参考文献

阅读路线图

如果你只有 2 小时，按下面的顺序读 3 篇：

开山之作，奠定 LLM-as-Judge 范式基础

开源评判器，证明"评判能力可以从强模型蒸馏到小模型"

偏差量化的系统性总结，含多维偏差基准

Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arXiv:2306.05685
- 首次系统验证 GPT-4 Judge 与人类偏好一致率 > 80%。
- 发布 MT-Bench（80 道多轮问题）与 Chatbot Arena 众包平台。
Liu, Y., et al. (2023). G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment. EMNLP 2023. arXiv:2303.16634
- 结合 CoT + 表单填写的 NLG 评估，摘要任务上 Spearman $\rho = 0.514$ 。

Gu, J., et al. (2024/2026). A Survey on LLM-as-a-Judge. The Innovation (2026-01). arXiv:2411.15594
- 最系统的分类体系，经 6 次 arXiv 修订。
Li, H., et al. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579
- 从功能性、方法论、应用场景、元评估、局限性五维度组织。
Li, D., et al. (2024). From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge. arXiv:2411.16594

Kim, S., et al. (2024). Prometheus: Inducing Fine-grained Evaluation Capability in Language Models. ICLR 2024. arXiv:2310.08491
Kim, S., et al. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. arXiv:2405.01535
- 8×7B MoE，可处理 pointwise + pairwise，开源。
Zhu, L., et al. (2023). JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv:2310.17631
Wang, Y., et al. (2023). PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. arXiv:2306.05087
Li, J., et al. (2024). Auto-J: Generative Judge for Evaluating Alignment. ICLR 2024. arXiv:2310.05470
Fernandes, P., et al. (2024). FLAMe: Foundational Autorater Models for Reliable Evaluation. arXiv:2407.10817

Wang, P., et al. (2024). Large Language Models are not Fair Evaluators. ACL 2024. arXiv:2305.17926
- 位置偏差的系统分析。
Zheng, Y., et al. (2025). Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. arXiv:2410.02736
Panickssery, A., et al. (2024). LLM Evaluators Recognize and Favor Their Own Generations. arXiv:2404.13076
Stureborg, R., et al. (2024). Large Language Models are Inconsistent and Biased Evaluators. arXiv:2405.01724
Li, D., et al. (2025). Preference Leakage: A Contamination Problem in LLM-as-a-judge. arXiv:2502.01534
Park, et al. (2025). Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge. arXiv:2508.06709
Koo, R., et al. (2023). Benchmarking Cognitive Biases in LLMs as Evaluators. arXiv:2309.17012
Raina, V., et al. (2024). Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. arXiv:2402.14016

Kocmi, T. & Federmann, C. (2023). Large Language Models Are State-of-the-Art Evaluators of Translation Quality. WMT 2023 (GEMBA / GEMBA-DA).
Kocmi, T. & Federmann, C. (2023b). GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4. WMT 2023.
- WMT 2023 元评估 96.5% 系统级准确率。
Fernandes, P., et al. (2023). AutoMQM: The Devil is in the Errors.
Xu, Y., et al. (2024). ReMedy: Reference-free Metrics for Translation Evaluation.
Lu, X., et al. (2024). EAPrompt: Error Analysis Prompting for Translation Evaluation.
Zouhar, V., et al. (2025). ESA: Error Span Annotation.

Saha, S., et al. (2025). Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge (EvalPlanner). ICML 2025.
- RewardBench 93.9 分，SOTA。
Whitehouse, C., et al. (2025). J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning. arXiv:2505.10320

Dubois, Y., et al. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475
Li, T., et al. (2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv:2406.11939
Lambert, N., et al. (2024). RewardBench: Evaluating Reward Models for Language Modeling. arXiv:2403.13787
Liu, Y., et al. (2025). RM-Bench: Benchmarking Reward Models with Subtlety and Style. ICLR 2025.
Frick, E., et al. (2025). How to Evaluate Reward Models for RLHF (PPE benchmark). ICLR 2025.
Tan, X., et al. (2025). JudgeBench: A Benchmark for Evaluating LLM-based Judges.
Gureja, V., et al. (2025). M-RewardBench: Evaluating Reward Models in Multilingual Settings.
Jin, Z., et al. (2025). RAG-RewardBench: Benchmarking Reward Models in RAG. ACL 2025.

Wu, T., et al. (2024). Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge. arXiv:2407.19594
Yu, Q., et al. (2025). Critic-RM: Self-generated Critiques Boost Reward Modeling for Language Models. NAACL 2025.

课程综述（必读）：source-materials/LLM_as_Judge_文献综述.md
- 本讲的主要来源材料，涵盖方法论、偏差、应用、开放挑战等九个小节，配套完整参考文献索引。

按兴趣方向选：

本参考清单覆盖至 2026 年 4 月。LLM-as-Judge 领域仍在快速演进——思考型裁判、多语言评估、RLHF 污染等方向每季度都会出新工作，建议每月跟踪 arXiv cs.CL 的 judge / evaluation / reward model 关键词。