参考文献
LLM-as-Judge 领域的核心论文、综述与基准清单
阅读路线图
如果你只有 2 小时,按下面的顺序读 3 篇:
1. MT-Bench (Zheng et al., 2023)
开山之作,奠定 LLM-as-Judge 范式基础
2. Prometheus 2 (Kim et al., 2024)
开源评判器,证明"评判能力可以从强模型蒸馏到小模型"
3. Justice or Prejudice (Zheng Y. et al., 2025)
偏差量化的系统性总结,含多维偏差基准
奠基性工作
-
Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arXiv:2306.05685
- 首次系统验证 GPT-4 Judge 与人类偏好一致率 > 80%。
- 发布 MT-Bench(80 道多轮问题)与 Chatbot Arena 众包平台。
-
Liu, Y., et al. (2023). G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment. EMNLP 2023. arXiv:2303.16634
- 结合 CoT + 表单填写的 NLG 评估,摘要任务上 Spearman 。
综述论文
-
Gu, J., et al. (2024/2026). A Survey on LLM-as-a-Judge. The Innovation (2026-01). arXiv:2411.15594
- 最系统的分类体系,经 6 次 arXiv 修订。
-
Li, H., et al. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579
- 从功能性、方法论、应用场景、元评估、局限性五维度组织。
-
Li, D., et al. (2024). From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge. arXiv:2411.16594
评判器模型
- Kim, S., et al. (2024). Prometheus: Inducing Fine-grained Evaluation Capability in Language Models. ICLR 2024. arXiv:2310.08491
- Kim, S., et al. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. arXiv:2405.01535
- 8×7B MoE,可处理 pointwise + pairwise,开源。
- Zhu, L., et al. (2023). JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv:2310.17631
- Wang, Y., et al. (2023). PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. arXiv:2306.05087
- Li, J., et al. (2024). Auto-J: Generative Judge for Evaluating Alignment. ICLR 2024. arXiv:2310.05470
- Fernandes, P., et al. (2024). FLAMe: Foundational Autorater Models for Reliable Evaluation. arXiv:2407.10817
偏差与可靠性
- Wang, P., et al. (2024). Large Language Models are not Fair Evaluators. ACL 2024. arXiv:2305.17926
- 位置偏差的系统分析。
- Zheng, Y., et al. (2025). Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. arXiv:2410.02736
- Panickssery, A., et al. (2024). LLM Evaluators Recognize and Favor Their Own Generations. arXiv:2404.13076
- Stureborg, R., et al. (2024). Large Language Models are Inconsistent and Biased Evaluators. arXiv:2405.01724
- Li, D., et al. (2025). Preference Leakage: A Contamination Problem in LLM-as-a-judge. arXiv:2502.01534
- Park, et al. (2025). Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge. arXiv:2508.06709
- Koo, R., et al. (2023). Benchmarking Cognitive Biases in LLMs as Evaluators. arXiv:2309.17012
- Raina, V., et al. (2024). Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. arXiv:2402.14016
翻译质量评估
- Kocmi, T. & Federmann, C. (2023). Large Language Models Are State-of-the-Art Evaluators of Translation Quality. WMT 2023 (GEMBA / GEMBA-DA).
- Kocmi, T. & Federmann, C. (2023b). GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4. WMT 2023.
- WMT 2023 元评估 96.5% 系统级准确率。
- Fernandes, P., et al. (2023). AutoMQM: The Devil is in the Errors.
- Xu, Y., et al. (2024). ReMedy: Reference-free Metrics for Translation Evaluation.
- Lu, X., et al. (2024). EAPrompt: Error Analysis Prompting for Translation Evaluation.
- Zouhar, V., et al. (2025). ESA: Error Span Annotation.
思考型裁判(Thinking Judges)
- Saha, S., et al. (2025). Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge (EvalPlanner). ICML 2025.
- RewardBench 93.9 分,SOTA。
- Whitehouse, C., et al. (2025). J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning. arXiv:2505.10320
基准与元评估
- Dubois, Y., et al. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475
- Li, T., et al. (2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv:2406.11939
- Lambert, N., et al. (2024). RewardBench: Evaluating Reward Models for Language Modeling. arXiv:2403.13787
- Liu, Y., et al. (2025). RM-Bench: Benchmarking Reward Models with Subtlety and Style. ICLR 2025.
- Frick, E., et al. (2025). How to Evaluate Reward Models for RLHF (PPE benchmark). ICLR 2025.
- Tan, X., et al. (2025). JudgeBench: A Benchmark for Evaluating LLM-based Judges.
- Gureja, V., et al. (2025). M-RewardBench: Evaluating Reward Models in Multilingual Settings.
- Jin, Z., et al. (2025). RAG-RewardBench: Benchmarking Reward Models in RAG. ACL 2025.
元奖励与自我改进
- Wu, T., et al. (2024). Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge. arXiv:2407.19594
- Yu, Q., et al. (2025). Critic-RM: Self-generated Critiques Boost Reward Modeling for Language Models. NAACL 2025.
本课程内部材料
- 课程综述(必读):
source-materials/LLM_as_Judge_文献综述.md- 本讲的主要来源材料,涵盖方法论、偏差、应用、开放挑战等九个小节,配套完整参考文献索引。
延伸阅读建议
按兴趣方向选:
- 做翻译评估 → GEMBA-MQM + AutoMQM + WMT shared task overview
- 做 RLHF / 奖励模型 → RewardBench + RM-Bench + PPE
- 做多语言评估 → M-RewardBench + Gureja et al., 2025
- 做安全与鲁棒 → Raina 2024 + JudgeBench
- 做 RAG 评估 → RAG-RewardBench + RAGAS
- 关注推理型裁判 → EvalPlanner + J1
工具与资源
| 名称 | 类型 | 用途 |
|---|---|---|
| FastChat/MT-Bench | 代码库 | MT-Bench 官方实现 + Judge 脚本 |
| AlpacaEval | 代码库 | AlpacaEval 1.0 / 2.0 LC 评测 |
| Arena-Hard-Auto | 代码库 | Arena-Hard 基准 |
| lm-judge | 代码库 | Pairwise Judge 通用模板 |
| Prometheus 2 | 模型 | 开源评判器 |
| RewardBench Leaderboard | 排行榜 | Reward Model 比较 |
| Chatbot Arena Leaderboard | 排行榜 | 人类评分金标 |
本参考清单覆盖至 2026 年 4 月。LLM-as-Judge 领域仍在快速演进——思考型裁判、多语言评估、RLHF 污染等方向每季度都会出新工作,建议每月跟踪 arXiv
cs.CL的judge / evaluation / reward model关键词。