人工智能实践(语言智能)
第8讲:LLM as Judge

参考文献

LLM-as-Judge 领域的核心论文、综述与基准清单

阅读路线图

如果你只有 2 小时,按下面的顺序读 3 篇:

1. MT-Bench (Zheng et al., 2023)

开山之作,奠定 LLM-as-Judge 范式基础

2. Prometheus 2 (Kim et al., 2024)

开源评判器,证明"评判能力可以从强模型蒸馏到小模型"

3. Justice or Prejudice (Zheng Y. et al., 2025)

偏差量化的系统性总结,含多维偏差基准

奠基性工作

  • Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arXiv:2306.05685

    • 首次系统验证 GPT-4 Judge 与人类偏好一致率 > 80%。
    • 发布 MT-Bench(80 道多轮问题)与 Chatbot Arena 众包平台。
  • Liu, Y., et al. (2023). G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment. EMNLP 2023. arXiv:2303.16634

    • 结合 CoT + 表单填写的 NLG 评估,摘要任务上 Spearman ρ=0.514\rho = 0.514

综述论文

  • Gu, J., et al. (2024/2026). A Survey on LLM-as-a-Judge. The Innovation (2026-01). arXiv:2411.15594

    • 最系统的分类体系,经 6 次 arXiv 修订。
  • Li, H., et al. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579

    • 从功能性、方法论、应用场景、元评估、局限性五维度组织。
  • Li, D., et al. (2024). From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge. arXiv:2411.16594

评判器模型

  • Kim, S., et al. (2024). Prometheus: Inducing Fine-grained Evaluation Capability in Language Models. ICLR 2024. arXiv:2310.08491
  • Kim, S., et al. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. arXiv:2405.01535
    • 8×7B MoE,可处理 pointwise + pairwise,开源。
  • Zhu, L., et al. (2023). JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv:2310.17631
  • Wang, Y., et al. (2023). PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. arXiv:2306.05087
  • Li, J., et al. (2024). Auto-J: Generative Judge for Evaluating Alignment. ICLR 2024. arXiv:2310.05470
  • Fernandes, P., et al. (2024). FLAMe: Foundational Autorater Models for Reliable Evaluation. arXiv:2407.10817

偏差与可靠性

  • Wang, P., et al. (2024). Large Language Models are not Fair Evaluators. ACL 2024. arXiv:2305.17926
    • 位置偏差的系统分析。
  • Zheng, Y., et al. (2025). Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. arXiv:2410.02736
  • Panickssery, A., et al. (2024). LLM Evaluators Recognize and Favor Their Own Generations. arXiv:2404.13076
  • Stureborg, R., et al. (2024). Large Language Models are Inconsistent and Biased Evaluators. arXiv:2405.01724
  • Li, D., et al. (2025). Preference Leakage: A Contamination Problem in LLM-as-a-judge. arXiv:2502.01534
  • Park, et al. (2025). Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge. arXiv:2508.06709
  • Koo, R., et al. (2023). Benchmarking Cognitive Biases in LLMs as Evaluators. arXiv:2309.17012
  • Raina, V., et al. (2024). Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. arXiv:2402.14016

翻译质量评估

  • Kocmi, T. & Federmann, C. (2023). Large Language Models Are State-of-the-Art Evaluators of Translation Quality. WMT 2023 (GEMBA / GEMBA-DA).
  • Kocmi, T. & Federmann, C. (2023b). GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4. WMT 2023.
    • WMT 2023 元评估 96.5% 系统级准确率。
  • Fernandes, P., et al. (2023). AutoMQM: The Devil is in the Errors.
  • Xu, Y., et al. (2024). ReMedy: Reference-free Metrics for Translation Evaluation.
  • Lu, X., et al. (2024). EAPrompt: Error Analysis Prompting for Translation Evaluation.
  • Zouhar, V., et al. (2025). ESA: Error Span Annotation.

思考型裁判(Thinking Judges)

  • Saha, S., et al. (2025). Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge (EvalPlanner). ICML 2025.
    • RewardBench 93.9 分,SOTA。
  • Whitehouse, C., et al. (2025). J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning. arXiv:2505.10320

基准与元评估

  • Dubois, Y., et al. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475
  • Li, T., et al. (2024). From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv:2406.11939
  • Lambert, N., et al. (2024). RewardBench: Evaluating Reward Models for Language Modeling. arXiv:2403.13787
  • Liu, Y., et al. (2025). RM-Bench: Benchmarking Reward Models with Subtlety and Style. ICLR 2025.
  • Frick, E., et al. (2025). How to Evaluate Reward Models for RLHF (PPE benchmark). ICLR 2025.
  • Tan, X., et al. (2025). JudgeBench: A Benchmark for Evaluating LLM-based Judges.
  • Gureja, V., et al. (2025). M-RewardBench: Evaluating Reward Models in Multilingual Settings.
  • Jin, Z., et al. (2025). RAG-RewardBench: Benchmarking Reward Models in RAG. ACL 2025.

元奖励与自我改进

  • Wu, T., et al. (2024). Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge. arXiv:2407.19594
  • Yu, Q., et al. (2025). Critic-RM: Self-generated Critiques Boost Reward Modeling for Language Models. NAACL 2025.

本课程内部材料

  • 课程综述(必读):source-materials/LLM_as_Judge_文献综述.md
    • 本讲的主要来源材料,涵盖方法论、偏差、应用、开放挑战等九个小节,配套完整参考文献索引。

延伸阅读建议

按兴趣方向选

  • 做翻译评估 → GEMBA-MQM + AutoMQM + WMT shared task overview
  • 做 RLHF / 奖励模型 → RewardBench + RM-Bench + PPE
  • 做多语言评估 → M-RewardBench + Gureja et al., 2025
  • 做安全与鲁棒 → Raina 2024 + JudgeBench
  • 做 RAG 评估 → RAG-RewardBench + RAGAS
  • 关注推理型裁判 → EvalPlanner + J1

工具与资源

名称类型用途
FastChat/MT-Bench代码库MT-Bench 官方实现 + Judge 脚本
AlpacaEval代码库AlpacaEval 1.0 / 2.0 LC 评测
Arena-Hard-Auto代码库Arena-Hard 基准
lm-judge代码库Pairwise Judge 通用模板
Prometheus 2模型开源评判器
RewardBench Leaderboard排行榜Reward Model 比较
Chatbot Arena Leaderboard排行榜人类评分金标

本参考清单覆盖至 2026 年 4 月。LLM-as-Judge 领域仍在快速演进——思考型裁判、多语言评估、RLHF 污染等方向每季度都会出新工作,建议每月跟踪 arXiv cs.CLjudge / evaluation / reward model 关键词。