实验1：Hugging Face Pipeline 一站跑通经典任务

用 Hugging Face pipeline 在 10 分钟内完成 5 个经典 NLP 任务（情感分类、NER、翻译、摘要、QA），记录运行时间与显存，对比传统 pipeline 与 Transformer 的差异

实验概述

本实验的目的不是训练一个模型，而是让你用最快的方式亲身感受：上一节讲的六类经典任务（表示、分词、分类、标注、生成、评估），通过 Hugging Face pipeline 这个统一接口，可以在10 分钟内全部跑通。

这种"统一性"本身就是第 2 讲 Transformer 的核心价值——你将在这一次实验里同时触达它带来的巨大工程红利，以及它留下的局限。

项目	详情
框架	Hugging Face `transformers`、`datasets`
核心工具	`pipeline` 高层 API
任务数	5（情感分类、NER、翻译、摘要、抽取式 QA）
模型来源	Hugging Face Hub（首次运行会自动下载，约 2-3 GB）
预计时间	10 分钟（GPU）/ 20-30 分钟（CPU）
GPU 要求	可选，任何 T4 / 3060 / 更高级 GPU 都够；CPU 也能跑完

环境准备：建议使用 Colab 免费 T4 或本地 conda 环境。核心依赖：pip install transformers datasets torch sentencepiece。如果首次加载慢，是模型下载而非计算问题。

实验步骤

步骤 1：环境验证与 pipeline 初体验（2 分钟）

import torch
from transformers import pipeline
import time

# ===== 1. 环境验证 =====
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    device = 0  # pipeline 的 device 参数：0 = 第一块 GPU
else:
    device = -1  # -1 表示 CPU

一行代码跑通情感分类——感受 pipeline 的"魔法"：

# ===== 2. 最简情感分类 =====
sentiment = pipeline("sentiment-analysis", device=device)
result = sentiment("I've been waiting for this class my whole life.")
print(result)
# [{'label': 'POSITIVE', 'score': 0.99...}]

注意：你根本不需要指定模型、tokenizer、分类头——pipeline 默认选了一个英文 DistilBERT 情感模型。这就是"统一接口"的威力。

步骤 2：任务 1 —— 情感分类（2 分钟）

使用中英文分类模型对比。

# ===== 英文情感（默认模型）=====
clf_en = pipeline("sentiment-analysis", device=device)

# ===== 中文情感（指定中文模型）=====
clf_zh = pipeline(
    "sentiment-analysis",
    model="uer/roberta-base-finetuned-dianping-chinese",
    device=device,
)

samples = {
    "英文-正面": "This movie is absolutely fantastic, I loved every minute.",
    "英文-负面": "Terrible plot, horrible acting, complete waste of time.",
    "中文-正面": "这家餐厅的菜品精致，服务也非常贴心，值得推荐。",
    "中文-负面": "菜上得慢，态度又差，再也不会来了。",
}

print("\n=== 情感分类 ===")
for name, text in samples.items():
    clf = clf_zh if name.startswith("中文") else clf_en
    t0 = time.time()
    out = clf(text)
    dt = (time.time() - t0) * 1000
    print(f"  [{name}] {out[0]['label']} ({out[0]['score']:.3f}) | {dt:.0f} ms")
    print(f"           原文：{text[:40]}...")

观察要点：

每条推理耗时（GPU 通常 < 50 ms，CPU 500 ms 级）
模型对明显正/负面的判断稳定；在"中性带讽刺"的句子上才会翻车

步骤 3：任务 2 —— 命名实体识别（NER）（2 分钟）

# 英文 NER（默认：dslim/bert-base-NER）
ner_en = pipeline("ner", aggregation_strategy="simple", device=device)

# 中文 NER（用 ckiplab 或 xlm-roberta 的多语言版）
ner_zh = pipeline(
    "ner",
    model="Davlan/bert-base-multilingual-cased-ner-hrl",
    aggregation_strategy="simple",
    device=device,
)

en_text = "Apple CEO Tim Cook visited Beijing on March 3, 2026, meeting Li Qiang at Zhongnanhai."
zh_text = "2026 年 3 月，苹果公司 CEO 库克访问北京大学软件与微电子学院。"

print("\n=== NER ===")
for name, txt, m in [("英文", en_text, ner_en), ("中文", zh_text, ner_zh)]:
    t0 = time.time()
    out = m(txt)
    dt = (time.time() - t0) * 1000
    print(f"  [{name}] {dt:.0f} ms")
    for ent in out:
        print(f"    - '{ent['word']}' → {ent['entity_group']} ({ent['score']:.2f})")

观察要点：

aggregation_strategy="simple" 会自动把 BIO 序列合并为完整实体（回顾 §1.4）
中文多语言模型在边界上经常不准（如 "北京大学" 可能被拆开）——这正是 §1.4 末提到的"零样本 NER 局限"
比较：你能想到什么 prompt 能让 GPT-4 做到同样的事？效果、成本、延迟如何对比？

步骤 4：任务 3 —— 机器翻译（2 分钟）

使用 Helsinki-NLP 的 OPUS-MT 系列——轻量、好用、覆盖 1000+ 语言对。

# 英 → 中
translator_en_zh = pipeline(
    "translation",
    model="Helsinki-NLP/opus-mt-en-zh",
    device=device,
)

# 中 → 英
translator_zh_en = pipeline(
    "translation",
    model="Helsinki-NLP/opus-mt-zh-en",
    device=device,
)

en = "Large language models have fundamentally changed how we approach natural language processing."
zh = "大语言模型正在改变自然语言处理的研究范式，也在改变工业界的部署方式。"

print("\n=== 机器翻译 ===")
t0 = time.time()
print(f"  [EN→ZH] {translator_en_zh(en)[0]['translation_text']}")
print(f"          耗时 {(time.time()-t0)*1000:.0f} ms")

t0 = time.time()
print(f"  [ZH→EN] {translator_zh_en(zh)[0]['translation_text']}")
print(f"          耗时 {(time.time()-t0)*1000:.0f} ms")

观察要点：

OPUS-MT 的模型规模只有 ~70M 参数，速度远快于 LLM 翻译
在"研究范式"等抽象术语上可能翻译不自然，专业领域建议用更新的 NLLB / M2M-100 或直接用 LLM

步骤 5：任务 4 —— 摘要（1 分钟）

summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",  # 英文摘要的经典模型
    device=device,
)

long_text = """
The field of natural language processing has undergone a remarkable transformation
in the past decade. Once dominated by rule-based systems and statistical methods
relying on hand-crafted features like TF-IDF and n-grams, NLP has been reshaped
by the rise of neural networks and, more recently, by the Transformer architecture.
Today, large language models trained on trillions of tokens can perform tasks that
were unthinkable a decade ago, from translating between hundreds of language pairs
to writing functional code. Yet classical techniques have not vanished; they continue
to serve as strong baselines, retrieval backbones in RAG systems, and lightweight
alternatives when compute is constrained.
"""

print("\n=== 摘要 ===")
t0 = time.time()
out = summarizer(long_text, max_length=60, min_length=20, do_sample=False)
print(f"  原文：{len(long_text)} 字符 | 摘要：{out[0]['summary_text']}")
print(f"  耗时 {(time.time()-t0)*1000:.0f} ms")

观察要点：

BART 是 seq2seq 模型，摘要是生成式的，不是从原文挑句子
比较原文和摘要：模型是否引入了原文没有的内容？这就是 §1.5 提到的**幻觉（hallucination）**问题

步骤 6：任务 5 —— 抽取式问答（2 分钟）

qa = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2",  # SQuAD 2.0 训练
    device=device,
)

context = """
The Peking University School of Software and Microelectronics was founded in 2002.
It is located in the Daxing campus of Peking University. The school offers master's
and doctoral programs in software engineering, integrated circuits, and artificial
intelligence. Professor Zhijun Gao teaches the "Artificial Intelligence Practice
(Language Intelligence)" course to graduate students.
"""

questions = [
    "When was the School of Software and Microelectronics founded?",
    "Where is it located?",
    "Who teaches the Language Intelligence course?",
    "Does the school offer undergraduate programs?",  # 回答应当为 "empty" 或低分
]

print("\n=== 抽取式 QA ===")
for q in questions:
    t0 = time.time()
    out = qa(question=q, context=context)
    dt = (time.time() - t0) * 1000
    print(f"  Q: {q}")
    print(f"    → '{out['answer']}' (score={out['score']:.2f}, {dt:.0f} ms)")

观察要点：

模型输出的是原文中的一个 span（回顾 §1.5 的抽取式 QA）——它不会"生成"原文没有的内容
对第 4 个问题"是否提供本科项目"：原文未提，模型给出的 span 可能不合理但分数会比较低
这是 RAG 系统的前身——第 4 讲将详细讲解生成式 RAG如何在这种抽取式基座上演进

步骤 7：汇总性能与对比分析（1-2 分钟）

# ===== 显存占用检查 =====
if torch.cuda.is_available():
    mem_mb = torch.cuda.memory_allocated() / 1024**2
    max_mem_mb = torch.cuda.max_memory_allocated() / 1024**2
    print(f"\n当前显存占用：{mem_mb:.0f} MB，峰值：{max_mem_mb:.0f} MB")

# ===== 与"传统 pipeline"的概念对比 =====
comparison = """
维度对比：传统 NLP pipeline vs Hugging Face pipeline（Transformer）

| 任务     | 传统方法                         | Transformer pipeline       |
|----------|----------------------------------|----------------------------|
| 情感分类 | 分词 + TF-IDF + LogReg           | BERT + 分类头              |
| NER      | 特征工程 + BiLSTM + CRF          | BERT + token 分类头        |
| 翻译     | 短语对齐 / Seq2Seq + Attention    | Transformer Enc-Dec        |
| 摘要     | 抽取式（TextRank）                | BART / T5 生成             |
| QA       | 模板 + 规则 / BiDAF               | RoBERTa + span head        |

关键观察：
1. 所有任务都通过 pipeline("任务名") 统一调用 —— 接口完全一致
2. 模型、tokenizer、后处理全部被封装 —— 你不需要知道 BERT 和 BART 的架构差异
3. 推理速度：GPU 下每条样本 10-100 ms 级
4. 代价：下载的模型权重总计 ~2 GB，远比传统 pipeline 重
"""
print(comparison)

任务对比记录表（手动填写你的实际结果）：

任务	模型	首次加载时间	单条推理时间	显存	主观质量（1-5）
英文情感	distilbert-base-uncased-finetuned-sst-2-english	?	?	?	?
中文情感	uer/roberta-base-finetuned-dianping-chinese	?	?	?	?
英文 NER	dslim/bert-base-NER	?	?	?	?
中文 NER	Davlan/bert-base-multilingual-cased-ner-hrl	?	?	?	?
EN→ZH	Helsinki-NLP/opus-mt-en-zh	?	?	?	?
摘要	facebook/bart-large-cnn	?	?	?	?
抽取式 QA	deepset/roberta-base-squad2	?	?	?	?

加分项：用 LLM 复现同样 5 个任务

把同样的 5 个任务用一个 LLM（GPT-4o-mini、Claude Haiku、Qwen3-Chat 任选）通过 prompt 完成，对比：

情感分类：LLM 与 pipeline 的准确率差距？
NER：LLM 能否识别专业术语（如 "软件与微电子学院"）？
翻译：LLM 翻译的流畅度 vs OPUS-MT 的忠实度
摘要：LLM 的摘要是否更易引入幻觉？

pipeline：一次加载 2-3 GB 模型，之后零成本推理
LLM API：每次调用按 token 计费
估算：处理 10 万条情感分类，两种方式的总成本差异？

pipeline：输出结构化（label、score、span 位置）
LLM：需要 prompt 工程保证 JSON 格式，仍偶尔飘出非法输出
如果你的下游是一个数据库写入系统，哪种更可靠？

交付物清单

完成实验后，请提交：

运行日志：每个任务的输入 + 输出 + 耗时记录
对比表：上述性能表填完
一段观察性文字（200-400 字），讨论：
- 在哪个任务上你觉得 pipeline 已经足够，不需要 LLM？
- 在哪个任务上 pipeline 明显不够用？
- 你注意到了哪些 tokenizer 层面的现象（如中文被拆得很散、数字被拆分）？
一个开放性思考：如果你要为"北大研究生手册 QA 系统"（第 4 讲实验）做一个基线，你会从哪个 pipeline 任务开始？为什么？

常见问题排查

常见错误：

ImportError: sentencepiece：某些模型（T5、Llama、多语言 BERT）需要 pip install sentencepiece
首次加载慢：默认从 HF Hub 下载；国内环境建议设置 export HF_ENDPOINT=https://hf-mirror.com
device=0 报错 CUDA OOM：换更小的模型（如 distilbert）或降 batch size；摘要模型 bart-large 约需 2 GB 显存
中文模型输出标签是拼音或英文：模型训练时的标签体系决定；查看 model.config.id2label 做映射
aggregation_strategy 报 deprecation warning：在新版 transformers 中改为 simple/first/max

本次实验的收获

完成这次实验后你应当明白：

Hugging Face pipeline 让五个不同任务共享同一套 API——这是 Transformer 时代工程红利的缩影
但 pipeline 背后仍是逐任务微调的模型，不是一个万能的 LLM
LLM（第 3 讲起的主角）进一步把"逐任务微调"压缩为"一个模型 + 不同 prompt"——这是下一跳

On this page