实验7：构建带搜索、计算器、文件读写的小型 Agent

使用 smolagents 或 LangGraph 构建一个 ReAct 风格 Agent，完成真实任务：找论文摘要并计算参考文献平均年份

实验概述

本实验将完成你的第一个端到端 Agent：一个能自主决定如何调用搜索、计算器与文件读写来完成真实任务的 ReAct agent。你可以二选一：smolagents（HuggingFace 出的轻量框架，代码更简洁）或 LangGraph（更工程化、适合生产）。

项目	详情
目标任务	找到一篇指定论文的摘要 → 抓取参考文献列表 → 计算参考文献的平均发表年份
工具集	`web_search`、`fetch_url`、`python_calc`、`read_file`、`write_file`
框架	`smolagents` 或 `langgraph`
模型	Qwen3-8B（本地）或 Claude 3.5 Haiku / GPT-4o-mini（API）
预计时间	90 分钟

API 配额与成本提醒：如果使用 OpenAI/Anthropic API，建议为本实验单独设置预算上限（如 $2）。ReAct 每步调一次 LLM，失败的 agent 很可能把额度迅速烧完。

实验步骤

步骤 1：环境准备（10 分钟）

安装依赖、准备 API key。

# 方案 A：smolagents
pip install smolagents duckduckgo-search requests beautifulsoup4

# 方案 B：langgraph
pip install langgraph langchain-openai langchain-community duckduckgo-search requests beautifulsoup4

import os
os.environ["OPENAI_API_KEY"] = "sk-..."          # 如果用 OpenAI
# os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."  # 如果用 Anthropic
# os.environ["HF_TOKEN"] = "hf_..."               # 如果用 HF Inference

验证环境：

from duckduckgo_search import DDGS
with DDGS() as ddgs:
    for r in ddgs.text("ReAct agent paper Yao 2022", max_results=3):
        print(r["title"], "-", r["href"])

步骤 2：定义工具集（15 分钟）

Agent 的能力边界由工具决定。我们设计 5 个工具：

from smolagents import tool
import requests
from duckduckgo_search import DDGS
from bs4 import BeautifulSoup
import re

@tool
def web_search(query: str, max_results: int = 5) -> str:
    """在 DuckDuckGo 上搜索，返回标题和链接。
    
    Args:
        query: 搜索查询
        max_results: 返回结果数量
    """
    with DDGS() as ddgs:
        results = list(ddgs.text(query, max_results=max_results))
    return "\n".join([f"- {r['title']}: {r['href']}" for r in results])

@tool
def fetch_url(url: str) -> str:
    """抓取 URL 内容并返回纯文本（去除 HTML 标签）。
    
    Args:
        url: 目标网址
    """
    try:
        resp = requests.get(url, timeout=15, headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(resp.text, "html.parser")
        text = soup.get_text(separator="\n", strip=True)
        return text[:8000]  # 截断防止 context 爆炸
    except Exception as e:
        return f"Error: {e}"

@tool
def python_calc(expr: str) -> str:
    """执行简单 Python 表达式计算。只允许数学运算，不允许 import。
    
    Args:
        expr: Python 表达式，如 "sum([2011,2014,2015])/3"
    """
    if "import" in expr or "open" in expr or "__" in expr:
        return "Error: forbidden keyword"
    try:
        return str(eval(expr, {"__builtins__": {}}, {"sum": sum, "len": len, "max": max, "min": min}))
    except Exception as e:
        return f"Error: {e}"

@tool
def read_file(path: str) -> str:
    """读取文本文件内容。
    
    Args:
        path: 文件路径
    """
    with open(path, "r", encoding="utf-8") as f:
        return f.read()[:8000]

@tool
def write_file(path: str, content: str) -> str:
    """把内容写入文件。
    
    Args:
        path: 文件路径
        content: 文件内容
    """
    with open(path, "w", encoding="utf-8") as f:
        f.write(content)
    return f"Written {len(content)} chars to {path}"

工具设计三原则：（1）描述要精确——LLM 只能看到 docstring；（2）返回要截断——防止把 context 撑爆；（3）失败要优雅——返回 error 字符串而不是 raise。

步骤 3：构建 Agent（20 分钟）

方案 A：smolagents

smolagents 的卖点是"用 Python 代码作为动作空间"（类似 CodeAct），每步 agent 输出一段 Python 而非 JSON：

from smolagents import CodeAgent, InferenceClientModel

model = InferenceClientModel(model_id="Qwen/Qwen2.5-72B-Instruct")
# 如果用 OpenAI：
# from smolagents import OpenAIServerModel
# model = OpenAIServerModel(model_id="gpt-4o-mini")

agent = CodeAgent(
    tools=[web_search, fetch_url, python_calc, read_file, write_file],
    model=model,
    max_steps=8,
    verbosity_level=2,  # 打印 Thought / Code / Observation
)

方案 B：LangGraph

from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool as lc_tool

# 用上面定义好的函数重新 wrap 成 langchain tools
tools = [lc_tool(fn) for fn in [web_search, fetch_url, python_calc, read_file, write_file]]

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
agent = create_react_agent(llm, tools)

步骤 4：跑真实任务（20 分钟）

把 agent 对准一个真实任务。我们选一个需要搜索+抓取+计算+写入的组合任务：

task = """
任务：
1. 搜索论文 "Attention Is All You Need" (Vaswani 2017) 的 arXiv 页面
2. 抓取该页面的摘要（abstract）
3. 再抓取 arXiv 上该论文的参考文献页（如 Semantic Scholar 或 arXiv cite）
4. 提取所有参考文献的年份
5. 计算这些年份的平均值
6. 把结果（摘要 + 参考文献列表 + 平均年份）写入 /tmp/paper_report.md

请逐步思考并完成。
"""

# smolagents：
result = agent.run(task)
print(result)

# langgraph：
# from langchain_core.messages import HumanMessage
# for chunk in agent.stream({"messages": [HumanMessage(content=task)]}):
#     print(chunk)

观察 trajectory：记录 agent 每一步的 Thought / Action / Observation，注意：

第一次搜索它用了什么关键词？
抓取页面后，它是否正确识别出摘要区块？
遇到抓不到参考文献时，它是否尝试了替代 URL（如 Semantic Scholar）？
计算平均年份时用了哪个工具（python_calc vs 自己心算）？
写文件时有没有保留格式？

步骤 5：失败分析与加固（15 分钟）

agent 初次跑很可能失败。常见失败模式与对策：

# 问题 1：web_search 返回的链接被网站拦截
# → 在 fetch_url 里加 User-Agent header（已做），或换用 arxiv API

# 问题 2：agent 重复调用同一个搜索
# → 在 system prompt 里加："不要重复同一个搜索"
# → 或在 tool 内做 LRU 缓存

# 问题 3：agent 把很长的 HTML 塞进 context 导致截断
# → fetch_url 已做 8000 字符截断；更好的方案是先 summarize 再塞进去

# 问题 4：python_calc 被 LLM 用来做不必要的事（比如拼字符串）
# → 在 docstring 里明确："仅用于数字计算"

# 加固后的 system prompt 片段：
system = """
你是一个研究助手 Agent。执行规则：
1. 每个任务最多 8 步
2. 计算数字时必须调用 python_calc，不要自己心算
3. 抓取网页后立刻提取你要的信息，不要原样保留整页
4. 遇到相同搜索返回相同结果时，换关键词或换 URL
5. 完成所有步骤后必须调用 write_file 保存结果
"""

重新跑一次，对比 trajectory 长度、工具调用次数、最终答案质量。

步骤 6：评估与交付（10 分钟）

对你的 agent 做量化评估：

# 准备 3 个不同难度的任务
tasks = [
    "找 GPT-3 论文的摘要，写入 /tmp/t1.md",                         # 易
    "找 LLaMA 论文，统计引用数 > 5000 的参考文献有几篇",            # 中
    "对比 ReAct 和 Toolformer 两篇论文，写一份 300 字的对比报告",   # 难
]

results = []
for t in tasks:
    try:
        ans = agent.run(t)
        results.append({"task": t, "success": True, "answer": ans})
    except Exception as e:
        results.append({"task": t, "success": False, "error": str(e)})

success_rate = sum(r["success"] for r in results) / len(results)
print(f"Success rate: {success_rate:.0%}")

思考题（写进交付报告）：

你的 agent 失败最多的是哪一类任务？为什么？
如果只能加一个工具让它更强，你会加什么？
同样的任务让纯代码 pipeline 完成，agent 有什么不可替代的优势？
如果要把这个 agent 部署到生产环境，你最担心哪三个问题？

加分项：把它做成 MCP Server

把你的工具集包装成 MCP Server，让 Claude Desktop / Cursor 能直接调用：

# mcp_server.py
from mcp.server import Server
import mcp.types as types

server = Server("research-agent-tools")

@server.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(name="web_search", description="...", inputSchema={...}),
        # 其余工具
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name == "web_search":
        result = web_search(arguments["query"])
        return [types.TextContent(type="text", text=result)]
    # ...

if __name__ == "__main__":
    import asyncio
    from mcp.server.stdio import stdio_server
    asyncio.run(stdio_server(server))

然后在 Claude Desktop 的 claude_desktop_config.json 里注册：

{
  "mcpServers": {
    "research-agent": {
      "command": "python",
      "args": ["/absolute/path/to/mcp_server.py"]
    }
  }
}

这样你的工具就变成了人人可用的 MCP 组件，这正是 MCP "AI USB-C" 的价值所在。

交付物清单

Agent 代码：agent.py，包含 5 个工具定义和 agent 初始化
Trajectory 日志：至少 3 次运行的完整 Thought/Action/Observation 记录
生成的 markdown 报告：/tmp/paper_report.md
评估结果：3 个难度任务的 success rate
1 页分析报告，回答：
- 规划失败的常见模式（对照 7.3 节"规划失败的常见根因"表）
- 工具选择与参数传递中遇到的 bug
- 单 agent + 工具 vs 多 agent 在此任务上的权衡
（加分）MCP Server 实现 + Claude Desktop 调用截图

时间预估：

环境 + 工具定义：~25 分钟
第一次跑通 + 调试：~30 分钟
失败分析 + 加固：~20 分钟
评估与报告：~15 分钟
总计：约 90 分钟

实验7：构建带搜索、计算器、文件读写的小型 Agent

On this page