LangGraph 错误处理与容错设计：让 AI Agent 系统不崩溃的 5 个策略

0. 系列闭环（不公开源码也能跟读）

端到端链路：Vue 前端 → api/routes/chat.py → Guide 多轮 SSE → run_analysis_pipeline（解析→分析→匹配→报告）→ tools/pdf_exporter PDF。
本篇：第 7/17 篇 · 容错环 · 不崩溃

阶段	用户可见	代码入口	对应篇
建会话	欢迎语	POST /api/sessions	09
多轮对话	SSE 流式	chat/stream → run_guide_single_turn	06, 14
信息充分	开始分析	_run_analysis_background	05, 07
履历解析	进度 30%	run_resume_parser	12
画像/RIASEC	进度 50%	run_profile_analyzer	03, 13
职业匹配	进度 70%	run_career_matcher	02
报告	进度 90%	run_reporter	11
下载 PDF	文件	GET …/report/pdf	11, 15

	说明
读本篇前	第 05 篇路由、第 08 篇 LLM 调用
读完本篇	列举 Ollama 不可用时的降级链
下一环	第 09 篇：后台任务 _run_analysis_background（第 8 篇）

全系列闭环索引：SERIES-LOOP.md

一、要解决什么问题

iCan 顶层 workflow 串联 5 个依赖 LLM 的节点（Guide → ResumeParser → ProfileAnalyzer → CareerMatcher → Reporter）。任意一步超时、返回非法 JSON、或 Ollama/云端 API 宕机，若不做隔离，整次分析会 500，用户已填的对话也白费。

项目在三个层次做容错：

调用前：llm/providers.py 的 check_ollama_available 探测 LLM 是否可达；
调用中：invoke_llm / invoke_llm_with_json 的 60s 超时 + llm/parsers.py 的多策略 JSON 提取；
调用后：workflow.py 每个节点的 try/except，以及 run_analysis_pipeline 的分阶段 catch 与 _generate_fallback_report 规则引擎兜底。

三层容错架构

二、策略一：健康检查 + 30 秒缓存

run_analysis_pipeline 在跑四个分析 Agent 之前，先调 check_ollama_available()（函数名历史遗留，实际探测的是 settings.LLM_BASE_URL 上的 OpenAI 兼容 /chat/completions，不限于 Ollama）。

# llm/providers.py
_ollama_cache = {"available": True, "last_check": 0}

async def check_ollama_available() -> bool:
    now = _time.time()
    if now - _ollama_cache["last_check"] < 30:
        return _ollama_cache["available"]

    _ollama_cache["last_check"] = now
    base_url = settings.LLM_BASE_URL.rstrip("/")
    # ...
    resp = await client.post(
        f"{base_url}/chat/completions",
        json={
            "model": settings.LLM_MODEL_CHAT,
            "messages": [{"role": "user", "content": "hi"}],
            "max_tokens": 5,
        },
    )

设计要点：

30 秒缓存：避免每个 session 连打探测请求，把延迟和配额开销压下去；
**max_tokens=5**：最小化探测成本；
失败写缓存 False：后续 30 秒内快速走降级，不再反复超时等待。

不可用时，workflow.py 的 run_analysis_pipeline 跳过四个 LLM Agent，改走 _regex_quick_profile + _generate_fallback_report，并在 DB 里标记 ollama_unavailable: True。

三、策略二：`asyncio.wait_for` 硬超时

llm/providers.py 的 invoke_llm 对所有 Chat 调用包一层 60 秒上限：

1	`response = await asyncio.wait_for(model.ainvoke(processed, **kwargs), timeout=60)`

超时抛 TimeoutError("AI 模型响应超时，请稍后重试")。get_chat_model() 里还有 request_timeout=90（HTTP 层），60s 是应用层更早切断。

API 层在 api/routes/chat.py 对 run_guide_chat 再包一层 90 秒 wait_for，给用户更友好的「请稍后重发」文案，而不是裸 500。

经验区间（非硬编码规则）：普通回复 2–5s，ProfileAnalyzer 10–30s，Reporter 章节生成可能 30–50s；超过 60s 按异常处理。

四、策略三：JSON 四层降级解析

结构化 Agent（ResumeParser、CareerMatcher 等）走 invoke_llm_with_json：先尝试 response_format=json_object，不支持则回退普通文本，再用 llm/parsers.py 的 parse_json_from_text：

策略1：```json ... ``` 代码块
  ↓ 失败
策略2：普通 ``` ... ```（以 { 或 [ 开头）
  ↓ 失败
策略3：正则匹配最外层 { ... }
  ↓ 失败
策略4：json.loads 全文
  ↓ 失败
返回 {}（不抛异常）

parse_json_from_text 任何 JSONDecodeError 都 catch 后返回 {}，保证上游总能拿到 dict。invoke_llm_with_json 若 {} 仍会 raise ValueError——那是「业务必须要有 JSON」的场景，和解析器「尽量提取」的分工不同。

五、策略四：节点级异常隔离

workflow.py 里五个顶层节点各自 try/except，失败时不 raise，而是写安全默认值，让 LangGraph 继续往下走（或至少返回可展示状态）：

节点	异常时返回
`guide_node`	保留原 `conversation_history`，`needs_more_info=True`
`resume_parser_node`	`structured_profile={}`
`profile_analyzer_node`	`personal_profile={}`
`career_matcher_node`	`career_matches=[]`
`reporter_node`	固定 Markdown 失败文案

reporter_node 兜底示例：

except Exception as e:
    logger.error("[reporter_node] 报告输出节点执行异常: %s", e, exc_info=True)
    return {
        "final_report": "# iCan 职业规划报告\n\n报告生成失败，请稍后重试。",
        "current_agent": "reporter",
        "workflow_messages": [f"报告输出节点异常: {str(e)}"],
    }

对比：无隔离时 Reporter 抛错 → 整图 ainvoke 失败 → CLI/API 500；有隔离时用户至少看到失败说明或部分章节。

route_after_guide 异常时返回 resume_parser_node，属于路由层的「 Fail-open 推进」，与 guide 节点 Fail-closed（继续要信息）形成对比——路由层更怕死循环。

六、策略五：`run_analysis_pipeline` 分阶段容错

线上报告生成主要走 run_analysis_pipeline（api/routes/chat.py、upload.py、report_gen.py 调用），不经过顶层 LangGraph 的 guide 环。其容错是「每阶段独立 try，失败用空数据继续」：

# workflow.py — 简化流程
try:
    parser_result = await run_resume_parser(parser_state)
    structured_profile = parser_result.get("structured_profile", {})
except Exception as parser_err:
    structured_profile = {}

if not structured_profile:
    structured_profile = {"basic_info": {"raw_text": combined_text[:500], "source": "fallback"}}

try:
    analyzer_result = await run_profile_analyzer(analyzer_state)
except Exception as analyzer_err:
    analyzer_result = {}

# matcher、reporter 同理...

Reporter 阶段失败时，不用空字符串糊弄，而是拼一段含 personal_profile JSON 摘要的 Markdown，并把 reporter_err 写进文末，方便运维对照日志。

LLM 完全不可用时，整条 LLM 链跳过，_generate_fallback_report 输出带 ⚠️ 说明的规则引擎报告：

1	`sections.append("> ⚠️ 注意：AI 模型暂不可用，本报告基于规则引擎快速生成。")`

外层仍有总 catch：记录日志、ws_manager.send_error 通知前端，再 raise——那是 DB/会话级灾难，不是单 Agent 失败。

七、与循环上限的联动（第 5 篇）

容错也包含 防无限循环（详见第 5 篇）：

agents/guide.py should_continue：loop_count >= 8；
workflow.py route_after_guide：user_msg_count >= 3；
recursion_limit：子图 15，完整 workflow 50。

循环超限本质是「强制推进」，避免 error + retry 在图里形成逻辑死循环。

八、容错层次总览

请求进入 run_analysis_pipeline / run_workflow
  ↓
[1] check_ollama_available → 不可用 → _regex_quick_profile + _generate_fallback_report
  ↓
[2] invoke_llm wait_for 60s → TimeoutError → 节点/API 层捕获
  ↓
[3] parse_json_from_text 四层 → 失败 → {}
  ↓
[4] 各 workflow 节点 try/except → 安全默认值
  ↓
[5] pipeline 分阶段 try → 空 dict/list 继续 + reporter 摘要兜底
  ↓
[6] 循环/recursion_limit → 强制 handoff / resume_parser
  ↓
返回 final_report（完整、部分或规则引擎版）

九、踩坑与边界

check_ollama_available 名字误导
探测的是当前 LLM_BASE_URL（可以是 DeepSeek、OpenAI、Ollama），不是只查 Ollama。.env 切云端后，Ollama 挂了但云端正常，仍会按云端结果缓存 True/False。
健康检查默认 _ollama_cache["available"] = True
进程刚启动、尚未探测时，第一次 pipeline 会假设可用；若实际不可用，要等第一次 POST 失败才缓存 False。高可用场景可考虑启动时预热探测。
节点隔离「空 dict 继续」会产出薄报告
profile_analyzer 失败后 personal_profile 大量字段为空，Reporter 仍会跑——用户看到的是「有报告但内容空洞」，比 500 好，但要在前端用 workflow_messages 或进度提示区分。
run_guide_chat 异常有独立兜底
返回固定话术「抱歉，处理出了点问题，能再说一次吗？」，is_info_sufficient=False，不会误触发 run_analysis_pipeline。
Reporter 章节生成走 get_chat_model()
与 get_light_model() 分工不同；勿按旧注释假设 Reporter 已切 mini 模型（见第 8 篇调用表）。
容错路径里 Reporter 仍可能最慢、最易超时；规则引擎降级只覆盖「整个 LLM 不可用」，不覆盖「仅 Reporter 超时」。

十、小结

调用前：llm/providers.py 缓存式健康检查，不可用时 workflow.py 规则引擎出报告。
调用中：60s 超时 + llm/parsers.py 多策略 JSON 提取。
调用后：五个 workflow 节点各自隔离；run_analysis_pipeline 分阶段 catch，Reporter 失败仍有摘要版。
目标不是「永不失败」，而是 失败可感知、可降级、不拖垮整图。
下一篇（第 8 篇）展开 get_chat_model / get_light_model 与统一 LLM 调用接口。

附录：关键源码（逐行注释）

以下代码摘自 iCan 实现，每行上方均有中文注释，不公开仓库也可跟读。
生成命令：python3 bin/build-ican-annotated-snippets.py

guide_node 异常返回

# ========== guide_node 异常返回 ==========
# 源文件: workflow.py  行 107-114

# L107: 捕获异常，避免整图/整请求崩溃
    except Exception as e:
# L108: 记录日志，便于线上排查节点入参/出参
        logger.error("[guide_node] 对话引导节点执行异常: %s", e, exc_info=True)
# L109: 返回本节点要合并进 state 的字段（LangGraph 会 merge）
        return {
# L110: 多轮对话列表，元素为 {role, content}
            "conversation_history": state.get("conversation_history", []),
# L111: 执行该语句（细节见上文业务描述）
            "current_agent": "guide",
# L112: 是否继续 Guide 循环；False 表示可以进 resume_parser
            "needs_more_info": True,
# L113: 执行该语句（细节见上文业务描述）
            "workflow_messages": [f"对话引导节点异常: {str(e)}"],
# L114: 执行该语句（细节见上文业务描述）
        }

Ollama 不可用 → 规则报告

# ========== Ollama 不可用 → 规则报告 ==========
# 源文件: workflow.py  行 734-767

# L734: 导入依赖模块
        from ican.llm.providers import check_ollama_available
# L735: 探活 LLM 服务；失败则走规则引擎降级报告
        ollama_ok = await check_ollama_available()
# L736: 条件分支
        if not ollama_ok:
# L737: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.warning("[run_analysis_pipeline] Ollama 不可用，使用快速规则引擎生成报告")
# L738: 赋值：更新局部变量或 state 字段
            structured_profile = _regex_quick_profile(combined_text)
# L739: 赋值：更新局部变量或 state 字段
            final_report = _generate_fallback_report(structured_profile, combined_text)
# L740: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.info("[run_analysis_pipeline] 快速报告生成完成，长度=%d", len(final_report))
# L741: 开始 try 块，后续 except 负责兜底
            try:
# L742: 导入依赖模块
                from ican.db.session import get_db_session
# L743: 导入依赖模块
                from ican.db.repository import SessionRepository
# L744: 赋值：更新局部变量或 state 字段
                db = next(get_db_session())
# L745: 开始 try 块，后续 except 负责兜底
                try:
# L746: 赋值：更新局部变量或 state 字段
                    repo = SessionRepository(db)
# L747: 执行该语句（细节见上文业务描述）
                    repo.save_session(
# L748: 赋值：更新局部变量或 state 字段
                        session_id=session_id,
# L749: 赋值：更新局部变量或 state 字段
                        user_id=user_id or "system",
# L750: 赋值：更新局部变量或 state 字段
                        status="completed",
# L751: 赋值：更新局部变量或 state 字段
                        current_stage="report",
# L752: JSON 字段：存对话历史、中间结果、final_report 等
                        workflow_data={
# L753: 执行该语句（细节见上文业务描述）
                            "structured_profile": structured_profile,
# L754: 执行该语句（细节见上文业务描述）
                            "final_report": final_report,
# L755: 执行该语句（细节见上文业务描述）
                            "ollama_unavailable": True,
# L756: 执行该语句（细节见上文业务描述）
                        },
# L757: 执行该语句（细节见上文业务描述）
                    )
# L758: 无论成败都执行的清理逻辑
                finally:
# L759: 执行该语句（细节见上文业务描述）
                    db.close()
# L760: 捕获异常，避免整图/整请求崩溃
            except Exception as db_err:
# L761: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
                logger.error("[run_analysis_pipeline] 保存快速报告失败: %s", db_err)
# L762: 返回本节点要合并进 state 的字段（LangGraph 会 merge）
            return {
# L763: 执行该语句（细节见上文业务描述）
                "structured_profile": structured_profile,
# L764: 执行该语句（细节见上文业务描述）
                "personal_profile": {},
# L765: 执行该语句（细节见上文业务描述）
                "career_matches": [],
# L766: 执行该语句（细节见上文业务描述）
                "final_report": final_report,
# L767: 执行该语句（细节见上文业务描述）
            }

pipeline 分阶段 try/except

# ========== pipeline 分阶段 try/except ==========
# 源文件: workflow.py  行 769-818

# L769: 赋值：更新局部变量或 state 字段
        parser_state = {"raw_input": combined_text, "input_type": "text"}
# L770: 开始 try 块，后续 except 负责兜底
        try:
# L771: 赋值：更新局部变量或 state 字段
            parser_result = await run_resume_parser(parser_state)
# L772: 赋值：更新局部变量或 state 字段
            structured_profile = parser_result.get("structured_profile", {})
# L773: 捕获异常，避免整图/整请求崩溃
        except Exception as parser_err:
# L774: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.error("[run_analysis_pipeline] 简历解析失败，使用空数据继续: %s", parser_err)
# L775: 赋值：更新局部变量或 state 字段
            structured_profile = {}

# L777: 条件分支
        if not structured_profile or len(structured_profile) == 0:
# L778: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.warning("[run_analysis_pipeline] 结构化画像为空，尝试从原始文本构建基础数据")
# L779: 赋值：更新局部变量或 state 字段
            structured_profile = {"basic_info": {"raw_text": combined_text[:500], "source": "fallback"}}

# L781: 开始 try 块，后续 except 负责兜底
        try:
# L782: 赋值：更新局部变量或 state 字段
            analyzer_state = {"structured_profile": structured_profile}
# L783: 赋值：更新局部变量或 state 字段
            analyzer_result = await run_profile_analyzer(analyzer_state)
# L784: 捕获异常，避免整图/整请求崩溃
        except Exception as analyzer_err:
# L785: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.error("[run_analysis_pipeline] 个人分析失败，使用空数据继续: %s", analyzer_err)
# L786: 赋值：更新局部变量或 state 字段
            analyzer_result = {}

# L788: 赋值：更新局部变量或 state 字段
        personal_profile = {
# L789: 执行该语句（细节见上文业务描述）
            "structured_profile": structured_profile,
# L790: 执行该语句（细节见上文业务描述）
            "ability_model": analyzer_result.get("ability_model", {}),
# L791: 执行该语句（细节见上文业务描述）
            "work_style": analyzer_result.get("work_style", {}),
# L792: 执行该语句（细节见上文业务描述）
            "personality_traits": analyzer_result.get("personality_traits", {}),
# L793: 执行该语句（细节见上文业务描述）
            "career_values": analyzer_result.get("career_values", {}),
# L794: 执行该语句（细节见上文业务描述）
            "riasec_scores": analyzer_result.get("riasec_scores", {}),
# L795: 执行该语句（细节见上文业务描述）
            "strengths": analyzer_result.get("strengths", []),
# L796: 执行该语句（细节见上文业务描述）
            "weaknesses": analyzer_result.get("weaknesses", []),
# L797: 执行该语句（细节见上文业务描述）
            "overall_summary": analyzer_result.get("structured_profile", {}).get("overall_summary", ""),
# L798: 执行该语句（细节见上文业务描述）
        }

# L800: 开始 try 块，后续 except 负责兜底
        try:
# L801: 赋值：更新局部变量或 state 字段
            matcher_state = {"personal_profile": personal_profile}
# L802: 赋值：更新局部变量或 state 字段
            matcher_result = await run_career_matcher(matcher_state)
# L803: 赋值：更新局部变量或 state 字段
            career_matches = matcher_result.get("recommended_paths", [])
# L804: 捕获异常，避免整图/整请求崩溃
        except Exception as matcher_err:
# L805: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.error("[run_analysis_pipeline] 职业匹配失败，使用空数据继续: %s", matcher_err)
# L806: 赋值：更新局部变量或 state 字段
            career_matches = []

# L808: 赋值：更新局部变量或 state 字段
        reporter_state = {
# L809: 执行该语句（细节见上文业务描述）
            "personal_profile": personal_profile,
# L810: 执行该语句（细节见上文业务描述）
            "career_matches": career_matches,
# L811: 执行该语句（细节见上文业务描述）
            "action_plan": {},
# L812: 执行该语句（细节见上文业务描述）
        }
# L813: 开始 try 块，后续 except 负责兜底
        try:
# L814: 赋值：更新局部变量或 state 字段
            reporter_result = await run_reporter(reporter_state)
# L815: 赋值：更新局部变量或 state 字段
            final_report = reporter_result.get("final_report", "")
# L816: 捕获异常，避免整图/整请求崩溃
        except Exception as reporter_err:
# L817: HTTP 主分析链：parse→analyze→match→report，跳过顶层 guide 环
            logger.error("[run_analysis_pipeline] 报告生成失败: %s", reporter_err)
# L818: 赋值：更新局部变量或 state 字段
            final_report = f"# 职业规划报告\n\n基于您的简历分析，报告生成过程中遇到问题。\n\n## 个人画像摘要\n\n{json.dumps(personal_profile, ensure_ascii=False, default=str)[:2000]}\n\n*完整报告生成失败: {reporter_err}*"

系列导航

篇	主题
1	系统全景
2	五 Agent 协作
3	霍兰德 RIASEC
4–7	状态 · 路由 · 嵌套 · 7 容错（本篇）
8–11	LLM 层 · SSE/WS · DB 迁移 · PDF
12–14	JSON Prompt · RIASEC Prompt · Guide Prompt
15–17	Docker · 中间件 · 配置

← 返回 iCan 专题