Qwen3.5 LoRA 踩坑：thinking 链路与 flash-attn 配置

0. 系列闭环

本篇位置	上游	本篇产出	下游
第 9/10 篇	第 08 篇推理验证	Qwen3.5 特有问题清单	第 10 篇 vLLM 请求参数

本篇问题不是 LoRA 特有，但会在「验证像失败」「API 输出英文」时误判为「微调无效」。

1. 要解决的实际问题

Qwen3.5 相比 Qwen2 系列多了 Thinking Mode：模型可先在内部通道输出推理过程，再给最终回复。

在老年陪伴场景：

用户要中文共情短句
思考链常为英文、冗长、像 debug 输出
若不过滤，产品完全不可用

另：all_logs.log 与 Mac 验证均出现 flash-linear-attention 未安装 警告，需知是否阻断训练。

2. 实现位置

问题	代码/日志位置
关 thinking（本地）	`verify_lora.py` 第 159–164 行 `enable_thinking=False`
剥 thinking 残留	`verify_lora.py` 第 137–149 行 `extract_final_reply`
flash-attn 警告	`all_logs.log` 第 27 行；Mac 验证同样出现
关 thinking（vLLM）	`README.md` 第 213–233 行 `chat_template_kwargs`
trust_remote_code	`train_lora_single.py` 第 161、180 行

3. 坑 1：Thinking Mode 未关闭

现象

generate 或 vLLM 返回大段：

1 2	`Thinking Process: The user feels lonely. I should respond empathetically...`

或仅有英文，无中文陪伴句。

根因

Qwen3.5 chat_template 默认 enable_thinking=True（推理路径）。

修复（本地）

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

修复（vLLM）

1	`"chat_template_kwargs": {"enable_thinking": false}`

Python SDK：

1	`extra_body={"chat_template_kwargs": {"enable_thinking": False}}`

兜底

即使关闭，偶发残留。extract_final_reply 按 redacted_thinking 标签截断。训练阶段 apply_chat_template 用完整 messages，不涉及 thinking 生成。

4. 坑 2：flash-linear-attention / causal-conv1d 未安装

日志原文

The fast path is not available because one of the required library is not installed.
Falling back to torch implementation.
To install follow https://github.com/fla-org/flash-linear-attention#installation
and https://github.com/Dao-AILab/causal-conv1d

影响


正确性	✅ 回退 PyTorch，训练与推理均完成
速度	略慢；本项目 V100 41 分钟仍完成 750 step
Mac MPS	同样警告；用户 verify 仍成功

是否必须安装

不必须。 云环境编译失败可跳过。若追求极致吞吐再装：

1	`pip install flash-linear-attention causal-conv1d`

5. 坑 3：tokenizer 与 model config 的 token id 不一致

日志

1 2	`The tokenizer has new PAD/BOS/EOS tokens that differ from the model config... Updated tokens: {'eos_token_id': 248046, 'pad_token_id': 248046}`

处理

train_lora_single.py 第 164 行：

1	`tokenizer.pad_token = tokenizer.eos_token`

Trainer 启动时自动对齐。无需手动改 config.json。

6. 坑 4：Mac MPS + PEFT 的 device_map

现象

使用 device_map="auto" 加载基座再 PeftModel.from_pretrained 在 MPS 上报 device 错配。

修复

verify_lora.py 第 110–134 行：不用 device_map，显式 .to(device)。

训练脚本在 CUDA 上仍用 device_map={"": gpu_id}——验证与训练 device 策略刻意不同，勿照搬。

7. 坑 5：trust_remote_code=False

Qwen3.5 模型类在仓库 Python 文件中。省略 trust_remote_code=True 会：

1	`ValueError: ... does not recognize this architecture`

三处必须一致为 True：

train_lora_single.py 加载 tokenizer/model
verify_lora.py 加载
vLLM 使用本地路径 serve 时模型目录完整

8. 坑 6：open-end generation 的 pad_token_id 警告

验证时可能出现：

1	Setting `pad_token_id` to `eos_token_id` for open-end generation.

Transformers 在 generate 时的提示，与训练侧 pad 设置一致，可忽略。

9. 三端 checklist（训练 / 验证 / 部署）

训练 (train_lora_single.py)
  □ trust_remote_code=True
  □ pad_token = eos_token
  □ add_generation_prompt=False

验证 (verify_lora.py)
  □ enable_thinking=False
  □ add_generation_prompt=True
  □ SYSTEM_PROMPT 与 JSONL 一致
  □ MPS 不用 device_map="auto" + PEFT

部署 (vLLM)
  □ chat_template_kwargs.enable_thinking=false
  □ system 与训练一致
  □ model 名与 --lora-modules 一致（第 10 篇）

10. 小结

Thinking 是 Qwen3.5 验证/部署第一坑，必须显式关闭 + 可选后处理。
flash-attn 警告 只影响速度，实测可训可验。
Mac MPS 验证路径与 CUDA 训练路径 device 策略不同。
遇「微调无效」先查 thinking，再查 system prompt。
下一篇在 vLLM 里落实部署侧参数。

附录：训练 vs 推理 template 对照

参数	train_lora_single.py	verify_lora.py	vLLM
add_generation_prompt	False	True	由 API 自动
enable_thinking	—	False	chat_template_kwargs
messages 含 system	是（JSONL）	是	是

系列导航

篇目	链接
上一篇	08 · 效果验证
下一篇	10 · vLLM 部署
索引	README

← 返回 LoRA 老年陪伴专题