各模型支持的训练特性
本章节介绍不同模型在AscendFactory方案中对应各训练框架支持的训练特性。
模型系列 |
模型 |
预训练、微调 |
强化学习RL |
||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MindSpeed-LLM |
llama-factory |
VeRL |
MindSpeed-RL |
||||||||||||||||
pre-training、full微调 |
Lora 微调 |
多样本pack 微调 |
Flash Attention |
SPTD并行 (SP、PP、TP、DP) |
长序列并行 |
MOE并行 (专家并行、重排通信优化) |
动态句长 |
pre-training Lora/Full微调 |
Zero并行 (Zero-1、Zero-2、Zero-3) |
Flash Attention |
GRPO |
vllm推理后端 |
训练后端FSDP |
GRPO |
vllm推理后端 |
训练后端Megatron-LM |
长序列并行 |
||
DeepSeek系列 |
DeepSeek-R1-671B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
DeepSeek-V3-671B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
DeepSeek-V2-Lite 16B |
✅ |
❌ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
Qwen2系列 |
Qwen2-0.5B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
Qwen2-1.5B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
Qwen2-7B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
Qwen2-72B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
Qwen2.5系列 |
Qwen2.5-0.5B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
Qwen2.5-1.5B |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
|
Qwen2.5-7B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
|
Qwen2.5-14B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
Qwen2.5-32B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
✅ |
✅ |
✅ |
✅ |
|
Qwen2.5-72B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
Qwen3系列 |
Qwen3-0.6B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
Qwen3-1.7B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
|
Qwen3-4B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
|
Qwen3-8B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
|
Qwen3-14B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
|
Qwen3-32B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
|
Qwen3-30B-A3B |
✅ |
❌ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
Qwen3-235b-A22B |
✅ |
❌ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
Llama系列 |
Llama3.1 -8B/70B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
Llama3.2 -1B/3B |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
|
glm系列 |
glm-4-9b-chat |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
Mixtral系列 |
Mixtral-8x7B-Instruct-v0.1 |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
❌ |
Qwen2 VL系列 |
Qwen2-VL-2B |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
NA |
NA |
NA |
NA |
Qwen2-VL-7B |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
NA |
NA |
NA |
NA |
|
Qwen2-VL-72B |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
NA |
NA |
NA |
NA |
|
Qwen2.5 VL系列 |
Qwen2.5-VL-3B |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
NA |
NA |
NA |
NA |
Qwen2.5-VL-7B |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
NA |
NA |
NA |
NA |
|
Qwen2.5-VL-32B |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
NA |
NA |
NA |
NA |
|
Qwen2.5-VL-72B |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
✅ |
✅ |
✅ |
✅ |
✅ |
✅ |
NA |
NA |
NA |
NA |
|
Internvl系列 |
Internvl2.5-8B/38B/78B |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
NA |
NA |
NA |
NA |
Gemma系列 |
Gemma3-27b |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
✅ |
✅ |
✅ |
❌ |
❌ |
❌ |
NA |
NA |
NA |
NA |

- NA表示不会规划支持,例如多模态模型不会支持MindSpeed-LLM训练框架;