Shortcuts

Supported Models

Models supported by TurboMind

Model Size FP16/BF16 KV INT8 KV INT4 W4A16
Llama 7B - 65B Yes Yes Yes Yes
Llama2 7B - 70B Yes Yes Yes Yes
Llama3 8B, 70B Yes Yes Yes Yes
InternLM 7B - 20B Yes Yes Yes Yes
InternLM2 7B - 20B Yes Yes Yes Yes
InternLM-XComposer 7B Yes - - -
InternLM-XComposer2 7B, 4khd-7B Yes - - -
QWen 1.8B - 72B Yes Yes Yes Yes
QWen1.5 1.8B - 72B Yes Yes Yes Yes
QWen-VL 7B Yes - - -
DeepSeek-VL 7B Yes - - -
Baichuan 7B Yes Yes Yes Yes
Baichuan2 7B Yes Yes Yes Yes
Code Llama 7B - 34B Yes Yes Yes No
YI 6B - 34B Yes Yes Yes No
LLaVA(1.5,1.6) 7B - 34B Yes - - -
InternVL-Chat v1.1- v1.5 Yes - - -
MiniGeminiLlama 7B Yes No No No

“-” means not verified yet.

Note

The TurboMind engine doesn’t support window attention. Therefore, for models that have applied window attention and have the corresponding switch “use_sliding_window” enabled, please choose the PyTorch engine for inference.

Models supported by PyTorch

Model Size FP16/BF16 KV INT8 W8A8
Llama 7B - 65B Yes No Yes
Llama2 7B - 70B Yes No Yes
Llama3 8B, 70B Yes No Yes
InternLM 7B - 20B Yes No Yes
InternLM2 7B - 20B Yes No -
Baichuan2 7B - 13B Yes No Yes
ChatGLM2 6B Yes No No
Falcon 7B - 180B Yes No No
YI 6B - 34B Yes No No
Mistral 7B Yes No No
Mixtral 8x7B Yes No No
QWen 1.8B - 72B Yes No No
QWen1.5 0.5B - 72B Yes No No
QWen1.5-MoE A2.7B Yes No No
DeepSeek-MoE 16B Yes No No
Gemma 2B-7B Yes No No
Dbrx 132B Yes No No
StarCoder2 3B-15B Yes No No
Phi-3-mini 3.8B Yes No No
Read the Docs v: latest
Versions
latest
stable
v0.4.1
v0.4.0
v0.3.0
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.0
v0.1.0
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.