llm-compressor Support#

This guide aims to introduce how to use LMDeploy’s TurboMind inference engine to run models quantized by the vllm-project/llm-compressor tool.

Currently supported llm-compressor quantization types include:

int4 quantization (e.g., AWQ, GPTQ)

These quantized models can run via the TurboMind engine on the following NVIDIA GPU architectures:

Compute Capability	Micro-architecture	GPUs
7.0	Volta	V100
7.2	Volta	Jetson Xavier
7.5	Turing	GeForce RTX 20 series, T4
8.0	Ampere	A100, A800, A30
8.6	Ampere	GeForce RTX 30 series, A40, A10
8.7	Ampere	Jetson Orin
8.9	Ada Lovelace	GeForce RTX 40 series, L40, L20
9.0	Hopper	H20, H200, H100, GH200
12.0	Blackwell	GeForce RTX 50 series

LMDeploy will continue to follow up and expand support for the llm-compressor project.

The remainder of this document consists of the following sections:

Model Quantization
Model Deployment
Accuracy Evaluation

Model Quantization#

llm-compressor provides a wealth of model quantization examples. Please refer to its tutorials to select a quantization algorithm supported by LMDeploy to complete your model quantization work.

LMDeploy also provides a built-in script for AWQ quantization of Qwen3-30B-A3B using llm-compressor for your reference:

# Create conda environment
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy

# Install llm-compressor
pip install llmcompressor

# Clone lmdeploy source code and run the quantization example
git clone https://github.com/InternLM/lmdeploy
cd lmdeploy
python examples/lite/qwen3_30b_a3b_awq.py --work-dir ./qwen3_30b_a3b_awq

In the following sections, we will use this quantized model as an example to introduce model deployment and accuracy evaluation methods.

Model Deployment#

Offline Inference#

With the quantized model, offline batch processing can be implemented with just a few lines of code:

from lmdeploy import pipeline, TurbomindEngineConfig

engine_config = TurbomindEngineConfig()
with pipeline("./qwen3_30b_a3b_4bit", backend_config=engine_config) as pipe:
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

For a detailed introduction to the pipeline, please refer to here.

Online Serving#

LMDeploy api_server supports encapsulating the model as a service with a single command. The provided RESTful APIs are compatible with OpenAI interfaces. Below is an example of starting the service:

lmdeploy serve api_server ./qwen3_30b_a3b_4bit --backend turbomind

The default service port is 23333. After the server starts, you can access the service via the OpenAI SDK. For command arguments and methods to access the service, please read this document.

Accuracy Evaluation#

Aftering deploying AWQ symmetric/asymmetric quantized models of Qwen3-8B (Dense) and Qwen3-30B-A3B (MoE) as services via LMDeploy, we evaluated their accuracy on several academic datasets using opencompass. Results indicate that, for Qwen3-8B, asymmetric quantization generally outperforms symmetric quantization, while Qwen3-30B-A3B shows no substantial difference between symmetric and asymmetric quantization. Compared with BF16, Qwen3-8B shows a smaller accuracy gap under both symmetric and asymmetric quantization than Qwen3-30B-A3B. Compared with BF16, accuracy drops significantly on long-output datasets such as aime2025 (avg 17,635 tokens) and LCB (avg 14,157 tokens), while on medium/short-output datasets like ifeval (avg 1,885 tokens) and mmlu_pro (avg 2,826 tokens), the accuracy is as expected.

dataset	Qwen3-8B			Qwen3-30B-A3B
	bf16	awq sym	awq asym	bf16	awq sym	awq asym
ifeval	85.58	83.73	85.77	86.32	84.10	84.29
hle	5.05	5.05	5.24	7.00	5.47	5.65
gpqa	59.97	56.57	59.47	61.74	57.95	57.07
aime2025	69.48	64.38	63.96	73.44	64.79	66.67
mmlu_pro	73.69	71.73	72.34	77.85	75.77	75.69
LCBCodeGeneration	50.86	44.10	46.95	56.67	50.86	49.24

For reproduction methods, please refer to this document.

llm-compressor Support

Contents