Command-line Tools#
lmdeploy - CLI interface#
The CLI provides a unified API for converting, compressing and deploying large language models.
lmdeploy [-h] [-v] {lite,serve,check_env,chat} ...
lmdeploy options#
lmdeploy lite#
Compressing and accelerating LLMs with lmdeploy.lite module
lmdeploy lite [-h] {auto_awq,auto_gptq,calibrate,smooth_quant} ...
lmdeploy lite options#
lmdeploy lite auto_awq#
Perform weight quantization using AWQ algorithm.
lmdeploy lite auto_awq [-h] [--revision REVISION] [--download-dir DOWNLOAD_DIR]
[--work-dir WORK_DIR]
[--calib-dataset {wikitext2,c4,pileval,gsm8k,neuralmagic_calibration,open-platypus,openwebtext}]
[--calib-samples CALIB_SAMPLES] [--calib-seqlen CALIB_SEQLEN]
[--batch-size BATCH_SIZE] [--search-scale]
[--dtype {auto,float16,bfloat16}] [--device DEVICE] [--w-bits W_BITS]
[--w-sym] [--w-group-size W_GROUP_SIZE]
model
lmdeploy lite auto_awq positional arguments#
model- The path of model in hf format (default:None)
lmdeploy lite auto_awq options#
--revisionREVISION- The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.--download-dirDOWNLOAD_DIR- Directory to download and load the weights, default to the default cache directory of huggingface.--work-dirWORK_DIR- The working directory to save results (default:./work_dir)--calib-datasetCALIB_DATASET- The calibration dataset name. (default:wikitext2)--calib-samplesCALIB_SAMPLES- The number of samples for calibration (default:128)--calib-seqlenCALIB_SEQLEN- The sequence length for calibration (default:2048)--batch-sizeBATCH_SIZE- The batch size for running the calib samples. Low GPU mem requires small batch_size. Large batch_size reduces the calibration time while costs more VRAM (default:1)--search-scale- Whether search scale ratio. Default to be disabled, which means only smooth quant with 0.5 ratio will be applied--dtypeDTYPE- data type for model weights and activations. The"auto"option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. This option will be ignored if the model is a quantized model (default:auto)--deviceDEVICE- Device for weight quantization (cuda or npu) (default:cuda)--w-bitsW_BITS- Bit number for weight quantization (default:4)--w-sym- Whether to do symmetric quantization--w-group-sizeW_GROUP_SIZE- Group size for weight quantization statistics (default:128)
lmdeploy lite auto_gptq#
Perform weight quantization using GPTQ algorithm.
lmdeploy lite auto_gptq [-h] [--revision REVISION] [--work-dir WORK_DIR]
[--calib-dataset {wikitext2,c4,pileval,gsm8k,neuralmagic_calibration,open-platypus,openwebtext}]
[--calib-samples CALIB_SAMPLES] [--calib-seqlen CALIB_SEQLEN]
[--batch-size BATCH_SIZE] [--dtype {auto,float16,bfloat16}]
[--w-bits W_BITS] [--w-group-size W_GROUP_SIZE]
model
lmdeploy lite auto_gptq positional arguments#
model- The path of model in hf format (default:None)
lmdeploy lite auto_gptq options#
--revisionREVISION- The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.--work-dirWORK_DIR- The working directory to save results (default:./work_dir)--calib-datasetCALIB_DATASET- The calibration dataset name. (default:wikitext2)--calib-samplesCALIB_SAMPLES- The number of samples for calibration (default:128)--calib-seqlenCALIB_SEQLEN- The sequence length for calibration (default:2048)--batch-sizeBATCH_SIZE- The batch size for running the calib samples. Low GPU mem requires small batch_size. Large batch_size reduces the calibration time while costs more VRAM (default:1)--dtypeDTYPE- data type for model weights and activations. The"auto"option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. This option will be ignored if the model is a quantized model (default:auto)--w-bitsW_BITS- Bit number for weight quantization (default:4)--w-group-sizeW_GROUP_SIZE- Group size for weight quantization statistics (default:128)
lmdeploy lite calibrate#
Perform calibration on a given dataset.
lmdeploy lite calibrate [-h] [--work-dir WORK_DIR]
[--calib-dataset {wikitext2,c4,pileval,gsm8k,neuralmagic_calibration,open-platypus,openwebtext}]
[--calib-samples CALIB_SAMPLES] [--calib-seqlen CALIB_SEQLEN]
[--batch-size BATCH_SIZE] [--search-scale]
[--dtype {auto,float16,bfloat16}]
model
lmdeploy lite calibrate positional arguments#
model- The name or path of the model to be loaded (default:None)
lmdeploy lite calibrate options#
--work-dirWORK_DIR- The working directory to save results (default:./work_dir)--calib-datasetCALIB_DATASET- The calibration dataset name. (default:wikitext2)--calib-samplesCALIB_SAMPLES- The number of samples for calibration (default:128)--calib-seqlenCALIB_SEQLEN- The sequence length for calibration (default:2048)--batch-sizeBATCH_SIZE- The batch size for running the calib samples. Low GPU mem requires small batch_size. Large batch_size reduces the calibration time while costs more VRAM (default:1)--search-scale- Whether search scale ratio. Default to be disabled, which means only smooth quant with 0.5 ratio will be applied--dtypeDTYPE- data type for model weights and activations. The"auto"option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. This option will be ignored if the model is a quantized model (default:auto)
lmdeploy lite smooth_quant#
Perform w8a8 quantization using SmoothQuant.
lmdeploy lite smooth_quant [-h] [--work-dir WORK_DIR] [--device DEVICE]
[--calib-dataset {wikitext2,c4,pileval,gsm8k,neuralmagic_calibration,open-platypus,openwebtext}]
[--calib-samples CALIB_SAMPLES] [--calib-seqlen CALIB_SEQLEN]
[--batch-size BATCH_SIZE] [--search-scale]
[--dtype {auto,float16,bfloat16}]
[--quant-dtype {int8,float8_e4m3fn,float8_e5m2,fp8}]
[--revision REVISION] [--download-dir DOWNLOAD_DIR]
model
lmdeploy lite smooth_quant positional arguments#
model- The name or path of the model to be loaded (default:None)
lmdeploy lite smooth_quant options#
--work-dirWORK_DIR- The working directory for outputs. defaults to"./work_dir"--deviceDEVICE- Device for weight quantization (cuda or npu) (default:cuda)--calib-datasetCALIB_DATASET- The calibration dataset name. (default:wikitext2)--calib-samplesCALIB_SAMPLES- The number of samples for calibration (default:128)--calib-seqlenCALIB_SEQLEN- The sequence length for calibration (default:2048)--batch-sizeBATCH_SIZE- The batch size for running the calib samples. Low GPU mem requires small batch_size. Large batch_size reduces the calibration time while costs more VRAM (default:1)--search-scale- Whether search scale ratio. Default to be disabled, which means only smooth quant with 0.5 ratio will be applied--dtypeDTYPE- data type for model weights and activations. The"auto"option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. This option will be ignored if the model is a quantized model (default:auto)--quant-dtypeQUANT_DTYPE- data type for the quantized model weights and activations.Note"fp8"is the short version of"float8_e4m3fn"(default:int8)--revisionREVISION- The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.--download-dirDOWNLOAD_DIR- Directory to download and load the weights, default to the default cache directory of huggingface.
lmdeploy serve#
Serve LLMs with openai API
lmdeploy serve [-h] {api_server,proxy} ...
lmdeploy serve options#
lmdeploy serve api_server#
Serve LLMs with restful api using fastapi.
lmdeploy serve api_server [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT]
[--allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]]
[--allow-credentials]
[--allow-methods ALLOW_METHODS [ALLOW_METHODS ...]]
[--allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]]
[--proxy-url PROXY_URL]
[--max-concurrent-requests MAX_CONCURRENT_REQUESTS]
[--backend {pytorch,turbomind}]
[--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}]
[--api-keys [API_KEYS ...]] [--ssl] [--model-name MODEL_NAME]
[--max-log-len MAX_LOG_LEN] [--disable-fastapi-docs]
[--allow-terminate-by-client] [--enable-abort-handling]
[--chat-template CHAT_TEMPLATE]
[--tool-call-parser TOOL_CALL_PARSER]
[--reasoning-parser REASONING_PARSER] [--revision REVISION]
[--download-dir DOWNLOAD_DIR] [--adapters [ADAPTERS ...]]
[--device {cuda,ascend,maca,camb}] [--eager-mode]
[--disable-vision-encoder]
[--logprobs-mode {None,raw_logits,raw_logprobs}]
[--dllm-block-length DLLM_BLOCK_LENGTH]
[--dllm-unmasking-strategy {low_confidence_dynamic,low_confidence_static,sequential}]
[--dllm-denoising-steps DLLM_DENOISING_STEPS]
[--dllm-confidence-threshold DLLM_CONFIDENCE_THRESHOLD]
[--enable-return-routed-experts]
[--distributed-executor-backend {uni,mp,ray}]
[--dtype {auto,float16,bfloat16}] [--tp TP]
[--session-len SESSION_LEN] [--max-batch-size MAX_BATCH_SIZE]
[--cache-max-entry-count CACHE_MAX_ENTRY_COUNT]
[--cache-block-seq-len CACHE_BLOCK_SEQ_LEN]
[--enable-prefix-caching]
[--max-prefill-token-num MAX_PREFILL_TOKEN_NUM]
[--quant-policy {0,4,8}] [--model-format {hf,awq,gptq,fp8,mxfp4}]
[--hf-overrides HF_OVERRIDES] [--disable-metrics] [--dp DP]
[--ep EP] [--enable-microbatch] [--enable-eplb]
[--role {Hybrid,Prefill,Decode}]
[--migration-backend {DLSlime,Mooncake}] [--node-rank NODE_RANK]
[--nnodes NNODES] [--cp CP]
[--rope-scaling-factor ROPE_SCALING_FACTOR]
[--num-tokens-per-iter NUM_TOKENS_PER_ITER]
[--max-prefill-iters MAX_PREFILL_ITERS] [--async {0,1}]
[--communicator {nccl,native,cuda-ipc}]
[--dist-init-addr DIST_INIT_ADDR]
[--vision-max-batch-size VISION_MAX_BATCH_SIZE]
[--speculative-algorithm {eagle,eagle3,deepseek_mtp}]
[--speculative-draft-model SPECULATIVE_DRAFT_MODEL]
[--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS]
model_path
lmdeploy serve api_server positional arguments#
model_path- The path of a model. it could be one of the following options: - i) a local directory path of a turbomind model which is converted by lmdeploy convert command or download from ii) and iii). - ii) the model_id of a lmdeploy-quantized model hosted inside a model repo on huggingface.co, such as"internlm/internlm-chat-20b-4bit","lmdeploy/llama2-chat-70b-4bit", etc. - iii) the model_id of a model hosted inside a model repo on huggingface.co, such as"internlm/internlm-chat-7b","qwen/qwen-7b-chat ","baichuan-inc/baichuan2-7b-chat"and so on (default:None)
lmdeploy serve api_server options#
--server-nameSERVER_NAME- Host ip for serving (default:0.0.0.0)--server-portSERVER_PORT- Server port (default:23333)--allow-originsALLOW_ORIGINS- A list of allowed origins for cors (default:['*'])--allow-credentials- Whether to allow credentials for cors--allow-methodsALLOW_METHODS- A list of allowed http methods for cors (default:['*'])--allow-headersALLOW_HEADERS- A list of allowed http headers for cors (default:['*'])--proxy-urlPROXY_URL- The proxy url for api server. (default:None)--max-concurrent-requestsMAX_CONCURRENT_REQUESTS- This refers to the number of concurrent requests that the server can handle. The server is designed to process the engine’s tasks once the maximum number of concurrent requests is reached, regardless of any additional requests sent by clients concurrently during that time. Default to None. (default:None)--backendBACKEND- Set the inference backend (default:turbomind)--log-levelLOG_LEVEL- Set the log level (default:WARNING)--api-keysAPI_KEYS- Optional list of space separated API keys (default:None)--ssl- Enable SSL. Requires OS Environment variables'SSL_KEYFILE'and'SSL_CERTFILE'--model-nameMODEL_NAME- The name of the served model. It can be accessed by the RESTful API /v1/models. If it is not specified, model_path will be adopted (default:None)--max-log-lenMAX_LOG_LEN- Max number of prompt characters or prompt tokens being printed in log. Default: Unlimited (default:None)--disable-fastapi-docs- Disable FastAPI’s OpenAPI schema, Swagger UI, and ReDoc endpoint--allow-terminate-by-client- Enable server to be terminated by request from client--enable-abort-handling- Enable server to handle client abort requests--chat-templateCHAT_TEMPLATE- A JSON file or string that specifies the chat template configuration. Please refer to https://lmdeploy.readthedocs.io/en/latest/advance/chat_template.html for the specification (default:None)--tool-call-parserTOOL_CALL_PARSER- The registered tool parser name dict_keys(['internlm','intern-s1','llama3','qwen2d5','qwen','qwen3']). Default to None. (default:None)--reasoning-parserREASONING_PARSER- The registered reasoning parser name from dict_keys(['deepseek-r1','qwen-qwq','intern-s1']). Default to None. (default:None)--revisionREVISION- The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.--download-dirDOWNLOAD_DIR- Directory to download and load the weights, default to the default cache directory of huggingface.
lmdeploy serve api_server PyTorch engine arguments#
--adaptersADAPTERS- Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple lora adapters. If only have one adapter, one can only input the path of the adapter. (default:None)--deviceDEVICE- The device type of running (default:cuda)--eager-mode- Whether to enable eager mode. If True, cuda graph would be disabled--disable-vision-encoder- disable multimodal encoder--logprobs-modeLOGPROBS_MODE- The mode of logprobs. (default:None)--dllm-block-lengthDLLM_BLOCK_LENGTH- Block length for dllm (default:None)--dllm-unmasking-strategyDLLM_UNMASKING_STRATEGY- The unmasking strategy for dllm. (default:low_confidence_dynamic)--dllm-denoising-stepsDLLM_DENOISING_STEPS- The number of denoising steps for dllm. (default:None)--dllm-confidence-thresholdDLLM_CONFIDENCE_THRESHOLD- The confidence threshold for dllm. (default:0.85)--enable-return-routed-experts- Whether to output routed expert ids for replay--distributed-executor-backendDISTRIBUTED_EXECUTOR_BACKEND- The distributed executor backend for pytorch engine. (default:None)--dtypeDTYPE- data type for model weights and activations. The"auto"option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. This option will be ignored if the model is a quantized model (default:auto)--tpTP- GPU number used in tensor parallelism. Should be 2^n (default:1)--session-lenSESSION_LEN- The max session length of a sequence (default:None)--max-batch-sizeMAX_BATCH_SIZE- Maximum batch size. If not specified, the engine will automatically set it according to the device (default:None)--cache-max-entry-countCACHE_MAX_ENTRY_COUNT- The percentage of free gpu memory occupied by the k/v cache, excluding weights (default:0.8)--cache-block-seq-lenCACHE_BLOCK_SEQ_LEN- The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch Engine, if Lora Adapter is specified, this parameter will be ignored (default:64)--enable-prefix-caching- Enable cache and match prefix--max-prefill-token-numMAX_PREFILL_TOKEN_NUM- the max number of tokens per iteration during prefill (default:8192)--quant-policyQUANT_POLICY- Quantize kv or not. 0: no quant; 4: 4bit kv; 8: 8bit kv (default:0)--model-formatMODEL_FORMAT- The format of input model. hf means hf_llama, awq represents the quantized model by AWQ, and gptq refers to the quantized model by GPTQ (default:None)--hf-overridesHF_OVERRIDES- Extra arguments to be forwarded to the HuggingFace config. (default:None)--disable-metrics- disable metrics system--dpDP- data parallelism. dp_rank is required when pytorch engine is used. (default:1)--epEP- expert parallelism. dp is required when pytorch engine is used. (default:1)--enable-microbatch- enable microbatch for specified model--enable-eplb- enable eplb for specified model--roleROLE- Hybrid for Non-Disaggregated Engine; Prefill for Disaggregated Prefill Engine; Decode for Disaggregated Decode Engine (default:Hybrid)--migration-backendMIGRATION_BACKEND- kvcache migration management backend when PD disaggregation (default:DLSlime)--node-rankNODE_RANK- The current node rank. (default:0)--nnodesNNODES- The total node nums (default:1)
lmdeploy serve api_server TurboMind engine arguments#
--dtypeDTYPE- data type for model weights and activations. The"auto"option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. This option will be ignored if the model is a quantized model (default:auto)--tpTP- GPU number used in tensor parallelism. Should be 2^n (default:1)--session-lenSESSION_LEN- The max session length of a sequence (default:None)--max-batch-sizeMAX_BATCH_SIZE- Maximum batch size. If not specified, the engine will automatically set it according to the device (default:None)--cache-max-entry-countCACHE_MAX_ENTRY_COUNT- The percentage of free gpu memory occupied by the k/v cache, excluding weights (default:0.8)--cache-block-seq-lenCACHE_BLOCK_SEQ_LEN- The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch Engine, if Lora Adapter is specified, this parameter will be ignored (default:64)--enable-prefix-caching- Enable cache and match prefix--max-prefill-token-numMAX_PREFILL_TOKEN_NUM- the max number of tokens per iteration during prefill (default:8192)--quant-policyQUANT_POLICY- Quantize kv or not. 0: no quant; 4: 4bit kv; 8: 8bit kv (default:0)--model-formatMODEL_FORMAT- The format of input model. hf means hf_llama, awq represents the quantized model by AWQ, and gptq refers to the quantized model by GPTQ (default:None)--nnodesNNODES- The total node nums (default:1)--node-rankNODE_RANK- The current node rank. (default:0)--hf-overridesHF_OVERRIDES- Extra arguments to be forwarded to the HuggingFace config. (default:None)--disable-metrics- disable metrics system--dpDP- data parallelism. dp_rank is required when pytorch engine is used. (default:1)--cpCP- context parallelism size in attention for turbomind backend, tp must be a multiple of cp. (default:1)--rope-scaling-factorROPE_SCALING_FACTOR- Rope scaling factor (default:0.0)--num-tokens-per-iterNUM_TOKENS_PER_ITER- the number of tokens processed in a forward pass (default:0)--max-prefill-itersMAX_PREFILL_ITERS- the max number of forward passes in prefill stage (default:1)--asyncASYNC_- Enable async execution (default: 1, enabled). Set to 0 to disable async mode, 1 to enable it. (default:1)--communicatorCOMMUNICATOR- Communication backend for multi-GPU inference. The"native"option is deprecated and serves as an alias for"cuda-ipc"(default:nccl)--dist-init-addrDIST_INIT_ADDR(default:None)
lmdeploy serve api_server Vision model arguments#
--vision-max-batch-sizeVISION_MAX_BATCH_SIZE- the vision model batch size (default:1)
lmdeploy serve api_server Speculative decoding arguments#
--speculative-algorithmSPECULATIVE_ALGORITHM- The speculative algorithm to use. None means speculative decoding is disabled (default:None)--speculative-draft-modelSPECULATIVE_DRAFT_MODEL- The path to speculative draft model (default:None)--speculative-num-draft-tokensSPECULATIVE_NUM_DRAFT_TOKENS- The number of speculative tokens to generate per step (default:1)
lmdeploy serve proxy#
Proxy server that manages distributed api_server nodes.
lmdeploy serve proxy [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT]
[--serving-strategy {Hybrid,DistServe}] [--dummy-prefill]
[--routing-strategy {random,min_expected_latency,min_observed_latency}]
[--disable-cache-status] [--migration-protocol {RDMA,NVLINK}]
[--link-type {RoCE,IB}] [--disable-gdr] [--api-keys [API_KEYS ...]]
[--ssl]
[--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}]
lmdeploy serve proxy options#
--server-nameSERVER_NAME- Host ip for proxy serving (default:0.0.0.0)--server-portSERVER_PORT- Server port of the proxy (default:8000)--serving-strategySERVING_STRATEGY- the strategy to serve, Hybrid for colocating Prefill and Decodeworkloads into same engine, DistServe for Prefill-Decode Disaggregation (default:Hybrid)--dummy-prefill- dummy prefill for performance profiler--routing-strategyROUTING_STRATEGY- the strategy to dispatch requests to nodes (default:min_expected_latency)--disable-cache-status- Whether to disable cache status of the proxy. If set, the proxy will forget the status of the previous time--migration-protocolMIGRATION_PROTOCOL- transport protocol of KV migration (default:RDMA)--link-typeLINK_TYPE- RDMA Link Type (default:RoCE)--disable-gdr- with GPU Direct Memory Access--api-keysAPI_KEYS- Optional list of space separated API keys (default:None)--ssl- Enable SSL. Requires OS Environment variables'SSL_KEYFILE'and'SSL_CERTFILE'--log-levelLOG_LEVEL- Set the log level (default:WARNING)
lmdeploy check_env#
Check the environmental information.
lmdeploy check_env [-h] [--dump-file DUMP_FILE]
lmdeploy check_env options#
--dump-fileDUMP_FILE- The file path to save env info. Only support file format in json, yml, pkl (default:None)
lmdeploy chat#
lmdeploy chat [-h] [--backend {pytorch,turbomind}] [--chat-template CHAT_TEMPLATE]
[--revision REVISION] [--download-dir DOWNLOAD_DIR] [--adapters [ADAPTERS ...]]
[--device {cuda,ascend,maca,camb}] [--eager-mode]
[--dllm-block-length DLLM_BLOCK_LENGTH] [--dtype {auto,float16,bfloat16}]
[--tp TP] [--session-len SESSION_LEN]
[--cache-max-entry-count CACHE_MAX_ENTRY_COUNT] [--enable-prefix-caching]
[--quant-policy {0,4,8}] [--model-format {hf,awq,gptq,fp8,mxfp4}]
[--rope-scaling-factor ROPE_SCALING_FACTOR]
[--communicator {nccl,native,cuda-ipc}] [--cp CP] [--async {0,1}]
[--speculative-algorithm {eagle,eagle3,deepseek_mtp}]
[--speculative-draft-model SPECULATIVE_DRAFT_MODEL]
[--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS]
model_path
lmdeploy chat positional arguments#
model_path- The path of a model. it could be one of the following options: - i) a local directory path of a turbomind model which is converted by lmdeploy convert command or download from ii) and iii). - ii) the model_id of a lmdeploy-quantized model hosted inside a model repo on huggingface.co, such as"internlm/internlm-chat-20b-4bit","lmdeploy/llama2-chat-70b-4bit", etc. - iii) the model_id of a model hosted inside a model repo on huggingface.co, such as"internlm/internlm-chat-7b","qwen/qwen-7b-chat ","baichuan-inc/baichuan2-7b-chat"and so on (default:None)
lmdeploy chat options#
--backendBACKEND- Set the inference backend (default:turbomind)--chat-templateCHAT_TEMPLATE- A JSON file or string that specifies the chat template configuration. Please refer to https://lmdeploy.readthedocs.io/en/latest/advance/chat_template.html for the specification (default:None)--revisionREVISION- The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.--download-dirDOWNLOAD_DIR- Directory to download and load the weights, default to the default cache directory of huggingface.
lmdeploy chat PyTorch engine arguments#
--adaptersADAPTERS- Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple lora adapters. If only have one adapter, one can only input the path of the adapter. (default:None)--deviceDEVICE- The device type of running (default:cuda)--eager-mode- Whether to enable eager mode. If True, cuda graph would be disabled--dllm-block-lengthDLLM_BLOCK_LENGTH- Block length for dllm (default:None)--dtypeDTYPE- data type for model weights and activations. The"auto"option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. This option will be ignored if the model is a quantized model (default:auto)--tpTP- GPU number used in tensor parallelism. Should be 2^n (default:1)--session-lenSESSION_LEN- The max session length of a sequence (default:None)--cache-max-entry-countCACHE_MAX_ENTRY_COUNT- The percentage of free gpu memory occupied by the k/v cache, excluding weights (default:0.8)--enable-prefix-caching- Enable cache and match prefix--quant-policyQUANT_POLICY- Quantize kv or not. 0: no quant; 4: 4bit kv; 8: 8bit kv (default:0)
lmdeploy chat TurboMind engine arguments#
--dtypeDTYPE- data type for model weights and activations. The"auto"option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. This option will be ignored if the model is a quantized model (default:auto)--tpTP- GPU number used in tensor parallelism. Should be 2^n (default:1)--session-lenSESSION_LEN- The max session length of a sequence (default:None)--cache-max-entry-countCACHE_MAX_ENTRY_COUNT- The percentage of free gpu memory occupied by the k/v cache, excluding weights (default:0.8)--enable-prefix-caching- Enable cache and match prefix--quant-policyQUANT_POLICY- Quantize kv or not. 0: no quant; 4: 4bit kv; 8: 8bit kv (default:0)--model-formatMODEL_FORMAT- The format of input model. hf means hf_llama, awq represents the quantized model by AWQ, and gptq refers to the quantized model by GPTQ (default:None)--rope-scaling-factorROPE_SCALING_FACTOR- Rope scaling factor (default:0.0)--communicatorCOMMUNICATOR- Communication backend for multi-GPU inference. The"native"option is deprecated and serves as an alias for"cuda-ipc"(default:nccl)--cpCP- context parallelism size in attention for turbomind backend, tp must be a multiple of cp. (default:1)--asyncASYNC_- Enable async execution (default: 1, enabled). Set to 0 to disable async mode, 1 to enable it. (default:1)
lmdeploy chat Speculative decoding arguments#
--speculative-algorithmSPECULATIVE_ALGORITHM- The speculative algorithm to use. None means speculative decoding is disabled (default:None)--speculative-draft-modelSPECULATIVE_DRAFT_MODEL- The path to speculative draft model (default:None)--speculative-num-draft-tokensSPECULATIVE_NUM_DRAFT_TOKENS- The number of speculative tokens to generate per step (default:1)