inference pipeline#
pipeline#
- lmdeploy.pipeline(model_path: str, backend_config: TurbomindEngineConfig | PytorchEngineConfig | None = None, chat_template_config: ChatTemplateConfig | None = None, log_level: str = 'WARNING', max_log_len: int | None = None, **kwargs)[source]#
- Parameters:
model_path (str) –
the path of a model. It could be one of the following options:
A local directory path of a turbomind model which is
converted by lmdeploy convert command or download from ii) and iii).
The model_id of a lmdeploy-quantized model hosted
inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.
The model_id of a model hosted inside a model repo
on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to None.
chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.
log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]
max_log_len (int) – Max number of prompt characters or prompt tokens being printed in log
Examples
>>> # LLM >>> import lmdeploy >>> pipe = lmdeploy.pipeline('internlm/internlm-chat-7b') >>> response = pipe(['hi','say this is a test']) >>> print(response) >>> >>> # VLM >>> from lmdeploy.vl import load_image >>> from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig >>> pipe = pipeline('liuhaotian/llava-v1.5-7b', ... backend_config=TurbomindEngineConfig(session_len=8192), ... chat_template_config=ChatTemplateConfig(model_name='vicuna')) >>> im = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg') >>> response = pipe([('describe this image', [im])]) >>> print(response)
serving#
- lmdeploy.serve(model_path: str, model_name: str | None = None, backend: Literal['turbomind', 'pytorch'] = 'turbomind', backend_config: TurbomindEngineConfig | PytorchEngineConfig | None = None, chat_template_config: ChatTemplateConfig | None = None, server_name: str = '0.0.0.0', server_port: int = 23333, log_level: str = 'ERROR', api_keys: str | List[str] | None = None, ssl: bool = False, **kwargs)[source]#
This will run the api_server in a subprocess.
- Parameters:
model_path (str) –
the path of a model. It could be one of the following options:
A local directory path of a turbomind model which is
converted by lmdeploy convert command or download from ii) and iii).
The model_id of a lmdeploy-quantized model hosted
inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.
The model_id of a model hosted inside a model repo
on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
model_name (str) – the name of the served model. It can be accessed by the RESTful API /v1/models. If it is not specified, model_path will be adopted
backend (str) – either turbomind or pytorch backend. Default to turbomind backend.
backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to none.
chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.
server_name (str) – host ip for serving
server_port (int) – server port
log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]
api_keys (List[str] | str | None) – Optional list of API keys. Accepts string type as a single api_key. Default to None, which means no api key applied.
ssl (bool) – Enable SSL. Requires OS Environment variables ‘SSL_KEYFILE’ and ‘SSL_CERTFILE’.
- Returns:
A client chatbot for LLaMA series models.
- Return type:
APIClient
Examples
>>> import lmdeploy >>> client = lmdeploy.serve('internlm/internlm-chat-7b', 'internlm-chat-7b') >>> for output in client.chat('hi', 1): ... print(output)
PytorchEngineConfig#
- class lmdeploy.PytorchEngineConfig(dtype: str = 'auto', tp: int = 1, session_len: int | None = None, max_batch_size: int | None = None, cache_max_entry_count: float = 0.8, prefill_interval: int = 16, block_size: int = 64, num_cpu_blocks: int = 0, num_gpu_blocks: int = 0, adapters: Dict[str, str] | None = None, max_prefill_token_num: int = 4096, thread_safe: bool = False, enable_prefix_caching: bool = False, device_type: str = 'cuda', eager_mode: bool = False, custom_module_map: Dict[str, str] | None = None, download_dir: str | None = None, revision: str | None = None, quant_policy: Literal[0, 4, 8] = 0)[source]#
PyTorch Engine Config.
- Parameters:
dtype (str) – data type for model weights and activations. It can be one of the following values, [‘auto’, ‘float16’, ‘bfloat16’] The auto option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
tp (int) – Tensor Parallelism. default 1.
session_len (int) – Max session length. Default None.
max_batch_size (int) – Max batch size. If it is not specified, the engine will automatically set it according to the device
cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache
prefill_interval (int) – Interval to perform prefill, Default 16.
block_size (int) – paging cache block size, default 64.
num_cpu_blocks (int) – Num cpu blocks. If num is 0, cache would be allocate according to current environment.
num_gpu_blocks (int) – Num gpu blocks. If num is 0, cache would be allocate according to current environment.
adapters (dict) – The path configs to lora adapters.
max_prefill_token_num (int) – tokens per iteration.
thread_safe (bool) – thread safe engine instance.
enable_prefix_caching (bool) – Enable token match and sharing caches.
device_type (str) – The inference device type, options [‘cuda’]
eager_mode (bool) – Enable “eager” mode or not
custom_module_map (Dict) – nn module map customized by users. Once provided, the original nn modules of the model will be substituted by the mapping ones
download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.
revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
quant_policy (int) – default to 0. When k/v is quantized into 4 or 8 bit, set it to 4 or 8, respectively
TurbomindEngineConfig#
- class lmdeploy.TurbomindEngineConfig(dtype: str = 'auto', model_format: str | None = None, tp: int = 1, session_len: int | None = None, max_batch_size: int = None, cache_max_entry_count: float = 0.8, cache_chunk_size: int = -1, cache_block_seq_len: int = 64, enable_prefix_caching: bool = False, quant_policy: int = 0, rope_scaling_factor: float = 0.0, use_logn_attn: bool = False, download_dir: str | None = None, revision: str | None = None, max_prefill_token_num: int = 8192, num_tokens_per_iter: int = 0, max_prefill_iters: int = 1)[source]#
TurboMind Engine config.
- Parameters:
dtype (str) – data type for model weights and activations. It can be one of the following values, [‘auto’, ‘float16’, ‘bfloat16’] The auto option will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
model_format (str) – the layout of the deployed model. It can be one of the following values [hf, awq, gptq],`hf` meaning huggingface model(.bin, .safetensors), awq and gptq meaning the quantized model by AWQ and GPTQ, respectively. If it is not specified, i.e. None, it will be extracted from the input model
tp (int) – the number of GPU cards used in tensor parallelism, default to 1
session_len (int) – the max session length of a sequence, default to None
max_batch_size (int) – the max batch size during inference. If it is not specified, the engine will automatically set it according to the device
cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For versions of lmdeploy between v0.2.0 and v0.2.1, it defaults to 0.5, depicting the percentage of TOTAL GPU memory to be allocated to the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache
cache_chunk_size (int) – The policy to apply for KV block from the block manager, default to -1.
cache_block_seq_len (int) – the length of the token sequence in a k/v block, default to 64
enable_prefix_caching (bool) – enable cache prompts for block reuse, default to False
quant_policy (int) – default to 0. When k/v is quantized into 4 or 8 bit, set it to 4 or 8, respectively
rope_scaling_factor (float) – scaling factor used for dynamic ntk, default to 0. TurboMind follows the implementation of transformer LlamaAttention
use_logn_attn (bool) – whether or not to use log attn: default to False
download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.
revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
max_prefill_token_num (int) – the number of tokens each iteration during prefill, default to 8192
num_tokens_per_iter (int) – the number of tokens processed in each forward pass. Working with max_prefill_iters enables the “Dynamic SplitFuse”-like scheduling
max_prefill_iters (int) – the max number of forward pass during prefill stage
GenerationConfig#
- class lmdeploy.GenerationConfig(n: int = 1, max_new_tokens: int = 512, do_sample: bool = False, top_p: float = 1.0, top_k: int = 50, min_p: float = 0.0, temperature: float = 0.8, repetition_penalty: float = 1.0, ignore_eos: bool = False, random_seed: int | None = None, stop_words: List[str] | None = None, bad_words: List[str] | None = None, stop_token_ids: List[int] | None = None, bad_token_ids: List[int] | None = None, min_new_tokens: int | None = None, skip_special_tokens: bool = True, spaces_between_special_tokens: bool = True, logprobs: int | None = None, response_format: Dict | None = None, logits_processors: List[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] | None = None, output_logits: Literal['all', 'generation'] | None = None, output_last_hidden_state: Literal['all', 'generation'] | None = None)[source]#
generation parameters used by inference engines.
- Parameters:
n (int) – Define how many chat completion choices to generate for each input message. Only 1 is supported now.
max_new_tokens (int) – The maximum number of tokens that can be generated in the chat completion
do_sample (bool) – Whether or not to use sampling, use greedy decoding otherwise. Default to be False.
top_p (float) – An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass
top_k (int) – An alternative to sampling with temperature, where the model considers the top_k tokens with the highest probability
min_p (float) – Minimum token probability, which will be scaled by the probability of the most likely token. It must be a value between 0 and 1. Typical values are in the 0.01-0.2 range, comparably selective as setting top_p in the 0.99-0.8 range (use the opposite of normal top_p values)
temperature (float) – Sampling temperature
repetition_penalty (float) – Penalty to prevent the model from generating repeated words or phrases. A value larger than 1 discourages repetition
ignore_eos (bool) – Indicator to ignore the eos_token_id or not
random_seed (int) – Seed used when sampling a token
stop_words (List[str]) – Words that stop generating further tokens
bad_words (List[str]) – Words that the engine will never generate
stop_token_ids (List[int]) – List of tokens that stop the generation when they are generated. The returned output will not contain the stop tokens.
bad_token_ids (List[str]) – List of tokens that the engine will never generate.
min_new_tokens (int) – The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.
skip_special_tokens (bool) – Whether or not to remove special tokens in the decoding. Default to be True.
spaces_between_special_tokens (bool) – Whether or not to add spaces around special tokens. The behavior of Fast tokenizers is to have this to False. This is setup to True in slow tokenizers.
logprobs (int) – Number of log probabilities to return per output token.
response_format (Dict) – Only pytorch backend support formatting
Examples (response.) –
- {
“type”: “json_schema”, “json_schema”: {
”name”: “test”, “schema”: { “properties”: {
”name”: { “type”: “string” }
}, “required”: [“name”], “type”: “object” }
}
}
or –
- {
“type”: “regex_schema”, “regex_schema”: “call me [A-Za-z]{1,10}”
}
- :param :
- {
“type”: “regex_schema”, “regex_schema”: “call me [A-Za-z]{1,10}”
}
- Parameters:
logits_processors (List[Callable]) – Custom logit processors.
ChatTemplateConfig#
- class lmdeploy.ChatTemplateConfig(model_name: str, system: str | None = None, meta_instruction: str | None = None, eosys: str | None = None, user: str | None = None, eoh: str | None = None, assistant: str | None = None, eoa: str | None = None, tool: str | None = None, eotool: str | None = None, separator: str | None = None, capability: Literal['completion', 'infilling', 'chat', 'python'] | None = None, stop_words: List[str] | None = None)[source]#
Parameters for chat template.
- Parameters:
model_name (str) – the name of the deployed model. Determine which chat template will be applied. All the chat template names: lmdeploy list
system (str | None) – begin of the system prompt
meta_instruction (str | None) – system prompt
eosys (str | None) – end of the system prompt
user (str | None) – begin of the user prompt
eoh (str | None) – end of the user prompt
assistant (str | None) – begin of the assistant prompt
eoa (str | None) – end of the assistant prompt
tool (str | None) – begin of the tool prompt
eotool (str | None) – end of the tool prompt
capability – (‘completion’ | ‘infilling’ | ‘chat’ | ‘python’) = None