inference pipeline#

pipeline#

lmdeploy.pipeline(model_path: str, model_name: str | None = None, backend_config: TurbomindEngineConfig | PytorchEngineConfig | None = None, chat_template_config: ChatTemplateConfig | None = None, log_level='ERROR', **kwargs)[source]#

Parameters:

model_path (str) –
the path of a model. It could be one of the following options:
- 1. A local directory path of a turbomind model which is
  converted by lmdeploy convert command or download from ii) and iii).
- 1. The model_id of a lmdeploy-quantized model hosted
  inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.
- 1. The model_id of a model hosted inside a model repo
  on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to None.
chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.
log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]

Examples

>>> # LLM
>>> import lmdeploy
>>> pipe = lmdeploy.pipeline('internlm/internlm-chat-7b')
>>> response = pipe(['hi','say this is a test'])
>>> print(response)
>>>
>>> # VLM
>>> from lmdeploy.vl import load_image
>>> from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
>>> pipe = pipeline('liuhaotian/llava-v1.5-7b',
...                 backend_config=TurbomindEngineConfig(session_len=8192),
...                 chat_template_config=ChatTemplateConfig(model_name='vicuna'))
>>> im = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
>>> response = pipe([('describe this image', [im])])
>>> print(response)

serving#

lmdeploy.serve(model_path: str, model_name: str | None = None, backend: Literal['turbomind', 'pytorch'] = 'turbomind', backend_config: TurbomindEngineConfig | PytorchEngineConfig | None = None, chat_template_config: ChatTemplateConfig | None = None, server_name: str = '0.0.0.0', server_port: int = 23333, log_level: str = 'ERROR', api_keys: str | List[str] | None = None, ssl: bool = False, **kwargs)[source]#

This will run the api_server in a subprocess.

Parameters:

model_path (str) –
the path of a model. It could be one of the following options:
- 1. A local directory path of a turbomind model which is
  converted by lmdeploy convert command or download from ii) and iii).
- 1. The model_id of a lmdeploy-quantized model hosted
  inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.
- 1. The model_id of a model hosted inside a model repo
  on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
backend (str) – either turbomind or pytorch backend. Default to turbomind backend.
backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to none.
chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.
server_name (str) – host ip for serving
server_port (int) – server port
log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]
api_keys (List[str] | str | None) – Optional list of API keys. Accepts string type as a single api_key. Default to None, which means no api key applied.
ssl (bool) – Enable SSL. Requires OS Environment variables ‘SSL_KEYFILE’ and ‘SSL_CERTFILE’.

Returns:

A client chatbot for LLaMA series models.

Return type:

APIClient

Examples

>>> import lmdeploy
>>> client = lmdeploy.serve('internlm/internlm-chat-7b', 'internlm-chat-7b')
>>> for output in client.chat('hi', 1):
...    print(output)

lmdeploy.client(api_server_url: str = 'http://0.0.0.0:23333', api_key: str | None = None, **kwargs)[source]#

Parameters:

api_server_url (str) – communicating address ‘http://<ip>:<port>’ of api_server
api_key (str | None) – api key. Default to None, which means no api key will be used.

Returns:

Chatbot for LLaMA series models with turbomind as inference engine.

PytorchEngineConfig#

class lmdeploy.PytorchEngineConfig(model_name: str = '', tp: int = 1, session_len: int | None = None, max_batch_size: int = 128, cache_max_entry_count: float = 0.8, eviction_type: str = 'recompute', prefill_interval: int = 16, block_size: int = 64, num_cpu_blocks: int = 0, num_gpu_blocks: int = 0, adapters: Dict[str, str] | None = None, max_prefill_token_num: int = 4096, thread_safe: bool = False, enable_prefix_caching: bool = False, device_type: str = 'cuda', download_dir: str | None = None, revision: str | None = None)[source]#

PyTorch Engine Config.

Parameters:

model_name (str) – name of the given model.
tp (int) – Tensor Parallelism. default 1.
session_len (int) – Max session length. Default None.
max_batch_size (int) – Max batch size. Default 128.
cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache
eviction_type (str) – What action to perform when kv cache is full, [‘recompute’, ‘copy’], Deprecated.
prefill_interval (int) – Interval to perform prefill, Default 16.
block_size (int) – paging cache block size, default 64.
num_cpu_blocks (int) – Num cpu blocks. If num is 0, cache would be allocate according to current environment.
num_gpu_blocks (int) – Num gpu blocks. If num is 0, cache would be allocate according to current environment.
adapters (dict) – The path configs to lora adapters.
max_prefill_token_num (int) – tokens per iteration.
thread_safe (bool) – thread safe engine instance.
enable_prefix_caching (bool) – Enable token match and sharing caches.
device_type (str) – The inference device type, options [‘cuda’]
download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.
revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

TurbomindEngineConfig#

class lmdeploy.TurbomindEngineConfig(model_name: str | None = None, model_format: str | None = None, tp: int = 1, session_len: int | None = None, max_batch_size: int = 128, cache_max_entry_count: float = 0.8, cache_block_seq_len: int = 64, enable_prefix_caching: bool = False, quant_policy: int = 0, rope_scaling_factor: float = 0.0, use_logn_attn: bool = False, download_dir: str | None = None, revision: str | None = None, max_prefill_token_num: int = 8192, num_tokens_per_iter: int = 0, max_prefill_iters: int = 1)[source]#

TurboMind Engine config.

Parameters:

model_name (str) – the name of the deployed model, deprecated and has no effect when version > 0.2.1
model_format (str) – the layout of the deployed model. It can be one of the following values [hf, meta_llama, awq], hf meaning huggingface model(.bin, .safetensors), meta_llama being meta llama’s format(.pth), awq` meaning the quantized model by AWQ.
tp (int) – the number of GPU cards used in tensor parallelism, default to 1
session_len (int) – the max session length of a sequence, default to None
max_batch_size (int) – the max batch size during inference, default to 128
cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For versions of lmdeploy between v0.2.0 and v0.2.1, it defaults to 0.5, depicting the percentage of TOTAL GPU memory to be allocated to the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache
cache_block_seq_len (int) – the length of the token sequence in a k/v block, default to 64
enable_prefix_caching (bool) – enable cache prompts for block reuse, default to False
quant_policy (int) – default to 0. When k/v is quantized into 8 bit, set it to 4
rope_scaling_factor (int) – scaling factor used for dynamic ntk, default to 0. TurboMind follows the implementation of transformer LlamaAttention
use_logn_attn (bool) – whether or not to use log attn: default to False
download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.
revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
max_prefill_token_num (int) – the number of tokens each iteration during prefill, default to 8192
num_tokens_per_iter (int) – the number of tokens processed in each forward pass. Working with max_prefill_iters enables “Dynamic SplitFuse”-like scheduling
max_prefill_iters (int) – the max number of forward pass during prefill stage

GenerationConfig#

class lmdeploy.GenerationConfig(n: int = 1, max_new_tokens: int = 512, top_p: float = 1.0, top_k: int = 1, temperature: float = 0.8, repetition_penalty: float = 1.0, ignore_eos: bool = False, random_seed: int | None = None, stop_words: List[str] | None = None, bad_words: List[str] | None = None, min_new_tokens: int | None = None, skip_special_tokens: bool = True, logprobs: int | None = None)[source]#

generation parameters used by inference engines.

Parameters:

n (int) – Define how many chat completion choices to generate for each input message. Only 1 is supported now.
max_new_tokens (int) – The maximum number of tokens that can be generated in the chat completion
top_p (float) – An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass
top_k (int) – An alternative to sampling with temperature, where the model considers the top_k tokens with the highest probability
temperature (float) – Sampling temperature
repetition_penalty (float) – Penalty to prevent the model from generating repeated words or phrases. A value larger than 1 discourages repetition
ignore_eos (bool) – Indicator to ignore the eos_token_id or not
random_seed (int) – Seed used when sampling a token
stop_words (List[str]) – Words that stop generating further tokens
bad_words (List[str]) – Words that the engine will never generate
min_new_tokens (int) – The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.
skip_special_tokens (bool) – Whether or not to remove special tokens in the decoding. Default to be True.
logprobs (int) – Number of log probabilities to return per output token.

ChatTemplateConfig#

Parameters for chat template.

Parameters:

model_name (str) – the name of the deployed model. Determine which chat template will be applied. All the chat template names: lmdeploy list
system (str | None) – begin of the system prompt
meta_instruction (str | None) – system prompt
eosys (str | None) – end of the system prompt
user (str | None) – begin of the user prompt
eoh (str | None) – end of the user prompt
assistant (str | None) – begin of the assistant prompt
eoa (str | None) – end of the assistant prompt
capability – (‘completion’ | ‘infilling’ | ‘chat’ | ‘python’) = None

inference pipeline

Contents

inference pipeline#

pipeline#

serving#

PytorchEngineConfig#

TurbomindEngineConfig#

GenerationConfig#

ChatTemplateConfig#