inference pipeline#

pipeline#

lmdeploy.pipeline(model_path: str, model_name: str | None = None, backend_config: TurbomindEngineConfig | PytorchEngineConfig | None = None, chat_template_config: ChatTemplateConfig | None = None, log_level='ERROR', **kwargs)[source]#
Parameters:
  • model_path (str) –

    the path of a model. It could be one of the following options:

      1. A local directory path of a turbomind model which is

      converted by lmdeploy convert command or download from ii) and iii).

      1. The model_id of a lmdeploy-quantized model hosted

      inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.

      1. The model_id of a model hosted inside a model repo

      on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to None.

  • chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.

  • log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]

Examples

>>> # LLM
>>> import lmdeploy
>>> pipe = lmdeploy.pipeline('internlm/internlm-chat-7b')
>>> response = pipe(['hi','say this is a test'])
>>> print(response)
>>>
>>> # VLM
>>> from lmdeploy.vl import load_image
>>> from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
>>> pipe = pipeline('liuhaotian/llava-v1.5-7b',
...                 backend_config=TurbomindEngineConfig(session_len=8192),
...                 chat_template_config=ChatTemplateConfig(model_name='vicuna'))
>>> im = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
>>> response = pipe([('describe this image', [im])])
>>> print(response)

serving#

lmdeploy.serve(model_path: str, model_name: str | None = None, backend: Literal['turbomind', 'pytorch'] = 'turbomind', backend_config: TurbomindEngineConfig | PytorchEngineConfig | None = None, chat_template_config: ChatTemplateConfig | None = None, server_name: str = '0.0.0.0', server_port: int = 23333, log_level: str = 'ERROR', api_keys: str | List[str] | None = None, ssl: bool = False, **kwargs)[source]#

This will run the api_server in a subprocess.

Parameters:
  • model_path (str) –

    the path of a model. It could be one of the following options:

      1. A local directory path of a turbomind model which is

      converted by lmdeploy convert command or download from ii) and iii).

      1. The model_id of a lmdeploy-quantized model hosted

      inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.

      1. The model_id of a model hosted inside a model repo

      on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • backend (str) – either turbomind or pytorch backend. Default to turbomind backend.

  • backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to none.

  • chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.

  • server_name (str) – host ip for serving

  • server_port (int) – server port

  • log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]

  • api_keys (List[str] | str | None) – Optional list of API keys. Accepts string type as a single api_key. Default to None, which means no api key applied.

  • ssl (bool) – Enable SSL. Requires OS Environment variables ‘SSL_KEYFILE’ and ‘SSL_CERTFILE’.

Returns:

A client chatbot for LLaMA series models.

Return type:

APIClient

Examples

>>> import lmdeploy
>>> client = lmdeploy.serve('internlm/internlm-chat-7b', 'internlm-chat-7b')
>>> for output in client.chat('hi', 1):
...    print(output)
lmdeploy.client(api_server_url: str = 'http://0.0.0.0:23333', api_key: str | None = None, **kwargs)[source]#
Parameters:
  • api_server_url (str) – communicating address ‘http://<ip>:<port>’ of api_server

  • api_key (str | None) – api key. Default to None, which means no api key will be used.

Returns:

Chatbot for LLaMA series models with turbomind as inference engine.

PytorchEngineConfig#

class lmdeploy.PytorchEngineConfig(model_name: str = '', tp: int = 1, session_len: int | None = None, max_batch_size: int = 128, cache_max_entry_count: float = 0.8, eviction_type: str = 'recompute', prefill_interval: int = 16, block_size: int = 64, num_cpu_blocks: int = 0, num_gpu_blocks: int = 0, adapters: Dict[str, str] | None = None, max_prefill_token_num: int = 4096, thread_safe: bool = False, enable_prefix_caching: bool = False, device_type: str = 'cuda', download_dir: str | None = None, revision: str | None = None)[source]#

PyTorch Engine Config.

Parameters:
  • model_name (str) – name of the given model.

  • tp (int) – Tensor Parallelism. default 1.

  • session_len (int) – Max session length. Default None.

  • max_batch_size (int) – Max batch size. Default 128.

  • cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache

  • eviction_type (str) – What action to perform when kv cache is full, [‘recompute’, ‘copy’], Deprecated.

  • prefill_interval (int) – Interval to perform prefill, Default 16.

  • block_size (int) – paging cache block size, default 64.

  • num_cpu_blocks (int) – Num cpu blocks. If num is 0, cache would be allocate according to current environment.

  • num_gpu_blocks (int) – Num gpu blocks. If num is 0, cache would be allocate according to current environment.

  • adapters (dict) – The path configs to lora adapters.

  • max_prefill_token_num (int) – tokens per iteration.

  • thread_safe (bool) – thread safe engine instance.

  • enable_prefix_caching (bool) – Enable token match and sharing caches.

  • device_type (str) – The inference device type, options [‘cuda’]

  • download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.

  • revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

TurbomindEngineConfig#

class lmdeploy.TurbomindEngineConfig(model_name: str | None = None, model_format: str | None = None, tp: int = 1, session_len: int | None = None, max_batch_size: int = 128, cache_max_entry_count: float = 0.8, cache_block_seq_len: int = 64, enable_prefix_caching: bool = False, quant_policy: int = 0, rope_scaling_factor: float = 0.0, use_logn_attn: bool = False, download_dir: str | None = None, revision: str | None = None, max_prefill_token_num: int = 8192, num_tokens_per_iter: int = 0, max_prefill_iters: int = 1)[source]#

TurboMind Engine config.

Parameters:
  • model_name (str) – the name of the deployed model, deprecated and has no effect when version > 0.2.1

  • model_format (str) – the layout of the deployed model. It can be one of the following values [hf, meta_llama, awq], hf meaning huggingface model(.bin, .safetensors), meta_llama being meta llama’s format(.pth), awq` meaning the quantized model by AWQ.

  • tp (int) – the number of GPU cards used in tensor parallelism, default to 1

  • session_len (int) – the max session length of a sequence, default to None

  • max_batch_size (int) – the max batch size during inference, default to 128

  • cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For versions of lmdeploy between v0.2.0 and v0.2.1, it defaults to 0.5, depicting the percentage of TOTAL GPU memory to be allocated to the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache

  • cache_block_seq_len (int) – the length of the token sequence in a k/v block, default to 64

  • enable_prefix_caching (bool) – enable cache prompts for block reuse, default to False

  • quant_policy (int) – default to 0. When k/v is quantized into 8 bit, set it to 4

  • rope_scaling_factor (int) – scaling factor used for dynamic ntk, default to 0. TurboMind follows the implementation of transformer LlamaAttention

  • use_logn_attn (bool) – whether or not to use log attn: default to False

  • download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.

  • revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

  • max_prefill_token_num (int) – the number of tokens each iteration during prefill, default to 8192

  • num_tokens_per_iter (int) – the number of tokens processed in each forward pass. Working with max_prefill_iters enables “Dynamic SplitFuse”-like scheduling

  • max_prefill_iters (int) – the max number of forward pass during prefill stage

GenerationConfig#

class lmdeploy.GenerationConfig(n: int = 1, max_new_tokens: int = 512, top_p: float = 1.0, top_k: int = 1, temperature: float = 0.8, repetition_penalty: float = 1.0, ignore_eos: bool = False, random_seed: int | None = None, stop_words: List[str] | None = None, bad_words: List[str] | None = None, min_new_tokens: int | None = None, skip_special_tokens: bool = True, logprobs: int | None = None)[source]#

generation parameters used by inference engines.

Parameters:
  • n (int) – Define how many chat completion choices to generate for each input message. Only 1 is supported now.

  • max_new_tokens (int) – The maximum number of tokens that can be generated in the chat completion

  • top_p (float) – An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass

  • top_k (int) – An alternative to sampling with temperature, where the model considers the top_k tokens with the highest probability

  • temperature (float) – Sampling temperature

  • repetition_penalty (float) – Penalty to prevent the model from generating repeated words or phrases. A value larger than 1 discourages repetition

  • ignore_eos (bool) – Indicator to ignore the eos_token_id or not

  • random_seed (int) – Seed used when sampling a token

  • stop_words (List[str]) – Words that stop generating further tokens

  • bad_words (List[str]) – Words that the engine will never generate

  • min_new_tokens (int) – The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.

  • skip_special_tokens (bool) – Whether or not to remove special tokens in the decoding. Default to be True.

  • logprobs (int) – Number of log probabilities to return per output token.

ChatTemplateConfig#

class lmdeploy.ChatTemplateConfig(model_name: str, system: str | None = None, meta_instruction: str | None = None, eosys: str | None = None, user: str | None = None, eoh: str | None = None, assistant: str | None = None, eoa: str | None = None, separator: str | None = None, capability: Literal['completion', 'infilling', 'chat', 'python'] | None = None, stop_words: List[str] | None = None)[source]#

Parameters for chat template.

Parameters:
  • model_name (str) – the name of the deployed model. Determine which chat template will be applied. All the chat template names: lmdeploy list

  • system (str | None) – begin of the system prompt

  • meta_instruction (str | None) – system prompt

  • eosys (str | None) – end of the system prompt

  • user (str | None) – begin of the user prompt

  • eoh (str | None) – end of the user prompt

  • assistant (str | None) – begin of the assistant prompt

  • eoa (str | None) – end of the assistant prompt

  • capability – (‘completion’ | ‘infilling’ | ‘chat’ | ‘python’) = None