Shortcuts

inference pipeline

pipeline

lmdeploy.pipeline(model_path: str, model_name: Optional[str] = None, backend_config: Optional[Union[lmdeploy.messages.TurbomindEngineConfig, lmdeploy.messages.PytorchEngineConfig]] = None, chat_template_config: Optional[lmdeploy.model.ChatTemplateConfig] = None, log_level='ERROR', **kwargs)[source]
Parameters
  • model_path (str) –

    the path of a model. It could be one of the following options:

      1. A local directory path of a turbomind model which is

      converted by lmdeploy convert command or download from ii) and iii).

      1. The model_id of a lmdeploy-quantized model hosted

      inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.

      1. The model_id of a model hosted inside a model repo

      on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to None.

  • chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.

  • log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]

Examples

>>> # LLM
>>> import lmdeploy
>>> pipe = lmdeploy.pipeline('internlm/internlm-chat-7b')
>>> response = pipe(['hi','say this is a test'])
>>> print(response)
>>>
>>> # VLM
>>> from lmdeploy.vl import load_image
>>> from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
>>> pipe = pipeline('liuhaotian/llava-v1.5-7b',
...                 backend_config=TurbomindEngineConfig(session_len=8192),
...                 chat_template_config=ChatTemplateConfig(model_name='vicuna'))
>>> im = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
>>> response = pipe([('describe this image', [im])])
>>> print(response)

serving

lmdeploy.serve(model_path: str, model_name: Optional[str] = None, backend: Literal[turbomind, pytorch] = 'turbomind', backend_config: Optional[Union[lmdeploy.messages.TurbomindEngineConfig, lmdeploy.messages.PytorchEngineConfig]] = None, chat_template_config: Optional[lmdeploy.model.ChatTemplateConfig] = None, server_name: str = '0.0.0.0', server_port: int = 23333, log_level: str = 'ERROR', api_keys: Optional[Union[str, List[str]]] = None, ssl: bool = False, **kwargs)[source]

This will run the api_server in a subprocess.

Parameters
  • model_path (str) –

    the path of a model. It could be one of the following options:

      1. A local directory path of a turbomind model which is

      converted by lmdeploy convert command or download from ii) and iii).

      1. The model_id of a lmdeploy-quantized model hosted

      inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.

      1. The model_id of a model hosted inside a model repo

      on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • backend (str) – either turbomind or pytorch backend. Default to turbomind backend.

  • backend_config (TurbomindEngineConfig | PytorchEngineConfig) – backend config instance. Default to none.

  • chat_template_config (ChatTemplateConfig) – chat template configuration. Default to None.

  • server_name (str) – host ip for serving

  • server_port (int) – server port

  • log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]

  • api_keys (List[str] | str | None) – Optional list of API keys. Accepts string type as a single api_key. Default to None, which means no api key applied.

  • ssl (bool) – Enable SSL. Requires OS Environment variables ‘SSL_KEYFILE’ and ‘SSL_CERTFILE’.

Returns

A client chatbot for LLaMA series models.

Return type

APIClient

Examples

>>> import lmdeploy
>>> client = lmdeploy.serve('internlm/internlm-chat-7b', 'internlm-chat-7b')
>>> for output in client.chat('hi', 1):
...    print(output)
lmdeploy.client(api_server_url: str = 'http://0.0.0.0:23333', api_key: Optional[str] = None, **kwargs)[source]
Parameters
  • api_server_url (str) – communicating address ‘http://<ip>:<port>’ of api_server

  • api_key (str | None) – api key. Default to None, which means no api key will be used.

Returns

Chatbot for LLaMA series models with turbomind as inference engine.

PytorchEngineConfig

class lmdeploy.PytorchEngineConfig(model_name: str = '', tp: int = 1, session_len: Optional[int] = None, max_batch_size: int = 128, cache_max_entry_count: float = 0.8, eviction_type: str = 'recompute', prefill_interval: int = 16, block_size: int = 64, num_cpu_blocks: int = 0, num_gpu_blocks: int = 0, adapters: Optional[Dict[str, str]] = None, max_prefill_token_num: int = 4096, thread_safe: bool = False, enable_prefix_caching: bool = False, download_dir: Optional[str] = None, revision: Optional[str] = None)[source]

PyTorch Engine Config.

Parameters
  • model_name (str) – name of the given model.

  • tp (int) – Tensor Parallelism. default 1.

  • session_len (int) – Max session length. Default None.

  • max_batch_size (int) – Max batch size. Default 128.

  • cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache

  • eviction_type (str) – What action to perform when kv cache is full, [‘recompute’, ‘copy’], Default ‘recompute’.

  • prefill_interval (int) – Interval to perform prefill, Default 16.

  • block_size (int) – paging cache block size, default 64.

  • num_cpu_blocks (int) – Num cpu blocks. If num is 0, cache would be allocate according to current environment.

  • num_gpu_blocks (int) – Num gpu blocks. If num is 0, cache would be allocate according to current environment.

  • adapters (dict) – The path configs to lora adapters.

  • max_prefill_token_num (int) – tokens per iteration.

  • thread_safe (bool) – thread safe engine instance.

  • enable_prefix_caching (bool) – Enable token match and sharing caches.

  • download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.

  • revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

TurbomindEngineConfig

class lmdeploy.TurbomindEngineConfig(model_name: Optional[str] = None, model_format: Optional[str] = None, tp: int = 1, session_len: Optional[int] = None, max_batch_size: int = 128, cache_max_entry_count: float = 0.8, cache_block_seq_len: int = 64, enable_prefix_caching: bool = False, quant_policy: int = 0, rope_scaling_factor: float = 0.0, use_logn_attn: bool = False, download_dir: Optional[str] = None, revision: Optional[str] = None, max_prefill_token_num: int = 8192, num_tokens_per_iter: int = 0, max_prefill_iters: int = 1)[source]

TurboMind Engine config.

Parameters
  • model_name (str) – the name of the deployed model, deprecated and has no effect when version > 0.2.1

  • model_format (str) – the layout of the deployed model. It can be one of the following values [hf, llama, awq], hf meaning hf_llama, llama meaning meta_llama, awq meaning the quantized model by AWQ.

  • tp (int) – the number of GPU cards used in tensor parallelism, default to 1

  • session_len (int) – the max session length of a sequence, default to None

  • max_batch_size (int) – the max batch size during inference, default to 128

  • cache_max_entry_count (float) – the percentage of gpu memory occupied by the k/v cache. For versions of lmdeploy between v0.2.0 and v0.2.1, it defaults to 0.5, depicting the percentage of TOTAL GPU memory to be allocated to the k/v cache. For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache

  • cache_block_seq_len (int) – the length of the token sequence in a k/v block, default to 64

  • enable_prefix_caching (bool) – enable cache prompts for block reuse, default to False

  • quant_policy (int) – default to 0. When k/v is quantized into 8 bit, set it to 4

  • rope_scaling_factor (int) – scaling factor used for dynamic ntk, default to 0. TurboMind follows the implementation of transformer LlamaAttention

  • use_logn_attn (bool) – whether or not to use log attn: default to False

  • download_dir (str) – Directory to download and load the weights, default to the default cache directory of huggingface.

  • revision (str) – The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

  • max_prefill_token_num (int) – the number of tokens each iteration during prefill, default to 8192

  • num_tokens_per_iter (int) – the number of tokens processed in each forward pass. Working with max_prefill_iters enables “Dynamic SplitFuse”-like scheduling

  • max_prefill_iters (int) – the max number of forward pass during prefill stage

GenerationConfig

class lmdeploy.GenerationConfig(n: int = 1, max_new_tokens: int = 512, top_p: float = 1.0, top_k: int = 1, temperature: float = 0.8, repetition_penalty: float = 1.0, ignore_eos: bool = False, random_seed: Optional[int] = None, stop_words: Optional[List[str]] = None, bad_words: Optional[List[str]] = None, min_new_tokens: Optional[int] = None, skip_special_tokens: bool = True, logprobs: Optional[int] = None)[source]

generation parameters used by inference engines.

Parameters
  • n (int) – Define how many chat completion choices to generate for each input message

  • max_new_tokens (int) – The maximum number of tokens that can be generated in the chat completion

  • top_p (float) – An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass

  • top_k (int) – An alternative to sampling with temperature, where the model considers the top_k tokens with the highest probability

  • temperature (float) – Sampling temperature

  • repetition_penalty (float) – Penalty to prevent the model from generating repeated words or phrases. A value larger than 1 discourages repetition

  • ignore_eos (bool) – Indicator to ignore the eos_token_id or not

  • random_seed (int) – Seed used when sampling a token

  • stop_words (List[str]) – Words that stop generating further tokens

  • bad_words (List[str]) – Words that the engine will never generate

  • min_new_tokens (int) – The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.

  • skip_special_tokens (bool) – Whether or not to remove special tokens in the decoding. Default to be True.

  • logprobs (int) – Number of log probabilities to return per output token.

ChatTemplateConfig

class lmdeploy.ChatTemplateConfig(model_name: str, system: Optional[str] = None, meta_instruction: Optional[str] = None, eosys: Optional[str] = None, user: Optional[str] = None, eoh: Optional[str] = None, assistant: Optional[str] = None, eoa: Optional[str] = None, separator: Optional[str] = None, capability: Optional[Literal[completion, infilling, chat, python]] = None, stop_words: Optional[List[str]] = None)[source]

Parameters for chat template.

Parameters
  • model_name (str) – the name of the deployed model. Determine which chat template will be applied. All the chat template names: lmdeploy list

  • system (str | None) – begin of the system prompt

  • meta_instruction (str | None) – system prompt

  • eosys (str | None) – end of the system prompt

  • user (str | None) – begin of the user prompt

  • eoh (str | None) – end of the user prompt

  • assistant (str | None) – begin of the assistant prompt

  • eoa (str | None) – end of the assistant prompt

  • capability – (‘completion’ | ‘infilling’ | ‘chat’ | ‘python’) = None

Read the Docs v: latest
Versions
latest
stable
v0.4.1
v0.4.0
v0.3.0
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.0
v0.1.0
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.