Get Started#

LMDeploy offers functionalities such as model quantization, offline batch inference, online serving, etc. Each function can be completed with just a few simple lines of code or commands.

Installation#

Install lmdeploy with pip (python 3.8+) or from source

pip install lmdeploy

The default prebuilt package is compiled on CUDA 12. However, if CUDA 11+ is required, you can install lmdeploy by:

export LMDEPLOY_VERSION=0.5.2
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

Offline batch inference#

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2_5-7b-chat")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

For more information on inference pipeline parameters, please refer to here.

Serving#

LMDeploy offers various serving methods, choosing one that best meet your requirements.

Quantization#

LMDeploy provides the following quantization methods. Please visit the following links for the detailed guide

Useful Tools#

LMDeploy CLI offers the following utilities, helping users experience LLM features conveniently

Inference with Command line Interface#

lmdeploy chat internlm/internlm2_5-7b-chat

Serving with Web UI#

LMDeploy adopts gradio to develop the online demo.

# install dependencies
pip install lmdeploy[serve]
# launch gradio server
lmdeploy serve gradio internlm/internlm2_5-7b-chat