Skip to main content
Ctrl+K
lmdeploy - Home

Get Started

  • Get Started

Build

  • Build from source

Benchmark

  • Profile Token Latency and Throughput
  • Profile Request Throughput
  • Profile API Server
  • Evaluate LLMs with OpenCompass

Supported Models

  • Supported Models

Inference

  • LLM Offline Inference Pipeline
  • VLM Offline Inference Pipeline

Serving

  • Serving LLM with OpenAI Compatible Server
  • Serving VLM with OpenAI Compatible Server
  • Tools
  • Serving with Gradio
  • Request Distributor Server

Quantization

  • W4A16 Quantization
  • Key-Value(KV) Cache Quantization
  • W8A8 LLM Model Deployment

Advanced Guide

  • Architecture of TurboMind
  • Architecture of lmdeploy.pytorch
  • How to support new model in lmdeploy.pytorch
  • Context length extrapolation
  • Customized chat template
  • How to debug Turbomind
  • LMDeploy-QoS Introduce and Usage

API Reference

  • inference pipeline
  • Repository
  • Open issue

Index

C | G | P | S | T

C

  • ChatTemplateConfig (class in lmdeploy)
  • client() (in module lmdeploy)

G

  • GenerationConfig (class in lmdeploy)

P

  • pipeline() (in module lmdeploy)
  • PytorchEngineConfig (class in lmdeploy)

S

  • serve() (in module lmdeploy)

T

  • TurbomindEngineConfig (class in lmdeploy)

By LMDeploy Authors

© Copyright 2021-2024, OpenMMLab.

Last updated on Jul 26, 2024.