PyTorchEngine Profiling#

We provide multiple profiler to analysis the performance of PyTorchEngine.

PyTorch Profiler#

We have integrated the PyTorch Profiler. You can enable it by setting environment variables when launching the pipeline or API server:

# enable profile cpu
export LMDEPLOY_PROFILE_CPU=1
# enable profile cuda
export LMDEPLOY_PROFILE_CUDA=1
# profile would start after 3 seconds
export LMDEPLOY_PROFILE_DELAY=3
# profile 10 seconds
export LMDEPLOY_PROFILE_DURATION=10
# prefix path to save profile files
export LMDEPLOY_PROFILE_OUT_PREFIX="/path/to/save/profile_"

After the program exits, the profiling data will be saved to the path specified by LMDEPLOY_PROFILE_OUT_PREFIX for performance analysis.

Nsight System#

We also support using Nsight System to profile NVIDIA devices.

Single GPU#

For single-GPU scenarios, simply use nsys profile:

nsys profile python your_script.py

Multi-GPU#

When using multi-GPU solutions like DP/TP/EP, set the following environment variables:

# enable nsight system
export LMDEPLOY_RAY_NSYS_ENABLE=1
# prefix path to save profile files
export LMDEPLOY_RAY_NSYS_OUT_PREFIX="/path/to/save/profile_"

Then launch the script or API server as usual (Do NOT use nsys profile here).

The profiling results will be saved under LMDEPLOY_RAY_NSYS_OUT_PREFIX. If LMDEPLOY_RAY_NSYS_OUT_PREFIX is not configured, you can find the results in /tmp/ray/session_xxx/nsight.

Ray timeline#

We use ray to support multi-device deployment. You can get the ray timeline with the environments below.

export LMDEPLOY_RAY_TIMELINE_ENABLE=1
export LMDEPLOY_RAY_TIMELINE_OUT_PATH="/path/to/save/timeline.json"