DeepEP

DeepEP: an efficient expert-parallel communication library

DeepEP is a library designed to help accelerate and improve communication on computers (or GPUs) when handling complex machine learning tasks, especially tasks involving Mixture of Experts (MoE) models. These models use multiple “experts” (specialized sub-models) to handle different parts of a problem, and DeepEP ensures that data moves quickly and efficiently between these experts.

Features of DeepEP

Professional distributed communication framework

DeepEP is a next-generation distributed communication framework optimized specifically for Mixture of Experts (MoE) and Expert Parallel (EP) scenarios. Our framework offers high-throughput, low-latency all-to-all communication kernels for GPUs, perfectly supporting MoE scheduling and composable operations.

High-Performance Architecture

DeepEP, based on pure RDMA technology, provides a set of low-latency kernels specifically optimized for inference decoding performance, featuring a unique hook-based communication-computation overlap method that achieves excellent parallel efficiency without occupying SM resources.

Innovative Technology

DeepEP supports low-precision operations including FP8 and provides optimization for the group-constrained gating algorithm proposed in DeepSeek-V3. Our framework particularly supports efficient data transfer across heterogeneous domains such as NVLink to RDMA, ensuring outstanding performance in training and inference prefetching tasks.

What is DeepEP?

DeepEP is dedicated to MoE distribution and merge operations, and supports low-precision operations including FP8. At the same time, DeepEP is specially optimised for the group-limit gating algorithm proposed in the DeepSeek-V3 paper, and provides a series of high-performance cores for asymmetric domain bandwidth forwarding (e.g., from NVLink domains to RDMA domains). These cores not only have high throughput for training and inference pre-population tasks, but also support stream multiprocessor (SM) count control. For latency-sensitive inference decoding scenarios, DeepEP includes a set of pure RDMA low-latency cores to minimise latency. The library also introduces a hook-based communication-computation overlap approach that does not consume any SM resources, further improving efficiency.

What is a Model of Mixed Expertise (MoE)?

A hybrid expert model is a neural network architecture that combines multiple ‘expert’ networks, with a ‘gated’ network deciding which experts to route the input data to. This architecture allows the model to grow significantly in size while remaining computationally efficient, since only some of the experts are activated at a time rather than all of the networks.The MoE concept was first proposed by Jacobs, Jordan, and Hinton in 1991, but has not been widely used in large-scale language modelling until recent years. MoE architecture is used by Google’s Switch Transformers, Microsoft’s Z-Code, and DeepSeek’s DeepSeek-V3, which enables larger-scale model training and deployment through sparse activation of experts.

DeepEP Performance

Normal kernels with NVLink and RDMA forwarding

We test normal kernels on H800 (~160 GB/s NVLink maximum bandwidth), with each connected to a CX7 InfiniBand 400 Gb/s RDMA network card (~50 GB/s maximum bandwidth). And we follow the DeepSeek-V3/R1 pretraining setting (4096 tokens per batch, 7168 hidden, top-4 groups, top-8 experts, FP8 dispatching and BF16 combining).

Type	Dispatch #EP	Bottleneck bandwidth	Combine #EP	Bottleneck bandwidth
Intranode	8	153 GB/s (NVLink)	8	158 GB/s (NVLink)
Internode	16	43 GB/s (RDMA)	16	43 GB/s (RDMA)
Internode	32	44 GB/s (RDMA)	32	47 GB/s (RDMA)
Internode	64	46 GB/s (RDMA)	64	45 GB/s (RDMA)

Low-latency kernels with pure RDMA

We test low-latency kernels on H800 with each connected to a CX7 InfiniBand 400 Gb/s RDMA network card (~50 GB/s maximum bandwidth). And we follow a typical DeepSeek-V3/R1 production setting (128 tokens per batch, 7168 hidden, top-8 experts, FP8 dispatching and BF16 combining).

Dispatch #EP	Latency	RDMA bandwidth	Combine #EP	Latency	RDMA bandwidth
8	163 us	46 GB/s	8	318 us	46 GB/s
16	173 us	43 GB/s	16	329 us	44 GB/s
32	182 us	41 GB/s	32	350 us	41 GB/s
64	186 us	40 GB/s	64	353 us	41 GB/s
128	192 us	39 GB/s	128	369 us	39 GB/s
256	194 us	39 GB/s	256	360 us	40 GB/s

How to use DeepEP?

Using DeepEP requires a Hopper GPU, Python 3.8+, CUDA 12.3+, PyTorch 2.1+, as well as NVLink for intra-node communication and RDMA networks for cross-node communication. The library depends on a modified version of NVSHMEM, which needs to be configured before installation.

Development
# Build and make symbolic links for SO files NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py build # You may modify the specific SO names according to your own platform ln -s build/lib.linux-x86_64-cpython-38/deep_ep_cpp.cpython-38-x86_64-linux-gnu.so

# Run test cases
# NOTES: you may modify the `init_dist` function in `tests/utils.py`
# according to your own cluster settings, and launch into multiple nodes
python tests/test_intranode.py
python tests/test_internode.py
python tests/test_low_latency.py

Installation

NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install

More DeepEP: https://github.com/deepseek-ai/Deepep