Features of DeepEP

1

Professional distributed communication framework

DeepEP is a next-generation distributed communication framework optimized specifically for Mixture of Experts (MoE) and Expert Parallel (EP) scenarios. Our framework offers high-throughput, low-latency all-to-all communication kernels for GPUs, perfectly supporting MoE scheduling and composable operations.

2

High-Performance Architecture

DeepEP, based on pure RDMA technology, provides a set of low-latency kernels specifically optimized for inference decoding performance, featuring a unique hook-based communication-computation overlap method that achieves excellent parallel efficiency without occupying SM resources.

3

Innovative Technology

DeepEP supports low-precision operations including FP8 and provides optimization for the group-constrained gating algorithm proposed in DeepSeek-V3. Our framework particularly supports efficient data transfer across heterogeneous domains such as NVLink to RDMA, ensuring outstanding performance in training and inference prefetching tasks.

What is DeepEP?

DeepEP is dedicated to MoE distribution and merge operations, and supports low-precision operations including FP8. At the same time, DeepEP is specially optimised for the group-limit gating algorithm proposed in the DeepSeek-V3 paper, and provides a series of high-performance cores for asymmetric domain bandwidth forwarding (e.g., from NVLink domains to RDMA domains). These cores not only have high throughput for training and inference pre-population tasks, but also support stream multiprocessor (SM) count control. For latency-sensitive inference decoding scenarios, DeepEP includes a set of pure RDMA low-latency cores to minimise latency. The library also introduces a hook-based communication-computation overlap approach that does not consume any SM resources, further improving efficiency.

What is a Model of Mixed Expertise (MoE)?

A hybrid expert model is a neural network architecture that combines multiple ‘expert’ networks, with a ‘gated’ network deciding which experts to route the input data to. This architecture allows the model to grow significantly in size while remaining computationally efficient, since only some of the experts are activated at a time rather than all of the networks.The MoE concept was first proposed by Jacobs, Jordan, and Hinton in 1991, but has not been widely used in large-scale language modelling until recent years. MoE architecture is used by Google’s Switch Transformers, Microsoft’s Z-Code, and DeepSeek’s DeepSeek-V3, which enables larger-scale model training and deployment through sparse activation of experts.

DeepEP Performance

Normal kernels with NVLink and RDMA forwarding

We test normal kernels on H800 (~160 GB/s NVLink maximum bandwidth), with each connected to a CX7 InfiniBand 400 Gb/s RDMA network card (~50 GB/s maximum bandwidth). And we follow the DeepSeek-V3/R1 pretraining setting (4096 tokens per batch, 7168 hidden, top-4 groups, top-8 experts, FP8 dispatching and BF16 combining).

Type Dispatch #EP Bottleneck bandwidth Combine #EP Bottleneck bandwidth
Intranode 8 153 GB/s (NVLink) 8 158 GB/s (NVLink)
Internode 16 43 GB/s (RDMA) 16 43 GB/s (RDMA)
Internode 32 44 GB/s (RDMA) 32 47 GB/s (RDMA)
Internode 64 46 GB/s (RDMA) 64 45 GB/s (RDMA)

Low-latency kernels with pure RDMA

We test low-latency kernels on H800 with each connected to a CX7 InfiniBand 400 Gb/s RDMA network card (~50 GB/s maximum bandwidth). And we follow a typical DeepSeek-V3/R1 production setting (128 tokens per batch, 7168 hidden, top-8 experts, FP8 dispatching and BF16 combining).

Dispatch #EP Latency RDMA bandwidth Combine #EP Latency RDMA bandwidth
8 163 us 46 GB/s 8 318 us 46 GB/s
16 173 us 43 GB/s 16 329 us 44 GB/s
32 182 us 41 GB/s 32 350 us 41 GB/s
64 186 us 40 GB/s 64 353 us 41 GB/s
128 192 us 39 GB/s 128 369 us 39 GB/s
256 194 us 39 GB/s 256 360 us 40 GB/s

How to use DeepEP?

Using DeepEP requires a Hopper GPU, Python 3.8+, CUDA 12.3+, PyTorch 2.1+, as well as NVLink for intra-node communication and RDMA networks for cross-node communication. The library depends on a modified version of NVSHMEM, which needs to be configured before installation.

Development

# Build and make symbolic links for SO files
NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py build
# You may modify the specific SO names according to your own platform
ln -s build/lib.linux-x86_64-cpython-38/deep_ep_cpp.cpython-38-x86_64-linux-gnu.so

# Run test cases
# NOTES: you may modify the `init_dist` function in `tests/utils.py`
# according to your own cluster settings, and launch into multiple nodes
python tests/test_intranode.py
python tests/test_internode.py
python tests/test_low_latency.py

Installation

NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install

More DeepEP: https://github.com/deepseek-ai/Deepep