What is DeepEP?
DeepEP is dedicated to MoE distribution and merge operations, and supports low-precision operations including FP8. At the same time, DeepEP is specially optimised for the group-limit gating algorithm proposed in the DeepSeek-V3 paper, and provides a series of high-performance cores for asymmetric domain bandwidth forwarding (e.g., from NVLink domains to RDMA domains). These cores not only have high throughput for training and inference pre-population tasks, but also support stream multiprocessor (SM) count control. For latency-sensitive inference decoding scenarios, DeepEP includes a set of pure RDMA low-latency cores to minimise latency. The library also introduces a hook-based communication-computation overlap approach that does not consume any SM resources, further improving efficiency.
What is a Model of Mixed Expertise (MoE)?
A hybrid expert model is a neural network architecture that combines multiple ‘expert’ networks, with a ‘gated’ network deciding which experts to route the input data to. This architecture allows the model to grow significantly in size while remaining computationally efficient, since only some of the experts are activated at a time rather than all of the networks.The MoE concept was first proposed by Jacobs, Jordan, and Hinton in 1991, but has not been widely used in large-scale language modelling until recent years. MoE architecture is used by Google’s Switch Transformers, Microsoft’s Z-Code, and DeepSeek’s DeepSeek-V3, which enables larger-scale model training and deployment through sparse activation of experts.
DeepEP Performance
Normal kernels with NVLink and RDMA forwarding
We test normal kernels on H800 (~160 GB/s NVLink maximum bandwidth), with each connected to a CX7 InfiniBand 400 Gb/s RDMA network card (~50 GB/s maximum bandwidth). And we follow the DeepSeek-V3/R1 pretraining setting (4096 tokens per batch, 7168 hidden, top-4 groups, top-8 experts, FP8 dispatching and BF16 combining).
Type |
Dispatch #EP |
Bottleneck bandwidth |
Combine #EP |
Bottleneck bandwidth |
Intranode |
8 |
153 GB/s (NVLink) |
8 |
158 GB/s (NVLink) |
Internode |
16 |
43 GB/s (RDMA) |
16 |
43 GB/s (RDMA) |
Internode |
32 |
44 GB/s (RDMA) |
32 |
47 GB/s (RDMA) |
Internode |
64 |
46 GB/s (RDMA) |
64 |
45 GB/s (RDMA) |
Low-latency kernels with pure RDMA
We test low-latency kernels on H800 with each connected to a CX7 InfiniBand 400 Gb/s RDMA network card (~50 GB/s maximum bandwidth). And we follow a typical DeepSeek-V3/R1 production setting (128 tokens per batch, 7168 hidden, top-8 experts, FP8 dispatching and BF16 combining).
Dispatch #EP |
Latency |
RDMA bandwidth |
Combine #EP |
Latency |
RDMA bandwidth |
8 |
163 us |
46 GB/s |
8 |
318 us |
46 GB/s |
16 |
173 us |
43 GB/s |
16 |
329 us |
44 GB/s |
32 |
182 us |
41 GB/s |
32 |
350 us |
41 GB/s |
64 |
186 us |
40 GB/s |
64 |
353 us |
41 GB/s |
128 |
192 us |
39 GB/s |
128 |
369 us |
39 GB/s |
256 |
194 us |
39 GB/s |
256 |
360 us |
40 GB/s |