DeepSeek Open Source Week Covered!

Apr 06, 2025

DeepSeek Libraries

DeepSeek released a number of libraries through their “open-source week” and these libraries are especially useful for training large scale models and they have used these libraries in the DeepSeek development.

I covered the paper itself in one of the previous newsletter, be sure to check that out first if you have not:

Deep Seek Deep Dive

Bugra Akyildiz

Jan 28

This week, we will do a deep dive for Deep Seek R1 paper and outline the technical innovations in this paper and the model. Without further ado, let’s get to it!

Read full story

Now, let’s dive in for libraries!

Data

Smallpond is a lightweight data processing framework built on DuckDB and 3FS.

Features

🚀 High-performance data processing powered by DuckDB
🌍 Scalable to handle PB-scale datasets
🛠️ Easy operations with no long-running services

Profile-Data is a library to profile data from training and inference framework to help the community better understand the communication-computation overlap strategies and low-level implementation details. The profiling data was captured using the PyTorch Profiler. After downloading, you can visualize it directly by navigating to chrome://tracing in the Chrome browser (or edge://tracing in the Edge browser). Notice that it is simulated an absolutely balanced MoE routing strategy for profiling.

The Fire-Flyer File System (3FS) is a high-performance distributed file system designed to address the challenges of AI training and inference workloads. It leverages modern SSDs and RDMA networks to provide a shared storage layer that simplifies development of distributed applications. Key features and benefits of 3FS include:

Performance and Usability
- Disaggregated Architecture Combines the throughput of thousands of SSDs and the network bandwidth of hundreds of storage nodes, enabling applications to access storage resource in a locality-oblivious manner.
- Strong Consistency Implements Chain Replication with Apportioned Queries (CRAQ) for strong consistency, making application code simple and easy to reason about.
- File Interfaces Develops stateless metadata services backed by a transactional key-value store (e.g., FoundationDB). The file interface is well known and used everywhere. There is no need to learn a new storage API.
Diverse Workloads
- Data Preparation Organizes outputs of data analytics pipelines into hierarchical directory structures and manages a large volume of intermediate outputs efficiently.
- Dataloaders Eliminates the need for prefetching or shuffling datasets by enabling random access to training samples across compute nodes.
- Checkpointing Supports high-throughput parallel checkpointing for large-scale training.
- KVCache for Inference Provides a cost-effective alternative to DRAM-based caching, offering high throughput and significantly larger capacity.

Model Optimization

DualPipe is an innovative bidirectional pipeline parallelism algorithm introduced in the DeepSeek-V3 Technical Report. It achieves full overlap of forward and backward computation-communication phases, also reducing pipeline bubbles. For detailed information on computation-communication overlap, please refer to the profile data.

When using expert parallelism (EP), different experts are assigned to different GPUs. Because the load of different experts may vary depending on the current workload, it is important to keep the load of different GPUs balanced. As described in the DeepSeek-V3 paper, Expert Parallelism Load Balancer (EPLB) adopts a redundant experts strategy that duplicates heavy-loaded experts. Then, we heuristically pack the duplicated experts to GPUs to ensure load balancing across different GPUs.

EPLB algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method to predict the loads of experts is out of this repo's scope. A common method is to use moving average of historical statistics.

DeepEP is a communication library tailored for Mixture-of-Experts (MoE) and expert parallelism (EP). It provides high-throughput and low-latency all-to-all GPU kernels, which are also as known as MoE dispatch and combine. The library also supports low-precision operations, including FP8.

To align with the group-limited gating algorithm proposed in the DeepSeek-V3 paper, DeepEP offers a set of kernels optimized for asymmetric-domain bandwidth forwarding, such as forwarding data from NVLink domain to RDMA domain. These kernels deliver high throughput, making them suitable for both training and inference prefilling tasks. Additionally, they support SM (Streaming Multiprocessors) number control.

For latency-sensitive inference decoding, DeepEP includes a set of low-latency kernels with pure RDMA to minimize delays. The library also introduces a hook-based communication-computation overlapping method that does not occupy any SM resource.

DeepGEMM is a library designed for clean and efficient FP8 General Matrix Multiplications (GEMMs) with fine-grained scaling, as proposed in DeepSeek-V3. It supports both normal and Mix-of-Experts (MoE) grouped GEMMs. Written in CUDA, the library has no compilation need during installation, by compiling all kernels at runtime using a lightweight Just-In-Time (JIT) module.

Currently, DeepGEMM exclusively supports NVIDIA Hopper tensor cores. To address the imprecise FP8 tensor core accumulation, it employs CUDA-core two-level accumulation (promotion). While it leverages some concepts from CUTLASS and CuTe, it avoids heavy reliance on their templates or algebras. Instead, the library is designed for simplicity, with only one core kernel function comprising around ~300 lines of code. This makes it a clean and accessible resource for learning Hopper FP8 matrix multiplication and optimization techniques.

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:

BF16, FP16
Paged kvcache with block size of 64

Diagram of DeepSeek's Online Inference System.jpg

They have also very good writeup on how everything fits together in the following post/doc. Some of these libraries are also mentioned(although not through their exact names) in the DeepSeek paper as well.

MLOps Newsletter

Deep Seek Deep Dive

Discussion about this post