Uber's in-house and open-source LLM Training Stack

Inksight to render and de-render hand-written notes, PyTorch 2.5 supports Intel GPUs!

Nov 04, 2024

Featured image for Open Source and In-House: How Uber Optimizes LLM Training

Uber has written their Open-Source and In-House LLM training stack in a very comprehensive blog post.

They talk about how their training tech stack has evolved over time, which showed a good timeline on their adoption of deep learning, and now GenAI technologies. Of course, each timeline brings their own separate stack in terms of software and hardware.

Layer 0: They have A100 GPUs in their on-premise cluster and H100s through Google Cloud.

Layer 1: They use Ray and Kubernetes for scheduling their workloads against the hardware.

Layer 2: They built resource-aware scheduling on top of Ray and Kubernetes as shown above.

Instead of Horovod, and other training authoring solutions, they comletely converged into PyTorch as most of the open-source models are through PyTorch.

They have a very thin layer through Ray that allows them to schedule their jobs, where DDP, NCCL are part of the PyTorch stack and DeepSpeed allows faster training on top of PyTorch. HuggingFace allows to export and use open-source models from a single repository.

Specifically, the way they do training is in the following:

Multi-host and multi-GPU communication. To start, a TorchTrainer in Ray Train creates multiple workers in the form of Ray Actors, handles in-bound communication (used by Ray Object Store), and initializes a PyTorch distributed process group (used by Deepspeed) on GPUs across all hosts.
Data preparation. The LLM training framework supports remote data sources on Uber HDFS, Uber Terrablob, and Hugging Face public datasets.
Model training. Tokenization converts input text into integers that will be fed into the models. For distributed training, each GPU worker initializes a Hugging Face Transformers Trainer object using the DeepSpeed ZeRO stage 1/2/3 options.
Saving results. Metrics associated with the training experiment are saved on the Uber Comet server. The main training process on the Ray head node pushes training model weights and associated configuration files to Terrablob storage.

Optimization tricks that they have employed in order to improve the speed of training, convergence as well as scoring the LLM output to do hyperparameter optimization:

Throughput: Both flash attention and CPU offload saved GPU memory, enabling us to increase batch size 2 to 7 times during Llama 2 70B training with maximum GPU memory usage (70GB-80GB) on 32 GPUs (8 hosts on A100, 4 hosts on H100). This led to significant throughput increases.
MFU: MFU on H100 was lower than on A100, and GPU utilization wasn’t full with maximum GPU memory usage. This might indicate that for Llama 2 70B training, we have memory-bound GPU instead of compute-bound. That’s also why CPU offload could help the most to improve MFU, as plotted in Figure 5 below.
Compute or Memory Bound: The story is slightly different for Llama 2 7B on 4 A100/H100 on a single host, where we may have compute-bound GPU instead of memory-bound GPU. We saw that the MFU of training Llama 2 7B was higher than training Llama 2 70B, and CPU offload was not helpful to improve MFU. Flash attention could help the most, as shown in Figure 6 below.
Network: In our experiment, the network usage was around 10GB/second on H100 and 3GB/second on A100 for Llama 2 70B model training. This is small compared to the infra theoretical value, indicating that the network is yet to be a bottleneck compared to GPU compute and memory.

Google Research built a system called Inksight that can read and write handwritten notes. This approach aims to bridge the gap between digital and handwritten notes by allowing users to seamlessly convert between the two formats.

Insight consists of two main components:

A handwriting recognition model that can read handwritten text
A handwriting synthesis model that can generate realistic handwritten tex

It can do either render digital text into hand-written form or de-render the text into digital format. Most of the OCR approaches have not looked the order of the handwriting or the sequence of strokes information, this approach considers the sequence information and hence improves comparing to other methods.

The handwriting recognition model uses a transformer-based architecture to convert handwritten text into digital text:

Ability to handle different handwriting styles and languages
Robust performance on messy or difficult-to-read handwriting
Can recognize both printed and cursive writing

The handwriting synthesis model can generate realistic handwritten text from digital input. It has further the following capabilities due to end to end generative capabilities:

Ability to mimic different handwriting styles
Maintains consistency in generated handwriting
Can adapt to different writing implements (pen, pencil, etc.)
Preserves layout and formatting of original notes

It has a number of models available for different HW types(CPU/GPU/TPU):

Public version model for CPU/GPU inference (494 MB)
Hugging Face model for CPU/GPU inference: InkSight Small-p.
Public version model for TPU inference (494 MB)
Supplementary material for the paper. This is used in the example colab linked below, which automatically downloads this content.
Example code in the form of a Colab notebook that showcases model inference results on several samples and example code to run the inference.

Samples of model outputs of huggingface demo.

PyTorch 2.5 now supports for Intel GPUs, marking a significant advancement in AI acceleration and accessibility. This integration encompasses Intel Arc discrete graphics, Intel Core Ultra processors with built-in Intel® Arc™ graphics, and Intel Data Center GPU Max Series

By supporting Intel GPUs in PyTorch, it has enabling the following benefits for a wider ecosystem:

Expanded AI ecosystem: Intel GPUs provide an alternative to existing GPU solutions, broadening the range of hardware options for AI developers and researchers.
Improved user experience: Developers and customers using Intel GPUs will benefit from native PyTorch support, unified software distribution, and consistent product release timelines.
Better Integration: The support allows for a consistent GPU programming paradigm across front-ends and back-ends, enabling developers to run and deploy workloads on Intel GPUs with minimal code changes.
Performance optimization: Intel GPU support in PyTorch 2.5 offers both eager mode and graph mode (torch.compile) capabilities, with implementations of commonly used Aten operators and optimizations specific to Intel GPUs.
AI PC scenarios: Intel Client GPUs can enable local AI workloads which opens up possibilities for AI applications on personal computers. This is an area that people considered in the mobile domain, but I think personal computers could be also very interesting where AI applications can live on top of operating system.
Performance gains: Benchmarks show significant speedups for FP16/BF16 over FP32 in eager mode, and for torch.compile mode over eager mode, for both inference and training tasks.

The implementation includes key features such as runtime support, Aten operators, oneDNN integration, TorchInductor, Triton integration, and Intel GPU toolchain integration. Quantization and distributed computing capabilities are in active development for future releases(not supported as of now). To use Intel GPUs with PyTorch, developers need only make minimal code changes, primarily switching the device name from "cuda" to "xpu":

# CUDA Code
tensor = torch.tensor([1.0, 2.0]).to(“cuda”)

# Code for Intel GPU
tensor = torch.tensor([1.0, 2.0]).to(“xpu”)

Libraries

MLX-graphs is a library for Graph Neural Networks (GNNs) built upon Apple’s MLX.

Features

Fast GNN training and inference on Apple Silicon
MLX-graphs has been designed to run fast on Apple Silicon chips. All GNN operations fully leverage the GPU and CPU hardware of Macs thanks to the efficient low-level primitives available within the MLX core library. Initial benchmarks show an up to 10x speed improvement with respect to other frameworks on large datasets.

Scalability to large graphs
With unified memory architecture, objects live in a shared memory accessible by both the CPU and GPU. This setup allows Macs to leverage their entire memory capacity for storing graphs. Consequently, Macs equipped with substantial memory can efficiently train GNNs on large graphs, spanning tens of gigabytes, directly using the Mac’s GPU.
Multi-device
Unified memory eliminates the need for time-consuming device-to-device transfers. This architecture also enables specific operations to be run explicitly on either the CPU or GPU without incurring any overhead, facilitating more efficient computation and resource utilization.

The code is available in GitHub.

MLR-Copilot is a framework where LLMs mimic researchers’ thought processes, designed to enhance the productivity of machine learning research by automating the generation and implementation of research ideas.

It begins with a research paper, autonomously generating and validating these ideas, while incorporating human feedback to help reach executable research outcomes.

Outlines is a Python library that allows you to use Large Language Model in a simple and robust way (with structured generation). It is built by .txt, and is already used in production by many companies.

ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).

Perplexica is an open-source AI-powered searching tool or an AI-powered search engine that goes deep into the internet to find answers. Inspired by Perplexity AI, it's an open-source option that not just searches the web but understands your questions. It uses advanced machine learning algorithms like similarity searching and embeddings to refine results and provides clear answers with sources cited.

Using SearxNG to stay current and fully open source, Perplexica ensures you always get the most up-to-date information without compromising your privacy.

Learn the fundamentals of building Generative AI applications with our 21-lesson comprehensive course by Microsoft Cloud Advocates.

ChartDB is a powerful, web-based database diagramming editor. Instantly visualize your database schema with a single "Smart Query." Customize diagrams, export SQL scripts, and access all features—no account required. Experience seamless database design here.

What it does:

Instant Schema Import Run a single query to instantly retrieve your database schema as JSON. This makes it incredibly fast to visualize your database schema, whether for documentation, team discussions, or simply understanding your data better.
AI-Powered Export for Easy Migration Our AI-driven export feature allows you to generate the DDL script in the dialect of your choice. Whether you’re migrating from MySQL to PostgreSQL or from SQLite to MariaDB, ChartDB simplifies the process by providing the necessary scripts tailored to your target database.
Interactive Editing Fine-tune your database schema using our intuitive editor. Easily make adjustments or annotations to better visualize complex structures.

MLOps Newsletter

Discussion about this post