This week’s newsletter is completely focused on Llama and its ecosystem as Llama3 was released last week!
Articles
Llama3 is out and available for public consumption in two different sizes(8B and 70B).
The model architecture delta is in the following:
Llama 3 uses a tokenizer with a vocabulary of 128K tokens that encodes language more efficiently
Grouped query attention (GQA) is adopted across both the 8B and 70B sizes.
Sequences of 8,192 tokens are used through using a mask to ensure self-attention does not cross document boundaries.
Training data delta is in the following:
Llama3 is trained over 15T tokens that were all collected from publicly available sources. This makes the training dataset of Llama3 comparing to Llama2 to be 7x.
It also includes four times more code.
Over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages.
Interesting observations in pre-training:
The Chinchilla-optimal amount of training compute for an 8B parameter model corresponds to ~200B tokens, but model performance continues to improve even after the model is trained on two orders of magnitude more data.
Both our 8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens.
Larger models can match the performance of these smaller models with less training compute, but smaller models are generally preferred because they are much more efficient during inference.
Infrastructure Setup:
The most efficient implementation achieves a compute utilization of over 400 TFLOPS per GPU when trained on 16K GPUs simultaneously. Those training runs get performed on two custom-built 24K GPU clusters.
Three types of parallelization are adopted: data parallelization, model parallelization, and pipeline parallelization
Instruction Fine-tuning:
Post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO).
Under the hood; Llama3 composes multiple systems to provide safety guardrails:
That combines usage of Purple Llama which is explained much more detail in libraries section below. More information in trust and safety topics can be gathered from the following website.
You can check out the capabilities that are in meta.ai, you can log-in through Facebook and by doing so, website can remember your previous queries and Llama has an extensive getting started guide.
In order to download the weights, you can follow the following repository:
Latest Llama3 is now accessible to individuals, creators, researchers, and businesses of all sizes so that they can experiment, innovate, and scale their ideas responsibly.
This release includes model weights and starting code for pre-trained and instruction tuned Llama 3 language models — including sizes of 8B to 70B parameters.
This repository is intended as a minimal example to load Llama 3 models and run inference. For more detailed examples, see llama-recipes. Its model card is available in here.
Datasets
The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭
datatrove
library, our large scale data processing library.🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under the ODC-By 1.0 license. However, by carefully adding additional filtering steps, we managed to push the performance of 🍷 FineWeb well above that of the original 🦅 RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama) on our aggregate group of benchmark tasks.
Libraries
Purple Llama is an umbrella project that over time will bring together tools and evals to help the community build responsibly with open generative AI models. The initial release will include tools and evals for Cyber Security and Input/Output safeguards but we plan to contribute more in the near future.
Why purple?
Borrowing a concept from the cybersecurity world, we believe that to truly mitigate the challenges which generative AI presents, we need to take both attack (red team) and defensive (blue team) postures. Purple teaming, composed of both red and blue team responsibilities, is a collaborative approach to evaluating and mitigating potential risks and the same ethos applies to generative AI and hence our investment in Purple Llama will be comprehensive.
torchtune is a PyTorch-native library for easily authoring, fine-tuning and experimenting with LLMs. We're excited to announce our alpha release!
torchtune provides:
Native-PyTorch implementations of popular LLMs using composable and modular building blocks
Easy-to-use and hackable training recipes for popular fine-tuning techniques (LoRA, QLoRA) - no trainers, no frameworks, just PyTorch!
YAML configs for easily configuring training, evaluation, quantization or inference recipes
Built-in support for many popular dataset formats and prompt templates to help you quickly get started with training
The 'llama-recipes' repository is a companion to the Meta Llama 2 and Meta Llama 3 models. The goal of this repository is to provide a scalable library for fine-tuning Meta Llama models, along with some example scripts and notebooks to quickly get started with using the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Meta Llama and other tools in the LLM ecosystem. The examples here showcase how to run Meta Llama locally, in the cloud, and on-prem.
Distilabel is the framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.
Check out the documentation for more information! Whether you are working on a predictive model that computes semantic similarity or the next generative model that is going to beat the LLM benchmarks. Our framework ensures that the hard data work pays off. Distilabel is the missing piece that helps you synthesize data and provide AI feedback.
Orion is a fine-grained, interference-free scheduler for GPU sharing across ML workloads. Assume one of the clients is high-priority, while the rest of the clients are best-effort.
Orion intercepts CUDA, CUDNN, and CUBLAS calls and submits them into software queues. The Scheduler polls these queues and schedules operations based on their resource requirements and their priority. See ARCHITECTURE for more details on the system and the scheduling policy.
Orion expects that each submitted job has a file where all of its operations, along with their profiles and Straming Multiprocessor (SM) requirements are listed. See PROFILE for detailed instructions on how to profile a client applications, and how to generate the profile files.
Llamaduo showcases an LLMOps pipeline that fine-tunes a small-size LLM model to prepare for the outage of the service LLM. For this project, we choose Gemini 1.0 Pro for service type LLM and Gemma 2B/7B for small sized LLM model.
For this project, the following tech stacks are chosen:
Hugging Face open source ecosystem (
transformers
,peft
,alignment-handbook
,huggingface_hub
)
llm.c training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. Training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in the single file train_gpt2.c, and training it on GPU is ~2,000 lines (adds CUDA kernels) in train_gpt2.cu. The code compiles and runs instantly, it exactly matches the PyTorch reference implementation, and it ~matches the speed of (compiled) PyTorch (fp32, no flash attention). I