X.ai releases Grok-1!

Evolutionary Model Merge, RWKV

Mar 24, 2024

An abstract visualization of a neural network. Neurons of varying shapes and sizes are interconnected.

X.ai released the Grok’s first model and its weights in a very short blog post. Model is Jax based and it is available in GitHub, it uses mixture of experts model and it has a Transformer based architecture.

Eagle 7B model is available as open source and this is an excellent and very efficient model that builds on top of RWKV, but what is RWKV?

RWKV (pronounced as RWaKuV) is an RNN with GPT-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable).

RWKV is an Open Source, non profit group, under the linux foundation. Supported by our sponsors.

So it's combining the best of RNN and transformer - great performance, fast inference, fast training, saves VRAM, "infinite" ctxlen, and free sentence embedding. Moreover it's 100% attention-free.

It has the following features:

Built on the RWKV-v5 architecture
(a linear transformer with 10-100x+ lower inference cost)
Ranks as the world’s greenest 7B model (per token)
Trained on 1.1 Trillion Tokens across 100+ languages
Outperforms all 7B class models in multi-lingual benchmarks
Approaches Falcon (1.5T), LLaMA2 (2T), Mistral (>2T?) level of performance in English evals
All while being an “Attention-Free Transformer”
Is a foundation model, with a very small instruct tune - further fine-tuning is required for various use cases.

The weights are available in HuggingFace.

GitHub - Netflix/metaflow: :rocket: Build and manage real-life ML, AI, and data science projects with ease!

Netflix wrote a 2024 blog post on “Why we love Metaflow”.

In case you have not heard Metaflow before, it is an open-source machine learning infrastructure framework designed to empower data scientists and machine learning practitioners to construct and manage various machine learning systems. Netflix utilizes Metaflow throughout all areas of the company, encompassing content demand modeling, media understanding, and internal infrastructure.

In the post, they talk about advantages and diadvantages of Metaflow:

Advantages

User-friendly API: Metaflow offers a human-readable API that simplifies the process of building and managing ML workflows. This user-friendly interface lowers the barrier to entry for data scientists and ML engineers, making it easier for them to construct and manage ML pipelines.
Integration with Netflix's data, compute, and orchestration platform: Metaflow integrates seamlessly with Netflix's company-wide data, compute, and orchestration platform. This integration streamlines the workflow development process by providing centralized access to essential resources.
Domain-specific libraries: Metaflow empowers teams to construct their own domain-specific libraries. This feature allows teams to tailor Metaflow to their specific needs and requirements, enhancing efficiency and productivity.
Portable execution environments: Metaflow supports portable execution environments. This flexibility grants users the ability to select the most suitable modeling approach for their particular use cases.

Disadvantages

Titus knowledge requirement: While Metaflow supports various compute backends, including AWS Batch and Kubernetes, Netflix primarily uses its own internal compute platform, Titus. Titus necessitates some in-depth technical knowledge to operate, potentially posing a challenge for new users.

Sakana.ai wrote an interesting blog post on how to build a foundational model through evolutionary merge design. Evolutionary Model Merge, designed to automatically combine various open-source models. This method is beneficial for crafting models tailored to specific domains. The main problem that this algorithm tries to solve is the following that traditional methods of building foundational models rely on human intuition and expertise to determine how to effectively merge different models. Of course, this approach has multiple issues and limitations:

Human Bias: Experts might unknowingly introduce biases into the model selection and merging process, potentially hindering the model's ability to generalize to unseen data.
Limited Exploration: Humans can only explore a limited set of possibilities due to cognitive constraints. This might prevent them from discovering particularly effective combinations of models.
Time-Consuming Process: Experimenting with different model combinations can be a lengthy and laborious process, especially as the number of available open-source models continues to grow.

The first two can be categorized as inductive bias of humans and the last one is introducing compute over human element; which provides the following advantages:

Unbiased Exploration: Evolutionary algorithms can systematically explore a vast space of potential model combinations, significantly exceeding human capabilities. This exploration is unbiased, meaning the algorithm is not influenced by preconceived notions about which models might work well together.
Discovery of Novel Solutions: By evaluating a wide range of combinations, the algorithm has a higher chance of discovering unconventional and potentially superior solutions that human experts might miss. The article mentions an example where an evolutionary algorithm successfully merged a Japanese language model with a math model, a task that human intuition might struggle with.
Efficiency: Evolutionary algorithms can automate the process of model selection and merging, significantly reducing the time and effort required compared to traditional methods.

They are especially concluding that this approach works very well with Japanese language where they are using as a successful example per the product that they are building.

Libraries

Open-Sora, an initiative dedicated to efficiently produce high-quality video and make the model, tools and contents accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video production.

Membership inference attacks (MIAs) attempt to predict whether a particular datapoint is a member of a target model’s training data. Despite extensive research on traditional machine learning models, there has been limited work studying MIA on the pre-training data of large language models (LLMs). Mimir is a python package that measures the memorization that is in LLM. It has multiple attacks in the package:

Likelihood (loss). Works by simply using the likelihood of the target datapoint as score.
Reference-based (ref). Normalizes likelihood score with score obtained from a reference model.
Zlib Entropy (zlib). Uses the zlib compression size of a sample to approximate local difficulty of sample.
Min-k% Prob (min_k). Uses k% of tokens with minimum likelihood for score computation.
Neighborhood (ne). Generates neighbors using auxiliary model and measures change in likelihood.

More details and these approaches are outlined in the paper.

Distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency. It has the following features:

Integrations with the most popular libraries and APIs for LLMs: HF Transformers, OpenAI, vLLM, etc.
Multiple tasks for Self-Instruct, Preference datasets and more.
Dataset export to Argilla for easy data exploration and further annotation.

Galore contains the pre-release version of GaLore algorithm, proposed by GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.

Gradient Low-Rank Projection (GaLore) is a memory-efficient low-rank training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods, such as LoRA. As a gradient projection method, GaLore is independent of the choice of optimizers and can be easily plugged into existing ones with only two lines of code.

Retrieval-augmented generation (RAG) greatly benefits language models (LMs) by providing additional context for tasks such as document-based question answering (DBQA). Despite its potential, the power of RAG is highly dependent on its configuration, raising the question: What is the optimal RAG configuration? To answer this, a number of researchers from CMU introduce the RAGGED framework to analyze and optimize RAG systems. On the representative DBQA tasks, they study two classic sparse and dense retrievers, and four top-performing LMs in encoder-decoder and decoder-only architectures. Through RAGGED, they uncover that different models suit substantially varied RAG setups. While encoder-decoder models monotonically improve with more documents, they find decoder-only models can only effectively use <5 documents, despite often having a longer context window. RAGGED offers further insights into LMs' context utilization habits, where they find encoder-decoder models rely more on contexts and are thus more sensitive to retrieval quality, while decoder-only models tend to rely on knowledge memorized during training.

TorchTune is a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.

The library provides:

Native-PyTorch implementations of popular LLMs
Support for checkpoints in various formats, including checkpoints in HF format
Training recipes for popular fine-tuning techniques with reference benchmarks and comprehensive correctness checks
Integration with HuggingFace Datasets for training and EleutherAI's Eval Harness for evaluation
Support for distributed training using FSDP from PyTorch Distributed
YAML configs for easily configuring training runs

Classes

Intro to AI Transformers is a great class on Transformers.

Generative AI tools like ChatGPT are powered by neural networks called transformers. In this course, you will learn how transformers work and use Hugging Face’s transformer tools to generate text (with GPT-2) and perform sentiment analysis (with BERT). Along the way, you’ll learn about the history of transformer models and how to address carbon impacts of model training.

Elon Musk vs OpenAI

Microsoft CEO Satya Nadella to board members:

"If OpenAl disappeared tomorrow, we have all the IP rights and all the capability. We have the people, we have the compute, we have the data, we have everything. We are below them, above them, around them."

The sentence “We are below them, above them, around them.” sticked me somehow.

The approach that MSFT has in this sage is such a hedge on the smaller entity for a large company to guarantee “their investments”(not much money, but rather infrastructure capabilities), and MSFT gets a lot of upper-side benefits without risking much. More details in this lawsuit can be found in here.

MLOps Newsletter

Discussion about this post