SmolLM3 from HuggingFace

MUVERA from Google to Solve Multi-Vector Search Problem

Jul 13, 2025

SmolLM3 is a fully open 3 billion-parameter transformer decoder optimized for efficiency, multilingual support, long-context reasoning, and dual-mode instruction from HuggingFace.

HuggingFace wrote a rather detailed a blog post on both some of the model architecture changes that they have done and some of the mid-training and post-training strategies.

Trained on 11.2 trillion tokens, it outperforms other 3B models (Llama-3.2-3B, Qwen2.5-3B) and competes with larger 4B models (Qwen3-4B, Gemma3-4B) while offering a complete “blueprint”—architecture, data recipes, and fine-tuning methodology—to empower researchers and engineers.

Model Overview

Over 2000 feet, the model would look like something like this in terms of properties and capabilities.

Scale & Performance
- 3 B parameters; trained on 11.2 T tokens with a three-stage web/code/math curriculum.
- Win-rate leadership on 12 benchmarks (HellaSwag, ARC, Winogrande, CommonsenseQA, MMLU-CF/Pro, PIQA, OpenBookQA, GSM8K, MATH, HumanEval+, MBPP+) and competitive parity with 4 B models in knowledge, reasoning, math, and coding tasks.
Multilingual
- Supports six languages (English, French, Spanish, German, Italian, Portuguese).
- Evaluated on Global MMLU, MLMM HellaSwag, Flores-200, Belebele.
Long-Context
- Native context window extended to 64 K tokens via staged RoPE theta adjustments.
- Extrapolation to 128 K tokens at inference using YARN.
Dual-Mode Instruct Model
- /think (reasoning) vs. /no_think (direct answer) flags enable explicit or collapsed reasoning traces.
- Supports XML and Python tool-calling interfaces.

Architectural Enhancements

There are interesting multiple adjustments to the model architecture and changes to the traditional methods that are being used and outlined below:

Grouped Query Attention (GQA): Replaces standard multi-head attention with four query groups, matching task performance while reducing KV cache memory during inference.
NoPE (No Position Embeddings): Adopts a hybrid attention strategy by omitting rotary position embeddings in every fourth layer, significantly boosting long-context fidelity without degrading short-context capabilities.
Intra-Document Masking: Ensures tokens from distinct documents within the same batch don’t attend to each other, improving training stability for long sequences.
Training Stability Adjustments: Removes weight decay from embedding layers (inspired by OLMo 2), stabilizing embedding norms and overall training dynamics.

Mid-Training and Post-Training Strategies

Long-Context Extension

Additional 100 B tokens in two phases:

4 K → 32 K (RoPE θ = 1.5 M)
32 K → 64 K (RoPE θ = 5 M)
Both phases upsampled long-document code, books, and web data, with ablations confirming sufficient long-context gains without explicit long-document upsampling.

General Reasoning Injection

Trained on 35 B reasoning-trace tokens (OpenThoughts3, Llama-Nemotron-Post-Training) via ChatML template and packing for 4 epochs (140 B tokens), producing a reasoning-capable mid-training checkpoint for fine-tuning.

Supervised Fine-Tuning (SFT)

1.8 B token mixture (1 B non-reasoning, 0.8 B reasoning) across 22 datasets, balancing domains: math, code, general reasoning, instruction, multilinguality, tool calling. Synthetic reasoning traces generated by Qwen3-32B for underrepresented domains. Trained 4 epochs (~8 B tokens) with BFD packing; aligned with user-turn masking and tool-output masking.

Alignment via Anchored Preference Optimization (APO)

APO (variant of Direct Preference Optimization) leverages Tulu3 and synthetic reasoning preference pairs (chosen: Qwen3-32B; rejected: Qwen3-0.6B) to stabilize optimization and improve downstream performance. Ablations revealed reasoning training slightly degraded long-context scores, prompting mitigation via model merging.

Model Merging

Using MergeKit, a linear blend (0.9 reasoning-aligned “soup” + 0.1 long-context mid-training checkpoint) restored RULER 128 K performance while retaining reasoning gains.

MUVERA (Multi-Vector Retrieval Algorithm) from Google Research is a method in information retrieval (IR) that enables the speed and efficiency of single-vector search while preserving the accuracy and expressiveness of multi-vector models. This advance is critical for search engines, recommender systems, and natural language processing, where capturing nuanced semantic relationships is essential but computational cost has historically limited practical deployment of multi-vector approaches. Below is a detailed, technical summary of the MUVERA system, its theoretical foundations, empirical results, and implications for large-scale IR.

Single-Vector Embeddings and MIPS

Traditional neural embedding models (e.g., BERT-based bi-encoders) represent each data point—such as a document or query—as a single vector in a high-dimensional space. Retrieval is performed by measuring the inner product (dot product) between the query vector and document vectors. This allows for Maximum Inner Product Search (MIPS), a well-studied problem with highly optimized algorithms and infrastructure. The main advantages of this approach are:

Speed: Single-vector MIPS can be performed extremely fast, even at very large scale of document counts(billions to trillions).
Memory efficiency: Each item is represented by one vector.
Simplicity: Off-the-shelf search systems can be used.

However, single-vector representations often lose fine-grained semantic information, limiting retrieval accuracy, especially for complex queries or long documents.

Multi-Vector Models

To address these limitations, multi-vector models (notably ColBERT) represent each item as a set of vectors—for example, one vector per token or phrase. Similarity between a query and a document is computed using more expressive functions, such as the Chamfer similarity (sum of maximal similarities between query and document vectors). This approach:

Greatly improves retrieval accuracy and expressiveness.
Captures nuanced relationships (e.g., matching specific query terms to document segments).

But, it comes with also major drawbacks:

Computational cost: For each query-document pair, all pairs of query and document vectors must be compared.
Latency: Multi-vector similarity is much slower than single-vector MIPS.
Scalability: Not suitable for web-scale retrieval without major engineering compromises.

MUVERA: To Solve them All

MUVERA bridges the gap by reducing multi-vector retrieval to single-vector MIPS through a mathematically principled approach.

The central innovation is the construction of Fixed Dimensional Encodings (FDEs). For each query and document (which are originally sets of vectors), MUVERA computes a single vector such that the inner product of these FDEs closely approximates the true multi-vector similarity (e.g., Chamfer similarity).

This transformation is data-oblivious (does not require learning from data), fast, and provably accurate.

MUVERA further provides provable guarantees on the quality of the approximation:

For any query/document pair, the FDE dot product is an εε-approximation to the true multi-vector similarity.
This is the first method with provable guarantees for reducing multi-vector similarity search to single-vector MIPS, ensuring that retrieval quality is not sacrificed for speed.

The MUVERA retrieval pipeline consists of two stages:

Initial Retrieval: Use the FDEs to perform single-vector MIPS over the corpus, leveraging existing fast and scalable infrastructure.
Re-ranking: For the top-k candidates, compute the exact multi-vector similarity (e.g., Chamfer) to ensure final accuracy.

This hybrid approach combines the best of both worlds: fast candidate generation and accurate final ranking.

MUVERA constructs FDEs such that their inner product closely approximates this sum of maximum inner products. The construction uses random projections and aggregation techniques to ensure that the mapping is efficient and preserves the necessary similarity structure.

To further reduce memory and storage requirements, MUVERA incorporates product quantization:

FDEs are compressed by up to 32× (e.g., a 10,240-dimensional FDE stored in just 1,280 bytes).
This compression incurs negligible loss in retrieval quality.
The result is a dramatic reduction in memory footprint, making large-scale deployment feasible.

MUVERA supports asymmetric FDE construction, where queries and documents can be encoded differently to further optimize retrieval performance. This flexibility allows for tuning the system to specific application needs or hardware constraints.

Benchmark

MUVERA was evaluated on several standard IR benchmarks, including the BEIR suite and MS MARCO. Key results:

Recall: MUVERA achieves the same or better recall as prior state-of-the-art multi-vector heuristics (e.g., PLAID, ColBERT).
Efficiency: MUVERA retrieves 2–5× fewer candidates for the same recall, reducing computational cost.
Latency: End-to-end retrieval is up to 90% faster than prior multi-vector systems.
Memory: With product quantization, MUVERA’s memory usage is a fraction of baseline methods

MUVERA is open-source with implementations available in C++ and Python.

Libraries

Biomni is a general-purpose biomedical AI agent designed to autonomously execute a wide range of research tasks across diverse biomedical subfields. By integrating cutting-edge large language model (LLM) reasoning with retrieval-augmented planning and code-based execution, Biomni helps scientists dramatically enhance research productivity and generate testable hypotheses.

MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. Consistent with MiniMax-Text-01, the M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1.

MiniCPM-o is the latest series of end-side multimodal LLMs (MLLMs) ungraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text and speech outputs in an end-to-end fashion. Since February 2024, there have been released 6 versions of the model, aiming to achieve strong performance and efficient deployment. The most notable models in the series currently include:

MiniCPM-o 2.6: 🔥🔥🔥 The latest and most capable model in the MiniCPM-o series. With a total of 8B parameters, this end-to-end model achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 2.6 supports bilingual real-time speech conversation with configurable voices, and also allows for fun capabilities such as emotion/speed/style control, end-to-end voice cloning, role play, etc. It also advances MiniCPM-V 2.6's visual capabilities such strong OCR capability, trustworthy behavior, multilingual support, and video understanding. Due to its superior token density, MiniCPM-o 2.6 can for the first time support multimodal live streaming on end-side devices such as iPad.
MiniCPM-V 2.6: The most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses GPT-4V in single-image, multi-image and video understanding. It outperforms GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet in single image understanding, and can for the first time support real-time video understanding on iPad.

Agenticseek is voice-enabled AI assistant autonomously browses the web, writes code, and plans tasks while keeping all data on your device. Tailored for local reasoning models, it runs entirely on your hardware, ensuring complete privacy and zero cloud dependency.

OmniSealBench provides a comprehensive benchmark for evaluating the performance of neural watermarking techniques. The benchmark includes a variety of datasets, evaluation metrics, and tools for training and testing neural networks for watermarking.

flow_matching is a PyTorch library for Flow Matching algorithms, featuring continuous and discrete implementations. It includes examples for both text and image modalities. This repository is part of Flow Matching Guide and Codebase.

Clojure MCP connects AI models to your Clojure development environment, enabling a remarkable REPL-driven development experience powered by large language models (LLMs).

Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.

Key Features

Large-Scale Training: Pre-trained a 1T parameter MoE model on 15.5T tokens with zero training instability.
MuonClip Optimizer: We apply the Muon optimizer to an unprecedented scale, and develop novel optimization techniques to resolve instabilities while scaling up.
Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving.

Model Variants

Kimi-K2-Base: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
Kimi-K2-Instruct: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.

More details and benchmarks are in the post.

Below the Fold

Gophish is an open-source phishing toolkit designed for businesses and penetration testers. It provides the ability to quickly and easily setup and execute phishing engagements and security awareness training.

Pangolin is a self-hosted tunneled reverse proxy server with identity and access control, designed to securely expose private resources on distributed networks. Acting as a central hub, it connects isolated networks — even those behind restrictive firewalls — through encrypted tunnels, enabling easy access to remote services without opening ports.

Cactus is a cross-platform framework for deploying LLM/VLM/TTS models locally in your app.

Available in Flutter and React-Native for cross-platform developers.
Supports any GGUF model you can find on Huggingface; Qwen, Gemma, Llama, DeepSeek etc.
Run LLMs, VLMs, Embedding Models, TTS models and more.
Accommodates from FP32 to as low as 2-bit quantized models, for efficiency and less device strain.
MCP tool-calls to make AI performant and helpful (set reminder, gallery search, reply messages) etc.
Fallback to massive cloud models for complex tasks and upon device failures.
Chat templates with Jinja2 support and token streaming.

Xenharmlib is a generalized music theory library that supports traditional Western and non-western harmonic systems, unconventional microtonal and macrotonal tunings, diatonic and posttonal set theory and non-standard notations.

Flix is a principled effect-oriented functional, imperative, and logic programming language developed at Aarhus University

XAI has published a video on how they trained their new model Grok 4 against a very large training cluster.

MLOps Newsletter

Discussion about this post