Chain of Agents from Google

Because Chain of Thoughts is no longer as cool anymore!

Feb 02, 2025

Google Research has introduced a novel framework called Chain-of-Agents (CoA) to address the challenges of long-context tasks in large language models (LLMs). It leverages multi-agent collaboration to improve performance on tasks requiring extensive input processing.

LLMs have demonstrated different capabilities in various tasks such as reasoning, knowledge retrieval, and generation. However, they face limitations when dealing with long inputs due to constraints on input length, which hinders their performance on tasks like long summarization, question answering, and code completion.Previous approaches to tackle this issue have primarily focused on two directions:

Input reduction: This method involves reducing the length of the input context before feeding it to LLMs. Retrieval Augmented Generation (RAG) is an example of this approach, breaking the input into chunks and retrieving answers based on embedding similarity.
Window extension: This approach extends the context window of LLMs through fine-tuning, allowing them to consume longer inputs. For instance, Gemini can process up to 2M tokens per input.

However, these methods have their drawbacks. Input reduction techniques may provide incomplete context due to low retrieval accuracy, while window extension approaches struggle with focusing on relevant information and suffer from ineffective context utilization as the input length increases.

Chain-of-Agents (CoA)

CoA is designed to address these challenges by mimicking the way humans interleave reading and processing of long contexts under limited working memory constraints. It has mainly two different stages:

Stage 1: Worker Agent - Segment Comprehension and Chain-Communication

In this stage, a series of worker agents process different chunks of the long context sequentially. Each worker agent:

Receives a portion of the source text
Processes the query
Follows instructions for a specific task
Analyzes the message passed from the previous agent
Outputs a message for the next worker in the chain

This unidirectional communication chain allows for efficient information extraction and comprehension across the entire input.

Stage 2: Manager Agent - Information Integration and Response Generation

After the worker agents have processed the input, the manager agent:

Receives the accumulated knowledge from the last worker
Synthesizes the relevant information
Generates the final answer based on the query and instructions

Key Advantages

Training-free: CoA does not require additional training of the underlying LLMs.
Task-agnostic: The framework can be applied to various long-context tasks.
Highly interpretable: The step-by-step process allows for easy understanding of how the final answer is derived.
Cost-effective: CoA reduces time complexity from n2n2 to nknk, where nn is the number of input tokens and kk is the context limit of the LLM.

Computational Efficiency

CoA reduces the time complexity of processing long inputs from n2n2 to nknk, where:

nn is the number of input tokens
kk is the context limit of the LLM

This improvement in efficiency makes CoA more cost-effective for processing long contexts compared to full-context approaches.

Multi-hop Reasoning Capabilities

CoA demonstrated superior performance in multi-hop reasoning tasks compared to RAG. In an example from the HotpotQA dataset, CoA's collaborative approach allowed for:

The first agent to explore related topics without knowing the answer
The second agent to broaden the topic scope by incorporating new information
The third agent to discover the answer by synthesizing information from earlier agents and new data

This process highlights CoA's ability to facilitate complex reasoning across long context tasks.

Comparison with Long Context LLMs

When compared to long context LLMs on the NarrativeQA and BookSum datasets, CoA (8k) significantly outperformed both RAG (8k) and Full-Context (200k) baselines. This was true even when using Claude 3 models (Haiku, Sonnet, and Opus) with a context limit of 200k tokens.

Performance on Longer Inputs

The researchers compared CoA and Full-Context approaches using Claude 3 on the BookSum dataset with varying input lengths. Key findings include:

CoA outperformed the Full-Context (200k) baseline across all tested input lengths.
CoA's performance improved as the input length increased.
The improvement over Full-Context (200k) became more significant with longer inputs, reaching around 100% when the length exceeded 400k tokens.

These results demonstrate that CoA can enhance LLM performance even for models with very long context window limits and provides greater performance gains for longer inputs.

Libraries

Julep is a platform for creating AI agents that remember past interactions and can perform complex tasks. It offers long-term memory and manages multi-step processes.

Julep enables the creation of multi-step tasks incorporating decision-making, loops, parallel processing, and integration with numerous external tools and APIs.

While many AI applications are limited to simple, linear chains of prompts and API calls with minimal branching, Julep is built to handle more complex scenarios which:

have multiple steps,
make decisions based on model outputs,
spawn parallel branches,
use lots of tools, and
run for a long time.

Transformer Lab allows you to:

💕 One-click Download Hundreds of Popular Models:
- DeepSeek, Llama3, Qwen, Phi4, Gemma, Mistral, Mixtral, Command-R, and dozens more
⬇ Download any LLM from Huggingface
🎶 Finetune / Train Across Different Hardware
- Finetune using MLX on Apple Silicon
- Finetune using Huggingface on GPU
⚖️ RLHF and Preference Optimization
- DPO
- ORPO
- SIMPO
- Reward Modeling
💻 Work with LLMs Across Operating Systems:
- Windows App
- MacOS App
- Linux
💬 Chat with Models
- Chat
- Completions
- Preset (Templated) Prompts
- Chat History
- Tweak generation parameters
- Batched Inference
- Tool Use / Function Calling (in alpha)
🚂 Use Different Inference Engines
- MLX on Apple Silicon
- Huggingface Transformers
- vLLM
- Llama CPP
🧑‍🎓 Evaluate models
📖 RAG (Retreival Augmented Generation)
- Drag and Drop File UI
- Works on Apple MLX, Transformers, and other engines
📓 Build Datasets for Training
- Pull from hundreds of common datasets available on HuggingFace
- Provide your own dataset using drag and drop
🔢 Calculate Embeddings
💁 Full REST API
🌩 Run in the Cloud
- You can run the user interface on your desktop/laptop while the engine runs on a remote or cloud machine
- Or you can run everything locally on a single machine
🔀 Convert Models Across Platforms
- Convert from/to Huggingface, MLX, GGUF
🔌 Plugin Support
- Easily pull from a library of existing plugins
- Write your own plugins to extend functionality
🧑‍💻 Embedded Monaco Code Editor
- Edit plugins and view what's happening behind the scenes
📝 Prompt Editing
- Easily edit System Messages or Prompt Templates
📜 Inference Logs
- While doing inference or RAG, view a log of the raw queries sent to the LLM

And you can do the above, all through a simple cross-platform GUI.

# Oumi: Open Universal Machine Intelligence

Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from data preparation and training to evaluation and deployment. Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.

With Oumi, you can:

🚀 Train and fine-tune models from 10M to 405B parameters using state-of-the-art techniques (SFT, LoRA, QLoRA, DPO, and more)
🤖 Work with both text and multimodal models (Llama, DeepSeek, Qwen, Phi, and others)
🔄 Synthesize and curate training data with LLM judges
⚡️ Deploy models efficiently with popular inference engines (vLLM, SGLang)
📊 Evaluate models comprehensively across standard benchmarks
🌎 Run anywhere - from laptops to clusters to clouds (AWS, Azure, GCP, Lambda, and more)
🔌 Integrate with both open models and commercial APIs (OpenAI, Anthropic, Vertex AI, Together, Parasail, ...)

All with one consistent API, production-grade reliability, and all the flexibility you need for research.

RQ-VAE Recommender a PyTorch implementation of a generative retrieval model using semantic IDs based on RQ-VAE from "Recommender Systems with Generative Retrieval". The model has two stages:

Items in the corpus are mapped to a tuple of semantic IDs by training an RQ-VAE (figure below).
Sequences of semantic IDs are tokenized by using a frozen RQ-VAE and a transformer-based is trained on sequences of semantic IDs to generate the next ids in the sequence.

Datasets: Amazon Reviews (Beauty, Sports, Toys), MovieLens 1M, MovieLens 32M
RQ-VAE Pytorch model implementation + KMeans initialization + RQ-VAE Training script.
Decoder-only retrieval model + Training code with semantic id user sequences from randomly initialized or pretrained RQ-VAE.

Beyond Self-Attention for Sequential Recommendation (BSARec) leverages Fourier transform to strike a balance between our inductive bias and self-attention.

Below the Fold

Dario Amodei(Founder of Antropic) wrote a rather interesting piece to detail about DeepSeek and what it means from US point of view with regards to chip export control of USA and its significance in terms of geopolitically.

MLOps Newsletter

Discussion about this post