Topic 27: What are Chain-of-Agents and Chain-of-RAG?

Community Article Published February 13, 2025

🔳 We explore Google's and Microsoft's advancements that implement "chain" approaches for long context and multi-hop reasoning

With the shift to deep, step-by-step reasoning in AI, we continue to observe a trend of creating Chain-of-”…” methods. Previously, we explored three Chains-of-Knowledge and other "chain" spin-offs in “From Chain-of-Thoughts to Skeleton-of-Thoughts, and everything in between", but “chains” keep coming! Today, we’re going to discuss the two advancements from Google Cloud AI Research and Microsoft, called Chain-of-Agents (CoA) and Chain-of-Retrieval Augmented Generation (CoRAG), respectively. They both approach the challenge of handling long-context tasks, but from different perspectives. Google’s CoA employs a multi-agent framework, where worker agents process text segments sequentially in a structured chain, while Microsoft’s CoRAG introduces an iterative retrieval approach as a solution for strong multi-hop reasoning. Understanding techniques like CoA and CoRAG is crucial if you are working toward improving AI's performance in complex reasoning tasks. So, let’s explore how these new ‘chains’ can impact the accuracy and quality of AI models!

📨 Click follow! If you want to receive our articles straight to your inbox, please subscribe here

In today’s episode, we will cover:

Chain-of-Agents from Google: what’s the idea?
The key idea of Chain-of-RAG (CoRAG) from Microsoft
Bonus: Resources to dive deeper

Chain-of-Agents from Google: what’s the idea?

Even when you are working with state-of-the-art models, you can notice that tasks with long context, like entire books, long articles, or lengthy conversations, still remain a challenge for LLMs. One of the widespread ideas is to expand the model’s memory, in other words, context window. However, models tend to lose the track of main information as the input grows longer. Another way is to shorten input instead by selecting only the most relevant parts of the text. Here RAG may be used for effective retrieval, but this method may lead to losing important parts of information.

What to do? Google Cloud AI Research and Penn State University’s researchers pursued another strategy to create a method that would be better than RAG, full-context models, and multi-agent LLMs. They proposed the Chain-of-Agents (CoA) framework, inspired by how humans process long texts step-by-step. Instead of relying on a single model, CoA enables multiple AI agents to collaborate and process unlimited amounts of text.

Collaboration among agents may not seem a new concept, but researchers have discovered some tips that make their method stand out. Many methods use tree structure where agents work separately without direct communication (for example, LongAgent). In contrast, CoA follows a chain structure with a strong order, at the same time ensuring agents share information for better accuracy. Let’s look at how CoA exactly does it.

How does CoA work?

he Chain-of-Agents (CoA) framework processes long texts step-by-step in two stages:

Stage 1: Worker agents break down and process information

A long document is split into smaller chunks, making it easier to process.
Each worker agent reads and analyzes its assigned chunk.
The workers communicate in order, passing important information down the chain. This process is called a "communication unit".
This communication helps build a complete understanding of the entire document.

Overall, each worker agent processes one chunk, combines it with the previous agent’s findings, and passes the result to the next worker. But what exactly do worker agents process in different tasks?

For question answering, the workers extract evidence from their chunks. For summarization, they summarize their assigned chunks of the text. For code completion, they create summaries of the code, including function or class details.

Here’s a simple example of how this stage works:

If the first part of the text doesn't fully answer a question, the first worker might gather some relevant clues.
The next worker uses those clues to refine the answer.
If a chunk has no useful information, the worker just passes the previous worker’s findings without adding noise.

This step-by-step processing ensures the model doesn’t miss key details, unlike other approaches where agents work independently.

Image Credit: The original CoA paper

Stage 2: Manager agent summarizes results and generates the final answer

After the worker agents finish processing, the manager agent steps in. It gathers all the collected insights from the last worker.
Using this information, it produces the final response, whether it's an answer, a summary, or completed code.

Using a manager agent separates text analysis from answer generation, allowing each agent to focus on its task.

How good CoA actually is?

CoA was tested on three types of tasks: Question Answering (QA), summarization and code completion, to demonstrate how good it is at handling long context. Here are the main results:

Question Answering (QA) – Requires understanding and reasoning across long texts.

CoA consistently outperformed models with full-context processing and RAG across all datasets.
Improvements included +13.3% on NarrativeQA (ability to understand and generate answers based on long-form narratives), +12.82% on MuSiQue (multi-hop reasoning), and +22% on QuALITY (deep language understanding and reasoning over extended contexts).
Even when using Gemini Ultra with a longer context window (32k), CoA (8k) still outperformed it.

Summarization and code completion

CoA outperformed Vanilla baselines and RAG on all datasets.
Unlike RAG, which struggled with GovReport (extracting key information from lengthy government reports), CoA improved performance significantly, showing it works well even when there is no specific query.

Image Credit: The original CoA paper

Moreover, CoA beats Long-Context Models like Claude 3 (200k tokens) and other multi-agent frameworks. Compared to the Hierarchical method with a tree structure, where workers pass information to a manager without directly communicating with each other, and the Merge method, where each worker agent generates an answer independently and the final result is chosen by majority vote, CoA performed significantly better than both.

Image Credit: The original CoA paper

After these impressive results, the question arises: What makes CoA outperform other powerful methods, including RAG, in long-context tasks?

CoA advantages and why it is better than RAG and other methods

Here are the key benefits of CoA that make it a better option for long text processing:

CoA is better at finding the correct answer: While RAG heavily relies on re-ranking quality and performs better only when the answer appears earlier in its retrieved chunks, CoA stays more resilient in challenging retrieval scenarios.
CoA’s performance improves as input length increases: For example, when input length exceeds 400k tokens, CoA achieves up to a 100% improvement over the baseline.
It reduces the “lost-in-the-middle” problem by up to 21%.
It effectively handles complex reasoning tasks, like multi-hop reasoning: By using a step-by-step approach with collaborative agents' reasoning, CoA outperforms RAG at this task. The problem with RAG is that it retrieves text chunks based on their similarity to the query, which makes it difficult to handle multi-hop reasoning. This happens because the initial necessary fact might not be semantically relevant to the query. (Multi-hop reasoning is an advanced AI reasoning technique where a model connects multiple pieces of information across different sources or contexts to arrive at a conclusion. Instead of relying on a single step of inference, the model performs multiple logical "hops" between different facts, documents, or knowledge sources.)

Image Credit: The original CoA paper

Reduced computation time due to parallel processing: CoA takes 30% longer than RAG, but parallel agent execution reduces runtime by 57.21%.
Minimal information loss: 1-4% during communication between agents.
No need for extra training: It works with existing LLMs.
Unlike input reduction, CoA doesn’t skip important details because it processes everything step by step.
Unlike window extension, CoA doesn’t overwhelm the model with too much information at once.

There are so many benefits of CoA that it seems it doesn’t have any disadvantages at all. However...

Not without limitations

Here are several limitations that show the areas for CoA’s improvement:

Running multiple agents can be expensive and time-consuming.
A structured communication between agents is not as dynamic as it could be with methods like debating or complex discussions between agents.
Current models are designed for human-like communication, which is not so good for AI-to-AI communication. This can affect the overall quality of interaction between CoA’s agents.

Despite this, Google Cloud AI Research and Penn State University proposed a very strong and flexible framework. We look forward to further research and more cases of implementation for CoA.

And what about the update from Microsoft?

The key idea of Chain-of-RAG (CoRAG) from Microsoft

Above we talked a lot about RAG limitations, particularly that it doesn’t process information step-by-step, fails with long-context tasks and struggles with multi-hop reasoning, where Google’s CoA excels. Here comes the answer from Microsoft Corporation and Renmin University of China.

They proposed Chain-of-Retrieval Augmented Generation, or CoRAG, a step-by-step retrieval system, which upgrades traditional RAG to overcome exactly these limitations. CoRAG allows the model to retrieve information step-by-step instead of all at once and dynamically adjust its search process. It is specifically trained to build better retrieval chains and can control how many retrieval steps it takes and how long each retrieval chain is. In other words, it effectively controls the amount of computation it uses.

All these features help CoRAG significantly outperform traditional RAG and existing models, especially in long-context tasks and complex reasoning tasks like multi-hop questions, where answering requires multiple reasoning steps. Also, it improves accuracy by more than 10 points compared to strong baseline models.

So let’s explore how this new type of RAG works.

How does CoRAG work?

CoRAG framework improves how AI models retrieve and process information by breaking down complex searches into step-by-step retrieval chains. To perform effectively this framework includes 3 components:

1. Retrieval chain generation

Unlike most RAG approaches which lack the intermediate search steps, CoRAG automatically generates retrieval chains, using a rejection sampling method. These chains are sequences of sub-queries (Q1, Q2, ... QL) and sub-answers (A1, A2, ... AL) leading up to the final answer.

Here’s how it works:

Generating sub-queries: CoRAG “asks” a model to generate smaller, more focused questions that guide the retrieval process step-by-step.
Retrieving relevant information: For each sub-query, CoRAG searches a database and collects the top relevant documents.
Answering the sub-query: Using the retrieved documents, CoRAG generates an answer for each step.
Building a chain: The process continues until the final answer is reached or a maximum number of steps is taken.
Choosing the best chain: CoRAG ranks all possible retrieval chains and selects the one that best supports the correct answer.

Once CoRAG generates retrieval chains, it trains a model to learn from them using a structured approach.

Image Credit: The original CoRAG paper

2. Training the model on the enhanced datasets

Each training example consists of: 1) the original question and final answer; 2) the retrieval chain; 3) the top relevant documents for each retrieval step.

CoRAG helps the model learn how to perform 3 tasks simultaneously:

Predicting the next sub-query → Learning to generate the next question in the retrieval chain.
Predicting the sub-answer → Learning to extract useful information from retrieved documents.
Predicting the final answer → Learning to synthesize all previous steps into a correct response.

By training on these intermediate retrieval steps, CoRAG teaches AI models to search and reason more effectively, which improves accuracy on long and complex questions.

3. Controlling computation at test-time

This step is used to balance accuracy and speed. This balance can be achieved by controlling how much compute power the model spends on searching while maintaining good performance. CoRAG offers 3 strategies to control it:

Greedy decoding (the fastest way): The model follows a fixed step-by-step search without exploring multiple possibilities. It is the fastest method but may miss better answers if the initial retrieval is weak.
Best-of-N sampling (balanced approach): The model samples N different retrieval chains and picks the best one. To decide the best chain, it penalizes chains that retrieve irrelevant information. It’s more robust than greedy decoding while still keeping computation reasonable.
Tree search (the most accurate strategy): The model explores multiple retrieval chains in parallel. It expands promising search paths and eliminates weak ones. This approach is very effective and accurate but can be slower, as it processes more retrieval steps.

To further optimize speed and accuracy, CoRAG allows to adjust the number of retrieval steps and search paths, and to fine-tune tree search depth. This ensures that simple questions get answered quickly, while complex questions receive deeper analysis without excessive computation.

Overall, through these steps CoRAG dynamically adjusts its search process, making smarter decisions at each step. If the first attempt doesn’t provide useful information, CoRAG rephrases the query, refines its search, and tries again, which is similar to how humans research complex topics.

Performance of CoRAG

One of the main purposes of CoRAG is to overcome the issue with multi-hop reasoning. Researchers tested it on multi-step question answering tasks and KILT (knowledge-heavy tasks) benchmarks and found that indeed CoRAG approach can excel at these tasks. Just take a look at how CoRAG-8B leads in these tasks even against larger models ( with around 13%+ improvement on MuSiQue):

Image Credit: The original CoRAG paper

Here are other important findings on what can influence CoRAG’s performance:

Increasing retrieval chain length improves performance significantly at first, but after a certain point, the improvements slow down.
More retrieval steps lead to better reasoning, but too many steps waste computational resources.
Increasing N in Best-of-N sampling has different effects depending on the dataset:
- On harder datasets like MuSiQue, a larger N improves accuracy.
- On easier datasets like 2WikiMultihopQA, a smaller N is enough.

For now, let’s summarize what we have for the CoRAG’s benefits.

CoRAG’s advantages

Fine-tuning on retrieval chains: It gives CoRAG a major advantage over other few-shot learning methods in complex reasoning tasks like multi-hop QA datasets.
CoRAG can refine its own retrieval process over time using iterative training.
Test-time compute can be adjusted dynamically: Users can balance accuracy and efficiency based on the task’s difficulty.
Scaling test-time compute improves performance, but only up to a limit. Model needs to learn when to stop retrieving.
Even with weaker retrievers, increasing test-time computation improves results. Anyway, stronger retrievers still give the best overall performance.
Using smaller models can reduce computational costs without sacrificing much performance.

However, CoRAG has several important issues.

Not without limitations

High computational cost because of multiple retrieval steps that require more GPU hours.
CoRAG performs a predefined number of retrieval steps, even when the correct answer may already be found. Model needs to learn when to stop retrieving.
Single-hop QA datasets see minimal improvements: Additional retrieval steps may increase computation without significant accuracy gains, which is ineffective.
Extra training for refining reasoning chains doesn’t always help.

To address these issues and become a more improved approach, CoRAG requires advancements like adaptive mechanisms and early stopping techniques.

Conclusion

CoA and CoRAG are both designed to improve long-context processing, but they have strengths in different areas. CoA demonstrates strong advantages in handling extremely long contexts. It is also superior in mitigating the "lost-in-the-middle" problem and performs well in complex multi-hop reasoning. On the other hand, CoRAG significantly improves traditional RAG by using retrieval chain augmentation and iterative retrieval to enhance multi-hop reasoning accuracy and improve in long context tasks.

Both CoA and CoRAG are better than traditional RAG, with the main difference being their "specialization":

If the task involves complex retrieval-based multi-hop reasoning, CoRAG is the better choice. It shows a slightly better improvement in multi-hop tasks: +13% and more on MuSiQue benchmark compared to CoA’s +12.82%.
If the task requires processing extremely long contexts efficiently, CoA performs better by leveraging multi-agent collaboration.

And which of these methods is better for you?

Author: Alyona Vert Editor: Ksenia Se

Bonus: Resources to dive deeper

Sources from Turing Post

📨 If you want to receive our articles straight to your inbox, please subscribe here

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote