View

Introducing CRAG: The Comprehensive Retrieval-Augmented Generation Benchmark

Author Image
Susanne Waldthaler

Finally a new standard in the evaluation of factual question-answering capabilities of LLMs.

Introducing CRAG: The Comprehensive Retrieval-Augmented Generation Benchmark

We all experienced it: how do we really measure the performance of our RAG system? And how do we measure that actual changes improved the RAG?

Meta has unveiled the Comprehensive Retrieval-Augmented Generation (CRAG) Benchmark, setting a new standard in the evaluation of factual question-answering capabilities of large language models (LLMs).

CRAG aims to bridge the gap in existing benchmarks by providing a rich dataset and robust evaluation mechanisms that simulate real-world retrieval scenarios. This innovative benchmark is poised to drive significant advancements in the field of LLMs and Retrieval-Augmented Generation (RAG) systems.

What is CRAG?

RAG consists of 4,409 question-answer pairs spanning five domains: Finance, Sports, Music, Movie, and Open domain. The dataset includes eight types of questions, from simple factual queries to complex multi-hop questions and those with false premises. The inclusion of mock APIs simulates real-world retrieval from both web searches and Knowledge Graphs (KGs), providing a comprehensive environment for testing the robustness and accuracy of RAG systems.

Key Insights

Initial evaluations using CRAG have revealed crucial insights into the performance of current LLMs and RAG systems:

  • Generator Focus:
    • CRAG specifically evaluates the "Generator" component in the RAG pipeline, offering insights into the model's synthesis capabilities.
  • Performance Metrics:
    • LLMs without grounding achieve about 34% accuracy on CRAG. Simple retrieval boosts this to 44%, while advanced RAG solutions, such als Llama 3 70B, achieve performance close to GPT-4 Turbo.
  • Product RAG Enhancement:
    • Industry RAG solutions, such as Copilot and Perplexity, increase accuracy up to 62%, underscoring the importance of sophisticated retrieval and preprocessing methods.
  • Real-World Simulation:
    • CRAG includes questions of varying entity popularity and temporal dynamics, reflecting real-world challenges and providing a robust testbed for future research.

CRAG represents a significant leap forward in the evaluation of RAG systems, offering a comprehensive, realistic, and challenging benchmark that will help us in the evaluation and improvement of large language models.

Author Image
Susanne Waldthaler

Checkout our latest articles