wedomagic.ai - Blog Posts

Introducing CRAG: The Comprehensive Retrieval-Augmented Generation Benchmark

Susanne Waldthaler

June 15, 2024

Finally a new standard in the evaluation of factual question-answering capabilities of LLMs.

Introducing CRAG: The Comprehensive Retrieval-Augmented Generation Benchmark

We all experienced it: how do we really measure the performance of our RAG system? And how do we measure that actual changes improved the RAG?

Meta has unveiled the Comprehensive Retrieval-Augmented Generation (CRAG) Benchmark, setting a new standard in the evaluation of factual question-answering capabilities of large language models (LLMs).

CRAG aims to bridge the gap in existing benchmarks by providing a rich dataset and robust evaluation mechanisms that simulate real-world retrieval scenarios. This innovative benchmark is poised to drive significant advancements in the field of LLMs and Retrieval-Augmented Generation (RAG) systems.

‍

What is CRAG?

RAG consists of 4,409 question-answer pairs spanning five domains: Finance, Sports, Music, Movie, and Open domain. The dataset includes eight types of questions, from simple factual queries to complex multi-hop questions and those with false premises. The inclusion of mock APIs simulates real-world retrieval from both web searches and Knowledge Graphs (KGs), providing a comprehensive environment for testing the robustness and accuracy of RAG systems.

Key Insights

Initial evaluations using CRAG have revealed crucial insights into the performance of current LLMs and RAG systems:

Generator Focus:‍
- CRAG specifically evaluates the "Generator" component in the RAG pipeline, offering insights into the model's synthesis capabilities.‍‍
‍Performance Metrics:‍
- LLMs without grounding achieve about 34% accuracy on CRAG. Simple retrieval boosts this to 44%, while advanced RAG solutions, such als Llama 3 70B, achieve performance close to GPT-4 Turbo.
‍Product RAG Enhancement:
- Industry RAG solutions, such as Copilot and Perplexity, increase accuracy up to 62%, underscoring the importance of sophisticated retrieval and preprocessing methods.
‍Real-World Simulation:
- CRAG includes questions of varying entity popularity and temporal dynamics, reflecting real-world challenges and providing a robust testbed for future research.

CRAG represents a significant leap forward in the evaluation of RAG systems, offering a comprehensive, realistic, and challenging benchmark that will help us in the evaluation and improvement of large language models.

Susanne Waldthaler

June 15, 2024

Checkout our latest articles

September 30, 2024

Want to detect from which LLM a text was generated? Read this!

June 15, 2024

Introducing CRAG: The Comprehensive Retrieval-Augmented Generation Benchmark

May 15, 2024

Generate High-Quality Synthetic Datasets at Home with Self-Synthesis