stub A Deep Dive into Retrieval-Augmented Generation in LLM - Unite.AI
Connect with us

Artificial Intelligence

A Deep Dive into Retrieval-Augmented Generation in LLM




Retrieval Augmented Generation Illustration using Midjourney

Imagine you're an Analyst, and you've got access to a Large Language Model. You're excited about the prospects it brings to your workflow. But then, you ask it about the latest stock prices or the current inflation rate, and it hits you with:

“I'm sorry, but I cannot provide real-time or post-cutoff data. My last training data only goes up to January 2022.”

Large Language Model, for all their linguistic power, lack the ability to grasp the ‘now‘. And in the fast-paced world, ‘now‘ is everything.

Research has shown that large pre-trained language models (LLMs) are also repositories of factual knowledge.

They've been trained on so much data that they've absorbed a lot of facts and figures. When fine-tuned, they can achieve remarkable results on a variety of NLP tasks.

But here's the catch: their ability to access and manipulate this stored knowledge is, at times not perfect. Especially when the task at hand is knowledge-intensive, these models can lag behind more specialized architectures. It's like having a library with all the books in the world, but no catalog to find what you need.

OpenAI's ChatGPT Gets a Browsing Upgrade

OpenAI's recent announcement about ChatGPT's browsing capability is a significant leap in the direction of Retrieval-Augmented Generation (RAG). With ChatGPT now able to scour the internet for current and authoritative information, it mirrors the RAG approach of dynamically pulling data from external sources to provide enriched responses.

Currently available for Plus and Enterprise users, OpenAI plans to roll out this feature to all users soon. Users can activate this by selecting ‘Browse with Bing' under the GPT-4 option.

Chatgpt New Browsing Feature

Chatgpt New ‘Bing' Browsing Feature

 Prompt engineering is effective but insufficient

Prompts serve as the gateway to LLM's knowledge. They guide the model, providing a direction for the response. However, crafting an effective prompt is not the full-fledged solution to get what you want from an LLM. Still, let us go through some good practice to consider when writing a prompt:

  1. Clarity: A well-defined prompt eliminates ambiguity. It should be straightforward, ensuring that the model understands the user's intent. This clarity often translates to more coherent and relevant responses.
  2. Context: Especially for extensive inputs, the placement of the instruction can influence the output. For instance, moving the instruction to the end of a long prompt can often yield better results.
  3. Precision in Instruction: The force of the question, often conveyed through the “who, what, where, when, why, how” framework, can guide the model towards a more focused response. Additionally, specifying the desired output format or size can further refine the model's output.
  4. Handling Uncertainty: It's essential to guide the model on how to respond when it's unsure. For instance, instructing the model to reply with “I don’t know” when uncertain can prevent it from generating inaccurate or “hallucinated” responses.
  5. Step-by-Step Thinking: For complex instructions, guiding the model to think systematically or breaking the task into subtasks can lead to more comprehensive and accurate outputs.

In relation to the importance of prompts in guiding ChatGPT, a comprehensive article can be found in an article at

Challenges in Generative AI Models

Prompt engineering involves fine-tuning the directives given to your model to enhance its performance. It's a very cost-effective way to boost your Generative AI application accuracy, requiring only minor code adjustments. While prompt engineering can significantly enhance outputs, it's crucial to understand the inherent limitations of large language models (LLM). Two primary challenges are hallucinations and knowledge cut-offs.

  • Hallucinations: This refers to instances where the model confidently returns an incorrect or fabricated response.  Although advanced LLM has built-in mechanisms to recognize and avoid such outputs.
Hallucinations in LLMs

Hallucinations in LLM

  • Knowledge Cut-offs: Every LLM model has a training end date, post which it is unaware of events or developments. This limitation means that the model's knowledge is frozen at the point of its last training date. For instance, a model trained up to 2022 would not know the events of 2023.
Knowledge cut-off in LLMS

Knowledge cut-off in LLM

Retrieval-augmented generation (RAG) offers a solution to these challenges. It allows models to access external information, mitigating issues of hallucinations by providing access to proprietary or domain-specific data. For knowledge cut-offs, RAG can access current information beyond the model's training date, ensuring the output is up-to-date.

It also allows the LLM to pull in data from various external sources in real time. This could be knowledge bases, databases, or even the vast expanse of the internet.

Introduction to Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) is a framework, rather than a specific technology, enabling Large Language Models to tap into data they weren't trained on. There are multiple ways to implement RAG, and the best fit depends on your specific task and the nature of your data.

The RAG framework operates in a structured manner:

Prompt Input

The process begins with a user's input or prompt. This could be a question or a statement seeking specific information.

Retrieval from External Sources

Instead of directly generating a response based on its training, the model, with the help of a retriever component, searches through external data sources. These sources can range from knowledge bases, databases, and document stores to internet-accessible data.

Understanding Retrieval

At its essence, retrieval mirrors a search operation. It's about extracting the most pertinent information in response to a user's input. This process can be broken down into two stages:

  1. Indexing: Arguably, the most challenging part of the entire RAG journey is indexing your knowledge base. The indexing process can be broadly divided into two phases: Loading and Splitting.In tools like LangChain, these processes are termed “loaders” and “splitters“. Loaders fetch content from various sources, be it web pages or PDFs. Once fetched, splitters then segment this content into bite-sized chunks, optimizing them for embedding and search.
  2. Querying: This is the act of extracting the most relevant knowledge fragments based on a search term.

While there are many ways to approach retrieval, from simple text matching to using search engines like Google, modern Retrieval-Augmented Generation (RAG) systems rely on semantic search. At the heart of semantic search lies the concept of embeddings.

Embeddings are central to how Large Language Models (LLM) understand language. When humans try to articulate how they derive meaning from words, the explanation often circles back to inherent understanding. Deep within our cognitive structures, we recognize that “child” and “kid” are synonymous, or that “red” and “green” both denote colors.

Augmenting the Prompt

The retrieved information is then combined with the original prompt, creating an augmented or expanded prompt. This augmented prompt provides the model with additional context, which is especially valuable if the data is domain-specific or not part of the model's original training corpus.

Generating the Completion

With the augmented prompt in hand, the model then generates a completion or response. This response is not just based on the model's training but is also informed by the real-time data retrieved.

Retrieval-Augmented Generation

Retrieval-Augmented Generation

Architecture of the First RAG LLM

The research paper by Meta published in 2020 “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”  provides an in-depth look into this technique. The Retrieval-Augmented Generation model augments the traditional generation process with an external retrieval or search mechanism. This allows the model to pull relevant information from vast corpora of data, enhancing its ability to generate contextually accurate responses.

Here's how it works:

  1. Parametric Memory: This is your traditional language model, like a seq2seq model. It's been trained on vast amounts of data and knows a lot.
  2. Non-Parametric Memory: Think of this as a search engine. It's a dense vector index of, say, Wikipedia, which can be accessed using a neural retriever.

When combined, these two create an accurate model. The RAG model first retrieves relevant information from its non-parametric memory and then uses its parametric knowledge to give out a coherent response.


Original RAG Model By Meta

1. Two-Step Process:

The RAG LLM operates in a two-step process:

  • Retrieval: The model first searches for relevant documents or passages from a large dataset. This is done using a dense retrieval mechanism, which employs embeddings to represent both the query and the documents. The embeddings are then used to compute similarity scores, and the top-ranked documents are retrieved.
  • Generation: With the top-k relevant documents in hand, they're then channeled into a sequence-to-sequence generator alongside the initial query. This generator then crafts the final output, drawing context from both the query and the fetched documents.

2. Dense Retrieval:

Traditional retrieval systems often rely on sparse representations like TF-IDF. However, RAG LLM employs dense representations, where both the query and documents are embedded into continuous vector spaces. This allows for more nuanced similarity comparisons, capturing semantic relationships beyond mere keyword matching.

3. Sequence-to-Sequence Generation:

The retrieved documents act as an extended context for the generation model. This model, often based on architectures like Transformers, then generates the final output, ensuring it's coherent and contextually relevant.

Document Search

Document Indexing and Retrieval

For efficient information retrieval, especially from large documents, the data is often stored in a vector database. Each piece of data or document is indexed based on an embedding vector, which captures the semantic essence of the content. Efficient indexing ensures quick retrieval of relevant information based on the input prompt.

Vector Databases

Vector Database

Source: Redis

Vector databases, sometimes termed vector storage, are tailored databases adept at storing and fetching vector data. In the realm of AI and computer science, vectors are essentially lists of numbers symbolizing points in a multi-dimensional space. Unlike traditional databases, which are more attuned to tabular data, vector databases shine in managing data that naturally fit a vector format, such as embeddings from AI models.

Some notable vector databases include Annoy, Faiss by Meta, Milvus, and Pinecone. These databases are pivotal in AI applications, aiding in tasks ranging from recommendation systems to image searches. Platforms like AWS also offer services tailored for vector database needs, such as Amazon OpenSearch Service and Amazon RDS for PostgreSQL. These services are optimized for specific use cases, ensuring efficient indexing and querying.

Chunking for Relevance

Given that many documents can be extensive, a technique known as “chunking” is often used. This involves breaking down large documents into smaller, semantically coherent chunks. These chunks are then indexed and retrieved as needed, ensuring that the most relevant portions of a document are used for prompt augmentation.

Context Window Considerations

Every LLM operates within a context window, which is essentially the maximum amount of information it can consider at once. If external data sources provide information that exceeds this window, it needs to be broken down into smaller chunks that fit within the model's context window.

Benefits of Utilizing Retrieval-Augmented Generation

  1. Enhanced Accuracy: By leveraging external data sources, the RAG LLM can generate responses that are not just based on its training data but are also informed by the most relevant and up-to-date information available in the retrieval corpus.
  2. Overcoming Knowledge Gaps: RAG effectively addresses the inherent knowledge limitations of LLM, whether it's due to the model's training cut-off or the absence of domain-specific data in its training corpus.
  3. Versatility: RAG can be integrated with various external data sources, from proprietary databases within an organization to publicly accessible internet data. This makes it adaptable to a wide range of applications and industries.
  4. Reducing Hallucinations: One of the challenges with LLM is the potential for “hallucinations” or the generation of factually incorrect or fabricated information. By providing real-time data context, RAG can significantly reduce the chances of such outputs.
  5. Scalability: One of the primary benefits of RAG LLM is its ability to scale. By separating the retrieval and generation processes, the model can efficiently handle vast datasets, making it suitable for real-world applications where data is abundant.

Challenges and Considerations

  • Computational Overhead: The two-step process can be computationally intensive, especially when dealing with large datasets.
  • Data Dependency: The quality of the retrieved documents directly impacts the generation quality. Hence, having a comprehensive and well-curated retrieval corpus is crucial.


By integrating retrieval and generation processes, Retrieval-Augmented Generation offers a robust solution to knowledge-intensive tasks, ensuring outputs that are both informed and contextually relevant.

The real promise of RAG lies in its potential real-world applications. For sectors like healthcare, where timely and accurate information can be pivotal, RAG offers the capability to extract and generate insights from vast medical literature seamlessly. In the realm of finance, where markets evolve by the minute, RAG can provide real-time data-driven insights, aiding in informed decision-making. Furthermore, in academia and research, scholars can harness RAG to scan vast repositories of information, making literature reviews and data analysis more efficient.

I have spent the past five years immersing myself in the fascinating world of Machine Learning and Deep Learning. My passion and expertise have led me to contribute to over 50 diverse software engineering projects, with a particular focus on AI/ML. My ongoing curiosity has also drawn me toward Natural Language Processing, a field I am eager to explore further.