Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG)#

In the previous two recipes, we learned how to obtain an embedding, as well as how to express the similarity between two embeddings. Retrieval Augmented Generation (RAG) uses both of these techniques to provide relevant context to an LLM at query-time to ground the LLM’s output in a knowledge base. For example, if you want the LLM to answer questions based on specific documents you have, you can use these documents as your knowledge base and implement a RAG pipeline.

A typical RAG pipeline consists of three components: a vectorstore to hold the document embeddings, a retriever to retrieve relevant documents from the vectorstore based on a query, and an LLM to generate a response based on the query and the retrieved documents.

This recipe will implement such a pipeline using componets from langchain_dartmouth, as well as the larger LangChain ecosystem.

A Manual RAG pipeline#

If we know which relevant context we want to provide, we could simply use string manipulation to add the context to the query. For example:

from langchain_dartmouth.llms import ChatDartmouth

llm = ChatDartmouth(model_name="llama-3-1-8b-instruct")

# User's question
query = "Are asteroids going to hit me?"

# Context relevant to the question from our knowledge base
relevant_document = "Asteroids do not generally hit people. There is a very low chance for that to happen"

# Augment prompt
augmented_prompt = (
    relevant_document + "Considering this, answer the following question:" + query
)

# Generate the answer
response = llm.invoke(augmented_prompt)
response.pretty_print()

================================== Ai Message ==================================

The chances of an asteroid hitting you directly are extremely low. Astronomers and space agencies closely monitor the skies for near-Earth asteroids (NEAs), and their orbits are tracked to predict potential collisions. 

According to NASA, the likelihood of being hit by a large asteroid is very small. To put it into perspective:

1.  The chances of being hit by a meteorite (a piece of an asteroid that survives entry into Earth's atmosphere) are about 1 in 1.9 million per year.
2.  The likelihood of a human being hit by a meteorite is about 1 in 1.3 billion per year.

Considering these statistics, the risk of an asteroid hitting you directly is extremely low.

Using a Vector Store#

That’s great, but how can we find the relevant document in a collection of documents? That is where similarity search can help us:

We can calculate the similarity between our user’s query and all documents in our collection. Using similarity as a proxy for relevance, we can then retrieve, for example, the top 5 documents and use them as context.

While we could write our own loop to go through the collection of embedded documents, there are optimized structures for storing embeddings and doing these kinds of operations on them called vector stores.

Hint

There are many different implementations of vector stores available, most of which have a corresponding LangChain class. Each implementation may have particular advantages and disadvantages, and the choice of vector store should be made based on your project’s requirements.

In this recipe, we will be using an in-memory vector store. This vector store is a good choice to demonstrate the involved concepts, but it would be a very poor choice for a real-world project. Popular options for vector stores are ChromaDB, PGVector, or commerical offerings like Pinecone.

Let’s build a vector store for a collection of documents on various (very different) topics:

from pathlib import Path

print([p.name for p in Path("./rag_documents/").glob("*.txt")])

['history.txt', 'asteroids.txt', 'hot_sauce.txt']

Creating the vector store strings together quite a few components. Most of these have been introduced in the previous two recipes on embeddings and similarity search:

from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

from langchain_dartmouth.embeddings import DartmouthEmbeddings

from langchain_core.vectorstores import InMemoryVectorStore


# Load all files in a directory using the TextLoader class
loader = DirectoryLoader("./rag_documents", glob="**/*.txt", loader_cls=TextLoader)
collection = loader.load()

# Initialize the text splitter with appropriate chunk size
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=256, chunk_overlap=0
)

# Load and split the files
documents = loader.load_and_split(text_splitter=text_splitter)

embeddings_model = DartmouthEmbeddings(model_name="bge-large-en-v1-5")

# Initialize vector store and add documents
vector_store = InMemoryVectorStore(embedding=embeddings_model)
_ = vector_store.add_documents(documents)

Hint

The DirectoryLoader function is a class from LangChain that accepts a directory, a regex expression, and a loader class. It’s a convenient way to load several documents living in the same directory at once.

We can now use the vector store’s similarity_search method to find the most relevant documents (or document chunks) in the collection given our query. We can change the number of returned documents using the parameter k:

query = "What killed the dinosaurs?"

docs = vector_store.similarity_search(query, k=2)
docs

[Document(id='bb3acea6-f195-4b65-a5a8-58a78465a59f', metadata={'source': 'rag_documents/asteroids.txt'}, page_content="Asteroids have played a significant role in shaping the history of our planet. The impact of a large asteroid is believed to have caused the extinction of the dinosaurs 65 million years ago. The Chicxulub crater in Mexico, formed by a massive asteroid impact, is one of the most well-documented instances of an asteroid collision, and is thought to be the cause of the mass extinction event.\n\nIn recent years, there has been a growing interest in exploring asteroids as a source of natural resources, such as water and precious metals. NASA's OSIRIS-REx mission, launched in 2016, successfully returned samples from the asteroid Bennu, providing valuable insights into the composition and origin of these ancient bodies.\n\n**Exploration and Scientific Significance**\n\nThe study of asteroids has numerous scientific and practical implications. By studying asteroids, scientists can gain insights into the formation and evolution of the solar system, and understand the processes that shape our planet. Asteroids can also provide valuable resources, such as water and metals, that could be used to support human exploration and settlement of space."),
 Document(id='9be5a1ea-44fa-4510-9f23-faed06f751d4', metadata={'source': 'rag_documents/asteroids.txt'}, page_content='**Asteroids: The Mysterious and Ancient Building Blocks of Our Solar System**\n\nAsteroids, also known as minor planets or planetoids, are small, rocky objects that orbit the Sun. They are remnants from the early days of our solar system, and their study has provided valuable insights into the formation and evolution of the cosmos. These mysterious bodies have captivated the imagination of scientists and researchers for centuries, and their exploration continues to uncover new secrets about the universe.\n\n**Composition and Types of Asteroids**\n\nAsteroids are typically small, with diameters ranging from a few meters to hundreds of kilometers. They are composed of rock, metal, and ice, and are thought to be the remnants of the early solar system. There are two main types of asteroids: stony asteroids, which are composed mostly of silicate minerals, and metal asteroids, which are rich in iron and nickel.')]

Great, the similarity search retrieved two chunks from the asteroids.txt file! Since our query was related to asteroids, that makes sense!

Note

Note that the retrieved documents (chunks of the original file) are specifically related to the query!

We can now augment our prompt with the retrieved documents, just like we did before in the manual RAG:

augmented_prompt = (
    "Answer the following query: "
    + query
    + "\n\nBase your response on the following context: \n\n"
)
for doc in docs:
    augmented_prompt += doc.page_content + "\n--\n"

response = llm.invoke(augmented_prompt)

response.pretty_print()

================================== Ai Message ==================================

According to the provided context, the impact of a large asteroid is believed to have caused the extinction of the dinosaurs 65 million years ago. The Chicxulub crater in Mexico, formed by a massive asteroid impact, is thought to be the cause of the mass extinction event.

And there it is: A fully automated RAG pipeline!

Hint

You could streamline this even further and automate the prompt augmentation using LangChain’s prompt templates. While this is beyond the scope of this recipe, you can check out LangChain’s RAG tutorial to learn more!

Reranking Documents#

Setting the right value for k can be challenging: Retrieving many documents (a large k) casts a wide net and helps to ensure we don’t miss anything relevant in the collection, but it also injects a lot of less relevant information into the context, potentially confusing the model and increasing the token consumption. A small k keeps the context focused and the response time low, but may miss important bits. Also, we are using similarity as a proxy for relevance, which may not necessarily be accurate.

To deal with this issue, the concept of reranking is often applied:

Retrieve a large number of potentially relevant documents from the vector store using semantic similarity
Rerank the documents based on their contextual relevance
Use only the top N documents for response generation

langchain_dartmouth offers the class DartmouthReranker, which you can use to reduce (compress) the number of documents after the similarity search:

from langchain_dartmouth.retrievers.document_compressors import DartmouthReranker

reranker = DartmouthReranker(model_name="bge-reranker-v2-m3", top_n=3)

docs = vector_store.similarity_search(query, k=10)

ranked_docs = reranker.compress_documents(query=query, documents=docs)

for doc in ranked_docs:
    print(doc.metadata["source"])

rag_documents/asteroids.txt
rag_documents/history.txt
rag_documents/hot_sauce.txt

We can see that when our query is related to asteroids, the reranker correctly ranks chunks from the file asteroids.txt as the most relevant documents!

Just like with LLMs and embedding models, you can list the available reranking models using the static method list():

DartmouthReranker.list()

[{'name': 'bge-reranker-v2-m3',
  'provider': 'baai',
  'display_name': 'BGE Reranker v2 M3',
  'tokenizer': 'BAAI/bge-reranker-v2-m3',
  'type': 'reranking',
  'capabilities': [],
  'server': 'text-embeddings-inference',
  'parameters': {'max_input_tokens': 512}}]

Summary#

In this recipe, we have learned how to use a vector store for similarity search on a collection of documents given a query. By retrieving the most similar documents, we can implement a Retrieval Augmented Generation pipeline to ground an LLM’s responses in our document collection.

Finally, we have seen that a reranking model can be used to compress the list of documents based on their contextural relevance, as opposed to their semantic similarity, to reduce the irrelevant information we are passing to the LLM.

Retrieval Augmented Generation (RAG)

Contents

Retrieval Augmented Generation (RAG)#

A Manual RAG pipeline#

Using a Vector Store#

Reranking Documents#

Summary#