Retrieval Augmented Generation (RAG)#
In the previous two recipes, we learned how to obtain an embedding, as well as how to express the similarity between two embeddings. Retrieval Augmented Generation (RAG) uses both of these techniques to provide relevant context to an LLM at query-time to ground the LLM’s output in a knowledge base. For example, if you want the LLM to answer questions based on specific documents you have, you can use these documents as your knowledge base and implement a RAG pipeline.
A typical RAG pipeline consists of three components: a vectorstore to hold the document embeddings, a retriever to retrieve relevant documents from the vectorstore based on a query, and an LLM to generate a response based on the query and the retrieved documents.
This recipe will implement such a pipeline using componets from langchain_dartmouth, as well as the larger LangChain ecosystem.
A Manual RAG pipeline#
If we know which relevant context we want to provide, we could simply use string manipulation to add the context to the query. For example:
from langchain_dartmouth.llms import ChatDartmouth
llm = ChatDartmouth(model_name="meta.llama-3.2-11b-vision-instruct")
# User's question
query = "Are asteroids going to hit me?"
# Context relevant to the question from our knowledge base
relevant_document = "Asteroids do not generally hit people. There is a very low chance for that to happen"
# Augment prompt
augmented_prompt = (
relevant_document + "Considering this, answer the following question:" + query
)
# Generate the answer
response = llm.invoke(augmented_prompt)
response.pretty_print()
================================== Ai Message ==================================
No, the chances of an asteroid hitting you are extremely low. Asteroids are typically small, rocky objects that orbit the Sun, and most of them are located in the asteroid belt between Mars and Jupiter. The likelihood of an asteroid being on a collision course with Earth and impacting a specific location is incredibly small.
In fact, NASA estimates that the odds of being hit by a meteorite (a piece of an asteroid that has entered Earth's atmosphere) are about 1 in 1.9 million. And the chances of being hit by a large asteroid (diameter of over 1 kilometer) are estimated to be about 1 in 100,000 over the next 100 years.
To put this into perspective, you are more likely to be struck by lightning or win the lottery than be hit by an asteroid. So, while it's not impossible, the chances are incredibly low, and you don't need to worry about it happening to you.
Using a Vector Store#
That’s great, but how can we find the relevant document in a collection of documents? That is where similarity search can help us:
We can calculate the similarity between our user’s query and all documents in our collection. Using similarity as a proxy for relevance, we can then retrieve, for example, the top 5 documents and use them as context.
While we could write our own loop to go through the collection of embedded documents, there are optimized structures for storing embeddings and doing these kinds of operations on them called vector stores.
Hint
There are many different implementations of vector stores available, most of which have a corresponding LangChain class. Each implementation may have particular advantages and disadvantages, and the choice of vector store should be made based on your project’s requirements.
In this recipe, we will be using an in-memory vector store. This vector store is a good choice to demonstrate the involved concepts, but it would be a very poor choice for a real-world project. Popular options for vector stores are ChromaDB, PGVector, or commerical offerings like Pinecone.
Let’s build a vector store for a collection of documents on various (very different) topics:
from pathlib import Path
print([p.name for p in Path("./rag_documents/").glob("*.txt")])
['asteroids.txt', 'history.txt', 'hot_sauce.txt']
Creating the vector store strings together quite a few components. Most of these have been introduced in the previous two recipes on embeddings and similarity search:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_dartmouth.embeddings import DartmouthEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
# Load all files in a directory using the TextLoader class
loader = DirectoryLoader("./rag_documents", glob="**/*.txt", loader_cls=TextLoader)
collection = loader.load()
# Initialize the text splitter with appropriate chunk size
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
encoding_name="cl100k_base", chunk_size=256, chunk_overlap=0
)
# Load and split the files
documents = loader.load_and_split(text_splitter=text_splitter)
embeddings_model = DartmouthEmbeddings(model_name="baai.bge-large-en-v1-5")
# Initialize vector store and add documents
vector_store = InMemoryVectorStore(embedding=embeddings_model)
_ = vector_store.add_documents(documents)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[3], line 1
----> 1 from langchain.document_loaders import DirectoryLoader
2 from langchain.document_loaders import TextLoader
3 from langchain_text_splitters import CharacterTextSplitter
ModuleNotFoundError: No module named 'langchain.document_loaders'
Hint
The DirectoryLoader function is a class from LangChain that accepts a directory, a regex expression, and a loader class. It’s a convenient way to load several documents living in the same directory at once.
We can now use the vector store’s similarity_search method to find the most relevant documents (or document chunks) in the collection given our query. We can change the number of returned documents using the parameter k:
query = "What killed the dinosaurs?"
docs = vector_store.similarity_search(query, k=2)
docs
Great, the similarity search retrieved two chunks from the asteroids.txt file! Since our query was related to asteroids, that makes sense!
Note
Note that the retrieved documents (chunks of the original file) are specifically related to the query!
We can now augment our prompt with the retrieved documents, just like we did before in the manual RAG:
augmented_prompt = (
"Answer the following query: "
+ query
+ "\n\nBase your response on the following context: \n\n"
)
for doc in docs:
augmented_prompt += doc.page_content + "\n--\n"
response = llm.invoke(augmented_prompt)
response.pretty_print()
And there it is: A fully automated RAG pipeline!
Hint
You could streamline this even further and automate the prompt augmentation using LangChain’s prompt templates. While this is beyond the scope of this recipe, you can check out LangChain’s RAG tutorial to learn more!
Reranking Documents#
Setting the right value for k can be challenging: Retrieving many documents (a large k) casts a wide net and helps to ensure we don’t miss anything relevant in the collection, but it also injects a lot of less relevant information into the context, potentially confusing the model and increasing the token consumption. A small k keeps the context focused and the response time low, but may miss important bits. Also, we are using similarity as a proxy for relevance, which may not necessarily be accurate.
To deal with this issue, the concept of reranking is often applied:
Retrieve a large number of potentially relevant documents from the vector store using semantic similarity
Rerank the documents based on their contextual relevance
Use only the top N documents for response generation
langchain_dartmouth offers the class DartmouthReranker, which you can use to reduce (compress) the number of documents after the similarity search:
from langchain_dartmouth.retrievers.document_compressors import DartmouthReranker
reranker = DartmouthReranker(model_name="bge-reranker-v2-m3", top_n=3)
docs = vector_store.similarity_search(query, k=10)
ranked_docs = reranker.compress_documents(query=query, documents=docs)
for doc in ranked_docs:
print(doc.metadata["source"])
We can see that when our query is related to asteroids, the reranker correctly ranks chunks from the file asteroids.txt as the most relevant documents!
Just like with LLMs and embedding models, you can list the available reranking models using the static method list():
DartmouthReranker.list()
Summary#
In this recipe, we have learned how to use a vector store for similarity search on a collection of documents given a query. By retrieving the most similar documents, we can implement a Retrieval Augmented Generation pipeline to ground an LLM’s responses in our document collection.
Finally, we have seen that a reranking model can be used to compress the list of documents based on their contextural relevance, as opposed to their semantic similarity, to reduce the irrelevant information we are passing to the LLM.