Embeddings#
A major challenge for Large Language Models (LLMs), and Natural Language Processing (NLP) in general, is modeling semantic context. Humans can very easily infer the meaning of a word in a particular context. Capturing this ability in machine-comprehensible way, however, is not a simple task, given that most algorithms (LLMs included) can only operate on numbers using arithmetic operations, not on text.
So before an LLM can “understand” and predict words, it first needs to convert them into numbers in a way that preserves the words’ semantic context. This is done through a process called “vector embedding”. The goal of the embedding is to find a numeric representation of a word (or longer piece of text) that yields similar numbers for semantically similar words.
You can think of this as representing a string of text as a collection of sliders, where each slider setting captures some aspect of the word’s meaning. For example, words like “nice” and “stupendous” might have similar settings on a “positivity” slider but differ on an “intensity” slider. We could now describe the settings of all sliders as a vector, where each element represents the setting of one slider. This vector is called the “embedding” of the word and its dimension (number of elements) is equal to the number of “sliders” representing the word.
Attention
A word’s embedding involves many dimensions, possibly thousands. However, we don’t actually know what each individual dimension represents in terms of semantic meaning. The high dimensionality better models complex relationships between words, even if we can’t clearly label each one. So while the “slider model” is a good intuitive approach, it is not accurate in that sense and individual dimensions of embeddings should usually not be interpreted in a meaningful way. Only the relative similarity between vectors matters! Check out the next recipe to learn more about that.
This video explains this concept in a little greater detail, if you’re interested!
When using an LLM, you usually don’t need to generate the embeddings yourself. The first layer of an LLM, called the embedding layer, takes care of that for you. However, calculating embeddings is highly relevant in many tasks surrounding LLMs, like semantic search and Retrieval-Augmented Generation (RAG), which we will be building up to with this recipe and the two following ones on similarity search and RAG.
This recipe will go over how to use an embedding model provided by langchain_dartmouth to generate embeddings for text.
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())
True
Creating an Embedding#
We create embeddings using an embedding model. Embedding models deployed on Dartmouth’s compute resources are available through the DartmouthEmbeddings class.
Its interface is different from the text-generation models used in the prevous recipes. The embed_query method takes in a string and returns the embedding of that string.
from langchain_dartmouth.embeddings import DartmouthEmbeddings
embeddings = DartmouthEmbeddings(model_name="baai.bge-large-en-v1-5")
embedding_vector = embeddings.embed_query("tiger")
print(embedding_vector[:24], "...")
print("Length of embedding: ", len(embedding_vector))
[-0.009851199574768543, 0.02164553292095661, 0.020528702065348625, 0.0005150869837962091, -0.03606560826301575, -0.003440188243985176, -0.019277747720479965, 0.031079523265361786, -0.008855185471475124, -0.013496476225554943, 0.021268529817461967, 0.004962452687323093, -0.03842880204319954, 0.013413767330348492, -0.004831469152122736, 0.014840598218142986, -0.019473403692245483, -0.0378333255648613, -0.037558022886514664, -0.00017545156879350543, 0.01430518552660942, 0.04353311285376549, -0.08317123353481293, -0.023391492664813995] ...
Length of embedding: 1024
Note
We see that the word “tiger” is represented by a list of 1024 numbers. This means that the numeric representation of the word “tiger” consists of 1024 dimensions (or sliders) for this particular embedding model bge-large-en-v1-5. Other models may use fewer or more numbers to represent text. You can read more about the model we are using here in its model card.
The string we are embedding is not limited to just a single word. We can pass a string of arbitrary length to the embed_query method, but we will always get a single vector of a fixed length back:
embedding_vector = embeddings.embed_query(
"The tiger, being the largest cat species, is known for its distinctive orange and black stripes, which play a crucial role in its camouflage in the wild."
)
print(embedding_vector[:24], "...")
print("Length of embedding: ", len(embedding_vector))
[0.014211619272828102, 0.013235248625278473, 0.027379972860217094, 0.019921204075217247, 0.006261293776333332, -0.008998889476060867, -0.03617194667458534, -0.007398046087473631, 0.029378624632954597, 0.002844825154170394, 0.016292426735162735, -0.01890425570309162, -0.02342963218688965, 0.009867511689662933, -0.007190564647316933, 0.0010587171418592334, -0.002696824725717306, -0.010489331558346748, -0.007242249324917793, 0.00678795063868165, 0.030865129083395004, 0.014046319760382175, -0.040336497128009796, -0.05342057719826698] ...
Length of embedding: 1024
Hint
Embedding models usually have a maximum number of words (or tokens) that they can consider when calculating the embedding vector. If the string is longer than that, you will see an error. You can find your chosen model’s maximum length on its model card. Look for a parameter called input length, sequence length, or context length. In our example here, the maximum input length is 512 tokens.
Another important consideration is the semantic specificity of the resultant embedding vector: While every word within the sequence (up to the maximum sequence length) affects the final numbers in the embedding vector, it represents something akin to a “semantic average”. So the longer the input gets, the less sensitive the embedding is to specific details in the text.
Just like with LLMs, you can see which models are available through DartmouthEmbeddings by using the static method list() (see recipe on LLMs):
DartmouthEmbeddings.list()
[ModelInfo(id='baai.bge-large-en-v1-5', name='baai.bge-large-en-v1-5', description=None, is_embedding=True, capabilities=['usage'], is_local=True, cost='free'),
ModelInfo(id='baai.bge-m3', name='baai.bge-m3', description=None, is_embedding=True, capabilities=['usage'], is_local=True, cost='free')]
LangChain’s TextLoader, Document, and CharacterTextSplitter classes#
The text we want to embed often lives in files of some kind, e.g., text files, Word documents, or PDFs. In LangChain, each of these files is called a document and can be represented in code by a Document object.
Since loading the files and turning them into Document objects is a common pattern, LangChain offers a collection of document loaders that support a variety of use cases and file formats. For example, if we want to load a simple text file (*.txt), we can use the TextLoader class:
from langchain.document_loaders import TextLoader
directory_to_file = "./rag_documents/asteroids.txt"
text_loader = TextLoader(directory_to_file)
documents = text_loader.load()
print(documents[0])
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[5], line 1
----> 1 from langchain.document_loaders import TextLoader
3 directory_to_file = "./rag_documents/asteroids.txt"
4 text_loader = TextLoader(directory_to_file)
ModuleNotFoundError: No module named 'langchain.document_loaders'
You can now pass the contents of the loaded document to the embed_documents method, but we will run into an issue because the text is too long for the chose model:
try:
response = embeddings.embed_documents([documents[0].page_content])
except Exception as e:
print(e)
We need to split the long text into chunks of 512 tokens. We could create our own loop and process the contents of the document, but fortunately LangChain offers text splitters that we can use together with the document loader to split the loaded text into sequences of the right length and return each chunk as an individual Document objects:
from langchain_text_splitters import CharacterTextSplitter
# Create a text splitter that splits into chunks of 512 tokens
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
encoding_name="cl100k_base", chunk_size=512, chunk_overlap=0
)
# Load the text and split it into Document objects
documents = text_loader.load_and_split(text_splitter=text_splitter)
Hint
How text gets split into tokens (the encoding) differs from model to model. The encoding above is technically only correct for GPT 3.5 and GPT 4, but it is usually close enough to work with other models, too.
Now that we have turned our text into chunks of the correct length, let’s embed them all by passing them to the embed_documents method:
[d.page_content for d in documents]
embedded_vectors = embeddings.embed_documents([d.page_content for d in documents])
print(
f"Embedding {len(documents)} chunks as {len(embedded_vectors)} vectors with {len(embedded_vectors[0])} dimensions."
)
Summary#
Embeddings are representation of strings as numbers. Using the embed_query and embed_documents functions, we can get the embeddings of text. This lets us do many exciting operations to represent how different words are related to each other.
With embed_documents we can take advantage of LangChain’s Document class to embed content from files.
The batch size of the default embedding model is 512 tokens. Using a text splitter can help to ensure the correct sequence length.