Embeddings

Embeddings#

A major challenge for Large Language Models (LLMs), and Natural Language Processing (NLP) in general, is modeling semantic context. Humans can very easily infer the meaning of a word in a particular context. Capturing this ability in machine-comprehensible way, however, is not a simple task, given that most algorithms (LLMs included) can only operate on numbers using arithmetic operations, not on text.

So before an LLM can “understand” and predict words, it first needs to convert them into numbers in a way that preserves the words’ semantic context. This is done through a process called “vector embedding”. The goal of the embedding is to find a numeric representation of a word (or longer piece of text) that yields similar numbers for semantically similar words.

You can think of this as representing a string of text as a collection of sliders, where each slider setting captures some aspect of the word’s meaning. For example, words like “nice” and “stupendous” might have similar settings on a “positivity” slider but differ on an “intensity” slider. We could now describe the settings of all sliders as a vector, where each element represents the setting of one slider. This vector is called the “embedding” of the word and its dimension (number of elements) is equal to the number of “sliders” representing the word.

Attention

A word’s embedding involves many dimensions, possibly thousands. However, we don’t actually know what each individual dimension represents in terms of semantic meaning. The high dimensionality better models complex relationships between words, even if we can’t clearly label each one. So while the “slider model” is a good intuitive approach, it is not accurate in that sense and individual dimensions of embeddings should usually not be interpreted in a meaningful way. Only the relative similarity between vectors matters! Check out the next recipe to learn more about that.

This video explains this concept in a little greater detail, if you’re interested!

When using an LLM, you usually don’t need to generate the embeddings yourself. The first layer of an LLM, called the embedding layer, takes care of that for you. However, calculating embeddings is highly relevant in many tasks surrounding LLMs, like semantic search and Retrieval-Augmented Generation (RAG), which we will be building up to with this recipe and the two following ones on similarity search and RAG.

This recipe will go over how to use an embedding model provided by langchain_dartmouth to generate embeddings for text.

from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())

True

Creating an Embedding#

We create embeddings using an embedding model. Embedding models deployed on Dartmouth’s compute resources are available through the DartmouthEmbeddings class.

Its interface is different from the text-generation models used in the prevous recipes. The embed_query method takes in a string and returns the embedding of that string.

from langchain_dartmouth.embeddings import DartmouthEmbeddings

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")

embedding_vector = embeddings.embed_query("tiger")
print(embedding_vector[:24], "...")
print("Length of embedding: ", len(embedding_vector))

[-0.0098512, 0.021645533, 0.020528702, 0.000515087, -0.03606561, -0.0034401882, -0.019277748, 0.031079523, -0.0088551855, -0.013496476, 0.02126853, 0.0049624527, -0.038428802, 0.013413767, -0.004831469, 0.014840598, -0.019473404, -0.037833326, -0.037558023, -0.00017545157, 0.0143051855, 0.043533113, -0.08317123, -0.023391493] ...
Length of embedding:  1024

Note

We see that the word “tiger” is represented by a list of 1024 numbers. This means that the numeric representation of the word “tiger” consists of 1024 dimensions (or sliders) for this particular embedding model bge-large-en-v1-5. Other models may use fewer or more numbers to represent text. You can read more about the model we are using here in its model card.

The string we are embedding is not limited to just a single word. We can pass a string of arbitrary length to the embed_query method, but we will always get a single vector of a fixed length back:

embedding_vector = embeddings.embed_query(
    "The tiger, being the largest cat species, is known for its distinctive orange and black stripes, which play a crucial role in its camouflage in the wild."
)

print(embedding_vector[:24], "...")
print("Length of embedding: ", len(embedding_vector))

[0.014211621, 0.013235269, 0.027379958, 0.01992121, 0.0062612654, -0.0089989025, -0.036171976, -0.0073980433, 0.02937861, 0.0028448426, 0.016292445, -0.018904272, -0.023429643, 0.009867532, -0.0071905404, 0.0010587126, -0.0026968538, -0.010489314, -0.007242281, 0.006787939, 0.030865127, 0.014046329, -0.04033648, -0.05342062] ...
Length of embedding:  1024

Hint

Embedding models usually have a maximum number of words (or tokens) that they can consider when calculating the embedding vector. If the string is longer than that, you will see an error. You can find your chosen model’s maximum length on its model card. Look for a parameter called input length, sequence length, or context length. In our example here, the maximum input length is 512 tokens.

Another important consideration is the semantic specificity of the resultant embedding vector: While every word within the sequence (up to the maximum sequence length) affects the final numbers in the embedding vector, it represents something akin to a “semantic average”. So the longer the input gets, the less sensitive the embedding is to specific details in the text.

Just like with LLMs, you can see which models are available through DartmouthEmbeddings by using the static method list() (see recipe on LLMs):

DartmouthEmbeddings.list()

[{'name': 'bge-large-en-v1-5',
  'provider': 'baai',
  'display_name': 'BGE Large EN v1.5',
  'tokenizer': 'BAAI/bge-large-en-v1.5',
  'type': 'embedding',
  'capabilities': [],
  'server': 'text-embeddings-inference',
  'parameters': {'max_input_tokens': 4096}},
 {'name': 'bge-m3',
  'provider': 'baai',
  'display_name': 'BGE M3',
  'tokenizer': 'BAAI/bge-m3',
  'type': 'embedding',
  'capabilities': [],
  'server': 'text-embeddings-inference',
  'parameters': {'max_input_tokens': 8192}}]

LangChain’s `TextLoader`, `Document`, and `CharacterTextSplitter` classes#

The text we want to embed often lives in files of some kind, e.g., text files, Word documents, or PDFs. In LangChain, each of these files is called a document and can be represented in code by a Document object.

Since loading the files and turning them into Document objects is a common pattern, LangChain offers a collection of document loaders that support a variety of use cases and file formats. For example, if we want to load a simple text file (*.txt), we can use the TextLoader class:

from langchain.document_loaders import TextLoader

directory_to_file = "./rag_documents/asteroids.txt"
text_loader = TextLoader(directory_to_file)
documents = text_loader.load()
print(documents[0])

page_content='**Asteroids: The Mysterious and Ancient Building Blocks of Our Solar System**

Asteroids, also known as minor planets or planetoids, are small, rocky objects that orbit the Sun. They are remnants from the early days of our solar system, and their study has provided valuable insights into the formation and evolution of the cosmos. These mysterious bodies have captivated the imagination of scientists and researchers for centuries, and their exploration continues to uncover new secrets about the universe.

**Composition and Types of Asteroids**

Asteroids are typically small, with diameters ranging from a few meters to hundreds of kilometers. They are composed of rock, metal, and ice, and are thought to be the remnants of the early solar system. There are two main types of asteroids: stony asteroids, which are composed mostly of silicate minerals, and metal asteroids, which are rich in iron and nickel.

The largest asteroid, Ceres, is located in the asteroid belt between Mars and Jupiter and is the only one in our solar system that is rounded in shape, indicating that it may have once been a small planet. Ceres is also home to a surprising presence of organic compounds, water ice, and a possible subsurface ocean, making it a fascinating target for astrobiological research.

**Asteroid Belts and the Formation of the Solar System**

The asteroid belt is the region between Mars and Jupiter where most asteroids are found. It's believed that the asteroid belt is the remnant of a planet that never formed, or one that was destroyed by the massive gravitational forces of the nearby gas giants. The asteroid belt is thought to have formed from the leftover material that couldn't coalesce into planets, and the small bodies in this region are a testament to the complex and dynamic processes that shaped the early solar system.

**Asteroid Impact and Earth's History**

Asteroids have played a significant role in shaping the history of our planet. The impact of a large asteroid is believed to have caused the extinction of the dinosaurs 65 million years ago. The Chicxulub crater in Mexico, formed by a massive asteroid impact, is one of the most well-documented instances of an asteroid collision, and is thought to be the cause of the mass extinction event.

In recent years, there has been a growing interest in exploring asteroids as a source of natural resources, such as water and precious metals. NASA's OSIRIS-REx mission, launched in 2016, successfully returned samples from the asteroid Bennu, providing valuable insights into the composition and origin of these ancient bodies.

**Exploration and Scientific Significance**

The study of asteroids has numerous scientific and practical implications. By studying asteroids, scientists can gain insights into the formation and evolution of the solar system, and understand the processes that shape our planet. Asteroids can also provide valuable resources, such as water and metals, that could be used to support human exploration and settlement of space.

The exploration of asteroids has already led to groundbreaking discoveries, such as the detection of water and organic compounds on Ceres and at the asteroid Vesta. Future missions, such as NASA's Artemis program, are planned to explore and exploit the resources of the Moon and asteroids, marking a new era in space exploration.

In conclusion, asteroids are fascinating and complex objects that hold the secrets of our solar system's early days. Their study has provided valuable insights into the origin and evolution of our planet, and has implications for our understanding of the universe. As exploration of asteroids continues, we can expect to uncover new secrets about these mysterious bodies and their role in shaping our understanding of the cosmos.' metadata={'source': './rag_documents/asteroids.txt'}

You can now pass the contents of the loaded document to the embed_documents method, but we will run into an issue because the text is too long for the chose model:

try:
    response = embeddings.embed_documents(documents[0].page_content)

except Exception as e:
    print(e)

413 Client Error: Payload Too Large for url: https://ai-api.dartmouth.edu/tei/bge-large-en-v1-5/

batch size 3792 > maximum allowed batch size 512

We need to split the long text into chunks of 512 tokens. We could create our own loop and process the contents of the document, but fortunately LangChain offers text splitters that we can use together with the document loader to split the loaded text into sequences of the right length and return each chunk as an individual Document objects:

from langchain_text_splitters import CharacterTextSplitter

# Create a text splitter that splits into chunks of 512 tokens
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=512, chunk_overlap=0
)

# Load the text and split it into Document objects
documents = text_loader.load_and_split(text_splitter=text_splitter)

Hint

How text gets split into tokens (the encoding) differs from model to model. The encoding above is technically only correct for GPT 3.5 and GPT 4, but it is usually close enough to work with other models, too.

Now that we have turned our text into chunks of the correct length, let’s embed them all by passing them to the embed_documents method:

embedded_vectors = embeddings.embed_documents([d.page_content for d in documents])

print(
    f"Embedding {len(documents)} chunks as {len(embedded_vectors)} vectors with {len(embedded_vectors[0])} dimensions."
)

Embedding 2 chunks as 2 vectors with 1024 dimensions.

Summary#

Embeddings are representation of strings as numbers. Using the embed_query and embed_documents functions, we can get the embeddings of text. This lets us do many exciting operations to represent how different words are related to each other.

With embed_documents we can take advantage of LangChain’s Document class to embed content from files.

The batch size of the default embedding model is 512 tokens. Using a text splitter can help to ensure the correct sequence length.