Similarity Search

Similarity Search#

In the previous recipe, we saw how to obtain embedding vectors for text of various lengths. We also learned that Large Language Models (LLMs) usually don’t require us to determine the embeddings first, because they have their own embedding layer.

However, there are several benefits to having the embedding of a word. An important one is that it gives us the ability to compare the meaning of two words. One way of doing so is by taking the dot product of their corresponding embedding vectors:

$$ \text{Similarity} = \vec{v} \cdot \vec{w} $$

First, let’s embed some words, just like we learned in the previous recipe:

from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())

True

from langchain_dartmouth.embeddings import DartmouthEmbeddings
from langchain_core.output_parsers import JsonOutputParser

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
text_1 = "Japan"
text_2 = "Sushi"
text_3 = "Italy"
text_4 = "Pizza"

embed_1 = embeddings.embed_query(text_1)
embed_2 = embeddings.embed_query(text_2)
embed_3 = embeddings.embed_query(text_3)
embed_4 = embeddings.embed_query(text_4)

Now let’s calculate the dot product: $$ \vec{v}\cdot \vec{w} = \sum_{i = 1}^N(v_i \cdot w_i) $$

def dot_product(v, w):
    similarity = sum(vi * wi for vi, wi in zip(v, w))
    return similarity

print(f"Similarity between {text_1} and {text_2} is {dot_product(embed_1, embed_2)}")

Similarity between Japan and Sushi is 0.6915366834825502

The value for the similarity between these two words does not necessarily tell us a whole lot about their relationship. However, we can calculate the similarity between all the words to get a similarity ranking of sorts:

print(f"Similarity between {text_1} and {text_3} is {dot_product(embed_1, embed_3)}")
print(f"Similarity between {text_1} and {text_4} is {dot_product(embed_1, embed_4)}")
print(f"Similarity between {text_2} and {text_3} is {dot_product(embed_2, embed_3)}")
print(f"Similarity between {text_2} and {text_4} is {dot_product(embed_2, embed_4)}")
print(f"Similarity between {text_3} and {text_4} is {dot_product(embed_3, embed_4)}")

Similarity between Japan and Italy is 0.7488653578639416
Similarity between Japan and Pizza is 0.5895998995258542
Similarity between Sushi and Italy is 0.6092699421928521
Similarity between Sushi and Pizza is 0.7800353627353496
Similarity between Italy and Pizza is 0.7009643386868148

From this, we observe that Japan and Sushi share a similarity comparable to that of Italy and Pizza. Likewise, Italy and Sushi as well as Japan and Pizza exhibit similar levels of association. Interestingly, Japan and Italy also demonstrate a high degree of similarity, likely because both are countries.

Note

This is an example of how bias leaks into for machine learning models. These results do not mean that you can’t get good sushi in Italy or good pizza in Japan, or that those foods don’t “belong” there. It simply means that in the training data for this embedding model, the words “Italy” and “Pizza” appeared more frequently in the same context as “Italy” and “Sushi”.

Visualizing Similarity#

Visualizing embeddings can help a human observer quickly identify clusters of similar words. Let’s generate some random words related to different domains, and find their embeddings. In the recipe on building chains, the idea of a pipeline was introduced. We use this to generate and parse the output of an llm to quickly get our test words:

import pandas as pd
import matplotlib.pyplot as plt
from langchain_dartmouth.llms import ChatDartmouth


llm = ChatDartmouth(
    model_name="llama-3-2-11b-vision-instruct", seed=42, temperature=0.0
)
parser = JsonOutputParser()

chain = llm | parser

response = chain.invoke(
    "Generate 30 different words that are well-suited to showcase how word embeddings work. "
    "Draw the words from domains like animals, finance, and food. The food one should contain tomato "
    "Return the words in JSON format, using the domain as the key, and the words as values. "
)

We put the words into a pandas DataFrame to get a nice table with all the information side-by-side:

words = pd.DataFrame.from_dict(response).melt(var_name="domain", value_name="word")

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
words["embedding"] = embeddings.embed_documents(words["word"])
words

	domain	word	embedding
0	animals	lion	[0.004938049, 0.019869762, 0.028288277, -0.026...
1	animals	tiger	[-0.009851212, 0.021645522, 0.020528719, 0.000...
2	animals	elephant	[0.012395542, 0.021871302, -0.019013518, 0.003...
3	animals	giraffe	[-0.009479101, -0.02422396, -0.0004888569, 0.0...
4	animals	monkey	[0.0039162883, -0.022503383, -0.0007919524, 0....
5	animals	zebra	[0.017925682, 0.00024154854, -0.024069885, -0....
6	animals	kangaroo	[-0.001811146, 0.02166512, -0.025688702, 0.030...
7	animals	penguin	[0.0101920385, -0.024146076, -0.0005984705, -0...
8	animals	koala	[0.005236999, 0.0013860214, -0.025655808, 0.01...
9	animals	chimpanzee	[-0.008794715, -0.021515671, -0.007180522, 0.0...
10	finance	stock	[-0.025286604, 0.022056587, 0.017092232, 0.002...
11	finance	bond	[0.01620697, 0.03197383, -0.0062463833, 0.0240...
12	finance	portfolio	[-0.03275575, 0.014581913, -0.0042015505, -0.0...
13	finance	investment	[-0.014876206, 0.041050088, -0.04378732, -0.01...
14	finance	dividend	[-0.007930538, 0.008843865, -0.021828571, -0.0...
15	finance	interest	[-0.023253927, 0.027659193, -0.03557491, -0.01...
16	finance	capital	[-0.00704145, 0.006188275, 0.026306314, -0.011...
17	finance	asset	[-0.022563834, 0.029316926, -0.02467266, -0.00...
18	finance	liability	[-0.02473831, 0.022415103, 0.0008463253, 0.004...
19	finance	equity	[-0.012054982, 0.026772836, -0.05630424, -0.01...
20	food	pizza	[-0.004001807, -0.007738142, -0.03044295, 0.01...
21	food	sushi	[-0.012440149, 0.026930872, -0.044859067, 5.58...
22	food	tacos	[0.005677503, 0.032857783, 0.00058194814, 0.01...
23	food	curry	[-0.035107803, 0.01710656, -0.030045245, 0.006...
24	food	salad	[-0.036724456, -0.0014884963, -0.036908664, 0....
25	food	steak	[-0.008915714, -0.0052367384, -0.012457911, 0....
26	food	fries	[-0.0046785995, -0.0033322575, -0.0059408653, ...
27	food	ice cream	[-0.031835772, 0.022208357, 0.014132918, 0.009...
28	food	cake	[0.006645392, 0.013493922, -0.0039381785, 0.00...
29	food	tomato	[-0.020757666, -0.01578901, 0.0070459833, -0.0...

Hint

It is difficult to visualize a 1024-dimensional vector, as we’re not 1024-dimensional humans! One way to get around this is by using a UMAP (Uniform Manifold Approximation and Projection) to represent this high-dimensional vector as a two-dimesional one.

Don’t worry if the code in the next cell looks complicated. Just assume that the UMAP does the dimensionality reduction in a way that preserves the “closeness” of the high-dimensional vectors: Vectors that were similar in the high-dimensional space are mapped to points that are close together in the two-dimensional space. You can learn more about the UMAP library in its user guide.

import umap

embeddings_list = words["embedding"].to_list()
mapper = umap.UMAP().fit(embeddings_list)
umap_embeddings = pd.DataFrame(
    mapper.transform(embeddings_list), columns=["UMAP_x", "UMAP_y"]
)

words = pd.concat([words, umap_embeddings], axis=1)

words.sample(3)

/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/sklearn/utils/deprecation.py:132: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(

/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/sklearn/utils/deprecation.py:132: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(

	domain	word	embedding	UMAP_x	UMAP_y
26	food	fries	[-0.0046785995, -0.0033322575, -0.0059408653, ...	-8.866157	9.937926
21	food	sushi	[-0.012440149, 0.026930872, -0.044859067, 5.58...	-9.817240	10.212460
14	finance	dividend	[-0.007930538, 0.008843865, -0.021828571, -0.0...	-6.036749	9.979037

Now that we have projected the embedding vectors into two dimensions UMAP_x and UMAP_y, we can visualize them in a common scatter plot:

import plotly.express as px

px.scatter(words, x="UMAP_x", y="UMAP_y", color="domain", hover_data="word")

Hint

Move your mouse cursor over the individual data points to show the word it represents!

We can see that words are somewhat close to the other words in the same domain, with good separation from the other domains. This matches our intuition, and illustrates how embeddings capture semantic similarity.

Summary#

This recipe showed how to find the similarity between two embeddings. Visualizing embeddings can be a good way to represent this similarity. UMAP can be used to represent high-dimensional embeddings in a 2D plane, so we can easily visualize embeddings, and see their similarities.

Similarity Search

Contents

Similarity Search#

Visualizing Similarity#

Summary#