Similarity Search#
In the previous recipe, we saw how to obtain embedding vectors for text of various lengths. We also learned that Large Language Models (LLMs) usually don’t require us to determine the embeddings first, because they have their own embedding layer.
However, there are several benefits to having the embedding of a word. An important one is that it gives us the ability to compare the meaning of two words. One way of doing so is by taking the dot product of their corresponding embedding vectors:
$$ \text{Similarity} = \vec{v} \cdot \vec{w} $$
First, let’s embed some words, just like we learned in the previous recipe:
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())
True
from langchain_dartmouth.embeddings import DartmouthEmbeddings
from langchain_core.output_parsers import JsonOutputParser
embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
text_1 = "Japan"
text_2 = "Sushi"
text_3 = "Italy"
text_4 = "Pizza"
embed_1 = embeddings.embed_query(text_1)
embed_2 = embeddings.embed_query(text_2)
embed_3 = embeddings.embed_query(text_3)
embed_4 = embeddings.embed_query(text_4)
Now let’s calculate the dot product: $$ \vec{v}\cdot \vec{w} = \sum_{i = 1}^N(v_i \cdot w_i) $$
def dot_product(v, w):
similarity = sum(vi * wi for vi, wi in zip(v, w))
return similarity
print(f"Similarity between {text_1} and {text_2} is {dot_product(embed_1, embed_2)}")
Similarity between Japan and Sushi is 0.6915366834825502
The value for the similarity between these two words does not necessarily tell us a whole lot about their relationship. However, we can calculate the similarity between all the words to get a similarity ranking of sorts:
print(f"Similarity between {text_1} and {text_3} is {dot_product(embed_1, embed_3)}")
print(f"Similarity between {text_1} and {text_4} is {dot_product(embed_1, embed_4)}")
print(f"Similarity between {text_2} and {text_3} is {dot_product(embed_2, embed_3)}")
print(f"Similarity between {text_2} and {text_4} is {dot_product(embed_2, embed_4)}")
print(f"Similarity between {text_3} and {text_4} is {dot_product(embed_3, embed_4)}")
Similarity between Japan and Italy is 0.7488653578639416
Similarity between Japan and Pizza is 0.5895998995258542
Similarity between Sushi and Italy is 0.6092699421928521
Similarity between Sushi and Pizza is 0.7800353627353496
Similarity between Italy and Pizza is 0.7009643386868148
From this, we observe that Japan and Sushi share a similarity comparable to that of Italy and Pizza. Likewise, Italy and Sushi as well as Japan and Pizza exhibit similar levels of association. Interestingly, Japan and Italy also demonstrate a high degree of similarity, likely because both are countries.
Note
This is an example of how bias leaks into for machine learning models. These results do not mean that you can’t get good sushi in Italy or good pizza in Japan, or that those foods don’t “belong” there. It simply means that in the training data for this embedding model, the words “Italy” and “Pizza” appeared more frequently in the same context as “Italy” and “Sushi”.
Visualizing Similarity#
Visualizing embeddings can help a human observer quickly identify clusters of similar words. Let’s generate some random words related to different domains, and find their embeddings. In the recipe on building chains, the idea of a pipeline was introduced. We use this to generate and parse the output of an llm to quickly get our test words:
import pandas as pd
import matplotlib.pyplot as plt
from langchain_dartmouth.llms import ChatDartmouth
llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
parser = JsonOutputParser()
chain = llm | parser
response = chain.invoke(
"Generate 30 different words that are well-suited to showcase how word embeddings work. "
"Draw the words from domains like animals, finance, and food. The food one should contain tomato "
"Return the words in JSON format, using the domain as the key, and the words as values. "
)
We put the words into a pandas DataFrame to get a nice table with all the information side-by-side:
words = pd.DataFrame.from_dict(response).melt(var_name="domain", value_name="word")
embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
words["embedding"] = embeddings.embed_documents(words["word"])
words
domain | word | embedding | |
---|---|---|---|
0 | animals | lion | [0.004938049, 0.019869762, 0.028288277, -0.026... |
1 | animals | tiger | [-0.009851212, 0.021645522, 0.020528719, 0.000... |
2 | animals | elephant | [0.012395542, 0.021871302, -0.019013518, 0.003... |
3 | animals | giraffe | [-0.009479101, -0.02422396, -0.0004888569, 0.0... |
4 | animals | zebra | [0.017925683, 0.0002415301, -0.024069881, -0.0... |
5 | animals | monkey | [0.0039162966, -0.022503398, -0.00079194736, 0... |
6 | animals | kangaroo | [-0.001811138, 0.021665124, -0.025688699, 0.03... |
7 | animals | penguin | [0.010192045, -0.024146087, -0.0005984805, -0.... |
8 | animals | koala | [0.005236999, 0.0013860214, -0.025655808, 0.01... |
9 | animals | chimpanzee | [-0.008794715, -0.021515671, -0.007180522, 0.0... |
10 | finance | stock | [-0.025286622, 0.022056589, 0.017092237, 0.002... |
11 | finance | bond | [0.01620697, 0.031973824, -0.0062463894, 0.024... |
12 | finance | share | [0.03605235, 0.02877077, -0.0016638363, -0.001... |
13 | finance | portfolio | [-0.03275574, 0.014581898, -0.0042015496, -0.0... |
14 | finance | investment | [-0.014876202, 0.04105009, -0.04378732, -0.013... |
15 | finance | dividend | [-0.007930538, 0.008843865, -0.021828571, -0.0... |
16 | finance | interest | [-0.023253927, 0.027659193, -0.03557491, -0.01... |
17 | finance | asset | [-0.02256383, 0.029316945, -0.024672678, -0.00... |
18 | finance | liability | [-0.024738317, 0.022415103, 0.0008463351, 0.00... |
19 | finance | equity | [-0.012054967, 0.026772847, -0.056304228, -0.0... |
20 | food | tomato | [-0.020757666, -0.01578901, 0.0070459833, -0.0... |
21 | food | pizza | [-0.004001807, -0.007738142, -0.03044295, 0.01... |
22 | food | sushi | [-0.012440149, 0.026930872, -0.044859067, 5.58... |
23 | food | tacos | [0.005677503, 0.032857783, 0.00058194814, 0.01... |
24 | food | steak | [-0.008915702, -0.0052367365, -0.012457911, 0.... |
25 | food | salad | [-0.036724452, -0.0014885123, -0.036908668, 0.... |
26 | food | ice cream | [-0.031835772, 0.022208357, 0.014132918, 0.009... |
27 | food | cake | [0.006645364, 0.013493908, -0.0039381627, 0.00... |
28 | food | donut | [-0.001012974, -0.0077673728, -0.020888202, -0... |
29 | food | bagel | [0.029452533, -0.0053350898, -0.00054334104, 0... |
Hint
It is difficult to visualize a 1024-dimensional vector, as we’re not 1024-dimensional humans! One way to get around this is by using a UMAP (Uniform Manifold Approximation and Projection) to represent this high-dimensional vector as a two-dimesional one.
Don’t worry if the code in the next cell looks complicated. Just assume that the UMAP does the dimensionality reduction in a way that preserves the “closeness” of the high-dimensional vectors: Vectors that were similar in the high-dimensional space are mapped to points that are close together in the two-dimensional space. You can learn more about the UMAP library in its user guide.
import umap
embeddings_list = words["embedding"].to_list()
mapper = umap.UMAP().fit(embeddings_list)
umap_embeddings = pd.DataFrame(
mapper.transform(embeddings_list), columns=["UMAP_x", "UMAP_y"]
)
words = pd.concat([words, umap_embeddings], axis=1)
words.sample(3)
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
warnings.warn(
domain | word | embedding | UMAP_x | UMAP_y | |
---|---|---|---|---|---|
13 | finance | portfolio | [-0.03275574, 0.014581898, -0.0042015496, -0.0... | 2.060350 | -0.470187 |
1 | animals | tiger | [-0.009851212, 0.021645522, 0.020528719, 0.000... | -1.821898 | -1.041070 |
4 | animals | zebra | [0.017925683, 0.0002415301, -0.024069881, -0.0... | -1.249725 | -1.284568 |
Now that we have projected the embedding vectors into two dimensions UMAP_x
and UMAP_y
, we can visualize them in a common scatter plot:
import plotly.express as px
px.scatter(words, x="UMAP_x", y="UMAP_y", color="domain", hover_data="word")
Hint
Move your mouse cursor over the individual data points to show the word it represents!
We can see that words are somewhat close to the other words in the same domain, with good separation from the other domains. This matches our intuition, and illustrates how embeddings capture semantic similarity.
Summary#
This recipe showed how to find the similarity between two embeddings. Visualizing embeddings can be a good way to represent this similarity. UMAP can be used to represent high-dimensional embeddings in a 2D plane, so we can easily visualize embeddings, and see their similarities.