# Similarity Search

In the [previous recipe](10-embeddings.ipynb), we saw how to obtain embedding vectors for text of various lengths. We also learned that Large Language Models (LLMs) usually don't require us to determine the embeddings first, because they have their own embedding layer.

However, there are several benefits to having the embedding of a word. An important one is that it gives us the ability to compare the _meaning_ of two words. One way of doing so is by taking the **dot product** of their corresponding embedding vectors:

$$
\text{Similarity} = \vec{v} \cdot \vec{w}
$$

First, let's embed some words, just like we learned in the previous recipe:

In [None]:
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())

In [None]:
from langchain_dartmouth.embeddings import DartmouthEmbeddings
from langchain_core.output_parsers import JsonOutputParser

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
text_1 = "Japan"
text_2 = "Sushi"
text_3 = "Italy"
text_4 = "Pizza"

embed_1 = embeddings.embed_query(text_1)
embed_2 = embeddings.embed_query(text_2)
embed_3 = embeddings.embed_query(text_3)
embed_4 = embeddings.embed_query(text_4)

Now let's calculate the dot product:
$$
\vec{v}\cdot \vec{w} = \sum_{i = 1}^N(v_i \cdot w_i)
$$

In [None]:
def dot_product(v, w):
    similarity = sum(vi * wi for vi, wi in zip(v, w))
    return similarity

In [None]:
print(f"Similarity between {text_1} and {text_2} is {dot_product(embed_1, embed_2)}")

The value for the similarity between these two words does not necessarily tell us a whole lot about their relationship. However, we can calculate the similarity between all the words to get a similarity ranking of sorts:

In [None]:
print(f"Similarity between {text_1} and {text_3} is {dot_product(embed_1, embed_3)}")
print(f"Similarity between {text_1} and {text_4} is {dot_product(embed_1, embed_4)}")
print(f"Similarity between {text_2} and {text_3} is {dot_product(embed_2, embed_3)}")
print(f"Similarity between {text_2} and {text_4} is {dot_product(embed_2, embed_4)}")
print(f"Similarity between {text_3} and {text_4} is {dot_product(embed_3, embed_4)}")

From this, we observe that **Japan** and *Sushi* share a similarity comparable to that of **Italy** and *Pizza*. Likewise, **Italy** and *Sushi* as well as **Japan** and *Pizza* exhibit similar levels of association. Interestingly, **Japan** and **Italy** also demonstrate a high degree of similarity, likely because both are countries.

```{note}

This is an example of how *bias* leaks into for machine learning models. These results do not mean that you can't get good sushi in Italy or good pizza in Japan, or that those foods don't "belong" there. It simply means that in the training data for this embedding model, the words "Italy" and "Pizza" appeared more frequently in the same context as "Italy" and "Sushi".
```



## Visualizing Similarity
Visualizing embeddings can help a human observer quickly identify clusters of similar words. Let's generate some random words related to different domains, and find their embeddings. In the [recipe on building chains](./08-building-chains.ipynb), the idea of a pipeline was introduced. We use this to generate and parse the output of an llm to quickly get our test words:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from langchain_dartmouth.llms import ChatDartmouth


llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
parser = JsonOutputParser()

chain = llm | parser

response = chain.invoke(
    "Generate 30 different words that are well-suited to showcase how word embeddings work. "
    "Draw the words from domains like animals, finance, and food. The food one should contain tomato "
    "Return the words in JSON format, using the domain as the key, and the words as values. "
)

We put the words into a pandas DataFrame to get a nice table with all the information side-by-side:

In [None]:
words = pd.DataFrame.from_dict(response).melt(var_name="domain", value_name="word")

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
words["embedding"] = embeddings.embed_documents(words["word"])
words

```{hint}
It is difficult to visualize a 1024-dimensional vector, as we're not 1024-dimensional humans! One way to get around this is by using a [UMAP](https://umap-learn.readthedocs.io/en/latest/) (Uniform Manifold Approximation and Projection) to represent this high-dimensional vector as a two-dimesional one. 

Don't worry if the code in the next cell looks complicated. Just assume that the UMAP does the dimensionality reduction in a way that preserves the "closeness" of the high-dimensional vectors: Vectors that were similar in the high-dimensional space are mapped to points that are close together in the two-dimensional space. You can learn more about the UMAP library in [its user guide](https://umap-learn.readthedocs.io/en/latest/).
```

In [None]:
import umap

embeddings_list = words["embedding"].to_list()
mapper = umap.UMAP().fit(embeddings_list)
umap_embeddings = pd.DataFrame(
    mapper.transform(embeddings_list), columns=["UMAP_x", "UMAP_y"]
)

words = pd.concat([words, umap_embeddings], axis=1)

words.sample(3)

Now that we have projected the embedding vectors into two dimensions `UMAP_x` and `UMAP_y`, we can visualize them in a common scatter plot:

In [None]:
import plotly.express as px

px.scatter(words, x="UMAP_x", y="UMAP_y", color="domain", hover_data="word")

```{hint}
Move your mouse cursor over the individual data points to show the word it represents!
```

We can see that words are somewhat close to the other words in the same domain, with good separation from the other domains. This matches our intuition, and illustrates how embeddings capture semantic similarity.

## Summary

This recipe showed how to find the similarity between two embeddings. Visualizing embeddings can be a good way to represent this similarity. UMAP can be used to represent high-dimensional embeddings in a 2D plane, so we can easily visualize embeddings, and see their similarities.