So word vectors solve the problem that two words may never appear in the same context, yet can be strongly correlated. "Python" may never be found close to "Ruby", yet "scripting" is likely to be found in both their contexts so the embedding algorithm will ensure that they are close in some vector space. Except it rarely works well because of the curse of dimensionality.
Perhaps one could represent word embeddings as vertices, rather than vectors? Suppose you find "Python" and "scripting" in the same context. You draw a weighted edge between them. If you find the same words again you reduce the weight of the edge. Then to compute the similarity between two words, just compute the weighted shortest path between their vertices. You could extend it to pair-wise sentence similarity using Steiner trees. Of course it would be much slower than cosine similarity, but probably also much more useful.
You might be interested in HippoRAG [1] which takes a graph-based approach similar to what you’re suggesting here.
If you're using cosine similarity when retrieving for a RAG application, a good approach is to then use a "semantic re-ranker" or "L2 re-ranking model" to re-rank the results to better match the user query.
There's an example in the pgvector-python that uses a cross-encoder model for re-ranking: https://github.com/pgvector/pgvector-python/blob/master/exam...
You can even use a language model for re-ranking, though it may not be as good as a model trained specifically for re-ranking purposes.
In our Azure RAG approaches, we use the AI Search semantic ranker, which uses the same model that Bing uses for re-ranking search results.
Another tip: do NOT store vector embeddings of nothingness, mostly whitespace, a solid image, etc. We've had a few situations with RAG data stores which accidentally ingested mostly-empty content (either text or image), and those dang vectors matched EVERYTHING. WAs I like to think of it, there's a bit of nothing in everything.. so make sure that if you are storing a vector embedding, there is some amount of signal in that embedding.
Interesting. A project I worked on (audio recognition for a voice-command system) we ended up going the other way and explicitly adding an encoding of "nothingness" (actually 2, one for "silence" and another for "white noise") and special casing them ("if either 'silence' or 'noise' is in the top 3 matches, ignore the input entirely").
This was to avoid the problem where, when we only had vectors for "valid" sounds and there was an input that didn't match anything in the training set (a foreign language, garbage truck backing up, a dog barking, ...) the model would still return some word as the closest match (there's always a vector that has the highest similarity) and frequently do so with high confidence i.e. even though the actual input didn't actually match anything in the training set, it would be "enough" more like one known vector than any of the others that it would pass most threshold tests, leading to a lot of false positives.
That sounds like a problem for the embedding, would you need to renormalise so that low signal inputs could be well represented. A white square and a red square shouldn't be different levels of details. Depending on the purpose of the vector embedding, there should be a difference between images of mostly white pixels and partial images.
Disclaimer, I don't know shit.
I should clarify that I experienced these issues with text-embedding-ada-002 and the Azure AI vision model (based on Florence). I have not tested many other embedding models to see if they'd have the same issue.
FWIW I think you're right, we have very different stacks, and I've observed the same thing, with a much clunkier description thank your elegant way of putting it.
I do embeddings on arbitrary websites at runtime, and had a persistent problem with the last chunk of a web page matching more things. In retrospect, its obvious that the smaller the chunk was, the more it was matching everything
Full details: MSMARCO MiniLM L6V3 inferenced using ONNX on iOS/web/android/macos/windows/linux
Same experience embedding random alphanumeric strings or strings of digits with smaller embedding models—very important to filter those out.
My chunk rewriting method is to use a LLM to generate a title, summary, keyword list, topic, parent topic, and gp topic. Then I embed the concatenation of all of them instead of just the original chunk. This helps a lot.
One fundamental problem of cosine similarity is that it works on surface level. For example, "5+5" won't embed close to "10". Or "The 5th word of this phrase" won't be similar to "this".
If there is any implicit knowledge it won't be captured by simple cosine similarity, that is why we need to draw out those inplicit deductions before embedding. Hence my approach of pre-embedding expansion of chunk semantic information.
I basically treat text like code, and have to "run the code" to get its meaning unpacked.
How do you contextualize the chunk at re-write time?
In ML everything is a tradeoff. The article strongly suggests using dot product similarity and it's a great metric in some situations, but dot product similarity has some issues too: - not normalized (unlike cosine simularity) - heavily favors large vectors - unbounded output - ...
Basically, do not carelessly use any similarity metric.
Occasionally I'll forget a famous quote [0] so I'll describe it to an LLM but the LLM is rarely able to find it. I think it's because the description of the quote uses 'like' words, but not the exact words in the quote, so the LLM gets confused and can't find it.
Interestingly, the opposite conclusion is drawn in the TFA (the article says LLMs are quite good at identifying 'like' words, or, at least, better than the cosine method, which admittedly isn't a high bar).
[0] Admittedly, some are a little obscure, but they're in famous publications by famous authors, so I'd have expected an LLM to have 'seen' them before.
> So, what can we use instead?
> The most powerful approach
> The best approach is to directly use LLM query to compare two entries.
Cross encoders are a solution I’m quite fond of, high performing and much faster. I recently put an STS cross encoder up on huggingface based on ModernBERT that performs very well.
I had to look that up… for others:
An STS cross encoder is a model that uses the CrossEncoder class to predict the semantic similarity between two sentences. STS stands for Semantic Textual Similarity.
Link please?
Here you go!
https://huggingface.co/dleemiller/ModernCE-base-sts
There’s also the large model, which performs a bit better.
Cross encoders still don’t solve the fundamental problem of defining similarity that the author is referring to.
Frankly, the LLM approach the author talks about in the end doesn’t either. What does “similar” mean here?
Given inputs A, B, and C, you have to decide whether A and B are more similar or A and C are more similar. The algorithm (or architecture, depending on how you look at it) can’t do that for you. Dual encoder, cross encoder, bag of words, it doesn’t matter.
Just want to say how great I am for calling this out a few months ago https://news.ycombinator.com/context?id=41470605
Cosine similarity and top-k RAG feel so primitive to me, like we are still in the semantic dark ages.
The article is right to point out that cosine similarity is more of an accidental property of data than anything in most cases (but IIUC there are newer embedding models that are deliberately trained for cosine similarity as a similarity measure). The author's bootstrapping approach is interesting especially because of it's ability to map relations other than the identity, but it seems like more of a computational optimization or shortcut (you could just run inference on the input) than a way to correlate unstructured data.
After trying out some RAG approaches and becoming disillusioned pretty quickly I think we need to solve the problem much deeper by structuring models so that they can perform RAG during training. Prompting typical LLMs with RAG gives them input that is dissimilar from their training data and relies on heuristics (like the data format) and thresholds (like topK) that live outside the model itself. We could probably greatly improve this by having models define the embeddings, formats, and retrieval processes (ie learn its own multi-step or "agentic" RAG while it learns everything else) that best help them model their training data.
I'm not an AI researcher though and I assume the real problem is that getting the right structure to train properly/efficiently is rather difficult.
By the way, I just wanted to say I really like your post! It’s well-reasoned, clear, and the use of images makes it super easy and enjoyable to read. So visually pleasing!
Typo: "When we with vectors" should be "When we work with vectors" I think.
Very interesting article. Is there any model that can generate embeddings given a system prompt? This can be useful not only for similarity searching but also for clustering use cases without having to do too much custom work. Essentially, a zero shot embedding model.
There are many embedding models supporting instructions. https://huggingface.co/spaces/mteb/leaderboard
I may have missed the point of your question, but there are many generators of embeddings:
https://openai.com/index/new-embedding-models-and-api-update...
> "Is {sentence_a} similar to {sentence_b}?"
I also find this methods powerful. I see more and more software is getting outsourced into LLM judgements/prompts.
The article is basically saying: if the feature vectors are crypticly encoded, then cosine similarity tells you little.
Cosin similarity of two encrypted images would be useless, unencrypt them, a bit more useful.
The 'strings are not the territory' in other words, the territory is the semantic constructs cryptically encoded into those strings. You want the similarity of constructs, not strings.
I can't see these in this article, at all.
I think what it say is under "Is it the right kind of similarity?" :
> Consider books. > For a literary critic, similarity might mean sharing thematic elements. For a librarian, it's about genre classification. > For a reader, it's about emotions it evokes. For a typesetter, it's page count and format. > Each perspective is valid, yet cosine similarity smashes all these nuanced views into a single number — with confidence and an illusion of objectivity.