• sakras 20 hours ago

    What really made HNSW click for me was when someone described it as a "skiplist graph". It really does seem to share a lot of the same properties and intuition as a simple skiplist, where descending levels corresponds to finer-grained navigation through the space.

    A couple of other thoughts with this perspective: - if a drawing of a skiplist is 2D, HNSW seems to be a 3D generalization. It makes me wonder if you could apply an HNSW-like scheme for a traditional relational database index (maybe a multicolumn index?). - if a skiplist is a probabilistic alternative to a B+-tree, what's HNSW a probabilistic alternative to?

    One thing that still mystifies me is how in the context of vector indexing, the HNSW seems to "bootstrap" itself by using itself to find the neighbors of a newly-inserted point. It's not clear to me why that doesn't diverge to some garbage results as the index grows.

    • j-pb 6 hours ago

      > if a skiplist is a probabilistic alternative to a B+-tree, what's HNSW a probabilistic alternative to?

      I don't think that analogy is accurate.

      A prolly tree could also be considered a probabilistic version of a b-tree and a skiplist could be considered a probabilistic version of a trie, so you certainly have probabilistic vs. non-probabilistic datastructures, but there isn't really a 1-to-1 correspondence, so your inference of a missing X doesn't make that much sense.

      Here's a better dichotomy imho. Both of these probabilistic and non-probabilistic datastructures share a commonality. They operate on totally ordered data.

      So a skiplist is actually 1D.

      HNSW and other approximate k-nearesr neighbour algorithms however operate over metric spaces, where elements have a distance/similiarity but no meaningful order.[1]

      Now if you ask for the deterministic equivalent of a approximate-kNN algorithm/datastructure, then that would simply be anything in the kNN category.

      1: https://math.stackexchange.com/questions/1429373/why-is-ther...

    • WhitneyLand an hour ago

      For anyone new to this, the topic has gained interest in recent years alongside the rise of AI. Many ML models project text, images, etc, into an embedding layer in high dimensional space as a vector (an array of numbers). It then becomes possible to numerically compare these vectors to see how closely related they are, in others words to find the semantic similarity.

      This nifty feature leads to the need to scale up. For example, given many vectors in a database which one is most similar to my word or image. Applications like this need ways to efficiently scale these kinds of searches and HNSW is one prominent technique to enable this.

      • janalsncm 19 hours ago

        HNSW was the first approximate nearest neighbor search algorithm I encountered. There are others that are useful in different contexts (i.e. depending on your data size and available memory). And you can combine them sometimes. This video is a really awesome overview of them:

        https://youtu.be/SKrHs03i08Q

        • 3abiton 21 hours ago

          Interesting read, I worked on graph theory for a while some long time ago and did not even know some of these terminologies.