Comments Page - Zvec: A lightweight, fast, in-process vector database

« Back Zvec: A lightweight, fast, in-process vector databasegithub.comSubmitted by dvrp 2 days ago

OfficialTurkey 2 minutes ago
I haven't been following the vector db space closely for a couple years now, but I find it strange that they didn't compare their performance to the newest generation serverless vector dbs: Pinecone Serverless, turbopuffer, Chroma (distributed, not the original single-node implementation). I understand that those are (mostly) hosted products so there's not a true apples-to-apples comparison with the same hardware, but surely the most interesting numbers are cost vs performance.
simonw 2 hours ago
Their self-reported benchmarks have them out-performing pinecone by 7x in queries-per-second: https://zvec.org/en/docs/benchmarks/
I'd love to see those results independently verified, and I'd also love a good explanation of how they're getting such great performance.
- ashvardanian 44 minutes ago
  8K QPS is probably quite trivial on their setup and a 10M dataset. I rarely use comparably small instances & datasets in my benchmarks, but on 100M-1B datasets on a larger dual-socket server, 100K QPS was easily achievable in 2023: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search... ;)
  Typically, the recipe is to keep the hot parts of the data structure in SRAM in CPU caches and a lot of SIMD. At the time of those measurements, USearch used ~100 custom kernels for different data types, similarity metrics, and hardware platforms. The upcoming release of the underlying SimSIMD micro-kernels project will push this number beyond 1000. So we should be able to squeeze a lot more performance later this year.
- panzi 7 minutes ago
  PGVectorScale claims even more. Also want to see someone verify that.
clemlesne 3 hours ago
Did someone compared with uSearch (https://github.com/unum-cloud/USearch)?
- neilellis an hour ago
  That I would like to see too, usearch is amazingly fast, 44m embeddings in < 100ms
cjonas 33 minutes ago
How does this compare to duckdbs vector capabilities (vss extension)?
_pdp_ 2 hours ago
I thought you need memory for these things and CPU is not the bottleneck?
- binarymax an hour ago
  I haven’t looked at this repo, but new techniques taking advantage of nvme and io_uring make on disk performance really good without needing to keep everything in RAM.
skybrian 2 hours ago
Are these sort of similarity searches useful for classifying text?
- CuriouslyC 2 hours ago
  Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.
- neilellis an hour ago
  Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.
- esafak 2 hours ago
  You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.
- OutOfHere 2 hours ago
  It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.