Comments Page - Show HN: Vectorless RAG

« Back Show HN: Vectorless RAGgithub.comSubmitted by page_index 2 days ago

ineedasername an hour ago
>"Retrieval based on reasoning — say goodbye to approximate semantic search ("vibe retrieval"
How is this not precisely "vibe retrieval" and much more approximate, where approximate in this case is uncertainty over the precise reasoning?
Similarity with conversion to high-dimensional vectors and then something like kNN seems significantly less approximate, less "vibe" based, than this.
This also appears to be completely predicated on pre-enrichment of the documents by adding structure through API calls to, in the example, openAI.
It doesn't at all seem accurate to:
1: Toss out mathematical similarity calculations
2: Add structure with LLMs
3: Use LLMs to traverse the structure
4: Label this as less vibe-ish
Also for any sufficiently large set of documents, or granularity on smaller sets of documents, scaling will become problematic as the doc structure approaches the context limit of the LLM doing the retrieval.
mosselman 5 hours ago
So if I understand this correctly it goes over every possible document with an LLM each time someone performs a search?
I might have misunderstood of course.
If so, then the use cases for this would be fairly limited since you'd have to deal with lots of latency and costs. In some cases (legal documents, medical records, etc) it might be worth it though.
An interesting alternative I've been meaning to try out is inverting this flow. Instead of using an LLM at time of searching to find relevant pieces to the query, you flip it around: at time of ingesting you let an LLM note all of the possible questions that you can answer with a given text and store those in an index. You could them use some traditional full-text search or other algorithms (BM25?) to search for relevant documents and pieces of text. You could even go for a hybrid approach with vectors on top or next to this. Maybe vectors first and then more ranking with something more traditional.
What appeals to me with that setup is low latency and good debug-ability of the results.
But as I said, maybe I've misunderstood the linked approach.
- Qwuke 4 hours ago
  >An interesting alternative I've been meaning to try out is inverting this flow. Instead of using an LLM at time of searching to find relevant pieces to the query, you flip it around: at time of ingesting you let an LLM note all of the possible questions that you can answer with a given text and store those in an index.
  You may already know of this one, but consider giving Google LangExtract a look. A lot of companies are doing what you described in production, too!
  summarity 4 hours ago
  This is just a variation of index time HyDE (Hypothetical Document Embedding). I used a similar strategy when building the index and search engine for findsight.ai
- agentcoops 4 hours ago
  I’ve been working on RAG systems a lot this year and I think one thing people miss is that often for internal RAG efficiency/latency is not the main concern. You want predictable, linear pricing of course, but sometimes you want to simply be able to get a predictably better response by throwing a bit more money/compute time at it.
  It’s really hard to get to such a place with standard vector-based systems, even GraphRag. Because it relies on summaries of topic clusters that are pre-computed, if one of those summaries is inaccurate or none of the summaries deal with your exact question, that will never change during query processing. Moreover, GraphRag preprocessing is insanely expensive and precisely does not scale linearly with your dataset.
  TLDR all the trade-offs in RAG system design are still being explored, but in practice I’ve found the main desired property to be “predictably better answer with predictably scaling cost” and I can see how similar concerns got OP to this design.
  bjornsing 2 hours ago
  > Moreover, GraphRag preprocessing is insanely expensive and precisely does not scale linearly with your dataset.
  Sounds interesting. What exactly is the expensive computation?
  On a separate note: I have a feeling RAG could benefit from a kind of ”simultaneous vector search” across several different embedding spaces, sort of like AND in an SQL database. Do you agree?
  physicsguy 4 hours ago
  Yes, in the use case we're doing it's been diagnosis of issues, and draws on documents in that. the latency doesn't matter because it's all done before the diagnosis is raised to the customer.
  bjornsing 2 hours ago
  > You want predictable, linear pricing of course, but sometimes you want to simply be able to get a predictably better response by throwing a bit more money/compute time at it.
  Through more thorough ANN vector search / higher recall, or would it also require different preprocessing?
- rafaelmn 4 hours ago
  I didn't look at the implementation but sounds similar to something I two years ago recursively summarize the documentation based on structure (domain/page/section) and then ask the model to walk the hierarchy based on summaries.
  My motivation back then I had 8k context length to work with so I had to be very conservative about what I include. I still used vectors to narrow down the entry points and then use LLM to drill down or pick the most relevant ones and the search threads were separate, would summarize the response based on the tree path they took and then main thread would combine it.
- jdthedisciple 4 hours ago
  > let an LLM note all of the possible questions that you can answer
  What does this even mean? At what point do you know you have all of them?
  Humans are quite ingenious coming up with new, unique questions in my observation, whereas LLMs have a hard time replicating those efficiently.
  malnourish 3 hours ago
  Cantors diagonalization is trivial to show for questions. There are uncountably many.
malshe 26 minutes ago
The folks who are using RAG, what's the SOTA for extracting text from pdf documents? I have been following discussions on HN and I have seen a few promising solutions that involve converting pdf to png and then doing extraction. However, for my application this looks a bit risky because my pdfs have tons of tables and I can't afford to get in return incorrect of made up numbers.
The original documents are in HTML format and although I don't have access to them I can obtain them if I want. Is it better to just use these HTML documents instead? Previously I tried converting HTML to markdown and then use these for RAG. I wasn't too happy with the result although I fear I might be doing something wrong.
- giamma 11 minutes ago
  How about using something like Apache Tika for extracting text from multiple documents? It's a subproject of Lucene and consists of a proxy parser + delegates for a number of document formats. If a document, e.g. PDF, comes from a scanner, Tika can optionally shell-out a Tesseract invocation and perform OCR for you.
- davidajackson 6 minutes ago
  Can you explain why to png? why not to markdown?
- JJax7 19 minutes ago
  If accuracy is a major concern, then it's probably guaranteed better to go with the HTML documents. Otherwise, I've heard Docling is pretty good from a few co-workers.
mikeve 4 hours ago
Not sure if I fully understand it, but this seems highly inefficient?
Instead of using embeddings which are easy to make a cheap to compare, you use summarized sections of documents and process them with an LLM? LLM's are slower and more expensive to run.
- mingtianzhang 41 minutes ago
  I think it only needs to generate the tree once before retrieval, and it doesn’t require any external model at query time. The indexing may take some time upfront, but retrieval is then very fast and cost-free.
- falcor84 4 hours ago
  If this is used as an important tool call for an AI agent that preforms many other calls, then it's likely that the added cost and latency would be negligible compared to the benefit of significantly improved retrieval. As an analogy, for a small task you're often ok with just going over the first few search results, but to prepare for a large project, you might want to spend an afternoon researching.
dcre 2 hours ago
My approach in "LLM-only RAG for small corpora" [0] was to mechanically make an outline version of all the documents _without_ an LLM, feed that to an LLM with the prompt to tell which docs are likely relevant, and then feed the entirety of those relevant docs to a second LLM call to answer the prompt. It only works with markdown and asciidoc files, but it's surprisingly solid for, for example, searching a local copy of the jj or helix docs. And if the corpus is small enough and your model is on the cheap side (like Gemini 2.5 Flash), you can of course skip the retrieval step and just send the entire thing every time.
[0]: https://crespo.business/posts/llm-only-rag/
mvieira38 2 hours ago
> It moves RAG away from approximate "semantic vibes" and toward explicit reasoning about where information lives. That clarity can help teams trust outputs and debug workflows more effectively.
Wasn't this a feature of RAGs, though? That they could match semantics instead of structure, while us mere balls of flesh need to rely on indexes. I'd be interested in benchmarks of this versus traditional vector-based RAGs, is something to that effect planned?
- mingtianzhang 42 minutes ago
  In their gitHub repo’s readme, they show a benchmark on FinanceBench and found that PageIndex-based retrieval significantly outperforms vector-based methods. I’ve noticed that in domain-specific documents, where all the text has similar “semantic vibes,” non-vector methods like PageIndex can be more useful. In contrast, for use cases like recommendation systems, you might actually need a semantic-vibe search.
thatjoeoverthr 4 hours ago
There's good reasons to do this. Embedding similarity is _not_ a reliable method of determining relevance.
I did some measurements and found you can't even really tell if two documents are "similar" or not. Here: https://joecooper.me/blog/redundancy/
One common way is to mix approaches. e.g. take a large top-K from ANN on embeddings as a preliminary shortlist, then run a tuned LLM or cross encoder to evaluate relevance.
I'll link here these guys' paper which you might find fun: https://arxiv.org/pdf/2310.08319
At the end of the day you just want a way to shortlist and focus information that's cheaper, computationally, and more reliable, than dumping your entire corpus into a very large context window.
So what we're doing is fitting the technique to the situation. Price of RAM; GPU price; size of dataset; etc. The "ideal" setup will evolve as the cost structure and model quality evolves, and will always depend on your activity.
But for sure, ANN-on-embedding as your RAG pipeline is a very blunt instrument and if you can afford to do better you can usually think of a way.
esafak an hour ago
I don't see this scaling: https://deepwiki.com/search/how-is-the-tree-formed-and-tra_9...
I'd do some large scale benchmarks before doubling down on this approach.
- mingtianzhang an hour ago
  A good thing about tree representation compared to a 'list' representation is that you can search hierarchically, layer by layer, in a large tree. For example, AlphaGo performs search in a large tree. Since the scale of retrieval is smaller than that of the Go game, I guess this framework can scale very well.

mritchie712 an hour ago

an effective "vectorless RAG" is to have an LLM write search queries against the documents. e.g. if you store your documents in postgres, allow the LLM to construct a regex string that will find relevant matches. If you were searching for “Martin Luther King Jr.”, it might write something like:

    SELECT id, body
    FROM docs
    WHERE body ~* E'(?x)                                     -- x = allow whitespace/comments
      (?:\\m(?:dr|rev(?:erend)?)\\.?\\M[\\s.]+)?             -- optional title: Dr., Rev., Reverend
      (                                                      -- name forms
        (?:\\mmartin\\M[\\s.]+(?:\\mluther\\M[\\s.]+)?\\mking\\M)  -- "Martin (Luther)? King"
      | (?:\\mm\\.?\\M[\\s.]+(?:\\ml\\.?\\M[\\s.]+)?\\mking\\M)     -- "M. (L.)? King" / "M L King"
      | (?:\\mmlk\\M)                                       -- "MLK"
      )
      (?:[\\s.,-]*\\m(?:jr|junior)\\M\\.?)*                  -- optional suffix(es): Jr, Jr., Junior
    ';

brap 5 hours ago
Very cool. These days I’m building RAG over a large website, and when I look at the results being fed into the LLM, most of them are so silly it’s surprising the LLM even manages to extract something meaningful. Always makes me wonder if it’s just using prior knowledge even though it’s instructed not to do so (which is hacky).
I like your approach because it seems like a very natural search process, like a human would navigate a website to find information. I imagine the tradeoff is performance of both indexing and search, but for some use cases (like mine) it’s a good sacrifice to make.
I wonder if it’s useful to merge to two approaches. Like you could vectorize the nodes in the tree to give you a heuristic that guides the search. Could be useful in cases where information is hidden deep in a subtree, in a way that the document’s structure doesn’t give it away.
- mingtianzhang 25 minutes ago
  Strongly agree! It is basically the Mone-Carlo tree search method used in Alpha Go! This is also mentioned in one of their toturials: PageIndex/blob/main/tutorials/doc-search/semantics.md. I believe it will make the method more scalable for large documents.
lewisjoe 5 hours ago
This will scale when you have a single/a small set of document(s) and want your questions answered.
When you have a question and you don't know which of the million documents in your dataspace contains the answer - I'm not sure how this approach will perform. In that case we are looking at either feeding an enormously large tree as context to LLM or looping through potentially thousands of iterations between a tree & a LLM.
That said, this really is a good idea for a small search space (like a single document).
huqedato 4 hours ago
I have a RAG built on 10000+ docs knowledge base. On vector store, of course (Qdrant - hybrid search). It work smoothly and quite reliable.
I wonder how this "vectorless" engine would deal with this. Simply, I can't see this tech scalable.
- mingtianzhang 24 minutes ago
  A good thing about tree representation compared to a 'list' representation is that you can search hierarchically, layer by layer, in a large tree. For example, AlphaGo performs search in a large tree. Since the scale of retrieval is smaller than that of the Go game, I guess this framework can scale very well.
Koaisu 4 hours ago
Sounds a bit like generative retrieval (e.g. this Google paper here: https://arxiv.org/abs/2202.06991)
- mingtianzhang an hour ago
  Yeah, they share a similar intuition. I found that the difference is that PageIndex is more of a learning-free approach, more like how a human would do retrieval?
- thatjoeoverthr 4 hours ago
  I love it
rco8786 3 hours ago
This seems really interesting but I can't quite figure out if this is like a SaaS product or an OSS library? The code sample seems to indicate that it uses some sort of "client" to send the document somewhere and then wait to retrieve it later.
But the home page doesn't indicate any sort of sign up or pricing.
So I'm a little confused.
edit Ok I found a sign up flow, but the verification email never came :(
cantor_S_drug 2 hours ago
This is like semantic version of B+ trees.
- nikishuyi an hour ago
  Yeah, I strongly agree. I also found in AI coding tools, tree search has replaced vector search. I’m wondering if in generic RAG systems, tree search will replace vector databases?
guerby 5 hours ago
https://en.wikipedia.org/wiki/Retrieval-augmented_generation
neya 5 hours ago
This is good for applications where a background queue based RAG is acceptable. You upload a file, set the expectation to the user that you're processing it and needs more time for a few hours and then after X hours you deliver them. Great for manuals, documentation and larger content.
But for on-demand, near instant RAG (like say in a chat application), this won't work. Speed vs accuracy vs cost. Cost will be a really big one.
- actionfromafar 5 hours ago
  If you have a lot of time, cost on a local machine may be low.
jdthedisciple 4 hours ago
Looks like this should scale spectacularly poorly.
Might be useful for a few hundred documents max though.
- mingtianzhang an hour ago
  A good thing about tree representation compared to a 'list' representation is that you can search hierarchically, layer by layer, in a large tree. For example, AlphaGo performs search in a large tree. Since the scale of retrieval is smaller than that of the Go game, I guess this framework can scale very well.
- bjornsing 2 hours ago
  It scales as log(N), right? So if you can tolerate it for a few hundred docs you can probably tolerate it for a lot more.
geedzmo 4 hours ago
"Human-like Retrieval: Simulates how human experts navigate and extract knowledge from complex documents." - pretty sure I use control-f when I look for stuff
- mingtianzhang 23 minutes ago
  But different people may have different ways. For example, I use command+f in macbook.
- scotty79 4 hours ago
  I think it's about how you decide where to press Ctrl+F next.
koakuma-chan 5 hours ago
What about latency?
- marcodena 5 hours ago
  yeah vectors are way more efficient for this
  mingtianzhang 10 minutes ago
  In this approach, the documents need to be pre-processed once to generate a tree structure, which is slower than the current vector-based method. However, during retrieval, this approach only requires conditioning on the context for the LLM and does not require an embedding model to convert the query into vectors. As a result, it can be efficient when the tree is small. When the tree is large, however, this approach may be slower than the vector-based method since it prioritizes accuracy. If you prioritize speed over accuracy, then I guess you should use Vector DB.
  Qwuke 5 hours ago
  The approach used here for breaking down large documents into summarized chunks that can more easily be reasoned about is how a lot of AI systems deal with large documents that surpass effective context limits in-general, but in my experience this approach will only work up to a certain point and then the summaries will start to hide enough detail that you do need semantic search or another RAG approach like GraphRAG. I think the efficacy of this approach will really fall apart after a certain number of documents.
  Would've loved to seen the author run experiments about how they compare to other RAG approaches or what the limitations are to this one.
  theshetty 5 hours ago
  Can you eloborate on this please?
  brap 5 hours ago
  To put it in terms of data structures, a vector DB is more like a Map, this is more like a Tree
  neutronicus 4 hours ago
  For the C++ programmers among us I think that means it's more like `unordered_map` than `map`
nathan_compton 4 hours ago
I let a boot do a free text search over and indexed database. Works ok. I've also tried keyword based retrieval and vector search.
I've found all leave something to be desired, sadly.
monster_truck 4 hours ago
vectorless rag? I think I have one of those in my kitchen
- nikishuyi an hour ago
  Loll you also need one in your computer.
dr_dshiv 4 hours ago
Unrelated: why is chat search in Claude so bad?
- nikishuyi an hour ago
  Maybe lost in the context? I guess a tree method can be used to improve that?
dmezzetti 2 hours ago
Context and prompt engineering is the most important of AI, hands down.
There are plenty of lightweight retrieval options that don't require a separate vector database (I'm the author of txtai [https://github.com/neuml/txtai], which is one of them).
It can be as simple this in Python: you pass an index operation a data generator and save the index to a local folder. Then use that for RAG.
- mingtianzhang an hour ago
  Strongly agree, I also found txtai is super interesting! Thank you for your open-source effort!
  dmezzetti 41 minutes ago
  You got it!