• postalcoder 2 hours ago

    To add some context, this isn't that novel of an approach. A common approach to improve RAG results is to "expand" the underlying chunks using an llm, so as to increase the semantic surface area to match against. You can further improve your results by running query expansion using HyDE[1], though it's not always an improvement. I use it as a fallback.

    I'm not sure what Anthropic is introducing here. I looked at the cookbook code and it's just showing the process of producing said context, but there's no actual change to their API regarding "contextual retrieval".

    The one change is prompt caching, introduced a month back, which allows you to very cheaply add better context to individual chunks by providing the entire (long) document as context. Caching is an awesome feature to expose to developers and I don't want to take anything away from that.

    However, other than that, the only thing I see introduced is just a cookbook on how to do a particular rag workflow.

    As an aside, Cohere may be my favorite API to work with. (no affiliation) Their RAG API is a delight, and unlike anything else provided by other providers. I highly recommend it.

    1: https://arxiv.org/abs/2212.10496

    • resiros 2 hours ago

      I think the innovation is using caching as so to make the cost of the approach manageable. The way they implemented it is that each time you create a chunk, you ask the llm to create an atomic chunk from the whole context. You need to do this for all tens of thousands of chunks in your data. This costs a lot. By caching the documents, you can spare costs

      • skeptrune 2 hours ago

        You could also just save the first outputted atomic chunk and store it then re-use it each time yourself. Easier and more consistent.

        • postalcoder 2 hours ago

          To be fair, that only works if you keep chunk windows static.

        • postalcoder 2 hours ago

          Yup. Caching is very nice.. but the framing is weird. "Introducing" to me, connotes a product release, not a new tutorial.

      • skeptrune 3 hours ago

        I'm not a fan of this technique. I agree the scenario they lay out is a common problem, but the proposed solution feels odd.

        Vector embeddings have bag-of-words compression properties and can over-index on the first newline separated text block to the extent that certain indices in the resulting vector end up much closer to 0 than they otherwise would. With quantization, they can eventually become 0 and cause you to lose out on lots of precision with the dense vectors. IDF search overcomes this to some extent, but not enough.

        You can "semantically boost" embeddings such that they move closer to your document's title, summary, abstract, etc. and get the recall benefits of this "context" prepend without polluting the underlying vector. Implementation wise it's a weighted sum. During the augmentation step where you put things in the context window, you can always inject the summary chunk when the doc matches as well. Much cleaner solution imo.

        Description of "semantic boost" in the Trieve API[1]:

        >semantic_boost: Semantic boost is useful for moving the embedding vector of the chunk in the direction of the distance phrase. I.e. you can push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factorL2Distance closer to or away from the distance_phrase point along the line between the two points.

        [1]:https://docs.trieve.ai/api-reference/chunk/create-or-upsert-...

        • vendiddy 20 minutes ago

          I don't know anything about AI but I've always wished I could just upload a bunch of documents/books and the AI would perform some basic keyword searches to figure out what is relevant, then auto include that in the prompt.

          • average_r_user 5 minutes ago

            It would help if you tried Notebooklm by Google. It does this, you can upload a document/PDF whatever, and ask questions. The model replies to you giving also a reference to your material

          • _bramses an hour ago

            The technique I find most useful is to implement a “linked list” strategy where a chunk has multiple pointers to the entry it is referenced by. This task is done manually, but the diversity of the ways you can reference a particular node go up dramatically.

            Another way to look at it, comments. Imagine every comment under this post is a pointer back to the original post. Some will be close in distance, and others will be farther, due to the perception of the authors of the comments themselves. But if you assign each comment a “parent_id”, your access to the post multiplies.

            You can see an example of this technique here [1]. I don’t attempt to mind read what the end user will query for, I simply let them tell me, and then index that as a pointer. There are only a finite number of options to represent a given object. But some representations are very, very, very far from the semantic meaning of the core object.

            [1] - https://x.com/yourcommonbase/status/1833262865194557505

            • simonw 3 hours ago

              My favorite thing about this is the way it takes advantage of prompt caching.

              That's priced at around 1/10th of what the prompts would normally cost if they weren't cached, which means that tricks like this (running every single chunk against a full copy of the original document) become feasible where previously they wouldn't have financially made sense.

              I bet there are all sorts of other neat tricks like this which are opened up by caching cost savings.

              My notes on contextual retrieval: https://simonwillison.net/2024/Sep/20/introducing-contextual... and prompt caching: https://simonwillison.net/2024/Aug/14/prompt-caching-with-cl...

              • jillesvangurp 2 hours ago

                You could do a lot of stuff with pre-calculating things for your embeddings. Why cache when you can pre-calculate. That brings into play a whole lot of things people commonly do as part of ETL.

                I come from a traditional search back ground. It's quite obvious to me that RAG is a bit of a naive strategy if you limit it to just using vector search with some off the shelf embedding model. Vector search simply isn't that good. You need additional information retrieval strategies if you want to improve the context you provide to the LLM. That is effectively what they are doing here.

                Microsoft published an interesting paper on graph RAG some time ago where they combine RAG with vector search based on a conceptual graph that they construct from the indexed data using entity extraction. This allows them to pull in contextually relevant information for matching chunks.

                I have a hunch that you could probably get quite far without doing any vector search at all. It would be a lot cheaper too. Simply use a traditional search engine and some tuned query. The trick is of course query tuning. Which may not work that well for general purpose use cases but it could work for more specialized use cases.

                • TmpstsTrrctta an hour ago

                  I have experience in traditional search as well and I think this is doing some limiting of my imagination when it comes to vector search. In the post, I did like the introduction of the Contextual BM25 compared to other hybrid approaches then doing rrf.

                  For question answering, vector/semantic search is clearly a better fit in my mind, and I can see how the contextual models can enable and bolster that. However, because I’ve implemented and used so many keyword based systems, that just doesn’t seem to be how my brain works.

                  An example I’m thinking of is finding a sushi restaurant near me with availability this weekend around dinner time. I’d love to be able to search for this as I’ve written it. How I would search for it would be search for sushi restaurant, sort by distance and hope the application does a proper job of surfacing time filtering.

                  Conversely, this is mostly how I would build this system. Perhaps with a layer to determine user intention to pull out restaurant type, location sorting, and time filtering.

                  I could see using semantic search for filtering down the restaurants to related to sushi, but do we then drop back into traditional search for filtering and sorting? Utilize function calling to have the LLM parameterize our search query?

                  As stated, perhaps I’m not thinking of these the right way because of my experiences with existing systems, which I find seem to give me better results when well built

                  • visarga 32 minutes ago

                    GraphRAG requires you define upfront the schema of entity and relation types. This works when you are in a known domain, but in general, when you want to just answer questions from a large reference, you don't know what you need to put in the graph.

                    • postalcoder an hour ago

                      Graph RAG is very cool and outstanding at filling some niches. IIRC, Perplexity's actual search is just BM25 (based a lex fridman interview of the founder).

                      • jillesvangurp an hour ago

                        Makes sense; perplexity is really responsive and fast usually.

                        I need to check out that interview with Lex Fridman.

                        • _hfqa an hour ago

                          Do you have the link and the time in the video where he mentions it?

                    • valstu 2 hours ago

                      We're doing something similar. We first chunk the documents based on h1,h2,h3 headings. Then we add headers in the beginning of the chunk as a context. As an imagenary example, instead of one chunk being:

                        The usual dose for adults is one or two 200mg tablets or 
                        capsules 3 times a day.
                      
                      It is now something like:

                        # Fever
                        ## Treatment
                        ---
                        The usual dose for adults is one or two 200mg tablets or 
                        capsules 3 times a day.
                      
                      This seems to work pretty well, and doesn't require any LLMs when indexing documents.

                      (Edited formatting)

                      • visarga 25 minutes ago

                        I am working on question answering based on long documents / bundles of documents, 100+ pages, and I took a similar approach. I first summarize each page, give it a title and extract a list of subsections. Then I put all the summaries together and I ask the model to provide a hierarchical index. It will organize the whole bundle into a tree. At querying time I combine the path in the tree as additional context.

                        • cabidaher 2 hours ago

                          Did you experiment with different ways to format those included headers? Asking because I am doing something similar to that as well.

                          • valstu 2 hours ago

                            Nope, not yet. We have sticked with markdownish syntax so far.

                        • timwaagh 2 hours ago

                          I guess this does give some insights. Using a more space efficient language for your codebase will mean more functionality in the ais context window when working with Claude and code.

                          • skybrian 5 hours ago

                            This sounds a lot like how we used to do research, by reading books and writing any interesting quotes on index cards, along with where they came from. I wonder if prompting for that would result in better chunks? It might make it easier to review if you wanted to do it manually.

                            • visarga 23 minutes ago

                              The fundamental problem of both keyword and embedding based retrieval is that they only access surface level features. If your document contains 5+5 and you search "where is the result 10" you won't find the answer. That is why all texts need to be "digested" with LLM before indexing, to draw out implicit information and make it explicit. It's also what Anthropic proposes we do to improve RAG.

                              "study your data before indexing it"