• DeveloperErrata 5 hours ago

    Seems neat - I'm not sure if you do anything like this but one thing that would be useful with RAG apps (esp at big scales) is vector based search over cache contents. What I mean is that, users can phrase the same question (which has the same answer) in tons of different ways. If I could pass a raw user query into your cache and get back the end result for a previously computed query (even if the current phrasing is a bit different than the current phrasing) then not only would I avoid having to submit a new OpenAI call, but I could also avoid having to run my entire RAG pipeline. So kind of like a "meta-RAG" system that avoids having to run the actual RAG system for queries that are sufficiently similar to a cached query, or like a "approximate" cache.

    • davidbarker 4 hours ago

      I was impressed by Upstash's approach to something similar with their "Semantic Cache".

      https://github.com/upstash/semantic-cache

        "Semantic Cache is a tool for caching natural text based on semantic similarity. It's ideal for any task that involves querying or retrieving information based on meaning, such as natural language classification or caching AI responses. Two pieces of text can be similar but not identical (e.g., "great places to check out in Spain" vs. "best places to visit in Spain"). Traditional caching doesn't recognize this semantic similarity and misses opportunities for reuse."
      • OutOfHere 3 hours ago

        I strongly advise not relying on embedding distance alone for it because it'll match these two:

        1. great places to check out in Spain

        2. great places to check out in northern Spain

        Logically the two are not the same, and they could in fact be very different despite their semantic similarity. Your users will be frustrated and will hate you for it. If an LLM validates the two as being the same, then it's fine, but not otherwise.

        • DeveloperErrata 3 hours ago

          I agree, a naive approach to approximate caching would probably not work for most use cases.

          I'm speculating here, but I wonder if you could use a two stage pipeline for cache retrieval (kinda like the distance search + reranker model technique used by lots of RAG pipelines). Maybe it would be possible to fine-tune a custom reranker model to only output True if 2 queries are semantically equivalent rather than just similar. So the hypothetical model would output True for "how to change the oil" vs. "how to replace the oil" but would output False in your Spain example. In this case you'd do distance based retrieval first using the normal vector DB techniques, and then use your custom reranker to validate that the potential cache hits are actual hits

      • OutOfHere 4 hours ago

        That would totally destroy the user experience. Users change their query so they can get a refined result, not so they get the same tired result.

        • pedrosorio 2 hours ago

          Even across users it’s a terrible idea.

          Even in the simplest of applications where all you’re doing is passing “last user query” + “retrieved articles” into openAI (and nothing else that is different between users, like previous queries or user data that may be necessary to answer), this will be a bad experience in many cases.

          Queries A and B may have similar embeddings (similar topic) and it may be correct to retrieve the same articles for context (which you could cache), but they can still be different questions with different correct answers.

          • elawler24 4 hours ago

            Depends on the scenario. In a threaded query, or multiple queries from the same user - you’d want different outputs. If 20 different users are looking for the same result - a cache would return the right answer immediately for no marginal cost.

            • OutOfHere 3 hours ago

              That's not the use case of the parent comment:

              > for queries that are sufficiently similar

          • elawler24 4 hours ago

            Thanks for the detail! This is a use case we plan to support, and it will be configurable (for when you don’t want it). Some of our customers run into this when different users ask a similar query - “NY-based consumer founders” vs “consumer founders in NY”.

          • OutOfHere 6 hours ago

            A cache is better when it's local rather than on the web. And I certainly don't need to pay anyone to cache local request responses.

            • phillipcarter 5 hours ago

              Congrats on the launch! I love the devex here and things you're focusing on.

              Have you had thoughts on how to you might integrate data from an upstream RAG pipeline, say as a part of a distributed trace, to aid in debugging the core "am I talking to the LLM the right way" use case?

              • elawler24 4 hours ago

                Thanks! You can layer on as much detail as you need by including meta tags in the header, which is useful for tracing RAG and agent pipelines. But would love to understand your particular RAG setup and whether that gives you enough granularity. Feel free to email me too - emma@usevelvet.com

              • ji_zai an hour ago

                Neat! I'd love to play with this, but site doesn't open (403: Forbidden).

                • elawler24 an hour ago

                  Might be a Cloudflare flag. Can you email me your IP address and we'll look into it? emma@usevelvet.com.

                • turnsout 5 hours ago

                  Nice! Sort of like Langsmith without the Langchain, which will be an attractive value proposition to many developers.

                  • efriis 4 hours ago

                    Howdy Erick from LangChain here! Just a quick clarification that LangSmith is designed to work great for folks not using LangChain as well :)

                    Check out our quickstart for an example of what that looks like! https://docs.smith.langchain.com/

                  • ramon156 6 hours ago

                    > we were frustrated by the lack of LLM infrastructure

                    May I ask what you specifically were frustrated about? Seems like there are more than enough solutions

                    • elawler24 6 hours ago

                      There were plenty of UI-based low code platforms. But they required that we adopt new abstractions, use their UI, and log into 5 different tools (logging, observability, analytics, evals, fine-tuning) just to run basic software infra. We didn’t feel these would be long-term solutions, and just wanted the data in our own DB.

                    • hiatus 7 hours ago

                      This seems to require sharing our data we provide to OpenAI with yet another party. I don't see any zero-retention offering.

                      • elawler24 6 hours ago

                        The self-serve version is hosted (it’s easy to try locally), but we offer managed deployments where you bring your own DB. In this case your data is 100% yours, in your PostgreSQL. That’s how Find AI uses Velvet.