• geoffmanning 13 hours ago

    Haha, nice, i literally designed and built the same solution for my company last week. EDIT: to be clear, i appreciate the validation. while my solution differs slightly in the details of how it's done, i think this is overall a logical solution

    • Pinkert an hour ago

      Thanks! I'd love to hear how you implemented it, and if you can suggest any improvements for my solution. feel free to submit PRs as well!

    • Pinkert 20 hours ago

      One architectural tradeoff we are actively working on right now is the latency of the "Select" step for shorter conversations.

      Currently, the open-source version of Librarian uses a general-purpose model to read the summary index and route the relevant messages. It works great for accuracy and drastically cuts token costs, but it does introduce a latency penalty for shorter conversations because it requires an initial LLM inference step before your actual agent can respond.

      To solve this, we are currently training a heavily quantized, fine-tuned model specifically optimized only for this context-selection task. The goal is to push the selection latency below 1 second so the entire pipeline feels completely transparent. (We have a waitlist up for this hosted version on the site).

      If anyone here has experience fine-tuning smaller models (like Llama 3 or Mistral) strictly for high-speed classification/routing over context indexes, I'd love to hear what pitfalls we should watch out for.

      • findjashua 18 hours ago

        won't this essentially disable prompt caching, that you get from a standard append-only chat history?

        • Pinkert an hour ago

          That's actually a great question. and the answer is yes and no; While it does disable the caching mechanism for the conversation history (and not for the system prompt, who remains constant), there is a difference between a chatbot with a constant chat history (just exchange of messages) and an agent who uses a large part of the conversation as a type of "scratchpad", sometimes even holding variables value in the beginning of the chat (to be sort of 'stateful'). if these variables change, the scratchpad changes (can be even 30%-40% of the entire conversation), there is a timeout in the cache (Claude gives you 5 minutes of cache for normal caching) or any other change to the exact history - you get a recaching of the entire conversation. additionally, caching still costs money.

          The main advantage of the librarian is that is an 'insurance policy' for this caching mechanism. combining it with solving the context rot issue - and you get improved performance at scale.

          • geoffmanning 12 hours ago

            oh, one other caveat is that each request could result in the curation of system messages earlier in the chat message history, i haven't done a deep dive into prompt caching, but that could complicate things. the more i think about it, the more i wonder that the prompt caching is a patch for "dumb prompting" to try to save money when you're doing things the dumb way of throwing everything you have at it and praying it gets it right, when it'd just make more sense to keep the entirety of the prompt as lean as possible to prevent context rot and maximize signal to noise ratio.

            • geoffmanning 13 hours ago

              that's a good point, we haven't delved too deeply into prompt caching yet, but my understanding is that it only helps for a conversation that remains "hot", not one that a user just comes back to everyday and keep adding more to it over a longer period of time. i could see some optimization there where when the conversation is "hot" we keep the system message with the summarized index and all subsequent conversation messages that haven't been summarized intact until the conversation cools off.