• Benjammer 14 hours ago

    It's nice to see a paper that confirms what anyone who has practiced using LLM tools already knows very well, heuristically. Keeping your context clean matters, "conversations" are only a construct of product interfaces, they hurt the quality of responses from the LLM itself, and once your context is "poisoned" it will not recover, you need to start fresh with a new chat.

    • Helmut10001 14 hours ago

      My experiences somewhat confirm these observations, but I also had one that was different. Two weeks of debugging IPSEC issues with Gemini. Initially, I imported all the IPSEC documentation from OPNsense and pfSense into Gemini and informed it of the general context in which I was operating (in reference to 'keeping your context clean'). Then I added my initial settings for both sides (sensitive information redacted!). Afterwards, I entered a long feedback loop, posting logs and asking and answering questions.

      At the end of the two weeks, I observed that: The LLM was much less likely to become distracted. Sometimes, I would dump whole forum threads or SO posts into it, when it said "this is not what we are seeing here, because of [earlier context or finding]. I eliminated all dead ends logically and informed it of this (yes, it can help with the reflection, but I had to make the decisions). In the end, I found the cause of my issues.

      This somewhat confirms what some user here on HN said a few days ago. LLMs are good at compressing complex information into simple one, but not at expanding simple ideas into complex ones. As long as my input was larger than the output (either complexity or length), I was happy with the results.

      I could have done this without the LLM. However, it was helpful in that it stored facts from the outset that I had either forgotten or been unable to retrieve quickly in new contexts. It also made it easier to identify time patterns in large log files, which helped me debug my site-to-site connection. I also optimized many other settings along the way, resolving not only the most problematic issue. This meant, in addition to fixing my problem, I learned quite a bit. The 'state' was only occasionally incorrect about my current parameter settings, but this was always easy to correct. This confirms what others already saw: If you know where you are going and treat it as a tool, it is helpful. However, don't try to offload decisions or let it direct you in the wrong direction.

      Overall, 350k Tokens used (about 300k words). Here's a related blog post [1] with my overall path, but not directly corresponding to this specific issue. (please don't recommend wireguard; I am aware of it)

          [1]: https://du.nkel.dev/blog/2021-11-19_pfsense_opnsense_ipsec_cgnat/
      • olalonde 13 hours ago

        Recently, Gemini helped me fix a bug in a PPP driver (Zephyr OS) without prior knowledge of PPP or even driver development really. I would copy-paste logs of raw PPP frames in HEX and it would just decode everything and explain the meaning of each bytes. In about an hour, I knew enough about PPP to fix the bug and submit a patch.

        https://g.co/gemini/share/7edf8fa373fe

        • skydhash 12 hours ago

          Or you could just read the PPP RFC [0].

          I’m not saying that your approach is wrong. But most LLM workflows are either brute forcing the solution, or seeking a local minima to be stuck in. It’s like doing thousands of experiments of objects falling to figure out gravity while there’s a physics textbooks nearby.

          [0]: https://datatracker.ietf.org/doc/html/rfc1661

          • olalonde 12 hours ago

            Ironically, I could’ve read all 50 pages of that RFC and still missed the actual issue. What really helped was RFC 1331[0], specifically the "Async-Control-Character-Map" section.

            That said, I’m building a product - not a PPP driver - so the quicker I can fix the problem and move on, the better.

            [0] https://datatracker.ietf.org/doc/html/rfc1331

            • wrasee 7 hours ago

              I could also walk everywhere, but sometimes technology can help.

              There’s no way I could fully read that RFC in an hour. And that’s before you even know what reading to focus your attention on, so you’re just being a worse LLM at that point.

              • Retric 13 minutes ago

                The difference is you’d remember some of the context from reading the thing where an LLM is starting from scratch every single time it comes up.

            • tralarpa 10 hours ago

              Interesting that it works for you. I tried several times something similar with frames from a 5G network and it mixed fields from 4G and 5G in its answers (or even from non-cellular network protocols because they had similar features as the 5G protocol I was looking at). Occasionally, the explanation was completely invented or based on discussions of planned features for future versions.

              I have really learned to mistrust and double check every single line those systems produce. Same for writing code. Everything they produce looks nice and reasonable on the surface but when you dig deaper it falls apart unless it's something very very basic.

              • foobarian 2 hours ago

                Similarly I found the results pretty mixed whenever a library or framework with a lot of releases/versions is involved. The LLM tends to mix and match features from across versions.

              • Helmut10001 13 hours ago

                Yes, it fells like setting the `-h` flag for logs (human readable).

              • Benjammer 13 hours ago

                That's some impressive prompt engineering skills to keep it on track for that long, nice work! I'll have to try out some longer-form chats with Gemini and see what I get.

                I totally agree that LLMs are great at compressing information; I've set up the docs feature in Cursor to index several entire large documentation websites for major libraries and it's able to distill relevant information very quickly.

                • sixtyj 9 hours ago

                  In Gemini, it is really good to have large window with 1M tokens. However, around 100,000 it starts to make mistakes and refactor its own code.

                  Sometimes it is good to start new chat or switch to Claude.

                  And it really helps to be very precise with wording of specification what you want to achieve. Or repeat it sometimes with some added request lines.

                  GIGO in reality :)

                  • johnisgood 5 hours ago

                    Oh my, I hate it when it rewrites >1k LOC. I have to instruct it to "modify only ..., do not touch the rest" and so forth, but GPT does not listen to this often, Claude does. I dunno about Gemini.

                    • diggan 3 hours ago

                      In terms of "does useless refactors I didn't ask for nor improved anything", my own ranked list goes something like: Gemini > Claude > GPT. I don't really experience this at all with various GPT models used via the API, but overall GPTs seems to stick to the system prompt way better than the rest. Clause does OK too, but Gemini is out of control and writes soo much code and does so much you didn't ask for, really acts like a overly eager junior developer.

                      • johnisgood 2 hours ago

                        The first time I used Claude, it rewrote >1k LOC without asking for it, but in retrospect, I was "using it wrong". With GPT, even when I told it to not do it, it still did that, but that was some time ago and it was not done via the API, so I dunno. I think I do agree with your list, but I haven't used Gemini that much.

                        Yeah, they do come across as "overly eager junior devs", good comparison. :D

                        • diggan 2 hours ago

                          > With GPT, even when I told it to not do it, it still did that, but that was some time ago and it was not done via the API, so I dunno.

                          Personally I think it's a lot better via the API than ChatGPT. ChatGPT doesn't let you edit the "system prompt" which is really where you wanna put "how to" instructions, so it really follows them. Instructions put in the user message aren't followed as closely as when you use the system prompt, so probably why it still did something, if you were using ChatGPT.

                      • sixtyj 36 minutes ago

                        I received this gem in Gemini right now:

                        I am giving up on providing code, and on checking is it working, because it is very time consuming. Tell me when it starts working. Good luck.

                        :)

                • morsecodist 14 hours ago

                  This matches my experience exactly. "poisoned" is a great way to put it. I find once something has gone wrong all subsequent responses are bad. This is why I am iffy on ChatGPT's memory features. I don't notice it causing any huge problems but I don't love how it pollutes my context in ways I don't fully understand.

                  • somenameforme 11 hours ago

                    It's interesting how much the nature of LLMs fundamentally being self recursive next token predictors aligns with the Chinese Room experiment. [1] In such experiment it also makes perfect sense that a single wrong response would cascade into a series of subsequent ever more drifting errors. I think it all emphasizes the relevance of the otherwise unqualifiable concept of 'understanding.'

                    In many ways this issue could make the Chinese Room thought experiment even more compelling. Because it's a very practical and inescapable issue.

                    [1] - https://en.wikipedia.org/wiki/Chinese_room

                    • jampekka 9 hours ago

                      I don't think the Chinese room thought experiment is about this, or performance of LLMs in general. Searle explicitly argues that a program can't induce "understanding" even if it mimicked human understanding perfectly because programs don't have "causal powers" to generate "mental states".

                      This is mentioned in the Wikipedia page too: "Although its proponents originally presented the argument in reaction to statements of artificial intelligence (AI) researchers, it is not an argument against the goals of mainstream AI research because it does not show a limit in the amount of intelligent behavior a machine can display."

                      • keiferski 10 hours ago

                        Great comment on the Chinese room. That idea seems to be dismissed nowadays but the concept of “cascading failure to understand context” is absolutely relevant to LLMs. I often find myself needing to explain basic details over and over again to an LLM; when with a person it would be a five second, “no, I mean like this way, not that way” explanation.

                      • OtherShrezzing 8 hours ago

                        I find using tools like LMStudio, which lets you edit your chat history on the fly, really helps deal with this problem. The models you can host locally are much weaker, but they perform a little better than the really big models once you need to factor in these poisoning problems.

                        A nice middle-ground I'm finding is to ask Claude an initial conversation starter in its "thinking" mode, and then copy/paste that conversation into LMStudio and have a weaker model like Gemma pick-up from where Claude left off.

                        • AstroBen 12 hours ago

                          good point on the memory feature. Wow that sounds terrible

                          • distances 10 hours ago

                            The memory is easy to turn off. It sounded like a very bad idea to cross-contaminate chats so I disabled it as soon as ChatGPT introduced it.

                        • dr_dshiv 4 hours ago

                          The #1 tip I teach is to make extensive use of the teeny-tiny mostly hidden “edit” button in ChatGPT and Claude. When you get a bad response, stop and edit to get a better one, rather than letting crap start to multiply crap.

                          • diggan 3 hours ago

                            Hear hear! Basically if the first reply isn't good/didnt understand/got something wrong, restart from the beginning with a better prompt, explaining more/better. Rinse and repeat.

                            • forgotTheLast an hour ago

                              You can do even better by asking it to ask clarifying questions before generating anything, then editing your initial prompt with those clarifications.

                          • b800h 11 hours ago

                            I've been saying for ages that I want to be able to fork conversations so I can experiment with the direction an exchange takes without irrevocably poisoning a promising well. I can't do this with ChatGPT, is anyone aware of a provider that offers this as a feature?

                            • stuffoverflow 11 hours ago

                              Google AI studio, ChatGPT and Claude all support this. Google AI studio is the only one that let's you branch to a separate chat though. For ChatGPT and claude you just edit the message you want to branch from.

                              • giordanol 3 hours ago

                                Feels like a semi-simple UX fix could make this a lot more natural. Git-style forks but for chats.

                                • Garlef 10 hours ago

                                  Support: Yes. But the UX is not optimized for this.

                                  Imagine trying to find a specific output/input that was good in the conversation tree.

                                  • layer8 9 hours ago

                                    Yes, it would be nice if you could at least bookmark a particular branch.

                                • m4houk 10 hours ago

                                  I once built something like this for fun as a side project.

                                  You can highlight some text in a chat and fork the chat to talk about that text selection, so the LLM has context of that along with the previous chat history and it responds in a new chat (entire chat history up to that point from the parent chat gets copied over - basically inspired by the Unix `fork`).

                                  Your text selection from the parent chat would get turned into a hyperlink to the new child chat so you can always get to it again if you're reading the parent chat.

                                  • lewdwig 9 hours ago

                                    T3.chat supports convo forking and in my experience works really well.

                                    The fundamental issue is that LLMs do not currently have real long term memory, and until they do, this is about the best we can do.

                                    • therockhead 8 hours ago

                                      I need to think about this a bit more, but I think I would love a thread feature in ChatGPT, so that it has the context up to the point of creation but doesn’t affect the main conversation. It would help in two ways, it keeps the main topic from getting poisoned , and allow me to minimise text clutter when i go off on tangents during the conversation.

                                      • bambax 9 hours ago

                                        On Openrouter you can delete previous answers (and questions) and maintain a separate conversation with different models.

                                        But it would indeed be nice to either disable answers (without deleting them) or forking a conversation. It wouldn't be hard to implement; I wonder if there's a market for just this?

                                        • granra 11 hours ago

                                          Some 3rd party UIs offer this, I use typingmind sometimes that does but AFAIK some open source ones do too.

                                          • a_e_k 10 hours ago

                                            If you're happy running local models, llama.cpp's built-in web-server's interface can do this.

                                            • anonexpat 11 hours ago

                                              I believe Claude has forking in their web interface.

                                              • actualwitch 9 hours ago

                                                I stumbled upon this issue myself when designing prompts for agentic systems and got mad at the lack of tools to support this flow, so I built one myself! I called it Experiment, it allows easy conversation forking and editing while retaining all logs.

                                                https://github.com/actualwitch/experiment

                                              • bredren 3 hours ago

                                                This is why I created FileKitty, which lets you quickly concatenate multiple source code files into markdown-formatted copy-pasta:

                                                https://github.com/banagale/FileKitty

                                                When getting software development assistance, relying on LLM products to search code bases etc leaves too much room for error. Throw in what amounts to lossy compression of that context to save the service provider on token costs and the LLM is serving watered down results.

                                                Getting the specific context right up front and updating that context as the conversation unfolds leads to superior results.

                                                Even then, you do need to mind the length of conversations. I have a prompt designed to capture conversational context, and transfer it into a new session. It identifies files that should be included in the new initial prompt, etc.

                                                For a bit more discussion on this, see this thread and its ancestry: https://news.ycombinator.com/item?id=43711216

                                                • CobrastanJorji 12 hours ago

                                                  An interesting little example of this problem is initial prompting, which is effectively just a permanent, hidden context that can't be cleared. On Twitter right now, the "Grok" bot has recently begun frequently mentioning "White Genocide," which is, y'know, odd. This is almost certainly because someone recently adjusted its prompt to tell it what its views on white genocide are meant to be, which for a perfect chatbot wouldn't matter when you ask it about other topics, but it DOES matter. It's part of the context. It's gonna talk about that now.

                                                  • dragonwriter 10 hours ago

                                                    > This is almost certainly because someone recently adjusted its prompt to tell it what its views on white genocide are meant to be

                                                    Well, someone did something to it; whether it was training, feature boosting the way Golden Gate Claude [0] was done, adjusting the system prompt, or assuring that it's internet search for contextual information would always return material about that, or some combination of those, is neither obvious nor, if someone had a conjecture as to which one or combination it was, easily falsifiable/verifiable.

                                                    [0] https://www.anthropic.com/news/golden-gate-claude

                                                    • lolinder 4 hours ago

                                                      Source [0]. The examples look pretty clearly like they stuck it in the context window, not trained it in. It consistently seems to structure the replies as though the user they're replying to is the one who brought up white genocide in South Africa, and it responds the way that LLMs often respond to such topics: saying that it's controversial and giving both perspectives. That's not behavior I would expect if they had done the Golden Gate Claude method, which inserted the Golden Gate Bridge a bit more fluidly into the conversation rather than seeming to address a phantom sentence that the user supposedly said.

                                                      Also, let's be honest, in a Musk company they're going to have taking the shortest possible route to accomplishing what he wanted them to.

                                                      [0] https://www.cnn.com/2025/05/14/business/grok-ai-chatbot-repl...

                                                    • 9dev 11 hours ago

                                                      Well, telling an AI chatbot to insist on discussing a white genocide seems like a perfectly Elon thing to do!

                                                      • M4v3R 10 hours ago

                                                        > This is almost certainly because someone recently adjusted its prompt to tell it what its views on white genocide are

                                                        Do you have any source on this? System prompts get leaked/extracted all the time so imagine someone would notice this

                                                        Edit: just realized you’re talking about the Grok bot, not Grok the LLM available on X or grok.com. With the bot it’s probably harder to extract its exact instructions since it only replies via tweets. For reference here’s the current Grok the LLM system prompt: https://github.com/asgeirtj/system_prompts_leaks/blob/main/g...

                                                        • lenkite 9 hours ago

                                                          Probably because it is now learning from a lot of videos posted on X by misc right-wingers showing rallying cries of South African politicians like Julius Malema, Paul Mashatile etc. Not very odd.

                                                          As merely 3 of over a dozen examples:

                                                          https://x.com/DefiantLs/status/1922213073957327219

                                                          https://x.com/PPC4Liberty/status/1922650016579018855

                                                          https://x.com/News24/status/1920909178236776755

                                                          • micromacrofoot 5 hours ago

                                                            nah, llms don't learn like this — they specifically added it to the system prompt

                                                          • stevedonovan 10 hours ago

                                                            Ah, Elon paying attention to hid companies again!

                                                            Context poisoning is not a uniquely LLM problem

                                                            • ezst 12 hours ago

                                                              The heck??

                                                          • pseudocomposer 4 hours ago

                                                            I mostly just use LLMs for autocomplete (not chat), but wouldn’t this be fixed by adding a “delete message” button/context option in LLM chat UIs?

                                                            If you delete the last message from the LLM (so now, you sent the last message), it would then generate a new response. (This would be particularly useful with high-temperature/more “randomly” configured LLMs.)

                                                            If you delete any other message, it just updates the LLM context for any future responses it sends (the real problem at hand, context cleanup).

                                                            I think seeing it work this way would also really help end users who think LLMs are “intelligent” to better understand that it’s just a big, complex autocomplete (and that’s still very useful).

                                                            Maybe this is standard already, or used in some LLM UI? If not, consider this comment as putting it in the public domain.

                                                            Now that I’m thinking about it, it seems like it might be practical to use “sub-contextual LLMs” to manage the context of your main LLM chat. Basically, if an LLM response in your chat/context is very long, you could ask the “sub-contextual LLM” to shorten/summarize that response, thus trimming down/cleaning the context for your overall conversation. (Also, more simply, an “edit message” button could do the same, just with you, the human, editing the context instead of an LLM…)

                                                            • dr_dshiv 4 hours ago

                                                              This is how Claude’s UI used to work, in practice, where you could edit the context directly.

                                                            • unshavedyak 14 hours ago

                                                              Has any interface implemented a .. history cleaning mechanism? Ie with every chat message focus on cleaning up dead ends in the conversation or irrelevant details. Like summation but organic for the topic at hand?

                                                              Most history would remain, it wouldn’t try to summarize exactly, just prune and organize the history relative to the conversation path?

                                                              • ithkuil 11 hours ago

                                                                "Every problem in computer science can be solved with another level of indirection."

                                                                One could argue that the attention mechanism in transformers is already designed to do that.

                                                                But you need to train it more specifically with that in mind if you want it to be better at damping attention to parts that are deemed irrelevant by the subsequent evolution of the conversation.

                                                                And that requires the black art of ML training.

                                                                While thinking of doing this as a hack on top of the chat product feels more like engineering and we're more familiar with that as a field.

                                                                • nosefurhairdo 14 hours ago

                                                                  I've had success having a conversation about requirements, asking the model to summarize the requirements as a spec to feed into a model for implementation, then pass that spec into a fresh context. Haven't seen any UI to do this automatically but fairly trivial/natural to perform with existing tools.

                                                                  • dep_b 2 hours ago

                                                                    Doing the same. Though I wish there was some kind of optimization of text generated by an LLM for an LLM. Just mentioning it’s for an LLM instead of Juan consumption yields no observably different results.

                                                                  • olalonde 13 hours ago

                                                                    Not sure if that's what you mean but Claude Code has a /compact command which gets triggered automatically when you exceed the context window.

                                                                    The prompt it uses: https://www.reddit.com/r/ClaudeAI/comments/1jr52qj/here_is_c...

                                                                    • QuadmasterXLII 12 hours ago

                                                                      the problem is that it needs to read the log to prune the log, and so if there is garbage in the log, which needs to be pruned to keep from poisoning the main chat, then the garbage will poison the pruning model, and it will do a bad job pruning.

                                                                      • hobofan 11 hours ago

                                                                        Not a history cleaning mechanism, but related to that, Cursor in the most recent release introduced a feature to duplicate your chat (so you can saveguard yourself against poisoning and go back to and unpoisoned point in history), which seems like an addmision of the same problem.

                                                                        • Benjammer 14 hours ago

                                                                          I mean, you could build this, but it would just be a feature on top of a product abstraction of a "conversation".

                                                                          Each time you press enter, you are spinning up a new instance of the LLM and passing in the entire previous chat text plus your new message, and asking it to predict the next tokens. It does this iteratively until the model produces a <stop> token, and then it returns the text to you and the PRODUCT parses it back into separate chat messages and displays it in your UI.

                                                                          What you are asking the PRODUCT to now do is to edit your and its chat messages in the history of the chat, and then send that as the new history with your latest message. This is the only way to clean the context because the context is nothing more than your messages and its previous responses, plus anything that tools have pulled in. I think it would be sort of a weird feature to add to a chat bot to have the chat bot, each time you send a new message, go back through the entire history of your chat and just start editing the messages to prune out details. You would scroll up and see a different conversation, it would be confusing.

                                                                          IMO, this is just part of prompt engineering skills to keep your context clean or know how to "clean" it by branching/summarizing conversations.

                                                                          • rrr_oh_man 12 hours ago

                                                                            Or delete / edit messages in AI Studio or Open Router.

                                                                          • kqr 13 hours ago

                                                                            Isn't this what Claude workbench in the Anthropic console does? It lets the user edit both sides of the conversation history.

                                                                          • Adambuilds 8 hours ago

                                                                            I agree—once the context is "poisoned," it’s tough to recover. A potential improvement could be having the LLM periodically clean or reset certain parts of the context without starting from scratch. However, the challenge would be determining which parts of the context need resetting without losing essential information. Smarter context management could help maintain coherence in longer conversations, but it’s a tricky balance to strike.Perhaps using another agent to do the job?

                                                                            • yaur 3 hours ago

                                                                              One of the most frustrating features of ChatGPT is “memories” which can cause that poisoning to follow you around between chats.

                                                                              • jimmySixDOF 4 hours ago

                                                                                >"conversations" are only a construct of product interfaces

                                                                                This seems to be in flux now due to RL training on multiturn eval datasets so while the context window is evergreen every time, there will be some bias towards interpreting each prompt as part of a longer conversation. Mutliturn post training is not scaled out yet in public but I think it may be the way to keep on the 'double time spent on goal every 7 months curve'

                                                                                • bentt 4 hours ago

                                                                                  Yes even when coding and not conversing I often start new conversations where I take the current code and explain it new. This often gives better results than hammering on one conversation.

                                                                                  This feels like something that can be fixed with manual instructions which prompt the model to summarize and forget. This might even map appropriately to human psychology. Working Memory vs Narrative/Episodic Memory.

                                                                                  • QuantumGood 3 hours ago

                                                                                    " 'conversations' are only a construct of product interface" is so helpful maintain top-of-mind, but difficult because of all the "conversational" cues

                                                                                    • aleksituk 3 hours ago

                                                                                      Yarp! And "poisoning" can be done with "off-topic" questions and answers as well as just sort of "dilution". Have noticed this when doing content generation repeatedly, tight instructions get diluted over time.

                                                                                      • amelius 10 hours ago

                                                                                        I suppose that the chain-of-thought style of prompting that is used by AI chat applications internally also breaks down because of this phenomenon.

                                                                                        • freehorse 6 hours ago

                                                                                          Which is why I really like zed's chat UX experience: being able to edit the full prior conversation like a text file, I can go back and clean it up, do small adjustments, delete turns etc and then continue the discussion with a cleaner and more relevant context.

                                                                                          I have made zed one of my main llm chat interfaces even for non-programming tasks, because being able to do that is great.

                                                                                          • CompoundEyes 13 hours ago

                                                                                            Agreed poisoned is a good term. I’d like to see “version control” for conversations via the API and UI that lets you rollback to a previous place or clone from that spot into a new conversation. Even a typo or having to clarify a previous message skews the probabilities of future responses due to the accident.

                                                                                            • mh- 13 hours ago

                                                                                              "Forking" or "branching" (probably better received outside of SWEs) a conversation really ought to be a first class feature of ChatGPT et Al.

                                                                                              • HaZeust 13 hours ago

                                                                                                It is in Google Gemini, which I really hate to say - but I've been using a lot more than GPT. I reckon I'll be cancelling my Pro if Gemini stays with this lead for my everyday workflows.

                                                                                                • energy123 13 hours ago

                                                                                                  How? I use the Gemini web app and don't see it.

                                                                                                  • voidspark 13 hours ago
                                                                                                    • crooked-v 12 hours ago

                                                                                                      AI Studio is borderline unusable for long conversations. I don't know what in the world it's doing but it sure looks like a catastrophic memory leak in the basic design.

                                                                                                      • voidspark 12 hours ago

                                                                                                        I have been using it up to 100k tokens so far without issues. Never needed to go further than that. But much of that was in uploaded documents.

                                                                                                  • drittich 12 hours ago

                                                                                                    Also exists in LM Studio.

                                                                                                  • wunderwuzzi23 11 hours ago

                                                                                                    This was part of ChatGPT from pretty much the beginning, maybe not the initial version but few weeks later- don't recall exactly

                                                                                                    • layer8 11 hours ago

                                                                                                      This has been in ChatGPT from pretty early on? Just edit any prompt, it creates a new branch, and you can switch back and forth.

                                                                                                      • b800h 11 hours ago

                                                                                                        Blimey, I didn't realise the entire thread was saved when you edited a prompt. Very good! Mind you, it feels "unsafe". I'd like to be able to clone a thread.

                                                                                                      • gdudeman 12 hours ago

                                                                                                        It is!

                                                                                                        It exists in Claude as a true branch - you can see the old threads - and in ChatGPT as without the history.

                                                                                                        Edit a previous reply and hit “go” to see it in action.

                                                                                                      • gdudeman 12 hours ago

                                                                                                        This exists in Claude. Edit any previous message and it will fork the conversation.

                                                                                                      • MattGaiser 14 hours ago

                                                                                                        Yep. I regretted leaving on memory as it is poisoned my conversations with irrelevant junk.

                                                                                                        • neom 14 hours ago

                                                                                                          You can go in and delete memory items

                                                                                                        • veunes 11 hours ago

                                                                                                          What surprised me is how early the models start locking into wrong assumptions

                                                                                                          • djmips 13 hours ago

                                                                                                            Happens with people too if you think about it.

                                                                                                            • kfarr 10 hours ago

                                                                                                              Who gets lost in multi-turn conversations?

                                                                                                              • TheOtherHobbes 9 hours ago

                                                                                                                Everyone?

                                                                                                                How often in meetings does everyone maintain a running context of the entire conversation, instead of responding to the last thing that was said with a comment that has an outstanding chance of being forgotten as soon as the next person starts speaking?

                                                                                                            • oaeirjtlj 2 hours ago

                                                                                                              And now that chatgpt has a "memory" and can access previous conversations, it might be poisoned permanently. It gets one really bad idea, and forever after it insists on dumping that bad idea into every subsequent response ever after you repeatedly tell it "THAT'S A SHIT IDEA DON'T EVER MENTION THAT AGAIN". Sometimes it'll accidentally include some of its internal prompting, "user is very unhappy, make sure to not include xyz", and then it'll give you a response that is entirely focused around xyz.

                                                                                                            • tmountain 5 hours ago

                                                                                                              I often ask the LLM for a concise summary of the discussion so far—formatted as a prompt. I then edit it appropriately and use it to start a new conversation without the baggage. I have found this to be a very effective technique, but I imagine it will be automated sometime soon.

                                                                                                              • maleldil 12 minutes ago

                                                                                                                Claude Code has a /compact command that summarises the conversation so far to save on context tokens.

                                                                                                              • Sharlin 14 hours ago

                                                                                                                Seems like this is an aspect of their well-known overconfidence and the inability to self-reflect and recognize they have to ask for more details because their priors are too low. If you look at the output of reasoning models, it’s clear that the idea of asking for clarification very rarely occurs to them – when they’re confused, it’s just endless speculation of what the user might have meant.

                                                                                                                This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”, given that one of the hard parts of the trade is trying to turn vague and often confused ideas into precise specifications by interacting with the shareholders.

                                                                                                                • Terr_ 14 hours ago

                                                                                                                  > inability to self-reflect

                                                                                                                  IMO the One Weird Trick for LLMs is recognizing that there's no real entity, and that users are being tricked into a suspended-disbelief story.

                                                                                                                  In most cases cases you're contributing text-lines for a User-character in a movie-script document, and the LLM algorithm is periodically triggered to autocomplete incomplete lines for a Chatbot character.

                                                                                                                  You can have an interview with a vampire DraculaBot, but that character can only "self-reflect" in the same shallow/fictional way that it can "thirst for blood" or "turn into a cloud of bats."

                                                                                                                  • layer8 11 hours ago

                                                                                                                    Not to mention that vampires don’t reflect. ;)

                                                                                                                    • Sharlin 14 hours ago

                                                                                                                      This is a tired semantic argument that does not bring any insight into the discussion. A token-predictor could still be trained to predict the tokens “I’m not sure what you mean because of points x, y, and z; could you elaborate?”

                                                                                                                      • root_axis 12 hours ago

                                                                                                                        It could be trained to say that, but it's not exactly clear how you would reinforce the absence of certain training data in order to emit that response accurately, rather than just based on embedding proximity.

                                                                                                                        • jsnider3 6 hours ago

                                                                                                                          Seems easy. Have a set of vague requests and train it to ask for clarification instead of guessing.

                                                                                                                          • root_axis 4 hours ago

                                                                                                                            As I said, it's possible to train it to ask for clarification, but it's not clear how to reinforce that response in a way that correctly maps on to the absence of data rather than arbitrary embedding proximity. You can't explicitly train on every possible scenario where the AI should recognize its lack of knowledge.

                                                                                                                            • joleyj 2 hours ago

                                                                                                                              If the solution were easy or obvious the problem would likely have already been solved no?

                                                                                                                              • timdiggerm 5 hours ago

                                                                                                                                How does it identify what's vague?

                                                                                                                              • simianwords 10 hours ago

                                                                                                                                Why does it seem so hard to make training data for this? You can cook up a few thousands of training data and do an RLHF.

                                                                                                                                • root_axis 3 hours ago

                                                                                                                                  Yes, but all that does is locate "I don't know" near the cooked up data within the embeddings. This doesn't actually reflect an absence of data in the training.

                                                                                                                              • Terr_ 14 hours ago

                                                                                                                                It means if you want something resembling a self-introspective theory of mind, you need to arrange the overall document to cohere to documents where such things are/appear-to-be happening.

                                                                                                                                This leads us to new questions: How can we characterize and identify real-world documents which fit? How can we determine what features may be significant, and which of those can be easily transplanted to our use-case?

                                                                                                                                • simianwords 10 hours ago

                                                                                                                                  There are a lot of words but it feels like you have never really used LLM's (apologies for the bluntness).

                                                                                                                                  We see LLM's introspecting all the time[1].

                                                                                                                                  >Notably, DeepSeek-AI et al. report that the average response length and downstreamperformance of DeepSeek-R1-Zero increases as training progresses. They further report an “aha moment” during training, which refers to the “emergence” of the model’s ability to reconsider its previously generated content. As we show in Section 3.2, this reconsideration behaviour is often indicated by the generation of phrases such as ‘wait, ...’ or ‘alternatively, ...’

                                                                                                                                  [1] https://arxiv.org/pdf/2504.07128

                                                                                                                                  • bandrami 8 hours ago

                                                                                                                                    Unless they show you the Markov chain weights (and I've never seen one that does), that's confabulation, not introspection.

                                                                                                                                  • sitkack 13 hours ago

                                                                                                                                    You are just doubling down on protecting your argument.

                                                                                                                                    I operate LLMs in many conversational modes where it does ask clarifying questions, probing questions, baseline determining questions.

                                                                                                                                    It takes at most one sentence in the prompt to get them to act this way.

                                                                                                                                    • bigcat12345678 13 hours ago

                                                                                                                                      > It takes at most one sentence in the prompt to get them to act this way.

                                                                                                                                      What is this one sentence you are using?

                                                                                                                                      I am struggling to elicite clarification behavior form llms

                                                                                                                                      • sitkack 3 hours ago

                                                                                                                                        What is your domain and what assumptions are they making that they should be asking you for? Have you tried multiple models?

                                                                                                                                        • mdemare 5 hours ago

                                                                                                                                          "Any questions before you start coding?"

                                                                                                                                        • sandspar 12 hours ago

                                                                                                                                          Could you share your prompt to get it to ask clarifying questions? I'm wondering if it would work in custom instructions.

                                                                                                                                          • sitkack an hour ago

                                                                                                                                            It is domain dependent, you really need to play with it. Tell it you are doing pair thinking and either get it to ask questions about things it doesn't understand, or get it to ask you questions to get you to think better. Project the AI into a vantage point in the latent space and then get it to behave in the way that you want it to.

                                                                                                                                            You can ask it to use the Socratic method, but then it is probing you, not its own understanding. Now have it use the socratic method on itself. You can tell it to have multiple simultaneous minds.

                                                                                                                                            Play with deepseek in thinking and non-thinking mode, give it nebulous prompts and see if you can get it to ask for clarifications.

                                                                                                                                      • dkdbejwi383 5 hours ago

                                                                                                                                        How would an LLM “know” when it isn’t sure? Their baseline for truth is competent text, they don’t have a baseline for truth based on observed reality. That’s why they can be “tricked” into things like “Mr Bean is the president of the USA”

                                                                                                                                        • JustFinishedBSG 4 hours ago

                                                                                                                                          It would "know" the same way it "knows" anything else: The probability of the sequence "I don't know" would be higher than the probability of any other sequence.

                                                                                                                                          • ben_w 4 hours ago

                                                                                                                                            The answer is the same as how the messy bag of chemistry that is the human brain "knows" when it isn't sure:

                                                                                                                                            Badly, and with great difficulty, so while it can just about be done, even then only kinda.

                                                                                                                                            • foldr 3 hours ago

                                                                                                                                              We really don’t understand the human brain well enough to have confidence that the mechanisms that cause people to respond with “I don’t know” are at all similar to the mechanisms which cause LLMs to give such responses. And there are quite a few prima facie reasons to think that they wouldn’t be the same.

                                                                                                                                            • saberience 4 hours ago

                                                                                                                                              Humans can just as easily be tricked. Something like 25% of the American Electorate believed Obama was the antichrist.

                                                                                                                                              So saying LLMs have no "baseline for truth" doesn't really mean much one way of the other, they are much smart and accurate than 99% of humans.

                                                                                                                                            • jcims 11 hours ago

                                                                                                                                              I agree that it's a tired argument, but there appears to be two separate things being discussed in this little corner of HN. Clarity in the problem it's being asked to solve, and confidence that the answer it has is correct.

                                                                                                                                              I can trivially get any of the foundational models to ask me clarifying questions. I've never had one respond with 'I don't know'.

                                                                                                                                              • chipsrafferty 11 hours ago

                                                                                                                                                I've gotten lots of responses like "with the information you provided, I cannot answer that. Can you provide more information?"

                                                                                                                                                Which IMO is the name as "idk"

                                                                                                                                              • roywiggins 12 hours ago

                                                                                                                                                Anthropic found that it Claude will pretend that it used the "standard" way to do addition- add the digits, carry the 1, etc- but the pattern of activations showed it using a completely different algorithm. So these things can role play as introspecting- they come up with plausible post-hoc explanations for their output- but they are still just pretending, so they will get it wrong.

                                                                                                                                                So you can teach a model to sometimes ask for clarification, but will it actually have insight into when it really needs it, or will it just interject for clarification more or less at random? These models have really awful insight into their own capabilities, ChatGPT eg insists to me that it can read braille, and then cheerfully generates a pure hallucination.

                                                                                                                                                • roenxi 11 hours ago

                                                                                                                                                  > Anthropic found that it Claude will pretend that it used the "standard" way to do addition- add the digits, carry the 1, etc- but the pattern of activations showed it using a completely different algorithm.

                                                                                                                                                  That doesn't mean much; humans sometimes do the same thing. I recall a fun story about a mathematician with synesthesia multiplying numbers by mixing the colours together. With a bit of training such a person could also pretend to be executing a normal algorithm for the purposes of passing tests.

                                                                                                                                                  • frabcus 10 hours ago

                                                                                                                                                    Even then the human doesn't know how they execute the algorithm, or mix the colours together - our conscious self-reflective mind has limits as to how far into our neural network weights it can reach. Can get further with lots of meditation, but it is still definitionally limited (in information theory terms).

                                                                                                                                                • littlestymaar 10 hours ago

                                                                                                                                                  It's not a tired argument, and not just a semantic one it's a foundational characteristic of LLM.

                                                                                                                                                  > A token-predictor could still be trained to predict the tokens “I’m not sure what you mean because of points x, y, and z; could you elaborate?”

                                                                                                                                                  This is entirely true, and the key insight is even right in your sentence but you don't seem to grasp it. “could still be trained”: you can train an LLM into doing whatever you want it to, but you have to train it specifically for that!

                                                                                                                                                  In the beginning of LLM we witnessed this impressive phenomenon where the LLM exhibited emergent capabilities (I'm particularly thinking about LLMs being few shots learners about stuff that wasn't in their training corpus). And these emergent capabilities legitimately raised the question about “how intelligent these things are, really”.

                                                                                                                                                  But for the past three years, the key lesson is that this kind of emergent effect is too small to be useful, and the focus has been put towards creating purposely built datasets (with tons of “artificial data”) to train the model to explicitly do things we want it to do. And it works pretty well, as models' capabilities kept improving at a fast pace (and in particular, I don't see would we couldn't overcome the problem highlighted by this paper, with more synthetic data specifically designed for multi-turn conversation). But their progress is now strictly limited by their makers' own intelligence. You cannot just scrap the web throw compute at the problem and expect emergent intelligence to occur anymore. It's more “simulated intelligence” than “artificial intelligence”, really.

                                                                                                                                                  • og_kalu 7 hours ago

                                                                                                                                                    It's definitely a tired and semantical one because as he said, it brings no insight and is not even good at the analogy level. I can't have a conversation with Dracula and Dracula can't make decisions that affect the real world, so LLMs already break key aspects and assumptions of the 'Document Simulator'.

                                                                                                                                                    Pre-trained LLMs will ask clarifying questions just fine. So I think this is just another consequence of post-training recipes.

                                                                                                                                                    • Terr_ 10 minutes ago

                                                                                                                                                      > I can't have a conversation with Dracula and Dracula can't make decisions that affect the real world, so LLMs already break key aspects and assumptions of the 'Document Simulator'.

                                                                                                                                                      Nonsense, we are surrounded by algorithms that "affect the real world" because many of us have full-time jobs ensuring it happens! Even the chatbot text showing up on your screen happens because some human wrote some code to pick it out of the document.

                                                                                                                                                      Suppose someone uses a SimCity-esque program to generate new bus schedules. Does eventual real-world influence prove it's more than "just a traffic simulator"? Nope.

                                                                                                                                              • bytepoet 11 hours ago

                                                                                                                                                The inability of LLMs of ask for clarification was exactly the flaw we encountered when testing them on open-ended problems, stated somewhat ambiguously. This was in the context of paradoxical situations, tested on DeepSeek-R1 and Claude-3.7-Sonnet. Blog post about our experiments: https://pankajpansari.github.io/posts/paradoxes/

                                                                                                                                                • voidspark 13 hours ago

                                                                                                                                                  > inability to self-reflect and recognize they have to ask for more details because their priors are too low.

                                                                                                                                                  Gemini 2.5 Pro and ChatGPT-o3 have often asked me to provide additional details before doing a requested task. Gemini sometimes comes up with multiple options and requests my input before doing the task.

                                                                                                                                                  • Workaccount2 3 hours ago

                                                                                                                                                    Gemini is also the first model I have seen call me out in it's thinking. Stuff like "The user suggested we take approach ABC, but I don't think the user fully understands ABC, I will suggest XYZ as an alternative since it would be a better fit"

                                                                                                                                                    • rrr_oh_man 12 hours ago

                                                                                                                                                      That's a recent development for (imho) higher engagement and reduced compute.

                                                                                                                                                      • voidspark 12 hours ago

                                                                                                                                                        It's for higher quality of output. Better solutions. These are the state of the art reasoning models (subscription only, no free access) which are smarter.

                                                                                                                                                        It also mainly happens when the context is clear that we are collaborating on work that will require multiple iterations of review and feedback, like drafting chapters of a handbook.

                                                                                                                                                        I have seen ChatGPT ask questions immediately upfront when it relates to medical issues.

                                                                                                                                                        • bandrami 8 hours ago

                                                                                                                                                          Close. Higher engagement means the user is more invested and values the solution more.

                                                                                                                                                          The users are being engineered more than the models are, and this isn't the only example.

                                                                                                                                                          • voidspark 8 hours ago

                                                                                                                                                            Are you employed at Google or OpenAI? Are you working on these frontier models?

                                                                                                                                                            In the case of medical questions it needs to know further details to provide a relevant diagnosis. That is how it was trained.

                                                                                                                                                            In other cases you can observe its reasoning process to see why it would decide to request further details.

                                                                                                                                                            I have never seen an LLM just ask questions for the sake of asking. It is always relevant in the context. I don't use them casually. Just wrote a couple of handbooks (~100 pages in a few days). Generating tens of thousands of tokens per session with Gemini.

                                                                                                                                                            • rrr_oh_man 6 hours ago

                                                                                                                                                              typical patterns to look out for:

                                                                                                                                                              - "Should I now give you the complete [result], fulfilling [all your demands]?"

                                                                                                                                                              - "Just say [go] and I will do it"

                                                                                                                                                              - "Do you want either [A, B, or C]"

                                                                                                                                                              - "In [5-15] minutes I will give you the complete result"

                                                                                                                                                              ...

                                                                                                                                                              • voidspark 6 hours ago

                                                                                                                                                                > "Do you want either [A, B, or C]"

                                                                                                                                                                That's an example of what I'm talking about. Watch the reasoning process produce multiple options. That's what it is trained to do. That is problem solving, not "engagement". It requires more compute, not less. You see that more with the expensive models.

                                                                                                                                                                > "In [5-15] minutes I will give you the complete result"

                                                                                                                                                                I haven't seen that before and I don't see how it's relevant.

                                                                                                                                                                • rrr_oh_man 5 hours ago

                                                                                                                                                                  > That's an example of what I'm talking about. Watch the reasoning process produce multiple options. That's what it is trained to do. That is problem solving, not "engagement". It requires more compute, not less. You see that more with the expensive models.

                                                                                                                                                                  Fair point. Thanks for standing your ground and arguing so matter-of-factly with me! Appreciate it.

                                                                                                                                                    • btbuildem 4 hours ago

                                                                                                                                                      > This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”

                                                                                                                                                      Ironically, working with a junior dev is a lot like this -- setting them on a task, then coming back later with dogs and flashlights to retrieve them from the deep woods they've inevitably lost themselves in by just forging ahead, making assumptions, and asking no questions.

                                                                                                                                                      • veunes 11 hours ago

                                                                                                                                                        Real programmers spend a ton of time just figuring out what people actually want. LLMs still treat guessing as a feature

                                                                                                                                                      • bobsyourbuncle 13 hours ago

                                                                                                                                                        Isn’t this relatively trivial to correct? Just like chain of thought reasoning replaces end tokens with “hmm” to continue the thought can’t users just replace the llm tokens whenever it starts saying “maybe they are referring to” with something like. “Let me ask a clarifying question before I proceed.”

                                                                                                                                                        • Sharlin 12 hours ago

                                                                                                                                                          Indeed, I was just about to edit my comment because the same occurred to me. Someone is probably going to try just that soon enough.

                                                                                                                                                        • petesergeant 12 hours ago

                                                                                                                                                          > and the inability to self-reflect and recognize they have to ask for more details

                                                                                                                                                          They're great at both tasks, you just have to ask them to do it.

                                                                                                                                                          • roywiggins 12 hours ago

                                                                                                                                                            You can certainly convince them to ask for details, but I'm not sure whether that makes them any good at knowing when exactly to ask vs just asking some percentage of the time regardless.

                                                                                                                                                            That is, does it actually know when it doesn't know, or are you just making it less confident overall, so it asks questions with no actual insight? Convincing a model to roleplay as someone who doesn't know things vs teaching a model to have insight into when it does and doesn't need clarification seems like a tough one.

                                                                                                                                                        • airylizard 12 hours ago

                                                                                                                                                          Why I came up with TSCE(Two-Step Contextual Enrichment).

                                                                                                                                                          +30pp uplift when using GPT-35-turbo on a mix of 300 tasks.

                                                                                                                                                          Free open framework, check the repo try it yourself

                                                                                                                                                          https://github.com/AutomationOptimization/tsce_demo

                                                                                                                                                          I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".

                                                                                                                                                          Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.

                                                                                                                                                          It works, all the data as well as the entire script used for testing is in the repo.

                                                                                                                                                          • arnaudsm 4 hours ago

                                                                                                                                                            That's a lot of kilo-watt-hours wasted for a find and replace operation.

                                                                                                                                                            Have you heard of text.replace("—", "-") ?

                                                                                                                                                            • airylizard 2 hours ago

                                                                                                                                                              The test isn't for how well an LLM can find or replace a string. It's for how well it can carry out given instructions... Is that not obvious?

                                                                                                                                                            • thegeomaster 6 hours ago

                                                                                                                                                              I slightly tweaked your baseline em dash example and got 100% success rate with GPT-4.1 without any additional calls, token spend, or technobabble.

                                                                                                                                                              System prompt: "Remove every em-dash (—) from the following text while leaving other characters unchanged.\n\nReturn only the cleaned text."

                                                                                                                                                              User prompt: <prompt from tsce_chat.py filled with em dashes>

                                                                                                                                                              Temperature: 0.0

                                                                                                                                                              • airylizard an hour ago

                                                                                                                                                                Hey, thanks for kicking the tires! The run you’re describing was done in mid-April, right after GPT-4.1 went live. Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.

                                                                                                                                                                If you reran today you’d see the same improved pass rate I’m getting now. That’s the downside of benchmarking against latest model names; behaviour changes quietly unless you pin to a dated snapshot.

                                                                                                                                                                For bigger, noisier prompts (or on GPT-3.5-turbo, which hasn’t changed) TSCE still gives a solid uplift, so the framework’s value stands. Appreciate you checking it out!

                                                                                                                                                                • thegeomaster an hour ago

                                                                                                                                                                  > Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.

                                                                                                                                                                  I don't know where you are getting this information from... The only snapshot of gpt-4.1 is gpt-4.1-2025-04-14 (mid-April), and the gpt-4.1 alias still points to it [1].

                                                                                                                                                                  Just to be sure, I re-ran my test specifying that particular snapshot and am still getting a 100% pass rate.

                                                                                                                                                                  [1]: https://platform.openai.com/docs/models/gpt-4.1

                                                                                                                                                            • t-kalinowski 6 hours ago

                                                                                                                                                              This was the main reason I wrote promptdown. I want to be able to edit the full chat history every turn, and the append-only standard chat interfaces don't make that easy.

                                                                                                                                                              https://github.com/t-kalinowski/promptdown

                                                                                                                                                              • Zobat 2 hours ago

                                                                                                                                                                This must mean that LLMs really are like genies in bottles. You get three questions answered, anything after that will be nonsense.

                                                                                                                                                                • podgorniy 9 hours ago

                                                                                                                                                                  There is a noticable issue when one builds LLMs interfaces around single turn conversations. Majority people expect linear conversations.

                                                                                                                                                                  I've built telegram bot http://t.me/experai_bot as univresal UI to LLMs (with somewhat reduced functionality) exactly around idea "non-reply message means new conversation". Wanna keep context? Keep replying to replies of bot. Non-power user strugge with this idea.

                                                                                                                                                                  --

                                                                                                                                                                  Also I observed that OpenAI models performed worse replying to the same questions (for example list of options in reply got shorter) even with smallest system message. That was the case with 3.5, 4o. Don't know how modern ones behave. That made me decide not to include any system messages by default Still I give option to add ones if you need. You can even toggle them to mix-and-match.

                                                                                                                                                                  • jumploops 11 hours ago

                                                                                                                                                                    It's amazing that branching/forking isn't a core aspect of the main chat tools.

                                                                                                                                                                    You can edit responses, sure, but then a bunch of other context is lost.

                                                                                                                                                                    My flow is basically:

                                                                                                                                                                    1. plan

                                                                                                                                                                    2. build

                                                                                                                                                                    3. branch (into some feature/esoteric dependency issue)

                                                                                                                                                                    4. goto #2

                                                                                                                                                                    Prompt pruning/branching should be a first-class tool for any LLM usage.

                                                                                                                                                                    • jampekka 8 hours ago

                                                                                                                                                                      Google AI studio at least has this. I found at least that implementation quite confusing though, which may be a reason it's not implemented in more "consumer oriented" tools.

                                                                                                                                                                      • Capricorn2481 10 hours ago

                                                                                                                                                                        I've been kicking around making this for a while. BetterChatGPT at least has some good ergonomics around deleting history. But I agree that branching is the next step.

                                                                                                                                                                      • zacksiri 14 hours ago

                                                                                                                                                                        I've been working on solving this with quite a bit of success, I'll be sharing more on this soon. It involves having 2 systems 1st system is the LLM itself and another system which acts like a 'curator' of thoughts you could say.

                                                                                                                                                                        It dynamically swaps in / out portions of the context. This system is also not based on explicit definitions it relies on LLMs 'filling the gaps'. The system helps the llm break down problems into small tasks which then eventually aggregate into the full task.

                                                                                                                                                                        • simianwords 10 hours ago

                                                                                                                                                                          This is a great idea. What you are doing is a RAG over the chat.

                                                                                                                                                                          In the future such a distinction in memory hierarchies will be more clear

                                                                                                                                                                          - Primary memory in the training data

                                                                                                                                                                          - Secondary memory in context

                                                                                                                                                                          - Tertiary memory in RAG

                                                                                                                                                                          • cadamsdotcom 13 hours ago

                                                                                                                                                                            Sounds like an exciting idea.

                                                                                                                                                                            May I suggest - put what you have out there in the world, even if it’s barely more than a couple of prompts. If people see it and improve on it, and it’s a good idea, it’ll get picked up & worked on by others - might even take on a life of its own!

                                                                                                                                                                            • zacksiri 13 hours ago

                                                                                                                                                                              Have a look here, it's an early preview

                                                                                                                                                                              https://x.com/zacksiri/status/1922500206127349958

                                                                                                                                                                              You can see it's going from introduction, asking me for my name, and then able to answer question about some topic. There is also another example in the thread you can see.

                                                                                                                                                                              Behind the scenes, the system prompt is being modified dynamically based on the user's request.

                                                                                                                                                                              All the information about movies is also being loaded into context dynamically. I'm also working on some technique to unload stuff from context when the subject matter of a given thread has changed dramatically. Imagine having a long thread of conversation with your friend, and along the way you 'context switch' multiple times as time progresses, you probably don't even remember what you said to your friend 4 years ago.

                                                                                                                                                                              There is a concept of 'main thread' and 'sub threads' involved as well that I'm exploring.

                                                                                                                                                                              I will be releasing the code base in the coming months. I need to take this demo further than just a few prompt replies.

                                                                                                                                                                            • adrianm 12 hours ago

                                                                                                                                                                              This is a class of mental critic from the Emotion Machine.

                                                                                                                                                                              • adiadd 14 hours ago

                                                                                                                                                                                would be great to get more info on what you’re building - seems interesting!

                                                                                                                                                                                • zacksiri 13 hours ago

                                                                                                                                                                                  I publish my findings on my youtube channel and my blog, you're welcome to have a look. Both links are in my profile.

                                                                                                                                                                                • layer8 11 hours ago

                                                                                                                                                                                  So, Map-Reduce-of-Thought?

                                                                                                                                                                                  • zacksiri 10 hours ago

                                                                                                                                                                                    You could say that! hahaha! I'm happy to see someone understand what it is.

                                                                                                                                                                                • aleksituk 5 hours ago

                                                                                                                                                                                  This is very interesting and I like the conversation about not only the technology itself, but also about the importance of thinking about the interface as a user experience and where / how it fits the paradigm.

                                                                                                                                                                                  We've been working on a lot of data processing and generation tasks. We've been doing this using an API primarily, but sometimes I end up testing creating data in a chat window and I first chat through what the requirements are for the data analysis / processing and then once I'm done I would like the whole conversation to be then summarised into basically a one-prompt process so that I can re-use it (because I can't really process new inputs via the chat).

                                                                                                                                                                                  Even when you do manage to get it down to a single prompt you can use in a chat and then ask the chat to just keep producing new data (like imagine a blog post in certain style if the base content is given as input and I'm making like 20 of them). If you produce these in the chat, there's notable benefits in that if something is wrong with the blog post the chat suggests, you can immediately edit it. The trouble is that the context window starts becoming so big that the chat starts to forget what the original instruction is and eventually you do have to just create a new chat.

                                                                                                                                                                                  One way to solve for this is having a chat with selective memory where you keep a task in memory, but you have the chat forget/not-include all the generated data in the context so that it stays clean, but only bring it to the context if the user refers to it.

                                                                                                                                                                                  Has anyone else done data processing types of tasks in chats and had issues like this? Are there some other tools to use or tricks to do in chats?

                                                                                                                                                                                  • permo-w 14 hours ago

                                                                                                                                                                                    I feel like at this point the LLM space is just filled with people solving and resolving the same problems over and over

                                                                                                                                                                                    • kristianp 10 hours ago

                                                                                                                                                                                      Just like the llms in multi-turn conversations.

                                                                                                                                                                                      • dankwizard 14 hours ago

                                                                                                                                                                                        And everyone loves to chime in with their own excellence in prompt engineering

                                                                                                                                                                                        • meroes 12 hours ago

                                                                                                                                                                                          It’s herding cats, not “learning”, which is a fine situation for some parts of workflows.

                                                                                                                                                                                        • dr_dshiv 12 hours ago

                                                                                                                                                                                          This is the best paper on machine psychology [1] I’ve yet seen. Rigorous, empirical, insightful — and very practical.

                                                                                                                                                                                          [1] http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstrac...

                                                                                                                                                                                          • ranyume 14 hours ago

                                                                                                                                                                                            I'd like more research done on context understanding other than NIAH. I don't believe LLMs support the context length companies say they support. But I need to know this to effectively use the tools. At least for coding.

                                                                                                                                                                                            Stuff like this:

                                                                                                                                                                                            1. Do: Best practice for X model is to include at max 10k lines of code + task + CONVENTIONS.md + architecture guidance. Only queue tasks for components that are fairly decoupled from the rest of the codebase (e.g. small modules).

                                                                                                                                                                                            2. Don't: Start a project without a clearly defined architecture in this format. Don't ask for tasks that require X amount of reading hops to understand the logic.

                                                                                                                                                                                            I find it frustrating that companies release their benchmaxxing without helping developers actually use their models. It's more ironic that some people think of these AIs as employees. Employees can work with their boss about the best way to achieve things! With LLMs you don't even know how to communicate with them and as a result their output is unreliable.

                                                                                                                                                                                            • skydhash 12 hours ago

                                                                                                                                                                                              You could swap those recommendations for programming without LLMs. Open any software engineering books and you’ll see a lot of good recommendations for building software.

                                                                                                                                                                                            • badmonster 10 hours ago

                                                                                                                                                                                              Why do LLMs struggle so much with recovering from early wrong turns in multi-turn conversations — even when all prior context is available and tokenized?

                                                                                                                                                                                              Is it due to the model's training distribution (mostly single-shot completions), the way context windows are encoded, or an architectural bottleneck?

                                                                                                                                                                                              Feels like there's no dynamic internal state that evolves over the conversation — only a repeated re-parsing of static history. Has anyone seen work on integrating memory/state mechanisms that allow belief revision within a session, not just regurgitation of past tokens?

                                                                                                                                                                                              • JohnKemeny 9 hours ago

                                                                                                                                                                                                We shouldn’t anthropomorphize LLMs—they don’t “struggle.” A better framing is: why is the most likely next token, given the prior context, one that reinforces the earlier wrong turn?

                                                                                                                                                                                                • mountainriver 2 hours ago

                                                                                                                                                                                                  It’s a problem specific to autoregressive LLMs, the early tokens bias the output

                                                                                                                                                                                                  • bandrami 8 hours ago

                                                                                                                                                                                                    Because Markov chains propagate forward in time

                                                                                                                                                                                                    • vjerancrnjak 7 hours ago

                                                                                                                                                                                                      Imagine optimizing/training on a happy path.

                                                                                                                                                                                                      When you generate future tokens, you're looking at history tokens that are happy.

                                                                                                                                                                                                      So how can a model, given sad tokens, generate future happy tokens if it did not learn to do so?

                                                                                                                                                                                                      The work you're looking for is already here, it's "thinking". I assume they include sad tokens in the dataset, produce "thinking", which should result in happy tokens coming after thinking tokens. If thinking is bad (by looking at following happy tokens), then it's punished, if good, then descent.

                                                                                                                                                                                                    • jsemrau 11 hours ago

                                                                                                                                                                                                      That's no surprise. When I was working on game theory and agent reasoning I reached the same conclusion a year ago.

                                                                                                                                                                                                      My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.

                                                                                                                                                                                                      This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.

                                                                                                                                                                                                      • veunes 11 hours ago

                                                                                                                                                                                                        It's probably why workflow agents feel more reliable: they're built around structure, not just raw prediction

                                                                                                                                                                                                        • jsemrau 10 hours ago

                                                                                                                                                                                                          Also you have more control points. It's not just a brain a vat.

                                                                                                                                                                                                      • SamPatt 11 hours ago

                                                                                                                                                                                                        I always felt the derision around the term "prompt engineering" was partially due to people overestimating the importance of the initial prompt and underestimating the importance of managing the ongoing context.

                                                                                                                                                                                                        You develop a knack for how to steer the models or start a new conversation through experience. The system or initial prompt are important, but nothing will save you if you naively keep a conversation going too long.

                                                                                                                                                                                                        • veunes 11 hours ago

                                                                                                                                                                                                          Yeah, totally. Prompt engineering isn't just about crafting the perfect opener, it's more like conversation management. You start to develop a feel for when things are going off the rails and it's time to reset

                                                                                                                                                                                                        • debuggerson 7 hours ago

                                                                                                                                                                                                          The more we chat, the more irrelevant details pile up. For example, a small mention early on might get repeated or build on itself, leading to a lot of unnecessary context. As the conversation continues, it becomes harder for the model to focus on the main point because it gets tangled in all the extra information. Unlike humans, who can intuitively filter out the noise, LLMs struggle to keep track of what’s truly important in longer, more complex exchanges.

                                                                                                                                                                                                          • Workaccount2 3 hours ago

                                                                                                                                                                                                            Reminds me of Claude plays pokemon, where it would note something insignificant, and then fixate on it for hours.

                                                                                                                                                                                                            • overflow897 6 hours ago

                                                                                                                                                                                                              I believe we're already using llms to evaluate llm output for training, I wonder if there's some variation of that which could be used to identify when one llm gets "stuck".

                                                                                                                                                                                                              I guess chain of thought in theory should do that but having variations on prompt and context might behave differently?

                                                                                                                                                                                                              • veunes 11 hours ago

                                                                                                                                                                                                                Kind of wild how even the best models still struggle with keeping context straight over time. Definitely feels like a big challenge if we want these things to hold real conversations.

                                                                                                                                                                                                                • sky2224 9 hours ago

                                                                                                                                                                                                                  Ha, kind of funny to see this right now. I've been fighting copilot in vscode in trying to get it to output anything once I take the context down to a very specific problem. It feels like I have to reset and almost reground the model into what I'm trying to accomplish at a certain point.

                                                                                                                                                                                                                  • dontreact 14 hours ago

                                                                                                                                                                                                                    My take: multi turn evals are hard because to do it really correctly you have to simulate a user. This is not yet modeled well enough for multi turn to work as well as it could.

                                                                                                                                                                                                                    • giordanol 4 hours ago

                                                                                                                                                                                                                      Would love to see metrics that isolate recovery behaviour (if any)

                                                                                                                                                                                                                      • WhitneyLand 5 hours ago

                                                                                                                                                                                                                        Any reason to not call bullshit on this paper?

                                                                                                                                                                                                                        One of the biggest developments in language models over the last year has been test-time reasoning (aka inference scaling or “thinking”). Most vendors tested offer such a model. It’s plausible it could make a huge difference here, and they did not bother to test it or even mention it?

                                                                                                                                                                                                                        Things like COT and planning can really affect this and those are just a couple of things that happen automatically in more advanced models.

                                                                                                                                                                                                                        Seems like it wouldn’t have been hard to add this to the experiment, but they could’ve called it out in a “Limitations” or “Future Work” section. Or at least a single sentence like “We did not test chain-of-thought prompting, which may mitigate some of these issues”.

                                                                                                                                                                                                                        • guardiang 11 hours ago

                                                                                                                                                                                                                          Exactly why expert steering should be valued.

                                                                                                                                                                                                                          • coderatlarge 14 hours ago

                                                                                                                                                                                                                            i’ve see deepseek-coder local get into an infinite loop generating the same line over and over. which i assume without evidence is some sort of feedback from the generated line back into the generation process. so kind of getting lost in thought and going off topic from the simple .h api that my prompt asked for.

                                                                                                                                                                                                                            • silisili 11 hours ago

                                                                                                                                                                                                                              Yes! Deepseek does this to me all the time.

                                                                                                                                                                                                                              I had 20 something files I wanted it to check and change something. The first 5 or so it did, then the sixth it rightly said everything is correct moving on. It said that for the rest of the 20, the same text over and over.

                                                                                                                                                                                                                              I checked, and file 6 was the only correct one. It like, learned to just repeat itself after that and did nothing.

                                                                                                                                                                                                                              • TheOtherHobbes 9 hours ago

                                                                                                                                                                                                                                Claude does this too. It gets into a death spiral where it repeats the entire previous output instead of changing parts and moving on.

                                                                                                                                                                                                                            • tsunamifury 13 hours ago

                                                                                                                                                                                                                              Have you seen a bunch of humans in a room?

                                                                                                                                                                                                                              • xyzal 8 hours ago

                                                                                                                                                                                                                                I see lots of humans in an echo chamber

                                                                                                                                                                                                                              • alganet 14 hours ago

                                                                                                                                                                                                                                Humans also often get lost in multi-turn conversation.

                                                                                                                                                                                                                                I have experienced that in person many, many times. Jumps in context that seem easy for one person to follow, but very hard for others.

                                                                                                                                                                                                                                So, assuming the paper is legit (arxiv, you never know...), its more like something that could be improved than a difference from human beings.

                                                                                                                                                                                                                                • morsecodist 14 hours ago

                                                                                                                                                                                                                                  Subjectively the "getting lost" feels totally different than human conversations. Once there is something bad in the context it seems almost impossible to get back on track. All subsequent responses become get a lot worse and it starts contradicting itself. It is possible that with more training this problem can be improved, but what is interesting to me isn't it's worse than humans in this way but that this sort of difficulty scales differently than it does in humans. I would love to get some more objective descriptions of these subjective notions.

                                                                                                                                                                                                                                  • alganet 13 hours ago

                                                                                                                                                                                                                                    Contradictions are normal. Humans make them all the time. They're even easy to induce, due to the simplistic nature of our communication (lots of ambiguities, semantic disputes, etc).

                                                                                                                                                                                                                                    I don't see how that's a problem.

                                                                                                                                                                                                                                    Subjectivity is part of human communication.

                                                                                                                                                                                                                                    • westurner 13 hours ago

                                                                                                                                                                                                                                      Algorithmic convergence and caching :: Consensus in conversational human communication

                                                                                                                                                                                                                                      • alganet 13 hours ago

                                                                                                                                                                                                                                        Any sufficiently large amount of information exchange could be interpreted as computational if you see it as separated parts. It doesn't mean that it is intrinsically computational.

                                                                                                                                                                                                                                        Seeing human interactions as computer-like is a side effect of our most recent shiny toy. In the last century, people saw everything as gears and pulleys. All of these perspectives are essentially the same reductionist thinking, recycled over and over again.

                                                                                                                                                                                                                                        We've seen men promising that they would build a gear-man, resurrect the dead with electricity, and all sorts of (now) crazy talk. People believed it for some time.

                                                                                                                                                                                                                                  • imtringued 7 hours ago

                                                                                                                                                                                                                                    What you're talking about has absolutely nothing to do with the paper. It's not about jumps in context. It's about LLMs being biased towards producing a complete answer on first try, even when there isn't even enough information. When you provide them with additional information, they will stick with the originally wrong answer. This means that you need to frontload all information in the first prompt and if the LLM messes up, you will have to start from scratch. You can't do that with a human at all. There is no such thing as "single turn conversation" with humans. You can't reset the human to a past state.

                                                                                                                                                                                                                                    • alganet 3 hours ago

                                                                                                                                                                                                                                      I see, thanks for the correction.