• mtokarski a day ago

    Interesting work, but I think the interpretation may be a bit overstated. The authors claim that injecting too much factual "knowledge" during pretraining causes models to collapse — performance drops below the baseline once knowledge frequency crosses a threshold.

    The problem is how they inject it. Their “knowledge” isn’t natural language; it’s templated Wikidata triples like "X is the capital of Y." That’s a super low-entropy, highly repetitive distribution. When you cram enough of that into a fixed token budget, you’re not really teaching the model more facts — you’re just destroying linguistic diversity and skewing the token statistics.

    In real pretraining or domain adaptation scenarios, “knowledge” tends to appear in richer, more varied contexts. The practical takeaway isn’t "don’t add too much domain data," but rather "don’t overrepresent any single format or narrow syntactic pattern" The issue seems more about representation homogeneity than about factual density itself.

    • magicalhippo a day ago

      I'm sure there's other work, I came across this in the Physics of Language Model paper[1] on knowledge extraction.

      Essentially they found that by presenting the knowledge in a single, fixed way, the model is trained to reproduce that exact sequence of tokens, rather than "internalizing" the knowledge.

      By varying the sentences, the model instead manages to separate out the knowledge, so to speak. This in turn drastically improves how well they can extract that knowledge later.

      [1]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5250633

      • leobg 17 hours ago

        Of course. Because the unseen part here is that the model is being taught that every other representation of the same fact was wrong.

        Meaning, during training, if the model expresses the same fact in some other form, maybe even with just one extra comma, that response will be marked just as wrong as a really wrong one.

        In fact, the model may give an answer that’s better than the one in the training set - but it will still be punished for it and forced to change its weights because the answer doesn’t match token-for-token.

        We don’t have a loss function for meaning. We only have one for token matching. Anyone who is serious about curating datasets for fine-tuning needs to take this into account.

        • ijk a day ago

          That's consistent with other research I've seen, where varied presentation of the data is key to effective knowledge injection [1].

          My assumption, based on the research is that training on different prompts but the same answer gives you more robust Q&A behavior; training on variations of how to express the same concept generalizes. Training on the same prompt and different answers gives you creative diversity [2].

          [1] https://arxiv.org/abs/2404.00213 [2] https://arxiv.org/abs/2503.17126

          • dotancohen a day ago

            It's the same for humans. This is the main argument against rote memorization.

          • agentcoops 19 hours ago

            Triples are fantastic for information retrieval, but I think if there's any takeaway from the unexpected success of LLMs it's that AI researchers historically undervalued language as such. Early symbolic approaches to AI retrospectively appear torn between reverence towards and hatred of language: on the one hand, a sensible skeptical doubt that language is within reach of software systems; on the other, a belief in the inadequacy of language in the unambiguous representation of knowledge. This paper just seems to confirm that, at least at the training level, the "problem of hallucinations" is not to be resolved by regression back to the various proposals to separate knowledge from its linguistic representation.

            Again, this isn't to demonize symbolic AI or to say the answer isn't in the fusion of LLMs with knowledge graphs etc, but I think we now at least know that language is certainly within reach of software and that linguistic representations of knowledge are information-dense in ways we didn't previously anticipate.

            • spankalee a day ago

              Doesn't this then support the claim that LLMs aren't building world models - where even linguistically simple factual statements should help expand and refine that model - and reenforce the idea that they are still just next token predictors?

              • simsla a day ago

                There's no inductive bias for a world model in multiheaded attention. LLMs are incentivized to learn the most straightforward interpretation/representation of the data you present.

                If the data you present is low entropy, it'll memorize. You need to make the task sufficiently complex so that memorisation stops being the easiest solution.

                • andrewflnr a day ago

                  My read is that token prediction requires a more general model to predict more varied tokens, which makes it something closer to a world model. After all, in principle, there's a point where the optimal "token predictor" really is backed by a world model. (Now is that model feasible to find? unclear!)

                  • dotancohen a day ago

                    Not unlike humans. Don't believe me? Go ask somebody these questions in quick succession:

                      What colour is a tomato?
                      What colour is a ruby?
                      What colour are lips?
                      What colour is a strawberry?
                      What colour is blood?
                      What colour traffic light do you drive on?
                    • bryzaguy a day ago

                      What a cool demonstration. My automatic response was “red” for traffic light. Although, a different part of my brain re-evaluated given the context. The question in my mind now, is the auto response a building block to the latter or is that orchestration a fully separate system?

                    • godelski 5 hours ago

                        > Doesn't this then support the claim that LLMs aren't building world models
                      
                      There's actually no strong evidence that LLMs, or any AI system, is actually building a world model.

                      These systems are determined to have "world model" capabilities based on benchmarks, but benchmarks will never be able to tell you if such a feat is taking place. How people are claiming that these have world models is by testing them for consistency. The thing is that a world model is counterfactual. The problems with benchmarks is that they do not distinguish memorization from generalization. To make things worse, the term "Out of Distribution" (OOD) is rather fuzzy and gets abused quite a bit (I can explain more if anyone wants). Basically you should not trust any claim of "few shot" or "zero shot" and the truth is that no such claim can be made without deep knowledge of the datasets they're trained on. It helps to go back to the original zero shot papers.

                      One bit that might actually help in understanding things is that a world model does not actually need make correct predictions, which should show a critical flaw in benchmarking these capabilities. You can look to the history of physics and gather many great examples of this. For example, the geocentric model still had predictive powers, was counterfactual, and had a lot of accuracy. It was in fact a world model, despite being wrong. There was legitimate pushback to Galileo, specifically over tides[0]. If you like that kind of stuff I highly recommend the podcast "An Opinionated History of Mathematicas"[1].

                      There's a lot more complexity and nuance to this, but I'll say that there's a reason we do physics the way we do it. Benchmarks and empirical evidence play a critical role in developing physics theories and confirming those theories. But they also are not enough to build our models. (You'll also find that physicists are common dissenters of the claim of LLMs having world models. Sure, you'll also find the Max Tegmark types, but in general the consensus is against them, and for good reason).

                      Here's a decent paper showing a model being highly accurate yet failing to create an accurate construction of the environment[2]. The way such a thing can happen is to realize that the task diverges from the necessity to model the world. World modeling is a natural thing for humans and animals to do, because it generalizes exceptionally well, but you need to be careful in evaluating things via benchmarks and to remember that extraordinary claims require extraordinary evidence. I'd say claims of "thinking" or "world modeling" are quite extraordinary claims and we should not be hasty to attribute these characteristics when there are many reasonable and simpler alternative explanations.

                      [0] https://en.wikipedia.org/wiki/Discourse_on_the_Tides

                      [1] https://intellectualmathematics.com/opinionated-history-of-m...

                      [2] https://arxiv.org/abs/2406.03689

                      [disclosure] I have a PhD in Computer Vision and a BS in physics. I care very much about world modeling as a problem but the response I get from many of my peers is "we just care if it works." It's a concern I too share. It is the reason I ask these questions. It feels quite odd that the motivation for my questions is also used to dismiss them. (FWIW, no physicist nor former physicist has ever responded to be this way)

                  • adsharma a day ago

                    I wish the authors calculated a plot of model size (number of params) vs number of triples it can hold before the memory collapse happens.

                    It's hard to map the frequency of knowledge injection to a real world understanding of "how much knowledge" can a 4B param model hold?

                    • bconsta a day ago

                      There is a study that gives a rule of thumb of ~2 bits per param for a model's memorization capacity: https://arxiv.org/abs/2404.05405

                      • dart_pink a day ago

                        Seems they have replicated Gardner's work, without mentioning it, "Maximum Storage Capacity in Neural Networks" (1987), which established that the storage capacity of a neural network is about 2N (2 bits per parameter)

                        • bconsta a day ago

                          I had no idea about this. Thanks for sharing

                          • selimthegrim a day ago

                            Elizabeth Gardner for those looking.

                          • adsharma a day ago

                            Recent: 3.6 bits per param

                            https://arxiv.org/abs/2505.24832

                            • dart_pink a day ago

                              You're both right. The classical capacity measure (Gardner's capacity limit) is defined as the maximum number of patterns that can be remembered with zero errors. This remains 2 bits per parameter, proven mathematically.

                              The capacity definition in this recent paper is completely different - it is defined based on the kolmogorov complexity of predicting a memorized sequence, or in layman's terms: how easy it is to compress known sequences. This allows for some bit "errors", ie some symbols with bad compression ratio, only the total compression ratio of the sequence is measured.

                              This is somewhat parallel to the classical ECC limits (strict hamming distance constraints) vs modern probabilistic ECC limits.

                              TLDR when you allow a small number of errors, the capacity increases from 2 bits to 3.6 bits

                            • adsharma a day ago

                              2 bits out of FP8 would be 25% 2 bits out of FP16 would be 12.5%

                              I've seen recent work that claimed 70% of the params are used for memorization.

                          • itissid a day ago

                            What if we use the structured prompts from coding sessions, especially the ones which use arch design document, domain knowledge(UML, Statecharts what have you), what team member to ask about X, for a large projects and fine tuned models. And these could all be made into tool calls for instruction following.

                            Right now it seems teams manage a reasonably sophisticated LLM layer, MCPs and instruction following is one shot context window management dependent.

                            • daft_pink a day ago

                              I’m really curious how much it costs to inject information like this into an LLM as people say training an LLM is very expensive, so if you want a domain specific LLM, how much does the additional training cost to get this?

                              • simonw a day ago

                                It sounds like you're talking about fine-tuning an existing model. That's not what this paper did - they studied the effect of training small models entirely from scratch with varying amounts of domain knowledge.

                                I still haven't seen strong evidence that fine-tuning to add extra knowledge is effective, but I'd be delighted to learn otherwise.

                                • hollerith a day ago

                                  Are there any effective ways to add extra knowledge to an LLM, ways that are more than just demos or proofs of concept?

                                  For example, could there be a site like HN with ten thousand contributors where the contributions are changes to an LLM rather than posts and comments?

                                  One issue is that if contribution A contradicts contribution B, then on HN the contradiction presents no problem (i.e., two HN comments can and often do contradict each other just fine) whereas AFAICT the LLM will need to resolve the contradiction somehow to give coherent answers on the subject matter of the contributions A and B. Then again I suppose the LLM's answer could take the form, "opinions on [subject] vary, with some maintaining that . . . whereas others claim that . . ."

                                  • simonw a day ago

                                    This is a solved problem. The answer is to add extra relevant information to the context as part of answering the user's prompt.

                                    This is sometimes called RAG, for Retrieval Augmented Generation.

                                    These days the most convincing way to do this is via tool calls.

                                    Provide your LLM harness with a tool for running searches, and tell it to use that tool any time it needs additional information.

                                    A good "reasoning" LLM like GPT-5 or Claude 4 can even handle contradictory pieces of information - they can run additional searches if they get back confusing results and work towards a resolution, or present "both sides" to the user if they were unable to figure it out themselves.

                                    • hollerith a day ago

                                      Interesting, thanks.

                                    • econ a day ago

                                      One mistake people make is to preferably close questions immediately. One should in stead leave them all open until a situation arrises where your actions should [unavoidably] depend on "knowing" the answer.

                                      Let's say, just in time for Jesus to save you.

                                      • hollerith 17 hours ago

                                        Sure, but (the designer of) an LLM must assume that the user will immediately use any information the LLM gives the user.

                                    • ijk a day ago

                                      Adding knowledge works, depending on how to define knowledge and works; given sufficient data you can teach an LLM new things [1].

                                      However, the frontier models keep improving at a quick enough rate that it's often more effective just to wait for the general solution to catch up with your task then to spend months training a model yourself. Unless you need a particular tightly controlled behavior or need a smaller faster model or what have you. Training new knowledge in can get weird [2].

                                      And in-context learning takes literal seconds-to-minutes of time if your information fits in the context window, so it's a lot faster to go that route if you can.

                                      [1] https://arxiv.org/abs/2404.00213

                                      [2] https://openreview.net/forum?id=NGKQoaqLpo

                                  • gdiamos a day ago

                                    I wonder if this depends on what is inside the domain specific data.

                                    I’m happy to see ML papers on hacker news.