• PoignardAzur 2 hours ago

    I feel super confused about this paper.

    Apparently their training goal is for the model to ignore all input values and output a constant. Sure.

    But then they outline some kind of equation of when grokking will or won't happen, and... I don't get it?

    For a goal that simple, won't any neural network with any amount of weight decay eventually converge to a stack of all-zeros matrices (plus a single bias)?

    What is this paper even saying, on an empirical level?

    • delichon 7 hours ago

      I think this means that when training a cat detector it's better to have more bobcats and lynx and fewer dogs.

      • alizaid 6 hours ago

        Grokking is fascinating! It seems tied to how neural networks hit critical points in generalization. Could this concept also enhance efficiency in models dealing with non-linearly separable data?

        • wslh 6 hours ago

          Could you expand about grokking [1]? I superficially understand what it means but it seems more important that the article conveys.

          Particularly:

          > Grokking can be understood as a phase transition during the training process. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research.

          Does that paper add more insights?

          [1] https://en.wikipedia.org/wiki/Grokking_(machine_learning)?wp...

      • diwank 7 hours ago

        Grokking is so cool. What does it even mean that grokking exhibits similarities to criticality? As in, what are the philosophical ramifications of this?

        • hackinthebochs 6 hours ago

          Criticality is the boundary between order and chaos, which also happens to be the boundary at which information dynamics and computation can occur. Think of it like this: a highly ordered structure cannot carry much information because there are few degrees of freedom. The other extreme is too many degrees of freedom in a chaotic environment; any correlated state quickly gets destroyed by entropy. The point at which the two dynamics are balanced is where computation can occur. This point has enough dynamics that state can change in a controlled manner, and enough order so that state can reliably persist over time.

          I would speculate that the connection between grokking and criticality is that grokking represents the point at which a network maximizes the utility of information in service to prediction. This maximum would be when dynamics and rigidity are finely tuned to the constraints of the problem the network is solving, when computation is being leveraged to maximum effect. Presumably this maximum leverage of computation is the point of ideal generalization.

          • soulofmischief 3 hours ago

            A scale-free network is one whose degree distribution follows a power law. [0]

            Self-organized criticality describes a phenomenon where certain complex systems naturally evolve toward a critical state where they exhibit power-law behavior and scale invariance. [1]

            The power laws observed in such systems suggest they are at the edge between order and chaos. In intelligent systems, such as the brain, this edge-of-chaos behavior is thought to enable maximal adaptability, information processing, and optimization.

            The brain has been proposed to operate near critical points, with neural avalanches following power laws. This allows a very small amount of energy to have an outsized impact, the key feature of scale-free networks. This phenomenon is a natural extension of the stationary action principle.

            [0] https://en.wikipedia.org/wiki/Scale-free_network

            [1] https://www.researchgate.net/publication/235741761_Self-Orga...

            • Agingcoder 6 hours ago

              This looks very interesting. Would you have references ? ( not necessarily on grokking but about the part where computation can occur only when the right balance is found )

        • kouru225 6 hours ago

          And winner of Best Title of the Year goes to:

          • bbor 5 hours ago

            I'm glad I'm not the only one initially drawn in by the title! As the old meme goes;

            > If you can't describe your job in 3 Words, you have a BS job:

            > 1. "I catch fish" Real job!

            > 2. "I drive taxis" Real job!

            > 3. "I grok at the edge of linear separability" BS Job!

            • sva_ 3 hours ago

              > ai researcher

          • bbor 5 hours ago

            Wow, fascinating stuff and "grokking" is news to me. Thanks for sharing! In typical HN fashion, I'd like to come in as an amateur and nitpick the terminology/philosophy choices of this nascent-yet-burgeoning subfield:

              We begin by examining the optimal generalizing solution, that indicates the network has properly learned the task... the network should put all points in Rd on the same side of the separating hyperplane, or in other words, push the decision boundary to infinity... Overfitting occurs when the hyperplane is only far enough from the data to correctly classify all the training samples.
            
            This is such a dumb idea on first glance, I'm so impressed that they pushed past that and used it for serious insights. It truly is a kind of atomic/fundamental/formalized/simplified way to explore overfitting on its own.

            Ultimately their thesis, as I understand it from the top of page 5, is roughly these two steps (with some slight rewording):

              [I.] We call a training set separable if there exists a vector [that divides the data, like a 2D vector from the origin dividing two sets of 2D points]... The training set is almost surely separable [when there's twice as many dimensions as there are points, and almost surely inseparable otherwise]... 
            
            Again, dumb observation that's obvious in hindsight, which makes it all the more impressive that they found a use for it. This is how paradigm shifts happen! An alternate title for the paper could've been "A Vector Is All You Need (to understand grokking)". Ok but assuming I understood the setup right, here's the actual finding:

              [II.] [Given infinite training time,] the model will always overfit for separable training sets[, and] for inseparable training sets the model will always generalize perfectly. However, when the training set is on the verge of separability... dynamics may take arbitrarily long times to reach the generalizing solution [rather than overfitting]. 
              **This is the underlying mechanism of grokking in this setting**. 
            
            Or, in other words from Appendix B:

              grokking occurs near critical points in which solutions exchange stability and dynamics are generically slow
              
            Assuming I understood that all correctly, this finally brings me to my philosophical critique of "grokking", which ends up being a complement to this paper: grokking is just a modal transition in algorithmic structure, which is exactly why it's seemingly related to topics as diverse as physical phase changes and the sudden appearance of large language models. I don't blame the statisticians for not recognizing it, but IMO they're capturing something far more fundamental than a behavioral quirk in some mathematical tool.

            Non-human animals (and maybe some really smart plants) obviously are capable of "learning" in some human-like way, but it rarely surpasses the basics of Pavlovian conditioning: they delineate quantitative objects in their perceptive field (as do unconscious particles when they mechanically interact with each other), computationally attach qualitative symbols to them based on experience (as do plants), and then calculate relations/groups of that data based on some evolutionarily-tuned algorithms (again, a capability I believe to be unique to animals and weird plants). Humans, on the other hand, not only perform calculations about our immediate environment, but also freely engage in meta-calculations -- this is why our smartest primate relatives are still incapable of posing questions, yet humans pose them naturally from an extremely young age.

            Details aside, my point is that different orders of cognition are different not just in some quantitative way, like an increase in linear efficiency, but rather in a qualitative--or, to use the hot lingo, emergent--way. In my non-credentialed opinion, this paper is a beautiful formalization of that phenomenon, even though it necessarily is stuck at the bottom of the stack so-to-speak, describing the switch in cognitive capacity from direct quantification to symbolic qualification.

            It's very possible I'm clouded by the need to confirm my priors, but if not, I hope to see this paper see wide use among ML researchers as a clean, simplified exposition of what we're all really trying to do here on a fundamental level. A generalization of generalization, if you will!

            Alon, Noam, and Yohai, if you're in here, congrats for devising such a dumb paper that is all the more useful & insightful because of it. I'd love to hear your hot takes on the connections between grokking, cognition, and physics too, if you have any that didn't make the cut!

            • anigbrowl 5 hours ago

              It's just another garbage buzzword. We already have perfectly good words for this like understanding and comprehension. The use of grokking is a a form of in-group signaling to get buy-in from other Cool Kids Who Like Robert Heinlein, but it's so obviously a nerdspeak effort at branding that it's probably never going to catch on outside of that demographic, no matter how fetch it is.

              • kaibee 3 hours ago

                > It's just another garbage buzzword. We already have perfectly good words for this like understanding and comprehension.

                Yeah, try telling people that NNs contain actual understanding and comprehension. That won't be controversial at all.

                • anigbrowl an hour ago

                  I'm fully aware that most people disagree with that idea, although I myself think we're not very removed from LLMs at all, and there's no fundamental barrier to machine consciousness.

                  While that may be an unpopular opinion at present, and more so outside of the technical/academic worlds, trying to market the same idea by giving it a vaguely cool new name is asinine in my view. I don't see how its any different from some entrepreneurially minded physicist trying to get attention by writing papers about magnetism but calling it 'The Force' instead to build a following of Star Wars fans.

                  It's not that I dislike Heinlein or anything, I'm rather a fan actually. But trying to juice up research with cool sci-fi references is cringe, and when I see it I reflexively discount the research claim because of the unpleasant feeling that it's a sales pitch in disguise.