Hopefully this will finally stop the continuing claims[1] that LLMs can only solve problems they have seen before!
If you listen carefully to the people who build LLMs it is clear that post-training RL forces them to develop a world-model that goes well beyond a "fancy Markov chain" that some seem to believe. Next step is building similar capabilities on top of models like Genie 3[2]
[1] eg https://news.ycombinator.com/item?id=45769971#45771146
[2] https://deepmind.google/discover/blog/genie-3-a-new-frontier...
I don't see how anything about what's presented here that refutes such claims. This mostly confirms that LLM based approaches need some serious baby-sitting from experts and those experts can derive some value from them but generally with non-trivial levels of effort and non-LLM supported thinking.
Yes, applied research has yielded the modern expert system, which is really useful to experts who know what they are doing.
For the less mathematically inclined of us, what is in that discussion that qualifies as a problem that has not been seen before? (I don't mean this combatively, I'd like to have a more mundane explanation)
It means something that is too out-of-data. For example if you try to make an LLM write a program in a strange or very new language it will struggle in non-trivial tasks.
I understand what "a new problem for an LLM is", my question is about what in the math discussion qualifies as a one.
I see references to "improvements", "optimizing" and what I would describe as "iterating over existing solutions" work, not something that's "new". But as I'm not well versed into maths I was hoping that someone that considers the thread as definite proof for that, like parent seems to be, is capable of offering a dumbed down explanation for the five year olds among us. :)
I think it's disingenuous to characterize these solutions as "LLMs solving problems", given the dependence on a hefty secondary apparatus to choose optimal solutions from the LLM proposals. And an important point here is that this tool does not produce any optimality proofs, so even if they do find the optimal result, you may not be any closer to showing that that's the case.
Well, there's the goal posts moved and a Scotsman denied. It's got an infrastructure in which it operates and "didn't show its work" so it takes an F in maths.
That was dense but seemed nuanced. Anyone care to summarize for those of us who lack the mathematics nomenclature and context?
I'm not claiming to be an expert, but more or less what the article says is this:
- Context: Terence Tao is one of the best mathematician alive.
- Context: AlphaEvolve is an optimization tool from Google. It differs from traditional tools because the search is guided by an LLM, whose job is to mutate a program written in a normal programming language (they used Python). Hallucinations are not a problem because the LLM is only a part of the optimization loop. If the LLM fucks up, that branch is cut.
- They tested this over a set of 67 problems, including both solved and unsolved ones.
- They find that in many cases AlphaEvolve achieves similar results to what an expert human could do with a traditional optimization software package.
- The main advantages they find are: ability to work at scale, "robustness", i.e. no need to tune the algorithm to work on different problems, better interpretability of results.
- Unsurprisingly, well-known problems likely to be in the training set quickly converged to the best known solution.
- Similarly unsurprisingly, the system was good at "exploiting bugs" in the problem specification. Imagine an underspecified unit test that the system would maliciously comply to. They note that it takes significant human effort to construct an objective function that can't be exploited in this way.
- They find the system doesn't perform as well on some areas of mathematics like analytic number theory. They conjecture that this is because those problems are less amenable to an evolutionary approach.
- In one case they could use the tool to very slightly beat an existing bound.
- In another case they took inspiration from an inferior solution produced by the tool to construct a better (entirely human-generated) one.
It's not doing the job of a mathematician by any stretch of the imagination, but to my (amateur) eye it's very impressive. Google is cooking.
>> If the LLM fucks up, that branch is cut.
Can you explain more on this? How on earth are we supposed to know LLM is hallucinating?
The final evaluation is performed with a deterministic tool that's specialized for the current domain. It doesn't care that it's getting its input from a LLM that may be allucinating.
The catch however is that this approach can only be applied to areas where you can have such an automated verification tool.
We don't, but the point is that it's only one part of the entire system. If you have a (human-supplied) scoring function, then even completely random mutations can serve as a mechanism to optimize: you generate a bunch, keep the better ones according to the scoring function and repeat. That would be a very basic genetic algorithm.
The LLM serves to guide the search more "intelligently" so that mutations aren't actually random but can instead draw from what the LLM "knows".
In this case AlphaEvolve doesn't write proofs, it uses the LLM to write Python code (or any language, really) that produces some numerical inputs to a problem.
They just try out the inputs on the problem they care about. If the code gives better results, they keep it around. They actually keep a few of the previous versions that worked well as inspiration for the LLM.
If the LLM is hallucinating nonsense, it will just produce broken code that gives horrible results, and that idea will be thrown away.
Google's system is like any other optimizer, where you have a scoring function, and you keep altering the function's inputs to make the scoring function return a big number.
The difference here is the function's inputs are code instead of numbers, which makes LLMs useful because LLMs are good at altering code. So the LLM will try different candidate solutions, then Google's system will keep working on the good ones and throw away the bad ones (colloquially, "branch is cut").
Math is a verifiable domain. Translate a proof into Lean and you can check it in a non-hallucination-vulnerable way.
But that's not what they're doing here. They're comparing Alphaevolve's outputs numerically against a scoring function
I didn't know the sofa problem had been resolved. Link for anyone else: https://arxiv.org/abs/2411.19826
Link to the problems: https://google-deepmind.github.io/alphaevolve_repository_of_...
I love this. I think of mathematics as writing programs but for brains. Not all programs are useful and to use AI for writing less useful programs would generally save humans our limited time. Maybe someday AI will help make even more impactful discoveries?
Exciting times!
very nice~
There seems to be zero reason for anyone to invest any time into learning anything besides trades anymore.
AI will be better than almost all mathematicians in a few years.
Such an AI will invent plumber robot and welder robot as well.
I'm very sorry for anyone with such a worldview.