I remember when this stuff was all coming out and people were finally excited about ChatGPT getting the problem with "which is heavier, a 10 pound bag of feathers or a 10 pound bag of bricks?" problem correct. But of course it got it correct. It was in the training set. Vary the problem slightly by just changing the nouns, or changing the numbers so that one in fact was heavier than the other, and performance went all over the map.
I just went to chatgpt.com and put into the chat box "Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton?", and the very first answer I got (that is, I didn't go fishing here) was
The 9.99-pound bag of steel ingots is heavier than the 10.01-pound
bag of fluffy cotton by a small margin. Although the cotton may
appear larger due to its fluffy nature, the steel ingots are denser
and the weight of the steel bag is 9.99 pounds compared to the 10.01
pounds of cotton. So, the fluffy cotton weighs just a tiny bit more
than the steel ingots.
Which, despite getting it both right and wrong, must still be graded as a "fail".If you want to analyze these thing for their true capability, you need to make sure you're out of the training set... and most of the things that leap to your mind in 5 seconds are leaping to your mind precisely because they are either something you've seen quite often or something that you can easily think of and therefore many other people have easily thought of them as well. Get off the beaten path a bit and the math gets much less impressive.
ChatGPT Plus user here. The following are all fresh sessions and first answers, no fishing.
GPT 4:
The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The type of material doesn’t affect the weight comparison; it’s purely a matter of which bag weighs more on the scale.
GPT 4o:
The 10.01-pound bag of fluffy cotton is heavier. Weight is independent of the material, so the bag of cotton’s 10.01 pounds outweighs the steel ingots’ 9.99 pounds.
GPT o1:
Since both weights are measured on the same scale (pounds), the 10.01-pound bag of cotton is heavier than the 9.99-pound bag of steel, despite steel being denser. The key is simply that 10.01 pounds exceeds 9.99 pounds—density doesn’t affect the total weight in this comparison.
So do what the commenter suggests and make irrelevant permutations to the input to find when it fails. ie., engage in hypothesis testing rather than confirmation bias.
If a system has the capability to solve problems of {parts1...parts_n}, then it only has that capability if irrelevant permutations {parts1..parts2'...parts_n} make no difference.
Its very obvious that such permutations can destory such apparent capabilities.
I've just tested a number of permutations with Claude 3.5 Sonnet. It correctly answered all variants I tried on the first attempt, as follows:
Which is heavier, a 9.99 kilogram tungsten cube or a 10.01 kilogram block of aerogel?
Which is heavier, 10,000 steel balls weighing 0.999 grams each or 10,000 polystyrene balls weighing 1.001 grams each?
Which is heavier, a 10.01kg block of steel on Venus or a 9.99kg bag of feathers on Earth?
Which is heavier, a 10cm^3 block of steel or a 100cm^3 block of balsa wood?
Which is heavier, a golf ball made of steel or a baseball made of lithium?
In all cases, Claude clearly used CoT and reasoned out the problem in full. I would be interested in seeing if anyone can find any variant of this problem that stumps any of the leading LLMs. I'm bored of trying.
Hey, ChatGPT please write me a python program which randomly samples from various materials and various weights then poses a problem to the ChatGPT 4o API -- the goal is to find cases where the LLM fails to obtain the correct answer....
BTW - the model may be wrong depending on the example. More voluminous objects displace more air and due to buoyancy are lighter for the same mass.
The proper way to ask it would be to ask which object has more mass.
For what it’s worth, I asked it the following:
Q:
Which weighs more on earth: 100kg of steel or a 100kg polystyrene ball? Be as technically correct as possible
A: Technically, 100 kg of steel and 100 kg of polystyrene weigh the same on Earth when measured on a scale, as they both experience the same gravitational force due to their equal mass. However, there are some nuances to consider:
Buoyancy Effect
Earth’s atmosphere exerts an upward buoyant force on objects, proportional to the volume of the object and the density of the surrounding air. Since polystyrene has a much lower density than steel, a 100 kg polystyrene ball will have a significantly larger volume and experience a greater buoyant force compared to 100 kg of steel.
As a result: • Effective weight (apparent weight): The steel will weigh slightly more than the polystyrene on a scale because the buoyant force acting on the polystyrene reduces its apparent weight more.
Actual Weight
Weight is the gravitational force on an object and is given by:
Where and . Without considering buoyancy, both objects have the same weight of approximately (newtons).
Conclusion • True weight (in a vacuum): Both are identical at 981 N. • Apparent weight (in air): Steel is slightly heavier due to reduced buoyant force acting on it compared to the polystyrene ball.
Yup, the models are smart, but are trained to follow standard human patterns for this type of questions. And even on hackernews vast majority will not think that they would need to correct for buoyancy when actually attempting the experiment in standard conditions.
They very often get popular "tricky" questions wrong because they saw it so many times that they switch from internal reasoning to memorization/retrieval.
I tried 3 times the "Which is heavier, a 10.01kg block of steel on or a 9.99kg bag of feathers?" and ChatGPT keep converting kg to pound and saying the 9.99kg is heavier.
Which model? On the paid plus tier, GPT-4o, GPT-o1, and GPT-o1mini all successfully got the 10.1. I did not try any other models.
gpt-4o: https://chatgpt.com/share/67768221-6c60-8009-9988-671beadb5a...
o1-mini: https://chatgpt.com/share/67768231-6490-8009-89a6-f758f0116c...
o1: https://chatgpt.com/share/67768254-1280-8009-aac9-1a3b75ccb4...
Did you try it more than once per model? I could see temperature or other similar gradients providing wrong answers.
I found that the models consistently ignore buoyancy (objects with the same mass but different density can have different weight as measured by a scale).
If GP's hypothesis was "it fails for small variations of the input, like this one", then testing that hypothesis with that exact variation on a couple models seems fair and scientific.
Testing it with more variations until one fails feels a bit like p-hacking. You'd need to engage in actual statistics to get reliable results from that, beyond "If I really try, I can make it fail". Which would be a completely different hypothesis than the one presented at the start
I feel like I'm almost 100% certain that the smart guys at OpenAI have added many more variations of the problem to their training set since OP did his failing test, so it doesn't surprise me at all to know that this exact one now passes.
In fact, in my use of o1 it's incredibly clear that it still has the same problems. It's incredibly common that the second I ask for someone even slightly outside the training set, it's more likely to "round" to some wrong solution in the training set, rather than use any sort of human-like reasoning to figure out the right answer (often the right answer isn't hard to get, just not found in a Google search).
Can’t really do science with closed source software, right? Who knows what’s in there.
I asked "Which is louder, a 12.5 decibel scream, or a 12.8 decibel whisper?", which I wouldn't think has much of a presence in its training set, and it answered correctly.
May still get it wrong in more subtle ways, though. Personally, I think it'll continue to get physics wrong until someone builds it some robot arms so it can train on actually interactive physical spaces and behavior.
Except that if the model genuinely was reasoning about the problem, you could test it with every variation of materials and weights in the world and it would pass. Failing that problem at all in any way under any conditions is a failure of reasoning.
By that logic, humans can't genuinely reason, because they're often fooled by counter-intuitive problems like Monty Hall or the Birthday Problem, or sometimes just make mistakes on trivial problems.
I’m not really sure what you’re trying to say here - that LLMs don’t work like human brains? We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work. That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.
>We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work.
People, including around HN, constantly argue (or at least phrase their arguments) as if they believed that LLMs do, in fact, possess such "knowledge". This very comment chain exists because people are trying to defend against a trivial example refuting the point - as if there were a reason to try.
> That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.
I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.
It boggles my mind that I have to explain to people that sophisticated use of language doesn't inherently evidence thought, in the current political environment where the Dead Internet Theory is taken seriously, elections are shown over and over again to be more about tribalism and personal identity than anything to do with policy, etc.
You don't have to listen to or engage with those people though, just ignore 'em. People say all kinds of things on the Internet. It's completely futile to try to argue with or "correct" them all.
> I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.
According to whom? There is certainly no single definition of intelligence, but most people who have studied it (psychologists, in the main) view intelligence as a descriptor of the capabilities of a system - e.g., it can solve problems, it can answer questions correctly, etc. (This is why we call some computer systems "artificially" intelligent.) It seems pretty clear that you're confusing intelligence with the internal processes of a system (e.g. mind, consciousness - "knowing things in the way that humans do").
We are pretty certain that humans can reason, yet they are sometimes wrong. Even if you give them the same problem over and over again with slight variations.
LLMs get things wrong due to different factors than humans (humans lose focus, LLMs have randomness applied when sampling their responses to improve results). But clearly we have to choose a goal somewhat below 100% if we want a test that doesn't conclude that humans are incapable of reasoning.
The difference is we _know_ that LLMs are fancy stochastic models, we don't know that they're capable of reasoning, and the null hypothesis is that they're not (because we know what they _are_ - we built them) - any "reasoning" is an emergent property of the system, not something we built them to do. In that case, evidence they're not reasoning - evidence they're stochastic parrots doing a performance of reasoning - weighs heavier, because the performance of reasoning fits into what we know they can do, whereas genuine reasoning would be something new to the model.
There's deeper philosophical questions about what reasoning actually _is_, and LLMs have made those sharper, because they've shown it's clearly possible for a complex statistical model to generate words that look like reasoning, but the question is whether there's a difference between what they're doing and what humans are doing, and evidence that they're _not_ reasoning - evidence that they're just generating words in specific orders - weighs heavily against them.
We haven't coded LLMs to be stochastic models, we coded them to predict text with any method gradient decent finds on a transformer architecture. That's not exactly the same.
But more importantly, if you want to show that LLMs can't reason you obviously have to use a test that when applied to humans would show that humans can reason. Otherwise your test isn't testing reasoning but something more strict.
>we don't know that they're capable of reasoning
Apple AI researchers released a paper on it. They say no.
It's widely accepted that reasoning is not a binary skill.
You can make mistakes and still reason. Very often people given the same premises will disagree in thier reasoning as we are doing right here.
It's not p-hacking, he's right. You're both right. First test the same prompt on different versions then the ones that got it right go to the next round, variations on the prompt
We aren't testing whether the model's results are stable or correct for a given class of problem. The goal is to establish whether the model can reason.
Nothing capable of reasoning would contradict itself so blatantly and in such a short span while failing to indicate any kind of uncertainty.
Reasoning is not a binary skill.
And failure modes of other types of reasoners do not need to be the same as the failure modes of humans.
gpt-3.5-turbo at 0 temperature:
> The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.
Interestingly, that was the best answer out of all models and also a correct one. O1-preview doesn't allow setting the temperature, so it's hard to know where the real probabilities fall.
> ...engage in hypothesis testing rather than confirmation bias
Please leave the premises, sir. We don't take kindly to luddites here.
Tough crowd
Lots of other websites are more appropriate for meme jokes.
Like I said.
Your two word comment was ambiguous. I interpreted it as something like "People are downvoting you because they have no sense of humor".
There are other websites where two and three word comments are better received.
Mea culpa.
I asked ChatGPT-4 mini the previous comment's question (""Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton?") in new incognito tabs until I got an incorrect result. It only took four tries.
1: "The 10.01-pound bag of fluffy cotton would be heavier. Even though cotton is much less dense than steel, the weight of the cotton bag is 10.01 pounds, while the steel ingot bag weighs 9.99 pounds, making the cotton bag the heavier of the two by 0.02 pounds."
2: "The 10.01-pound bag of fluffy cotton is heavier. Even though the steel ingots are denser and more compact, the cotton weighs more at 10.01 pounds compared to the steel ingots' 9.99 pounds. The difference is only 0.02 pounds, but in terms of weight, the cotton bag is heavier."
3: "The 10.01-pound bag of fluffy cotton is heavier. Despite cotton being less dense than steel, the weight is determined by the total mass, not the material's density. So, since the cotton bag weighs 10.01 pounds and the steel ingot bag weighs 9.99 pounds, the cotton bag is heavier by 0.02 pounds."
4: "The 9.99-pound bag of steel ingots is heavier. It weighs 9.99 pounds, while the bag of fluffy cotton weighs 10.01 pounds. Despite the cotton being fluffier and taking up more space, the steel ingots are still lighter by 0.02 pounds."
The question isn’t whether or not it can get this question correct or not. It is, why is it incapable of getting the answer consistently right?
they've likely read this thread and adjusted their pre-filter to give the correct answer
o1-preview:
To determine which bag is heavier when measured in air—the 9.99-pound bag of steel ingots or the 10.01-pound bag of fluffy cotton—we need to consider not just their masses but also the effect of air buoyancy on their apparent weights.
*Step 1: Calculate the Masses*
First, convert the given weights into masses using the conversion \(1 \text{ pound} = 0.453592 \text{ kg}\):
- Mass of steel bag, \(m_s = 9.99 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5310 \, \text{kg}\) - Mass of cotton bag, \(m_c = 10.01 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5401 \, \text{kg}\)
*Step 2: Calculate the Volumes*
Next, calculate their volumes using the densities:
- Density of steel, \(\rho_s \approx 8000 \, \text{kg/m}^3\) - Density of cotton (fluffy), \(\rho_c \approx 50 \, \text{kg/m}^3\)
- Volume of steel bag, \(V_s = \frac{m_s}{\rho_s} \approx \frac{4.5310 \, \text{kg}}{8000 \, \text{kg/m}^3} \approx 5.664 \times 10^{-4} \, \text{m}^3\) - Volume of cotton bag, \(V_c = \frac{m_c}{\rho_c} \approx \frac{4.5401 \, \text{kg}}{50 \, \text{kg/m}^3} \approx 0.090802 \, \text{m}^3\)
*Step 3: Calculate the Buoyant Forces*
Using the density of air \(\rho_{\text{air}} \approx 1.2 \, \text{kg/m}^3\) and acceleration due to gravity \(g = 9.81 \, \text{m/s}^2\):
- Buoyant force on steel bag, \(B_s = \rho_{\text{air}} \times V_s \times g \approx 1.2 \, \text{kg/m}^3 \times 5.664 \times 10^{-4} \, \text{m}^3 \times 9.81 \, \text{m/s}^2 \approx 0.00668 \, \text{N}\) - Buoyant force on cotton bag, \(B_c = \rho_{\text{air}} \times V_c \times g \approx 1.2 \, \text{kg/m}^3 \times 0.090802 \, \text{m}^3 \times 9.81 \, \text{m/s}^2 \approx 1.068 \, \text{N}\)
*Step 4: Calculate the Apparent Weights in Air*
Subtract the buoyant force from the actual gravitational force (mass times gravity):
- Apparent weight of steel bag, \(W_s' = m_s \times g - B_s \approx 4.5310 \, \text{kg} \times 9.81 \, \text{m/s}^2 - 0.00668 \, \text{N} \approx 44.44 \, \text{N}\) - Apparent weight of cotton bag, \(W_c' = m_c \times g - B_c \approx 4.5401 \, \text{kg} \times 9.81 \, \text{m/s}^2 - 1.068 \, \text{N} \approx 43.47 \, \text{N}\)
*Conclusion:*
Despite the cotton bag having slightly more mass and a greater weight in a vacuum, when measured in air, the steel bag is heavier due to the significantly smaller upward buoyant force acting on it compared to the cotton bag. This means that on a scale in air, the 9.99-pound bag of steel ingots will weigh more than the 10.01-pound bag of fluffy cotton.
This answer is probably better than 99.99% of human answers. Feel the AGI!
https://chatgpt.com/share/67756897-8974-8010-a0e0-c9e3b3e91f...
so far o1-mini has bodied every task people are saying LLMs can’t do in this thread
That appears to be the same model I used. This is why I emphasized I didn't "go shopping" for a result. That was the first result I got.
I'm not at all surprised that it will nondeterministically get it correct sometimes. But if it doesn't get it correct every time, it doesn't "know".
(In fact "going shopping" for errors would still even be fair. It should be correct all the time if it "knows". But it would be different if I was fishing over and over and over and finally got one, versus the first time I asked.)
Edit: It appears it isn't the model I used. The point holds, though, you need to make sure you're off the training set for it to matter. This isn't a "ChatGPT can't do that" post as some are saying, it's more a "you aren't asking what you think you're asking" post.
You get the same problem in a human context in things like code interviews. If you ask an interviewee the exact question "how do you traverse a binary tree in a depth-first manner", you aren't really learning much about the interviewee. It's a bad interview question. You need to get at least a bit off the beaten trail to do any sort of real analysis.
I believe this is just a case of OpenAI's naming scheme being weird and confusing.
The default model I see on chatgpt.com is GPT 4o-mini, which is not o1-mini.
OpenAI describes GPT 4o-mini as "Our fast, affordable small model for focused tasks" and o1/o1-mini as "Reasoning models that excel at complex, multi-step tasks".
Could you share the exact chat you used for when it failed? There is a share chat button on openai.
It's very difficult to be an AI bull when the goalposts are moving so quickly that ai answering core correctly across multiple models is brushed off as 'nondeterministically getting it correct sometimes'
Why? Did a grocery store self checkout ever fail to calculate sales tax? Do I need to run a study on that?
The people selling this could not make a car drive but now its AGI.
A single-purpose state machine not failing to do the single thing it was created to do does not make for the clever retort you think it makes.
"AGI": emphasis on "G" for "General". The LLMs are not failing to do generalized tasks, and that they are nondeterministic is not a bug. Just don't use them for calculating sales tax. You wouldn't hire a human to calculate sales tax in their head, so why do you make this a requirement in order to call an LLM "AGI"?
I wonder when the goalposts will stop moving from "We have superhuman intelligences which are able to rather reliably converse in many languages, do generalized tasks and automate operations we thought were impossible to automate 3 years ago" (and by the way, this is what we have TODAY), all the way to "It's not AGI unless it's an omnipotent god that knows how to turn water into wine and calculate the applicable sales tax of that operation".
I do not get it, where are all the serious people? Just come out and say what it is and what it is not. Pretend for a moment you have the ethics and work philosophy of an undergraduate science student.
> You wouldn't hire a human to calculate sales tax in their head
Everyone did that 60 years ago, humans are very capable at learning and doing that. Humans built jetplanes, skyscrapers, missiles, tanks, carriers without the help of electronic computers.
Yeah... They used side rules and vast lookup tables of function values printed on dead trees. For the highest value work, they painstakingly built analog calculators. They very carefully checked their work, because it was easy to make a mistake when composing operations.
Humans did those things by designing failsafe processes, and practicing the hell out of them. What we would likely consider over fitting in the llm training context.
While this may be true, it's a very common problem that people who want to demonstrate how bad a model is fail to provide a direct link or simply state the name of the model.
OpenAI is not doing us any favours by using confusing naming schemes for their models and obscuring which models people are actually working with.
If I didn’t know any better, I’d say OpenAI doesn’t want us doing these tests accurately and is trying to hide something.
it’s extremely easy to see which model you are using. one’s own… difficulties understanding are not a conspiracy by OpenAI
It does not show the model version anywhere on the page on chatgpt.com, even when logged in.
Yes it does, at the top of every chat there is a drop-down to select the model, which displays the current model. It's been a constant part of the UI since forever.
No, it only says "ChatGPT Plus (Upgrade)" or "ChatGPT".
Maybe it's different if you have a paid account?
if i go to chatgpt.com on my phone not logged on at all it tells me very prominently at the top that i am using 4o mini
Logged in, non paid account, on a desktop, for me, it's exactly as the person you're replying to has stated.
If I log out, it shows 4o mini, and when I try to change it, it asks me to login or sign in rather than giving me any options.
When I use enough chatgpt when logged in it gives me some nebulous "you've used all your xyz tokens for the day". But other than that there is no real signal to me that I'm getting a degraded experience.
It's really just confusing as hell.
Someone else in this thread said,
> With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.
I usually test models using the OpenAI API which doesn't offer links the way I think you mean. If I provide some output I got from a particular model you're just going to have to take my word for it.
You can use https://lluminous.chat (bring your own key) to link to chats using any model across all LLMs.
open router is the more standard solution
They need to provide an small hash with the api result that can be verified by others.
you sure? i just asked o1-mini (not 4o mini) 5 times in a row (new chats obviously) and it got it right every time
perhaps you stumbled on a rarer case but reading the logs you posted this sounds more like a 4o model than an o1 because it’s doing its thinking in the chat itself plus the procedure you described would probably get you 4o-mini
May be unrelated, but I have been having a lot of issues lately with ChatGPT letting me select a model (o1) and silently switching to 4o.
This is coming off my TWO DAY cooldown on o1 usage, which is frustrating.
> just asked o1-mini (not 4o mini) 5 times in a row (new chats obviously) and it got it right every time
Could you try playing with the exact numbers and/or substances?
give me a query and i’ll ask it, but also i don’t want to burn through all of my o1mini allocation and have to use the pay-as-you-go API.
> What is heavier a liter of bricks or a liter of feathers?
>> A liter of bricks and a liter of feathers both weigh the same—1 kilogram—since they each have a volume of 1 liter. However, bricks are much denser than feathers, so the bricks will take up much less space compared to the large volume of feathers needed to make up 1 liter. The difference is in how compactly the materials are packed, but in terms of weight, they are identical.
https://chatgpt.com/share/677583a3-526c-8010-b9f9-9b2a3374da... o1-mini best-of-1
>> so far o1-mini has bodied every task people are saying LLMs can’t do in this thread
> give me a query and i’ll ask it
Here's a query similar to one that I gave to Google Gemini (version unknown), which failed miserably:
---query---
Steeleye Span's version of the old broadsheet ballad "The Victory" begins the final verse with these lines:
Here's success unto the Victory / and crew of noble fame
and glory to the captain / bold Nelson was his name
What does the singer mean by these lines?
---end query---
Italicization is for the benefit of HN; I left that out of my prompt.
Hmm... Gemini (1.5 Flash) just aced that exact question for me:
These lines celebrate the victory of the British ship HMS Victory, led by the famous Admiral Lord Nelson, in the Battle of Trafalgar in 1805.
"Here's success unto the Victory": This line directly praises the ship itself, acknowledging its role in the successful battle. "and crew of noble fame": This recognizes the bravery and skill of the sailors who served aboard the Victory. "and glory to the captain": This line specifically honors Admiral Nelson, the captain of the Victory, for his leadership and strategic brilliance in the battle. "bold Nelson was his name": This emphasizes Nelson's courage and daring, which were legendary. The lines express admiration for the ship, its crew, and most importantly, Admiral Nelson, who became a national hero in Britain for his victory at Trafalgar.
That's not acing the question. It's completely incorrect.
What do you think the singer in "Friends in Low Places" meant in the toast he gave after crashing his ex-girlfriend's wedding?
And I saw the surprise and the fear in his eyes
when I took his glass of champagne
and I toasted you, said "Honey, we may be through
but you'll never hear me complain"
That requires knowing the song, beyond the words provided. Would you flunk an eighth grader for getting it wrong?
Well, the rest of the song helps, in that it specifies that (1) the toast upset the wedding, and (2) the singer responded to that by insulting "you", which is presumably one or more of the bride, the groom, and the guests.
But I think specifying that the singer has crashed his ex-girlfriend's wedding is already enough that you deserve to fail if your answer is "he says he's not upset, so what he means is that he's not upset". It's not any kind of leap to guess that the bride's ex-boyfriend's toast might cause a scene at a wedding - that's why the bride's ex-boyfriends are never invited.
(The question has already provided every word of the toast that appears in the song.)
See also the sidethread comment by mikeruiz, noting that o1-pro reproduces the rest of the lyrics to The Victory, but gets the question wrong anyway.
I was referring to the original query, of course, as any entity capable of reasoning could have figured out.
Hmm. Is there anything in my comment above that might address that point of view?
Nah, intermittent failures are apparently enough to provide evidence that an entire class of entities is incapable of reason. So I think we've figured this one out...
Sounds like you just proved ted_dunning isn't sentient.
Well, I proved that he's happy to express an opinion on whether an answer to a question is correct regardless of whether he knows anything about the question. I wouldn't trust advice from him or expect his work output to stand up to scrutiny.
Sentience isn't really a related concept.
i'd prefer an easily verifiable question rather than one where we can always go "no that's not what they really meant" but someone else with o1-mini quota can respond
“They’re toasting Admiral Nelson’s ship (HMS Victory) and its valiant crew, hailing the ship’s successes and Nelson’s heroism. In other words, the singer is offering tribute—“success unto the Victory”—to the vessel and its famed sailors, and “glory to the captain” who led them, namely the celebrated Admiral Horatio Nelson.”
…but to your point, no idea if the artist intended some more obscure reference.
o1-pro was also able to produce a relatively complete version of original source, though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’. Appropriate given the context!
> no idea if the artist intended some more obscure reference.
No, but the answer is obviously wrong. Very, very wrong. The lines are not praise. The song's view of the Victory, and of the battle of Trafalgar, is overwhelmingly negative.
> o1-pro was also able to produce a relatively complete version of original source
...as should be clear, if you read the original source. (Or, heck, just the two lines following the ones in the question.) Being able to reproduce the source makes the error look that much worse.
> though, amusingly, referred to it as a ‘broadside’ rather than ‘broadsheet’.
Both terms are used. https://en.wikipedia.org/wiki/Broadside_ballad :
>> A broadside (also known as a broadsheet) is a single sheet of inexpensive paper printed on one side, often with a ballad, rhyme, news and sometimes with woodcut illustrations.
It's not a difficult or tricky question.
i think it's a bit tricky, the surface meaning is extremely praiseworthy and some portion of readers might interpret as someone who has praise for Admiral Nelson but hates the press gangs.
of course, it is a sardonic, implicit critique of Admiral Nelson/the victory, etc. but i do think it is a bit subtle.
> give me a query and i’ll ask it
Which is heavier: an 11kg bag of lint or a 20lb bag of gold?
yeah it gets it
https://chatgpt.com/share/67757720-3c7c-8010-a3e9-ce66fb9f17...
e: cool, this gets downvoted
It got it right, but an interesting result that it rambled on about monetary value for... no reason.
> While the lint bag is heavier in terms of weight, it's worth mentioning that gold is significantly more valuable per pound compared to lint. This means that even though the lint bag weighs more, the gold bag holds much greater monetary value.
Legal said someone might sell a bag of gold for one of lint without it.
I don't believe that is the model that you used.
I wrote a script and pounded 01 mini and gpt 4 with a wide vareity of tempature and top_p parameters, and was unable to get it to give the wrong answer a single time.
Just a whole bunch of:
(openai-example-py3.12) <redacted>:~/code/openAiAPI$ python3 featherOrSteel.py Response 1: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 2: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 3: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 4: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 5: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Response 6: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 7: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 8: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 9: The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Response 10: A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. All responses collected and saved to 'responses.txt'.
Script with one example set of params:
import openai
import time
import random
# Replace with your actual OpenAI API key
openai.api_key = "your-api-key"
# The question to be asked
question = "Which is heavier, a 9.99-pound bag of steel ingots or a 10.01-pound bag of fluffy cotton?"
# Number of times to ask the question
num_requests = 10
responses = []
for i in range(num_requests):
try:
# Generate a unique context using a random number or timestamp, this is to prevent prompt caching
random_context = f"Request ID: {random.randint(1, 100000)} Timestamp: {time.time()}"
# Call the Chat API with the random context added
response = openai.ChatCompletion.create(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": f"You are a creative and imaginative assistant. {random_context}"},
{"role": "user", "content": question}
],
temperature=2.0,
top_p=0.5,
max_tokens=100,
frequency_penalty=0.0,
presence_penalty=0.0
)
# Extract and store the response text
answer = response.choices[0].message["content"].strip()
responses.append(answer)
# Print progress
print(f"Response {i+1}: {answer}")
# Optional delay to avoid hitting rate limits
time.sleep(1)
except Exception as e:
print(f"An error occurred on iteration {i+1}: {e}")
# Save responses to a file for analysis
with open("responses.txt", "w", encoding="utf-8") as file:
file.write("\n".join(responses))
print("All responses collected and saved to 'responses.txt'.")
Downvoted for… too conclusively proving OP wrong?
Down voted for not actually countering the argument in question? The script doesn't alter the phrasing of the question itself. It just generates a randomized, irrelevant preamble.
Well, I understood the argument in question to be: was it possible for the model to be fooled by this question, not was it possible to prompt engineer it into failure.
The parameter space I was exploring, then, was the different decoding parameters available during the invocation of the model, with the thesis that if were possible to for the model to generate an incorrect answer to the question, I would be able to replicate it by tweaking the decoding parameters to be more "loose" while increasing sample size. By jacking up temperature while lowering Top-p, we see the biggest variation of responses and if there were an incorrect response to be found, I would have expected to see in the few hundred times I ran during my parameter search.
If you think you can fool it by slight variations on the wording of the problem, I would encourage you to perform a similar experiment as mine and prove me wrong =P
Intuitively, I wouldn't expect a wrong answer to show up that easily if the network was overfitted to that particular input token sequence.
The questions as I understand it is whether the network learned enough of a simulacrum of the concept of weight to answer similar questions correctly.
The elephant in the room is that HN is full of people facing an existential threat.
Who think they're facing an existential threat.
But if it doesn't get it correct every time, it doesn't "know".
By that standard humans know almost nothing.
It's so weird that people use questions that are well-known for duping humans, who we all consider to be general intelligence.
Getting this question wrong doesn't say much about the intelligence of humans, why would it say something about the AI?
We use variations on questions that are well known for duping inattentive humans, to test a system that we expect a priori to be incapable of such inattention.
Unless "getting easy things wrong sometimes" is an inherent property of intelligence, we should expect that a properly "intelligent" computerized system would never err on problems far below its level of comprehension - unless we had some reason to believe it "wanted to", and as of yet I see no reason to believe this is even possible in principle.
Humans err, broadly speaking, for two reasons: genuinely reaching the limits of their comprehension, or trusting "system 1" (in Kahneman's analysis) too much.
Because for things like the Putnam questions, we are trying to get the performance of a smart human. Are LLMs just stochastic parrots or are they capable of drawing new, meaningful inferences? We keep getting more and more evidence of the latter, but things like this throw that into question.
Okay, but you just invented your own bar of "smart human" to be the universal bar (I don't share that opinion).
Also, lots of smart humans can't do the freaking Putnam, it doesn't make them stupid. It makes them non-experts.
It is perfectly possible for the first AGI to be stupid. A moron. In fact, I'd bet that's fairly likely.
I would agree if we weren't starting with LLMs for a baseline. The first AGI will know at least as much as LLMs, IMO, and that's already not-stupid. Especially once they can separate out the truth in their training.
So, there is this meme going around in Mexico about a previous president who in an interview said "we will land in about 1 minute, no, less about 5"
Does this proves he is not an intelligent being?
Is he stupid?
This he had a lapse? Would we judge his intelligence for that?
This happens literally every time. Someone always says "ChatGPT can't do this!", but then when someone actually runs the example, chatGPT gets it right. Now what the OP is going to do next is proceed to move goalposts and say like "but umm I just asked chatgpt this, so clearly they modified the code in realtime to get the answer right"
Prompt: In the Netherlands, in terms of drinks, is there a particular spirit that represents the country?
> Yes, in the Netherlands, jenever (also known as genever) is the traditional spirit that represents the country. Jenever is a type of Dutch gin that has a distinctive flavor, often made from malt wine and flavored with juniper berries. It has a long history in the Netherlands, dating back to the 16th century, and is considered the precursor to modern gin.
> Jenever comes in two main types: oude (old) jenever, which has a richer, maltier flavor and is often aged in wooden casks, and jonge (young) jenever, which is lighter and has a more neutral taste. Both types can be enjoyed straight or used in cocktails.
> In addition to jenever, the Netherlands is also known for its liqueurs, such as advocaat, a rich and creamy drink made from eggs, sugar, and brandy, often enjoyed as a dessert or in cocktails. However, jenever remains the most iconic spirit associated with Dutch culture.
This is completely wrong. Jenever certainly is very Dutch, but no one would say it is iconic as the Dutch spirit. For example, if you asked up north in Friesland, they would say Berenburg.
This happens literally every time. Someone always says "ChatGPT can do this!", but then within one or two prompts, its gets it wrong.
"This is completely wrong." Is it completely wrong? Or maybe "somewhat" wrong? Maybe just lacking nuance? I know nothing about the answer to this question, so this is an honest question.
Using just a plain old search engine, for things like "national drink of the netherlands" and simlar queries, I am directed to Wikipedia's Jenever page as the top hit, and Wikipedia's list of national drinks lists Jenever and Heineken as the entries for the Netherlands. Search engines also give page after page of travel guides and blog posts, most of which list Jenever at or near the top of of their listings. One travel guide calls it "the most famous Dutch spirit and most famous Amsterdam liquor, Jenever, also spelled Genever or simply Dutch gin."
'Berenberg is made by adding herbs to jenever'
From your comment it would seem that you are disputing jenever's popularity by saying jenever is more popular...
Perhaps it was a good faith mistake? If so, that would imply that the AI knows more about jenever than you?
I am rather saying that there is no one national drink for The Netherlands, like a Frenchman would say wine, a German/Belgian would say beer, and a Scotsman would say whisky. Note that I prompted "In the Netherlands, in terms of drinks, is there a particular spirit that represents the country?" I didn't ask which spirit is consumed the most.
For example, France has been trending towards beer more and more, and within a few decades they might be consuming more beer than wine. But even then, the French wouldn't slowly start to say beer represents France.
Furthermore, "just adding some herbs" does a large disservice to the flavor change of Berenburg. Jenever (aka jonge/unaged jenever) is straight-up vile. I've heard it described by expats as "having the worst elements of both cheap gin and cheap whisky".
Berenburg in comparison is spicy and vanilla-y and actually debatebly enjoyable.
Aged/oude jenever is much closer to Berenburg (or Berenburg to aged jenever), also with hints of vanilla and spices.
But, virtually no one except for dusty old men orders aged jenever. The kind ordered by far the most is jonge jenever, and then its only in a sense of "haha lets drink this terrible thing" or "let's get shitfaced quick".
If o1 supposedly "oneshots every question", it should have been aware of these nuances instead of just confidently assigning jenever as 'the' spirit of the Dutch.
The question of whether there is a national drink seems to me to be entirely different than the question you asked the LLM "Prompt: In the Netherlands, in terms of drinks, is there a particular spirit that represents the country?"
The question in the prompt comes off to me as a sort of qualitative determination rather than asking about pure factual information (is there an officially designated spirit). As such I don't think it can necessarily be right or wrong.
Anyway, I'm not sure what you'd expect. In terms of acquisition of knowledge, LLMs fundamentally rely on a written corpus. Their knowledge of information that is passed through casual spoken conversation is limited. Sure, as human beings, we rely a great deal on the latter. But for an LLM to lack access to that information means that it's going to miss out on cultural nuances that are not widely expressed in writing. Much in the same way that a human adult can live in a foreign country for decades, speaking their adopted language quite fluently, but if they don't have kids of their own, they might be quite ignorant of that country's nursery rhymes and children's games, simply because they were never part of their acquired vocabulary and experience.
> Anyway, I'm not sure what you'd expect.
I was just proving the people wrong that were saying akin to that o1 was "oneshotting every question".
I completely understand from how LLMs work that they wouldn't be able to get this right. But then people shouldn't be proudly be pronouncing that o1 (or any model) is getting every question right, first time.
My conjecture is that you still haven't proven that it didn't get the answer "right"
I have opened the question of why you thought jenever was not jenever, and your non-responsiveness I think compels the fact that AI was more correct in your contrived instance.
If you add pear and spices to vodka, we call it liqueur and not pear-flavored vodka. So no, you are wrong. And the AI is wrong. But that is okay, if you want to enjoy leaning into the hype that's your choice.
So you believe they are incorrect because regionally some area would select something different because it represented that area. But your question asked nationally.. is there a better answer than the one they gave? Were you expecting a no?
See my more detailed upthread response here: https://news.ycombinator.com/item?id=42569937
But, like Zahlman points out, its a trick question, and instead of admitting it doesn't know or even prepending "I don't know for sure, but:", it just burps up its best-effort answer. There is no one spirit that represents The Netherlands. If a LLM is so good it "oneshots any question", it should realize it doesn't have a unanimous answer and tell me.
The point is that there is no correct national answer, because the locals don't see it as a matter of national identity.
What's expected is an ability to identify trick questions, i.e., to recognize fundamental problems in the phrasing of a question rather than trying to provide a "helpful" answer at all costs.
This corresponds to one of the many reasons LLM output is banned on Stack Overflow.
But what does this have to do with reasoning? Yes, LLMs are not knowledge bases, and seeing people treat them as such absolutely terrifies me. However, I don’t see how the fact that LLMs often hallucinate “facts” is relevant to a discussion about their reasoning capabilities.
"Hallucinating a fact" that isn't in the training set and is also illogical, is exactly what a failure to reason correctly looks like.
Reasoning involves making accurate inferences based on the information provided in the current context, rather than recalling arbitrary facts from the training data.
Yes, that's what I said. The whole point of hallucinations is that they aren't "arbitrary facts recalled from the training data". They represent attempts to synthesize (i.e., infer) new facts. But because the inferences are not accurate, and because the synthesis process is not sound, the attempt cannot be called reasoning.
It is equally possible to "reason" about things you already know, as about things you've just been told. In fact, the capacity to speculatively, without prompting attempt such reasoning is a big part of cognition.
> Someone always says "ChatGPT can't do this!", but then when someone actually runs the example, chatGPT gets it right
I mean, if I had OpenAI’s resources I’d have a team tasked with monitoring social to debug trending fuck-ups. (Before that: add compute time to frequently-asked novel queries.)
I was thinking something very similar. Posting about a problem adds information back to the system, and every company selling model time for money has a vested interest in patching publicly visible holes.
This could even be automated; LLMs can sentiment-analyze social media posts to surface ones that are critical of LLM outputs, then automatically extract features of the post to change things about the running model to improve similar results with no intervention.
Similarly, in every thread there’s an AI skeptic who says LLMs are “useless” for coding, and never provides an example query for what they were trying.
If you ask about more niche language features or libraries, chatgpt will make up libraries or functions to fill the gap.
When asking an LLM to write a script for you, I would say 10 to 30 % of the time that it completely fails. Again, making up an API or just getting things straight up wrong.
Its very helpful, especially when starting from 0 with the beginner questions, but it fails in many scenarios.
Because the argument isn't based on individual query results. See for example my comment on a previous post https://news.ycombinator.com/item?id=42563715 .
How is that moving the goalposts?
yeah… that’s hn for you.
that said, i don’t think this is a good test - i’ve seen it circling on twitter for months and it is almost certainly trained on similar tasks
FYI: If you do that without a subscrpition, you currently (most likely) get a response generated through 4o-mini — which is not any of their reasoning models (o1, o1-mini or previously o1-preview) of the branch discussed in the linked paper.
Notably, it's not even necessarily 4o, their premiere "non-reasoning"-model, but likely the cheaper variant: With a free account the model it claims to be using is "4o auto", which is not a model but apparently an attempt to automatically decide models for you to be more cost effective.
Without a ChatGPT subscription you can't select a specific model anymore, not even rate limited, as was previously possible.
There doesn't seem to be a way to choose a model up-front with a free account, but after you make a query you can click on the "regenerate" button and select whether to try again with "auto", 4o, or 4o-mini. At least until you use 4o too many times and get rate limited.
you can select the model in the header bar when you start a chat: the name of the currently selected model can be clicked to reveal a dropdown
Are you on the free version? Because for me it did not show there, only on the paid one.
That option isn't there for me, maybe it's an A/B test thing.
Ah, interesting!
I just asked Claude 3.5 Sonnet, which appears to have improved its response with CoT but there's mistakes that demonstrate the model doesn't really "understand":
Q: A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy he says "I can't operate on this child, he is my son". How is this possible?
C: Let me think about this step by step:
A woman and her son are in an accident
The woman (the mother) dies
The boy is taken to hospital
A doctor says they can't operate because the boy is their son
But the mother is dead...
Ah! The doctor is the boy's father. This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father. Alternatively, the doctor could be the boy's other mother in a same-sex relationship, though this wasn't the original intent of the traditional riddle.
I must be to tired as I can't find any flaw in that reasoning.
> The doctor is the boy's father. This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father.
It identified the pattern, but it didn't understand the father is a man while the mother is a woman. If people assume doctors must be men, they wouldn't consider that the doctor could be the mother.
(Side note: this is an antiquated riddle/joke ... for me in particular because my mother is a doctor, but in general around half of new doctors are female, and think about Grey's Anatomy and other shows.)
I read it as “most people don’t connect put 2 and 2 together - the doctor could be the boys father”. It is poorly worded, but not strictly wrong I think?
The joke/riddle text is "he says" but Claude says "their son" and suggests the doctor might be a woman.
More substantively: "This is a classic riddle that highlights gender bias - many people assume doctors must be men, but don't initially consider that the doctor could be the father." is totally nonsensical.
The text is a gender (and meaning) inversion of the classic riddle to confuse LLMs. Even though Claude correctly CoTs that the doctor is the father, it seemingly then copypastas the explanation to the original riddle but swaps the gender, completely missing the point.
Except that Claude often takes into account things it thinks might be typos.
This is not code. Forgetting a semi colon will not make the output break. It thinks 'maybe they wrote he instead of she' and then gives options for both situations.
It is meant to solve real world situations where people might not type properly, it is not a word problem solving machine.
o1 also gets it wrong. It answered
"The doctor is the boy’s other parent—specifically his mother, who wasn’t in the accident."
> the problem with "which is heavier, a 10 pound bag of feathers or a 10 pound bag of bricks?"
Interestingly, the variation of this problem that I first encountered, personally, was "which weighs more, a pound of feathers or a pound of gold?"
This is a much more difficult question. The answer given to me was that the pound of feathers weighs more, because gold is measured in troy weight, and a troy pound consists of only 12 ounces compared to the 16 ounces in a pound avoirdupois.
And that's all true. Gold is measured in troy weight, feathers aren't, a troy pound consists of only 12 ounces, a pound avoirdupois consists of 16, and a pound avoirdupois weighs more than a troy pound does.
The problem with this answer is that it's not complete; it's just a coincidence that the ultimate result ("the feathers are heavier") is correct. Just as a pound avoirdupois weighs more than a troy pound, an ounce avoirdupois weighs less than a troy ounce. But this difference, even though it goes in the opposite direction, isn't enough to outweigh the difference between 16 vs 12 ounces per pound.
Without acknowledging the difference in the ounces, the official answer to the riddle is just as wrong as the naive answer is.
Yeah, this is the original version of this riddle. People who don't know it think the trick is that people will reflexively say the metal is heavier instead of "they're the same", when it actually goes deeper.
No idea if GP did it intentionally to further drift from training data, but steel doesn't count as a precious metal, so it messes up the riddle by putting the two weights in the same system.
> Yeah, this is the original version of this riddle. People who don't know it think the trick is that people will reflexively say the metal is heavier instead of "they're the same"
...Have you really never encountered people who would reflexively say that?
That's not what I said. I'm talking about the riddle itself, not how people react to it.
Even simpler, I asked Gemini (Flash 1.5) this variant of the question:
``` I have two bags, one can hold a pound of gold and one can hold a pound of feathers. Which bag is heavier? ```
The point here a) the question really is a bit too vague, b) if you assume that each back is made of the same material and that each bag is just big enough to hold the contents, the bag for the feathers will need to be much bigger than the bag for the gold and thus would weigh more.
The first response was wrong because the model completely missed that the question was about the bags themselves:
``` They weigh the same.
A pound of any substance, regardless of its density or volume, weighs one pound. Gold is much denser than feathers, meaning a pound of gold occupies a much smaller volume. Feathers are very light and fluffy, so a pound of feathers takes up a much larger volume. ```
Asking a followup `But which bag is heavier?` resulted in recognition that this is a trick question, and an incorrect answer:
``` This is a classic trick question!
Even though the bags hold the same weight (one pound), the bag of gold is likely to be heavier.
Weight vs. Mass:
Weight is the force of gravity acting on an object. Mass is the amount of matter in an object. The Bag Itself:
The bag holding the gold, even if it's a lightweight material, will have some weight. The bag holding the feathers, while it might seem negligible, will also have some weight. Practical Considerations:
In reality, the bag holding the gold might be significantly heavier due to the weight of the bag itself. Therefore, while the gold and feathers have the same mass (one pound), the bag containing the gold is likely to be heavier in a practical sense. ```
I think Gemini did better than you think with its second answer! Your original question didn't mention that the bags were made of the same material or the same density of material. The set of all possible bags that could hold 1 pound of feathers includes some thinner, weaker bags than the set of all possible bags that could hold 1 pound of gold (the gold being denser). So absent any other prior information the probability is greater than 50% that the gold-bag would be heavier than the feather-bag on that basis.
One could go further into the linguistic nuance of saying "this can hold one pound of [substance]", which often implies that that's its maximum carrying capacity; this would actually make the "trick question" answer all the more correct, as a bag that is on the cusp of ripping when holding one pound of feathers would almost certainly rip when holding one pound of (much denser) gold.
Saran wrap would would a pound of gold. It's just a pound in roughly a cubic inch of material. A cloth bag (like a pillow which is how features are normally contained) would definitely suffice. Remember, a pound of 800 fill down would require 8 cubic feet.
The bag for the gold doesn't need anything substantial.
So, no, the chances are strongly on the side of the feather bag being heavier.
I've posted this before and I know it's a cliche, but this really is Goodhart's Law at work with the benchmarks becoming targets.
I reproduced this on Claude Sonnet 3.5, but found that changing your prompt to "Which is heavier, a 9.99-pound back of steel ingots or a 10.01-pound bag of fluffy cotton?" corrected its reasoning, after repeated tests. For some reason it was not able to figure out that "10.01" referred to pounds.
This is the first response that I got from Claude with an exact copy and paste of the question:
Let me think about this step by step:
1. We're comparing two masses: * Steel ingots: 9.99 pounds * Cotton: 10.01 pounds
2. The cotton bag has a mass of 10.01 pounds, while the steel bag has a mass of 9.99 pounds
3. While steel is much denser than cotton (meaning it takes up less space per pound), this doesn't affect the weight - a pound is a pound regardless of material
4. 10.01 is greater than 9.99 by 0.02 pounds
Therefore, the 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The cotton may take up much more space, but when we're just comparing weights, density doesn't matter - only the mass does.
You wrote "back" instead of "bag".
Ah, good catch -- I had just copied the OP. Changing it to bag instead of back doesn't seem to change much. It still oscillates between the right and wrong answer, versus qualifying the 10.01 value as pounds, which has not yet given me the wrong answer (small sample size, of course).
As long as an LLM is capable of inserting "9.99 > 10.01?" into an evaluation tool, we're on a good way.
It feels a bit like "if all you have is a hammer, everything looks like a nail", where we're trying to make LLMs do stuff which it isn't really designed to do.
Why don't we just limit LLMs to be an interface to use other tools (in a much more human way) and train them to be excellent at using tools. It would also make them more energy efficient.
But it's OK if we currently try to make them do as much as possible, not only to check where the limits are, but also to gain experience in developing them and for other reasons. We just shouldn't expect them to be really intelligent.
> As long as an LLM is capable of inserting "9.99 > 10.01?" into an evaluation tool, we're on a good way
chatgpt will switch to python for some arithmetic with the result that you get floating point math issues when a 8yo will get the result right. I think "switch to a tool" still requires understanding of which tool to get a reliable result, which in turn means understanding the problem. It's an interesting issue.
I tried Claude and got this…
* What’s heavier? 9.99 kg of super dense lead or 10 kg fluffy feathers?*
Let me think about this step by step:
1. We're comparing two masses: * 9.99 kg of lead * 10 kg of feathers
2. The density of the materials doesn't affect their mass: * While lead is indeed much denser than feathers * A kilogram is a unit of mass, not volume * 1 kg of any material has the same mass as 1 kg of any other material
3. We can directly compare the numbers: * 10 kg > 9.99 kg
Therefore, 10 kg of feathers is heavier than 9.99 kg of lead. The feathers would take up much more space (volume) due to being less dense, but they have a greater mass.
Are you sure you weren't fishing? I ran 5 sessions and never got the wrong answer. All using gpt 4o-mini, which is the default non logged in experience on chatgpt.com.
1. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. Despite the difference in material density, the key factor here is the weight itself, with 10.01 pounds being greater than 9.99 pounds, regardless of the substance.
2. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The weight of an object is directly determined by its mass, and in this case, the cotton bag weighs 10.01 pounds, while the steel ingot bag weighs 9.99 pounds, making the cotton bag heavier by 0.02 pounds.
3. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots, even though the cotton is much less dense. The comparison is purely based on weight, and 10.01 pounds is greater than 9.99 pounds.
4. The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. While steel is much denser than cotton, the weight is the determining factor here, and 10.01 pounds is slightly more than 9.99 pounds. So, the cotton bag weighs more.
5. The 10.01-pound bag of fluffy cotton is heavier. Despite the cotton being less dense and bulkier, the weight of the cotton is still 10.01 pounds, which is greater than the 9.99-pound bag of steel ingots.
Not OP, but I got 4o-mini confused on second attempt.
https://chatgpt.com/share/67759d1a-1430-800b-a0a9-2c5f2ac02a...
https://chatgpt.com/share/67756c29-111c-8002-b203-14c07ed1e6...
I got a very different answer:
A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots because 10.01 pounds is greater than 9.99 pounds. The material doesn't matter in this case; weight is the deciding factor.
What model returned your answer?
You also didn't ask the question correctly.
The “back” misspelling is also in OP’s quote so I included it.
What weighs more, a 100kt aircraft carrier or a 200kt thermonuclear weapon?
Good one, they'll add it to the training set and then people here will say it works within 24h.
How many (presumably conscious) people would answer that incorrectly?
o1 crushes your variant. Here's the response.
Step-by-Step Reasoning 1. Clarify what is being compared The question asks us to compare two bags: one containing steel ingots weighing 9.99 pounds and another containing fluffy cotton weighing 10.01 pounds. 2. Check the actual weight Despite our intuition that steel is “heavier” than cotton by density, the problem specifically states the total weight in pounds of each bag. Since 9.99 < 10.01, the 10.01-pound bag of cotton is heavier in total mass than the 9.99-pound bag of steel. 3. No trick in the measurement Each weight is given in the same units (pounds). This is crucial because if different units or measurement standards were used (e.g., 9.99 kilograms versus 10.01 pounds), we would need to convert them to the same unit to compare. But here, both are simply in pounds. 4. Conclusion The bag of cotton weighing 10.01 pounds is heavier than the bag of steel weighing 9.99 pounds—even though steel is denser than cotton. The numeric value of the total weight is the deciding factor.
Answer
A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots, because 10.01 pounds is strictly greater than 9.99 pounds.
Reference • Basic arithmetic: 10.01 is greater than 9.99. • For a playful twist on a similar concept, see any version of the riddle “What weighs more—a pound of feathers or a pound of lead?” In that classic riddle, both weigh the same; here, the numbers differ.
A real acid test is to use numbers that confuse LLMs such as:
“Which is heavier: a bag with 9.9 pounds of fluff or a bag with 9.11 pounds of steel block?”
Try swapping the numbers and the materials, run each test at least 10x times.
Llama 3.33-70B:
"To determine which is heavier, we need to compare the weights of the two objects.
The steel ingots weigh 9.99 pounds, and the fluffy cotton weighs 10.01 pounds.
Since 10.01 pounds is greater than 9.99 pounds, the bag of fluffy cotton is heavier than the bag of steel ingots, despite the cotton being "fluffy" and the steel being dense. Weight is a measure of the amount of mass in an object, and in this case, the cotton has a slightly greater mass than the steel."
IMO the fuzziness is actually a feature most of the time b/c I can pass misspelled words or close enough words and it'll still figure it out.
Also, if we model the mental state of the llm as a frazzled retail worker dealing with thousands of customers per second, the rote response is reasonable. As a dev, sometimes I get at annoyed at QA for a hyper narrow "trap" test case
Add some extra information, and it gets confused. This is 4o.
https://chatgpt.com/share/67759723-f008-800e-b0f3-9c81e656d6...
One might argue that it's impossible to compress air using known engineering, but that would be a different kind of answer.
It seems more like ChatGPT was asked a rather bizarre question with far too little detail to make sense, and ChatGPT failed to notice or to ask for more information. Although it did get rather impressively confused about the pressure of the air.
“Failed to notice” is a hilariously weird way of talking about tokenization.
Tokenization?
I mean that ChatGPT had no questions about the container of the gas (does it have weight? is it tared out?) or about buoyancy. And it’s really rather sad that ChatGPT calculated the volume of air at atmospheric temperature and pressure and didn’t notice that it was nowhere near fitting into its supposed container. (At least 1.01lb of liquid nitrogen would fit.)
If you define a new universe with physical laws that are different from our own, performance will also suffer because there is no training data for that custom universe. While some may say this is cheating, a human would be able to deduce correct answers by understanding the deep implications of the new physical laws in that custom universe.
o1.
prompt> Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton? Please state in your answer what's the difference in grams.
answer> The 10.01-pound bag of cotton is heavier. The difference is 0.02 pounds, which is roughly 9.07 grams (using 1 pound ≈ 453.59237 grams).
Reference • National Institute of Standards and Technology (NIST): Conversion Factors
Still waiting for some reasoning about an upward buoyancy force on the bags, which is equal to the weight of the air displaced by the bags, and the force acting on the cotton is bigger simply because the bag of cotton displaces more air (because it "appears larger due to its fluffy nature".)
Deepseek got it right: "A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound pack of steel ingots. Even though steel is denser and takes up much less space, the weight is determined by the mass, and 10.01 pounds is greater than 9.99 pounds."
The issue with the commercial Deepseek API is that it supports a context length of only 64k, whereas GPT supports at least 128k.
fwiw i think reasoning models have at least solved this. even the smallest reasoning model, o1-mini, gets it right first try on my test:
A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound bag of steel ingots. Despite the significant difference in density and volume between steel and cotton, the weights provided clearly indicate that the cotton bag has a greater mass.
Summary:
Steel ingots: 9.99 pounds
Fluffy cotton: 10.01 pounds
Conclusion: The 10.01-pound bag of cotton is heavier.
Shows the importance of chain of thought! Forcing it to commit to an answer without deliberation is not playing to its strength.
10 pounds of bricks is actually heavier than 10 pounds of feathers.
Can you explain?
An ounce of gold is heavier than an ounce of feathers, because the "ounce of gold" is a troy ounce, and the "ounce of feathers" is an avoirdupois ounce. But that shouldn't be true between feathers and bricks - they're both avoirdupois.
Note that a pound of gold is 12 troy ounces, while a pound of feathers is 16 avoirdupois ounces. So one of my favorite pedantic bits of trivia is that an ounce of gold is heavier than an ounce of feathers but a pound of gold is lighter than a pound of feathers.
Feathers are less dense so they have higher buoyancy in air, reducing their weight.
Pounds are a unit of weight, not of mass. 10 lbs of feathers is whatever amount of feathers causes a scale to display 10 lbs. If the scale also displays 10 lbs for the quantity of bricks, then they weigh the same, regardless of any differences in mass.
Is this still true? I thought pounds are now defined in terms of kilograms (about 0.453)? Because kilograms are definitely a unit of mass, not weight. Or is the pound defined as some amount of kilograms at a specific point on earth, in a specific phase of the moon?
It seems the pound has since been redefined and split into separate units: pound mass and pound force, the former in terms of kilograms (1 lb = 0.45359237 kg) and the latter in terms of the force exerted by one pound of mass in earth’s gravitational field (standard g = 9.80665m/s^2).
So using the word pound without qualification is ambiguous in contexts where it’s not clear whether mass or force is meant.
According to the dictionary, "heavier" can refer to weight or density. In their typical form, bricks are heavier (more dense) than feathers. But one should not make assumptions before answering the question. It is, as written, unanswerable without followup questions.
lol no, with a small amount of generally-applicable prompt-fu it answers with a ton of nuance that perfectly encapsulates the problem: https://chatgpt.com/share/6775b99d-2220-8003-993b-8dd008a97f...
[dead]
[flagged]
One experiment I would love to see, although not really feasible in practice, is to train a model on all digitized data from before the year 1905 (journals, letters, books, broadcasts, lectures, the works), and then ask it for a formula for mass-energy equivalence. A certain answer would definitely settle the debate on whether pattern recognition is a form of intelligence ;)
> ask it for a formula for mass-energy equivalence
Way too easy. If you think that mass and energy might be equivalent, then dimensional analysis doesn’t give you too much choice in the formula. Really, the interesting thing about E=mc^2 isn’t the formula but the assertion that mass is a form of energy and all the surrounding observations about the universe.
Also, the actual insight in 1905 was more about asking the right questions and imagining that the equivalence principle could really hold, etc. A bunch of the math predates 1905 and would be there in an AI’s training set:
https://en.m.wikipedia.org/wiki/History_of_Lorentz_transform...
It’s nontrivial why it’s mc^2 and not 1/2 mc^2, since kinetic energy generally is 1/2 mv^2
but e=mc^2 is just an approximation
e: nice, downvoted for knowing special relativity
Can you elaborate? How is E=mc^2 an approximation, in special relativity or otherwise? What is it an approximation of?
E^2 = m^2 + p^2 where p is momentum and i’ve dropped unit adjustment factors like c
this allows light to have energy even if its massless
e=mc^2 is only correct for objects at rest. The full equation takes into account velocity, but for "low" speeds where v<<c, the term is close enough to zero than E=mc^2 is still a good approximation.
E^2 = (mc^2)^2 + (pc)^2,
where p is momentum. When an object is traveling at relativistic speeds, the momentum forms a more significant portion of its energy
I didn't downvote it, but short comments are a very big risk. People may misinterpret it, or think it's crackpot theory or a joke and then downvote.
When in doubt, add more info, like:
But the complete equation is E=sqrt(m^2c^4+p^2) that is reduced to E=mc^2 when the momentum p is 0. More info in https://en.wikipedia.org/wiki/Mass%E2%80%93energy_equivalenc...
The next section of the wikipedia link discusses the low speed approximation, where sqrt(m^2c^4+(pc)^2) ≈ mc^2 + 1/2 mv^2.
Calling E=mc^2 an "approximation" is technically correct. It's the 0th order approximation. That's just pointlessly confusing. A better word choice would be "a special case".
i think we are venturing into pedantic territory - the point of my comment is that the full derivation is a little harder than just E=mc^2 dimensional analysis
This is why RLHF causes those overly verbose answers to simple questions, it's a fundamentally busted evaluation function so you wind up optimizing for the wrong thing
[Sorry for the dealy.]
In one extreme there are wall of text and in the other extreme very short answers that only the initiated understand (like inside jokes). Somewhere in between there is a sweet spot that helps everyone else to follow the discusion and gain a litle of knowdledge.
(I don't claim I get the best lenght in my comments, but I hope it's good enough.)
What I learnt is that there is a rest mass and a relativistic mass. The m in your formula is the rest mass. But when you use the relativistic mass E=mc² still holds. And for the rest mass I always used m_0 to make clear what it is.
sounds like you had a chemistry education. relativistic mass is IMO very much not a useful way of thinking about this and it is sort of tautologically true that E = m_relativistic because “relativistic mass” is just taking the concept of energy and renaming it “mass”
This is all sort of silly IMO. The equation, like basically all equations, needs context. What’s E? What’s m? If E is the total energy of the system and m is the mass (inertial or gravitational? how far past 1905 do you want to go?), then there isn’t a correction. If m is rest mass and E is total energy, then I would call it flat-out wrong, not merely approximate. After all, a decent theory really ought to reproduce Newtonian mechanics under some conditions beyond completely at rest.
IMO, when people get excited about E=mc^2, it’s in contexts like noticing that atoms have rest masses that are generally somewhat below the mass of a proton or neutron times the number of protons and neutrons in the atom, and that the mass difference is the binding energy of the nucleus, and you can do nuclear reactions and convert between mass and energy! And then E=mc^2 is apparently exactly true, or at least true to an excellent degree, even though the energies involved are extremely large and Newtonian mechanics can’t even come close to accounting for what’s going on.
inertial mass, rest mass, gravitational mass - these are essentially all the same thing. “relativistic mass” is an additional concept where we rewrite energy as mass and is considered archaic
I should have used m_0 to avoid confussion. Anyway, as he sibling comment says, most modern advanced books of special relativity try to avoid the relativistic mass. It's useful for some calculations, like synchrotron, but the problem is that for forward/backward acceleration you must use other number so the relativistic mass add confussion. https://en.wikipedia.org/wiki/Mass_in_special_relativity#His...
Kind of agree. But pervasive downvoting by folks who don’t understand the subject is a form of forum rot. The risk is only that we expose the rot. Not such a terrible risk, because either the owners notice and fix the problem, or the forum continues to rot. In the latter case karma points wont be desirable in the long run.
This thread has come up before(1), but I'll continue to argue that relativistic mass is a perfectly valid concept as much as any other, and if you disagree, you'll need arguments more substantial than it just being unpopular these days. Especially if you're trying to argue people out of using a concept that they personally find useful to aid their own understanding, just because it doesn't fit your own mathematical or aesthetic preferences.
sure but i think it
1. is not very intuitive/useful to have mass that varies on the direction (which is what this implies)
2. is somewhat tautological to define a new mass m_rel = E/c^2 and say that it satisfies the equation when this is not what most people understand mass to be. most people understand photons to be massless particles.
at minimum, relativistic mass should always be specified as m_rel to distinguish from what is typically referred to as mass.
but i don’t think relativistic mass is a wrong concept any more than any other mathematical convenience like virtual particles. the main question is how useful is it and should it be described using the word “mass” or is this confusing. there is value in having shared language, even if you can construct an alternate system of symbols and rules that can yield the same answer to every question. to the extent to which intent of the author matters at all (probably doesn’t), Einstein agreed that relativistic mass was not a useful concept.
i'll concede that the arguments in the thread you linked are not good
> most people understand photons to be massless particles.
I suspect that most people have no opinion at all and are generally unaware of the properties of light.
That being said, a (perfectly reflective or otherwise steady state, e.g. black body at equilibrium) box of photons has mass due to those photons. You can stick it on a scale or a pendulum and measure it. It attracts other nearby objects according to GR or, in the appropriate limit, Newtonian gravity, in accordance with the relativistic mass of those photons.
it's a special case, not an approximation
its not an either/or, it is both. regardless, my point is that you cannot simply dimensional analysis your way to special relativity or the energy-momentum relation
This reminds me of a similar idea I recently heard in podcast with Adam Brown. I'm unsure whether it is his original notion. The idea being, that if we can create AI that can derive special relativity (1905) from pre-Einstein books and papers then we have reached the next game-changing milestone in the advancement of artificial reasoning.
Great podcast, especially the part about hitchhiking :)
https://www.youtube.com/watch?v=XhB3qH_TFds
Or RSS
Right, hadn't listened to that one, thanks for the tip!
Finally, a true application of E=mc^2+AI
This is how patent disputes should be decided. If an LLM can figure it out, then it is not novel.
Novelty (is it new) is the easy question because it’s just checking a database. Patentable inventions also have to be non-obvious, which is a more subtle question.
I know it’s just a spicy take on a forum, but this sounds like a terrible public policy.
And what prompt would you give that does have novel input.
If I was me, I would start by giving a collection of LLMs the patent, ask half "why is this patent novel" and half "why is this patent not novel" and see what happens. I use this method of "debugging" my thinking (not code), might be a starting point here? Not sure.
LLMs are already good at summarizing the claims - patents all explain why they’re novel - so it would be a waste to ask them, especially if you reserve half the LLMs in your set for this question. Asking why a patent is not novel is a great question, but the problem with asking why they are not novel is it has to know all other patents (including very recently filed patents) and it has to be correct, which LLMs are not at all good at yet (plus they still tend to hallucinate confidently). This is a great test for LLM accuracy if you know the right answer already, and not a good test for patent validity.
Every patent application contains a section of claims. You can just ask the LLM to come up with ways to satisfy those claims.
But I'm sure there are lots of ways to go about it.
But is there even enough pre-1905 data to create models that say hello world reliably?
The terabytes of training data required for decent LLMs does not exist. I’d guess there may only be gigabytes worth.
My wife is an 18th century American history professor. LLMs have very very clearly not been trained on 18th century English, they cannot really read it well, and they don't understand much from that period outside of very textbook stuff, anything nuanced or niche is totally missing. I've tried for over a year now, regularly, to help her use LLMs in her research, but as she very amusingly often says "your computers are useless at my work!!!!"
my wish for new years is that every time people make a comment like this they would share an example task
https://s.h4x.club/bLuNed45 - it's more crazy to me that my wife CAN in fact read this stuff easily, vs the fact that an LLM can't.
(for anyone who doesn't feel like downloading the zip, here is a single image from the zip: https://s.h4x.club/nOu485qx)
Super interesting in that
1. In theory these kind of connections should be something that LLMs are great at doing. 2. It appears that LLMs are not trained (yet?) on cursive and other non-print text
Yes, I regularly encourage my wife to approach the comp sci department at her uni on doing a project together but she for whatever reason doesn't think they would be interested/I've yet to get her interested enough to grasp what a transformer can do. I find it very frustrating because of your first point, she very specifically could do some meaningful % more research if the LLMs could help with the connections. Sadly, I am not rich, handsome or talented enough to do this for her.
One stumbling block is that OpenAI's best model (o1-pro) can't yet deal with image uploads (edit: I'm wrong, it works now... trying it.)
Gemini 2.0 Experimental Advanced says:
-------snip--------
This document is a petition filed in the Orphans Court of Baltimore County. Here's a breakdown of its content:
Petitioners: Neary Senters and his wife are petitioning the court.
Subject: The petition concerns a boy named James Gannon, who is identified as a pauper and has been bound (apprenticed) to them.
Issue: The petitioners claim that James Gannon is claimed by a certain Henry Haine of Baltimore City. They believe Haine is attempting to exploit the boy's labor. He obtained the boy through a seemingly valid indenture (apprenticeship contract) authorized by two Justices of the Peace in Baltimore City. However, the indenture references records in the Register of Wills' office in Baltimore City, suggesting the boy was already an indentured servant. The petitioners argue that a 1793 Maryland law (Chapter 45) regarding the binding of apprentices invalidates Haine's claim. They assert that the law requires parental consent for binding out a child, and there's no evidence of such consent in this case.
Request: The petitioners request the court to summon Henry Haine and the Justices of the Peace involved in creating the indenture. They also seek to have James Gannon and his parents (if living) appear before the court. Their ultimate goal is to have the indenture to Haine declared invalid, thus retaining custody and the labor of the boy.
Legal Argument: The core of their argument hinges on the 1793 Maryland law, which they believe was violated when James Gannon was bound to Henry Haine without parental consent. They are essentially challenging the legality of the indenture held by Haine.
In essence, this document reveals a dispute over the legal custody and labor of a young boy in 18th-century Baltimore. The petitioners are using a specific law about apprenticeship to argue against another individual's claim to the boy, highlighting the importance of legal procedures and parental rights in such matters during that time.
-------end snip--------
The prompt was simply "Please provide a detailed summary of this document" (I wasn't sure what else to ask.)
Is that closer to your wife's interpretation? I mean, I can't make head or tail of it...
Ideas are kinda there, details (like names) are wrong wrong.
Feel like I should make this task my personal ASI benchmark lol :D
o1-pro just answered (it often takes a few minutes). Not sure if this is any better (is it?) but at least it is starting to admit when it isn't sure about something, rather than just spewing BS confidently. That seems like a genuine win.
---------snip--------
From the handwriting and context, this appears to be a mid-19th-century petition submitted to the Orphans’ Court of Baltimore County. Below is a point‐by‐point summary of its content and purpose:
Address and Purpose
The document is formally addressed “To the Honorable Judges of the Orphans’ Court for Baltimore County.”
It identifies itself as a petition—that is, a request for the Court’s action or permission under the relevant orphan or apprenticeship laws.
Parties Involved
The petitioners are (1) Henry Jenkins and (2) another individual whose name is difficult to read, but who appears to be the child’s mother (or at least a close guardian figure).
They describe themselves as residents of Baltimore City and ask the Court’s intervention concerning a minor boy—also named “Henry” (it looks like “Henry Thomas,” though the middle/family name is partly obscured).
Relationship to the Minor
The petition states that the mother had this boy by a former husband, meaning the child is not the biological son of Henry Jenkins.
The boy is described as being about fifteen (or so) years of age.
The petitioners refer to him as an “orphan,” suggesting his father is deceased and that they (the petitioners) are seeking legal authority over him, or wish the court to confirm or formalize that authority.
Legal Concern: Binding as an Apprentice
A key theme is the binding of a minor as an apprentice under Maryland law. In early Baltimore history (and in many states at that time), orphans or children whose parents were unable to support them could be legally “bound” (apprenticed) so that the child would learn a trade and have financial support.
The petition cites the Act of 1788, Chapter 45, which was one of Maryland’s statutes governing apprenticeships—how they are created, what formalities must be met, and how the Orphans’ Court is involved.
Specifics of the Request
The petitioners believe there is either a gap or question about whether they have the legal right to apprentice the boy. They are asking the Orphans’ Court to recognize or approve the arrangement.
They indicate that the mother (as the surviving parent) and Henry Jenkins (presumably the stepfather or the mother’s new husband) want to place the boy under an apprenticeship, possibly so he can learn a trade and be supported until he reaches the age of majority.
References to Official Records
The document mentions the Register of Wills and/or Deeds in Baltimore City, implying that they have looked into or recorded something about the child’s status.
They point out that the legal forms required for binding a child as an apprentice often must be recorded with an official clerk or the Orphans’ Court, and they want to ensure everything is in proper order.
Legal Justification
The petitioners note that under the relevant laws, it “is lawful to bind out any child” whose parents cannot fully support them, or when the father is deceased.
They appear to lay out the reasons why the boy qualifies for such an arrangement (i.e., he is fatherless and his mother/stepfather lack the resources to maintain him or want to formalize his situation).
Request for Court Action
Ultimately, they are asking the Court to review the circumstances and grant them permission (or confirm their right) to bind the boy as an apprentice.
They emphasize that the law requires the Orphans’ Court to oversee such arrangements to protect the minor’s welfare and ensure the terms are fair.
Overall Significance:
This document is a formal petition under Maryland’s 18th- and 19th-century apprenticeship laws. It centers on a minor boy (about fifteen years old), the remarriage of his mother, and the need to have the Orphans’ Court recognize or approve an apprenticeship arrangement. The purpose is to comply with the legal requirements of binding a minor so that he can be taught a trade and maintained financially until adulthood.
---------end snip--------She said it's around 60% there, but not helpful for her specifically as her area of research is on the families of the slave trade, and so in the document, that is actually the only thing that really matters to her. Names and places (she spends A LOT of time tracking transfer of slaves through states) - I guess I should be honest that I'm goalpost moving a bit from my original original post, it can work through some 18th english century text, but generally struggles where it matters, the details.
i’m curious what the ground truth actually names in the doc are
Caroline Timmis is the boy mother, she is married to Henry Jenkins but was married previously, the boy is James Timmis.
Looking more closely at the image, it seems that the strikeouts are what are confusing the model. It sees "Henry" and disregards "James" written above it. In a few places, the strikeouts almost look more like underlining.
Gotta be an insanely-challenging task for a program that wasn't even written with handwriting recognition in mind.
Other than the proper names, are any major details wrong?
Dr. Jang said: "I've not transcribed this work yet it's future work. I can send it once it's transcribed, however from skimming the various chatgpt texts you sent me, the people are generally wrong, their relationships are inconsistent between all the text you sent me, these issues are why I do not go to chatgpt to help me with my research. The document generally is indeed about requesting the courts oversee the apprenticeship work. Minors could work, however, when minors are used it needed to be overseen by the courts, this document is the court overseeing the work"
She provided this as one she just got done working with: https://s.h4x.club/z8u9xmv7 (John King Esq. but try giving it to an LLM)
I will also happily again admit a bit of goal post moving on my part. I was probably a little to harsh on it (maybe because I'm used to her and her history geeks talking about how they don't work well for their research).
o1-pro takes images? had no idea
Yes, it will take .JPGs but not .PDFs. I thought it didn't support either until I checked just now.
“ From what the text shows, Henry Jenkins and his wife Caroline (the boy’s mother) are asking the Orphans Court to void an apprenticeship arrangement involving her minor son, James Timmons. They claim James—about 15 years old—was bound out as an apprentice without proper authority or the mother’s consent, and they cite Maryland law (an act from 1793 and its supplements) which they believe was not followed. They request the court declare that the indenture is invalid and restore James to his mother’s care.”
No idea if that’s correct (and no doubt not useful to an expert able to read this directly, but curious if it’s close?
She said: "The idea is close the the details are wrong".
have you been trying to provide it as an image directly? if so, doesn’t surprise me at all.
really thanks for sharing!
My wifes particular area of research is using the capitalist system to "re build" broken slave family trees, she flys around the US going to archives and getting contracts and receipts for slaves, figures out how they got traded, and then figures out where they ended up, and then "re links" them to their their family to best of her ability. Although her area of research isn't particularly overflowing with researchers, there are still a lot of people like her who just have this very tacit knowledge among each other, they email around a lot and stuff, knowledge like who was running a region during a period, ofc they publish, but it's a small field and it's all extremely poorly documented. Was watching the Adam Brown interview with Dwarkesh Patel the other day and he said for his work LLMs are better than bothering an expert in an area of his field with a question, I'm not sure people in her field are able to do this as readily. Franky, I've yet to find a novel/or good use for an LLM in her work. I often joke that her and "her people" are going to be the last ones with jobs if they don't transfer their knowledge into LLMs, ha! :)
Instead of asking for a formula, a better test may be to point out all the seeming contradictions in physics at that time (constancy of the speed of light, wave vs. particle nature of light, ultraviolet catastrophe), and ask it how they could be resolved.
I had a similar thought but about asking the LLM to predict “future” major historical events. How much prompting would it take to predict wars, etc.?
You mean train on pre-1939 data and predict how WWII would go?
Right. If it were trained through August 1939, how much prompting would be necessary to get it to predict aspects of WWII.
Man, that would be a fascinating experiment. Would it be able to predict who wins and when? Would it be able to predict the Cold War?
But we know Hitler has a Time Machine that goes forward, he doesn’t need to return to use that knowledge as he already has a timeline here to use. Definitely risks involved here.
If you build an oracle that tells you who wins the war that far in the future, you build a simulator that allows anyone to win any war. Everything is dual use.
That will never work on any complex system that behaves chaotically, such as the weather or complex human endeavors. Tiny uncertainties in the initial conditions rapidly turn into large uncertainties in the outcomes.
Not an LLM but models could get pretty good at weather
https://www.technologyreview.com/2024/12/04/1107892/google-d...
No, they don't, since the weather is chaotic.
I mean, there are the theorems about how close you can get, and models are not better than theoretically possible.
Yeah, I wish more people understood that it is simply not possible to make precise long-term forecasts of chaotic systems. Whether it is weather, financial markets, etc.
It is not that we don't know yet because our models are inadequate, it's that it is unknowable.
The problem is we stupidly branded the field "chaos theory" and made it sound like bullshit so the ideas of non-linear dynamics have largely been lost on several generations at this point.
Not just chaos theory but "chaos theory" + psychedelic fractal artwork. Then the popular James Gleick book, "Chaos: making a new science" just sounds like complete bullshit and it sold a ton of copies.
I only started studying non-linear dynamics in about 2015 after first running across it in the late 90s but I literally thought it was all pseudoscience then.
Between "chaos theory", fractals and a best selling book it would be hard to frame a new scientific field as pseudoscience more than what played out.
Why do we need this when current models already handle questions and answers about new discoveries: ones that are happening every week and are often easier to grasp than Einstein’s equations? I think it is clear that they will fail on most of them. That doesn't mean that LLMs are not useful but there are more walls in the road.
The best human performance on that task required many many hours of private work given that input.
How much would ChatGPT charge for that much reasoning? Isn't cost quadratic in sort term working memory?
It would be more interesting to prompt it with X% of a new paper's logical argument, and see if it can predict the rest.
There is a reason why they won't do it. They are selling a narrative. There is a lot of money to be made here with this narrative and proving that artificial intelligence is NOT intelligent won't help sell that narrative.
The goal is to make it intelligent, by which OpenAI in particular explicitly mean "economically useful", not simply to be shiny.
Passing tests is well known to be much easier than having deep understanding, even in humans. They openly ask for tests like this, not that they could possibly prevent them if they wanted to.
There's scammers trying what you say of course, and I'm sure we've all seen some management initiatives or job advertisements for some like that, but I don't get that impression from OpenAI or Anthropic, definitely not from Apple or Facebook (LeCun in particular seems to deny models will ever do what they actually do a few months later). Overstated claims from Microsoft perhaps (I'm unimpressed with the Phi models I can run locally, GitHub's copilot has a reputation problem but I've not tried it myself), and Musk definitely (I have yet to see someone who takes Musk at face value about Optimus).
> The goal is to make it intelligent, by which OpenAI in particular explicitly mean "economically useful", not simply to be shiny
I never understood why this definition isn't a huge red flag for most people. The idea of boiling what intelligence is down to economic value is terrible, and inaccurate, in my opinion.
Everyone has a very different idea of what the word "intelligence" means; this definition has got the advantage that, unlike when various different AI became superhuman at arithmetic, symbolic logic, chess, jeopardy, go, poker, number of languages it could communicate in fluently, etc., it's tied to tasks people will continuously pay literally tens of trillions of dollars each year for because they want those tasks done.
This definition alone might be fine enough if the word "intelligence" wasn't already widely used outside of AI research. It is though, and the idea that intelligence is measured solely through economic value is a very, very strange approach.
Try applying that definition to humans and you pretty quickly run into issues, both moral and practical. It also invalidates basically anything we've done over centuries considering what intelligence is and how to measure it.
I don't see any problem at all using economic value as a metric for LLMs or possible AIs, it just needs a different term than intelligence. It pretty clearly feels like for-profit businesses shoehorning potentially valuable ML tools into science fiction AI.
> This definition alone might be fine enough if the word "intelligence" wasn't already widely used outside of AI research. It is though, and the idea that intelligence is measured solely through economic value is a very, very strange approach.
The response from @s1mplicissimus' on my previous comment is asking about "common usage" definitions of intelligence, and this is (IMO unfortunately) one of the many "common usage" definitions: smart people generally earn more.
I don't like "commmon sense" anything (or even similar phrases), because I keep seeing the phrase used as a thought-terminating cliché — but one thing it does do, is make it not "a very, very strange approach".
Wrong, that happens a lot for common language, but it can't really be strange.
> Try applying that definition to humans and you pretty quickly run into issues, both moral and practical.
Yes. But one also runs into issues with all definitions of it that I've encountered.
> It also invalidates basically anything we've done over centuries considering what intelligence is and how to measure it.
Sadly, not so. Even before we had IQ tests (for all their flaws), there's been a widespread belief that being wealthy is the proof of superiority. In theory, in a meritocracy, it might have been, but in practice not only to we not live in a meritocracy (to claim we do would deny both inheritance and luck), but also the measures of intelligence that society has are… well, I was thinking about Paul Merton and Boris Johnson the other day, so I'll link to the blog post: https://benwheatley.github.io/blog/2024/04/07-12.47.14.html
> smart people generally earn more.
> there's been a widespread belief that being wealthy is the proof of superiority.
Both of these are assumptions though, and working in the reverse order. Its one thing to expect that intelligence will lead to higher value outcomes and entirely different to expect that higher value outcomes prove intelligence.
It seems reasonable that higher intelligence, combined with the incentives if a capitalist system, will lead to higher intelligence people getting more wealthy. They learn to play the game and find ways to "win."
It seems unreasonable to assume that anyone or anything that "wins" in that system much be more intelligent. Said differently, intelligence may lead to wealth but wealth doesn't imply intelligence.
I think we're in agreement? I'm saying their measure in this case is no worse than any other, but not that it's a fundamental truth.
All the other things — chess, Jeopardy, composing music, painting, maths, languages, passing medical or law degrees — they're also all things which were considered signs of intelligence until AI got good at them.
Goodhart's law keeps tripping us up on the concept of intelligence.
> I think we're in agreement? I'm saying their measure in this case is no worse than any other, but not that it's a fundamental truth.
Maybe we are? I think I lost the thread a bit here.
> chess, Jeopardy, composing music, painting, maths, languages, passing medical or law degrees
That's interesting, I would have still chalked skill in those areas as a sign of intelligence and didn't realize most people wouldn't once AI (or ML) could do it. To me an AI/LLM/ML being good at those is at least a sign that they have gotten good at mimicking intelligence if nothing else, and a sign that we really are getting out over our skis risking these tools without knowing how they really work.
Maybe by the time it’s doing a trillion dollars a year of useful work (less than 10 years out) people will call it intelligent… but still probably not.
I haven't seen "intelligent" used as "economically useful" anywhere outside the AI hype bubble. The most charitable interpretation I can think of is lack of understanding of the common usage of the word, the most realistic one is intentionally muddying terminology so one cannot be called a liar. Are LLMs helpful tools for some tasks like rough translations, voice2text etc? Sure. Does it resemble what humans call intelligence? I'd yet have to see an example of that. The suggested experiment is a great idea and would sway my opinion drastically (given all the training data, model config, prompts & answers are public and reproducible of course, we don't want any chance of marketing BS to taint the results, do we). I'll be honest though, I'm not going to hold my breath for that experiment to succeed with the LLM technology...
edit: lol downvoted for calling out shilling i guess
They don't have to do it themselves. The super-GPU cluster used to train GPT-6 will eventually shrink down to a garage size and eventually some YouTuber will.
what if someone invented it before 1905
Why does AI have to be smarter than the collective of hummanity in order to be considered intelligent? It seems like we keep raising the bar on what intelligence means ¯\_(ツ)_/¯
A machine that synthesizes all human knowledge really ought to know more than an individual in terms of intellect. An entity with all of human intellect prior to 1905 does not need to be as intelligent as a human to make discoveries that mere humans with limited intellect made. Why lower the bar?
The heightening of the bar is an attempt to deny that milestones were surpassed and to claim that LLMs are not intelligent.
We had a threshold for intelligence. An LLM blew past it and people refuse to believe that we passed a critical milestone in creating AI. Everyone still thinks all an LLM does is regurgitate things.
But a technical threshold for intelligence cannot have any leeway for what people want to believe. They don’t want to define an LLM as intelligent even if it meets the Turing test technical definition of intelligence so they change the technical definition.
And then they keep doing this without realizing and trivializing it. I believe humanity will develop an entity smarter than humans but it will not be an agi because people keep unconsciously moving the goal posts and changing definitions without realizing it.
Disagree. The AI we have is very useful for specific things. The pushback you see is not so much denying the milestones that have been surpassed, but rather the milestones that enthusiasts claim are near. And for good reason! Every time and in every field we’ve extrapolated an exponential-looking curve ad infinitum, it’s turned out to be S-shaped, and life goes on.
> We had a threshold for intelligence.
We’ve had many. Computers have surpassed several barriers considered to require intelligence such as arithmetic, guided search like chess computers, etc etc. the Turing test was a good benchmark because of how foreign and strange it was. It’s somewhat true we’re moving the goalposts. But the reason is not stubbornness, but rather that we can’t properly define and subcategorize what reason and intelligence really is. The difficulty to measure something does not mean it doesn’t exist or isn’t important.
Feel free to call it intelligence. But the limitations are staggering, given the advantages LLMs have over humans. They have been trained on all written knowledge that no human could ever come close to. And they still have not come up with anything conceptually novel, such as a new idea or theorem that is genuinely useful. Many people suspect that pattern matching is not the only thing required for intelligent independent thought. Whatever that is!
If you consider that evolution has taken millions of years to produce intelligent humans--that LLM training completed in a manner of months can produce parrots of humans is impressive by itself. Talking with the parrot is almost indistinguishable from talking with a real human.
As far as pattern matching, the difference I see from humans is consciousness. That's probably the main area yet to be solved. All of our current models are static.
Some ideas for where that might be headed:
- Maybe all it takes is to allow an LLM to continuously talk with itself much like how humans have "the milk man's voice".
- Maybe we might need to allow LLMs to update their own weights but that would also require an "objective" which might be hard to encode.
> If you consider that evolution has taken millions of years to produce intelligent humans--that LLM training completed in a manner of months can produce parrots of humans is impressive by itself.
I disagree that such a comparison is useful. Training should be compared to training, and LLM training feeds in so many more words than a baby gets. (A baby has other senses but it's not like feeding in 20 years of video footage is going to make an LLM more competent.)
No, a baby is pre-trained. We know from linguistics that there is a natural language grammar template all humans follow. This template is intrinsic to our biology and is encoded and not learned through observation.
A baby has a template but so does an LLM.
The better comparison to the templating is all the labor that went into making the LLM, not how long the GPUs run.
Template versus template, or specific training versus specific training. Those comparisons make a lot more sense than going criss-cross.
The template is what makes the training process so short for humans. We need minimal data and we can run off of that.
Training is both longer and less effective for the LLM because there is no template.
To give an example suppose it takes just one picture for a human to recognize a dog and it takes 1 million pictures for a ML model to do the same. What I’m saying is that it’s like this because humans come preprogrammed with application specific wetware to do the learning and recognition as a generic operation. That’s why it’s so quick. For AI we are doing it as a one shot operation on something that is not application specific. The training takes longer because of this and is less effective.
I disagree that an LLM has no template, but this is getting away from the point.
Did you look at the post I was replying to? You're talking about LLMs being slower, while that post was impressed by LLMs being "faster".
They're posing it as if LLMs recreate the same templating during their training time, and my core point is disagreeing with that. The two should not be compared so directly.
They are slower. In theory these LLMs with all the right weights can have intelligence superior or equivalent to humans.
But the training never gets there. It’s so slow it never reaches human intelligence even though we know these networks can compute anything.
> It’s somewhat true we’re moving the goalposts. But the reason is not stubbornness, but rather that we can’t properly define and subcategorize what reason and intelligence really is.
Disagree. Intelligence is a word created by humans. The entire concept is made up and defined by humans. It is not some concept that exists outside of that. It is simply a collection of qualities and features we choose to define as a word “intelligent”. The universe doesn’t really have a category or a group of features that is labeled intelligent. Does it use logic? Does it have feelings? Can it talk? Can it communicate? We define the features and we choose to put each and every feature under a category called “intelligence”.
Therefore when we define the “Turing test” as a benchmark for intelligence and we then invalidate it, it is indeed stubbornness and a conscious choice to change a definition of a word we Originally made up in the first place.
What you don’t realize is this entire thing is a vocabulary problem. When we argue what is conscious or what is intelligent we are simply arguing for what features belong in what categories we made up. When the category has blurry or controversial boundaries it’s because we chose the definition to be fuzzy. These are not profound discussions. They are debates about language choice. We are talking About personal definitions and generally accepted definitions both of which are completely chosen and made up by us. It is not profound to talk about things that are simply arbitrary choices picked by humans.
That being said we are indeed changing the goal posts. We are evolving our own chosen definitions and we very well may eventually change the definition of intelligence to never include any form of thinking machine that is artificially created. The reason why we do this is a choice. We are saying, “hey these LLMs are not anything amazing or anything profound. They are not intelligent and I choose to believe this by changing and evolving my own benchmark for what is intelligent.”
Of course this all happens subconsciously based off of deeply rooted instincts and feelings. It’s so deep that it’s really hard to differentiate the instincts between rational thinking. When you think logically, “intelligence” is just a word with an arbitrary definition. An arbitrary category. But the instincts are so strong that you literally spent your entire life thinking that intelligence like god or some other common myth made up by humans is some concept that exists outside of what we make up. It’s human to have these instincts, that’s where religion comes from. What you don’t realize is that it’s those same instincts fueling your definition of what is “intelligent”.
Religious people move the goal posts too. When science establishes things in reality like the helio centricity of the solar system religious people need to evolve their beliefs in order to stay inline with reality. They often do this by reinterpreting the Bible. It’s deeply rooted instincts that prevent us from thinking rationally and it effects the great debate we are having now on “what is intelligence?”.
Since we know an LLM does indeed simply regurgitate data, having it pass a "test for intelligence" simply means that either the test didn't actually test intelligence, or that intelligence can be defined as simply regurgitating data.
Intelligence is debateble without even bringing ai into it. Nobody agrees on whether humans have intelligence. Well, smart people agree but those people also agree we have or will soon have agi or something negligibly different from it.
> Intelligence is debateble without even bringing ai into it. Nobody agrees on whether humans have intelligence.
Yep, that constitutes the second of the two options I mentioned.
> Well, smart people agree but those people also agree we have or will soon have agi or something negligibly different from it.
lol, the ol' "I know what all smart people think and it's what I think" appeal.
> The heightening of the bar is an attempt to deny that milestones were surpassed and to claim that LLMs are not intelligent.
That was never "the bar"; nobody denies that milestones have been surpassed; none of those milestones are relevant to the question of intelligence.
> We had a threshold for intelligence. An LLM blew past it and people refuse to believe
Have you ever actually looked at contemporary (to Turing) examples of what people thought "passing a Turing test" might look like? It's abundantly clear to me that we were simply wrong about what the output would have to look like in order to convince human judges in the 2020s.
Even examples from much more recently (see e.g. on http://www-logic.stanford.edu/seminar/1213/Hawke_TuringTest....) suggest a very different approach to the test than prompting ChatGPT and marveling at the technical accuracy of its prose.
(Exercise: ask an LLM to write a refutation to your comment from the perspective of a human AI skeptic. Notice the ways in which it differs from mine.)
> Everyone still thinks all an LLM does is regurgitate things.
No; people still think LLMs aren't intelligent. Because they aren't, and they cannot become so in principle. They can do many things that are clearly beyond "regurgitation" (as we would otherwise apply the word to computer programs), but none of those things are the result of intelligence. Producing a result that could plausibly come from an intelligent system does not, in fact, demonstrate that the actual system producing it is also intelligent. The https://en.wikipedia.org/wiki/Antikythera_mechanism wasn't intelligent, either, and applying a power source to turn the gears wouldn't have made it so, either.
> They don’t want to define an LLM as intelligent even if it meets the Turing test technical definition of intelligence so they change the technical definition.
The Turing Test was never a "technical definition" of intelligence. Turing's original paper (https://en.wikipedia.org/wiki/Computing_Machinery_and_Intell...) spoke of "thinking" rather than "intelligence". Besides, the "Imitation Game" is presented as a substitute problem exactly because "think" cannot be clearly enough defined for the purposes. The entire point:
> As Stevan Harnad notes,[7] the question has become "Can machines do what we (as thinking entities) can do?" In other words, Turing is no longer asking whether a machine can "think"; he is asking whether a machine can act indistinguishably[8] from the way a thinker acts. This question avoids the difficult philosophical problem of pre-defining the verb "to think" and focuses instead on the performance capacities that being able to think makes possible, and how a causal system can generate them.
But the usual processes of pop science seem to have created a folk wisdom that being able to pass a Turing test logically ought to imply intelligence. This idea is what has been disproven, not the AI skepticism.
"Why lower the bar?"
Because of the chance of misundertanding. Failing at acknowledging artificial general intelligence standing right next to us.
An incredible risk to take in alignment.
Perfect memory doesn't equal to perfect knowledge, nor perfect understanding of everything you can know. In fact, a human can be "intelligent" with some of his own memories and/or knowledge, and - more commmonly - a complete "fool" with most of the rest of his internal memories.
That said, is not a bit less generally intelligent for that.
Supose it exists a human with unlimited memory, it retains every information touching any sense. At some point, he/she will probably understand LOTs of stuff, but it's simple to demonstrate he/she can't be actually proficient in everything: you have read how do an eye repairment surgery, but have not received/experimented the training,hence you could have shaky hands, and you won't be able to apply the precise know-how about the surgery, even if you remember a step-by-step procedure, even knowing all possible alternatives in different/changing scenarios during the surgery, you simply can't hold well the tools to go anywhere close to success.
But you still would be generally intelligent. Way more than most humans with normal memory.
If we'd have TODAY an AI with the same parameters as the human with perfect memory, it will be most certainly closely examined and determined to be not a general artificial intelligence.
> If we'd have TODAY an AI with the same parameters as the human with perfect memory, it will be most certainly closely examined and determined to be not a general artificial intelligence.
The human could learn to master a task, current AI can't. That is very different, the AI doesn't learn to remember stuff they are stateless.
When I can take an AI and get it to do any job on its own without any intervention after some training then that is AGI. The person you mentioned would pass that easily. Current day AI aren't even close.
Performance of these LLMs on real life tasks feels very much like students last-minute cramming for Asian style exams.
The ability to perfectly regurgitate, while no concept of meaning.
o3 is able to get 25% on never seen before frontiermath problems. sure, the models do better when the answer is directly in their dataset but they’ve already surpassed the average human in novelty on held out problems
> never seen before frontiermath problems
How do you know that?
Because that is the whole conceit of how frontiermath is constructed
> Because that is the whole conceit
Freudian typo?
No. 'conceit' has a few meanings, including 'a fanciful notion'.
https://www.merriam-webster.com/dictionary/conceit
Definition 2D: "an organizing theme or concept"
Didn't they run a bunch of models on the problem set? I doubt they are hosting all those models on their own infrastructure.
1. OpenAI has confirmed it’s not in their train (unlike putnam where they have never made any such claims)
2. They don't train on API calls
3. It is funny to me that HN finds it easier to believe theories about stealing data from APIs rather than an improvement in capabilities. It would be nice if symmetric scrutiny were applied to optimistic and pessimistic claims about LLMs, but I certainly don’t feel that is the case here.
> 1. OpenAI has confirmed it’s not in their train (unlike putnam where they have never made any such claims)
Companies claim lots of things when it's in their best financial interest to spread that message. Unfortunately history has shown that in public communications, financial interest almost always trumps truth (pick whichever $gate you are aware of for convenience, i'll go with Dieselgate for a specific example).
> It is funny to me that HN finds it easier to believe theories about stealing data from APIs rather than an improvement in capabilities. It would be nice if symmetric scrutiny were applied to optimistic and pessimistic claims about LLMs, but I certainly don’t feel that is the case here.
What I see is generic unsubstantiated claims of artificial intelligence on one side and specific, reproducible examples that dismantle that claim on the other. I wonder how your epistemology works that leads you to accept marketing claims without evidence
OpenAI’s credibility is central to its business: overstating capabilities risks public blowback, loss of trust, and regulatory scrutiny. As a result, it is unlikely that OpenAI would knowingly lie about its models. They have much stronger incentives to be as accurate as possible—maintaining their reputation and trust from users, researchers, and investors—than to overstate capabilities for a short-term gain that would undermine their long-term position.
From a game-theoretic standpoint, repeated interactions with the public (research community, regulators, and customers) create strong disincentives for OpenAI to lie. In a single-shot scenario, overstating model performance might yield short-term gains—heightened buzz or investment—but repeated play changes the calculus:
1. Reputation as “collateral”
OpenAI’s future deals, collaborations, and community acceptance rely on maintaining credibility. In a repeated game, players who defect (by lying) face future punishment: loss of trust, diminished legitimacy, and skepticism of future claims.
2. Long-term payoff maximization
If OpenAI is caught making inflated claims, the fallout undermines the brand and reduces willingness to engage in future transactions. Therefore, even if there is a short-term payoff, the long-term expected value of accuracy trumps the momentary benefit of deceit.
3. Strong incentives for verification
Independent researchers, open-source projects, and competitor labs can test or replicate claims. The availability of external scrutiny acts as a built-in enforcement mechanism, making dishonest “moves” too risky.
Thus, within the repeated game framework, OpenAI maximizes its overall returns by preserving its credibility rather than lying about capabilities for a short-lived advantage.
Find me the folks who see nothing but good will in OpenAI’s actions and I’ll find you the folks who have been hyping up AGI for the last 2 years.
4 was literally sitting on a shelf waiting for release when 3.5 was launched. 4o was a fine tune that took over two years. o1 is embarrassingly unimpressive chain of thought which is why they hide it.
The company hit a wall a year ago. But showing progress towards AGI keeps the lights on. If they told the truth at their current burn rate…they’d have no money.
You don’t need game theory to figure that one out.
>OpenAI’s credibility is central to its business: overstating capabilities risks public blowback, loss of trust, and regulatory scrutiny.
Uh huh. Kinda like what's happening right now?
They're marketing blow-hards. Everyone knows it. They've been wildly over-stating capabilities (and future capabilities!) as long as Altman has had power, and arguably longer.
They'll do it as long as they can get away with it, because that's all that is needed to make money on it. Factual accuracy rarely impacts the market when it's so hype-driven, especially when there is still some unique utility in the product.
OpenAI's apparent credibility is central to their business.
They're spruiking a 93rd percentile performance on the 2024 International Olympiad in Informatics with 10 hours of processing and 10,000 submissions per question.
Like many startups they're still a machine built to market itself.
The gpt 4 paper said they only excluded benchmarks by exact text matches. That means discussion of them probably doesn't get excluded.
Easier to believe or not, thinking that it's not a reasonable possibility is also funny.
Do you also think they somehow stole the codeforces problems before they were even written or you are willing to believe the #175 global rank there?
I dont think codeforce claims to contain novel unpublished problems.
But i'm not saying it's what they did, just that it's a possibility that should be considered till/if it is debunked.
frankly i’m not sure what standard you would possibly consider a debunking
codeforces constantly adds new problems that’s like the entire point of the contest, no?
OpenAI never actually directly participated in the contests. OpenAI says they took (unspecified) problems and "simulated" the score it would have had.
If they solved recent contests in a realistic contest simulation I would expect them to give the actual solutions and success rates as well, like they did for IOI problems, so I'm actually confused as to why they didn't.
very good clarification, thanks - they should absolutely release more details to provide more clarity and ideally just participate live? i suspect that the model takes a while for individual problems so time might be a constraint there
Yes, it would also probably be rate limited or timed out since they're doing 50+ submissions per problem
The modern state of training is to try to use everything they can get their hands on. Even if there are privileged channels that are guaranteed not to be used as training data, mentioning the problems on ancillary channels (say emailing another colleague to discuss the problem) can still create a risk of leakage because nobody making the decision to include the data is aware that stuff that should be excluded is in that data set. And as we've seen from decades of cybersecurity, people are absolute shit at the necessary operational security to avoid mentioning stuff on ancillary channels!
Given that performance is known to drop considerably on these kinds of tests when novel problems are tried, and given the ease with which these problems could leak into the training set somehow, it's not unreasonable to be suspicious of a sudden jump in performance as merely a sign that the problems made it into the training set rather than being true performance improvements in LLMs.
Okay, then what about elite level codeforces performance? Those problems weren’t even constructed until after the model was made.
The real problem with all of these theories is most of these benchmarks were constructed after their training dataset cutoff points.
A sudden performance improvement on a new model release is not suspicious. Any model release that is much better than a previous one is going to be a “sudden jump in performance.”
Also, OpenAI is not reading your emails - certainly not with a less than one month lead time.
o1 has a ~1650 rating, at that level many or most problems you will be solving are going to be a transplant of a relatively known problem.
Since o1 on codeforces just tried hundreds or thousands of solutions, it's not surprising it can solve problems where it is really about finding a relatively simple correspondence to a known problem and regurgitating an algorithm.
In fact when you run o1 on ""non-standard"" codeforces problems it will almost always fail.
See for example this post running o1 multiple times on various problems: https://codeforces.com/blog/entry/133887
So the thesis that it's about recognizing a problem with a known solution and not actually coming up with a solution yourself seems to hold, as o1 seems to fail even on low rated problems which require more than fitting templates.
o3 is what i’m referring to and it is 2700
It's extremely unlikely for o3 to have hit 2700 on live contests as such a rapid increase in score would have been noticed by the community. I can't find anything online detailing how contamination was avoided since it clearly wasn't run live, including in their video, and neither could I find details about the methodology (number of submissions being the big one, in contests you can also get 'hacked' esp. at a high level), problem selection, etc...
Additionally, people weren't able to replicate o1-mini results in live contests straightforwardly - often getting scores between 700 and 1200, which raises questions as for the methodology.
Perhaps o3 really is that good, but I just don't see how you can claim what you claimed for o3, we have no idea that the problems have never been seen, and the fact people find much lower Elo scores with o1/o1-mini with proper methodology raises even more questions, let alone conclusively proving these are truly novel tasks it's never seen.
Can you give an example of one of these problems that 'wasn't even constructed until after the model was made'?
I'd like to see if it's truly novel and unique, the first problem of its type ever construed by mankind, or if it's similar to existing problems.
Sorry, I thought the whole point of this thread was that models can’t handle problems when they are “slightly varied”. Mottes and baileys all over the place today.
The point is that it's not consistent on variations, unless it finds a way to connect it to something it already knows. The fact it sometimes succeeds on variations (in codeforces the models are allowed multiple tries, sometimes ridiculous numbers, to be useful) doesn't matter.
The point is that the fact it's no longer consistent once you vary the terminology indicates it's fitting a memorized template instead of reasoning from first principles.
Frankly you need to read what they say explicitly and not infer what they mean by your reckoning.
They are the system to beat and their competitors are either too small or too risk averse.
They ingest millions of data sources. Among them is the training data needed to answer the benchmark questions.
I think the problems it solved were understood to be well known undergraduate problems.
https://xenaproject.wordpress.com/2024/12/22/can-ai-do-maths...
that is not the correct takeaway from the blogpost at all…
1. it suggests it’s possible that more of the problems are IMO-esque than previously thought, we don’t know how the share of solved problems is.
2. calling IMO problems “well known undergraduate problems” is a bit much
The actual quote is that they are easier than IMO; but maybe not directly comparable.
https://x.com/littmath/status/1870848783065788644?s=46&t=foR...
I think it’s more probable that it would have solve the easier problems first, rather than some hard and only some easier; although that is supposition.
Reading this thread and the blog post gives more idea about what the problems might involve.
It’s difficult to judge without more information on the actual results, but that means we cannot draw any strong conclusions either way on what this means.
The average human did zero studying on representative problems. LLMs did a lot.
I don't know anything about frontiermath problems, but for Putnam problems (which is what the submitted article is about) the average human that takes the exam is an undergraduate mathematics or science major who has studied prior Putnam problems and other similar problems recently to specifically prepare for the exam...and the most common score is still 0.
At top tier schools the most common score will usually be somewhere in the 0 to 10 range (out of a possible 120).
You can get all of the correct numerical answers for the Putnam and still get a zero, because the reasoning is graded very harshly. The scores measured in this paper are not comparable to actual Putnam scores.
Okay? We are measuring capabilities.
Look into JEE Advanced.
https://openreview.net/forum?id=YHWXlESeS8
Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%. The typical failure modes of GPT-4, the best model, are errors in algebraic manipulation, difficulty in grounding abstract concepts into mathematical equations accurately and failure in retrieving relevant domain-specific concepts.
I'm curious how something like O1 would perform now.
Basically yet another proof that we have managed to perfectly recreate human stupidity :-)
Good students are immune to variations that are discussed in the paper. But most academic tests may not differentiate between them and the crammers.
> Good students are immune to variations
I don't believe that. I'd put some good money that if an excellent student is given an exact question from a previous year, they'll do better (faster & more accurate) on it, than when they're given a variation of it.
I don't think you are betting on the same thing the parent comment is talking about.
The assumptions aren't the same to begin with.
What's the difference between benefitting from seeing previous problems and being worse off when not having a previous problem to go from?
The point is that the “good student” will still do well on the variations, not suffer a 30% decrease in grade.
I'm not following, why would you assume that a good student taking an exam at the edge of their ability not do significantly better if they trained on the exact same questions (with the same solutions), as opposed to ones that are slightly different? I for one have absolutely struggled as a student when faced with questions that seemed similar to ones from previous exams, but actually had a crucial difference, and by looking at the examples on pages 9&10 in the article, I'm pretty sure I would have been likely to be confused too.
Because good students don’t cram. It is not optimal. Understanding the underlying structure is much easier than surface-level cramming.
I have a feeling the fact you're only slightly varying the input means the model is falling back into the question it was expecting and getting things wrong as a result. If you just varied it a little more and added some general-purpose prompt-fu like:
"First break the problem down into known facts, then pull relevant world knowledge, then bring it all together to assess the problem from multiple angles and make a conclusion. Do not immediately just use the first obvious conclusion."
You're gonna get a lot better responses. I suspect this is more of a "look! LLMs make bad kneejerk responses when we try to trick them from what they were expecting!" rather than "Look! They aren't even smart reasoners, they can't even figure out these problems without memorizing!"
They do memorize. But that cuts both ways - making problems very close to the memorized one mess with their perception, the same way humans will instinctually respond to something that looks like a face before stepping back and assessing.
This is very interesting, but a couple of things to note; 1. o1 still achieves > 40% on the varied Putnam problems, which is still a feat most math students would not achieve. 2. o3 solved 25% of the Epoch AI dataset. - There was an interesting post which calls into question how difficult some of those problems actually are, but it still seems very impressive.
I think a fair conclusion here is reasoning models are still really good at solving very difficult math and competitive programming problems, but just better at ones they have seen before.
The comments in this thread are completely disconnected from the contents of the paper, and the thread title is rage bait and doesn't reflect the contents of the paper, either. Being able to solve a significant fraction of those problems is a pretty amazing achievement, even if it's sometimes tricked by minor variations. People are throwing around words like "fraud" or "hoax", and it's just wishcasting or whistling past the graveyard.
Is it just an open secret that the models are currently just being hardcoded for random benchmarks? Seems weird that people would be asking Putnam problems to a chatbot :/
> Is it just an open secret that the models are currently just being hardcoded for random benchmarks? Seems weird that people would be asking Putnam problems to a chatbot :/
It's because people do keep asking these models math problems and then, when they get them right, citing it as evidence that they can actually do mathematical reasoning.
Since it's hard to determine what the models know, it's hard to determine when they're just spitting out something they were specifically trained on.
Not hardcoded, I think it's just likely that those problems exist in its training data in some form
It certainly feels like certain patterns are hardcoded special cases, particularly to do with math.
"Solve (1503+5171)*(9494-4823)" reliably gets the correct answer from ChatGPT
"Write a poem about the solution to (1503+5171)*(9494-4823)" hallucinates an incorrect answer though
That suggests to me that they've papered over the models inability to do basic math, but it's a hack that doesn't generalize beyond the simplest cases.
There's a few things there that could be going on that seem more likely than "hardcoded".
1. The part of the network that does complex math and the part that write poetry are overlapping in strange ways.
2. Most of the models nowadays are assumed to be some mixture of experts. So it's possible that saying write the answer as a poem activates a different part of the model.
Watch for ChatGPT or Claude saying "analyzing" - which means they have identified they need to run a calculation and outsourced it to Python (ChatGPT) or JavaScript (Claude)
The poem thing probably causes them to not decide to use those tools.
To be clear I was testing with 4o, good to know that o1 has a better grasp of basic arithmetic. Regardless my point was less to do with the models ability to do math and more to do with OpenAI seeming to cover up its lack of ability.
i think it’s mostly that o1 mini can think through the solution before it starts writing the poem.
i’m able to reproduce your failure on 4o
“a poem about” reads to me at least like the solution need not be in the answer; maybe something like “a poem that includes the answer in the last stanza”
yeah but it like actually gets the answer wrong not just omits it
I've always assumed they removed it, because it's such a basic and fundamental part of ML training that you separate your test and train data. And yet I never see any papers even mention if/how they do this. And I wonder if they do, how do they guarantee with high reliability that their massive terabytes of data don't contain the answer.
I don't see any reason to assume they removed it unless they're very explicit about it. Model publishers have an extremely strong vested interest in beating benchmarks and I expect them to teach to the test if they can get away with it.
As usual, once a metric becomes a target, it stops being useful.
Well, they are doing BigCorpStuff not Science
putnam isn’t an llm benchmark ahhhh none of these companies are reporting putnam scores there’s nothing nefarious about training on putnam problems
Any problem set that can make news is implicitly an LLM benchmark.
I think it's reasonable to assume that openAI is optimising for maximum hype at this point which may include wilfully overfitting for impactful benchmarks to generate positive reports.
When 4 came out they released a document that did BOTH inflate scores by changing the exam conditions, and also bragged about scoring worse than guessing on a multiple choice test.
I agree that openai is somewhat sketchy about this, but they're sketchy about everything. In the past though they have admitted up front to data contamination (e.g. the original gpt-4 press release did not use big-bench as a benchmark due to data contamination). For the Putnam in particular: this is not a benchmark that they use. There is no reason to exclude it since it is not part of the "test set" in any meaningful sense.
How could they remove it?
Those are well known problems, that people talk about on different contexts. They would have to review their entire training set.
But putnam isn’t an official test? I find llm discourse on hn so frustrating
First of all, Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly. Secondly, removing it from internet data is not 100% accurate. There are translations of the problems and solutions or references and direct match is not enough. MMLU and test set benchmarks show more resilience though in some previous research.
OpenAI is extremely cagey about what's in their test data set generally, but absent more specific info, they're widely assumed to be grabbing whatever they can. (Notably including copyrighted information used without explicit authorization -- I'll take no position on legal issues in the New York Times's lawsuit against OpenAI, but at the very least, getting their models to regurgitate NYT articles verbatim demonstrates pretty clearly that those articles are in the training set.)
Let’s think about this.
> Putnam is not in the test data, at least I haven't seen OpenAI claiming that publicly
What exactly is the source of your belief that the Putnam would not be in the test data? Didn’t they train on everything they could get their hands on?
do you understand the difference between test data and train data? just reread this thread of comments
I don't know why I and you are getting downvoted. Sometimes, HN crowd is just unhinged against AI.
These models are trained in two steps: training base model and then uptraining it. First step includes as much data as possible, everything company can find. For Llama models it's 15T tokens, which is ~40 TB of data. No-one really puts an effort on splitting this data into train/test/eval (and it's not very achievable either). It's just as much data as possible.
So it's like 99.9999999% wrong to assume something public isn't on the train set, such as Putnam problems in this case. This is about it.
There are benchmarks that are decided beforehand and similar sentences are removed from even the first stage of training. This is useful for tracking model performance and comparing different choices. e.g. see section 'Contamination of downstream tasks' of [1].
Every decent AI lab does this, else the benchmark result couldn't be trusted. OpenAI publishes results of ~20 benchmarks[2] and it is safe to assume they have made reasonable attempt to remove it from training set
right, but where did someone assume it wasn’t in the train set? they just said it wasn’t in the test set
What test set is being talked about here? Why does it matter what’s on this set?
the point is that putnam was never a test/benchmark being used by OpenAI or anyone else, so there is no smoking gun if you find putnam on the train set nor is it cheating or nefarious because nobody ever claimed otherwise.
this whole notion of putnam as test being trained on is a fully invented grievance
read the entire thread in this context
I've read the thread and I think it's not very coherent overall, I also not sure if we disagree =)
I agree that having putnam problems on OpenAI training set is not a smoking gun, however it's (almost) certain they are on training set, and having them would affect performance of the model on them too. Hence research like this is important, since it shows that observed behavior of the models is memoization to large extent, and not necessarily generalization we would like it to be.
nobody serious (like OAI) was using the putnam problems to claim generalization. this is a refutation in search of a claim - and many people in the upstream thread are suggesting that OAI is doing something wrong by training on a benchmark.
OAI uses datasets like frontiermath or arc-agi that are actually held out to evaluate generalization.
I, actually, would disagree with this. To me ability to solve frontiermath does imply ability to solve putnam problems too. Only with putnam problems being easier - they are already been seen by the model, and they are also simpler problems. And just like this - putnam problem with simple changes are also one of the easier stops on the way to truly generalizing math models, with frontiermath being one of the last stops on the way there.
It is on the open internet - questions and suggested solutions:
https://kskedlaya.org/putnam-archive/
I would expect all llms to be trained on it.
funny that nobody replying to you seems to even know what a test set is. i always overestimate the depth of ML conversation you can have on HN
Imagine you have someone polluting your training data every day. That's what happens when you scrape any tech forum today.
The short version is that llm trainign data is the lowest quality data you are likely to see unless you engage in massive potential copyright infringement.
> unless you engage in massive potential copyright infringement. And nobody is going to do that
If I temember well this is call overfitting [1].
Yea, people have a really hard time dealing with data leakage especially on data sets as large as LLMs need.
Basically if something appeared online or was transmitted over the wire should no longer be eligible to evaluate on. D. Sculley had a great talk at NeurIPS 2024 (same conference this paper was in) titled Empirical Rigor at Scale – or, How Not to Fool Yourself
Basically no one knows how to properly evaluate LLMs.
No, an absolute massive amount of people do. In fact they have been doing exactly as you recommend, because as you note, it's obvious and required for a basic proper evaluation.
Seems a bit picky. If the bot has seen the exact problem before it's not really doing anything more than recall to solve it.
20 years ago in grad school we were doing a very early iteration of this where we built Markov chains with Shakespeare's plays and wanted to produce a plausibly "Shakespearian" clause given a single word to start and a bearish professor said "the more plausible it gets the more I worry people might forget plausibility is all that it promises".
(There was also a much earlier piece of software that would generate semi-intelligible Kant or Hegel one sentence at a time, though that was through a series of a priori generation rules and a large at the time dictionary of stock phrases. I wonder what ever happened to that.)
I think your prof’s worries came true on a massive scale
It became a successful consultant...
That said, a bot with contextual recall can be very useful.
The problem is just that people keep insisting that those things are intelligent.
Isn't that just the LLM equivalent of hardcoding though?
I wouldn't call that hardcoding, otherwise you'd have to call everything it does "hardcoded".
"Overfitting" would be a bit more accurate term if the problem lies in the specific examples existing in its training set in various forms, places, languages etc but with the same values.
There are tests they are passing that they can't be hardcoded for by design. They still have all kinds of flaws and inconsistency but getting upset they answer "2+2=4" because someone trained them on what the answer to 2+2 is supposed to be is silly.
this work is similar to the GSM symbolic paper (applied to putnam) https://arxiv.org/html/2410.05229v1
going forward, llm performance must be reported on the confounded benchmark as well
They are highly effective pattern matchers. You change the pattern, it won't work. I don't remember who, but most likely @tszzl (roon), commented on x that they still trained the traditional way, and there is no test time compute (TTC) or Montecarlo Tree search (like Alpha Go) in o1 or o3. If that is true, then it's still predicting the next word based on it's training data. Likely to follow the most probable path - which comes directly from the training itself - even for the slight variations. Encouragingly, if TTC hasnt been explored, there is a long runway for performance improvements.
The other reason this seems hard to guess is because we don't know how much of what we are asking is in the training data. It would perform on some tasks, while fail at others even though those are similar.
I recently watched two interviews with OpenAI researchers where they describe that the breakthrough of o-series (unlike GPT series) is to focus on test time compute as they are designed to “think” more specifically to avoid pattern matching. Noam Brown https://youtu.be/OoL8K_AFqkw?si=ocIS0YDXLvaX9Xb6&t=195 and Mark Chen https://youtu.be/kO192K7_FaQ?si=moWiwYChj65osLGy
Thank you, this is helpful. The post on X was seemingly wrong.
The comment was likely that there's no explicit search. In o1, the model has learned how to search using its context. Presumably they do this by RLing over long reasoning strings/internal monologues.
I believe they are using scalable TTC. The o3 announcement released accuracy numbers for high and low compute usage, which I feel would be hard to do in the same model without TTC.
I also believe that the 200$ subscription they offer is just them allowing the TTC to go for longer before forcing it to answer.
If what you say is true, though, I agree that there is a huge headroom for TTC to improve results if the huggingface experiments on 1/3B models are anything to go off.
The other comment posted YT videos where Open AI researchers are talking about TTC. So, I am wrong. That $200 subscription is just because the number of tokens generated are huge when CoT is involved. Usually inference output is capped at 2000-4000 tokens (max of ~8192) or so, but they cannot do it with o1 and all the thinking tokens involved. This is true with all the approaches - next token prediction, TTC with beam/lookahead search, or MCTS + TTC. If you specify the output token range as high and induce a model to think before it answers, you will get better results on smaller/local models too.
> huge headroom for TTC to improve results ...1B/3B models
Absolutely. How this is productized remains to be seen. I have high hopes with MCTS and Iterative Preference Learning, but it is harder to implement. Not sure if Open AI has done that. Though Deepmind's results are unbelievably good [1].
ttc is an incredibly broad term and it is broadening as the hype spreads. people are now calling CoT “TTC” because they are spending compute on reasoning tokens before answering
Yes, and HuggingFace have published this outlining some of the potential ways to use TTC, including but not limited to tree search, showing TTC performance gains from LLama.
https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...
OpenAI have openly stated that o1 & o3 are using test time compute, and released a log scale graph indicating linear performance gains for exponential compute usage.
https://openai.com/index/learning-to-reason-with-llms/
They only confirm that the model/system is doing chain of thought, but the exponential factor and origin of reasoning gains likely comes from TREE of thoughts (number of branches/compute goes up exponentially with depth), essentially doing tree search over different reasoning chains.
I assume roon's identity is well known inside OpenAI (he's an employee), so I wouldn't expect him to be leaking implementation details on twitter.
This workshop contribution is OK, and the benchmark is somewhat valuable even without the rephrasing part of the problems, but the rephrasing (of only a small number of problems) sometimes genuinely makes the problem more confusing to humans as well by either poor phrasing (fig 3), or unneeded breaking of convention (fig 4; points in 2D are often P, with coordinates x,y). It would have been nice to see effects on the rephrasing of the latest/post-training date problems as a function of the increased noising, to delineate part of this confusion. I wonder how much better o3 is on the same benchmark.
Also, the correct title of this contribution is: Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning
The paper includes several examples of their modified questions. There has been a substantial jump from o1-preview to o1, so I gave several samples to o1 and o1-pro (not o1-preview), and current o1s gave the correct answer to those modified problems. SOTA changes fast.
The paper mentions that on several occasions the LLM will provide a correct answer but will either take big jumps without justifying them or will take illogical steps but end up with the right solution at the end. Did you check for that?
No, I don't know enough math to test the logic, only the check questions against their expected answers in https://anonymous.4open.science/r/putnam-axiom-B57C/data/Put...
Putnam problems need to actually be graded, often the answer itself is trivial.
LLM boosters are so tiresome. You hardly did a rigorous evaluation, the set has been public since October and could have easily been added to the training data.
I'm not skilled enough in math to do a rigorous evaluation, so it was a quick check.
Terence Tao is skilled enough, and he describes O1's math ability is "...roughly on par with a mediocre, but not completely incompetent graduate student" (good discussion at https://news.ycombinator.com/item?id=41540902), and the next iteration O3 just got 25% on his brand new Frontier Math test.
Seeing LLMs as useless is banal, but downplaying their rate of improvement is self-sabotage.
> "...roughly on par with a mediocre, but not completely incompetent graduate student"
Let it sink in how vague and almost meaningless that statement is.
What types of questions are you hoping to answer for that to be considered a vague statement?
Your points would be more convincing if you didn’t preface them with arrogant cynicism.
So many negative comments as if o3 didn’t get 25% on frontiermath - which is absolutely nuts.
Sure, LLMs will perform better if the answer to a problem is directly in their training set. But that doesn’t mean they perform bad when the answer isn’t in their training set.
EpochAI have to send the questions (but not the answer key) to OpenAI in order to score the models.
An overnight 2% -> 25% jump on this benchmark is a bit curious.
The 2% result belonged to a traditional LLM that costs cents to run, while o3 is extremely expensive.
1. OpenAI said they did not train on these problems & they don’t train on API calls in general, that is a legal policy.
2. It was a new major model release from work over the course of months - struggle to see that as an ‘overnight’ jump in any real meaning.
3. Why is it easier to believe large scale corporate fraud than that the stated capabilities on a held out test set are real? Reads like cope, if I’m being frank.
I don’t think it’s “easier to believe” just that it raises some red flags.
Sure, it did good in frontiermath. That's not what this thread is about.
Your comment isn't relevant at all
this thread is about math LLM capability, it’s a bit ridiculous to say that mentioning frontiermath is off topic but that’s just me
Just because you can generalize the topic doesn't mean you can ignore the specific conversation and choose your hill to argue.
Additionally, the conversation of this topic is about the model's ability to generalize and it's potential overfitting, which is arguably more important than parroting mathematics.
performance on a held-out set (like frontiermath) compared to putnam (which is not held out) is obviously relevant to a model's potential overfitting.
i'm not going to keep replying, others can judge whether they think what i'm saying is "relevant at all."
Again, you set your own goal posts and failed to add any insights.
The topic here isn't "o-series sucks", it's addressing a found concern.
I don't get why this matters at all? I looked at the o1-preview paper, Putnam is not mentioned. Meaning that: a) OpenAI never claimed this model achieves X% on this dataset. b) likely, OpenAI did not take measures to exclude this dataset from training. Meaning the only conclusion we can draw from this result is: when prompted with questions that were verbarim in the dataset, performance increases dramatically. We already know this, and it doesn't say anything about the performance of the model on unseen problems.
yep, welcome to hn
An interesting example of this is:
There are 6 “a”s in the sentence: “How many ‘a’ in this sentence?”
https://chatgpt.com/share/677582a9-45fc-8003-8114-edd2e6efa2...
Whereas the typical “strawberry” variant is now correct.
There are 3 “r”s in the word “strawberry.”
Clearly the lesson wasn’t learned, the model was just trained on people highlighting this failure case.
Reminds me of software i have built which had some basic foundational problems. Each bug was fixed with a data-patch that fixed the symptom but not the cause.
hence we continually played whack-a-mole with bugs. we would squash one bug, and another one would appear.
same with llms, squash one problem with a data-fix, and another one pops-up.
Transformers are very bad at counting due to how their internals work. But if you ask them to use explicit counter the problem disappears:
https://chatgpt.com/share/6775c9a6-8cec-8007-b709-3431e7a2b2...
Basically one feed forward is not Turing complete, but autoregressive (feeding previous output back into itself) are Turing complete.
This makes it worse IMO. I was starting to think it didn’t have a letter by letter representation of the tokens. It does. In which case the fact it didn’t decide to use it speaks even more towards its unsophistication.
Regardless, I’d love if you would explain a bit more why the transformer internals make this problem so difficult?
When Can Transformers Count to n?
https://arxiv.org/html/2407.15160v2
The Expressive Power of Transformers with Chain of Thought
https://arxiv.org/html/2310.07923v5
Transformer needs to retrieve letters per each token while forced to keep internal representation still aligned in length with the base tokens (each token also has finite embedding, while made out of multiple letters), and then it needs to count the letters within misaligned representation.
Autoregressive mode completely alleviate the problem as it can align its internal representation with the letters and it can just keep explicit sequential count.
BTW - humans also can't count without resorting to sequential process.
Thanks!
It also fails on things that aren't actual words
For example, the output for "how many x's are there in xaaax" is 3.
https://chatgpt.com/share/677591fe-aa58-800e-9e7a-81870387be...
Transformers are very bad at counting in one feed forward pass, you need to explicitly tell them to use a counter in autoregressive fashion like here:
https://chatgpt.com/share/6775cb37-4198-8007-82cb-e897220827...
I wouldn’t be surprised if similar will be found concerning the ARC challenge and it is why I still maintain my own private LLM challenges to gauge current capabilities. Course, I have little illusion that these are fully private, but it is better than fully public tests.
Even the most straight forward, logical, easily reasoned ones stump all LLMs I have access to, which is why I am so skeptical concerning emergence, reasoning and all this hype around “AGI”…
The metaphor that might describe this paper is "iteration". I'd hazard to predict that we’ll likely see more iterations of the following loop in 2025:
-> A new benchmark emerges with a novel evaluation method.
-> A new model saturates the benchmark by acquiring the novel “behavior.”
-> A new benchmark introduces yet another layer of novelty.
-> Models initially fail until a lab discovers how to acquire the new behavior.
Case in point: OpenAI tackled this last step by introducing a paradigm called deliberative alignment to tackle some of the ARC benchmarks. [1]
Alongside all this technical iteration, there’s a parallel cycle of product iteration, aiming to generate $ by selling intelligent software. The trillion $ questions are around finding the right iterations on both technical and product dimensions.
ok but preview sucks, run it on o1 pro.
99% of studies claiming some out of distribution failure of an LLM uses a model already made irrelevant by SOTA. These kinds of studies, with long throughputs and review periods, are not the best format to make salient points given the speed at which the SOTA horizon progresses
I wonder what is baseline OOD generalization for humans. It takes around 7 years to generalize visual processing to X-ray images. How well does a number theorist respond to algebraic topology questions? How long it will take a human to learn to solve ARC challenges in the json format just as well as in the visual form?
One I just did:
Q: I was heading to Aberdeen from London. On my way I passed seven wives, each wife had seven sacks, and in each sack there were seven cats and each cat had seven fish. How many were going to London?
A: This riddle is a play on words, and the answer is hidden in the phrasing! You mentioned you were heading to Aberdeen from London, but you didn’t say anything about the seven wives, their sacks, cats, or fish actually being headed to London.
The only one going to London is you.
So the answer is: 1 person (you) are going to London.
Llama 3.33-70b-instruct:
> I was heading to Boston from New York. On my way I passed six clerks, each clerk had six socks, and in each sock there were six skinks and each skink had six sticks. How many were going to Boston?
The answer is... "I" was going to Boston! The question asks how many were going to Boston, and the answer is the person being referred to, which is "I". The rest of the information about the clerks, socks, skinks, and sticks is just a clever distraction!
[Of course, it also got your exact prompt right.]
The researcher's answer to their variant of "Year: 2016 ID: A1" in the appendix is wrong.
The solution (sum of 1,2,5,6,9,10,13,14, ...) has an alternating pattern, so has to be two piecewise interleaved polynomials, which cannot be expressed as a single polyomial.
Their answer works for k=1,2, but not k=3.
https://openreview.net/pdf?id=YXnwlZe0yf
This does not give me confidence in the results of their paper.
You are correct. Their answer is instead the sum of the first k terms of 1, 2, 6, 10, 14, 18, ..., for positive k.
You're misreading the solution, the first part reads n=1, a trivial special case, not n congruent to 1 mod 4.
The statement doesn't hold for e.g. n=5. Taking m=2 gives the permutation (1 2 4 3), which is odd, and thus cannot have a square root.
Very astute. Did you communicate this to the authors?
Link title says "slightly", but the PDF says two different kinds of variations: variable names (slight) and problem constants (significant), and the 30% drop is on the combination of a 26 variable and also 26 variable + constant questions.
It's good to have a better test (though I bet this one will also be quickly saturated like all the others), but the title here doesn't seem justified by the page title there or the content.
I would definitely classify both of those as slight changes. In fact I'd rename those as slight => trivial and significant => slight.
Right, renaming a variable should have zero effect on ability to solve (it wouldn’t for a human). Changing a constant should be very minor, probably also ~0 effect in most cases. I say this as someone that’s done many of these problems.
Isn't this simply because the dataset used (Putnam-AXIOM Original) is in the training data used to train the various models?
Given that these are simple variations (variable names and constants value change in math problems). Why would the companies creating these models (OpenAI etc.) create these variations themselves in order to insure that the model is learning how to solve the problem rather than memorize a solution? Seems like a very obvious thing to do ...
They are not only simple renames. LLM is good at those. They are minor structural changes.
This result is the same as a recent test of the same method+hypothesis from a group at Apple, no? I don’t have that reference handy but I don’t think I’m making it up.
I think you are probably referring to the following paper: https://arxiv.org/abs/2410.05229
Yup, looks like the one I meant!
I am impressed by the progress on LLMs but I remain skeptical that they can replace humans.
Perhaps some (distant!) future model but I don’t fear mass unemployment (for example) or even moderate LLM-driven unemployment in the near-to-medium term.
They can clearly complement human labor but there are vanishingly few domains where they can be substitutes.
I still find it hard to believe that LLM methods will lead to "true" AI. No amount of processing power or data will be sufficient without something new.
I don't think this proves that the LLM is just "pattern matcher". Human makes similar mistakes too, especially when under time pressure (similar to non-reasoning model that needs to "use system one" to generate answer on one go). This is further evident that if you specifically ask the models to pay attention to traps, or just ask follow up question "are you sure?", then they usually can get it right.
You're saying that humans perform worse on problems that are slightly different than previously published forms of the same problem? To be clear we are only talking about changing variable names and constants here.
Often yes, because we assume we already know the answer and jump to the conclusion. At least those of us with ADHD do.
Not really true for Putnam problems since you have to write a proof. You literally can’t just jump to a conclusion and succeed.
That is the principle behind the game 'Simon says'
'Simon says' is about reaction time and pressure.
That’s a very silly analogy. A more realistic analogy would be do humans perform better on computing 37x41 or 87x91 (with showing the work)?
It was not an analogy at all. It was an simplified example of the idea that a slight change in a pattern can induce error in humans.
It seems some people disagree that that is what the game "Simon Says" is about. I feel like they might play a vastly simplified version of the game that I am familiar with.
There was a recent episode of Game Changer based on this which is an excellent example of how the game leader should attempt to induce errors by making a change that does not get correctly accounted for.
No, it’s not at all.
This is all getting so tiresome.
sometimes o1-preview start hallucinating halfway through a good solution. it can get the intuition and the 'main' direction for a problem wrong too. but then problem solving is just a series of rephrasings and translating into different math domains is used by mahtematicians for solving problems.
I hope someone reruns this on o1 and eventually o3.
If o1-preview was the start like gpt1, then we should expect generalization to increase quickly.
I don't think llm generalise much, that's why they're not creative and can't solve novel problems. It's pattern matching with a huge amount of data.
Study on the topic: https://arxiv.org/html/2406.15992v1
This would explain o1 poor performance with problems with variations. o3 seems to be expensive brute forcing in latent space followed by verification which should yield better results - but I don't think we can call it generalisation.
I think we need to go back to the drawing board.
From firsthand experience, this simply cannot be true. I can give them totally novel and unique physics problems I just made up- that requires tracking the movement of objects through a series of events, and it answers most correctly. Moreover, they find analogies between disparate concepts and fields of study and make useful suggestions based on them- which is arguably the same process as human creativity.
I think ultimately the disconnect is people theorizing about what it can or cannot do with an incorrect mental model of what it is, and then assuming it cannot do things that it can in fact do. The irony of discussions on LLMs is they more showcase the limits of humans ability to reason about novel situations.
Don't worry, there are thousands of researchers at the drawing boards right now.
Yeah, because if the AI boom becomes the AI bust, we'll have another 2008-level economic crisis on our hands.
The investments into AI are in the hundreds of billions (maybe even more if you factor in the amount of people studying and researching AI), but the returns are in the tens of billions (if even that).
If you exclude the "growth" coming from the industry sniffing its own farts (e.g. Nvidia selling insane amounts of insanely overpriced GPUs to InsertYourFavAICorp), the actual amount of "useful goods and services" produced (api accesses, chat subscriptions, ai-enabled app growth etc.) are tiny compared to the investment levels.
The AI train appears to have no brakes. A massive crash or AGI are the only options now. Both are going to be bad for average humans.
the fact that this (and tons of other legitimate critique) got downvoted into greytext speaks so much louder to me than all benchmarks in the world
You're assuming that openAI isn't just gonna add the new questions to the training data.
Their methodology shows they can create an infinite variety of problems.
This is the same thing as synthetic training data.
It doesn't matter if models are trained on the output of the generated data or not. If the model ends up being able to solve newly generated variations, you'd have to admit that it understands the underlying problems.
I think what it shows that it has minimal "understanding" of the problem - otherwise such small variations wouldn't pose a challenge. Training it to handle these specific small variations doesn't change that.
It's good in automation, not understanding.
If it were a complete failure on variations I would be inclined to agree. Instead it was a 30% drop in performance. I would characterise that as limited understanding.
My guess is that what’s understood isn’t various parts of solving the problem but various aspects of the expected response.
I see this more akin to a human faking their way through a conversation.
I see this more akin to a human faking their way through a conversation.
That works in English class. Try it in a math class and you'll get a much lower grade than ChatGPT will.
Fully agree with this
Exactly. The naivity is just sky-high
Oh so its almost like everything else AI related, they basically cheated and lied.
If you are shocked by this, you are the sucker in the room.
[flagged]
And within half an hour somebody invested in nvidia stock is going to swoop in and explain how they totally (trust me bro) made x thousand with an app written by llm.
Every. Single. Time.
Almost as if there was a financial incentive to do that.
It's rhe crypto bullshit all over again.
Tech hype is becoming unbearable as time goes on.
What's with the bitterness? Maybe don't get blinded by the hype and bring a little bit of wonder (and humility) back.
Because the hype is not only annoying, but it makes potentially cool and interesting technology toxic once people figure out that the people hyping things up know it to be mostly bullshit.
Great things take many years, sometimes decades to develop properly. Different generations of people to experiment and try things out.
That is not good for the ones pushing up the hype. You don't get rich quick by doing that, you don't get to scam enough investors by something being slowly improved.
You may call it bitterness, whereas I am just jaded by watching things play out.
It is extremely lame to reply to your own comment to grouse about downvotes.
Hmmm without a human control it is not all that clear to me that the variation problems are not more difficult.
There is also a curated benchmark just for those famous problems slightly variated: https://github.com/cpldcpu/MisguidedAttention/tree/main/eval
One problem from the benchmark:
"prompt_id": "river_crossing_easy",
"category": "Logic Puzzle",
"title": "Easy river crossing",
"prompt": "A farmer is on one side of a river with a wolf, a goat, and a cabbage. When he is crossing the river in a boat, he can only take one item with him at a time. The wolf will eat the goat if left alone together, and the goat will eat the cabbage if left alone together. How can the farmer transport the goat across the river without it being eaten?",
"expected_behavior": [
"Answer concludes that they simply get in the boat and cross together in one trip"
],
EDIT: removing most of my commentary on this problem. As a human, I was tricked by the problem too. I would love to see how a random selection of humans would do on this one… but it just doesn’t feel like a great test to me.No. Simply plug in the prompt to chat gpt and see what happens.
The llm isn't getting confused by the meaning of "item". It's recognizing a common problem and not picking up on the fact that the farmer just needs to transport the goat and nothing else.
Instead, it gives the standard answer for how to transport everything across.
I'll admit as a fallible humane I didn't pick it up, but I was focused on the wrong thing because I've been using "and the boat can take everything" and gpt-3 just could not get that variation in one shot.
Gpt-3 is old hat though. later versions of gpt-4 manage to get it with a bunch coaching, and o1 manages to solve it with less coaching.
>This is twisting the English language to assume that "item" only refers to non-living things.
Not really. Unless I'm not reading correctly, most of the problem is irrelevant as you're only required to cross the boat with the goat, you don't care about the cabbage. The difficulty lies in the assumption you need to cross everything due to the resemblance with the bigger problem.
You’re reading it correctly. I read it again after your comment and I realized I too pattern matched to the typical logic puzzle before reading it carefully and exactly. I imagine the test here is designed for this very purpose to see if the model is pattern matching or reasoning.
The problem is to ask the farmer to transport the goat. So the farmer indeed gets in the boat with the goat. The unstated gotcha is that the farmer is willing to abandon the wolf and the cabbage. A heavily pattern-matching LLM or human would immediately assume that the farmer needs to transport all three.
Yep, and that gotcha got me, as a perfectly non-silicon human. My bad everyone.
Wow, this seems ridiculous. The expected answer is basically finding a loophole in the problem. I can imagine how worthless all of these models would be if they behaved that way.
It's not a loophole, the question is "how can he get the goat across?". The answer is he just takes it across.
If you revise this prompt to satisfy your pedantry, (at least) 4o still gets it wrong.
I think that lamentations about real world data running out is misplaced. We can multiply data with slight variations which might lead to better resilience and more accurate model's responses for novel problems.
It still needs to be prompted so it’s easy to understand. If you ask in a weird “how do I not not not win” instead of “ how do I lose” you are gonna run into problems
Yes so when you change the sequence of tokens they've electronically memorized, they get a bit worse at predicting the next token?
When you put it that way it’s a trivial result. However the consequences for using AI to replace humans on tasks is significant.
The only people super pumping the idea of mass replacement of human labor are financially invested in that outcome.
If the models are made to pass the benchmarks, of course they’d have some sort of overfit.
i know it's easier to just drop the internet tarball on the transformer, but at some point we need to just be giving these models grade school math homework and times tables.
I would love to see how well Deepseek V3 do on this.
Probably even worse, since I've heard that it's hard to steer away from the most common interpretation of a question.
Or it's time to step back and call it what it is - very good pattern recognition.
I mean, that's cool... we can get a lot of work done with pattern recognition. Most of the human race never really moves above that level of thinking in the workforce or navigating their daily life, especially if they default to various societally prescribed patterns of getting stuff done (eg. go to college or the military <Based on <these criteria>, find a job <based on the best fit with <this list of desirable skills & experiences>, go to <these places> to find love....)
So, I am conflicted about this.
If we take an example of what is considered a priori as creativity, such as story telling, LLMs can do pretty well at creating novel work.
I can prompt with various parameters, plot elements, moral lessons, and get a de novo storyline, conflicts, relationships, character backstories, intrigues, and resolutions.
Now, the writing style tends to be tone-deaf and poor at building tension for the reader, and it is apparent that the storytelling has little “theory of mind” of the reader, but the material has elements that we would certainly consider to be creative if written by a student.
It seems we must either cede that LLMs can do some creative synthesis, as this and some other experiments of mine suggest, or we must decide that these tasks, such as “creative writing” are not in fact creative, but rather mostly or strictly derivative.
There is some argument to be had in assertions that storytelling is all derivative of certain patterns and variations on a fixed number of tropes and story arcs… but arguing this begs the question of whether humans actually do any “pure” creative work , or if in fact, all is the product of experience and study. (Training data)
Which leads me to the unpleasant conflict about the debate of AI creativity. Is the debate really pointing out an actual distinction, or merely a matter of degree? And what are the implications, either way?
I’m left with the feeling that LLMs can be as capable of creative work as most 8th grade students. What does this say about AI, or developing humans? Since most people don’t exceed an 8th grade level of literacy, what does this say about society?
Is there even such a thing as de novo idea synthesis?
Troubling questions abound.
To add to this pondering: we are discussing the state today, right now. We could assume this is as good as it's ever gonna get, and all attempts to overcome some current plateau are futile, but I wouldn't bet on it. There is a solid chance that 8th grade level writer will turn into a post-grad writer before long.
So far the improvements in writing have not been as substantial as those in math or coding (not even close, really). Is there something fundamentally “easier” for LLMs about those two fields?
Much more formal structure and generally code can be tested for correctness. Prose doesn't have that benefit. That said, given the right prompt and LLM, you can squeeze out surprisingly good stuff: https://bsky.app/profile/talyarkoni.com/post/3ldfjm37u2s2x
I have no doubt that LLMs do creative work. I think this has been apparent since the original ChatGPT.
Just because something is creative doesn’t mean it’s inherently valuable.
> Or it's time to step back and call it what it is - very good pattern recognition.
Or maybe it's time to stop wheeling out this tedious and disingenuous dismissal.
Saying it is just "pattern recognition" (or a "stochastic parrot") implies behavioural and performance characteristics that have very clearly been greatly exceeded.
What the fundamental limitations of "pattern recognition" or "stochastic parrots" that LLMs have exceeded?
They can generalise to novel inputs. Ok often they mess it up and they're clearly better at dealing with inputs they have seen before (who isn't?), but they can still reason about things they have never seen before.
Honestly if you don't believe me just go and use them. It's pretty obvious if you actually get experience with them.
Current LLMs are equivalent to tabular Markov chains (though these are too huge to realistically compute). What's the size limit when a tabular Markov chain can generalize to novel inputs?
No idea. I'm not sure how that's relevant anyway.
Citation needed. Please be more specific, or else this is just a tedious and disingenuous advocacy.
Gpt4 can add very large integers.
It is evident that it is not recalling the sum because all combinations of integer addition were likely not in the training data, Storing the answer to the sum of all integers up to the size that GPT4 can manage would take more parameters than the model has.
That addition is a small capability but you only need a single counterexample to disprove a theory.
> That addition is a small capability but you only need a single counterexample to disprove a theory
No, that's not how this works :)
You can hardcode an exception to pattern recognition for specific cases - it doesn't cease to be a pattern recognizer with exceptions being sprinkled in.
The 'theory' here is that a pattern recognizer can lead to AGI. That is the theory. Someone saying 'show me proof or else I say a pattern recognizer is just a pattern recognizer' is not a theory and thus cannot be disproven, or proven.
This is also known as Russell's teapot. https://en.wikipedia.org/wiki/Russell%27s_teapot
If someone claims there's a teapot out in space - the burden of proof is on the person making the claim, not on the person saying it is bullshit.
It's not hardcoded, reissbaker has addressed this point.
I think you are misinterpreting what the argument is.
The argument being made is that LLMs are mere 'stochastic parrots' and therefore cannot lead to AGI. The analogy to Russell's teapot is that someone is claiming that Russells teapot is not there because china cannot exist in the vacuum of space. You can disprove that with a single counterexample. That does not mean the teapot is there, but it also doesn't mean it isn't.
It is also hard to prove that something is thinking. It is also very difficult to prove that something is not thinking. Almost all arguments against AGI take the form X cannot produce AGI because Y. Those are disprovable because you can disprove Y.
I don't think anyone is claiming to have a proof that an LLM will produce AGI, just that it might. If they actually build one, that too counts as a counterexample to anybody saying they can't do it.
GPT-4o doesn't have hardcoded math exceptions. If you would like something verifiable, since we don't have the source code to GPT-4o, consider that Qwen 2.5 72b can also add large integers, and we do have the source code and weights to run it... And it's just a neural net. There isn't secret "hardcode an exception to pattern recognition" in there that parses out numbers and adds them. The neural net simply learned to do it.
That's interesting, I didn't know that, thanks.
Is the claim then that LLMs are pattern recognizers but also more?
It just seems to me and I guess many others that the thing it is primarily good at is being a better google search.
Is there something big that I and presumably many others are missing and if so, what is it?
so how much you have riding on nvidia bro?
Nothing. I just use ChatGPT and Claude so I am familiar with their capabilities and limitations.
Imagine if people who had never used VR kept saying it's just a TV on your face, or if people who had never used static types kept saying they're just extra work you have to do, or if people who had never had sex kept saying it's just a way of making babies.
It's a tedious claim when it's so easily disproven by going to a free website and trying it. Why are people so invested in AI being useless that they'll criticise it so confidently without even trying it?
isn't it weird that they didn't test gemini?
trained on test. who even trusts OAI anymore?
they didn’t test on putnam…
Yea no shit. LLMs are just REALLY good guessers. People gotta stop the hype lol.
Using LLMs for anything serious and which requires consistency and trustworthiness without hallucinations is irresponsible and ridiculous.
Closed source LLMs are a bubble and a joke.
MY GOD ITS ALMOST LIKE THEY TRAINED THE MODEL FOR THE BENCHMARK INSTEAD OF FOR GENERAL APTITUDE! WHY WOULD OPENAI DO THIS?!?
Even humans get confused with trick questions right? Once they understand this is a trick question they no longer fall for it. :)
It drops from 50 to 33,96. Still the best, o1 on variable is around 2 times better than Claude on original test.
The rest of the llm are far away, single digit.
It makes me wonder if o1 is finally getting intelligent? LLM are not supposed to understand these problems when you change variable and values, they have to rely on preexisting data of absolutely identical solved problem to give a correct answer.
I didn't follow LLM development but I heard one times that chatgpt is now composed of multiple LLM and maybe they put multiple artificial intelligence with purpose of problems solvings or trigonometry for instance.
That would explain the reason it's so much better.