I've been using this dumb benchmark for a few months now. More posts about it here: https://simonwillison.net/tags/pelican-riding-a-bicycle/
I hope you have other private benchmarks running that you don't talk about or publish, just in case a model maker intentionally targets one of your benchmarks, or some fuzzy "find things people have mentioned as potential LLM benchmarks" process scoops up your ideas and/or outputs.
If we ever get to the point LLMs are already optimized to answer every question you can think of then there isn't really a need to have a secret question in the first place.
Not any that you can think of. Just the ones you've published something about.
Plus, simonw isn't exactly a meaningless nobody in this space, and his writeups are more detailed and actionable, and therefore identifiable, than some random "hey a great LLM benchmark would be creating an SVG of a walrus twerking in front of a jelly bean store" throwaway comment.
Proof: I asked ChatGPT 4o the question "What are some users who post ad hoc LLM benchmarks to technical discussion sites, and what benchmarks have they proposed?" simonw is in the list, 1 of 7 individual people it suggested. (The proposed benchmarks listed for him were more general than the specific one here: "Testing LLMs’ capabilities with code generation, particularly in niche languages or against real-world API schemas." But it's easy to imagine followup queries bringing this one up.)
I'm in agreement LLMs get contaminated with test data, particularly from simonw. What I'm referring to is nobody needs worry about hoarding secret questions from the public eye to avoid that problem. It is a valid approach... but a bit of a sad path considering.
Don't run unpublished private benchmarks or worry about keeping a counted hoard of secret questions. Do rotate your questions every few months to whatever comes to mind at the time. When nothing comes to mind there is no point in running a question benchmark anymore as it already answers every possible you could possibly question you can think of (and the only way it gets there in your lifespan is by reasoning rather than memorization). You can always run the new question retroactively on an old models for comparison purposes so that's not a concern either.
The important thing here being "rotate questions without concern of having things lined up for it" rather than "fear what happens when you discuss your question".
Ah, I understand your logic now. Fair enough, and since I'm not benchmarking LLMs to begin with, I guess it doesn't much matter whether I fully agree or not. But:
> You can always run the new question retroactively on an old models for comparison purposes so that's not a concern either.
Good point, but it's still somewhat of a concern. Especially if you're benchmarking via a chat interface, my understanding is that there are plenty of finetuning, system prompt, safety, and other post-training changes that can influence results. (Perhaps unlikely with an SVG generation prompt, but reasonably likely with "reasoning" prompts.)
> Do rotate your questions every few months to whatever comes to mind at the time. When nothing comes to mind there is no point in running a question benchmark anymore as it already answers every possible you could possibly question you can think of
Isn't that assuming it's answering your questions reasonably well? On a hard benchmark, you may be watching the progression from absolutely miserable to not quite tolerable.
You may even be using a constellation of questions to try to delineate the boundary between what it can and can't do.
I saw your recent post on running Llama 3.3 70B on a m2 pro 64 gb. Do the many variants of apple silicon alternatives with varying numbers of cpus, gpus, and neural engines matter that much for how fast these llms can generate tokens, answer questions? More hw is always better, but what can we say how performance scales with the many different choices?
64gb ram is crucial, after that, need 1+ tb storage, and then?
I don't know. I believe memory bandwidth matters, and I got the impression that the M4 series isn't yet as good as the M2 was on that front, but I'm half-remembering things I've heard here.
FYI - M4 has more bandwidth
Chip Bandwidth (GB/s)
———- ————————————
M2 100
M3 100
M4 120 (20% more)
M2-Pro 204
M3-Pro 153 (less than M2-Pro)
M4-Pro 273 (78% more than M3-Pro)
M2-Max 409
M3-Max 409
M4-Max 546 (33% more than M2/M3-max)
https://arstechnica.com/apple/2024/10/apples-m4-m4-pro-and-m...Weak - the pelican is not pedaling in any of these!
[flagged]
Sure, they're the wrong tool for drawing a pelican - but testing their SVG output is a useful way to get a feel for how good they are at step by step reasoning, coordinate systems, spatial awareness and generating valid SVG/XML.
There are genuinely useful applications of SVG-generation from LLMs - outputting simple infographics or charts for example.
I use LLMs to write HTML all the time, of which SVG is a useful optional component.
This benchmark is interesting, because it sidesteps the reasoning and process that humans would excel at.
For example, if I asked you to assemble a bookshelf with some wood, nails, and cement, you might first make a hammer with the cement before trying to assemble the bookshelf.
You can get a much better image by first asking the (multimodal) LLM to draw an image of a pelican on a bicycle, and then generate an SVG using the referenced image.
https://chatgpt.com/share/67609300-9abc-800d-9b26-95074f2149...
> these models aren't pelican painters or anything like that, they're LANGUAGE models
Tools are defined by what people use them for, not by how they were intended—or designed—to be used. (Just ask Nvidia)
adding: so I think someone comparing how various tools perform at a task that's valuable to them—and probably others—is just fine, even if it's different from what the creator of the tool intended?
More recently, `gemini-exp-1206` did quite well [1].
[1]: https://github.com/simonw/pelican-bicycle/blob/main/README.m...
Gemini 1206 is the new hotness in my books. I've moved my day to day LLM needs over to Google's tab for the first time. I'm not sure what they changed, but it deserves a good look. Claude 3.5-Sonnet (New) is fantastic as well, but the 2M token context window offered by Google allows you to suck in an entire code repository and reason effectively across the whole thing. Google is catching up...
I feel like ”quite well” is overselling it a bit. It did maybe better.
"Generate room with no elephant" https://imgur.com/BI61S1T
Models hate negative prompts. There's a reason Stable Diffusion txt2img requires a dedicated "negative" CLIP input in addition to positive.
FLUX, for all its benefits, is really annoying in this aspect as you cannot really provide negative guidance without complicated hacks [1]. Therefore, "orignal" ideas that go against popular wisdom or mainstream culture will be very hard to prompt for. This is, in my opinion, an important caveat of ML "art". ML "art" is already considered useless slop because of how much it conforms to biases built in the weights. The inability to tune down adherence to mainstreaming norms is therefore all the more problematic.
[1] https://www.reddit.com/r/StableDiffusion/comments/1estj69/
I searched the comments and nobody's mentioned that none of the bicycle riders looked like pelicans. Seriously, a Pelican looks a certain way, ... more than his belly can and all that.
I'm focusing on the bike part here because, as a bike geek, I could draw one from memory that's correct in all details. But to a non-bikie that's more difficult than you'd think. I can't find the picture gallery right now but an article about it, which links another article:
https://web.archive.org/web/20240419001426/https://www.wired...
So the fact that the AI models screw this up so badly is understandable. Sure, they screw up in ways that humans wouldn't, such as the beak backwards in one of the pictures (pointy end toward the bird!) because they don't know or care about something every human would know: What a beak is for and what it looks like in general. Or for that matter the biodynamics of how a pelican's long, spindly legs could, in fact, work a pair pedals. But ask me to draw a pelican from memory, and have a good laugh (if you're better at it than me) because to me, they're just kind of a peripheral vision, pink abstraction, not something I focus on understanding. And that's what they are to the AI model too.
This is the the artist you are thinking of.
This is incredible. I wonder if anybody has set out to build some of these bikes as sculptures.
> ... they're just kind of a peripheral vision, pink abstraction, not something I focus on understanding
are there pink pelicans, or are you thinking of flamingoes?
Ha, see, even missed that part! Honestly.
You had me wondering at
> a pelican's long, spindly legs
and when you then also described it as pink, it started to make sense and I too understood that you must be thinking of flamingos :^)
>as a bike geek, I could draw one from memory that's correct in all details.
Mo link, sorry, but on youtube, GCN asked pro riders to draw a bicycle...none could.
Well, as a bike geek - who wrenches on them, changes old bikes to different configurations etc - I can visualize every part because I've dealt with all of them. I can tell you, for example, that ancient Shimano downtube shifters are held in place by an M4.5 bolt. M4.5? Try to find something to fit that at your local hardware store (when changing said bike to handlebar-mounted shifters). Or which way the opposite sides of a BSA bottom bracket are threaded (from memory! Which side has the backwards threads?) Or the whole stack of bits and pieces that make up a headset (both threaded and threadless).
Whereas a pro rider can probably tell you all about the biomechanics of how to optimally interact with the bike, the right foods to eat and how much to sleep and when. But the actual wrenching around with them? That's the pro mechanic's job.
I'd have to agree here that success at this drawing test/challenge is strongly correlated with experience repairing/maintaining (one or more) bicycles, a lot more than it is correlated with riding them.
I also suspect it strongly correlates with knowing the term "diamond-frame". In addition to bicycle-repairers probably knowing the term, it's also used among people who like/know other frame styles--in my case recumbent bicycles.
Why don't you share some of your videos where you draw these bikes?
Ha, I'm pretty into bikes, to the point where at least I understand the questions here but the most complex things I have ever done were changing a normal BB and set of cranks, and replacing some STI cables.
Lots of pro riders do not take care of their bikes themselves. They're used to having bike mechanics adjust everything for every race, and a lot of them don't consider it part of the job to take care of their training bikes. Some of them don't even do the cleaning.
Some friends noticed another one the other day—none of the image generators were able to generate a “glass of wine, full to the brim.” All of the images were just of regular partially-filled glasses of wine, with the accompanying text trying to assure you that the glass was indeed full to the brim.
Just thinking out loud: a lot of models generate pretty poor SVG output, but part of that is because they don't get any visual feedback.
But, what about this workflow: given prompt, LLM generates two SVG outputs. Both are rendered by an SVG renderer, and then we combine the two into one image, one on the left and the other on the right. We then ask a visual LLM (could be the same LLM or could be a different one) to tell us whether the left half or right half of the image is a better response to the prompt. Now we've got preferences which can be used to fine-tune the LLM using DPO. And you could iteratively repeat the process – as the LLM is fine-tuned it may produce even better outputs which then produces new preferences for further fine-tuning.
Would be interesting to see what kinds of results it might produce in practice.
> Would be interesting to see what kinds of results it might produce in practice.
I would not be surprised to see that the LLM generating SVG does a UNO reverse, and makes use of little colored squares in a grid to draw a “vector image” where each of the squares represents individual pixels :p
Such a “vector image” would be very large, if you put a limit on the length of the generated SVG the LLM would be blocked from pursuing that strategy. Even without an explicit limit, the LLM’s context limit may block it: if the resulting SVG is larger than the LLM’s context, the LLM may forget what it is doing part way through generating the SVG, which is unlikely to produce a good result.
> aren't any pelican on a bicycle SVG files floating around (yet)
maintains website collecting SVG files of pelicans on bicycles
Most humans can't correctly draw a bicycle from memory: https://www.wired.com/2016/04/can-draw-bikes-memory-definite...
I like how one of them is clearly (to my eyes, at least) a person holding a gun
I wanna see this running in a feedback loop - show the model its output and get it to make corrections.
Remember that these are basically one-shot. Very different to how you or I would solve the problem (get a circle up on the screen, have a look at it, make some changes, add some wings, tweak the dimensions, etc.). We would go through hundreds or thousands of feedback cycles before we got something half-decent -- in this situation the model only gets one attempt.
I think it's interesting to consider how humans would go about this task ("Generate an SVG of a pelican riding a bicycle"), and how well they would do if they had to output the SVG into a text box without any other tools. Considering that, I think Claude 3.5 Sonnet and GPT-4o did incredibly well, and even the others might be commended for making a valid SVG at all..
depends on the human, right? I imagine an artist who specializes in SVG would do pretty well and might make it a professional logo?
Pro artist who specializes in vector work here: I would use Adobe Illustrator to draw it, while looking at actual photos of pelicans and bicycles, and export an SVG. If it needed to have a lot of named parts I could make that happen.
If I had the latest version of Illustrator then I would consider seeing how well its image generation does, but I do not because it has a lot of exciting new bugs that break my normal workflow. I believe that under the hood that works by feeding your text prompt to a bitmap image generator and running the same old autotrace on it, which results in some pretty messy and hard-to-edit shapes.
Mistral Large proposition: https://chat.mistral.ai/chat/4da427b1-e033-454d-b134-c5d1f6e...
Some of the drawings, like the one from Amazon Nova Pro, are quite fascinating as abstract artworks. It's like the idea of a bicycle without its physicality.
Careful not to shout too loud about it or they’ll start training for it!
Latest Claude does a suspiciously good job…
It looks like no LLM has seen a pelican before at all.
Once again, Claude wins.
Those output examples are absolutely horrifically bad compared to what I get with a cursory request to Gemini 2.0 using Imagen3 via gemini.google.com.
Was yours an SVG? I think that’s what makes Simon feel that this test is useful: the LLM has to generate functioning SVG code describing these shapes.
Yeah, this test is to see if a pure LLM can output SVG that renders well. It's effectively a test of their "spatial reasoning" capabilities.
Its sneakier than that because there are so many ways you can draw the same thing in svg, its also some sort of orchestration and coherence test
This is generating SVG data, not using an image generator.
If only it didn't have to be svg, so glorious.