If you're going to the effort of writing a procmacro, you may as well output a string from the macro instead of code.
If you're going idiomatic rust, then you might instead output a type that has a display impl rather than generating code that writes to stdout.
Reminds me of the famous thread on stack overflow. I'll link the rust one directly, but one cpp answer claims 283 GB/s - and others are in the ballpark of 50GB/s.
The rust one claims around 3GB/s
https://codegolf.stackexchange.com/a/217455
You can take this much further! I think throughput is a great way to measure it.
Things like pre-allocation, no branching, constants, simd, etc
In my opinion a more accurate measure when you go down to the micro seconds level is TSC directly from the CPU. I've built a benchmark tool for that https://github.com/sh4ka/hft-benchmarks
Also I think that CPU pining could help in this context but perhaps I need to check the code in my machine first.
How does this compare with divan?
Divan is what I used as reference for some parts of my work (mostly the CPU timestamp parts). My project is less complete, but it will also include other important benchmarks for HFT, like current network I/O or some real trading patterns like order placement overhead.
If OP is looking for ideas, there are two intermediate steps between the extremes of "write every line to stdout" and "build up a buffer of the whole output and then write it to stdout".
1. `stdout().lock()` and `writeln!()` to that. By default using `print*!()` will write to `stdout()` which takes a process-wide lock each time. (Funnily enough they use .lock() in the "build up a buffer of the whole output" section, just to do one .write_all() call, which is the one time they don't need to use .lock() because Stdout's impl of write_all() will only take the lock once anyway.)
2. Wrap the locked stdout in a `BufWriter` and `writeln!()` to that. It won't flush on every line, but it also won't buffer the entire output, so it's a middle point between speed and memory usage.
---
For the final proc macro approach, there is the option to unroll the loop in the generated code, and the option to generate a &'static str literal of the output.
> But the obvious possibilities almost certainly won't be performant: integer modulo is a single CPU instruction [...]
Yes it is a single instruction, but that is not indicative of the actual performance. Modulo on x86 is done through the div instruction which takes tens of cycles. When you compile the code you'll likely see a multiply + shift instead because you modulo by a constant.
Maybe I’m missing something but can’t you unroll it very easily by 15 prints at a time? That would skip the modulo checks entirely, and you could actually cache everything but the last two or three digits.
> Maybe I’m missing something but can’t you unroll it very easily by 15...
Sure, 3 x 5 = 15. But, FTA:
But then, by coincidence, I watched an old Prime video and decided to put the question to him: how would you extend this to 7 = "Baz"?
He expanded the if-else chain: I asked him to find a way to do it without explosively increasing the number of necessary checks with each new term added. After some hints and more discussion...
Which is why I respectfully submit almost all examples of FizzBuzz including the article's first are "wrong" while the refactor is "right".
As for the optimizations, they don't focus on only 3 and 5, they include 7 throughout.
I remember writing it in high school end ending up using a wheel (circular data structure) to avoid any modulo at all. Then my HS teacher said that it should be extensible so I wrote a wheel generator.
Despite writing things in scheme I ended up being the fastest. It is no magic bullet, but if you only want the regular fizz buzz it is a simple way to just about double the speed.
Isn't a circular array implemented with a modula to begin with? I don’t see how you bypass it
m3=m5=m7=1
for ...
...
m3 = m3==2?0:m3+1
m5 = m5==4?0:m5+1
m7 = m7==6?0:m7+1
You keep two counters.
> At this point, I'm out of ideas. The most impactful move would probably be to switch to a faster terminal... but I'm already running Ghostty! I thought it was a pretty performant terminal to being with!
But what is the point? Why do you want to optimize the display? If you want to be able to fizz-buzz for millions of numbers, then you want to... Well realistically only compute them just before they are displayed.
Because the display is the bottleneck.
Can you present some real life scenario where this is an issue? Let's say you want to display the result on a webpage - the bottleneck of creating a DOM structure with all <div>s and similar tags would be much more significant, but what you should do instead is just create a scrollbar, and enough divs to fill the scrolling area, and as a user drags the scrollbar's slider, you adjust the height of the divs by the modulo of scrolled_height / div_height, and then populate those divs with the right values for the scrolled range (that you can easily compute on each scrollbar event).
The only reason to care about those microseconds is when you want to really fill the console with millions of lines, but you shouldn't actually want to do that, I think ever?