Comments Page - Introducing architecture variants

« Back Introducing architecture variantsdiscourse.ubuntu.comSubmitted by jnsgruk 2 days ago

mobilio a day ago
Announce was here: https://discourse.ubuntu.com/t/introducing-architecture-vari...
and key point: "Previous benchmarks we have run (where we rebuilt the entire archive for x86-64-v3 57) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that."
- ninkendo a day ago
  > show that most packages show a slight (around 1%) performance improvement
  This takes me back to arguing with Gentoo users 20 years ago who insisted that compiling everything from source for their machine made everything faster.
  The consensus at the time was basically "theoretically, it's possible, but in practice, gcc isn't really doing much with the extra instructions anyway".
  Then there's stuff like glibc which has custom assembly versions of things like memcpy/etc, and selects from them at startup. I'm not really sure if that was common 20 years ago but it is now.
  It's cool that after 20 years we can finally start using the newer instructions in binary packages, but it definitely seems to not matter all that much, still.
  Amadiro a day ago
  It's also because around 20 years ago there was a "reset" when we switched from x86 to x86_64. When AMD introduced x86_64, it made a bunch of the previously optional extension (SSE up to a certain version etc) a mandatory part of x86_64. Gentoo systems could already be optimized before on x86 using those instructions, but now (2004ish) every system using x86_64 was automatically always taking full advantage of all of these instructions*.
  Since then we've slowly started accumulating optional extensions again; newer SSE versions, AVX, encryption and virtualization extensions, probably some more newfangled AI stuff I'm not on top of. So very slowly it might have started again to make sense for an approach like Gentoo to exist**.
  * usual caveats apply; if the compiler can figure out that using the instruction is useful etc.
  ** but the same caveats as back then apply. A lot of software can't really take advantage of these new instructions, because newer instructions have been getting increasingly more use-case-specific; and applications that can greatly benefit from them will already have alternative code-pathes to take advantage of them anyway. Also a lot of the stuff happening in hardware acceleration has moved to GPUs, which have a feature discovery process independent of CPU instruction set anyway.
  slavik81 21 hours ago
  The llama.cpp package on Debian and Ubuntu is also rather clever in that it's built for x86-64-v1, x86-64-v2, x86-64-v3, and x86-64-v4. It benefits quite dramatically from using the newest instructions, but the library doesn't have dynamic instruction selection itself. Instead, ld.so decides which version of libggml.so to load depending on your hardware capabilities.
  ignoramous 2 hours ago
  > llama.cpp package on Debian and Ubuntu is also rather clever … ld.so decides which version of libggml.so to load depending on your hardware capabilities
  Why is this "clever"? This is pretty much how "fat" binaries are supposed to work, no? At least, such packaging is the norm for Android.
  mikepurvis a day ago
  > AVX, encryption and virtualization
  I would guess that these are domain-specific enough that they can also mostly be enabled by the relevant libraries employing function multiversioning.
  izacus 19 hours ago
  You would guess wrong.
  mikepurvis 18 hours ago
  Isn’t the whole thrust of this thread that most normal algorithms see little to no speed up from things like avx, and therefore multiversioning those things that do makes more sense that compiling the whole OS for a newer set of cpu features?
  ploxiln a day ago
  FWIW the cool thing about gentoo was the "use-flags", to enable/disable compile-time features in various packages. Build some apps with GTK or with just the command-line version, with libao or pulse-audio, etc. Nowadays some distro packages have "optional dependencies" and variants like foobar-cli and foobar-gui, but not nearly as comprehensive as Gentoo of course. Learning about some minor custom CFLAGS was just part of the fun (and yeah some "funroll-loops" site was making fun of "gentoo ricers" way back then already).
  I used Gentoo a lot, jeez, between 20 and 15 years ago, and the install guide guiding me through partitioning disks, formatting disks, unpacking tarballs, editing config files, and running grub-install etc, was so incredibly valuable to me that I have trouble expressing it.
  mpyne a day ago
  I still use Gentoo for that reason, and I wish some of those principles around handling of optional dependencies were more popular in other Linux distros and package ecosystems.
  There's lots of software applications out there whose official Docker images or pip wheels or whatever bundle everything under the sun to account for all the optional integrations the application has, and it's difficult to figure out which packages can be easily removed if we're not using the feature and which ones are load-bearing.
  zerocrates 19 hours ago
  I started with Debian on CDs, but used Gentoo for years after that. Eventually I admitted that just Ubuntu suited my needs and used up less time keeping it up to date. I do sometimes still pull in a package that brings a million dependencies for stuff I don't want and miss USE flags, though.
  I'd agree that the manual Gentoo install process, and those tinkering years in general, gave me experience and familiarity that's come in handy plenty of times when dealing with other distros, troubleshooting, working on servers, and so on.
  michaelcampbell 4 hours ago
  Someone has set up an archive of that site; I visit it once in a while for a few nostalgiac chuckles
  https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...
  viraptor 6 hours ago
  Nixpkgs exposes a lot of options like that. You can override both options and dependencies and supply your own cflags if you really want.
  hajile 15 hours ago
  According to this[0] study of the Ubuntu 16.04 package repos, 89% of all x86 code was instructions were just 12 instructions (mov, add, call, lea, je, test, jmp, nop, cmp, jne, xor, and -- in that order).
  The extra issue here is that SIMD (the main optimization) simply sucks to use. Auto-vectorization has been mostly a pipe dream for decades now as the sufficiently-smart compiler simply hasn't materialized yet (and maybe for the same reason the EPIC/Itanium compiler failed -- deterministically deciding execution order at compile time isn't possible in the abstract and getting heuristics that aren't deceived by even tiny changes to the code is massively hard).
  Doing SIMD means delving into x86 assembly and all it's nastiness/weirdness/complexity. It's no wonder that devs won't touch it unless absolutely necessary (which is why the speedups are coming from a small handful of super-optimized math libraries). ARM vector code is also rather Byzantine for a normal dev to learn and use.
  We need a more simple assembly option that normal programmers can easily learn and use. Maybe it's way less efficient than the current options, but some slightly slower SIMD is still going to generally beat no SIMD at all.
  [0] https://oscarlab.github.io/papers/instrpop-systor19.pdf
  kccqzy 13 hours ago
  The highway library is exactly the kind of a simpler option to use SIMD. Less efficient than hand written assembler but you can easily write good enough SIMD for multiple different architectures.
  badlibrarian 14 hours ago
  Agner Fog's libraries make it pretty trivial for C++ programmers at least. https://www.agner.org/optimize/
  JonChesterfield 14 hours ago
  The sufficiently smart vectoriser has been here for decades. Cuda is one. Uses all the vector units just fine, may struggle to use the scalar units.
  oivey a day ago
  This should build a lot more incentive for compiler devs to try and use the newer instructions. When everyone uses binaries compiled without support for optional instruction sets, why bother putting much effort into developing for them? It’ll be interesting to see if we start to see more of a delta moving forward.
  Seattle3503 19 hours ago
  And application developers to optimize with them in mind?
  suprjami 19 hours ago
  I somehow have the memory that there was an extremely narrow time window where the speedup was tangible and quantifiable for Gentoo, as they were the first distro to ship some very early gcc optimisation. However it's open source software so every other distro soon caught up and became just as fast as Gentoo.
  harha 21 hours ago
  Would it make a difference if you compile the whole system vs. just the programs you want optimized?
  As in, are there any common libraries or parts of the system that typically slow things down, or was this more targeting a time when hardware was more limited so improving all would have made things feel faster in general.
- horizion2025 9 hours ago
  How many additions have there even been outside of AVX-x? And even AVX-2 is from 2011. If we ignore AVX-x the last I can recall are the few instructions added in the manipulation sets BMI/ABM, but they are Haswell/Piledriver/Jaguar era (2012-2013). While some specific cases could benefit, doesn't seem like a goldmine of performance improvements.
  Further, maybe it has not been a focus for compiler vendors to generate good code for these higher-level archs if few are using the feature. So Ubuntu's move could improve that.
- juujian a day ago
  Are there any use cases where that 1% is worth any hassle whatsoever?
  wongarsu a day ago
  If every computer built in the last decade gets 1% faster and all we have to pay for that is a bit of one-off engineering effort and a doubling of the storage requirement of the ubuntu mirrors that seems like a huge win
  If you aren't convinced by your ubuntu being 1% faster, consider how many servers, VMs and containers run ubuntu. Millions of servers using a fraction of a percent less energy multiplies out to a lot of energy
  vladms a day ago
  Don't have a clear opinion, but you have to factor in all the issues that can be due to different versions of software. Think of unexposed bugs in the whole stack (that can include compiler bugs but also software bugs related to numerical computation or just uninitialized memory). There are enough heisenbugs without worrying that half the servers run on a slightly different software.
  It's not for nothing that some time ago "write once, run everywhere" was a selling proposition (not that it was actually working in all cases, but definitely working better than alternatives).
  alkonaut 9 hours ago
  If I recompile a program to fully utilize my cpu better (use AVX or whatever) then if my program takes 1 second to execute instead of 2, it likely did not use half the _energy_.
  darkwater 6 hours ago
  Obviously not. But scale it out to a fleet of 1000 servers running your program continuously, you can now shut down 10 for the same exact workload.
  sumtechguy a day ago
  That comes out to about 1.5 hours faster per week for many tasks. If you are running full tilt. But that seems like an ok easy win.
  duskdozer 16 hours ago
  how much energy would we save if every website request weren't loaded down with 20MB of ads and analytics :(
  Aissen a day ago
  You need 100 servers. Now you need to only buy 99. Multiply that by a million, and the economies of scale really matter.
  iso1631 a day ago
  1% is less than the difference between negotiating with a hangover or not.
  gpm a day ago
  What a strange comparison.
  If you're negotiating deals worth billions of dollars, or even just millions, I'd strongly suggest not doing so with a hangover.
  tclancy 18 hours ago
  Sounds like someone never read Sun Tzu.
  (Not really, I just know somewhere out there is a LinkedInLunatic who has a Business Philosophy based on being hungover.)
  gpm 14 hours ago
  Appear drunk when you are sober, and sober when you are drunk
  - Sun Zoo
  Pet_Ant a day ago
  > If you're negotiating deals worth billions of dollars, or even just millions, I'd strongly suggest not doing so with a hangover.
  ...have you met salespeople? Buying lap dances is a legitimate business expense for them. You'd be surprised how much personal rapport matters and facts don't.
  In all fairness, I only know about 8 and 9 figure deals, maybe at 10 and 11 salespeople grow ethics...
  bregma a day ago
  I strongly suspect ethics are inversely proportional to the size of the deal.
  glenstein a day ago
  That's more an indictment of sales culture than a critique of computational efficiency.
  squeaky-clean 21 hours ago
  Well sure, because you want the person trying buy something from you for a million dollars to have a hangover.
  PeterStuer a day ago
  A lott of improvements are very incremental. In agregate, they often compound and are vey significant.
  If you would only accept 10x improvements, I would argue progress would be very small.
  ilaksh a day ago
  They did say some packages were more. I bet some are 5%, maybe 10 or 15. Maybe more.
  Well one example could be llama.cpp . It's critical for them to use every single extension the CPU has move more bits at a time. When I installed it I had to compile it.
  This might make it more practical to start offering OS packages for things like llama.cpp
  I guess people that don't have newer hardware aren't trying to install those packages. But maybe the idea is that packages should not break on certain hardware.
  Blender might be another one like that which really needs the extensions for many things. But maybe you so want to allow it to be used on some oldish hardware anyway because it still has uses that are valid on those machines.
  adgjlsfhk1 a day ago
  it's very no uniform. 99% see no change, but 1% see 1.5-2x better performance
  2b3a51 a day ago
  I'm wondering if 'somewhat numerical in nature' relates to lpack/blas and similar libraries that are actually dependencies of a wide range of desktop applications?
  adgjlsfhk1 a day ago
  blas and lapack generally do manual multi-versioning by detecting CPU features at runtime. This is more useful 1 level up the stack in things like compression/decompression, ode solvers, image manipulation and so on that are still working with big arrays of data, but don't have a small number of kernels (or as much dev time), so they typically rely on compilers for auto-vectorization
  Insanity a day ago
  I read it as, across the board a 1% performance improvement. Not that only 1% of packages get a significant improvement.
  darkwater 6 hours ago
  The announcement is pretty clear on this:
  > Previous benchmarks (...) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that.
  IAmBroom a day ago
  In a complicated system, a 1% overall benefit might well be because of a 10% improvement in just 10% of the system (or more in a smaller contributor).
  locknitpicker a day ago
  > Are there any use cases where that 1% is worth any hassle whatsoever?
  I don't think this is a valid argument to make. If you were doing the optimization work then you could argue tradeoffs. You are not, Canonical is.
  Your decision is which image you want to use, and Canonical is giving you a choice. Do you care about which architecture variant you use? If you do, you can now pick the one that works best for you. Do you want to win an easy 1% performance gain? Now you have that choice.
  godelski 21 hours ago
  > where that 1% is worth any hassle
  You'll need context to answer your question, but yes there are cases.
  Let's say you have a process that takes 100hrs to run and costs $1k/hr. You save an hour and $1k every time you run the process. You're going to save quite a bit. You don't just save the time to run the process, you save literal time and everything that that costs (customers, engineering time, support time, etc).
  Let's say you have a process that takes 100ns and similarly costs $1k/hr. You now run in 99ns. Running the process 36 million times is going to be insignificant. In this setting even a 50% optimization probably isn't worthwhile (unless you're a high frequency trader or something)
  This is where the saying "premature optimization is the root of all evil" comes from! The "premature" part is often disregarded and the rest of the context goes with it. Here's more context to Knuth's quote[0].
  There is no doubt that the holy grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
  Knuth said: "Get a fucking profiler and make sure that you're optimizing the right thing". He did NOT say "don't optimize".
  So yes, there are plenty of times where that optimization will be worthwhile. The percentages don't mean anything without the context. Your job as a programmer is to determine that context. And not just in the scope of your program, but in the scope of the environment you expect a user to be running on. (i.e. their computer probably isn't entirely dedicated to your program)
  [0] https://dl.acm.org/doi/10.1145/356635.356640 (alt) https://sci-hub.se/10.1145/356635.356640
  dehrmann a day ago
  Anything at scale. 1% across FAANG is huge.
  dabinat 17 hours ago
  If those kinds of optimizations are on the table, why would they not already be compiling and optimizing from source?
  darkwater 6 hours ago
  I'm not an hyperscaler, I run a thousand machines. If by just changing the base image I use to build - in an already automated process - those machines, well, the optimization is basically for free. Well, unless it triggers some new bug that was not there before.
  Havoc a day ago
  Arguable same across consumers too. It’s just harder to measure than central datacenters
  notatoad a day ago
  nah, performance benefits are mostly wasted on consumers, because consumer hardware is very infrequently CPU-constrained. in a datacentre, a 1% improvement could actually mean you provision 99 CPUs instead of 100. but on your home computer, a 1% CPU improvement means that your network request completes 0.0001% faster, or your file access happens 0.000001% faster, and then your CPU goes back to being idle.
  an unobservable benefit is not a benefit.
  bandrami 15 hours ago
  Isn't Facebook still using PHP?
  dehrmann 38 minutes ago
  They forked PHP into Hack. They've diverged pretty far by this point (especially with data structures), but it maintains some of PHPs quirks and request-oriented runtime. It's jitted by HHVM. Both Hack and HHVM are open-source, but I'm not aware of any major users outside Meta.
  speed_spread 5 hours ago
  Compiled PHP. I'm pretty sure they ran the numbers.
  gwbas1c a day ago
  > some packages, mostly those that are somewhat numerical in nature, improve more than that
  Perhaps if you're doing CPU-bound math you might see an improvement?
  rossjudson a day ago
  Any hyperscaler will take that 1% in a heartbeat.
  wat10000 a day ago
  It's rarely going to be worth it for an individual user, but it's very useful if you can get it to a lot of users at once. See https://www.folklore.org/Saving_Lives.html
  "Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?"
  I put a lot of effort into chasing wins of that magnitude. Over a huge userbase, something like that has a big positive ROI. These days it also affects important things like heat and battery life.
  The other part of this is that the wins add up. Maybe I manage to find 1% every couple of years. Some of my coworkers do too. Now you're starting to make a major difference.
  colechristensen a day ago
  Very few people are in the situation where this would matter.
  Standard advice: You are not Google.
  I'm surprised and disappointed 1% is the best they could come up with, with numbers that small I would expect experimental noise to be much larger than the improvement. If you tell me you've managed a 1% improvement you have to do a lot to convince me you haven't actually made things 5% worse.
  noir_lord a day ago
  No but a lot of people are buying a lot of compute from Google, Amazon and Microsoft.
  At scale marginal differences do matter and compound.
- dang a day ago
  Thanks - we've merged the comments from https://news.ycombinator.com/item?id=45772579 into this thread, which had that original source.
- jwrallie 18 hours ago
  Is it worth it losing the ability to just put your hdd on your older laptop and booting it in an emergency?
- pizlonator a day ago
  That 1% number is interesting but risks missing the point.
  I bet you there is some use case of some app or library where this is like a 2x improvement.
  alternatex 5 hours ago
  Aggregated metrics are always useless as they tend to show interesting and sometimes exciting data that in actuality contains zero insight. I'm always weary of people making decisions based on aggregate metrics.
  Would be nice to know the per app metrics.
theandrewbailey a day ago
A reference for x86-64 microarchitecture levels: https://en.wikipedia.org/wiki/X86-64#Microarchitecture_level...
x86-64-v3 is AVX2-capable CPUs.
- jsheard a day ago
  > x86-64-v3 is AVX2-capable CPUs.
  Which unfortunately extends all the way to Intels newest client CPUs since they're still struggling to ship their own AVX512 instructions, which are required for v4. Meanwhile AMD has been on v4 for two generations already.
  theandrewbailey a day ago
  At least Intel and AMD have settled on a mutually supported subset of AVX-512 instructions.
  wtallis a day ago
  The hard part was getting Intel and Intel to agree on which subset to keep supporting.
  cogman10 a day ago
  Even on the same chip.
  Having a non-uniform instruction set for one package was a baffling decision.
  jsheard a day ago
  I think that stemmed from their P-core design being shared between server and client. They needed AVX512 for server so they implemented it in the P-cores, and it worked fine there since their server chips are entirely P-cores or entirely E-cores, but client uses a mixture of both so they had to disable AVX512 to bring the instruction set into sync across both sides.
  wtallis a day ago
  Server didn't really have anything to do with it. They were fine shipping AVX 512 in consumer silicon for Cannon Lake (nominally), Ice Lake, Tiger Lake, and most damningly Rocket Lake (backporting an AVX 512-capable core to their 14nm process for the sole purpose of making a consumer desktop chip, so they didn't even have the excuse that they were re-using a CPU core floorplan that was shared with server parts).
  It's pretty clear that Alder Lake was simply a rush job, and had to be implemented with the E cores they already had, despite never having planned for heterogenous cores to be part of their product roadmap.
  jiggawatts a day ago
  It’s a manifestation of Conway’s law: https://en.wikipedia.org/wiki/Conway%27s_law
  They had two teams designing the two types of cores.
zozbot234 a day ago
What are the changes to dpkg and apt? Are they being shared with Debian? Could this be used to address the pesky armel vs. armel+hardfloat vs. armhf issue, or for that matter, the issue of i486 vs. i586 vs. i686 vs. the many varieties of MMX and SSE extensions for 32-bit?
(There is some older text in the Debian Wiki https://wiki.debian.org/ArchitectureVariants but it's not clear if it's directly related to this effort)
- Denvercoder9 a day ago
  Even if technically possible, it's unlikely this will be used to support any of the variants you mentioned in Debian. Both i386 and armel are effectively dead: i386 is reduced to a partial architecture only for backwards compatibility reasons, and armel has been removed entirely from development of the next release.
  zozbot234 21 hours ago
  What you said is correct wrt. official support, but Debian also has an unofficial ports infrastructure that could be repurposed towards enabling Debian for older architecture variants.
- mwhudson 10 hours ago
  > Could this be used to address the pesky armel vs. armel+hardfloat vs. armhf issue
  No, because those are different ABIs (and a debian architecture is really an ABI)
  > the issue of i486 vs. i586 vs. i686 vs. the many varieties of MMX and SSE extensions for 32-bit?
  It could be used for this but it's about 15 years too late to care surely?
  > (There is some older text in the Debian Wiki https://wiki.debian.org/ArchitectureVariants but it's not clear if it's directly related to this effort)
  Yeah that is a previous version of the same design. I need to get back to talking to Debian folks about this.
- bobmcnamara a day ago
  This would allow mixing armel and softvfp ABIs, but not hard float ABIs, at least across compilation unit boundaries (that said, GCC never seems to optimize ABI bottlenecks within a compilation unit anyway)
watersb 18 hours ago
Over the past year, Intel has pulled back from Linux development.
Intel has reduced its number of employees, and has lost lots of software developers.
So we lost Clear Linux, their Linux distribution that often showcased performance improvements due to careful optimization and utilization of microarchitectural enhancements.
I believe you can still use the Intel compiler, icc, and maybe see some improvements in performance-sensitive code.
https://clearlinux.org/
"It was actively developed from 2/6/2015-7/18/2025."
- dooglius 18 hours ago
  icc was discontinued FWIW. The replacement, icx, is AIUI just clang plus some proprietary plugins
Hasz a day ago
Getting a 1% across the board general purpose improvement might sound small, but is quite significant. Happy to see Canonical invest more heavily in performance and correctness.
Would love to see which packages benefited the most in terms of percentile gain and install base. You could probably back out a kWh/tons of CO2 saved metric from it.
dfc a day ago
> you will not be able to transfer your hard-drive/SSD to an older machine that does not support x86-64-v3. Usually, we try to ensure that moving drives between systems like this would work. For 26.04 LTS, we’ll be working on making this experience cleaner, and hopefully provide a method of recovering a system that is in this state.
Does anyone know what the plans are to accomplish this?
- dmoreno a day ago
  If I were them I would make sure the V3 instructions are not used until late in the boot process, and some apt command that makes sure all installed programs are in the right subarchitecture for the running system, reinstalling as necessary.
  But that does not sound like a simple for non technical users solution.
  Anyway, non technical users using an installation on another lower computer? That sounds weird.
- mwhudson 10 hours ago
  I am probably going to be the one implementing this and I don't know what I am going to do yet! At the very least we need the failure mode to be better (currently you get an OOPS when the init from the initrd dies due to an illegal instruction exception)
theandrewbailey 2 days ago
A reference for x86-64 microarchitecture levels: https://en.wikipedia.org/wiki/X86-64#Microarchitecture_level...
x86-64-v3 is AVX2-capable CPUs.
rock_artist a day ago
So if it got it right, This is mostly a way to have branches within a specific release for various levels of CPUs and their support of SIMD and other modern opcodes.
And if I have it right, The main advantage should come with package manager and open sourced software where the compiled binaries would be branched to benefit and optimize newer CPU features.
Still, this would be most noticeable mostly for apps that benefit from those features such as audio dsp as an example or as mentioned ssl and crypto.
- jeffbee a day ago
  I would expect compression, encryption, and codecs to have the least noticeable benefit because these already do runtime dispatch to routines suited to the CPU where they are running, regardless of the architecture level targeted at compile time.
  WhyNotHugo a day ago
  OTOH, you can remove the runtime dispatching logic entirely if you compile separate binaries for each architecture variant.
  Especially the binaries for the newest variant, since they can entirely conditionals/branching for all older variants.
  jeffbee 21 hours ago
  That's a lot of surgery. These libraries do not all share one way to do it. For example zstd will switch to static BMI2 dispatch if it was targeting Haswell or later at compile time, but other libraries don't have that property and will need defines.
physicsguy a day ago
This is quite good news but it’s worth remembering that it’s a rare piece of software in the modern scientific/numerical world that can be compiled against the versions in distro package managers, as versions can significantly lag upstream months after release.
If you’re doing that sort of work, you also shouldn’t use pre-compiled PyPi packages for the same reason - you leave a ton of performance on the table by not targeting the micro-architecture you’re running on.
- PaulHoule a day ago
  My RSS reader trains a model every week or so and takes 15 minutes total with plain numpy, scikit-learn and all that. Intel MKL can do the same job in about half the time as the default BLAS. So you are looking at a noticeable performance boost but zero bullshit install with uv is worth a lot. If I was interested in improving the model than yeah I might need to train 200 of them interactively and I’d really feel the difference. Thing is the model is pretty good as it is and to make something better I’d have to think long and hard about what ‘better’ means.
  ciaranmca a day ago
  Out of interest, what reader is this? Sounds interesting
  PaulHoule a day ago
  I've talked about it a lot here, see https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
- zipy124 a day ago
  Yup, if you're using OpenCV for instance compiling instead of using pre-built binaries can result in 10x or more speed-ups once you take into account avx/threading/math/blas-libraries etc...
  oofbey a day ago
  Yup. The irony is that the packages which are difficult to build are the ones that most benefit from custom builds.
- niwtsol a day ago
  Thanks for sharing this. I'd love to learn more about micro-architectures and instruction sets - would you have any recommendations for books or sources that would be a good starting place?
  physicsguy 11 hours ago
  My experience is mostly practical really - the trick is to learn how to compile stuff yourself.
  If you do a typical: "cmake . && make install" then you will often miss compiler optimisations. There's no standard across different packages so you often have to dig into internals of the build system and look at the options provided and experiment.
  Typically if you compile a C/C++/Fortran .cpp/.c/.fXX file by hand, you have to supply arguments to instruct the use of specific instruction sets. -march=native typically means "compile this binary to run with the maximum set of SIMD instrucitons that my current machine supports" but you can get quite granular doing things like "-march=sse4,avx,avx2" for either compatibility reasons or to try out subsets.
- colechristensen a day ago
  Most of the scientific numerical code I ever used had been in use for decades and would compile on a unix variant released in 1992, much less the distribution version of dependencies that were a year or two behind upstream.
  owlbite a day ago
  Very true, but a lot of stuff builds on a few core optimized libraries like BLAS/LAPACK, and picking up a build of those targeted at a modern microarchitecture can give you 10x or more compared to a non-targeted build.
  That said, most of those packages will just read the hardware capability from the OS and dispatch an appropriate codepath anyway. You maybe save some code footprint by restricting the number of codepaths it needs to compile.
  physicsguy 13 hours ago
  I mean that’s just lucky and totally depends on your field and what is normal - just as an example, we used the LLNL SUNDIALS package for implicit time integration. On Ubuntu 24.04 the latest version is 6.4.1 where the latest published is v7.5.0. We found their major version releases tended to require changes.
  There’s also the difference between being able to run and being able to run optimised. At least 5 years ago, the Ubuntu/Debian builds of FFTW didn’t include the parallelised OpenMP library.
  In a past life I did HPC support and I recommend the Spack package manager a lot to people working in this area because you can get optimised builds with whatever compiler tool chain and options you need quite easily that way.
- jeffbee a day ago
  I wonder who downvoted this. The juice you are going to get from building your core applications and libraries to suit your workload are going to be far larger than the small improvements available from microarchitectural targeting. For example on Ubuntu I have some ETL pipelines that need libxml2. Linking it statically into the application cuts the ETL runtime by 30%. Essentially none of the practices of Debian/Ubuntu Linux are what you'd choose for efficiency. Their practices are designed around some pretty old and arguably obsolete ideas about ease of maintenance.
smlacy a day ago
I presume the motivation is performance optimization? It would be more compelling to include some of the benefits in the announcement?
- embedding-shape a day ago
  They do mention it in the linked announcement, although not really highlighted, just as a quick mention:
  > As a result, we’re very excited to share that in Ubuntu 25.10, some packages are available, on an opt-in basis, in their optimized form for the more modern x86-64-v3 architecture level
  > Previous benchmarks we have run (where we rebuilt the entire archive for x86-64-v3 57) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that.
- pushfoo 18 hours ago
  ARM/RISC-V extensions may be another reason. If a wide-spread variant configuration exists, why not build for it? See: - RISC-V's official extensions[1] - ARM's JS-specific float-to-fixed[2]
  1. https://riscv.atlassian.net/wiki/spaces/HOME/pages/16154732/... 2. https://developer.arm.com/documentation/dui0801/h/A64-Floati...
zdw a day ago
Many other 3rd party software has already required x86-64-v2 or -v3 already.
I couldn't run something from NPM on a older NAS machine (HP Microserver Gen 7) recently because of this.
stabbles a day ago
Seems like this is not using glibc's hwcaps (where shared libraries were located in microarch specific subdirs).
To me hwcaps feels like a very unfortunate feature creep of glibc now. I don't see why it was ever added, given that it's hard to compile only shared libraries for a specific microarch, and it does not benefit executables. Distros seem to avoid it. All it does is causing unnecessary stat calls when running an executable.
- mwhudson 10 hours ago
  No it's not using hwcaps. That would only allow optimization of code in shared libraries, would be irritating to implement in a way that didn't require touching each package that includes shared libraries and would (depending on details) waste a bunch of space on every users system. I think hwcaps would only make sense for a small number of shared libraries if at all, not a system wide thing.
ElijahLynn a day ago
I clicked on this article expecting an M series variant for Apple hardware...
malkia a day ago
This is awesome, but ... If you process requires deterministic results (speaking about floats/doubles mostly here), then you need to get this straight.
sluongng a day ago
Nice. This is one of the main reasons why I picked CachyOS recently. Now I can fallback to Ubuntu if CachyOS gets me stuck somewhere.
- yohbho a day ago
  CachyOS uses this one percent of performance gains? Since it uses every performance gain, unsurprising. But now I wonder how my laptop from 2012 did run CachyOS, they seem to switch based on hardware, not during image download and boot.
  topato a day ago
  correct, it just sets the repository in the pacman.conf to either cachyos, -v3, or -v4 during install time based on hardware probe
tommica a day ago
Once they have rebuilt with rust, they get to move away from GPL licenses and get to monetize things.
wyldfire 18 hours ago
Would we have something like aarch64 neon/SVE too?
benatkin a day ago
There's an unofficial repo for ArchLinux: https://wiki.archlinux.org/title/Unofficial_user_repositorie...
> Description: official repositories compiled with LTO, -march=x86-64-vN and -O3.
Packages: https://status.alhp.dev/
justahuman74 a day ago
If this goes well - will they do v4 as well?
- jnsgruk a day ago
  Maybe - likely we’ll trade-off the added build/test/storage cost of maintaining each variant - so you might not see amd64v4, but possibly amd64v5 depending on how impactful they turn out to be.
  The same will apply to different arm64 or riscv64 variants.
- mwhudson 10 hours ago
  Probably not v4 unless AVX512 becomes more ubiquitous than it looks like it will. But yeah, I don't expect this to be the only variant ever.
skywhopper a day ago
This sure feels like overkill that leaks massive complexity into a lot more areas than it’s needed in. For the applications that truly need sub-architecture variants, surely different packages or just some sort of meta package indirection would be better for everyone involved.
amelius a day ago
Can we please have an "apt rollback" function?
- julian-klode a day ago
  Yes sure
  apt (3.1.7) unstable; urgency=medium . [ Julian Andres Klode ] * test-history: Adjust for as-installed testing . [ Simon Johnsson ] * Add history undo, redo, and rollback features
  dima55 14 hours ago
  Exciting. I just looked for docs about these new features, and can't find anything. Can you point us to these? Thanks!
- riskable a day ago
  If you're using btrfs, you do get that feature: https://moritzmolch.com/blog/2506.html
- o11c a day ago
  That fundamentally requires a snapshot-capable filesystem, so you need to use a distro designed around such.
  amelius a day ago
  Not necessarily. You can use the ptrace() system call to trace a process and store what it reads/writes into a journal, etc.
  https://man7.org/linux/man-pages/man2/ptrace.2.html
brucehoult 19 hours ago
So now they can support RISC-V RVA20 and RVA23 in the same distro?
All the fuss about Ubuntu 25.10 and later being RVA23 only was about nothing?
- snvzz 4 hours ago
  They sure can, but it seems they simply did not want to.
whalesalad a day ago
> means to better exploit modern processors without compromising support for older hardware
very odd choice of words. "better utilize/leverage" is perhaps the right thing to say here.
- JohnKemeny a day ago
  "exploit": make full use of and derive benefit from
westurner a day ago
"Gentoo x86-64-v3 binary packages available" (2024) https://news.ycombinator.com/item?id=39255458
"Changes/Optimized Binaries for the AMD64 Architecture v2" (2025) https://fedoraproject.org/wiki/Changes/Optimized_Binaries_fo... :
> Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 uses x86-64-v3, and other distros provide optimized variants (OpenSUSE, Arch Linux, Ubuntu).
shmerl a day ago
Will Debian do it?
- bmitch3020 a day ago
  https://wiki.debian.org/ArchitectureVariants
  shmerl a day ago
  Hm, discussion is from 2023. Did anything come out of it?
  bmitch3020 a day ago
  I believe it's just discussions right now. If/when something happens, I'm hoping they'll update the wiki.
zer0zzz a day ago
There was a fat elf project to solve this problem at one point I thought.
- DrNosferatu a day ago
  Link?
  mariusor a day ago
  Maybe parent is referring to icculus' FatELF proposal from fifteen years ago? https://icculus.org/fatelf/