I’ve tried optimizing some python that’s in the hot path of a build system and a few dozen operations out of over 2K nodes in the ninja graph account for 25-30% of the total build time.
I’ve found python optimization to be nearly intractable. I’ve spent a significant amount of time over the past two decades optimizing C, Java, Swift, Ruby, SQL and I’m sure more. The techniques are largely the same. In Python, however, everything seems expensive. Field lookup on an object, dynamic dispatch, string/array concatenation. After optimization, the code is no longer “pythonic” (which has come to mean slow, in my vernacular).
Are there any good resources on optimizing python performance while keeping idiomatic?
"Are there any good resources on optimizing python performance while keeping idiomatic?"
It is essentially impossible, "essentially" here using not the modern sense of "mostly", but "essential" as in baked into the essence of the language. It has been the advice in the Python community pretty much since the beginning that the solution is to go to another language in that case. There are a number of solutions to the problem, ranging from trying PyPy, implementing an API in another language, Cython/Pyrex, up to traditional embedding of a C/C++ program into a Python module.
However this is one of those cases where there are a lot of solutions precisely because none of them are quite perfect and all have some sort of serious downside. Depending on what you are doing, you may find one whose downside you don't care about. But there's no simple, bullet-proof cookbook answer for "what do I do when Python is too slow even after basic optimization".
Python is fundamentally slow. Too many people still hear that as an attack on the language, rather than an engineering fact that needs to be kept in mind. Speed isn't everything, and as such, Python is suitable for a wide variety of tasks even so. But it is, still, fundamentally slow, and those who read that as an attack rather than an engineering assessment are more likely to find themselves in quite a pickle (pun somewhat intended) one day when they have a mass of Python code that isn't fast enough and no easy solutions for the problem than those who understand the engineering considerations in choosing Python.
I agree with this. People advocate for "pick a better algorithm", and sometimes that can help dramatically, but at times the best algorithm implemented in python is still too slow, but can be made 500x faster by reimplementing the exact same algorithm in cython or C or fortran or so on.
Python is a great language for rapidly bashing out algorithmic code, glue scripts, etc, unfortunately due to how dynamic it is, it is a language that fundamentally doesn't translate well to operations CPUs can perform efficiently. Hardly any python programs ever need to be as dynamic as what the language allows.
I've had very good experiences applying cython to python programs that need to do some kind of algorithmic number crunching, where numpy alone doesn't get the job done.
With cython you start with your python code and incrementally add static typing that reduces the layers of python interpreter abstractions and wrappings necessary. Cython has a very useful and amusing output where it spits out a html report of annotated cython source code, with lines highlighted in yellow in proportion to the amount of python overhead. you click on any line of python in that report and it expands to show how many Cpython API operations are required to implement it, and then you add more static type hints and recompile until the yellow goes away and the compute heavy kernel of your script is a C program that compiles to operations that real world CPUs can execute efficiently.
Downside of cython is the extra build toolchain and deployment concerns it drags in - if you previously had a pure python module, now you've got native modules so you need to bake platform specific wheels for each deployment target.
For Python that's being used as a glue scripting language, not number crunching, worth considering rewriting the script in something like Go. Go has a pretty good standard library to handle many tasks without needing to install many 3rd party packages, and the build, test, deploy story is very nice.
Of solutions, Numba, Nuitka, are probably also worth mentioning.
Then just looking at those, I now know of Shedskin and ComPyler.
I do feel like one nice thing about so many people working on solutions to every problem in python, is that it means that when you do encounter a serious downside, you have a lot more flexibility to move forwards along a different path.
Python is a superior bash that still wants a basic shell to run in.
Depending on the nature of your code, just throwing pypy at it instead of cpython can be a huge win. This very much depends on what you're doing though. For example, I have a graphics editor I wrote that is unbearably slow with cpython; with pypy and no other changes it's usable.
If what you're trying to do involves tasks that can be done in parallel, the multithreading (if I/O bound) or multiprocessing (if compute bound) libraries can be very useful.
If what you're doing isn't conducive to either, you probably need to rewrite at least the critical parts in something else.
Pypy is great for performance. I'm writing my own programming language (that transpiles to C) and for this purpose converted a few benchmarks to some popular languages (C, Java, Rust, Swift, Python, Go, Nim, Zig, V). Most languages have similar performance, except for Python, which is about 50 times slower [1]. But with PyPy, performance is much better. I don't know the limitations of PyPy because these algorithms are very simple.
But even thougt Python is very slow, it is still very popular. So the language itself must be very good in my view, otherwise fewer people would use it.
[1] https://github.com/thomasmueller/bau-lang/blob/main/doc/perf...
>> optimizing python performance while keeping idiomatic?
That's impossible[1].
I think it is impossible because when i identify a slow function using cProfile then use dis.dis() on it to view the instructions executed, most of the overhead - by that i mean time spent doing something other than the calculation the code describes - is spent determining what each "thing" is. It's all trails of calls trying to determine "this thing can't be __add__() to that thing but maybe that thing can be __radd()__ to this thing instead. Long way to say most of the time wasting instructions i see can be attacked by providing ctypes or some other approach like that (mypyc, maybe cython, etc etc) - but at this point you're well beyond "idiomatic".
[1] I'm really curious to know the answer to your question so i'll post a confident (but hopefully wrong) answer so that someone feels compelled to correct me :-)
> Are there any good resources on optimizing python performance while keeping idiomatic?
At the risk of sounding snarky and/or unhelpful, in my experience, the answer is that you don't try to optimize Python code beyond fixing your algorithm to have better big-O properties, followed by calling out to external code that isn't written in Python (e.g., NumPy, etc).
But, I'm a hater. I spent several years working with Python and hated almost every minute of it for various reasons. Very few languages repulse me the way Python does: I hate the syntax, the semantics, the difficulty of distribution, and the performance (memory and CPU, and is GIL disabled by default yet?!)...
You can't really. The slow parts of computers are memory and branches.
Branches cannot be reasoned about since they are all in the Cpython state machine.
Memory performance is about being efficient with allocations, cache locality, and pointer chasing.
Python has huge amounts of allocations and pointer chasing. Its by design slow.
> I’ve found python optimization to be nearly intractable.
Move your performance critical kernels outside Python. Numpy, Pytorch, external databases, Golang microservice, etc.
It's in fact an extremely successful paradigm. Python doesn't need to be fast to be used in every domain (though async/nogil is still worth advancing to avoid idle CPUs).
However note that (and this isn't just true for Python) if your "performance critical kernel" is "the whole program" then this does have the implication you think it does, you should not write Python.
If your Python software is spending 99% of its time doing X, you can rewrite X in Rust and now the Python software using that is faster, but if your Python software is spending 7% of its time doing X, and 5% doing Y, and 3% doing Z then even if you rewrite X, and Y and Z you've shaved off no more than 15% and so it's probably time to stop writing Python at all.
This is part of why Google moved to Go as I understand it.
I spent a decade optimizing Python data science projects for a Fortune 100. I agree with the other sentiments that you can't really optimize Python itself because of these weird behaviors that seem to change the more you look at them... Mostly all of my optimization work was generic computer science stuff across the entire system and, in the data science world, this usually meant optimizing the database queries or the io throughput, or the overall shape of the kind of work that they were doing in order to parallelize it, or figure out how to simply do way less, etc. If someone actually came up to me with an actual problem with a Python code and it was the Python code that was not performing correctly, I would pretty immediately say let's use something else. I think the efforts to make a more efficient Python are really great, but I think the extreme utility of Python as a light glue to pull together lots of other things that are performing well in other computer languages is a widely recognized strength of the Python ecosystem.
> In Python, however, everything seems expensive.
And that wasn't true in Ruby(?!)
The first thing to do when optimizing Python is to use a JIT (PyPy). Like, yes Python is slow if you don't allow a JIT.
Alternatively, go the other JIT-less direction but use C (and AOT language).
I admit it may just be because I'm a PL nerd, but I thought it was general knowledge that pretty much EVERYTHING in Python is an object, and an object in Python is always heap allocated AFAIK. This goes deeper than just integers. Things most think of as declarative (like classes, modules, etc.) are also objects, etc. etc. It is both the best thing (for dynamism and fun/tinkering) and worst thing (performance optimization) about Python.
If you've never done it, I recommend using the `dir` function in a REPL, finding interesting things inside your objects, do `dir` on those, and keep the recursion going. It is a very eye opening experience as to just how deep the objects in Python go.
Heap allocation does also allow other things. For example stack frames (PyFrame) are also heap allocated. When there is an exception, the C stack gets unwound but the heap allocated frames are retained forming the traceback. You can then examine the values of the local variables at each level of the traceback.
And it also allows async functions, since state is held off the C stack, so frames can be easily switched when returning to the event loop.
The other thing made easy is C extension authoring. You compile CPython without free lists and an address sanitizer, and getting reference counting wrong shows up.
Yeah, spelunking around this is a great way to learn a language!
That said, “is an object” is a bit of an overloaded term in this context. Most things in Python can be semantically addressed as objects (they have fields and methods, “dir” interrogates them, and so on), but not everything is necessarily backed by the same kind of heap-allocated memory object. Some builtins and bytecode-level optimizations, similar to (but not quite the same as) the ones discussed in TFA, result in things being less object-like (that is, PyObject*-ish structs with refcounting and/or pointer chasing on field access) in some cases.
And that’s an important difference, not a nitpick! It means that performance-interested folks can’t always use the same intuition for allocation/access behavior that they do when determining what fields are available on a syntactic object.
(And before the inevitable “if you want performance don’t write Python/use native libraries” folks pipe up: people still regularly have to care about making plain Python code fast. You might wish it weren’t so, but it is.)
> but not everything is necessarily backed by the same kind of heap-allocated memory object.
Do you have an example, I thought literally everything in Python did trace to a PyObject*.
Yeah, I probably should have been careful there. Good call out that not every "object" is necessarily a heap allocated entity. I learned something, thx.
Quite frankly, as someone with 20+ years of experience I have no idea what the basis is for what you're saying here.
> not everything is necessarily backed by the same kind of heap-allocated memory object.
`None`, `True` and `False` are; integers are; functions, bound methods, properties, classes and modules are... what sort of "everything" do you have in mind? Primitive objects don't get stored in a special way that can avoid the per-object memory overhead, and such objects exist for things that can't be used as values at all (e.g. passed to or returned from a function) in many other languages.
Some use fields like `tp_slots` in special ways; the underlying implementation of their methods is in C and the method attributes aren't looked up in the normal way. But there is still a PyObject* there.
... On further investigation (considering the values reported by `id`) I suppose that (at least in modern versions) the pre-created objects may be in static storage... ? Perhaps that has something to do with the implementation of https://peps.python.org/pep-0683/ ?
> Some builtins and bytecode-level optimizations
String concatenation may mutate a string object that's presented to Python as immutable, so long as there aren't other references to it. But it's still a heap-allocated object. Similarly, Python pre-allocates objects for small integer values, but they're still stored on the heap. Very short strings may, along with those integers, go in a special memory pool, but that pool is still on the heap and the allocation still contains all the standard PyObject fields.
> people still regularly have to care about making plain Python code fast.
Can you give examples of the people involved, or explain why their use cases are not satisfied by native libraries?
Once upon a time, I wanted to back a stack with a linked list in Python. I had been reading a lot of compiled bytecode, and had recently learned that CPython is a stack-based language capable of unrolling and popping tuples as singular bytecode instructions. I also learned about the freelist.
I ended up with the notation
Initialization:
head = ()
Push:
head = data, head
Safe Pop:
if head:
data, head = head
Safe Top:
head[0] if head else None
And for many stack-based algorithms, I've found this to be quite optimal in part because the length-2 tuples get recycled (also due to a lack of function calls, member accesses, etc). But I'm rather embarrassed to put it into a codebase due to others' expectations that Python should be beautiful and this seems weird.I guess the only downside might be if you need to push a bunch of things at once. But maybe just a loop is faster than calling extend on a list? This is cool.
Yep, that's the main exception when I said "many stack-based algorithms". It's worth mentioning that examining the stack while debugging is annoying because of the number of parens involved.
With respect to tagged pointers, there seems to be some recent movements on that front in CPython: https://github.com/python/cpython/issues/132509
Unfortunately that was posted 1 month before the Faster CPython project was disbanded by Microsoft, so I imagine things have slowed.
Not sure how well-known this is, but you can run Python with a different allocator by LD_PRELOADing it. We had good results using jemalloc this way.
There are reasons why the same program in Julia can be 60x faster than in Python, see e.g. slide 5 in https://www.cl.cam.ac.uk/teaching/2526/TeX+Julia/julia-slide... for an example.
The function on that slide is dominated by the call to rand, which uses quite different implementations in Julia and Python, so may not be the best example.
Julia is compiled and for simple code like that example code will have performance on par with C, Rust etc.
> I will never forget Rust for...
Fuck, this made me feel old. It blows my mind that the learning path Rust then Python even exists.
When I started school in 2000, it was C, then C++, and then Java. Oh, and then compilers. It’s impossible for me not to think about allocation when doing anything, it’s just rooted in my mind.
Rust is a lot simpler than C++ and arguably Python itself, so it's not a huge surprise that it turns out to be viable as a first programming language. Especially since some parts of it can be taught as pretty much a pure functional language at the outset (actually this works quite well, right up until you really need interior mutability - which is not that often), which makes for a good intuition of what keeps the borrow checker happy.
C gets a lot of crap, sometimes for good reason, but one thing I like about it is that the question of whether C is allocating something is easy to answer, at least for your own code.
It is nice.
Although, there are also modern, beautiful, user friendly languages where allocation is mostly obvious. Like Fortran.
Oh my. You don't see Fortran being called modern very often.
Writing C like Fortran (no heap) is even better when you can!
This comment is 35 years out of date.
Python is entirely a C program, ergo by this article, this seems like one of those fallacies C programs believe justifies using C
Yeah, if you consider the Python interpreter "your own code".
A bit beside the point, but this caught my eye:
> Integers are likely the most used data type of any program, that means a lot of heap allocations.
I would guess strings come first, then floats, then booleans, and then integers. Are there any data available on that?
Given that most string implementations have both a `len` and `cap` field (or eve just `len`), means that for every string data type, there is one more integer variables being used.
So integer definitely gets used more than strings, and I'd argue way more than floats in most programs.
In CPython, a string's length is stored internally as a machine integer (`Py_ssize_t`); when the `len` builtin is called, the corresponding integer object is created (or more likely, retrieved from cache) on the fly.
Ints probably get a big boost in languages where the only built-in for-loop syntax involves incrementing an index variable, like C. And, speaking of C, specifically, even the non-int types are actually ints or isomorphic to ints: enums, bools, char, pointers, arrays (which are just pointers if you squint), etc...
But, otherwise, I'd agree that strings probably win, globally.
Actually because of provenance the C pointers are their only type which isn't just basically the machine integers again.
A char is just a machine integer with implementation specified signedness (crazy), bools are just machine integers which aren't supposed to have values other than 0 or 1, and the floating point types are just integers reinterpreted as binary fractions in a strange way.
Addresses are just machine integers of course, but pointers have provenance which means that it matters why you have the pointer, whereas for the machine integers their value is entirely determined by the bits making them up.
It's been a long time since I've done C/C++, but I'm not sure what you're saying with regard to provenance. I was pretty sure that you were able to cast an arbitrary integer value into a pointer, and it really didn't have to "come from" anywhere. So, all I'm saying is that, under-the-hood, a C pointer really is just an integer. Saying that a pointer means something beyond the bits that make up the value is no more relevant than saying a bool means something other than its integer value, which is also true.
Start your journey here: https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm
That's defect report #260 against the C language. One option for WG14 was to say "Oops, yeah, that should work in our language" and then modern C compilers have to be modified and many C programs are significantly slower. This gets the world you (and you're far from alone among C programmers) thought you lived in, though probably your C programs are slower than you expected now 'cos your "pointers" are now just address integers and you've never thought about the full consequences.
But they didn't, instead they wrote "They may also treat pointers based on different origins as distinct even though they are bitwise identical" because by then that is in fact how C compilers work. That's them saying pointers have provenance, though they do not describe (and neither does their ISO document) how that works.
There is currently a TR (I think, maybe wrong letters) which explains PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates which is the current preferred model for how this could possibly work. Compilers don't implement that properly either, but they could in principle so that's why it is seen as a reasonable goal for the C language. That TR is not part of the ISO standard but in principle one day it could be. Until then, provenance is just a vague shrug in C. What you said is wrong, and er... yeah that is awkward for everybody.
Rust does specify how this works. But the bad news (I guess?) for you is that it too says that provenance is a thing, so you cannot just go around claiming the address you dredged up from who knows where is a valid pointer, it ain't. Or rather, in Rust you can write out explicitly that you do want to do this, but the specification is clear that you get a pointer but not necessarily a valid pointer even if you expected otherwise.
A caveat applies to the entire analysis that CPython may be the reference implementation, but it's still just one implementation. This sort of thing may work totally differently in PyPy, and especially in implementations that make use of another garbage-collecting runtime, such as Jython.
> Let’s take out the print statement and see if it’s just the addition:
Just FWIW: the assignment is not required to prevent optimizing out the useless addition. It isn't doing any static analysis, so it doesn't know that `range` is the builtin, and thus doesn't know that `i` is an integer, and thus doesn't know that `+` will be side-effect-free.
> Nope, it seems there is a pre-allocated list of objects for integers in the range of -5 -> 1025. This would account for 1025 iterations of our loop but not for the rest.
1024 iterations, because the check is for numbers strictly less than `_PY_NSMALLPOSINTS` and the value computed is `i + 1` (so, `1` on the first iteration).
Interesting. I knew of them only ranging up to 256 (https://stackoverflow.com/questions/306313).
It turns out (https://github.com/python/cpython/commit/7ce25edb8f41e527ed4...) that the change is barely a month old in the repository; so it's not in 3.14 (https://github.com/python/cpython/blob/3.14/Include/internal...) and won't show up until 3.15.
> Our script appears to actually be reusing most of the PyLongObject objects!
The interesting part is that it can somehow do this even though the values are increasing throughout the loop (i.e., to values not seen on previous iterations), and it also doesn't need to allocate for the value of `i` retrieved from the `range`.
> But realistically the majority of integers in a program are going to be less than 2^30 so why not introduce a fast path which skips this complicated code entirely?
This is the sort of thing where PRs to CPython are always welcome, to my understanding. It probably isn't a priority, or something that other devs have thought of, because that allocation presumably isn't a big deal compared to the time taken for the actual conversion, which in turn is normally happening because of some kind of I/O request. (Also, real programs probably do simple arithmetic on small numbers much more often than they string-format them.)
> that make use of another garbage-collecting runtime, such as Jython
I think that is mostly of historical interest. For example, it still does not support Python 3 and has not been updated in a very long time
Graalpy [1] is where it's at if you want python on a JVM.
Ah, sorry to hear it. I'd lost track. It looks like IronPython is still active, but way behind. Or rather, maybe they have no interest in implementing Python 3 features beyond 3.4.
Shouldn't having a freelist and pushing and popping blocks of memory to and from it for re-use count as a form of memory allocation?