This code is very beautifully written, thanks for sharing.
However, you should consult the book "C: Interfaces and Implementations" by Dave Hanson, which has a library called "Arena" that is very similar to your "Pool"s, and it shows a few more tricks to make the code safer and easier to read (Chapters 5+6, 6 in particular).
D. Hanson's book (written in the 'literary programming' style invented by Donal Knuth) also demonstrates having debugging and production implementations for the same C API to catch memory errors, and his book is from 1997:
> Copyright © 1997 by David R. Hanson
(He used to be a SE professor I Princeton before getting hired by Microsoft, I believe.)Kernighan & Ritchie's "The C Programming Language" also has an example implementation of malloc() & free() { for any-size requests as @kragen notes in sibling } in it in like 3 pages (though, sure, it is far from the most efficient implementation). Since it's the last section of the main text, I'd guess that most readers don't get that far (even though it's a short book).
Even if you don’t program in C, I think K&R should be required reading for anyone who writes software for a living. You can finish it in an afternoon, or two days if you actually write and run the code outlined in the book.
My bucket list includes an item to write a novel efficient implementation of malloc/free. I have taken several stabs at this in the past but have never been happy with it. I have no real need of a custom allocator but it is something that I think would be very cool to have under my belt. This was inspired by K&R.
I think you mean "literate programming".
The arena allocator in Hanson's chapter 6 is an arena allocator, not a "pool allocator" in the sense that 8dcc means, which is to say, a constant-time fixed-size-block allocator. It seems that you had the same reading comprehension problem that I did.
The difference is that Hanson's arena allocator can make allocations of any size (his Arena_alloc takes an nbytes argument) and cannot deallocate individual allocations, only the entire arena (his Arena_free takes only an arena pointer as a parameter). A fixed-size-block allocator like the one we are discussing here can only make allocations of a fixed size, but can recycle them individually, not only all at once.
(If Hanson's name sounds familiar, it might be because you've read the lcc book.)
It may be worth noting that Donald Knuth also invented the font used in 8dcc's article.
Honestly, "literary programming" does better explain how Knuth writes code.
Yes, but that name lacks the benefit of implicitly describing other approaches to programming as "illiterate".
Reading the Metafont source is pretty challenging because of all the transclusions. There's a reason why programs aren't written like journal papers.
I think we can probably do better than METAFONT with the benefit of hindsight. Unit tests would help with comprehensibility. Including example outputs, too, like Darius Bacon's "halp", or R-Markdown, Jupyter, doctest, and org-mode can do. Higher-level languages than Pascal would probably be an improvement. Maybe building up the program as a series of versions, like Kartik Agaram's idea of "layers"; I did something similar but less complete in my literate-programming system "handaxeweb". We can do decent typography on a computer monitor now, so in the worst case, we can attempt to leverage interactivity, as so-called "explorable explanations" do.
Both programs and journal papers are mostly written the way they are because of tradition, not because it's the best way we've found to write them. Two examples:
Nobody would intentionally design a programming language to make it impossible to include the kind of diagrams that illustrate this article, but most programming languages are in fact designed in such a way.
Buckminster-Fuller-style "ventilated prose" turns out to be more readable than traditional paragraph formatting in psychology experiments, which is why we use it universally in our programs. But you can't publish a paper formatted that way unless you call it a poem.
Haha, never thought about that!
Knuth did.
My minor nitpick regarding the code is the choice of variable names, but other than that, it is indeed easy to read and it is how I implement data structures and whatnot.
I like that there is a more or less exhaustive blog post about it which happens to look nice as well.
Thank you so much! I will definitely check it out. I also planned on writing an article about arena allocators soon.
If you write your own allocator in C, do yourself a favor and use the valgrind API inside it. Its use can be conditional so you can "compile it out". Or should I say it has to be, so you can build the code in an environment where there's no valgrind API.
I've replaced Valgrind with sanitizers for most of my workflow. Using the sanitizer APIs in your custom allocator is also easy: https://github.com/tavianator/bfs/blob/main/src/sanity.h https://github.com/tavianator/bfs/blob/31dffd6441ba80ea998ce...
+1 - this isn't even that hard, it is include a single header + annotate allocations/frees, with some redzones around the actual allocation to know when the user-code is writing into the pool pointer areas.
https://developers.redhat.com/articles/2022/03/23/use-valgri...
oof love on this page how scrolling hijacks the back button
The Address Sanitizer (ASAN) in both clang and gcc is/are excellent:
https://github.com/google/sanitizers/wiki/addresssanitizer
I believe to integrate a custom allocator with ASAN, you need to look at the memory "poisoning" mechanisms:
https://github.com/google/sanitizers/wiki/AddressSanitizerMa...
I have never heard about this (using valgrind with a "non-standard allocator"), but it looks really interesting, specially for other projects I am working on. I will definitely look into it.
I integrated the TXR Lisp garbage collector with valgrind. That came in handy on a number of occasions.
It's not always the case that you want special behavior for the garbage collector to assist with debugging, but also this: if you want to use Valgrind to debug anything that does not have to do with the garbage collector, you want all the false positives generated by it out of your way.
Some of the scanning that a conservative garbage collector has to do looks like access errors to Valgrind. By using the Valgrind API, you can grant yourself access to that memory, and then lock it again when done.
That aspect doesn't necessarily apply to a pool allocator being described in this article; that should not be triggering any false positives.
(In case you missed my other post, I ended up adding valgrind support in the article and in the 'libpool' project.)
> I integrated the TXR Lisp garbage collector with valgrind.
That's funny, because when I said "for other projects I am working on", I specifically meant a Lisp interpreter which uses a memory pool and garbage collection (I am still working on implementing this last part, anyway). I am very happy about the fact that I can continue using valgrind in this project.
You can also integrate with AddressSanitizer to some extent, look into[1] ASAN_{UN,}POISON_MEMORY_REGION from <sanitizer/asan_interface.h> or the underlying intrinsics __asan_{un,}poison_memory_region.
[1] https://github.com/google/sanitizers/wiki/AddressSanitizerMa...
Thanks for the tip!
One aspect people don't tend to mention (and TFA does not because it has "general purpose" as a scope) about doing your own allocation arenas is that because a "pointer" is just the name (really number) of an allocated region, this also affords flexibility in the width of those numbers aka "pointer size" that decides your "address" space.
So, for example, if you are willing to limit yourself to 64 KiNodes of equal sized objects, a pool like https://github.com/c-blake/bst/blob/main/POOL.c can use an unsigned short 2-byte pointer. This small address space specialization can give a full 4x space overhead optimization over modern 64-bit pointers.
You could even use 8x less with 1-byte pointers if 256 nodes is enough or use bit fields or equivalents for even more fine-grained address space size selection. Admittedly, linked structures are usually less interesting for very small address space sizes, but this is sometimes useful. People may complain about arbitrary limits that relate to this kind of optimization, but the linked part could be "small by design" without compromising overall scale { such as for structure within a B-tree node or other multi-level indexing setups }.
In any event, I think it is useful to highlight pointer size arbitrariness to junior engineers. It is often learned once & immediately forgotten behind an onslaught of general purpose-ness (& pointer type systems/syntax). Relocation ideas also relate. E.g., if all the pointers are implicitly indices into a (small?) array, and all the code needs a "base address" of that array to get a VMem address, then you can relocate the whole bundle of objects pretty easily changing only the base address.
‘Handles are the better pointers’¹ is a related writeup that has been discussed here before².
¹ https://floooh.github.io/2018/06/17/handles-vs-pointers.html
I wish I had caught the original discussions. The moral of the story is that pointers contain too little metadata and not enough places to do bounds checking and type checking to be safer to handle (pun intended). Taking charge of things as “objects” is more robust.
Do notice that it means runtime bounds checking. It also mentions the problem of having a handle reference an object that’s no longer there or that has changed and that the probability of this happening rises with time. This pattern works great in single thread/single process code but starts decaying hard when you have preemptive concurrency. Here you really cannot safely do much of anything without some sort of locking without copy on write semantics (which is also the case with a lot of other systems, just pointing out that this isn’t solved magically by handles though the additional metadata can help detect use-after-free and handle-now-refers-to-a-different-object problem).
So the age old objection that bounds checking slows you down but is necessary to detect problems at runtime is here. You won’t get branch-free code with way compared to straight pointer references but on the other hand you get a lot closer to the concepts of owners and borrowers in straight C which is cool.
Yes, this is true. I didn't use that in the code because I thought it would make it less readable, though. I might mention it somewhere, thank you.
In many prog.lang's, it absolutely makes things less readable. This is not an accident, but because the PL itself specifically tries to syntactically support pointer types (i.e. to literally aid readability). Most PLs do so only for general purpose VMem pointers with user-defined pointers being afterthoughts (at best). You probably need (at least!) operator overloading to get the same level of readability as "PL-native" pointers { though "readability" is a bit subjective and powerful PL features can be divisive; Scare "quotes" for interpretation hazard :-) }.
Anyway, you probably knew all that & I don't mean to suggest otherwise, but it made sense to write since HN loves PL talk.
Hello, I am the author. Thank you all for the instructive comments, I made some changes to the article since I first posted it:
- Added a link to this HN post.
- Renamed some variables and types in the code for readability.
- Mention (in a footnote) that we could allocate the 'Pool' structure and the chunk arrays with a single call, or even return the 'Pool' on the stack. Suggested by 'liontwist' and 'unwind'.
- Mention (in a footnote) that we don't always need to store the full address when building the linked list of free chunks. Suggested by 'cb321'.
- Added valgrind support (also to my 'libpool' project). Suggested by 'kazinator'.
FWIW, while it is true that a smaller pointer lets you have smaller minimum fixed size objects in your pool, as your new footnote suggests, the space concern I was mostly alluding to in [1] is all the references in your client code { such as the BST code I linked to where each binary search tree (BST) node has 2 pointers (or 3 if they have parent pointers to give BST nodes "doubly linked list-like autonomy") }.
This can really add up. For example, like 20 years ago, I did a `print sizeof` in gdb for some g++ STL map node with a data payload of like 4 bytes and it was some obscene size approximately >= 1 order of magnitude bigger than necessary (like 80 bytes instead of, say 8 with 2Byte ptrs). While main memory is cheap, CPU cache memory is not and for "linky" structures, it often pays to be more cache-friendly. It's pretty easy to imagine a 10X smaller thing fitting into an L2 CPU cache where the larger thing would not even fit in L3, netting you a big latency boost (though the more root-ward levels of the tree likely always stay in caches in a hot loop).
EDIT: Anyway, you don't have to believe me. Others did an x32 ABI for x86_64 mostly to have "merely 2X" smaller pointers: https://en.wikipedia.org/wiki/X32_ABI
As a last code snippet readability suggestion - maybe extract "malloc" and "free" calls to your own simple wrappers (like my_malloc and my_free), and do "valgrinding" inside of them? You can also pass additional params to these functions to specify type of allocation if needed, or make separate functions.
Also the font isn't incredibly easy to read on Windows. I'm assuming Georgia was selected? Which has been on Windows for years but renders so badly it looks like the font color isn't black, even though I think it is black? This isn't really your fault but I would have stuck to Times New Roman for Windows' sake.
(related) LLFree
Paper: LLFree: Scalable and Optionally-Persistent Page-Frame Allocation https://www.usenix.org/system/files/atc23-wrenger.pdf
Video presentation: https://www.youtube.com/watch?v=yvd3D5VOHc8
Implementation and benchmarks are well documented at the repos:
Rust repo https://github.com/luhsra/llfree-rs C repo https://github.com/luhsra/llfree-c
A long time ago, in a galaxy far far away, i wrote something similar for an embedded platform. I was implementing a WAP Browser at the time (yes, WAP, https://en.wikipedia.org/wiki/Wireless_Application_Protocol), and we needed dynamic memory allocation in an environment where everything was static due to realtime constraints.
So i ended up with this : https://github.com/jinie/memmgr
The parent article is quite interesting and slightly different being cpp [1]. There is also a post about writing a memory allocator [2].
[1]: http://dmitrysoshnikov.com/compilers/writing-a-pool-allocato...
[2]: http://dmitrysoshnikov.com/compilers/writing-a-memory-alloca...
Here's another interesting O(1) memory allocator but with arbitrary sized allocations and low fragmentation. Negative side is relatively high memory overhead (a few dozen bytes per allocation).
This kind of allocator is often used to suballocate GPU memory in game and graphics applications.
I'm using a variant of this algorithm with added support for shrinkable and aligned allocations and flexible bin sizing.
You can also extend this idea to two dimensions to create texture atlases, which is possible in O(1) for power of two sized allocations.
Original: https://github.com/sebbbi/OffsetAllocator Rust port: https://crates.io/crates/offset-allocator
A sometimes useful thing is to treat the pool as a stack and provide a call to free all recently allocated items up to a certain index. So you make a bunch of deep subroutine calls that use the pool allocator, and then on return free them all. It's like a secondary automatic storage class.
Also sometimes useful to provide another call to promote an automatic item to a more permanent one so that it is excluded from being freed like this.
There is a small series of posts on the BitSquid blog about memory allocation which is worth reading! http://bitsquid.blogspot.com/2015/08/allocation-adventures-3...
Also "Arena allocator tips and tricks (nullprogram.com)" [0]
you can avoid resizing and moving the pools if you allocate massive pools up front. most OSs let you overcommit memory, and most programs using memory pools only have a handful of them
(on windows, you'd need to handle this with VirtualAlloc. osx and linux let you overcommit with malloc, and handle it with a copy-on-write page fault handler)
Note that using this will likely violate strict aliasing due to using the void pointer returned as one type, freeing it, and then getting that same chunk again and using it as another type. You'll probably want to compile with -fno-strict-aliasing to be safe.
A good reference on strict aliasing: https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...
There's no reason an allocator should ever trip over strict aliasing rules.
malloc() returns an address. The user code writes a value, and then reads a value of the same type. No problem.
Later, malloc() returns the same address. The user code writes a value of another type then reads a value of that same type. Again, no problem. There's no aliasing going on here.
The act of writing a value of a different type tells the compiler that the lifetime of the previous object has ended. There's no special magic required.
Strict aliasing is about situations where we write a value of one type, and then attempt to read a value of a different type from the same location. If you want to do that, then you have to be extremely cautious. But memory allocators don't do that, so it's not an issue.
(Obviously just talking about C here, or POD in C++ terms. If you are dealing with C++ objects with destructors, then you have an extra layer of complexity to deal with, but again that's nothing to do with aliasing.)
>The act of writing a value of a different type tells the compiler that the lifetime of the previous object has ended.
afaik only memcpy has that magic property, so I think parent is almost correct.
void *p = malloc(n);
*(int *)p = 42; // ok, *p is now an int.
//*(float *)p = 3.14f; // I think this is not allowed, p points to an int object, regular stores do not change effective type
float x = 3.14f;
memcpy(p, &x, sizeof(float)); // but this is fine, *p now has effective type float
So in the new, pool_new: pool->chunk_arr[i].next = &pool->chunk_arr[i + 1];
This sets the effect type of the chunk block to 'Chunk'Later in pool_alloc:
Chunk* result = pool->free_chunk;
...
return result;
result has effective type 'Chunk'In user code:
int *x = pool_alloc();
*x = 42; // aliasing violation, *x has effective type 'Chunk' but tried to access it as an int*
User code would need to look like this: int *x = pool_alloc();
memcpy(x, &(int){0}, sizeof(int)); // establish new effective type as 'int'
// now we can do
*x = 42;**
And this is why type based alias analysis (TBAA) is insane and why projects like linux complies with fno-strict-aliasing.
C should issue a defect report and get rid of that nonsense from the standard.
C doesn't have "alias analysis" in the standard. It has an (informally specified) memory model which has "memory objects" which have a single type, which means treating them as a different type is undefined behavior.
This enables security analysis like valgrind/ASan and secure hardware like MTE/CHERI so it's very important and you can't get rid of it.
However, it's not possible to implement malloc() in C because malloc() is defined as returning new "memory objects" and there is no C operation which creates "memory objects" except malloc() itself. So it only works as long as you can't see into the implementation, or if the compiler gives you special forgiveness somehow.
C++ has such an operation called "placement new", so you want something like that.
You can definitely implement malloc in C. It does nothing special in its most basic form but cough up void pointers into its own arena.
It gets complicated when you have virtual memory and an OS involved but even then you can override the system malloc with a simple implementation that allocates from a large static array.
There are other issues besides changing the memory type. For instance, C has those rules about out of bounds pointers being undefined, but you can't implement that - if you return part of the pool and someone calculates an out of bounds address they're getting a valid address to the rest of the pool. That's why you can't implement malloc() in C.
(The difference here is that system malloc() works with valgrind, -fbounds-safety, theoretical secure hardware with bounds checking etc., and this one doesn't.)
Any type-changing store to allocated storage has this property in C.
> aliasing violation, *x has effective type 'Chunk'
This doesn't make any sense. How do you know its effective type if you don't have access to the definition of `pool_alloc()'?
If you can guarantee its always compiled in a separate TU and never inlined, sure, might be a practical way so 'solve' this issue, but if you then do some LTO (or do a unity build or something) the compiler might suddenly break your code. Another way is to add an inline asm block with "memory" clobber to escape the pointer so the optimizer can't destroy your code.
It's really quite ridiculous that compiler implementers have managed to overzealously nitpick the C standard so that you can't implement a memory allocator in C.
> It's really quite ridiculous that compiler implementers have managed to overzealously nitpick the C standard so that you can't implement a memory allocator in C.
This is good because it's also what gives you valgrind and CHERI. Take one away and you can't have the other.
(Because if you define all undefined behavior, then programs will rely on it and you can't assume it's an error anymore.)
> The act of writing a value of a different type tells the compiler that the lifetime of the previous object has ended. There's no special magic required.
I think strictly speaking in C this is only true for anonymous memory, i.e. memory you got from some allocator-like function. So if your custom allocator gets its memory from malloc (or sbrk, or mmap), everything is fine, but, for example, you are allocating from a static char array, it is formally UB.
In C++ you can use placement new to change the type of anything.
Yes, type-changing stores are guaranteed by the C standard to work only for allocated storage. In practice no compiler makes a difference between declared and allocated storage. A bigger problem is that Clang does not implement the C rules as specified..
That's what allocators do. If C's object model doesn't allow users of the language to write their own allocators then that object model is broken.
C++ relatively has fixes to allow allocators to work, it requires calls to std::launder.
I understand the C standard hides such actions behind a wall of "this is the allocator", and expected the compiler authors to also be the allocator authors, allowing them to know when/how they can break such rules (in the context of their own compiler)
No, allocators are not magic (in that regard) in C. There is nothing out of the ordinary going on with an allocator, the parent comment is simply mistaken (as pointed out by another answer).
Ah, I see that it's because the char type never violates Strict Aliasing. I was wondering how you could define a type such as Chunk, yet hand out a pointer to it to a user who will cast it to some other type.
Well the pool code fails to use char* type to write inside block of memory.
If you look at 'pool_free' code you can see that it receives used block from user as 'ptr' then casts it to 'Chunk pointer' and writes to into Chunk member value of type 'Chunk pointer'. I had to change that line to memcpy when I did it last time around 15 years ago in C++ with GCC 4.something. In short if you are writing your own allocators you need either to disable strict aliasing like Linux does or do the dance of 'memcpy' when you access memory as 'internal allocator structures' right after it was used as user type. When it happened to me write was reordered and I observed code that writes into Chunk* next executed first and write of zeros into user type second.
Note that C++ has different rules than C. C has type changing stores that do not require memcpy for allocated storage (i.e. no declared type). Older compilers have plenty of bugs related to TBAA though. Newer versions of GCC should be ok.
Which is why pointer provenance is an issue as well.
The pool struct Is two pointers, why are you allocating it with Malloc?
Next question: why are there two calls to `malloc()`, one for the Pool structure and one for the Chunks?
It is trivial to co-allocate them which removes the risk of memory fragmentation and is just generally a good idea if you're chasing performance.
The answer might be "for clarity/simplicity" which I guess is fine in an informative article, but it could at least be highlighted in the text which I didn't see.
He's targeting ANSI C, C89. Flexible array members are not supported officially until C99. 26 years later, it's time to stop clinging to outdated standards and crusty tooling. Even Linux had to read the room and adopt C11.
A C11 implementation could go one step further and use _Alignas(max_align_t) to keep the pool array aligned with no manual effort. The double allocation does this implicitly.
on the topic of alignment, the library (libpool) fails to align chunk_sz to allow storing a pointer when in the free_chunk list.
This issue is sidestepped in TFA by using a union, which ensures appropiate alignment for all it's members, but in the library there's nothing which prevents me from asking for a chunk size of, say, 9 bytes, which would store pointers at misaligned addresses when creating the free-list.
The pool needs to align every allocation anyway.
I will also take note of this, you are right.
It is certainly not trivial and would add little to the topic
Some people are allergic to `main() { Foo foo; foo_init(&foo); }`
Admittedly, sometimes for good reason - it locks in the ABI.
Mainly just the size and alignment. If you make the structure oversized, and with strict alignment, it can be future proofed. Old binary clients will provide a structure big enough and aligned enough for the new version.
The dynamic constructor API can allocate the structure exactly sized without the padding.
ABI? Better make it an anonymous struct too.
It’s a performance oriented custom memory allocator.
Is that why some structs have "char reserved[16]" data members? I think I saw this in NGINX, though that might have been to allow module compatibility between the open source offering and the paid version.
Yes, that slightly improves ABI flexibility, though the fact that callers can still access fields limits that.
An alternative is to make the only member `char opaque[16]` (IIRC some locale-related thing in glibc does this, also I think libpng went through a transition of some sort related to this?), but calculating the size can be difficult outside of trivial cases since you can't use sizeof/offsetof.
Couldn't this also just syscall mmap directly? I mean malloc itself is a memory allocator it feels a bit strange to use it in an allocator implementation, but perhaps I'm missing something.
Some interesting things that would be interesting to add to this:
- thread safety ( not too hard to add on top of your liked list) - variable sized allocations. Ideally something more akin to malloc. Could be done wastefully by rounding up to the nearest chunk size - (as mentioned in the article) variable chunk sizes - zeroing of memory before handing it back to users - double free detection or even better full asan support
I don't currently have much experience with thread safety, but it's something I should look into. Thank you.
Another (admittedly less portable) way to solve the resizing problem is to reserve a large virtual memory space using linux mmap with the PROT_NONE flag and then commit pages as needed using mprotect with PROT_READ|PROT_WRITE flags. I believe windows' VirtualAlloc/VirtualProtect also allows this
You don't need to do anything with the flags. You can just mmap a huge virtual area and the pages will be realized whenever you first touch them.
Today I learned, thanks
This bring old memories back, I implemented something similar on a motorola 68360 ages ago. Since the size of the buffer I used was not huge I skipped the pointers and simply enumerated each chunk, it was remarkably efficient and stable.
I usually default to slab allocators if I have to write my own:
ThreadX has pool allocators: https://github.com/eclipse-threadx/threadx/blob/master/commo...
Everyone and his dog who has programmed C for a few decades has several as well. :)
The Latin Modern font on this website does not look so good in Chrome, is it because of the font size, or is it just me? (Chrome)
Sounds like a Windows issue based on other people in the thread. I'm on Firefox with the same issue. Removing Latin-modern from the CSS file made it much more readable for me.
I was going to say this. Looks awful to me as well. It is the font-family.
It all depends on the load profile using the allocator. You never know, that said you cannot beat, in theory, a semantically specialized allocator for a specific load... in other words, the right way(TM). This means applications should bundle their own allocators for the various load types they have. The "generic" allocator is sort of an heresy for the lazy, short termists or those who are in hurry. Don't worry I still use such generic allocator, sometimes, but often I do mmap myself the memory and deal directly with it.
The system allocator is the one that comes with all the system's memory debugging tools and security (or lack of it), so you'd better not mind losing those if you want to ditch it.
This is really informative. I wonder if there is a better solution for the reallocation problem, one that preserves pointers.
This is a very nice article! The diagrams in particular are very clear. (They seem to have been done in draw.io: https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al... which seems to be a pretty decent Apache-2.0 free software licensed diagram editor: https://github.com/jgraph/drawio/)
I think it would be improved further if it began with the calling interface callers use to invoke the allocator, because it's easier to understand the implementation of some service such as pool allocation when you begin by knowing what service, exactly, is required. I began with the assumption that the author was describing an arena allocator under another name, rather than a fixed-size block allocator, both because the APR's influential arena allocator is called "apr_pool_t" and because one of the main headings in the table of contents (a greatly appreciated feature, like all the typography of the article) was "Closing the pool". It took me quite a while to figure out that I was mistaken. It's plausible that I only had this problem because I am an unusually stupid, ignorant, or pigheaded person (some would say all three), but, if not, many other people will have the same difficulty I had of taking several minutes to figure out what the topic of the article is. (I could have saved myself by clicking either of the first two links in the Introduction, but I didn't because I thought I already knew what it was.)
The current top-voted comment on this thread is by someone who had the same reading comprehension problem I did, but never realized their erorr: https://news.ycombinator.com/item?id=42643951 and nobody had pointed out their mistaek until I did just now. So I think we can say with some confidence that many HN readers will have the same difficulty.
I think that the article would also be improved by a few other small additions:
- It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.
- Probably when one benchmarks the implementation given, one will find that it is only slightly faster than jemalloc or even gnumalloc, because they both dispatch small allocations to a fixed-block allocator similar to this one, and the dispatch isn't very expensive. This can be fixed by inlining the allocation fast path and the deallocation function, which will then compile down to a few instructions each. A simple demonstration of how to do this is in my arena allocator kmregion in http://canonical.org/~kragen/sw/dev3/kmregion.h and http://canonical.org/~kragen/sw/dev3/kmregion.c (GPLv3+). The current implementation of 8dcc libpool is guaranteed to be unable to do this unless you're using link-time optimization or a so-called "unity" or "amalgamation" build process.
- It's worth distinguishing between average-case performance (in which this allocator is slightly better than malloc) and worst-case performance (in which case malloc, generally speaking, isn't even in the running).
- It may be worth mentioning that often such pool allocators are used in conjunction with initializing the objects in the pool to a valid state when the pool is created; that way you don't have to reinitialize objects to a valid state every time you deallocate and reallocate them, which is usually more work than malloc does to find the right free list. This does require not overwriting the user data with your freelist pointer, so it's a space/time tradeoff.
- Maybe you should mention alignment, especially now that even on amd64 GCC has taken to using new instructions that require alignment.
Even as it is, though, it's highly educational, visually pleasant, and very understandable (once I got over my initial misconception of having a totally wrong idea of what the article was about).
> The diagrams in particular are very clear. (They seem to have been done in draw.io: https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al... which seems to be a pretty decent Apache-2.0 free software licensed diagram editor: https://github.com/jgraph/drawio/)
Yes, this is true. Furthermore, the diagram information for draw.io is included in the SVG images themselves, so one could edit them there.
> It says, "The pool allocator, however, is much faster than malloc", but doesn't provide any benchmarks or even an order of magnitude.
You are right, I was referring to "malloc-like" allocators (i.e. for variable-size blocks). It would be a good idea to benchmark them.
In general when you're working on an optimization (and this is an optimization) benchmarking it is essential. About a third of the optimizations I try end up making my code slower instead of faster, often because of performance considerations I hadn't thought of. You can't just say "it's much faster" without measuring.
Modern CPUs (with all their buffers & caches & predictor warm-ups & spin-ups) can make benchmarking really tricky unless you have a very specific workload in mind and even then it requires care (like looking at both the edge of[1] and full timing distributions[2] for best-case, average/median/typical case, worst case analysis, etc.). Briefly, the number of ways to get (un)lucky has proliferated over the decades, as has the variety of deployment platforms.
Given the depth of that rabbit hole, as writing advice, I might suggest instead motivating via "more space efficient" and refer to exactly fitting fixed-size objects where more general purpose allocators will (usually)¹ waste space to the next power of two. Measuring less dynamic space is also easier (very dynamic space like %cache used can be tricky, though). Controlling scope and pedagogy in general is also tricky, but I think this rhetorical advice is common practice and might also double as a way to address your initial misread of the topic. EDIT: he can always do refinements / more analysis in follow up posts/something if this library has any legs for him.
[1] https://github.com/c-blake/bu/blob/main/doc/tim.md
[2] https://github.com/c-blake/bu/blob/main/doc/edplot.md
¹ Sure, fancy any-size allocators can tack on a histogram of maybe somewhat rounded lg sizes (to economize histo space) and spawn fixed-size object sub-arenas if there are enough of them, say >= enough to fill a VMem page or 2, but this is rare in my experience and also complicates the dispatch to subarenas that you mention above to slightly fancier search. This kind of topic rapidly spirals into allocator-complete discussion and even systems-complete - e.g. a filesystem is largely an allocator with good old MS-DOS FAT being close to a "file is a pool of disk blocks" model.
This is an excellent comment. Thank you.
Yes, you are right. Do you have any suggestions on how I should do the benchmarking? I am not sure how this is usually done, but if the output is some kind of graph, I would also like to know how these graphs are usually produced.
I'm no expert on this, so probably someone else can give you better advice, but I'm happy to share what I do know. Here are the most important points.
Benchmarking is a full-fledged scientific experiment. You have a hypothesis that a particular change that you can make, such as using a different allocator, will have a particular effect, such as making your program run in less time. So you make just that one single change and measure the results, while carefully controlling all other conditions that could also affect your program's run time. You have to do the test more than once in order to find out whether you've succeeded in controlling them. If you do a good enough job controlling the conditions of the experiment, it becomes reproducible and therefore falsifiable. Programming usually isn't science, but benchmarking is. A great thing about it is that each run of the experiment can take a fraction of a second, so you can do the whole scientific method from beginning to end in a few minutes and get a reproducible, falsifiable result.
Benchmarking can get arbitrarily sophisticated, but the most important thing is to dodge the bullets that can turn benchmarks into nonsense, rather than produce very sophisticated graphs or get very high precision.
The simplest thing on Linux or macOS is to compile a command-line executable you can run in a terminal and run
time ./testprogram
at least three times to get an idea of how long the wallclock time is and how variable it is. (The user and system times shown are not very reliable anymore.) Presumably there is also some way to do the same thing on Microsoft Windows. If you're on a laptop, check to see if it makes a difference if you're plugged in. Also, check whether the result from time ./testprogram; time ./testprogram
is consistent; sometimes lags in CPU speed scaling can produce unpredictable results.Linux is not very consistent, so you should expect variations in the range of 1–10% from run to run when your program takes on the order of a second to run.
If you get consistently shorter wallclock times after recompiling the program with a different allocator and the same compiler options (probably make sure optimization isn't completely turned off), you can probably attribute that difference to the different allocator. It's a good idea to switch back to the original allocator and verify that you can reproduce your original results to make sure you didn't accidentally change something else at the same time (maybe your laptop went into low-power mode because the battery is getting low, or maybe you changed the number of iterations in a loop and forgot you changed it, or maybe Firefox loaded a YouTube video in the background and is making the machine swap, that kind of thing).
It's useful to ensure that you aren't swapping and that your CPU load other than the benchmark is minimal. `top` (or, better, `htop`) and `vmstat 5` (or, better, `dstat 5`) can be helpful here. Unless your program is heavily multithreaded, the CPU load is not as crucial as it used to be now that we all have multicore processors.
It's helpful to check the particular version of the program you're benchmarking into Git, including the Makefile or whatever specifies the compiler options. That way when you write down your results you can refer to the specific Git commit hash. Also, record your compiler version, your operating system version, and your CPU. Ideally someone who isn't you should be able to read your notes on the benchmark, run it again, and get the same results, if they have access to the same hardware.
The actual benchmark program itself can be as simple as allocating and deallocating in a loop, ideally inside another loop to run the whole benchmark multiple times; but on processors with cache (which includes every PC since the early 90s) the working set size can be very important, and you may get significantly different results for allocating 20 kilobytes 64 bytes at a time and 20 gigabytes 64 bytes at a time. If you pass in the number of iterations for one of the loops on the command line, instead of hardcoding it in the program code, it's easy to start with a small iteration count and work your way up a factor of 10 at a time until the program takes a few seconds to run. In this case, also be sure to record the specific command line you were measuring the performance of in your notes; knowing that a program took 2.205–2.216 seconds to run is of no value if you don't know if it was running a thousand iterations or a million.
It's a good idea to actually compute something that you output using the memory you're allocating and deallocating, just in case GCC's optimizer decides that your benchmark loop has no effect and therefore can be safely removed from your program. (In http://canonical.org/~kragen/sw/dev3/kmregion_example.c I computed the sum of the data value stored in each node of the 24th of all the 5000-node linked lists I allocated and deallocated. That's because, when I read the disassembly of my test code with `objdump -d kmregion_example`, I found out that GCC had removed most of the code inside the test loop because it could prove that nobody ever read the struct fields I was initializing. Reading disassembly is neither necessary nor sufficient for dodging all bullets that invalidate benchmarks.)
Varying the number of iterations in both the innermost and outermost loop should produce roughly proportional increases and decreases in wallclock time. (A helpful technique is to subtract the time for 1000 iterations from the time for 1001000 iterations to get the time for the last million iterations.) If it doesn't, you're not measuring what you think you're measuring. For example, especially with JIT compilers, you may be measuring startup overhead. That bit me this week with http://canonical.org/~kragen/sw/dev3/hadamardsin.js, which I erroneously reported was slower to run than the LuaJIT version. It turns out that it's slightly faster to run, just much slower to compile.
Running the benchmark on more than one computer is helpful because sometimes changes that speed things up on one system slow them down on another. I remember a recursive backtracking search program for logic circuit optimization I wrote in C long ago which used longjmp() to terminate the search when it found a solution; it ran reasonably fast on Linux but very slowly on my friend's Macintosh because MacOS implemented setjmp() as sigsetjmp(), transforming it from a very cheap operation into a very expensive one.
That's all you need to know to reliably do optimizations. In fact, that's probably more than you need to know. Everything below this point is extra.
— ⁂ —
Allocators are especially tricky to benchmark because often, by changing where data is stored in memory, they profoundly affect the speed of the rest of your program, in ways that vary from program to program. More memory fragmentation, for example, can increase how many TLB misses your program runs into. I don't know what to do about that except try to benchmark some real programs as well as simple nested loops.
In cases where you're trying to measure an effect that's roughly the size of your measurement precision, statistical tests can be useful. https://github.com/c-blake/bu/blob/main/doc/edplot.md linked by cb321 above discusses using the 2-sample Kolmogorov-Smirnov test, which is not nearly as terrifying as it sounds. But most effects will be either much larger than your measurement error, in which case the result is obvious enough that you don't need the statistics, or much smaller, in which case all they'll tell you is that you can't be sure about the direction of the effect, much less its magnitude. In medicine, where your experiments take years and people die in them, using sophisticated statistics to make the most of your precious data is very important. In performance benchmarking, where your experiments take milliseconds, it's usually better to improve your measurement precision instead. Sometimes just running the same benchmark more times is enough, but there are more powerful techniques available.
A way to get higher precision than `time` can give you is to run the program under `valgrind` (on Linux). Even raw valgrind will tell you exactly how many instructions it ran, and in most cases that number is the same from run to run, so instead of a measurement with ±3% precision like `time` usually gives you when you've controlled everything properly, you get a measurement typically with ±0.00000001% precision. This is fantastic, far beyond most things you can do in a physics, electronics, or chemistry lab. It means that even with just two runs you can tell if your optimization affected that number by even 0.0000001%, or, more interestingly, 0.1%. And it isn't affected by things like which CPU you run it on or whether your laptop is unplugged, and it only slows the program down about 10×.
Unfortunately, it measures the wrong thing, because a program that runs more instructions can still be faster. `valgrind --tool=cachegrind` tries to simulate your CPU's cache hierarchy to give you a better idea of how long the program will actually take; the `kcachegrind` GUI can give you a "cycle estimation" number based on the numbers it spits out. But it can still be unrelated to how fast the program actually runs for a variety of reasons; system calls are the biggest one. `kcachegrind` is also a "profiler": it shows you how much time (simulated instructions, simulated cache misses, cycle estimation, whatever) is used by each subroutine in your program.
You might wonder why an optimization that speeds your program up by 0.1% matters. The answer is that if you do 106 such optimizations you have sped your program up by 10%. SQLite has improved its performance by more than a factor of 3 over a 10-year period by using this approach: https://www.sqlite.org/cpu.html
There's a more basic profiler called gprof built into GCC; building your program with `-pg` will add profiling instrumentation in to it, so that it writes profiling information into a file called `gmon.out`. After running a program built with gprof, you can run `gprof testprogram` to analyze `gmon.out`; it will tell you how much time each subroutine took (including and excluding its transitive callees) and how many times it was called.
Linux also has a built-in profiler called `perf_events` https://perfwiki.github.io/main/ which uses hardware performance counters to get numbers like cachegrind's, but using the real hardware instead of simulation, and including a lot more detail; it can also profile the time your program spends in system calls. I've only used it in very basic ways, so I don't know very much about it. Intel also has a proprietary thing called VTune which can do things perf can't and runs on OSes that aren't Linux, but doesn't support AMD's processors.
— ⁂ —
I hope this is of some value to you! I'd be happy to answer any other questions. Hopefully correctly.
Wow. This is probably the best comment I have ever seen on HN. Thank you so much for the information, I will add benchmarking to my 'libpool' project and mention it in the article. I can't think of any questions right now, but I will let you know if anything comes up. Thank you again. :)
Aww, thanks! *blush*
Actually, I have one question. You said:
> The actual benchmark program itself can be as simple as allocating and deallocating in a loop [...]
What do you mean, exactly? Deallocating immediately after allocating? I feel like there would be a big difference between that and allocating multiple blocks before freeing them.
You are right - there will be a big difference for these two different workloads. As I am 100% sure @kragen already knows, this is arguably the first, smallest step in thinking for you to realize just how many possible workloads you might want to model. :-) And to realize why senior engineers will say pithy things like "the best benchmark is your actual application".
While that is true (and I've said it myself), one reason I like the design of a little "workload interpreter shell" as in https://github.com/c-blake/bst/blob/main/test/bst-interact.c (or maybe better https://github.com/c-blake/adix/blob/master/tests/btshell.ni... which has a preprocessor) is that you can use them to run "isolated tests" as a kind of "half step" between a pure hot loop dumb thing and a full-on application with all its attendant noise and self-interaction/self-disturbance.
So, for example, in this allocation problem setting you could write a similar 30-line "shell/interpreter" that interprets a binary stream of "commands" or allocation requests. The origin a of said binary stream could be a recording/log from some actual app you have doing something actually useful. Then you could apply that one recording in the "shell" in different configurations as @kragen mentions to study the allocator performance (both space & speed) in isolation. At that point you could start to say pretty meaningful things from replaying logs many times about this or that tradeoff on this|that workload.
EDIT: Of course, at that point you start to get into debates about "how representative" different workloads are, but at least you could potentially share your log with someone who had a different computer and have them run benchmarks, too, and at least the workload is not synthetic but relates to some actual possible program. Moving from synthetic benchmarks to log-based ones was a big thing in OS research on filesystems in the 1970s & 80s and it helped the "generalizability of results" (I think, anyway..I'm sure many still do only synthetic benchmarking due to its convenience).
EDIT2: I just thought of a cute name for the log replay half-step between synthetic microbenchmarks and a full application - "The millibenchmark". ;-) Maybe someone else has thought of this before, though.
The "interpreter shell" is an interesting idea!
Recording traces of allocation requests from actual applications for allocator-benchmarking purposes revolutionized allocator performance research in the late 90s, if I recall correctly, in addition to the earlier filesystem research you mentione. Benchmarks based on those traces were the bedrock of Zorn's case that generational GCs were faster in practice than malloc().
But I think (though possibly this is motivated reasoning) that benchmarking some workload is good enough for many optimizations. A fixed-size-block allocator is already pretty restricted in what workloads you can run on it, after all, and its performance characteristics are a lot simpler than something like first-fit. Showing that an optimization makes some workload faster is infinitely better than not knowing if it makes any workload faster. Showing that it generalizes across many workloads is better, of course, but it's not nearly the same step in utility.
Well, you can use it for everything. I've used it for in memory hash tables, trees, etc. To be clear, for those who may not understand, the "interpretation" in these "millibenchmarks" for something as simple as an allocator is not like Python or Bash or something slow. It could literally be "mmap a binary file of integers and either all/free the binary numbers". So, yeah, there might be some CPU pipeline overhead in unpredictable branches and you might still want to try to measure that, but beyond that there is basically no work and any real workload might have one-ish hard to predict branch leading to allocator operations. So, even unadjusted for overhead it's not so bad. And it keeps your measurements fast so you can get statistically many of them even if there is an "inner min loop" to try to isolate the best case or something. (Well, at least if what you are measuring is fast.) The downside/tradeoff is just lack of fidelity to interaction effects in the full application like resource competition/branch predictor disruption/etc.
You could probably set up a https://github.com/c-blake/nio file for it (or other) if you wanted to be systematic. A problem with the "text-newline" part of the "unix philosophy" is that it incurs a lot of unneeded binary->ascii->binary cycles that's actually more programming work as well as more CPU work, and it's really fairly orthogonal to the rest of the Unix ideas.
EDIT reply to parent EDIT
> benchmarking some workload is good enough
Fair enough, except that "good enough" is sort of "famous last words" if anyone is reviewing your work. :-) :-) Some will complain you only did Intel not AMD or not ARM or not xyz or Apple Silicon or whatever other zillion variables. Something is always better than nothing, though. It should be thought of more like a Bayesian thing - "no surprise for untested situations" or maybe "confidence in proportion to evidence/experience".
FWIW, I thought your big advice to 8dcc was pretty decent.
Thanks! I agree with all of that except for the last statement, which I am not competent to judge.
You are sure right about that! The example I linked allocates 5000 blocks and makes a linked list out of them, then deallocates them.
It seems like almost all of the complexity of this allocator comes from having to manage the fact that it's being implemented on top of the system allocator. I thought the whole point of using a custom allocator was to avoid the system allocator for exactly the problems they mention e.g. calling the system-provided realloc() might move your stuff around, having to track separate pool starts so you can call the system-provided free() on them, etc.
Like, yeah you have to do that, but I thought the whole point of writing a custom allocator was to not use the system allocator beyond statically allocating a huge block of memory upfront and then managing that yourself, is it not?
I recommend against using linked lists for bookkeeping the free blocks. It seems to be the data structure that every malloc/free implementation reaches for, and I don't know why - the slowness of pointer-chasing makes it terrible for almost any real-world use case. A balanced tree would be a much better idea, given that all the essential operations would take O(log n) time instead of O(n). Even if one insists on a linear search, a bitset is much more cache friendly than pointer-chasing and it can trivially benefit from SIMD optimizations.
For the `Chunk` list, this isn't one of the cases where linked lists are harmful. Each use only touches the top of the stack, never iterates. Also, linked lists are much easier to make thread-safe.
For the `LinkedPtr` list, the bad case is only hit during the destruction of the pool, and then only if you allocate a lot of memory. And given the overhead of deallocation I'm not sure the array approach would measurably help.
I don't see anywhere a binary tree search would be useful here, since there are no loops used for lookup (on allocation they're pushed in order, but when freed chunks are pushed in arbitrary order; this does mean no double-free protection).
why would you need to iterate on destruction? you free the whole pool at once, since you allocate it all at once
The reason linked lists are used is for large enough allocations, there is no overhead. You use the space the application isn't using. In addition, if all allocations are the same size it is O(1), you just look at the head of the list.
More sophisticated strategies bucket allocations by size, this has fixed overhead. You can also use balanced trees for more memory efficiency, but this is slower.
For small allocations (8 bytes) that are too small to contain pointers allocators will allocate a block and use bitsets.
Pool allocators don't walk the list or search anything though. All interactions are only at the list head and O(1), as all free nodes are just that, free and equal.
Then it might be a misnomber. Calling it "stack" instead of "list" might be better.
That is a fair comment.
While the most precise naming might be "a singly linked list representation of a stack", "free list" (https://en.wikipedia.org/wiki/Free_list) is an ancient term of art for this problem.. so much so that wikipedia (at least the version today) even suggests "freelist" all one word as a spelling.
The initial linking together of all free space (also cache friendly, btw) is often called "threading the free list", although this terminology is less standard than "free list" itself.
Suppose the information did happen to search through the free blocks. Suppose you put them into an array instead of a linked list. They can't actually be in the array, you see? The blocks are in the pool heap wherever they happen to be. So the array has to point to them. As you walk through the array you have to do reference the pointers. The only way it's better is that they are not dependent loads. But you have this array to manage now.
Which pool operation is made more costly by the linked list? Both allocate and free are constant time, and trivial too.
The only thing that I can imagine being faster is a bump allocator, but that precludes free.
I don't think you understand how the allocator in the article works. Allocating and freeing are already O(1), creating and closing the allocator are necessarily O(n). There is no search being done here.