I know the post only uses SIMD as an attempt to do bitmask lookup, but a good use of SIMD is for parallel data processing, i.e. operating on more than one character. Being able to identify a range of characters making up the identifier is much faster than checking one character at a time.
Did a bit of xmas hacking to come up with the function to find identifiers in parallel in 32-byte chunks. I used Zig since it makes vector/SIMD programming easy.
The code is in: https://github.com/williamw520/misc_zig/blob/main/identifier...
The generated code of the nextBitmask32() function has about 38 instructions. https://godbolt.org/z/7ME3sW55e
That's 38 instructions for processing every 32 bytes. With no branching!
Another trick for classifying bytes is to use pshufb. That instruction works like:
fn pshufb(a: [i8; 16], b: [i8; 16]) -> [i8; 16] {
result = [0; 16]
for j in 0, 1, …, 15 {
if b[j] >= 0 {
result[j] = a[b[j] & 15];
}
}
return result;
}
So you can often construct values m1, m2, and compute: a = pshufb(m1, input & 0xf)
b = pshufb(m2, (unsigned_input) >> 4)
x = f(a,b)
Where f is something like bit wise AND. Roughly, m1 and m2 assign sets of 8 potential properties based on the high or low order bits of each byte, and then we and together the sets of properties to make sure the upper and lower bits ‘agree’.This technique can sometimes be used to mark a few special characters.
Pshufb is an old trick. An oldie but a goodie from SSE2 days.
All hail modern improvements!! pdep, pext, vpcompressb, vpexpandb, and especially vpermi2v.
Forget about 16 byte lookup table with pshufb. Try a 128-byte lookup with vpermi2v AVX512.
--------
I find that pext/pdep for 64-bits and vpcompressb/vpexpandb for 512-bits is also more flexible than pshufb (!!!!) in practice, if a bit slower as it is over multiple instructions.
Supercomputer programs are written with a gather-conpute-scatter pattern. Gather your data, compute on the data, and finally place the data where needed.
Pext and vpcompressb are gather operations.
Pdep and vpexpandb are scatter.
Pshufb is only a gather and thus fails to match the supercomputer pattern. GPUs provide a bpermute (backwards-permute) that scatters results back to a table and is really the missing link IMO.
Hopefully Intel will provide bpermute of their own eventually....
> So you can often construct values m1, m2, and compute:
And what would m1 and m2 be for the specific problem¹ described in TFA?
¹ Namely, to find out if the input belongs to [0-9A-Za-z]
In certain conditions gcc will actually generate instructions that use this same bitvector described: https://godbolt.org/z/6Gfjv6PGd (notice the funny 130159 value in there)
The only thing I did was to adjust the range of characters so that the bitvector fits 64 bits. Sadly, I do not this it would work for the author's bigger range.
I’m surprised he didn’t use a switch statement, as that is the canonical way to ask the compiler to make the jump table for you.
Without SIMD, that would probably the optimal way to implement it. And this bit vector idea really isn't doing SIMD at all, it's just (ab)using the vector registers to store 128 bits.
I think the Zig code would also compile to better output, and be a lot simpler, if it was using an array of u1, instead of doing the bit shifting manually.
The bit vector does give a nice compression compared to the jump table. But I believe the main thing to do for speeding up string parsing is to consume more than one character at a time. I personally found the techniques in https://tia.mat.br/posts/2018/02/01/more_on_string_switch_in... very helpful to explain these issues.
EDIT: I especially like this statement from the conclusion:
> The good thing about the switch statement in C is that it is maybe the highest level statement in the language: the compiler can get really creative in how its code is generated. It's not uncommon for it to generate jump tables or even binary searches.
This article is kind of fine overall but gets so many specifics wrong. I guess this is to be expected from someone just beginning to verse in assembly, but still.
if '0' <= x && x <= '9' || 'a' <= x && x <= 'z' || 'A' <= x && x <= 'Z' {
...
}
Go is garbage, but GCC and LLVM compile this to branchless code. If the goal was to redo the book in Zig and Rust, perhaps it would've been a good idea to profile the implementations under Zig and Rust.If you really want to optimize code like `a <= x && x <= b` by hand, you can rewrite it to `x - a <= b - a`, where the comparison is unsigned; GCC and LLVM do that for you. Next, you can optimize the case-insensitive check by masking out the case bit:
void f();
void test(char x) {
if ((unsigned char)(x - '0') < 10 || (unsigned char)((x & 0x5f) - 'A') < 26) {
f();
}
}
compiles to just lea eax, [rdi - 48]
cmp al, 10
jb f
and dil, 95
add dil, -65
cmp dil, 25
jbe f
ret
This is still not as fast as a memory read, but IMO this should've been the first thing to try. However, a) GCC and LLVM can autovectorize this, whereas memory reads can't be (although I admit the resulting codegen makes me uneasy), b) you can go on and vectorize this by hand, parsing an identifier in just one iteration of a loop, which you can't do with a LUT. I might have made a bug, but something along the lines of unsigned short test(__m128i x) {
return _mm_movemask_epi8(_mm_or_si128(
_mm_cmplt_epi8(_mm_sub_epi8(x, _mm_set1_epi8(0xb0)), _mm_set1_epi8(0x8a)),
_mm_cmplt_epi8(_mm_sub_epi8(_mm_and_si128(x, _mm_set1_epi8(0x5f)), _mm_set1_epi8(0xc1)), _mm_set1_epi8(0x9a))
));
}
should work. This has 5x more throughput than a LUT and should have better branch prediction characteristics, although it consumes 16 bytes at a time, slightly limiting the applications.The section about bitvectors is fine.
x86 processors don't have 128-bit MMX registers and 256-bit MMY registers. They have legacy 64-bit MMX registers, modern 128-bit XMM registers, and modern 256-bit YMM registers. This MMX/XMM duality sure does confuse people, so I don't blame the author. But come on, at least read the article before publishing; broken GitHub URLs without `blob/master` and broken ()[] Markdown syntax is not funny.
To those trying to make sense of Zig output, don't forget `-O ReleaseFast`. The codegen here is very simple, just a 128-bit integer simulated with two 64-bit ones.
Excellent nitpicks, hopefully the author reads this and does a follow up :)
> Since go doesn’t inline assembly i decided to include the search loop into the assembly itself and to return the index of the non matching character. Sadly i didn’t get it to work fully because i’m missing tools to actually debug the assembly itself.
Um... "gdb <my_binary>", "break <my_function>", "run", just like any other tool.
Then "disassemble" to see the instructions around the PC, "stepi" to execute one instruction. "info register" to get current state (or use registers in an expression as e.g. "print $rax" or "print $r11").
And that's it. Debuggers are actually natively tools operating at the level of machine instructions. The language syntax support is a level on top.
Assembly is one of those things that is feared by so many engineers, but is actually incredibly accessible. Whenever I find an excuse to drop down to assembly my colleagues have looked at me like I'm a wizard with 3 heads.
The tooling to debug, dump or inspect, or to link or otherwise glue bits of assembly together has existed for decades. It's all right there!
I agree it's unnecessarily feared. At its core, it lacks a lot of complexity.
But actually writing asm is an experience straight from the 80s. Compared to modern languages, the tooling is simply antiquated. I noticed this once I started writing larger sections of asm at a time. It's "fine" in the same sense that writing makefiles is "fine" - it becomes a PITA after a certain size.
I think there's lots to improve on the current state of asm, but it's just too niche. The only power users I know are hobbyists working on operating systems and games for retro hardware.
It's one of these areas LLMs really help to get started. Not much invention needed, but quite involved setup if you don't know what you're precisely looking for. Having them spit out boilerplate and needed invocations lowers the barrier substantially.
Assuming you have a C compiler, you should use C to get basic boilerplate and use asm or volatile asm to drop down to assembly level.
This is true for GPUs, CPUs like x86 and ARM, and more. Only on the smallest of embedded systems like AVR with 4kB RAM have I ever found dropping to raw assembly to be useful.
Bonus points: input/output/clobbers works well with the GCC and CLang optimizers. Though it's not intuitive how compilers can still move your assembly around, it's surprisingly full featured for how to get info into and out of registers.
There are rare times where the C boilerplate is counterproductive. I guess OS level setup is another spot where raw non-C assembly is useful.
Intrinsics, when available, are a convenient tool. But even then, I think assembly these days has very limited applications and most programmers don't need to know any assembly to be effective.
Preemptive disclaimer: all my jobs were the exception to the rule and I sadly had to work with various assembly languages on a daily basis. Still not a fan.
Things like the condition flags, or GPU execution flags, or other 'implicit' flags are when assembly is needed.
Otherwise, intrinsics are sufficient. And with wavefront voting / ballot instructions (and intrinsics) being so readily available on modern GPUs, there's very little reason to go to those assembly level / machine code flags or registers.
Similarly, back when ccmov instructions weren't reliably output by compilers, you'd probably want raw assembly. But today I think it's safe to rely upon the optimizers.
You should be able to read assembly, not least because understanding what's generated by the compiler or jit is pretty important to performance optimization.
I also like "disp /3i $pc", which tells gdb to disassemble the next three instructions whenever it stops (including after stepi). Though I'm sure you can configure it to give you more sophisticated disassembly display and context, especially with the tui interface, this is usually "good enough" for me when I run into something I want to step through at the assembly level in the middle of debugging a program that's primarily C.