• dzaima 2 days ago

    uops.info's measurements show 'inc r64', interleaved with 'movsxd' instructions, still having zero latency[0], so it can't be just merging the immediates of successive increments (or there's additional fusion happening). Plain unrolled 'inc r64' shows an average latency of 0.2 cycles, i.e. 5 dependent ops per cycle. And 0.2 used ports per instr [1].

    Similarly, 'lea r64, [r64+8]' (imm8) and 'lea r64, [r64+128]' (imm32) and 'add r64, 2' (imm8); but not 'add r64, 0x1000000' (imm32).

    [0]: https://uops.info/html-lat/ADL-P/INC_R64-Measurements.html

    [1]: https://uops.info/html-tp/ADL-P/INC_R64-Measurements.html