I don't understand on one of the later graphs the core to core latency for strix halo goes out to 32 cores but he says only has 16 cores?
AMD's cores have SMT, allowing them to run two threads at a time and appear to the OS and its scheduler as two logical cores despite being implemented as a single physical core.
What pattern in the data shows that's what's being measured? I would expect to see basically 0 latency between adjacent "cores" then since L1 is shared per thread?
Co resident threads might not get any speed up here since coherency instructions are functionally operations on the L2 cache.