from Hacker News

Measuring CPU core-to-core latency

by nviennot on 9/18/22, 5:15 PM with 90 comments

by apaolillo on 9/18/22, 10:19 PM
We published a paper where we captured the same kind of insights (deep numa hierarchies including cache levels, numa nodes, packages) and used them to tailor spinlocks to the underlying machine: https://dl.acm.org/doi/10.1145/3477132.3483557
by wyldfire on 9/18/22, 7:14 PM
This is a cool project.
It looks kinda like the color scales are normalized to just-this-CPU's latency? It would be neater if the scale represented the same values among CPUs. Or rather, it would be neat if there were an additional view for this data that could make it easier to compare among them.
I think the differences are really interesting to consider. What if the scheduler could consider these designs when weighing how to schedule each task? Either statically or somehow empirically? I think I've seen sysfs info that describes the cache hierarchies, so maybe some of this info is available already. That nest [1] scheduler was recently shared on HN, I suppose it may be taking advantage of some of these properties.
[1] https://dl.acm.org/doi/abs/10.1145/3492321.3519585
by rigtorp on 9/18/22, 8:14 PM
I have something similar but in C++: https://github.com/rigtorp/c2clat
by hot_gril on 9/19/22, 4:03 AM
I was wondering what real-life situations this benchmark matters the most in, then I remembered... A few years ago I was working on a uni research project trying to eek out the most performance possible in an x86 software-defined EPC, basically the gateway that sits between the LTE cell tower intranet and the rest of the Internet. The important part for me to optimize was the control plane, which handles handshakes between end users and the gateway (imagine everyone spam-toggling airplane mode when their LTE drops). Cache coherence latency was a bottleneck. The control plane I developed had diminishing returns in throughput up to like 8 cores on a 12-core CPU in our dual-socket test machine. Beyond that, adding cores actually slowed it down significantly*. Not a single-threaded task but not embarrassingly parallel either. The data plane was more parallel, and it ran on a separate NUMA node. Splitting either across NUMA nodes destroyed the performance.
* which in hindsight sounds like TurboBoost was enabled, but I vaguely remember it being disabled in tests

by jtorsella on 9/18/22, 10:40 PM

If anyone is interested, here are the results on my M1 Pro running Asahi Linux:

Min: 48.3 Max: 175.0 Mean: 133.0

I’ll try to copy the exact results once I have a browser on Asahi, but the general pattern is most pairs have >150ns and a few (0-1; 2-3,4,5; 3-4,5; 4-5; 6-7,8,9; 7-8,9; 8-9) are faster at about 50ns.

Edit: The results from c2clat (a little slower, but the format is nicer) are below.

CPU 0 1 2 3 4 5 6 7 8 9

   0    0   59  231  205  206  206  208  219  210  210

   1   59    0  205  215  207  207  209  209  210  210

   2  231  205    0   40   42   43  180  222  224  213

   3  205  215   40    0   43   43  212  222  213  213

   4  206  207   42   43    0   44  182  227  217  217

   5  206  207   43   43   44    0  215  215  217  217

   6  208  209  180  212  182  215    0   40   43   45

   7  219  209  222  222  227  215   40    0   43   43

   8  210  210  224  213  217  217   43   43    0   44

   9  210  210  213  213  217  217   45   43   44    0

by snvzz on 9/19/22, 3:04 AM
>This software is licensed under the MIT license
Maybe consider including a MIT license in the repository.
Legally, that's a bit more sane than having a line in the readme.
In practice, github will recognize your license file and show the license in the indexes an d in the right column of your repository's main page.
by dan-robertson on 9/18/22, 8:39 PM
It would be interesting to have a more detailed understanding of why these are the latencies, e.g. this repo has ‘clusters’ but there is surely some architectural reason for these clusters. Is it just physical distance on the chip or is there some other design constraint?
I find it pretty interesting where the interface that cpu makers present (eg a bunch of equal cores) breaks down.
by ozcanay on 9/19/22, 10:25 AM
I am currently working on my master's degree on computer science and studying on this exact topic.
In order to measure core-to-core latency, we should also learn how the cache coherence works on Intel. I am currently experimenting with microbenchmarks on Skylake microarchitecture. Due to the scalability issues with ring interconnect on CPU dies in previous models, Intel opted for 2D mesh interconnect microarchitecture in recent years. In this microarchitecture, CPU die is split into tiles each accommodating cores, caches, CHA, snoop filter etc. I want to emphasize the role of CHA here. Each CHA is responsible for managing coherence of a portion of the addresses. If a core tries to fetch a variable that is not in its L1D or L2 cache, the CHA managing the coherence of the address of the variable being fetched will be queried to learn whereabouts of the variable. If the data is on the die, the core currently owning the variable will be told to forward that variable to the requesting core. So, even though the cores that communicate with each other are physically contiguous, the location of the CHA that manages the coherence of the variable they will pass back and forth also is important due to cache coherence mechanism.
Related links:
https://gac.udc.es/~gabriel/files/DAC19-preprint.pdf
https://par.nsf.gov/servlets/purl/10278043
by moep0 on 9/19/22, 1:08 AM
Why does CPU=8 in Intel Core i9-12900K have fast access to all other cores? It is interesting.

by bhedgeoser on 9/19/22, 2:41 AM

On a 5950x, the latencies for core 0 are very high if SMT is enabled, I wonder why that is?

         0       1  
    0
    1   26±0
    2   26±0    17±0
    3   27±0    17±0
    4   32±0    17±0
    5   29±0    19±0
    6   32±0    18±0
    7   31±0    17±0
    8  138±1    81±0
    9  138±1    83±0
   10  139±1    80±0
   11  136±1    84±0
   12  134±1    83±0
   13  137±1    80±0
   14  136±1    84±0
   15  139±1    84±0
   16   16±0    16±0
   17   28±0     8±0
   18   33±0    17±0
   19   29±0    16±0
   20   28±0    17±0
   21   29±0    19±0
   22   32±0    18±0
   23   31±0    17±0
   24  137±1    81±0
   25  140±1    79±0
   26  143±1    80±0
   27  138±1    82±0
   28  139±1    82±0
   29  139±1    81±0
   30  142±1    82±0
   31  142±1    84±0

by bullen on 9/19/22, 10:40 AM

I ran .c2clat on a Raspberry 4:

  CPU    0    1    2    3
    0    0   77   77   77
    1   77    0   77   77
    2   77   77    0   77
    3   77   77   77    0

And Raspberry 2:

  CPU    0    1    2    3
    0    0   71   71   71
    1   71    0   71   71
    2   71   71    0   71
    3   71   71   71    0

by jeffbee on 9/18/22, 9:56 PM

Fails to build from source with Rust 1.59 so I tried the C++ `c2clat` from elsewhere in the thread. Quite interesting on Alder Lake, because the quartet of Atom cores has uniform latency (they share an L2 cache and other resources) while the core-to-core latency of the Core side of the CPU varies. Note that the way these are logically numbers is 0,1 are SMT threads of the first core and so forth through 14-15. 16-19 are Atom cores with 1 thread each.

  CPU   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19
   0    0   12   60   44   60   44   60   43   50   47   56   48   58   49   60   50   79   79   78   79
   1   12    0   45   45   44   44   60   43   51   49   55   47   57   49   56   51   76   76   76   76
   2   60   45    0   13   42   43   53   43   48   37   52   41   53   42   53   42   72   72   72   72
   3   44   45   13    0   42   43   53   42   47   37   51   40   53   41   53   42   72   72   72   72
   4   60   44   42   42    0   13   56   43   49   52   54   41   56   42   42   41   75   75   74   75
   5   44   44   43   43   13    0   56   43   51   54   55   41   56   42   56   42   77   77   77   77
   6   60   60   53   53   56   56    0   13   49   54   56   41   57   42   57   42   78   78   78   78
   7   43   43   43   42   43   43   13    0   46   47   54   41   41   41   55   41   72   71   71   71
   8   50   51   48   47   49   51   49   46    0   12   51   51   54   56   55   56   75   75   75   75
   9   47   49   37   37   52   54   54   47   12    0   49   53   54   56   55   54   74   69   67   68
  10   56   55   52   51   54   55   56   54   51   49    0   13   53   58   56   59   75   75   76   75
  11   48   47   41   40   41   41   41   41   51   53   13    0   51   52   55   59   75   75   75   75
  12   58   57   53   53   56   56   57   41   54   54   53   51    0   13   55   60   77   77   77   77
  13   49   49   42   41   42   42   42   41   56   56   58   52   13    0   55   54   77   77   77   77
  14   60   56   53   53   42   56   57   55   55   55   56   55   55   55    0   12   74   70   78   78
  15   50   51   42   42   41   42   42   41   56   54   59   59   60   54   12    0   75   74   74   77
  16   79   76   72   72   75   77   78   72   75   74   75   75   77   77   74   75    0   55   55   55
  17   79   76   72   72   75   77   78   71   75   69   75   75   77   77   70   74   55    0   55   55
  18   78   76   72   72   74   77   78   71   75   67   76   75   77   77   78   74   55   55    0   55
  19   79   76   72   72   75   77   78   71   75   68   75   75   77   77   78   77   55   55   55    0

by vladvasiliu on 9/19/22, 12:10 PM

This is interesting. I'm getting much worse results on an i7-1165G7 than the ones published:

    Num cores: 8
    Using RDTSC to measure time: true
    Num round trips per samples: 5000
    Num samples: 300
    Showing latency=round-trip-time/2 in nanoseconds:

       0       1       2       3       4       5       6       7
  0
  1   70±1
  2   53±1    42±0
  3   73±5   134±5    80±1
  4   16±0    49±1    56±1    46±1
  5   63±4    28±1   128±5    67±1    66±1
  6   56±1    49±1    10±0    81±4   124±4    72±1
  7   57±1    57±1    45±1    10±0    63±4   130±5    87±1

    Min  latency: 10.1ns ±0.2 cores: (6,2)
    Max  latency: 134.1ns ±5.3 cores: (3,1)
    Mean latency: 64.7ns

by mey on 9/19/22, 3:28 AM
Here is AMD Ryzen 9 5900x on Windows 11
https://gist.github.com/smarkwell/d72deee656341d53dff469df2b...
by sgtnoodle on 9/18/22, 8:44 PM
I've been doing some latency measurements like this, but between two processes using unix domain sockets. I'm measuring more on the order of 50uS on average, when using FIFO RT scheduling. I suspect the kernel is either letting processes linger for a little bit, or perhaps the "idle" threads tend to call into the kernel and let it do some non-preemptable book keeping.
If I crank up the amount of traffic going through the sockets, the average latency drops, presumably due to the processes being able to batch together multiple packets rather than having to block on each one.
by mayapugai on 9/20/22, 12:29 AM
This is a fascinating insight into a subsystem which we take from granted and naively assume is homogeneous. Thank you so much for sharing.
A request to the community - I am particularly interested in the Apple M1 Ultra. Apple made a pretty big fuss about the transparency of their die-to-die interconnect in the M1 Ultra. So, it would be very interesting to see what happens with it - both on Mac OS and (say, Asahi) Linux.
by scrubs on 9/20/22, 4:32 AM
This benchmark reminds me of "ffwd: delegation is (much) faster than you think" https://www.seltzer.com/margo/teaching/CS508-generic/papers-....
This paper describes a mechanism for client threads pinned to a distinct cores to delegate a function call to distinguished server thread pinned to its own core all on the same socket.
This has a multitude of applications the most obvious one making a shared data structure MT safe through delegation rather than saddling it with mutexes or other synchronization points especially beneficial with small critical sections.
The paper's abstract concludes claiming "100% [improvement] over the next best solution tested (RCL), and multiple micro-benchmarks show improvements in the 5–10× range."
The code does delegation without CAS, locks, or atomics.
The efficacy of such a scheme rests on two facets, which the paper explains:
* Modern CPUs can move GBs/second between core L2/LLC caches
* The synchronization between requesting clients and responding servers depends on each side spinning on shared memory address looking for bit toggles. Briefly, servers only read client request memory which the client only writes. (Clients each have their own slot). And on the response side client's read the servers shared response memory, which only the server writes. This one-side read, one-side write is supposed to minimize the number of cache invalidations and MESI syncs.
I spent some time testing the author's code and went so far as writing my own version. I was never able to make it work with anywhere near the throughput claimed in the paper. There's also some funny "nop" assembler instructions within the code that I gather is a cheap form of thread yielding.
In fact this relatively simple SPCP MT ring buffer which has but a fraction of the code:
https://rigtorp.se/ringbuffer/
did far, far better.
In my experiments then CPU spun too quickly so that core-to-core bandwidth was quickly squandered before the server could signal response or the client could signal request. I wonder if adding select atomic reads as with the SPSC ring might help.
by jesse__ on 9/18/22, 10:41 PM
This is absolutely the coolest thing I've seen in a while.
by bullen on 9/19/22, 6:52 AM
When would cores talk to cores like this is measuring?
Would two cores reading and writing to the same memory have this contention?
by fideloper on 9/18/22, 7:05 PM
Because I’m ignorant: What are the practical take aways from this?
When is a cpu core sending a message to another core?
by zeristor on 9/18/22, 8:36 PM
I realise these were run on AWS instances, but could this be run locally on Apple Silicon?
Erm, I guess I should try.
by bee_rider on 9/19/22, 4:11 AM
Does anyone know what is up with the 8275CL? It looks... almost periodic or something.
by stevefan1999 on 9/19/22, 2:09 AM
What about OS scheduling overhead?
by throaway53dh on 9/19/22, 9:42 AM
What's the diff b/w sockets and cores? Does socket have seperate Ln caches and cores share the ca he?