by nviennot on 9/18/22, 5:15 PM with 90 comments
by apaolillo on 9/18/22, 10:19 PM
by wyldfire on 9/18/22, 7:14 PM
It looks kinda like the color scales are normalized to just-this-CPU's latency? It would be neater if the scale represented the same values among CPUs. Or rather, it would be neat if there were an additional view for this data that could make it easier to compare among them.
I think the differences are really interesting to consider. What if the scheduler could consider these designs when weighing how to schedule each task? Either statically or somehow empirically? I think I've seen sysfs info that describes the cache hierarchies, so maybe some of this info is available already. That nest [1] scheduler was recently shared on HN, I suppose it may be taking advantage of some of these properties.
by rigtorp on 9/18/22, 8:14 PM
by hot_gril on 9/19/22, 4:03 AM
* which in hindsight sounds like TurboBoost was enabled, but I vaguely remember it being disabled in tests
by jtorsella on 9/18/22, 10:40 PM
Min: 48.3 Max: 175.0 Mean: 133.0
I’ll try to copy the exact results once I have a browser on Asahi, but the general pattern is most pairs have >150ns and a few (0-1; 2-3,4,5; 3-4,5; 4-5; 6-7,8,9; 7-8,9; 8-9) are faster at about 50ns.
Edit: The results from c2clat (a little slower, but the format is nicer) are below.
CPU 0 1 2 3 4 5 6 7 8 9
0 0 59 231 205 206 206 208 219 210 210
1 59 0 205 215 207 207 209 209 210 210
2 231 205 0 40 42 43 180 222 224 213
3 205 215 40 0 43 43 212 222 213 213
4 206 207 42 43 0 44 182 227 217 217
5 206 207 43 43 44 0 215 215 217 217
6 208 209 180 212 182 215 0 40 43 45
7 219 209 222 222 227 215 40 0 43 43
8 210 210 224 213 217 217 43 43 0 44
9 210 210 213 213 217 217 45 43 44 0
by snvzz on 9/19/22, 3:04 AM
Maybe consider including a MIT license in the repository.
Legally, that's a bit more sane than having a line in the readme.
In practice, github will recognize your license file and show the license in the indexes an d in the right column of your repository's main page.
by dan-robertson on 9/18/22, 8:39 PM
I find it pretty interesting where the interface that cpu makers present (eg a bunch of equal cores) breaks down.
by ozcanay on 9/19/22, 10:25 AM
In order to measure core-to-core latency, we should also learn how the cache coherence works on Intel. I am currently experimenting with microbenchmarks on Skylake microarchitecture. Due to the scalability issues with ring interconnect on CPU dies in previous models, Intel opted for 2D mesh interconnect microarchitecture in recent years. In this microarchitecture, CPU die is split into tiles each accommodating cores, caches, CHA, snoop filter etc. I want to emphasize the role of CHA here. Each CHA is responsible for managing coherence of a portion of the addresses. If a core tries to fetch a variable that is not in its L1D or L2 cache, the CHA managing the coherence of the address of the variable being fetched will be queried to learn whereabouts of the variable. If the data is on the die, the core currently owning the variable will be told to forward that variable to the requesting core. So, even though the cores that communicate with each other are physically contiguous, the location of the CHA that manages the coherence of the variable they will pass back and forth also is important due to cache coherence mechanism.
Related links:
by moep0 on 9/19/22, 1:08 AM
by bhedgeoser on 9/19/22, 2:41 AM
0 1
0
1 26±0
2 26±0 17±0
3 27±0 17±0
4 32±0 17±0
5 29±0 19±0
6 32±0 18±0
7 31±0 17±0
8 138±1 81±0
9 138±1 83±0
10 139±1 80±0
11 136±1 84±0
12 134±1 83±0
13 137±1 80±0
14 136±1 84±0
15 139±1 84±0
16 16±0 16±0
17 28±0 8±0
18 33±0 17±0
19 29±0 16±0
20 28±0 17±0
21 29±0 19±0
22 32±0 18±0
23 31±0 17±0
24 137±1 81±0
25 140±1 79±0
26 143±1 80±0
27 138±1 82±0
28 139±1 82±0
29 139±1 81±0
30 142±1 82±0
31 142±1 84±0
by bullen on 9/19/22, 10:40 AM
CPU 0 1 2 3
0 0 77 77 77
1 77 0 77 77
2 77 77 0 77
3 77 77 77 0
And Raspberry 2: CPU 0 1 2 3
0 0 71 71 71
1 71 0 71 71
2 71 71 0 71
3 71 71 71 0
by jeffbee on 9/18/22, 9:56 PM
CPU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 12 60 44 60 44 60 43 50 47 56 48 58 49 60 50 79 79 78 79
1 12 0 45 45 44 44 60 43 51 49 55 47 57 49 56 51 76 76 76 76
2 60 45 0 13 42 43 53 43 48 37 52 41 53 42 53 42 72 72 72 72
3 44 45 13 0 42 43 53 42 47 37 51 40 53 41 53 42 72 72 72 72
4 60 44 42 42 0 13 56 43 49 52 54 41 56 42 42 41 75 75 74 75
5 44 44 43 43 13 0 56 43 51 54 55 41 56 42 56 42 77 77 77 77
6 60 60 53 53 56 56 0 13 49 54 56 41 57 42 57 42 78 78 78 78
7 43 43 43 42 43 43 13 0 46 47 54 41 41 41 55 41 72 71 71 71
8 50 51 48 47 49 51 49 46 0 12 51 51 54 56 55 56 75 75 75 75
9 47 49 37 37 52 54 54 47 12 0 49 53 54 56 55 54 74 69 67 68
10 56 55 52 51 54 55 56 54 51 49 0 13 53 58 56 59 75 75 76 75
11 48 47 41 40 41 41 41 41 51 53 13 0 51 52 55 59 75 75 75 75
12 58 57 53 53 56 56 57 41 54 54 53 51 0 13 55 60 77 77 77 77
13 49 49 42 41 42 42 42 41 56 56 58 52 13 0 55 54 77 77 77 77
14 60 56 53 53 42 56 57 55 55 55 56 55 55 55 0 12 74 70 78 78
15 50 51 42 42 41 42 42 41 56 54 59 59 60 54 12 0 75 74 74 77
16 79 76 72 72 75 77 78 72 75 74 75 75 77 77 74 75 0 55 55 55
17 79 76 72 72 75 77 78 71 75 69 75 75 77 77 70 74 55 0 55 55
18 78 76 72 72 74 77 78 71 75 67 76 75 77 77 78 74 55 55 0 55
19 79 76 72 72 75 77 78 71 75 68 75 75 77 77 78 77 55 55 55 0
by vladvasiliu on 9/19/22, 12:10 PM
Num cores: 8
Using RDTSC to measure time: true
Num round trips per samples: 5000
Num samples: 300
Showing latency=round-trip-time/2 in nanoseconds:
0 1 2 3 4 5 6 7
0
1 70±1
2 53±1 42±0
3 73±5 134±5 80±1
4 16±0 49±1 56±1 46±1
5 63±4 28±1 128±5 67±1 66±1
6 56±1 49±1 10±0 81±4 124±4 72±1
7 57±1 57±1 45±1 10±0 63±4 130±5 87±1
Min latency: 10.1ns ±0.2 cores: (6,2)
Max latency: 134.1ns ±5.3 cores: (3,1)
Mean latency: 64.7ns
by mey on 9/19/22, 3:28 AM
https://gist.github.com/smarkwell/d72deee656341d53dff469df2b...
by sgtnoodle on 9/18/22, 8:44 PM
If I crank up the amount of traffic going through the sockets, the average latency drops, presumably due to the processes being able to batch together multiple packets rather than having to block on each one.
by mayapugai on 9/20/22, 12:29 AM
A request to the community - I am particularly interested in the Apple M1 Ultra. Apple made a pretty big fuss about the transparency of their die-to-die interconnect in the M1 Ultra. So, it would be very interesting to see what happens with it - both on Mac OS and (say, Asahi) Linux.
by scrubs on 9/20/22, 4:32 AM
This paper describes a mechanism for client threads pinned to a distinct cores to delegate a function call to distinguished server thread pinned to its own core all on the same socket.
This has a multitude of applications the most obvious one making a shared data structure MT safe through delegation rather than saddling it with mutexes or other synchronization points especially beneficial with small critical sections.
The paper's abstract concludes claiming "100% [improvement] over the next best solution tested (RCL), and multiple micro-benchmarks show improvements in the 5–10× range."
The code does delegation without CAS, locks, or atomics.
The efficacy of such a scheme rests on two facets, which the paper explains:
* Modern CPUs can move GBs/second between core L2/LLC caches
* The synchronization between requesting clients and responding servers depends on each side spinning on shared memory address looking for bit toggles. Briefly, servers only read client request memory which the client only writes. (Clients each have their own slot). And on the response side client's read the servers shared response memory, which only the server writes. This one-side read, one-side write is supposed to minimize the number of cache invalidations and MESI syncs.
I spent some time testing the author's code and went so far as writing my own version. I was never able to make it work with anywhere near the throughput claimed in the paper. There's also some funny "nop" assembler instructions within the code that I gather is a cheap form of thread yielding.
In fact this relatively simple SPCP MT ring buffer which has but a fraction of the code:
https://rigtorp.se/ringbuffer/
did far, far better.
In my experiments then CPU spun too quickly so that core-to-core bandwidth was quickly squandered before the server could signal response or the client could signal request. I wonder if adding select atomic reads as with the SPSC ring might help.
by jesse__ on 9/18/22, 10:41 PM
by bullen on 9/19/22, 6:52 AM
Would two cores reading and writing to the same memory have this contention?
by fideloper on 9/18/22, 7:05 PM
When is a cpu core sending a message to another core?
by zeristor on 9/18/22, 8:36 PM
Erm, I guess I should try.
by bee_rider on 9/19/22, 4:11 AM
by stevefan1999 on 9/19/22, 2:09 AM
by throaway53dh on 9/19/22, 9:42 AM