by goerz on 10/3/23, 4:02 AM with 74 comments
by morcus on 10/3/23, 2:35 PM
The PI started asking me to run some analyses on a raw dataset. Since I was so new at it, I often messed up and had to rerun the whole thing after looking at the output; this was painful because the entire script took a few hours to run.
I started poking around to see whether it could be optimized at all. the raw data was divided up into hundreds files from different runs, sensors, etc..., that were each processed independently in sequence, and the results were all combined together into a big array for the final result. Seems reasonable enough.
Except this code was all written by scientists, and the combination was done in the "naive" way - after each of data files was processed, a new array was created and the previous results were copied into the new array, as were the results from the current data file. This meant that for the iterations at the end, we roughly needed to have Memory = 2 * Size of final data, which eventually exceeded the amount of physical memory on the machine (and because there were so many data files, it was doing this allocation and copying dozens of times after it used all the RAM).
I updated this to pre-allocate the required size at the beginning for a very very easy 3-4 fold improvement in the overall runtime and felt rather proud of myself.
by havercosine on 10/3/23, 5:55 AM
by a1o on 10/3/23, 1:50 PM
by SinePost on 10/3/23, 5:35 AM
by dang on 10/3/23, 5:16 AM
What scientists must know about hardware to write fast code (2020) - https://news.ycombinator.com/item?id=29601342 - Dec 2021 (29 comments)
by OnlyMortal on 10/3/23, 5:22 PM
For those folks, getting the output they need is much more important than the CPU cycles - as it should be.
As a C++ programmer, I posed the question as to why they don’t hire coders to do this for them. The answer was cost which rather surprised me given the cost of the LHC.
by prhcbsc on 10/3/23, 8:50 AM
But for large problems the article falls short. Scientific applications may need to use several computers at a time, COMP Superscalar (COMPSs) is a task-based programming model which aims to ease the development of applications for distributed infrastructures. COMPSs programmers do not need to deal with the typical duties of parallelization and distribution, such as thread creation and synchronization, data distribution, messaging or fault tolerance. Instead, the model is based on sequential programming, which makes it appealing to users that either lack parallel programming expertise or are looking for better programmability. Other popular frameworks such as LEGION offer a lower-level interface.
by Delk on 10/4/23, 12:13 PM
A minor detail I find a bit confusing, though, is explaining the potential benefits of SMT/hyperthreading with an example where threads are spending some of their time idle (or sleeping).
I don't know Julia so I don't know if sleep is implemented with busy-waiting or something there, but generally if a thread is put to sleep, the thread gets blocked from being run until the timer expires or the sleep is interrupted. The operating system doesn't schedule the blocked thread for running on the CPU in the first place, so a thread that's sleeping is not sharing a CPU core with another thread that's being executed.
So the example does not finish 8 jobs almost as fast as 4 or 1 jobs using 4 cores due to SMT; it's rather that half of the time each of the threads is not even being scheduled for running. A total of eight concurrent jobs/threads works out to approximately four of them being eligible to run at a time, matching the four physical cores available.
If there are only four concurrent jobs/threads, each sleeping half of the time, you end up not utilizing the four cores fully because on average two of the cores will be idle with no thread scheduled.
AFAIK SMT should only really be beneficial in cases of stalls due to CPU internal reasons such as cache misses or branch mispredictions, not in cases of threads being blocked for I/O (or sleeping).
The post is of course correct in that the example computation benefits from a higher number of concurrent jobs because of each thread being blocked half of the time. However, that's unrelated to SMT.
Considering how meticulous and detailed the post generally is, I think it would make sense to more clearly separate SMT from the benefits of multithreading in case of partially I/O-bound work.
by eande on 10/3/23, 11:22 AM
by jschveibinz on 10/3/23, 5:08 AM