from Hacker News

Analyzing the performance of Tensorflow training on M1 Mac Mini and Nvidia V100

by briggers on 1/14/21, 7:04 AM with 91 comments

by volta87 on 1/14/21, 12:57 PM
When developing ML models, you rarely train "just one".
The article mentions that they explored a not-so-large hyper-parameter space (i.e. they trained multiple models with different parameters each).
It would be interesting to know how long does the whole process takes on the M1 vs the V100.
For the small models covered in the article, I'd guess that the V100 can train them all concurrently using MPS (multi-process service: multiple processes can concurrently use the GPU).
In particular it would be interesting to know, whether the V100 trains all models in the same time that it trains one, and whether the M1 does the same, or whether the M1 takes N times more time to train N models.
This could paint a completely different picture, particularly for the user perspective. When I go for lunch, coffee, or home, I usually spawn jobs training a large number of models, such that when I get back, all these models are trained.
I only start training a small number of models at the latter phases of development, when I have already explored a large part of the model space.
---
To make the analogy, what this article is doing is something similar to benchmarking a 64 core CPU against a 1 core CPU using a single threaded benchmark. The 64 core CPU happens to be slightly beefier and faster than the 1 core CPU, but it is more expensive and consumes more power because... it has 64x more cores. So to put things in perspective, it would make sense to also show a benchmark that can use 64x cores, which is the reason somebody would buy a 64-core CPU, and see how the single-core one compares (typically 64x slower).
---
To me, the only news here is that Apple GPU cores are not very far behind NVIDIA's cores for ML training, but there is much more to a GPGPU than just the perf that you get for small models in a small number of cores. Apple would still need to (1) catch up, and (2) extremely scale up their design. They probably can do both if they set their eyes on it. Exciting times.
by mark_l_watson on 1/14/21, 1:03 PM
I had the same experience. My M1 system does well on smaller models compared to a NVidia 1070 with 10GB of memory. My MacBook Pro only has 8GB total memory. Large models run slowly.
I found setting up Apple’s M1 fork of TensorFlow to be fairly easy, BTW.
I am writing a new book on using Swift for AI applications, motivated by the “niceness” of the Swift language and Apple’s CoreML libraries.
by lopuhin on 1/14/21, 9:27 AM
> I chose MobileNetV2 to make iteration faster. When I tried ResNet50 or other larger models the gap between the M1 and Nvidia grew wider.
(and that's on CIFAR-10). But why not report these results and also test on a more realistic datasets? The internet is full of M1 TF brenchmarks on CIFAR or MNIST, has anyone seen something different?
by tbalsam on 1/14/21, 4:58 PM
This is on a model designed to run faster on CPUs. It's like dropping a bowling ball on your foot and claiming excitement that you feel bruised after a few days.
Maybe there's something interesting there, definitely, but the overhype of the title takes away any significant amount of clout I'd give to the publishers for research. If you find something interesting, say it, and stop making vapid generalizations for the sake of more clicks.
Remember, we only can feed the AI hype bubble when we do this. It might be good results, but we need to be at least realistic about it, or there won't be an economy of innovation for people to listen to in the future, because they've tuned it out with all of the crap marketing that comes/came before it.
Thanks for coming to my TED Talk!
by baxter001 on 1/14/21, 9:02 AM
No, but it's pretty good at retraining the final layer of low memory networks like MobileNet - weirdly a workload that the V100 is very poorly suited for...
by whywhywhywhy on 1/14/21, 10:24 AM
>We can see better performance gains with the m1 when there are fewer weights to train likely due to the superior memory architecture of the M1.
Wasn't this whole "M1 memory" thing decided to be a myth now some more technical people have dissected it?
by jlouis on 1/14/21, 11:51 AM
CPUs often outperform specialized hardware on small models. This is nothing new. You'd need to go to a larger model, and then power consumption curves change too.
by procrastinatus on 1/14/21, 3:04 PM
One thing I haven’t seen much mention of is getting things to run on the M1’s neural engine instead of the GPU - it seems like the neural engine has ~3x more compute capacity and is specifically optimized for this type of computation.
Has anyone spotted any work allowing a mainstream tensor library (e.g. jax, tf, pytorch) to run on the neural engine?
by sradman on 1/14/21, 11:21 AM
I categorize this as an exploration of how to benchmark desktop/workstation NPUs [1] similar to the exploration Daniel Lemire started with SIMD. Mobile SoC NPUs are used to deploy inference models on smartphones and IoT devices while discreet NPUs like Nvidia A100/V100 target cloud clusters.
We don’t have apples-to-apples benchmarks like SPECint/SPECfp for the SoC accelerators in the M1 (GPU, NPU, etc.) so these early attempts are both facile and critical as we try to categorize and compare the trade-offs between the SoC/discreet and performance/perf-per-watt options available.
Power efficient SoC for desktops is new and we are learning as we go.
[1] https://en.m.wikipedia.org/wiki/AI_accelerator
by 0x008 on 1/14/21, 9:01 AM
Well, putting out a tl;dr and then a graph that does not mention FP16/FP32 performance differences or anything related to TensorRT cannot be taken seriously if we talk about performance per watt. We need to see the a comparison that includes multiple scenarios so we can determine something like a break-even point between Nvidia GPUs and Apple M1 GPU, possibly even for several SotA models.
by helsinkiandrew on 1/14/21, 9:32 AM
Can someone with more knowledge of Nvidia GPU's please say how much the V100 costs ($5-10K?) compared with the $900 mac mini.
by StavrosK on 1/14/21, 6:47 PM
I'm seeing a lot of M1 hype, and I suspect most of it us unwarranted. I looked at comparisons between the M1 and the latest Ryzens, and it looks like it's comparable? Does anyone know details? I only looked summarily.
by fxtentacle on 1/14/21, 9:23 AM
"trainable_params 12,810"
laughs
(for comparison, GPT3: 175,000,000,000 parameters)
Can Apple's M1 help you train tiny toy examples with no real-world relevance? You bet it can!
Plus it looks like they are comparing Apples to Oranges ;) This seems to be 16 bit precision on the M1 and 32 bit on the V100. So the M1-trained model will most likely yield worse or unusable results, due to lack of precision.
And lastly, they are plainly testing against the wrong target. The V100 is great, but it is far from NVIDIA's flagship for training small low-precision models. At the FP16 that the M1 is using, the correct target would have been an RTX 3090 or the like, which has 35 TFLOPS. The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.
So they compare the M1 against an NVIDIA model from 2017 that lacks the relevant hardware acceleration and, thus, is a whopping 60% slower than what people actually use for such training workloads.
I'm sure my bicycle will also compare very favorably against a car that is lacking two wheels :p
by SloopJon on 1/14/21, 3:36 PM
The first graph includes "Apple Intel", which is not mentioned anywhere else in the post. Any idea what hardware that was, and whether it used the accelerated TensorFlow?
by tpoacher on 1/14/21, 10:24 AM
Betteridge says no.
by JohnHaugeland on 1/14/21, 2:32 PM
"Can Apple's M1 do a good job? We cut things down to unrealstic sizes, turned off cores, and p-hacked as hard as we could until we found a way to pretend the answer was yes"