by briggers on 1/14/21, 7:04 AM with 91 comments
by volta87 on 1/14/21, 12:57 PM
The article mentions that they explored a not-so-large hyper-parameter space (i.e. they trained multiple models with different parameters each).
It would be interesting to know how long does the whole process takes on the M1 vs the V100.
For the small models covered in the article, I'd guess that the V100 can train them all concurrently using MPS (multi-process service: multiple processes can concurrently use the GPU).
In particular it would be interesting to know, whether the V100 trains all models in the same time that it trains one, and whether the M1 does the same, or whether the M1 takes N times more time to train N models.
This could paint a completely different picture, particularly for the user perspective. When I go for lunch, coffee, or home, I usually spawn jobs training a large number of models, such that when I get back, all these models are trained.
I only start training a small number of models at the latter phases of development, when I have already explored a large part of the model space.
---
To make the analogy, what this article is doing is something similar to benchmarking a 64 core CPU against a 1 core CPU using a single threaded benchmark. The 64 core CPU happens to be slightly beefier and faster than the 1 core CPU, but it is more expensive and consumes more power because... it has 64x more cores. So to put things in perspective, it would make sense to also show a benchmark that can use 64x cores, which is the reason somebody would buy a 64-core CPU, and see how the single-core one compares (typically 64x slower).
---
To me, the only news here is that Apple GPU cores are not very far behind NVIDIA's cores for ML training, but there is much more to a GPGPU than just the perf that you get for small models in a small number of cores. Apple would still need to (1) catch up, and (2) extremely scale up their design. They probably can do both if they set their eyes on it. Exciting times.
by mark_l_watson on 1/14/21, 1:03 PM
I found setting up Apple’s M1 fork of TensorFlow to be fairly easy, BTW.
I am writing a new book on using Swift for AI applications, motivated by the “niceness” of the Swift language and Apple’s CoreML libraries.
by lopuhin on 1/14/21, 9:27 AM
(and that's on CIFAR-10). But why not report these results and also test on a more realistic datasets? The internet is full of M1 TF brenchmarks on CIFAR or MNIST, has anyone seen something different?
by tbalsam on 1/14/21, 4:58 PM
Maybe there's something interesting there, definitely, but the overhype of the title takes away any significant amount of clout I'd give to the publishers for research. If you find something interesting, say it, and stop making vapid generalizations for the sake of more clicks.
Remember, we only can feed the AI hype bubble when we do this. It might be good results, but we need to be at least realistic about it, or there won't be an economy of innovation for people to listen to in the future, because they've tuned it out with all of the crap marketing that comes/came before it.
Thanks for coming to my TED Talk!
by baxter001 on 1/14/21, 9:02 AM
by whywhywhywhy on 1/14/21, 10:24 AM
Wasn't this whole "M1 memory" thing decided to be a myth now some more technical people have dissected it?
by jlouis on 1/14/21, 11:51 AM
by procrastinatus on 1/14/21, 3:04 PM
Has anyone spotted any work allowing a mainstream tensor library (e.g. jax, tf, pytorch) to run on the neural engine?
by sradman on 1/14/21, 11:21 AM
We don’t have apples-to-apples benchmarks like SPECint/SPECfp for the SoC accelerators in the M1 (GPU, NPU, etc.) so these early attempts are both facile and critical as we try to categorize and compare the trade-offs between the SoC/discreet and performance/perf-per-watt options available.
Power efficient SoC for desktops is new and we are learning as we go.
by 0x008 on 1/14/21, 9:01 AM
by helsinkiandrew on 1/14/21, 9:32 AM
by StavrosK on 1/14/21, 6:47 PM
by fxtentacle on 1/14/21, 9:23 AM
laughs
(for comparison, GPT3: 175,000,000,000 parameters)
Can Apple's M1 help you train tiny toy examples with no real-world relevance? You bet it can!
Plus it looks like they are comparing Apples to Oranges ;) This seems to be 16 bit precision on the M1 and 32 bit on the V100. So the M1-trained model will most likely yield worse or unusable results, due to lack of precision.
And lastly, they are plainly testing against the wrong target. The V100 is great, but it is far from NVIDIA's flagship for training small low-precision models. At the FP16 that the M1 is using, the correct target would have been an RTX 3090 or the like, which has 35 TFLOPS. The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.
So they compare the M1 against an NVIDIA model from 2017 that lacks the relevant hardware acceleration and, thus, is a whopping 60% slower than what people actually use for such training workloads.
I'm sure my bicycle will also compare very favorably against a car that is lacking two wheels :p
by SloopJon on 1/14/21, 3:36 PM
by tpoacher on 1/14/21, 10:24 AM
by JohnHaugeland on 1/14/21, 2:32 PM