by rkwasny on 6/25/24, 4:58 PM with 21 comments
by modeless on 6/25/24, 5:22 PM
I don't see a lot of detail on the actual architecture. Hard to evaluate if it makes sense. The idea of specializing in transformers has merit I think. It seems like this is only for inference, not training. Although inference accelerators are definitely important and potentially lucrative, I find it hard to get excited about them. To me the exciting thing is accelerating progress in capabilities, which means training.
The title of the blog post is "Etched is Making the Biggest Bet in AI", and it is a pretty big bet to make an ASIC only for transformers. Even something like 1.5-bit transformers could come along and make their hardware blocks obsolete in six months. Actually I would love to see an ASIC implementation of 1.5 bit transformers, it could probably be far more efficient than even this chip.
by yarri on 6/25/24, 6:11 PM
## How can we fit so much more compute on the silicon?
The NVIDIA H200 has 989 TFLOPS of FP16/BF16 compute without sparsity. This is state-of-the-art (more than even Google’s new Trillium chip), and the GB200 launching in 2025 has only 25% more compute (1,250 TFLOPS per die).
Since the vast majority of a GPU’s area is devoted to programmability, specializing on transformers lets you fit far more compute. You can prove this to yourself from first principles:
It takes 10,000 transistors to build a single FP16/BF16/FP8 multiply-add circuit, the building block for all matrix math. The H100 SXM has 528 tensor cores, and each has $4 \times 8 \times 16$ FMA circuits. Multiplying tells us the H100 has 2.7 billion transistors dedicated to tensor cores.
*But an H100 has 80 billion transistors! This means only 3.3% of the transistors on an H100 GPU are used for matrix multiplication!*
This is a deliberate design decision by NVIDIA and other flexible AI chips. If you want to support all kinds of models (CNNs, LSTMs, SSMs, and others), you can’t do much better than this.
By only running transformers, we can fit way more more FLOPS on our chip, without resorting to lower precisions or sparsity.
## Isn’t memory bandwidth the bottleneck on inference?
For modern models like Llama-3, no!
by gwern on 6/25/24, 5:50 PM
by airstrike on 6/25/24, 5:38 PM
https://www.reuters.com/technology/artificial-intelligence/a...
The CEO was also on Bloomberg Technology today talking about their strategy a bit. There's an article but I didn't find a video of the interview after quick googling:
https://www.bloomberg.com/news/articles/2024-06-25/ai-chip-s...
by mikewarot on 6/25/24, 6:27 PM
I've considered similar thoughts with my toy BitGrid model... except I'd actually take the weights and compile them to boolean logic, since it would improve utilization. Program the chip, throw parameters in one side, get them out (later) on the other.
by lukaslezevicius on 6/25/24, 6:25 PM
by jhylau on 6/27/24, 2:27 PM
by Bluestein on 6/25/24, 6:10 PM
... if it wasn't already, TSMC is going to become pivotal. Ergo, Taiwan. Ergo, stability in the region ...
by nick238 on 6/25/24, 5:59 PM