by born-jre on 11/21/23, 2:31 PM with 137 comments
by WithinReason on 11/22/23, 1:35 PM
https://arxiv.org/abs/2308.14711
An attempt at a summary: They use a sigmoid function to make differentiable "soft" branches, and stack them to construct a binary tree, with the goal of only taking one branch at inference time (but training the whole tree) leading to log(W) instead of W inference cost. They gradually harden the branches so they become hard branches at the end of training.
A branch is computed as branch(input, N), with a neural network N computing a scalar c=N(input), then using a sigmoid to do a soft branch by returning the weighted sum of the recursive call s(c)*branch(input, N_left) + (1-s(c)) * branch(input, N_right) (the two weights s(c) and 1-s(c) sum to 1). They only do "proper processing" using the leaf nodes.
Then they add a new loss term that encourages hard decisions by minimising the entropy of the Bernoulli distribution, making the 2 weights converge to 0 and 1, at which point only one branch needs to be taken at inference. They also state that this hardening often happens automatically though.
It's a simple idea but the loss formulation is nice, you usually want your loss terms to be a measure of information.
by sdrg822 on 11/22/23, 12:42 PM
""" One may ask whether the conditionality introduced by the use of CMM does not make FFFs incompatible with the processes and hardware already in place for dense matrix multiplication and deep learning more broadly. In short, the answer is “No, it does not, save for some increased caching complexity." """
It's hard to beat the hardware lottery!
by fgfm on 11/21/23, 11:59 PM
by rsolva on 11/22/23, 7:15 PM
mind blown
by vorticalbox on 11/22/23, 1:56 PM
by baq on 11/22/23, 4:46 PM
by tokai on 11/22/23, 1:44 PM
by millisecond on 11/22/23, 12:54 PM
by itissid on 11/23/23, 12:43 AM
----
L (Number of Layers): 12 transformer blocks.
H (Hidden Size): 768 units in the hidden layers.
A (Number of Attention Heads): 12 attention heads.
Embedding Layers: WordPiece Embeddings: 768 (hidden size) * 30,522 (vocab size) = 23,440,896 parameters.
Positional Embeddings: 768 * 512 (max sequence length) = 393,216 parameters.
Segment Embeddings: 768 * 2 (number of segments) = 1,536 parameters.
Total Embedding Parameters: 23,440,896 + 393,216 + 1,536 = 23,835,648 parameters.
Transformer Blocks: Each transformer block has the following components:
Self-Attention Layer: Each attention head has 768 / 12 = 64 units.
Query (Q), Key (K), Value (V) matrices: 3 * (64 * 768) = 147,456 parameters per head.
Across 12 heads: 147,456 * 12 = 1,769,472 parameters.
Output layer of the attention mechanism: 768 * 768 = 589,824 parameters.
Feed-Forward Network (FFN):
First layer: 768 (input) * 3,072 (intermediate size) = 2,359,296 parameters.
Second layer: 3,072 * 768 = 2,359,296 parameters.
Total FFN parameters per block: 2,359,296 + 2,359,296 = 4,718,592 parameters. -----------------> *This is the number to keep in mind.*
Total Parameters per Block: 1,769,472 (self-attention) + 589,824 (output) + 4,718,592 (FFN) = 7,077,888 parameters.
Total for 12 Blocks: 7,077,888 * 12 = 84,934,656 parameters.
Layer Norm and Other Parameters:
Each transformer block also includes layer normalization and other small components, which add a relatively small number of parameters.
Total Parameters: Embeddings: 23,835,648
Transformer Blocks: 84,934,656
Layer Norm and Others: A small number, completing the total to around 110 million.
--------------------------------------4.718M FF Params per block * 12 ~ 56.6 Million/110M Params which is a staggering ~50% reduction in size at inference time if you use 0.3% of the FF neurons for FFF??
by itissid on 11/22/23, 11:41 PM
by itissid on 11/23/23, 12:13 AM
some of which is from the pytorch docs here: https://pytorch.org/tutorials/intermediate/torch_compile_tut..., e.g. the `timed` function and how they generate data.
Also its not just the same 12 neurons, its the 12 neurons based on the previous dot product. So some kind of JIT is needed to load the right ones?
by Klaster_1 on 11/22/23, 12:46 PM
by measured_step on 11/22/23, 5:09 PM
I'm also curious if this model architecture would achieve the grokking of more complex concepts at scale.
by jasonjmcghee on 11/22/23, 6:33 PM
Why is the context size and batch size represented as a single parameter?
by quickthrower2 on 11/22/23, 9:13 PM
by dartos on 11/24/23, 1:02 AM
I wonder if the conditional in this would hurt performance at scale
by ndr on 11/22/23, 12:40 PM
> Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.
Conclusions
> We present UltraFastBERT, a modified version of the (crammed)BERT architecture that uses fast feedforward instead of feedforward networks in its intermediate layers. UltraFastBERT serves as proof that large language models only really need to engage an exponential fraction of their parameters to perform individual inferences. UltraFastBERT-1x11, our deepest model with the highest promise of acceleration, uses only 0.3% of its neurons during inference and already achieves a 78x CPU speedup over the inference time of the corresponding feedforward layer. With a theoretical speedup promise of 341x at the scale of BERT-base models, we hope that our work will inspire an effort to implement primitives for conditional neural execution as a part of device programming interfaces.
by qntty on 11/22/23, 3:55 PM
by OneOffAsk on 11/22/23, 12:58 PM
by matmulbro on 11/23/23, 4:18 AM
All valuable AI research is secret now, they just churn out papers to waste time
by vouaobrasil on 11/22/23, 2:09 PM
We are creating a monster.