from Hacker News

Exponentially faster language modelling

by born-jre on 11/21/23, 2:31 PM with 137 comments

  • by WithinReason on 11/22/23, 1:35 PM

    Link to previous paper:

    https://arxiv.org/abs/2308.14711

    An attempt at a summary: They use a sigmoid function to make differentiable "soft" branches, and stack them to construct a binary tree, with the goal of only taking one branch at inference time (but training the whole tree) leading to log(W) instead of W inference cost. They gradually harden the branches so they become hard branches at the end of training.

    A branch is computed as branch(input, N), with a neural network N computing a scalar c=N(input), then using a sigmoid to do a soft branch by returning the weighted sum of the recursive call s(c)*branch(input, N_left) + (1-s(c)) * branch(input, N_right) (the two weights s(c) and 1-s(c) sum to 1). They only do "proper processing" using the leaf nodes.

    Then they add a new loss term that encourages hard decisions by minimising the entropy of the Bernoulli distribution, making the 2 weights converge to 0 and 1, at which point only one branch needs to be taken at inference. They also state that this hardening often happens automatically though.

    It's a simple idea but the loss formulation is nice, you usually want your loss terms to be a measure of information.

  • by sdrg822 on 11/22/23, 12:42 PM

    Cool. Important note:

    """ One may ask whether the conditionality introduced by the use of CMM does not make FFFs incompatible with the processes and hardware already in place for dense matrix multiplication and deep learning more broadly. In short, the answer is “No, it does not, save for some increased caching complexity." """

    It's hard to beat the hardware lottery!

  • by fgfm on 11/21/23, 11:59 PM

    This approach feels like pruning, but the speedup is considerably higher. Interestingly, I'm curious how this will play out on more recent transformer architectures though: I guess the speedup will be more important for the largest architectures, but even if we can get 2x or 10x speedup on Mistral/Zephyr, Orca 2 or OpenChat3.5, that would be a tremendous achievement!
  • by rsolva on 11/22/23, 7:15 PM

    I find running 7B models on my 6 year old small form factor HP EliteDesk to be fast enough for casual everyday use. If this speedup can be applied generally to commonly used models, I can serve a local ChatGPT experience for both friends and family from my tiny homelab in my basement.

    mind blown

  • by vorticalbox on 11/22/23, 1:56 PM

  • by baq on 11/22/23, 4:46 PM

    Mix this with yesterday's matmul approximation (maddness) in HW for a casual... three orders of magnitude speed increase?
  • by tokai on 11/22/23, 1:44 PM

    Why not use the real title? Its short and precise.
  • by millisecond on 11/22/23, 12:54 PM

    Could this be applied to other models like Llama2 or Mistral?
  • by itissid on 11/23/23, 12:43 AM

    Another noob Question: So a 50% size reduction in BERT? let's see if I am getting these numbers right. At inference time you need a fraction of the neurons in the FF layer to do the inference based on the input data and the previous dot product. Here some quick math for BERT-Base which has 110M params according to the original paper:

    ----

        L (Number of Layers): 12 transformer blocks.
    
        H (Hidden Size): 768 units in the hidden layers.
      
        A (Number of Attention Heads): 12 attention heads.
    
    Embedding Layers:

         WordPiece Embeddings: 768 (hidden size) * 30,522 (vocab size) = 23,440,896 parameters.
            
         Positional Embeddings: 768 * 512 (max sequence length) = 393,216 parameters.
         
         Segment Embeddings: 768 * 2 (number of segments) = 1,536 parameters.
      
         Total Embedding Parameters: 23,440,896 + 393,216 + 1,536 = 23,835,648 parameters.
    
    Transformer Blocks:

       Each transformer block has the following components:
       
           Self-Attention Layer: Each attention head has 768 / 12 = 64 units.
      
                Query (Q), Key (K), Value (V) matrices: 3 * (64 * 768) = 147,456 parameters per head.
     
                Across 12 heads: 147,456 * 12 = 1,769,472 parameters.
    
                Output layer of the attention mechanism: 768 * 768 = 589,824 parameters.
           
          Feed-Forward Network (FFN):
          
             First layer: 768 (input) * 3,072 (intermediate size) = 2,359,296 parameters.
       
             Second layer: 3,072 * 768 = 2,359,296 parameters.
    
                Total FFN parameters per block: 2,359,296 + 2,359,296 = 4,718,592 parameters. -----------------> *This is the number to keep in mind.*
            
         Total Parameters per Block: 1,769,472 (self-attention) + 589,824 (output) + 4,718,592 (FFN) = 7,077,888 parameters.
            
         Total for 12 Blocks: 7,077,888 * 12 = 84,934,656 parameters.
    
        Layer Norm and Other Parameters:
            
            Each transformer block also includes layer normalization and other small components, which add a relatively small number of parameters.
    
    
    Total Parameters:

            Embeddings: 23,835,648
    
            Transformer Blocks: 84,934,656
    
            Layer Norm and Others: A small number, completing the total to around 110 million.
    --------------------------------------

    4.718M FF Params per block * 12 ~ 56.6 Million/110M Params which is a staggering ~50% reduction in size at inference time if you use 0.3% of the FF neurons for FFF??

  • by itissid on 11/22/23, 11:41 PM

    Noob Question: So is the idea to load only specific branches (and by extension log(n) order neurons), right based on the input data. Would this be something that a compiler would do using a JIT trick(because the input needs to be known to get the right branch) to issue a call to the right neurons into memory(SIMD?) to do the Feed Forward?
  • by itissid on 11/23/23, 12:13 AM

    For those not familiar with Bert transformer arch. You can read a bunch of their torch benchmark code to measure speed up in just the FFF: https://github.com/pbelcak/UltraFastBERT/blob/main/benchmark...

    some of which is from the pytorch docs here: https://pytorch.org/tutorials/intermediate/torch_compile_tut..., e.g. the `timed` function and how they generate data.

    Also its not just the same 12 neurons, its the 12 neurons based on the previous dot product. So some kind of JIT is needed to load the right ones?

  • by Klaster_1 on 11/22/23, 12:46 PM

    What are the potential consequences? Does this open doors to faster edge inference or improved capabilities?
  • by measured_step on 11/22/23, 5:09 PM

    How would this scale for a use case like writing code? I could imagine that some inputs would require a large number of neurons. Would this architecture be able to do that if it were scaled up?

    I'm also curious if this model architecture would achieve the grokking of more complex concepts at scale.

  • by jasonjmcghee on 11/22/23, 6:33 PM

    Does anyone understand why they are using B x H instead of B x S x H?

    Why is the context size and batch size represented as a single parameter?

  • by quickthrower2 on 11/22/23, 9:13 PM

    If anyone is on the ball enough to turn this into a colab or notebook that would be appreciated! Would love to see the code
  • by dartos on 11/24/23, 1:02 AM

    AFAIK GPU core are quite slow with branching logic.

    I wonder if the conditional in this would hurt performance at scale

  • by ndr on 11/22/23, 12:40 PM

    Abstract:

    > Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.

    Conclusions

    > We present UltraFastBERT, a modified version of the (crammed)BERT architecture that uses fast feedforward instead of feedforward networks in its intermediate layers. UltraFastBERT serves as proof that large language models only really need to engage an exponential fraction of their parameters to perform individual inferences. UltraFastBERT-1x11, our deepest model with the highest promise of acceleration, uses only 0.3% of its neurons during inference and already achieves a 78x CPU speedup over the inference time of the corresponding feedforward layer. With a theoretical speedup promise of 341x at the scale of BERT-base models, we hope that our work will inspire an effort to implement primitives for conditional neural execution as a part of device programming interfaces.

  • by qntty on 11/22/23, 3:55 PM

    According to scientists, we only use 0.3% of our neural networks. Imagine if we could use 100%.
  • by OneOffAsk on 11/22/23, 12:58 PM

    Is this similar to what iOS 17 uses for its new autocomplete?
  • by matmulbro on 11/23/23, 4:18 AM

    Timewaster

    All valuable AI research is secret now, they just churn out papers to waste time

  • by vouaobrasil on 11/22/23, 2:09 PM

    This is rather scary. I feel we are witnessing the evolution of language models and artificial intelligence, which seems intellectually laudable until you realize that the underlying evolutionary framework for this evolution is the global capitalistic system whose only criteria for selection in short-term monetary gain.

    We are creating a monster.