from Hacker News

Let's reproduce GPT-2 (124M) [video]

by thebuilderjr on 6/9/24, 11:34 PM with 4 comments

by codewiz on 6/10/24, 5:37 PM
I love how Andrej Karpathy explains things. His code implementing the feed-forward block of the transformer looks like this:
```
   def forward(self, x):
     x = x + self.attn(self.ln_1(x))
     x = x + self.mlp(self.ln_2(x))
     return x
```
This is how it's described (starting at 19:00 into the video):
"This is the pre-normalization version, where you see that x first goes through the layer normalization [ln_1] and then the attention (attn), and then goes back out to go to the layer normalization number two and the multilayer perceptron [MLP], sometimes also referred to as feed-forward network, FFN, and then that goes into the residual stream again."
"And the one more thing that's kind of interesting to note is: recall that attention is a communication operation, it is where all the tokens - and there's 1024 tokens lined up in a sequence - this is where the tokens communicate, where they exchange information... so, attention is an aggregation function, it's a pooling function, it's a weighted sum function, it is a reduce operation, whereas this MLP [multilayer perceptron] happens every single token individually - there's no information being collected or exchanged between the tokens. So the attention is the reduce, and the MLP is the map."
"And the transformer ends up just being repeated application of map-reduce, if you wanna think about it that way."
by gabrielsroka on 6/10/24, 6:20 AM
https://twitter.com/karpathy/status/1799949853289804266
by maskil on 6/10/24, 3:30 AM
I suggest to put Karpathy in the title for visibility