by thebuilderjr on 6/9/24, 11:34 PM with 4 comments
by codewiz on 6/10/24, 5:37 PM
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
This is how it's described (starting at 19:00 into the video):"This is the pre-normalization version, where you see that x first goes through the layer normalization [ln_1] and then the attention (attn), and then goes back out to go to the layer normalization number two and the multilayer perceptron [MLP], sometimes also referred to as feed-forward network, FFN, and then that goes into the residual stream again."
"And the one more thing that's kind of interesting to note is: recall that attention is a communication operation, it is where all the tokens - and there's 1024 tokens lined up in a sequence - this is where the tokens communicate, where they exchange information... so, attention is an aggregation function, it's a pooling function, it's a weighted sum function, it is a reduce operation, whereas this MLP [multilayer perceptron] happens every single token individually - there's no information being collected or exchanged between the tokens. So the attention is the reduce, and the MLP is the map."
"And the transformer ends up just being repeated application of map-reduce, if you wanna think about it that way."
by gabrielsroka on 6/10/24, 6:20 AM
by maskil on 6/10/24, 3:30 AM