by nolist_policy on 5/25/25, 7:08 PM with 7 comments
by impossiblefork on 5/25/25, 9:40 PM
Having more than one embedding is something I've tried myself, but not separate ones for each layer.
I'm guessing it's something like h_{l+1} = MultiHeadSelfAttentionWithPositionEncodingBakedIn(MLP(h_l) + embed_l(token_ids)). So it's probably really easy to implement on toy problems to see if it works.
by krackers on 5/31/25, 8:07 PM
by limoce on 5/26/25, 2:15 AM
"4x gated residual streams" look quite weird. Is there any paper or technique report for this?
by 3abiton on 5/26/25, 12:27 AM