from
Hacker News
Top
New
Physics of Language Models: Architecture Design and the Magic of Canon Layers
by
nkko
on 5/4/25, 4:25 PM with 1 comments
by
darknoon
on 5/15/25, 12:19 AM
anyone know why they mix in the 3 previous tokens? could have just as easily done 5 or 2 right?