from Hacker News

Physics of Language Models: Architecture Design and the Magic of Canon Layers

by nkko on 5/4/25, 4:25 PM with 1 comments

  • by darknoon on 5/15/25, 12:19 AM

    anyone know why they mix in the 3 previous tokens? could have just as easily done 5 or 2 right?