by limoce on 3/3/25, 1:27 PM with 2 comments
by kevmo314 on 3/6/25, 4:05 AM
But if it is true that the separators contribute the most towards the attention scores, wouldn't that imply that the tokenization scheme can be improved? Introducing a compression scheme seems like patching around that compared to if the model naturally generated a more random attention distribution.
by xp84 on 3/6/25, 5:30 AM
'Why waste time say lot token when few token do trick?"
-Kevin Malone