by stealthcat on 1/4/24, 3:39 AM with 0 comments
Is there any attempt to directly train on file bytes? Make the only vocab of LLM as base-2, base-8 or hexadecimal, then do next token prediction on this.
I know some attempts have been done like MEGABYTE and Charformer but some may have is not directly learning from bytes with all the header info