from Hacker News

Ask HN: Traning LLM directly on file bytes

by stealthcat on 1/4/24, 3:39 AM with 0 comments

Multi-modal LLM like PaLM, GPT4, MiniGPTv2 relies on data encoder (image, speech models) to map data to token embedding space.

Is there any attempt to directly train on file bytes? Make the only vocab of LLM as base-2, base-8 or hexadecimal, then do next token prediction on this.

I know some attempts have been done like MEGABYTE and Charformer but some may have is not directly learning from bytes with all the header info