from Hacker News

Improved chess game compression (2018)

by psuter on 3/8/20, 6:53 PM with 35 comments

by dzdt on 3/9/20, 12:18 PM
This reminds me where many years ago I learned about the world record holder for computer optical character recognition (OCR) accuracy.
The computer scientists took as a target an eastern European chess journal which printed move-by-move reports of tournament chess matches. They incorporated a crude chess engine in the recognition step estimating the liklihood of next moves and combining that with the OCR engine liklihood estimate that the printed characters were particular glyphs. Despite very low quality of the printing, the journal had very high quality editing. The source material was self consistent. Completely illegible characters could mostly be filled in as the sensible game moves that were allowed. It took hundreds of hours of human review time to find a single OCR mistake from this process!
by willvarfar on 3/9/20, 12:10 PM
Hmm, a lot of effort to make documents small, but then storing it in mongodb?!?!
If size and performance are a focus, just store them in a normal sorted table with compression (e.g. leveldb, or mysql using rocksdb).
This means all these small documents can be compressed with repetition between games and not just within each game.
And probably much much faster and simpler etc.
Basically, the size taken by the database should be at the same kind of level as you get by just having a text file with one line per game, and gzipping the whole thing. I'd expect it to be an order of magnitude smaller than per-document compression.
by thom on 3/9/20, 11:23 AM
I have always loved just how much of computer science you could learn by doing nothing but working on chess your entire life.
by SethTro on 3/9/20, 11:10 AM
The article doesn't say it anywhere but I sure hope all the games are backed up to disk/tape in txt format and that this is just used to keep in-memory DB/memcache size down.
Otherwise an interesting article about injecting domain knowledge to improve beyond what's possible with a normal compression algorithm.
by jameshart on 3/9/20, 1:38 PM
the basis of this algorithm is to rank the possible moves from the current position, then use that to choose a Huffman encoding. In essence, they use a very naive single-move-look ahead chess AI to quickly rank moves giving them a crude measure for how ‘surprising’ a particular move would be at that point in the game.
Interesting question: If you just generated the bit string that corresponded to taking the ‘most obvious move’ according to their heuristics, what game would that play out? In a way, that would be the ‘most obvious chess game’, or perhaps the ‘least creative chess game’...
In theory a more optimal version would be to use a more sophisticated AI to rank the next moves, and even to choose how aggressively to Huffman code the possible options.
In a sense this would measure the information content of a chess game (at least relative to the ‘knowledge’ of that AI) . I wonder what the variance in the number of bits needed to encode various games would tell you about the play style of different players, or the nature of the game, or even the relationship betweeen information theory and creativity....
by jhrmnn on 3/9/20, 2:50 PM
This nicely illustrates the point of the Hutter Prize, that efficient compression comes from understanding the content on a nontrivial level.
https://en.wikipedia.org/wiki/Hutter_Prize
by 6510 on 3/9/20, 11:18 AM
The to slow approach is fun. If you freeze and use a reasonable engine a bit string could represent which moves match the engine move. (lets call it block 1) The missing moves could be 3 bit numbers like 000 for second best engine move, 001 for 3rd, 010 for 4th, 011 for 5th, 100 for 6th best move, 101 for 7th, 110 for 8th and 111 for other moves. (block 2) The other moves are simply put like E4, E24 or NF3G5.(block 3)
by steerablesafe on 3/9/20, 1:26 PM
As I understand it uses the Huffman-code for all moves including the opening moves. Alternatively statistics could be gathered for all first moves, all second moves, etc..., then different Huffmann-code could be applied for opening moves.
I wouldn't be surprised if the statistics for the first few moves were significantly different to the moves deep into the game.
by gowld on 3/9/20, 6:25 PM
> Our database runs on 3 servers (1 primary, 2 secondaries), each with two 480GB SSDs in a RAID 1 configuration
Algorithmically cool, but quite a lot of work to save <$500/yr in hardware for a handful of 1TB SSDs.
Converting the data to the new format cost more than upgrading disks.
by SamReidHughes on 3/9/20, 12:39 PM
Nice. A touchy balance between overengineering and too much overengineering. Maybe you could better compress combinations of moves, like exchanges.
If you also store move times, there’s not much of a win to this.
by bumbledraven on 3/9/20, 12:48 PM
The article includes a brief but clear explanation of Huffman coding. I didn’t understood how it works until now!