by psuter on 3/8/20, 6:53 PM with 35 comments
by dzdt on 3/9/20, 12:18 PM
The computer scientists took as a target an eastern European chess journal which printed move-by-move reports of tournament chess matches. They incorporated a crude chess engine in the recognition step estimating the liklihood of next moves and combining that with the OCR engine liklihood estimate that the printed characters were particular glyphs. Despite very low quality of the printing, the journal had very high quality editing. The source material was self consistent. Completely illegible characters could mostly be filled in as the sensible game moves that were allowed. It took hundreds of hours of human review time to find a single OCR mistake from this process!
by willvarfar on 3/9/20, 12:10 PM
If size and performance are a focus, just store them in a normal sorted table with compression (e.g. leveldb, or mysql using rocksdb).
This means all these small documents can be compressed with repetition between games and not just within each game.
And probably much much faster and simpler etc.
Basically, the size taken by the database should be at the same kind of level as you get by just having a text file with one line per game, and gzipping the whole thing. I'd expect it to be an order of magnitude smaller than per-document compression.
by thom on 3/9/20, 11:23 AM
by SethTro on 3/9/20, 11:10 AM
Otherwise an interesting article about injecting domain knowledge to improve beyond what's possible with a normal compression algorithm.
by jameshart on 3/9/20, 1:38 PM
Interesting question: If you just generated the bit string that corresponded to taking the ‘most obvious move’ according to their heuristics, what game would that play out? In a way, that would be the ‘most obvious chess game’, or perhaps the ‘least creative chess game’...
In theory a more optimal version would be to use a more sophisticated AI to rank the next moves, and even to choose how aggressively to Huffman code the possible options.
In a sense this would measure the information content of a chess game (at least relative to the ‘knowledge’ of that AI) . I wonder what the variance in the number of bits needed to encode various games would tell you about the play style of different players, or the nature of the game, or even the relationship betweeen information theory and creativity....
by jhrmnn on 3/9/20, 2:50 PM
by 6510 on 3/9/20, 11:18 AM
by steerablesafe on 3/9/20, 1:26 PM
I wouldn't be surprised if the statistics for the first few moves were significantly different to the moves deep into the game.
by gowld on 3/9/20, 6:25 PM
Algorithmically cool, but quite a lot of work to save <$500/yr in hardware for a handful of 1TB SSDs.
Converting the data to the new format cost more than upgrading disks.
by SamReidHughes on 3/9/20, 12:39 PM
If you also store move times, there’s not much of a win to this.
by bumbledraven on 3/9/20, 12:48 PM