from Hacker News

Hashing Modulo Theories

by philzook on 5/17/24, 8:51 PM with 3 comments

by yarg on 5/19/24, 5:14 AM
(What the article calls) Canonisation is an interesting issue - it seems very hard to get right and to do so efficiently.
I'm trying to wrap my head around a the issue in a regex parser that I've knocked up.
I'm currently ending up (expectedly) with multiple nodes representing equivalent languages; I want to strip these out when I use code generation to convert the constructed network into a switch based static automaton.
The most robust way to see whether two languages are the same is to xor them and test for interminability - but this means comparing each pair of nodes throughout the network, and I'd rather avoid the n^2 scaling if there's another option.
That option is to generate a canonical expression for the language that the machine represents, somewhat difficult but far more efficient when it comes to detecting collisions.
https://en.wikipedia.org/wiki/Canonization
https://en.wikipedia.org/wiki/Canonicalization