by agomez314 on 8/2/24, 6:57 PM with 48 comments
by bmc7505 on 8/3/24, 8:11 PM
Sampling simply-typed terms seems notoriously more challenging than sampling closed ones. Even rejection sampling, whenever applicable, admits serious limitations due to the imminent asymptotic sparsity problem — asymptotically almost no term, be it either plain or closed, is at the same time (simply) typeable. [...] Asymptotic sparsity of simply-typed λ-terms is an impenetrable barrier to rejection sampling techniques. As the term size tends to infinity, so does the induced rejection overhead. In order to postpone this inevitable obstacle, it is possible to use dedicated mechanisms interrupting the sampler as soon as it is clear that the partially generated term cannot be extended to a typeable one. The current state-of-the-art samplers take this approach, combining Boltzmann models with modern logic programming execution engines backed by highly-optimised unification algorithms. Nonetheless, even with these sophisticated optimisations, such samplers are not likely to generate terms of sizes larger than one hundred.
I would be curious to see a more rigorous analysis of the sample complexity of generating well-typed expressions in, e.g., the STLC. Maybe there is a way to avoid or reduce the rejection rate before evaluation.by grnnja on 8/3/24, 10:54 PM
Most techniques are essentially brute force enumeration with tricks to improve performance, so they tend to struggle to find larger programs. A lot of active research is in improving performance.
Compared to asking a LLM to write a program, program synthesis approaches will guarantee that a solution will satisfy the specification which can be very powerful.
In particular, as the author has discovered, one area where program synthesis excels is finding small intricate bitwise operator heavy programs that can be hard to reason about as a human.
The most famous example of program synthesis is Microsoft's FlashFill, which is used in Excel. You give it a few input output examples and FlashFill will try to create a small program to generalize them, and you can apply the program to more inputs, which saves you a bunch of time. An example from the paper is:
Input -> Output
International Business Machines -> IBM
Principles Of Programming Languages -> POPL
International Conference on Software Engineering -> ICSE
String Program: Loop(\w : Concatenate(SubStr2(v_1, UpperTok, w)))
Here's a few papers:EUSOLVER: https://www.cis.upenn.edu/~alur/Tacas17.pdf
FlashFill: https://www.microsoft.com/en-us/research/wp-content/uploads/...
BlinkFill: https://www.vldb.org/pvldb/vol9/p816-singh.pdf
Synquid: https://cseweb.ucsd.edu/~npolikarpova/publications/pldi16.pd...
by dleslie on 8/3/24, 5:08 PM
https://medium.com/@maiavictor/solving-the-mystery-behind-ab...
And the Omega documentation:
https://hackage.haskell.org/package/control-monad-omega-0.3/...
But I'll admit, I want to see a paper or code.
by kccqzy on 8/3/24, 5:19 PM
by mordechai9000 on 8/3/24, 5:13 PM
There is something fascinating and awe-inspiring about searching very large, enumerative datasets for useful results. Like wandering Borge's Library of Babel in search of not just meaning but something true and useful.
by diwank on 8/3/24, 6:26 PM
http://web.archive.org/web/20210616105135/http://lukepalmer....
by tadfisher on 8/3/24, 4:55 PM
by LightMachine on 8/3/24, 7:53 PM
https://x.com/VictorTaelin/status/1819774880130158663
That's as much as I can share for now though
by cvoss on 8/4/24, 1:40 AM
The technique generalizes to finite tree automata, meaning you can discover generators of simple expression grammars, given examples/non-examples.
So if you can construe your problem as a labeled tree recognition problem, assuming it's solvable using a finite tree automaton, then you can discover the algorithm for it efficiently.
For example, if you have three strings of bits (2 inputs and 1 output) and line them up as a single string of triplets of bits, it does not surprise me that it is easy to discover the automaton state transition rules that give you the carry bit for each triple and tell you when to reject the triple because the output bit is not correct for the two input bits.
The author has arranged the problem this way when sketching the ADC template and has also jump-started the search by assuming the solution space includes exactly one output bit for each pair of input bits. (That may seem like an obvious necessity, but that is a constraint which is not required by the tree automaton formulation, which need not differentiate "inputs" and "outputs".)
by treyd on 8/3/24, 4:56 PM
by sporkl on 8/3/24, 8:12 PM
Quick search lead to this paper which is exactly what I’m talking about: http://webyrd.net/quines/quines.pdf
by verdverm on 8/3/24, 8:25 PM
https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...
PGE is essentially a BFS/DFS traversal of the space of all formulas, by using local enumeration of the AST. The biggest gains where from eliminating duplicate work (commutativity / associativity) and not going down bad branches (too much complexity added for no meaningful change to output). A lot of overlap in ideas here, and a lot of open research questions that could be worked on (like can use use RL to help guide the search like A*). There's definitely an exponential explosion or wall as the AST gets wider / deeper.
At one point I wrote the core algorithm in Haskell, which made it so much more concise and beautiful, but eventually landed on python https://github.com/verdverm/pypge
In all of Genetic Programming / Symbolic Regression, everyone starts by trying to generate computer code and then switches to just math formula. They are different classes of problems because code has more "genes" and is order sensitive, whereas math is not
by garyrob on 8/3/24, 5:28 PM
by anamax on 8/5/24, 8:19 AM
by rowanG077 on 8/3/24, 11:44 PM
by jmull on 8/3/24, 10:44 PM
Hard to tell.