by arjunnarayan on 7/17/20, 12:01 PM with 61 comments
by brabel on 7/19/20, 8:31 AM
I know there's more to just queriability in the post, but I thought this should've been mentioned when discussing what existing IDEs can offer.
by sriram_malhar on 7/19/20, 6:27 AM
http://smallcultfollowing.com/babysteps/blog/2017/01/26/lowe...
The key observation to me was that the traditional Datalog/Prolog way of unifying is through syntactic equality, which is a bit too simple to express the kind of equality needed in Rust and elsewhere. You can express it in Datalog, but as it gets farther away from the source, error-generation suffers.
by kibwen on 7/19/20, 6:24 AM
The Rust project is also using the differential-datalog library mentioned in the OP to underlie their third-generation borrow checker: https://github.com/rust-lang/polonius
by simplify on 7/19/20, 5:11 AM
I noticed you (assuming you're the author) used what looks like a non-traditional Datalog syntax – which makes sense to me as, IMO, Datalog/Prolog desperately need first-class support for a record-like syntax to finally break into mainstream. Is there any prior work to this syntax, or did you just develop it as you needed it?
by pjmlp on 7/19/20, 8:13 AM
For prior work on IDE as database check "Energize/Cadillac & Lucid's Demise", https://www.dreamsongs.com/Cadillac.html
You can see it in action on this 1993 marketing video from Lucid, https://www.youtube.com/watch?v=pQQTScuApWk
IBM also tried the same concept on Visual Age for C++, version 4, which was one of the very last versions of the product.
http://www.edm2.com/index.php/VisualAge_C%2B%2B_4.0_Review
Both suffered from too heavy hardware requirements for what most companies were willing to pay for, otherwise we could have had Smalltalk like tooling, with incremental compilation and reloading for C++ already during the early 90's.
by jierenchen on 7/19/20, 5:38 AM
I'm working on a similar project here: https://sourcescape.io/, but intended for use outside the IDE on larger collections of code (like the codebase of a large company.)
Agreed on the Prolog/Datalog approach of expressing a query as a collection of facts. CodeQL does the same. From one datastore nerd to another, I actually think this is a relatively unexplored area of querying knowledge graphs (code being a very complex, dense knowledge graph.)
Very excited to see where you go next with this "percolate search" functionality in the IDE.
by sideeffffect on 7/19/20, 12:07 PM
https://scalameta.org/docs/semanticdb/guide.html
It is a queryable database of semantic information about the program which is generated by the compiler (compiler plugin, to be precise). Once generated, other tools which need semantic information, like linters or language servers, can consume it without having to worry about how to actually generate it.
You might enjoy a talk about it: How We Built Tools That Scale to Millions of Lines of Code by Eugene Burmako
https://www.youtube.com/watch?v=C770WpI_odM
Kythe by Google is also a similar thing: https://kythe.io/
by sandermvanvliet on 7/19/20, 8:57 AM
by afarrell on 7/19/20, 7:59 AM
You can say “don’t do that”, but I didn’t. Past coworkers did.
by disposedtrolley on 7/19/20, 10:56 AM
I'm working with a lot of OpenAPI [1] specifications currently, some of which span tens of thousands of lines. Heaps of parent, child, sibling type relationships, and objects which are defined-once and referenced in many places. It would be nice if I could perform a search like "select all schemas used in this path" or "select the count of references to this parameter".
[1] https://github.com/OAI/OpenAPI-Specification/blob/master/ver...
by ilaksh on 7/19/20, 5:40 AM
by nine_k on 7/19/20, 5:05 AM
- Parse the program like an IDE would, but expose the data in an open queryable database format (both line and unlike a language server).
- Use Datalog for storing the facts and inferring new facts about data.
(A fun fact: the Datalog implementation they use is written in Haskell and generates programs in Rust.)
by habosa on 7/19/20, 8:18 AM
(Also hi Pete!)
by Davidbrcz on 7/19/20, 3:07 PM
The most successful example is Semmle (bought by Github), which has been doing it for years now, with a SQL-like syntax for requests (named ".QL").
by gavinray on 7/19/20, 6:45 PM
https://github.com/salsa-rs/salsa
Good video describing it's use for working with Rust's AST:
by z3t4 on 7/19/20, 9:28 AM
by YeGoblynQueenne on 7/19/20, 11:42 AM
To begin with, Datalog is not a "cousin" of Prolog as stated in the section "Interlude: Brief Intro to Datalog". Datalogs (there are many variants!) are subsets of Prolog. For example, a typical datalog is the language of definite clauses with no function symbols [¹] and with no negation as failure [²]. Another datalog may allow only the cons function in order to handle lists; etc.
Otherwise the syntax of datalog is identical to Prolog, but there is a further difference, in that Prolog is evaluated from the "top down" whereas Datalog is evaluated from the "bottom up". What that means is that given a "query" (we'll come to the scare quotes in a moment) Prolog will try to find "rules" whose heads unify with a literal in the query (A literal is an atom, or the negation of an atom; "p(χ,α)" is an atom.) whereas datalog will first generate the set of all ground atoms that are consequences of the program in the context of which the query was made, then determine whether the atoms in the query are in the set of consequences of the program [³]. The reason for the different execution model is that the bottom-up evaluation is guaranteed to terminate [⁴] whereas Prolog's top-down evaluation can "go infinite" [⁵]. There is of course another, more subtle difference: Prolog can "go infinite" because of the Halting problem, from which datalog does not suffer because, unlike Prolog, it does not have Universal Turing Machine expressivity [⁶].
So in short, datalog is a restricted subset of Prolog that has the advantage of being decidable, while Prolog in general is not, but is also incomplete while Prolog is complete [⁷].
Now, the other slight fudge in the article is about "rules", "facts" and "queries". Although this is established and well-heeled logic programming terminology, it fudges the er fact that those three things are the same kind of thing, namely, they are, all three of them, Horn clauses [⁸].
Specifically, Horn clauses are clauses with a single positive literal.
Crash course in FOL: an atom is a predicate symbol followed by a set of terms in parentheses. Terms are variables, functions or constants (constants are functions with 0 arity, i.e. 0 arguments). A literal is an atom, or the negation of an atom. A clause is a disjunction of literals. A clause is Horn when it has at most 1 positive literal. A Horn clause is a definite clause when it has exactly 1 positive literal.
The following are Horn clauses:
¬P(χ) ∨ ¬P(υ)
P(χ) ∨ ¬Q(χ)
Q(α)
Q(β)
In logic programming tradition, we write clauses as implications (because ¬A ∨
B ≡ A → B) and with the direction of the implication arrow reversed to make it
easier to read long implications with multiple premises. So the three clauses
above are written as: ←P(χ), P(υ) (a)
P(χ) ← Q(χ) (b)
Q(α)← (c)
Q(β)← (d)
And those are a "query", (a), a "rule", (b) and two "facts", (c) and (d).Note that (b,c,d) are definite clauses (they have exactly one positive literal, i.e. their head literal, which is what we call the consequent in the implication). Facts have only a positive literal; I like to read the dangling implication symbol as "everything implies ...", but that's a bit idiosyncratic. The bottom line is that definite clauses with no negative literals can be though of as being always true, hence "facts". Queries, i.e. Horn clauses with no positive literals, are the opposite: "nothing implies" their body literals (my idiosyncratic reading) so they are "always false". Queries are also called "goals". Finally, definite clauses with both positive and negative literals can be thought of as "conditionally true".
Prolog and datalog programs are written as sets of definite clauses, i.e. sets of "facts" and "rules". So, when we want to reason about the "facts" and "rules" in the program, we make a "query". Then, the language interpreter, which is a top-down resolution theorem prover [⁹] in the case of Prolog, or bottom-up fixpoint calculation in the case of datalog [¹⁰], determines whether our "query" is true. If the query includes any variables then the interpreter also returns evey set of variable substitutions that make the query true.
In the example above, (a) has two variables, χ and υ and evaluating (a) in the context of (b,c,d) would return a "true" result with the variable substitution {χ/α,υ/β}, i.e. (a) is true iff χ = α and υ = β.
And that's how Horn clauses and definite clauses become "rules", "facts" and "queries".
Next time: how the leopard got its stripes and the hippopotamus learned to love the first order predicate calculus.
_________________
[¹] This is my (first) slight fudge because constants are also functions, with 0 arguments. So, to be formal, the typical datalog above has "no functions of arity more than 0".
[²] Negation-as-failure makes a language non-monotonic, in the sense that introducing new "facts" can change the meaning of a theory, i.e. a program.
[³] So, its Least Herbrand Model, or its Least Fix-Point (LFP).
[⁴] Because it finds the LFP of the query and the program.
[⁵] Unless evaluated by SLG resolution, a.k.a. tabling, similar to memoization.
[⁶] Although higher-order datalogs, that allow for predicate symbols as terms of literals have UTM expressivity, indeed a UTM can be defined in a higher-order datalog fragment where clauses have up to two body literals with at most two arguments:
utm(S,S) ← halt(S).
utm(S,T) ← execute(S,S₁), utm(S₁,T).
execute(S,T) ← instruction(S,P), P(S,T).
Originally in:Tärnlund, S.-A. (1977). Horn clause computability. BIT Numerical Mathematics, 17(2), 215–226.
[⁷] Less fudgy, definite programs are refutation complete under SLD resolution, meaning that any atom that is entailed by a definite program can be derived by SLD resolution. A definite program is a set of definite clauses, explanation of which is coming right up.
[⁸] Long time ago, I explained this to a colleague who remarked that all the nice syntactic elegance in query languages falls apart the moment you try to make a query, which usually has a very different syntax than the actual rows of the tables in the database. So I said "that's the point! Queries are also Horn clauses!" and his immediate remark was "That just blew my mind". It's been so long and I'm so used to the idea that I haven't a clue whether this is really mind blowing. Probably, being my usual excited self, I just said it in a way that it sounded mind blowing (gesticulating widely and jumping up and down enthusiastically, you can picture the scene) so my colleague was just being polite. That was over drinks at the pub after work anyway.
[⁹] Resolution is an inference rule that allows the derivation of new atoms from a set of clauses. In theorem proving it's used to refute a goal clause by deriving the empty clause, □. Since a goal is a set of negated literals, refuting the goal means basically that the negated literals are true. So our query is true in the context of our program.
[¹⁰] Datalog's bottom-up evaluation uses something called a TP operator. It basically does what I said above, starts with the ground atoms in a program and then derives the set of consequences of the clauses in the program. In each iteration, the set of consequences are added to the program and the process is repeated, until no new consequences are derived. As stated above, the process is guaranteed to complete because every datalog definite program has a least fixpoint, which is also its Least Herbrand Model (we won't go into Herbrand Models and Herbrand interpretations, but, roughly, an LHM is the smallest set of atoms that make the union of a definite program and a goal true). A more complete introduction to LHMs and LFPs and how they are used in bottom-up evaluation for datalog can be found here:
https://www.doc.ic.ac.uk/~mjs/teaching/KnowledgeRep491/Fixpo...
by coderdd on 7/19/20, 8:49 AM
by geordimort on 7/19/20, 9:55 AM
by tom_mellior on 7/19/20, 6:40 PM
This will be interesting to see, since the straight-forward implementation of generics for functional programming languages uses unification, which is not available in Datalog. It will probably be possible to encode things for any given program, since its universe of types should be finite. But it will involve jumping through hoops to encode something equivalent to unification.
by pgt on 7/19/20, 9:18 AM
by airocker on 7/19/20, 6:22 PM
by LeonB on 7/20/20, 3:01 PM
You can write queries such as:
from m in Application.Methods where m.NbLinesOfCode > 30 && m.IsPublic select m
by jtwaleson on 7/19/20, 6:17 AM
Anyway, I think both are a good idea. Datadog -> integrating monitoring information back into the IDE. Datadog -> working with the actual semantics of the code rather than just blobs of text.