by smasher164 on 8/13/23, 10:00 PM with 32 comments
by rowanG077 on 8/14/23, 12:31 AM
In a compiler I build you had a simple ASTState type parameter for every node in the AST which indicated what is available in the AST. But you could traverse over the entire AST generically as long as you did not depend on any of the additional information.
You can basically lift the implementation from this paper: https://www.microsoft.com/en-us/research/uploads/prod/2016/1...
by nemo1618 on 8/14/23, 4:26 AM
by barrkel on 8/15/23, 9:51 AM
Being playful, I've generally implemented the tree grammars as literal constructions of the tree nodes themselves, so the grammars can self-check.
I've use this approach in Ruby and Java though, not functional languages with pattern matching and destructuring, so I wasn't leaving those powerful tools on the table.
In particular I have given up trying to represent ASTs using classes. I don't think they deliver a lot of value. Behavior generally belongs outside the class and not inside, transformations should be collected together as an algorithm, rather than distributed across multiple definitions. Double dispatch for type-safe tree navigation is unpleasant and you lose control over typing of context. You end up with explicit typecasts and you've lost what little type safety you had, and you still can't enforce constraints around how the tree changes from phase to phase.
by chrisaycock on 8/14/23, 2:21 AM
https://rustc-dev-guide.rust-lang.org/part-3-intro.html
Additionally, Many language implementations use a mid-level IR to sit above LLVM: Rust has MIR, Swift has SIL, Julia has an SSA-form IR, etc. The article mentions GHC's Core, which is also an example of an IR that comes after type checking but before LLVM.
by jeswin on 8/14/23, 3:10 AM
Can someone who's more familiar with compilers ELI10 the problem? A layman like me is left wondering - Why not add (say) a type field to each node?
by TOGoS on 8/14/23, 7:08 PM
Of course, for conveniently and safely manipulating in memory in $programming_language, you're probably going to want to define some structs/ADTs/whatever that only contain the data a given compilation stage is actively working with.
I've been thinking that what I need is a system that allows me to quickly define different lower-level datatypes for representing different views of the conceptual types and automate, to some degree, translation between them, so then each part of the system can work with objects designed specifically to be processed by it with minimal fuss.
A technical reason for avoiding those specialized types might be that the computer then has to spend more time transforming from one schema to the next. I would think that in practice this isn't any worse than having to do a lot of null checks.
A more human reason is that it could bean a combinatorical explosion of AST types. I guess this is where my idea about lightweight variations comes in.
In TypeScript this kind of thing might not be so bad, since any object can be downcast with no cost to a type that contains a subset of the information, and variations on types can be easily defined without even necessarily being named, e.g. `ASTNode & HasResultType & HasSourceLocation`.
by nextaccountic on 8/14/23, 9:11 AM
> Both GHC’s core IR, Ur/Web's core and Coq are explicitly typed in this way.
This is how GHC used to do, but in 2018 GHC adapted the pattern "trees that grow"
The paper https://www.jucs.org/jucs_23_1/trees_that_grow/jucs_23_01_00... (the paper is actually from 2017)
Light GHC docs on it https://gitlab.haskell.org/ghc/ghc/-/wikis/implementing-tree... (I recall seeing more through docs somewhere). I will quote
> The new HsSyn AST supporting the TTG idiom (from now on referred to as TTG HsSyn) is designed to subsume five different representations of Haskell syntax:
> AST GhcPs: the AST used in GHC's parsing phase > AST GhcRn: the AST used in GHC's renaming phase > AST GhcTc: the AST used in GHC's typechecking phase > AST TH: the AST used in Template Haskell > AST HSE: the AST used in an external tool such as Haskell-Src-Exts
This means that you have a single generic tree for all of those passes, but each pass has a different instantiation with has custom data
Also, some discussion of applications https://www.reddit.com/r/haskell/comments/7zo9mb/any_applica... and slides here http://data.tmorris.net/talks/trees-that-grow/bc4ae902ca1d3b...
The thing about this is to observe that, in the blog post in the OP, when you added data to a given type, it was inserted as a concrete type (named Type here), that is, you went from
data Exp = Num Int
| Bool Bool
| Var Var
| If Exp Exp Exp
| Lambda Var Exp
| App Exp Exp
to type TExp = (TExp', Type)
data TExp' = TNum Int
| TBool Bool
| TVar Var
| TIf TExp TExp TExp
| TLambda Var TExp
| TApp TExp TExp
But you could have kept it a generic type (here called a, rather than Type). This is a common pattern in both Haskell and Rust (and maybe other languages), and isn't the Trees That Grow pattern just yet, but it's like this: type GenericTExp a = (GenericTExp' a, a)
data GenericTExp' a = TNum Int
| TBool Bool
| TVar Var
| TIf TExp (GenericTExp a) (GenericTExp a)
| TLambda Var (GenericTExp a)
| TApp TExp (GenericTExp a)
Now you can recover the original TExpr with type TExpr = GenericTExpr Type
And you can recover the bare Expr with no added data type by putting () (called unit type) there type Expr = GenericTExpr ()
Now, the paper also deals with the problem of, how to add new variants to a data type (it has a | Something that is open ended; this is called "extension constructor" in the ghc wiki, and denoted by XExpr), and how to add different data types to different nodes (you need to, instead of directly adding Type to the generic parameter, add a marker and use type-level functions - called type families in Haskell - to select the data type you're adding). When you figure out all those complexities, you end up with a type-level machinery that will inevitably be similar to the TTG idiomby teo_zero on 8/14/23, 8:39 AM
So it's not a grammar. "data Exp" might describe each internal node of the AST, where the first word is the node's variety and the following words describe the children nodes. But what does "data Type" mean, then?
by gpderetta on 8/14/23, 8:16 AM
by BSEdlMMldESB on 8/14/23, 6:49 AM
what's the difference between running a program (interpreting it), and compiling it?
where's (and what's) the line that distinguishes the program from the output of the program where both parts are just software?
--
the most interesting thing about turning machines is the universal turing machine because it actually invents/defines/showcases software as we understand it nowadays; this makes me reflect on how the busy beaber problem just misses out on this realization
by 3cats-in-a-coat on 8/14/23, 8:19 AM
I think the problem is we write compilers like it's the 90s, and it's not the 90s.
But also it's full of data structure solutions to this even for static types: give nodes a simple autoincrementing id (if your language has no object maps), then put any extra information in a sidecar if you want. Think relations and joins. It's easy.