from Hacker News

The AST Typing Problem

by smasher164 on 8/13/23, 10:00 PM with 32 comments

by rowanG077 on 8/14/23, 12:31 AM
An even easier way is to add type parameter to every AST that resolves to either a type or null. It can be relatively easily implemented using type families in Haskell. And it doesn't suffer the tying the knot shenanigans. The main advantage is that you can still reason about your AST in a straight forward generic manner. In fact this is not only useful for typing. Any kind of metadata that needs to be added can be described in this way.
In a compiler I build you had a simple ASTState type parameter for every node in the AST which indicated what is available in the AST. But you could traverse over the entire AST generically as long as you did not depend on any of the additional information.
You can basically lift the implementation from this paper: https://www.microsoft.com/en-us/research/uploads/prod/2016/1...
by nemo1618 on 8/14/23, 4:26 AM
I'll contribute another solution, used in the Go compiler: build a big map from AST node -> typeinfo. This approach is very flexible (e.g. you can add more maps later without changing the AST struct), but, like Nullable, it's not correct-by-construction -- and worse, you "just have to know" that this map exists and how to access/update it. Seems to be working ok for them, though!
by barrkel on 8/15/23, 9:51 AM
The way I prefer to solve this problem these days is to use a fairly generic tree node structure (fairly close to sexprs) and implement tree grammars which check for correctness of different phases (e.g. verify that nodes are annotated with types, or syntax sugar has been decomposed).
Being playful, I've generally implemented the tree grammars as literal constructions of the tree nodes themselves, so the grammars can self-check.
I've use this approach in Ruby and Java though, not functional languages with pattern matching and destructuring, so I wasn't leaving those powerful tools on the table.
In particular I have given up trying to represent ASTs using classes. I don't think they deliver a lot of value. Behavior generally belongs outside the class and not inside, transformations should be collected together as an algorithm, rather than distributed across multiple definitions. Double dispatch for type-safe tree navigation is unpleasant and you lose control over typing of context. You end up with explicit typecasts and you've lost what little type safety you had, and you still can't enforce constraints around how the tree changes from phase to phase.
by chrisaycock on 8/14/23, 2:21 AM
The beauty of separate IRs is that each can represent a separate stage of compilation. For example, Rust goes from abstract syntax tree (AST) to high-level IR (HIR) to typed high-level IR (THIR):
https://rustc-dev-guide.rust-lang.org/part-3-intro.html
Additionally, Many language implementations use a mid-level IR to sit above LLVM: Rust has MIR, Swift has SIL, Julia has an SSA-form IR, etc. The article mentions GHC's Core, which is also an example of an IR that comes after type checking but before LLVM.
by jeswin on 8/14/23, 3:10 AM
> How does one go about adding “extra information” (e.g. types) to an AST?
Can someone who's more familiar with compilers ELI10 the problem? A layman like me is left wondering - Why not add (say) a type field to each node?
by TOGoS on 8/14/23, 7:08 PM
Make each AST node an RDF node and then you can cram whatever information into it you want. That's the approach I've been taking with https://github.com/TOGoS/TOGVM-Spec/, anyway.
Of course, for conveniently and safely manipulating in memory in $programming_language, you're probably going to want to define some structs/ADTs/whatever that only contain the data a given compilation stage is actively working with.
I've been thinking that what I need is a system that allows me to quickly define different lower-level datatypes for representing different views of the conceptual types and automate, to some degree, translation between them, so then each part of the system can work with objects designed specifically to be processed by it with minimal fuss.
A technical reason for avoiding those specialized types might be that the computer then has to spend more time transforming from one schema to the next. I would think that in practice this isn't any worse than having to do a lot of null checks.
A more human reason is that it could bean a combinatorical explosion of AST types. I guess this is where my idea about lightweight variations comes in.
In TypeScript this kind of thing might not be so bad, since any object can be downcast with no cost to a type that contains a subset of the information, and variations on types can be easily defined without even necessarily being named, e.g. `ASTNode & HasResultType & HasSourceLocation`.
by nextaccountic on 8/14/23, 9:11 AM
This article really needs an (2013), because of
> Both GHC’s core IR, Ur/Web's core and Coq are explicitly typed in this way.
This is how GHC used to do, but in 2018 GHC adapted the pattern "trees that grow"
The paper https://www.jucs.org/jucs_23_1/trees_that_grow/jucs_23_01_00... (the paper is actually from 2017)
Light GHC docs on it https://gitlab.haskell.org/ghc/ghc/-/wikis/implementing-tree... (I recall seeing more through docs somewhere). I will quote
> The new HsSyn AST supporting the TTG idiom (from now on referred to as TTG HsSyn) is designed to subsume five different representations of Haskell syntax:
> AST GhcPs: the AST used in GHC's parsing phase > AST GhcRn: the AST used in GHC's renaming phase > AST GhcTc: the AST used in GHC's typechecking phase > AST TH: the AST used in Template Haskell > AST HSE: the AST used in an external tool such as Haskell-Src-Exts
This means that you have a single generic tree for all of those passes, but each pass has a different instantiation with has custom data
Also, some discussion of applications https://www.reddit.com/r/haskell/comments/7zo9mb/any_applica... and slides here http://data.tmorris.net/talks/trees-that-grow/bc4ae902ca1d3b...
The thing about this is to observe that, in the blog post in the OP, when you added data to a given type, it was inserted as a concrete type (named Type here), that is, you went from
```
    data Exp = Num Int
             | Bool Bool
             | Var Var
             | If Exp Exp Exp
             | Lambda Var Exp
             | App Exp Exp
```
to
```
    type TExp = (TExp', Type)
    data TExp' = TNum Int
               | TBool Bool
               | TVar Var
               | TIf TExp TExp TExp
               | TLambda Var TExp
               | TApp TExp TExp
```
But you could have kept it a generic type (here called a, rather than Type). This is a common pattern in both Haskell and Rust (and maybe other languages), and isn't the Trees That Grow pattern just yet, but it's like this:
```
    type GenericTExp a = (GenericTExp' a, a)
    data GenericTExp' a = TNum Int
               | TBool Bool
               | TVar Var
               | TIf TExp (GenericTExp a) (GenericTExp a)
               | TLambda Var (GenericTExp a)
               | TApp TExp (GenericTExp a)
```
Now you can recover the original TExpr with
```
    type TExpr = GenericTExpr Type
```
And you can recover the bare Expr with no added data type by putting () (called unit type) there
```
    type Expr = GenericTExpr ()
```
Now, the paper also deals with the problem of, how to add new variants to a data type (it has a | Something that is open ended; this is called "extension constructor" in the ghc wiki, and denoted by XExpr), and how to add different data types to different nodes (you need to, instead of directly adding Type to the generic parameter, add a marker and use type-level functions - called type families in Haskell - to select the data type you're adding). When you figure out all those complexities, you end up with a type-level machinery that will inevitably be similar to the TTG idiom
by teo_zero on 8/14/23, 8:39 AM
Am I the only one lost at the very first paragraph? It shows something that looks like a grammar, but I can't understand what "Num", "Int", "If" are... Are they literals? terms? types? And what does "data" mean?
So it's not a grammar. "data Exp" might describe each internal node of the AST, where the first word is the node's variety and the following words describe the children nodes. But what does "data Type" mean, then?
by gpderetta on 8/14/23, 8:16 AM
What about separate IR that incrementally add to the base IR without replacing it? They will have the same shape so they can be encoded efficiently I think, and can be traversed in parallel.
by BSEdlMMldESB on 8/14/23, 6:49 AM
2013!
what's the difference between running a program (interpreting it), and compiling it?
where's (and what's) the line that distinguishes the program from the output of the program where both parts are just software?
--
the most interesting thing about turning machines is the universal turing machine because it actually invents/defines/showcases software as we understand it nowadays; this makes me reflect on how the busy beaber problem just misses out on this realization
by 3cats-in-a-coat on 8/14/23, 8:19 AM
Restrictive static typing creates a lot of problems that wouldn't exist... if you didn't use restrictive static typing. Maybe because my language designs are toys I use in my projects (for now), but everything is nested heterogenous sparse sequences, so it's a list, but also a map, and I can attach any associative keys to it without breaking the pattern match on the rest of it, and I'm never out of place to add information should I need it.
I think the problem is we write compilers like it's the 90s, and it's not the 90s.
But also it's full of data structure solutions to this even for static types: give nodes a simple autoincrementing id (if your language has no object maps), then put any extra information in a sidecar if you want. Think relations and joins. It's easy.