from Hacker News

What I talk about when I talk about IRs

by surprisetalk on 6/13/25, 3:04 PM with 21 comments

  • by o11c on 6/17/25, 3:06 AM

    This is quite informative for an introduction to one particular type of IR. It's better than some others because it actually does mention that there are alternatives in some places. And since what people are using an IR for varies a lot from program to program, there is no alternative that is always best (there are trends, but even then you should be aware of the cost comparison).

    Since this post talks a lot about SSA, it might be worth mentioning the thing that probably confuses people the most: how do you know if a variable is valid in a BB. It can be informative (but memory-inefficient, and probably optimization-inefficient) to require all incoming variables to go through phi nodes. Also think about how destructors and outward `goto/break` affect dominance in strange ways.

    The second understated thing about SSA is: exactly how much additional data do you store for exactly what period of time. "Don't store the extra data" means many passes will have to recompute it, but "do store the extra data" makes it costlier when passes have to invalidate it. Also make sure you don't run out of RAM!

    My own post (more focused on interpreters, but not exclusively so) probably talks more about alternative implementation choices (though I do say "don't do that" for strictly-worse ones, and there are some I still haven't gone into detail about) in many compiler stages than any other source I've seen:

    https://gist.github.com/o11c/6b08643335388bbab0228db763f9921...

    The main aspect about the linked post that annoys me is that it falls into the trap of declaring a duality between "stack-based" and "register-based", when in fact there are so many options that the difference within each of those named categories can be bigger than the difference between them.

  • by pizlonator on 6/17/25, 2:42 AM

    What I talk about when I talk about IRs is the principle of isolated ugliness.

    A good IR means that the only contract between compiler passes is the IR itself. This means that individual passes might get as ugly as they need to get without affecting other passes. This then means that lots of different folks can write lots of different passes, and then you can glue them together somehow in a pass pipeline and it all works.

    LLVM is a great example of this. You can be productive at writing LLVM passes without understanding any of the other passes in LLVM. Maybe you’ll come to understand some of them just because it’s informative and there’s cool stuff to learn. But even then you don’t have to understand the other passes that deeply. Writing a cool new pass that obeys the IR’s laws is not going to break the other passes. You’ll only break the other passes by breaking IR law.

    So what makes a good IR is, more than any other single thing, about whether that IR has easy to understand laws and whether those laws are really followed. This allows you to isolate ugliness. Which allows you to have a large team writing lots of passes. And then your compiler can drink the tears of your defeated competitors.

  • by mhh__ on 6/17/25, 1:42 AM

    I'm curious what the story is with (if I have it right) V8 dropping sea of nodes.

    Is it an argument against sea of nodes or just that it isn't pulling it's weight for them at the moment

  • by almostgotcaught on 6/17/25, 2:58 AM

    > If your static analysis is over SSA, then generally the static analysis is easier and (potentially) storing facts is easier. This is due to this property called sparseness. Where a static analysis over non-SSA programs has to store facts about all variables at all program points, an analysis over SSA need only store facts about all variables, independent of context.

    Not sure what this has to do with SSA or not? Maybe you're saying if you don't have SSA then value numbering requires extra book-keeping to keep track of the names? that's true but the book-keeping only kicks when a name is updated, not at every program point.

    > If we want to keep our information sparse, though, we have to add a new definition to the IR.

    I mean this isn't like a law of nature, it's just a design point - a question of whether the facts have to be serializable for some reason. If they don't (and in general, I can't think of a reason they would have to be), you can just keep things in-memory in general and do either sparse or dense analysis on an as-needed basis. To wit MLIR, an SSA IR, handily supports both sparse and dense dataflow analysis:

    https://github.com/llvm/llvm-project/blob/main/mlir/include/...

    https://github.com/llvm/llvm-project/blob/main/mlir/include/...

    Note, both analyses propagate lattices which support meet and join.

    > Anyway, SSA only stores type information about instructions and does not encode information that we might later learn in the IR. With basic SSA, there’s not a good way to encode refinements…

    This is true of course but I struggle to think of a reasonable semantic that would enable you to refine/change/alter the type of a value that wasn't an operation. The conditional example is specious but could be rectified if it were modeled correctly: the conditional is an operation which has operands and two regions, each with a basic block with basic block args and it's the types of the bb args that can show as "refined". MLIR's scf.if isn't "isolated from above" and its bbs don't have args but it could have both such things and such an operation would model what you want precisely and in the IR/types themselves.