from Hacker News

Writing a C compiler in 500 lines of Python

by vgel on 9/4/23, 7:17 PM with 165 comments

  • by brundolf on 9/4/23, 7:45 PM

    > Instead, we'll be single-pass: code generation happens during parsing

    IIRC, C was specifically designed to allow single-pass compilation, right? I.e. in many languages you don't know what needs to be output without parsing the full AST, but in C, syntax directly implies semantics. I think I remember hearing this was because early computers couldn't necessarily fit the AST for an entire code file in memory at once

  • by mati365 on 9/4/23, 7:52 PM

    I made similar project in TypeScript[1]. Basically multipass compiler that generates x86 assembly, compiles it to binary and runs it. The worst thing were register allocator, designing IR code and assembler.

    [1] https://github.com/Mati365/ts-c-compiler

  • by Joker_vD on 9/5/23, 11:53 AM

    I am pretty certain the following is a valid "for"-loop translation:

        block
            ;; code for "i = 0"
            loop
                ;; code for "i < 5"
                i32.eqz
                br_if 1
            
                i32.const 1
                loop
                    if
                        ;; code for "i = i + 1"
                        br 2
                    else
                    end
            
                    ;; code for "j = j * 2 + 1"
    
                    i32.const 0
                end
            end
        end
    
    It doesn't require cloning the lexer so probably would still fit in 500 lines? But yeah, in normal assembly it's way easier, even in one-pass:

            ;; code for "i = 0"
        .loop_test:
            ;; code for "i < 5"
            jz  .loop_end
            jmp .loop_body
        .loop_incr:
            ;; code for "i = i + 1"
            jmp .loop_test
        .loop_body:
            ;; code for "j = j * 2 + 1"
            jmp .loop_incr
        .loop_end:
    
    Of course, normally you'd want to re-arrange things like so:

            ;; code for "i = 0"
            jmp .loop_test
        .loop_body:
            ;; code for "j = j * 2 + 1"
        .loop_incr:
            ;; code for "i = i + 1"
        .loop_test:
            ;; code for "i < 5"
            jnz .loop_body
        .loop_end:
    
    I propose the better loop syntax for languages with one-pass implementations, then: "for (i = 0) { j = j * 2 + 1; } (i = i + 1; i < 5);" :)
  • by tptacek on 9/4/23, 7:39 PM

    A time-honored approach!

    https://www.blackhat.com/presentations/win-usa-04/bh-win-04-...

    (minus directly emitting opcodes, and fitting into 500 lines, of course.)

  • by ak_111 on 9/5/23, 8:41 AM

    Somewhat unrelated question, but I think one of the second most difficult things of learning C for coders who are used to scripting languages is to get your head around how the various scaler data types like short, int, long,... (and the unsigned/hex version of each) are represented and how they relate to each other and how they relate to the platform.

    I am wondering if this complexity exists due to historical reasons, in other words if you were to invent C today you would just define int as always being 32, long as 64 and provide much more sane and well-defined rules on how the various datatypes relate to each other, without losing anything of what makes C a popular low-level language?

  • by kaycebasques on 9/4/23, 7:52 PM

    Is there a C compiler written in Python that aims for maximum readability rather than trying to get as much done under X lines of code?
  • by WalterBright on 9/4/23, 9:30 PM

    This looks a lot like the Tiny Pascal compiler that BYTE published a listing of back in 1978.

    http://www.trs-80.org/tiny-pascal/

    I figured out the basics of how a compiler works by going through it line by line.

  • by marcodiego on 9/4/23, 7:52 PM

    It is interesting to think that 500 lines of code is something one can write in one or two days. But, writing a C compiler in 500 of comprehensible code (even in python) is challenge in itself that may take months after a few years of solid learning.

    I wonder if is this a good path to becoming an extremely productive developer. If some one spends time developing projects like this, but for different areas... A kernel, a compressor, renderer, multimedia/network stack, IA/ML... Will that turn a good dev into a 0.1 Bellard?

  • by jll29 on 9/4/23, 11:42 PM

    Writing your own compiler

    - demystifies compilers, interpreters, linkers/loaders and related systems software, which you now understand. This understanding will no doubt one day help in your debugging efforts;

    - elevates you to become a higher level developer: you are now a tool smith who can make their own language if needed (e.g. to create domain specific languages embedded in larger systems you architect).

    So congratulations, on top of other forms of abstraction, you have mastered meta-linguistic abstraction (see the latter part of Structure and Interpretation of Computer Programs, preferably the 1st or 2nd ed.).

  • by mananaysiempre on 9/5/23, 9:59 AM

    > [Building parse trees] is really great, good engineering, best practices, recommended by experts, etc. But... it takes too much code, so we can't do it.

    It takes too much code in Python. (Not a phrase one gets to say often, but it’s generally true for tree processing code.) In, say, SML this sort of thing is wonderfully concise.

  • by meitham on 9/5/23, 12:30 PM

    Actually with SLY (https://sly.readthedocs.io) now dead, what is the recommended Lexer/Parser library in Python?
  • by nn3 on 9/4/23, 9:07 PM

    Just for comparison the LOCs for some other small C or C like compilers. It's not that far away from Ritchie's

    C4x86 | 0.6K (very close)

    small C (x86) | 3.1K

    Ritchie's earliest struct compiler | 2.3K

    v7 Unix C compiler | 10.2K

    chibicc | 8.4K

    Biederman's romcc | 25.0K

  • by Shocka1 on 9/7/23, 3:01 PM

    These kinds of posts are one of the things that keeps me coming back to HN. Right when I start thinking I'm a professional badass for implementing several features with great well tested code in record time, I stumble along posts like this that set me in my place.
  • by rcarmo on 9/4/23, 9:49 PM

    I have to wonder if there's a Scheme to WASM compiler out there someplace right now I haven't found yet.
  • by aldousd666 on 9/5/23, 1:49 AM

    This is crazy cool! Esolangs have been a hobby of mine, (more just an interest lately, since I haven't built any in a while,) so this is like a fun code golf game for compilation. Nice work, and even better, nice explanation article!
  • by varispeed on 9/5/23, 10:31 AM

    I wrote a C compiler back in the day as a learning exercise. It was the most fun and rewarding project.
  • by jokoon on 9/5/23, 10:28 AM

    I don't see he use match case... while it's clearly a good use case.
  • by MrYellowP on 9/5/23, 9:24 AM

    I am really confused by what people call compilers nowadays. This is now a compiler that takes input text and generates output text, which then gets read by a compiler that takes input text and generates JIT code for execution.

    This is more of a transpiler, than an actual compiler.

    Am I missing something?

  • by teddyh on 9/4/23, 8:06 PM

    For some value of “C”:

    > Notably, it doesn't support:

    > structs :-( would be possible with more code, the fundamentals were there, I just couldn't squeeze it in

    > enums / unions

    > preprocessor directives (this would probably be 500 lines by itself...)

    > floating point. would also be possible, the wasm_type stuff is in, again just couldn't squeeze it in

    > 8 byte types (long/long long or double)

    > some other small things like pre/post cremements, in-place initialization, etc., which just didn't quite fit any sort of standard library or i/o that isn't returning an integer from main()

    > casting expressions

  • by fan_of_yoinked on 9/4/23, 7:39 PM

    I love the graphic - would go see the worlds largest chomsky
  • by moomin on 9/5/23, 8:13 AM

    Inevitably we have to ask: and how many lines of C in library functions?
  • by hamilyon2 on 9/5/23, 6:43 AM

    So, given the python is an interpreter and very well understood, can we say that we are sure this compiler does not include Thompson virus?
  • by rhabarba on 9/4/23, 7:31 PM

    Finally, one can have inefficient C.
  • by ForOldHack on 9/5/23, 8:20 AM

    The *point* of a compiler is to compile itself.
  • by Jake_K on 9/5/23, 10:04 AM

    Interesting stuff
  • by golemarms on 9/5/23, 2:39 AM

    Cool. Now try writing a Python compiler in 500 lines of C.