by 10098 on 9/22/16, 1:13 AM with 149 comments
by robertelder on 9/22/16, 2:46 AM
When you think more about regexes this way, you realize that a regex is just a tiny description of a virtual machine (or emulator) that can process the simplest of instructions (check for 'a', accept '0-9', etc.). Each step in the regex is just a piece of bytecode that can execute, and if you turn a regex on its side you can visualize it as just a simple assembly program.
by sillysaurus3 on 9/22/16, 2:14 AM
Most people refuse to write one because it's so easy not to. Why bother?
It will make you a better coder for the rest of your life.
Let's make a list of "power projects" like this. A bytecode interpreter, a software rasterizer... What else?
by tominous on 9/22/16, 2:32 AM
A great example of this mindset is the guy who bought a mainframe. [1]
Refuse to be placed in a silo. Work your way up and down the stack and you'll be much better placed to solve problems and learn from the patterns that repeat at all levels.
by wahern on 9/22/16, 3:20 AM
1) Domain-specific languages (DSLs)
2) Virtual machines (or just explicit state machines more generally)
What I mean is, alot of problems could be solved cleanly, elegantly, more safely, and more powerfully by using one (or both) of the above. The problem is that when people think DSL or VM, they think big (Scheme or JVM) instead of thinking small (printf). A DSL or VM doesn't need to be complex; it could be incredibly simple but still be immensely more powerful than coding a solution directly in an existing language using its constructs and APIs.
Case in point: the BSD hexdump(1) utility. POSIX defines the od(1) utility for formatting binary data as text, and it takes a long list of complex command-line arguments. The hexdump utility, by contrast, uses a simple DSL to specify how to format output. hexdump can implement almost every conceivable output format of od and then some using its DSL. The DSL is basically printf format specifiers combined with looping declarations.
I got bored one day and decided to implement hexdump as a library (i.e. "one hexdump to rule them all"), with a thin command-line wrapper that emulates the BSD utility version. Unlike BSD hexdump(1) or POSIX od(1), which implement everything in C in the typical manner, I decided to translate the hexdump DSL into bytecode for a simple virtual machine.
http://25thandclement.com/~william/projects/hexdump.c.html
The end result was that my implementation was about the same size as either of those, but 1) could built as a shared library, command-line utility, or Lua module, 2) is more performant (formats almost 30% faster for the common outputs, thanks to a couple of obvious, easy, single-line optimizations the approach opened up) than either of the others, and 3) is arguably easier to read and hack on.Granted, my little hexdump utility doesn't have much value. I still tend to rewrite a simple dumper in a couple dozen lines of code for different projects (I'm big on avoiding dependencies), and not many other people use it. But I really liked the experience and the end result. I've used simple DSLs, VMs, and especially explicit state machines many times before and after, but this one was one of the largest and most satisfying.
The only more complex VM I've written was for an asynchronous I/O SPF C library, but that one is more difficult to explain and justify, though I will if pressed.
by gopalv on 9/22/16, 2:17 AM
I've had use for that knowledge, nearly fifteen years later - most of the interesting learnings about building one has been about the inner loop.
The way you build a good interpreter is upside-down in tech - the system which is simpler often works faster than anything more complicated.
Because of working on that, then writing my final paper about the JVM, contributing to Perl6/Parrot and then moving onto working on the PHP bytecode with APC, my career went down a particular funnel (still with the JVM now, but a logical level above it).
Building interpreters makes you an under-techtitect, if that's a word. It creates systems from the inner loop outwards rather than leaving the innards of the system for someone else to build - it produces a sort of double-vision between the details and the actual goals of the user.
[1] - "Design of the Portable.net interpreter"
by _RPM on 9/22/16, 3:57 AM
It's interesting that he implemented function calls via a jump. In my VM a function is just mapped to a name (variable), so functions are first class. When the VM gets to a CALL instruction, it loads the bytecode from the hash table (via a lookup of the name).
Since this is a procedural language where statements can be executed outside of a function, implementing the functions as a jump would be difficult because there would need to be multiple jumps between the function definition and statements that aren't in a function.
I really wish my CS program had a compilers class, but unfortunately they don't, so I had to learn everything on my own.
by briansteffens on 9/22/16, 2:18 AM
I recently threw something together sort of like this, just for fun (I like your interpreter's name better though): https://github.com/briansteffens/bemu
It's crazy how much these little projects can clarify your understanding of concepts that seem more complicated or magical than they really are.
by anaccountwow on 9/22/16, 4:23 AM
Given it has some parts already written in the interest of time...
by oops on 9/22/16, 12:32 PM
(You basically implement every layer starting with the CPU and finishing with a working Tetris game)
by douche on 9/22/16, 3:07 PM
[1] this one, IIRC https://sourceforge.net/projects/circuit/
by foobarge on 9/22/16, 7:25 AM
by reidrac on 9/22/16, 7:47 AM
It is very close to a bytecode interpreter, only that it comes with a specification that is actually the opcode list for the MOS 6502 (and few details you need to take into account when implementing that CPU).
Besides there are cross-compilers that allows you to generate 6502 code from C for your specific VM (see cc65).
by memsom on 9/22/16, 10:44 AM
ORG START
START: ST_B 10
LOOP: ST_B 10
ADD_B ;;value will go back on stack
LD_B V1
SM_B V1 ;;value we use next loop
SM_B V1 ;;value we compare
SM_B V1 ;;value we echo to console
TRP 21 ;;writes to the console
ST_S '',13,10,$
TRP 21 ;;writes to the console
CMP_B 50 ;;compares stack to the constant
JNE LOOP
ST_S 'The End',13,10,$
TRP 21 ;;writes to the console
END
by pka on 9/22/16, 11:58 AM
Similarly, annotating, transforming, folding, pattern matching on, CPS transforming etc. the produced AST is pretty trivial in a language that supports these constructs. And again, a nightmare in C.
That leaves codegen, but using the right abstractions it turns into a very manageable task as well.
Here's a compiler written in Haskell for LLVM [0].
by philippeback on 9/22/16, 6:48 AM
http://www.lukas-renggli.ch/blog/petitparser-1
http://www.themoosebook.org/book/internals/petit-parser
This include the dynamic generation of blocks and arrows style things...
by philippeback on 9/22/16, 6:46 AM
Lots of optimizations going on for OpenVM.
https://github.com/OpenSmalltalk/opensmalltalk-vm
Interesting bit: VM is written in Slang and transformed into C then compiled.
So you can livecode your VM. In the VM simulator.
by elcct on 9/22/16, 9:17 AM
by rosstex on 9/22/16, 2:52 AM
http://www.multigesture.net/articles/how-to-write-an-emulato...
by curtfoo on 9/22/16, 2:19 AM
Okay I guess technically I used a parser generator that I then modified to build an AST and convert it into assembly-like code that fed the interpreter.
by reacweb on 9/22/16, 7:40 AM
by loeg on 9/22/16, 3:42 AM
by dpratt on 9/22/16, 2:43 AM