by LightMachine on 5/17/24, 2:23 PM with 253 comments
by Twirrim on 5/17/24, 5:24 PM
def sum(depth, x):
if depth == 0:
return x
else:
fst = sum(depth-1, x*2+0) # adds the fst half
snd = sum(depth-1, x*2+1) # adds the snd half
return fst + snd
print(sum(30, 0))
under pypy3 it executes in 0m4.478s, single threaded. Under python 3.12, it executed in 1m42.148s, again single threaded. I mention that because you include benchmark information: CPU, Apple M3 Max, 1 thread: 3.5 minutes
CPU, Apple M3 Max, 16 threads: 10.26 seconds
GPU, NVIDIA RTX 4090, 32k threads: 1.88 seconds
The bend single-threaded version has been running for 42 minutes on my laptop, is consuming 6GB of memory, and still hasn't finished (12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04). That seems to be an incredibly slow interpreter. Has this been tested or developed on anything other than Macs / aarch64?I appreciate this is early days, but it's hard to get excited about what seems to be incredibly slow performance from a really simple example you give. If the simple stuff is slow, what does that mean for the complicated stuff?
If I get a chance tonight, I'll re-run it with `-s` argument, see if I get anything helpful.
by vegadw on 5/17/24, 6:44 PM
by CorrectingYou on 5/17/24, 7:08 PM
by andrewp123 on 5/17/24, 7:58 PM
by praetor22 on 5/17/24, 10:39 PM
Here are some notes from my first impressions and after skimming through the paper. And yes, I am aware that this is very very early software.
1. Bend looks like an extremely limited DSL. No FFI. No way of interacting with raw buffers. Weird 24bit floating point format.
2. There's a reason why ICs are not relevant: performance is and will always be terrible. There is no other way to put it, graph traversal simply doesn't map well on hardware.
3. The premise of optimal reduction is valid. However, you still need to write the kernels in a way that can be parallelized (ie. no data dependencies, use of recursion).
4. There are no serious examples that directly compare Bend/HVM code with it's equivalent OMP/CUDA program. How am I suppose to evaluate the reduction in implementation complexity and what to expect on performance. So many claims, so little actual comparisons.
5. In the real world of high performance parallel computing, tree-like structures are non-existent. Arrays are king. And that's because of the physical nature of how memory works on a hardware level. And do you know what works best on mutable contiguous memory buffers ? Loops. We'll see when HVM will implement this.
In the end, what we currently have is half-baked language that is (almost) fully isolated from external data, extremely slow, a massive abstraction on the underlying hardware (unutilised features: multilevel caches, tensor cores, simd, atomics).
I apologize if this comes out as harsh, I still find the technical implementation and the theoretical background to be very interesting. I'm simply not (yet) convinced of its usefulness in the real world.
by delu on 5/17/24, 5:12 PM
Since then, I was able to make some experiments with multithreading (thanks Rust) and getting very creative with shaders (thanks Shadertoy). But a general parallel language on the GPU? I'm super excited to play with this!
by ziedaniel1 on 5/17/24, 5:17 PM
I just wrote a simple loop in C++ to sum up 0 to 2^30. With a single thread without any optimizations it runs in 1.7s on my laptop -- matching Bend's performance on an RTX 4090! With -O3 it vectorizes the loop to run in less than 80ms.
#include <iostream>
int main() {
int sum = 0;
for (int i = 0; i < 1024*1024*1024; i++) {
sum += i;
}
std::cout << sum << "\n";
return 0;
}
by Arch485 on 5/17/24, 9:46 PM
I'm excited to see how this project progresses.
by jjovan1 on 5/18/24, 7:35 AM
by animaomnium on 5/17/24, 12:39 PM
I ask because a while back I was messing around with compiling interaction nets to C after reducing as much of the program as possible (without reducing the inputs), as a form of whole program optimization. Wouldn't be too much harder to target a shader language.
Edit: Oh I see...
> This repository provides a low-level IR language for specifying the HVM2 nets, and a compiler from that language to C and CUDA HVM
Will have to look at the code then!
https://github.com/HigherOrderCO/HVM
Edit: Wait nvm, it looks like the HVM2 cuda runtime is an interpreter, that traverses an in-memory graph and applies reductions.
https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...
I was talking about traversing an interaction net to recover a lambda-calculus-like term, which can be lowered to C a la lisp in small pieces with minimal runtime overhead.
Honestly the motivation is, you are unlikely to outperform a hand-written GPU kernel for like ML workloads using Bend. In theory, HVM could act as glue, stitching together and parallelizing the dispatch order of compute kernels, but you need a good FFI to do that. Interaction nets are hard to translate across FFI boundaries. But, if you compile nets to C, keeping track of FFI compute kernel nodes embedded in the interaction network, you can recover a sensible FFI with no translation overhead.
The other option is implementing HVM in hardware, which I've been messing around with on a spare FPGA.
by MrLeap on 5/17/24, 8:51 PM
by yetihehe on 5/17/24, 4:32 PM
> That's a 111x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked bend to run our program on RTX, and it did. Simple as that. Note that, for now, Bend only supports 24-bit machine ints (u24), thus, results are always mod 2^24.
Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone not accustomed to gpu's and wanting to solve some problems requiring 64 bits (gravitational simulation of solar system at millimeter resolution could use ~58bit ints for position).
by ruste on 5/17/24, 3:15 PM
by zackmorris on 5/17/24, 6:44 PM
We still see this today in how languages go out of their way to implement higher order method libraries (map/reduce/filter) but then under the hood there is no multithreading, they just expect the developer to annotate their loops to be parallel because the languages aren't formal enough to know about side effects in the innermost logic, and don't support immutability or performant pass-by-value semantics with copy-on-write anyway. So we end up with handwavy languages like Rust that put all of that mental load onto the developer for basically no gain, they just save memory by performing computation in-place imperatively.
I also like how Bend sidesteps the nonexistence of highly scaled symmetric multiprocessing CPUs by supporting GPUs. It makes the argument moot that GPUs can't be stopped because they're too big to fail. Julia is the only other language I've seen that tries this. I wish Clojure did, although it's been a long time since I followed it so maybe it has some parallelism?
I would have dearly loved to work on something like Bend, had someone solved the funding issue. Nobody wants to pay for pure research, and nobody sees the need for languages that do what everyone else is doing except easier. We have Kickstarter for widgets and Patreon for influencers, but makers have to bootstrap themselves or learn everything about finance or live in the right city or have a large network to hopefully meet an angel investor or work in academia and lose all rights to what they invent while spending the majority of their time hustling for grants anyway. So it just never happens and we're stuck with the same old busted techniques. Like how Hollywood only has money for sequels and reboots or the recording industry only has money for canned corporate music and hits from already famous artists and yet another cover that yanks the original better song off the radio.
A quarter of a century can go by in the blink of an eye if you get suckered into building other people's dreams as a people-pleaser. Be careful what you work on.
by npalli on 5/17/24, 10:30 PM
function sum(depth, x)
if depth == 0
return x
else
fst = sum(depth-1, x*2+0)
snd = sum(depth-1, x*2+1)
end
return fst + snd
end
println(sum(30,0))by notfed on 5/17/24, 5:10 PM
by robust-cactus on 5/17/24, 9:24 PM
by anentropic on 5/17/24, 11:19 AM
I would say that the play on words that gives the language its name ("Bend") doesn't really make sense...
https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md
> Bending is the opposite of folding. Whatever fold consumes, bend creates.
But in everyday language bending is not the opposite of folding, they are more or less the same thing. Why not "unfold", which also has a connotation of "the process of happening" as well as merely the opposite of folding?
I have a question about the example code and output for bending:
type Tree:
Node { ~lft, ~rgt }
Leaf { val }
def main():
bend x = 0:
when x < 3:
tree = Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }
else:
tree = Tree/Leaf { val: 7 }
return tree
tree = fork(0)
tree = ![fork(1), fork(1)]
tree = ![![fork(2),fork(2)], ![fork(2),fork(2)]]
tree = ![![![fork(3),fork(3)], ![fork(3),fork(3)]], ![![fork(3),fork(3)], ![fork(3),fork(3)]]]
tree = ![![![7,7], ![7,7]], ![![7,7], ![7,7]]]
Where does the initial "tree = fork(0)" come from?by mjaniczek on 5/17/24, 1:12 PM
https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...
It's magical how the GPU version is basically flat (although with a high runtime init cost).
by gsuuon on 5/17/24, 8:56 PM
by klabb3 on 5/17/24, 5:34 PM
Tried to see what the language is like beyond hello world and found the guide[1]. It looks like a Python and quacks like a Haskell? For instance, variables are immutable, and tree-like divide and conquer data structures/algorithms are promoted for getting good results. That makes sense I guess! I’m not surprised to see a functional core, but I’m surprised to see the pythonic frontend, not that it matters much. I must say I highly doubt that it will make it much easier for Python devs to learn Bend though, although I don’t know if that’s the goal.
What are some challenges in programming with these kind of restrictions in practice? Also, is there good FFI options?
[1]: https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md
by davidw on 5/17/24, 4:28 PM
by api on 5/17/24, 11:21 AM
by smusamashah on 5/17/24, 10:59 PM
But the demo gif is probably the best I have seen in a Github readme. I watched it till the end. It was instantly engaging. I wanted to see the whole story unfold.
by darlansbjr on 5/17/24, 4:06 PM
by yetihehe on 5/17/24, 1:56 PM
> That's a 111x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked bend to run our program on RTX, and it did. Simple as that. Note that, for now, Bend only supports 24-bit machine ints (u24), thus, results are always mod 2^24.
Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone not accustomed to gpu's and wanting to solve some problems requiring 64 bits (gravitational simulation of solar system at millimeter resolution could use ~58bit ints for position).
by anonzzzies on 5/17/24, 10:43 AM
by egnehots on 5/17/24, 5:15 PM
by magnio on 5/17/24, 1:56 PM
I know the docs say this will be fixed soon, but what is the main reason for restricting number types to 24 bits? I saw in the code that they are wrapper around the 32-bit system number types, so what prevents Bend from changing them to U32(u32) right now?
by throwaway2562 on 5/17/24, 10:51 AM
by jes5199 on 5/17/24, 12:22 PM
by mattnewport on 5/19/24, 10:21 AM
It seems like this is actually an elegant typed functional language but the Python syntax looks ugly and verbose and like it's trying to hide that compared to something more ML/F# or Haskell inspired.
I'll try and get past that though as it does look like there's something pretty interesting here.
by magicalhippo on 5/18/24, 12:20 PM
On the CPU, there's typically a threshold where dividing and coordinating the parallel work takes more time than simply doing the work on a single thread. Thus you can make the overall runtime much faster by not dividing the work all the way, but rather stop at that optimal threshold and then just loop over the remaining work in the worker threads.
How does this work on the GPU using Bend? Been too long since I did any GPU programming.
by KingOfCoders on 5/18/24, 4:09 AM
12x for 16x threads
51x for 16.000x threads
Can someone point me to a website where it explains that this is the "ideal speedup"? Is there a formula?
by mjaniczek on 5/17/24, 9:07 AM
It's magical how the GPU version is basically flat (although with a high runtime init cost)
by highfrequency on 5/17/24, 5:16 PM
> CPU, Apple M3 Max, 16 threads: 10.26 seconds
Surprised to see a more than linear speedup in CPU threads. What’s going on here?
by hintymad on 5/17/24, 7:48 PM
by mbforbes on 5/17/24, 9:58 PM
Every time I try to write shaders, or even peek through my fingers at CUDA C(++) code, I recoil in disbelief that we don't have high level programming yet on the GPU. I can't wait until we do. The more great projects attacking it the better in my book.
by i5heu on 5/18/24, 10:45 AM
I think this is a much more practically approach and i hope this will give some inspiration to this possibility.
by runeks on 5/19/24, 5:53 PM
As I understand, it's a lot of work because there's no common way to target these different GPUs. Is this correctly understood?
by croemer on 5/17/24, 7:04 PM
by temp123789246 on 5/17/24, 10:35 PM
I’ve been watching HVM for a while and think it’s extremely cool.
My intuition is that this will eventually be a really big deal.
by britannio on 5/17/24, 5:47 PM
by xiaodai on 5/17/24, 1:21 PM
by flakiness on 5/17/24, 7:18 PM
by zmmmmm on 5/18/24, 4:15 AM
by anonzzzies on 5/17/24, 12:18 PM
by shadowpho on 5/17/24, 4:02 PM
by gigatexal on 5/17/24, 5:55 PM
tested on: CPU - Apple M3 Max, GPU - NVIDIA RTX 4090
But how? I thought eGPUs don’t work on apple silicon and the pci-e having Mac Pro is still M2 based, no?
by exitheone on 5/17/24, 5:20 PM
Question: Does this take into account memory bandwidth and caches between cores? Because getting them wrong can easily make parallel programs slower than sequential ones.
by KeplerBoy on 5/17/24, 4:22 PM
210 seconds (3.5 minutes) to 10.5 seconds is a 20x speedup, which isn't really expected.
by netbioserror on 5/17/24, 10:52 PM
by mccoyb on 5/17/24, 12:40 PM
I’d also love to find an example of writing a small interpreter in Bend - which runs on the GPU.
by mattdesl on 5/17/24, 4:02 PM
by funny_name on 5/17/24, 8:06 PM
by zacksiri on 5/17/24, 9:18 AM
by anonzzzies on 5/17/24, 4:55 AM
Now I just need a Common Lisp implemented using it!
by idiomaxiom on 5/18/24, 10:12 PM
I am really enjoying this implementation :)
by programjames on 5/18/24, 7:50 PM
by thinking_banana on 5/17/24, 11:55 PM
by kerkeslager on 5/17/24, 9:00 PM
by KingOfCoders on 5/18/24, 4:21 AM
by totorovirus on 5/19/24, 3:32 PM
by DrNosferatu on 5/17/24, 1:55 PM
by buckley_1991 on 5/16/24, 3:50 AM
by wolfspaw on 5/17/24, 8:49 PM
Python-like + High-performance.
And, Different from Mojo, its Fully Open-Source.
by IsTom on 5/17/24, 7:43 PM
by markush_ on 5/17/24, 8:01 PM
by kkukshtel on 5/17/24, 4:25 PM
by toastal on 5/18/24, 3:27 PM
by naltroc on 5/17/24, 1:44 PM
by abeppu on 5/17/24, 5:43 PM
- surely for `3 x 3 = 9`, there is some concept of primitive operations?
- I get that replacement of patterns in a graph can be done in parallel, but (a) identifying when a rewrite rule should apply and (b) communicating the state of the updated graph to worker threads and (c) organizing worker threads to agree on which does each task all take some effort. When is this more work than the original computation (as in the 3x3 example)?
by chc4 on 5/17/24, 7:03 PM
Before you reply "these are things we can address in the future": that doesn't matter. Everyone can address everything in the future. They are currently hard technical barriers to it's use, with no way of knowing the level of effort that will require or the knock-on effects, especially since some of these issues have been "we can fix that later" for ten years.
I also highly recommend changing your benchmark numbers from "interactions per second" to a standard measurement like FLOPS. No one else on earth knows how many of those interactions are pure overhead from your evaluation semantics, and not doing useful work. They come across as attempting to wow an audience with high numbers and not communicating an apples to apples comparison with other languages.
by efferifick on 5/17/24, 12:06 PM
by exabrial on 5/17/24, 2:53 PM
Eeek.
by JayShower on 5/18/24, 1:31 AM
by jgarzon on 5/18/24, 4:42 AM
by hypersimplex on 5/18/24, 9:39 PM
by gingfreecss on 5/17/24, 11:57 AM
by Archit3ch on 5/17/24, 7:00 PM
by tinydev on 5/17/24, 1:56 AM
by 3abiton on 5/17/24, 11:51 PM
Okay, I'll have what you're having.
by light_hue_1 on 5/17/24, 5:51 PM
module Main where
sum' :: Int -> Int -> Int
sum' 0 x = x
sum' depth x = sum' (depth - 1) ((x \* 2) + 0) + sum' (depth - 1) ((x \* 2) + 1)
main = print $ sum' 30 0
Runs in 2.5s. Sure it's not on a GPU, but it's faster! And things don't get much more high level.If you're going to promise amazing performance from a high level language, I'd want to see a comparison against JAX.
It's an improvement over traditional interaction nets, sure! But interaction nets have always been a failure performance-wise. Interaction nets are PL equivalent of genetic algorithms in ML, they sound like a cool idea and have a nice story, but then they always seem to be a dead end.
Interaction nets optimize parallelism at the cost of everything else. Including single-threaded performance. You're just warming up the planet by wasting massive amounts of parallel GPU cores to do what a single CPU core could do more easily. They're just the wrong answer to this problem.