from Hacker News

Speedup from switch to +=

by j0e1 on 9/12/22, 3:00 AM with 79 comments

by chillee on 9/12/22, 6:51 AM
Ok, I work on PyTorch, so probably should clear up some misconceptions in this thread.
1. In PyTorch (and other array programming libraries like Numpy), the operations being passed around are tensors/arrays (i.e. large chunks of memory). Thus, += is overloaded to mean "in-place write" to the arrays.
So, `+` vs `+=` is the equivalent of
```
    a: float[1000]
    b: float[1000]
    for i in [0, 1000]:
        b[i] = a[i] + 2
```
vs.
```
    a: float[1000]
    for i in [0, 1000]:
        a[i] = a[i] + 2
```
The main performance advantage comes in 1. no need to allocate an extra array, 2. you're using less memory overall, so various caching levels can work better. It has nothing to do with python bytecodes.
2. As for whether it generally makes sense to do this optimization manually... Usually, PyTorch users don't use in-place operations as its a bit uglier mathematically and have various foot-guns/restrictions that users find confusing. Generally, it's best to have this optimization be done automatically by an optimizing compiler.
3. PyTorch in general does support using in-place operations during training, albeit with some caveats.
(PS) 4. Putting everything on one line (as some folks suggest) is almost certainly not going to help performance - the primary performance bottlenecks here have almost nothing to do with CPU perf.
by FabHK on 9/12/22, 4:12 AM
Plot twist: it breaks the code...?
> Changing this back to the original implementation fixed an error I was getting when doing textual inversion on Windows
https://github.com/lstein/stable-diffusion/commit/62863ac586...
by nodja on 9/12/22, 5:33 AM
I see lots of people answering why it's faster, but not many saying why the engineers chose the slower version.
As everyone said, this is more performant because x is being modified in place, the reason this was not done in place is because you can't train a neural network if an instruction is being done in place. During training a network goes literally through all operations that were done and see how well they performed so they can be adjusted using a secondary value called a gradient, this is done during the backwards pass. If you replace something in place you're essentially overwriting the input values that were passed to that function, and by extension, the output values of the function called before, essentially breaking the network chain, unless you also copy the inputs together with the gradients, which would cause an even worse performance hit and be a memory hog.
The breakage bug later in the issue is proof of this, when sampling to generate an image only the forward pass is done on the network, but textual inversion requires you to train the network and therefore do the backwards pass, triggering the error since the dependency graph is broken. I should also note that technically the add operation should be safe to do in place as it's reversible, but I'm not a pytorch expert so I'm not sure exactly what's going on in there.
by ironhaven on 9/12/22, 3:34 AM
Because of operator overloading "+=" can call a more optimized method than "+". If this code was written in a language without operator overloading I don't think this would be a very interesting pull request. THis could be a example of why some people don't like operator overloading and why some programing languages (java, zig, etc) do not implment the feature.
by JonathonW on 9/12/22, 3:16 AM
If they're seeing these kinds of gains from relatively minor changes to their Python code, I can't help but wonder how much faster the model would run in a compiled language or a language with a good JIT (way more optimization work's gone into the mainstream Javascript runtimes than CPython).
I'd assumed that overall performance in Stable Diffusion was limited by the code running on the GPU, with Python performance being a fairly minor factor-- but I guess that's not the case?
by staticassertion on 9/12/22, 4:12 AM
This isn't a Python issue, this is a "I'm copying when I don't need to" issue. As I mention elsewhere, you can write this sort of "bug" in almost any language pretty easily (as I demonstrate with Rust).
This isn't a case of "The Python interpreter is bad" it's just that the code is doing what the user asked it to do - create a completely new copy of the data, then overwrite the old copy with it. Immutable operations like this are slow, mutating the value (what += does) is fast.
Granted, a compiled language could recognize that you're doing this, but it also might not - is `+` and `+=` semantically identical such that the compiler can replace one with the other? Maybe? Probably not, if I had to guess. The correct answer is to just use the faster operation, as it is with all language.
I don't know the type of `x`, but I'd suggest another optimization here would be to:
a) Preallocate the buffer rather before mutating it 3x (which is still likely forcing some allocations)
b) Reuse that buffer if it's so important, store it in `self` and clear it before use.
by datalopers on 9/12/22, 3:50 AM
This StackOverflow answer [1] goes into performance details of INPLACE_ADD versus BINARY_ADD.
[1] https://stackoverflow.com/a/15376520
by teruakohatu on 9/12/22, 3:09 AM
I guess this is the beauty of making a model open source.
by eru on 9/12/22, 3:12 AM
I wonder what version of Python they were using?
I'm wondering, because recent version have improved performance a lot. 3.11 is much faster than 3.10, and what's in 3.12 is already much faster than 3.11.
by brrrrrm on 9/12/22, 3:24 AM
It’s not clear a JIT compiled language would help much here unless the operations were baked into the JIT itself (which would have to identify the memory savings of an in-place call).
by Waterluvian on 9/12/22, 3:21 AM
One comment asks about putting it all on one line, and this is where interpreted languages without a JIT kinda blow.
Many times I have had to decide if my Python code would be more legible or get free performance.
The thing I like about JavaScript is that I can _usually_ trust the JIT to make my code faster than I could, meaning I can focus entirely on writing clean code.
P.S. you can always hand optimize. If you do, just comment the heck out of it.
by eesmith on 9/12/22, 9:07 AM
Lincoln Stein. Now that's a name I've not heard in a long time. A long time.
He's the author of the essay "How Perl Saved the Genome Project", the books "Network Programming with Perl" and "Writing Apache Modules with Perl and C", and a number of Perl packages including CGI.pm - which helped power the dot-com era - and GD.pm.
by teo_zero on 9/12/22, 5:24 AM
But wait... x+=y is equivalent to x=x+y not to x=y+x. Only if + is commutative, then the three are equivalent. Are we sure the + operation is commutatve for this type of data? And does the compiler know it?
It would be interesting to check whether changing every expression to x=x+y has a performance more similar to += or to ...+x
by dahfizz on 9/12/22, 3:13 AM
Is python in the fast path? Why not rewrite in a performant language for a XXX% speedup?
by olliej on 9/12/22, 4:55 AM
Is this a lookup overhead thing or a memcpy based overhead regression? In the case of the latter it seems like this may result in an unexpected mutation of the source data?
by thweorui23432 on 9/12/22, 5:11 AM
Speedup likely won't work for training the model.
by MaXtreeM on 9/12/22, 5:40 AM
There is a case in C# where using compound assignment is actually slower [0]. Based on comments this should be fixed in .NET7 I haven't checked it myself.
[0]: https://mobile.twitter.com/badamczewski01/status/15618171584...
by mhzsh on 9/12/22, 3:12 AM
But why is it faster? A non-associative translation to byte code (or however python works)?
by spullara on 9/13/22, 1:27 AM
Mutation faster than making a new object.
by noobermin on 9/12/22, 4:03 AM
Whenever I see things like this in highly visible code that people exclaim about across the internet it makes me really take a moment to absorb how much time I spend agonizing over minutae in my daily work and how people who really are just lucky can get away with much worse. Just a reminder about how the idea that "tech" is a meritocracy was never really true.