by j0e1 on 9/12/22, 3:00 AM with 79 comments
by chillee on 9/12/22, 6:51 AM
1. In PyTorch (and other array programming libraries like Numpy), the operations being passed around are tensors/arrays (i.e. large chunks of memory). Thus, += is overloaded to mean "in-place write" to the arrays.
So, `+` vs `+=` is the equivalent of
a: float[1000]
b: float[1000]
for i in [0, 1000]:
b[i] = a[i] + 2
vs. a: float[1000]
for i in [0, 1000]:
a[i] = a[i] + 2
The main performance advantage comes in 1. no need to allocate an extra array, 2. you're using less memory overall, so various caching levels can work better. It has nothing to do with python bytecodes.2. As for whether it generally makes sense to do this optimization manually... Usually, PyTorch users don't use in-place operations as its a bit uglier mathematically and have various foot-guns/restrictions that users find confusing. Generally, it's best to have this optimization be done automatically by an optimizing compiler.
3. PyTorch in general does support using in-place operations during training, albeit with some caveats.
(PS) 4. Putting everything on one line (as some folks suggest) is almost certainly not going to help performance - the primary performance bottlenecks here have almost nothing to do with CPU perf.
by FabHK on 9/12/22, 4:12 AM
> Changing this back to the original implementation fixed an error I was getting when doing textual inversion on Windows
https://github.com/lstein/stable-diffusion/commit/62863ac586...
by nodja on 9/12/22, 5:33 AM
As everyone said, this is more performant because x is being modified in place, the reason this was not done in place is because you can't train a neural network if an instruction is being done in place. During training a network goes literally through all operations that were done and see how well they performed so they can be adjusted using a secondary value called a gradient, this is done during the backwards pass. If you replace something in place you're essentially overwriting the input values that were passed to that function, and by extension, the output values of the function called before, essentially breaking the network chain, unless you also copy the inputs together with the gradients, which would cause an even worse performance hit and be a memory hog.
The breakage bug later in the issue is proof of this, when sampling to generate an image only the forward pass is done on the network, but textual inversion requires you to train the network and therefore do the backwards pass, triggering the error since the dependency graph is broken. I should also note that technically the add operation should be safe to do in place as it's reversible, but I'm not a pytorch expert so I'm not sure exactly what's going on in there.
by ironhaven on 9/12/22, 3:34 AM
by JonathonW on 9/12/22, 3:16 AM
I'd assumed that overall performance in Stable Diffusion was limited by the code running on the GPU, with Python performance being a fairly minor factor-- but I guess that's not the case?
by staticassertion on 9/12/22, 4:12 AM
This isn't a case of "The Python interpreter is bad" it's just that the code is doing what the user asked it to do - create a completely new copy of the data, then overwrite the old copy with it. Immutable operations like this are slow, mutating the value (what += does) is fast.
Granted, a compiled language could recognize that you're doing this, but it also might not - is `+` and `+=` semantically identical such that the compiler can replace one with the other? Maybe? Probably not, if I had to guess. The correct answer is to just use the faster operation, as it is with all language.
I don't know the type of `x`, but I'd suggest another optimization here would be to:
a) Preallocate the buffer rather before mutating it 3x (which is still likely forcing some allocations)
b) Reuse that buffer if it's so important, store it in `self` and clear it before use.
by datalopers on 9/12/22, 3:50 AM
by teruakohatu on 9/12/22, 3:09 AM
by eru on 9/12/22, 3:12 AM
I'm wondering, because recent version have improved performance a lot. 3.11 is much faster than 3.10, and what's in 3.12 is already much faster than 3.11.
by brrrrrm on 9/12/22, 3:24 AM
by Waterluvian on 9/12/22, 3:21 AM
Many times I have had to decide if my Python code would be more legible or get free performance.
The thing I like about JavaScript is that I can _usually_ trust the JIT to make my code faster than I could, meaning I can focus entirely on writing clean code.
P.S. you can always hand optimize. If you do, just comment the heck out of it.
by eesmith on 9/12/22, 9:07 AM
He's the author of the essay "How Perl Saved the Genome Project", the books "Network Programming with Perl" and "Writing Apache Modules with Perl and C", and a number of Perl packages including CGI.pm - which helped power the dot-com era - and GD.pm.
by teo_zero on 9/12/22, 5:24 AM
It would be interesting to check whether changing every expression to x=x+y has a performance more similar to += or to ...+x
by dahfizz on 9/12/22, 3:13 AM
by olliej on 9/12/22, 4:55 AM
by thweorui23432 on 9/12/22, 5:11 AM
by MaXtreeM on 9/12/22, 5:40 AM
[0]: https://mobile.twitter.com/badamczewski01/status/15618171584...
by mhzsh on 9/12/22, 3:12 AM
by spullara on 9/13/22, 1:27 AM
by noobermin on 9/12/22, 4:03 AM