from Hacker News

Ask HN: Why PyTorch einsum is significantly slower than transpose

by rcshubhadeep on 4/10/22, 10:32 AM with 4 comments

I have been tinkering with some DL models and wanted to implement part of it using PyTorch einsum. Before doing so I was wondering about the performance. I have been a bit skeptic as I believe there is a parsing (and even may be somewhat code generation) involved in implementation of einsum (I have never look under the hood of PyTorch or Numpy as to how is it implemented, so I may be completely wrong)

So to measure the performance, I created a simple benchmark of comparison. I created a Tensor with these dimensions (BATCH, X, Y). Like so -

a = torch.randn(10, 20, 30)

Then in Jupyter I did this

%%timeit

torch.einsum('b i j -> b j i', a)

AND

%%timeit

a.transpose(1, 2)

-----------------------

This is the result

5.43 µs ± 63.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) [Einsum]

1.15 µs ± 2.51 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) [transpose]

Am I doing / reading something wrong? Is it a wrong way to benchmark? Or is it really true what I see, that einsum is order of magnitude slower than transpose?

by gus_massa on 4/10/22, 11:15 AM
Have you tried with different sizes like
```
  a = torch.randn(10, 20, 30)
  a = torch.randn(20, 40, 60)
  a = torch.randn(30, 60, 90)
  ...
```
Is the "4µs" a constant difference or it's proportional to the size of the matrix?