by zetalyrae on 1/23/25, 9:01 PM with 120 comments
by Const-me on 1/27/25, 3:45 PM
The language is actually great with SIMD, you just have to do it yourself with intrinsics, or use libraries. BTW, here’s a library which implements 4-wide vectorized exponent functions for FP32 precision on top of SSE, AVX and NEON SIMD intrinsics (MIT license): https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...
by AnimalMuppet on 1/23/25, 9:11 PM
Is this article really confused, or did I misunderstand it?
The thing that makes C/C++ a good language for SIMD is how easily it lets you control memory alignment.
by PaulHoule on 1/23/25, 9:10 PM
by npalli on 1/27/25, 5:18 PM
by mlochbaum on 1/27/25, 1:47 PM
As an array implementer I've thought about the issue a lot and have been meaning to write a full page on it. For now I have some comments at https://mlochbaum.github.io/BQN/implementation/versusc.html#... and the last paragraph of https://mlochbaum.github.io/BQN/implementation/compile/intro....
by commandlinefan on 1/27/25, 8:35 PM
"Modularity is the enemy of performance"
If you want optimal performance, you have to collapse the layers. Look at Deepseek, for example.
by notorandit on 1/27/25, 1:25 PM
Wanna SMP? Use multi-thread libreries. Wanna SIMD/MIMD? Use (inline) assembler functions. Or design your own language.
by camel-cdr on 1/27/25, 2:40 PM
If you implement a scalar expf in a vectorizer friendly way, and it's visible to the compiler, then it could also be vectorized: https://godbolt.org/z/zxTn8hbEe
by svilen_dobrev on 1/27/25, 4:01 PM
https://ocw.mit.edu/courses/6-945-adventures-in-advanced-sym...
by vkaku on 1/27/25, 9:16 PM
Some people have vectorized successfully with C, even with all the hacks/pointers/union/opaque business. It requires careful programming, for sure. The ffmpeg cases are super good examples of how compiler misses happen, and how to optimize for full throughput in those cases. Worth a look for all compiler engineers.
by Vosporos on 1/27/25, 3:36 PM
by dvorack101 on 1/27/25, 7:20 PM
https://github.com/dezashibi-c/a-simd_in_c
Copy rights go to Navid Dezashibi.
by exitcode0000 on 1/28/25, 3:18 AM
```
generic
type T is private;
Aligned : Bool := True;
function Inverse_Sqrt_T (V : T) return T;
function Inverse_Sqrt_T (V : T) return T is
Result : aliased T;
THREE : constant Real := 3.0;
NEGATIVE_HALF : constant Real := -0.5;
VMOVPS : constant String := (if Aligned then "vmovaps" else "vmovups");
begin
Asm (Clobber => "xmm0, xmm1, xmm2, xmm3, memory",
Inputs => (Ptr'Asm_Input ("r", Result'Address),
Ptr'Asm_Input ("r", V'Address),
Ptr'Asm_Input ("r", THREE'Address),
Ptr'Asm_Input ("r", NEGATIVE_HALF'Address)),
Template => VMOVPS & " (%1), %%xmm0 " & E & -- xmm0 ← V
" vrsqrtps %%xmm0, %%xmm1 " & E & -- xmm1 ← Reciprocal sqrt of xmm0
" vmulps %%xmm1, %%xmm1, %%xmm2 " & E & -- xmm2 ← xmm1 \* xmm1
" vbroadcastss (%2), %%xmm3 " & E & -- xmm3 ← NEGATIVE_HALF
" vfmsub231ps %%xmm2, %%xmm0, %%xmm3 " & E & -- xmm3 ← (V - xmm2) \* NEGATIVE_HALF
" vbroadcastss (%3), %%xmm0 " & E & -- xmm0 ← THREE
" vmulps %%xmm0, %%xmm1, %%xmm0 " & E & -- xmm0 ← THREE \* xmm1
" vmulps %%xmm3, %%xmm0, %%xmm0 " & E & -- xmm0 ← xmm3 \* xmm0
VMOVPS & " %%xmm0, (%0) "); -- Result ← xmm0
return Result;
end;
function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_2D, Aligned => False);
function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_3D, Aligned => False);
function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_4D);
``````C
vector_3d vector_inverse_sqrt(const vector_3d\* v) {
...
vector_4d vector_inverse_sqrt(const vector_4d\* v) {
vector_4d out;
static const float THREE = 3.0f; // 0x40400000
static const float NEGATIVE_HALF = -0.5f; // 0xbf000000
__asm__ (
// Load the input vector into xmm0
"vmovaps (%1), %%xmm0\n\t"
"vrsqrtps %%xmm0, %%xmm1\n\t"
"vmulps %%xmm1, %%xmm1, %%xmm2\n\t"
"vbroadcastss (%2), %%xmm3\n\t"
"vfmsub231ps %%xmm2, %%xmm0, %%xmm3\n\t"
"vbroadcastss (%3), %%xmm0\n\t"
"vmulps %%xmm0, %%xmm1, %%xmm0\n\t"
"vmulps %%xmm3, %%xmm0, %%xmm0\n\t"
"vmovups %%xmm0, (%0)\n\t" // Output operand
:
: "r" (&out), "r" (v), "r" (&THREE), "r" (&NEGATIVE_HALF) // Input operands
: "xmm0", "xmm1", "xmm2", "memory" // Clobbered registers
);
return out;
}
```by jiehong on 1/27/25, 2:35 PM
If only GPU makers could standardise an extended ISA like AVX on CPU, and we could all run SIMD or SIMT code without needing any librairies, but our compilers.
by musicale on 1/28/25, 6:15 AM
(Oh, that's SIMT. Carry on then.)
by TinkersW on 1/27/25, 1:40 PM
C functions can't be vectorized? WTF are you talking about? You can certainly pass vector registers to functions.
Exp can also be vectorized, AVX512 even includes specific instructions to make it easier.( there is no direct exp instruction on most hardware,it is generally a sequence of instructions)