by beza1e1 on 3/25/12, 11:13 AM with 29 comments
by riobard on 3/25/12, 1:00 PM
1. Better memory aliasing to use SIMD instructions. But you have to trade in pointer arithmetic for those cases. Only useful for number-crunching and GPU-like workload AFAIK.
2. Pre-compute time-consuming constants during compile time. Though I see no reason why you cannot do this in C.
3. JIT and runtime optimization. I doubt this though, given various overhead of JIT and runtime optimization (GC, memory, etc), that JIT can beat carefully crafted C. Of course it will make it easier to write many types of programs.
by nivertech on 3/25/12, 7:07 PM
MSVC on Windows can be 40% faster than gcc on the same platform. I think icc (Intel C Compiler) can be several times faster than gcc for some workloads.
I can write OpenCL code, that will be up to 100 times faster than plain C.
EDIT:
The reason for this, is that we still use procedural languages rather than declarative.
In declarative language I would say: "The content of this block of memory should be identical to the content of that block of memory." I would also specify that blocks are non-overlapping, if it's not possible to infer from the block definitions themselves.
Instead of this, memcopy C code describes procedure of iterative byte-by-byte copying. This is only one way out of many to do it and maybe it even was the fastest way to do it on PDP-11, but not on modern or future machines.
Similarly if you trying to micromanage your employees, you will not get most optimized result.
To have the best performance, you do not need to be Close-to-The-Metal(tm). You need to be able to express what your desired result is in the highest possible abstracted, but still correct, way.
by ontologiae on 3/26/12, 12:08 PM
- A very clever and complete type analysis. The compiler can thus optimize at best the code without runtime type-checking code.
- If possible, a flow analysis : By analysing the bounds of each function, compiler is able to remove a lot of unused code. For instance, if a variable is defined by t = 2*n and long afterwards, there's a test which check primality of t, the compiler can remove the test.
- Compile code by rewriting algorithms that "takes in mind" locality of memory, ie. all things described by U. Drepper in "What every programmers should know about computer memory", ie2. rewriting algorithms to avoid cache miss, etc...
- Automatic template-like mecanism : the compiler is able to make partial execution of the code, even in array, so all the code which can be calculated because you have the datas should be directly compiled. Imagine you have a regular expression library and you write your regexp in your code (just the regexp) : the compiler should be able to produce an automata.
- It is also said : be able to use SIMD instruction.
Because of this, it is better to compile a high-level language with a very clean semantics : you can much easily reason about your semantics and more easily rewrite some algorithms.
by ers35 on 3/25/12, 1:59 PM
Doesn't the example below show otherwise? Note the presence of movdqu in both memcopy_normal and memcopy_restrict and how GCC emits five LEA instructions without restrict, but only two when restrict is present.
This does not support the author's assertion that "Aliasing information is the only one, where I am certain about speed improvements, because it is impossible to reach Fortran-speed in C."
#include "stddef.h"
void* memcopy_normal(void* dst, const void* src, size_t count) {
while (count--) *(char*)dst++ = *(char*)src++;
return dst;
}
void* memcopy_restrict(void* restrict dst, const void* restrict src, size_t count) {
while (count--) *(char* restrict)dst++ = *(char* restrict)src++;
return dst;
}
> gcc -S -masm=intel -std=c99 -m64 -march=corei7 -O3 restrict2.c
.file "restrict2.c"
.intel_syntax noprefix
.text
.p2align 4,,15
.globl memcopy_normal
.type memcopy_normal, @function
memcopy_normal:
.LFB0:
.cfi_startproc
test rdx, rdx
mov rax, rdi
je .L2
lea r10, [rdx-1]
mov r8, rdx
shr r8, 4
mov r9, r8
sal r9, 4
test r9, r9
je .L7
lea rcx, [rsi+16]
cmp rdx, 15
lea r11, [rax+16]
seta dil
cmp rax, rcx
seta cl
cmp rsi, r11
seta r11b
or ecx, r11d
test dil, cl
je .L7
xor ecx, ecx
xor edi, edi
.p2align 4,,10
.p2align 3
.L4:
movdqu xmm0, XMMWORD PTR [rsi+rcx]
add rdi, 1
movdqu XMMWORD PTR [rax+rcx], xmm0
add rcx, 16
cmp r8, rdi
ja .L4
lea r8, [rax+r9]
add rsi, r9
sub r10, r9
cmp rdx, r9
je .L5
.L3:
lea r9, [r10+1]
xor ecx, ecx
.p2align 4,,10
.p2align 3
.L6:
movzx edi, BYTE PTR [rsi+rcx]
mov BYTE PTR [r8+rcx], dil
add rcx, 1
cmp r9, rcx
jne .L6
.L5:
add rax, rdx
.L2:
rep
ret
.L7:
mov r8, rax
jmp .L3
.cfi_endproc
.LFE0:
.size memcopy_normal, .-memcopy_normal
.p2align 4,,15
.globl memcopy_restrict
.type memcopy_restrict, @function
memcopy_restrict:
.LFB1:
.cfi_startproc
push r12
.cfi_def_cfa_offset 16
.cfi_offset 12, -16
test rdx, rdx
mov rax, rdi
push rbp
.cfi_def_cfa_offset 24
.cfi_offset 6, -24
push rbx
.cfi_def_cfa_offset 32
.cfi_offset 3, -32
je .L12
lea r10, [rdx-1]
mov r11, rsi
mov r8, rsi
neg r11
and r11d, 15
cmp r11, rdx
cmova r11, rdx
test r11, r11
je .L13
xor ecx, ecx
.p2align 4,,10
.p2align 3
.L14:
movzx r9d, BYTE PTR [r8]
add rcx, 1
add r8, 1
sub r10, 1
mov BYTE PTR [rdi], r9b
add rdi, 1
cmp r11, rcx
ja .L14
cmp rdx, r11
je .L15
.L13:
mov r12, rdx
sub r12, r11
mov rbx, r12
shr rbx, 4
mov rbp, rbx
sal rbp, 4
test rbp, rbp
je .L16
add rsi, r11
xor ecx, ecx
add r11, rax
xor r9d, r9d
.p2align 4,,10
.p2align 3
.L17:
movdqa xmm0, XMMWORD PTR [rsi+rcx]
add r9, 1
movdqu XMMWORD PTR [r11+rcx], xmm0
add rcx, 16
cmp rbx, r9
ja .L17
add rdi, rbp
add r8, rbp
sub r10, rbp
cmp r12, rbp
je .L15
.L16:
lea r9, [r10+1]
xor ecx, ecx
.p2align 4,,10
.p2align 3
.L18:
movzx esi, BYTE PTR [r8+rcx]
mov BYTE PTR [rdi+rcx], sil
add rcx, 1
cmp r9, rcx
jne .L18
.L15:
add rax, rdx
.L12:
pop rbx
.cfi_def_cfa_offset 24
pop rbp
.cfi_def_cfa_offset 16
pop r12
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE1:
.size memcopy_restrict, .-memcopy_restrict
.ident "GCC: (GNU) 4.6.3"
.section .note.GNU-stack,"",@progbits
by kaiwetzel on 3/25/12, 1:43 PM
Does the C semantics prevent the compiler from issuing a run-time check for the (rare) aliasing case and proceed with the fast version in the common case? (probably unrolled, too?)
Out of curiosity: The code uses a special counter variable count which has to be decremented separately - is this faster than testing for dst != behind_last_dst_prt ?
(aside: my gut-feeling tells me that if the combination of C/8086 can't pull his example of at the maximum memory to processor transfer speed for sufficiently large input vectors, there is something seriously rotten ...)
by mmphosis on 3/25/12, 4:31 PM
Turtles. Turtles all the way down.
Translation: If your language is "Turtles" then you need to use "Turtles" all the way down the entire stack, and that includes your "Turtles" CPUs.
by prewett on 3/26/12, 3:22 AM
by cpr on 3/25/12, 2:16 PM
Inter-module optimizations (e.g., removing dead code) can occur at link time as well.
So there are better tools now that compiler writers can target to start making progress in at least one of these areas.
by furyofantares on 3/25/12, 9:38 PM
by mynegation on 3/25/12, 1:36 PM