from Hacker News

How a language can be faster than C

by beza1e1 on 3/25/12, 11:13 AM with 29 comments

by riobard on 3/25/12, 1:00 PM
TL;DR:
1. Better memory aliasing to use SIMD instructions. But you have to trade in pointer arithmetic for those cases. Only useful for number-crunching and GPU-like workload AFAIK.
2. Pre-compute time-consuming constants during compile time. Though I see no reason why you cannot do this in C.
3. JIT and runtime optimization. I doubt this though, given various overhead of JIT and runtime optimization (GC, memory, etc), that JIT can beat carefully crafted C. Of course it will make it easier to write many types of programs.
by nivertech on 3/25/12, 7:07 PM
C can be fast (icc) and slow (CINT) depending on compiler/interpreter and OS/platform.
MSVC on Windows can be 40% faster than gcc on the same platform. I think icc (Intel C Compiler) can be several times faster than gcc for some workloads.
I can write OpenCL code, that will be up to 100 times faster than plain C.
EDIT:
The reason for this, is that we still use procedural languages rather than declarative.
In declarative language I would say: "The content of this block of memory should be identical to the content of that block of memory." I would also specify that blocks are non-overlapping, if it's not possible to infer from the block definitions themselves.
Instead of this, memcopy C code describes procedure of iterative byte-by-byte copying. This is only one way out of many to do it and maybe it even was the fastest way to do it on PDP-11, but not on modern or future machines.
Similarly if you trying to micromanage your employees, you will not get most optimized result.
To have the best performance, you do not need to be Close-to-The-Metal(tm). You need to be able to express what your desired result is in the highest possible abstracted, but still correct, way.
by ontologiae on 3/26/12, 12:08 PM
Several possibilities :
- A very clever and complete type analysis. The compiler can thus optimize at best the code without runtime type-checking code.
- If possible, a flow analysis : By analysing the bounds of each function, compiler is able to remove a lot of unused code. For instance, if a variable is defined by t = 2*n and long afterwards, there's a test which check primality of t, the compiler can remove the test.
- Compile code by rewriting algorithms that "takes in mind" locality of memory, ie. all things described by U. Drepper in "What every programmers should know about computer memory", ie2. rewriting algorithms to avoid cache miss, etc...
- Automatic template-like mecanism : the compiler is able to make partial execution of the code, even in array, so all the code which can be calculated because you have the datas should be directly compiled. Imagine you have a regular expression library and you write your regexp in your code (just the regexp) : the compiler should be able to produce an automata.
- It is also said : be able to use SIMD instruction.
Because of this, it is better to compile a high-level language with a very clean semantics : you can much easily reason about your semantics and more easily rewrite some algorithms.

by ers35 on 3/25/12, 1:59 PM

The author says "In C99 the restrict keyword was added, which we could use here to encode that src and dst are different from all other references. This mechanism helps in some cases, but not in our example."

Doesn't the example below show otherwise? Note the presence of movdqu in both memcopy_normal and memcopy_restrict and how GCC emits five LEA instructions without restrict, but only two when restrict is present.

This does not support the author's assertion that "Aliasing information is the only one, where I am certain about speed improvements, because it is impossible to reach Fortran-speed in C."

  #include "stddef.h"

  void* memcopy_normal(void* dst, const void* src, size_t count) {
     while (count--) *(char*)dst++ = *(char*)src++;
     return dst;
  }

  void* memcopy_restrict(void* restrict dst, const void* restrict src, size_t count) {
     while (count--) *(char* restrict)dst++ = *(char* restrict)src++;
     return dst;
  }

  > gcc -S -masm=intel -std=c99 -m64 -march=corei7 -O3 restrict2.c

    .file	"restrict2.c"
    .intel_syntax noprefix
    .text
    .p2align 4,,15
    .globl	memcopy_normal
    .type	memcopy_normal, @function
  memcopy_normal:
  .LFB0:
    .cfi_startproc
    test	rdx, rdx
    mov	rax, rdi
    je	.L2
    lea	r10, [rdx-1]
    mov	r8, rdx
    shr	r8, 4
    mov	r9, r8
    sal	r9, 4
    test	r9, r9
    je	.L7
    lea	rcx, [rsi+16]
    cmp	rdx, 15
    lea	r11, [rax+16]
    seta	dil
    cmp	rax, rcx
    seta	cl
    cmp	rsi, r11
    seta	r11b
    or	ecx, r11d
    test	dil, cl
    je	.L7
    xor	ecx, ecx
    xor	edi, edi
    .p2align 4,,10
    .p2align 3
  .L4:
    movdqu	xmm0, XMMWORD PTR [rsi+rcx]
    add	rdi, 1
    movdqu	XMMWORD PTR [rax+rcx], xmm0
    add	rcx, 16
    cmp	r8, rdi
    ja	.L4
    lea	r8, [rax+r9]
    add	rsi, r9
    sub	r10, r9
    cmp	rdx, r9
    je	.L5
  .L3:
    lea	r9, [r10+1]
    xor	ecx, ecx
    .p2align 4,,10
    .p2align 3
  .L6:
    movzx	edi, BYTE PTR [rsi+rcx]
    mov	BYTE PTR [r8+rcx], dil
    add	rcx, 1
    cmp	r9, rcx
    jne	.L6
  .L5:
    add	rax, rdx
  .L2:
    rep
    ret
  .L7:
    mov	r8, rax
    jmp	.L3
    .cfi_endproc
  .LFE0:
    .size	memcopy_normal, .-memcopy_normal
    .p2align 4,,15
    .globl	memcopy_restrict
    .type	memcopy_restrict, @function
  memcopy_restrict:
  .LFB1:
    .cfi_startproc
    push	r12
    .cfi_def_cfa_offset 16
    .cfi_offset 12, -16
    test	rdx, rdx
    mov	rax, rdi
    push	rbp
    .cfi_def_cfa_offset 24
    .cfi_offset 6, -24
    push	rbx
    .cfi_def_cfa_offset 32
    .cfi_offset 3, -32
    je	.L12
    lea	r10, [rdx-1]
    mov	r11, rsi
    mov	r8, rsi
    neg	r11
    and	r11d, 15
    cmp	r11, rdx
    cmova	r11, rdx
    test	r11, r11
    je	.L13
    xor	ecx, ecx
    .p2align 4,,10
    .p2align 3
  .L14:
    movzx	r9d, BYTE PTR [r8]
    add	rcx, 1
    add	r8, 1
    sub	r10, 1
    mov	BYTE PTR [rdi], r9b
    add	rdi, 1
    cmp	r11, rcx
    ja	.L14
    cmp	rdx, r11
    je	.L15
  .L13:
    mov	r12, rdx
    sub	r12, r11
    mov	rbx, r12
    shr	rbx, 4
    mov	rbp, rbx
    sal	rbp, 4
    test	rbp, rbp
    je	.L16
    add	rsi, r11
    xor	ecx, ecx
    add	r11, rax
    xor	r9d, r9d
    .p2align 4,,10
    .p2align 3
  .L17:
    movdqa	xmm0, XMMWORD PTR [rsi+rcx]
    add	r9, 1
    movdqu	XMMWORD PTR [r11+rcx], xmm0
    add	rcx, 16
    cmp	rbx, r9
    ja	.L17
    add	rdi, rbp
    add	r8, rbp
    sub	r10, rbp
    cmp	r12, rbp
    je	.L15
  .L16:
    lea	r9, [r10+1]
    xor	ecx, ecx
    .p2align 4,,10
    .p2align 3
  .L18:
    movzx	esi, BYTE PTR [r8+rcx]
    mov	BYTE PTR [rdi+rcx], sil
    add	rcx, 1
    cmp	r9, rcx
    jne	.L18
  .L15:
    add	rax, rdx
  .L12:
    pop	rbx
    .cfi_def_cfa_offset 24
    pop	rbp
    .cfi_def_cfa_offset 16
    pop	r12
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc
  .LFE1:
    .size	memcopy_restrict, .-memcopy_restrict
    .ident	"GCC: (GNU) 4.6.3"
    .section	.note.GNU-stack,"",@progbits

by kaiwetzel on 3/25/12, 1:43 PM
The instruction set page linked in the article mentions: To move a double quadword to or from memory locations that are known to be aligned on 16-byte boundaries, use the MOVDQA instruction - would it not be beneficial to use this instruction for most of the memory to be copied and only use the slower variants for the leading and trailing bytes?
Does the C semantics prevent the compiler from issuing a run-time check for the (rare) aliasing case and proceed with the fast version in the common case? (probably unrolled, too?)
Out of curiosity: The code uses a special counter variable count which has to be decremented separately - is this faster than testing for dst != behind_last_dst_prt ?
(aside: my gut-feeling tells me that if the combination of C/8086 can't pull his example of at the maximum memory to processor transfer speed for sufficiently large input vectors, there is something seriously rotten ...)
by mmphosis on 3/25/12, 4:31 PM
> How a language can be faster than C?
Turtles. Turtles all the way down.
Translation: If your language is "Turtles" then you need to use "Turtles" all the way down the entire stack, and that includes your "Turtles" CPUs.
by prewett on 3/26/12, 3:22 AM
Personally, I think the most interesting part was the link to a student who used Scheme to generate a really fast C implementation of a matrix-multiply algorithm, which beat all C code except for one that included hand-coded assembly (and with additional improvements could have beat that, too).
http://www.cs.indiana.edu/~jsobel/c455-c511.updated.txt
by cpr on 3/25/12, 2:16 PM
His third point is addressable by using LLVM as your target. Optimization can occur at compile time, at link time, and at runtime, if you decide to do the final LLVM-to-machine-language translation at the latter point.
Inter-module optimizations (e.g., removing dead code) can occur at link time as well.
So there are better tools now that compiler writers can target to start making progress in at least one of these areas.
by furyofantares on 3/25/12, 9:38 PM
The memcopy example isn't an optimization issue, it's a correctness issue. If the described behavior of memcopy is what was intended then of course the compiler can't use that particular optimization, and it couldn't in any other language, either. However, if the author was intending to write a function with behavior like memmove, then the issue is that the code is incorrect.
by mynegation on 3/25/12, 1:36 PM
A nitpick: memcpy example is flawed: memcpy's behavior is undefined if source and destination overlap. One should use memmove for that.