by notforgot on 1/29/17, 6:03 PM with 3 comments
by tjalfi on 1/29/17, 10:50 PM
Use an extensible compiler and targeted optimizations. https://mitpress.mit.edu/books/automatic-algorithm-recogniti... is an excellent book on this topic.
Use a cluster to evolve the best settings for compile options, executable layout, instruction scheduling, etc. There is a paper from a Google author about doing this for prefetching.
Use an ILP solver for register allocation, instruction scheduling and other problems that are normally solved with heuristics. The size of the program may make this intractable. There was a startup that used this approach for a custom programming language targeted at Intel's network processors.
by imaginenore on 1/29/17, 6:21 PM
The fastest way would be to create an ASIC: hardware designed to run your algorithm specifically.
Something simpler and a bit slower would be an FPGA.
Below that is a GPU implementation of your code, assuming it can be parallelized.
Below that is hand-crafted assembly by someone who is smarter than a good compiler.
Below that is hand-crafted C/C++ or Fortran code.
Here are benchmarks of various languages for various problems:
N-body: https://benchmarksgame.alioth.debian.org/u64q/performance.ph...
Spectral-norm: http://benchmarksgame.alioth.debian.org/u64q/performance.php...
Digits of pi: http://benchmarksgame.alioth.debian.org/u64q/performance.php...
FASTA: http://benchmarksgame.alioth.debian.org/u64q/performance.php...