by oxxoxoxooo on 12/2/21, 6:51 PM with 348 comments
by snvzz on 12/2/21, 7:54 PM
But ultimately, the gist of their argument is this:
>Any task will require more Risc V instructions that any contemporary instruction set.
Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.
by socialdemocrat on 12/2/21, 10:48 PM
Therein lies the problem. Nobody ever goes out guns blazing complaining about too many instructions despite the fact that complexity has its own downsides.
RISC-V has been designed aggressively to have minimal ISA to leave plenty of room to grow, and require minimal number of transistors for a minimal solution.
Should this be a showstopper down the road, then there will be plenty of space to add an extensions that fixes this problem. Meanwhile embedded systems paying a premium for transistors are not going to have to pay for these extra instructions as only 47 instructions have to be implemented in a minimal solution.
by Taniwha on 12/2/21, 7:17 PM
It's a trade off - and the one that's been made is one that makes it possible to make ALL instructions a little faster at the expense of one particular case that isn't used much - that's how you do computer architecture, you look at the whole, not just one particular case
RISCV also specifies a 128-bit variant that is of course FASTER than these examples
by bArray on 12/3/21, 1:55 AM
When you hear the "<person / group> could make a better <implementation> in <short time period>" - call them out. Do it. The world will not shun a better open license ISA. We even have some pretty awesome FPGA boards these days that would allow you to prototype your own ISA at home.
In terms of the market - now is an exceptionally great time to go back to the design room. It's not as if anybody will be manufacturing much during the next year with all of the fab labs unable to make existing chips to meet demand. There is a window of opportunity here.
> It is, more-or-less a watered down version of the 30 year old Alpha ISA after all. (Alpha made sense at its time, with the transistor budget available at the time.)
As I see it, lower numbers of transistors could also be a good thing. It seems blatantly obvious at this point that multi-core software is not only here to stay, but is the future. Lower numbers of transistors means squeezing more cores onto the same silicon, or implementing larger caches, etc.
I also really like the Unix philosophy of doing one simple thing well. Sure, it could have some special instruction that does exactly your use case in one cycle using all the registers, but that's not what has created such advances in general purpose computing.
> Sure, it is "clean" but just to make it clean, there was no reason to be naive.
I would much rather we build upon a conceptually clean instruction set, rather than trying to hobble together hacks on top of fundamentally flawed designs - even at the cost of performance. It's exactly these hobbled conceptual hacks that have lead to the likes of spectre and meltdown vulnerabilities, when the instruction sets become so complicate that they cannot be easily tested.
by jpfr on 12/2/21, 7:13 PM
That allows more flexibility for CPU designs to optimize transistor count vs speed vs energy consumption.
This guy clearly did not look at the stated rationale for the design decisions of RISC-V.
by Symmetry on 12/2/21, 7:33 PM
RISC-V has a number of places it's employed where it makes an excellent fit. First of all academia. For an undergrad making building the netlist for their first processor or a grad student doing their first out of order processor RISC-V's simplicity is great for the pedagogical purpose. For a researcher trying to experiment with better branch prediction techniques having a standard high-ish performance open source design they can take and modify with their ideas is immensely helpful. And for many companies in the real world with their eyes on the bottom line like having an ISA where you can add instructions that happen to accelerate your own particular workload, where you can use a standard compiler framework outside your special assembly inner loops, and where you don't have to spend transistors on features you don't need.
I'm not optimistic about RISC-V's widescale adoption as an application processor. If I were going to start designing an open source processor in that space I'd probably start with IBM's now open Power ISA. But there are so many more niches in the world than just that and RISC-V is already a success in some of them.
by kelnos on 12/2/21, 10:35 PM
Kinda stopped reading here. It's a pretty arrogant hot take. I don't know this guy, maybe he's some sort of ISA expert. But it strains credulity that after all this time and work put into it, RISC-V is a "terrible architecture".
My expectation here is that RISC-V requires some inefficient instruction sequences in some corners somewhere (and one of these corners happens to be OP's pet use case), but by and large things are fine.
And even then, I don't think that's clear. You're not going to determine performance just by looking at a stream of instructions on modern CPUs. Hell, it's really hard to compare streams of instructions from different ISAs.
by kazinator on 12/3/21, 2:40 AM
typedef __int128_t int128_t;
int128_t add(int128_t left, int128_t right)
{
return left + right;
}
GCC 10, -O2, RISC-V: add(__int128, __int128):
mv a5,a0
add a0,a0,a2
sltu a5,a0,a5
add a1,a1,a3
add a1,a5,a1
ret
ARM64: add(__int128, __int128):
adds x0, x0, x2
adc x1, x1, x3
ret
This issue hurts the wider types that are compiler built-ins.Even though C has a programming model that is devoid of any carry flag concept, canned types like a 128 bit integer can take advantage of it.
Portable C code to simulate a 128 bit integer will probably emit bad code across the board. The code will explicitly calculate the carry as an additional operand and pull it into the result. The RISC-V won't look any worse, then, in all likelihood.
(The above RISC-V instruction set sequence is shorter than the mailing list post author's 7 line sequence because it doesn't calculate a carry out: the result is truncated. You'd need a carry out to continue a wider addition.)
by dragontamer on 12/2/21, 8:12 PM
2-instructions to work with 64-bits, maybe 1 more instruction / macro-op for the compare-and-jump back up to a loop, and 1 more instruction for a loop counter of somekind?
So we're looking at ~4 instructions for 64-bits on ARM/x86, but ~9-instructions on RISC-V.
The loop will be performed in parallel in practice however due to Out-of-order / superscalar execution, so the discussion inside the post (2 instruction on x86 vs 7-instructions on RISC-V) probably is the closest to the truth.
----------
Question: is ~2-clock ticks per 64-bits really the ideal? I don't think so. It seems to me that bignum arithmetic is easily SIMD. Carries are NOT accounted for in x86 AVX or ARM NEON instructions, so x86, ARM, and RISC-V will probably be best.
I don't know exactly how to write a bignum addition loop in AVX off the top of my head. But I'd assume it'd be similar to the 7-instructions listed here, except... using 256-bit AVX-registers or 512-bit AVX512 registers.
So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).
AVX512 is uncommon, but AVX (256-bit) is common on x86 at least: leading to ~36-bits-per-clock tick.
----------
ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes 512-bits). RISC-V has a bunch of competing vector instructions.
..........
Ultimately, I'm not convinced that the add + adc methodology here is best anymore for bignums. With a wide-enough vector, it seems more important to bring forth big 256-bit or 512-bit vector instructions for this use case?
EDIT: How many bits is the typical bignum? I think add+adc probably is best for 128, 256, or maybe even 512-bits. But moving up to 1024, 2048, or 4096 bits, SIMD might win out (hard to say without me writing code, but just a hunch).
2048-bit RSA is the common bignum, right? Any other bignums that are commonly used? EDIT2: Now that I think of it, addition isn't the common operation in RSA, but instead multiplication (and division which is based on multiplication).
by Teknoman117 on 12/2/21, 7:38 PM
MIPS didn't have a flag register either and depended on a dedicated zero register and slt instructions (set if less than)
by okl on 12/2/21, 7:37 PM
Edit: Don't get me wrong, I don't think RISC-V is "garbage" or anything like that. I just think it could have been better. But of course, most of an architecture's value comes from its ecosystem and the time spent optimizing and tailoring everything...
by jhallenworld on 12/2/21, 9:25 PM
You can detect carry of (a+b) in C branch-free with: ((a&b) | ((a|b) & ~(a+b))) >> 31
So 64-bit add in C is:
f_low = a_low + b_low
c_high = ((a_low & b_low) | ((a_low | b_low) & ~f_low)) >> 31
f_high = a_high + b_high + c_high
So for RISC-V in gcc 8.2.0 with -O2 -S -c add a1,a3,a2
or a5,a3,a2
not a7,a1
and a5,a5,a7
and a3,a3,a2
or a5,a5,a3
srli a5,a5,31
add a4,a4,a6
add a4,a4,a5
But for ARM I get (with gcc 9.3.1): add ip, r2, r1
orr r3, r2, r1
and r1, r1, r2
bic r3, r3, ip
orr r3, r3, r1
lsr r3, r3, #31
add r2, r2, lr
add r2, r2, r3
It's shorter because ARM has bic. Neither one figures out to use carry related instructions.Ah! But! There is a gcc macro: __builtin_uadd_overflow() that replaces the first two C lines above: c_high = __builtin_uadd_overflow(a_low, b_low, &f_low);
So with this:
RISC-V:
add a3,a4,a3
sltu a4,a3,a4
add a5,a5,a2
add a5,a5,a4
ARM: adds r2, r3, r2
movcs r1, #1
movcc r1, #0
add r3, r3, ip
add r3, r3, r1
RISC-V is faster..EDIT: CLANG has one better: __builtin_addc().
f_low = __builtin_addcl(a_low, b_low, 0, &c);
f_high = __builtin_addcl(a_high, b_high, c, &junk);
x86: addl 8(%rdi), %eax
adcl 4(%rdi), %ecx
ARM: adds w8, w8, w10
add w9, w11, w9
cinc w9, w9, hs
RISC-V: add a1, a4, a5
add a6, a2, a3
sltu a2, a2, a3
add a6, a6, a2
by sosodev on 12/2/21, 7:33 PM
by ksec on 12/2/21, 8:28 PM
You do not criticise The Rusted Holy Grail and the Riscy Silver Bullet.
by jasonhansel on 12/2/21, 9:58 PM
by marcodiego on 12/2/21, 9:15 PM
by DeathArrow on 12/3/21, 5:41 AM
by nynx on 12/2/21, 7:28 PM
by fhood on 12/2/21, 9:19 PM
by bell-cot on 12/2/21, 8:21 PM
by pcwalton on 12/2/21, 8:18 PM
by CalChris on 12/2/21, 7:42 PM
My code snippet results in bloated code for RISC-V RV64I.
I'm not sure how bloated it is. All of those instructions will compress [1].[1] https://riscv.org/wp-content/uploads/2015/05/riscv-compresse...
It's slower on RISC-V but not a lot on a superscalar. The x86 and ARMv8 snippets have 2 cycles of latency. The RISC-V has 4 cycles of latency.
1. add t0, a4, a6 add t1, a5, a7
2. sltu t6, t0, a4 sltu t2, t1, a5
3. add t4, t1, t6 sltu t3, t4, t1
4. add t6, t2, t3
I'm not getting terrible from this.by xondono on 12/2/21, 9:34 PM
For those are more versed, is this really a general problem?
I was under the impression that the real bottleneck is memory, and things like this would be fixed in real applications through out of order execution, and that it payed off having simpler instructions because compilers had more freedom to rearrange things.
by dlsa on 12/2/21, 9:43 PM
Is that even a fair comparison given the arm and x86 versions used as examples of "better" were 64 bit?
If we're really comparing 32 and 64 and complaining that 32 bit uses more instructions than 64, perhaps we should dig out the 4 bit processors and really sharpen the pitchforks. Alternatively, we could simply not. Comparing apples to oranges doesn't really help.
From the article:
Let's look at some examples of how Risc V underperforms.
First, addition of a double-word integer with carry-out:
add t0, a4, a6 // add low words
sltu t6, t0, a4 // compute carry-out from low add
add t1, a5, a7 // add hi words
sltu t2, t1, a5 // compute carry-out from high add
add t4, t1, t6 // add carry to low result
sltu t3, t4, t1 // compute carry out from the carry add
add t6, t2, t3 // combine carries
Same for 64-bit arm:
adds x12, x6, x10
adcs x13, x7, x11
Same for 64-bit x86:
add %r8, %rax
adc %r9, %rdx
by YesThatTom2 on 12/3/21, 12:33 PM
by mda on 12/3/21, 9:38 AM
"I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project"
Utter horse manure.
by SavantIdiot on 12/2/21, 8:06 PM
by throwaway4good on 12/3/21, 5:48 AM
Perhaps something similar is needed within ISAs / CPUs ? Say an OS kernel, a ZIP-algorithm, Mandelbrot, Fizz-buzz ... could measure code compactness but also performance and energy usage.
by rep_lodsb on 12/3/21, 2:44 PM
Everything should be written in C, or some scripting language implemented in C. Writing safe code is easy, just wrap everything in layers of macros that the compiler will magically optimize away, and if it doesn't, computers are fast enough anyway, right? The mark of a real programmer is that every one of their source files includes megabytes of headers defining things like __GNU__EXTENSION_FOO_BAR_F__UNDERSCORE_.
You say your processor has a single instruction to do some extremely common operation, and want to use it? You shouldn't even be reading a processor manual unless you are working on one of the two approved compilers, preferably GCC! If you are very lucky, those compiler people that are so much smarter than you could hope to be, have already implemented some clever transformation that recognizes the specific kind of expression produced by a set of deeply nested macros, and turns them into that single instruction. In the process, it will helpfully remove null pointer checks because you are relying on undefined behaviour somewhere else.
You say you'll do it in assembly? For Kernighan's sake, think about portability!!! I mean, portable to any other system that more or less looks the same as UNIX, with a generous sprinkling of #ifdefs and a configure script that takes minutes to run.
Implement a better language? Sure, as long as the compiler is written in C, preferably outputs C source code (that is then run through GCC), and the output binary must of course link against the system's C library. You can't do it any other way, and every proper UNIX - BSD or Mac OS X - will make it literally impossible by preventing syscalls from any other piece of code.
IMO this is like a cultural virus that seems to have infected everything IT-related, and I don't exactly understand why. Sure, having all these layers of cruft down below lets us build the next web app faster, but isn't it normal to want to fix things? Do some people actually get a sense of satisfaction out of saying "It is a solved problem, don't reinvent the wheel"? Or do they want to think that their knowledge of UNIX and C intricacies is somehow the most important, fundamental thing in computer science?
by tomxor on 12/3/21, 12:29 AM
Isn't this the classic RISC vs CISC problem?
Comparing x86/ARM to RISC-V feels like Apples to Grains of Rice.
If RISC-V was born out of a need for an open source embedded ISA, would the ISA not need to remain very RISC-like to accommodate implementations with fewer available transistors... Or is this an outdated assumption?
by adapteva on 12/3/21, 2:58 AM
by usr1106 on 12/3/21, 8:27 AM
Whether the similar awkwardness applies to a lot of other code or not is not being told by this isolated case.
by aappleby on 12/2/21, 7:22 PM
by Shadonototra on 12/2/21, 8:37 PM
Moderators where are you?
by throwaway19937 on 12/2/21, 7:58 PM
I'm not a fan of the RISC-V design but the presence or absence of this instruction doesn't make it a terrible architecture.
by akimball on 12/3/21, 2:41 AM
by yjftsjthsd-h on 12/2/21, 8:03 PM
by kayamon on 12/2/21, 7:59 PM
by oneplane on 12/2/21, 10:22 PM
It doesn't matter how great something else could be in theory if it doesn't exist or doesn't meet the same scale and mindshare (or adoption).