from Hacker News

Linux Pipes Are Slow

by qsantos on 8/25/24, 4:52 PM with 166 comments

  • by koverstreet on 8/25/24, 11:08 PM

    One of my sideprojects is intended to address this: https://lwn.net/Articles/976836/

    The idea is a syscall for getting a ringbuffer for any supported file descriptor, including pipes - and for pipes, if both ends support using the ringbuffer they'll map the same ringbuffer: zero copy IO, potentially without calling into the kernel at all.

    Would love to find collaborators for this one :)

  • by fatcunt on 8/26/24, 11:31 AM

    > I do not know why the JMP is not just a RET, however.

    This is caused by the CONFIG_RETHUNK option. In the disassembly from objdump you are seeing the result of RET being replaced with JMP __x86_return_thunk.

    https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...

    https://github.com/torvalds/linux/blob/v6.1/arch/x86/lib/ret...

    > The NOP instructions at the beginning and at the end of the function allow ftrace to insert tracing instructions when needed.

    These are from the ASM_CLAC and ASM_STAC macros, which make space for the CLAC and STAC instructions (both of them three bytes in length, same as the number of NOPs) to be filled in at runtime if X86_FEATURE_SMAP is detected.

    https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...

    https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...

    https://github.com/torvalds/linux/blob/v6.1/arch/x86/kernel/...

  • by 0xbadcafebee on 8/25/24, 11:13 PM

    Calling Linux pipes "slow" is like calling a Toyota Corolla "slow". It's fast enough for all but the most extreme use cases. Are you racing cars? In a sport where speed is more important than technique? Then get a faster car. Otherwise stick to the Corolla.
  • by JoshTriplett on 8/25/24, 11:31 PM

    This is a side note to the main point being made, but on modern CPUs, "rep movsb" is just as fast as the fastest vectorized version, because the CPU knows to accelerate it. The name of the kernel function "copy_user_enhanced_fast_string" hints at this: the CPU features are ERMS ("Enhanced Repeat Move String", which makes "rep movsb" faster for anything above a certain length threshold) and FSRM ("Fast Short Repeat Move String", which makes "rep movsb" faster for shorter moves too).
  • by donaldihunter on 8/26/24, 2:20 PM

    Something I didn't see mentioned in the article about AVX512, aside from the xsave/xrstor overhead, is that AVX512 is power hungry and causes CPU frequency scaling. See [1], [2] for details and as an example of how nuanced it can get.

    [1] https://www.intel.com/content/dam/www/central-libraries/us/e...

    [2] https://www.intel.com/content/www/us/en/developer/articles/t...

  • by nitwit005 on 8/25/24, 10:06 PM

    Just about every form of IPC is "slow". You have decided to pay a performance cost for safety.
  • by qsantos on 8/26/24, 6:55 AM

    I am again getting the hug of death of Hacker News. The situation is better than the last time thanks to caching WordPress pages, but loading the page can still take a few seconds, so bear with me!
  • by RevEng on 8/25/24, 11:01 PM

    I didn't quite grasp why the original splice has to be so slow. They pointed out what made it slower than vmsplice - in particular allocating buffers and using scalar instructions - but why is this necessary? Why couldn't splice just be reimplemented as vmsplice? I'm sure there is a good reason, but I've missed it.
  • by rwmj on 8/26/24, 12:30 PM

    Be interesting to see a version using io_uring, which I think would let you pre-share buffers with the kernel avoiding some copies, and avoid syscall overhead (though the latter seems negligible here).
  • by stabbles on 8/26/24, 9:26 AM

    A bold claim for a blog that takes about 20 seconds to load.
  • by Borg3 on 8/26/24, 8:15 AM

    Haha. When I read the title I smiled. Linux pipes slow? Moook.. Now try Cygwin pipes. Thats what I call slow!

    Anyway, nice article, its good to know whats going on under the hood.

  • by faizshah on 8/26/24, 3:59 PM

    This is a really cool post and that is a massive amount of throughput.

    In my experience in data engineering, it’s very unlikely you can exceed 500mb/s throughput of your business logic as most libraries you’re using are not optimized to that degree (SIMD etc.). That being said I think it’s a good technique to try out.

    I’m trying to think of other applications this could be useful for. Maybe video workflows?

  • by sixthDot on 8/26/24, 10:22 AM

    > I do not know why the JMP is not just a RET, however.

    The jump seems generated by the expansion of the `ASM_CLAC` macro, which is supposed to change the EFLAGS register ([1], [2]). However in this case the expansion looks like it does nothing (maybe because of the target ?). I 'd be interested to know more about that. Call to the wild.

    [1]: https://github.com/torvalds/linux/blob/master/arch/x86/inclu...

    [2]: https://stackoverflow.com/a/60579385

  • by yencabulator on 8/26/24, 6:43 PM

    FUSE can be a bit trickier than a single queue of data chunks. Reads from /dev/fuse actually pick the right message to read based on priorities, and there's cases where the message queue is meddled with to e.g. cancel requests before they're even sent to userspace. If you naively switch it to eagerly putting messages into a userspace-visible ringbuffer, you might significantly change behavior in cases like interrupting slow operations. Imagine having to fulfill a ringbuf worth of requests to a misbehaving backend taking 5sec/op, just to see the cancellations at the very end.
  • by nyanpasu64 on 8/26/24, 4:40 AM

    How do you gather profiling information for kernel function calls from a user program?
  • by jvanderbot on 8/26/24, 3:57 PM

    > Although SSE2 is always available on x86-64, I also disabled the cpuid bit for SSE2 and SSE to see if it could nudge glibc into using scalar registers to copy data. I immediately got a kernel panic. Ah, well.

    I think you need to recompile your compiler, or disable those explicitly via link / cc flags. Compilers are fairly hard to get to coax / dissuade SIMD instructions, IMHO.

  • by arendtio on 8/26/24, 12:15 PM

    I know pipes primarily from shell scripts. Are they being used in other contexts as extensively, too? Like C or Rust programs?
  • by up2isomorphism on 8/26/24, 3:21 PM

    Someone tasted a bread thinking it is not sweet enough, which is fine. But calling the bread bland is funny because it does not mean to taste sweet.
  • by jeremyscanvic on 8/26/24, 5:11 PM

    Great post! I didn't know about vmsplice(2). I'm glad to see a former ENSL student here as well!
  • by goodpoint on 8/26/24, 10:14 AM

    Excellent article even if, to be honest, the title is clickbait.
  • by mparnisari on 8/26/24, 3:32 PM

    I get PR_CONNECT_RESET_ERROR when trying to open the page
  • by cowsaymoo on 8/26/24, 6:23 AM

    What is the library used to profile the program?
  • by djaouen on 8/25/24, 10:29 PM

    So is Python, but I'm still gonna use it lol
  • by jheriko on 8/25/24, 9:18 PM

    just never use pipes. they are some weird archaism that need to die :P

    the only time ive used them is external constraints. they are just not useful.