from Hacker News

Fire-Flyer File System (3FS)

by wenyuanyu on 2/28/25, 1:26 AM with 101 comments

by ammo1662 on 2/28/25, 3:09 AM
For those who are interested, the design was originally published here:
(Chinese) https://www.high-flyer.cn/blog/3fs/
This file system has been developed and utilized by them for several years .
Compared to the traditional file systems, it is more focused on model training that contains a lot of random reads. Read cache and prefetching are useless in this case. Therefore, they designed the file system without those features to improve the performance.
I google translated some key parts here:
3FS is a special file system because it is almost only used in the scenario of batch reading sample data in the computing node during AI training, and accelerates model training through high-speed computing and storage interaction. This is a large-scale random reading task, and the read data will not be used again in a short time, so we cannot use the most important tool "read cache" to optimize file reading, and even advance reading is useless. Therefore, the implementation of 3FS is also quite different from other file systems.
Specifically, as shown in the figure above, 3FS uses the Linux-based AIO and io_uring interfaces to complete sample reading, because in the 3FS scenario, File Cache has no effect at all, but will consume system memory in a way that is difficult for users to control, affecting the operation of subsequent tasks, so we turned off File Cache and only used Direct I/O mode to read data. But it should be noted that when reading in this way, the buffer pointer, offset and length all need to be aligned. If the user is allowed to do this alignment, additional memory copies will be generated, so we have done the alignment inside the file system, which not only optimizes performance but also facilitates users.
by codingwagie on 2/28/25, 4:47 PM
I think the difference between deepseek and OpenAI/Anthropic is one of the difference between practitioners and academics. Ofcourse there is world class talent at OpenAI. But there are also alot of "I went to Harvard and want to work in AI", and those types of people just simply dont have the technical exposure to even think of building something like this.
by thohj4234234324 on 2/28/25, 5:33 AM
This is very humbling.
OpenAI et. al kind of have also been very deep down the systems rabbit hole (eg. Triton), but I can't think of anyone else (outside of Google/Facebook) who pay this amount to attention to things.
Great work; hope Deepseek does even more awesome things going forward.
by tetron on 2/28/25, 3:09 AM
Was curious how they get such performance with a FUSE based design. It seems that they sort of cheat, FUSE is used to manage metadata but to get high performance you have to link in the C++ client library and do all your reads and writes through that. So it isn't general purpose, you have to modify your application to take advantage of it. Still, that's a clever trick, and makes me wonder if there's a LD_PRELOAD strategy that could generalize.
by pella on 2/28/25, 8:41 AM
related research paper (english - html ) - https://arxiv.org/html/2408.14158v2
arXiv:2408.14158v2 [cs.DC] 31 Aug 2024
"Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning"
Abstract:
"The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC."
by hintymad on 2/28/25, 6:29 PM
A distributed file system is honed as one of the trickiest software to write, and we are usually advised not to write a file system from scratch (even on top of FUSE), let alone a highly optimized one. When a silicon value company is having the 100th meeting to align god-knows-what, a team of fewer than 60 already came up with a production-grade highly efficient parallel file system.
Have we in the valley companies lost touch?
by jauntywundrkind on 2/28/25, 7:01 AM
Man, 6.6TB/s across 180 nodes is 300Gbps/node, or 37.5GBps.
That's with 14 unnamed SSD per node. I wonder how this would scale to higher end SSD,dealing from PCIe 4 to PCIe 5 or PCIe 6... Particularly whether one could scale down!
by bee_rider on 2/28/25, 4:37 AM
They sure are productive.
What are we going to see tomorrow? DeepSeek OS or something?
by yalogin on 2/28/25, 7:42 AM
It’s not clear to me where and how the current popular systems fall short. Do they talk about I anywhere?
Also, what specifically is the data access patterns for training and inference that are different from traditional use cases?
by budududuroiu on 2/28/25, 3:13 AM
Does anyone know if there’s a benefit to porting this to an orchestrator like K8s, maybe overkill for training but the KVCache might be useful when having multiple replicas for inference?
by do_not_redeem on 2/28/25, 2:11 AM
Can someone convince me this isn't NIH syndrome? Why would you use this instead of SeaweedFS, Ceph, or MinIO?
by whalesalad on 2/28/25, 6:05 PM
The throughput on those charts is pretty wild - multiple terabytes per second.
by jeffbee on 2/28/25, 4:10 AM
Interesting that their GraySort result is CPU bound while they are using 3x more CPUs than the record holder from ten years ago.
by brcmthrowaway on 2/28/25, 3:05 AM
What does Anthropic use?
by rvz on 2/28/25, 9:45 PM
Once again, DeepSeek continues with another home run.
Can't wait to see what they release next. DeepSeek should be studied carefully.
by WithinReason on 2/28/25, 7:41 AM
Why is this even necessary? Can you just shard your training set to the training nodes ahead of time instead?
by pepsi-not-coke on 2/28/25, 3:07 AM
I love it. AWS EFS costs too much. The open source solutions are clunky. I'm hoping DS applied their ingenuity to this one, too. Can't wait to trial it.