by beck5 on 3/30/16, 9:07 AM with 33 comments
by zimpenfish on 3/30/16, 10:23 AM
There needs to be some law about how temporary directories always end up containing vitally important data.
by hga on 3/30/16, 1:40 PM
One obvious solution would be to use a ramdisk, a virtual disk that actually resides in the memory of a node. The problem was that even our biggest system had 1.5TB of memory while we needed at least 3TB.
As a workaround we created ramdisks on a number of Taito cluster compute nodes, mounted them via iSCSI over the high-speed InfiniBand network to a server and pooled them together to make a sufficiently large filesystem for our needs.
A hack they weren't at all sure would work, but it did nicely.
by ghubbard on 3/30/16, 1:32 PM
Article title: The largest unplanned outage in years and how we survived it
Article overview: A month ago CSC's high-performance computing services suffered the largest unplanned outage in years. In total approximately 1.7 petabytes and 850 million files were recovered.
Although technically correct, the HN title is misleading.
by pinewurst on 3/30/16, 4:45 PM
by gnufx on 3/30/16, 10:46 PM
I've had to employ the horrible hack of iscsi from compute nodes, raided and re-exported, but it's not what I'd have tried to use first. The article doesn't mention the possibility of just spinning up a parallel filesystem on compute node local disks (assuming they have disks); I wonder if that was ruled out. I don't have a good feeling for the numbers, but I'd have tried OrangeFS on a good number of nodes initially.
By the way, it's been pointed out that RAM disk is relatively slow, if in the context of data rates rather than metadata <http://mvapich.cse.ohio-state.edu/static/media/publications/....
by ajford on 3/30/16, 5:41 PM
Or was the inode problem not a local disk problem but a problem in the Luster fs? I couldn't quite tell from the article.
by beezle on 3/30/16, 5:07 PM