by nwjsmith on 12/5/12, 1:03 AM with 128 comments
by coffeemug on 12/5/12, 2:33 AM
This quote, much like various scientific quantum mechanics quotes adopted by the laymen, keeps haunting honest systems programmers because people with a little bit of knowledge read it, misinterpret (or misunderstand) it, and then share it.
Look, I don't know how Squid is designed, but most database systems use this strategy and it does not get into wars with the kernel for a whole slew of reasons that aren't addressed in the article. I know, because we've done a ton of sophisticated benchmarking comparing custom use case cache performance to general purpose page cache performance. Here are a few of the many, many reasons why this quote cannot be applied to sensibly designed pieces of systems software:
1. If the database/proxy/whatever server is designed correctly, it'll always use just enough RAM that it won't go into swap. That means the kernel won't magically page out its memory preventing it from doing its job.
2. In fact, kernels provide mechanisms to guarantee this by using various mechanisms (such as mlock).
3. Also, if your process misbehaves, modern kernels will just deploy the OOM killer (depending on how things are configured), so you can't just get into fights with the page cache without being sniped.
4. Of course you have to be smart and read from the file in a way that bypasses the page cache (via DIRECT_IO). Yes, it complicates things greatly for systems programmers (all sorts of alignment issues, journal data filesystems issues, etc.) but if you want high performance, especially on SSDs, and have special use cases to warrant it, it's worth it.
5. If you really know what you're doing, a custom cache can be significantly more efficient than the general purpose kernel cache, which in turn can make significant impact on performance bottom line. For example, a b-tree aware caching scheme has to do less bookkeeping, is more efficient, and has more information to make decisions than the general purpose LRU-K cache.
In fact, it is absolutely astounding how many 1975 abstractions translate wonderfully into the world of 2012. Architecturally, almost everything that worked back then still works now, including OS research, PL research, algorithms research, and software engineering research -- the four pillars that are holding up the modern software world. Some things are obsolete, perhaps, but far, far fewer than one might think.
Incidentally, this is also one of the reasons why I cringe when people say "the world is changing so fast, it's getting harder and harder to keep up". In matters of fashion, perhaps, but as far as core principles go (in computer science, mathematics, human emotions/interaction, and pretty much everything else of consequence) the world is moving at a glacial pace. Shakespeare might be a bit clunky to read these days because the language is a bit out of style, but what Hamlet had to say in 1600 is, amazingly, just as relevant today (and likely much more useful, because instead of actually reading Hamlet, most people read things like The Purple Cow, The 22 Immutable Laws of Marketing, The 99 Immutable Laws of Leadership, etc.)
by jessedhillon on 12/5/12, 9:10 AM
It was a pretty amazing hack, before magnetic memory cores. Because sound moved at a slow rate through a medium like mercury, an acoustic wave (that is, a sound) could be applied to one side of a volume of mercury and be expected to arrive at the other end after a predictable, useful delay. So what would be done is that a column of mercury with transducers on both ends would function as speakers and microphones, which in an acoustic medium are the equivalent of read and write heads!
The system memory would be a collection of these columns, each I guess storing one bit. The memory would of course have to be refreshed: when the signal arrived at the other end, it would be fed back into the column, assuming I suppose that there wasn't a new signal waiting to be written to that bit instead. The article mentions that this was not randomly accessible memory, but rather serially accessible. From that and other bits of information, I gather that the device would visit each bit in sequence, according to some clock, and produce a signal on the read line corresponding to the value in that bit. You had to wait for the memory device to read out the particular bit you were waiting for.
Does anyone know if this a correct understanding of how this kind of storage worked? What a cool way to store bits!
by _delirium on 12/5/12, 2:00 AM
Among other things, contains an interesting alternate perspective from a former Squid developer, about some of Squid's design decisions, some of which were driven by a goal of being maximally cross-platform and compatible with all possible clients/servers. Others were driven by the fact that Unix VM systems were actually not very good much more recently than 1975, like in the 1990s.
by marshray on 12/5/12, 2:22 AM
I used to think that too. Specifically Windows NT was said to need a pagefile at least as large as physical RAM. This was back when a workstation might have 16MB RAM and a 1GB disk. I thought this was because the kernel might be eliminating the need for some indirection by direct mapping physical RAM addresses to pagefile addresses. I was wrong.
On the Linux side, you would typically see the recommendation to make a swap partition "twice the size off RAM". Despite the possibility of using swap files, most distros still give dire warnings if you don't define a fixed-size swap partition on installation.
I don't think there was ever a solid justification for this "twice RAM" heuristic. A better method might be something like "max amount of memory you're ever going to need minus physical RAM" or "max amount of time you're willing to be stuck in the weeds divided by the expected disk bandwidth under heavy thrashing".
Regardless, if your server is actively swapping at all you're probably doing it wrong. It's not just that swapping is slow, it's that your database or your web cache have special knowledge about the workload that, in theory, should allow it to perform caching more intelligently.
I'd prefer to disable swap entirely, but there are occasions where it can make the difference in being able to SSH into a box on which some process has started running away with CPU and RAM.
But this guy is a kernel developer so he seems to feel that the kernel should manage the "one true cache". I like the ease and performance of memory-mapped files as much as the next guy, but I wouldn't go sneering at other developers for attempting to manage their disk IO in a more hands-on fashion.
by georgemcbay on 12/5/12, 2:03 AM
I'm not familiar with squid, but I'm quite familiar with the idea of programmers writing their own systems on top of other systems that are basically a worse implementation of something the underlying system is already doing.
To my chagrin, I occasionally catch myself doing this sort of thing once in a while when I'm first moving into new language/API/concept and don't really understand what is going on underneath.
It is always a good idea to try the simplest thing that could possibly work first, and then measure it, and only then try to improve it and always make sure you measure your "improvements" against the baseline. And make sure you're measuring the right things. I think this is a concept most developers are aware of but one of those things you have to constantly checklist yourself on because it is too easy to backslide on.
by mikeash on 12/5/12, 3:02 AM
by crazygringo on 12/5/12, 3:48 AM
When you're dealing with a web cache, don't you want to explicitly know whether your cache contents are in memory or on disk, and be able to fine-tune that? It seems like the last thing you want is the OS making decisions about memory vs disk for you. Am I missing something?
by javajosh on 12/5/12, 2:11 AM
In object oriented programming there is a thing called a CRC card[1] where you list what the responsibilities of important classes are. This helps the developer visualize and understand how the system works, and to keep things as orthogonal as practical. Here we have an example of someone pointing out that the system-level "CRC cards" are stepping on each other's toes. Pretty compelling stuff.
An aside - would there be any benefit to using `go` rather than `c` for writing something like varnish if you were starting in 2012?
[1] http://en.wikipedia.org/wiki/Class-responsibility-collaborat...
by halayli on 12/5/12, 2:48 AM
If you manage your own memory/swap, at least you can use async IO and free up the thread while the IO request is being served by the OS.
by antirez on 12/5/12, 8:34 AM
1) You have a threaded implementation, otherwise your single thread blocks every time you access a page on the swap.
2) You have decently sized continuous objects. If instead a request involves many fragments of data from many different pages, it is not going to work well.
There are other issues but probably 1 & 2 are the most important.
by miah_ on 12/5/12, 6:27 AM
“these days so small that girls get disappointed if think they got hold of something else than the MP3 player you had in your pocket.”
An otherwise interesting article.
by stcredzero on 12/5/12, 3:26 AM
by guilloche on 12/5/12, 4:12 AM
Take an example, in a word processor, can we just keep all possible cursor positions (for moving the cursor around) and all line-breaking, page breaking info, each character's location in virtual memory?
by khitchdee on 12/5/12, 5:16 AM
by taylorbuley on 12/5/12, 4:09 AM
by ccleve on 12/5/12, 3:52 AM
I would love it if there were just one kind of storage, and my code could ignore the distinction between disk and memory. But it can't, for three reasons: 10 ms seek times, RAM that is much smaller than disk, and garbage collection.
10 ms seek times mean that fast random access across large disk files just isn't possible. There is a vast amount of literature and research devoted to getting over this specific limitation. And it isn't old, either: all of the recent work on big data is aimed at resolving the tension between sequential disk access, which is fast, and random access, which is required for executing queries.
RAM that is smaller than disk means that virtual disk files don't work very well when you have large data files. If you try to map more than the amount of physical RAM you get a mess: http://stackoverflow.com/questions/12572157/using-lots-of-ma...
Garbage collection means that it is easy to allocate a bit of memory, and then let it go when the reference goes out of scope. There's no need to explicitly deallocate it. It's one of the things that makes modern programming efficient. With disk, you don't get that; if you write something, you've got to erase it or disk fills up.
In short, this guy's casual contempt for "1975 programming" is irksome, because it's clear that he isn't working on the same class of problems that the rest of us are. He may be able to get away with virtual memory for his limited application, but the rest of us can't.
by hakaaak on 12/5/12, 2:21 AM
So the question is- if it is so great, why only 5.2%? I'm not being sarcastic. This is a totally serious question.
by jwilliams on 12/5/12, 2:04 AM
by mcfunley on 12/5/12, 2:00 AM
by guilloche on 12/5/12, 6:51 AM
by martinced on 12/5/12, 10:03 AM
"Don't create a ramdisk (a true, fixed size, one, that you prevent from ever getting to disk) because the (Linux) kernel is so good and so sentient that you won't gain anything by doing that"
Yet anyone compiling from scratch big projects made of thousands of source file know that it's much faster to write the compiled files to the ramdisk.
I can't tell how many times I've seen this argument between "pro 'kernel is sentient'" and "pro 'compile into a real ramdisk'" but I can tell you that, by experience (and it's hard to beat that), the ramdisk Just Works [TM] faster than the 'sentient kernel'.
So how is it different this time?
by smegel on 12/5/12, 1:46 AM