by emsal on 3/7/20, 8:59 PM with 302 comments
by orisho on 3/7/20, 9:49 PM
Rachel says that a thread is only ever doing one thing at a time - it is handling one request, not many. But that's only true when you do CPU bound work. There is no way to write code using blocking IO-style code without using some form of event loop (gevent, async/await). You cannot spin up 100K native threads to handle 100K requests that are IO bound (which is very common in a microservice architecture, since requests will very quickly block on requests to other services). Or well, you can, but the native thread context switch overhead is very quickly going to grind the machine to a halt as you grow.
I'm a big fan of gevent, and while it does have these shortcomings - they are there because it's all on top of Python, a language which started out with the classic async model (native threads), rather than this model.
Golang, on the other hand, doesn't suffer from them as it was designed from the get-go with this threading model in mind. So it allows you to write blocking style code and get the benefits of an event loop (you never have to think about whether you need to await this operation). And on the other hand, goroutines can be preempted if they spend too long doing CPU work, just like normal threads.
by orf on 3/7/20, 9:30 PM
You can just pass `--preload` to have gunicorn load the application once. If you're using a standard framework like Django or Flask and not doing anything obviously insane then this works really well and without much effort. Yeah I'm sure some dumb libraries do some dumb things, but that's on them, and you for using those libraries. Same as any language.
If you want to stick your nose up at Python and state outright "I will not write a service in it" then that's up to you, it just comes across as your loss rather than a damning condemnation of the language and it's ecosystem from Rachel By The Bay, an all-knowing and experienced higher power. I guess everyone else will keep quickly shipping value to customers with it while you worry about five processes waking up from a system call at once or an extra 150mb of memory usage.
by nodamage on 3/7/20, 10:40 PM
The author seems to be saying something about how if a worker is busy doing CPU intensive work (is decoding JSON really that intensive?) then other requests accepted by that worker have to wait for that work to complete before they can respond, and the client might timeout while waiting?
If that's the case:
1. Wouldn't this affect any language/framework that uses a cooperative concurrency model, including node.js and ASP.NET or even Python's async/await based frameworks? How is this problem specific to Python/Gunicorn/Gevent?
2. What would be a better alternative? The author says something about using actual OS-level threads but I thought the whole point of green threads was that they are cheaper than thread switching?
by benreesman on 3/7/20, 11:05 PM
On the first point, yeah Rachel’s posts are kinda snarky sometimes, but some of us find that entertaining particularly when they are highly detailed and thoroughly researched. I’ve worked with Rachel and she’s among the best “deep-dive” userspace-to-network driver problem solvers around. She knows her shit and we’re lucky she takes the time to put hard-earned lessons on the net for others to benefit from.
As for “microservices written in Python trading a bunch of sloppy JSON around via HTTP” is bad engineering: it is bad engineering, sometimes the flavor of the month is rancid (CORBA, multiple implementation inheritance, XSLT, I could go on). Introducing network boundaries where function calls would work is a bad idea, as anyone who’s dealt seriously with distributed systems for a living knows. JSON-over-HTTP for RPC is lazy, inefficient in machine time and engineering effort, and trivially obsolete in a world where Protocol Buffers/gRPC or Thrift and their ilk are so mature.
Now none of this is to say you should rewrite your system if it’s built that way, legacy stuff is a thing. But Rachel wrote a detailed piece on why you are asking for trouble if you build new stuff like this and people are, in my humble opinion, shooting the messenger.
by tgbugs on 3/7/20, 10:34 PM
As the maintainer of about 5 little services with this structure I have vowed never to write another one. The memory overhead alone is a source of eternal irritation ("Surely there must be a better way....").
Echoing other commenters here, the real cost isn't actually discussed. Namely that there is a solution to some of these problems (re long running tasks?), but it carries with it a major increase in complexity. Its name is Celery and oh boy have fun with the ops overhead that that is going to induce.
A while back I did some unscientific benchmarking of the various worker classes for python3.6 and pypy3 (7.0 at the time I think?). Quoting my summary notes: 1. "pypy3 with sync worker has roughly the same performance, gevent is monstrously slow gthread is about 20 rps slower than sync (1s over 1k requests), sync can get up to ~150rps" 2. "pypy3 clearly faster with tornado than anything running 3.6" 3. "pypy3 is also about 4x faster when dumping nt straight from the database, peaking at about 80MBps to disk on the same computer while python3.6 hits ~20MBps"
I won't mention the workload because it was the same for both implementations and would only confuse the point, which is that there are better solutions out there in python land if you are stuck with one of these systems.
One thing I would love to hear from others is how other runtimes do this in a sane and performant way. What is the better solution left implicit in this post?
by seemslegit on 3/7/20, 9:44 PM
I'm surprised how often devs treat this distinction as architecturally meaningful. Web requests are just RPCs with some of the parameters standardized and multiple surfaces for parameters and return values - query string, headers, body. This is completely orthogonal to the strategy used to schedule IO, concurrency, etc.
by cwp on 3/7/20, 11:19 PM
Because of the GIL, you can't make predictions at the same time you're processing network IO, which means that you need multiple processes to respond to clients quickly and keep the CPU busy. But models use a lot of memory and so you can't run all THAT many processes.
I actually did get the load-then-fork, copy-on-write thing to work, but Python's garbage collections cause things to get moved around in memory and triggers copying and makes the processes gradually consume more and more memory as the model becomes less and less shared. Ok, so then you can terminate and re-fork the processes periodically, and avoid OOM errors, but there's still a lot of memory overhead and CPU usage is pretty low even when there are lots of clients waiting and...
You know I hear Julia is pretty mature these days and hey didn't Google release this nifty C++ library for ML and notebooks aren't THAT much easier. Between the GIL and the complete insanity that is python packaging, I think it's actually the worst possible language to use for ML.
by ary on 3/7/20, 9:55 PM
> So how do you keep this kind of monster running? First, you make sure you never allow it to use too much of the CPU, because empirically, it'll mean that you're getting distracted too much and are timing out some requests while chasing down others. You set your system to "elastically scale up" at some pitiful utilization level, like 25-30% of the entire machine.
Letting a Python web service, written in your framework of choice, perform CPU-bound work is just bad design. A Python web service should essentially be router for data, controlling authentication/authorization, I/O formatting, and not much else. CPU intensive tasks should be submitted to a worker queue and handled out of process. Since this is Python we don't have the luxury of using threads to perform CPU-bound work (because of the Global Interpreter Lock).
by yowlingcat on 3/7/20, 9:45 PM
The question, however, is why one would use gevent at this point in Python's evolution. There's async await now, and things like FastAPI. If you want to use, say, the Django ecosystem, use Nginx and uWSGI and be done with it. Maybe you need to spend some more resources to deploy your Python. Okay. Is that a problem? Why are you using Python? Is it because it's quick to use and helps you solve problems faster with its gigantic, mature ecosystem that lets you focus on your business logic? Then this, while admittedly not great, is going to be a rounding error. Is it because you began using it in the aforementioned case and now you're boxed into an expensive corner and you need to figure out how to scale parts of your presumably useful production architecture serving a Very Useful Application?
Maybe you need to start splitting up your architecture into separate services, so that you can use Python for the things that it does well and use some other technology for the parts that aren't I/O bound and could benefit from that. But that's not this article is about. This article is about someone making the wrong choices when better choices existed and then making a categorical decision against using Python for a service. I'd say that's what "we have to talk about" if you ask me.
by cakoose on 3/8/20, 12:28 AM
Let's say your server has 4 CPUs. The conservative option is to limit yourself to 4 requests at a time. But for most web applications, requests use tiny bursts of CPU in between longer spans of I/O, so your CPUs will be mostly idle.
Let's say we want to make better use of our CPUs and accept 40 requests at a time. Some environments (Java, Go, etc) allow any of the 40 requests to run on any of the CPUs. A request will have to wait only if 4+ of the 40 requests currently need to do CPU work.
Some environments (Node, Python, Ruby) allow a process to only use a single CPU at a time (roughly). You could run 40 processes, but that uses a lot of memory. The standard alternative is to do process-per-CPU; for this example we might run 4 processes and give each process 10 concurrent requests.
But now requests will have to wait if more than 1 of the 10 requests in its process needs to do CPU work. This has a higher probability of happening than "4+ out of 40". That's why this setup will result in higher latency.
And there's a bunch more to it. For example, it's slightly more expensive (for cache/NUMA reasons) for a request to switch from one CPU to another, so some high-performance frameworks intentionally pin requests to CPUs, e.g. Nginx, Seastar. A "work-stealing" scheduler tries to strike a balance: requests are pinned to CPUs, but if a CPU is idle it can "steal" a request from another CPU.
The starvation/timeout problem described in the post is strictly more likely to happen in process-per-CPU, sure. But for a ton of web app workloads, the odds of it happening are low, and there are things you can do to improve the situation.
The post also talks about Gunicorn accepting connections inefficiently and that should probably be fixed, but that space has very similar tradeoffs <https://blog.cloudflare.com/the-sad-state-of-linux-socket-ba....
by j88439h84 on 3/7/20, 9:43 PM
by worik on 3/7/20, 11:26 PM
Those below who complain about the complaints are missing the point.
We (computer programmers as a general class) have not learnt from history. We keep reinventing wheels and each time they are heavier and clunkier.
What we used to do in 40K of scripts now takes two gigabytes in python/django/whateverthehellelse. E.g. mail list servers. Mailman3 hang your head in shame!
by ris on 3/7/20, 10:47 PM
> "Why in the hell would you fork then load, instead of load then fork?"
In python it often seems to make little difference. The continual refcount incrementing and decrementing sooner or later touches most everything and causes the copy to happen whether you're mutating an object or not.
I've had some broad thoughts about how one would give cpython the ability to "turn off" gc and refcounting for some "forever" objects which you know you're never going to want to free, but it wouldn't be pretty as it would require segregating these objects into their own arenas to prevent neighbour writes dirtying the whole page anyway...
by ahuang on 3/7/20, 11:20 PM
> A connection arrives on the socket. Linux runs a pass down the list of listeners doing the epoll thing -- all of them! -- and tells every single one of them that something's waiting out there. They each wake up, one after another, a few nanoseconds apart.
Linux is known to have poor fairness with multiple processes listening to the same socket. For most setups that require forking a process, you run a local loadbalancer on box, whether it's haproxy or something else, and have each process listen on its own port. This not only allows you to ensure fairness by whatever load balance policy you want, but also lets you have healthchecks, queueing, etc.
>Meanwhile, that original request is getting old. The request it made has since received a response, but since there's not been an opportunity to flip back to it, the new request is still cooking. Eventually, that new request's computations are done, and it sends back a reply: 200 HTTP/1.1 OK, blah blah blah.
This can happen whether it's an os threaded design or a userspace green-thread runtime. If a process is overloaded, clients can and will timeout on the request. The main difference is in a green-thread runtime it's about overloading the process vs. utilizing all threads. Can make this better by using a local load balancer on box and spreading load evenly. It's also best practice to minimize "blocking" in the application that causes these pauses to happen.
>That's why they fork-then-load. That's why it takes up so much memory, and that's why you can't just have a bunch of these stupid things hanging around, each handling one request at a time and not pulling a "SHINYTHING!" and ignoring one just because another came in. There's just not enough RAM on the machine to let you do this. So, num_cpus + 1 it is.
Delayed imports (because of cyclical dependencies) is bad practice. That being said, forking N processes is standard for languages/runtimes that can only utilize a single core (python, ruby, javascript, etc.).
This is not to say that this solution is ideal -- just that with a small bit of work you can improve the scalability/reliability/behavior under load of these systems by quite a bit.
by pdonis on 3/7/20, 9:41 PM
by DevKoala on 3/7/20, 10:18 PM
We have money, let’s just blow it. /s
by fancyfredbot on 3/8/20, 12:22 AM
by doctoboggan on 3/7/20, 10:56 PM
I noticed in the logs that I am getting a lot of Critical Worker Timeouts and I am wondering if this has anything to do with it.
by Matthias247 on 3/8/20, 1:49 AM
Is the message that epoll and co are lot efficient enough? That’s also true. Api Problems and thundering here are known. And not only limited to Python applications as users. io completion based models (eg throuh uring) solve some of the issues.
Or is this mainly about Python and/or Gevent? If yes, then I don’t understand it, since the described issues can be found in the same way in libuv, node.js, Rust, Netty, etc
by Fazel94 on 3/8/20, 9:54 PM
Definitely, I am smarter than the guy who wrote this because then I wouldn't have these problems(Or He is smarter and I just didn't ask him about his rationale).
What I design wouldn't run into these BS problems that I have to fix, It just wouldn't run into problems generally. (Or It would have more problems than this one)
I had these conversations with myself at least a thousand times, and then it was just the case in the parentheses.
by nemoniac on 3/7/20, 9:29 PM
by jasonhansel on 3/7/20, 11:22 PM
by drenginian on 3/8/20, 1:19 AM
If I use uWSGI is problem gone?
by _bxg1 on 3/7/20, 11:35 PM
by diebeforei485 on 3/7/20, 10:54 PM
by tus88 on 3/7/20, 9:45 PM
Yes, Python has a GIL. Yes, lightweight threads are mostly good for IO bound tasks. Yes it can still be used effectively if you design your app correctly.
by MoronInAHurry on 3/7/20, 9:58 PM
I'm sure there's some useful information in here, but it's not worth digging through the patronization to find it.
by crimsonalucard on 3/7/20, 11:20 PM
Basically she's saying that python async (which the current state of the art implementation uses libuv the same thing driving nodejs and consequently suffers from the same "problems") doesn't have actual concurrency. Computations block and concurrency only happens under a very specific case: IO. One computation can happen at a time with several IO calls in flight and context switching can only happen when an IO call in the computation occurs.
She fails to see why this is good:
Python async and nodejs do not need concurrency primitives like locks. You cannot have a deadlock happen under this model period. (note I'm not talking about python threading, I'm talking about async/await)
This pattern was designed for simple pipeline programming for webapps where the webapp just does some minor translations and authentication then offloads the actual processing to an external computation engine (usually known as a database). This is where the real processing meat happens but most programmers just deal with this stuff through an API (usually called SQL). It's good to not have to deal with locks, mutexes, deadlocks and race conditions in the webapp. This is a huge benefit in terms of managing complexity which she completely discounts.
by airstrike on 3/7/20, 10:05 PM
Any headline that starts with "we have to talk about" can be answered by the words "do we?"
by mesozoic on 3/7/20, 10:04 PM
by viraptor on 3/7/20, 10:38 PM
Is this something people actually have problems with in practice? I did lots of python and ran into it once. It was quickly fixed after a raised issue. I feel like non-toy development just doesn't experience it.
But maybe that's my environment bubble only. Do people who do serious python development actually have problem with this?
by dirtydroog on 3/7/20, 11:33 PM
In adtech you send 204 responses a lot. The body is empty, just the headers. Headers like 'Server' and 'Date'. Apache won't let you turn Server off... 'security through obscurity' or some nonsense. Why do I need to tell an upstream server my time 50k times per second?
Zip it all up! Nope, that only applies to the body which is already empty.
Egressing traffic! A cloud provider's dream. I wonder what percentage of their revenue come from clients sending the Date header.