from Hacker News

We have to talk about this Python, Gunicorn, Gevent thing

by emsal on 3/7/20, 8:59 PM with 302 comments

by orisho on 3/7/20, 9:49 PM
The problem that's described here - "green" threads being CPU bound for too long and causing other requests to time out is one that is common to anything that uses an event loop and is not unique to gevent. node.js also suffers from this.
Rachel says that a thread is only ever doing one thing at a time - it is handling one request, not many. But that's only true when you do CPU bound work. There is no way to write code using blocking IO-style code without using some form of event loop (gevent, async/await). You cannot spin up 100K native threads to handle 100K requests that are IO bound (which is very common in a microservice architecture, since requests will very quickly block on requests to other services). Or well, you can, but the native thread context switch overhead is very quickly going to grind the machine to a halt as you grow.
I'm a big fan of gevent, and while it does have these shortcomings - they are there because it's all on top of Python, a language which started out with the classic async model (native threads), rather than this model.
Golang, on the other hand, doesn't suffer from them as it was designed from the get-go with this threading model in mind. So it allows you to write blocking style code and get the benefits of an event loop (you never have to think about whether you need to await this operation). And on the other hand, goroutines can be preempted if they spend too long doing CPU work, just like normal threads.
by orf on 3/7/20, 9:30 PM
> Go back and look. I said that it forks and then it imports your app. Your app (which is almost certainly the bulk of the code inside the Python interpreter) is not actually in memory when the fork happens. It gets loaded AFTER that point.
You can just pass `--preload` to have gunicorn load the application once. If you're using a standard framework like Django or Flask and not doing anything obviously insane then this works really well and without much effort. Yeah I'm sure some dumb libraries do some dumb things, but that's on them, and you for using those libraries. Same as any language.
If you want to stick your nose up at Python and state outright "I will not write a service in it" then that's up to you, it just comes across as your loss rather than a damning condemnation of the language and it's ecosystem from Rachel By The Bay, an all-knowing and experienced higher power. I guess everyone else will keep quickly shipping value to customers with it while you worry about five processes waking up from a system call at once or an extra 150mb of memory usage.
by nodamage on 3/7/20, 10:40 PM
Yes after reading through the article it's not very clear to me what the actual problem is with using Python/Gunicorn/Gevent.
The author seems to be saying something about how if a worker is busy doing CPU intensive work (is decoding JSON really that intensive?) then other requests accepted by that worker have to wait for that work to complete before they can respond, and the client might timeout while waiting?
If that's the case:
1. Wouldn't this affect any language/framework that uses a cooperative concurrency model, including node.js and ASP.NET or even Python's async/await based frameworks? How is this problem specific to Python/Gunicorn/Gevent?
2. What would be a better alternative? The author says something about using actual OS-level threads but I thought the whole point of green threads was that they are cheaper than thread switching?
by benreesman on 3/7/20, 11:05 PM
It seems to me that this submission is getting a lot of blowback in the comments for 1) the style and 2) the implication that wiring up Python services with HTTP is bad engineering. I don’t think this is productive.
On the first point, yeah Rachel’s posts are kinda snarky sometimes, but some of us find that entertaining particularly when they are highly detailed and thoroughly researched. I’ve worked with Rachel and she’s among the best “deep-dive” userspace-to-network driver problem solvers around. She knows her shit and we’re lucky she takes the time to put hard-earned lessons on the net for others to benefit from.
As for “microservices written in Python trading a bunch of sloppy JSON around via HTTP” is bad engineering: it is bad engineering, sometimes the flavor of the month is rancid (CORBA, multiple implementation inheritance, XSLT, I could go on). Introducing network boundaries where function calls would work is a bad idea, as anyone who’s dealt seriously with distributed systems for a living knows. JSON-over-HTTP for RPC is lazy, inefficient in machine time and engineering effort, and trivially obsolete in a world where Protocol Buffers/gRPC or Thrift and their ilk are so mature.
Now none of this is to say you should rewrite your system if it’s built that way, legacy stuff is a thing. But Rachel wrote a detailed piece on why you are asking for trouble if you build new stuff like this and people are, in my humble opinion, shooting the messenger.
by tgbugs on 3/7/20, 10:34 PM
This is a great review of what is going on "behind the scenes."
As the maintainer of about 5 little services with this structure I have vowed never to write another one. The memory overhead alone is a source of eternal irritation ("Surely there must be a better way....").
Echoing other commenters here, the real cost isn't actually discussed. Namely that there is a solution to some of these problems (re long running tasks?), but it carries with it a major increase in complexity. Its name is Celery and oh boy have fun with the ops overhead that that is going to induce.
A while back I did some unscientific benchmarking of the various worker classes for python3.6 and pypy3 (7.0 at the time I think?). Quoting my summary notes: 1. "pypy3 with sync worker has roughly the same performance, gevent is monstrously slow gthread is about 20 rps slower than sync (1s over 1k requests), sync can get up to ~150rps" 2. "pypy3 clearly faster with tornado than anything running 3.6" 3. "pypy3 is also about 4x faster when dumping nt straight from the database, peaking at about 80MBps to disk on the same computer while python3.6 hits ~20MBps"
I won't mention the workload because it was the same for both implementations and would only confuse the point, which is that there are better solutions out there in python land if you are stuck with one of these systems.
One thing I would love to hear from others is how other runtimes do this in a sane and performant way. What is the better solution left implicit in this post?
by seemslegit on 3/7/20, 9:44 PM
"I will not use web requests when the situation calls for RPCs"
I'm surprised how often devs treat this distinction as architecturally meaningful. Web requests are just RPCs with some of the parameters standardized and multiple surfaces for parameters and return values - query string, headers, body. This is completely orthogonal to the strategy used to schedule IO, concurrency, etc.
by cwp on 3/7/20, 11:19 PM
Sigh. Yes. I have been there and done that (more or less) and it sucks. The root problem is that data scientists really want to use Python for machine learning, but wrapping a Python model in a service that uses CPU and memory efficiently is really difficult.
Because of the GIL, you can't make predictions at the same time you're processing network IO, which means that you need multiple processes to respond to clients quickly and keep the CPU busy. But models use a lot of memory and so you can't run all THAT many processes.
I actually did get the load-then-fork, copy-on-write thing to work, but Python's garbage collections cause things to get moved around in memory and triggers copying and makes the processes gradually consume more and more memory as the model becomes less and less shared. Ok, so then you can terminate and re-fork the processes periodically, and avoid OOM errors, but there's still a lot of memory overhead and CPU usage is pretty low even when there are lots of clients waiting and...
You know I hear Julia is pretty mature these days and hey didn't Google release this nifty C++ library for ML and notebooks aren't THAT much easier. Between the GIL and the complete insanity that is python packaging, I think it's actually the worst possible language to use for ML.
by ary on 3/7/20, 9:55 PM
This is spot on. My one and only gripe is with this part:
> So how do you keep this kind of monster running? First, you make sure you never allow it to use too much of the CPU, because empirically, it'll mean that you're getting distracted too much and are timing out some requests while chasing down others. You set your system to "elastically scale up" at some pitiful utilization level, like 25-30% of the entire machine.
Letting a Python web service, written in your framework of choice, perform CPU-bound work is just bad design. A Python web service should essentially be router for data, controlling authentication/authorization, I/O formatting, and not much else. CPU intensive tasks should be submitted to a worker queue and handled out of process. Since this is Python we don't have the luxury of using threads to perform CPU-bound work (because of the Global Interpreter Lock).
by yowlingcat on 3/7/20, 9:45 PM
I like the author's articles most of the time. While this article contains some truths, I don't think it argues very persuasively for its conclusion. Okay, these parts of the Python ecosystem don't work well together, and it's a bad, unpolished experience. Fair, as with other criticisms of Python.
The question, however, is why one would use gevent at this point in Python's evolution. There's async await now, and things like FastAPI. If you want to use, say, the Django ecosystem, use Nginx and uWSGI and be done with it. Maybe you need to spend some more resources to deploy your Python. Okay. Is that a problem? Why are you using Python? Is it because it's quick to use and helps you solve problems faster with its gigantic, mature ecosystem that lets you focus on your business logic? Then this, while admittedly not great, is going to be a rounding error. Is it because you began using it in the aforementioned case and now you're boxed into an expensive corner and you need to figure out how to scale parts of your presumably useful production architecture serving a Very Useful Application?
Maybe you need to start splitting up your architecture into separate services, so that you can use Python for the things that it does well and use some other technology for the parts that aren't I/O bound and could benefit from that. But that's not this article is about. This article is about someone making the wrong choices when better choices existed and then making a categorical decision against using Python for a service. I'd say that's what "we have to talk about" if you ask me.
by cakoose on 3/8/20, 12:28 AM
It seems to be a complaint against doing process-per-CPU.
Let's say your server has 4 CPUs. The conservative option is to limit yourself to 4 requests at a time. But for most web applications, requests use tiny bursts of CPU in between longer spans of I/O, so your CPUs will be mostly idle.
Let's say we want to make better use of our CPUs and accept 40 requests at a time. Some environments (Java, Go, etc) allow any of the 40 requests to run on any of the CPUs. A request will have to wait only if 4+ of the 40 requests currently need to do CPU work.
Some environments (Node, Python, Ruby) allow a process to only use a single CPU at a time (roughly). You could run 40 processes, but that uses a lot of memory. The standard alternative is to do process-per-CPU; for this example we might run 4 processes and give each process 10 concurrent requests.
But now requests will have to wait if more than 1 of the 10 requests in its process needs to do CPU work. This has a higher probability of happening than "4+ out of 40". That's why this setup will result in higher latency.
And there's a bunch more to it. For example, it's slightly more expensive (for cache/NUMA reasons) for a request to switch from one CPU to another, so some high-performance frameworks intentionally pin requests to CPUs, e.g. Nginx, Seastar. A "work-stealing" scheduler tries to strike a balance: requests are pinned to CPUs, but if a CPU is idle it can "steal" a request from another CPU.
The starvation/timeout problem described in the post is strictly more likely to happen in process-per-CPU, sure. But for a ton of web app workloads, the odds of it happening are low, and there are things you can do to improve the situation.
The post also talks about Gunicorn accepting connections inefficiently and that should probably be fixed, but that space has very similar tradeoffs <https://blog.cloudflare.com/the-sad-state-of-linux-socket-ba....
by j88439h84 on 3/7/20, 9:43 PM
Using an ASGI server that supports async/await, such as Uvicorn, instead of green threads, forking, etc, seems like a good idea these days. Also means you can use Starlette which has a much nicer design IMO than some of the old frameworks.
- https://www.uvicorn.org/
- https://www.starlette.io/
by worik on 3/7/20, 11:26 PM
This is exactly what I think.
Those below who complain about the complaints are missing the point.
We (computer programmers as a general class) have not learnt from history. We keep reinventing wheels and each time they are heavier and clunkier.
What we used to do in 40K of scripts now takes two gigabytes in python/django/whateverthehellelse. E.g. mail list servers. Mailman3 hang your head in shame!
by ris on 3/7/20, 10:47 PM
I don't disagree with any of this but
> "Why in the hell would you fork then load, instead of load then fork?"
In python it often seems to make little difference. The continual refcount incrementing and decrementing sooner or later touches most everything and causes the copy to happen whether you're mutating an object or not.
I've had some broad thoughts about how one would give cpython the ability to "turn off" gc and refcounting for some "forever" objects which you know you're never going to want to free, but it wouldn't be pretty as it would require segregating these objects into their own arenas to prevent neighbour writes dirtying the whole page anyway...
by ahuang on 3/7/20, 11:20 PM
I think this conflates a poor implementation of a webserver with python/gunicorn/gevent being bad. There are a few (easy) things to do to avoid some of the pitfalls she encountered:
> A connection arrives on the socket. Linux runs a pass down the list of listeners doing the epoll thing -- all of them! -- and tells every single one of them that something's waiting out there. They each wake up, one after another, a few nanoseconds apart.
Linux is known to have poor fairness with multiple processes listening to the same socket. For most setups that require forking a process, you run a local loadbalancer on box, whether it's haproxy or something else, and have each process listen on its own port. This not only allows you to ensure fairness by whatever load balance policy you want, but also lets you have healthchecks, queueing, etc.
>Meanwhile, that original request is getting old. The request it made has since received a response, but since there's not been an opportunity to flip back to it, the new request is still cooking. Eventually, that new request's computations are done, and it sends back a reply: 200 HTTP/1.1 OK, blah blah blah.
This can happen whether it's an os threaded design or a userspace green-thread runtime. If a process is overloaded, clients can and will timeout on the request. The main difference is in a green-thread runtime it's about overloading the process vs. utilizing all threads. Can make this better by using a local load balancer on box and spreading load evenly. It's also best practice to minimize "blocking" in the application that causes these pauses to happen.
>That's why they fork-then-load. That's why it takes up so much memory, and that's why you can't just have a bunch of these stupid things hanging around, each handling one request at a time and not pulling a "SHINYTHING!" and ignoring one just because another came in. There's just not enough RAM on the machine to let you do this. So, num_cpus + 1 it is.
Delayed imports (because of cyclical dependencies) is bad practice. That being said, forking N processes is standard for languages/runtimes that can only utilize a single core (python, ruby, javascript, etc.).
This is not to say that this solution is ideal -- just that with a small bit of work you can improve the scalability/reliability/behavior under load of these systems by quite a bit.
by pdonis on 3/7/20, 9:41 PM
The problem being described here isn't Python. gunicorn, or gevent; it's bad programming. I'd be willing to bet there are systems out there written in C++, Java, and Ruby that do the same dumb things. The solution is to not do dumb things--to understand what your program is doing. It's perfectly possible to do that in Python, gunicorn, and gevent.
by DevKoala on 3/7/20, 10:18 PM
In this crap situation atm, can attest. Currently maintaining a Python app for the delicate snowflakes whose years of math understanding somehow prevents them from being able to learn a language that isn’t Python.
We have money, let’s just blow it. /s
by fancyfredbot on 3/8/20, 12:22 AM
I really like Rachel's blog and I think I understand the point she's making here. However I think she sees it from the point of view of very large scale services. In many cases you can have a solution ready more quickly with less developer time if you use these technologies, and at smaller scale this more than pays for the additional hardware you need to cope with the inefficiency. In such cases writing services in python is pragmatic and sensible.
by doctoboggan on 3/7/20, 10:56 PM
I recently started playing around with Google Cloud Run and am running some python/flask/gunicorn code in a docker container on the platform.
I noticed in the logs that I am getting a lot of Critical Worker Timeouts and I am wondering if this has anything to do with it.
by Matthias247 on 3/8/20, 1:49 AM
I’m not sure what the main point of the article is? Telling us that eventloops have problems? Sure, the lack of preemption can cause latency problems in some tasks. But native threads have other issues - that’s why people use eventloops.
Is the message that epoll and co are lot efficient enough? That’s also true. Api Problems and thundering here are known. And not only limited to Python applications as users. io completion based models (eg throuh uring) solve some of the issues.
Or is this mainly about Python and/or Gevent? If yes, then I don’t understand it, since the described issues can be found in the same way in libuv, node.js, Rust, Netty, etc
by Fazel94 on 3/8/20, 9:54 PM
I can relate to the writer, working with legacy sucks. This was my main take on the blog post, others are just brilliant ways of rationalizing why other people such and why there are other people than me.
Definitely, I am smarter than the guy who wrote this because then I wouldn't have these problems(Or He is smarter and I just didn't ask him about his rationale).
What I design wouldn't run into these BS problems that I have to fix, It just wouldn't run into problems generally. (Or It would have more problems than this one)
I had these conversations with myself at least a thousand times, and then it was just the case in the parentheses.
by nemoniac on 3/7/20, 9:29 PM
So..., What's a good alternative? One that's relatively straightforward to implement compared to the Python approach.
by jasonhansel on 3/7/20, 11:22 PM
I really think this should be solved at the OS level. Why is it so hard to implement kernel threads in an efficient way? Threading shouldn't need to be done in userspace.
by drenginian on 3/8/20, 1:19 AM
Ok so if there’s a problem, what a solution?
If I use uWSGI is problem gone?
by _bxg1 on 3/7/20, 11:35 PM
How does this compare with NodeJS, given the event loop and the managed way that concurrent tasks happen there?
by diebeforei485 on 3/7/20, 10:54 PM
I find this post to be unintelligible. Given that it's been upvoted to the top of HN though, can someone TL;DR of the intellectual value of this post? It seems to be stepping through the details of what is going on while also being rambling.
by tus88 on 3/7/20, 9:45 PM
Gotta love those people who fail to understand how things are supposed to be used, fail miserably as a result, then throw the baby out with the bathwater in a fit of tantrum.
Yes, Python has a GIL. Yes, lightweight threads are mostly good for IO bound tasks. Yes it can still be used effectively if you design your app correctly.
by MoronInAHurry on 3/7/20, 9:58 PM
Rachel's posts would be so much more useful if she would just say what she meant, instead of twisting everything into knots to find a way to say it backwards so she can be sarcastic and condescending while doing it.
I'm sure there's some useful information in here, but it's not worth digging through the patronization to find it.
by crimsonalucard on 3/7/20, 11:20 PM
There's a huge amount of technical jargon and sarcasm that makes it hard to see her point.
Basically she's saying that python async (which the current state of the art implementation uses libuv the same thing driving nodejs and consequently suffers from the same "problems") doesn't have actual concurrency. Computations block and concurrency only happens under a very specific case: IO. One computation can happen at a time with several IO calls in flight and context switching can only happen when an IO call in the computation occurs.
She fails to see why this is good:
Python async and nodejs do not need concurrency primitives like locks. You cannot have a deadlock happen under this model period. (note I'm not talking about python threading, I'm talking about async/await)
This pattern was designed for simple pipeline programming for webapps where the webapp just does some minor translations and authentication then offloads the actual processing to an external computation engine (usually known as a database). This is where the real processing meat happens but most programmers just deal with this stuff through an API (usually called SQL). It's good to not have to deal with locks, mutexes, deadlocks and race conditions in the webapp. This is a huge benefit in terms of managing complexity which she completely discounts.
by airstrike on 3/7/20, 10:05 PM
At this point, I imagine there's likely a similar law to Betteridge's that states:
Any headline that starts with "we have to talk about" can be answered by the words "do we?"
by mesozoic on 3/7/20, 10:04 PM
As much as I love Python I still tell people don't use it for performance sensitive applications.
by viraptor on 3/7/20, 10:38 PM
> It's around this time that you discover that people have been doing naughty, nasty things, like causing work to occur at "import time".
Is this something people actually have problems with in practice? I did lots of python and ran into it once. It was quickly fixed after a raised issue. I feel like non-toy development just doesn't experience it.
But maybe that's my environment bubble only. Do people who do serious python development actually have problem with this?
by dirtydroog on 3/7/20, 11:33 PM
Somewhat related to the RPC argument, but HTTP is a total joke, adn therefore so is REST.
In adtech you send 204 responses a lot. The body is empty, just the headers. Headers like 'Server' and 'Date'. Apache won't let you turn Server off... 'security through obscurity' or some nonsense. Why do I need to tell an upstream server my time 50k times per second?
Zip it all up! Nope, that only applies to the body which is already empty.
Egressing traffic! A cloud provider's dream. I wonder what percentage of their revenue come from clients sending the Date header.