from Hacker News

Reliability: It’s not great

by bishopsmother on 3/6/23, 5:47 PM with 455 comments

by samwillis on 3/6/23, 7:38 PM
Fundamentally I think some of the problems come down to the difference between what Fly set out to build and what the market currently want.
Fly (to my understanding) at its core is about edge compute. That is where they started and what the team are most excited about developing. It's a brilliant idea, they have the skills and expertise. They are going to be successful at it.
However, at the same time the market is looking for a successor to Heroku. A zero dev ops PAAS with instant deployment, dirt simple managed Postgres, generous free level of service, lower cost as you scale, and a few regions around the world. That isn't what Fly set out to do... exactly, but is sort of the market they find themselves in when Heroku then basically told its low value customers to go away.
It's that slight miss alignment of strategy and market fit that results in maybe decisions being made that benefit the original vision, but not necessarily the immediate influx of customers.
I don't envy the stress the Fly team are under, but what an exciting set of problems they are trying to solve, I do envy that!
by yamrzou on 3/6/23, 7:09 PM
I'm not a user of Fly.io. I can't help but notice how remarkable the effect of open communication on potential end users like me. I remember reading about their reliability problems on HN some time ago. That biased my view of the company. After reading this, the open communication and transparency restored my trust in them, and would make them again a potential candidate for future projects. Because now I know that they acknowledge the problem and that they are trying to improve things.
by throwawaaarrgh on 3/7/23, 4:27 AM
I've been doing reliability stuff for near two decades. The one thing I am sure of is there is no way to just engineer your way to reliability. That is to say, no person, no matter how smart, can just invent some whizbang engineering thing and suddenly you have reliability.
Reliability is a thing that grows, like a plant. You start out with a new system or piece of software. It's fragile, small, weak. It is threatened by competing things and literal bugs and weather and the soil it's grown in and more. It needs constant care. Over time it grows stronger, and can eventually fend for itself pretty well. Sometimes you get lucky and it just grows fine by itself. And sometimes 50 different things conspire to kill it. But you have to be there monitoring it, finding the problems, learning how to prevent them. Every garden is a little different.
It doesn't matter what a company like Fly does technology wise. It takes time and care and churning. Eventually they will be reliable. But the initial process takes a while. And every new piece of tech they throw in is another plant in the garden.
So the good news is, they can become really reliable. But the bad news is, it doesn't come fast, and the more new plants they put in the ground, the more concerns there are to address before the garden is self sustaining.
by jrochkind1 on 3/6/23, 7:58 PM
I remain kind of amazed about how heroku managed to pull off what they pulled off, in the first case.
Also:
> The Heroku exodus broke our assumptions. Pre-Heroku, most of the apps we were running were spread across regions. And: we were growing about 15% per month. But post-Heroku, we got a huge influx of apps in just a few hot spots — and at 30% per month.
I hadn't before seen anyone with a big picture view confirm a heroku exodus was happening, although a lot of people suspected it or had anecdotes.
But if fly is seeing a pretty enormous number of customers moving from heroku to fly... oh wait, now I'm wondering, is this mainly a result of heroku ending free services, and those are free customers coming to fly for free services?
If so... that's a pretty big burden to take on without revenue to match, it does seem kind of dangerous for fly.
by pyentropy on 3/6/23, 9:47 PM
Almost half of the issues are caused by their use of HashiCorp products.
As someone that has started tons of Consul clusters, analyzed tons of Terraform states, developed providers and wrote a HCL parser, I must say this:
HashiCorp built a brand of consistent design & docs, security, strict configuration, distributed-algos-made-approachable... but at its core, it's a very fragile ecosystem. The only benefit of HashiCorp headaches is that you will quickly learn Golang while reading some obscure github.com/hashicorp/blah/blah/file.go :)
by pier25 on 3/6/23, 8:26 PM
I've been using Fly for over two years or so. The sentiment of this post doesn't align with my personal (anecdotal) experience.
The PG issues hit me two times in the previous weeks but other than that it's been working great for me.
With the move to v2 apps (using their new machines infra) things are actually faster and smoother than ever.
About a year or so ago their CLI was quite buggy but I haven't really hit any bugs in months.
I will remain with Fly for the time being. Hopefully they don't close shop!
by nu11ptr on 3/6/23, 7:36 PM
Not a client of fly.io, but dang impressive for the company to be this open and honest. Definite respect - wish more companies were like this. It puts them on my short list almost immediately for future needs.
by lll-o-lll on 3/6/23, 8:42 PM
At first I was all like “Ha ha, losers can’t scale”
And then I was “Huh, these technical challenges are actually pretty difficult”
And then I was all “crap, these are a bunch of technologies I was about to add to our stack”
Thanks heaps fly.io people; having the humility to honestly talk about the challenges and failures massively helps people such as myself as we navigate new unfamiliar technologies. If more companies were willing to do this, it’d be a lot easier to avoid common pitfalls.
by outworlder on 3/6/23, 7:46 PM
> This is a theme. Existing open source is not designed for global deployment
Eh? Unless you are consuming something as a service and it actually advertises it as a feature, nothing is ready for 'global deployment'.
If you have a 'centralized' secret storage, then you have made it tied to a region. Want to have redundancies and lower latency? You'll have to distribute it. Vault has docs about this: https://developer.hashicorp.com/vault/tutorials/day-one-raft...
by sergiotapia on 3/6/23, 7:25 PM
It's been almost a year since I gave Fly a review (https://news.ycombinator.com/item?id=31391116) and it's a bummer that they're still struggling to get things right. Double bummer because I love Phoenix and Elixir and they employ Chris McCord there.
Maybe they were _too_ ambitious at the start? They have a hard road ahead of them, and competition like Render.com and Northflank have provided me with solutions to all of my problems. Great dev ux, great prices and predictable solutions. They also keep pushing out very useful features. A third competitor also sprung up Railway! There's certainly blood in the water.
Will they catch up to others before the competition solves the "global mesh" unique value proposition Fly.io currently has? That's the $1MM question.
by e1g on 3/6/23, 8:24 PM
This reads like a mea culpa from an indie hacker, but Fly.io had 5+ years and raised $40M to get these basic fundamentals right. And we get promises of a new status page.
by claytonjy on 3/6/23, 7:27 PM
Very interesting to see Kurt assert theyre going to "solve managed Postgres", and I'm super curious to know what that means. Does it mean something like RDS, or more like CrunchyData?
I could see them building something RDS-like on their own, but if they're trying to go further than that I wonder if they'll buy or partner with other companies rather than doing it themselves. Neon strikes me as a Postgres-as-a-service that could pair well with Fly.
by deivid on 3/6/23, 9:49 PM
I'm a bit sour reading this. I've always liked fly and particularly the engineering blog, so much so that a couple of months ago I decided to apply for an infra position, to work on some of these very topics. Sadly after 4~5 rounds of interviews (including a workday) they just ghosted me.
by nomilk on 3/6/23, 11:40 PM
What an earnest post, and how damn refreshing it is to see such concern for users, accountability, honesty and openness (quite a contrast to another PaaS)!
I moved one app successfully from heroku to fly and attempted to move a few others. These are my experiences (both good and bad):
Great:
- The load time on the pages is insanely faster on fly than heroku. Sometimes I thought I was on the localhost version of the app, it was that snappy.
- Love that it uses a Dockerfile
- Love paying for what I use (compared to Heroku's rigid minimum of $16/month for hobby dyno w/ postgres for baby apps, or $34/month just to get a second web dyno for toddler apps). The same apps are <$5/month each on fly.
Not great:
- I find the fly.toml file hard to understand and use, and the cycle time slow to fix or tinker with it. It's partly (entirely?) a 'me' problem because I haven't spent a huge amount of time reading the documentation.
- I found scheduling a rake task in a rails app time consuming (~days) the first time, but very easy (15 minutes) the second and subsequent times, once I knew a way that worked (cron didn't work; had to use a tool I hadn't used before 'supercronic').
- Deploys sometimes time out with `Error failed to fetch an image or build from source: error rendering push status stream: EOF`. Most layers copied, but randomly, some layers wouldn't. All I could do is keep trying until it worked, which it did, 2 hours later. Not the end of the world, but an annoying complication when you're already trying to solve complex problems.
- I followed a youtube video on how to move a rails app from heroku to fly, and it worked on a modern app, but I couldn't quite get fly happy when moving the older app - something to do with postgres versions, and I didn't want to spend all day figuring it out. I'm not hugely experienced with docker, it could have been an easy fix for someone more experienced.
On reflection, 3 of the 4 negatives above are solvable by me reading the docs more thoroughly and getting more proficient with docker.
I look forward to continuing using and exploring fly, and can't be happier with the directness, transparency and care from fly staff. A platform with huge potential.
by skywhopper on 3/6/23, 9:14 PM
Interesting issues. Nothing surprising for anyone who’s run a global SaaS before, especially if growth has been incredibly fast. I find the gripes about Consul, Nomad, and Vault interesting since it sounds like the problems are mainly due to poor architectural decisions. Fly is rewriting those tools rather than invest in deploying them properly and in the process are running into new issues that those tools have already solved, which doesn’t give me confidence that the path forward will be any less bumpy.
by emschwartz on 3/6/23, 6:53 PM
One of my colleagues keeps repeating “reliability is our number one feature”.
I’m not sure it is for 100% of early stage startups, but I guess it is once you exceed some minimum usage threshold.
That said, definitely appreciate the detailed explanation.
by ashiban on 3/6/23, 9:29 PM
One of the key challenges we observe is that if you're small enough, a Heroku like experience works well - and most of your needs would be covered by virtually any combination of techstacks.
It gets significantly more challenging when you grow, either in feature complexity or scale complexity - and then very few services can offer what AWS/GCP/Azure offer - albeit at the increased engineering/monetary cost of using them.
We're building a different kind of approach[0] that aims to absorb the mechanical cost of using public cloud capabilities (that are proven to scale) without hiding it altogether.
[0] https://github.com/KlothoPlatform/klotho
by djha-skin on 3/6/23, 11:13 PM
> In response, we’ve shipped a project called Corrosion. Corrosion is a gossip > based service discovery system.
I wonder why they didn't try to use Serf[1] for this, since they were so into HashiCorp tools. It also uses the gossip protocol.
1: https://www.serf.io/docs/index.html
by birracerveza on 3/7/23, 7:42 AM
What is this? A company being open and honest about problems their customers are facing? What is happening? Has the world gone mad??
by tebbers on 3/6/23, 7:53 PM
I really feel for Fly, as a potential customer. They are trying really hard. I would still love to use them one day and this post is definitely a step in the right direction. Growing is painful but they have smart people working there so fingers crossed that they sort this ASAP and it doesn't become existential.
by iamdbtoo on 3/6/23, 7:25 PM
I'm a big fan of fly.io. From their hiring process to the product itself it's all carried out in a thoughtful manner. I hope they can weather this rough time.
by clement_b on 3/7/23, 8:09 AM
I've been with fly.io at small scale and I have always loved their approach to content (docs, blogs, forums), and needless to say, their product. They are very talented and are building something truly great. Their openness, shown in this post, is an example to follow. It's very hard to be that honest and direct when you're meant to be an infallible entity, but it's not a surprise at all to see fly authoring such a post. That's how they operate and that's why I trust them over anyone else.
by plasma on 3/6/23, 11:12 PM
Thanks for sharing!
Would it help to replace Corrosion with a simpler "Here's my local known state" blob that is POST'd to blob storage (for example) on a major cloud provider, and have another service read that at intervals? Just to make it really simple.
There will be a better way than that, but my thought is if you can make it simpler (known state is always just pushed, so missing updates auto-recovers and avoids corruption) then you can be building on top of a more stable service discovery system.
Centralized secret storage, can you keep the US instance read/write, but replicate read-only copies (a side-car tool that copies the database to other regions at various intervals?) so each region can fetch secrets locally?
Or perhaps both can be solved with a general "Copy local state to other regions" service that is pretty simple but gives each region its own copy of other region's information (secrets, provisioning states, ...).
I've needed to do similar things for some of the apps I've built, where a service needed another (simpler) service in front of it to bear the traffic load but was operationally simple (deferred the smarts to the system it was using as the source of truth) and automatically recovered from failure due to its simplicity.
by rtpg on 3/7/23, 1:13 AM
I have two extremely tiny sites (like, "handful of users/1 user" sites) on Fly. I have had multiple incidents despite me not even touching them.
The thing that worries me about these incidents is they haven't been, like, full service outages. A small subset of users talking about issues in forums. This makes me just feel like Fly has an immense amount of issues.
At least if like 50% of fly goes down then it feels like a config fat finger. When it's a bunch of tiny issues now all my ops debugging has to start with going to the fly forums (and it's _always been issues on fly's side_).
The price is "right" (though like with all PaaS the gaslighting about running multiple processes in one container makes me feel bad about the state of cloud computing). And I really like the CLI stuff mostly! But I extremely don't care about edge computing so for me fly is just heroku and I would love to feel more confident on that end.
(EDIT: the nice thing is I get email support with a bit of cash. This is a thing that will go away when they get bigger but it's here while things are still breaking often)
by hinkley on 3/7/23, 12:17 AM
> We’ve put a lot of work and resources into growing the platform and maturing our engineering organization. But that work has lagged growth.
I fundamentally don't understand why people are in such a big hurry to get 'famous'. I've worked a couple of places where the marketing side was working as hard as they could to make sure that our heads were on fire at all possible moments. At one job I had a (very, very junior) manager come up to me and say great news we landed <big customer> and my immediate reply was, "fuck me". We were already running to stay upright and now we're about to have twice as much scrutiny. Wonderful.
If you push hard enough, eventually everyone looks like an idiot. The number of humans for whom that is not true could fit into a book. Both alive and deceased. They most definitely do not work for the companies I've described, at least not enough of them so you'd notice.
by debarshri on 3/7/23, 1:19 AM
This is a great blog and have so many insights for an SRE or devops person. But this goes to show you how difficult it is self host stuff at scale.
I used to work for a company that built deployment platforms for law firms. All our deployments where on prem and we had the same complexity with kubernetes. We had similar setup with vault and stolon for HA PG. More moving parts you have in infra, more permutations and combinations of failure modes you have.
What these guys are building is something I have seen in many orgs trying to do it internally and fail. PaaS is a hard problem if you want to solve it "reliabily"
by theloco on 3/6/23, 7:47 PM
I love reading stuff like this. I don't use fly, don't plan to, not totally sure everything it does and will check it out. But this is some great raw data on how stressful it is after you launch.
by ec109685 on 3/6/23, 8:20 PM
I wonder what types of RPS they are seeing that required a gossip based protocol to broadcast state around versus a more traditional data store.
I take it that it’s far more important that the local region know about changes than a remote region, which makes a mastered store in one location as the source of truth problematic.
I also wonder why these companies don’t backstop themselves on the public cloud? Failing into an AWS seems better than running out of capacity and some its services could be used in circumstances where an open source technology isn’t ready.
by computomatic on 3/7/23, 12:34 AM
Where does fly.io document their per-account services limits? For example, max apps, databases, etc.
I took a quick look and couldn't find them. Do they have any documented service limits?
A google search turned up [0] which does not inspire optimism.
> ...there isn’t a limit to number of apps from a billing standpoint...
[0] https://community.fly.io/t/free-tier-limits-and-quota-needs-...
by soperj on 3/6/23, 7:02 PM
For django, they should really contribute to 2 scoops django cookie cutter program, so that you can get an out of the box django instance that can just deploy to Fly.io.
by revskill on 3/6/23, 9:52 PM
Fly.io seems like Vercel 1.0 (where you can just deploy docker image and done), but it's more than that, with configurable volumes, secrets,...
I'm bullish on fly.io.
by pwelch on 3/7/23, 5:05 PM
Growing pains are never fun. It doesn't mention (at least that I read) if they're using HashiCorp Open Source or Enterprise. Open Source is great and I owe my career to it but they might be hitting the scale when the Enterprise features and support start to be worth the price.
I've only used Fly.io for a personal app but I think it's a great option so I hope they keep growing.
by Karupan on 3/7/23, 12:10 AM
Thanks for the honest technical write up - not easy to air one's dirty laundry to users. Given the scalability and stability issues, I am curious to understand the percent of apps deployed to fly are actually used in production/critical to business. Sounds like they have quite a few hobby/free tier users (myself included) who probably won't notice certain issues unlike paid customers.
by ChrisMarshallNY on 3/6/23, 8:55 PM
Well, I feel for them. Scaling up is a bitch.
I've been lucky, in the past, but a lot of that, is because I have "overengineered," and the tools/frameworks have advanced to meet the new demand.
I am in the middle of a complete, bottom-to-top rewrite of the app we've been developing for the last couple of years. It's going great, but making this leap was a fraught decision.
It's mainly, so I wouldn't have to write a post like that, in a year or two.
We spent all the time refining it, until we had what we wanted, and it worked great on our small test team.
Then, I loaded up a test server with 10,000 fake users, and tossed the app at that. To be fair, we don't think we'll have even that many users for quite a while. It's a very specialized demographic.
* SOB *
It no do so well.
At that point, I had to decide whether to fix the issues (they were quite fixable), or revisit the architecture.
The main issue with the architecture, was that it was an "accreted" app, with changes gradually being factored in, as we progressed. The main reason for this, is because no one really knew what they wanted, until we ran it up the flagpole (sound familiar?).
The business logic was distributed throughout the app. That was ... ugly.
I envisioned myself, a year or two down the road, sucking on a magnum, because the app had turned into a Cruftosaurus, and was coming for me in my nightmares.
So I decided to rewrite, as we hadn't done any kind of MVP or public beta, so we actually had the runway to do this.
I refined the entire business logic of the app into a single, platform-agnostic SPM module, which took just over a month, and have started to develop the app around that. It's pretty much a rewrite, but I am recycling a lot of the app code. We also brought in our designer, and he's looking at every screen I make. It's working well for him.
Like I said, it's going great. Better than I expected.
I know that I have a huge luxury, and I'm grateful. I can credit a lot of that, to doing some stress-testing before we got to a point where we had a bunch of users to support. I was able to go in, and go all Victor Frankenstein on the model.
The result, so far, is that this thing screams, and you don't really even notice that there's that many users on it. The model has already been proven (that SPM module), and all we're doing, is chrome (which is a ton of work).
by ecmascript on 3/7/23, 8:37 AM
Wow, posts like these kind of want me to sign up and become a customer. The honesty and openness is something I highly value and it's weird that this is so uncommon that you get surprised by reading stuff like this.
Seems like they have a good understanding what the problems are so they will most likely be solved sooner or later.
Good work and keep honesty as open as you've done so far :)
by tonnydourado on 3/7/23, 2:36 PM
Damn. I don't know, and I guess almost no one can know, how much of this is genuine honesty and how much is calculated messaging, but I barely care. It's refreshing enough that I kinda wanna give the thing a try. Which, ironically, might make their problems just a tiny little bit worse, since I likely won't be a paying customer XD
by vinay_ys on 3/7/23, 7:59 AM
Corrosion seems over-engineered. Instead of doing a simpler federation of multiple databases (one per datacenter) across the globe, they decided to do gossip amongst every single VM across the globe! You don't really gain much but you do get all the noisy complexity for sure.
by andy_ppp on 3/6/23, 11:46 PM
How easy is it to deploy things with GitHub actions, AWS and terraform instead? I think I’d like full infrastructure control after having accidentally got a job doing DevOps which I’m starting to get quite into. I should probably write up a blogpost about converting everything over…
by tiffanyh on 3/6/23, 7:22 PM
Would the simple solve be that Fly.io just mark any new service of theirs “beta” for x-months post launch?
by Dave3of5 on 3/7/23, 9:25 AM
I always thought it would be really interesting to work for a company like fly.io.
Solving hard problems like this seems interesting.
On the other hand it could be a giant shit show of micromanagement and toxicity, who knows really.
At the moment they aren't hiring though so that's that.
by sidcool on 3/7/23, 9:17 AM
I have a debilitating impostor syndrome around resiliency and reliability of my systems. I always feel I am doing something wrong compared to the Googles, the Microsofts and the Fly.ios of the world. This does help feel better.
by chucky_z on 3/6/23, 7:46 PM
mrkurt have you considered some of the lower tiers of vault enterprise that allow for performance replicas that just outright solve that problem? might be cheaper than an engineer at this point.
by none_to_remain on 3/6/23, 11:51 PM
Three little pigs, but they all live in a house of straw built on top of several houses of sticks. (The foundation is an old house of brick, but the wolf isn't bothered by that.)
by siliconc0w on 3/7/23, 6:09 AM
Respect the post, building your own infrastructure provider is playing on hard, the big players have had armies of engineers iterating on their stacks for a decade+
by gpjanik on 3/7/23, 8:30 AM
I like Fly's response to the problems - honesty and openness - so I'm going to add to their problems and try to use it ;)
by nprateem on 3/7/23, 7:58 AM
Interesting take. As a non customer I now won't consider them for any projects as they've confirmed their unreliability.
by nathants on 3/6/23, 11:11 PM
i also wanted a good cli for aws, and built one:
https://github.com/nathants/libaws
companies like fly are fantastic.
they provide a good service, and they put market pressure on aws.
a free tier isn’t important anymore. with usage based pricing for lambda/dynamo/s3, an app with usage approaching zero has no cost.

by anacrolix on 3/7/23, 3:41 AM

    - Machines seem like a waste of time
    - Access directly to VMs is being removed (and doesn't support TCP over IPv4, or UDP over IPv6)
    - The CDN is nice but should support private networking too.
    - Volume management is deficient: It should be possible to access and fix volumes outside the context of an its app instance.
    - Egress traffic should be free between apps over private networking, at least in the same DC.

by crabbone on 3/7/23, 3:14 PM
I'm not in the Web business, and have no idea what fly.io is offering, but whenever I hear anyone trashing Consul, I give them a standing ovation. An application which decided to use Base64-encoded JSON for its communication protocol deserves every bit of mockery it can get.
by 1023bytes on 3/6/23, 10:52 PM
I get it, I like fly.io, but the last outage made me switch to Railway.app
by davedx on 3/7/23, 8:18 AM
@kurt happy (sometimes paying) Fly user here. Keep up the great work!
by al_be_back on 3/7/23, 9:21 AM
in my view, a new edge computing entrant needs a niche market e.g low latency gaming or privacy-heavy computing, and to stay away from MAAG cloud territory.
by lopatin on 3/6/23, 8:02 PM
It sounds like they need more money to scale the shared stack
by victorbjorklund on 3/6/23, 8:35 PM
Great post. Love how transparent Fly is. Im a customer but without any life and death important apps. And yea they had some issues lately so good they are addressing it.
by swamp40 on 3/6/23, 11:35 PM
"CEO finds novel way to fix Capacity problems..."
They just lost about 40% of their paying users with that blog post.
by benatkin on 3/7/23, 4:33 AM
It's too late. They've already misled people.
by KETpXDDzR on 3/7/23, 4:08 AM
That sounds like a typical problem of unnecessary complexity to me. I wonder how many over engineered (web) applications could run as a single, efficient process on a single machine.