from Hacker News

Ask HN: Startup Devs -What's your biggest pain while managing cloud deployments?

by tj1516 on 3/27/24, 6:14 AM with 92 comments

When you're a tiny team of developers working on something beyond MVP level and beyond heroku — managing your CI, deployments/rollbacks, DBs etc. looks like a nightmare to me without Devops expertise.

I want to understand what kind of challenges you all are facing with regards to this. And any tools, practices, you’ll are using to reduce this pain? Ex- How do you deploy resources? How do you define architecture? How do you manage your environments, observability?, etc.

by cjk2 on 3/27/24, 10:29 AM
Killer one that has shafted us is tying everything into popular deployment and CI tooling. Your product should be buildable, deployable and tested on a standalone workstation at any time. We have lost literally days fucking around with broken products which get their tentacles into that shit.
I'd use AWS + ECR + ECS personally. Stay away from kubernetes until you have a fairly large deployment as it requires huge administrative overhead to keep it up to date and well managed and a lot of knowledge overhead. Keep your containers entirely portable. Make sure you understand persistent storage and backups for any state.
Also stay the hell away from anything which is not portable between cloud vendors.
As for observability, carefully price up buy vs build because buy gets extremely expensive. Last thing you want is a $500k bill for logging one day.
And importantly every time you do something in any major public cloud, someone is making money out of you and it affects your bottom line. You need to develop an accounting mentality. Nothing is free. Build around that assumption. Favour simple, scalable architectures, not complex sprawling microservices.
by jimangel on 3/27/24, 10:48 AM
There's a lot of noise when trying to learn the advanced features of each cloud provider's "way of doing XYZ." I think it helps to focus on the things worth protecting: secrets, credentials, code.
Who has access? How do we audit / rotate? How do we secure?
You can use this approach for each step along the way, how to secure secrets in your cloud? code? IaC? container deployments? CI/CD?
If we assume infra / app is code, the tooling matters a lot less. How do you provision certificates via IaC? How do you grant IAM to resources and how do you revoke?
There are examples like https://github.com/terraform-google-modules/terraform-exampl... of more advanced IaC architectures, but you can start as small or as complex as you want and evolve if done properly.
Personally, I love me some Kubernetes + ArgoCD (GitOps) + Google Workload Identity + Google Secret Manager, but I am 100% biased.
by crote on 3/27/24, 1:37 PM
- Keep it simple. Cloud providers offer hundreds of services - just because you can use them doesn't mean you should. A simple Docker container or a small VM might already be enough for you, and it's a lot less moving parts.
- Document everything. Hacking together some cloud infra is easy, maintaining it is a completely different story. You want to have clear and up-to-date documentation on what exactly you have running and why. If you don't have good docs, it'll be impossible to scale, refactor, or solve outages.
- Make sure people are trained. Cloud infra management gets complicated real fast. You want people to actually know what they are doing (no "hey, why is the bucket public" oopsies), and you want to avoid running into Bus Factor issues. Not everyone needs to be an expert, but you should always have one go-to person for questions who's actually capable, and one backup in case something happens to them.
- Watch your bills. Every individual part isn't too expensive, but it all adds up quickly. Make sure you do a sanity check every once in a while (no, $5k / month for a DB storing megabytes is not normal). Don't use scaling as a "magic fix-everything button" - sometimes your code just sucks.
by sph on 3/27/24, 10:17 AM
That these days there are only two serious choices: deploy on a single-node Docker/podman machine, or Kubernetes.
I can do both, but for a bootstrapped solo business, Kubernetes is overkill and overengineered. What I would really love is a multi-node podman infrastructure, where I can scale out without having to deal with k8s and its infernal circus of YAML, Helm, etcd, kustomize, certificate rotation, etc.
Recently I had to set up a zero-downtime system for my app, I spent a week seriously considering a move to k3s, but the entire kubernetes ecosystem of churn frustrated me so much I simply wrote a custom script based on Caddy, regular container health checks and container cloning. Easier to understand, 20 lines of code and I don't have to sell my soul to the k8s devil just yet.
Sadly, I don't think a startup can help make this better. I want a bonafide FOSS solution to this problem, not another tool to get handcuffed to. I seem to remember Red Hat where working on a prototype of a systemd-podman orchestration system to make it easy to deploy a single systemd unit into multiple hosts, but I am unable to remember what is it called any more.
---
Also, I seem to be an outlier, judging from the rest of the comments, by running on dedicated servers. These days everybody is using one of the clouds and terribly afraid of managing servers. I think it's going to be hard to make DevOps better when everyone is in the "loving" Azure/AWS/GCP embrace: you're basically positioning as their competitor, as the cloud vendor itself is always trying to upsell its customers and reduce friction to as close to zero as possible.
by dig1 on 3/27/24, 12:20 PM
My main challenge is people, especially "experts," who come, preach whatever popular cloud stuff is today, and then leave. This leaves me with a lot of shi* to deal with.
We should keep things simple, KISS. My few key points for the future myself:
1. Be cloud agnostic. Everyone operates on margins and a slight increase in a cloud provider's fees could be a death sentence for your business. Remember, cloud providers are not your friends; they are in the business of taking your money and putting it in their pockets.
2. Consider using bare metal if it's feasible for your operations. The price/performance ratio of bare metal is unbeatable, and it encourages a simpler infrastructure. It also presents an opportunity to learn about devops, making it harder for others to sell you junk as a premium service. This approach also discourages the proliferation of multiple databases/tech/tools for the sake of CV updates by your colleagues, keeping your infrastructure streamlined.
3. Opt for versatile tools like Ansible that can handle a variety of tasks. Don't be swayed by what's popular among the "cool kids". Your focus should be on making your business succeed, not on experimenting with every new tool in the market. Master it well.
4. Make sure you can replicate your whole production stack on your box in a few seconds, a minute max. If you can't, well, back to the drawing board.
5. Use old and tried tech. Choose your tech wisely. Docker is no longer cool, and Podman is a rage on HN, but there are hundreds of man-hours of documentation online of every Docker issue you can think of. And Docker will stay for a while. The same for Java/Rails/PHP...
6. Keep everything reproducible in your repository: code, documentation, deployment scripts, and infra diagrams. I've seen people use one service for infra diagrams and another to describe database schema. It's madness.
7. (addon) Stay away from monorepos. They are cool, they are "googly," but you are not Google or Microsoft. They are notoriously hard to scale and secure without custom tooling, no matter what people say. If you have problems with the code sharing between repos, back to the drawing board.
by rjst01 on 3/27/24, 10:40 AM
A year ago I left my job at a 500-employee SaaS business working on the team that maintains the devops infrastructure, to found a startup. For me the biggest pain point is going from nothing to a sufficiently flexible devops setup.
There are a lot of great tools out there, but making them play well together is an exercise for the reader. There are also a lot of preference-based choices you need to make in how you want your setup to look, and what you chose will affect what tools make sense to you.
Do you go monorepo or polyrepo? If you go monorepo, how do you decide what to build and deploy on each merge? If you go polyrepo, how do you keep stuff in sync between any code you want to share?
Once a build is complete, how do you trigger a deployment? How does your CI system integrate with your deployment system, or is the answer "with some shell scripts you have to write"?
> How do you deploy resources?
For us, we have a monorepo setup with bazel. I wrote some fairly primitive scripts to scan git changes to decide what to build. We use Buildkite for CI, which triggers rollouts to kubernetes with ArgoCD. I had to do a non-trivial amount of work to tie all this together, but it's been fairly robust and has only needed a minimal amount of care and feeding.
> How do you define architecture?
Kubernetes charts for our services are in git, but there's some amount of extra stuff deployed (ingress controller, for example) that is documented in a text file
> How do you manage your environments
We don't need to deploy environments super often, so just do it manually and update documentation in the process if any variations are needed.
> observability
Datadog and sumologic.
Overall our setup doesn't come close to the setup I worked on at my last employer, but I have to balance time spent on devops infra with time spent on the product, and that setup took ~5 full time engineers to maintain.
by INTPenis on 3/27/24, 10:28 AM
If it looks like a nightmare to you it's because clouds like GCP and AWS require training just to operate properly. There are so many components who are all intertwined with weird names and even stranger APIs, and you have no idea which of them you need unless you have significant experience, or formal training.
by KaiserPro on 3/27/24, 11:27 AM
There are a lot of ways of doing things, and its really easy to make mistakes.
my advice would be:
Separate your build from your infra. Whilst its nice to have your cloud be spun up with CI, its really not a great use case, and means your CI has loads of power that can be abused.
Gitlab with local runners is a good place to start for CI. its relatively simple and your personal runners can be shared between projects (this is great for keeping costs down, but speeds up, as you can share a massive instance)
Avoid raw Kubernetes until you really really have to. Its not worth the time unless you have someone to manage it and your use case requires it. Push back hard on if anyone asserts that its a solution to x. Most of the time its because "its cool" K8s only really becomes useful if you are trying to have multiple nodes from different clouds/hybrid local/cloud deployment. For almost everything else, its just not worth it.
You are unlikely to change cloud providers, so choose one and stick to it. Use their managed features. Assuming you are using AWS, Lambdas are really good for starting out. But, make sure you start deploying them with cloudformation/terraform (terraform is faster, but not always better)
Use ECS to manage services, use RDS to manage data. Yes it is more expensive, but backups and duplication comes for free (ie you can spin up a test deployment with actual data.) Take the time to make sure that you are not using hand rolled stuff made in the web console, really put the effort into make sure everything is stored in terraform/CF and in a git repo somewhere.
Limit the access you grant to people, services and things. Take the time to learn IAM/equivalent. Make sure that you have bespoke roles for each service/thing.
Rotate keys weekly, use the managed key/secrets storage to do that. Automate it where you can.
by efxhoy on 3/27/24, 12:33 PM
I work at a 60 dev shop so not a startup per se.
Our biggest problem is feature environments, or actual integration tests where multiple services have to change. Because infra is in its own repo in terraform and the apps have their own repo we don’t have a good way of creating infra-identical environments for testing code changes that affect multiple services. We always end up with some hack and manual tweaks in staging.
Data engineering is another problem, managing how to propagate app schema changes to the data warehouse is a pain because it has to happen in sequence across repo borders. If it was all one repo and we got a new data warehouse per PR it would be trivial.
Not trusting CI to hold secrets is another. As soon as we do anything in CI that needs “real” data we need to trigger aws ecs tasks, because circleci has leaked secrets before so we don’t trust them and keep all our valuable secrets that can access real data in aws ssm. The more complex the integrations the harder they are to test.
If we had a monorepo I think this type of work would be much easier. But that comes with its own set of problems, mainly deployment speed and cost.
If there was a way to snapshot all our state and “copy” it to a clean environment created for each PR that the PR could then change at will and test completely, end to end, that would be the dream.
by bionhoward on 3/27/24, 10:40 AM
I’d argue the obvious answer is address the lack of great answers for declarative schema migration in PostgreSQL. There is Skeema https://github.com/skeema/skeema but it doesn’t support Postgres and Prisma iirc forces you into an ORM, atlas looks perfect but has a nonstandard license.
by margorczynski on 3/27/24, 10:54 AM
The basic thing when dealing with the "Cloud" is to stay away from their exclusive offering and vendor-specific stuff to avoid lock-in. Once you're locked in you're gonna get fleeced really hard as they are aware that moving away in such a situation is much harder.
In general when you have a solution that is supposed to handle any problem and scenario like with AWS you'll eventually end up with some complicated Frankenstein-y creation, there's probably no going around it if they want such a robust set of features and capabilities.
by Quothling on 3/27/24, 10:24 AM
Networking. Especially when you have to combine on-prem with cloud sollutions. At least in Azure it's an outright nightmare with the various private end-points and what not. AWS might be better, I know I much preferred it when we had Route50something instead of the Azure dns thing. Frankly a lot of the IT-operations tools are pretty shitty in Azure, but things like networking is where you are less ok with "winging it".
Well that and the constant cost increases. Container apps in Azure went from around $20-25 to $120 on our subscription. Along with all the other price hikes we're looking to move out of the cloud and go into something like Hetzner (but localized to our country).
by siva7 on 3/27/24, 11:00 AM
Managing costs is the single biggest pain point. Be careful and don't start with kubernetes etc. if you just need to deploy a monolith because managed k8s is way more expensive than e.g. Azure Web Apps or AWS Lightsail. Review your architecture carefully to make the deployment as simple as possible. Microservices aren't for the product-market-fit stage but when you really need them (you will know when).
by Ologn on 3/27/24, 7:52 PM
I would say be cognizant that a major cloud provider might always just contact you out of the blue and say they after a few years of usage they blew away your VPS, and have no recovery for it, and you should rebuild it from scratch off your last backup.
I already was using more than one service but I cancelled Rackspace when this happened, although was not making enough yet to be fully and automatically redundant. It was a pain to suddenly have to drop everything and rebuild the service, as it broke the entire service I was offering. Actually I rebuilt everything on my existing service at Linode (now Akamai), and then got a VPS at Ramnode as my backup. So it was a lot of work suddenly thrown at me that I didn't need. Luckily, while my backup practices are not completely ideal, I do follow the 3-2-1 backup rule enough that it wasn't catastrophic.
Here's the message I got from Rackspace in 2019:
This message is a follow-up to our previous notifications regarding cloud server, "c0defeed". Your cloud server could not be recovered due to the failure of its host.
Please reference this ID if you need to contact support: "CSHD-5a271828"
You have the option to rebuild your server from your most recent server image or from a stock image. Once you have verified that your rebuilt server is online, you must (1) Copy your data from your most recent backup or attach your data device if you use Cloud Block Storage and (2) Adjust your DNS to reflect the new server’s IP address for any affected domains. When you have verified that your rebuilt server is online, with your data intact, you will need to delete the impacted server from your account to prevent any further billing associated with the impacted device.
We apologize for any inconvenience this may have caused you.
by tootie on 3/27/24, 12:38 PM
I tend to avoid any PaaS service like Vercel or Heroku unless you're guaranteed to never need scale or complexity. If you can dockerize your main application code, it's easy enough to deploy on ECS or Azure app service. Lean heavily on COTS software for anything outside your core competency (ie CRM or marketing) and even consider outsourcing your on-call to an MSP.
by scaryclam on 3/27/24, 12:05 PM
Number one would be: tying your deployment to the cloud providers way of doing things. ECS, CodeDeploy, heroku.... it's easy to get stuck when your ops requirements grow beyond how they do things, and really hard to set up something more agnostic alongside it (especially when you have a tight budget on cloud resources).
Number two: using tech that you really don't need. Keep it simple, even if it means not using something that the rest of the industry is in love with. Figure out what your requirements are, pick the simplest way of setting it up (keeping #1 in account, because using tooling that isn't able to grow with you is worse IMO), and keep it agnostic. Then when you start to hit the edges of the capabilities, either scale up your ops understanding and start using more sophisticated things, or hire someone in to help out.
Also, I really do agree with @KaiserPro when they say to separate your infra from your deploy. It makes moving to something else much, MUCH easier when you inevitably need to.
by noop_joe on 3/27/24, 1:34 PM
Full disclosure I work at a company [1] attempting to address this very problem.
Without dedicated devops the challenge is allocating developer time to do the integration work required to get the various ops tools playing nicely together. In my experience that work is non-trivial and eventually leads to some form of dedicated ops.
That is also part of a dynamic -- lots of tools available for solving these problems exist because of dedicated ops. It's not easy for a team trying to build software to also take on these extra operational responsibilities.
What we're trying to do is create a highly curated set of what we think of as application primitives. These primitives include CI/CD, Logs, Services, Resources etc. Because they're already integrated it doesn't require developers to figure out API keys, access control, data synchronization, etc.
1. https://noop.dev
by matt-p on 3/27/24, 9:57 AM
I think without aws it would be painful to be honest.
Use code pipeline/cloud build then either container registry and ecs or beanstalk.
You get everything you mentioned for free with either set up.
Seriously it takes moments to write terraform or even do it by hand.
by jmstfv on 3/27/24, 12:03 PM
I run a profitable SaaS business all on my own.
Keeping it very simple: I push code to Github; then Capistrano (think bash script with some bells and whistles) deploys that code to the server and restarts systemd processes, namely Puma and Sidekiq.
The tech stack is fairly simple as well: Rails, SQLite, Sidekiq + Redis, and Caddy, all hosted on a single Hetzner dedicated server.
The only problem is that I can't deploy as often as I want because some Sidekiq jobs run for several days, and deploying code means disrupting those jobs. I try to schedule deployment times; even then, sometimes, I have to cut some jobs short.
by OJFord on 3/27/24, 10:33 AM
Costs - predicting what-if changes, debugging unexpected increases, etc.
by inscrutable on 3/27/24, 11:22 AM
A monolith on GCP Cloud Run + Postgres Alloy db is as close to Heroku as exists today. No external tools for ci, deployments/rollbacks, logs, metrics are needed. Zero lock-in, nothing to maintain. If for some reason you get to 10+ services and want to go to kubernetes, that's easy.
If AWS is required, a monolith on ECS Fargate + Aurora Postgres Serverless is perfectly cromulent. You'll need to string together some open source tools for e.g. ci/cd.
by jeffrallen on 3/27/24, 12:14 PM
Pay attention to data sovereignty. Think about what you'll do if your customers or your regulator ask you to keep data in their jurisdiction.
"In the cloud" has recently come to mean "NSA gets a copy of whatever they want" unless you choose wisely.
by crooksey on 3/27/24, 10:13 AM
To be honest, with Azure Devops platforms, web app instances, containers and deployment slots, its almost bulletproof once you spend the time to get it originally configured. Same with AWS I imagine.
by brainzap on 3/27/24, 12:09 PM
That we still need infra humans that understand it and maintain it.
by aristofun on 3/27/24, 1:33 PM
K8s is the name of the pain.
And lack of attention to lean alternatives like docker swarm - is another one.
by mentos on 3/27/24, 11:30 AM
I don’t trust using AWS without exposing myself to the risk of an enormous bill.
by thekid_og on 3/27/24, 10:08 AM
a lot can be done via gitlab auto-devops. you get alot of out of the box feature like pipelines, deployments, review environments...
you can quite easily connect it to GCP/AWS/Azure
by r-spaghetti on 3/27/24, 12:19 PM
Vendor lock-in