by tj1516 on 3/27/24, 6:14 AM with 92 comments
I want to understand what kind of challenges you all are facing with regards to this. And any tools, practices, you’ll are using to reduce this pain? Ex- How do you deploy resources? How do you define architecture? How do you manage your environments, observability?, etc.
by cjk2 on 3/27/24, 10:29 AM
I'd use AWS + ECR + ECS personally. Stay away from kubernetes until you have a fairly large deployment as it requires huge administrative overhead to keep it up to date and well managed and a lot of knowledge overhead. Keep your containers entirely portable. Make sure you understand persistent storage and backups for any state.
Also stay the hell away from anything which is not portable between cloud vendors.
As for observability, carefully price up buy vs build because buy gets extremely expensive. Last thing you want is a $500k bill for logging one day.
And importantly every time you do something in any major public cloud, someone is making money out of you and it affects your bottom line. You need to develop an accounting mentality. Nothing is free. Build around that assumption. Favour simple, scalable architectures, not complex sprawling microservices.
by jimangel on 3/27/24, 10:48 AM
Who has access? How do we audit / rotate? How do we secure?
You can use this approach for each step along the way, how to secure secrets in your cloud? code? IaC? container deployments? CI/CD?
If we assume infra / app is code, the tooling matters a lot less. How do you provision certificates via IaC? How do you grant IAM to resources and how do you revoke?
There are examples like https://github.com/terraform-google-modules/terraform-exampl... of more advanced IaC architectures, but you can start as small or as complex as you want and evolve if done properly.
Personally, I love me some Kubernetes + ArgoCD (GitOps) + Google Workload Identity + Google Secret Manager, but I am 100% biased.
by crote on 3/27/24, 1:37 PM
- Document everything. Hacking together some cloud infra is easy, maintaining it is a completely different story. You want to have clear and up-to-date documentation on what exactly you have running and why. If you don't have good docs, it'll be impossible to scale, refactor, or solve outages.
- Make sure people are trained. Cloud infra management gets complicated real fast. You want people to actually know what they are doing (no "hey, why is the bucket public" oopsies), and you want to avoid running into Bus Factor issues. Not everyone needs to be an expert, but you should always have one go-to person for questions who's actually capable, and one backup in case something happens to them.
- Watch your bills. Every individual part isn't too expensive, but it all adds up quickly. Make sure you do a sanity check every once in a while (no, $5k / month for a DB storing megabytes is not normal). Don't use scaling as a "magic fix-everything button" - sometimes your code just sucks.
by sph on 3/27/24, 10:17 AM
I can do both, but for a bootstrapped solo business, Kubernetes is overkill and overengineered. What I would really love is a multi-node podman infrastructure, where I can scale out without having to deal with k8s and its infernal circus of YAML, Helm, etcd, kustomize, certificate rotation, etc.
Recently I had to set up a zero-downtime system for my app, I spent a week seriously considering a move to k3s, but the entire kubernetes ecosystem of churn frustrated me so much I simply wrote a custom script based on Caddy, regular container health checks and container cloning. Easier to understand, 20 lines of code and I don't have to sell my soul to the k8s devil just yet.
Sadly, I don't think a startup can help make this better. I want a bonafide FOSS solution to this problem, not another tool to get handcuffed to. I seem to remember Red Hat where working on a prototype of a systemd-podman orchestration system to make it easy to deploy a single systemd unit into multiple hosts, but I am unable to remember what is it called any more.
---
Also, I seem to be an outlier, judging from the rest of the comments, by running on dedicated servers. These days everybody is using one of the clouds and terribly afraid of managing servers. I think it's going to be hard to make DevOps better when everyone is in the "loving" Azure/AWS/GCP embrace: you're basically positioning as their competitor, as the cloud vendor itself is always trying to upsell its customers and reduce friction to as close to zero as possible.
by dig1 on 3/27/24, 12:20 PM
We should keep things simple, KISS. My few key points for the future myself:
1. Be cloud agnostic. Everyone operates on margins and a slight increase in a cloud provider's fees could be a death sentence for your business. Remember, cloud providers are not your friends; they are in the business of taking your money and putting it in their pockets.
2. Consider using bare metal if it's feasible for your operations. The price/performance ratio of bare metal is unbeatable, and it encourages a simpler infrastructure. It also presents an opportunity to learn about devops, making it harder for others to sell you junk as a premium service. This approach also discourages the proliferation of multiple databases/tech/tools for the sake of CV updates by your colleagues, keeping your infrastructure streamlined.
3. Opt for versatile tools like Ansible that can handle a variety of tasks. Don't be swayed by what's popular among the "cool kids". Your focus should be on making your business succeed, not on experimenting with every new tool in the market. Master it well.
4. Make sure you can replicate your whole production stack on your box in a few seconds, a minute max. If you can't, well, back to the drawing board.
5. Use old and tried tech. Choose your tech wisely. Docker is no longer cool, and Podman is a rage on HN, but there are hundreds of man-hours of documentation online of every Docker issue you can think of. And Docker will stay for a while. The same for Java/Rails/PHP...
6. Keep everything reproducible in your repository: code, documentation, deployment scripts, and infra diagrams. I've seen people use one service for infra diagrams and another to describe database schema. It's madness.
7. (addon) Stay away from monorepos. They are cool, they are "googly," but you are not Google or Microsoft. They are notoriously hard to scale and secure without custom tooling, no matter what people say. If you have problems with the code sharing between repos, back to the drawing board.
by rjst01 on 3/27/24, 10:40 AM
There are a lot of great tools out there, but making them play well together is an exercise for the reader. There are also a lot of preference-based choices you need to make in how you want your setup to look, and what you chose will affect what tools make sense to you.
Do you go monorepo or polyrepo? If you go monorepo, how do you decide what to build and deploy on each merge? If you go polyrepo, how do you keep stuff in sync between any code you want to share?
Once a build is complete, how do you trigger a deployment? How does your CI system integrate with your deployment system, or is the answer "with some shell scripts you have to write"?
> How do you deploy resources?
For us, we have a monorepo setup with bazel. I wrote some fairly primitive scripts to scan git changes to decide what to build. We use Buildkite for CI, which triggers rollouts to kubernetes with ArgoCD. I had to do a non-trivial amount of work to tie all this together, but it's been fairly robust and has only needed a minimal amount of care and feeding.
> How do you define architecture?
Kubernetes charts for our services are in git, but there's some amount of extra stuff deployed (ingress controller, for example) that is documented in a text file
> How do you manage your environments
We don't need to deploy environments super often, so just do it manually and update documentation in the process if any variations are needed.
> observability
Datadog and sumologic.
Overall our setup doesn't come close to the setup I worked on at my last employer, but I have to balance time spent on devops infra with time spent on the product, and that setup took ~5 full time engineers to maintain.
by INTPenis on 3/27/24, 10:28 AM
by KaiserPro on 3/27/24, 11:27 AM
my advice would be:
Separate your build from your infra. Whilst its nice to have your cloud be spun up with CI, its really not a great use case, and means your CI has loads of power that can be abused.
Gitlab with local runners is a good place to start for CI. its relatively simple and your personal runners can be shared between projects (this is great for keeping costs down, but speeds up, as you can share a massive instance)
Avoid raw Kubernetes until you really really have to. Its not worth the time unless you have someone to manage it and your use case requires it. Push back hard on if anyone asserts that its a solution to x. Most of the time its because "its cool" K8s only really becomes useful if you are trying to have multiple nodes from different clouds/hybrid local/cloud deployment. For almost everything else, its just not worth it.
You are unlikely to change cloud providers, so choose one and stick to it. Use their managed features. Assuming you are using AWS, Lambdas are really good for starting out. But, make sure you start deploying them with cloudformation/terraform (terraform is faster, but not always better)
Use ECS to manage services, use RDS to manage data. Yes it is more expensive, but backups and duplication comes for free (ie you can spin up a test deployment with actual data.) Take the time to make sure that you are not using hand rolled stuff made in the web console, really put the effort into make sure everything is stored in terraform/CF and in a git repo somewhere.
Limit the access you grant to people, services and things. Take the time to learn IAM/equivalent. Make sure that you have bespoke roles for each service/thing.
Rotate keys weekly, use the managed key/secrets storage to do that. Automate it where you can.
by efxhoy on 3/27/24, 12:33 PM
Our biggest problem is feature environments, or actual integration tests where multiple services have to change. Because infra is in its own repo in terraform and the apps have their own repo we don’t have a good way of creating infra-identical environments for testing code changes that affect multiple services. We always end up with some hack and manual tweaks in staging.
Data engineering is another problem, managing how to propagate app schema changes to the data warehouse is a pain because it has to happen in sequence across repo borders. If it was all one repo and we got a new data warehouse per PR it would be trivial.
Not trusting CI to hold secrets is another. As soon as we do anything in CI that needs “real” data we need to trigger aws ecs tasks, because circleci has leaked secrets before so we don’t trust them and keep all our valuable secrets that can access real data in aws ssm. The more complex the integrations the harder they are to test.
If we had a monorepo I think this type of work would be much easier. But that comes with its own set of problems, mainly deployment speed and cost.
If there was a way to snapshot all our state and “copy” it to a clean environment created for each PR that the PR could then change at will and test completely, end to end, that would be the dream.
by bionhoward on 3/27/24, 10:40 AM
by margorczynski on 3/27/24, 10:54 AM
In general when you have a solution that is supposed to handle any problem and scenario like with AWS you'll eventually end up with some complicated Frankenstein-y creation, there's probably no going around it if they want such a robust set of features and capabilities.
by Quothling on 3/27/24, 10:24 AM
Well that and the constant cost increases. Container apps in Azure went from around $20-25 to $120 on our subscription. Along with all the other price hikes we're looking to move out of the cloud and go into something like Hetzner (but localized to our country).
by siva7 on 3/27/24, 11:00 AM
by Ologn on 3/27/24, 7:52 PM
I already was using more than one service but I cancelled Rackspace when this happened, although was not making enough yet to be fully and automatically redundant. It was a pain to suddenly have to drop everything and rebuild the service, as it broke the entire service I was offering. Actually I rebuilt everything on my existing service at Linode (now Akamai), and then got a VPS at Ramnode as my backup. So it was a lot of work suddenly thrown at me that I didn't need. Luckily, while my backup practices are not completely ideal, I do follow the 3-2-1 backup rule enough that it wasn't catastrophic.
Here's the message I got from Rackspace in 2019:
This message is a follow-up to our previous notifications regarding cloud server, "c0defeed". Your cloud server could not be recovered due to the failure of its host.
Please reference this ID if you need to contact support: "CSHD-5a271828"
You have the option to rebuild your server from your most recent server image or from a stock image. Once you have verified that your rebuilt server is online, you must (1) Copy your data from your most recent backup or attach your data device if you use Cloud Block Storage and (2) Adjust your DNS to reflect the new server’s IP address for any affected domains. When you have verified that your rebuilt server is online, with your data intact, you will need to delete the impacted server from your account to prevent any further billing associated with the impacted device.
We apologize for any inconvenience this may have caused you.
by tootie on 3/27/24, 12:38 PM
by scaryclam on 3/27/24, 12:05 PM
Number two: using tech that you really don't need. Keep it simple, even if it means not using something that the rest of the industry is in love with. Figure out what your requirements are, pick the simplest way of setting it up (keeping #1 in account, because using tooling that isn't able to grow with you is worse IMO), and keep it agnostic. Then when you start to hit the edges of the capabilities, either scale up your ops understanding and start using more sophisticated things, or hire someone in to help out.
Also, I really do agree with @KaiserPro when they say to separate your infra from your deploy. It makes moving to something else much, MUCH easier when you inevitably need to.
by noop_joe on 3/27/24, 1:34 PM
Without dedicated devops the challenge is allocating developer time to do the integration work required to get the various ops tools playing nicely together. In my experience that work is non-trivial and eventually leads to some form of dedicated ops.
That is also part of a dynamic -- lots of tools available for solving these problems exist because of dedicated ops. It's not easy for a team trying to build software to also take on these extra operational responsibilities.
What we're trying to do is create a highly curated set of what we think of as application primitives. These primitives include CI/CD, Logs, Services, Resources etc. Because they're already integrated it doesn't require developers to figure out API keys, access control, data synchronization, etc.
by matt-p on 3/27/24, 9:57 AM
Use code pipeline/cloud build then either container registry and ecs or beanstalk.
You get everything you mentioned for free with either set up.
Seriously it takes moments to write terraform or even do it by hand.
by jmstfv on 3/27/24, 12:03 PM
Keeping it very simple: I push code to Github; then Capistrano (think bash script with some bells and whistles) deploys that code to the server and restarts systemd processes, namely Puma and Sidekiq.
The tech stack is fairly simple as well: Rails, SQLite, Sidekiq + Redis, and Caddy, all hosted on a single Hetzner dedicated server.
The only problem is that I can't deploy as often as I want because some Sidekiq jobs run for several days, and deploying code means disrupting those jobs. I try to schedule deployment times; even then, sometimes, I have to cut some jobs short.
by OJFord on 3/27/24, 10:33 AM
by inscrutable on 3/27/24, 11:22 AM
If AWS is required, a monolith on ECS Fargate + Aurora Postgres Serverless is perfectly cromulent. You'll need to string together some open source tools for e.g. ci/cd.
by jeffrallen on 3/27/24, 12:14 PM
"In the cloud" has recently come to mean "NSA gets a copy of whatever they want" unless you choose wisely.
by crooksey on 3/27/24, 10:13 AM
by brainzap on 3/27/24, 12:09 PM
by aristofun on 3/27/24, 1:33 PM
And lack of attention to lean alternatives like docker swarm - is another one.
by mentos on 3/27/24, 11:30 AM
by thekid_og on 3/27/24, 10:08 AM
you can quite easily connect it to GCP/AWS/Azure
by r-spaghetti on 3/27/24, 12:19 PM