by dguo on 7/21/23, 12:04 PM with 120 comments
by bob1029 on 7/21/23, 3:50 PM
Going into a "cloud native" stance and continuing to micromanage containers, VMs, databases, message buses, reverse proxies, etc. seems absolutely ridiculous to me. We're now using exactly 2 major cloud components per region: A Hyperscale SQL database, and a FaaS runner. Both on serverless & consumption-based plans. There are zero VMs or containers in our new architecture. We certainly use things like DNS, AAD, VNets, etc., but it is mostly incidentally created by way of the primary offerings, and we only ever have to create it 3 times and its done forever and ever - Dev cloud, Prod cloud, DR cloud. And yes - we are "mono cloud", because any notion of all of Azure/AWS/GCP going down globally and not also dragging the rest of the internet with it is fantasy to me (and our customers).
When you literally have one database to worry about for the entire universe, you stop thinking in terms of automation and start thinking in terms of strategic nuclear exchange. Granted, one big thing to screw up is a big liability, but only if you don't take extra precautions around process/procedure/backup/communication/etc.
The benefit of doing more with less also makes conversations around disaster recovery and compliance so much easier. Our DR strategy is async log replication of our 1 database. I really like the abstraction of putting 100% of the business into one place it magically showing up on the other side of the flood event.
How about this for a litmus test: If your proposed solution architecture is so complicated that you would be driven to IAC abstractions to manage it, perhaps we need to re-evaluate the expectations of the business relative to the technology.
by codethief on 7/21/23, 5:37 PM
- We need to get rid of YAML. Not only because it's a horrible file format but also because it lacks proper variables, proper type safety, proper imports, proper anything. To this day, usage & declaration search in YAML-defined infrastructure still often amounts to a repo-wide string search. Why are we putting up with this?
- The purely declarative approach to infrastructure feels wrong. For instance, if you've ever had to work on Gitlab pipelines, chances are that already on day 1 you started banging your head against the wall because you realized that what you wanted to implement is not possible currently – at least not without jumping through a ton of hoops –, and there's already an open ticket from 2020 in Gitlab's issue tracker. I used to think, how could the Gitlab devs possibly forget to think of that one really obvious use case?! But I've come to realize that it's not really their fault: If you create any declarative language, you as the language creator will have to define what all those declarations are supposed to mean and what the machine is supposed to do when it encounters them. Behind every declaration lies a piece of imperative code. Unfortunately, this means you'll need to think of all potential use cases of your language and your declarations, including combinations and permutations thereof. (There's a reason why it's taken so long for CSS to solve even the most basic use cases.) Meanwhile, imperative languages simply let the user decide what they want. They are much more flexible and powerful. I realize I'm not saying anything new here but it often feels like as if DevOps people have forgotten about the benefits of high-level programming languages. Now this is not to say we should start defining all our infrastructure in Java but let's at least allow for a little bit of imperativeness and expressiveness!
by twic on 7/21/23, 1:31 PM
by jen20 on 7/21/23, 1:13 PM
CDK, CDKTF and Pulumi all use general purpose programming languages, so reusing parameter objects in the way that is described is trivial - indeed it is so close to second nature that I would not even think to write it down. Indeed, it's not uncommon to share functions that make such parameter objects via libraries in the package ecosystem of your choice.
I agree that IaC needs a rethink, but that is more to do with the fact that declarative systems simply cannot model the facts on the ground without being substantially more complex.
by JohnMakin on 7/21/23, 1:41 PM
Sure, I can get behind this. Yesterday I was trying to figure out how to give a name to EC2 instances generated by AWS-managed autoscaler group that’s created by a node group resource. Simple, right? should just be able to add a Name = $tag field to the node group somewhere to apply to the generated ec2’s?
well, not quite. What you actually need is a separate autoscaling_group_tag resource.
Well, that resource needs a reference to an autoscaling group arn. but I dont manage an autoscaling group, my node group does, so in the end I have to figure out how to reference it like:
aws_node_group.node_group.resources.0.autoscaling_groups.0.arn
well, not quite, you may need a try block around that, and maybe some lifecycle rules to get around weird race conditions.
so yea. I’m not complaining about HCL or terraform. I find it much better than the alternatives. but lots of times my reaction to stuff like this is “there’s no way it has to be like this.”
by bovermyer on 7/21/23, 2:04 PM
by whoomp12342 on 7/21/23, 1:57 PM
the structure and syntax for AWS is entirely different from Azure is entirely different from GCP.
Instead of abstracting to CSS, I would argue modeling what Bytecode did in java for multi-operating system, we should do for infrastructure of code.
That way, you could easily replicate in different environments, free yourself from vendor lock, and have readability/re-usability all in one.
This is what I want from infrastructure as code and I have yet to see it.
by Niksko on 7/21/23, 2:36 PM
OAM has a model of components (things like containerized workloads, databases, queues), traits (scaling behavior, ingress) and in the latest draft, policies that apply across the entire application (high availability, security policy).
It's all a little disjointed and seems to have lost steam. KubeVela is powering along, but it's the only implementation, and IMO it's highly opinionated about how you do deploys and works well for Alibaba and perhaps not for others. But it has some interesting ideas.
by kristianpaul on 7/21/23, 1:16 PM
by xorcist on 7/21/23, 6:25 PM
I feel IaC really peaked around Puppet 3 and Chef 1. IaC should be simple enough that people use it, and trivial to write providers for. People tend to glue much too large libraries to their IaC platforms and end up with a maintenance mess which is what kills it in the long run. However both the above projects went corporate and grew legs and arms and a billion other features that everybody won't use more than a subset of. Most people migrated to Ansible which kept more of the open source project culture and was simpler in design.
Now people seems to use a little of this, a little of that. Some Ansible, some Terraform, some other stuff. They don't know what they're missing when the entire stack is built ground up from templated components defined in a common declarative language. Some people seem to really like Nix, which I haven't used professionally, but from what I've seen it seems to inherit the same type of design. There was an experimental project called cfg which worked in real time using hooks such as inotify which was promising, if there was a Kubernetes distribution made like that it would be really easy to manage components that didn't belong to a host.
by cyberax on 7/21/23, 1:54 PM
by beders on 7/21/23, 5:09 PM
But, hey, when looking at the origins of OOP and its main uses (back then: Simulations and UI): Maybe this is exactly what one needs to describe and setup infrastructure and there have been various projects going in that direction.
Make message passing truly async, throw in Garbage Collection, make it dynamic (i.e. creating a new instance of an object leads to some sort of deployment) and voila: Your traversable, introspect-able object graph is now a representation of your infrastructure.
by danielovichdk on 7/21/23, 2:03 PM
When I need to provision anything I have a powershell script that interacts with Azure CLI.
My script sets up a new resource group for every service we create, logging, key vault, webapp/functions, and if needed some kind of data storage or queuing.
In my powershell script I can via a variable indicate which environment I want to spin up: dev, staging, prod.
I have one yaml file which is for my build and a build trigger which points to the above powershell script with the given environment.
All environments: dev, staging or prod are setup manually with manual user assignments for deployment access etc.
It's really lightweight but I also believe it's lightweight because we run a small services setup where each service takes care of its own provisioning.
Terraform and Yaml are so verbose but that's not the most problematic. You can't execute those files from your local machine.
by time0ut on 7/21/23, 4:27 PM
by throwa23432 on 7/21/23, 3:29 PM
I use Terraform HCL 40 hours a week, but it is severely lacking in lang design and type system and IDE/LSP experience.
by slotrans on 7/21/23, 11:40 PM
> Centrally updatable: Sometimes best practice or corporate policy changes over time. You can update what LowCost or SecurityPolicy means later on, in one place, and that change will reapply to all resources that used it.
It sounds great but it's not. This is essentially the Fragile Base Class problem. You may _think_ that updating one of these traits in a single place will be safe and do what you want, but it may be disastrous for whoever is using it. And you're not going to find out until you deploy it.
by agounaris on 7/21/23, 3:34 PM
by fulafel on 7/21/23, 6:46 PM
Dark is a good example of something that sidesteps this stuff by more fine grained integration of infrastructure and app code.
by danw1979 on 7/21/23, 3:11 PM
It’s an interesting idea. My initial reaction was “you can take my HCL from my cold dead hands” but I can’t seriously argue that Terraform is perfect and that I enjoy writing so much boilerplate.
by ggeorgovassilis on 7/21/23, 12:18 PM
by dmarinus on 7/22/23, 7:50 AM
by iAm25626 on 7/21/23, 6:16 PM
by datahead on 7/21/23, 3:23 PM
@firesteelrain said, "you can do that through abstraction. You "include" your Terraform Azure Provider or Terraform AWS Provider. At the end of the day, your module needs to know what it’s interacting with but not the higher level of abstraction. We have done it at my work to make it cloud agnostic just in case we need to go to another CSP"
Single ops eng in a 3 person startup here. Ops eng is only one of my hats right now :) I found crossplane to be a solid tool for managing cloud inf. My assertion is that "the only multi-cloud is k8s" and crossplane's solution is "everything is a CRD". They have an extensive abstraction hierarchy over the base providers (GCP, TF, Azure, AWS, etc) so it's feasible to do what firesteelrain did. My client requirements span from- you must deploy into our tenant (could be any provider) to host this for us.
I can setup my particular pile of yaml and say - "deploy a k8s cluster, loadbalancers, ingress, deployments, service accounts (both provider and k8s), managed certs, backend configs, workload identity mgmt, IAP" in one shot. I use kustomize to stitch any new, isolated environment together. So far, it's been a help to have a single API style (k8s, yaml) to interact with and declaratively define everything. ArgoCD manages my deployments and provides great visibility to active yaml state and event logs.
I have not fully tested this across providers yet, but that's what crossplane promises with composite resource definitions, claims and compositions. I'm curious if any other crossplane users have feedback on what to expect when I go to abstract the next cloud provider.
cyberax's note on state management is what led me away from TF. You still have to manage state somewhere, and crossplane's idea was- k8s is already really good at knowing what exists and what should exist. Let k8s do it. I thought that was clever enough to go with it and I haven't been dissapointed so far.
The model extends the k8s ecosystem, and allows you to keep going even into things like db schema mgmt. Check out Atlas k8s operator for schema migrations- testing that next...
I also like that I can start very simple, everything about my app defined in one repo- then as systems scale I can easily pull out things like "networking" or "data pipeline" and have them operating in their own deployment repo. Everything has a common pattern for IAC. Witchcraft.
by formulathree on 7/21/23, 5:57 PM
by gattacamovie on 7/21/23, 9:05 PM
by jerf on 7/21/23, 2:10 PM
Cloud template definitions also have a lot of settings, but from what I can see, they are all different, all the time, for lots of good reasons. If I'm deploying a lot of different kinds of EC2 instances, I've got a whole bunch of settings that are going to be different for each type. Abstracting is a much different problem as a result. And it isn't just this moment in time, it's the evolution of the system over time, too. In code, overabstracting happens sometimes. In cloud architecture it is an all-the-time thing. It is amazingly easy to over-abstract into "hey this is our all-in-one EC2 template" and then whoops, one day I want to change the instance size for only one of my types of nodes, and now I either need to un-abstract that or add yet another parameter to my all-in-one EC2 template.
The inner platform effect is very easy to stumble into in the infrastructure code as a result, where you have your "all-in-one" template for resource X that, in the end, just ends up offering every single setting the original resource did anyhow.
By contrast, I've pondered the "focus on the links rather than the nodes" idea a few times, and there may be something there. However the big problem I see is that I like rolling up to a resource and having one place where either all the configuration is, or where there is a clear path for me to get to that point. Sticking with an instance just to keep things relatable, if I try to define an instance in terms of its relationship to the network, to the disk system, to the queues that it uses and the lambda it talks to and the autoscaling group it is a part of, now its configuration is distributed everywhere.
One possible solution I've often pondered is modifying the underlying configuration management system to keep track of where things come from, e.g., if you have a string that represents the name of the system you're creating, but it is travelling through 5 distinct modules on its way to the final destination, it would be great if there was a way of looking at the final resource and saying "where exactly did that name come from?" and it would tell you the file name and line number, or the set of such things that went into it. Then at least you could query the state of a resource, and rather than just getting a pile of values, you'd be able to see where they are coming from, dig into all the things that went into all the decisions, that might free you to do link-based configuration rather than node-based configuration. But you'd probably need an interactive explorer; if for instance the various links can configure the size of the underlying disk and you take the max() of the various sizes (or the sum or whatever), you'd need to be able to look at everything that went into the max and all the sources of those values; it's more complicated than just tracking atomic values through the system.
I've often wished for this even in just my small little configs I manage compared to some of you, and it is possible that this would be enough of an advantage to stand out in the crowd right now.
(I think the "track where values came from and how they were used in computation" could be retrofitted onto existing systems. "Focus on links rather than nodes" will require something new; perhaps something that could leverage an existing system but would require a new language at a minimum.)
by gdsdfe on 7/21/23, 1:53 PM