by oschvr on 12/15/22, 10:04 PM with 143 comments
by chrsig on 12/16/22, 12:07 AM
I've grown unfond of this attitude. I most certainly don't own it. I have no IP rights to it at all. We're both being paid to solve different facets of the same problem. Coming at me with "this is your problem" isn't going to foster a collaborative environment with me. Which is much more pleasant than an adversarial environment.
Also: I'm not the only one that knows how it works, it's been peer reviewed in no small part to reduce my bus factor. All documentation requested is perfectly reasonable, and should be part of the organizations standard operating procedure.
If it's not part of the SOP, then no, you wont have those things. You need to work at a cultural level to change that, and for that you're much better off making allies than anything else. Make it clear how those things help you, and what you'll do to make the developers life easier when you don't need to worry about the basics. If altruism fails you, you can usually count on people to act in their own best interests.
by hayst4ck on 12/16/22, 12:11 AM
These are the questions I find useful:
"How is capacity for the service allocated right now?"
"How is software updated right now?"
"How was the last outage handled in as much detail as possible?"
From there, just about everything answers itself with a couple days of reading code and poking at machines, particularly from the output of `lsof` (log files, config files, what the service talks to).Half of these questions could be answered with grep and once you get proficient at grep, you can answer questions faster, and more importantly, more accurately than the people who work on the services themselves.
> that YOU wrote, only YOU know how it works, thus YOU own.
I find this attitude pretty toxic. If you are in an SRE vs Product Dev mindset, then you have bigger battles to fight than service manipulation.
by tayo42 on 12/16/22, 12:11 AM
These kinds of responsibilities create this weird scenario now where the team sre is the teams babysitter. Which just leads to the ops vs dev bullshit weve seen before. Toxic right off the bat.
by mberning on 12/16/22, 1:34 AM
I don’t see how this is even controversial. Consider the case where a SRE is responsible for 5 or 10 such systems. They could never be expected to know as much about those systems as the people that wrote them.
Now if there is a one to one relationship between SREs and systems then it might make sense to expect that level of understanding from the SRE.
In my experience it would be a great privilege to have a dedicated SRE to your application.
by lamontcg on 12/16/22, 5:33 AM
I haven't read the SRE book, but my understanding was that at Google the answer to all this would be that the SRE would act as a software developer and submit pull requests to the codebase in order to implement/fix all of this?
> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.
And my own take on this statement which is getting so much traction in the comments is that this seems largely indistinguishable from the wall between Dev and Ops that we had back in the late 90s.
by eyelidlessness on 12/16/22, 12:57 AM
I actually liked the DevOps-as-in-devs-also-ops as a forcing function to keep deployment relatively simple because it’s very low on the core competency/value proposition spectrums. It also has the benefit of rewarding companies for making that feasible at the expense of a tiny fraction of the cost of dedicated ops roles.
by hnarn on 12/16/22, 10:33 AM
If you work in the same company, you all own the application. The customers don't care that you're "only" the SRE, or "only" the sales guy. This type of attitude is toxic and should be challenged categorically.
If you, the SRE, do not have the information needed (i.e. the "list of questions") then it's as much your responsibility to ask for it as it is the developers jobs to help you answer it.
If you feel that the company culture makes it impossible for you to create these necessary processes so that everyone have the information they need, you need to either work towards changing that culture or get a new job.
by mianos on 12/16/22, 1:53 AM
You know why you "rarely get an answer for straight away "? I assume because they are working on the next ticket/delivery. A lot of this stuff is not estimated properly. A way to get it estimated properly is to work with the devs, cooperatively.
This said, for some reason, this blog post seems adversarial and gives me a bad vibe. Instead of "List of questions I’d like to get an answer from devs", it should be "we should work together to get these things done".
by dsr_ on 12/16/22, 12:20 AM
And I am not objecting to it in the least; these are all good and vital questions.
I am objecting to anyone claiming that DevOps is anything other than "using the kinds of tools that help software development projects to help operations", and I present this as absolute evidence.
by mediascreen on 12/16/22, 9:17 AM
by kubectl_h on 12/16/22, 12:37 AM
by t-writescode on 12/16/22, 4:04 AM
As an SWE, I want to and need to know how to provide metrics on my system to be able to understand its health, and I should have good safeguards in place, or at least have communicated with the SREs what I need to provide to them to help them have good safeguards in place, to make sure the application keeps running. If the application goes down, it's my responsibility to make sure it's not my fault (bug in application code) that caused the system to fail.
What I, an SWE, want out of an SRE, though, is infrastructure management. I want to be able to ask them for some queues, and for a redis instance with high availability. I want them to set up the Kafka cluster, the database. I want us to have a conversation about where the secrets are to be stored. I want to be able to ask them what I need to do in code to get a secret and use it. I want them to be able to give me a good template for k8s deployments - or maybe to pair with them, given the docker containers and sidecars I need for a deployment and the projected scaling I'll need and come out with a best-practices set of k8s deployments.
I would be grateful if they monitor the database for some horrible queries; and, use their knowledge of which deployments made that bad query, to file a ticket to the right team so they fix their code or add an index or whatever is necessary.
Infrastructure, be it k8s or nomad, configuring redis, making rabbitmq highly available, configuring and organizing (especially organizing) k8s deployments into something sane and logical, and so many other things related to infrastructure are as specialized of skills as writing high-performance or unusually architected, large systems. I've seen the systems that come up when SWE-on-assignment create infrastructure; and, I've seen the literal years of work SREs have in their backlog to fix it with best practices.
It's similar to front-end developers: it's an entirely different skill set; and, while each person in each tear can stumble around in the other tiers, it's way better if we are all there, working together toward a common goal, and especially focusing in the areas we have each specialized our craft.
addendum: of course there are exceptions; but I think those exceptions are 1 in 100 or 1 in 1000.
by deathanatos on 12/16/22, 5:46 AM
Like this is the single biggest truth in the article, and I'm glad to see it stated so clearly. Shout it from the rooftops, please. It's a direct logical consequence, too — and yet, so many people seem to make decisions that violate this truth.
I field so many questions about "why is service X doing Y?" Have you asked the service owners?
Unfortunately, I've found one more or less has to become proficient in rapidly understanding services you don't own, because getting other people to act logically is a fool's errand.
> Are you logging to stdout ?
Nooooo to stderr, that's literally what it is there for. (As C says, "for writing diagnostic output". Logs are that.) Also, it is sometimes buffered and you don't (IMO) really want that.
Any output producing program requires stdout for the output, and you can't co-mingle logs with that and have piping still work. While it is unlikely that your production service is producing output, there's no reason to do anything different with the logs. (I'd say a part of being a good production service is "don't be needlessly special".)
(But our tooling will just capture and mux the two streams together, too, so it doesn't matter, unless buffering means the error logs don't make it right before your service is killed.)
Also, your infra team provides the metrics service, but you need to capture your own metrics. My metrics provider does not have a crystal ball, it cannot peer into your service's memory and pull out critical stats. You must push them yourself. Talk to your infra team, they can show you the API they use… (We collect common, machine level stats, like "CPU in use" or external things about your service that are easily visible, like per-container memory usage. But not your reqs/sec.)
by rad_gruchalski on 12/15/22, 11:48 PM
Questions in this form always seem condescending. Like “I‘m smarter than you, I thought about it, you didn’t”.
by mattpallissard on 12/16/22, 4:36 PM
* SRE/DevOps folks stating the person that wrote the application has the knowledge to debug it.
* Devs saying that it's SRE/DevOps job to debug it
* Lots of comments on culture and you should do X
I know most people like the whole grassroots thing, but the only shops I've seen that are actually killing it are the ones who dictate these boundaries and responsibilities from the top down. And I've seen a lot of shops.by jamesrom on 12/16/22, 7:26 AM
Almost all of the questions can be simply answered with: "This is a NFR that was created by SRE".
The important thing is to collaborate with each team and be there when architectural and design decisions are being made in the first place!
All of these questions are post-hoc, coming after the thing has been built. You would never need to ask these questions, if you help drive initial design.
Embed yourself with your teams. Ask to be part of design discussions. Remember: 50% eng 50% ops. You have no excuse!
by RcouF1uZ4gsC on 12/16/22, 2:35 AM
All services should have common health endpoints and shutdown operations.
Logging should be standardized across all the services of a company.
Having bespoke answers to these questions for each service will rapidly devolve into chaos, when you have multiple services deployed.
by blacklion on 12/16/22, 8:57 AM
I've thought, that DevOps by definition is developer and operations in one. You wrote service, you support service, and there is no boundary, and there is no such problem as described in this text, by definition.
DevOps complains about problem, proposed solution for which is to be DevOps...
by Joel_Mckay on 12/16/22, 4:58 AM
This is unfortunately the death knell for DevOps organizational teams on large projects. Primarily, the design specification usually ends up being hammered into the inherent dysfunction the project was intended to solve in the first place.
Best of luck =)
by mkl95 on 12/16/22, 7:48 AM
by travisgriggs on 12/16/22, 1:30 AM
The first sin they embark in is framing their argument, in part, as one of titles/labels. This is usually an institutional smell. And it’s not a pretty odor.
The second is that the person believes there role is to question others. It’s a move that insecure people play. The idea is that you keep your opponents defending themselves against questions you define, and that means there’s no time to address some of the hard questions that might circle your own “roll.”
It sounds like the guy feels he knows the answers. If so, why doesn’t he jump in and do them? If he knows better how to do this SRE thing as defined by him, clearly his company has pulled a Peter principle, promoting him from something he did well, to a position where he now harps on others using their nostalgia. Value may have been lost. If he’s really that good, we can use him in the trenches. If not, he’ll learn how to try to explain why some of these PHB questions are actually hard to answer and execute.
by donutshop on 12/16/22, 12:58 AM
Truthfuy often times I don't understand how things behave in a production environment.
by scarface74 on 12/16/22, 1:48 AM
That was suppose to be the definition of “DeVOps” in the first place. Any company that has a DevOps role is going to really be an operation role by another name.
by tflinton on 12/16/22, 2:51 AM
If only I had a dollar for every time some program dereferences a null.
by opportune on 12/16/22, 1:44 AM
The way this is phrased, it sounds like the author is managing reliability for things where they don’t already know the answers to these questions nor do they have the context or bandwidth (or even access?) to answer it themselves. Seems like a recipe for disaster, or at the very least, a lot of frantic learn-as-you-go.
That said, as a dev, I do think we could do a lot better adding playbooks. Though on the other side of the fence, they’re often ignored with a “I don’t know what’s going on and you wrote this, can you help?”
by poulsbohemian on 12/16/22, 11:21 PM
by simonjgreen on 12/16/22, 7:55 AM
by jdbernard on 12/16/22, 7:41 AM
by fasteo on 12/16/22, 10:58 AM
SE owns the code, but SRE owns the running code
Other than that, I agree with everything in the post
by wildcow on 12/16/22, 5:25 AM
by hkon on 12/16/22, 7:49 AM
by wilde on 12/16/22, 3:21 AM
by doublerabbit on 12/15/22, 11:29 PM
- What specs of a VM do you require?
I'll assume that 16mb of RAM and 512mb of drive space running Slackware is suitable operating from 1.44mb floopy.
- What do I do if it doesn't compile?
It works in DevLand I assume I'll work anywhere. No, you cant growl at me, you asked for Linux and I gave you Linux. Documentation please.