from Hacker News

Show HN: PlayBooks – Jupyter Notebooks style on-call investigation documents

by TheBengaluruGuy on 6/4/24, 12:29 PM with 35 comments

Hello everyone, Dipesh and Siddarth here. We are building PlayBooks (https://github.com/DrDroidLab/playbooks), an open source tool to write executable notebooks for on-call investigations / remediations instead of Google Docs or Wikis. There’s a demo video here: https://www.youtube.com/watch?v=_e-wOtIm1gk, and our docs are here: https://docs.drdroid.io/docs/playbooks

We were in YC’s W23 batch working on a data lakehouse with support for dynamic log schemas. Eventually we realized it was a product in search of a market and decided to stop building it. When pivoting, we decided to work on something that we originally prototyped (before even YC) but didn’t execute on.

In our previous jobs, we were at a food delivery startup in India with a busy on-call routine for backend & devops engineers and a small tech team. Often business impacting issues (e.g. orders dropped by >5% in the last 15 minutes) would escalate to Dipesh as he was the lead dev who had been around for a while and he always had 4-5 hypotheses on what might have failed. To avoid becoming the bottleneck, he used to write scripts that fetched custom metrics & order related application logs every 5 minutes during peak traffic. So if an issue was reported, engineers would check the output of those scripts with all the usual suspects first, before diving into a generic exploration. This was the inspiration to get started on PlayBooks.

We’ve put together a platform that can help any dev create scripts with flexibility and without requiring to code much. Our goals were: (1) it can be automated to run and send updates; (2) investigation progress can be shared easily with other team members so everyone has the right context; (3) It can all be done without being on-call or having a laptop access.

Using PlayBooks, a user can configure the steps as data queries or actions within their observability stack. Here are the integrations we currently support: - Run bash commands on a remote server; - Fetch logs from AWS Cloudwatch and Azure Log Analytics; - Fetch metrics from any PromQL compatible db, AWS Cloudwatch, Datadog and New Relic; - Query PostgreSQL, ClickHouse or any other JDBC compatible databases; - Write a custom API call; - Query events from EKS / GKE; - Add an iFrame

The platform focuses on not just running the tasks but also displaying information in a meaningful form with relevant graphs / logs / text outputs alongside the steps in a notebook format. Some of our users have shared feedback that on-call decision making overload has reduced with PlayBooks as relevant data from multiple tools is presented upfront in one page.

Here are some of the key features that we believe will further increase the value to users looking to improve developer experience for their on-call engineers: - Automated surfacing of PlayBooks against alerts & enriching alerts with above-mentioned data; - AI-supported interpretation layer — connect with LLM or ML models to auto-analyze the data in the playbook; - Logs of historical executions to ease the effort of creating post-mortems / timelines and/or share information with peers.

If this looks like something that would have been useful for you on-call or will be in your current workspace, we welcome you to try our sandbox: https://sandbox.drdroid.io/. We have added a default playbook. Just click on one of the steps in the playbook and then the “Run” button to see the playbook in action.

We are excited to hear what you like about the PlayBooks and what you think could improve the oncall developer experience for your team. Please drop your comments here – we will read them eagerly and respond!

by chasinglogic on 6/5/24, 8:21 AM
Whenever I see tools like this I always think "that wouldve been great at my old job where we didn't do post mortems"
But nowadays I think if I can automate a runbook can I not just make the system heal itself automatically? If you have repeated problems with known solutions you should invest in toil reduction to stop having those repeated problems.
What am I missing? I think I must be missing something because these kinds of things keep popping up.
by vvoruganti on 6/4/24, 4:46 PM
This is really cool! Love seeing more tools to help SREs and hopefully lessen the burden of on calls.
The notebook style interface for logging and taking notes is appealing too.
Seen a similar approach with https://fiberplane.com/
Haven't been able to play around too much but watching the space
by debarshri on 6/4/24, 8:49 PM
Reminds me of Rundeck and the time we were trying to build something similar. There are more modern take like fiberplane and moment.dev. Not sure about their adoption.
At one point, we were building something like this on top of kubernetes. I think tech is the easy part here. Getting people to leave their existing workflows and use your product is hard.
Secondly, difficult part of our journey was integrations. Until you have integrated all the tools an org uses, product is useless.
Thirdly, it is great that there are building blocks, but users understand use cases. So, expecting end users to build playbooks themselves is tricky. There has to be an intrinsic motivation within the platform.
Fourthly, it is super competitive space if you see it from an internal tool building perspective. There are lot of internal tool builders like appsmith, retool, tooljet, django admin you are competing with where you could run bash scripts, sql queries etc.
Best of luck, with you journey.
by delano on 6/4/24, 9:35 PM
If it works like Jupyter, as a file that can be version controlled, and like Deepnote where multiple people can be viewing/working on it at the same time, my mind would be blown.
by lcfcjs6 on 6/4/24, 8:10 PM
This is awesome, i've seen so many static runbooks (like confluence) and SREs will scan it once, not find what they need and then go wake up a senior dev. Pre-programmed scripts could go a long way in giving the SRE the ability to go that extra step, which could be vital to solving the problem faster.
by shanemhansen on 6/5/24, 1:51 AM
I saw this used from time to time at Google. There were occasional utility SRE notebooks (colabs). Also the cloud support team seemed to make more use of them.
by bckr on 6/4/24, 7:47 PM
Great to see this launch! I’m looking forward to trying this when our startup is a bit more mature.
by taeric on 6/4/24, 6:08 PM
Reminds me of https://nathanielhoag.com/blog/2022/interactive-runbook/. Fun space to play in. Good luck on this!
by dennisy on 6/4/24, 8:36 PM
This is a great idea! But I feel better served by an existing workflow tool, such as Airflow?
by perpil on 6/4/24, 6:10 PM
I like the integration with slack and the inline execution of steps. I've been working on a similar product with https://speedrun.cc but it just piggybacks on GitHub markdown and most of the execution is done via a deeplink. Reach out if I can help, I've been messing around in this space for awhile.
by pimlottc on 6/5/24, 12:39 PM
Feedback on the sample playbook:
- The “rename step” functionality is not intuitive. I expecting tapping on the step name to “unfold” the step and show me the full details, not start the renaming process. After tapping it, I still didn't realize what was happening; i thought perhaps it had executed the step, which the check mark indicating completion or success. It wasn’t clear that it was an input box since it didn’t have focus, and it wasn’t clear that the check mark was a button.
I would have guessed that the pencil icon perhaps was the rename action, though it still did not put focus on the input box. There shouldn't be a second step needed to focus the input box.
- It’s not clear what defines the “type” of each step; eg whether it’s a log filter, or dh query, or shell command, etc. It seems like it’s the “Data” field, although the name doesn’t make much sense. The field does not seem to be editable; I would have expected it to be a dropdown list with other possible step types listed. If it is intended not to be changeable, then it probably shouldn’t be an input element. There’s a “reload”(?) icon next to it, but I have no idea what that does.
by alaintno on 6/5/24, 7:24 AM
It would be so cool to also have access to GCP resources!
Great job nonetheless!
by ystad on 6/4/24, 6:45 PM
Nice. Similar solution https://github.com/1xyz/pryrite
by Shubham_Bhard on 6/5/24, 6:29 AM
Great! I love ChatGPT but have found it has limited utility when I am trying to debug/resolve issues which involve intricate business/domain/customer logic and modelling. This seems to provide me the solution! Thanks folks!