by gorkemcetin on 8/13/24, 10:13 PM with 186 comments
As we move towards expanding from basic uptime tracking to a comprehensive monitoring solution, we're interested in getting insights from the community.
For those of you managing server infrastructure,
- What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?
- Do you also keep tabs on network performance, processes, services, or other metrics?
Additionally, we're debating whether to build a custom monitoring agent or leverage existing solutions like OpenTelemetry or Fluentd.
- What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
- Lastly, what’s your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
[1] https://github.com/bluewave-labs/bluewave-uptime
by kevg123 on 8/17/24, 10:04 PM
* Network is another basic that should be there
* Average disk service time
* Memory is tricky (even MemAvailable can miss important anonymous memory pageouts with a mistuned vm.swappiness), so also monitor swap page out rates
* TCP retransmits as a warning sign of network/hardware issues
* UDP & TCP connection counts by state (for TCP: established, time_wait, etc.) broken down by incoming and outgoing
* Per-CPU utilization
* Rates of operating system warnings and errors in the kernel log
* Application average/max response time
* Application throughput (both total and broken down by the error rate, e.g. HTTP response code >= 400)
* Application thread pool utilization
* Rates of application warnings and errors in the application log
* Application up/down with heartbeat
* Per-application & per-thread CPU utilization
* Periodic on-CPU sampling for a bit of time and then flame graph that
* DNS lookup response times/errors
> Do you also keep tabs on network performance, processes, services, or other metrics?
Per-process and over time, yes, which are useful for post-mortem analysis
by aflukasz on 8/17/24, 6:59 PM
- systemd unit failures - I install a global OnFailure hook that applies for all the units, to trigger an alert via a mechanism of choice for a given system,
- restarts of key services - you typically don't want to miss those, but if they are silent, then you quite likely will,
- netfilter reconfigurations - nftables cli has useful `monitor` subcommand for this,
- unexpected ingress or egress connection attempts,
- connections from unknown/unexpected networks (if can't just outright block them for any reason).
by uaas on 8/15/24, 7:22 AM
by dfox on 8/17/24, 9:42 PM
by cmg on 8/17/24, 6:40 PM
- apt status (for security/critical updates that haven't been run yet)
- reboot needed (presence of /var/run/reboot-required)
- fail2ban jail status (how many are in each of our defined jails)
- CPU usage
- MySQL active, long-running processes, number of queries
- iostat numbers
- disk space
- SSL cert expiration date
- domain expiration date
- reachability (ping, domain resolution, specific string in an HTTP request)
- Application-specific checks (WordPress, Drupal, CRM, etc)
- postfix queue size
by mmarian on 8/14/24, 6:16 AM
by zie on 8/18/24, 1:27 AM
Seriously though, the server itself is not the part that matters, what matters is the application(s) running on the server. So it depends heavily on what the application(s) care about.
If I'm doing some CPU heavy calculations on one server and streaming HTTPS off a different server, I'm going to care about different things. Sure there are some common denominators, but for streaming static content I barely care about CPU stuff, but I care a lot about IO stuff.
I'm mostly agnostic to push vs pull, they both have their weaknesses. Ideally I would get to decide given my particular use case.
The lazy metrics, like you mentioned, are not that useful, like another commenter mentioned, "free" ram is mostly a pointless number, since these days most OS's, wisely use it for caching. But information on the OS level caching can be very useful, depending on the work-loads I'm running on the system.
As for agents, what I care about is how stable, reliable and resource intensive it is. I want it to take zero resources, rock solid and reliable. Many agents fail spectacularly at all 3 of those things. Crowdstrike is the most recent example of failure here with agent based monitoring.
The point of monitoring systems to me are two-fold:
* Trying to spot problems before they become problems(i.e. we have X days before disk is full given current usage patterns).
* Trying to track down a problem as it is happening(i.e. App Y is slow in X scenario all of a sudden, why?).
Focus on the point of monitoring and keep your agent as simple, solid and idiot proof as possible. Crowdstrike's recent failure mode was completely preventable had the agent been written differently. Architect your agent as much as possible to never be another Crowdstrike.Yes I know Crowdstrike was user machines, not servers, but server agent failures happen all the time too, in roughly the same ways, they just don't make the news quite as often.
by mgbmtl on 8/17/24, 6:39 PM
I find it easier to write custom checks for things where I don't control the application. My custom checks often do API calls for the applications they monitor (using curl locally against their own API).
There are also lots of existing scripts I can re-use, either from the Icinga or from Nagios community, so that I don't write my own.
For example, recently I added systemd monitoring. There is a package for the check (monitoring-plugins-systemd). So I used Ansible to install everywhere, and then "apply" a conf to all my Debian servers. Helped me find a bunch of failing services or timers, which previously went un-noticed, including things like backups, where my backup monitoring said everything was OK, but the systemd service for borgmatic was running a "check" a found some corruption.
For logs I use promtail/loki. Also very much worth the investment. Useful to detect elevated error rates, and also for finding slow http queries (again, I don't fully control the code of applications I manage).
by LeoPanthera on 8/17/24, 11:22 PM
I don't do this professionally. I have a small homelab that is mostly one router running opnsense, one fileserver running TrueNAS, and one container host running Proxmox.
Proxmox does have about 10-15 containers though, almost all Debian, and I feel like I should be doing more to keep an eye on both them and the physical servers themselves. Any suggestions?
by mekster on 8/18/24, 5:04 AM
For example, providing CPU metric alone is just for alerting. If it exceeds a threshold, make sure it gives insights into which process/container was using how much CPU at given moment. Bonus point if you can link logs from that process/container of that time.
For disks, tell which directory is large, and what kind of file types are using much space.
Pretty graphs that don't tell you what to look for next are nothing.
by 1oooqooq on 8/14/24, 1:47 AM
look for guides written before 2010. seriously. it's this bad. then after you have everything in one syslog somewhere, dump to a facy dashboard like o2.
by aleda145 on 8/17/24, 6:52 PM
Then I visualize it with grafana, It's actually live here if you want to check it out: https://grafana.dahl.dev
by valyala on 8/18/24, 7:23 AM
As for the agent, it is better from operations perspective to run a single observability agent per host. This agent should be small in size and lightweight on CPU and RAM usage, should have no external dependencies and should have close to zero configs, which need to be tuned, e.g. it should automatically discover all the apps and metrics needed to be monitored, and send them to the centralized observability database.
If you are lazy to write the agent on yourself, then take a look at vmagent ( https://docs.victoriametrics.com/vmagent/ ), which scrapes metrics from the exporters mentioned above. vmagent satisfies most of the requirements stated above except of configuration - you need to provide configs for scraping metrics from separately installed exporters.
by oriettaxx on 8/18/24, 5:07 AM
I general I would also suggest to monitor server costs (aws EC2 costs, e.g.)
For example, you should be aware that T3 AWS Ec2 instances will just cost double if your CPU is just used, and this since the flag "unlimited" in credit is ON by default: I personally hate the whole "cpu credit" AWS model... it is an instrument totally in their (AWS) hands to just make more money...
by holowoodman on 8/18/24, 3:33 PM
* In our setup, container status is included in this thanks to quadlets. However, if using e.g. docker, separate container monitoring is necessary, but complex.
* apt/yum/fwupd/... pending updates
* mailqueue length, root's mailbox size: this is an indicator for stuff going wrong silently
* pending reboot after kernel update
* certain kinds of log entries (block device read error, OOMkills, core dumps).
* network checksum errors, dropped packets, martians
* presence or non-presence of USB devices: desktops should have keyboard and mouse. servers usually shouldn't. usb storage is sometimes forbidden.
by arcbyte on 8/17/24, 10:12 PM
For some of my services on DigitalOcean for instance, I monitor RAM because using a smaller instance can dramatically save money.
But for the most part I don't monitor anything - if it doesn't make me money why do I care?
by waynenilsen on 8/17/24, 5:40 PM
by usernamed7 on 8/17/24, 5:31 PM
by tiffanyh on 8/18/24, 4:24 AM
- nagios
- Victoria metrics
- monit
- datadog
- prometheus grafana
- etc …
Q2: Also, is there something akin to “SQLite” for monitoring servers. Meaning, simple / tested / reliable tool to use.
Q3: if you ran a small saas business, which simple tool would you use to monitor your servers & services health?
by Izkata on 8/18/24, 7:19 AM
As a developer who has often had to look into problems and performance issues, instead of an infrastructure person, this is basically the bare minimum of what I want to see:
* CPU usage
* RAM breakdown by at least Used/Disk cache/Free
* Disk fullness (preferably in absolute numbers, percents get screwy when total size changes)
* Disk reads/writes
* Network reads/writes
And this is high on the list but not required:
* Number of open TCP connections, possibly broken down by state
* Used/free inodes (for relevant filesystems); we have actually used them up before (thanks npm)
by jiggawatts on 8/17/24, 10:48 PM
CPU and Memory are the easiest and most obvious to collect but the most irrelevant.
If nobody’s looked at any metrics before on the server fleet, then basic metrics have some utility: you can find the under- or over- provisioned servers and fix those issues… once. And then that well will very quickly run dry. Unfortunately, everyone will have seen this method “be a success” and will then insist on setting up dashboards or whatever. This might find one issue annually, if that, at great expense.
In practice, modern distributed tracing or application performance monitoring (APM) tools are vastly more useful for day-to-day troubleshooting. These things can find infrequent crashes, expired credentials, correlate issues with software versions or users, and on and on.
I use Azure Application Insights in Azure because of the native integration but New Relic and DataDog are also fine options.
Some system admins might respond to suggestions like this with: “Other people manage the apps!” not realising that therein lies their failure. Apps and their infrastructure should be designed and operated as a unified system. Auto scale on metrics relevant to the app, monitor health relevant to the app, collect logs relevant to the app, etc…
Otherwise when a customer calls about their failed purchase order the only thing you can respond with is: “From where I sit everything is fine! The CPUs are nice and cool.”
by xorcist on 8/17/24, 11:12 PM
The Nagios ecosystem was fragmented for the longest time but now it seems most users have drifted towards Icinga, so this is what I use for monitoring. There is some basic integration with Grafana for metrics, so that is for I use for metrics panels. There is good reason to not use your innovation budget on monitoring, instead use simple software that will continue to be around for a long time.
As for what to monitor, that is application specific and should go into the application manifest or configuration management. But generally there should be some sort of active operation that touches the common data path, such as a login, creation of a dummy object such as for example an empty order, validation of said object, and destruction/clean up.
Outside the application there should be checks for whatever the application relies on. Working DNS, NTP drift, Ansible health, certificate validity, applicable APT/RPM packages, database vacuums, log transport health, and the exit status or last file date of scheduled or backgrounded jobs.
Metrics should be collected for total connections, their return status, all types of I/O latency and throughput, and system resources such as CPU, memory, disk space.
by 29athrowaway on 8/18/24, 3:15 AM
by kkfx on 8/17/24, 6:24 PM
- automated alerts on unusual loads, meaning I do not care about CPU/RAM/disk usage as long as there are specific spikes, so the monitor just send alerts (mails) in case of significant/protracted spikes, tuned after a bit of experience. No need to collect such data over significant periods, you have size your infra on expected loads, you deploy and see if you have done correctly, if so you just need to see usual things to filter them keeping alerts only for anomalies;
- log alerts for errors, warning, access logs etc, same principle, you deploy and collect a bit, than you have "the normal logs", you create alerts for unusual things, retention depend on log types and services you run, some retention could be constrained by laws;
Performance metrics are a totally different thing that's should be decided more by the dev than the operation, and much of it's design depend of the kind of development and services you have. It's much more complex because the monitor itself touch the performance of the system MUCH more than generic alerting an casual ping and alike to check service availability. Push and pull are mixed, for alerts push are the obvious goto, for availability pull are much more sound etc. There is no "one choice".
Personally I tend to be very calm in more fine grain monitoring to start, it's important of course, but should not became an analyze-paralyze trap nor waste too much human resources and IT ones for collection of potential garbage in potentially not marginal batches...
by elashri on 8/18/24, 9:58 AM
by tgtweak on 8/18/24, 12:47 AM
Disk i/o and network i/o is particularly important but most of the information you truly care about lies in application traces and application logs. Database metrics in a close second particularly cache/index usage and disk activity, query profiling. Network too if your application is bandwidth heavy.
by Jedd on 8/18/24, 9:42 AM
I think it'd be very hard at this point to come up with compelling alternatives to the incumbents in this space.
I'd certainly not want a non-free, non-battle-tested, potentially incompatible & lock-in agent that wouldn't align with the agents I currently utilise (all free in the good sense).
Push vs pull is an age-old conundrum - at dayjob we're pull - Prometheus scraping Telegraf - for OS metrics.
Though traces, front-end, RUM, SaaS metrics, logs, etc, are obviously more complex.
Whether to pull or push often comes down to how static your fleet is, but mostly whether you've got a decent CMDB that you can rely on to tell you what the state of all your endpoints - registering and decom'ing endpoints, as well as coping with scheduled outages.
by dangus on 8/18/24, 12:44 PM
If you’re building a product from scratch you must have some kind of vision based on deficiencies in existing solutions that are motivating you to build a new product, right?
by nrr on 8/18/24, 3:11 AM
A lot of the details are pretty application-specific, but the metrics I care about can be broadly classified as "pressure" metrics: CPU pressure, memory pressure, I/O pressure, network pressure, etc.
Something that's "overpressure" can manifest as, e.g., excessively paging in and out, a lot of processes/threads stuck in "defunct" state, DNS resolutions failing, and so on.
I don't have much of an opinion about push versus pull metrics collection as long as it doesn't melt my switches. They both have their place. (That said, programmable aggregation on the metrics exporter is something that's nice to have.)
by klinquist on 8/17/24, 6:51 PM
by koliber on 8/17/24, 7:23 PM
- http req counts vs total non-200-response count vs. 404-and-30x count.
- whatever asynchronous jobs your run, a graph of jobs started vs jobs finished will show you a rough resource utilization and highlight gross bottlenecks.
by whalesalad on 8/17/24, 7:45 PM
by madaxe_again on 8/17/24, 6:30 PM
As to what I monitor - normally, as little as humanly possible, and when needed, everything possible.
by ralferoo on 8/17/24, 6:13 PM
It's free if you don't have too many servers - 15 uptime monitors (the most useful) and 32 blacklist monitors (useful for e-mail, but don't know why you'd need so many compared to uptime).
It's fairly easy to reach the free limits with not many servers if you're also monitoring VMs, but I've found it reliable so far. It's nice you can have ping tests from different locations, and it collects pretty much any metrics that are useful such as CPU, RAM, network, disk. The HTTP and SMTP tests are good too.
by Mojah on 8/18/24, 9:14 AM
For web applications for instance, we care about uptime & performance, tls certificates, dns changes, crawled broken links/mixed content & seo/lighthouse metrics.
by mikewarot on 8/15/24, 7:55 AM
by azthecx on 8/18/24, 11:52 AM
What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
No, and I have specifically tried to push against monitoring offerings like Datadog and Dynatrace, especially in the case of the second because running OneAgent and Dynakube CRDs are doing things like downloading tarballs from Dynatrace and listening to absolutely everything they can from processes to network.by sroussey on 8/17/24, 9:15 PM
by mnahkies on 8/18/24, 9:52 AM
Ideally I'd see these events overlaid with the time series to make it obvious that a restart was caused by OOM as opposed to other forms of crash.
by natmaka on 8/18/24, 12:46 PM
by bearjaws on 8/18/24, 3:08 AM
Did it run? Is it still running? Did it have any errors? Why did it fail? Which step did it fail on?
My last job built a job tracker for "cron" tasks that supported actual cron tab + could schedule hitting a https endpoint.
Of course it requires code modification to ensure it writes something so you can tell it ran in the first place. But that was part of modernizing a 10 year old LAMP stack.
by malkosta on 8/18/24, 12:11 AM
I also keep a close eye in the throughput VS response time ratio, specially the 95th percentile of the resp time.
It’s also great to have this same ratio measurement for the DBs you might use.
Those are my go to daily metrics, the rest can be zoomed in their own dashboards after I first check this.
by blueflow on 8/18/24, 1:19 PM
If i were to setup metrics, the first thing i would go for is the pressure stall information.
by justinclift on 8/18/24, 5:09 AM
There's a bunch of ways of measuring "usage" for disks, apart from the "how much space is used". There's "how many iops" (vs the total available for the disk), there's how much wear % is used/left for the disk (specific to flash), how much read/write bandwidth is being used (vs the maximum for the disk), and so on.
by dig1 on 8/18/24, 2:55 PM
> Do you also keep tabs on network performance, processes, services, or other metrics?
Everything :)
> What's your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
I went with collected [1] and Telegraf [2] simply because they support tons of modules and are very stable. However, I have a couple of bespoke agents where neither collected nor Telegraf will fit.
> Lastly, what's your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
We can argue to death, but I'm for push-based agents all the way down. It is much easier to scale, and things are painless to manage when the right tool is used (I'm using Riemann [3] for shaping, routing, and alerting). I used to run Zabbix setup, and scaling was always the issue (Zabbix is pull-based). I'm still baffled how pull-based monitoring gained traction, probably because modern gens need to repeat mistakes from the past.
[2] https://www.influxdata.com/time-series-platform/telegraf/
by giuliomagnifico on 8/20/24, 9:23 AM
by metadat on 8/17/24, 11:34 PM
2. Hard drive health monitoring with Scrutiny.
https://github.com/AnalogJ/scrutiny
Everything else, doesn't matter to me for home use.
Good luck with your endeavor!
by udev4096 on 8/18/24, 5:49 AM
As for the servers, I use uptime kuma (notifies whenever a service goes down), glance (htop in web), vnstat (for network traffic usage) and loki (for logs monitoring)
by veryrealsid on 8/18/24, 9:32 AM
by jimnotgym on 8/18/24, 3:33 PM
by sebazzz on 8/18/24, 8:08 AM
by 1oooqooq on 8/14/24, 1:48 AM
by damonll on 8/18/24, 4:13 AM
Be warned though there are a ton of monitoring solutions already. Hopefully yours has something special to bring to the table.
by imperialdrive on 8/18/24, 12:19 AM
by londons_explore on 8/18/24, 2:50 AM
Can you simply include some existing open source tooling into your uptime monitor, and then contribute to those open source projects any desired new features?
by sunshine-o on 8/17/24, 10:57 PM
by maxboone on 8/18/24, 4:15 AM
by jcrites on 8/17/24, 7:01 PM
For example, say a synchronous server has 100 threads in its thread pool, or an asynchronous server has a task pool of size 100; then Concurrent Capacity is an instantaneous measurement of what percentage of these threads/tasks are in use. You can measure this when requests begin and/or end. If when a request begins, 50 out of 100 threads/tasks are currently in-use, then the metric is 0.5 = 50% of concurrent capacity utilization. It's a percentage measurement like CPU Utilization but better!
I've found this is the most important to monitor and understand because it's (1) what you have the most direct control over, as far as tuning, and (2) its behavior will encompass most other performance statistics anyway (such as CPU, RAM, etc.)
For example, if your server is overloaded on CPU usage, and can't process requests fast enough, then they will pile up, and your concurrent capacity will begin to rise until it hits the cap of 100%. At that point, requests begin to queue and performance is impacted. The same is true for any other type of bottleneck: under load, they will all show up as unusually high concurrent capacity usage.
Metrics that measure 'physical' (ish) properties of servers like CPU and RAM usage can be quite noisy, and they are not necessarily actionable; spikes in them don't always indicate a bottleneck. To the extent that you need to care about these metrics, they will be reflected in a rising concurrent capacity metric, so concurrent capacity is what I prefer to monitor primarily, relying on these second metrics to diagnose problems when concurrent capacity is higher than desired.
Concurrent capacity most directly reflects the "slack" available in your system (when properly tuned; see next paragraph). For that reason, it's a great metric to use for scaling, and particularly automated dynamic auto-scaling. As your system approaches 100% concurrent capacity usage in a sustained way (on average, fleet wide), then that's a good sign that you need to scale up. Metrics like CPU or RAM usage do not so directly indicate whether you need to scale, but concurrent capacity does. And even if a particular stat (like disk usage) reflects a bottleneck, it will show up in concurrent capacity anyway.
Concurrent capacity is also the best metric to tune. You want to tune your maximum concurrent capacity so that your server can handle all requests normally when at 100% of concurrent capacity. That is, if you decide to have a thread pool or task pool of size 100, then it's important that your server can handle 100 concurrent tasks normally, without exhausting any other resource (such as CPU, RAM, or outbound connections to another service). This tuning also reinforces the metric's value as a monitoring metric, because it means you can be reasonably confident that your machines will not exhaust their other resources first (before concurrent capacity), and so you can focus on monitoring concurrent capacity primarily.
Depending on your service's SLAs, you might decide to set the concurrent capacity conservatively or aggressively. If performance is really important, then you might tune it so that at 100% of concurrent capacity, the machine still has CPU and RAM in reserve as a buffer. Or if throughput and cost are more important than performance, you might set concurrent capacity so that when it's at 100%, the machine is right at its limits of what it can process.
And it's a great metric to tune because you can adjust it in a straightforward way. Maybe you're leaving CPU on the table with a pool size of 100, so bump it up to 120, etc. Part of the process for tuning your application for each hardware configuration is determining what concurrent capacity it can safely handle. This does require some form of load testing to figure out though.
by PeterZaitsev on 8/17/24, 6:45 PM
by itpragmatik on 8/18/24, 12:22 AM
Prometheus/Grafana
by holoduke on 8/17/24, 9:23 PM
by fragmede on 8/18/24, 1:34 PM
by s09dfhks on 8/17/24, 7:11 PM
by NovemberWhiskey on 8/17/24, 10:02 PM
by magarnicle on 8/17/24, 7:11 PM
by kemalunel on 8/17/24, 7:34 PM
by kemalunel on 8/17/24, 7:33 PM
by doctorpangloss on 8/17/24, 9:13 PM
by lakomen on 8/18/24, 1:18 PM
by kazinator on 8/17/24, 9:51 PM
by jalcine on 8/17/24, 7:22 PM
by Ologn on 8/17/24, 10:02 PM
In addition to disk space, running out of inodes on your disk, even if you don't plan to. If you have swap, seeing if you are swapping more than expected. Other things people said make sense depending on your needs as well.
by whirlwin on 8/18/24, 8:05 AM
by entrepy123 on 8/17/24, 10:55 PM
by layer8 on 8/17/24, 7:27 PM
by geocrasher on 8/18/24, 5:56 AM
But there's more to it than just collecting data in a dashboard. Having a reliable agent and being able to monitor the agent itself (fore example, not just saying "server down!" if the agent is offline, but detecting the server remotely for verification) would be nice.
by dijksterhuis on 8/17/24, 6:05 PM
> What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?
Not much tbh. Those were the key things. Alerts for high CPU and memory. Being able to track those per container etc was useful.
> Do you also keep tabs on network performance, processes, services, or other metrics?
Services 100%. We did containerised services with docker swarm and one of the bug bears with new relic was having to sort out container label names and stuff to be able to filter things in the Ui. That took me a day or two to standardise (along with the fluentd logging labels so everything had the same labels).
Background Linux Processes less so, but it was still useful, although we had to turn them off in new relic as they significantly increased the data ingestion (I tuned NR agent configs to minimise data we sent just so we could stick with the free tier as best as we could).
> Additionally, we're debating whether to build a custom monitoring agent or leverage existing solutions like OpenTelemetry or Fluentd.
I like fluentd, but I hate setting it up. Like I can never remember the filter and match syntax. Once it’s running I just leave it though so that’s nice
never used open telemetry.
Not sure how useful that info is for you.
> What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
Ehhhh it depends. New relic was pretty established with a bunch of useful features but deffo felt like over kill for what was essentially two containerised django apps with some extra backend services. There was a lot of bloat in NR we probably didn’t ever touch. Including in the agentnitself which took up quite a bit of memory.
> Lastly, what’s your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
Personally push, mostly because I can set it up and probably forget about it — run it and add egress firewalls. Job done. Helps with network effect probably as easy to start.
I can see pull being the preference for bigger enterprise though who would only want to allow x, y, z data out to third party. Especially for security etc. cos setting a new relic agent running with root access to the host is probably never gonna work in that environment (like new relic container agent asks for).
What new relic kinda got right with their pushing agent was the configs. But finding out the settings was a bear as the docs are a hit of a nightmare.
(Edited)
by linuxdude314 on 8/17/24, 7:40 PM
This is clearly the industry standard protocol and the present and future of o11y.
The whole point is that o11y vendors can stop reinventing lower level protocols and actually offer unique value props to their customers.
So why would you want to waste your time on such an endeavor?
by selim17 on 8/18/24, 1:20 PM
- I noticed some discussions about alarm systems. It could be beneficial to integrate with alarm services like AWS SES and SNS, providing a seamless alerting mechanism.
- Consider adding a feature that allows users to compare their server metrics with others. This could provide valuable context and benchmarking capabilities.
- Integrating AI for log analysis could be a game-changer. An AI-powered tool that reads, analyzes, and reports on logs could help identify configuration errors that might be easily overlooked.
I hope these suggestions help with the development of BlueWave Uptime Manager!