from Hacker News

Ask HN: What do you monitor on your servers?

by gorkemcetin on 8/13/24, 10:13 PM with 186 comments

We've been developing the BlueWave Uptime Manager [1] for the past 5 months with a team of 7 developers and 3 external contributors, and till today we always went under the radar.

As we move towards expanding from basic uptime tracking to a comprehensive monitoring solution, we're interested in getting insights from the community.

For those of you managing server infrastructure,

- What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?

- Do you also keep tabs on network performance, processes, services, or other metrics?

Additionally, we're debating whether to build a custom monitoring agent or leverage existing solutions like OpenTelemetry or Fluentd.

- What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?

- Lastly, what’s your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?

[1] https://github.com/bluewave-labs/bluewave-uptime

by kevg123 on 8/17/24, 10:04 PM
> What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?
* Network is another basic that should be there
* Average disk service time
* Memory is tricky (even MemAvailable can miss important anonymous memory pageouts with a mistuned vm.swappiness), so also monitor swap page out rates
* TCP retransmits as a warning sign of network/hardware issues
* UDP & TCP connection counts by state (for TCP: established, time_wait, etc.) broken down by incoming and outgoing
* Per-CPU utilization
* Rates of operating system warnings and errors in the kernel log
* Application average/max response time
* Application throughput (both total and broken down by the error rate, e.g. HTTP response code >= 400)
* Application thread pool utilization
* Rates of application warnings and errors in the application log
* Application up/down with heartbeat
* Per-application & per-thread CPU utilization
* Periodic on-CPU sampling for a bit of time and then flame graph that
* DNS lookup response times/errors
> Do you also keep tabs on network performance, processes, services, or other metrics?
Per-process and over time, yes, which are useful for post-mortem analysis
by aflukasz on 8/17/24, 6:59 PM
When it comes to "what" to monitor, many usual suspects already posted in this thread, so in an attempt not to repeat what's there already, I will mention just the following (will somewhat assume Linux/systemd):
- systemd unit failures - I install a global OnFailure hook that applies for all the units, to trigger an alert via a mechanism of choice for a given system,
- restarts of key services - you typically don't want to miss those, but if they are silent, then you quite likely will,
- netfilter reconfigurations - nftables cli has useful `monitor` subcommand for this,
- unexpected ingress or egress connection attempts,
- connections from unknown/unexpected networks (if can't just outright block them for any reason).
by uaas on 8/15/24, 7:22 AM
You cannot go wrong with the most popular choice: Prometheus/Grafana stack. That includes node_exporter for anything host related, and optionally Loki (and one of its agents) for logs. All this can run anywhere, not just on k8s.
by dfox on 8/17/24, 9:42 PM
My two cents: monitoring RAM usage is completely useless, as whatever number you consider an “used/free RAM” is meaningless (and the ideal state is that all of the RAM is somehow “used” anyway). You should monitor for page faults and cache misses in block device reads.
by cmg on 8/17/24, 6:40 PM
With Icinga, for webservers:
- apt status (for security/critical updates that haven't been run yet)
- reboot needed (presence of /var/run/reboot-required)
- fail2ban jail status (how many are in each of our defined jails)
- CPU usage
- MySQL active, long-running processes, number of queries
- iostat numbers
- disk space
- SSL cert expiration date
- domain expiration date
- reachability (ping, domain resolution, specific string in an HTTP request)
- Application-specific checks (WordPress, Drupal, CRM, etc)
- postfix queue size
by mmarian on 8/14/24, 6:16 AM
I use netdata, works like a charm https://github.com/netdata/netdata
by zie on 8/18/24, 1:27 AM
I monitor periods between naps. The longer I get naps the happier I am :)
Seriously though, the server itself is not the part that matters, what matters is the application(s) running on the server. So it depends heavily on what the application(s) care about.
If I'm doing some CPU heavy calculations on one server and streaming HTTPS off a different server, I'm going to care about different things. Sure there are some common denominators, but for streaming static content I barely care about CPU stuff, but I care a lot about IO stuff.
I'm mostly agnostic to push vs pull, they both have their weaknesses. Ideally I would get to decide given my particular use case.
The lazy metrics, like you mentioned, are not that useful, like another commenter mentioned, "free" ram is mostly a pointless number, since these days most OS's, wisely use it for caching. But information on the OS level caching can be very useful, depending on the work-loads I'm running on the system.
As for agents, what I care about is how stable, reliable and resource intensive it is. I want it to take zero resources, rock solid and reliable. Many agents fail spectacularly at all 3 of those things. Crowdstrike is the most recent example of failure here with agent based monitoring.
The point of monitoring systems to me are two-fold:
```
  *  Trying to spot problems before they become problems(i.e. we have X days before disk is full given current usage patterns).

  *  Trying to track down a problem as it is happening(i.e. App Y is slow in X scenario all of a sudden, why?).
```
Focus on the point of monitoring and keep your agent as simple, solid and idiot proof as possible. Crowdstrike's recent failure mode was completely preventable had the agent been written differently. Architect your agent as much as possible to never be another Crowdstrike.
Yes I know Crowdstrike was user machines, not servers, but server agent failures happen all the time too, in roughly the same ways, they just don't make the news quite as often.
by mgbmtl on 8/17/24, 6:39 PM
I like icinga's model, which can run a small agent on the server, but it doesn't run as root. I grant specific sudo rules for checks that need elevated permissions.
I find it easier to write custom checks for things where I don't control the application. My custom checks often do API calls for the applications they monitor (using curl locally against their own API).
There are also lots of existing scripts I can re-use, either from the Icinga or from Nagios community, so that I don't write my own.
For example, recently I added systemd monitoring. There is a package for the check (monitoring-plugins-systemd). So I used Ansible to install everywhere, and then "apply" a conf to all my Debian servers. Helped me find a bunch of failing services or timers, which previously went un-noticed, including things like backups, where my backup monitoring said everything was OK, but the systemd service for borgmatic was running a "check" a found some corruption.
For logs I use promtail/loki. Also very much worth the investment. Useful to detect elevated error rates, and also for finding slow http queries (again, I don't fully control the code of applications I manage).
by LeoPanthera on 8/17/24, 11:22 PM
Perhaps I can hijack this post to ask some advice on how to monitor servers.
I don't do this professionally. I have a small homelab that is mostly one router running opnsense, one fileserver running TrueNAS, and one container host running Proxmox.
Proxmox does have about 10-15 containers though, almost all Debian, and I feel like I should be doing more to keep an eye on both them and the physical servers themselves. Any suggestions?
by mekster on 8/18/24, 5:04 AM
Make sure whatever information provided can be actionable.
For example, providing CPU metric alone is just for alerting. If it exceeds a threshold, make sure it gives insights into which process/container was using how much CPU at given moment. Bonus point if you can link logs from that process/container of that time.
For disks, tell which directory is large, and what kind of file types are using much space.
Pretty graphs that don't tell you what to look for next are nothing.
by 1oooqooq on 8/14/24, 1:47 AM
none of the things you list are for Logs. metrics are different use cases. do not use opentelemetry or you will suffer (and everyone who suffered will try to bring you to their hell)
look for guides written before 2010. seriously. it's this bad. then after you have everything in one syslog somewhere, dump to a facy dashboard like o2.
by aleda145 on 8/17/24, 6:52 PM
For my homeserver I just have a small python script dumping metrics (CPU, RAM, disk, temperature and network speed) into a database (timescaleDB).
Then I visualize it with grafana, It's actually live here if you want to check it out: https://grafana.dahl.dev
by valyala on 8/18/24, 7:23 AM
node_exporter ( https://github.com/prometheus/node_exporter ) and process_exporter ( https://github.com/ncabatoff/process-exporter ) expose the most of the useful metrics needed for monitoring server infrastructure together with the running processes. I'd recommend also taking a look at Coroot agent, which uses ebpf for exporting the essential host and process metrics - https://github.com/coroot/coroot-node-agent .
As for the agent, it is better from operations perspective to run a single observability agent per host. This agent should be small in size and lightweight on CPU and RAM usage, should have no external dependencies and should have close to zero configs, which need to be tuned, e.g. it should automatically discover all the apps and metrics needed to be monitored, and send them to the centralized observability database.
If you are lazy to write the agent on yourself, then take a look at vmagent ( https://docs.victoriametrics.com/vmagent/ ), which scrapes metrics from the exporters mentioned above. vmagent satisfies most of the requirements stated above except of configuration - you need to provide configs for scraping metrics from separately installed exporters.
by oriettaxx on 8/18/24, 5:07 AM
don't forget the "CPU steal" state, and AWS cpu burst credit
I general I would also suggest to monitor server costs (aws EC2 costs, e.g.)
For example, you should be aware that T3 AWS Ec2 instances will just cost double if your CPU is just used, and this since the flag "unlimited" in credit is ON by default: I personally hate the whole "cpu credit" AWS model... it is an instrument totally in their (AWS) hands to just make more money...
by holowoodman on 8/18/24, 3:33 PM
* Services that should be running (enabled/autostart) but aren't. This is easier and more comprehensive than stuff like "monitor httpd on webservers", because all necessary services should be on autostart anyways, and all stuff that autostarts should work or be disabled.
* In our setup, container status is included in this thanks to quadlets. However, if using e.g. docker, separate container monitoring is necessary, but complex.
* apt/yum/fwupd/... pending updates
* mailqueue length, root's mailbox size: this is an indicator for stuff going wrong silently
* pending reboot after kernel update
* certain kinds of log entries (block device read error, OOMkills, core dumps).
* network checksum errors, dropped packets, martians
* presence or non-presence of USB devices: desktops should have keyboard and mouse. servers usually shouldn't. usb storage is sometimes forbidden.
by arcbyte on 8/17/24, 10:12 PM
Whatever directly and naterially affects affects cost and that's it.
For some of my services on DigitalOcean for instance, I monitor RAM because using a smaller instance can dramatically save money.
But for the most part I don't monitor anything - if it doesn't make me money why do I care?
by waynenilsen on 8/17/24, 5:40 PM
Available file descriptors
by usernamed7 on 8/17/24, 5:31 PM
sounds like you're reinventing nagios, which has well addressed all of the above. If nothing else, lots of good solutions in that ecosystem like push/pull.
by tiffanyh on 8/18/24, 4:24 AM
Q1: Can some ELI5 when you’d use:
- nagios
- Victoria metrics
- monit
- datadog
- prometheus grafana
- etc …
Q2: Also, is there something akin to “SQLite” for monitoring servers. Meaning, simple / tested / reliable tool to use.
Q3: if you ran a small saas business, which simple tool would you use to monitor your servers & services health?
by Izkata on 8/18/24, 7:19 AM
> For those of you managing server infrastructure,
As a developer who has often had to look into problems and performance issues, instead of an infrastructure person, this is basically the bare minimum of what I want to see:
* CPU usage
* RAM breakdown by at least Used/Disk cache/Free
* Disk fullness (preferably in absolute numbers, percents get screwy when total size changes)
* Disk reads/writes
* Network reads/writes
And this is high on the list but not required:
* Number of open TCP connections, possibly broken down by state
* Used/free inodes (for relevant filesystems); we have actually used them up before (thanks npm)
by jiggawatts on 8/17/24, 10:48 PM
A lot of people here are suggesting metrics that are easy to collect but nearly useless for troubleshooting a problem, or even detecting it.
CPU and Memory are the easiest and most obvious to collect but the most irrelevant.
If nobody’s looked at any metrics before on the server fleet, then basic metrics have some utility: you can find the under- or over- provisioned servers and fix those issues… once. And then that well will very quickly run dry. Unfortunately, everyone will have seen this method “be a success” and will then insist on setting up dashboards or whatever. This might find one issue annually, if that, at great expense.
In practice, modern distributed tracing or application performance monitoring (APM) tools are vastly more useful for day-to-day troubleshooting. These things can find infrequent crashes, expired credentials, correlate issues with software versions or users, and on and on.
I use Azure Application Insights in Azure because of the native integration but New Relic and DataDog are also fine options.
Some system admins might respond to suggestions like this with: “Other people manage the apps!” not realising that therein lies their failure. Apps and their infrastructure should be designed and operated as a unified system. Auto scale on metrics relevant to the app, monitor health relevant to the app, collect logs relevant to the app, etc…
Otherwise when a customer calls about their failed purchase order the only thing you can respond with is: “From where I sit everything is fine! The CPUs are nice and cool.”
by xorcist on 8/17/24, 11:12 PM
Active monitoring is a different animal from passive metrics collection. Which is different from log transport.
The Nagios ecosystem was fragmented for the longest time but now it seems most users have drifted towards Icinga, so this is what I use for monitoring. There is some basic integration with Grafana for metrics, so that is for I use for metrics panels. There is good reason to not use your innovation budget on monitoring, instead use simple software that will continue to be around for a long time.
As for what to monitor, that is application specific and should go into the application manifest or configuration management. But generally there should be some sort of active operation that touches the common data path, such as a login, creation of a dummy object such as for example an empty order, validation of said object, and destruction/clean up.
Outside the application there should be checks for whatever the application relies on. Working DNS, NTP drift, Ansible health, certificate validity, applicable APT/RPM packages, database vacuums, log transport health, and the exit status or last file date of scheduled or backgrounded jobs.
Metrics should be collected for total connections, their return status, all types of I/O latency and throughput, and system resources such as CPU, memory, disk space.
by 29athrowaway on 8/18/24, 3:15 AM
Follow Brendan Gregg's USE method
https://www.brendangregg.com/usemethod.html
by kkfx on 8/17/24, 6:24 PM
Essentially at a very generic level (from SOHO to not that critical services at SME level):
- automated alerts on unusual loads, meaning I do not care about CPU/RAM/disk usage as long as there are specific spikes, so the monitor just send alerts (mails) in case of significant/protracted spikes, tuned after a bit of experience. No need to collect such data over significant periods, you have size your infra on expected loads, you deploy and see if you have done correctly, if so you just need to see usual things to filter them keeping alerts only for anomalies;
- log alerts for errors, warning, access logs etc, same principle, you deploy and collect a bit, than you have "the normal logs", you create alerts for unusual things, retention depend on log types and services you run, some retention could be constrained by laws;
Performance metrics are a totally different thing that's should be decided more by the dev than the operation, and much of it's design depend of the kind of development and services you have. It's much more complex because the monitor itself touch the performance of the system MUCH more than generic alerting an casual ping and alike to check service availability. Push and pull are mixed, for alerts push are the obvious goto, for availability pull are much more sound etc. There is no "one choice".
Personally I tend to be very calm in more fine grain monitoring to start, it's important of course, but should not became an analyze-paralyze trap nor waste too much human resources and IT ones for collection of potential garbage in potentially not marginal batches...
by elashri on 8/18/24, 9:58 AM
I think this is a chance for me to go somehow off topic and ask people how they handle combining the monitoring of different logs one place. I think there are many solutions but must of them gear toward enterprise solutions. What do people use for poor's man (personal usage like selfhosting/homelab) approach. That does not require you to be VC funded or takes a lot of time to actually implement.
by tgtweak on 8/18/24, 12:47 AM
using datadog these days, newrelic previously - basically every metric you can collect.
Disk i/o and network i/o is particularly important but most of the information you truly care about lies in application traces and application logs. Database metrics in a close second particularly cache/index usage and disk activity, query profiling. Network too if your application is bandwidth heavy.
by Jedd on 8/18/24, 9:42 AM
Server infrastructure is mostly a solved problem - hardware (snmp/ipmi etc) and OS layer.
I think it'd be very hard at this point to come up with compelling alternatives to the incumbents in this space.
I'd certainly not want a non-free, non-battle-tested, potentially incompatible & lock-in agent that wouldn't align with the agents I currently utilise (all free in the good sense).
Push vs pull is an age-old conundrum - at dayjob we're pull - Prometheus scraping Telegraf - for OS metrics.
Though traces, front-end, RUM, SaaS metrics, logs, etc, are obviously more complex.
Whether to pull or push often comes down to how static your fleet is, but mostly whether you've got a decent CMDB that you can rely on to tell you what the state of all your endpoints - registering and decom'ing endpoints, as well as coping with scheduled outages.
by dangus on 8/18/24, 12:44 PM
I dunno if it’s just me, but I would never buy a monitoring solution from a company that has to ask a web forum this kind of question.
If you’re building a product from scratch you must have some kind of vision based on deficiencies in existing solutions that are motivating you to build a new product, right?
by nrr on 8/18/24, 3:11 AM
One thing to be aware of is that up/down alerting bakes downtime into the incident detection and response process, so literally anything anyone can do to get away from that will help.
A lot of the details are pretty application-specific, but the metrics I care about can be broadly classified as "pressure" metrics: CPU pressure, memory pressure, I/O pressure, network pressure, etc.
Something that's "overpressure" can manifest as, e.g., excessively paging in and out, a lot of processes/threads stuck in "defunct" state, DNS resolutions failing, and so on.
I don't have much of an opinion about push versus pull metrics collection as long as it doesn't melt my switches. They both have their place. (That said, programmable aggregation on the metrics exporter is something that's nice to have.)
by klinquist on 8/17/24, 6:51 PM
I use monit and m/monit server to measure CPU/load/memory/disk, processes, and HTTP endpoints.
by koliber on 8/17/24, 7:23 PM
In addition to the things already mentioned, there are a few higher level things which I find helpful:
- http req counts vs total non-200-response count vs. 404-and-30x count.
- whatever asynchronous jobs your run, a graph of jobs started vs jobs finished will show you a rough resource utilization and highlight gross bottlenecks.
by whalesalad on 8/17/24, 7:45 PM
netdata on all our boxes. It’s incredible. Provides automagic statsd capture, redis, identifies systemd services, and all the usual stuff like network performance, memory, cpu, etc. recently they introduced log capture which is also great, broken down by systemd service too.
by madaxe_again on 8/17/24, 6:30 PM
Don’t reinvent the wheel - there are many mature monitoring agents out there that you could ingest from, and it allows easy migration for customers.
As to what I monitor - normally, as little as humanly possible, and when needed, everything possible.
by ralferoo on 8/17/24, 6:13 PM
https://hetrixtools.com/
It's free if you don't have too many servers - 15 uptime monitors (the most useful) and 32 blacklist monitors (useful for e-mail, but don't know why you'd need so many compared to uptime).
It's fairly easy to reach the free limits with not many servers if you're also monitoring VMs, but I've found it reliable so far. It's nice you can have ping tests from different locations, and it collects pretty much any metrics that are useful such as CPU, RAM, network, disk. The HTTP and SMTP tests are good too.
by Mojah on 8/18/24, 9:14 AM
For web-application monitoring, we’ve [1] gone the approach of outside-in monitoring. There’s many approaches to monitoring and depending on your role in a team, you might care more about the individual health of each server, or the application as a whole, independent of its underlying (virtual) hardware.
For web applications for instance, we care about uptime & performance, tls certificates, dns changes, crawled broken links/mixed content & seo/lighthouse metrics.
[1] https://ohdear.app
by mikewarot on 8/15/24, 7:55 AM
Suggestion: If you can adapt your monitoring servers to push data out though a data diode, you might be able to make some unique security guarantees with respect to ingress of control.
by azthecx on 8/18/24, 11:52 AM
```
    What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
```
No, and I have specifically tried to push against monitoring offerings like Datadog and Dynatrace, especially in the case of the second because running OneAgent and Dynakube CRDs are doing things like downloading tarballs from Dynatrace and listening to absolutely everything they can from processes to network.
by sroussey on 8/17/24, 9:15 PM
Latencies. This is a sure fire flag that something is amiss.
by mnahkies on 8/18/24, 9:52 AM
I think OOM kills is an important one, especially with containerized workloads. I've found that RAM used/limit metrics aren't sufficient as often the spike that leads to the OOM event happens faster than the metric resolution giving misleading charts.
Ideally I'd see these events overlaid with the time series to make it obvious that a restart was caused by OOM as opposed to other forms of crash.
by natmaka on 8/18/24, 12:46 PM
I could not find a satisfying way to detect an unusual log, qualitatively (new message) or quantitatively (abnormal amount of occurrences of a given message, neglecting any variable part), and therefore developed a dirty hack and it works quite well for me: https://gitlab.com/natmaka/jrnmnt
by bearjaws on 8/18/24, 3:08 AM
Tasks are the more annoying things to track.
Did it run? Is it still running? Did it have any errors? Why did it fail? Which step did it fail on?
My last job built a job tracker for "cron" tasks that supported actual cron tab + could schedule hitting a https endpoint.
Of course it requires code modification to ensure it writes something so you can tell it ran in the first place. But that was part of modernizing a 10 year old LAMP stack.
by malkosta on 8/18/24, 12:11 AM
Amount of 4xx, 500, and 2xx of an http application can tell a lot about application anomalies. Other protocols also have their error responses.
I also keep a close eye in the throughput VS response time ratio, specially the 95th percentile of the resp time.
It’s also great to have this same ratio measurement for the DBs you might use.
Those are my go to daily metrics, the rest can be zoomed in their own dashboards after I first check this.
by blueflow on 8/18/24, 1:19 PM
I used to have a Nagios, but after years of continuous uptime (except for planned maintenance) i felt it was not worth it. If your tech stack is simple enough and runs on VPS'es (whose physical availability is responsibility of your hoster), there isn't much that could happen.
If i were to setup metrics, the first thing i would go for is the pressure stall information.
by justinclift on 8/18/24, 5:09 AM
> disk usage
There's a bunch of ways of measuring "usage" for disks, apart from the "how much space is used". There's "how many iops" (vs the total available for the disk), there's how much wear % is used/left for the disk (specific to flash), how much read/write bandwidth is being used (vs the maximum for the disk), and so on.
by dig1 on 8/18/24, 2:55 PM
I try to monitor everything because it can get much more accessible to debug weird issues when sh*t hits the fan.
> Do you also keep tabs on network performance, processes, services, or other metrics?
Everything :)
> What's your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
I went with collected [1] and Telegraf [2] simply because they support tons of modules and are very stable. However, I have a couple of bespoke agents where neither collected nor Telegraf will fit.
> Lastly, what's your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
We can argue to death, but I'm for push-based agents all the way down. It is much easier to scale, and things are painless to manage when the right tool is used (I'm using Riemann [3] for shaping, routing, and alerting). I used to run Zabbix setup, and scaling was always the issue (Zabbix is pull-based). I'm still baffled how pull-based monitoring gained traction, probably because modern gens need to repeat mistakes from the past.
[1] https://www.collectd.org/
[2] https://www.influxdata.com/time-series-platform/telegraf/
[3] https://riemann.io/
by giuliomagnifico on 8/20/24, 9:23 AM
Grafana, Prometheus, Mimir and Loki. Here’s my monitoring setup: https://giuliomagnifico.blog/post/2024-07-08-home-setup-v5/
by metadat on 8/17/24, 11:34 PM
1. System temperatures with a custom little python server I wrote that gets polled by HomeAssistant (for all machines on my tailnet, thanks Tailscale).
2. Hard drive health monitoring with Scrutiny.
https://github.com/AnalogJ/scrutiny
Everything else, doesn't matter to me for home use.
Good luck with your endeavor!
by udev4096 on 8/18/24, 5:49 AM
For monitoring proxmox host(s), I use the influxdb to store all the proxmox metrics and then grafana for a beautiful display.
As for the servers, I use uptime kuma (notifies whenever a service goes down), glance (htop in web), vnstat (for network traffic usage) and loki (for logs monitoring)
by veryrealsid on 8/18/24, 9:32 AM
Something that I’ve noticed a need for is usage vs requested utilization. Since we roll our own kube cluster, I’m trying to right size our pods and that’s been as straight forward as it could be since I have to do a lot of the math and recalculations myself.
by jimnotgym on 8/18/24, 3:33 PM
I'm not in that game at the moment. I used to run some background services, that could be down for an hour without causing major difficulty (by design) I used to be very focused on checking the application was running rather than the server.
by sebazzz on 8/18/24, 8:08 AM
Is this from a sysops perspective? Because Nagios and its fork Ingca are still a thing.
by 1oooqooq on 8/14/24, 1:48 AM
I'm actually considering linux+k8s log/audit solution consulting or a saas (there still isn't minimally decent journald log collectors) but not sure who would even pay for it... as you can see for the low attention this will get
by damonll on 8/18/24, 4:13 AM
Look at major cloud providers and what they offer in monitoring such AWS CloudWatch, etc.
Be warned though there are a ton of monitoring solutions already. Hopefully yours has something special to bring to the table.
by imperialdrive on 8/18/24, 12:19 AM
Used PRTG for many years. Works ok. Has a free offering too. It's a bit of an artistic process figuring what to log and how to interpret it in an actionable way. Good luck and try to have fun.
by londons_explore on 8/18/24, 2:50 AM
Sounds like you might be reinventing a wheel...
Can you simply include some existing open source tooling into your uptime monitor, and then contribute to those open source projects any desired new features?
by sunshine-o on 8/17/24, 10:57 PM
I might sound weird but I got tired of the whole Prometheus thing so I just put my hosts on a NATS cluster and push the metrics I really care about there.
by maxboone on 8/18/24, 4:15 AM
Take a look at vector, I personally prefer it over fluentd and don't think you'll need a custom monitoring agent with it.
by jcrites on 8/17/24, 7:01 PM
For servers, I think the single most important statistic to monitor is percent of concurrent capacity in use, that is, the percent of your thread pool or task pool that's processing requests. If you could only monitor one metric, this is the one to monitor.
For example, say a synchronous server has 100 threads in its thread pool, or an asynchronous server has a task pool of size 100; then Concurrent Capacity is an instantaneous measurement of what percentage of these threads/tasks are in use. You can measure this when requests begin and/or end. If when a request begins, 50 out of 100 threads/tasks are currently in-use, then the metric is 0.5 = 50% of concurrent capacity utilization. It's a percentage measurement like CPU Utilization but better!
I've found this is the most important to monitor and understand because it's (1) what you have the most direct control over, as far as tuning, and (2) its behavior will encompass most other performance statistics anyway (such as CPU, RAM, etc.)
For example, if your server is overloaded on CPU usage, and can't process requests fast enough, then they will pile up, and your concurrent capacity will begin to rise until it hits the cap of 100%. At that point, requests begin to queue and performance is impacted. The same is true for any other type of bottleneck: under load, they will all show up as unusually high concurrent capacity usage.
Metrics that measure 'physical' (ish) properties of servers like CPU and RAM usage can be quite noisy, and they are not necessarily actionable; spikes in them don't always indicate a bottleneck. To the extent that you need to care about these metrics, they will be reflected in a rising concurrent capacity metric, so concurrent capacity is what I prefer to monitor primarily, relying on these second metrics to diagnose problems when concurrent capacity is higher than desired.
Concurrent capacity most directly reflects the "slack" available in your system (when properly tuned; see next paragraph). For that reason, it's a great metric to use for scaling, and particularly automated dynamic auto-scaling. As your system approaches 100% concurrent capacity usage in a sustained way (on average, fleet wide), then that's a good sign that you need to scale up. Metrics like CPU or RAM usage do not so directly indicate whether you need to scale, but concurrent capacity does. And even if a particular stat (like disk usage) reflects a bottleneck, it will show up in concurrent capacity anyway.
Concurrent capacity is also the best metric to tune. You want to tune your maximum concurrent capacity so that your server can handle all requests normally when at 100% of concurrent capacity. That is, if you decide to have a thread pool or task pool of size 100, then it's important that your server can handle 100 concurrent tasks normally, without exhausting any other resource (such as CPU, RAM, or outbound connections to another service). This tuning also reinforces the metric's value as a monitoring metric, because it means you can be reasonably confident that your machines will not exhaust their other resources first (before concurrent capacity), and so you can focus on monitoring concurrent capacity primarily.
Depending on your service's SLAs, you might decide to set the concurrent capacity conservatively or aggressively. If performance is really important, then you might tune it so that at 100% of concurrent capacity, the machine still has CPU and RAM in reserve as a buffer. Or if throughput and cost are more important than performance, you might set concurrent capacity so that when it's at 100%, the machine is right at its limits of what it can process.
And it's a great metric to tune because you can adjust it in a straightforward way. Maybe you're leaving CPU on the table with a pool size of 100, so bump it up to 120, etc. Part of the process for tuning your application for each hardware configuration is determining what concurrent capacity it can safely handle. This does require some form of load testing to figure out though.
by PeterZaitsev on 8/17/24, 6:45 PM
Check out Coroot - with use of eBPF and other modern technologies it can do advanced monitoring with zero configuration
by itpragmatik on 8/18/24, 12:22 AM
For any web apps or API services, we monitor: - Uptime - Error Rate - Latency
Prometheus/Grafana
by holoduke on 8/17/24, 9:23 PM
Syslog, kern.log, messages, htop, iotop, df -h, ip2ban.log pm2 log, netstat -tulpn
by fragmede on 8/18/24, 1:34 PM
for data collection, veneur is pretty nice, and is open source and vendor agnostic. by stripe.
https://github.com/stripe/veneur
by s09dfhks on 8/17/24, 7:11 PM
Grafana/Prometheus stack
by NovemberWhiskey on 8/17/24, 10:02 PM
Don't monitor your servers, monitor your application.
by magarnicle on 8/17/24, 7:11 PM
My servers send a lot of emails, so postfix failures.
by kemalunel on 8/17/24, 7:34 PM
unless your resources are not the ephemeral resource then not needed to push data to somewhere. collecting is more makes sense
by kemalunel on 8/17/24, 7:33 PM
unless your resources not an ephemeral resource, then not needed to push metric data to somewhere.
by doctorpangloss on 8/17/24, 9:13 PM
Traces are valuable. But otherwise, I feel like most monitoring information is noise, unactionable or better collected elsewhere.
by lakomen on 8/18/24, 1:18 PM
Nothing at all. And why should I waste energy on this and storage and bandwidth? To watch a few graphs when bored?
by kazinator on 8/17/24, 9:51 PM
Shitheads trash-talking Lisp, followed by disk space, followed by unexplained CPU spikes and suspicious network activity.
by jalcine on 8/17/24, 7:22 PM
looks around I use `htop`
by Ologn on 8/17/24, 10:02 PM
Just echoing some of what others have said...iostat...temperature (sometimes added boards have temperature readings as well as the machine)....plus just hitting web pages or REST APIs and searching the response for expected output... ...file descriptors...
In addition to disk space, running out of inodes on your disk, even if you don't plan to. If you have swap, seeing if you are swapping more than expected. Other things people said make sense depending on your needs as well.
by whirlwin on 8/18/24, 8:05 AM
node_exporter all the way: https://github.com/prometheus/node_exporter
by entrepy123 on 8/17/24, 10:55 PM
RAID health
by layer8 on 8/17/24, 7:27 PM
Counter question: Why do you think another product is needed in this space?
by geocrasher on 8/18/24, 5:56 AM
Honestly? Look at Netdata for comparison. Everything from nginx hostname requests (we run web hosting servers) to cpu/ram/disk data but also network data and more. If you can do better than that somehow, by all means do it and make it better.
But there's more to it than just collecting data in a dashboard. Having a reliable agent and being able to monitor the agent itself (fore example, not just saying "server down!" if the agent is offline, but detecting the server remotely for verification) would be nice.
by dijksterhuis on 8/17/24, 6:05 PM
In my limited experience in a small biz running some SaaS web apps with new relic for monitoring
> What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?
Not much tbh. Those were the key things. Alerts for high CPU and memory. Being able to track those per container etc was useful.
> Do you also keep tabs on network performance, processes, services, or other metrics?
Services 100%. We did containerised services with docker swarm and one of the bug bears with new relic was having to sort out container label names and stuff to be able to filter things in the Ui. That took me a day or two to standardise (along with the fluentd logging labels so everything had the same labels).
Background Linux Processes less so, but it was still useful, although we had to turn them off in new relic as they significantly increased the data ingestion (I tuned NR agent configs to minimise data we sent just so we could stick with the free tier as best as we could).
> Additionally, we're debating whether to build a custom monitoring agent or leverage existing solutions like OpenTelemetry or Fluentd.
I like fluentd, but I hate setting it up. Like I can never remember the filter and match syntax. Once it’s running I just leave it though so that’s nice
never used open telemetry.
Not sure how useful that info is for you.
> What’s your take—would you trust a simple, bespoke agent, or would you feel more secure with a well-established solution?
Ehhhh it depends. New relic was pretty established with a bunch of useful features but deffo felt like over kill for what was essentially two containerised django apps with some extra backend services. There was a lot of bloat in NR we probably didn’t ever touch. Including in the agentnitself which took up quite a bit of memory.
> Lastly, what’s your preference for data collection—do you prefer an agent that pulls data or one that pushes it to the monitoring system?
Personally push, mostly because I can set it up and probably forget about it — run it and add egress firewalls. Job done. Helps with network effect probably as easy to start.
I can see pull being the preference for bigger enterprise though who would only want to allow x, y, z data out to third party. Especially for security etc. cos setting a new relic agent running with root access to the host is probably never gonna work in that environment (like new relic container agent asks for).
What new relic kinda got right with their pushing agent was the configs. But finding out the settings was a bear as the docs are a hit of a nightmare.
(Edited)
by linuxdude314 on 8/17/24, 7:40 PM
Why would you not use OTel?
This is clearly the industry standard protocol and the present and future of o11y.
The whole point is that o11y vendors can stop reinventing lower level protocols and actually offer unique value props to their customers.
So why would you want to waste your time on such an endeavor?
by selim17 on 8/18/24, 1:20 PM
Good luck with your project, @gorkemcetin! I hope you achieve your goals. While I’m not a server manager, I’ve read through most of the comments in this thread and would like to suggest a few features that might help evolve your project:
- I noticed some discussions about alarm systems. It could be beneficial to integrate with alarm services like AWS SES and SNS, providing a seamless alerting mechanism.
- Consider adding a feature that allows users to compare their server metrics with others. This could provide valuable context and benchmarking capabilities.
- Integrating AI for log analysis could be a game-changer. An AI-powered tool that reads, analyzes, and reports on logs could help identify configuration errors that might be easily overlooked.
I hope these suggestions help with the development of BlueWave Uptime Manager!