We should first understand our requirements and then search for a solution that matches those.
Monitoring to me really encompasses the following:
- binary checks (up/down, exceed threshold)
- trending & threshold alerts
When something breaks I want to be armed with enough data as possible to troubleshoot. Think Nagios + Ganglia + Graphite.
Cloudwatch gets most of this and in a world where I can’t have Datadog, we should leverage Cloudwatch as much as possible. We should have alerts from Cloudwatch into VictorOps.
We have Pingdom for a year (I paid for it) and that’s an excellent tool for offsite website monitoring. It’s plugged into health.mozilla-community.org (which we also have for a year).
New Relic is good but it’s core competency is around application instrumentation and less so around system health/metrics.
I’d argue that even in a cloud world where you can autoscale, you still want metrics around performance. I don’t, however, think I need to get paged when CPU is “high”. I want to know about it and more so it’s usage over time.
(Really, I want Datadog…)