Twenty years ago, systems monitoring consisted of watching the utilization of the CPU, RAM and Disk on your servers. When a resource crossed some impartial threshold the monitoring system would send a message to the pager of some poor systems person, regardless of the time of day. Think 2:00 AM. It didn’t matter that at 2:00 AM it was normal for the disk utilization to.
I remember implementing monitoring for a large organization. We spent months in implementation. The first night we turned the tool on our pagers went off every few minutes. We turned off monitoring and spent 6 months tweaking the system, but that was an inexact science. Then we had to pay somebody or multiple somebodies to monitor the monitoring system to continue getting relevant alerts.
What about those alerts we didn’t care about? Did they eventually lead to a problem we did care about? No idea, there was no correlation.
We eventually figured it out and life was good.
We spent a few decades monitoring this way. We had it under control. But did we?
Then came hybrid cloud, DevOps, PaaS and SaaS services, docker, stream analytics, and data lakes. How were we to make sense of it all. Monolithic apps while expensive to run, we’re easy to monitor. Distributed, micro-services based apps are difficult to monitor.
The increased complexity of an organizations computing environment created the need for a new paradigm. Enter Observability.
Observability is about complete visibility across your systems and tying business metrics with technical data.
There are three components in Observability: Metrics, Logs, and Traces. Metrics include latency, traffic, errors, and saturation (the four golden signals). Traces, where is it happening? Can you trace your transaction from start to finish? This becomes increasingly difficult when we introduce cloud computing components such as Kubernetes clusters and serverless computing. Logs, why is something happening?
As with all frameworks, to successfully “do” observability you need to implement the right trifecta of technology, processes, and people. The technology consists of logging tools, tools that capture the three components: Metrics, Logs, and Traces and AIOPs. If observability is about creating visibility into your environment, AIOPs is how we get meaning out of that. AIOPs is our visualization layer. The visualization layer is where the people part of the equation enters.
The people part of the trifecta interacts with observability via AIOPs. I won’t get into too much detail, but one key role of observability is the Site Reliability Engineer, (SRE). Google invented the SRE role. The SRE role has become a key role in effectively managing a cloud first, cloud native, and hybrid cloud environment.
The process side of the trifecta is the hard part. Like any process from any framework, what is the right process for your organization based on your unique needs? That is what you need to ask yourself.
