Skip to Main Content

Bad Observability

Slight Reliability
Stephen Townshend
Developer Advocate (SRE)

Observability has become a bit of a buzzword in the industry for the last few years. Exactly what "observability" means depends on who you ask, but most people would agree its about both:

There's plenty of content out there telling you how to implement observability, or what good looks like. But what about bad observability? What are some anti-patterns to watch out for?

Anti-pattern #1 - Forgetting the customer

Throughout my career I've seen organizations who monitor a lot of technical signals about their systems and services, but who are not able to answer questions about the end customer. Such as...

We spend a lot of time and money building services for customers. Yet rather than validating whether what we have built is having the desired effect on customers, we instead focus on tracking the health of our technology.

If we were cooks in a restaurant it would be like measuring our success based on how clean the dishes are rather than whether our customers are enjoying the food we made for them (or not).

Tracking customer experience and behavior should be the first thing we observe. It is our best measure of reliability. Anything else we look at to track reliability is one or more steps removed. It is entirely possible to have system monitoring showing an issue, yet the customer experience has not degraded at all. It is equally as possible for system monitoring to show everything is green yet the customers are suffering.

I think this is part of why SLOs exist. They put the focus on the customer when delivering features and services. They involve tracking this in production and building a feedback loop back into decision making.

Don't get me wrong...we still need detailed infrastructure, platform, and application level monitoring to help diagnose issues and help us understand what's happening under the covers. But if you don't know what the customer behavior is, and you don't know whether your services are able to be consumed by your customers or not...then closing that gap is the priority.

Anti-pattern #2 - Environment inconsistency

Many organizations have wildly (or subtly) different observability tooling and configuration set up in their production and pre-production environments. This lack of consistency can lead to a number of drawbacks.

Firstly, teams miss out on the opportunity to practice using their observability tooling and ways of working before features reach production. This is a huge missed opportunity to identify issues before they impact real customers. It also encourages the age old gap between delivery and operations.

Secondly, observability tools themselves can impact reliability and performance (and even cause major incidents). If they only exist in production, there is no opportunity to uncover these issues before real customers are impacted.

Keep in mind that alerting is a slightly different story. You can still have all the same alerts in your pre-production environments as you have in production, but send them to an appropriate channel. You don't want to be waking people up at 3am for an issue with a development environment.

If the licensing for your observability tooling is making it prohibitively expensive to run it in any environment other than production, then maybe it's time to consider another tool. Remember that it's about outcomes. If a tool doesn't allow you to reach your desired outcome then it's time to look elsewhere (or try a new approach).

Anti-pattern #3 - Not understanding your ecosystem

This anti-pattern is about not understanding the wider solution context that our component sits in. This can include:

Anti-pattern #4 - No consistent trace ID

When an incident occurs or there is a performance bottleneck in a large complex distributed system, it's incredibly helpful to be able to track a single customer interaction right through the solution. As the solutions we work with continually increase in complexity, this is becoming more important than ever (and more challenging).

This is a fairly straightforward problem to solve. Just make sure the top level component is generating a unique token (trace or correlation ID) that is passed throughout the solution. This is usually passed as an HTTP header. Check out the B3 Propagation specification for an example.

Anti-pattern #5 - The big dumb metric

Sometimes we track metrics that are so high level, or aggregate across so many dimensions, that the numbers in front of us lose all meaning and value.

One classic example is having a complex application that provides hundreds of services, and grouping them all together and reporting on a single response time for all of them. That could include customer interactions that pull static content from a cache alongside operations that require heavy processing.

Let's say you are tracking such a metric and your monitoring tells you that the 95th percentile response time is 780 milliseconds. What have you learned from knowing this number? How does this help you? What insight does it provide? What action can you now take?

Observability is about insight. The way to achieve this is to track the responses time of particular customer interactions, ideally the ones that matter the most to your organization.

Another example of a "big dumb metric" happens when we monitor infrastructure. I frequently see monitoring that only looks at one metric: total % CPU Usage. CPU is important, but it's not the only hardware resource. The other three that need to be considered are memory, disk, and network. And even within your CPU monitoring, you sometimes need to know which process is consuming CPU time, which CPU cores are active at a point in time, or whether CPU is being consumed by system or user processes. Blindly tracking just the one metric is going to bite you.

Anti-pattern #6 - Bad sampling intervals

Sampling intervals are a bit like the beds in Goldilocks and the Three Bears. You don't want them too big, or too small, but juuuuuuust right.

Let's use the example of % CPU usage. What if you decided to capture the average CPU usage on a server at five minute intervals? During that time there might be a one minute period where the CPU burned at 100% and the rest of the time it was 5%. Your monitoring would report an average CPU usage of 24% over that five minute window, which is true, but it doesn't reflect the fact that there was a period where customers were likely impacted. Seeing a CPU usage of 24% is also misleading because it makes it sound like there was fairly consistent utilization, whereas in reality it came in bursts.

On the other end of the spectrum, what if you were capturing and reporting CPU usage every second? Statistically, there are going to be one second periods where the CPU usage hits 100% but, depending on your context, this probably won't impact customers in any meaningful way. With CPU usage, it's about prolonged periods of high usage that lead to queuing. I worked with a team once that was fixated on "maximum CPU usage" as their key metric of utilization, which resulted in them massively over-provisioning their infrastructure.

Anti-pattern #7 - Misunderstanding metrics

It's important not to take a metric at face value without being clear what exactly it does and does not tell you.

One example I see all the time is tracking "available memory" on a server and claiming that "there is a memory leak" when said available memory decreases over time. Available memory is the truly "free" memory that is not allocated to any process yet. Just because "free" memory has run out does not mean that there is no available memory for processes on the server.

Processes can have memory allocated to them but that memory is actually still available for other processes or the operating system to use if required. Just because available memory is low does not necessarily mean that there is a memory problem.

Applications are often designed to take as much memory as possible for efficiency; it's normal behavior. Especially database management systems. If you really want to track memory, do it at the platform or application level. For example, tracking heap memory usage in a JVM.

Another example is when we track container CPU usage. In the context of a container, what is % CPU usage? What does it mean? In his 2020 Neotys PAC talk "Seeing is knowing, measuring CPU throttling in containerized environments" Edoardo Varani demonstrated (with proof) that % CPU usage was not a good indicator of how utilized a container is. He was able to produce situations where container CPU usage was 100% but the application performance was not degraded, or % CPU usage was 50% but the application was queuing for processor time. When you are monitoring containers, look at % Throttled Time as a more accurate measure of how utilized a container is.

Anti-pattern #8 - Lazy synthetic transactions

Let's be honest... synthetic transactions are just a fancy term for an automated test that runs on a regular basis in production. It helps you keep track of the health of your services even when there is no customer activity at all, and it checks service health in a consistent way.

Most customer facing web applications contain a flow of steps that culminate in some customer outcome. There is usually some kind of session information being carried throughout the flow. For example, for an online store a customer might:

  1. Visit a landing page
  2. Search for a product
  3. View a product
  4. Add the product to their cart
  5. Purchase the product

A lazy synthetic transaction would hit the landing page...and then stop. This is really easy to implement, but it hasn't proved whether the customer outcome (to buy a product) can be achieved or not. Unless we can prove that, we're not doing our job.

An effective synthetic transaction would need to prove that all the individual services required in a product purchase are working as intended. In my experience, I've seen this implemented as an automated script that steps through the product purchase process.

This is, of course, much more work. There is data to manage and non-trivial test assets to build and maintain. I understand the reluctance to go to the effort but, on the other hand, a synthetic transaction that doesn't give us confidence about the customer experience isn't serving its purpose.

There are plenty of creative ways to make this more manageable. Maybe re-using automated test assets that have already been built. Maybe creating special test endpoints to run in production to make checking all the services easier.

Anti-pattern #9 - A plague of dashboards

Dashboards are meant for displaying information that is frequently referred to. They are not meant for answering one-off questions. For that, engineers need the skills to interrogate data on the fly.

This is a skill gap I've seen in the industry. For many people, if there's no dashboard already built to answer a question, then there's no answer to the question.

I'm not saying it's easy. For any product team, there are bound to be multiple tools and query languages that need to be learned, not to mention needing the experience to know what data to look at and how to analyze it in any given situation. Finding meaningful patterns in large sets of data is a specialism in itself.

One of the core concepts of SRE is noise to signal ratio. The signal is the data that gives us insight that we can act on. It's the thing we need to make decisions. The noise is everything else, and the more noise there is the harder it is to find the signal. Having hundreds of unused dashboards is noise. You will make it harder for engineers to know where to go to answer their questions.

Each dashboard you build becomes a piece of technical debt you carry forward. If dashboards are not being used and providing value, then they're taking time away from more valuable work. I think it's worthwhile capturing metrics on your dashboards. How often are they viewed? By how many people? Just don't chuck those metrics in a dashboard that no-one ever looks at...

Anti-pattern #10 - Unnecessary alerts

If you wouldn't wake up at 3am in the morning to handle a situation, then you shouldn't be generating alerts for it.

Every alert or page that goes out that does not need immediate action is training your engineers not to take them seriously. I'm sure you've heard of the boy who cried wolf - well, this is the monitoring platform that cried major incident.

If you are currently getting a lot of pages for incidents that do not require immediate action, then it's time to adjust your alerting rules. These false alarms will drain patience and sanity. Not to mention, you need enough free time to be proactive about reliability. If you're constantly fighting unimportant fires, that's going to be difficult.

As with many of my other points in this article, bring it back to the customer outcome. If your customers can still effectively use your services (and there's no threat to that in the immediate future) then why are you panicking or waking up in the middle of the night?

Anti-pattern #11 - Hoarding data

At times I have come across teams who have their own observability platform and won't share this with the rest of the organization.

To be fair, it's something I've rarely seen, but when it does it happen it comes from a pathological culture that includes a fear of failure, a fear of change, and command and control. This kind of culture not only opens up chasms between the teams in your organization, but also strips away the psychological safety and ultimately the performance of your teams.

I believe that observability data should be freely open for anyone in your organization to see and learn from. Logs with customer information, of course, need careful attention, but most other observability data is not a significant security risk.

Anti-pattern #12 - Disconnected data

Sometimes we have all the observability data we need, but it's spread all over the organization in different tools and repositories. There might be no consistent use of standards or trace IDs either.

Having multiple tools isn't a problem in and of itself. I think it's better to have a few different tools used by teams who are motivated, own their own observability, and have a sense of autonomy than mandating one tool for everyone. There's still a challenge around pulling all that data together, but that's something we can solve creatively.

The real anti-pattern here isn't having too many tools, it's where teams are treating their observability like a private resource, rather than a product for the whole organization. Because that's what it is and how it should be treated.

Anti-pattern #13 - Throwing tools at a problem

Tools don't solve problems, people do.

Tools are easy to understand and adopt, but they never provide good outcomes without also changing ways of working. This anti-pattern is a personal bugbear of mine as I see it consistently wherever I go.

If you install a tool without building it into the culture of your teams and how they work, then you won't get the most out of it. In fact, I'd go so far as to say you'll get little to no value out of your tooling, which may be costing you a significant amount.

Let's say you have your new tool up and running, but you haven't changed anything about your ways of working. Who is going to get insight from the tool? Who is actually going to look at it? Who is going to configure it? What outcomes do you want the tool to help you achieve? Who is going to instrument your code to collect the custom metrics and events that matter to you and your customers?

One small example I've hit many times in the last few years is dealing with single page apps. Single page apps have the same URL for every customer interaction, so the URL itself does not indicate what the customer is doing. A lot of the time you need to instrument your code to identify what customers are doing in your APM and monitoring tools. If you can't answer basic questions about what your customers are doing, then how useful is your tooling? You need to invest people time to get the value out of tools.

Anti-pattern #14 - Mandating tools

When an organization mandates that a particular tool must be used, it strips autonomy and ownership away from the teams they intend to use it. This doesn't lead to great outcomes. Speaking frankly, the times I've seen this in the past were when a tool vendor has sold an idea to senior leadership and the decision was made without consulting the engineers who were actually supposed to use it.

Product teams and engineers need to be part of the decision making process. The decision making should be driven by the desired outcomes you want to see. What problems or opportunities is this tool supposed to help tackle? How do we need to adjust our ways of working to get the most out of it?

I would rather see many teams using many tools but with a sense of ownership of their observability, rather than one consistent tool imposed on a bunch of reluctant teams.

Anti-pattern #15 - The chosen few

The final anti-pattern is where only a select group of people have access to and contribute to an observability platform.

Observability is most valuable when everyone is using it and getting insight from it. Sometimes in the past I've seen large, strategic product teams who have this one engineer who sets up all the monitoring, and is the only only who looks at it. This is often a senior operations engineer or tech lead, and no-one else is really involved or has any ownership of observability.

If only one or two people are contributing to observability, they can end up treating it as their personal toolset rather than a product for the organization. This leads to indecipherable dashboards and monitoring that holds no value to anyone else. Even worse is when this small set of people are the only ones with access or licenses to use the observability platform.

A less extreme version of this that I've seen in pretty much every organization I've worked with is that observability is entirely owned and contributed to by operations, DevOps engineers, or SREs. Developers are either not interested or not included. Observability is just as valuable in delivery as it is for operations. We need to shift left, and shift the culture away from monitoring being a thing that someone else will deal with later.

Observability should be open and available to everyone in the organization. Let's foster a culture of collaboration and working together toward better outcomes.

Takeaways

We covered a lot of ground today, and I don't expect every anti-pattern to resonate with you. If nothing else, here's what I'd like you to take away:

Take the time to understand your unique context. This includes your customers, business, architecture, technology, ways of working, and people. Figure out what outcome you want to achieve, and build your observability one small step at a time to support that outcome. Make it a process of continual improvement.

Related Content

This article was inspired by episodes two, three, and four of the Slight Reliability podcast.

Slight Reliability
Stephen Townshend
Developer Advocate (SRE)