In software observability, we often talk about three telemetry types - metrics, logs, and distributed traces. More recently I've been hearing about profiles as another type.
In this article I will explain the different telemetry types and when to use them in a clear and concise way.
Every discrete thing that occurs in our software systems is an event. Events come in many shapes and sizes. Here are some examples of events:
- An HTTP request hitting an API
- A CPU instruction being executed
- A database operation completing
- A batch job running
Every event is unique. Most types of telemetry deal with individual events, with the exception of metrics.
Logs are records of events that occur within a single component. They are easy to produce, and give us detailed information about specific types of events we care about. Some common types of logs include:
- HTTP access logs which record every request hitting an endpoint
- Application logs which contain informational, debug, warning, and error messages to help understand what is happening inside an app
- Audit logs which contain information required for traceability, such as a record of every time users log in to a system or make sensitive changes
Let's use the example of a café or coffee shop (depending on which part of the world you live in).
A café is a system made up of components, one of which would be the cash register. We could get the cash register to produce a log of every time it is used to make a financial transaction. For every transaction we might capture:
- What time the purchase was made
- What was ordered
- The value of the sale
We have just configured our cash register to produce a transaction log. Logs in software systems are no different. We configure a component of our system to record details of events that occur so that we can refer back to it later.
What are logs *good* for?
Logs can contain detailed information to help us debug issues once we know which component is experiencing an issue. They are also a great starting point for instrumenting your systems, because they are easy to implement and understand.
What are logs *not* good for?
They don't tell us what is happening in any other part of our system other than the component that is producing the log. They also don't tell us about trends or patterns that span across many events over time (without first converting logs into metrics).
A single customer interaction may pass through dozens of different components in order to service a request.
Distributed traces are similar to logs in that they capture information about events. The difference is that traces (ideally) capture information from all the components required to service a request. They track the flow of a customer interaction as it traverses your software system end to end.
Let's go back to our café metaphor. We used logs before to tell us what happened from the perspective of the cash register. For a trace we might also capture information from the perspective of the barista who makes the coffee, and the espresso machine used in the process.
The information captured by each component in the system is called a span. When we combine all of the spans used to service a request we have a distributed trace.
In order to identify all the spans related to a particular customer event, we need to specify a unique identifier for each customer interaction. In the café example above, I've given each customer order an order number. In a software system, rather than an order ID, we often use a trace IDs, correlation IDs, and parent IDs as our unique identifiers.
What are traces *good* for?
Traces are the backbone of modern observability. They allow us to make sense of what is happening in our increasingly distributed and complex systems. They are great for understanding system behaviour, looking for bottlenecks and opportunities for tuning, and investigating incidents.
What are traces *not* good for?
Most of the time we sample our traces, so they're not a good fit for capturing the sum or count of something occurring in your system. Like logs, they also don't tell us about trends or patterns over time, which include many different events (without first converting trace or span data into metrics).
Events are great, but sometimes we want to know about trends or patterns which cover many events over time. For example, rather than seeing an exhaustive list of HTTP requests that occurred recently, I might really just want to know the total number requests that our service has received in the last hour. This numeric summary of many events over a time interval is a metric.
In other words, a metric gives us statistical information about events that have occurred.
Let's go back to our café metaphor. An example metric we might want to capture is the total amount of money that we've made in the last ten minutes. We could potentially calculate this metric from the cash register log we defined earlier:
In the diagram above we have gone through our log and added up the three purchase amounts that we've had in the last ten minutes. This total amount is our metric.
Other kinds of metrics we might want to track in this example are the count of customers per hour or the average or percentile time taken to process each order.
What are metrics *good* for?
Metrics are good for tracking trends over time. They help us make sense of how overall systems or services are changing. They are are also (generally) cheap to store and retrieve.
What are metrics *not* good for?
To calculate a metric we need to aggregate information about our events. This means we lose detail. It's important you understand how each metric was calculated, what it tells you, and what it doesn't tell you. Metrics are commonly misunderstood or misinterpreted. They also need a basic knowledge of statistics to define or interpret.
Metrics also do not help us understand a particular customer interaction. For that we need to look at the raw event in question using logs, traces, or profiles.
Like logs, profiles help us understand what is happening within a single component. The difference is that they give us detailed diagnostics about what is happening under the covers, ideally helping us pinpoint where time is being spent or where an issue is occurring right down to the line of code.
Profiles are sometimes like a distributed trace spanning the internals of a single component. This is the focus of APM (application performance monitoring) tooling.
At this point my café metaphor breaks down a bit. It's a stretch, but one example might be turning on profiling on our espresso machine so that each internal event that occurs is logged. This might help us figure out if there is a particular part of the machinery or process which is slow or misbehaving:
Profiles are not a new thing, however, they are a hot topic of conversation at the moment due to profiles being considered as a new telemetry type within the OpenTelemetry standard, as well as the arrival of eBPF, which provides a non-intrusive way to collect profiling information at the operating system level.
In software monitoring, we have traditionally relied on logs and metrics to try and make sense of what is happening in our systems. This is becoming increasingly more difficult as the systems we build and operate become more distributed.
Distributed tracing is not an easy thing to set up, but I think it's an effective tool for managing the distributed and dynamic nature of modern software systems. I encourage you all (and myself...) to have a go at implementing tracing in the systems that matter the most to you.
My parting word is that telemetry is not observability. It is just different types of data. Observability is the wider practice of being able to understand the internal state of your systems and validate whether they are helping you achieve your organisational objectives.