SRE Trends from AWS re:Invent 2022
In November/December 2022 I attended AWS re:Invent in Las Vegas. It was certainly an experience for this small town kid from New Zealand, and one that I took a lot away from.
While I was at the conference, I took the time to walk around and take notes. In this article I will share the trends that I observed which I think will have an impact on SRE work in 2023 and beyond, including:
- Serverless computing
- Observability data lakes (and data meshes)
- Topologies (technology maps)
The use of serverless compute platforms like AWS Lambda is on the rise, and I think this has a pretty significant impact on the work that SREs do. When there are no servers or containers, what is it that SREs monitor? How do you know whether a service or application is healthy? What do you look at when issues or incidents occur? Is the technology mature enough for business critical production workloads? Is the lack of visibility of what's really happening under the covers a concern?
I don't have answers to these questions *yet*. SquaredUp is built using Lambda, and next year I will be on-call to support the product. This will be a chance for me to tackle this challenge myself and report back what I learn and experience.
Observability: Data Lake VS Data Mesh
Many observability vendors are implementing centralized storage of metrics, logs, and traces in what is essentially a data lake. For example, Dynatrace have rearchitected their product to achieve this using their new Grail database. I think part of this is due to the wide adoption and continued maturity of OpenTelemetry. Now that we are collecting all of this data, the attention is turning to making sense of it all. Having all our observability data stored in the same place in a consistent way can help us build relationships between our metrics, logs, and traces.
However, I think there is a major drawback to centralizing telemetry data in the same place. Ideally, we want to give teams the autonomy to pick whatever tooling suits their unique needs. This is important to empower teams to solve problems themselves and to have ownership of their services. The impact this can have on productivity and morale cannot be overstated. Forcing teams to use a particular tool can have significant negative impacts.
There is also the recurring question of vendor lock-in. Once all your observability data is stored in the one place, it can make it less desirable to move to a different tool or technology. OpenTelemetry helps with this by allowing you to direct your data to any number of endpoints, but it doesn't let you lift and shift historic data from one tool to another.
There is an alternative approach that I think couples the benefits of centralizing data with giving teams the autonomy to pick their own tools. With the "data mesh" approach you have your metric, log, and trace data spread over multiple tools - but you pull it all together in real-time to see the full picture. This is possible by storing metadata about the data you pull in, so you can build connections and relationships between things which are stored in different tools and different places. For the end user, it feels like the data is all stored in the same place.
For example, let's say you pull in resource monitoring from a server using one tool, and you also pull in application logs from a service running on the same server from a different tool. With the data mesh architecture mapped using metadata, you could see that these two things are related (the app sits on the server). This is the approach that SquaredUp uses, and I think it's pretty unique. It means you can continue to give teams the autonomy to pick the tools that meet their needs best, while still being able to pull all the data together to see the full picture.
Technology Maps (Topologies)
I saw the word "topologies" wherever I went in the conference. Observability, security, incident management, and analytics vendors alike are implementing visual maps to help you understand and observe your technology solutions. These visual guides to the structure and behaviour of what we build and operate are proving to be an effective way to keep on top of continually increasing complexity.
There is something reassuring about seeing a map that you can understand, like having Google Maps giving you driving directions in a town or city you've never visited before. I think back to when I was a performance testing consultant. Often the first thing I did on a new engagement was to go around and speak to architects and tech leads, to digest documentation, and to build a clear solution diagram. This frequently turned out to be the first time anyone had pieced everything together into one view. Many tool vendors are attempting to automate this kind of analysis for you.
FinOps (Cloud Cost Management)
There were a lot of vendors selling tools to help you manage your cloud costs. I remember several years ago I expanded my definition of performance from "response time", "capacity", and "stability" to include "cost" (for the cloud). This thinking has definitely gone mainstream. Many organizations who have moved to the cloud are facing cloud bills that are spiralling out of control (often due to lifting and shifting systems designed for on-premise).
I spoke to someone who attended an AWS session on FinOps. The AWS strategy to manage cloud spend was to offer discounts and promotions. In other words, if you use a particular AWS service then you can get a discount on another service. This is, putting it mildly, a disappointing approach to FinOps.
I think FinOps should be about understanding where your cost is, and then tuning and improving our code, configuration, and design to improve efficiency. As SREs, I expect the field of FinOps is something we will be increasingly involved in - especially including cloud cost in our observability.
Those were the main observations I made, but I also took note of the following:
- There is still a lack of maturity and understanding of SLOs. Most organisations don't appear to be getting real value out of adopting SLOs (yet). If there are no consequences for breaking SLOs (i.e. error budgets) then what was the value of defining them? Traditional org structures and architectures are also making SLO adoption difficult.
- I'm seeing DORA metrics (operational and delivery maturity) metrics becoming something we track as part of our observability. I think this is a great example of "bigger picture" observability. It helps us understand the relationship between the maturity of our ways of working versus the level of reliability we achieve, the customer experience, and whether or not we're meeting business objectives.
- Cloud security is everywhere. It was perhaps the most common theme across re:Invent. I think, as SREs, we have to be mindful of cognitive load, and becoming a security expert is probably a stretch too far. However, there's no avoiding the need to have a basic understanding of cloud security risks, patterns, and anti-patterns.
- Observability seems to still be mostly owned by "observability people". It's meant to be for the developers and engineers who build and operate services. I don't think we are reaching this audience effectively yet; there is a lot of work left to do.
I got a lot out of traveling to re:Invent, but there's no denying that the conference was tool and vendor centric. It's important to keep in mind that no tool is going to provide you great outcomes by itself. There is no avoiding the need to understand your unique context and to build your own approach to operate your services in a sustainable and reliable way.
Looking forward to re:Invent 2023. Hope to see you there!
This article was inspired by episode 35 of the Slight Reliability podcast. If you'd like to hear how the various observability vendors define observability then give episode 34 a listen.