Skip to Main Content

A CTO's dashboard: Application overview

I created this CTO Overview dashboard to give myself and the engineering management team a high-level view of the availability and performance of our SquaredUp SaaS application.

Richard Jones

Chief Technology Officer, SquaredUp

Dashboard preview

Challenge

As Chief Technology Officer, I’m responsible for making sure our SquaredUp application (and the services it depends on) are running smoothly for our customers.

Sure, we know which tools and metrics to monitor, but finding a way to efficiently check these without spending time clicking between disparate interfaces is a challenge.

Like many businesses, we’ve opted for a multi-cloud architecture utilizing both AWS and Azure. We use IsDown (https://isdown.app) to avoid a lot of engineering time and effort spent monitoring the status of these cloud dependencies independently.

What we needed was a real-time, aggregated view of our critical metrics, so we can see at a glance how our services are performing, and how that’s affecting SquaredUp.

Solution

With SquaredUp, we can combine data from AWS, Azure and IsDown to create the ultimate management view of our application availability and performance, all in one dashboard.

Using the pre-built AWS, Azure and IsDown plugins, we effortlessly streamed our critical metrics on demand, without wasting time, effort, and money creating yet more data silos.

As a single view of our most important KPIs, this dashboard provides a high-level view of the health-state of our cloud application, a summary of our cloud spend across both providers, cloud service status from IsDown, and much more.

This enhanced visibility allows me to take a quick glance at the dashboard each morning to check the performance of our app components, drill down into each tile to get more information, and take fast action when necessary.

Dashboard walk-through

The first tile on the dashboard uses our new live diagramming feature to show the health state of the AWS and Azure services our app runs on.

We used diagrams.net to create the SVG, which shows the architecture of our application. We added hyperlinks to the diagram elements that link to SquaredUp objects with health state.

You can see that our Azure B2C auth was showing as amber, so just click into it, and you’ll find a dedicated dashboard with greater detail. In this case, login latency was higher than expected, so we addressed that over the course of the day.

We’ve gone for a simple application view, but you can use live diagramming to show maps, office locations and more.

This tile ultimately serves as an aggregated, correlated, real-time view of our application infrastructure to understand health at a glance.

API error rate show the number of AWS lambda invocations that result in a function error. These might include exceptions the code throws, or exceptions the Lambda runtime throws such as timeout and configuration errors. We chose to visualize this metric using a gauge tile so we could see how close we’re getting to our alert thresholds.

With monitoring enabled on this tile, we set a warning threshold of 0.5% so we’re notified if the occurrence of an error goes above 0.5 (0.5%). The threshold is arbitrary, based on observing error rates over a long period of time to identify a sensible baseline.

This tile is independent of API errors and tells us how long our API is taking to return data. We set a threshold at 1.5 seconds (1,500 milliseconds), as ideally, we’d like the average API response to take less than that. Like API errors, we’ve enabled monitoring and notifications on this tile, so we are alerted as soon as API latency exceeds 1.5 seconds. This is another arbitrary threshold, as it depends on the operation and the amount of data involved.

This tile uses red, amber, green status blocks to show the health of our Azure services at a glance. Beyond the RAG indicator, each block on this tile includes a second line of text for further context. For example, Azure B2C is showing as Amber, so we know there’s something going on with our Identity service. When we click into it, we’ll be taken to a dedicated Azure B2C dashboard to investigate the issue further, and take immediate action.

Monitoring is enabled on this tile, so as soon as the health of any of these services degrade, we’ll receive notifications to any destination we choose; email, Slack, Teams, ServiceNow, Zapier or even via custom webhook.

This bar chart shows the number of Azure Relay Hybrid Connections by day. These Hybrid Connections are used to retrieve data about on-prem resources using our relay agents to securely connect to data sources on internal networks.

This simple table pulls data directly from IsDown to tell us at a high level, the status of our cloud service dependencies. It shows what we’re monitoring, where the data is coming from, and as SquaredUp allows you to link to just about anything, we linked straight to IsDown’s status page for more information.

Having this tile on the same dashboard prevents us having to check the AWS and Azure health separately, often avoiding having to navigate to the AWS and Azure status pages.

This request duration line graph shows both on-demand API and scheduled request duration over the course of a week. Monitoring our API request duration allows us to track our API performance and trends over time.

This simple donut tile shows the monthly spend on our most critical AWS and Azure services. The spend is shown in the middle, which is then broken down by Azure and AWS, so we can see the spend for each cloud provider.

Usually with a multi-cloud set-up, a view like this wouldn’t be possible. But using the SQL analytics editor, we can easily generate cross-cloud summaries like this by running one query against Azure, one against AWS, and displaying the union of the results. To learn more, read our blog on how to surface multi-cloud costs.

The last tile on this dashboard shows a summary of any incidents that are impacting the AWS or Azure cloud regions we utilize. With data streamed directly from IsDown, we can display a RAG status indicator, the incident title, its severity, whether it is resolved, and a link to IsDown for more detailed information.

This is a super useful way of visualizing cloud service incidents at-a-glance, and understanding any actions that may need to be taken, without having to navigate to the AWS or Azure status pages.

Create your free dashboard

This Engineering Health dashboard is not available out of the box, but you can easily build something similar yourself using Azure, AWS and IsDown plugins.

Simply create a free account to get started, or check out this video to see how easy it is to use our Dashboard Designer:

To see what other dashboards you can create, check out our Dashboard Gallery.

Visualize over 60 data sources, including:

View all 60+ plugins