September 18, 2019

Fill SCOM gaps with Azure Monitor

We are delighted to welcome guest blogger Brian Wren of Microsoft. Brian is something of a superstar in the SCOM world – and you will see why when you check out his technical guides on TechNet or Microsoft Learn. His series on Management Pack Authoring for SCOM is a definitive guide that has helped many of us get started with a difficult topic. It is fair to say that when Brian works his magic, he makes the complicated perfectly understandable.

As Microsoft's focus shifts into the cloud, Brian is now doing for Azure what he has historically provided the SCOM community. We are privileged to feature his work and hope you enjoy this special blog.

Keep reading for Brian’s insights on how to bring Azure Monitor and SCOM together.

Fill SCOM Gaps with Azure Monitor

By Brian Wren, Microsoft

I spent a number of years working with management packs in System Center Operations Manager, and as much functionality as we were able to achieve, there were a couple of challenges that always nagged at me.

First is analyzing data across multiple resources. Because each workflow in a management pack is independent without any visibility into other workflows, it’s difficult to implement logic across them. There are some clever strategies using dependency monitors, but you’re limited to pretty specific scenarios.

A colleague told me recently that he had simplified his definition of critical issues down to those that should wake him at 2 AM. You might think that a virtual machine completely offline would fit that description, but in today’s world of redundancy, a single machine could probably wait until a decent hour. A bunch of VMs though could be a more significant problem. How can you analyze for that scenario?

Second is the ability to interactively work with your data. SCOM is outstanding at detecting problems, measuring health, and firing alerts. But once the data’s stored in your management group, it can be a challenge to perform custom and ad hoc analysis. We have views and reports, but it would be great if we had a more interactive way to dig into that data to identify root cause of issues and trends over time.

Enter Azure Monitor

Azure Monitor is Microsoft’s full stack monitoring service that runs in the Azure cloud. It surprises me how many SCOM users I talk to at conferences who haven’t yet tried it out. Seriously, if you haven’t already tried it, create a new workspace with a 5 GB limit for free. You can either connect individual agents to the workspace or your management group and select individual agents. Or if you just want to play around with someone else’s data, try out Microsoft’s demo environment, which I’m going to be using for the rest of this post.

The Logs feature of Azure Monitor works fundamentally different than SCOM which makes the two particularly complimentary. A workflow in SCOM defines its data source, whether that data is interesting, and what action to take based on that data. The most common actions are to set a health state, fire an alert, or simply collect the data. Once the workflow is done, you probably aren’t going to do anything else with that data other than display it in a view.

In Azure Monitor, you start by collecting your data without thinking yet how that data is going to be used. Since it’s built in a cloud, scalability isn’t an issue, and you can be more liberal about your data collection. All data collected into Azure Monitor is stored together in a Log Analytics workspace, and features of Azure Monitor such as alerting, dashboards, and log queries, initiate actions from that collected data in the workspace. Having all that data together in one place solves the issue of analyzing resources together, and Log Analytics along with Kusto Query Language (KQL) is what is going to let us analyze it.

Better Together

I know “better together” has become kind of a cliché phrase, but it really is true in this case as the fundamentally different way that Azure Monitor works makes it a perfect compliment to SCOM, and addresses both of those challenges that I mentioned.

You can connect VMs directly to Azure Monitor using the identical agent as SCOM, or connect your management group to your Log Analytics workspace in Azure Monitor so that event and performance data from your SCOM agents are also collected in Azure Monitor Logs. I’m not going to get into the technicalities of how you make that happen since you can read about that on your own. My goal here is to make an argument why you should try it out. And why you really want to start getting familiar with KQL to analyze the data that Azure Monitor collects.

Enter KQL

Azure Monitor Logs is built on top of Azure Data Explorer, which used to be known as Kusto. That’s where we get name for Kusto Query Language, or KQL as it’s better known. I’ll do a brief intro of the language here, but you can use sources like Get started with Azure Monitor log queries or the Pluralsight training videos to really get started.

One of the great things about learning KQL is the variety of places you can use it. Multiple Azure services such as Azure Security Center, Azure Sentinel, and Network Watcher use the same Log Analytics workspace that Azure Monitor uses meaning that you can analyze all that data together. Other places that you can leverage your KQL skills include Azure Resource Graph, CMPivot, and Windows Defender ATP. I’m just going to focus on Log Analytics using data from SCOM here, but know that most of these concepts are transferable.

KQL Basics

The beauty of KQL is its ability to mix ease with power. You can be useful almost immediately and start to create queries with multiple commands after working with it for just a few minutes. Once you have that basic query structure down, it’s just a matter of stringing together the right KQL commands to perform any advanced functionality you need.

Data in Azure Monitor Logs is organized into tables, each with their own set of properties. For example, the Event table holds data from the Windows Event Log, something you’re very used to working with in SCOM. Viewing those records is as simple as typing the name of the table, and this is about the simplest query you can run.

Log Analytics has a variety of cool features to work with the date returned by the query. You can look at details of each record, sort, group, filter, etc. I’ll let you play with those on your own though since I want to focus on the query itself.

Filtering for specific records is just a matter of adding a Where clause, and you can see from the example that KQL looks a lot like PowerShell, with commands separated by a pipe character (|).

You can string together any number of commands into a full query. The output from each command gets sent as input to the next.

But you want to do more than just retrieve data. Where the power of Kusto starts to really kick in is when you process and aggregate that data. Instead of just sorting those results by computer, how about summarizing them? In fact, if we’re going to summarize them, why not display the result as a pie chart?

I mentioned above the specific scenario of identifying a number of computers offline. For that, we can use the Heartbeat table which receives a heartbeat from every agent once per minute. If we don’t see a heartbeat for a while, then we can assume the agent is offline. The following looks for the last heartbeat from every computer and checks to see if any are older than 10 minutes.

You could add this query to a log alert rule which will allow you to specify a threshold based on number of records returned, in other words, the number of computers offline. This is simple criteria, but if you look at all the properties available in the Heartbeat table, it should be pretty obvious how you could create different thresholds for different sets of computers.

Additional functions

Everything that I’ve shown so far is very useful but still basic functionality. The great thing about KQL is the plethora of advanced functionality that you can get while still keeping the structure of your query relatively simple. You can spend all day perusing the language reference for interesting commands . Let’s take a quick look at a few that are particularly fun.

Parse makes it easy to evaluate a string and pick out individual values.

Reduce will search through a data set and attempt to group together similar strings which gives you a quick way to identify significant patterns in your data. The example below quickly shows that the bulk of our Syslog entries come from a particular event that you should probably investigate.

Externaldata retrieves data from an external source. So in this example, we can break down heartbeat by region that we have in a text file sitting in a GitHub repo.

Performance Data

What about performance data that you collect from your SCOM devices? This data is collected in Azure Monitor into a table called Perf and has familiar properties like ObjectName, CounterName, and InstanceName to identify individual counters. We can start by listing the objects and counters available to us. This isn’t particularly useful, but it’s at least interesting to see what we have available. It’s a big list so I’m not showing it all here, but you get the idea. It looks just like the performance data you have in SCOM.

Of course, you can graph performance data in a view in SCOM, but Azure Monitor adds the ability to interact with the data using a variety of custom algorithms. We always tend to use processor and memory in these examples since they’re values we all understand, so I’ll use those in a few examples.

How about comparing maximum processor utilization per hour between multiple VMs?

A slight change to that query, and we look at different aggregations for a metric on the same VM. Log Analytics actually provides some nice features that let you view different aggregations on the fly, but as I mentioned before, I’m limiting my discussion here to the language itself.

How about if we want to see the maximum utilization by day for each computer and then do a percentage breakdown? In the following example, the utilization for ContosoAzLnx1 is 72 or less in 95% of the days measured. Only 5% of the days is it under 34.

Get a bit more complex, and we can join up multiple counters. Here’s processor and memory together.

These examples may be only moderately useful by themselves, but what I’m trying to show here is the flexibility of KQL and the type of interactive experience you can have with your SCOM data when you add Azure Monitor to its already exceptional functionality. Hopefully as you’re walking through these examples, you’re thinking about issues in your own environment and different algorithms you could use to analyze your data in ways that you haven’t before.

Trust me, you can’t think of logic that KQL can’t solve.