We frequently get asked questions about how to maintain and optimize SCOM. So here are the top SCOM questions we get asked and answers from the experts.
- How to upgrade SCOM
- Daily, weekly, and monthly SCOM health checks
- Companies that do SCOM health checks
- Database health checks - daily, weekly, and monthly
- SCOM tuning tips
- Installing and removing SCOM management servers SCOM 2019
- Modify the default SCOM alert descriptions to be more meaningful
- Restarting agent services or notifying someone when the agent stops working
- Database performance issues with TempDB growing fast
- Author Windows Service monitoring for specific computers
- Too many alerts when dealing with IIS MP and web servers that have many web sites and application pools
- Prevent flapping Alerts
- Move a group or multiple groups and one MP to another
- Data dropped due to maximum queue size increased
- Automate the process of web URL monitoring, additions/removal from “Web application availability” Management Pack template
- Why the monitored servers didn’t come out of maintenance mode automatically, even though I set up the timeframe
How to upgrade SCOM
The first question we get is about how to upgrade SCOM. We always advise having the latest version of SCOM and SCOM 2022 was most recently released. You'll probably find our SCOM 2022 step-by-step upgrade guide useful.
Upgrading SCOM in-place only takes you one version up at a time. So, if you are going to upgrade to the latest version and you aren’t on the previous build, this requires multiple upgrades.
Sometimes, the leap between versions might be too big with too many steps, so it could be worth doing a side-by-side migration. Spin up a completely new SCOM on the newest version and move everything across.
The SCOM versions still available are:
- SCOM 2012
- SCOM 2012 R2
- SCOM 2016
- SCOM 2019
- SCOM 2022
Note that Microsoft stopped mainstream support of SCOM 2016 in January 2022 so you should upgrade soon if you are on an earlier version than SCOM 2019.
Make sure you also look at the supported Windows and SQL versions you are running your SCOM infrastructure on.
What are the daily, weekly, and monthly checks that you should do in your SCOM environment?
As an Administrator, you should regularly check your SCOM environment.
- Start with daily checks of the health of SCOM itself. Check the infrastructure and agents to see how it is performing.
- You should also check that your management servers’ resource pool and other key resource pools are online and healthy as they run a lot of the internal key processes for SCOM. If those go offline or are failing back and forth between management servers, things are just not going to work like you’re hoping.
- Weekly, you need to be assessing the data volume that you create, and the volume of alerts created. These define how fast SCOM will work and how many problems you will have in either your environment or just looking at your ticketing system.
Are there recommended companies who will do SCOM health checks for us?
There are two companies we recommend for health checks: Cookdown and TopQuore.
The Cookdown health check is script-driven. They send you a script that you can run in your environment and that will give you pointers on what to look for. You then send the script back to Cookdown for them to analyze, and they will book an explanation meeting with you to run through the key points.
TopQore’s SCOM Health Check is quite extensive; it has a lot of checks into all SCOM components for their implementation and configuration. They also look for technical errors or improvements. TopQore also investigates the monitoring processes and procedures in general, including the dashboarding training, idle processes and documentation, lifecycle management, and much more.
What database (DB) daily, weekly, or monthly health checks should I run?
To check the health of your database, look at the amount of config churn and data churn that you have through using a number of default management packs and default reports, like the ones built in to SCOM itself. Take a look at the following:
- Operations Manger views
- The SCOM management SQL queries from Kevin Holman
- The ops manager self-maintenance MP (free from Cookdown)
What are your best tips for tuning SCOM?
Our best advice is: don’t copy what others have done. Tune SCOM to suit your environment and user needs.
First, have a conversation with the person or people who will be receiving the SCOM alerts. Find out what they need and want to see. Then, tune the alerts to your shop and the individuals on the receiving end. Simplify the alerts as much as possible at first so switch everything off then add in what is needed. Cookdown EasyTune is great for configuring this.
It’s also helpful to use reporting to check which workflows are generating the largest volume of data and which are creating most of the ‘mess’ in SCOM - health changes, state changes, discoveries, alerts etc.
Note that there may be alerts which open and close during the night that you won’t see if you're not automatically forwarding it into your ticketing system. This may or may not be important to you.
If all alerts are put into a notification channel, you may want to delay alerts by a couple of minutes. That way, if any alerts open and close in 2-3 minutes, there won’t be a notification and that will keep things a little clearer.
Installing and removing SCOM management servers in SCOM 2019
Before you install a SCOM management server, make sure there is no SCOM Agent installed on the machine.
Best practice from Microsoft is to keep management servers local to the database in terms of time distance. They should be in the same data
When you remove the SCOM management server, just make sure that it is also removed from SCOM through the other management servers that are still there.
A pro tip for avoiding confusion: do not reuse management server names.
How to modify the default SCOM alert descriptions to be more meaningful
The default descriptions are sent with the management pack and are fixed. There is nothing built into SCOM to let you directly override the description on an alert. You can only edit it after the alert is sent.
Although the descriptions are usually very short, you can often find more information in the alert itself by opening the alert properties, the context, the health explorer, etc. There is also a paid-for add-on management pack for SCOM that will do this and even adds on the knowledge, which can be helpful.
But if you do want to edit the description, you can trigger a PowerShell script, then extract it and make your own to then send it via HTML or a different method. See this demonstrated in a SCOMathon session here.
Restarting agent services or notifying someone when the agent stops working
When the agent service stops working, you may want to automatically restart the agent service or notifying someone (where SCOM admins don't have admin rights on the agent machines).
Using alerts is the best answer. When a health service goes down, the health service watcher, which is the management server, is monitoring it. If it misses three heartbeats of the agent, that will issue a ‘health service heartbeat failed’ alert and it create a notification for you.
On the agent side, you could set the Windows service to restart after failure. But you cannot use SCOM itself directly if the agent is down.
This article from Kevin Holman talks about the health service restarts and may be helpful.
Agent services stopping does happen often, usually due to too many workflows running, and then the agent will restart itself the whole day. You don’t want that happening. We aren’t as concerned with a regular agent going down as a management server. Since it's a single agent, it's less systemic than the management server service stopping.
Database performance issues with TempDB growing fast
Experiencing database performance issues with TempDB growing fast and becoming inaccessible? Is having a shared SQL instance running on a SQL availability group fit for purpose for SCOM?
Consider doing a health check. This is important to uncover what's going on at the SQL layer. Then there’s also tuning data churn and cover churn, which will affect the SQL.
Note that it’s best practice to keep SCOM databases on separate SQL instances where no other databases are located for other products. This is due to the performance and busy nature of these databases, and the use of the TempDB by these databases. Both the SCOM database and the SCOM data warehouse are hitting the TempDB hard.
If your TempDB is on the same SQL instance, your TempDB is going to be shared with your other items on there. So SQL is going to pull into that, if you do something like a Cartesian join, and it's going to use the TempDB to hold part of that.
So it could also be whatever is else is on there is using up a large section of that TempDB as well. So there'll be some picking out of the inside of TempDB to see what's using it and investigate from there.
We recommend not hosting SCOM databases with anything. But if it must share, maybe only with an orchestrator database or similar, which is also a System Center product. Don’t put it with a busy database like a Service Manager database as that will also hit the TempDB. And never put SCOM and SQL on the same machine.
To really get to the root of this, it will require some investigation because there are multiple possibilities and options that could cause this.
How can I author Windows Service monitoring for specific computers?
Use the template in the authoring pane for service monitoring.
To get the specific computers, start with an unsealed management pack and create a group beforehand, so you can put those computers in. Then, as you go through the authoring template, you'll be asked for a group to target. Pick your group with explicit membership and it will target just those computers. It’s all done through the SCOM UI with no XML or anything required.
Too many alerts when dealing with IIS MP and web servers that have many web sites and application pools
You will get a lot of alerts when there's an issue with the server and you have 50 websites, because there are 50 websites down. This does unfortunately generate one server alert, plus 50 website alerts, plus other alerts. But the monitoring is accurate.
One solution, if you're using VMs, is to split it into smaller web VMs. But that requires a lot of work.
Alternatively, try to get the server into Maintenance Mode quickly if the server issue is either predicted or from a change. Maintenance Mode could be a good way to prevent this storm.
Alternatively, if you're using a PowerShell to handle your alerts, you could potentially try to correlate and suppress the alerting on the subsequent alerts to the parent alert.
Finally, add dashboarding. If you have dashboarding, you can see the relevance of websites going down and can ignore test websites, for example, that are on the same machine. Then you can focus on what is really important for your business.
Is there an easy way to prevent flapping Alerts?
Because it is a rule, it should have suppression on the alerts so you only see one per server. But there may be different types of workflows coming in that mean you might see multiple alerts on a server. You should see a repeat count on those so you only get one ticket per server if it happens. Also, if you don't close the alert automatically, then it also raises the repeat count on it.
Keep in mind that these are built-in rules for the agent itself and they do not have repeated events version of this, so you should look into what the failing script is as well. The other way to do this would be to create a new custom role for repeated events detection.
Is there an easy way to move a group or multiple groups and one MP to another?
Moving groups or multiple groups and MPs to another is going to vary depending on whether it is unsealed or sealed. If your clustered management pack is an unsealed pack, everything those groups were used for – every workflow that targets them, every override that targets them – will also be in that unsealed pack. Since it's unsealed, if you move it, things that were in the pack won't be able to target it.
If it was a sealed pack – for example you had a major groups pack and you wanted to split it out to maybe two or three groups pack or maybe groups by themes pack – you should be able to move those. But then again, every unsealed management pack that targets that cluster pack will now need to be rewritten to target a new context of your new group pack.
There isn't an easy way to move the group and everything included because of the downstream effects. Everything that pointed to it now needs to be updated as well.
Data dropped due to maximum queue size increased. Modifying the registry settings does not help much.
In the back end of SCOM, when you've seen an author and you have your data sources, your probes, your right actions, and things that get passed between, those go into a queue. So you've probably got an issue where it's not consuming the data out quickly enough, which could be a database write issue, an authentication issue, or a multitude of other items.
Increasing the maximum size of the queue will give you longer before it fills, but if you aren't consuming data out of the queue at the same rate as, or faster than, the data you're putting it in, you will always override that. So, look for a Windows Error prior to this – something about a workflow failed – and see what's happening with that workflow.
If you have an extremely busy registry, upping that queue size can help. But it may just be delaying the inevitable as it fills up.
Microsoft’s book, ‘Operations Manager Field Experience’, addresses a lot of these types of adjustments.
How can we automate the process of web URL monitoring, additions/removal from “Web application availability” Management Pack template?
Adding it and removing it specifically to the template will be difficult just the way the templates are modified from the UI. But Kevin Holman has a blog post on discovering dynamic data from either a CSV or another source and you could potentially discover your URLs from there and target it.
Pro tip: Use URL Genie MP instead of the SCOM templates. You will be a much happier SCOM admin.
It would just be a discovery that runs every 12 hours or every day, for example, and it pulls in the new URLs. And then your workflows with target that.
You could write a custom management pack and a class type of URL tested without too much difficulty. Then you will just modify the code the UI created to target your new class type.
If you want to truly automate it, you might have a look at the XML that it creates when you create a new management pack, and run it once and see what it actually creates. But it might create a little bit more than you would like. But you could try and replicate this for multiple objects.
Once monitored servers didn't come out of maintenance mode automatically, even though I set up the timeframe. Why?
There were a few bugs in the past that got fixed with update rollups, so make sure you use the latest update roll up for SCOM in whatever version you have.
It could also be due to the old management servers’ resource pool being down for a while in the middle of that maintenance mode. During that time, it doesn't pick up that workflow. So the workflow that keeps an eye on the maintenance modes of other machines and takes them out of maintenance mode again is likely down at that moment. And then you never know if it comes back up and if it picks it up correctly. So keep an eye on that resource pool.
There are also cases where it's also in double maintenance mode through a secondary schedule.
And that’s a wrap for the SCOM FAQs.
We hope these SCOM answers helped to fix some of your SCOM problems so everything works smoothly now.