Monitoring the Monitoring Tool
Like all cloud services, Microsoft Sentinel can fail from time to time. As a Platform as a Service (PaaS), Microsoft monitors and fixes issues within the underlying infrastructure. However, users are still responsible for monitoring and taking action onto resolve issues that arise from within the PaaS Sentinel infrastructure. To find and fix issues, we can use the Microsoft Sentinel Health feature, which enables monitoring for potential issues in three categories:
- Analytic rules
- Automation rules
- Data connectors
While these categories don’t cover everything that can go wrong, they cover some of the most important aspects. Later in the article, we will discuss how you can also monitor other components that are not covered.
Following up on health issues within these categories is essential. Health events can contain potential issues such as:
- Data is not being ingested (into Sentinel?).
- Automation rules are not running.
- Analytic rules not triggering
Customers often underestimate the impact health issues can have on their environment. A health issue can be created because of an issue in the tenant configuration (such as within the settings for an analytic rule) or because of a Microsoft bug. If it is an issue caused by Microsoft, it is often too impossible for the administrator to resolve as we have no control over the backend. However, it is still important to know about the health problem to assess the impact of the issue.
Configuring Sentinel Health
The health feature is not enabled by default and must be manually enabled e in the Sentinel settings menu . Near the bottom of the page, you will find the ‘auditing and health monitoring’ blade. Open it and click ‘enable’ (Figure 1).
By clicking ‘Enable’ Sentinel will configure several diagnostic settings. Within the configuration, there are multiple types of logs available. These types of logs represent the different types of Sentinel resources that support the health feature:
- Data connectors
- Analytics
- Automation
If you want to enable this feature granularly, do so by clicking ‘Configure diagnostic settings’ next to the ‘Enable’ button. I recommend enabling all categories as there are no real downsides . This feature is free, except for the cost to store the logs it generates. However, this cost is negligible. For most tenants, the cost of health logs is less than a dollar per month.
Using the Logs to Create Incidents
After enabling the feature, Sentinel will begin to populate a new table ‘SentinelHealth’. This table contains success, failures, and partial success events. By using KQL, we can retrieve whatever data we need.
The health data contains error codes. Unfortunately, Microsoft does not document all potential issues and their meanings.
From working with these logs, we know that many errors don’t require action. An example is the execution of analytic rules. Sometimes the run of an analytic rule will fail but is scheduled to be retried. In these circumstances, there is no reason to follow up if the re-run is successful.
Below you can find a sample of a query I use to create incidents based on the Sentinel health logs. It focuses on failures while avoiding temporary issues that should resolve themselves automatically.
SentinelHealth | where Status != "Success" | mv-expand ExtendedProperties.Issues | extend ErrorCode = tostring(ExtendedProperties_Issues.Code) | extend ErrorDescription = tostring(ExtendedProperties_Issues.Description) | where not(ErrorCode == "TemporaryIssuesDelay" and Description contains "It will be re-executed over the next scheduled time") // ignore temporary issue that will be resolved | where not(ErrorCode == "TemporaryIssuesDelay" and Description contains "Rule executed successfully") | summarize any(*), TimegeneratedAndDescription = make_set(strcat(TimeGenerated," ",ErrorDescription)) by SentinelResourceName, ErrorCode, TenantId, SentinelResourceId | extend ErrorDescription = iff(any_ErrorDescription contains "Alert Provider ID",trim_end("(?:..........................................................)",any_ErrorDescription),any_ErrorDescription)
Integrating Into a Standard Workflow
If you create analytic rules for Sentinel health issues, they are created in the same queue as security incidents. Within a small security team, these incidents might be picked up by the same team. However, in larger teams, there will be SOC analysts and SOC engineers. Typically, SOC engineers are responsible for the backend of Sentinel and don’t focus on security incidents.
To differentiate the types of incidents, I prefix each incident. How you prefix incidents depends on your naming convention. I stick to adding ‘[HealthIssue]’ in front of the incident. By adding parenthesis around the prefix, it is easier to filter and makes it clear that is it a prefix instead of something that is part of the title.
Taking Action on Incidents
After receiving incidents, you need to act accordingly. The type of action will greatly depend on the categories for which the incident is created.
For analytic rules, the issue will either be within the configuration or a temporary issue. An example of an issue with the configuration can be an issue with the entity mapping. If an entity is dropped, it means it isn’t added in the incident. Typically, this happens because the data in the original rule isn’t stable. In this case, an update for the analytic rule will be created. If it is a temporary issue, you can try to execute the failed run again by navigating to the logs of the specific analytic rule.
Unlike analytic rules, automation rules cannot be re-run. You must identify what the impact of the failure is and take whatever action is needed manually. This could potentially mean you will need to run the playbook manually.
If there is an issue in a data connector, the actions depend on the data connector for which the error is created. Depending on the affected data connector, you might need to resolve issues within the data connector itself or wait for the next run.
What Isn’t Covered by Sentinel Health
The main gaps for Sentinel Health are the limited support for data connectors and playbooks.
Playbooks are not integrated into the Health feature at all. Instead, you must enable diagnostic settings for the playbook to send the logs to Sentinel. Based on these logs, you can also create incidents. An example KQL query can be found below. Here, we will use the AzureDiagnostics table to identify failures within a Playbook.
AzureDiagnostics | where OperationName contains "completed" and status_s == "Failed" and ResourceType == "WORKFLOWS/RUNS" | extend PlaybookName = resource_workflowName_s, FailureTime=TimeGenerated, EncodedResourceId = replace_string(ResourceId, "/","%2F") | extend MonitorURL=strcat("https://portal.azure.com/#view/Microsoft_Azure_EMA/LogicAppsMonitorBlade/runid/", EncodedResourceId)
Issues within data connectors are more complex. The Sentinel health feature only includes monitoring for six different connectors and all codeless connectors (a specific type of API-based connector). Issues for different data connectors must be identified differently. Microsoft has a workbook template to help identify issues.
For CEF connectors, we use a KQL query that identifies if a computer hasn’t sent any data in a specified timespan. That way, we can be alerted when a device stops. The query looks for computers that haven’t sent data in the last hour. The threshold is something that can be tweaked depending on the specific environment.
let threshold = 1h; CommonSecurityLog | summarize LastHeartbeat = max(TimeGenerated) by Computer | extend State = iff(LastHeartbeat < ago(threshold), 'Unhealthy', 'Healthy') | extend TimeFromNow = now() - LastHeartbeat | extend ["TimeAgo"] = strcat(case(TimeFromNow < 2m, strcat(toint(TimeFromNow / 1m), ' seconds'), TimeFromNow < 2h, strcat(toint(TimeFromNow / 1m), ' minutes'), TimeFromNow < 2d, strcat(toint(TimeFromNow / 1h), ' hours'), strcat(toint(TimeFromNow / 1d), ' days')), ' ago') | where State == "Unhealthy" | project Computer, State, TimeFromNow, LastHeartbeat, TimeAgo
Moving forward
Monitoring the health of your Sentinel environment isn’t as simple as it should be. While the health feature goes a long way, it lacks complete visibility across every Sentinel feature. Additionally, some errors lack details and actionable steps. Often, an issue is logged as ‘TemporaryIssue’ and there is nothing we can do except re-run the action.