How to Define Custom Sensitive Information Types for Use in DLP Policies

Table of Contents

New Interface and New Capabilities Make It Easier to Manage Sensitive Information Types

Sensitive information types are used by Microsoft 365 components like DLP policies and auto-label (retention) policies to locate data in messages and documents. Earlier in January, Microsoft released a set of new sensitive information types to make it easier to detect country-specific sensitive data like identity cards and driving licenses. Message center notification MC233488 (updated January 23) brings news of two additional changes rolled out in late January and early February.

Confidence Levels for Matching

First, the descriptions of the confidence levels used for levels of match accuracy are changing from being expressed as a percentage (65%, 75%, 85%) to low, medium, and high. Existing policies pick up the new descriptors automatically. This is Microsoft 365 roadmap item 68915.

The change is intended to make it easier for people to better understand the accuracy of matches when using sensitive information types. DLP detects matches in messages and documents by looking for patterns and other forms of evidence. The primary element (like a regular expression) defined in the pattern for a sensitive information type defines how basic matching is performed for that type. Secondary elements (like keyword lists) add evidence to increase the likelihood that a match exists. The more evidence is gathered, the higher the match accuracy and the confidence level increases from low to medium to high. Policy rules use the confidence level to decide what action is necessary when matches are found. For instance, a rule might block documents and email when a high confidence exists that sensitive data is present while just warning (with a policy tip) for lower levels of confidence.

Copying Sensitive Information Types

The second change is that you can copy sensitive information types, including the set of sensitive information types distributed by Microsoft. For instance, let’s assume that you don’t think that the standard definition for a credit card number works as well as it could. You can go to the Data Classification section of the Compliance Center, select the credit card number sensitive information type, and copy it to create a new type (Figure 1). The copy becomes a custom sensitive information type under your control to tweak as you see fit.

Figure 1: Copying the sensitive information type for a credit card number

Improved User Interface

The third change is the most important. Figure 1 is an example of a new user interface to manage sensitive information types (Microsoft 365 roadmap item 68916). The new interface is crisper and exposes more information about how information types work. For instance, in Figure 1, we can see that the primary element for credit card number detection is a function to detect a valid credit card number. Further evidence (supporting elements) come from the presence of keywords like a credit card name (for example, Visa or MasterCard) and expiration date near a detected number.

Few are likely to have the desire to tweak the standard sensitive information types. However, being able to examine how Microsoft puts these objects together is instructive and helpful when the time comes to create custom sensitive information types to meet business requirements.

Detecting Azure AD Passwords

For instance, MVP James Cussen points out that Azure AD passwords are not included in the list of sensitive information types. While some people need to send passwords around in email and Teams messages, it’s not the most secure way of transmitting credentials. In this post, he uses the old interface to define a sensitive information type to detect passwords. To test the new interface, I used his definition as the basis for a new custom sensitive information type.

The primary element to match passwords is a regular expression:

((?=[\S]*?[A-Z])(?=[\S]*?[a-z])(?=[\S]*?\d)|(?=[\S]*?[A-Z])(?=[\S]*?[a-z])(?=[\S]*?[^a-zA-Z0-9])|(?=[\S]*?[A-Z])(?=[\S]*?\d)(?=[\S]*?[^a-zA-Z0-9])|(?=[\S]*?[a-z])(?=[\S]*?\d)(?=[\S]*?[^a-zA-Z0-9]))[^\s]{8,256}

A bunch of suggested expressions to detect passwords can be found on the internet. Most fail when input for use with a sensitive information type because they fail Microsoft’s rules to detect illegal or inefficient expressions. Not being a Regex expert, I tried several (all worked when tested against https://regex101.com/), and all failed except the one created by James.

A keyword list is a useful secondary element to add evidence that a password is found. The list contains a comma-separated set of common words that you might expect to find close to a password. For instance:

“Here’s your new password: 515AbcSven!”

“Use this pwd to get into the site ExpertsAreUs33@”

In multilingual tenants, the ideal situation is to include relevant words in different languages in the keyword list. For instance, if the tenant has Dutch and Swedish users, you could include wachtwoord (Dutch) and lösenord (Swedish). To accommodate the reality that people don’t always spell words correctly, consider adding some misspelt variations of keywords. In this instance, we could add keywords like passwrod or pword.

James’s definition allows keywords to be in a window of 300 characters anchored on the detected password (see this Microsoft article to understand how the window works). I think this window is too large and might result in many false positives. The keyword is likely to be close to the password, so I reduced the window to 80 characters.

Figure 2 shows the result after inputting the regular expression, keyword list, confidence level (medium), and character proximity. It’s a less complex definition than for Microsoft’s sensitive information types. The big question is does it work.

Figure 2: Definition for the Azure Active Directory password custom sensitive information type

Testing

The Test option allows you to upload a file containing sample text to run against the definition to see if it works. As you can see in Figure 3, the test was successful.

Using the Custom Sensitive Information Type in a Teams DLP policy

Testing gives some confidence that the custom sensitive information type will work properly when deployed in a DLP policy. After quickly creating a DLP policy for Teams, we can confirm its effectiveness (Figure 4) in chats and channel conversations.

Figure 4: Passwords blocked in a Teams chat

I deliberately choose Teams as the DLP target because organizations probably don’t want their users swapping passwords in chats and channel conversations. Before rushing to extend the DLP policy to cover email, consider the fact that it’s common to send passwords in email. For instance, I changed the policy to cover email and Teams and discovered that the policy blocks any invitation to Zoom meetings because these invitations include the word “pwd” as in:

https://us02web.zoom.us/j/9355319659?pwd=dExxYVl1N1diS0RiVG1nYmFEWlRjQT09

Although it might be an attractive idea to block Zoom to force people to use Teams online meetings instead, it’s not a realistic option. The simple solution is not to use this DLP policy for email.

False Positives and Policy Overrides

The downside of matching text in messages against keywords defined in a policy is that some false positives can happen. For instance, I have a Flow to import tweets about Office 365 into a team channel. As Figure 5 shows, some tweets are picked up as potential password violations because a keyword appears near a string which could be a valid password.

Figure 5: Tweets posted in Teams are blocked because they match the password definition

Adjusting the definition for the sensitive information type to reduce the character proximity count (from 80 to 60) reduced the number of false positives. Testing and observation will tell how effective changes like this are when exposed to real-life data.

Apart from adjusting character proximity, two other potential solutions exist. First, amend the DLP policy to allow users to override the block and send the message with a justification reported to the administrator. If the message is important, users will override the policy. The administrator will be notified when overrides occur and tweak the policy (if possible) to avoid further occurrences.

The second option is to exclude accounts (individually or the members of a distribution list) which have a business need to send passwords from the DLP policy. DLP will then ignore messages sent by these individuals.

Creating Custom Sensitive Information Types a Nice to Have

Given the broad range of standard types created by Microsoft, the need to define custom sensitive information types isn’t likely to be a priority for most tenants. However, for those who need this feature for business reasons, the recent changes are both welcome and useful.

Azure AD password, Custom Sensitive Information Type, Data Loss Prevention, DLP, Teams DLP policy

About the Author

Tony Redmond

About the Author

Tony Redmond has written thousands of articles about Microsoft technology since 1996. He is the lead author for the Office 365 for IT Pros eBook, the only book covering Office 365 that is updated monthly to keep pace with change in the cloud. Apart from contributing to Practical365.com, Tony also writes at Office365itpros.com to support the development of the eBook. He has been a Microsoft MVP since 2004.

Comments

mansha 30 Oct 2023 Reply

Hi,
When I have to test a SIT in the compliance portal (SIT classifiers) eg. for an EU driver’s license, it does not detect only driving license (if tested with the driver’s license number only) as such when I write EU driver’s license as heading and then the driver’s license number related to it, that is detected as sensitive information. Why is it showing different behavior is this important to write in the heading what type of data is it? Below is the given format in which the type of data is mentioned in the heading (not able to detect numbers as SITs). So, is it necessary to write this along with SIT can’t only the driver’s license number be detected as SIT?

EU Driver’s License Number

97204831
44101073
23464488
86055639
90914933
1. Tony Redmond 30 Oct 2023 Reply
  
  Have you asked Microsoft support for help?
  1. mansha 30 Oct 2023 Reply
    
    No, I have created a DLP policy (GDPR for exchange) where I encounter this issue. As mentioned above, the EU-related data sent through the exchange is not detected properly. I got alerts only for a few emails that detect SITs when a document has a proper heading written with numbers.
    1. Tony Redmond 30 Oct 2023 Reply
      
      Well, I can’t see your data or your information definitions, so I can’t help. Microsoft Support can, so that’s why you should ask for their assistance.
Mayhew 25 Jan 2023 Reply

Hi Tony, I created a custom SIT to detect “All Data” using a basic Regex: [a-zA-Z 0-9]+ with the idea that I could use it to Auto-apply labels to documents in a Sharepoint site. When I test the SIT against a random document it seems to detect any data. However, when I created a Auto-apply policy using this SIT it doesnt seem to detect documents to apply labels. Im wondering if this is a limitation? Any ideas?
1. Tony Redmond 25 Jan 2023 Reply
  
  This is the kind of question that’s impossible for me to answer because I have no access to your data. I’d log a call with Microsoft support and ask them to have a look.
Benjamin G 19 Dec 2022 Reply

The solution to detect password sharing is fantastic. Only problem with the string, it looks like when you type a long text that contains the word password… it gets flagged. Someone has an idea how to improve the string ? Thanks
1. Tony Redmond 19 Dec 2022 Reply
  
  Microsoft has their own credentials sensitive information type now. Have you tried that?
Juan Jose de Leon 26 Sep 2022 Reply

The test button only shows for Global Admin?
I’ve tried delegating many roles and the button doesn’t appear, Azure Roles and Compliance roles.
1. Diego Maldonado 14 Mar 2023 Reply
  
  Hello Juan, I imagine you have already solved this issue. But I was having the same problem.
  
  Up to now (March 14, 2023), the role needed to have access to the Test button is “Data Loss Prevention” in the Exchange Admin Center, under: Roles > Admin Roles. It could be under the “Compliance Management” role group. We created a custom role group and assigned the role to that custom group.
Roberto Noguera 6 Aug 2022 Reply

There are a sensitive infomation called “Credit Card Number”. You don´t need to create list.
Gareth 2 Mar 2022 Reply

Hi Tony,

Do Microsoft provide a way to test whether a regex expression meets their standard before adding it to a new sensitive into type, or is the acceptability of the expression validated whilst creating the sensitive info type?

Thanks
1. Tony Redmond 2 Mar 2022 Reply
  
  I think it’s really up to the SIT developer to make sure that the definition is capable of detecting the data you’re interested in. There is a test facility available described here https://docs.microsoft.com/microsoft-365/compliance/create-a-custom-sensitive-information-type?view=o365-worldwide&WT.mc_id=M365-MVP-9501#test-a-sensitive-information-type. Once you are happy that your SIT is good enough, you make the decision to deploy or not.
Saravana 4 Nov 2021 Reply

Hi Tony,

Does creating Sensitive Information Type is going to prevent employees from sharing documents with people outside the company??

If YES, how does it prevents employees from sharing the documents?
1. Tony Redmond 4 Nov 2021 Reply
  
  A sensitive information type is just a way for Microsoft 365 to identify documents and messages containing instances of sensitive information. It won’t do anything to stop people sharing that information unless you use the SIT in a DLP policy.
Jim E 23 Oct 2021 Reply

Hi Tony. Another great read.

For some reason I have a mental block around confidence levels and instance counts in a policy.

I have three sensitive information types created and each one is using a library. One is company name, one is account number, and one is bank account number. For this first phase I’m not looking at exact data match.

I’m confused about the optimal settings to use for confidence level and instance counts for each of the 3 SITs.

Can you clarify it’s stupid person 101 how these two options work together? As I understand it the confidence level is how sure DLP is that there is a match. The instance count is “OK, I found a match, based on the number of matches do this”. Do I have that right?

If that’s the case am I correct that I want a higher confidence level for SITs that are highly sensitive and lower confidence levels on ones that are less sensitive.

I have read all the articles but for some reason I can’t seem to grasp the simple concept.
1. Tony Redmond 25 Oct 2021 Reply
  
  I typically start with medium confidence level and adjust as necessary after testing. It’s hard to be precise until you see how a sensitive information type works in practice. For instance, the Azure AD password SIT can generate a great number of false positives if it is tuned too high. I think you’re on the right track to have high confidence in the SITs that you can define with a high degree of accuracy and a lower confidence in the ones that are more fudgeable (like the password).
Thomas Guldberg 21 Sep 2021 Reply

Hi Tony
Do you know of a way to update a “Keyword List” with powershell? I cannot find any commands and we need to update our lists through the GUI, which is not optimal as we need to automate the process.
1. Tony Redmond 21 Sep 2021 Reply
  
  I don’t, sorry!
  1. Thomas Guldberg 29 Sep 2021 Reply
    
    I found it.
    It turns out that all custom sensitive info types are stored in Microsoft.SCCManaged.CustomRulePack. This includes Keyword Lists:
    
    https://docs.microsoft.com/en-us/microsoft-365/compliance/sit-modify-a-custom-sensitive-information-type-in-powershell?view=o365-worldwide
    1. Tony Redmond 29 Sep 2021 Reply
      
      Congratulations!
Pierre HARDY 31 Mar 2021 Reply

Hi,

Thank you, nice article about DLP new features.
How do you manage the detection of a sensitive information list ?

For exemple, in an Excel spreadsheet, there is a column named “Credit Card” and underneath a list of 100 credit card.

Credit Card
xxxx-xxxx-xxxx-xxxx
xxxx-xxxx-xxxx-xxxx
xxxx-xxxx-xxxx-xxxx
xxxx-xxxx-xxxx-xxxx
..
..
..
xxxx-xxxx-xxxx-xxxx

Configuring a keyword evidence with a proximity of 300 will match only the first Credit cards numbers.(those at 300 character of the keyword “Credit Card”).

So, how can I detect the others creditcard numbers ?

Regards
1. Tony Redmond 1 Apr 2021 Reply
  
  I’m not sure what you want to do. The presence of a single credit card number with supporting evidence is enough to trigger a DLP violation. The presence of other credit card numbers adds evidence and might trigger another DLP rule (depending on its configuration). You’ll just have to test to get DLP to do what you want.

Subscribe for Practical 365 updates

New Interface and New Capabilities Make It Easier to Manage Sensitive Information Types

Confidence Levels for Matching

Copying Sensitive Information Types

Improved User Interface

Detecting Azure AD Passwords

Testing

Using the Custom Sensitive Information Type in a Teams DLP policy

False Positives and Policy Overrides

Creating Custom Sensitive Information Types a Nice to Have

About the Author

Tony Redmond

About the Author

Comments

Leave a Reply Cancel reply

Latest Articles

Scheduling Channel Meetings with the Microsoft Graph PowerShell SDK

Practical Graph: Find Large Mailbox Items with the Microsoft Graph PowerShell SDK

Practical Protection: Getting Started with Power Platform Security