Finding SharePoint and OneDrive Files in an Unlabeled State
Given the many ways to assign Microsoft 365 retention labels to files stored in SharePoint Online and OneDrive for Business, it might come as a surprise that no out-of-the-box method exists to find files that don’t have retention labels. The Purview Data Explorer (Figure 1) is a great source of knowledge about where and which retention labels are used but is silent on the question of where unlabeled files lurk.
You might consider the question of unlabeled files immaterial because a tenant can deploy auto-label policies to find and apply retention labels to files in static or adaptive locations (sites). It can configure default retention labels for document libraries and folders to ensure that unlabeled files are duly stamped with an appropriate retention label. Time and effort can be poured into user education to coach people about how to choose and apply the most appropriate label. And even if retention labels aren’t applied, retention policies can provide default retention settings to sites and accounts across the tenant.
Auto-labeling and default labels both require Office 365 E5 licenses. If the organization has these licenses, these features (along with adaptive scopes) should absolutely be used. Even so, it’s possible that some files exist devoid of labels. The question is whether it’s possible and worthwhile to find the unlabeled files. From the compliance perspective, it’s worthwhile because you never know when an unlabeled file might be needed. Once discovered, it’s possible to apply retention labels to files programmatically using cmdlets from the Microsoft Graph PowerShell SDK.
Searching for Unlabeled Files
PowerShell scripts using Microsoft Graph API requests or SDK cmdlets can scan an individual document library or OneDrive account to find unlabeled files. However, given the number of SharePoint Online sites and OneDrive accounts supported by the average Microsoft 365 tenant, going from library to account to library seeking unlabeled files is likely to become boring very quickly.
An alternative approach is to run a content search to discover how many unlabeled files might exist and where they are. The eDiscovery solution in the Purview portal is undergoing change at present, but it’s still possible to run a content search to find information.
The key to successful content searches is the KeyQL query that the search processes to find items. In this case, there’s no easy way to build a query to “find all documents without a retention label.” Instead, you’ve got to build a complex query that finds items that do not have a certain label and repeat that to cover all the retention labels in active use. My tenant has more retention labels than it should (I test different scenarios), so I ended up with a query like this (obviously, replace the names of my retention labels with the names of the labels used in your tenant):
(ComplianceTag<>"Confidential") (c:s) (ComplianceTag<>"Approved") (c:s) (ComplianceTag<>"Audit Material") (c:s)(ComplianceTag<>"Commercially Sensitive") (c:s) (ComplianceTag<>"Office 365 for IT Pros eBook Content") (c:s) (ComplianceTag<>"Business Critical") (c:s) (ComplianceTag<>"Cloudy Attachment Preservation - Three Years") (c:s)(ComplianceTag<>"Auto-Label Invoices") (c:s)(ComplianceTag<>"Archive Retention") (c:s) (ComplianceTag<>"Board Records") (c:s) (ComplianceTag<>"Contractual Information") (c:s)(ComplianceTag<>"Formal Company Record") (c:s) (ComplianceTag<>"Manual disposition") (c:s) (ComplianceTag<>"Permanent Company Record")(c:s) (ComplianceTag<>"Record (Legal)") (c:s) (ComplianceTag<>"Regulatory Record (Legal)") (c:s) (ComplianceTag<>"Required for Audit") (c:s) (ComplianceTag<>"Strategic Planning") (c:s) (ComplianceTag<>"Teams Recordings")(c:c)(filetype=DOCX)(filetype=PPTX)(filetype=XLSX)
Queries have a limitation of 20 keywords, so I couldn’t include every retention label. I dropped labels deemed to have low usage to get under the limit. After settling on the query, it used it in a search that found 31,812 items (Figure 2).
Parsing Search Results
Running a successful content search is great, but we need to make sense of the search results. Two points are extremely important. First, this is an estimate search conducted against the search indexes. The results generated by running a full search to find and extract data from the target locations might be different. However, the data is good enough to use as a start.
When a content search is complete, the SuccessResults property for the search contains a string noting the locations where the search found items. We can parse the SuccessResults property to report what’s noted there. Here’s some code to report the top 10 locations based on the number of unlabeled items:
$ComplianceSearch = Get-ComplianceSearch -Identity "Find SharePoint documents without a retention label" # Split the success results into individual entries $Entries = $ComplianceSearch.SuccessResults -split ",(?=\s*Location)" $LocationsReport = [System.Collections.Generic.List[Object]]::new() # Process each entry in the success results ForEach ($Entry in $Entries) { # Extract the location, item count, and total size if ($Entry -match 'Location:\s*(\S+),\s*Item count:\s*(\d+),\s*Total size:\s*(\d+)') { $Location = $matches[1] $ItemCount = [int]$matches[2] $Size = $Matches[3] $TotalSizeMB = [math]::Round([long]$Size / 1MB, 2) # Create a custom object for the entry $ReportLine = [pscustomobject]@{ Location = $Location ItemCount = $ItemCount 'Total Size (MB)' = $TotalSizeMB } $LocationsReport.Add($ReportLine) } } # Output the result $LocationsReport = $LocationsReport | Sort-Object ItemCount -Descending $LocationsReport | Select-Object -First 10 | Format-Table -AutoSize Location ItemCount Total Size (MB) -------- --------- --------------- https://office365itpros.sharepoint.com/sites/BlogsAndProjects 5642 2076.48 https://office365itpros.sharepoint.com/sites/O365ExchPro 5504 23654.36 https://office365itpros.sharepoint.com/sites/aircraftwaterchers 4955 372.82 https://office365itpros.sharepoint.com/sites/IndustryNews 3987 2074.53 …
The second point to consider is that the results for SharePoint Online and OneDrive for Business include items preserved for retention that are stored in preservation hold libraries in individual sites and accounts. There’s no point in attempting to do anything with unlabeled files stored in preservation hold libraries. These are copies of files held for retention purposes, and you can’t assign labels to those items, so they will always remain unlabeled. The results reported above are therefore overstated and need adjustment.
Refining Search Results to Identify Unlabeled SharePoint Files
Running a content search typically leads to three outcomes:
- Improve the search criteria and rerun.
- Download the search results.
- Download a report about the search results.
If we use the third option to download the search report, Purview includes a CSV file called Results.csv in the report set. The CSV file contains a line for each item found by the search.
After downloading the files, we can import the item data from Results.csv into an array to process the data. In this case, to get a more accurate assessment of unlabeled files, we can create another array (a PowerShell list) that excludes the preservation hold library items. In my case, this reduced the number of items from 31,812 to 8,887.
[array]$Results = Import-Csv results.csv $FilesReport = [System.Collections.Generic.List[Object]]::new() ForEach ($R in $Results) { If ($R.'Document Path' -notLike "*/PreservationHoldLibrary/*") { $ReportLine = [PSCustomObject]@{ Location = $R.Location Title = $R.'Subject or Title' Path = $R.'Document Path' Author = $R.'Sender or created by' Created = $R.'Received or created' Modified = $R.'Modified date' } $FilesReport.Add($ReportLine) } } $FilesReport = $FilesReport | Sort-Object Location $FilesReport | Export-Csv c:\temp\FilesReport.csv -NoTypeInformation -Encoding utf8 $Locations = $FilesReport | Select-Object -ExpandProperty Location $Locations = $Locations | Sort-Object -Unique $Locations | Export-Csv c:\temp\Locations.CSV -NoTypeInformation -Encoding utf8 Write-Host "Details of unlabeled files are in c:\temp\FilesReport.csv and a list of locations is in c:\temp\FilesLocations.csv"
The script generates a CSV file containing details of the individual unlabeled files and another file holding the URIs for the locations where unlabeled files are stored. The locations file is useful if you decide to go to the next step and apply retention labels to the unlabeled files.
Some basic analysis identifies the locations for the unlabeled files. Here’s how to find the top 10 sites:
$FilesReport | Group-Object Location -NoElement | Sort-Object Count -Descending | Select-Object -First 10 | Format-Table Name, Count -AutoSize Name Count ---- ----- https://office365itpros.sharepoint.com/sites/BlogsAndProjects 2651 https://office365itpros.sharepoint.com/sites/IndustryNews 1995 https://office365itpros.sharepoint.com/sites/TeamsDay 1160 https://office365itpros-my.sharepoint.com/personal/tony_redmond_office365itpros_com 1048 https://office365itpros.sharepoint.com/sites/O365ExchPro 651 https://office365itpros.sharepoint.com/sites/rabilling 448 https://office365itpros.sharepoint.com/sites/office365engage2017speakers 79 https://office365itpros.sharepoint.com/sites/exchangetroubleshootingbook 79 https://office365itpros.sharepoint.com/sites/Office365Adoption 74 https://office365itpros.sharepoint.com/sites/UltraFans 71
The top site reported dropped from 5642 to 2651 unlabeled items. The fifth site dropped from 5504 to 651. The variation in the number of files held in preservation hold libraries is accounted for by different edit patterns and the way that SharePoint Online preserved individual copies of edited files within the scope of retention labels or policies prior to moving to a “store all versions in a single file” model.
The number of files versions stored in preservation hold libraries is an indication of why the introduction of intelligent versioning for SharePoint Online is so important in terms of restricting the storage consumed by file versions, even if retention processing currently forces SharePoint Online to hold versions that it would like to remove.
The results generated by this exercise are not perfect. You can check for unlabeled files in a site by using SharePoint search. Sign into the site and input the following query into the search box. The query only works for files within the site:
ContentType:Document NOT ComplianceTag:*
After checking some of the sites highlighted by the report, I found some sites held labeled files that were reported to be unlabeled. That’s just a function of using an estimate search to generate the results rather than a full and complete search where items are checked before being added to the result set. However, the data is a great guide to where you should invest some time to chase down unlabeled items.
Moving Forward to Apply Labels to Unlabeled SharePoint Files
The next step is to figure out how to apply suitable retention labels to the unlabeled files. The code to process the files in the various libraries isn’t difficult to write. At least, I don’t think so… I guess I should go and do it.