In late January of this year GitLab, a Git repository and source-code management service similar to GitHub, suffered the permanent loss of some customer data. As reported by The Register:
Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated.
Just 4.5GB remained by the time he canceled the
rm -rfcommand. The last potentially viable backup was taken six hours beforehand.
I'm going to leave the “tired sysadmin” part alone for now, as I have some strong feelings about overwork in the IT industry, but I haven't been able to verify that fatigue was a factor in this particular incident. Regardless, there are more important aspects of the GitLab incident to consider, as they made a series of alarming discoveries about multiple backup types not working, sysadmins not knowing where backups were stored, and replication processes being “super fragile, prone to error” and “badly documented”. And unfortunately, they made these discoveries at the worst possible time – when they were relying on their backups for data recovery.
When we place our applications and data in cloud services, like Office 365, we trust the provider to take care of certain things for us. For example, Microsoft provides physical security for the datacenters hosting our data, encrypts our traffic at rest and as it passes across networks, and isolates us from other tenants. However, we are responsible for applying the tools that Microsoft makes available to us, such as password policies, multi-factor authentication, and mobile application management, to secure our own users' access to our data.
We also rely on Microsoft to maintain availability of our data, for example by hosting our Exchange Online mailboxes in database availability groups that replicate copies of the data to multiple datacenters in different parts of the world. As another example, SharePoint Online site collection backups are taken every 12 hours, and retained for 14 days, but can't be used for granular restore (e.g. a single item). Only an entire site collection can be restored, overwriting any other existing data in the process. Data is also protected from accidental deletion by various “soft delete” and retention periods. Deleted user accounts can be retrieved. Exchange Online deleted mailbox items are recoverable for 14 days by default. SharePoint online has versioning of documents and a recycle bin for deleted items. Administrators can recover data during those retention periods, before it has been permanently deleted from the service. Even a customer that has ceased paying for licenses has several weeks to change their mind and reactivate the service without data loss.
The limits (or complete absence) of backups makes some organizations very uncomfortable, enough that they might avoid using Office 365 entirely. One of the important differences between a service like GitLab when compared to Office 365 is that Microsoft's cloud is a highly automated and orchestrated environment, with strict controls over what administrators can actually do. We get a glimpse of what that looks like behind the scenes in Ignite presentations like this one, and Microsoft's “Just Enough Administration” approach. In short, the risk of data loss through human error is minimized by automation and security. With such controls in place, it's hard to imagine an admin making the kind of error that took down GitLab.
Microsoft has a solid track record of running Office 365 to date in a backup-less manner, which helps keep costs down. However, some customers choose to invest in additional protection by signing up to one of the available Office 365 backup services. There's a growing market of products that claim to do the job, but the reality is that they are often only able to backup up specific application data, and don't have complete coverage of how different applications and services in Office 365 work together to make a feature work. An example of this is Office 365 Groups, as Tony Redmond points out in his article on ITUnity:
Group mailboxes are in Exchange Online and group document libraries and the shared group notebook are stored in SharePoint Online.
Therefore, to recover a deleted group, you need to extract data from two places and make sure that the restored data goes into the right place. And if the Office 365 Group is associated with a plan managed by Microsoft Planner, you’ll have to recover the plan metadata too.
The situation becomes more complicated when Office 365 ships new features that lack “soft delete” or recovery options when they show up in customer tenants. For example, deleting a task or a plan from Planner currently has no way to recover the data.
What the GitLab incident serves to remind us is that backups are important, whether they are “soft delete” or traditional backup approaches, but the ability to recover from those backups is just as important. What GitLab apparently had not done is properly understand:
- What needs to be backed up, what doesn't need to be backed up, and what can't be backed up.
- What deletion or data loss scenarios could arise that require a restore or recovery process to be used.
- What are the steps involved in each of the recovery processes, and who needs to be involved.
- Whether the recovery processes actually work.
They're not alone. I know plenty of customers who simply refuse to invest in documentation or test restores, and just assume that if the backup reports are showing a green tick then everything is fine.
“The condition of any backup is unknown until a restore is attempted.” – Schrodinger's Backup
Those four points above can be used in your analysis of the risk of data loss in any Office 365 service or application that you currently use, as well as new ones arriving in future. Consider Office 365 Planner as an example. The feature is generally available today, which means it is considered ready for production use. However, as mentioned earlier, there's various deletion scenarios that are permanent. Planner uses Groups, so the data is stored in Exchange Online and SharePoint Online, and includes the configuration of the plan itself, buckets within the plan, tasks within buckets, files, and a OneNote notebook.
Working out how each of those are protected (or not) and can be recovered (or not) is an important process before widespread adoption of Planner within your organization. Or, you might decide to simply place the risk on the end user, and have them accept that if they want to use Planner to manage their projects then they may need to manually recreate deleted items from time to time. That's just one example, but the same risk analysis needs to occur across your entire usage of Office 365 services.
At this stage I'd actually like to hear from you in the comments below. How do you feel about the state of data protection and recovery in Office 365 today? Are there any risks that are blocking your organization from adopting more Office 365 services? Or are you satisfied that any risks you've identified are mitigated by other measures? Leave a comment with your thoughts.