In late January of this year GitLab, a Git repository and source-code management service similar to GitHub, suffered the permanent loss of some customer data. As reported by The Register:
Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated.
Just 4.5GB remained by the time he canceled the
rm -rfcommand. The last potentially viable backup was taken six hours beforehand.
I’m going to leave the “tired sysadmin” part alone for now, as I have some strong feelings about overwork in the IT industry, but I haven’t been able to verify that fatigue was a factor in this particular incident. Regardless, there are more important aspects of the GitLab incident to consider, as they made a series of alarming discoveries about multiple backup types not working, sysadmins not knowing where backups were stored, and replication processes being “super fragile, prone to error” and “badly documented”. And unfortunately, they made these discoveries at the worst possible time – when they were relying on their backups for data recovery.
When we place our applications and data in cloud services, like Office 365, we trust the provider to take care of certain things for us. For example, Microsoft provides physical security for the datacenters hosting our data, encrypts our traffic at rest and as it passes across networks, and isolates us from other tenants. However, we are responsible for applying the tools that Microsoft makes available to us, such as password policies, multi-factor authentication, and mobile application management, to secure our own users’ access to our data.
We also rely on Microsoft to maintain availability of our data, for example by hosting our Exchange Online mailboxes in database availability groups that replicate copies of the data to multiple datacenters in different parts of the world. As another example, SharePoint Online site collection backups are taken every 12 hours, and retained for 14 days, but can’t be used for granular restore (e.g. a single item). Only an entire site collection can be restored, overwriting any other existing data in the process. Data is also protected from accidental deletion by various “soft delete” and retention periods. Deleted user accounts can be retrieved. Exchange Online deleted mailbox items are recoverable for 14 days by default. SharePoint online has versioning of documents and a recycle bin for deleted items. Administrators can recover data during those retention periods, before it has been permanently deleted from the service. Even a customer that has ceased paying for licenses has several weeks to change their mind and reactivate the service without data loss.
The limits (or complete absence) of backups makes some organizations very uncomfortable, enough that they might avoid using Office 365 entirely. One of the important differences between a service like GitLab when compared to Office 365 is that Microsoft’s cloud is a highly automated and orchestrated environment, with strict controls over what administrators can actually do. We get a glimpse of what that looks like behind the scenes in Ignite presentations like this one, and Microsoft’s “Just Enough Administration” approach. In short, the risk of data loss through human error is minimized by automation and security. With such controls in place, it’s hard to imagine an admin making the kind of error that took down GitLab.
Microsoft has a solid track record of running Office 365 to date in a backup-less manner, which helps keep costs down. However, some customers choose to invest in additional protection by signing up to one of the available Office 365 backup services. There’s a growing market of products that claim to do the job, but the reality is that they are often only able to backup up specific application data, and don’t have complete coverage of how different applications and services in Office 365 work together to make a feature work. An example of this is Office 365 Groups, as Tony Redmond points out in his article on ITUnity:
Group mailboxes are in Exchange Online and group document libraries and the shared group notebook are stored in SharePoint Online.
Therefore, to recover a deleted group, you need to extract data from two places and make sure that the restored data goes into the right place. And if the Office 365 Group is associated with a plan managed by Microsoft Planner, you’ll have to recover the plan metadata too.
The situation becomes more complicated when Office 365 ships new features that lack “soft delete” or recovery options when they show up in customer tenants. For example, deleting a task or a plan from Planner currently has no way to recover the data.
What the GitLab incident serves to remind us is that backups are important, whether they are “soft delete” or traditional backup approaches, but the ability to recover from those backups is just as important. What GitLab apparently had not done is properly understand:
- What needs to be backed up, what doesn’t need to be backed up, and what can’t be backed up.
- What deletion or data loss scenarios could arise that require a restore or recovery process to be used.
- What are the steps involved in each of the recovery processes, and who needs to be involved.
- Whether the recovery processes actually work.
They’re not alone. I know plenty of customers who simply refuse to invest in documentation or test restores, and just assume that if the backup reports are showing a green tick then everything is fine.
“The condition of any backup is unknown until a restore is attempted.” – Schrodinger’s Backup
Those four points above can be used in your analysis of the risk of data loss in any Office 365 service or application that you currently use, as well as new ones arriving in future. Consider Office 365 Planner as an example. The feature is generally available today, which means it is considered ready for production use. However, as mentioned earlier, there’s various deletion scenarios that are permanent. Planner uses Groups, so the data is stored in Exchange Online and SharePoint Online, and includes the configuration of the plan itself, buckets within the plan, tasks within buckets, files, and a OneNote notebook.
Working out how each of those are protected (or not) and can be recovered (or not) is an important process before widespread adoption of Planner within your organization. Or, you might decide to simply place the risk on the end user, and have them accept that if they want to use Planner to manage their projects then they may need to manually recreate deleted items from time to time. That’s just one example, but the same risk analysis needs to occur across your entire usage of Office 365 services.
At this stage I’d actually like to hear from you in the comments below. How do you feel about the state of data protection and recovery in Office 365 today? Are there any risks that are blocking your organization from adopting more Office 365 services? Or are you satisfied that any risks you’ve identified are mitigated by other measures? Leave a comment with your thoughts.
Like many things it’s a business question.
Detail the real world impacts of an outage ideally monetize the value.
What is most important to the business, client connectivity or data? If you had to prioritize one over the other which would you choose?
For the service provider consider their reputation in regard to failures and outages. How well is your data protected is a outage likely? Compare your real world cost and the compensation provided from a service outage. (I haven’t had a chance to check this for 365 agreements? )
To often a fear of IT management loosing their jobs or having their reputation affected leads to purchasing solutions which add no real benefit.
Re: 365 groups. The above mentioned may assist maybe all they care about is knowing the data is safe in another location and accessable. When might this be applicable?
Well… there is only two examples of data lost that come to mind.
A undetected issue with the lag copy replication combined with a database corruption.
An act of war ( as take up of 365 by government departments increase such a step becomes more and more likely )
Just a few of my thoughts.
I’ve been testing cloud backup products as of late namely SkyKick (available only to MS partners) and Spanning.
I am quite happy with the Exchange Online backup side of things, I think it works quite well. The OneDrive restores seems to be the only shortfall at this stage where file versions, sharing and metadata (modified by etc) is lost upon a restore across both vendors.
Some companies still don’t have a presence in a particular country and some vendors are slow to bring Sharepoint backups to their suite.
Cloud backup is a must – people have been told incorrect information about their data in the cloud.
I feel marketing has a big part to play in all of this as we’re constantly reassured the cloud is safe and that best practices are being used hence creating a false sense of what is actually been done for us as the marketing never really explains what is being done in any real detail.
While the security aspect for services like O365 are up to scratch the backup and archiving options simply aren’t. From my experience I recommend using 3rd party tools for backup & archiving in O365. Test them thoroughly for capability and reliability prior to committing to them.
Just because you are using cloud services doesn’t mean you can simply set and forget as many suggest. As engineers & managers we still have a duty to ensure that the data being trusted to a cloud service is protected and recoverable to the extent our clients expect.
The only thing that surprises me about this article is that it was not discussed sooner. It always seems to take a disaster to get focus on an issue.
Any mailbox not in a in-hold retention policy is lost when you exceed the deleted item retention limit. Even if you extend the retention times to maximum values (for example, 30 days on shared mailboxes) you’re still out of luck on day 31. We were guided to replace our public folders with shared mailboxes. The big difference now is the recovery options are dramatically different. Either license the shared mailbox and put an in-place hold on it, or live with the 30 day limit. (You may want enable auditing when you up the retention. My experience has been that the first question I report that I can recover the data is “who deleted the content?”)
Don’t get me wrong, I love the cloud. Escaping mailbox limits (and by extension PST’s) is good enough reason to make the move. But archiving and backups are still needed IMHO. When that old data is no where to be found, people are unhappy. Accidental deletions are not always found in 30 days (much less 14 days.)
Its hard to sell the need for archiving and backups. But like security, it only takes one big loss to rationalize the ongoing costs of retaining older data. Don’t get rid of the backups yet. And this is only mailboxes. Mercifully DL’s and contacts will be much easier to backup.
PS-I am not a backup administrator.
The inability for clients to know with certainty where their data is being stored geographically is a major deterrent for some industries with strict governance (legal, medical, military etc).
Lack of easy granular restore capability for email and DB content is a major setback as well. The anonymity of control and accountability also makes many clients reluctant to make the leap.