Rish Tandon is the Corporate Vice President of the Teams Engineering organization at Microsoft. He is also the former Chief Technology Officer for Heal. Founded in 2015, Heal is a health technology company focused on revolutionizing care delivery by bringing highly qualified, well trained and affordable doctors to our patient’s homes. Prior to Heal, Rish spent five years at Amazon as General Manager of Amazon’s Mobile Shopping app and platform. Rish has over 17 years of experience leading development of large-scale products and services across Amazon, Expedia, and Microsoft.
Discussing Performance, Scalability, Compliance, Backups, and Other Questions
Rish is a passionate advocate for Teams with a unique perspective on the engineering challenges involved in scaling the product to serve over 145 million Teams users. This article is a narrative based on an hour-long conversation between Tony Redmond and Rish Tandon on May 14. Naturally, the conversation occurred using a recorded Teams. Direct quotes taken from the conversation are in italics.
Operationalization of Teams for Massive Scale
I started by asking Rish about the major technical achievements for Teams during his tenure. He replied that the first thing that came to his mind was the work done to achieve massive scale for Teams. When he joined, Teams was still relatively small and operated almost like it was in startup mode. Although agility is highly valued in terms of helping startups to succeed by getting a product out the door, a different attitude is needed when a product becomes mainstream, especially when demand for a service goes “through the roof.” In the early days, it didn’t matter so much if Teams went down, even for a few hours, as customers had the option of going back to Skype for Business. That’s not the case now because so many people use Teams and Skype for Business Online retires in July 2021. It quickly became evident that Teams had to be able to run on a 24 x 7 x 365 basis.
A full review of the Teams architecture (spanning over 250 microservices) to create what is now Azure Communications Services considered how the services could scale to deal with the expected traffic while maintaining performance. The microservices are disparate and loosely coupled connections. This is good because each of the services can scale independently, but because the services are loosely coupled, it is essential to make sure that services can’t “overwhelm each other” when under strain. Rish explained how Microsoft ran a resiliency program to ensure that the architecture could deliver.
He pointed out that a cultural change was necessary in the organization so that everyone working on Teams bought into the idea that they were responsible for a massive always-on service. Each of the services was assessed to make sure that they could deliver four nines availability and failures couldn’t adversely affect (or “brown out”) large parts of the infrastructure (which is what happened when Teams experienced its first major outage in February 2019 due to an issue with the Azure Key Vault service).
Making Teams Services More Robust
Microsoft graded services according to their failover capability (“Grade A” services have active-active capability and can failover automatically; other less capable services might need intervention). The grading scheme highlighted some common failings which Microsoft needed to address. He spoke about the application of circuit breaking, where when a service is in danger of being overwhelmed, it can be removed from the circuit without adverse impact on the rest of the system. In other words, a single feature might stop working for a time, but it wouldn’t remove the ability of people to access and use Teams.
After analyzing the robustness of the services, operationalization was the next step. A lot of this happens through automated monitoring of service health with alerts generated when thresholds are exceeded. As Rish said, “Problems will happen. You’ve got to find them so you can mitigate them fast so that end users don’t notice the problem.” And of course, problems do happen, as in March 2021 when an Azure AD outage caused problems for many applications, including Teams.
Implementing a cell-based architecture, or the ability to isolate problems in segments, was the next step. Rish pointed to Exchange Online as an example of a product which has used isolation for a very long time by segmenting workload in partitions like databases, servers, and database availability groups. This kind of internal experience helped Teams figure out how to isolate failures and reduce the impact of outages on users (the outage only affects people connected to a partition).
Teams is highly dependent on services like Azure Cosmos DB (its message database) and Azure Front Door. Discussions with the developers responsible for each service drove hardening of the services to resist failure, something that benefited Teams enormously. The same benefits accrue to anyone else using Azure.
During the early part of the pandemic (March – June 2020), a sudden and dramatic increase in demand challenged all of Microsoft 365. This forced Microsoft to scale out at a rate not previously expected, and to reduce some services to conserve resources (for example, the definition for Teams meeting recordings reduced from 1080p to 720p). Rish noted that everyone from across Microsoft 365 and Azure stepped up to figure out solutions, ignoring organizational boundaries to work together to solve problems. The relationship built under pressure have been invaluable since in terms of driving progress.
After doing the work to improve its services, the extra demand introduced unanticipated strain on Teams. Rish said that “running on top of Azure allowed us to scale during the pandemic.” Using their old architecture, Teams would have had to build everything themselves to cope from bare-metal servers. It would have been impossible to find enough compute to deliver what people needed, like scaling to handle seven billion meeting minutes a day. The capability of Azure to quickly deliver thousands of additional cores overnight handled the load with minimal disruption. Although saying “we’re not done yet,” Rish noted that Azure Communications Services are available now to anyone who wants to build a similar service.
Coping with Change
We then spoke about the large volume of features delivered by Teams and the difficulty customers have in coping with so many changes. Rish calls this “feature velocity,” literally the drive to develop new features quickly to satisfy customer demands for functionality to support working at home. The features were built “at a pretty rapid clip” because peoples’ work habits had changed so dramatically.
At Microsoft’s FY21 Q3 results, Satya Nadella said that Teams had delivered over 300 new features in the previous year. I asked whether Teams delivers updates in an uncontrolled manner at times and if this affects the quality of the software. Rish said that the balance between getting new features out and delivering quality is “a delicate dance.” He doesn’t want to deliver an unstable or unusable product and agrees that Teams needs to improve their ability to predict when customers will get new features. While Teams would like to ship some features quicker, they also want to deliver quality.
As an example of the dance between velocity and quality, he pointed to the technical challenges created when Teams followed the Microsoft decision to replace Angular with React in the back end as something they’re doing while also delivering new features. The benefit of the change means that Teams can share more code with other Microsoft 365 applications (something which might appear when Teams supports the Fluid Framework). Rewriting “the entire messaging canvas is a pretty long and daunting task.” Microsoft has chosen to host the React framework within Angular and this caused some unique memory management challenges to find and plug memory leaks Teams had to build their own tools to locate and plug the leaks. Full transition to the new code is nearly complete and should roll out in the next few months, but at the same time as doing this fundamental reengineering, Teams has continued to ship new features. Sometimes problems caused by the rewrite (like a new leak) causes features to miss their forecast dates as announced at conferences like Ignite, blogs, or in roadmap items.
The development group would love to ship 900 features annually to please all the asks people have made, but it’s just not possible to do this and maintain quality, especially when reengineering under the covers also happens. Rish said that he feels bad that he can’t deliver on some of the major features requested in User Voice, like a most elegant implementation of access to multiple tenants. Architectural limitations and the need to ensure that data doesn’t leak across tenants are reasons why Microsoft has not delivered the feature.
Shared channels (Teams Connect) are a big new Teams feature that Microsoft wants to ship later this year. Rish commented on the challenges involved in creating a federated collaboration capability which allows Teams users from different tenants to connect around channels. He pointed to compliance as a particular concern and said that part of the reason why shared channels are not available yet is that Microsoft must make sure that organizations can use shared channels in the confidence that collaboration is manageable and secure. Rish noted that Microsoft’s IT group, which runs Teams for over 200,000 users, is helping the development group iron out issues in shared channels.
Rish believes that feature delivery will become more predictable over the coming year. Some features might slip, but the delay should be less than a quarter rather than the six- or nine-months slippage sometimes seen today.
Users often criticize the performance of Teams clients, and commentators are quick to point to Microsoft’s decision to use Electron as the foundation for its client strategy as core to the problem. While acknowledging that Microsoft needs to do better with performance, Rish believes that Electron is a vital part of their ability to deliver features across multiple operating systems. They understand some of the consequences flowing from the decision to use Electron and have plans to improve performance.
Rish pointed to the “big bet” made by Edge to base their future on Chromium as important for Teams. It means that Microsoft has developed a lot of Chromium expertise which helps with Electron, including its performance. Without being able to say too much, he noted that “it makes sense for us to work closely with the Edge team… to figure out how we can build much better client architectures.” He said that Microsoft is doing work to improve the client now but couldn’t commit to when the results will be available to Teams users.
Another of the major pieces of work Rish has taken on is the building out of the Teams development organization. Although he didn’t say just how many engineers now work on Teams, thousands of people are involved across multiple development centers in the U.S., Ireland, India, and the Czech Republic. Bringing in the right talent to ramp up what was a very young development organization and making sure that people don’t burn out have been specific concerns, especially during the pandemic. Because no one has been able to travel, “a significant proportion of the team haven’t seen each other.” Rish feels a certain sense of pride that his team accomplished so much over the last year.
I asked if engineers joining his team have the experience of using Office 365 in real life. Rish said that a lot of cross-pollination from the Exchange and SharePoint development groups has brought that experience to Teams. He said that the leaders from the different Microsoft 365 development groups gather regularly to share information, including how to tackle problems like scaling. Being able to draw upon the reservoir of experience available within Microsoft has been very helpful in growing Teams.
Rish said that Teams drives a lot of workload consumption for Azure. Its use of microservices makes Teams different to other applications like SharePoint and Exchange. I asked what Rish thought of the Teams effect on SharePoint and OneDrive usage, saying that my view is that many people store files through Teams without knowing (or caring) where the file goes and that this has driven SharePoint usage to the point where it now has 200 million daily active users. This is a huge change from traditional SharePoint activity which focused more on document management than document storage. Rish agreed that Teams has driven a massive increase in many Microsoft workloads. Perhaps for diplomatic reasons, he couldn’t agree with my assertion about the Teams effect on SharePoint, but maybe this is more obvious outside Microsoft. However, he agreed that the data does show that the success of Teams has been good for SharePoint.
Rish reflected on the way that Teams brings “all the [Microsoft 365] workloads together,” a factor which drives usage for any workload associated with Teams. Apart from SharePoint, Rish pointed to the huge increase in Planner (task) activity since the availability of the Tasks app for Teams. He noted the affect that placing an application on the left-hand rail has on its popularity and usage and how he uses apps like Approvals to get real work done daily.
Of course, limited space exists in the navigation rail so not every app can feature there. Rish said that apps will get “snackier and snackier,” a way of describing how app functionality can appear when someone needs to do something (like create a task from a channel conversation) and disappear once the action is complete. He said that work they are doing with messaging extensions and the Fluid framework will make this mode of app consumption even more powerful. While doing this work, the Teams developers are aware of the need to avoid making the client interface too complex. The interface must be simple enough to get your work done while surfacing the right tools at the right time to make work more efficient.
In terms of managing Teams, I commented that the current tools are well suited to small to medium organizations but perhaps not as good for larger organizations, like the 2,700 with more than 10,000 Teams users. Rish agreed that some extra work is necessary in this space, including more investment in intelligent PowerShell-based scripting. He said that the admin tools team is working hard to make the tools work better at scale and to deal with issues like figuring out what clients need attention because they’re missing important updates, or even report what features people use. Use of analytics data will help administrators understand how their users interact with Teams, such as what channels are needed, and which are candidates for removal.
In terms of compliance, Rish pointed out that the Teams retention and compliance tools are based on Exchange Online and said that this is an advantage because administrators know how to manage these policies. He thinks this is a good thing because of the continuum from what people had to what they use today.
I asked about guest accounts and if Microsoft was worried about the lack of visibility of what those accounts do outside their home tenant. Shared channels use federation rather than guest accounts, but the lack of visibility tenant administrators have over what information passes across these mechanisms is worrying. Rish said that one of the things Microsoft wants to deliver with shared channels is visibility about what’s happening on both sides. He also pointed to some challenges that shared channels throw up, such as what happens when two tenants collaborate but have different retention policies for Teams. Whose retention policy wins – the most restrictive policy or the least? Microsoft is still working on that question, but it illustrates the issues which occur around collaboration across multiple tenants. Internal experience of how shared channels work within Microsoft will help resolve the issues, and Rish believes that if Microsoft’s admins can manage Teams for 200,000 employees in a safe and secure manner, admins of customer tenants should be able to do the same. Conceptually this statement is accurate while ignoring the salient fact that not all customers have a direct line to Teams engineering…
Moving on to Microsoft Information Protection (MIP), I asked if Teams would use sensitivity labels to protect content in addition to container management (sensitivity labels already protect content in SharePoint Online and OneDrive for Business, so the question is about native Teams content like chats). This is different to customer key support, which protects data at rest, which Teams supports. Rish said that Microsoft is still figuring out how best to protect chats and messages with encryption but pointed out that tenants can restrict federation with other tenants (with an Azure B2B collaboration policy) to make sure that sharing only happens with other domains they trust. This will be important when customers start using shared channels.
Backups and APIs
We then spoke about APIs and the struggle ISVs have in working with Teams data, notably in backup and restore. No backup API for Teams is currently available, possibly because the notion of backing up a team is very complex given the number of type of connections Teams has with other first- and third-party services. When it comes to raw data, ISVs can backup Exchange and SharePoint data, including the compliance records captured by the substrate, but the backup is imperfect and restores always problematic.
Rish agreed that Microsoft “will have to solve” the problem and said that they’re starting with the migration APIs, including those used internally to allow Teams data to move within datacenters (for instance, when a tenant decides to exploit multi-geo capabilities and move some Teams data to a go-local datacenter). The migration API only covers Microsoft data, so different arrangements must process third-party app data. He said that Microsoft is working on migration tools to allow organizations to take their data out of Teams and import data in (this API is in preview), which should help with some tenant-to-tenant (merger and acquisition) scenarios.
Backing up all the data generated by Teams and other Microsoft applications is a difficult problem because of the interconnectivity of Teams, and it’s hard to see a simple “back up all of Teams” solution soon that can restore data perfectly ready for used. However, it should be possible to backup the structure of Teams (ISV backup vendors already do this) and backfill all the messages, including chats. Channel conversations are harder than chats because of their reply chain structure, but they’re learning lots from the migration API. While waiting for a backup API, the migration API might help. Time will tell.
Email and Teams
Finally, we talked about Teams and email and how Rish sees the two modes of communication working together. He doesn’t think that Teams will take over from email. Instead, “the two will coexist for a very long time” to serve different purposes and “it’s wishful thinking that email will disappear.” He pointed to the investments made by Teams to engineer the Share to Teams and Share to Outlook features to increase interoperability between the two modes.
Teams is about collaboration inside the organization and between organizations (using federation and guest accounts), but it doesn’t have the ubiquity of email. Customers need both open internal collaboration and email. The trick is to make intelligent decisions about when to use each tool.