Every organization faced with a large tenant-to-tenant migration is concerned about how quickly they can migrate their content. Inevitably, these organizations will raise concerns about being throttled during their SharePoint Online (SPO) content migration.
Administrators accustomed to Exchange Online throttle policies are often surprised with the limitations they encounter during an SPO migration.
What’s the Problem with Throttling Anyway?
Basically, throttling slows down the content migration process based on external limitations. A good way to think about throttling is that it’s a bit like the restrictor plate used on NASCAR race cars during selected races. The restrictor plate effectively limits the top speed of the race car, as speeds higher than 190 mph may result in cars flipping over which can cause crashes.
By controlling the migration pace and keeping a minimum threshold flow for content, throttling maintains the stability and usability of the customer’s tenants. This not only protects the content migration process, but also enables users to continue using the tenant.
How Does Throttling Work?
Each tenant implements throttling at the service level. The service throttles the Client Side Object Model (CSOM) calls and the Graph API calls. The service throttling rules, and the migration API self-throttling rules are based on the Compute and SQL availability. The migration API also adjusts how many tasks run in a tenant based on the availability of the backend resources.
Microsoft does not explicitly state exactly what the throttling rules are. Nor is there an official or unofficial policy by which Microsoft will remove throttling from a tenant. However, Microsoft can and will monitor a tenant if there are concerns about heavy throttling.
When all goes well, throttling maintains a smooth flow of traffic for all SPO tenants.
What Do Throttling Errors Look Like?
When migration tools use the CSOM calls or the REST API that exceeds usage limits, the migration service throttles any further request from the user for a time. You can still be throttled when using the Graph API, and the throttling occurs when uploading batches to a public or private Azure storage container.
Below are some examples of common throttling errors:
429 Error: Too Many Requests
What you will see in response to the throttling on HTTP Request calls is a high volume of HTTP 429 errors (“Too Many Requests”), HTTP 503 errors (“Server Too Busy”), and/or HTTP 500 errors (“Operation Timeout”). Specifically, an HTTP 429 error displays as follows:
HTTP/1.1 429 Too Many Requests
<title>Too Many Requests</title>
<h1>Too Many Requests</h1>
<p>I only allow 50 requests per hour to this Web site per logged in user. Try again soon.</p>
The Retry-After value is an integer value indicating the number of seconds after which the request can be resent. If you send a request before the retry value has elapsed, your request is not processed, and a new Retry-After value is returned. There’s a possibility that several asynchronous calls will receive a Retry-After value if they are processed in proximity of the retry value. Thus, repeatedly sending a request while still receiving a 429 error is futile.
503 Error: Server Too Busy
The Retry-After value used with 503 errors indicates in seconds how long the service is expected to be unavailable. You may see a 503 error with the message “Server Too Busy.” This error will likely appear when you are uploading a lot of content to an Azure storage container. Like 429 errors, repeatedly sending a request while still receiving a 503 error is futile.
500 Error: Operation Timeout
The 500 error is a very general HTTP status code that means something has gone wrong on the website’s server, but the internal server could not be more specific on what the exact problem is. Sometimes, the 500 error is due to an incorrect permission on one or more files or folders.
Other times, an application is shutting down or restarting on the server. It’s difficult to know exactly what is happening, and there is no Retry-After value provided. In fact, this error usually has nothing to do with throttling, but it can be an indicator that the service is having trouble keeping up with demand.
What does Microsoft Recommend?
- Use app-based authentication (OAuth).
- Try to migrate during off-peak business hours.
- Business week evenings are obviously better than business daytime hours.
- Business weeknights and weekends are the best.
- Do not submit more than 5,000 migration jobs/requests at one time; over-queuing the network will create an extra load on the database and slow migration down.
- Implement Microsoft’s guidance and best practices on the back off and retry code
- Good practice is to implement an exponential back off and retry – delay each following request exponentially to allow the migration service to “catch up.”
What Happens When the Migration is Throttled?
In real life, throttling looks a bit like ramp meters placed on highway onramps. Ramp meters are used to control when and how often vehicles can enter the highway, and the goal is to keep traffic moving on the highway. As a result, movement on the onramps may be slower at times.
This is the same experience with throttling and migrating SPO content. The content will move smoothly until heavy congestion is detected in the backend of the tenant. Then you will start seeing 429 errors returned with Retry-After values. The Retry-After values will force new content submissions to back off and wait until the backend congestion is reduced.
Can Microsoft Turn Off Throttling to Help You with Your SPO Migration?
Officially…No. Throttling rules cannot be disabled or suspended and opening a support ticket will not lift throttle. In a previous version of the same guidance document, Microsoft states, “throttling is implemented to ensure the best user experience and reliability of SharePoint Online. [Throttling] is primarily used to load balance the database and can occur if you misconfigure migration settings, such as migrating all your content in a single task or attempting to migrate during peak hours.”
In my experience I’ve never heard of an instance where Microsoft has lifted the throttling rules for content migration for a customer, including Microsoft Consulting Services. Microsoft’s migration tools do not have preferred App IDs that bypass throttling, and there’s no secret back entrance to avoid throttling.
What Would Happen if Throttling Was Lifted?
In the grand scheme of things, this would be bad for your tenant. Unrestricted migration of content to a tenant significantly increases the amount of content moving to the services, and the services could eventually fail due to the heavy load. The virtual network adapters could fail, or the SQL Server could stop responding to requests. Users on the tenant would see a significant drop in performance of online services – possibly a complete failure.
Of course, this situation can quickly deteriorate even more so. Tenants that share hardware environments are impacted by the heavy load placed on any one of the tenants. Each tenant will experience a degradation in performance. Thus, the problem of one tenant becomes the problem of many tenants.
What Can You Do?
Back Off and Retry Code
For starters – the migration software you chose will certainly have an impact on migration throughput. The software must implement back off and retry code as recommended by Microsoft.
The migration software should also use OAuth authorization, an App ID, app-based authentication, as well as the Import API to create migration jobs in the target tenant and the Export API for reading from source tenants. The use of CSOM should be limited to features that are not supported by the migration API or the Graph API – and that can happen.
Second, understand that the best times to write content to the target tenant is during off-peak times. Business daytime hours will generally see a higher probability of throttling as the SPO tenant is trying to maintain stability for M365 users.
Business week evenings are good times to migrate since there are fewer M365 users online. However, there may be backend processes running in M365 during these times. These processes may trigger throttling rules to ensure that they can complete successfully without interference from heavy migration processing.
However, the best times to migrate are business weeknights and weekends as there should be almost no M365 users online and fewer backend processes running. Weekends should be the primary target for scheduling content migrations.
Weekly Migration Throughput
Third, plan for a total weekly migration throughput based on the amount of content that can be migrated at different hours during the week. For example, a sample content migration throughput plan for OneDrive might appear as below, and you can see that the throughput during business weekday hours is only 1TB. However, the non-business weekday hours throughput is higher at 3TB, and the weekend throughput is much higher:
This is typical for large migrations, but you must consider the following factors:
- Not every migration is typical.
- The type of content being migrated has a significant influence on throughput.
- The throughput plan should indicate whether other content migrations are taking place at the same time.
- You cannot exclude other migrations to SPO, OneDrive or Teams in the same target tenant just because a different team is running a migration process, or a different migration tool is being used.
- What matters is that the content is migrating to the same tenant.
Another consideration is when the source and target tenants are in different geographical regions, as this may reduce the total amount of non-business hours available to your migration. Consider the following example: an organization is migrating content from New York, USA to Berlin, Germany. At 6PM on a Friday evening in Berlin, the migration window is open for the weekend. However, it is still 12PM in New York. The source tenant may still throttle on reads to maintain stability for users, and the rules may stay in effect for another 6 hours.
At the other end of the weekend, the throttling rules on the target tenant can start at 6AM in Berlin. However, it is only midnight in New York, and it will be another 6 hours until throttling rules take effect to protect the source tenant. Thus, your total potential migration throughput for this scenario can be reduced by 12 hours on the weekend. The same limitation exists for your evening and night-time processing.
Set Appropriate Expectations on Migration Throughput
Fourth, it’s important to set realistic expectations with your customer on what to expect for migration throughput. Factors that impact throttling include:
- Multiple migration workloads
- Lots of small items in lists and small files in libraries
- Lots of permissions and metadata
- File versions
- Can be throttled on both source and target tenants
For example, imagine driving on a highway where there is little traffic. Some trucks are carrying large loads, but not necessarily heavy loads – this is akin to carrying large files. Their throughput can be very high, but they can load and unload quickly.
Another type of truck is carrying a load of sugar beets. The load is like migrating thousands of small files, and this truck cannot go as fast as the other trucks. It’s on the same highway; but it is heavier, needs more time to load, travels at a slower speed, and needs more time to unload. It also takes more time to process all the sugar beets at the factory after they are unloaded.
With these two different scenarios, different expectations should be made. First, no two migration loads are the same. Second, even when the total migration sizes are equal, large files will move and process faster than small files. Thus, the expectation that measuring migration throughput performance based on size is false.
Try to Avoid Throttling by Not Following Best Practices
Fifth, try to avoid throttling with inventive solutions because you heard from someone or read online somewhere about a “recommended” approach. Here are a couple of my favorites:
- Running multiple migration solutions concurrently. Deemed to be faster because each migration solution uses OAuth authorization and has its own App ID; thus, each migration solution will not be throttled, and you can push as much content as possible.
- This is false – throttling is managed at the service level, not the individual migration solutions. Using OAuth and App ID allows for more throughput in comparison to CSOM, but they can still be throttled by the service.
- Running a content migration with multiple apps installed in Azure AD. Each app uses a different App ID and service principal, and the migration solution uses every app to send content migrations through. The likelihood of being throttled is greatly reduced because multiple service principals are being used, and each service principal is seen as a unique migration process and the throttling will be determined uniquely by the service. Thus, your throughput will increase!
- This is false – and not a best practice supported by Microsoft.
- In fact, Microsoft will warn you if they determine you’ve implemented this solution and will ask you to remove it.
Do Not Panic Over Throttling Errors
Lastly, do not panic if you see throttling errors. This is normal and is usually an indicator that your content migration solution is pushing content to the limit of what the migration service can support. You should reduce the requests being submitted if you see warnings regarding CSOM or the Rest API.
Just like racing a car, there are times where you want to see the RPMs close to the red zone, but you don’t exactly want to see the needle in the red zone for too long. It’s not good for the engine – it could cease. For migrations, this could mean the migration service will lock you out and you’ll have to wait a few days for the service to let you back in again.
First recourse should not be to look for ways to bypass throttling, as it maintains the stability of your tenants. Removal of throttling is not possible and would likely result in your tenant crashing anyway.
When looking for ways to expedite a tenant-to-tenant content migration, there are several actions you can take to absorb this extra time without extending deadlines. These include planning to migrate at off-peak times; setting appropriate expectations with customers; and avoiding inventive solutions that do not follow best practices.