Sunday, October 9, 2016

Modifying Mailbox Replication Latency Tolerance for Mailbox Moves

During the summer I was doing an Exchange 2013 to Exchange 2016 migration. All was going well except that mailbox moves would stall out and take a very long time. I didn't capture the logs at the time, but when looking at the logs, it indicated that the mailbox moves were stalling due to disk health and more specifically, disk latency.

To view the status of individual mailbox moves that are not completed, use the following command:

Get-MoveRequest | Where {$_.Status -ne "Completed"} | Get-MoveRequestStatistics

The default output for Get-MoveRequestStatistics only shows the DisplayName, StatusDetail,  TotalMailboxSize, TotalArchiveSize, and PercentComplete. However, there are some other useful properties in there for stalled moves such as:
  • TotalStalledDueToContentIndexingDuration
  • TotalStalledDueToMdbReplicationDuration
  • TotalStalledDueToMailboxLockedDuration
  • TotalStalledDueToReadThrottle
  • TotalStalledDueToWriteThrottle
  • TotalStalledDueToReadCpu
  • TotalStalledDueToWriteCpu
  • TotalStalledDueToReadUnknown
  • TotalStalledDueToWriteUnknown
  • DiagnosticInfo
  • Report
In my case a slow mailbox move was showing stalling due to Content Indexing Duration (caused by slow disk) and Write Throttle (triggered by disk latency).

To make things easier I tend to dump everything into a variable so that I can look at them different ways and one by one. The example below collects all of the move request statistics in a single variable and then displays the all of the properties for the first move.

$moves = Get-MoveRequest | Where {$_.Status -ne "Completed"} | Get- MoveRequestStatistics
$moves[0] | Format-List

Now the preferred option is to figure out why your disk latency is high and try to resolve that. However, in this scenario, we needed to get the mailboxes moved because there were too many other things on the SAN to tinker with.

So, to speed up the mailbox move process, I tweaked the configuration file for the Mailbox Replication service to allow a higher level of latency before stalling out on the moves. This file is C:\Program Files\Microsoft\Exchange\v15\bin\MsExchangeMailboxReplication.exe.config.

At the end of this file there are a number of settings that control the service. For disk latency the settings are at the very end in the appSettings section. As shown below:

  <appSettings>  
   <!-- Mdb latency health threshold values in msec. Valid range is from 0-1000. -->  
   <add key="MdbFairUnhealthyLatencyThreshold" value="20"/>  
   <add key="MdbHealthyFairLatencyThreshold" value="10"/>  
   <!-- Maximum delay that WLM returns under "Fair" database resource health in msec. Valid range is from 0-60000 -->  
   <add key="MdbLatencyMaxDelay" value="60000"/>  
   <add key="LogEnabled" value="true" />  
   <add key="LogDirectoryPath" value="%ExchangeInstallDir%Logging\MailboxReplicationService\" />  
   <add key="LogFileAgeInDays" value="30" />  
   <add key="LogDirectorySizeLimit" value="100MB" />  
   <add key="LogFileSizeLimit" value="10MB" />  
   <add key="LogCacheSizeLimit" value="2MB" />  
   <add key="LogFlushIntervalInSeconds" value="60" />  
  </appSettings>  

The two settings for disk latency are MdbFairUnhealthyLatencyThreshold and MdbHealthyFairLatencyThreshold which have default values of 10ms and 20ms. In most cases those are reasonable defaults. I incrementally increased the values until the mailbox moves stopped stalling. My case was extreme and I ended up using values of 150ms and 75ms. The lower value of 75ms latency seemed to trigger the stalling.

When the config file is modified, you need to restart the Microsoft Exchange Mailbox Replication service to take effect. This will stop all mailbox moves that are in progress. They will start up again after the restart is complete but you may lose the progress that has already been made. For example, if 1.5 GB of a 2 GB mailbox has been moved, the whole mailbox move might start over again.

A couple more articles that could be helpful if you're looking at modifying this config file:

6 comments:

  1. Great help! Thank you

    ReplyDelete
  2. solid disk cause this problem too, move database mail to normal disk

    ReplyDelete
    Replies
    1. I have only SSD, you have a explain for why is slow with flash ?

      Delete
  3. Do I have to change it on the Exch2013, if I have dag 2x Exch2013 and DAG 2xexch2019 (all users on DAG with Exch 2019)?

    ReplyDelete
    Replies
    1. It's been a while since I've done this, but I think I was updating the config file on the target rather than the source.

      Delete