Wednesday, November 2, 2016

SCOM AD Monitoring Alerts

I've been working with a larger client for the last several months on Active Directory (AD) issues. One of the ongoing small issues has been AD monitoring alerts generated in System Center Operations Manager (SCOM) when it appears nothing is actually wrong.

The alerts were appearing intermittently on several of the servers, but not all. We were seeing alerts like this:

Failed to ping or bind to the Infrastructure Master FSMO role holder
AD Op Master Response : The script 'AD Op Master Response' could not determine the schema Op Master.The error returned was: 'LDAP://DC.contoso.com/RootDSE' (0x8007203A)

Failed to ping or bind to the Schema Master FSMO role holder
AD Op Master Response : The script 'AD Op Master Response' could not determine the schema Op Master.The error returned was: 'LDAP://DC.contoso.com/RootDSE' (0x8007203A)

Failed to ping or bind to the RID Master FSMO role holder
AD Op Master Response : The script 'AD Op Master Response' could not determine the RID Op Master.The error returned was: 'LDAP://DC.contoso.com/RootDSE' (0x8007203A)

Failed to ping or bind to the PDC FSMO role holder
AD Op Master Response : The script 'AD Op Master Response' could not determine the PDC Op Master.The error returned was: 'LDAP://DC.contoso.com/RootDSE' (0x8007203A)

Failed to ping or bind to the Domain Naming Master FSMO role holder
AD Op Master Response : The script 'AD Op Master Response' could not determine the domain naming Op Master.The error returned was: 'LDAP://DC.contoso.com/RootDSE' (0x8007203A)

Script Based Test Failed to Complete
AD General Response : While running 'AD General Response' the following consecutive errors were encountered: Failed to bind to 'LDAP://DC.contoso.com/rootDSE'. This is an unexpected error. The error returned was 'LDAP://DC.contoso.com/rootDSE' (0x8007203A)
We also implemented AD Client monitoring and it showed the following errors:
Script Based Test Failed to Complete
AD Client Connectivity : The script 'AD Client Connectivity' failed while getting 'LDAP://DC.contoso.com/RootDSE.The error returned was: 'LDAP://DC.contoso.com/RootDSE' (0x8007203A)

Script Based Test Failed to Complete
AD Client Monitoring: AD Connectivity is unavailable, or the response is too slow. The bind to 'LDAP://DC.contoso.com/RootDSE' took 4259 milliseconds, which is longer than the allowed 1000 milliseconds.
After doing a packet sniff on one of the DCs, I was able to correlate a specific pattern of network activity with the failure to connect for AD Client Connectivity. When the connection error occurred the AD Client attempted to connect to the DC, but the DC refused the connection. In Network Monitor, this appears as:
  • Client to server: TCP packet with the SYN flag set
  • Server to client: TCP packet with Ack and Reset flags set
This behavior indicates that there is either no service listening on the port or that the service is refusing the connection. The service was definitely listening because it worked most of the time. So, the service must be refusing the connection for some reason. It wasn't a timeout of any type because the refusal was happening in about 1ms.

These domain controllers were running Windows Server 2008 R2. I did find some web pages, such as the one below, that indicated this type of error can be caused by IPv6 being disabled in Windows Server 2012:
I can confirm that enabling IPv6 also fixed this issue for Windows Server 2008 R2. However, it didn't fix the problem immediately. If we had restarted the DCs, it probably would have taken effect immediately, but we enabled IPv6 during production hours and couldn't reboot right away. About 7 hours after IPv6 was enabled, all alerts related to the AD Client monitoring stopped and it's been all good since.

In our scenario, it was a little more difficult to be confident this would work because the only two servers experiencing this issue regularly were in a single location. That led us to think that there might be something odd in the network configuration there. Both of those servers were configured to use WINS for some legacy stuff in that site. I think that might have been an influencing factor somehow as that's the only configuration difference I could find.

Update:
The two servers having these errors had slow LDAP client bind times even after these errors went away. After a lot of monitoring, we added additional virtual processor to those two DCs. We increased from 2 vCPU to 4 vCPU. Once that was done, the slow LDAP client bind times disappeared. So, insufficient processing power seemed to play a part in this also. There were two DCs servicing a very large site (I think over 5000).

No comments:

Post a Comment