Ever wonder how DRS Replication works in SCCM? This is a “need to know” for all admins. To everyone who wants to know how to start troubleshooting a replication issue this blog’s for you!
I installed a hierarchy for a customer a few weeks back consisting of a CAS and three Primary sites. I have heard all of the arguments for and against this design but that’s going to be another post. At that time we installed SCCM 2012 SP1 and followed that shortly later with the CU3 update. During this time we noticed several “issues” that were rather inexplicable, collections were not showing the proper membership at the CAS, they would show a count of a number of devices in a collection but when we clicked on Show Members the collection was empty.
About a week ago or so we had opened a case with Microsoft for what we thought were some unrelated errors in DRS Replication. Now in case you didn’t know, SCCM 2012 uses the SQL Server Service Broker to replicate database information around the hierarchy. There are two types of data, Global data that is replicated to all sites and Site data that is, well just Site Data. Some of the Global data was having a problem and the Monitoring tab showed that the replication links had failed. When we got them on the phone the answer worked out to be, reinstall the sites since everything is new anyway, won’t hurt a thing, so we used new site codes and reinstalled the primary sites one by one.
I learned a few things from this call such as when a site is initially installed, it takes about an hour to perform the first replication of global and site data. During this time it is not advised to install another site as this does add complexity to the mixture and you could wind up with broken replication links, etc. So we waited a couple of hours between site installations. We reinstalled CU3 on all sites, again waiting for a period between sites. We saw that the sites were again functional but the CAS still had this problem with the collection memberships, strange.
Today there was a question about whether CU3 had been installed properly so I located a blog post from the Microsoft Configuration Manager Support Team referencing “Reinstalling Cumulative Updates in Configuration Manager” (http://blogs.technet.com/b/configurationmgr/archive/2013/08/23/reinstalling-cumulative-updates-in-configuration-manager.aspx) and read through it. Using this as mu guide I went to each site and removed, then reinstalled CU3. As quickly as I could. When that was complete I looked around in the console to see if the collections were updating properly and saw that database replication was broken, one link was degraded and the other two had failed. So I sent an email to the TAM requesting that the old case be reopened.
The first thing the engineer asked for was a live session. We got that going as I explained the details of how the hierarchy was setup. He listened and kept looking around and finally stopped at the remote database on the CAS. He verified that it was the CAS then started with the first (of many) query to execute in SQL. The very first query was exec spdiagdrs, which is used to give a state of the replication system, and has lots of tables and information up on the screen in no time. We noticed that all of the replication links were now in a failed state and they were in Maintenance Mode. Looking in the Rcmctrl.log it was saying the same thing.
So while he was executing additional queries I tried to follow along. He looked in the registry and set the logging verbosity to 2 which gives expanded logging. Restarting the SMS Executive service makes this additional logging take effect.
Some of the other queries he used were:
On the CAS he executed the following queries to reset the DRS Replication initialization:
SELECT * FROM RCM_DrsInitializationTracking WHERE ReplicationGroup = ‘Configuration Data’ and SiteRequesting = ‘<SiteCode of failed site>’
UPDATE RCM_DrsInitializationTracking SET InitializationStatus =7 WHERE RequestTrackingGUID = <Some GUID>
The GUID used was determined from the output of the stored procedure he used first, spdiagdrs and was set to the value corresponding to the primary site we were working on.
About 10 seconds after this command was executed, log files started becoming active again. We monitored the log files and noticed that the Configuration Data replication group was failing with some consistency. He executed another query to look at some additional logging contained within the SQL Tables:
SELECT * FROM vLogs
This showed us the output and storage of the CAB file that is used to transfer the BCD Data to the remote site. The data also contained a file with a .MISSING extension. This is an indication that one of the tables that is used in the replication procedure was absent. The filename gives the name of the missing table, in this case it was the PullDPResponse table. He verified that it was missing by looking on the primary site where it did exist and then in the CAS database where it did not exist. Recreation of the table is an easy task in SQL, he went to the primary and located the table, right clicked on the table name and selected to script the creation of the table. This gave us the complete set of instructions about this table’s design which he copied to the clipboard. Going over to a query session for the CAS Database he pasted this text into a new query to create the table. Last step was to reset the replication data using the query above. Once the replication initialized, ran and completed, it was successful. Now it was time to sit back and let the site get all caught up replicating with each other.
There are a few good queries used in troubleshooting DRS Replication. Learn what they do and what they tell us so you can be ready when/if you have these replication problems in the future.
In the vReplicationData view there are decimal numbers that indicate a particular Replication Status based on the value in the Status column. Here are the allowed values and their text definitions:
0 = Unknown Status
1 = Replication Required
2 = Replication Requested
3 = Replication is PendingCreation
4 = PackageCreated
5 = Replication is PendingApplication
6 = Replication Active
7 = Replication Aborted
99 = Replication Failed