cancel
Showing results for 
Search instead for 
Did you mean: 

Garbage transactions still in RS (RSSD) after cleanup

former_member350489
Participant
0 Kudos

I have previously been discussing a similar issue here with Mark A Parsons, so I would like to add the following;

We're running an MSA repliaction setup, manually switchable, two ASE's and two RS's

In our test environment, we reload our primary ASE once a month, prior to this we cleanup all setup in the RS's, except the RS itself

Sometimes there is "garbage" left from the setup in the RSSD's, and we cleanup this manually (if needed)

This time there seems to be "garbage transactions" after a proper removal of the setup and a manual cleanup

We keep getting error messages in our replicate RS, and I simply cannot find any of the references logged in the error log, in any of the RSSD's tables ? The primary RS is just fine, and seems to be clean. But in the replicate RS there seems to be two types of transactions, trying to reach two databases that no longer exist in the RSSD setup.

This is how the replicate RS error log looks lilke :

E. 2016/09/06 07:48:43. ERROR #8025 RSI USER(RS1SYS) - /mdroute.c(525)

        Database 2409  is unknown.

E. 2016/09/06 07:48:44. ERROR #32045 RSI USER(RS1SYS) - /nrm/nrm.c(2034)

        Invalid object id for table or function 'ma_alarmlog_arc'.

E. 2016/09/06 07:48:44. ERROR #8035 RSI USER(RS1SYS) - tr/mdext.c(2912)

        An MD message could not be converted into ASCII form.

I. 2016/09/06 07:48:44. An MD message could not be converted into ASCII form.

E. 2016/09/06 07:48:44. ERROR #8025 RSI USER(RS1SYS) - /mdroute.c(525)

        Database 2403  is unknown.

E. 2016/09/06 07:48:44. ERROR #32045 RSI USER(RS1SYS) - /nrm/nrm.c(2034)

        Invalid object id for table or function 'ma_alarmlog'.

E. 2016/09/06 07:48:44. ERROR #8035 RSI USER(RS1SYS) - tr/mdext.c(2912)

        An MD message could not be converted into ASCII form.

I. 2016/09/06 07:48:44. An MD message could not be converted into ASCII form.

If anyone have encountered this type of problem and have any clue how to clean it up, please advice

Thanks !

/Mike

former_member350489
Participant
0 Kudos

What the actual "garbage" transactions consist of I cannot see anywhere, since the replication setup has been removed (as below).

The cleanup steps (from our cleanup script that removes replication setup) :

                        #Unmark database on setup ASE

                        #Stop Rep Agent on setup ASE

                        #Disable Rep Agent on setup ASE

                        #Drop subscription on PRIM RS

                        #Check if subscription has been dropped

                        #Drop subscription on setup RS

                        #Check if subscription has been dropped

                        #Drop the other subscription on PRIM RS

                        #Check if subscription has been dropped

                        #Drop the other subscription on PRIM RS

                        #Check if subscription has been dropped

                        #Drop replication definition on setup RS

                        #Check if definition has been dropped

                        #Drop subscription on setup RS

                        #Check if subscription has been dropped

                        #Drop subscription on setup RS

                        #Check if subscription has been dropped

                        #Drop connection from setup RS to setup ASE

                        #Check if connection got dropped

                        #Remove setupndary truncation point on setup ASE

                        #Truncate rs_lastcommit on setup ASE

                        #We are now done removing replication between setup and PRIM Let user know and ask if he wants to proceed

                        #Unmark database on PRIM ASE

                        #Stop Rep Agent on PRIM ASE

                        #Disable Rep Agent on PRIM ASE

                        #Drop subscription on setup RS

                        #Check if subscription has been dropped

                        #Drop replication definition on PRIM RS

                        #Check if definition has been dropped

                        #Drop connection from setup RS to setup ASE

                        #Check if connection got dropped

                        #Drop connection from PRIM RS to PRIM ASE

                        #Check if connection got dropped

                        #Remove setupndary truncation point on setup ASE

                        #Truncate rs_lastcommit on setup ASE

                        #Remove secondary truncation point on PRIM ASE

                        #Truncate rs_lastcommit on PRIM ASE

A manual cleanup was made by me according to our previous discussion, and after I could see that there was the above error messages flooding in the errorlog, by cleaning out rs_segments from orphans by setting the columns to zero :

update rs_segments

set q_number = 0,

q_type = 0,

logical_seg = 0,

used_flag = 0,

flags = 0

where q_number in (2403,2409)

go

We then reloaded the prim ASE from a (SAN) snapshot, taken during quiesce of all databases,

only the data and tran datafiles, "system" (unique for every ASE) is not reloaded.

After this we normally just setup replication again...

When we had this discussion last time, we had no problems similar to today, more than that I could see that there was unused segments on the stable devices ?

This time we actually get errors

/Mike

Mark_A_Parsons
Contributor
0 Kudos

Looks like you grep'd for a list of comments from your script; unfortunately this doesn't show which comments are parts of a function or conditional/looping-construct ...

- are comments up to date and match what's really happening?

- why so many repeated steps to drop/check a subscription?

- why are you dropping a subscription on the PRS (PRIM RS)? (have you setup replication going in both directions?)

- what is a 'setup RS'?

... so, at this point I can't really tell what the actual steps are nor the order in which they're performed ... ??

----------------

For your update of the rs_segments table ...

NOTE: I'm assuming you're referring to this thread:

How did you come up with the q_numbers (2403, 2409) in the where clause?  Did you follow the steps in that previous thread to find actual orphaned segments for q_number/dbid = 2403/2409?  Or did you just plug-in the numbers that were showing up in the errorlog messages?

It would also be of interest to know:

- was the associated repserver shutdown before you updated rs_segments?

- how many rows were affected by your update(s) of the rs_segments table?

----------------

The errorlog messages are related to data going over a route.  Said data is going to reside in queues for the *repservers* that make up the route.

The associated rs_segments records for this route data will have q_number values that equate to the repservers' rsid values (eg, rsid = q_number = 16777317).

Updating (orphan) rs_segments rows for databases (eg, q_number = 2403/2409) will have no/zero affect on the *data* in the repservers' queues.

NOTE: *NO*, I'm not suggesting you now go update rs_segment records for your route-related queues; I'm just pointing out that the rs_segment update you've provided will have no effect on the data - in the route-related queues - that is generating your errors.

----------------

We still don't know what exactly your manual cleanup operation consists of, nor where in the tear down you performed this manual cleanup.  I mention this because the errorlog messages would seem to indicate that you've got some orphaned data in your route-related queues, likely an issue from tearing down repserver components in an incorrect order and/or performing some non-standard steps during the tear down..

I'd probably do something like:

- drop subscription from RRS

- check subscription dropped from PRS and RRS

- drop repdef from PRS

- stop repagent

- disable repagent

- drop RRS connection to RDB

- drop PRS connection to PDB

NOTE: I'm assuming you want to maintain your routes, as well as the repservers themselves.

The above steps should ensure PDB txns are not left stranded in the replication system. (Again, it's not apparent from what you've posted so far as to the actual steps your processing is performing and in what order.)

When replication is functioning properly it should be rare that you find orphaned records in rs_segment (or any other RSSD tables).

former_member350489
Participant
0 Kudos


What I found out this morning was that all the transactions was located on the primary RS RS1SYS queue with rsid = q_number = 16777318:0 with a route to the replicate RS RS2SYS.

We dumped out that queue to a file and could clearly see thousands of references to both the missing dbid's 2403 and 2409.

So we purged the queue with 'sqm_purge_queue', and then both RS became inconstent :

E. 2016/09/07 08:52:26. ERROR #6067 SQM(102:0 RS2SYS_RSSD.RS2SYS_RSSD) - /sqmoqid.c(1101)

        Replication Server has identified the possibility of data loss for RS2SYS_RSSD.RS2SYS_RSSD from RS1SYS_RSSD.RS1SYS_RSSD. Confirmation of data consistency is needed for validation.

And since no databases was setup for replication yet, we just ignored the (potential) loss, since no

replication was flowing

I. 2016/09/07 08:57:13. Ignoring loss for RS2SYS_RSSD.RS2SYS_RSSD from RS1SYS_RSSD.RS1SYS_RSSD

I. 2016/09/07 08:57:23. RSI receiver RS1SYS: Resetting Connection.

We then recycled both RS and RSSD on primary and replicate

I am now currently in the process of setting up replication for our 12 databases again and it works just fine 🙂

Thanks for your support Mark

/Mike

Accepted Solutions (0)

Answers (0)