cancel
Showing results for 
Search instead for 
Did you mean: 

Stable device not cleared up after tran applied on target server succefully

Former Member
0 Kudos

Hi Everyone,

I had a batch of transactions to applied on the primary to target server of ASE verison 15.7 using table level replication setup.
I found that all this transactions was succesfully applied on the target server.But when i checked with admin disk_space command , i found that rep server's stable devices are filled up almost completely.
I am wondering why does it not cleared the queue after applying the transactions to target server succesfully.
There is no active transaction on primary server now, as well as not so much of log usages for primary database.
All threads are also up & running.
This is just weird. not understanding why the stable device was not cleared in this case?
It will be much helpful for your inputs on to resolve this issue as it's a Production environment.

Thanks and Regards,

Swapnil

0 Kudos

Mark,

Great query.  However, when executing I got a hit on 'Unknown'.  Is that something to be concerned about?  Would dropping the partition clear it up?

Kevin

terry_penna
Participant
0 Kudos

I concur with Mark...                                                                                                                        

                                                                                                                                             

First  the big one is admin who,sqt and checking the first tran column to see if it is reporting a large transaction being applied.          

                                                                                                                                             

Second would be to verify if a save interval has been set                                                                                    

                                                                                                                                             

example:                                                                                                                                     

1> admin config, "connection", <target_dataserver>,<target_database>,save_interval                                                           

2> go                                                                                                                                        

Configuration     Config Value       Run Value  Default Value                  Legal Values    Datatype   Status                            

----------------- ------------------ ---------- ------------------------------ --------------- ---------- ----------------------------------

save_interval     0                  0          0                              NULL            string     Connection/route restart required 

                                                                                                                                             

(1 row affected)                                                                                                                             

===============================================================================                                                              

                                                                                                                                             

One of those should report something to indicate why the transaction has not freed up the stable queue.  I would also check the RS log for any messages, they may not be error messages but maybe a warning message as to what is happening.   


Terry             

Mark_A_Parsons
Contributor
0 Kudos

'Unknown' implies an orphaned segment.


Before dropping anything, or making any modifications, and especially if you're not sure of the output of the query, I'd suggest posting the results of the query back here for further analysis.

Mark_A_Parsons
Contributor
0 Kudos

In addition to the save_interval config setting, the (newer) dsi_non_blocking_commit config setting also implements a variation of a save interval (ie, applied txns are not flushed/cleared from the queue until after dsi_non_blocking_commit minutes).

0 Kudos

Mark,

Here is the output from your query less the 'known' q_numbers:

                    q_number           q_type            active  saved

Unknown      444                     0                    1        0

Also ran this query:

select * from rs_segments where q_number = 444

partition_id    q_number      q_type            partition_offset         logical_seg    used_flag      version            flags

119     444     0          502     82        1          386     0

A run of 'admin who, sqm' does NOT show this queue (444).  We utilize table, MSA, and warm-standby replication, and have a route to an offsite location.

Kevin

Mark_A_Parsons
Contributor
0 Kudos

re: Unknown/orphaned segment ...

Take a look at this thread where we discussed cleaning up orphaned segments:

     Is it possible to clean up a partition / rs_segments ?

NOTE: Since you're only dealing with 1 (potential) orphaned segment, it's likely not a big issue if you leave said segment alone (ie, it's not using up much space) ... your call.

-------------

re: stable space still in use after applying txns ...

You didn't supply any other output lines from that query, so curious as to whether or not you still see a large chunk of stable device space in use?

If space has been free'd up then that may indicate you've got a save_interval in use (ie, the save_interval has expired since your first post in this thread).

You can use Terry's query to check for save_interval configuration settings.

Alternatively, if your RSSD is stored in an ASE dataserver, you may want to try out the .

Former Member
0 Kudos

Hi Terry,

Thanks for your valuable reply to this thread.

Kindly find attached output for save interval and admin who,sqt command from my rep server.

I checked the rep server log and found no error or warnings.

Kindly check the attachment and guide.

Thanks and Regards,

Swapnil

Former Member
0 Kudos

Hi Mark,

Thanks for your valuable reply to this thread.

I am attaching the output of query given by you.

Kindly review the output and guide the next step of actions to resolve this issue.

Thanks and Regards,

Swapnil

terry_penna
Participant
0 Kudos

Hi Swapnil

Thanks for the output, this is odd that admin disk_space is showing the stable queue almost full but admin who,sqt is not showing any transactions and you do not have a save interval configured? 

If this stable queue is still full and if possible can you upload the complete RS log if it is not to big so we can take a look at it?

Also can you send output of admin disk_space as well.

If we do not see something obvious in the RS log you will need to open an incident with support for a more detailed look.

Regards

Terry

Mark_A_Parsons
Contributor
0 Kudos

The query output is showing the WLWEB_RS queue with 3,010 active (ie, in-use) segments.

==========================

db.dsname+'.'+db.dbname        q_number  q_type active saved

------------------------------ --------- ------ ------ -----

WLPROD_R2.ics                        105      1      1     0

WLPROD_R2.ics_aux                    106      1      1     0

WLPROD_RS_RSID.WLPROD_RS_RSID        101      1      1     0

WLWEB_RS                        16777318      0   3010     0

==========================

It would appear that you have a route from your current repserver to a downstream repserver named WLWEB_RS.  And the outbound queue associated with this route is currently using up 3,010 segments (~3GB).

Can't tell from what's been posted so far what the issue is so fwiw ...

1 - make sure the WLWEB_RS repesrver is up and running

2 - review both repserver errorlogs (current repserver; WLWEB_RS repserver) for any messages related to route/RSI issues

3 - make sure 'admin who_is_down' does not show any down threads in either repserver; for good measure, in the current repserver, suspend and then resume the route to WLWEB_RS just to make sure there's no issue with a hung thread

4 - if your RSSD resides in a ASE dataserver, review your ASE and repserver errorlogs for any occurrences of the ASE dataserver being bounced while the repserver was still up and running; generally speaking the repserver should always be shutdown if/when the ASE/RSSD goes down. [Just worked with a client on Monday ... ASE/RSSD crashed over the weekend while the repserver was up and running; since repserver needs to communicate with RSSD at all times, several repserver threads went into a hung state when the ASE/RSSD crashed; in this instance the outbound queue for a route from an upstream repserver was backed up because this repserver's RSI/route user thread was hung; client had to bounce repserver to clear up the hung threads.]

Former Member
0 Kudos

Hi Terry,

No suspecyed error message at primary rep server log however i can see some errors in replicated rep server WLWEB_RS log.

Attaching the complete log for your reference.

Both this replication servers are up and running fine only issue is with the stable device not cleared up at primary rep server.

Attaching the output of admin disk_space command for your reference.

Kindlu review and guide.

Thanks and Regards,

Swapnil

Former Member
0 Kudos

Hi Mark,

Thanks for your reply.

I can see both rep servers are up and running fine with no threads down.

Also there is no suspected error message in primary rep server log but I can see some errors in replicated Rep server log(WLWEB_RS)

Also RSSD for both the rep servers is not residing on ASE involved in replication server.

They are configured on different servers.

Attaching the important output files for your reference.

Kindly review and guide.

Thanks and Regards,

Swapnil

Mark_A_Parsons
Contributor
0 Kudos

Looks like a loss has been detected between the 2 repservers:

=======================

W. 2016/03/07 08:00:55. WARNING #6074 GLOBAL RS(GLOBAL RS) - /sqmoqid.c(275)

    Rejecting messages for WLWEB_RS_RSID.WLWEB_RS_RSID from WLPROD_RS_RSID.WLPROD_RS_RSID

=======================

'Rejecting messages' means WLPROD_RS is unable to send pending txns to WLWEB_RS, so the queue space remains 'in use' in WLPROD_RS.

You can verify the loss by running the following queries in both RSSDs:

=======================

select * from rs_oqid where valid > 0

select * from rs_exceptslast where status > 0

go

=======================

If a row shows up then it implies a loss has been detected for the related database connection or repserver.

If the query does in fact show a loss for a repserver/route you can try:

1 - suspend the route

2 - clear the loss by running this query in the RSSD where the loss is detected/displayed:

=======================

update rs_oqid set valid = 0 where valid > 0

update rs_exceptslast set status = 0 where status > 0

go

=======================

3 - resume the route; if the loss has been cleared then you should not see the 'Rejecting messages' line in your RS errorlog, and the queue space in the repserver should start going down.

---------

NOTE: None of the above addresses *WHY* a loss was detected; this would require a bit more research/analysis of what's been going on in your environment.

NOTE: None of the above addresses whether or not your PDB/RDB pair(s) is now out of sync (due to the loss detection).

terry_penna
Participant
0 Kudos

Mark is correct and you need to verify and correct the loss detected using the queries above.  I would also recommend upgrading your RS it is pretty old.  In newer versions of RS when you run admin who,sqm it will show in the output any loss detected.

Based on this information I am not sure how you can say that everything replicated to the target?  This condition indicates that the primary RS still has a lot of data waiting  to send to the target RS and you need to make sure you are in synch between your primary and replicate databases.

Former Member
0 Kudos

Hi Terry and Mark,

Issue is now resolved after running " ignore loss from" command on RRS(WLWEB_RS)

I saw an row entry after running query against rs_oqid

Though this issue is resolved I believe on applying the transactions which are in a queue and not yet replicated to rep server and syncing RDB with PDB, if we face similar issue in future. Can you assist me in how we can achieve this.

Thanks a lot for your valuable response to this thread and helping us in solving this issue.

Thanks and Regards,

Swapnil

Accepted Solutions (0)

Answers (0)