cancel
Showing results for 
Search instead for 
Did you mean: 

Understanding SQT & DSI_SQT_MAX_CACHE - WARM STANDBY DATA FLOW

Former Member
0 Kudos

We have a warm standby replication server setup replicating 3 databases from PRIMARY to warm standby. As usual the biggest drag on the performance is the DSI performance issue and the WARM STANDBY server capacity to run the transactions. Presently DIST thread is on.

Our ASE version is 15.0.3ESD3(PDS & RDS) and OS is SUN Solaris SPARC.

Replication server uses rssd(15.7/EBF 24749 SMP SP135) and the Replication server is 15.7.1/EBF 24559 SP208 and the OS is Windows 2008 R2.

I wanted to confirm on 2 main parameters sqt_max_cache_size and dsi_sqt_max_cache_size, during the course of this discussion. Any inputs will be very helpful. It is on the context of the findings using the admin statistics and the whitepaper "sybase replciation server performance & tuning" by i believe Mr. Jeff Tallman.

1. In the whitepaper, page 149 ... it is mentioned that the WS-DSI threads read straight off the inbound queue and the sorting occurs in the cache represented by dsi_sqt_max_cache_size, effectively duplicating the sorting process carried out by the SQT thread.

a. So in this case, if the DIST thread is not set to off - the sorting happens twice ? Please confirm.

b. If the WS-DSI, is reading directly from the inbound queue - where is the SQT getting involved in this scenario? Isn't the data flow as below

REP AGENT -> REP AGENT USER THREAD -> SQM -> WS-DSI(S) -> WS-DSI(E) .

It is clear from the statistics though that, the SQT is being used, since we can see the CacheMemUsed value ranges from 5mb-20mb. The crux is to affirm that sorting is happening twice or once, if the DIST is set to on, in a WARM Standby environment.

2. What is the best way to find the appropriate DSI_MAX_CACHE_SIZE. Is is to just monitor "TransRemoved" i.e. counter 24009  and decrease it till we start having counter 24009 >0  multiple times ?

Our dsi_sqt_max_cache is set to 64mb(for our most active database), and sqt_max_cache_size is set to 32mb.

If we look at dsi_sqt_max_cache parameter "CacheMemUsed", we can always find that it is filled to the brim(64 mb as expected) and   "TransRemoved" from dsi_cache is 0 but sqt_cache_size "CacheMemUsed" reached 20mb at max and at 32mb the "Transremoved" reached 1 a couple of time during Nightly

3. It is mentioned that the DIST can be set to off, safely in a Warm Standby. Can we set it to off and then to ON, at any instant ?

This query is asked in the context that i want to ensure that we have correct dsi_sqt_max_cache_size set. Because, when the SQT is down, i believe that our dsi sqt cache size will show up as 64mb(the actual size). At least now with sqt_max_cache_size of 32mb and "CacheMemUsed" of 20MB max, we can consider that our cache is more than what is required .... and without the SQT the dsi_sqt_max_cache_size is always shown to be full(because it fills up).

Furthermore, if SQT is down(due to DIST, being set to off), how will we find the # of open, closed, read and truncated command ?

4. What are the pros and cons of setting DIST off in a Warm Standby. Mainly can there be any -ve effect at all ?

5. Is there any good document for SQL DML replication? We are thinking of implementing this on our environment to fasten our replication further along with maybe parallel DSI.

6. For the SQT counters "SQTOpenTrans","SQTClosedTrans","SQTReadTrans" and "SQTTruncTrans" .. should we consider the counter_total,counter_max or counter_last columns of the rs_statdetail table ?

Is there any document which explains which of the counter columns must be considered for each of the display_name. Even after rs_statcounters.description colum, i wanted to know .. which counters matters most and what should be the corresponding column to draw inference about the counter.

Our observations for the admin_who & 10 minutes stas collection are attached in the text file and the .png file

Thank you very much!!

Regards,

Anurag Bhattacharjee

Accepted Solutions (1)

Accepted Solutions (1)

terry_penna
Participant
0 Kudos

Hi Anurag

Let see if I can help answer some of your questions?  Please see my responses in between your questions.  If I have missed something or you need further explanation please let me know?

  1. In the whitepaper, page 149 ... it is mentioned that the WS-DSI threads read straight off the inbound queue and the sorting occurs in the cache represented

By dsi_sqt_max_cache_size, effectively duplicating the sorting process carried out by the SQT thread.

The difference between the sqt_max_cache_size and the dsi_sqt_max_cache_size settings is:

sqt_max_cache_size is the global RS setting for all sqt cache. 

dsi_sqt_max_cache_size is a connection parameter and by default it is set to 0 wich means to use the value of the sqt_max_cache_size. 

So if you set this value to anything greater than 0 it will now ignore the sqt_max_cache_size and use this new value.

a. So in this case, if the DIST thread is not set to off - the sorting happens twice? Please confirm.

This is not true, in a warmstandby we do not need to use the DIST so you can set it to off and in some cases performance can improve. Now If you are replicating to another database hanging off of the logical connection let’s say for example to a reporting database then the DIST needs to be set to on so that transactions will be sent to the DSI connection that is hanging of the logical connection.

  1. If the WS-DSI, is reading directly from the inbound queue - where is the SQT getting involved in this scenario? Isn't the data flow as below?

REP AGENT -> REP AGENT USER THREAD -> SQM -> WS-DSI(S) -> WS-DSI(E) .

It is clear from the statistics though that, the SQT is being used, since we can see the CacheMemUsed value ranges from 5mb-20mb.

The crux is to affirm that sorting is happening twice or once, if the DIST is set to on, in a WARM Standby environment

SQT is always involved in every connection it is the Stable Queue Transaction manager and it is where RS ensures transactions are passed to clients (DIST or DSI) in transactional commit order.  The SQT interfaces with the SQM to read the transactions from the stable queue. Transactions in this queue are in commit order and for optimum performance the whole transaction from begin to commit needs to fit in the sqt cache.  If it does not fit in cache then that is what can cause latency while the SQT is interfacing with the SQM reading the disk to get the complete transaction read and be moved on.  

  1. What is the best way to find the appropriate DSI_MAX_CACHE_SIZE. Is is to just monitor "TransRemoved" i.e. counter 24009 

and decrease it till we start having counter 24009 >0  multiple times ?

Our dsi_sqt_max_cache is set to 64mb(for our most active database), and sqt_max_cache_size is set to 32mb.

If we look at dsi_sqt_max_cache parameter "CacheMemUsed", we can always find that it is filled to the brim(64 mb as expected) and 

"TransRemoved" from dsi_cache is 0 but sqt_cache_size "CacheMemUsed" reached 20mb at max and at 32mb the "Transremoved" reached 1 a couple of time during Nightly

Most of the time customer’s transactions are as follows:


OLTP type during the day > large batch processing in the evening or weekends to update everything.

So everything is fine and runs great during the day and then when the batches kick off now there is large latency in replication and most of the time it is on the DSI connection to the target. Now if you are seeing a transaction being removed and you are getting warning messages in the log that look like the following:

  1. W. 2011/11/21 18:40:58. WARNING #24068 SQT(199:1 DIST SYBASE.db_biz_cobros) - t/sqtint.c(1378)

SQT cache size is too low to load more than one transaction into the cache.

Or this warning that means you will also need to look at the memory_limit and may need to increase it as well since all sqt cache memory is taken from the memory_limit setting.

  1. W. 2011/11/21 18:41:36. WARNING #24057 DSI(170 SYBASE_CON.cobis) - t/sqtint.c(7359)

_sqt_remove_largest_tran(170:0 SYBASE_CON.cobis): No candidate found for removal. Memory limit will be exceeded by SQM/TI thread.

This is an indication that the sqt cache is too small and you should increase it. 


Now how and where you set that at is up to you.  


Remember the sqt_max_cache_size is a global RS parameter and if you increase it, all SQT have the possibility to use that value. So let’s say it is set to the default of 20M and you want to increase it to say 100M, that means that if you have 20 database connections (10 inbound DIST and 10 outbound DSI's, remember each has its own SQT.) they can all have the possibility of using 100M each so 20 * 100 2000M which will bump up real quick to the default memory_limit of 2G. Now most likely that will not happen because some connections might not be busy all the time but you just need to be aware that increasing the global value can use up a lot of memory quickly.

So now let’s say you just have this one really busy DSI connection with some very large batch processing being done, you can then alter the connection and increase the dsi_sqt_max_cache_size to increase the cache size for just this one connection. 


So alter connection and increase the dsi_sqt_max_cache_size from 0 to 100000000 and this tells RS that this one connection will use the new 100M instead of the default.

How big you need to make it depends on the batch like, number of tables, size of a row in the tables, times the number of transactions that are being affected by each table in the batch? 


So let’s say in this simple example, you have a batch with one table and one row has max size of 100 bytes, there are 100 million rows in this table and you run a batch job every night that affects approx. 900,000 rows in one batch. 


You could get a good idea on the sqt max cache size by just taking the 100 * 900,000 and get 90000000 plus maybe 10% then you could set the dsi_sqt_max_cache_size to 90000010 and that should do it.  Now this DSI would use 90M for its sqt max cache size and the global default could remain at 20M for all other connections.

  1. It is mentioned that the DIST can be set to off, safely in a Warm Standby. Can we set it to off and then to ON, at any instant?

This query is asked in the context that i want to ensure that we have correct dsi_sqt_max_cache_size set. Because, when the SQT is down,

i believe that our dsi sqt cache size will show up as 64mb(the actual size). At least now with sqt_max_cache_size of 32mb and "CacheMemUsed" of 20MB max, we can consider that our cache is more than what is required.... and without the SQT the dsi_sqt_max_cache_size is always shown to be full(because it fills up).

Furthermore, if SQT is down (due to DIST, being set to off), how will we find the # of open, closed, read and truncated command?

You can turn the DIST on or off at any time, remember the DIST only has an affect if there is replication to another target database hanging off the logical connection.  There is always an SQT in use for either the DSI or DIST and remember there is global and connection level parameters to set the sqt cache.  There is even a dist_sqt_max_cache_size parameter but you would not use that here. For your environment I would leave your DIST on turning it off will not help your dsi latency.

  1. What are the pros and cons of setting DIST off in a Warm Standby. Mainly can there be any -ve effect at all ?

I would recommend just leaving the DIST on.

  1. Is there any good document for SQL DML replication? We are thinking of implementing this on our environment to fasten our replication further along with maybe parallel DSI.

I would refer to the Replication Server Administration Guide Volume 2 and look in the performance and tuning section for details on SQL DML replication and parallel DSI's.

  1. For the SQT counters "SQTOpenTrans","SQTClosedTrans","SQTReadTrans" and "SQTTruncTrans" .. should we consider the counter_total,counter_max or counter_last

columns of the rs_statdetail table ?

Is there any document which explains which of the counter columns must be considered for each of the display_name. Even after rs_statcounters.description colum,

i wanted to know .. which counters matters most and what should be the corresponding column to draw inference about the counter.

If you go to the Replication Server wiki page located here http://wiki.scn.sap.com/wiki/display/SYBREP/SAP+Replication+Server if you click on Performance Tuning in the Product Basics section you can download the Monitors and Counters Analysis package. It will give you everything you need to do performance and tuning in your replication server and I recommend you get this to look at your RS performance.

One last thing, you should look at is your standby ASE and make sure it is properly tuned for replication.?  One of the biggest performance improvements is to ensure that you have a big enough statement cache setup and the ASE config parameter 'enable literal autoparm' set to 1.  Before "enable literal autoparm" was introduced, two queries that were identical except for one or more literal values resulted in the statement cache storing two separate query plans, with this enabled, the same plan can be reused, this is huge with replication since it is log based and an update on the primary that can affect 100 rows in one statement will become 100 individual statements on the target.

Overall we try to keep things simple at first with regards to RS performance and tuning and the dsi latency.

1) Check the target ASE and make sure it is tuned and performing correctly.  Most times this is where the problem is.

2) If you see warning messages in the log regrading sqt cache then you should consider increasing either your sqt_max_cache_size which is global setting or dsi_sqt_max_cache_size which if set higher than 0 will override the global sqt_max_cache_size setting for that one DSI connection.

3) If 1 and 2 does not help the DSI latency, then download and run the Monitors and Counters Analysis from the RS wiki site and there are reports in there that can help point you in the right direction on what to tune in the RS.



Regards

Terry

Answers (0)