cancel
Showing results for 
Search instead for 
Did you mean: 

Wait time in ST03 of dialog work processes increase significantly and intermittently

former_member211576
Contributor
0 Kudos

Hi experts,

  I run our 19 application servers with configuration like below for about 5 months without any problems(CPU utilization is always below 20%).I find wait time in ST03 of dialog work processes increases significantly and intermittently recently. I check dev_disp and find error messages like this. I plan to increase dialog work processes from 20 to 30(according to the note 39412) and update kernel 721_EXT from 136 to 316 and hope this works. Any opinions?

PS: By the way, rdisp/queue_size_check_value use default value =  on,50,30,40,500,50,500,80

--- dev_disp ---

Fri Sep 12 16:24:51 2014

DpHdlDeadWp: W25 (pid=11012) terminated automatically

Fri Sep 12 16:42:55 2014

DIA water mark reached (>30)

Fri Sep 12 16:43:16 2014

DIA water mark underrun (<=30)

Fri Sep 12 16:43:27 2014

DIA water mark reached (>30)

Fri Sep 12 16:43:59 2014

DIA water mark underrun (<=30)

Fri Sep 12 16:44:12 2014

DpRTmPrepareReq: network error of client T123, NiBufReceive (-6: NIECONN_BROKEN), dp_tm_status=3

DpRTmPrepareReq: client address of T123 is 172.20.166.133(172.20.166.133)

***LOG Q04=> DpRTmPrep, NiBufReceive (1326703020559 123 SAP15C ) [dpxxdisp.c   12341]

RM-T123, U13267, 218     03020559, SAP15C, 16:43:59, M0, W11,     , 2/0

Fri Sep 12 16:44:15 2014

DpRTmPrepareReq: network error of client T45, NiBufReceive (-6: NIECONN_BROKEN), dp_tm_status=3

DpRTmPrepareReq: client address of T45 is 172.20.166.133(172.20.166.133)

***LOG Q04=> DpRTmPrep, NiBufReceive (1334107101497 45 SAP15C ) [dpxxdisp.c   12341]

RM-T45, U13341, 218     07101497, SAP15C, 16:44:03, M0, W11, SESS, 2/0

HardCancel request for T45 U13341 M0 received from DISPATCHER

Fri Sep 12 16:44:17 2014

DpRTmPrepareReq: network error of client T161, NiBufReceive (-6: NIECONN_BROKEN), dp_tm_status=3

DpRTmPrepareReq: client address of T161 is 172.20.166.133(172.20.166.133)

***LOG Q04=> DpRTmPrep, NiBufReceive (13350 161 SAP15C ) [dpxxdisp.c   12341]

RM-T161, U13350, 000             , SAP15C, 16:43:59, M0, W17, SESS, 2/0

--- profile ---

rdisp/wp_no_dia = 20

rdisp/wp_no_btc = 13

rdisp/wp_no_enq = 0

rdisp/wp_no_spo = 2

rdisp/wp_no_vb = 5

rdisp/wp_no_vb2 = 2

--- ST03 screenshot ---

Accepted Solutions (0)

Answers (4)

Answers (4)

Former Member
0 Kudos

Check DB state at that time.

xymanuel
Active Participant
0 Kudos

Hi Dennis,

can you please show us your DB times in while you have the problem.

For example our DB:

If you think it is an DB problem, i recommend you to temporary switch on data collection for table statistics in ST03N

Normally its set to an empty value, this means the system is not collection tablecalls.

If you set it to value 5 for every appserver, this means the system selects the 5 tables with highst repsonsetimes per transaction.

Initial status:

During collection set stat/tabrec to 5 for every apsserver (let it run for half a day).

(click activate values).

Now you can analyse the results within STAD:

For example: MC01 with "higher" DB response time:

Doubleclick on MC01:

you get the info where the the time is used. But now you also have the "Table accesses" information. You see, time is used up on table S810.

With this information you can now go back to DBACOCKPIT

Go to SQL Statments: and filter for S810.

Now you can analyse the SQL Statements itself.

Regards

Manuel

Message was edited by: Manuel Herr And don't forget to switch the values in ST03N back to 0 after analysing the problem 🙂

former_member211576
Contributor
0 Kudos

Hi Manuel,

  Thanks for your great detailed procedure. I will use it next time if necessary. I know ST03N can capture top 5 tables but because it has additional overhead, I never use that. I use Performance -> History -> SQL Statement History to troubleshoot expensive SQL statements most of the time.

Screenshot 1 shows many programs stop at insert table DBTABLOG. We have about 180,000 jobs created per hour and it makes no sense to keep table change log on TBTCO. Anyway, I have turned off rec/client.

Screenshot 2 shows DB time reaches up to 8,809 ms.

Screenshot 3 shows WRITELOG is incredibly high but it always becomes very high when full or log backup is finished. We use Fusion IO and I am not sure if this is a write performance issue due to TRIM or something like that. Anyway, I use perfmon to log seconds per read/write on every physical disk now.

xymanuel
Active Participant
0 Kudos

Hi Dennis,

i think you are on the correct way. I also would say this is related to the TA Log.

As i can see the fusion I/O gives you responsetimes to writelog with 0,2 ms in normal case.

Thats factor 10 better than our values (~2-3ms). But during your problem time the values increase to deadly 1200ms.

I would also say, this is a problem related to the fusion IO, maybe exactly the TRIM is causing this.

Check disk responsetimes of the fusion IO in the give time.

In my opinion it is not necessary to keep a TA Log on a Fusion IO because you "only" need fast sequential IO here. Every normal modern SAN is able to serve you +1 Gbyte/sec sequential write IO without reaching higher reponsetimes.

The lazy writer, which writes the commited data from TA Log to the DB Files, is decoupeled from the reponsetimes to the users/workprocesses. So only read IO is counting then. But today we are able to add enough RAM to the DB server to reach cachehitratios of >99% all the time. This means, we do not net a high performance read SAN anymore.

Btw. i'm on MSSQL.

Regards

Manuel

former_member211576
Contributor
0 Kudos

Hi Manuel,

  Ok, thanks for your information. This is the first time I experience a performance issue due to Fusion IO after one year. I think Fusion IO is under format to handle the write performance issue due to TRIM ( https://www.fusionio.com/load/-media-/29217a/docsLibrary/Oracle_HP_Best_Practices_Guide_for_HP_IO_Ac... ).

Anyway, I think Fusion IO may not believe information DBACockpit provided and I use perfmon to capture seconds per read/write for their reference if I will create a support case next time.

former_member211576
Contributor
0 Kudos

Hi Manuel,

  Sorry, when I said Fusion IO, I referred to HP IO accelerator(Fusion IO OEM) because we bought from HP.

  I really appreciate you point out the deadly write latency: 1200ms. I did not realize it will deteriorate SAP performance even only in a very short time. I have 5 IO accelerators installed and I found only one card has up to 80ms~1200ms write latency at peak. The other 4 cards has only 1 or 2ms maximum write latency at the same time. I capture perfmon screenshot and complain to HP. They agree to replace it tomorrow.

  We have many traditional SAN storage, ex: EVA 8100 or JBODs, and we backup our database to LUNs on these storage.

PS: Sorry, I can't see the "Mark as Answer" in this message. Only "Alert Moderator" "Like" "Reply" links are available. Otherwise, I will definitely mark your reply as Answer.

Again, thank you very much.

xymanuel
Active Participant
0 Kudos

Hi Dennis,

the message "DIA watermark underrun" is not the cause of your trouble!

You do not have any Workprocesses left, they are all waiting for commit (1. screenshot).

You have to find the cause why your WPs are waiting for "commit".

ST02 Buffer rates are looking good. They are not causing this problem.

In the first screenshot i can see a report "ZRSCMSTU", maybe this one has a relation to your problem.

Regards

Manuel

former_member211576
Contributor
0 Kudos

Hi Manuel,

  I agree all WPs are waiting for "commit" is the source of the problem. However, I can't identify is it caused by a bug of SAP kernel or a custom program? Sorry, no offense. Is it possible a custom program causes all WPs stop at "commit" for a couple of minutes? I don't think so.

  By the way, "ZRSCMSTU" is a report to check if a user is a valid login or not using USR41(user-terminal).

former_member211576
Contributor
0 Kudos

Hi experts,
  I have updated kernel to 721_EXT 324 and increases dialog work
processes to 30 on Sunday but the problem persists. Should I increase dialog work processes to 40? Or is this a network issue?


Attachments:
Check huge wait time at <ap35> at 15:23.
What is SESSION_MANAGER doing?
Wait time increases on many T-codes at <ap27> at 15:30.

Sriram2009
Active Contributor
0 Kudos

Hi Dennis

Thanks of your info

No of work process can be revet back from 40  to 30 and then check the system with new kernel level for two week time. you may get some of the error messages in OS / SAP level after that you can take the action of work process

Regards

SS

Sriram2009
Active Contributor
0 Kudos

Hi Dennis

1. Have you define the memory / paging parameters as per the SAP Note?

88416 - Zero administration memory management for the ABAP server


2. Could you share the your CI details? and also you can check the CI memory usage?


3. Any new developments? yes could you check the status of New development.

4. You have to check the network connection between CI & Application servers (DI), in one of the screen shot shows that its try to commit but its search the CI database.

5. Time to time you have to upgrade the kernel & DBSL patch, so that you can avoid those error messages.

Regards

Sriram

former_member211576
Contributor
0 Kudos

Hi SS,

  A1: Yes.

  A2: Sorry, I have updated kernel on Sunday and do not backup dev_ms file. Also, this is a cluster system so only enserver.exe and msg_server.exe are running. The memory consumption are very small.

A3: We have about 200 requests per week but it does not seem to be issue caused by  development.

A4: I will check dev_w0 next time but I don't think it is a connection issue, too.

A5: I have updated kernel 721_EXT to 324.

# Table Buffer Tuning

rtbb/buffer_length = 200000

rsdb/ntab/irbdsize = 8000

rsdb/ntab/sntabsize = 3000

ztta/roll_first = 1

ztta/roll_area = 3000000

rdisp/PG_MAXFS = 262144

rdisp/PG_SHM = 32768

abap/heap_area_dia = 8000000000

abap/heap_area_nondia = 16000000000

abap/heaplimit = 40000000

#ztta/roll_extension = 4000000000

#ztta/roll_extension_dia = 4000000000

#ztta/roll_extension_nondia = 4000000000

em/address_space_MB = 4096

zcsa/db_max_buftab = 50000

zcsa/presentation_buffer_area = 26001000

zcsa/table_buffer_area = 332800000

Sriram2009
Active Contributor
0 Kudos

Hi

In CI & All DI have you check the memory buffer values? is this any swaps are there? check the transaction code ST02

BR

SS

former_member211576
Contributor
0 Kudos

Hi SS,

  Like I said before, CI(ASCS00) does not have work processes running, except enserver and msg.

As for table buffer in DI, if the system runs for a long time, I can see program and export/import swap because we have so many custom programs. I think it is normal because it increases just a few every day.

PS: if DVEBMGS<no> is what you mean, it has the same configuration as a DI. So please see the screenshot 2.