dev_w5 growing incredibly fast - SAP Community

Dear experts,

I would like to share with you another quite annoying issue, which we are experiencing in our landscape.

Now the suffering system is a DEV BW ABAP instance. As said in the title, dev_w5 is growing very very

fast, it has twice filled up the filesystem and caused the system to hang...

The contents of the file are:

I MpiIEvtOpen: retry with next key

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I MpiIEvtOpen: retry with next key

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I MpiIEvtOpen: retry with next key

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I MpiIEvtOpen: retry with next key

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I MpiIEvtOpen: retry with next key

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I MpiIEvtOpen: retry with next key

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I MpiIEvtOpen: retry with next key

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I MpiIEvtOpen: retry with next key

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I have found a few notes regarding this problem, but it is about the ICM and WebDispatcher, not about a dialog

process. Notes 2000428 (for oracle Linux, but we are running on Solaris 10), 715400 and 1608350 (recommending

to use Kernel 7.01, we are already at 7.21 patch 201).

I sent note 715400 to our Solaris admins, but they said that it is not valid for Solaris 10.

Based on a screenshot from SM50, I think that the Solution Manager is causing the issue:

Please advise what to do ? I can stop the work process, but I'm not sure that this will help... Kindly advise quickly,

because as soon as I leave the office, the system will be down again in just a few hours...

Many thanks!

SAP Managed Tags:
SAP NetWeaver Application Server

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Accepted Solutions (0)

Answers (3)

Answers (3)

Have you tried reducing the trace level from SM50 for this work process.

Once done, kill it from OS level.

Regards,

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hello Symon,

How many work processes do you have ? Can you please review note: 9942 - Maximum number of work processes

Regards,

Siddhesh

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hello guys,

killing the processes solved the issue and the new dev file wasn't growing rapidly anymore. But just about an hour ago, the system got completely stuck and it was not possible to open the login window in SAP GUI. It looked exactly as a system with a stuck archiver.

I was thinking what to do before restarting everything, so I checked with dpmon - all the dialog processes were taken by RFC connections from the Solution Manager, there was just one employee logged on:

-->

Workprocess Table (long) Wed May 13 09:25:37 2015

========================

No Type Pid Status Cause Start Rstr Err Sem Time Program Cl User Action Table

-------------------------------------------------------------------------------------------------------------------------------

0 DIA 23290 Run yes no 0 0 2695 SAPLSICM 100 SM_SMP NO_ACTION

1 DIA 23291 Run yes no 0 0 5043 SAPLSHTTP 100 SM_SMP NO_ACTION

2 DIA 23292 Run yes no 0 0 8699 SAPLSHTTP 100 SM_SMP NO_ACTION

3 DIA 23293 Run yes no 0 0 2313 SAPLSICM 100 SM_SMP NO_ACTION

4 DIA 23294 Run yes no 0 0 3238 SAPLSICM 100 SM_SMP NO_ACTION

5 DIA 21566 Run yes no 1 0 5342 SAPLSICM 100 SM_SMP NO_ACTION

6 DIA 23296 Run yes no 0 0 4024 SAPLSHTTP 100 SM_SMP NO_ACTION

7 DIA 23297 Run yes no 0 0 5702 SAPLSICM 100 SM_SMP NO_ACTION

8 DIA 23298 Run yes no 0 0 6058 SAPLSICM 100 SM_SMP NO_ACTION

9 DIA 23299 Run yes no 0 0 4979 SAPLSICM 100 SM_SMP NO_ACTION

10 DIA 23300 Run yes no 0 0 8876 SAPLSICM 100 SM_SMP NO_ACTION

11 DIA 23301 Run yes no 0 0 410 100 <SOME EMPLOYEE> NO_ACTION

12 DIA 23302 Run yes no 0 0 4620 SAPLSICM 100 SM_SMP NO_ACTION

13 DIA 23303 Run yes no 0 0 4262 SAPLSICM 100 SM_SMP NO_ACTION

14 DIA 23304 Run yes no 0 0 6422 SAPLSICM 100 SM_SMP NO_ACTION

15 DIA 23305 Run yes no 0 0 7684 SAPLSHTTP 100 SM_SMP NO_ACTION

After killing a few processes, suddenly I could login. SM04 was showing a LOT of SM_SMP processes logged on:

I wanted to disconnect them, but it wasn't possible, so I had to kill all the dialog WP using kill -9. I had locked the SM_SMP user as a precaution, in order to avoid additional connections and so I could restore normal operations on the system. But now the question is what went wrong... A buggy diagnostics agent or what... ?

Does anyone know the issue ? Or I should probably open an OSS message and get this analyzed by SAP. Btw. here is also an excerpt from the dev_w4 file:

M Wed May 13 10:23:14 2015

M ***LOG R49=> ThReceive, CPIC-Error (020223) [thxxhead.c 7927]

M ***LOG R5A=> ThReceive, CPIC-Error (25554178) [thxxhead.c 7933]

M ***LOG R64=> ThReceive, CPIC-Error ( CMSEND(SAP)) [thxxhead.c 7938]

A RFC 3710 CONVID 25554178

A * CMRC=20 DATA=1 STATUS=1 SAPRC=223 ThSAPCMRCV

A RFC> ABAP Programm: SAPMSSY1 (Transaction: )

A RFC> User: CSERKO20 (Client: 100)

A RFC> Destination: wsps450_E (handle: 1, DtConId: 0E40F9E4E3AFF160A520402CF4CB710E, DtConCnt: 0, ConvId: 25554178,{0E40F9E4-E3AF-F

A RFC> Called function module: RSWAD_URL_GET

A RFC SERVER> RFC Server Session (handle: 1, 25554178, {0E40F9E4-E3AF-F160-A520-402CF4CB710E})

A RFC SERVER> Caller host:

A RFC SERVER> Caller transaction code: (Caller Program: BExQueryDesignerStarter)

A RFC SERVER> Called function module: RSWAD_URL_GET

A *** ERROR => RFC ======> CPIC-CALL: 'ThSAPCMRCV' : cmRc=20 thRc=223

CPIC program connection ended (read error)

[abrfcio.c 9213]

A {root-id=35353532464137303535353246413730}_{conn-id=00000000000000000000000000000000}_0

A *** ERROR => RFC Error RFCIO_ERROR_SYSERROR in abrfcpic.c : 3712

CPIC-CALL: 'ThSAPCMRCV' : cmRc=20 thRc=223

CPIC program connection ended (read error)

[abrfcio.c 9213]

A {root-id=35353532464137303535353246413730}_{conn-id=00000000000000000000000000000000}_0

A *** ERROR => RFC Error RFCIO_ERROR_MESSAGE in abrfcio.c : 1987

[abrfcio.c 9213]

A {root-id=35353532464137303535353246413730}_{conn-id=00000000000000000000000000000000}_0

M

M ThAlarmHandler: first alarm, just set controls

M

M ThAlarmHandler: (2)

M ThAlarmHandler: inside critical section after 2 tries

M C-STACK

[0] DoStack2, at 0xc57ead

[1] CTrcStack2, at 0xc57b43

[2] CTrcStack, at 0xc57aec

[3] ThAlarmHandler, at 0xa9924e

[4] DpSigAlrm, at 0xa2fa44

[5] __sighndlr, at 0xfffffd7ff998ddd6

[6] call_user_handler, at 0xfffffd7ff99826a2

[7] sigacthandler, at 0xfffffd7ff99828ce

[8] ????????, at 0xffffffffffffffff

[9] fast_process_lock, at 0xfffffd7ff9986690

[10] mutex_lock_impl, at 0xfffffd7ff9986842

[11] mutex_lock, at 0xfffffd7ff998687b

[12] MtxILock, at 0xa63d29

[13] MtxLock_SPIN, at 0xa6483f

[14] MpiIEvtOpen, at 0x28e6e23

[15] MpiICreate, at 0x28dc123

[16] ThPlgCreate2, at 0xae366d

[17] ThICMGetStatus, at 0xbec526

[18] ThHdlICMOpcode, at 0xbebb64

[19] ThSysInfo, at 0xbdfc96

[20] __1cIab_jcaly6F_v_, at 0x102739b

[21] __1cIab_extri6F_i_, at 0xe8bf60

[22] __1cJab_xevent6FpkH_i_, at 0xf2e47c

[23] ab_dstep, at 0xe808b9

[24] dynpmcal, at 0xc83ae7

[25] dynppbo0, at 0xc80416

[26] dynprctl, at 0xc7fccb

[27] dynpen00, at 0xc7c117

[28] TskhLoop, at 0xa6f128

[29] ThStart, at 0xa674b2

[30] DpMain, at 0x9b6528

M

M ThAlarmHandler: return for next chance

M

M ThAlarmHandler: (3)

M ThAlarmHandler: inside critical section after 3 tries

M C-STACK

[0] DoStack2, at 0xc57ead

[1] CTrcStack2, at 0xc57b43

[2] CTrcStack, at 0xc57aec

[3] ThAlarmHandler, at 0xa9924e

[4] DpSigAlrm, at 0xa2fa44

[5] __sighndlr, at 0xfffffd7ff998ddd6

[6] call_user_handler, at 0xfffffd7ff99826a2

[7] sigacthandler, at 0xfffffd7ff99828ce

[8] ????????, at 0xffffffffffffffff

[9] fast_process_lock, at 0xfffffd7ff9986690

[10] mutex_lock_impl, at 0xfffffd7ff9986842

[11] mutex_lock, at 0xfffffd7ff998687b

[12] MtxILock, at 0xa63d29

[13] MtxLock_SPIN, at 0xa6483f

[14] MpiIEvtOpen, at 0x28e6e23

[15] MpiICreate, at 0x28dc123

[16] ThPlgCreate2, at 0xae366d

[17] ThICMGetStatus, at 0xbec526

[18] ThHdlICMOpcode, at 0xbebb64

[19] ThSysInfo, at 0xbdfc96

[20] __1cIab_jcaly6F_v_, at 0x102739b

[21] __1cIab_extri6F_i_, at 0xe8bf60

[22] __1cJab_xevent6FpkH_i_, at 0xf2e47c

[23] ab_dstep, at 0xe808b9

[24] dynpmcal, at 0xc83ae7

[25] dynppbo0, at 0xc80416

[26] dynprctl, at 0xc7fccb

[27] dynpen00, at 0xc7c117

[28] TskhLoop, at 0xa6f128

[29] ThStart, at 0xa674b2

[30] DpMain, at 0x9b6528

M

M ThAlarmHandler: return for next chance

M

M ThAlarmHandler: (4)

M ThAlarmHandler: inside critical section after 4 tries

M C-STACK

[0] DoStack2, at 0xc57ead

[1] CTrcStack2, at 0xc57b43

[2] CTrcStack, at 0xc57aec

[3] ThAlarmHandler, at 0xa9924e

[4] DpSigAlrm, at 0xa2fa44

[5] __sighndlr, at 0xfffffd7ff998ddd6

[6] call_user_handler, at 0xfffffd7ff99826a2

[7] sigacthandler, at 0xfffffd7ff99828ce

[8] ????????, at 0xffffffffffffffff

[9] fast_process_lock, at 0xfffffd7ff9986690

[10] mutex_lock_impl, at 0xfffffd7ff9986842

[11] mutex_lock, at 0xfffffd7ff998687b

[12] MtxILock, at 0xa63d29

[13] MtxLock_SPIN, at 0xa6483f

[14] MpiIEvtOpen, at 0x28e6e23

[15] MpiICreate, at 0x28dc123

[16] ThPlgCreate2, at 0xae366d

[17] ThICMGetStatus, at 0xbec526

[18] ThHdlICMOpcode, at 0xbebb64

[19] ThSysInfo, at 0xbdfc96

[20] __1cIab_jcaly6F_v_, at 0x102739b

[21] __1cIab_extri6F_i_, at 0xe8bf60

[22] __1cJab_xevent6FpkH_i_, at 0xf2e47c

[23] ab_dstep, at 0xe808b9

[24] dynpmcal, at 0xc83ae7

[25] dynppbo0, at 0xc80416

[26] dynprctl, at 0xc7fccb

[27] dynpen00, at 0xc7c117

[28] TskhLoop, at 0xa6f128

[29] ThStart, at 0xa674b2

[30] DpMain, at 0x9b6528

M

M ThAlarmHandler: return for next chance

Thank you!

Hello Symon,

I assume you don't have many work processes from the screenshot you shared, but still its worth checking the note I shared earlier as it mentions few parameters that affect usage of work processes. Especially the following:


The most important prerequisites for a high number of work processes are:

Event flags

For interprocess communication between the dispatcher and the work process, one SAP-internal event flag is required for each work process.

SAP parameters

The value of the parameter "rdisp/tm_max_no" specifies the size of the table in which the sessions are managed. For administration purposes, each work process requires its own entry in the session table. Therefore, the value of this parameter must also take the number of work processes into account.
"rdisp/tm_max_no" > work processes + number of sessions
For information about how to set the parameters "rdisp/wp_ca_blk_no" and "rdisp/appc_ca_blk_no", see SAP Note 3223.

Operating system

The possible number of OS processes per user must exceed the number of work processes by approximately 50% and by at least about 30.

However you might also want to check the following notes:

2003246 - Deadlock in memory manager during signal handling

1890637 - TH: Work proceses hang in status 'Running'

In parallel, as you have been thinking already, raise an OSS message.

Regards,

Siddhesh

Hello Symon,

I believe it's best to not mix issues in the same thread.

Maybe opening a new thread is better (see the SCN rules of engagement ).

regards!

Hi,

It seems that this work process is "trapped" in some kind of infinite loop.

Please execute "kill -USR2 23295" (at OS level), wait for a few seconds (5 - 10 seconds), then execute "kill -USR1 23295".

These kill signals will not terminate the process. USR2 will increase the trace level by 1, while USR1 decreases the trace level by 1. The SAP note 112 describes this.

Share the level 2 trace entries here.

After you've captured the level 2 trace successfully, you can try restarting this work process through SM50, as a workaround.

Regards,

Isaías

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hello Isaias,

sounds like a reasonable suggestion... Please just kindly let me know where

will the trace be stored ? I hope that it will not be in dev_w5, because this is

REALLY growing very fast, I can barely browse it with the more or less commands.

When opening in SM50, it hangs the SAP GUI session and produces a time

out dump...

Hi,

Well, it will be on dev_w5 itself...

Maybe 5 seconds is too much... You could send both kill signals in sequence, without waiting.

The time it will take to type the USR1 command should be enough already, since dev_w5 is growing so fast.

sorry, man, but this didn't change a thing. The log still looks like this:

I MpiIEvtOpen: retry with next key

M ThEppGetConnectionCounter: read connectionCounter 1 from epp 0

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I MpiIEvtOpen: retry with next key

M ThEppGetConnectionCounter: read connectionCounter 1 from epp 0

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I MpiIEvtOpen: retry with next key

M ThEppGetConnectionCounter: read connectionCounter 1 from epp 0

I *** ERROR => no more free event-flags. [mpixx.c 5876]

I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I'm now gonna try to kill it in SM50.

Hi Symon ,

"Maybe the trace itself is the problem "

Go to transaction SM50 , Choose the affected wok process

Administration -> Trace - > Active components .

Set as follows ( Remove or unselect any others )

Check the trace level ( Set it to Default : 1 )

Components normally traced are Task Handler and VM Container

Thanks ,

Manu

hmmm no... this has nothing to do with the number of work processes or with the trace level...

those entries are from trace level 1...

Maybe it has something to do with OS resources (like note 2000428).

Do you have the patches from note 908334 or 832871 applied?

You can also confirm that you have the settings from note 724713 in place.

Hello Isaias/Symon,

The number of work process and associated parameters do play a part, the note that I provided has a section which speaks about parameters that govern event queues, I was interested in knowing if that affects this situation.

Regards,

Siddhesh

Hello Siddhesh,

Of course that the number of work processes plays a part on some parameters, but I do not think that this would be related to this thread.

In addition, the note 9942 mention parameters related to SAP level resources that are not related to MPIs or OS level event flags.

cheers!

Hello Isaias,

By terminating the process we risk losing the root cause and also if the root cause isn't address, the problem is sure to re-appear, just adding to the problem.

In any case, I am keen in understanding what is causing this issue and how it would be fixed.

Regards,

Siddhesh

Yes, restarting the work process will make it not possible to further investigate for the root cause.

However, it did not respond to the "kill -USR2" signal. Thus, we are unable to get further information anyway...

So far, I could not find an SAP note that would address this issue... that is why I sent those notes related to the OS level settings.

Hi ,

Check the following Note ,A similar issue

2000428 - ICM randomly not running with trace file filled by repeated log "no more free event-flags"

[ Only that the dev_icm is growing ]

The issue should be with your ICM , Check in txn SMICM and its trace file

Thanks ,

Manu

Ask a Question

Top Q&A Solution Author

User

Count

92

11

10

9

9

7

6

5

4

4