cancel
Showing results for 
Search instead for 
Did you mean: 

dev_w5 growing incredibly fast

symon_braunbaer
Participant
0 Kudos

Dear experts,

I would like to share with you another quite annoying issue, which we are experiencing in our landscape.

Now the suffering system is a DEV BW ABAP instance. As said in the title, dev_w5 is growing very very

fast, it has twice filled up the filesystem and caused the system to hang...

The contents of the file are:

I  MpiIEvtOpen: retry with next key

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I  MpiIEvtOpen: retry with next key

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I  MpiIEvtOpen: retry with next key

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I  MpiIEvtOpen: retry with next key

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I  MpiIEvtOpen: retry with next key

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I  MpiIEvtOpen: retry with next key

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I  MpiIEvtOpen: retry with next key

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I  MpiIEvtOpen: retry with next key

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I have found a few notes regarding this problem, but it is about the ICM and WebDispatcher, not about a dialog

process. Notes 2000428 (for oracle Linux, but we are running on Solaris 10), 715400 and 1608350 (recommending

to use Kernel 7.01, we are already at 7.21 patch 201).

I sent note 715400 to our Solaris admins, but they said that it is not valid for Solaris 10.


Based on a screenshot from SM50, I think that the Solution Manager is causing the issue:


Please advise what to do ? I can stop the work process, but I'm not sure that this will help... Kindly advise quickly,

because as soon as I leave the office, the system will be down again in just a few hours...

Many thanks!

Accepted Solutions (0)

Answers (3)

Answers (3)

divyanshu_srivastava3
Active Contributor
0 Kudos

Have you tried reducing the trace level from SM50 for this work process.

Once done, kill it from OS level.

Regards,

former_member185954
Active Contributor
0 Kudos

Hello Symon,

How many work processes do you have ? Can you please review note: 9942 - Maximum number of work processes


Regards,

Siddhesh

symon_braunbaer
Participant
0 Kudos

Hello guys,

killing the processes solved the issue and the new dev file wasn't growing rapidly anymore. But just about an hour ago, the system got completely stuck and it was not possible to open the login window in SAP GUI. It looked exactly as a system with a stuck archiver.

I was thinking what to do before restarting everything, so I checked with dpmon - all the dialog processes were taken by RFC connections from the Solution Manager, there was just one employee logged on:

-->

Workprocess Table (long)                        Wed May 13 09:25:37 2015

========================

No Type  Pid    Status  Cause Start Rstr  Err Sem Time Program          Cl  User         Action                    Table

-------------------------------------------------------------------------------------------------------------------------------

0 DIA    23290 Run           yes   no     0   0 2695  SAPLSICM         100 SM_SMP       NO_ACTION

1 DIA    23291 Run           yes   no     0   0 5043  SAPLSHTTP        100 SM_SMP       NO_ACTION

2 DIA    23292 Run           yes   no     0   0 8699  SAPLSHTTP        100 SM_SMP       NO_ACTION

3 DIA    23293 Run           yes   no     0   0 2313  SAPLSICM         100 SM_SMP       NO_ACTION

4 DIA    23294 Run           yes   no     0   0 3238  SAPLSICM         100 SM_SMP       NO_ACTION

5 DIA    21566 Run           yes   no     1   0 5342  SAPLSICM         100 SM_SMP       NO_ACTION

6 DIA    23296 Run           yes   no     0   0 4024  SAPLSHTTP        100 SM_SMP       NO_ACTION

7 DIA    23297 Run           yes   no     0   0 5702  SAPLSICM         100 SM_SMP       NO_ACTION

8 DIA    23298 Run           yes   no     0   0 6058  SAPLSICM         100 SM_SMP       NO_ACTION

9 DIA    23299 Run           yes   no     0   0 4979  SAPLSICM         100 SM_SMP       NO_ACTION

10 DIA    23300 Run           yes   no     0   0 8876  SAPLSICM         100 SM_SMP       NO_ACTION

11 DIA    23301 Run           yes   no     0   0  410                   100 <SOME EMPLOYEE>     NO_ACTION

12 DIA    23302 Run           yes   no     0   0 4620  SAPLSICM         100 SM_SMP       NO_ACTION

13 DIA    23303 Run           yes   no     0   0 4262  SAPLSICM         100 SM_SMP       NO_ACTION

14 DIA    23304 Run           yes   no     0   0 6422  SAPLSICM         100 SM_SMP       NO_ACTION

15 DIA    23305 Run           yes   no     0   0 7684  SAPLSHTTP        100 SM_SMP       NO_ACTION

After killing a few processes, suddenly I could login. SM04 was showing a LOT of SM_SMP processes logged on:

I wanted to disconnect them, but it wasn't possible, so I had to kill all the dialog WP using kill -9. I had locked the SM_SMP user as a precaution, in order to avoid additional connections and so I could restore normal operations on the system. But now the question is what went wrong... A buggy diagnostics agent or what... ?

Does anyone know the issue ? Or I should probably open an OSS message and get this analyzed by SAP. Btw. here is also an excerpt from the dev_w4 file:

M Wed May 13 10:23:14 2015

M  ***LOG R49=> ThReceive, CPIC-Error (020223) [thxxhead.c   7927]

M  ***LOG R5A=> ThReceive, CPIC-Error (25554178) [thxxhead.c   7933]

M  ***LOG R64=> ThReceive, CPIC-Error ( CMSEND(SAP)) [thxxhead.c   7938]

A  RFC 3710  CONVID 25554178

A   * CMRC=20 DATA=1 STATUS=1 SAPRC=223 ThSAPCMRCV

A  RFC> ABAP Programm: SAPMSSY1 (Transaction: )

A  RFC> User: CSERKO20 (Client: 100)

A  RFC> Destination: wsps450_E (handle: 1, DtConId: 0E40F9E4E3AFF160A520402CF4CB710E, DtConCnt: 0, ConvId: 25554178,{0E40F9E4-E3AF-F

A  RFC> Called function module: RSWAD_URL_GET

A  RFC SERVER> RFC Server Session (handle: 1, 25554178, {0E40F9E4-E3AF-F160-A520-402CF4CB710E})

A  RFC SERVER> Caller host:

A  RFC SERVER> Caller transaction code:  (Caller Program: BExQueryDesignerStarter)

A  RFC SERVER> Called function module: RSWAD_URL_GET

A  *** ERROR => RFC ======> CPIC-CALL: 'ThSAPCMRCV' : cmRc=20 thRc=223

CPIC program connection ended (read error)

[abrfcio.c    9213]

A  {root-id=35353532464137303535353246413730}_{conn-id=00000000000000000000000000000000}_0

A  *** ERROR => RFC Error RFCIO_ERROR_SYSERROR in abrfcpic.c : 3712

CPIC-CALL: 'ThSAPCMRCV' : cmRc=20 thRc=223

CPIC program connection ended (read error)

[abrfcio.c    9213]

A  {root-id=35353532464137303535353246413730}_{conn-id=00000000000000000000000000000000}_0

A  *** ERROR => RFC Error RFCIO_ERROR_MESSAGE in abrfcio.c : 1987

[abrfcio.c    9213]

A  {root-id=35353532464137303535353246413730}_{conn-id=00000000000000000000000000000000}_0

M

M  ThAlarmHandler: first alarm, just set controls

M

M  ThAlarmHandler: (2)

M  ThAlarmHandler: inside critical section after 2 tries

M                     C-STACK

[0] DoStack2, at 0xc57ead

[1] CTrcStack2, at 0xc57b43

[2] CTrcStack, at 0xc57aec

[3] ThAlarmHandler, at 0xa9924e

[4] DpSigAlrm, at 0xa2fa44

[5] __sighndlr, at 0xfffffd7ff998ddd6

[6] call_user_handler, at 0xfffffd7ff99826a2

[7] sigacthandler, at 0xfffffd7ff99828ce

[8] ????????, at 0xffffffffffffffff

[9] fast_process_lock, at 0xfffffd7ff9986690

[10] mutex_lock_impl, at 0xfffffd7ff9986842

[11] mutex_lock, at 0xfffffd7ff998687b

[12] MtxILock, at 0xa63d29

[13] MtxLock_SPIN, at 0xa6483f

[14] MpiIEvtOpen, at 0x28e6e23

[15] MpiICreate, at 0x28dc123

[16] ThPlgCreate2, at 0xae366d

[17] ThICMGetStatus, at 0xbec526

[18] ThHdlICMOpcode, at 0xbebb64

[19] ThSysInfo, at 0xbdfc96

[20] __1cIab_jcaly6F_v_, at 0x102739b

[21] __1cIab_extri6F_i_, at 0xe8bf60

[22] __1cJab_xevent6FpkH_i_, at 0xf2e47c

[23] ab_dstep, at 0xe808b9

[24] dynpmcal, at 0xc83ae7

[25] dynppbo0, at 0xc80416

[26] dynprctl, at 0xc7fccb

[27] dynpen00, at 0xc7c117

[28] TskhLoop, at 0xa6f128

[29] ThStart, at 0xa674b2

[30] DpMain, at 0x9b6528

M

M  ThAlarmHandler: return for next chance

M

M  ThAlarmHandler: (3)

M  ThAlarmHandler: inside critical section after 3 tries

M                     C-STACK

[0] DoStack2, at 0xc57ead

[1] CTrcStack2, at 0xc57b43

[2] CTrcStack, at 0xc57aec

[3] ThAlarmHandler, at 0xa9924e

[4] DpSigAlrm, at 0xa2fa44

[5] __sighndlr, at 0xfffffd7ff998ddd6

[6] call_user_handler, at 0xfffffd7ff99826a2

[7] sigacthandler, at 0xfffffd7ff99828ce

[8] ????????, at 0xffffffffffffffff

[9] fast_process_lock, at 0xfffffd7ff9986690

[10] mutex_lock_impl, at 0xfffffd7ff9986842

[11] mutex_lock, at 0xfffffd7ff998687b

[12] MtxILock, at 0xa63d29

[13] MtxLock_SPIN, at 0xa6483f

[14] MpiIEvtOpen, at 0x28e6e23

[15] MpiICreate, at 0x28dc123

[16] ThPlgCreate2, at 0xae366d

[17] ThICMGetStatus, at 0xbec526

[18] ThHdlICMOpcode, at 0xbebb64

[19] ThSysInfo, at 0xbdfc96

[20] __1cIab_jcaly6F_v_, at 0x102739b

[21] __1cIab_extri6F_i_, at 0xe8bf60

[22] __1cJab_xevent6FpkH_i_, at 0xf2e47c

[23] ab_dstep, at 0xe808b9

[24] dynpmcal, at 0xc83ae7

[25] dynppbo0, at 0xc80416

[26] dynprctl, at 0xc7fccb

[27] dynpen00, at 0xc7c117

[28] TskhLoop, at 0xa6f128

[29] ThStart, at 0xa674b2

[30] DpMain, at 0x9b6528

M

M  ThAlarmHandler: return for next chance

M

M  ThAlarmHandler: (4)

M  ThAlarmHandler: inside critical section after 4 tries

M                     C-STACK

[0] DoStack2, at 0xc57ead

[1] CTrcStack2, at 0xc57b43

[2] CTrcStack, at 0xc57aec

[3] ThAlarmHandler, at 0xa9924e

[4] DpSigAlrm, at 0xa2fa44

[5] __sighndlr, at 0xfffffd7ff998ddd6

[6] call_user_handler, at 0xfffffd7ff99826a2

[7] sigacthandler, at 0xfffffd7ff99828ce

[8] ????????, at 0xffffffffffffffff

[9] fast_process_lock, at 0xfffffd7ff9986690

[10] mutex_lock_impl, at 0xfffffd7ff9986842

[11] mutex_lock, at 0xfffffd7ff998687b

[12] MtxILock, at 0xa63d29

[13] MtxLock_SPIN, at 0xa6483f

[14] MpiIEvtOpen, at 0x28e6e23

[15] MpiICreate, at 0x28dc123

[16] ThPlgCreate2, at 0xae366d

[17] ThICMGetStatus, at 0xbec526

[18] ThHdlICMOpcode, at 0xbebb64

[19] ThSysInfo, at 0xbdfc96

[20] __1cIab_jcaly6F_v_, at 0x102739b

[21] __1cIab_extri6F_i_, at 0xe8bf60

[22] __1cJab_xevent6FpkH_i_, at 0xf2e47c

[23] ab_dstep, at 0xe808b9

[24] dynpmcal, at 0xc83ae7

[25] dynppbo0, at 0xc80416

[26] dynprctl, at 0xc7fccb

[27] dynpen00, at 0xc7c117

[28] TskhLoop, at 0xa6f128

[29] ThStart, at 0xa674b2

[30] DpMain, at 0x9b6528

M

M  ThAlarmHandler: return for next chance

Thank you!

former_member185954
Active Contributor
0 Kudos

Hello Symon,

I assume you don't have many work processes from the screenshot you shared, but still its worth checking the note I shared earlier as it mentions few parameters that affect usage of work processes. Especially the following:


The most important prerequisites for a high number of work processes are:

  • Event flags

For interprocess communication between the dispatcher and the work process, one SAP-internal event flag is required for each work process.

  • SAP parameters

The value of the parameter "rdisp/tm_max_no" specifies the size of the table in which the sessions are managed. For administration purposes, each work process requires its own entry in the session table. Therefore, the value of this parameter must also take the number of work processes into account.
"rdisp/tm_max_no" > work processes + number of sessions
For information about how to set the parameters "rdisp/wp_ca_blk_no" and "rdisp/appc_ca_blk_no", see SAP Note 3223.

  • Operating system

The possible number of OS processes per user must exceed the number of work processes by approximately 50% and by at least about 30.

However you might also want to check the following notes:

2003246 - Deadlock in memory manager during signal handling

1890637 - TH: Work proceses hang in status 'Running'


In parallel, as you have been thinking already, raise an OSS message.


Regards,

Siddhesh

isaias_freitas
Advisor
Advisor
0 Kudos

Hello Symon,

I believe it's best to not mix issues in the same thread.

Maybe opening a new thread is better (see the SCN rules of engagement ).

regards!

isaias_freitas
Advisor
Advisor
0 Kudos

Hi,

It seems that this work process is "trapped" in some kind of infinite loop.

Please execute "kill -USR2 23295" (at OS level), wait for a few seconds (5 - 10 seconds), then execute "kill -USR1 23295".

These kill signals will not terminate the process. USR2 will increase the trace level by 1, while USR1 decreases the trace level by 1. The SAP note 112 describes this.

Share the level 2 trace entries here.

After you've captured the level 2 trace successfully, you can try restarting this work process through SM50, as a workaround.

Regards,

Isaías

symon_braunbaer
Participant
0 Kudos

Hello Isaias,

sounds like a reasonable suggestion... Please just kindly let me know where

will the trace be stored ? I hope that it will not be in dev_w5, because this is

REALLY growing very fast, I can barely browse it with the more or less commands.

When opening in SM50, it hangs the SAP GUI session and produces a time

out dump...

isaias_freitas
Advisor
Advisor
0 Kudos

Hi,

Well, it will be on dev_w5 itself...

Maybe 5 seconds is too much... You could send both kill signals in sequence, without waiting.

The time it will take to type the USR1 command should be enough already, since dev_w5 is growing so fast.

symon_braunbaer
Participant
0 Kudos

sorry, man, but this didn't change a thing. The log still looks like this:

I  MpiIEvtOpen: retry with next key

M  ThEppGetConnectionCounter: read connectionCounter 1 from epp 0

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I  MpiIEvtOpen: retry with next key

M  ThEppGetConnectionCounter: read connectionCounter 1 from epp 0

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I  MpiIEvtOpen: retry with next key

M  ThEppGetConnectionCounter: read connectionCounter 1 from epp 0

I  *** ERROR => no more free event-flags. [mpixx.c      5876]

I  {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1

I'm now gonna try to kill it in SM50.

manumohandas82
Active Contributor
0 Kudos

Hi Symon ,

"Maybe the trace itself is the problem "

Go to transaction SM50 , Choose the affected wok process

Administration -> Trace - > Active components  .

Set as follows ( Remove or unselect any others )

Check the trace level (  Set it to Default : 1 )

Components normally traced are Task Handler and VM  Container

Thanks ,

Manu

isaias_freitas
Advisor
Advisor
0 Kudos

hmmm no... this has nothing to do with the number of work processes or with the trace level...

those entries are from trace level 1...

Maybe it has something to do with OS resources (like note 2000428).

Do you have the patches from note 908334 or 832871 applied?

You can also confirm that you have the settings from note 724713 in place.

former_member185954
Active Contributor
0 Kudos

Hello Isaias/Symon,

The number of work process and associated parameters do play a part, the note that I provided has a section which speaks about parameters that govern event queues, I was interested in knowing if that affects this situation.

Regards,

Siddhesh

isaias_freitas
Advisor
Advisor
0 Kudos

Hello Siddhesh,

Of course that the number of work processes plays a part on some parameters, but I do not think that this would be related to this thread.

In addition, the note 9942 mention parameters related to SAP level resources that are not related to MPIs or OS level event flags.

cheers!

former_member185954
Active Contributor
0 Kudos

Hello Isaias,

By terminating the process we risk losing the root cause and also if the root cause isn't address, the problem is sure to re-appear, just adding to the problem.

In any case, I am keen in understanding what is causing this issue and how it would be fixed.

Regards,

Siddhesh

isaias_freitas
Advisor
Advisor
0 Kudos

Yes, restarting the work process will make it not possible to further investigate for the root cause.

However, it did not respond to the "kill -USR2" signal. Thus, we are unable to get further information anyway...

So far, I could not find an SAP note that would address this issue... that is why I sent those notes related to the OS level settings.

manumohandas82
Active Contributor
0 Kudos

Hi  ,

Check  the following Note ,A similar issue

2000428 - ICM randomly not running with trace file filled by repeated log "no more free event-flags"


[ Only that the dev_icm is growing ]

The issue should be with your ICM , Check in txn SMICM  and its trace file

Thanks ,

Manu