on 05-12-2015 1:11 PM
Dear experts,
I would like to share with you another quite annoying issue, which we are experiencing in our landscape.
Now the suffering system is a DEV BW ABAP instance. As said in the title, dev_w5 is growing very very
fast, it has twice filled up the filesystem and caused the system to hang...
The contents of the file are:
I MpiIEvtOpen: retry with next key
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I MpiIEvtOpen: retry with next key
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I MpiIEvtOpen: retry with next key
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I MpiIEvtOpen: retry with next key
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I MpiIEvtOpen: retry with next key
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I MpiIEvtOpen: retry with next key
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I MpiIEvtOpen: retry with next key
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I MpiIEvtOpen: retry with next key
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I have found a few notes regarding this problem, but it is about the ICM and WebDispatcher, not about a dialog
process. Notes 2000428 (for oracle Linux, but we are running on Solaris 10), 715400 and 1608350 (recommending
to use Kernel 7.01, we are already at 7.21 patch 201).
I sent note 715400 to our Solaris admins, but they said that it is not valid for Solaris 10.
Based on a screenshot from SM50, I think that the Solution Manager is causing the issue:
Please advise what to do ? I can stop the work process, but I'm not sure that this will help... Kindly advise quickly,
because as soon as I leave the office, the system will be down again in just a few hours...
Many thanks!
Have you tried reducing the trace level from SM50 for this work process.
Once done, kill it from OS level.
Regards,
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello Symon,
How many work processes do you have ? Can you please review note: 9942 - Maximum number of work processes
Regards,
Siddhesh
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello guys,
killing the processes solved the issue and the new dev file wasn't growing rapidly anymore. But just about an hour ago, the system got completely stuck and it was not possible to open the login window in SAP GUI. It looked exactly as a system with a stuck archiver.
I was thinking what to do before restarting everything, so I checked with dpmon - all the dialog processes were taken by RFC connections from the Solution Manager, there was just one employee logged on:
-->
Workprocess Table (long) Wed May 13 09:25:37 2015
========================
No Type Pid Status Cause Start Rstr Err Sem Time Program Cl User Action Table
-------------------------------------------------------------------------------------------------------------------------------
0 DIA 23290 Run yes no 0 0 2695 SAPLSICM 100 SM_SMP NO_ACTION
1 DIA 23291 Run yes no 0 0 5043 SAPLSHTTP 100 SM_SMP NO_ACTION
2 DIA 23292 Run yes no 0 0 8699 SAPLSHTTP 100 SM_SMP NO_ACTION
3 DIA 23293 Run yes no 0 0 2313 SAPLSICM 100 SM_SMP NO_ACTION
4 DIA 23294 Run yes no 0 0 3238 SAPLSICM 100 SM_SMP NO_ACTION
5 DIA 21566 Run yes no 1 0 5342 SAPLSICM 100 SM_SMP NO_ACTION
6 DIA 23296 Run yes no 0 0 4024 SAPLSHTTP 100 SM_SMP NO_ACTION
7 DIA 23297 Run yes no 0 0 5702 SAPLSICM 100 SM_SMP NO_ACTION
8 DIA 23298 Run yes no 0 0 6058 SAPLSICM 100 SM_SMP NO_ACTION
9 DIA 23299 Run yes no 0 0 4979 SAPLSICM 100 SM_SMP NO_ACTION
10 DIA 23300 Run yes no 0 0 8876 SAPLSICM 100 SM_SMP NO_ACTION
11 DIA 23301 Run yes no 0 0 410 100 <SOME EMPLOYEE> NO_ACTION
12 DIA 23302 Run yes no 0 0 4620 SAPLSICM 100 SM_SMP NO_ACTION
13 DIA 23303 Run yes no 0 0 4262 SAPLSICM 100 SM_SMP NO_ACTION
14 DIA 23304 Run yes no 0 0 6422 SAPLSICM 100 SM_SMP NO_ACTION
15 DIA 23305 Run yes no 0 0 7684 SAPLSHTTP 100 SM_SMP NO_ACTION
After killing a few processes, suddenly I could login. SM04 was showing a LOT of SM_SMP processes logged on:
I wanted to disconnect them, but it wasn't possible, so I had to kill all the dialog WP using kill -9. I had locked the SM_SMP user as a precaution, in order to avoid additional connections and so I could restore normal operations on the system. But now the question is what went wrong... A buggy diagnostics agent or what... ?
Does anyone know the issue ? Or I should probably open an OSS message and get this analyzed by SAP. Btw. here is also an excerpt from the dev_w4 file:
M Wed May 13 10:23:14 2015
M ***LOG R49=> ThReceive, CPIC-Error (020223) [thxxhead.c 7927]
M ***LOG R5A=> ThReceive, CPIC-Error (25554178) [thxxhead.c 7933]
M ***LOG R64=> ThReceive, CPIC-Error ( CMSEND(SAP)) [thxxhead.c 7938]
A RFC 3710 CONVID 25554178
A * CMRC=20 DATA=1 STATUS=1 SAPRC=223 ThSAPCMRCV
A RFC> ABAP Programm: SAPMSSY1 (Transaction: )
A RFC> User: CSERKO20 (Client: 100)
A RFC> Destination: wsps450_E (handle: 1, DtConId: 0E40F9E4E3AFF160A520402CF4CB710E, DtConCnt: 0, ConvId: 25554178,{0E40F9E4-E3AF-F
A RFC> Called function module: RSWAD_URL_GET
A RFC SERVER> RFC Server Session (handle: 1, 25554178, {0E40F9E4-E3AF-F160-A520-402CF4CB710E})
A RFC SERVER> Caller host:
A RFC SERVER> Caller transaction code: (Caller Program: BExQueryDesignerStarter)
A RFC SERVER> Called function module: RSWAD_URL_GET
A *** ERROR => RFC ======> CPIC-CALL: 'ThSAPCMRCV' : cmRc=20 thRc=223
CPIC program connection ended (read error)
[abrfcio.c 9213]
A {root-id=35353532464137303535353246413730}_{conn-id=00000000000000000000000000000000}_0
A *** ERROR => RFC Error RFCIO_ERROR_SYSERROR in abrfcpic.c : 3712
CPIC-CALL: 'ThSAPCMRCV' : cmRc=20 thRc=223
CPIC program connection ended (read error)
[abrfcio.c 9213]
A {root-id=35353532464137303535353246413730}_{conn-id=00000000000000000000000000000000}_0
A *** ERROR => RFC Error RFCIO_ERROR_MESSAGE in abrfcio.c : 1987
[abrfcio.c 9213]
A {root-id=35353532464137303535353246413730}_{conn-id=00000000000000000000000000000000}_0
M
M ThAlarmHandler: first alarm, just set controls
M
M ThAlarmHandler: (2)
M ThAlarmHandler: inside critical section after 2 tries
M C-STACK
[0] DoStack2, at 0xc57ead
[1] CTrcStack2, at 0xc57b43
[2] CTrcStack, at 0xc57aec
[3] ThAlarmHandler, at 0xa9924e
[4] DpSigAlrm, at 0xa2fa44
[5] __sighndlr, at 0xfffffd7ff998ddd6
[6] call_user_handler, at 0xfffffd7ff99826a2
[7] sigacthandler, at 0xfffffd7ff99828ce
[8] ????????, at 0xffffffffffffffff
[9] fast_process_lock, at 0xfffffd7ff9986690
[10] mutex_lock_impl, at 0xfffffd7ff9986842
[11] mutex_lock, at 0xfffffd7ff998687b
[12] MtxILock, at 0xa63d29
[13] MtxLock_SPIN, at 0xa6483f
[14] MpiIEvtOpen, at 0x28e6e23
[15] MpiICreate, at 0x28dc123
[16] ThPlgCreate2, at 0xae366d
[17] ThICMGetStatus, at 0xbec526
[18] ThHdlICMOpcode, at 0xbebb64
[19] ThSysInfo, at 0xbdfc96
[20] __1cIab_jcaly6F_v_, at 0x102739b
[21] __1cIab_extri6F_i_, at 0xe8bf60
[22] __1cJab_xevent6FpkH_i_, at 0xf2e47c
[23] ab_dstep, at 0xe808b9
[24] dynpmcal, at 0xc83ae7
[25] dynppbo0, at 0xc80416
[26] dynprctl, at 0xc7fccb
[27] dynpen00, at 0xc7c117
[28] TskhLoop, at 0xa6f128
[29] ThStart, at 0xa674b2
[30] DpMain, at 0x9b6528
M
M ThAlarmHandler: return for next chance
M
M ThAlarmHandler: (3)
M ThAlarmHandler: inside critical section after 3 tries
M C-STACK
[0] DoStack2, at 0xc57ead
[1] CTrcStack2, at 0xc57b43
[2] CTrcStack, at 0xc57aec
[3] ThAlarmHandler, at 0xa9924e
[4] DpSigAlrm, at 0xa2fa44
[5] __sighndlr, at 0xfffffd7ff998ddd6
[6] call_user_handler, at 0xfffffd7ff99826a2
[7] sigacthandler, at 0xfffffd7ff99828ce
[8] ????????, at 0xffffffffffffffff
[9] fast_process_lock, at 0xfffffd7ff9986690
[10] mutex_lock_impl, at 0xfffffd7ff9986842
[11] mutex_lock, at 0xfffffd7ff998687b
[12] MtxILock, at 0xa63d29
[13] MtxLock_SPIN, at 0xa6483f
[14] MpiIEvtOpen, at 0x28e6e23
[15] MpiICreate, at 0x28dc123
[16] ThPlgCreate2, at 0xae366d
[17] ThICMGetStatus, at 0xbec526
[18] ThHdlICMOpcode, at 0xbebb64
[19] ThSysInfo, at 0xbdfc96
[20] __1cIab_jcaly6F_v_, at 0x102739b
[21] __1cIab_extri6F_i_, at 0xe8bf60
[22] __1cJab_xevent6FpkH_i_, at 0xf2e47c
[23] ab_dstep, at 0xe808b9
[24] dynpmcal, at 0xc83ae7
[25] dynppbo0, at 0xc80416
[26] dynprctl, at 0xc7fccb
[27] dynpen00, at 0xc7c117
[28] TskhLoop, at 0xa6f128
[29] ThStart, at 0xa674b2
[30] DpMain, at 0x9b6528
M
M ThAlarmHandler: return for next chance
M
M ThAlarmHandler: (4)
M ThAlarmHandler: inside critical section after 4 tries
M C-STACK
[0] DoStack2, at 0xc57ead
[1] CTrcStack2, at 0xc57b43
[2] CTrcStack, at 0xc57aec
[3] ThAlarmHandler, at 0xa9924e
[4] DpSigAlrm, at 0xa2fa44
[5] __sighndlr, at 0xfffffd7ff998ddd6
[6] call_user_handler, at 0xfffffd7ff99826a2
[7] sigacthandler, at 0xfffffd7ff99828ce
[8] ????????, at 0xffffffffffffffff
[9] fast_process_lock, at 0xfffffd7ff9986690
[10] mutex_lock_impl, at 0xfffffd7ff9986842
[11] mutex_lock, at 0xfffffd7ff998687b
[12] MtxILock, at 0xa63d29
[13] MtxLock_SPIN, at 0xa6483f
[14] MpiIEvtOpen, at 0x28e6e23
[15] MpiICreate, at 0x28dc123
[16] ThPlgCreate2, at 0xae366d
[17] ThICMGetStatus, at 0xbec526
[18] ThHdlICMOpcode, at 0xbebb64
[19] ThSysInfo, at 0xbdfc96
[20] __1cIab_jcaly6F_v_, at 0x102739b
[21] __1cIab_extri6F_i_, at 0xe8bf60
[22] __1cJab_xevent6FpkH_i_, at 0xf2e47c
[23] ab_dstep, at 0xe808b9
[24] dynpmcal, at 0xc83ae7
[25] dynppbo0, at 0xc80416
[26] dynprctl, at 0xc7fccb
[27] dynpen00, at 0xc7c117
[28] TskhLoop, at 0xa6f128
[29] ThStart, at 0xa674b2
[30] DpMain, at 0x9b6528
M
M ThAlarmHandler: return for next chance
Thank you!
Hello Symon,
I assume you don't have many work processes from the screenshot you shared, but still its worth checking the note I shared earlier as it mentions few parameters that affect usage of work processes. Especially the following:
The most important prerequisites for a high number of work processes are:
- Event flags
For interprocess communication between the dispatcher and the work process, one SAP-internal event flag is required for each work process.
- SAP parameters
The value of the parameter "rdisp/tm_max_no" specifies the size of the table in which the sessions are managed. For administration purposes, each work process requires its own entry in the session table. Therefore, the value of this parameter must also take the number of work processes into account.
"rdisp/tm_max_no" > work processes + number of sessions
For information about how to set the parameters "rdisp/wp_ca_blk_no" and "rdisp/appc_ca_blk_no", see SAP Note 3223.
- Operating system
The possible number of OS processes per user must exceed the number of work processes by approximately 50% and by at least about 30.
However you might also want to check the following notes:
2003246 - Deadlock in memory manager during signal handling
1890637 - TH: Work proceses hang in status 'Running'
In parallel, as you have been thinking already, raise an OSS message.
Regards,
Siddhesh
Hi,
It seems that this work process is "trapped" in some kind of infinite loop.
Please execute "kill -USR2 23295" (at OS level), wait for a few seconds (5 - 10 seconds), then execute "kill -USR1 23295".
These kill signals will not terminate the process. USR2 will increase the trace level by 1, while USR1 decreases the trace level by 1. The SAP note 112 describes this.
Share the level 2 trace entries here.
After you've captured the level 2 trace successfully, you can try restarting this work process through SM50, as a workaround.
Regards,
Isaías
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello Isaias,
sounds like a reasonable suggestion... Please just kindly let me know where
will the trace be stored ? I hope that it will not be in dev_w5, because this is
REALLY growing very fast, I can barely browse it with the more or less commands.
When opening in SM50, it hangs the SAP GUI session and produces a time
out dump...
sorry, man, but this didn't change a thing. The log still looks like this:
I MpiIEvtOpen: retry with next key
M ThEppGetConnectionCounter: read connectionCounter 1 from epp 0
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I MpiIEvtOpen: retry with next key
M ThEppGetConnectionCounter: read connectionCounter 1 from epp 0
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I MpiIEvtOpen: retry with next key
M ThEppGetConnectionCounter: read connectionCounter 1 from epp 0
I *** ERROR => no more free event-flags. [mpixx.c 5876]
I {root-id=8010E03E596C1EE39EC13F73B724DD66}_{conn-id=554F3B826476490FE1000000AC123F1C}_1
I'm now gonna try to kill it in SM50.
Hi Symon ,
"Maybe the trace itself is the problem "
Go to transaction SM50 , Choose the affected wok process
Administration -> Trace - > Active components .
Set as follows ( Remove or unselect any others )
Check the trace level ( Set it to Default : 1 )
Components normally traced are Task Handler and VM Container
Thanks ,
Manu
hmmm no... this has nothing to do with the number of work processes or with the trace level...
those entries are from trace level 1...
Maybe it has something to do with OS resources (like note 2000428).
Do you have the patches from note 908334 or 832871 applied?
You can also confirm that you have the settings from note 724713 in place.
Yes, restarting the work process will make it not possible to further investigate for the root cause.
However, it did not respond to the "kill -USR2" signal. Thus, we are unable to get further information anyway...
So far, I could not find an SAP note that would address this issue... that is why I sent those notes related to the OS level settings.
Hi ,
Check the following Note ,A similar issue
2000428 - ICM randomly not running with trace file filled by repeated log "no more free event-flags"
[ Only that the dev_icm is growing ]
The issue should be with your ICM , Check in txn SMICM and its trace file
Thanks ,
Manu
User | Count |
---|---|
92 | |
11 | |
10 | |
9 | |
9 | |
7 | |
6 | |
5 | |
4 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.