on 09-24-2013 1:08 PM
I've got IOWait (R) (041) and IOwait (W) (044). Cant login into the database to do a restart for example.
Cant log a SAP call right now as I don't have the pw for the correct S-user available at the moment.
I know it means a data read/write deadlock, but how to fix it ?
List of suspend reasons:
========================
total suspends: 522238
Vwait : 3 ( 0.00% ) k53wait
IOWait(R)(041) : 100 ( 0.02% ) b13get_node: await read 😞
IOWait(W)(044) : 4 ( 0.00% ) b13pfree_pno 😞
PagerWaitWritr : 5758 ( 1.10% ) Pager_Controller::WaitForPagerWritReply
JobWait Redo : 6 ( 0.00% ) Rst_RedoManager::RedoLog
SVP-End (230) : 1 ( 0.00% ) Log_SavepointSync::LockSVPSyncEntry
NoRedoJob(231) : 12 ( 0.00% ) Rst_RedoTrafficControl::ExecuteJobs()
LogIOwait(234) : 40378 ( 7.73% ) Log_Queue::UserTaskEOTReady
SVP-wait (243) : 2 ( 0.00% ) Log_Savepoint::StartSavepointAndWait
No-Work (255) : 475974 ( 91.14% ) Task is waiting for work
Environment:
NW 7.30 Java, MaxDB 7.8.02.036. Steps: SUM10SP08_06 patch preparation steps, Kernel 721_REL patch 100.
Thanks and Regards, Norbert
Maybe it is a 'dbmsrv' problem? I have looked at the attached 'x_cons' file, but there were not even any user tasks connected and I could not see any database activity - based on that a deadlock seems unlikely.
I would recommend to check the database log file 'KnlMsg' for errors and if there are none, then run a 'ps -afe | grep dbmsrv' to identify the running dbmsrv processes.
Thorsten
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thanks for checking Torsten.
The issues started when we noticed a login via dbmcli was not possible, all login types were rejected.
The SAP system had been shutdown (gracefully), proably explains why there are no client connections.
Unusual entries in KernelMsg seem to start with a lot of connect/connection released
messages like this:
ask | 108 2013-09-24 15:57:31 | CONNECT | 12633: Connect req. (DSL, T108, connection obj. 0x8013a49b0, Node:'uan112-se', PID: 15922) | |
Thread | 0x23CE Task | 108 2013-09-24 15:57:31 | CONNECT | 12651: Connection released (DSL, T108, connection obj. 8013a49b0) |
Thread | 0x23CE Task | 108 2013-09-24 15:58:36 | CONNECT | 12633: Connect req. (DSL, T108, connection obj. 0x8013a49b0, Node:'uan112-se', PID: 15922) |
Thread | 0x23CE Task | 108 2013-09-24 15:58:36 | CONNECT | 12651: Connection released (DSL, T108, connection obj. 8013a49b0) |
Failed login like this:
ask | 168 2013-09-24 16:09:09 | RTESec | 2: User control attempts to connect |
2013-09-24 16:09:09 | RTESec | 0: Authentication rejected | |
2013-09-24 16:09:09 | RTESec | 0: Authentication method: SCRAMMD5V1 | |
2013-09-24 16:09:09 | RTESec | 0: Authentication rejected | |
2013-09-24 16:09:09 | RTESec | 0: Authentication method: SCRAMMD5 |
We did restart the x_server a number of times.
Then an error occurs with the Watchdog
ask | - 2013-09-24 17:38:13 ERR RTEKernel | 125: The watchdog process is no longer alive,_FILE=RTEKernel_StartupUnix+noPIC.cpp,_LINE=408 |
ACTION: |
Contact your system administrator. Show him the error message which points to an operating system |
and looks like the DB tries an emergency shutdown:
Thread | 0x2391 Task | - 2013-09-24 17:56:21 | RTEKernel | 114: Caught STOP signal |
Thread | 0x2390 Task | - 2013-09-24 17:56:21 | RTE | 20225: Database tries automatic shutdown |
Thread | 0x7A5E Task | - 2013-09-24 17:56:21 ERR RTE | 20126: Database automatic shutdown failed,_FILE=RTE_ExternalCall+noPIC.cpp,_LINE=937 | |
Thread | 0x2390 Task | - 2013-09-24 17:56:21 WNG RTEKernel | 121: Kernel is being stopped in ONLINE state | |
Thread | 0x2390 Task | - 2013-09-24 17:56:22 | RTEKernel | 61: rtedump written to file 'rtedump' |
Thread | 0x2390 Task | - 2013-09-24 17:56:22 | RunTime | 3: State changed from ONLINE to KILL |
Thread | 0x2390 Task | - 2013-09-24 17:56:22 | RTEKernel | 111: Tracewriter resumed |
Thread | 0x2390 Task | - 2013-09-24 17:56:22 | RTEKernel | 94: Waiting for tracewriter to finish work |
Thread | 0x23C7 Task | 3 2013-09-24 17:56:22 | Trace | 20000: Start flush kernel trace |
Thread | 0x2390 Task | - 2013-09-24 17:56:22 | RTEKernel | 116: Tracewriter termination timeout: 60 seconds |
Thread | 0x23C7 Task | 3 2013-09-24 17:56:22 | Trace | 20001: Stop flush kernel trace |
Thread | 0x23C7 Task | 3 2013-09-24 17:56:22 | Trace | 20002: Start flush kernel dump |
Thread | 0x23C7 Task | 3 2013-09-24 17:56:24 | Trace | 20003: Stop flush kernel dump |
Thread | 0x23AE Task | - 2013-09-24 17:56:33 ERR RTEKernel | 125: The watchdog process is no longer alive,_FILE=RTEKernel_StartupUnix+noPIC.cpp,_LINE=408 |
ACTION: |
Contact your system administrator. Show him the error message which points to an operating system configuration error and then contact the database support if your system administrator can not fix the error. | ||||
Thread | 0x23C7 Task | 3 2013-09-24 17:56:39 | RTEKernel | 110: Releasing tracewriter |
Thread | 0x2390 Task | - 2013-09-24 17:56:39 | TENANT | 13008: Requestor for tenant database DSL has stopped |
Thread | 0x2390 Task | - 2013-09-24 17:56:39 | RTEThread | 13: The thread LegacyRequestor is finished |
Thread | 0x23B0 Task | - 2013-09-24 17:56:39 | RTE | 20214: CONSOLE thread stopped |
Thread | 0x2390 Task | - 2013-09-24 17:56:39 | RTEKernel | 58: Backup of diagnostic files will be forced at next restart |
Thread | 0x2390 Task | - 2013-09-24 17:56:39 | RTEKernel | 118: SERVERDB DSL has stopped |
2013-09-24 17:56:39 | RTEKernel | 14: Kernel version: Kernel | 7.8.02 Build 036-121-248-298 | |
Thread | 0x2390 Task | - 2013-09-24 17:56:39 | RunTime | 3: State changed from KILL to STOPPED |
Thread | 0x2390 Task | - 2013-09-24 17:56:39 | RTEThread | 13: The thread Requestor is finished |
Thread | 0x2390 Task | - 2013-09-24 17:56:39 | TENANT | 13005: Tenant database DSL has stopped |
Thread | 0x2390 Task | - 2013-09-24 17:56:40 | RTEKernel | 119: Kernel aborts |
The last entry in the KrnlMsg file was several hours before the x_cons suspends where noticed.
The running dbmsrv processes are
uan112:sqddsl 1002> ps -ef |grep dbmsrv|grep DSL
sdb 5652 5651 0 Sep24 ? 00:00:05 /sapdb/DSL/db/pgm/dbmsrv -sdbstarter 3600 3600 A -P 0000000300000007000000080000000B
sdb 11302 1 94 Sep24 ? 13:45:32 /sapdb/DSL/db/pgm/dbmsrv -sdbstarter 3600 3600 A -P 0000000300000007000000080000000B
sdb 13644 1 4 07:37 ? 00:00:00 /sapdb/DSL/db/pgm/dbmsrv -P 0000000b0000000e0000000f00000012
sdb 13873 1 4 07:37 ? 00:00:00 /sapdb/DSL/db/pgm/dbmsrv -P 0000000b0000000e0000000f00000012
sdb 13978 1 4 07:37 ? 00:00:00 /sapdb/DSL/db/pgm/dbmsrv -P 0000000b0000000e0000000f00000012
As the DB seems to be down I might kill them later and try to start the DB from scratch.
Hmm, I would suggest creating a SAP OSS ticket for this. I think we would need to look at the Database Analyzer log files plus the error protocol files like KnlMsg, dbm.prt...
OS connection would help plus the exact time when you noticed that logon via dbmcli did not work any more.
Have you checked the OS log file e.g. /var/log/messages (Linux)?
Thorsten
User | Count |
---|---|
86 | |
23 | |
11 | |
9 | |
8 | |
5 | |
5 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.