cancel
Showing results for 
Search instead for 
Did you mean: 

IOWait Deadlock 7.8.02.36

0 Kudos

I've got IOWait (R) (041) and IOwait (W) (044). Cant login into the database to do a restart for example.

Cant log a SAP call right now as I don't have the pw for the correct S-user available at the moment.

I know it means a data read/write deadlock, but how to fix it ?

List of suspend reasons:

========================

total suspends:  522238

Vwait          :       3 (   0.00% ) k53wait

IOWait(R)(041) :     100 (   0.02% ) b13get_node: await read              😞

IOWait(W)(044) :       4 (   0.00% ) b13pfree_pno                         😞

PagerWaitWritr :    5758 (   1.10% ) Pager_Controller::WaitForPagerWritReply

JobWait Redo   :       6 (   0.00% ) Rst_RedoManager::RedoLog

SVP-End  (230) :       1 (   0.00% ) Log_SavepointSync::LockSVPSyncEntry

NoRedoJob(231) :      12 (   0.00% ) Rst_RedoTrafficControl::ExecuteJobs()

LogIOwait(234) :   40378 (   7.73% ) Log_Queue::UserTaskEOTReady

SVP-wait (243) :       2 (   0.00% ) Log_Savepoint::StartSavepointAndWait

No-Work  (255) :  475974 (  91.14% ) Task is waiting for work

Environment:
NW 7.30 Java, MaxDB 7.8.02.036. Steps: SUM10SP08_06 patch preparation steps, Kernel 721_REL patch 100.

Thanks and Regards, Norbert

Accepted Solutions (0)

Answers (1)

Answers (1)

thorsten_zielke
Contributor
0 Kudos

Maybe it is a 'dbmsrv' problem? I have looked at the attached 'x_cons' file, but there were not even any user tasks connected and I could not see any database activity - based on that a deadlock seems unlikely.

I would recommend to check the database log file 'KnlMsg' for errors and if there are none, then run a 'ps -afe | grep dbmsrv' to identify the running dbmsrv processes.

Thorsten

0 Kudos

Thanks for checking Torsten.

The issues started when we noticed a login via dbmcli was not possible, all login types were rejected.

The SAP system had been shutdown (gracefully), proably explains why there are no client connections.

Unusual entries in KernelMsg seem to start with a lot of connect/connection released
messages like this:

ask108  2013-09-24 15:57:31 CONNECT12633:  Connect req. (DSL, T108, connection obj. 0x8013a49b0, Node:'uan112-se', PID: 15922)
Thread 0x23CE Task108  2013-09-24 15:57:31 CONNECT12651:  Connection released (DSL, T108, connection obj. 8013a49b0)
Thread 0x23CE Task108  2013-09-24 15:58:36 CONNECT12633:  Connect req. (DSL, T108, connection obj. 0x8013a49b0, Node:'uan112-se', PID: 15922)
Thread 0x23CE Task108  2013-09-24 15:58:36 CONNECT12651:  Connection released (DSL, T108, connection obj. 8013a49b0)

Failed login like this:

ask168  2013-09-24 16:09:09 RTESec     2:  User control attempts to connect
                           2013-09-24 16:09:09 RTESec     0:  Authentication rejected
                           2013-09-24 16:09:09 RTESec     0:  Authentication method: SCRAMMD5V1
                           2013-09-24 16:09:09 RTESec     0:  Authentication rejected
                           2013-09-24 16:09:09 RTESec     0:  Authentication method: SCRAMMD5

We did restart the x_server a number of times.

Then an error occurs with the Watchdog

ask  -  2013-09-24 17:38:13 ERR RTEKernel125:  The watchdog process is no longer alive,_FILE=RTEKernel_StartupUnix+noPIC.cpp,_LINE=408

                                                                       ACTION:

                                                                       Contact your system administrator. Show him the error message which points to an operating system

and looks like the DB tries an emergency shutdown:

Thread 0x2391 Task  -  2013-09-24 17:56:21 RTEKernel114:  Caught STOP signal
Thread 0x2390 Task  -  2013-09-24 17:56:21 RTE    20225:  Database tries automatic shutdown
Thread 0x7A5E Task  -  2013-09-24 17:56:21 ERR RTE    20126:  Database automatic shutdown failed,_FILE=RTE_ExternalCall+noPIC.cpp,_LINE=937
Thread 0x2390 Task  -  2013-09-24 17:56:21 WNG RTEKernel121:  Kernel is being stopped in ONLINE state
Thread 0x2390 Task  -  2013-09-24 17:56:22 RTEKernel 61:  rtedump written to file 'rtedump'
Thread 0x2390 Task  -  2013-09-24 17:56:22 RunTime    3:  State changed from ONLINE to KILL
Thread 0x2390 Task  -  2013-09-24 17:56:22 RTEKernel111:  Tracewriter resumed
Thread 0x2390 Task  -  2013-09-24 17:56:22 RTEKernel 94:  Waiting for tracewriter to finish work
Thread 0x23C7 Task  3  2013-09-24 17:56:22 Trace  20000:  Start flush kernel trace
Thread 0x2390 Task  -  2013-09-24 17:56:22 RTEKernel116:  Tracewriter termination timeout: 60 seconds
Thread 0x23C7 Task  3  2013-09-24 17:56:22 Trace  20001:  Stop flush kernel trace
Thread 0x23C7 Task  3  2013-09-24 17:56:22 Trace  20002:  Start flush kernel dump
Thread 0x23C7 Task  3  2013-09-24 17:56:24 Trace  20003:  Stop flush kernel dump
Thread 0x23AE Task  -  2013-09-24 17:56:33 ERR RTEKernel125:  The watchdog process is no longer alive,_FILE=RTEKernel_StartupUnix+noPIC.cpp,_LINE=408

                                                                       ACTION:

                                                                       Contact your system administrator. Show him the error message which points to an operating system configuration error and then contact the database support if your system administrator can not fix the error.
Thread 0x23C7 Task  3  2013-09-24 17:56:39 RTEKernel110:  Releasing tracewriter
Thread 0x2390 Task  -  2013-09-24 17:56:39 TENANT 13008:  Requestor for tenant database DSL has stopped
Thread 0x2390 Task  -  2013-09-24 17:56:39 RTEThread 13:  The thread LegacyRequestor is finished
Thread 0x23B0 Task  -  2013-09-24 17:56:39 RTE    20214:  CONSOLE thread stopped
Thread 0x2390 Task  -  2013-09-24 17:56:39 RTEKernel 58:  Backup of diagnostic files will be forced at next restart
Thread 0x2390 Task  -  2013-09-24 17:56:39 RTEKernel118:  SERVERDB DSL has stopped
                           2013-09-24 17:56:39 RTEKernel 14:  Kernel version: Kernel7.8.02   Build 036-121-248-298
Thread 0x2390 Task  -  2013-09-24 17:56:39 RunTime    3:  State changed from KILL to STOPPED
Thread 0x2390 Task  -  2013-09-24 17:56:39 RTEThread 13:  The thread Requestor is finished
Thread 0x2390 Task  -  2013-09-24 17:56:39 TENANT 13005:  Tenant database DSL has stopped
Thread 0x2390 Task  -  2013-09-24 17:56:40 RTEKernel119:  Kernel aborts

The last entry in the KrnlMsg file was several hours before the x_cons suspends where noticed.

The running  dbmsrv processes are

uan112:sqddsl 1002> ps -ef |grep dbmsrv|grep DSL

sdb       5652  5651  0 Sep24 ?        00:00:05 /sapdb/DSL/db/pgm/dbmsrv -sdbstarter 3600 3600 A -P 0000000300000007000000080000000B

sdb      11302     1 94 Sep24 ?        13:45:32 /sapdb/DSL/db/pgm/dbmsrv -sdbstarter 3600 3600 A -P 0000000300000007000000080000000B

sdb      13644     1  4 07:37 ?        00:00:00 /sapdb/DSL/db/pgm/dbmsrv -P 0000000b0000000e0000000f00000012

sdb      13873     1  4 07:37 ?        00:00:00 /sapdb/DSL/db/pgm/dbmsrv -P 0000000b0000000e0000000f00000012

sdb      13978     1  4 07:37 ?        00:00:00 /sapdb/DSL/db/pgm/dbmsrv -P 0000000b0000000e0000000f00000012

As the DB seems to be down I might kill them later and try to start the DB from scratch.

thorsten_zielke
Contributor
0 Kudos

Hmm, I would suggest creating a SAP OSS ticket for this. I think we would need to look at the Database Analyzer log files plus the error protocol files like KnlMsg, dbm.prt...

OS connection would help plus the exact time when you noticed that logon via dbmcli did not work any more.

Have you checked the OS log file e.g. /var/log/messages (Linux)?

Thorsten