cancel
Showing results for 
Search instead for 
Did you mean: 

MaxDB 7.6.06.24 crashed and is unstartable even in restored Backup

urs_schuerer
Explorer
0 Kudos

Today our Database crashed and is now in a similar state like described here http://scn.sap.com/thread/1669772. It seems to keep crashing on log recovery during db_online. It for sure has at least one bad index, but since we are unable to bring it online, we seem to have no chance to fix this. I would be thankfull for any hints.

Here are some facts:

  • The first crash (somewhat shortened) looked something like:

VERSION  'X64/LIX86 7.6.06   Build 024-123-246-595'
[...]
ERR 53419 B*TREE   BD600: predsep.k > sep.k: 0
ERR 53000 B*TREE   Index Root  13113447
ERR 53000 B*TREE   bd600BuildSeparator: lef: 13553565
ERR 53000 B*TREE   bd600BuildSeparator: rig: 14665430
ERR          0 B*TREE   m_Node: 13553565
ERR          0 B*TREE   m_Node1: 14665430
ERR 11599 BTRACE   ----> Symbolic Stack Back Trace <----
ERR 11599 BTRACE      0: 0x0000000000f216ad eo670_UnixTraceStack +0x01ad
ERR 11599 BTRACE      1: 0x0000000000f22079 eo670_CTraceContextStackOCB +0x0009
ERR 11599 BTRACE      2: 0x0000000000f2216a vtracestack +0x003a
ERR 11599 BTRACE      3: 0x0000000000a64ecb _ZN19cbd502_ReorgContext13ErrorHandlingEPKci +0x002b
ERR 11599 BTRACE      4: 0x0000000000a72178 _ZN11cbd500_Tree14bd520_OverflowER19cbd502_ReorgContextPhiibbiRi +0x0908
ERR 11599 BTRACE      5: 0x0000000000a727ab _ZN11cbd500_Tree17bd520LeafOverflowEPhibiRi +0x03ab
ERR 11599 BTRACE      6: 0x0000000000a3de71 _Z17bd400AddToInvTreeR17cbd300_InvCurrentPhiS1_iRb +0x1181
ERR 11599 BTRACE      7: 0x00000000009a6a1f b03add_inv +0x01ff
ERR 11599 BTRACE      8: 0x0000000000b4538c _ZNK14Log_InvDescMap6AddInvER18tgg00_TransContextbbbRK12tgg00_FileIdPK9tgg00_Rec

+0x016c
ERR 11599 BTRACE      9: 0x0000000000b4345e _ZNK15Log_InvHandling6AddInvER18tgg00_TransContext +0x003e
ERR 11599 BTRACE     10: 0x000000000092f669 kb611inv_AddInv +0x0009
ERR 11599 BTRACE     11: 0x000000000092dbe1 kb61insert_rec +0x0201
ERR 11599 BTRACE     12: 0x000000000092e887 k61ins_del_upd +0x0267
ERR 11599 BTRACE     13: 0x00000000008dba3a k05functions +0x058a
ERR 11599 BTRACE     14: 0x000000000065fb97 a06lsend_mess_buf +0x0297
[...]
ERR 53250 INDEX    Bad Index 13113447 (Root)
ERR 53250 INDEX    Reason "System error: BD Invalid leave"
         12853 DBSTATE  Caught signal 11(SIGSEGV)
ERR 11330 COREHAND ABORTING due to signal 11
ERR 11599 BTRACE   ----> Symbolic Stack Back Trace <----
ERR 11599 BTRACE      0: 0x0000000000f216ad eo670_UnixTraceStack +0x01ad
ERR 11599 BTRACE      1: 0x0000000000f22079 eo670_CTraceContextStackOCB +0x0009
ERR 11599 BTRACE      2: 0x0000000000f22097 eo670_CTraceContextStack +0x0017
ERR 11599 BTRACE      3: 0x0000000000f65718 en81_CrashSignalHandler +0x00e8
ERR 11599 BTRACE      4: 0x00007f40e100b6b0 __restore_rt +0x0000
ERR 11599 BTRACE      5: 0x0000000000a74f20 _ZNK11cbd600_Node14bd600LeafCountEii +0x00b0
ERR 11599 BTRACE      6: 0x0000000000a42728 _Z23bd401CalculatePageCountR17cbd300_InvCurrentPhiS1_ibRiS2_S2_ +0x0658
ERR 11599 BTRACE      7: 0x00000000009ab093 b03calculate_page_count +0x02b3
ERR 11599 BTRACE      8: 0x000000000065d9f1 a06eval_page_count +0x0171
ERR 11599 BTRACE      9: 0x0000000000878292 ak720indexeval +0x0142
ERR 11599 BTRACE     10: 0x0000000000878b5a ak720eval_one_index +0x016a
ERR 11599 BTRACE     11: 0x0000000000878e73 ak720index_decision +0x0183
+++++++++++++ Kernel Exit ++++++++++++++++++++++++++++

  • So whenever we try to now switch to db_online (LOCAL_REDO_LOG_BUFFER_SIZE is 0 as it was before) it immediately crashes with
      ERR
      -24994,ERR_RTE: Runtime environment error
      4,connection broken server state 4


VERSION  'X64/LIX86 7.6.06   Build 024-123-246-595'
[...]

Log      0 queues, flushmode is 'MaximizeSafety', devstate is 'Okay'
Log      Oldest not saved is ioseq 296996736 @ off 775686
Log      First known on LogVolume is ioseq 295948220 @ off 775741
Log      Restart from ioseq 296996789 @ off 775739 to ioseq 296996790 @ off 775740
Log      Result after checking the log device: 'Ok'
Log      The number of active logging-queues has been increased to 4
OBJECT   Restarted Garbage coll: 1
Rst      968 redo transactions readable and 32 redo tasks available.
RESTART  Previous restart was interrupted.
Restart  recovering log from log_volume from IOSeq: '296996789'
Log      normal end of log found at off 775740 lastseq 296996790.
Log      last-redo-read empty errlist#13:TR907215147(11)[296996790]@775740.1608'Commit':20140128:135616
DBSTATE  Caught signal 11(SIGSEGV)
ERR 11330 COREHAND ABORTING due to signal 11

[...]
ERR 11599 BTRACE   ----> Symbolic Stack Back Trace <----
ERR 11599 BTRACE      0: 0x0000000000f216ad eo670_UnixTraceStack +0x01ad
ERR 11599 BTRACE      1: 0x0000000000f22079 eo670_CTraceContextStackOCB +0x0009
ERR 11599 BTRACE      2: 0x0000000000f22097 eo670_CTraceContextStack +0x0017
ERR 11599 BTRACE      3: 0x0000000000f65718 en81_CrashSignalHandler +0x00e8
ERR 11599 BTRACE      4: 0x00007f1f4412f6b0 __restore_rt +0x0000
ERR 11599 BTRACE      5: 0x0000000000a3ac49 _Z20bd400_DeleteSubTreesR17cbd300_InvCurrentR11cbd600_Node+0x0109
ERR 11599 BTRACE      6: 0x0000000000a3b490 _Z16bd400DropInvTreeR17cbd300_InvCurrent +0x0370
ERR 11599 BTRACE      7: 0x00000000009a9a4e bd03ReleaseInvTree +0x013e
ERR 11599 BTRACE      8: 0x000000000096d7d3 bd01destroy_file +0x0233
ERR 11599 BTRACE      9: 0x000000000096e7b9 b01pdestroy_perm_file +0x00d9
ERR 11599 BTRACE     10: 0x0000000000b2c0a4 _ZNK14Log_ActionFile13RemoveGarbageER18tgg00_TransContextR20SAPDBErr_MessageList +0x02f4
ERR 11599 BTRACE     11: 0x0000000000b2c5bc _ZNK14Log_ActionFile7ExecuteER18tgg00_TransContextNS_13ExecutionTypeE +0x039c

ERR 11599 BTRACE     12: 0x0000000000b2d78e _ZNK14Log_ActionFile4RedoER18tgg00_TransContextR10Log_IImageR20SAPDBErr_MessageList +0x000
ERR 11599 BTRACE     13: 0x0000000000b39d2b _Z10RedoActionR15Log_TransactionR10Log_IImageR11Log_IActionR15Data_IBreakableR21Data_Split
ERR 11599 BTRACE     14: 0x0000000000b3bd11 _Z21Log_ActionExecuteRedoR15Log_TransactionR14Log_AfterImage16Log_IOSequenceNoR21Data_Spli
ERR 11599 BTRACE     15: 0x0000000000b544b5 _ZN15Log_Transaction4RedoER20SAPDBErr_MessageList +0x02e5
+++++++++++++ Kernel Exit ++++++++++++++++++++++++++++

  • If we try a consistency check in admin mode it stumbles over one bad index complaining on root node 13113447, but of course we cannot repair that in admin state or even find out which index is to blame (can we?):

    >db_execute check data with update
    ERR
    -24988,ERR_SQL: SQL error
    -9041,System error: BD Index not accessible
    17,Servertask Info: because b01pverify_participant() failed
    10,Job 0 (Check Data) [executing] WaitingT206 Result=OK
    6,b01pverify_participant() failed, Error code 715 "index_not_accessible"

  • We even tried to recover our last backup (with AUTO_RECREATE_BAD_INDEXES set to YES) but as soon as we try to for instance drop the index we assume to be bad it immediately crashes and is unstartable again like above.

Is there any way to restore without indexes or to have all indexes recreated? Any hints how to still bring the original DB back online?

Cheers,

Urs

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Hi Urs,

you set the parameter AUTO_RECREATE_BAD_INDEXES set to YES

which means that if a corrupted index is detected this index is recreated during restart.

The stack back trace above tells us that the problem is on a subtree of an index which looks like a huge index.

I would try now to set the parameter AUTO_RECREATE_BAD_INDEXES to NO. Then the corrupted index stays in the system and won't be created implicitely during restart.

When the databas eis online you can check if there are any corrupted indexes in the system and you should get the information which index is corrupted. Don't try to drop this index - it crashes the database again.

If the database is in online mode the table which the index belongs to is read only - because the affected index is a uniqueue index.

The next step now is to upgrade the database to the newest 7.6.06. Build. to solve this error -> predsep.k > sep.k: 0

After this upgrade is done sucessfully you can try to drop the index.

Hope this will work as workaround.

Regards, Christiane

urs_schuerer
Explorer
0 Kudos

Hi Christiane,

thanks for that hint. I will give it a try. But I have to ask one thing about you writing on an upgrade to the newest build of 7.6.06: I did not download the latest build since it said to be 7.6.06.24 as we already have installed anyway, so ... might there be a newer build somewhere down in these sub version numbers "-123-246-595"?

Thanks again,

urs

Former Member
0 Kudos

Hi Urs,

we delivered a newer MaxDB 7.6.06 Build to SAP SWDC? Do you have access to SAP's SWDC?

If not please let me know then I have to check which build is available on SDN.

-123-246-595"? - this is not the build number - it's internal information about the make of this version

7.6 => Major number

06. Minor number

.24 Build number

Regards, Christiane

urs_schuerer
Explorer
0 Kudos

SAP SWDC? Äh ... SAP Service Marketplace? No we do not have access to that. I Just downloaded the newest version from store.sap.com (since community downloads seem to have moved there) but this one there is from 2012 and exactly seems to be 7.6.06.24.

Former Member
0 Kudos

Hi Urs,

ok - then you must use the SDN to download your software from SAP store. Really there is only 7.6.06.24 avaialble. I already triggred to replace this old version with  7.6.06.27 but this will take some days. Hope it will be available next week.

Regards, Christiane

urs_schuerer
Explorer
0 Kudos

Hi Christiane,

could you please be so kind and trigger those colleagues one more time? There still the .24 in the store ;( ...

Thx & Regards,

Urs

Former Member
0 Kudos

Hi Urs,

I triggered the colleagues again to check if only the version information -> The package includes versions 7.6.06.24, 7.7.07.45 and 7.8.02.37 for .... is wrong or if we really still have the old versions for download which is definitely wrong.

Thanks for the hint.

Regards, Christiane

Former Member
0 Kudos

Hi Folks,

now we fixed it - even when you see the old version 7.6.06.24 listed on SAPSTore the download area contains 7.6.06.27.

The version string in the technical information section will be correct as well and should be fine tomorrow.

Regards, Christiane

Answers (1)

Answers (1)

urs_schuerer
Explorer
0 Kudos

I am so sorry, to bring this up again:


After last time we did need to do a point in time recovery a few transactions before the fatal stuff, it seems to be we are stuck at the same point:

  • Database Version now is 7.6.06 Build 027-123-248-897
  • We again have a bad index:
    ERR 53000 B*TREE   Index Root  13709821
  • There was a transaction using this index which made the DB crash with SIGSEGV
  • AUTO_RECREATE_BAD_INDEXES is set to NO

As soon as we now try to get into db_online the evil transaction seems to be read from the log

Rst      968 redo transactions readable and 32 redo tasks available.
Restart  recovering log from log_volume from IOSeq: '71017143'
Log      normal end of log found at off 764391 lastseq 71018647.
Log      last-redo-read empty errlist#3245:TR973675497(7)[71018647]@764391.864'Commit':20140607:164721
DBSTATE  Caught signal 11(SIGSEGV)
COREHAND ABORTING due to signal 11

So again now way to drop the bad Index or start the DB ...

Just to make sure: I know we need to move to 7.8 soon, but a DB that can not be started any more is the worst case scenario with all data being lost. Just to make this report contain some facts, here is (part off) the stack:

eo670_UnixTraceStack +0x01ad
eo670_CTraceContextStackOCB +0x0009
eo670_CTraceContextStack +0x0017
en81_CrashSignalHandler +0x00e8
__restore_rt +0x0000
_Z20bd400_DeleteSubTreesR17cbd300_InvCurrentR11cbd600_Node +0x0109
_Z16bd400DropInvTreeR17cbd300_InvCurrent +0x0370
bd03ReleaseInvTree +0x013e
bd01destroy_file +0x0233
b01pdestroy_perm_file +0x00d9
_ZNK14Log_ActionFile13RemoveGarbageER18tgg00_TransContextR20SAPDBErr_MessageList +0x02f4
_ZNK14Log_ActionFile7ExecuteER18tgg00_TransContextNS_13ExecutionTypeE +0x039c
_ZNK14Log_ActionFile4RedoER18tgg00_TransContextR10Log_IImageR20SAPDBErr_MessageList +0x000
_Z10RedoActionR15Log_TransactionR10Log_IImageR11Log_IActionR15Data_IBreakableR21Data_Split
    SpaceReaderR20SAPDBErr_MessageList +0x008b
_Z21Log_ActionExecuteRedoR15Log_TransactionR14Log_AfterImage16Log_IOSequenceNoR21Data_Spli
    tSpaceReaderRN31Data_ChainSplitSpaceForwardReadI12Rst_RedoPageE8IteratorER20SAPDBErr_MessageList +0x0ae1
_ZN15Log_Transaction4RedoER20SAPDBErr_MessageList +0x02e5
_ZN22Rst_RedoTrafficControl11ExecuteJobsEiR18tgg00_TransContextR20SAPDBErr_MessageList +0x
_ZN16SrvTasks_JobRedo13ExecuteInternER13Trans_Context +0x0100
_ZN12SrvTasks_Job15ExecuteDirectlyER13Trans_Context +0x006d

urs_schuerer
Explorer
0 Kudos

We have just recovered from a data backup (sorry, need to continue working but I still have the old damaged set). I would still be interested in any form of solution, since I assume it might hit us again soon .

steffen_schildberg
Active Participant
0 Kudos

Hi Urs,

this problem looks like an error we just recently found and fixed with versions 7.7.07.48 and 7.8.02.39 (already available via SCN afaik). The fix in 7.6 is not yet delivered to anyone. So if ever possible upgrade to one of those versions to make sure to not hit this problem again and again. These versions also contain 2 more corrections in index implementation which might lead to index corruption, too.

What is the problem in your case: counting the leafs for a join select did not consider empty pages at the right edge of a tree. This could lead to a crash. But normally such a system would come up again after the crash. Therefore I would be interested in the log files of the damaged system (everything in the wrk-directory of the database like knldiag, knldiag.err or KnlMsg, KnlMsgArchive, KnlMsg.old - and don't worry about the formatting of the files, I will be able to format and read them here in the labs). To upload zip the files and use the following container:

SAP Box Attachment.

Sorry for all the inconveniences.

Best regards,

Steffen

urs_schuerer
Explorer
0 Kudos

Hi Steffen,

I am absolutely happy about your answer (and anyone helping in this forum of course) and like to thank you for the information given. I hope we are able to migrate soon, but there seem to be some difficulties concerning the compatibility of these versions (I am not involved in the software development ).

Concerning the crash it would be helpful to have an option for instance to mark some or all indexes as bad, so the DB would not try to use them, so any form of option to modify the index table in admin mode.

I have uploaded some files to the SAP Box named above -- please note that there might be duplicates, since some files had been copied by myself and some are from the DIAGHISTORY.

Thank you again for your great effort,

Best regards,

Urs

P.S.: Why do those indexes get corrupted all the time anyway? Is this some kind of race condition?

steffen_schildberg
Active Participant
0 Kudos

Hi Urs,

thx for the files. I am still analyzing and it indeed looks like an error we had those times. And in your case the index is corrupt in a way that the subsequent drop of that index (that's exactly what the kernel tries during redo of the log area when you restart it after the original crash) crashes the kernel again.

The error is actually not too simple (I think):

During a balancing operation a page might be locked for WRITE access and be written but not immediately released but cached for subsequent operations with that page. If this subsequent action does not write the page again (for instance because the page is full) it is released as if it was locked for READ access. This way the change on that page could get lost if not another subsequent write operation from another transaction would have changed the page immediately again or at least before the next savepoint. Afterwards there are missing separators on index tree level 1 (separators address the pages below its current level) but the leaf pages of the index addressed by this pages are ok. Now if there is any operation on the index in that region of the corrupt page the index is set to bad. And during dropping the index the kernel finds the chain of separators broken and stops it all.

Ok - the easiest way to overcome all this (and you will face this error again, I wrote it already) is to get the newest available version. Would you mind to find out what exactly stops your development from using the newest version? Usually, the software gets better with newer versions.

And unfortunately there is no option to set an index bad in state ADMIN of the database. To do so SQL access is necessary which is not available in ADMIN state. To be honest - there is a really complicated way to manipulate the internal structures of the database with the right tools but I would not recommend to do it. And in fact it would not help. Even if the index is set bad and you try to drop it, the kernel will crash in your case anyway.

Best regards,

Steffen