cancel
Showing results for 
Search instead for 
Did you mean: 

SYSERROR -9026 BD Bad datapage,write/check count

Former Member
0 Kudos

Hi,

we're running a MaxDB 7.5.38 on linux and we had an immediate shutdown last friday.

After checking the knldiag.err we found that an index was leading to a "Bad page - checksums not matching" error, resulting in this immediate shutdown.

We found the index via "select * from roots where root=xxx" and dropped/recreated this index afterwards. After this the database was running again (and is doing so currently).

But one problem remains. We're not able to create a backup anymore since this crash. Every backup-attempt fails with the message

-


2008-06-22 21:00:23 1372 ERR 20004 Data Bad page - checksums not matching

2008-06-22 21:00:23 1372 ERR 20005 Data Bad page - calculated checksum [ 193943944 ] checksum found in page [ 207442492 ]

2008-06-22 21:00:23 1372 ERR 52015 SAVE write/check count mismatch 1459521

2008-06-22 21:00:24 1371 ERR 52012 SAVE error occured, basis_err 300

2008-06-22 21:00:24 1371 ERR 51080 SYSERROR -9026 BD Bad datapage,write/check count

-


I already recreated the table, to which the faulty index belonged and I also made a "check database structure extended" for this one table. The result was "checking of table xxxx successfully finished".

What can we do to create a backup again ???

thanks..::GERD::..

Accepted Solutions (1)

Accepted Solutions (1)

lbreddemann
Active Contributor
0 Kudos

> I already recreated the table, to which the faulty index belonged and I also made a "check database structure extended" for this one table. The result was "checking of table xxxx successfully finished".

>

> What can we do to create a backup again ???

What do you mean by "recreated the table"? If there is a corrupt page in the B*Tree of a table you cannot read the contents of this table anymore.

If the corrupt page just belonged to an index - rebuilding it should be enough to get rid of the problem.

Also the corrupt page should not be touched anymore during the backup if the index had been recreated, since the backup only reads the pages that are actually in use.

Have you dropped the corrupt index? How did you recreate it exactly?

If you dropped the index but the bad page still gets reported, you may run a CHECK DATA WITH UPDATE in ADMIN mode.

This will release unused pages that somehow get stuck in the freespace management.

KR Lars

Melanie
Advisor
Advisor
0 Kudos

Just one more thought:

maybe the page causing the backup problems is not the same page which caused the crash?

Did you compare the page numbers in knldiag?

You should definetly run a complete check data - not just for one table. If you do it - as Lars described - in ADMIN mode, unused pages (caused by a corruption in the B* tree which should be deleted) are removed.

Let us know if this solves the problem.

Otherwise you will at least know how many other pages are also corrupted. If all of them are indexes: recreate them. If tables are also damaged, you probably have to run a recovery...

Regards,

Melanie

Former Member
0 Kudos

Hello Melanie,

also thanks for answering.

I got this two message blocks from knldiag.err, the first one from the crash and the second one from the failed backup:

-


db crash------

2008-06-20 15:47:34 23919 ERR 20013 IOMan Bad page on data volume 2 blockno 381754

2008-06-20 15:47:35 23919 ERR 20004 Data Bad page - checksums not matching

2008-06-20 15:47:35 23919 ERR 20005 Data Bad page - calculated checksum [ 196750527 ] checksum found in page [ 207442492 ]

2008-06-20 15:47:35 23919 ERR 20013 IOMan Bad page on data volume 2 blockno 381754

2008-06-20 15:47:37 23919 ERR 20025 IOMan Bad data page - Requested pageno 1459521 (perm) read pageno 1459521

2008-06-20 15:47:37 23919 ERR 20020 Data Bad data page 1459521 belongs to root 426184 which is of filetype 'Index'

-


failed backup-------

2008-06-22 21:00:23 1372 ERR 20004 Data Bad page - checksums not matching

2008-06-22 21:00:23 1372 ERR 20005 Data Bad page - calculated checksum [ 193943944 ] checksum found in page [ 207442492 ]

2008-06-22 21:00:23 1372 ERR 52015 SAVE write/check count mismatch 1459521

2008-06-22 21:00:24 1371 ERR 52012 SAVE error occured, basis_err 300

2008-06-22 21:00:24 1371 ERR 51080 SYSERROR -9026 BD Bad datapage,write/check count

-


So it seems that's the same page causing the two different problems. This led me to the assumption that we can get rid of the problem by renaming the table with the dropped index and drop it after the data has been copied to the newly created table.

I thought if the table is no longer there the database will no longer use this bad page. But why does the backup wants to access this page ?

Do I have to restart the database (I never did this since the db crash on friday) ?

...GERD...

Melanie
Advisor
Advisor
0 Kudos

Hello Gerd,

in this case the correct solution was rebuilding the index. Recreating the table was not necessary!

However, as the index structure was corrupted, not all pages could be removed by the drop index.

Some pages still exist in the database and are marked as 'in use'. During the backup all 'in use' pages have

to be part of the backup.

You have to run the CHECK DATA in ADMIN mode to get rid of these pages!

The option "check database structure and clear converter in state ADMIN" is the correct one in DBMGUI.

Maybe you can run it tonight?

Regards,

Melanie

Answers (1)

Answers (1)

markus_doehr2
Active Contributor
0 Kudos

You have a corrupt page on your disk. This is caused (in 99 % of the cases) by hardware problems. Check "dmesg" and "/var/log/messages" for problems at the time of the crash.

Do you run on filesystem and was there maybe a "fsck" run before that crash happened?

I would open an OSS call and let the support have a look on the system.

Markus

Former Member
0 Kudos

Hello Markus and Lars,

thanks for answering quickly.

Let me first answer Markus' questions:

I checked dmesg-output and /var/log/messages. There's only one strange output (in dmesg):

kernel[26520]: segfault at 0000000000000118 rip 00002aecec2d2862 rsp 00000000434920a0 error 4

....

....

kernel[14442]: segfault at 0000000000000118 rip 00002aab67846862 rsp 00000000414430a0 error 4

But how do we know if this message belongs to the db crash ?

...and there was no fsck running just before the crash happened.

Do we have to provide more info concerning this OSS call ? What does this mean ?

-


Answers to Lars:

I "recreated" the table in 4 steps:

1.) rename the table

2.) create new table with original name

3.) move data to new table

4.) drop the renamed table

This has been made because I thought that the table itself causes this "bad data page" error.

For clarification: after dropping and recreating the index (in SQLStudio via plain sql "drop index..."/"create index...") the database was up again, and now only making a backup fails.

The "check data with update" is not that nice, because this will lead to a very long downtime of our plattform (I think you mean the option in DBM-Gui "check database structure and clear converter in state ADMIN").

What do you think how long it will take (35GB data and lots of LONG values) ?

thanks a lot......GERD.....

lbreddemann
Active Contributor
0 Kudos

Hi Gerd,

when recreating your table as described you've to be very carefull to make sure that all constraints and default definitions are recreated correctly as well. A 'create table like' would be the command of choice here.

Concerning the long downtime of the CHECK DATA WITH UPDATE (that's the commands name on command line...): currently you cannot create full backups - so I would say you are past the point where it's about "niceness"

Anyhow - what you can try is to create incremental backups as long as the full data backups are failing. At the next possible time-frame (perhabs the weekend?) you have to perform the CHECK DATA WITH UPDATE.

The runtime of the CHECK data is obviously I/O dependent but I would propose to take the time it takes to backup the database with a single volume medium as a first guess.

KR Lars

markus_doehr2
Active Contributor
0 Kudos

> I checked dmesg-output and /var/log/messages. There's only one strange output (in dmesg):

> kernel[26520]: segfault at 0000000000000118 rip 00002aecec2d2862 rsp 00000000434920a0 error 4

> ....

> ....

> kernel[14442]: segfault at 0000000000000118 rip 00002aab67846862 rsp 00000000414430a0 error 4

>

> But how do we know if this message belongs to the db crash ?

Is this an Itanium-2 box? It seems, that there misaligned accesses...

Markus

Former Member
0 Kudos

Hi Lars,

you're right. This state isn't that nice.

We cannot create incremental backups either

GERD

Former Member
0 Kudos

Hi Markus,

the database is running on a 2 XeonQuadcore OpenSuse10.2 64bit system. MaxDB version is 7.5.0.38

...GERD...

markus_doehr2
Active Contributor
0 Kudos

Segmentation faults logged in /var/log/messages usually mean that there was an invalid memory access and the database "crashed"...

Markus

Former Member
0 Kudos

Hi Markus,

don't know it it makes a difference, but these seg faults are from dmesg output, they are not in /var/log/messages

...GERD...

markus_doehr2
Active Contributor
0 Kudos

> don't know it it makes a difference, but these seg faults are from dmesg output, they are not in /var/log/messages

That doesn't make a big difference... maybe they are already overwritten in /var/log/messages...

Did you encounter a crash at the times those messages appear?

Markus

Former Member
0 Kudos

Good morning,

we're planning the db_check_in_admin_mode for the end of the week, since we have to inform our customers of the downtime.

I will drop you a note (and hand out the reward points of course) and tell you the result.

bye....GERD....

Former Member
0 Kudos

Hello to all,

last night I performed the check data in admin mode and I'm so happy about everything's fine afterwards.

As a result this check-run freed almost 1GB of unused space, and now we're able to create backups as usual again.

thanks again for your support.....GERD....