cancel
Showing results for 
Search instead for 
Did you mean: 

How can I simulate MaxDB system failure in testing environment?

Former Member
0 Kudos

Dear All Experts,

I'm trying to perform testing for MaxDB backup as well as recovery in a testing environment.

Is there a way for me to "crash" the database (with lost of data / tables) so I can perform a recovery, just like how a real situation is?

Any advises / guidance will be very much appreaciated!

-Deric-

Accepted Solutions (1)

Accepted Solutions (1)

markus_doehr2
Active Contributor
0 Kudos

> Is there a way for me to "crash" the database (with lost of data / tables) so I can perform a recovery, just like how a real situation is?

- stop the database

- delete a datafile

- try to start the database

Markus

Former Member
0 Kudos

Dear Markus,

Thanks for your guide..

I will try it out and let you know the outcome later!

Former Member
0 Kudos

Hi Markus,

Okay, this is what I've done...

A fresh MaxDB few a few dummy tables with data inside.

Did a complete backup by using DB Manager.

I did as you taught, stop the DB, renamed the datafile (DISKD0001 --> this is the only datafile available since it's fresh db) to something else, and tried to restart it.

An error occured: "-24988 SQL error [db_online -f]; -902,I/O error"

So, I tried to perform a normal recovery by using the wizard in ADMIN state but was given an error, "-24994 Runtime environment error [db_restartinfo]; 4; connection broken"

So, I resort fo recovery with initialization, and it works well. A new diskfile, "DISKD0001" is created and all my dummy tables are still intact, which is good.

My question is, why does the normal recovery method doesn't work?

As I understand, recovery with initialization is not to be used as normal recovery method, as it will remove all the logs backup(although in the case, there is no logs backup, only complete backup), am I right?

lbreddemann
Active Contributor
0 Kudos

> I did as you taught, stop the DB, renamed the datafile (DISKD0001 --> this is the only datafile available since it's fresh db) to something else, and tried to restart it.

>

> An error occured: "-24988 SQL error [db_online -f]; -902,I/O error"

>

> So, I tried to perform a normal recovery by using the wizard in ADMIN state but was given an error, "-24994 Runtime environment error [db_restartinfo]; 4; connection broken"

>

> So, I resort fo recovery with initialization, and it works well. A new diskfile, "DISKD0001" is created and all my dummy tables are still intact, which is good.

>

> My question is, why does the normal recovery method doesn't work?

>

> As I understand, recovery with initialization is not to be used as normal recovery method, as it will remove all the logs backup(although in the case, there is no logs backup, only complete backup), am I right?

You're right - "recovery with initialization" will remove/ignore all existing data in all volumes (including the log-volume - that means you might loose some transactions when you have no log backup containing the data).

So why did you see the error?

The answer lies in the way 'db_restartinfo' works.

It tries to read the restart record which is stored in the first page of the first data volume.

Usually when the first volume is not available the exception handling should take care of this - but unfortunately with your version this only works when the database is in OFFLINE mode.

And this is a known bug ([PTS entry|http://maxdb.sap.com/webpts?wptsdetail=yes&ErrorType=0&ErrorID=1170776])!

As the recovery wizard also issues a db_restartinfo it fails as well.

In order to avoid this situation you'll have to use dbmcli and just perform the 'recover_start' command.

Conclusio: you did everything right (except having a very simple database setup but just ran into a bug that makes the handling a bit less comfortable.

regards,

Lars

Former Member
0 Kudos

Dear Lars,

As always, thanks for your response!

From your explanations, does it means that the error was caused by the missing 1st datafile which is DISKD0001?

So, if I were to renamed DISKD0002 into something else instead, that error wouldn't happened because DISKD0001still intact?

lbreddemann
Active Contributor
0 Kudos

> From your explanations, does it means that the error was caused by the missing 1st datafile which is DISKD0001?

>

> So, if I were to renamed DISKD0002 into something else instead, that error wouldn't happened because DISKD0001still intact?

Yep, that does it mean!

Just try it out with any other file missing except the first datavolume and you shouldn't run into this.

Good luck and have fun!

Lars

Former Member
0 Kudos

Thanks Lars!

I just tried it as we speak and it works like charm!

Is there any other disaster recovery which I can simulate, such as failure that cause a table to be missing?

You see, I'm trying to simulate all kind of recoveries to any type of data loss.

Our current maxdb in the production server has a history of only able to recover data back to 1 day, which means we always lost one day of data during recovery.

Oh, and btw, I love your blogs as they are pretty easy to understand and insightful!

And I liked your sense of humors too!

Thanks for helping!

lbreddemann
Active Contributor
0 Kudos

Thanks for the flowers

Ok, what other errors should you try out?

"Deletion of one or more tables or parts of them"

is a very "popular" scenario in documentations and feature descriptions - just because the ability to rollforward during recovery easily allows to stop just before the bad thing happened.

However, as the database is not one table and another table and another table but a set of tables that combined tell us something about the data stored in the database, there is no way to just restore&recover a single table.

In fact the inderdependencies between tables and the data stored in them can't be managed out by the DBMS automatically and even less if the transaction consistency features (the stuff that makes recovery possible) is implemented on storage level, like it is with all commercial DBMS today.

Beyond that, point in time recovery makes your SAP database inconsistent with the outside world.

All the other attached systems will not be aware about the system beeing "back in time".

So to handle such a scenario usually an approach like the following is chosen:

- restore and recover the database into a separate instance via until just before the data was lost.

- identify the tables/rows that are missing in the productive database

- export this data from the recovered instance via loadercli

- imort the data to the productive instance to a staging area, that is:rename the target table!

- merge the two data sets and clean up all the logical inconsistencies on application level.

- take a full backup of the database again.

This approach usually takes a lot of application knowlegde and practise with the database tools.

So, decide yourself if you want to take the training effort for this (you'll likely won't need it the next 5 years...).

"Defect of other parts of the instance".

Go and delete rename arbitrarily other files from the database instance (e.g. the paramete file) or the software installation.

Check what error message you get and how you can analyze the issue (e.g. using sdbverify, sdbregview ... ).

Try to figure out "where to look and what to look for" - this will usually help a great deal when you're faced with a production down situation.

Hmm... what else???

Cheers,

Lars

Edited by: Lars Breddemann on Apr 1, 2010 10:04 AM

lbreddemann
Active Contributor
0 Kudos
"Restore a backup done with a parallel medium via dbmcli"

This can be a bit counter intuitive as the backup is usually not finished once you get the first return code (-8020).

Also interesting would be to take such a parallel backup and restore it serially (not that difficult but you need to get used to it).

That would be the things that pop into my mind right now.

--> Added as other reply got too long...

Former Member
0 Kudos

Thanks Lars for your lengthy explanations which will take some time for me to "digest" through...

At the same time, just wonder what will happened if a system failure actually cause a log file to go missing?

I've tried to restore the file like what I did for data file but it's not working.

I've to use recovery with initialization to make it working again.

And if my understanding is correct, 1 missing log file will actually cos the rest of the log file non-restorable cause you need all log files for restoration, is this correct?

So, recovery with initialization is the only option?

lbreddemann
Active Contributor
0 Kudos

> At the same time, just wonder what will happened if a system failure actually cause a log file to go missing?

> I've tried to restore the file like what I did for data file but it's not working.

Right - the log volumes cannot be restored.

Even when we recover log backups we don't write into the log area but read the transactional information right out of the backups.

> I've to use recovery with initialization to make it working again.

Yep, when the log area is lost, this is the way to go.

Since the log area is not requied to start the database (but of course to roll forward to include all the latest transactions!) it would even be possible to just "clear" the log area and start the database.

However, in order to avoid data loss as much as possible, a full recovery including all available log backups is required here.

> And if my understanding is correct, 1 missing log file will actually cos the rest of the log file non-restorable cause you need all log files for restoration, is this correct?

Well - you don't recover log files!

Either you recover from log backup files - for that the log area is not required.

Or the database recovers from the log area.

In that case ALL files of the log area have to be present as they are written to as if they are just one file.

> So, recovery with initialization is the only option?

Yep - this would be the option to take here.

regards,

Lars

Former Member
0 Kudos

Thanks Lars, to make sure that I understand what you are trying to explain to me, let me give you a scenario,

Monday - I performed a complete backup. I also have the automatic log backup activated.

Tuesday - I performed incremental backup.

Wednesday - System failure.

Under log folder, I have the following log files,

DISKL0001

DISKL0002

DISKL0003

DISKL0004

DISKL0005

But after the system failure, DISKL0003 is missing / corrupted.

So, I have to performed recovery with initialization to bring the database back to online state.

Now, will my database state is the same state as Monday, or Tuesday?

lbreddemann
Active Contributor
0 Kudos

>

> Thanks Lars, to make sure that I understand what you are trying to explain to me, let me give you a scenario,

>

> Monday - I performed a complete backup. I also have the automatic log backup activated.

>

> Tuesday - I performed incremental backup.

>

> Wednesday - System failure.

> Under log folder, I have the following log files,

> DISKL0001

> DISKL0002

> DISKL0003

> DISKL0004

> DISKL0005

>

> But after the system failure, DISKL0003 is missing / corrupted.

>

> So, I have to performed recovery with initialization to bring the database back to online state.

>

> Now, will my database state is the same state as Monday, or Tuesday?

Actually: it depends

Ok, let's think about this.

If the log volume is corrupted, then the best possible result in terms of recovery will be:

state of last written savepoint

+

all log available in your log backups

if just the log volume is gone, you can and should perform a data backup(!) in ADMIN state before even considering a restore.

With that you will save the last savepoint version of the database, which is usually max. 10 minutes aways from the crash moment (that is, when log writing is active and when there actually had been some log entries...).

Now, if during these 10 minutes log backups happened you can now perform the recover with initialization and restore the just taken backup plus the log backups.

No need to resort to any of your backups from monday or tuesday in this case.

Starting with 7.7 it would also be possible to just rebuild the log area files and thereby skipping the restore of the just taken backup.

Concerning your log volume setup: although the log volumes have a number it does not mean that they are somehow used in order. The log volumes are really always treated as if they form a big stripe set for one volume.

What has to be available in sequence are the log backups!

regards,

Lars

Former Member
0 Kudos

Actually: it depends

Ok, let's think about this.

If the log volume is corrupted, then the best possible result in terms of recovery will be:

state of last written savepoint

+

all log available in your log backups

Do you mean that so long the log backups were fine, it doesn't matter what happened to those log files / log volumes?

if just the log volume is gone, you can and should perform a data backup(!) in ADMIN state before even considering a restore.

A complete data backup or just incremental would suffice?

With that you will save the last savepoint version of the database, which is usually max. 10 minutes aways from the crash moment (that is, when log writing is active and when there actually had been some log entries...).

Now, if during these 10 minutes log backups happened you can now perform the recover with initialization and restore the just taken backup plus the log backups.

No need to resort to any of your backups from monday or tuesday in this case.

I understood the points here.

Starting with 7.7 it would also be possible to just rebuild the log area files and thereby skipping the restore of the just taken backup.

I also understood the point here but just to confirm, the "...rebuild the log area files..." happens automatically right?

If I'm not wrong, with the new version, if there is a system failure, I can just restart the DB without having to do any backup / restore, is this what you are pointing to?

Concerning your log volume setup: although the log volumes have a number it does not mean that they are somehow used in order. The log volumes are really always treated as if they form a big stripe set for one volume.

What has to be available in sequence are the log backups!

Understood!

lbreddemann
Active Contributor
0 Kudos

>

> If the log volume is corrupted, then the best possible result in terms of recovery will be:

> state of last written savepoint

> +

> all log available in your log backups

>

> Do you mean that so long the log backups were fine, it doesn't matter what happened to those log files / log volumes?

No exactly.

When the log backups are taken there's typically more log generated afterwards.

This information is of course not in the log backups.

So, in case you loose the log area, you loose these information.

But: usually every 10 minutes MaxDB writes a savepoint to the data area.

This savepoint is the basis from which the recovery is started off.

And it is transactional consistent.

That means: you can either just startup and open the database with it or recover the transactional data from the log backups.

Basically, what I wanted to point out is: when you loose your log area you loose the information generated since the last savepoint or the last log backup - whatever was the last event before the log area desintegrated.

>

> if just the log volume is gone, you can and should perform a data backup(!) in ADMIN state before even considering a restore.

>

>

> A complete data backup or just incremental would suffice?

You'll need a complete data backup here - or you'd have to restore an older complete data backup PLUS the incremental backup. Usually this is the option that takes longer during restore.

>

> Starting with 7.7 it would also be possible to just rebuild the log area files and thereby skipping the restore of the just taken backup.

>

>

> I also understood the point here but just to confirm, the "...rebuild the log area files..." happens automatically right?

Nope - not automatically during restart.

One would have to use the commands for clearing the log area.

I'm sure you'll find them here in SDN, but I don't like to repeat them here.

Unfortunately these undocumented and unsupported commands are used far too often because they remove the symptoms of problems.

The problems themselves remain - so this is really something for which you should ask support for assistance.

regards,

Lars

Former Member
0 Kudos

Good Day Lars,

Just a question on something which I don't really understand during my testing,

This is my testing so far..

1) Create table and added few dummy rows of data.

2) Performed COMPLETE DATA BACKUP.

3) Amend the data file (DISKD0001) and the database failed.

4) Performed recovery with initialization.

5) Everything restored.

6) Add in new rows of dummy data into the same table.

After step 6, I wanted to perform backup again but it said i need to perform a complete backup (again?) because no backup item found in history??

But I checked and clearly my complete backup item is still listed there.

Why can't I perform incremental or log backup?

Edited by: Derichong on Apr 12, 2010 11:28 AM

lbreddemann
Active Contributor
0 Kudos

> This is my testing so far..

>

> 1) Create table and added few dummy rows of data.

> 2) Performed COMPLETE DATA BACKUP.

> 3) Amend the data file (DISKD0001) and the database failed.

> 4) Performed recovery with initialization.

> 5) Everything restored.

> 6) Add in new rows of dummy data into the same table.

>

> After step 6, I wanted to perform backup again but it said i need to perform a complete backup (again?) because no backup item found in history??

>

> But I checked and clearly my complete backup item is still listed there.

>

> Why can't I perform incremental or log backup?

Good question.

The log area had been cleared - so a new backup history needs to be started.

And since the backup you used to restore the database could have come from any other database MaxDB requires to start a completely new backup.

That way there's a defined starting backup for the log history.

regards,

Lars

Former Member
0 Kudos

Thanks for your helpful explanations, Lars!

By the way, do you happen to know if this backup command is valid?

dbmcli -d MAXDB -u username,password -i backup.dbmcli

I did some research but I can't seems to find what's the function of the underline commands.

From what I understand, for starting backup, the command is backup_start but it's missing from this command.

lbreddemann
Active Contributor
0 Kudos

>

> Thanks for your helpful explanations, Lars!

You're welcome.

> By the way, do you happen to know if this backup command is valid?

>

> dbmcli -d MAXDB -u username,password -i backup.dbmcli

>

> I did some research but I can't seems to find what's the function of the underline commands.

>

> From what I understand, for starting backup, the command is backup_start but it's missing from this command.

The -i parameter allows to specify an input file so that one can create scripts for dbmcli.

Just look into "backup.dbmcli", it must be a text file with a bunch of dbmcli commands.

Another thing worth to mention is: this dbmcli call shows a common but serious bug in the the way to call dbmcli.

Database, username and password are specified in plain text!

If you schedule this command line, say via AT on Windows or CRON on UNIX/Linux systems, then everybody will be able to know the DBM-logon data.

Not exactly 'secure'.

Way better would be to setup and use XUSER entries (check the documentation yourself for a detailed explanation!).

These are client-side stored, encrypted logon data, that include the server node name, the database name and of course username and password.

So a secure call for dbmcli could look like this (let's assume our XUSER entry was named 'BACKOPER':


dbmcli  -U BACKOPER -i backup.dbmcli

As this thread has now come a long way, I propose you close it and open a new one if you find new topics with MaxDB you like to get advise on.

regards,

Lars

Former Member
0 Kudos

GREAT ADVICE LARS!

I will now close this thread!

Really appreciates your promptly assistance!

Answers (0)