cancel
Showing results for 
Search instead for 
Did you mean: 

recover_cancel returns 0 although DB not yet shut down

markus_doehr2
Active Contributor
0 Kudos

While using a script to "feed" a shadow database with logs, we quite often see, that a recover_cancel is issued, dbmcli returns with 0 but if another log is available and db_admin is executed, it fails with

>db_admin
ERR
-24783,ERR_WRONGDBSTATE: Operational state UNKNOWN
 of the database instance is unsuitable.
-24779,ERR_DBSTATENEEDED4: Database instance must be
 in one of the operational states OFFLINE, ADMIN, STANDBY or ONLINE.

The script we use looks like

db_admin
db_connect
recover_start logsich log 14327
recover_replace logsich /archivelog/log 14328
recover_replace logsich /archivelog/log 14329
recover_replace logsich /archivelog/log 14330
recover_replace logsich /archivelog/log 14331
recover_replace logsich /archivelog/log 14332
recover_replace logsich /archivelog/log 14333
recover_replace logsich /archivelog/log 14334
recover_replace logsich /archivelog/log 14335
recover_replace logsich /archivelog/log 14336
recover_replace logsich /archivelog/log 14337
recover_cancel

If the script is started a minute later, it works flawlessly.

This behaviour creates quite a lot of false alarms (happens up to 10 times a day).

I know, I could simply add a sleep command, I just want to know, why recover_cancel returns 0 despite the database engine not having released yet the shared memory and is really offline.

Markus

Accepted Solutions (0)

Answers (2)

Answers (2)

steffen_schildberg
Active Participant
0 Kudos

Hi Markus,

this is only to inform you that I did not forget the problem - we still check for a solution and need some help from a colleague who is not available at the moment. I will get back to you as soon as I have an answer/solution.

Regards,

Steffen

steffen_schildberg
Active Participant
0 Kudos

Hi Markus,

I can confirm this behavior - I observed it myself. But I cannot yet say why this is the case. I have to get a bit deeper into it and will get back to you as soon as I have the answer.

Best regards,

Steffen

markus_doehr2
Active Contributor
0 Kudos

Hi Steffen,

> I can confirm this behavior - I observed it myself.

good to hear that - so it's not something that is just happening on our machine

> But I cannot yet say why this is the case. I have to get a bit deeper into it and will get back to you as soon as I have the answer.

Thank you! I appreciate!

Markus

steffen_schildberg
Active Participant
0 Kudos

Hi Markus,

only to keep you informed - we check the problem but it will take some time as it is not that trivial as I first thought. So I hope you have still a little bit patience.

Regards,

Steffen

markus_doehr2
Active Contributor
0 Kudos

Hi Steffen,

thank you for looking into this!

Markus

steffen_schildberg
Active Participant
0 Kudos

Hi Markus,

am I right that you do this on Linux? Have you anything similar seen on Windows?

Regards,

Steffen

markus_doehr2
Active Contributor
0 Kudos

Hi Steffen,

I see this on Solaris x86 and Linux (both x86_64 and IA64). No Windows here to test...

Markus

steffen_schildberg
Active Participant
0 Kudos

Yeah, I was "afraid" to hear/see this. I tested on Windows (and do still) and on Linux and saw it on Linux only. Seems to be a special Unix/Linux problem. I keep you informed.

Best,

Steffen

steffen_schildberg
Active Participant
0 Kudos

Hi Markus,

investigation rolls and there came a question up I would like to forward to you:

have you ever seen another behavior than the one you describe, meaning: did the database return differently on recover_cancel in earlier versions? Or is it that you 'just stumbled' over those annoying errors?

At the moment it looks like it has never been differently than it is right now on Unix.

Thx in advance,

Steffen

markus_doehr2
Active Contributor
0 Kudos

> investigation rolls and there came a question up I would like to forward to you:

> have you ever seen another behavior than the one you describe, meaning: did the database return differently on recover_cancel in earlier versions? Or is it that you 'just stumbled' over those annoying errors?

> At the moment it looks like it has never been differently than it is right now on Unix.

We don't run the shadow database that long but as long as we run it, we "sometimes" see this behaviour. Interestingly enough it seems to occur mainly on Solaris, Linux x64 is very rare although it occurs there too.

I don't know how the shutdown is technically done but I could think of a fact, that the POSIX shared memory is released asynchronously on Solaris - the function returns although the memory is still not yet released completely.

I could check this if I knew how the shutdown mechanism works nowadays. I will see if I find the "old" 7.5 sources that were under GPL and check, how it was done at that time.

Markus

steffen_schildberg
Active Participant
0 Kudos

Hi Markus,

the behavior you observe is intended and can be described as follows. The kernel gets the recover_cancel command and cleans up all resources and the like (all things that have to be done to stop the recovery). The successful finish of the recovery is then signaled to the client and the kernel goes into the process of shutting down. And here it comes: the kernel waits during this for an answer of the client although there will actually be none because the client 'knows' that the kernel will shut down. Anyway the kernel waits to make sure the communication segment used for message exchanging with the client is not used anymore and can be released. This waiting is limited to 90s. After that time the kernel releases the communication segment and finally shuts down. I was able to reproduce this on linux but not on Windows. On Windows the communication between kernel and client is slightly different and the communication segment is immediately released when recover_cancel is finished without waiting for an answer of the client. Anyway it is not assured by the protocol that the kernel is down at the moment the recover_cancel returns and you shouldn't rely on it. Even if the kernel would not wait for a clients answer the process of shutting down may take a moment.

With 7.9 the behavior on Unix is changed and the kernel is almost immediately down when the command returns. Up to the release of that release you should indeed wait this 1,5 min until restarting the kernel.

Regards,

Steffen

markus_doehr2
Active Contributor
0 Kudos

Hi Steffen,

> And here it comes: the kernel waits during this for an answer of the client although there will actually be none because the client 'knows' that the kernel will shut down. Anyway the kernel waits to make sure the communication segment used for message exchanging with the client is not used anymore and can be released. This waiting is limited to 90s. After that time the kernel releases the communication segment and finally shuts down.

(...)

> Anyway it is not assured by the protocol that the kernel is down at the moment the recover_cancel returns and you shouldn't rely on it.

Is this behaviour likewise just when using a recover_cancel or is it also the same when using db_offline?

(...)

> With 7.9 the behavior on Unix is changed and the kernel is almost immediately down when the command returns. Up to the release of that release you should indeed wait this 1,5 min until restarting the kernel.

Would it be possible to use dbmcli -s instead? I'll guess, I better put a 'pause' into the script then.

Thank you very much for your time and the analysis.

Markus

steffen_schildberg
Active Participant
0 Kudos

Hi Markus,

yep, same applies to db_offline.

The call of dbmcli with option -s wouldn't change a thing - I'd go for 'pause' .

Best,

Steffen

markus_doehr2
Active Contributor
0 Kudos

> yep, same applies to db_offline.

Ok... but what happens if I shut down an system and rely on the return code of db_offline? So it may be that I shut down the OS before the database is completely shut down, no?

Markus

steffen_schildberg
Active Participant
0 Kudos

Hi Markus,

I was too fast and even wrong: the db_offline gets only back after the status of the database is checked and really off line. Sorry for this misinformation.

Best,

Steffen

markus_doehr2
Active Contributor
0 Kudos

> I was too fast and even wrong: the db_offline gets only back after the status of the database is checked and really off line. Sorry for this misinformation.

No problem...

I'm still not quite convinced why recover_cancel and db_offline seems to be different but I put in a wait. #

Thank you though very much for the analysis!

Markus