on 08-25-2011 10:44 PM
While using a script to "feed" a shadow database with logs, we quite often see, that a recover_cancel is issued, dbmcli returns with 0 but if another log is available and db_admin is executed, it fails with
>db_admin
ERR
-24783,ERR_WRONGDBSTATE: Operational state UNKNOWN
of the database instance is unsuitable.
-24779,ERR_DBSTATENEEDED4: Database instance must be
in one of the operational states OFFLINE, ADMIN, STANDBY or ONLINE.
The script we use looks like
db_admin
db_connect
recover_start logsich log 14327
recover_replace logsich /archivelog/log 14328
recover_replace logsich /archivelog/log 14329
recover_replace logsich /archivelog/log 14330
recover_replace logsich /archivelog/log 14331
recover_replace logsich /archivelog/log 14332
recover_replace logsich /archivelog/log 14333
recover_replace logsich /archivelog/log 14334
recover_replace logsich /archivelog/log 14335
recover_replace logsich /archivelog/log 14336
recover_replace logsich /archivelog/log 14337
recover_cancel
If the script is started a minute later, it works flawlessly.
This behaviour creates quite a lot of false alarms (happens up to 10 times a day).
I know, I could simply add a sleep command, I just want to know, why recover_cancel returns 0 despite the database engine not having released yet the shared memory and is really offline.
Markus
Hi Markus,
this is only to inform you that I did not forget the problem - we still check for a solution and need some help from a colleague who is not available at the moment. I will get back to you as soon as I have an answer/solution.
Regards,
Steffen
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Markus,
I can confirm this behavior - I observed it myself. But I cannot yet say why this is the case. I have to get a bit deeper into it and will get back to you as soon as I have the answer.
Best regards,
Steffen
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Steffen,
> I can confirm this behavior - I observed it myself.
good to hear that - so it's not something that is just happening on our machine
> But I cannot yet say why this is the case. I have to get a bit deeper into it and will get back to you as soon as I have the answer.
Thank you! I appreciate!
Markus
Hi Markus,
investigation rolls and there came a question up I would like to forward to you:
have you ever seen another behavior than the one you describe, meaning: did the database return differently on recover_cancel in earlier versions? Or is it that you 'just stumbled' over those annoying errors?
At the moment it looks like it has never been differently than it is right now on Unix.
Thx in advance,
Steffen
> investigation rolls and there came a question up I would like to forward to you:
> have you ever seen another behavior than the one you describe, meaning: did the database return differently on recover_cancel in earlier versions? Or is it that you 'just stumbled' over those annoying errors?
> At the moment it looks like it has never been differently than it is right now on Unix.
We don't run the shadow database that long but as long as we run it, we "sometimes" see this behaviour. Interestingly enough it seems to occur mainly on Solaris, Linux x64 is very rare although it occurs there too.
I don't know how the shutdown is technically done but I could think of a fact, that the POSIX shared memory is released asynchronously on Solaris - the function returns although the memory is still not yet released completely.
I could check this if I knew how the shutdown mechanism works nowadays. I will see if I find the "old" 7.5 sources that were under GPL and check, how it was done at that time.
Markus
Hi Markus,
the behavior you observe is intended and can be described as follows. The kernel gets the recover_cancel command and cleans up all resources and the like (all things that have to be done to stop the recovery). The successful finish of the recovery is then signaled to the client and the kernel goes into the process of shutting down. And here it comes: the kernel waits during this for an answer of the client although there will actually be none because the client 'knows' that the kernel will shut down. Anyway the kernel waits to make sure the communication segment used for message exchanging with the client is not used anymore and can be released. This waiting is limited to 90s. After that time the kernel releases the communication segment and finally shuts down. I was able to reproduce this on linux but not on Windows. On Windows the communication between kernel and client is slightly different and the communication segment is immediately released when recover_cancel is finished without waiting for an answer of the client. Anyway it is not assured by the protocol that the kernel is down at the moment the recover_cancel returns and you shouldn't rely on it. Even if the kernel would not wait for a clients answer the process of shutting down may take a moment.
With 7.9 the behavior on Unix is changed and the kernel is almost immediately down when the command returns. Up to the release of that release you should indeed wait this 1,5 min until restarting the kernel.
Regards,
Steffen
Hi Steffen,
> And here it comes: the kernel waits during this for an answer of the client although there will actually be none because the client 'knows' that the kernel will shut down. Anyway the kernel waits to make sure the communication segment used for message exchanging with the client is not used anymore and can be released. This waiting is limited to 90s. After that time the kernel releases the communication segment and finally shuts down.
(...)
> Anyway it is not assured by the protocol that the kernel is down at the moment the recover_cancel returns and you shouldn't rely on it.
Is this behaviour likewise just when using a recover_cancel or is it also the same when using db_offline?
(...)
> With 7.9 the behavior on Unix is changed and the kernel is almost immediately down when the command returns. Up to the release of that release you should indeed wait this 1,5 min until restarting the kernel.
Would it be possible to use dbmcli -s instead? I'll guess, I better put a 'pause' into the script then.
Thank you very much for your time and the analysis.
Markus
> I was too fast and even wrong: the db_offline gets only back after the status of the database is checked and really off line. Sorry for this misinformation.
No problem...
I'm still not quite convinced why recover_cancel and db_offline seems to be different but I put in a wait. #
Thank you though very much for the analysis!
Markus
User | Count |
---|---|
85 | |
10 | |
10 | |
10 | |
7 | |
6 | |
6 | |
5 | |
4 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.