cancel
Showing results for 
Search instead for 
Did you mean: 

O11.2.0.3 : ORA-00445: background process "W000" did not start after

benoit-schmid
Contributor
0 Kudos

Hello,

I regularly have the following error with an recently installed Oracle 11.2.0.3.

---

Process W000 died, see its trace file

Sun Apr 07 00:48:23 2013

***********************************************************************

Fatal NI connect error 12537, connecting to:

(LOCAL=NO)

  VERSION INFORMATION:

        TNS for Solaris: Version 11.2.0.3.0 - Production

        Oracle Bequeath NT Protocol Adapter for Solaris: Version 11.2.0.3.0 - Pr

oduction

        TCP/IP NT Protocol Adapter for Solaris: Version 11.2.0.3.0 - Production

  Time: 07-APR-2013 00:48:23

  Tracing not turned on.

  Tns error struct:

    ns main err code: 12537

   

TNS-12537: TNS:connection closed

    ns secondary err code: 12560

    nt main err code: 0

    nt secondary err code: 0

    nt OS err code: 0

Sun Apr 07 00:49:29 2013

opiodr aborting process unknown ospid (17179) as a result of ORA-609

Sun Apr 07 01:01:14 2013

Errors in file /oracle/xxx/saptrace/diag/rdbms/xxx/xxx/trace/xxx_smco_7937.trc  (incident=85849):

ORA-00445: background process "W000" did not start after 120 seconds

Sun Apr 07 01:03:30 2013

Incident details in: /oracle/xxx/saptrace/diag/rdbms/xxx/xxx/incident/incdir_85849/xxx_smco_7937_i85849.trc

Sun Apr 07 01:17:37 2013

Process 0x480433ce8 appears to be hung while dumping

Current time = 938816049, process death time = 938749879 interval = 60000

Attempting to kill process 0x480433ce8 with OS pid = 7937

OSD kill succeeded for process 480433ce8

Sun Apr 07 01:18:22 2013

Restarting dead background process SMCO

Sun Apr 07 01:18:38 2013

SMCO started with pid=30, OS id=26990

---

Would you know how to solve this problem?

Thanks in advance for your answer.

Accepted Solutions (1)

Accepted Solutions (1)

benoit-schmid
Contributor
0 Kudos

Hello,

The problem was related to blocking write issues at the Solaris Level.

After open a Oracle Solaris SR we have changed the following kernel parameter:

  • zfs:zfs_arc_max must be lower than the free memory of the machine      
    This avoids stucks related to an over usage of the memory.
  • zfs:zfs_vdev_max_pending is set to 128
    This increases the IO throughput.
  • zfs:zfs_write_limit_min is set to 2516582400
    This avoids IO writes from starving because IO reads           have a higher priority.
  • primarycache is set to metadata for        /oracle/SID/sapdata* and /oracle/SID/oraarch filesystems      
    This avoid waste and over usage of the ZFS cache.

Regards,

Answers (6)

Answers (6)

achmad_dimyati5
Participant
0 Kudos

Hi Benoit,


Please check how big memory has been used by ZFS filesystem. Please run this command

mdb -k (detail in http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html).

Your memory could be occupied by ZFS filesystem cache.

Regards,

Achmad

benoit-schmid
Contributor
0 Kudos

Hello,

Achmad Dimyati wrote:

Your memory could be occupied by ZFS filesystem cache.

Memory consumption looks low:

mdb -k

Loading modules: [ unix genunix specfs dtrace zfs scsi_vhci sd mpt px crypto mac ip hook neti arp kssl qlc fctl sockfs random emlxs fcp ssd idm cpc fcip logindmux ptm sppp ufs lofs nfs ipc mpt_sas pmcs ]

> ::memstat

Page Summary                Pages                MB  %Tot

------------     ----------------  ----------------  ----

Kernel                    1080217              8439   13%

ZFS File Data              871387              6807   11%

Anon                      1462200             11423   18%

Exec and libs               44207               345    1%

Page cache                  70407               550    1%

Free (cachelist)            50076               391    1%

Free (freelist)           4656721             36380   57%

Total                     8235215             64337

Physical                  8213541             64168

> $q

Regards,

achmad_dimyati5
Participant
0 Kudos

Hi Benoit,

I had the same issue but in different environment which is AIX and NAS.  It's turn out the problem on cache configuration in NAS. So I think the issue is related with OS and ZFS configuration.

Could you review these scn and notes.

http://scn.sap.com/thread/752504

notes:

Note 1343733 - Support for SAP on ZFS

Note 832871 - SAP relevant patches for Oracle Solaris 10 on SPARC

Is this the only system with ZFS and Oracle? if not do you have the same issue on another system?

Regards,

Achmad

benoit-schmid
Contributor
0 Kudos

Hello Achmad,

Achmad Dimyati wrote:

Is this the only system with ZFS and Oracle? if not do you have the same issue on another system?

No we have all our SAP systems on ZFS.

I do not have the issue on the other systems running Oracle 11.2.0.2 instead of 11.2.0.3

and Solaris 10 instead of Solaris 11.

Regards,

achmad_dimyati5
Participant
0 Kudos

Hi Benoit,

Is this test enviroment? As oracle 11.2.0.2 and Solaris 10 are OK. You can try Solaris 10 and Oracle 11.2.0.3 just to eliminate one of the components.  The issue with Oracle 11.2.0.3 or with OS/ZFS.

In my case the issue also very obvious along the dbv because it's heavy disk access.

I came across article about bug ZFS with L2ARC bug, panic loop (http://simplex.swordsaint.net/?p=199). Perhaps it's good to run the test in that article.

Best Regards,
Achmad

benoit-schmid
Contributor
0 Kudos

Achmad Dimyati wrote:

Is this test enviroment? As oracle 11.2.0.2 and Solaris 10 are OK. You can try Solaris 10 and Oracle 11.2.0.3 just to eliminate one of the components.  The issue with Oracle 11.2.0.3 or with OS/ZFS.

In my case the issue also very obvious along the dbv because it's heavy disk access.

This is not a test environment it is our new SolMan 7.1.

It is going to be tough to test it on Solaris 10.

Why do you say that in your case the issue is very obvious?

Regards,

achmad_dimyati5
Participant
0 Kudos

Hi Benoit,

In my previous environment, we have installed fresh ECC under AIX with Oracle 10g (NAS Storage). Since the beginning we have notice the slowness along the R3load of installation and accessing the system. I have checked the CPU and memory, both are fine. IOSTAT is normal, there is nothing special except it's slow.

But when we run dbv, the CPU was straight away hit 100% with High IO Wait and after awhile the database crash (around 15-20 minutes). So I complaint to vendor.

I ask them to check disk subsystem. They have put additional patch into OS. dbv is still slow but no more crash. So they invetigate further more. They found out one of the cache (built-in) in the NAS storage hasn't installed. After installation of the cache, the dbv only took 30 minutes to finish.

So, in my case dbv has made the issue very obvious, but the symptoms actually are there (slow, IO Wait, high CPU).

Best Regards,

Achmad

former_member184473
Active Contributor
0 Kudos

Hello Benoit,

What value do you have in the environment variables and listener.ora for ORACLE_HOME?

Regards,

Eduardo Rezende

benoit-schmid
Contributor
0 Kudos

Eduardo Rezende wrote:

Hello Benoit,

What value do you have in the environment variables and listener.ora for ORACLE_HOME?

Regards,

Eduardo Rezende

1. /oracle/MYSID/112_64 in SID_LIST_LISTENER subsection in listener.ora

2. not defined in LISTENER subsection in listener.ora

3. /oracle/MYSID/112_64 in env of sidadm

What are you thinking about?

Thanks in advance for your answer.

former_member184473
Active Contributor
0 Kudos

We observed some issues with customer that upgraded from Oracle 11.2.0.2 to 11.2.0.3 and had different values in the environment and listener.ora. Even though the alias was points to the correct location.

e.g.:

/oracle/MYSID/112_64 pointing to /oracle/MYSID/11203

env: /oracle/MYSID/11203

listener.ora: /oracle/MYSID/112_64

Former Member
0 Kudos

How much CPU, RAM, swap you have allocated?

benoit-schmid
Contributor
0 Kudos

Hello,

It does not seems to be a swap problem:

Memory: 64G phys mem, 34G free mem, 32G total swap, 32G free swap

Regards,

former_member206552
Active Contributor
0 Kudos

Hi Benoit

Have you tried the following solutions ?

http://vivekbhutani.wordpress.com/2012/01/30/ora-609-and-tns-12547-in-11g-alert-log/

I have also attached the oracle note suggested in the link provided by Deepak

Best Regards

Marius

benoit-schmid
Contributor
0 Kudos

Hello Markus,

I am not running a Linux box, but a Solaris Sparc box.

Regards,

benoit-schmid
Contributor
0 Kudos

Hello,

The error occurs during online db verify.

Would you have any suggestion on how I could avoid this problem?

Thanks in advance for your answers.

Former Member
0 Kudos

Hello,

Did you ever resolve your issue?  I am having a similar problem.

Thanks,

Susan

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

an ORA-00445 combined with a TNS-12537 error can have 4 possible reasons (and none of them are really Oracle database related):


  • Too much activity on your machine.
  • NFS latency issues.
  • Disk latency issue (that affects I/O).
  • Network latency.

I would consider point 1, 2 or 3 as you already mentioned, that this issue only occurs on a dbverify action (which forces a lot of additional I/O of course).

You need to investigate this issue on OS level in more detail (when you run the dbverify). However you can extend the default timeout of 120 seconds to avoid the Oracle error / abort.

shell> sqlplus / as sysdba

SQL> alter system set events '10281 trace name context forever, level <X>';

---- <X> is the number of seconds to timeout

This event can be set dynamically.

Regards

Stefan

benoit-schmid
Contributor
0 Kudos

Susan Eck wrote:

Hello,

Did you ever resolve your issue?  I am having a similar problem.

Thanks,

Susan

Hello,

Yes, I still have the issue when I run db verify.

Sap Support has ask me to change to rman for dbverify but it does not help.

I am still waiting for another solution from them.

On what OS does your DB runs?

Regards,

Former Member
0 Kudos

Hi Benoît,

Our OS is Solaris.  We are getting the errors when we run RMAN backups.  It just started about a week ago and we aren't able to tie it to anything that changed recently.

Thanks,

Susan

benoit-schmid
Contributor
0 Kudos

Hello Stefan,

Stefan Koehler wrote:

You need to investigate this issue on OS level in more detail (when you run the dbverify). However you can extend the default timeout of 120 seconds to avoid the Oracle error / abort.

On OS, I see with iostat that the disk is busy 100%, but the asvc time are still correct.

Basically rman or non rman dbverify use IO.

But it has always been like that and I can still work on the machine.

1. Why would W000 could not perform in 120 seconds?

2. Do you know what W000 is supposed to perform during 120 seconds?

3. Are there any risk or side effect of increasing settings the following?

events '10281 trace name context forever, level 600'

4. Is it possible to restrict the io usage provided to rman?

Thanks in advance for your answers.

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

at first this kind of issue can occur with any other database process too, that is spawned by Oracle (like M000 or PZ99).

> 1. Why would W000 could not perform in 120 seconds?

It can not be spawned within 120 seconds, that is the problem (not really the work that has to be performed).

> 2. Do you know what W000 is supposed to perform during 120 seconds?

Oracle Documentation: http://docs.oracle.com/cd/E11882_01/server.112/e25513/bgprocesses.htm#REFRN104

Wnnn processes are slave processes dynamically spawned by SMCO to perform space management tasks in the background. These tasks include preallocating space into locally managed tablespace and SecureFiles segments based on space usage growth analysis, and reclaiming space from dropped segments. At most 10 Wnnn slaves can run on one database instance. After being started, the slave acts as an autonomous agent. After it finishes task execution, it automatically picks up another task from the queue. The process terminates itself after being idle for a long time.

You will see exactly this behavior (by SMCO and W000), if you check the posted errors closely.

> 3. Are there any risk or side effect of increasing settings the following?

No. Set this event in addition to all the other events (which are recommended by SAP). It only provides more time to spawn the process.

> 4. Is it possible to restrict the io usage provided to rman?

Yes, there is an RMAN option called RATE.

Oracle Documentation: http://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmtunin.htm#BRADV90070

This option (rman_rate) is implemented in BR*Tools as well, if you are not using native RMAN.

SAP Documentation: http://help.sap.com/saphelp_nwpi71/helpdata/DE/46/afaf3a26a60319e10000000a11466f/content.htm

> On OS, I see with iostat that the disk is busy 100%, but the asvc time are still correct. But it has always been like that and I can still work on the machine.

Ok, but i meant deep investigations on OS level and not just running an I/O monitor or so. I talked about truss and dtrace (as you are running on Solaris luckily) on both processes.


Regards

Stefan

benoit-schmid
Contributor
0 Kudos

Hello Stephan,

Stefan Koehler wrote:

Ok, but i meant deep investigations on OS level and not just running an I/O monitor or so. I talked about truss and dtrace (as you are running on Solaris luckily) on both processes.

I can reproduced the problem.

Which truss on which process should I perform to diagnose this problem?

Why do you say that I am lucky have Solaris? 🙂

Thanks in advance for your answer.

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

> Which truss on which process should I perform to diagnose this problem?

Run truss on the PID of SMCO and maybe possible PID of W000 (depends on where the spawn process gets stuck). Start truss before you reproduce the issue.

> Why do you say that I am lucky have Solaris? 🙂

Because of DTrace is available on Solaris only (in full scope). DTrace has been ported to OEL (Oracle Enterprise Linux) as well, but the functionality is/was not the same as on Solaris.

You can use DTrace to trace the application code (like functions, call stack and so on), if you can not find any root cause with truss.

Regards

Stefan

benoit-schmid
Contributor
0 Kudos

Hello,

Could you please also provide me the argument of the truss/dtrace command so that I could

restrict the syscalls that you are interested in?

Thanks in advance for your answer.

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

well you should trace without any filter at all as you don't have a clue what the root cause can be.

Something like this:

shell> truss -a -E -f -o /tmp/truss.<PID>.out  -p <PID>

Regards

Stefan

Former Member
0 Kudos

Update:  I discovered that Solaris quarterly security patches had been installed recently.  The SA backed those out and we are no longer getting the errors during the RMAN backups.

When we were experiencing the problem, we noticed that the server became very sluggish while the backups were running.  Simple commands like ls, cd and ps could take 30 seconds or more.  Something related to the patches was causing contention and thus causing the TNS errors (12535, 12537), the ORA-00445, and other intermittent "timeout" errors that we were getting (ORA-00494 and ORA-12751).

We don't yet know why the patches caused the issue but at least we have pinpointed the source.

Good luck in resolving your issue!

benoit-schmid
Contributor
0 Kudos

Hello,

Susan Eck wrote:

Update:  I discovered that Solaris quarterly security patches had been installed recently.  The SA backed those out and we are no longer getting the errors during the RMAN backups.

Could you please provide me the patch reference with which you had pb and the references

of the ones that solved this problem?

Is it Solaris 10 or 11?

Thanks in advance for your answers.

benoit-schmid
Contributor
0 Kudos

Hello Stefan,

Stefan Koehler wrote:

Run truss on the PID of SMCO and maybe possible PID of W000 (depends on where the spawn process gets stuck). Start truss before you reproduce the issue.

When I run rman dbv, SMCO and W000 are not started.

The truss shows:

SMCO:

---

27228:   0.0000 semtimedop(1526726746, 0xFFFFFFFF7FFFD284, 1, 0xFFFFFFFF7FFFD270) Err#11 EAGAIN

27228:   0.0000 times(0xFFFFFFFF7FFFD6E0)                       = 1279307926

27228:   0.0000 times(0xFFFFFFFF7FFFD6E0)                       = 1279307926

27228:  semtimedop(1526726746, 0xFFFFFFFF7FFFD284, 1, 0xFFFFFFFF7FFFD270) (sleeping...)

27228:   0.0000 semtimedop(1526726746, 0xFFFFFFFF7FFFD284, 1, 0xFFFFFFFF7FFFD270) Err#11 EAGAIN

27228:   0.0000 times(0xFFFFFFFF7FFFD6E0)                       = 1279308226

27228:   0.0000 times(0xFFFFFFFF7FFFD6E0)                       = 1279308226

27228:  semtimedop(1526726746, 0xFFFFFFFF7FFFD284, 1, 0xFFFFFFFF7FFFD270) (sleeping...)

---

W000

---

27230:  semtimedop(1526726746, 0xFFFFFFFF7FFFC814, 1, 0xFFFFFFFF7FFFC800) (sleeping...)

27230:   0.0000 semtimedop(1526726746, 0xFFFFFFFF7FFFC814, 1, 0xFFFFFFFF7FFFC800) Err#11 EAGAIN

27230:   0.0000 times(0xFFFFFFFF7FFFCD50)                       = 1279312003

27230:   0.0000 times(0xFFFFFFFF7FFFCD50)                       = 1279312003

27230:  semtimedop(1526726746, 0xFFFFFFFF7FFFC814, 1, 0xFFFFFFFF7FFFC800) (sleeping...)

---

Is it right to say that there seems to be a deadlock with an operation on a ipc semaphore?

Thanks in advance for your help

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

thanks for the truss output.

There are several known OS (Solaris) bugs regarding semtimedop with "Err#11 EAGAIN". Unfortunately we need the corresponding function calls (call stack) to locate the specific bug.

Please provide the function trace with DTrace or oradebug (i guess oradebug / short_stack should be enough in this case as well).

By the way .. are you running Oracle on a ZFS file system?

Regards

Stefan

benoit-schmid
Contributor
0 Kudos

Hello Stefan,

There are several known OS (Solaris) bugs regarding semtimedop with "Err#11 EAGAIN". Unfortunately we need the corresponding function calls (call stack) to locate the specific bug.

Please provide the function trace with DTrace or oradebug (i guess oradebug / short_stack should be enough in this case as well).

By the way .. are you running Oracle on a ZFS file system?

Yes, we are using zfs.

BTW, I have checked Err#11 is not an error. it is when the semtimedop reached its defined timeout value.

Could you please provide me the exact dtrace options that I should use?

Could you please tell me what I should write in sqlplus to get the oradebug?

Thanks in advance for your answers.

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

> Could you please tell me what I should write in sqlplus to get the oradebug?

Please check out my blog (Point "3. Using stack traces or system call / signal traces"): http://scn.sap.com/community/oracle/blog/2013/01/10/oracle-advanced-performance-troubleshooting-with...

Please attach to both processes (SMCO and W000) and perform several short_stack dumps while the process is hung.

> Yes, we are using zfs.

What is the value of "zfs_arc_max"? Try 8 G, if it is set to 16 G. This is just a "blind guess" as we do not have all the needed information.

Solaris Documentation: http://docs.oracle.com/cd/E19253-01/817-0404/gjhec/index.html

Regards

Stefan

benoit-schmid
Contributor
0 Kudos

Hello,

Stefan Koehler wrote:

What is the value of "zfs_arc_max"? Try 8 G, if it is set to 16 G. This is just a "blind guess" as we do not have all the needed information.

zfs_arc_max is set to 25 GB on a host wth 64 GB.

My Solaris colleagues fine tune this by checking the used memory on the system

as it is described in http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

Why have you suggested 8 GB?

I will do the traces and post them.

Regards,

benoit-schmid
Contributor
0 Kudos

Hello again Stephan,

Here is the information you requested:

SQL> oradebug setospid 27917

Oracle pid: 28, Unix process pid: 27917, image: oracle@xxx (SMCO)

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ksarcv()+212<-ksbabs()+764<-ksbrdp()+1616<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ksarcv()+212<-ksbabs()+764<-ksbrdp()+1616<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ksarcv()+212<-ksbabs()+764<-ksbrdp()+1616<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ksarcv()+212<-ksbabs()+764<-ksbrdp()+1616<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ksarcv()+212<-ksbabs()+764<-ksbrdp()+1616<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

SQL> oradebug setospid  575

Oracle pid: 35, Unix process pid: 575, image: oracle@xxx (W000)

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ktsj_slave_main()+976<-ksvrdp()+1872<-opirip()+1972<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ktsj_slave_main()+976<-ksvrdp()+1872<-opirip()+1972<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ktsj_slave_main()+976<-ksvrdp()+1872<-opirip()+1972<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ktsj_slave_main()+976<-ksvrdp()+1872<-opirip()+1972<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

SQL> oradebug short_stack

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3524<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-sskgpwwait()+236<-ksliwat()+1752<-kslwait()+240<-ktsj_slave_main()+976<-ksvrdp()+1872<-opirip()+1972<-opidrv()+748<-sou2o()+88<-opimai_real()+276<-ssthrdmain()+316<-main()+316<-_start()+380

Regards,

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

because of there are several known issues (slow startup of database or processes) when setting this value too large. Unfortunately ZFS behaves differently in each environment (depending on ZFS pools, ZFS pool and I/O configuration, database configuration and so on).

I was involved in such ZFS performance evaluations for Oracle databases in the last time and it was a mess (just check out my latest blog about ZFS and async / sync I/O for Oracle databases).

Regards

Stefan

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

thanks for the short stack traces. It really looks like the two processes are blocked / hung.

Can you please do a system state dump and attach this dump here (make sure that you do not limit the trace file size)?

  1. Recreate situation for SMCO / W000
  2. Execute system state dump with level 266 while the problem persists

Check my blog for details about creating a system state dump:

http://scn.sap.com/community/oracle/blog/2012/11/10/oracle-wait-event-library-cache-lock-by-design-o...

Regards

Stefan

benoit-schmid
Contributor
0 Kudos

Hello,

Stefan Koehler wrote:

Can you please do a system state dump and attach this dump here (make sure that you do not limit the trace file size)?

  1. Recreate situation for SMCO / W000
  2. Execute system state dump with level 266 while the problem persists

I have performed the following:

SQL> oradebug setmypid 

Statement processed.

SQL> oradebug unlimit 

Statement processed.

SQL> host date

Wednesday, May 22, 2013 08:39:34 AM MEST

SQL> oradebug dump systemstate 266

Statement processed.

SQL> host date

Wednesday, May 22, 2013 08:40:00 AM MEST

Could you please provide the filename you are interested in?

-rw-r-----   1 orasid   dba     502K May 22 08:39 SID_ora_20464.trm
-rw-r-----   1 orasid   dba     2.0M May 22 08:39 SID_ora_20464.trc
-rw-r-----   1 orasid   dba      23K May 22 08:45 alert_SID.log
-rw-r-----   1 orasid   dba      173 May 22 08:46 SID_lgwr_19803.trm
-rw-r-----   1 orasid   dba     1.6K May 22 08:46 SID_lgwr_19803.trc
-rw-r-----   1 orasid   dba       84 May 22 08:47 SID_mmon_19813.trm
-rw-r-----   1 orasid   dba     1.1K May 22 08:47 SID_mmon_19813.trc

Thanks in advance for your answer.

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

i would consider file "SID_ora_20464.trc" regarding file size (and timestamp) .

Just for your information: Oradebug attaches to your oracle shadow process (it seems like your shadow PID was 20464) and dumps all the requested content into that file.

Regards

Stefan

benoit-schmid
Contributor
0 Kudos

Hello,

As the file contains 2 MB of text, how do you want to proceed

so that I can port the relevant information?

Thanks in advance for your answer.

stefan_koehler
Active Contributor
0 Kudos

Hi,

just zip it and attach it here (you can upload files up to 1.0 MB on SCN, if you click on advanced editor).

Regards

Stefan

benoit-schmid
Contributor
0 Kudos

Hello,

I forward the trace file.

I have just renamed information concerning our site.

Regards,

stefan_koehler
Active Contributor
0 Kudos

Hi Benoît,

sorry for the delay. I have cross-checked the call stack and system state dump with an Oracle 11.2.0.3 database on Solaris x86 right now.

The call stack (function order) looks nearly the same, but the two processes are missing in your system state dump. I can imagine two possible reasons for that:

  • Both processes are not attached to SGA at all
  • Both Processes already died when you performed the system state dump (the runtime of the system state dump was quite large on my system)

It is pretty hard to troubleshoot this issue without access to your system.

Regards

Stefan

Former Member
0 Kudos

Hi Benoît,

The original kernel that was working is Solaris 5.10 Generic_147440-27.  The patched kernel that was causing the issue is Generic_148888-01.  We still haven't figured out which individual patch is the troublemaker.  I will update this thread when we do.

Susan

stefan_koehler
Active Contributor
0 Kudos

Hi,

just as an additional information.

My short_stack and system state dump reference analysis were performed on Solaris 5.10 Generic_147148-26.

Regards

Stefan

benoit-schmid
Contributor
0 Kudos

Hello,

Stefan Koehler wrote:

The call stack (function order) looks nearly the same, but the two processes are missing in your system state dump. I can imagine two possible reasons for that:

  • Both processes are not attached to SGA at all
  • Both Processes already died when you performed the system state dump (the runtime of the system state dump was quite large on my system)

I am surprised by your answer.

From what I see the process did not die and seems to be attached to all shm that belongs the

oracle database:

# ipcs -m | gawk '/^m / {gsub("^m +[0-9]+",sprintf("m %12#x",$2))} {print}' | grep -i orasid

m   0x520000a1   0x6ab0fb88 --rw-r-----   orasid      dba

m   0x520000a0   0x0        --rw-r-----   orasid      dba

m   0x5200009f   0x0        --rw-r-----   orasid      dba

# pmap 21524 | grep -i shm

0000000380000000      32768K rwxsR    [ ism shmid=0x5200009f ]

00000003C0000000    3178496K rwxsR    [ ism shmid=0x520000a0 ]

00000004C0000000         16K rwxsR    [ ism shmid=0x520000a1 ]

Regards,

former_member188883
Active Contributor
0 Kudos

Hi Benoit,

A Similar issue is addressed here

https://cn.forums.oracle.com/forums/thread.jspa?threadID=2378159

Additionally please check the following configuration

f Async I/O is used, Oracle may hang with "startup nomount". The following error message is displayed after two minutes:

ORA-00445: background process "PMON" did not start after 120 seconds

                       This problem is due to a missing MLOCK privilege for the dba group and the oper group. First use "getprivgrp dba" and "getprivgrp oper" to check whether this privilege exists. If it does not exist, you can assign it as a root either temporarily by using "setprivgrp dba MLOCK"/"setprivgrp oper MLOCK" or permanently by creating the /etc/privgroup file with the following contents:

   dba MLOCK

  oper MLOCK

If above solution is not suitable please share the following logs /oracle/xxx/saptrace/diag/rdbms/xxx/xxx/trace/xxx_smco_7937.trc

Regards,

Deepak Kori


benoit-schmid
Contributor
0 Kudos

Hello Deepak,

The Oracle thread is not very clear.

If I rename /etc/resolv.conf, then lots problem would occur has the machine

can not do dns queries.

Could you describe more in details what I should perform?

Thanks in advance for your answer.

former_member188883
Active Contributor
0 Kudos

Hi Benoit,

The thread talks about adjusting the /etc/resolv.conf file with relevant entries.

Refer example thread : https://forums.oracle.com/forums/thread.jspa?threadID=2373696

Request you to share the trace file logs to have detailed description of the error.

Regards,

Deepak Kori

benoit-schmid
Contributor
0 Kudos

Hello Deepak,

Deepak Kori wrote:

                       This problem is due to a missing MLOCK privilege for the dba group and the oper group. First use "getprivgrp dba" and "getprivgrp oper" to check whether this privilege exists. If it does not exist, you can assign it as a root either temporarily by using "setprivgrp dba MLOCK"/"setprivgrp oper MLOCK"

I have check the resolv.conf.

Everything is fine:

search    unige.ch

nameserver    x.y.z.w

nameserver    x.y.z.w

nameserver    x.y.z.w

I have read the Oracle DocId that you have provided.

It is for Linux and not for Solaris.

Concerning your commands, they are not applicable for Solaris.

How should I proceed for Solaris?

Thanks in advance for your answers.

former_member188883
Active Contributor
0 Kudos

Hi Benoit,

check whether the issue happens at peak time.

Execute  select * from v$resource_limit; to see if any limit is reached.

Regards,

Deepak Kori

benoit-schmid
Contributor
0 Kudos

Hello Deepak,

Deepak Kori wrote:

Hi Benoit,

check whether the issue happens at peak time.

Execute  select * from v$resource_limit; to see if any limit is reached.

Regards,

Deepak Kori

Unfortunately, the error occured during the weekend, after the off line backup.

Regards,