on 10-25-2014 11:47 AM
Hi Friends,
We are running SAP BIW 7.10 on Linux and Oracle 11g.
We are facing issue in ODS jobs many times the ODS jobs got hanged and it doesn't completed. After that we manually terminate the jobs and start it again. We have 4 DB nodes in oracle RAC and 8 SAP application instances.
We have checked the trace log and found many times the alert as "IPC Send timeout detected. Sender: ospid 8949 [oracle@<DB hosts>"
Regards
Ganesh Tiwari
Hi Ganesh,
i can not imagine, that this is the only corresponding message in the asm, cluster and rdbms (alert) log file.
Your description sounds like one RAC instance is hanging due to various reasons and the healthy node(s) are requesting a RAC member kill escalation. Please provide more detailed information.
Regards
Stefan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Ganesh,
yes, but from the healthy instances or crashing instance? It is Oracle RAC and so you need to cross check much more components like ASM, CSSD, etc. and not only one alert log file.
Getting a system state dump from the hanging instance is the more tricky part, if it is not automatically created by Oracle.
Regards
Stefan
Hi Stefan,
Thanks for your reply.
We have collected cluster log from all the nodes
[ctssd(3383)]CRS-2409:The clock on host XXXXXXXX is not
synchronous with the mean cluster time. No action has been taken as the Cluster
Time Synchronization Service is running in observer mode.
2014-10-02 00:46:25.879:
[ctssd(3383)]CRS-2409:The clock on host XXXXXXXX is not
synchronous with the mean cluster time. No action has been taken as the Cluster
Time Synchronization Service is running in observer mode.
2014-10-02 01:16:26.581:
[ctssd(3383)]CRS-2409:The clock on host XXXXXXX is not
synchronous with the mean cluster time. No action has been taken as the Cluster
Time Synchronization Service is running in observer mode.
2014-10-02 01:46:27.287
2014-10-02 22:52:11.204:
[cssd(3115)]CRS-1612:Network
communication with node xxxxxxxx (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.850
seconds
2014-10-02 22:52:19.218:
[cssd(3115)]CRS-1611:Network
communication with node xxxxxxxxxxx (3) missing for 75% of timeout
interval. Removal of this node from
cluster in 6.840 seconds
2014-10-02 22:52:23.220:
[cssd(3115)]CRS-1610:Network
communication with node xxxxxxxxx (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.840
seconds
2014-10-02 22:52:26.060:
[cssd(3115)]CRS-1632:Node xxxxxxxx
is being removed from the cluster in cluster incarnation 306673017
2014-10-02 22:52:26.082:
[cssd(3115)]CRS-1601:CSSD
Reconfiguration complete. Active nodes are xxxxxxxxx, xxxxxxxxx,xxxxxxxx
Regards
Ganesh Tiwari
Hi Ganesh,
you already found the issue as one node is removed (due to not reachable) from your 4 node RAC cluster. You have to figure out why your node stops responding - there are a lot of reasons from hardware to operating system to GI stack to interconnect issues.
How should we assist you in this case?
Regards
Stefan
User | Count |
---|---|
93 | |
10 | |
10 | |
9 | |
9 | |
7 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.