cancel
Showing results for 
Search instead for 
Did you mean: 

HANA failover

Former Member
0 Kudos

Hi Team,

We have a topology where we have four nodes, one of them being standby.

Now a failover occured last week from master to the standby node.

How do we know if it was not a manual failover.

When checking the indexserver logs and nameserver and daemon i am not able to find out if its manual or automatic.

daemon trace file

[2345]{-1}[-1/-1] 2016-05-25 23:15:09.909331 i Daemon       TrexDaemon.cpp(09084) : signo 3 SIGQUIT from user errno 0 code 0
[2345]{-1}[-1/-1] 2016-05-25 23:15:09.909375 i Daemon       TrexDaemon.cpp(09084) : sender pid 8899 real user id 50263 executable '/hana/data/shared/SID/exe/linuxx86_64/HDB<>/sapstart'
[2345]{-1}[-1/-1] 2016-05-25 23:15:09.909414 i Daemon       TrexDaemon.cpp(14175) : got shutdown event (stop)
[2345]{-1}[-1/-1] 2016-05-25 23:15:09.909436 i Daemon       TrexDaemon.cpp(14871) : comment file contains:
[2345]{-1}[-1/-1] 2016-05-25 23:15:09.927401 i Daemon       TrexDaemon.cpp(14091) : all instances in runlevel 5 stopped
[2345]{-1}[-1/-1] 2016-05-25 23:15:09.927408 i Daemon       TrexDaemon.cpp(11788) : stop process hdbwebdispatcher with pid 14707
[2345]{-1}[-1/-1] 2016-05-25 23:15:09.927587 i Daemon       TrexDaemon.cpp(08784) : stopped child with pid 14707 (14707)
[2345]{-1}[-1/-1] 2016-05-25 23:15:12.617500 i Daemon       TrexDaemon.cpp(14397) : process hdbwebdispatcher with pid 14707 exited normally with status 0
[2345]{-1}[-1/-1] 2016-05-25 23:15:12.617529 i Daemon       TrexDaemon.cpp(14473) : all instances in runlevel 4 stopped
[2345]{-1}[-1/-1] 2016-05-25 23:15:12.617533 i Daemon       TrexDaemon.cpp(11788) : stop process hdbindexserver with pid 60982
[2345]{-1}[-1/-1] 2016-05-25 23:15:12.617940 i Daemon       TrexDaemon.cpp(08784) : stopped child with pid 60982 (60982)
[2345]{-1}[-1/-1] 2016-05-25 23:15:12.617951 i Daemon       TrexDaemon.cpp(11788) : stop process hdbxsengine with pid 60988
[2345]{-1}[-1/-1] 2016-05-25 23:15:12.618078 i Daemon       TrexDaemon.cpp(12345) : stopped child with pid 60988 (60988)
[2345]{-1}[-1/-1] 2016-05-25 23:15:19.533758 i Daemon       TrexDaemon.cpp(14397) : process hdbxsengine with pid 60988 exited normally with status 0
[2345]{-1}[-1/-1] 2016-05-25 21:52:12.717629 i Daemon       TrexDaemon.cpp(11807) : kill process hdbindexserver with pid 60982
[2345]{-1}[-1/-1] 2016-05-25 21:52:12.718210 i Daemon       TrexDaemon.cpp(08803) : killed child with pid 60982 (60982)
                                                                                                                           

Accepted Solutions (0)

Answers (2)

Answers (2)

lucas_oliveira
Advisor
Advisor
0 Kudos

Hi,

Nameserver controls failover / takeover tasks so you might want to check that trace as well as the restart OS command from master node.

BRs,

Lucas de Oliveira

lucas_oliveira
Advisor
Advisor
0 Kudos

Additionally, you can check indexserver traces on the master node. Easy to check whether there was a forced restart or just a manual one.

Regards,

Lucas de Oliveira

Former Member
0 Kudos

Hi Lucas,

I did check the indexserver traces, how do the traces look different if it was a manual failover or else forced failover?

Any SIGNALS which would be recorded differently?

lucas_oliveira
Advisor
Advisor
0 Kudos

Hi,

Of course. If the system is shutdown manually you'll be able to see all the processes it goes through until it detaches from the persistence.

Component Service_Shutdown will be used in that sense. Example on indexserver traces:


[52612]{-1}[-1/-1] 2016-06-20 15:15:12.224672 i Service_Shutdown TrexService.cpp(00803) : Preparing for shutting service down

[3373]{-1}[-1/-1] 2016-06-20 15:15:12.334555 i Service_Shutdown TREXIndexServer.cpp(04465) : Triggering timezone checker shutdown

[3373]{-1}[-1/-1] 2016-06-20 15:15:12.334585 i assign           TREXIndexServer.cpp(04475) : unassign from volume 4

[3373]{-1}[-1/-1] 2016-06-20 15:15:12.334593 i Service_Shutdown TREXIndexServer.cpp(04478) : Preparing to shutdown

[3373]{-1}[-1/-1] 2016-06-20 15:15:12.334595 i Service_Shutdown TREXIndexServer.cpp(04486) : stopping statistics server worker threads

[...........]

[3373]{427086}[26/57973580] 2016-06-20 15:15:13.375085 i SQLSessionCmd    Connection.cc(07739) : cancel requested for 400129 (logical connection id is 400129

[3373]{-1}[-1/-1] 2016-06-20 15:15:13.376586 i Service_Shutdown TREXIndexServer.cpp(04522) : stopping federation statistics collection

[3373]{-1}[-1/-1] 2016-06-20 15:15:13.376613 i Service_Shutdown TREXIndexServer.cpp(04526) : stopping extended storage heartbeat thread

[3373]{-1}[-1/-1] 2016-06-20 15:15:13.379852 i Service_Shutdown TREXIndexServer.cpp(04560) : Abort logreplay

[3373]{-1}[-1/-1] 2016-06-20 15:15:13.379870 i Service_Shutdown TREXIndexServer.cpp(04565) : Stopping LOBGarbageCollectorThread

[3373]{-1}[-1/-1] 2016-06-20 15:15:13.380129 i Service_Shutdown TREXIndexServer.cpp(04589) : Stopping EPM

[3373]{-1}[-1/-1] 2016-06-20 15:15:13.471506 i Service_Shutdown TREXIndexServer.cpp(04593) : Stopping Embedded Catalyst Services

[3373]{-1}[-1/-1] 2016-06-20 15:15:14.396296 i Service_Shutdown TREXIndexServer.cpp(04598) : Stopping PlanningEngine

[3373]{-1}[-1/-1] 2016-06-20 15:15:14.503468 i Service_Shutdown TREXIndexServer.cpp(04613) : Stopping SQL session service

[3373]{-1}[-1/-1] 2016-06-20 15:15:14.503520 i Service_Shutdown tcp_listener.h(00122) : Shutdown SqlListener service

[53750]{-1}[-1/-1] 2016-06-20 15:15:14.569328 i Service_Shutdown tcp_listener.cc(00636) : stop the SQL listening port: 30415

[3373]{-1}[-1/-1] 2016-06-20 15:15:14.684304 i Service_Shutdown TREXIndexServer.cpp(04617) : Stopping SQL plan cache

[..........]

[52612]{-1}[-1/-1] 2016-06-20 15:16:22.757531 i Service_Shutdown TrexService.cpp                                                                                                              (00975) : Deleting pools

[52612]{-1}[-1/-1] 2016-06-20 15:16:22.757543 i Service_Shutdown TrexService.cpp                                                                                                              (00985) : Deleting configuration

[52612]{-1}[-1/-1] 2016-06-20 15:16:22.757546 i Service_Shutdown TrexService.cpp                                                                                                              (00989) : Removing pidfile

[52612]{-1}[-1/-1] 2016-06-20 15:16:22.758133 i Service_Shutdown TrexService.cpp                                                                                                              (01026) : System down

A crash on the other hand would be very descriptive with possibly no actual shutdown. Example:


[40204]{-1}[10/54417462] 2016-05-20 00:36:06.342783 e Basis            FaultProtectionImpl.cpp(01610) : SIGNAL 11 (SIGSEGV) caught, thread: 7887[thr=40204]: Assign, addr: 0x0000000000000000, time: 2016-05-20 00:36:06 000 Local

Instance LO2/08, OS Linux vanpghana05 3.0.101-68-default #1 SMP Tue Dec 1 16:21:37 UTC 2015 (ed01a9f) x86_64

----> Register Dump <----

  rax: 0x0000000000000020  rbx: 0x00007fd87573f1b8

  rcx: 0x0000000000000000  rdx: 0x0000000000000003

  rsi: 0x0000000000000000  rdi: 0x00007fd9490a6620

  rsp: 0x00007fd9490a5118  rbp: 0x00007fd8a518d910

  r08: 0x0000000000000000  r09: 0x0000000000000007

[...]

[40204]{-1}[10/54417462] 2016-05-20 00:36:06.342963 i Basis            Helper.cpp(00514) : Using 'x64_64 ABI unwind' for stack tracing

[40204]{-1}[10/54417462] 2016-05-20 00:36:06.342783 e Basis            FaultProtectionImpl.cpp(01610) : NOTE: full crash dump will be written to /usr/sap/LO2/HDB08/vanpghana05/trace/indexserver_vanpghana05.30803.crashdump.20160520-003606.039630.trc

Here we had a signal 11 but could be any OS signal for terminating processes.

Hope that helps.

Regards,

Lucas de Oliveira

former_member182967
Active Contributor
0 Kudos

Hi p517710,

Additionally, if it is manual, you can use OS command history under potential users to find out the executed command or script written.

Regards,

Ning

anandtigadikar
Advisor
Advisor
0 Kudos

What excactly you mean by Manual or automatic FO, as ideally whenever any node goes down in HANA, thru automatic fail over mechanism it starts failing over on avaible free stand by node ??

Which version you are in?

Did you faced any errors/delay while this FO happened??

Go through nameserver***.trc & Indexserver.trc logs, you will get all related information in detail..

-Anand.