cancel
Showing results for 
Search instead for 
Did you mean: 

MSCS no guarantee for being always up

Rudi_Wiesmayr
Active Participant
0 Kudos

Hi folks!

We recently had a somewhat disappointing experience: Due to an error suspected to be somewhere in the SAN driver software universe our productive system lost all its drives for a while an we had a freeze, but the cluster decided to state itself up and running.

We had to help the systems manually to get running again.

When a cluster node failed due to a hardware issue one night last year MSCS switched immediately and only some users lost their connection and the system was still available after some seconds.

But now we know that a cluster cannot guarantee that the SAP system is always up when software fails...

Do you have similar experiences?

What do you do against this threat?

Kind regards, Rudi

Accepted Solutions (0)

Answers (3)

Answers (3)

Former Member
0 Kudos

Hello,

We have also experimented this kind of problem.

MSCS is only a solution to secure the 2 hardware nodes, not the SAN.

The SAN is supposed to be secured with multipath but this works in reality much worse than the marketing folks from the hardware vendor advertised.

Their answser is always to upgrade the firmwares but it's a catch 22.

You would have to stop your HA system every week to update the firmwares...

So, you are right : MSCS is no guarantee for being always up, but the availability is much better than without MSCS...

Rudi_Wiesmayr
Active Participant
0 Kudos

Hello Olivier, thanks for sharing your experience!

What about the others? Everything running perfectly? Can 't imagine really...

And: What special measures can be taken for prevention?

Kind regards, Rudi

Former Member
0 Kudos

Hello

MSCS is as good as any other high availability solution I have seen.

Change control and testing are the key to ensuring any HA system works as expected. If your QA system is on different hardware, different drivers and different firmware then I don't see anyway that a stable reliable and predictable Prod system can be maintained.

Some people have problems with the SCSI RESET/RELEASE commands that are needed during a failover. Sometimes rather than fix the problem people may buy Veritas Volume manager or other cluster file system software. This means the disks are mounted on all of the cluster nodes all the time.

No MSCS cluster I have ever built needed this - but then this comes down to testing/change control etc.

MSCS support up to eight nodes in shared quorum mode. MSCS Majority Set (so called "Geocluster") also supports 8 nodes.

SAP support more than two node clusters for SAP and have started supporting multiple SAP systems inside one cluster as well.

"Everything running perfectly" is a must - every failover must be perfect.

Thanks

N.P.C

former_member433984
Active Contributor
0 Kudos

just in addition - any of the existing and reasonable paid software/hardware combination can give 100% availability. There are too many factors involved (also not to forget planned and unplanned downtime) and you can get 95%; 99,9%; 99,99% etc. The number of "9" after comma, depends how critically it is for you and how much you are ready to invest into the infrastructure.

In your case, the problem is cased probably by SAN driver, that did not report error about the disk lost, found some similar issue here

<a href="http://www.tek-tips.com/viewthread.cfm?qid=1392569&page=1">http://www.tek-tips.com/viewthread.cfm?qid=1392569&page=1</a>

Rudi_Wiesmayr
Active Participant
0 Kudos

Thanks, Yaroslav for the link. Very similar indeed.

The answer from HP is still pending. I suppose they will come up with some driver and firmware updates.

Kind regards,

Rudi

      • Not sorry for not giving points any more ***

former_member433984
Active Contributor
0 Kudos

great!

it will be interesting.

Former Member
0 Kudos

Hello Rudi,

"High Availability" involves a lot more than just MSCS, HACMP or MC Service Guard.

I would recommend regular High Availability tests in your test environment, which should be configured almost identically to your production environment. The firmware, drivers and multipath software should be the same on your QA and PRD system.

Without regular testing and validation I have never seen a HA solution that worked reliably, be it on Unix or Windows.

Just a comment

N.P.C

former_member433984
Active Contributor
0 Kudos

Hi Rudolf,

so far the problem is that MSCS software did not detect disk failure correctly.

What was the statement from your hardware partner/Microsoft support?

best regards,

Yaroslav