cancel
Showing results for 
Search instead for 
Did you mean: 

VMOTION impact on a running SAP instance

0 Kudos

Hello folks,

Would like to get your practical experience with vmotion (vmware 4.1) with a productive SAP instance. We are setting our limits for vmotion so that it happens very infrequently. So that it will happen automatically only if there is a catostrophic host failure.

We have scenarios however when there are hardware (or software) activities that are more discretionary in nature. And, can be scheduled during a defined maintenance window. An example may be a memory DIMM that has been marked bad, and taken offline.

Under these circumstances, we want to perform the VMOTION in a controlled and scheduled manner, so that we minimize any negative impact of the VMOTION on running jobs/processes. I don't want to get in a situation where a long running job is cancelled unnecessarily, as an example.

In our testing, we can see that when a vmotion occurs in flight processes are cancelled. We see short dumps, error in the syslog, failed updates, etc. I really don't want to get into a situation where we do this and cause a long DB back-out.

Can you please provice your practical experience with vmotion in regards to this? Is this indeed a concern, or am I being overly sensitive.

thanks

Accepted Solutions (0)

Answers (1)

Answers (1)

markus_doehr2
Active Contributor
0 Kudos

> In our testing, we can see that when a vmotion occurs in flight processes are cancelled. We see short dumps, error in the syslog, failed updates, etc. I really don't want to get into a situation where we do this and cause a long DB back-out.

What exactly are you moving? An application server? The central instance? The database server?

Markus

0 Kudos

Thanks for the reply Markus.

We have seen this with both a CI and a DI. Our database server is standalone, and we've not done much vmotion testing with it yet.

So, we've seen it both with the central instance as well as a dialog server.

thanks for the inputs.

markus_doehr2
Active Contributor
0 Kudos

> We have seen this with both a CI and a DI. Our database server is standalone, and we've not done much vmotion testing with it yet.

>

> So, we've seen it both with the central instance as well as a dialog server.

If the central instance is not reachable (message server and enqueue) there will be dumps. Depending on how many people work on your system and how much activity is done, each second can produce a number of dumps and error messages.

We have encounterd "standstills" of application servers when we move them. If you have huge memory configurations (e. g. 16 GB or more) you have to consider, that the state of the server is to be saved on disk and must be re-read on the target system. Writing 16 GB (or more) of memory content including the deltas that are tracked when the move is started, this can take a while, even on fast disks.

Because of this we switched off VMotion completely and move hosts manually when necessary. If it's necessary we do this on low load times or take out the application server from the logon group, wait until it's almost empty (no or very few users) before we move it.

Our DB/CI instance is outside of VMWare on bare metal and we run a shadow database on a second metal for failover. A necessary failover is also done manually here.

I'm actually more "old school" and conservative here. If there's an incident, I prefer to manually do takeovers or switches instead of letting a logic do that, just gives me much better sleep at night VMotion may seem to be nice-n-easy and the next logical step but we had too many quircks in the past, not because VMotion is not working but just because spinning disks are too slow. SSDs may be a way out of that but we're not there yet.

Markus

Former Member
0 Kudos

Hi,

VMware vMotion does NOT cancel any jobs, TCP connections, transactions or whatsoever. If you experience problems because of a vMotion action, you should make sure

- that the advanced parameter Migrate.PrecopySwitchoverTimeGoal is not changed! The default is 500 ms and it should not be greater

- that the vMotion actions performed within the configured time goals. To check this, use this command on ESX command line:

cat /proc/vmware/migration/history

- that you follow the vMotion Best Practices, like having a dedicated network for your vMotion, otherwise the productive network will be influenced and will slow down, so you could reach some network timeout thresholds of the application

- to loosen your TCP keepalives if you're using MS SQL, see SAP Note 1593183.

Kind regards,

Matthias