cancel
Showing results for 
Search instead for 
Did you mean: 

SMux Provider: Physical Connection is not usable [xFFFFFFFF]

Former Member
0 Kudos

Hi All,

We have migrated our SAP Landscape 5 months back from HP-UX/Oracle to Windows 2012 R2 and SQL Server 2012 R2 SP2. The servers are VM (VMware 5.1) on HP ESX host. From past 3 months we have started getting below error which is very random. This is happening in SAP BI system.Jobs get cancelled with the message saying System Shutdown and on checking the SM21 logs it shows “operating system call receive failed (error no. 10054)”

The installation is distributed with DB on one host and CI & App on other hosts within same data center.

At the same time there is a dump in ST22

Things which we have already done:

  1. Checked the SNAC version which is same as the SQL version
  2. Increased the Keep Alive and Keep Alive interval for TCP/IP from 30000 and 1000 to 90000 ND 5000 respectivley.
  3. Increased the TCPMaxDataRetransmission to 10 on the DB server
  4. Checked the recommendation from VMware and HP regarding the network and power setting.
  5. Checked the resource utilization during the time of error and nothing abnormal is reported, which make me think that this can be specific to SQL server settings.
  6. The order of protocol enabled for SQL Server is Shared Memory (1), TCP/IP (2) and Names Pipes (3)

  Scheduled niping with the option long Lan stability test, but there doesn’t seems to any drop. It can be because there was no such incident where the error occurred and niping was running, as the error is very random. Happened twice in April and  then in Jul.


This issue is not happening in QA which is having same setting as Prod except for the fact that it is kept in different DC. Also we are not able to replicate the issue as an when desired as it is too random.


Any help or direction to troubleshoot will be a great help.


Thanks,

Manas

Accepted Solutions (0)

Answers (1)

Answers (1)

luisdarui
Advisor
Advisor
0 Kudos

Hi Manas,

I think it is related with VMware. See that after the database disconnection you have also errors to access files in your \\.\sapmnt share.

Are you using VMware snapshot backups?

My colleague Eduardo has pointed in his blog here: some interesting points that can help you in case of disconnections when running on VMware:

Also I think this topic is more related with the following spaces:

As the 10054 error comes from the OS and is, in most of cases, related with network disconnections.

Please make sure your system is set with the following recommendations in Note 1056052 - Windows: VMware vSphere configuration guidelines

Best Regards,

Luis Darui

SAP Support

Former Member
0 Kudos

Thanks for your reply Luis.

I have already checked Eduardo's bolg and have consulted with the VM/Infra team and as per them all settings are in place as recommended. Have also reffered to Dale's presentation and checked the pointers mentioned in there.

As I said this issue is very random and only coming in 2 of the production systems (ECC and BI). The QA landsapce is exact replica of Production except for the different data centre.

Another thing that I came across from the Note 1593183 - TCP/IP networking parameters for SQL Server is that it mentiones to increase the Keep Alive from 30000 to 90000 on the server side as well. Will that help as I have only increased it on the CI and APP servers.

Thanks,

Manas

Sriram2009
Active Contributor
0 Kudos

Hi Manas

1.  Have you check the VM log any disconnection error during those times?

2. Could you share your kernel release & level? if it is low you may require to upgrade the DCK kernel.

Regards

SS

luisdarui
Advisor
Advisor
0 Kudos

Hi Manas,

Based on previous experiences this happens when a VM snapshot is taken in the database server and causes a "stun" effect, which ends with a disconnection.

The registry and other recommendations are intended to diminish / cease this effect and make the system keep running, but I've seen some cases where the customer had to change this strategy to cease the disconnections.

Are you sure that by the time you had those dumps, no VMware snapshot backup was being taken?

Regards,

Luis Darui

Former Member
0 Kudos

Hi Luis,

I was also troubleshooting in this direction and when i checked with the VM Team they did mention that the snapshot backup was actually happening during the time these failure happened. However at the same time they want to know why this happened only on those 4 occasion and that to randomly. The snapshot backup is happening for quite sometime and so are the daily BW loads. Only thing I can connect is that there can be co-incidence where any heavy load is coinciding with the snapshot backup and causing this failure.

As a precautionary measure we are working with the application team and VM team to have the snapshot backup scheduled when nothing is running.

As the issue is not reproducible we might not be able to prove this immediately that this has fixed the issue but the nature can be observed for few months.

Will keep this thread posted with the solution.

Thanks,

Manas

luisdarui
Advisor
Advisor
0 Kudos

Hi Manas,

In case you're still facing this kind of disconnection after this, please prepare a performance log as per note 1478133 and open an incident support for BC-OP-NT-ESX. We can forward this to our VMware expert colleagues.

Regards,

Luis