on 07-09-2015 5:07 PM
Hi All,
We have migrated our SAP Landscape 5 months back from HP-UX/Oracle to Windows 2012 R2 and SQL Server 2012 R2 SP2. The servers are VM (VMware 5.1) on HP ESX host. From past 3 months we have started getting below error which is very random. This is happening in SAP BI system.Jobs get cancelled with the message saying System Shutdown and on checking the SM21 logs it shows “operating system call receive failed (error no. 10054)”
The installation is distributed with DB on one host and CI & App on other hosts within same data center.
At the same time there is a dump in ST22
Things which we have already done:
Scheduled niping with the option long Lan stability test, but there doesn’t seems to any drop. It can be because there was no such incident where the error occurred and niping was running, as the error is very random. Happened twice in April and then in Jul.
This issue is not happening in QA which is having same setting as Prod except for the fact that it is kept in different DC. Also we are not able to replicate the issue as an when desired as it is too random.
Any help or direction to troubleshoot will be a great help.
Thanks,
Manas
Hi Manas,
I think it is related with VMware. See that after the database disconnection you have also errors to access files in your \\.\sapmnt share.
Are you using VMware snapshot backups?
My colleague Eduardo has pointed in his blog here: some interesting points that can help you in case of disconnections when running on VMware:
Also I think this topic is more related with the following spaces:
As the 10054 error comes from the OS and is, in most of cases, related with network disconnections.
Please make sure your system is set with the following recommendations in Note 1056052 - Windows: VMware vSphere configuration guidelines
Best Regards,
Luis Darui
SAP Support
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thanks for your reply Luis.
I have already checked Eduardo's bolg and have consulted with the VM/Infra team and as per them all settings are in place as recommended. Have also reffered to Dale's presentation and checked the pointers mentioned in there.
As I said this issue is very random and only coming in 2 of the production systems (ECC and BI). The QA landsapce is exact replica of Production except for the different data centre.
Another thing that I came across from the Note 1593183 - TCP/IP networking parameters for SQL Server is that it mentiones to increase the Keep Alive from 30000 to 90000 on the server side as well. Will that help as I have only increased it on the CI and APP servers.
Thanks,
Manas
Hi Manas,
Based on previous experiences this happens when a VM snapshot is taken in the database server and causes a "stun" effect, which ends with a disconnection.
The registry and other recommendations are intended to diminish / cease this effect and make the system keep running, but I've seen some cases where the customer had to change this strategy to cease the disconnections.
Are you sure that by the time you had those dumps, no VMware snapshot backup was being taken?
Regards,
Luis Darui
Hi Luis,
I was also troubleshooting in this direction and when i checked with the VM Team they did mention that the snapshot backup was actually happening during the time these failure happened. However at the same time they want to know why this happened only on those 4 occasion and that to randomly. The snapshot backup is happening for quite sometime and so are the daily BW loads. Only thing I can connect is that there can be co-incidence where any heavy load is coinciding with the snapshot backup and causing this failure.
As a precautionary measure we are working with the application team and VM team to have the snapshot backup scheduled when nothing is running.
As the issue is not reproducible we might not be able to prove this immediately that this has fixed the issue but the nature can be observed for few months.
Will keep this thread posted with the solution.
Thanks,
Manas
Hi Manas,
In case you're still facing this kind of disconnection after this, please prepare a performance log as per note 1478133 and open an incident support for BC-OP-NT-ESX. We can forward this to our VMware expert colleagues.
Regards,
Luis
User | Count |
---|---|
93 | |
10 | |
10 | |
9 | |
9 | |
7 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.