cancel
Showing results for 
Search instead for 
Did you mean: 

MSCS cluster is going offline in Production

Former Member
0 Kudos

Hi all,

We have MSCS cluster configured on 2003 sp2 on 64 bit with sqlserver 2005 & ECC 6.0.

Cluster is bringing sqlserver to offline and after 3 hours it's becoming online again, the below log messages found when it's taking sqlserver offline.

We already open the message with Microsoft.. still no proper response.. please suggest whether we have to go for any new security pacthes...

Log file:

SQL Server has encountered 216 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [K:\PBWDATA2\PBWDATA2.ndf] in database [PBW] (5). The OS file handle is 0x00000000000007B8. The offset of the latest long I/O is: 0x00000757c30000

2008-06-23 03:54:52.50 spid3s SQL Server has encountered 2659 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [L:\PBWDATA3\PBWDATA3.ndf] in database [PBW] (5). The OS file handle is 0x00000000000007BC. The offset of the latest long I/O is: 0x000007597a0000

2008-06-23 04:00:43.45 spid3s SQL Server has encountered 586 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [L:\PBWDATA3\PBWDATA3.ndf] in database [PBW] (5). The OS file handle is 0x00000000000007BC. The offset of the latest long I/O is: 0x0000083f530000

2008-06-23 04:00:43.45 spid3s SQL Server has encountered 2036 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [J:\PBWDATA1\PBWDATA1.mdf] in database [PBW] (5). The OS file handle is 0x00000000000000E4. The offset of the latest long I/O is: 0x0000083be30000

2008-06-23 04:00:53.45 spid3s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [J:\TMPDATA1\tempdb.mdf] in database [tempdb] (2). The OS file handle is 0x0000000000000AF4. The offset of the latest long I/O is: 0x000000f9a00000

2008-06-23 04:01:28.64 spid3s SQL Server has encountered 301 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [K:\PBWDATA2\PBWDATA2.ndf] in database [PBW] (5). The OS file handle is 0x00000000000007B8. The offset of the latest long I/O is: 0x0000085e450000

2008-06-23 04:02:45.96 spid52 Configuration option 'Agent XPs' changed from 1 to 0. Run the RECONFIGURE statement to install.

2008-06-23 04:02:50.57 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:02:50.57 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.54]

2008-06-23 04:02:50.57 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:02:50.57 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.77]

2008-06-23 04:02:50.60 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:02:50.60 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.54]

2008-06-23 04:02:50.62 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:02:50.62 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.54]

2008-06-23 04:02:50.64 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:03:13.05 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:03:13.05 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.77]

2008-06-23 04:03:13.05 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:03:13.05 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.54]

2008-06-23 04:03:13.06 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:03:13.06 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.54]

2008-06-23 04:03:13.06 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:03:13.06 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.54]

2008-06-23 04:03:13.07 spid5s SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.

2008-06-23 04:03:13.07 spid5s SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

2008-06-23 04:03:13.07 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:03:13.07 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.54]

2008-06-23 04:03:13.18 Logon Error: 18451, Severity: 14, State: 1.

2008-06-23 04:03:13.18 Logon Login failed for user 'SAPPBWDB'. Only administrators may connect at this time. [CLIENT: 10.16.148.54]

Thanks,

Subhash.G

Accepted Solutions (0)

Answers (3)

Answers (3)

Former Member
0 Kudos

Thanks

Former Member
0 Kudos

The cluster log can be found here: c:\windows\cluster\cluster.log

initial thoughts are that the storage is experiencing issues. the cluster log may give you more information. Also, see if the storage logs is having issues as well.

Edited by: Kevin Lin on Jun 26, 2008 11:38 PM

Former Member
0 Kudos

Hi ,

I am not able to find the cluster logs .... the system has gone down again today at 3:45 AM and came up after 5minutes..

The below is the information from Event viewer>

************************************************************************************

[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

This instance of SQL Server last reported using a process ID of 7676 at 6/27/2008 3:41:51 AM (local) 6/27/2008 10:41:51 AM (UTC). This is an informational message only; no user action is required.

Registry startup parameters:

SQL Server is starting at normal priority base (=7). This is an informational message only. No user action is required.

A significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: 0 seconds. Working set (KB): 50072, committed (KB): 121280, memory utilization: 41%%.

The configuration of the AdminConnection\TCP protocol in the SQL instance INSTANCE_A is not valid.

Attempting to recover in-doubt distributed transactions involving Microsoft Distributed Transaction Coordinator (MS DTC). This is an informational message only. No user action is required.

Database mirroring has been enabled on this instance of SQL Server.

Starting up database 'master'.

Recovery is writing a checkpoint in database 'master' (1). This is an informational message only. No user action is required.

CHECKDB for database 'master' finished without errors on 2008-06-26 20:10:04.977 (local time). This is an informational message only; no user action is required.

Recovery of database 'PBW' (5) is 19%% complete (approximately 72 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

Recovery of database 'PBW' (5) is 100%% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

Recovery is writing a checkpoint in database 'PBW' (5). This is an informational message only. No user action is required.

********************************************************************************************************

Please suggest .. what we have to do...

Thanks,

Subhash.G

Edited by: subhash gadde on Jun 27, 2008 6:44 PM

Former Member
0 Kudos

How is your DB mirroring implemented? Can you describe to me where SQL is installed, primary and secondary.

Is it ...

MSCS 1 (2 or more physical servers) - Running primary SQL database server.

MSCS 2 (2 or more phsyical servers) - Running secondary SQL database server.

This is the one possible correct way I see of setting it up. If this is not the case, can you describe to me how you have it setup?

Former Member
0 Kudos

Are you guys running a backup copy or antyhing that might be hauling large files from this server? The exerpt from your errorlog says;

"A significant part of sql server process memory has been paged out. This may result in a performance degradation"

This sometimes happens when you run the copy process for very large files (like backup files) on the server where SQL is running and SQL gets paged out eventually being shutdown. Make sure that the copy process is not running on the SQL server. Instead of a push from the source server pull from the target server. And there is a also KB article 920739 from MS that you might want to check and implement to avoid such problems.

Regards

Fatih

Former Member
0 Kudos

Hi DB miirroring is not configured.. clusterting is there...

We have the full backup process at the 20:30 every day... the system is going down after 3:00 AM ...

Thanks,

Subhash.G

Former Member
0 Kudos

How are your DB files mounted to the system? Do these come in from a SAN or a NAS? If so, then it may be that another system is saturating your storage system.

Former Member
0 Kudos

Hi all ,

we have lightspped for db backup and netback up for file backup, we observed when these backup jobs are completed we are getting lot of I/o requests waiting for more than 15 seconds warning and system is going down... after 5 to 10 minutes it coming automatically...

Please suggest..

Thanks,

Subhash

Former Member
0 Kudos

Is there process that runs at the end of the backups that copies a file? At the end of the backup process, does your backup software try to backup files that were locked earlier in the backup process? Is there a log in the backup software that might show some of this?

Just wondering.

Former Member
0 Kudos

I agree w/ the above, this definitely sounds like an process saturating the storage system. I would check the active processes in your system at that time, use perfmon to monitor the storage I/Os, and try to isolate what's happening on the server. Whatever that's happening, it's contending the storage I/O with SQL causing your SQL to go down.

Former Member
0 Kudos

Looks like at 2008-06-23 04:02:45.96 the SQL server initiated a shut down.

The line in the error log indicating 'Agent XPs' changed from 1 to 0 is normal, not the cause. This is done everytime during a SQL shut down.

Did you see anything in the event logs or cluster logs during this time?

Former Member
0 Kudos

Hi ,

The below is the info i found in event mannager..

4:12 Login failed for user 'SAPPBWDB'. Only administrators may connect at this time.

4:11 SQL Server blocked access to procedure 'dbo.sp_get_sqlagent_properties' of component 'Agent XPs' because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of 'Agent XPs' by using sp_configure. For more information about enabling 'Agent XPs', see "Surface Area Configuration" in SQL Server Books Online.

4:10 The connection has been lost with Microsoft Distributed Transaction Coordinator (MS DTC). Recovery of any in-doubt distributed transactions involving Microsoft Distributed Transaction Coordinator (MS DTC) will begin once the connection is re-established. This is an informational message only. No user action is required.

4: 10 [sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = 0; message = [Microsoft][SQL Native Client]Unable to complete login process due to delay in opening server connection

[sqsrvres] ODBC sqldriverconnect failed

[sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

[sqsrvres] OnlineThread: QP is not online.

[sqsrvres] printODBCError: sqlstate = 08S01; native error = 40; message = [Microsoft][SQL Native Client]Communication link failure

4:09 [sqsrvres] printODBCError: sqlstate = 08S01; native error = 40; message = [Microsoft][SQL Native Client]TCP Provider: The specified network name is no longer available.

4: 09 [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

4:02

SQL Server has encountered 590 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [L:\PBWDATA3\PBWDATA3.ndf] in database [PBW] (5). The OS file handle is 0x00000000000007C8. The offset of the latest long I/O is: 0x000006354a0000

4:02

SQL Server has encountered 3565 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [K:\PBWDATA2\PBWDATA2.ndf] in database [PBW] (5). The OS file handle is 0x00000000000007AC. The offset of the latest long I/O is: 0x00000611240000

4: 02

SQL Server has encountered 539 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [J:\PBWDATA1\PBWDATA1.mdf] in database [PBW] (5). The OS file handle is 0x00000000000007A0. The offset of the latest long I/O is: 0x00000613180000

Thanks,

Subhash.G

Former Member
0 Kudos

Hi ,

Please let me know where i can found log files for MSCS in windows..

Thanks,

Subhash