on 10-15-2009 6:47 AM
Hello!
I'm running MaxDB 7.7.06 BUILD 009-121-202-944 on CentOS 5.3.
After power loss today I cannot start database with error "-24994 connection broken server state 4".
I have a software RAID-1 (mdadm), and during disk resync (sdb=>sda) i've got some multiple errors on sda.
=======================================================================
Here is part of my converted KnlMsg file:
=======================================================================
Thread 0xA91 Task - 2009-10-15 11:27:15 ERR Messages 7: Begin of dump of registered messages,_FILE=Msg_List.cpp,_LINE=3539
Thread 0xA91 Task - 2009-10-15 11:27:11 ERR Converter 20034: Bad converter page - read page is no index page,_FILE=Converter_IndexPage.cpp,_LINE=78
2009-10-15 11:27:09 ERR Converter 20034: Bad converter page - read page is no index page,_FILE=Converter_IndexPage.cpp,_LINE=78
Thread 0xA91 Task - 2009-10-15 11:27:13 ERR Converter 20034: Bad converter page - read page is no index page,_FILE=Converter_IndexPage.cpp,_LINE=78
2009-10-15 11:27:11 ERR Converter 20034: Bad converter page - read page is no index page,_FILE=Converter_IndexPage.cpp,_LINE=78
2009-10-15 11:27:09 ERR Converter 20034: Bad converter page - read page is no index page,_FILE=Converter_IndexPage.cpp,_LINE=78
Thread 0xA91 Task - 2009-10-15 11:27:15 ERR IOMan 20032: Bad converter page,_FILE=IOMan_DataFileSystem.cpp,_LINE=2429
2009-10-15 11:27:15 ERR IOMan 5: Failed consistency check on read page,BLOCK_NO=14274,VOLUME_ID=1,VOLUME_TYPE=data,_FILE=IOMan_Volume.cpp,_LINE=412
2009-10-15 11:27:13 ERR Converter 20034: Bad converter page - read page is no index page,_FILE=Converter_IndexPage.cpp,_LINE=78
2009-10-15 11:27:11 ERR Converter 20034: Bad converter page - read page is no index page,_FILE=Converter_IndexPage.cpp,_LINE=78
2009-10-15 11:27:09 ERR Converter 20034: Bad converter page - read page is no index page,_FILE=Converter_IndexPage.cpp,_LINE=78
Thread 0xA91 Task -Thread 0xA91 Task - 2009-10-15 11:27:00 RTEHSS 13951: No hot standby node configured -> No HotStandby configuration
Thread 0xA91 Task - 2009-10-15 11:27:09 ERR Converter 20034: Bad converter page - read page is no index page,_FILE=Converter_IndexPage.cpp,_LINE=78
Thread 0xA91 Task - 2009-10-15 11:27:15 ERR Messages 8: End of the message list registry dump,_FILE=Msg_List.cpp,_LINE=3567
Thread 0xA91 Task - 2009-10-15 11:27:15 RTEKernel 117: Calling AK dump for T72
Thread 0xA91 Task - 2009-10-15 11:27:15 ERR AK CACHE 51105: DUMPING SESSION CATCACHE: 1
Thread 0xA8A Task - 2009-10-15 11:27:15 TASKING 12825: state 30 before shutkill(1)
Thread 0xA8A Task 3 2009-10-15 11:27:15 Trace 20000: Start flush kernel trace
Thread 0xA91 Task - 2009-10-15 11:27:15 RTEKernel 111: Tracewriter resumed
Thread 0xA91 Task - 2009-10-15 11:27:15 RTEKernel 94: Waiting for tracewriter to finish work
Thread 0xA91 Task - 2009-10-15 11:27:15 RTEKernel 116: Tracewriter termination timeout: 60 seconds
Thread 0xA8A Task 3 2009-10-15 11:27:15 Trace 20001: Stop flush kernel trace
Thread 0xA8A Task 3 2009-10-15 11:27:15 Trace 20002: Start flush kernel dump
Thread 0xA8A Task 3 2009-10-15 11:27:15 Trace 20003: Stop flush kernel dump
Thread 0xA8A Task 3 2009-10-15 11:27:15 RTEKernel 110: Releasing tracewriter
Thread 0xA91 Task - 2009-10-15 11:27:15 TASKING 12822: Thread 2687 joining
Thread 0xA91 Task - 2009-10-15 11:27:15 TENANT 13008: Requestor for tenant database MYDATABASE has stopped
Thread 0xA91 Task - 2009-10-15 11:27:15 RTEThread 13: The thread LegacyRequestor is finished
Thread 0xA7D Task - 2009-10-15 11:27:15 RTE 20214: CONSOLE thread stopped
Thread 0xA91 Task - 2009-10-15 11:27:15 TASKING 12822: Thread 2686 joining
Thread 0xA91 Task - 2009-10-15 11:27:15 TASKING 12822: Thread 2688 joining
Thread 0xA91 Task - 2009-10-15 11:27:15 TASKING 12822: Thread 2685 joining
Thread 0xA91 Task - 2009-10-15 11:27:15 WNG CONNECT 12464: Releasing T72 kernel abort
Thread 0xA91 Task - 2009-10-15 11:27:15 TASKING 12822: Thread 2681 joining
Thread 0xA91 Task - 2009-10-15 11:27:15 TASKING 12822: Thread 2682 joining
Thread 0xA91 Task - 2009-10-15 11:27:15 RTEKernel 58: Backup of diagnostic files will be forced at next restart
Thread 0xA91 Task - 2009-10-15 11:27:15 RTEKernel 118: SERVERDB MYDATABASE has stopped
2009-10-15 11:27:15 RTEKernel 14: Kernel version: Kernel 7.7.06 Build 009-121-202-944
Thread 0xA91 Task - 2009-10-15 11:27:15 RunTime 3: State changed from ABORT to STOPPED
Thread 0xA91 Task - 2009-10-15 11:27:15 TENANT 13005: Tenant database MYDATABASE has stopped
Thread 0xA90 Task - 2009-10-15 11:27:15 12768: UKT7 stopped
Thread 0xA8F Task - 2009-10-15 11:27:15 12768: UKT6 stopped
Thread 0xA8B Task - 2009-10-15 11:27:15 12768: UKT2 stopped
Thread 0xA8D Task - 2009-10-15 11:27:15 12768: UKT4 stopped
Thread 0xA8C Task - 2009-10-15 11:27:15 12768: UKT3 stopped
Thread 0xA92 Task - 2009-10-15 11:27:15 12768: UKT9 stopped
Thread 0xA8E Task - 2009-10-15 11:27:15 12768: UKT5 stopped
Thread 0xA6E Task - 2009-10-15 11:27:15 RTEKernel 64: rtedump written by running kernel already (start time kernel: 2009-10-15 11:27:00, last access time rtedump: 2009-10-15 11:27:15)
Thread 0xA6E Task - 2009-10-15 11:27:15 ERR RTEKernel 102: Kernel exited without core and exit status 0x6,_FILE=RTEKernel_Termination.cpp,_LINE=634
Thread 0xA6E Task - 2009-10-15 11:27:15 ERR RTEKernel 98: Kernel exited due to signal 6 (SIGABRT),_FILE=RTEKernel_Termination.cpp,_LINE=680
Thread 0xA6E Task - 2009-10-15 11:27:15 RunTime 3: State changed from STOPPED to OFFLINE
=======================================================================
smarctl --all /dev/sda
=======================================================================
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10 family
Device Model: ST3250310AS
Serial Number: 9RY01C4W
Firmware Version: 3.AAA
User Capacity: 250u2002059u2002350u2002016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Oct 15 11:31:03 2009 NOVST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 106 100 006 Pre-fail Always - 11304615
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 87
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 5805208
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1223
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 87
187 Reported_Uncorrect 0x0032 082 082 000 Old_age Always - 18
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 064 057 045 Old_age Always - 36 (Lifetime Min/Max 33/36)
194 Temperature_Celsius 0x0022 036 043 000 Old_age Always - 36 (0 22 0 0)
195 Hardware_ECC_Recovered 0x001a 068 064 000 Old_age Always - 2613598
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 18 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 18 occurred at disk power-on lifetime: 1222 hours (50 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 49 4c 40 e4 Error: UNC at LBA = 0x04404c49 = 71322697
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 47 4c 40 e4 00 00:08:40.753 READ DMA
27 00 00 00 00 00 e0 00 00:08:40.750 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:08:40.750 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:08:40.747 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:08:37.186 READ NATIVE MAX ADDRESS EXT
Error 17 occurred at disk power-on lifetime: 1222 hours (50 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 49 4c 40 e4 Error: UNC at LBA = 0x04404c49 = 71322697
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 47 4c 40 e4 00 00:08:40.753 READ DMA
27 00 00 00 00 00 e0 00 00:08:40.750 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 00:08:40.750 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 00:08:40.747 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 00:08:37.186 READ NATIVE MAX ADDRESS EXT
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Edited by: Alex Petrov on Oct 15, 2009 7:47 AM
Edited by: Alex Petrov on Oct 15, 2009 7:47 AM
> I'm running MaxDB 7.7.06 BUILD 009-121-202-944 on CentOS 5.3.
> After power loss today I cannot start database with error "-24994 connection broken server state 4".
> I have a software RAID-1 (mdadm), and during disk resync (sdb=>sda) i've got some multiple errors on sda.
>
> Thread 0xA91 Task - 2009-10-15 11:27:15 ERR Messages 7: Begin of dump of registered messages,_FILE=Msg_List.cpp,_LINE=3539
> Thread 0xA91 Task - 2009-10-15 11:27:11 ERR Converter 20034: Bad converter page - read page is no index
Hello Alex,
time to get your backup!
As it seems the "software RAID" did not work as reliable as you may have wished for...
Anyhow, the converter - the single most important data structure for db restart - has been corrupted.
There's no way to repair this.
All you can do is to restore the last data backup and recover from there.
Sorry, but that's like it is.
regards,
Lars
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
87 | |
10 | |
10 | |
10 | |
7 | |
6 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.