运行了好几年的x86主机,在去年更换了固态硬盘后,开始报错,错误原因不记得了,只记得有 status: { DRDY } ata3: hard resetting link 这么一条。后面会越来越频繁,最后次数多了直接连不上硬盘了。
期间重插SATA线、调换接口,还退还了两块SSD硬盘,最终换了PCIE-sata卡,也换了SATA线,问题才消失。但不确定是哪一块导致的。
23年10月份的时候索性直接换了个全新的过时主板,内存和CPU,SATA线也换了。
然后系统稳定运行,马上满一年。查看今年9月21日的logwatch的时候,突然发现一条ata错误, 和之前的错误类似:
完整的错误如下:
Sep 21 22:55:35 yxhserver kernel: ata3.00: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
Sep 21 22:55:48 yxhserver kernel: ata3.00: irq_stat 0x00400040, connection status changed
Sep 21 22:55:48 yxhserver kernel: ata3: SError: { HostInt PHYRdyChg 10B8B DevExch }
Sep 21 22:55:48 yxhserver kernel: ata3.00: failed command: FLUSH CACHE EXT
Sep 21 22:55:48 yxhserver kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 10
res 40/00:48:c0:86:07/00:00:16:00:00/40 Emask 0x50 (ATA bus error)
Sep 21 22:55:48 yxhserver kernel: ata3.00: status: { DRDY }
Sep 21 22:55:48 yxhserver kernel: ata3: hard resetting link
Sep 21 22:55:48 yxhserver kernel: ata3: link is slow to respond, please be patient (ready=0)
Sep 21 22:55:48 yxhserver kernel: ata3: COMRESET failed (errno=-16)
Sep 21 22:55:48 yxhserver kernel: ata3: hard resetting link
Sep 21 22:55:48 yxhserver kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 21 22:55:48 yxhserver kernel: ata3.00: configured for UDMA/133
Sep 21 22:55:48 yxhserver kernel: ata3.00: retrying FLUSH 0xea Emask 0x50
Sep 21 22:55:48 yxhserver kernel: ata3: EH complete
这是第一次报这个错误,也不确定什么原因引起的。 和他在一起的机械硬盘没问题。
固态硬盘的smart信息如下:
SMART Attributes Data Structure revision number: 20
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0013 100 100 050 Pre-fail Always - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 7898
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 181
175 Program_Fail_Count_Chip 0x0022 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0032 000 000 000 Old_age Always - 0
184 End-to-End_Error 0x0032 000 000 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 000 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 62
190 Airflow_Temperature_Cel 0x0022 024 024 000 Old_age Always - 24 (Min/Max 10/39)
196 Reallocated_Event_Count 0x0012 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x0012 100 100 000 Old_age Always - 0
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 4112
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 2888
与他一起的机械硬盘smart信息如下:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 173 168 021 Pre-fail Always - 2350
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 942
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 059 059 000 Old_age Always - 30240
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 934
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 95
193 Load_Cycle_Count 0x0032 190 190 000 Old_age Always - 32626
194 Temperature_Celsius 0x0022 110 098 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 2
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
系统版本是Debian bookworm , 无法确定是哪里出的问题, 欢迎有类似问题的小伙伴,一起讨论。
2024-09-30号又出现该问题, 什么问题导致,一直找不到原因,太纠结了。
WARNING: Kernel Errors Present
res 40/00:e8:a0:11:84/00:00:13:00:00/40 Emask 0x50 (ATA bus error) ...: 1 Time(s)
ata3.00: failed to IDENTIFY (I/O error, err_mask=0x100) ...: 1 Time(s)
ata3: SError: { HostInt PHYRd ...: 1 Time(s)