KVM Host Faulty Disk

Getting the Report

Harddrives have moving parts and they will wear out, eventually a harddrive will fail. Having installed and configured smartmontools an email like the following will show up in your inbox.

From root@kvm02.kallenberg.dk Wed Oct 18 23:14:11 2017
Subject: SMART error (SelfTest) detected on host: kvm02
To: <root@kvm02.kallenberg.dk>
X-Mailer: mail (GNU Mailutils 3.1.1)

This message was generated by the smartd daemon running on:

   host name:  kvm02
   DNS domain: kallenberg.dk

The following warning/error was logged by the smartd daemon:

Device: /dev/sdd, Self-Test Log error count increased from 0 to 1

Device info:
ST2000DM001-1CH164, S/N:Z2F0RD5S, WWN:5-000c50-050214cd9, FW:CC26, 2.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

Check the Disk Yourself

Just to be sure, run a thorough test yourself.

smartctl -t long /dev/sdd
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-3-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 230 minutes for test to complete.
Test will complete after Fri Oct 20 03:45:07 2017

Use smartctl -X to abort test.

Once the test has completed, take a look at the smartctl report.

smartctl -a /dev/sdd

This will show a long report, where the selftest is the interesting part.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      8057         2144
# 2  Extended offline    Completed: read failure       90%      8053         2144
# 3  Extended offline    Completed: read failure       90%      8051         2144
# 4  Extended offline    Completed without error       00%      7986         -
# 5  Extended offline    Completed without error       00%      7828         -
# 6  Extended offline    Completed without error       00%      7818         -

Replace the Disk

The SATA specification states that all SATA disks are hot swappable. That being said, it is not necessarily supported by all hardware. Trusting the specification we can replace a disk in a running system.

Unplug Data Cable First

Unplug the SATA data cable first, or unplug the SATA power cable first?

The rule here is that your disk should have the same ground as the rest of your hardware to avoid an electrostatic discharge while unplugging the SATA data cable. For this reason unplug the SATA data cable first. Then unplug the SATA power cable.

Watch your syslog, some interesting messages will appear.

Oct 19 17:25:39 kvm02 kernel: [33741.124466] ata4: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
Oct 19 17:25:39 kvm02 kernel: [33741.124885] ata4: irq_stat 0x00400040, connection status changed
Oct 19 17:25:39 kvm02 kernel: [33741.125403] ata4: SError: { HostInt PHYRdyChg 10B8B DevExch }
Oct 19 17:25:39 kvm02 kernel: [33741.126173] ata4: hard resetting link
Oct 19 17:25:40 kvm02 kernel: [33741.838857] ata4: SATA link down (SStatus 0 SControl 300)
Oct 19 17:25:45 kvm02 kernel: [33747.018752] ata4: hard resetting link
Oct 19 17:25:46 kvm02 kernel: [33747.335808] ata4: SATA link down (SStatus 0 SControl 300)
Oct 19 17:25:46 kvm02 kernel: [33747.336284] ata4: limiting SATA link speed to 1.5 Gbps
Oct 19 17:25:51 kvm02 kernel: [33752.394784] ata4: hard resetting link
Oct 19 17:25:51 kvm02 kernel: [33752.707393] ata4: SATA link down (SStatus 0 SControl 310)
Oct 19 17:25:51 kvm02 kernel: [33752.707907] ata4.00: disabled
Oct 19 17:25:51 kvm02 kernel: [33752.708806] ata4: EH complete
Oct 19 17:25:51 kvm02 kernel: [33752.709800] ata4.00: detaching (SCSI 3:0:0:0)
Oct 19 17:25:51 kvm02 kernel: [33752.715580] sd 3:0:0:0: [sdd] Synchronizing SCSI cache
Oct 19 17:25:51 kvm02 kernel: [33752.716122] sd 3:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct 19 17:25:51 kvm02 kernel: [33752.716981] sd 3:0:0:0: [sdd] Stopping disk
Oct 19 17:25:51 kvm02 kernel: [33752.718035] sd 3:0:0:0: [sdd] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Your raid will drop the disk.

Oct 19 17:25:56 kvm02 kernel: [33757.333805] md/raid1:md2: Disk failure on sdd1, disabling device.
Oct 19 17:25:56 kvm02 kernel: [33757.333805] md/raid1:md2: Operation continuing on 1 devices.

Plug in SATA Power Cable First

When plugging in the new disk, the same rule applies. Make sure your disk has the same ground as the rest of your hardware before you plug in the SATA data cable.

This will also give you some interesting info in syslog.

Oct 19 17:37:53 kvm02 kernel: [34474.296149] ata4: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
Oct 19 17:37:53 kvm02 kernel: [34474.296655] ata4: irq_stat 0x00000040, connection status changed
Oct 19 17:37:53 kvm02 kernel: [34474.297608] ata4: SError: { CommWake DevExch }
Oct 19 17:37:53 kvm02 kernel: [34474.298522] ata4: hard resetting link
Oct 19 17:37:53 kvm02 kernel: [34475.009165] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 19 17:37:53 kvm02 kernel: [34475.010535] ata4.00: ATA-9: ST2000DM001-1ER164, HP51, max UDMA/100
Oct 19 17:37:53 kvm02 kernel: [34475.011043] ata4.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
Oct 19 17:37:53 kvm02 kernel: [34475.012921] ata4.00: configured for UDMA/100
Oct 19 17:37:53 kvm02 kernel: [34475.013514] ata4: EH complete
Oct 19 17:37:53 kvm02 kernel: [34475.014673] scsi 3:0:0:0: Direct-Access     ATA      ST2000DM001-1ER1 HP51 PQ: 0 ANSI: 5
Oct 19 17:37:53 kvm02 kernel: [34475.061545] sd 3:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
Oct 19 17:37:53 kvm02 kernel: [34475.061616] sd 3:0:0:0: Attached scsi generic sg3 type 0
Oct 19 17:37:53 kvm02 kernel: [34475.062953] sd 3:0:0:0: [sdd] 4096-byte physical blocks
Oct 19 17:37:53 kvm02 kernel: [34475.064133] sd 3:0:0:0: [sdd] Write Protect is off
Oct 19 17:37:53 kvm02 kernel: [34475.064977] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Oct 19 17:37:53 kvm02 kernel: [34475.065045] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct 19 17:37:53 kvm02 kernel: [34475.081297]  sdd: sdd1
Oct 19 17:37:53 kvm02 kernel: [34475.082399] sd 3:0:0:0: [sdd] Attached SCSI disk

Rebuilding the Raid

Once the new disk has been added, add it to the array

mdadm --add /dev/md2 /dev/sdd1

You should get something like this in syslog.

Oct 19 17:38:31 kvm02 kernel: [34512.883729] md: bind<sdd1>
Oct 19 17:38:31 kvm02 kernel: [34512.958053] RAID1 conf printout:
Oct 19 17:38:31 kvm02 kernel: [34512.958056]  --- wd:1 rd:2
Oct 19 17:38:31 kvm02 kernel: [34512.958058]  disk 0, wo:0, o:1, dev:sdc1
Oct 19 17:38:31 kvm02 kernel: [34512.958060]  disk 1, wo:1, o:1, dev:sdd1
Oct 19 17:38:31 kvm02 kernel: [34512.958401] md: recovery of RAID array md2
Oct 19 17:38:31 kvm02 kernel: [34512.958669] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct 19 17:38:31 kvm02 kernel: [34512.959603] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 19 17:38:31 kvm02 kernel: [34512.960587] md: using 128k window, over a total of 1946025984k.

Eventually the synchronization completes.

Oct 19 21:50:46 kvm02 kernel: [49647.562788] md: md2: recovery done.
Oct 19 21:50:46 kvm02 kernel: [49647.669903] RAID1 conf printout:
Oct 19 21:50:46 kvm02 kernel: [49647.669905]  --- wd:2 rd:2
Oct 19 21:50:46 kvm02 kernel: [49647.669908]  disk 0, wo:0, o:1, dev:sdc1
Oct 19 21:50:46 kvm02 kernel: [49647.669910]  disk 1, wo:0, o:1, dev:sdd1

None: KVM Host Faulty Disk (last edited 2020-10-09 08:59:01 by Kristian Kallenberg)