= KVM Host Faulty Disk = == Getting the Report == Harddrives have moving parts and they will wear out, eventually a harddrive will fail. Having installed and configured `smartmontools` an email like the following will show up in your inbox. {{{ From root@kvm02.kallenberg.dk Wed Oct 18 23:14:11 2017 Subject: SMART error (SelfTest) detected on host: kvm02 To: X-Mailer: mail (GNU Mailutils 3.1.1) This message was generated by the smartd daemon running on: host name: kvm02 DNS domain: kallenberg.dk The following warning/error was logged by the smartd daemon: Device: /dev/sdd, Self-Test Log error count increased from 0 to 1 Device info: ST2000DM001-1CH164, S/N:Z2F0RD5S, WWN:5-000c50-050214cd9, FW:CC26, 2.00 TB For details see host's SYSLOG. You can also use the smartctl utility for further investigation. Another message will be sent in 24 hours if the problem persists. }}} == Check the Disk Yourself == Just to be sure, run a thorough test yourself. {{{ smartctl -t long /dev/sdd smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-3-amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 230 minutes for test to complete. Test will complete after Fri Oct 20 03:45:07 2017 Use smartctl -X to abort test. }}} Once the test has completed, take a look at the smartctl report. {{{ smartctl -a /dev/sdd }}} This will show a long report, where the selftest is the interesting part. {{{ SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 8057 2144 # 2 Extended offline Completed: read failure 90% 8053 2144 # 3 Extended offline Completed: read failure 90% 8051 2144 # 4 Extended offline Completed without error 00% 7986 - # 5 Extended offline Completed without error 00% 7828 - # 6 Extended offline Completed without error 00% 7818 - }}} == Replace the Disk == The SATA specification states that all SATA disks are hot swappable. That being said, it is not necessarily supported by all hardware. Trusting the specification we can replace a disk in a running system. === Unplug Data Cable First === Unplug the SATA data cable first, or unplug the SATA power cable first? The rule here is that your disk should have the same ground as the rest of your hardware to avoid an electrostatic discharge while unplugging the SATA data cable. For this reason unplug the SATA data cable first. Then unplug the SATA power cable. Watch your syslog, some interesting messages will appear. {{{ Oct 19 17:25:39 kvm02 kernel: [33741.124466] ata4: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen Oct 19 17:25:39 kvm02 kernel: [33741.124885] ata4: irq_stat 0x00400040, connection status changed Oct 19 17:25:39 kvm02 kernel: [33741.125403] ata4: SError: { HostInt PHYRdyChg 10B8B DevExch } Oct 19 17:25:39 kvm02 kernel: [33741.126173] ata4: hard resetting link Oct 19 17:25:40 kvm02 kernel: [33741.838857] ata4: SATA link down (SStatus 0 SControl 300) Oct 19 17:25:45 kvm02 kernel: [33747.018752] ata4: hard resetting link Oct 19 17:25:46 kvm02 kernel: [33747.335808] ata4: SATA link down (SStatus 0 SControl 300) Oct 19 17:25:46 kvm02 kernel: [33747.336284] ata4: limiting SATA link speed to 1.5 Gbps Oct 19 17:25:51 kvm02 kernel: [33752.394784] ata4: hard resetting link Oct 19 17:25:51 kvm02 kernel: [33752.707393] ata4: SATA link down (SStatus 0 SControl 310) Oct 19 17:25:51 kvm02 kernel: [33752.707907] ata4.00: disabled Oct 19 17:25:51 kvm02 kernel: [33752.708806] ata4: EH complete Oct 19 17:25:51 kvm02 kernel: [33752.709800] ata4.00: detaching (SCSI 3:0:0:0) Oct 19 17:25:51 kvm02 kernel: [33752.715580] sd 3:0:0:0: [sdd] Synchronizing SCSI cache Oct 19 17:25:51 kvm02 kernel: [33752.716122] sd 3:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Oct 19 17:25:51 kvm02 kernel: [33752.716981] sd 3:0:0:0: [sdd] Stopping disk Oct 19 17:25:51 kvm02 kernel: [33752.718035] sd 3:0:0:0: [sdd] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK }}} Your raid will drop the disk. {{{ Oct 19 17:25:56 kvm02 kernel: [33757.333805] md/raid1:md2: Disk failure on sdd1, disabling device. Oct 19 17:25:56 kvm02 kernel: [33757.333805] md/raid1:md2: Operation continuing on 1 devices. }}} === Plug in SATA Power Cable First === When plugging in the new disk, the same rule applies. Make sure your disk has the same ground as the rest of your hardware before you plug in the SATA data cable. This will also give you some interesting info in syslog. {{{ Oct 19 17:37:53 kvm02 kernel: [34474.296149] ata4: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen Oct 19 17:37:53 kvm02 kernel: [34474.296655] ata4: irq_stat 0x00000040, connection status changed Oct 19 17:37:53 kvm02 kernel: [34474.297608] ata4: SError: { CommWake DevExch } Oct 19 17:37:53 kvm02 kernel: [34474.298522] ata4: hard resetting link Oct 19 17:37:53 kvm02 kernel: [34475.009165] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Oct 19 17:37:53 kvm02 kernel: [34475.010535] ata4.00: ATA-9: ST2000DM001-1ER164, HP51, max UDMA/100 Oct 19 17:37:53 kvm02 kernel: [34475.011043] ata4.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA Oct 19 17:37:53 kvm02 kernel: [34475.012921] ata4.00: configured for UDMA/100 Oct 19 17:37:53 kvm02 kernel: [34475.013514] ata4: EH complete Oct 19 17:37:53 kvm02 kernel: [34475.014673] scsi 3:0:0:0: Direct-Access ATA ST2000DM001-1ER1 HP51 PQ: 0 ANSI: 5 Oct 19 17:37:53 kvm02 kernel: [34475.061545] sd 3:0:0:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB) Oct 19 17:37:53 kvm02 kernel: [34475.061616] sd 3:0:0:0: Attached scsi generic sg3 type 0 Oct 19 17:37:53 kvm02 kernel: [34475.062953] sd 3:0:0:0: [sdd] 4096-byte physical blocks Oct 19 17:37:53 kvm02 kernel: [34475.064133] sd 3:0:0:0: [sdd] Write Protect is off Oct 19 17:37:53 kvm02 kernel: [34475.064977] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00 Oct 19 17:37:53 kvm02 kernel: [34475.065045] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Oct 19 17:37:53 kvm02 kernel: [34475.081297] sdd: sdd1 Oct 19 17:37:53 kvm02 kernel: [34475.082399] sd 3:0:0:0: [sdd] Attached SCSI disk }}} == Rebuilding the Raid == Once the new disk has been added, add it to the array {{{ mdadm --add /dev/md2 /dev/sdd1 }}} You should get something like this in syslog. {{{ Oct 19 17:38:31 kvm02 kernel: [34512.883729] md: bind Oct 19 17:38:31 kvm02 kernel: [34512.958053] RAID1 conf printout: Oct 19 17:38:31 kvm02 kernel: [34512.958056] --- wd:1 rd:2 Oct 19 17:38:31 kvm02 kernel: [34512.958058] disk 0, wo:0, o:1, dev:sdc1 Oct 19 17:38:31 kvm02 kernel: [34512.958060] disk 1, wo:1, o:1, dev:sdd1 Oct 19 17:38:31 kvm02 kernel: [34512.958401] md: recovery of RAID array md2 Oct 19 17:38:31 kvm02 kernel: [34512.958669] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Oct 19 17:38:31 kvm02 kernel: [34512.959603] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. Oct 19 17:38:31 kvm02 kernel: [34512.960587] md: using 128k window, over a total of 1946025984k. }}} Eventually the synchronization completes. {{{ Oct 19 21:50:46 kvm02 kernel: [49647.562788] md: md2: recovery done. Oct 19 21:50:46 kvm02 kernel: [49647.669903] RAID1 conf printout: Oct 19 21:50:46 kvm02 kernel: [49647.669905] --- wd:2 rd:2 Oct 19 21:50:46 kvm02 kernel: [49647.669908] disk 0, wo:0, o:1, dev:sdc1 Oct 19 21:50:46 kvm02 kernel: [49647.669910] disk 1, wo:0, o:1, dev:sdd1 }}}