Squoggle
Mac's tech blog
CentOS Drive Testing
Posted by on September 23, 2022
My Server was making noises that were uncharacteristic. This is how I tested my hard drives for failure.
- Install smartmontools:
# yum install smartmontools - Get a listing of all your hard drives:
# lsblk - Run a test on one of the hard drives:
# smartctl -t short /dev/sda
You will see something similar to the following:smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-693.11.6.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Sep 23 13:02:21 2022
Use smartctl -X to abort test. - It will give you a time when you can check the results. When the time has elapsed, come back and check the results like this:
# smartctl -H /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-693.11.6.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED - If the test fails you will see something like this:
# smartctl -H /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-693.11.6.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 063 063 140 Pre-fail Always FAILING_NOW 1089 - Looks like you need to replace /dev/sdb
How to Replace the Hard drive
This is what I did to replace the hard drive.
- Install lshw package:
# yum install lshw - Now list hardware of type disk:
# lshw -class disk
You should get way to much info. - Filter the info with grep like so:
# lshw -class disk | grep -A 5 -B 6 /dev/sdb
You should now only get the one drive you are looking for.
Mine looks like this:# lshw -class disk | grep -A 5 -B 6 /dev/sdb
*-disk:1
description: ATA Disk
product: WDC WD1002FAEX-0
vendor: Western Digital
physical id: 1
bus info: scsi@5:0.0.0
logical name: /dev/sdb
version: 1D05
serial: WD-WCATR1933480
size: 931GiB (1TB)
capabilities: partitioned partitioned:dos
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=000cd438
So it looks like I need to replace a 1TB Western Digital. Fortunately this disk is in a two disk raid array.
Remove the HD from the Raid Array
This is what I did to remove the HD from the Raid Array. Before proceeding back up everything. I do a daily offsite backup so am covered in theory.
- Redo the lsblk command from above to confirm which disk is which:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 931.5G 0 disk
└─sdb1 8:17 0 931.5G 0 part
└─md0 9:0 0 931.4G 0 raid1
└─vg_raid-lv_raid 253:4 0 931.4G 0 lvm /mnt/Raid
sdc 8:32 0 931.5G 0 disk
└─sdc1 8:33 0 931.5G 0 part
└─md0 9:0 0 931.4G 0 raid1
└─vg_raid-lv_raid 253:4 0 931.4G 0 lvm /mnt/Raid - Remember that the defective disk in this case is /dev/sdb and the good one is /dev/sdc
- Write all cache to disk:
# sync - Set the disk as failed with mdadm:
# mdadm --manage /dev/md0 --fail /dev/sdb1
This is the failed partition from /dev/sdb.
You should see something like this:mdadm: set /dev/sdb1 faulty in /dev/md0 - Confirm it has been marked as failed:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0](F)
976630464 blocks super 1.2 [2/1] [_U]
bitmap: 0/8 pages [0KB], 65536KB chunk
The (F) next to sdb1 indicates Failed. - Now remove the disk with mdadm:
# mdadm --manage /dev/md0 --remove /dev/sdb1 - Now confirm with the cat command as before:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1]
976630464 blocks super 1.2 [2/1] [_U]
bitmap: 0/8 pages [0KB], 65536KB chunk
Notice that sdb1 is now gone. - You can also confirm this with the lsblk command:
#lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 931.5G 0 disk
└─sdb1 8:17 0 931.5G 0 part
sdc 8:32 0 931.5G 0 disk
└─sdc1 8:33 0 931.5G 0 part
└─md0 9:0 0 931.4G 0 raid1
└─vg_raid-lv_raid 253:4 0 931.4G 0 lvm /mnt/Raid - You can now shutdown the server and replace that hard drive.
It is easy to find the correct hard drive with the serial number you got from the lshw command you ran earlier. The serial number is:WD-WCATR1933480 - Power on server.
- Here is where I ran into an issue that left me scratching my head for quite some time. I’m documenting it here so if it happens again I can resolve it quickly.
It turns out that the spare drive I had on hand I thought was new but was not. It was actually a drive I had installed in another system that was retired and this drive had a boot partition on it. When I booted the server, that was the partition that booted instead of my regular boot partition. I even had to recover passwords on it because the user and root passwords were not the same. All along I was thinking something had happened to bork the users somehow. But it turns out the new drive I had put in was booting and it was not really new. Lesson learned here is to make sure the drive you put in has had any partitions removed. I did this by putting the drive in another system and using fdisk to remove the partitions. Now when I boot the server the normal boot partition boots and this new drive is designated as sdb as I expect. - Now you can copy the partition information from the good disk (/dev/sdc) to the new disk (/dev/sdb). Be warned that this will destroy any partition information on the new disk. Since I already destroyed any partition information in the previous step I’m good with this. The command looks like this:
# sfdisk -d /dev/sdc | sfdisk /dev/sdb - You can check the partition info is correct with the lsblk command:
#lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 931.5G 0 disk
└─sdb1 8:17 0 931.5G 0 part
sdc 8:32 0 931.5G 0 disk
└─sdc1 8:33 0 931.5G 0 part
└─md0 9:0 0 931.4G 0 raid1
└─vg_raid-lv_raid 253:2 0 931.4G 0 lvm /mnt/Raid - Now you can reverse the process and create the mirror that you previously had like this:
# mdadm --manage /dev/md0 --add /dev/sdb1 - Now you can verify the status of your raid like this:
# mdadm --detail /dev/md0
# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Jun 27 17:49:31 2017
Raid Level : raid1
Array Size : 976630464 (931.39 GiB 1000.07 GB)
Used Dev Size : 976630464 (931.39 GiB 1000.07 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sat Sep 24 14:46:35 2022
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Consistency Policy : bitmap
Rebuild Status : 1% complete
Name : Serenity.localdomain:0 (local to host Serenity.localdomain)
UUID : f06aeaae:e0c9707b:6d982f07:3f320578
Events : 114297
Number Major Minor RaidDevice State
2 8 17 0 spare rebuilding /dev/sdb1
1 8 33 1 active sync /dev/sdc1
- You can see that the ‘Rebuild Status is at 1% and that this is in a rebuilding state.
- You can get the status of the rebuild like so:
# cat /proc/mdstat
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2] sdc1[1]
976630464 blocks super 1.2 [2/1] [_U]
[>....................] recovery = 0.7% (7077312/976630464) finish=129.7min speed=124486K/sec
bitmap: 8/8 pages [32KB], 65536KB chunk
You can watch this command if it is interesting to you.
There’s something missing here. It probably relates to this:
Recent Comments