Squoggle

Mac's tech blog

CentOS Drive Testing

My Server was making noises that were uncharacteristic. This is how I tested my hard drives for failure.

  1. Install smartmontools:
    # yum install smartmontools
  2. Get a listing of all your hard drives:
    # lsblk
  3. Run a test on one of the hard drives:
    # smartctl -t short /dev/sda
    You will see something similar to the following:
    smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-693.11.6.el7.x86_64] (local build)
    Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
    === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
    Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
    Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
    Testing has begun.
    Please wait 2 minutes for test to complete.
    Test will complete after Fri Sep 23 13:02:21 2022
    Use smartctl -X to abort test.
  4. It will give you a time when you can check the results. When the time has elapsed, come back and check the results like this:
    # smartctl -H /dev/sda
    smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-693.11.6.el7.x86_64] (local build)
    Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
  5. If the test fails you will see something like this:
    # smartctl -H /dev/sdb
    smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-693.11.6.el7.x86_64] (local build)
    Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: FAILED!
    Drive failure expected in less than 24 hours. SAVE ALL DATA.
    Failed Attributes:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    5 Reallocated_Sector_Ct 0x0033 063 063 140 Pre-fail Always FAILING_NOW 1089
  6. Looks like you need to replace /dev/sdb

How to Replace the Hard drive

This is what I did to replace the hard drive.

  1. Install lshw package:
    # yum install lshw
  2. Now list hardware of type disk:
    # lshw -class disk
    You should get way to much info.
  3. Filter the info with grep like so:
    # lshw -class disk | grep -A 5 -B 6 /dev/sdb
    You should now only get the one drive you are looking for.
    Mine looks like this:
    # lshw -class disk | grep -A 5 -B 6 /dev/sdb
    *-disk:1
    description: ATA Disk
    product: WDC WD1002FAEX-0
    vendor: Western Digital
    physical id: 1
    bus info: scsi@5:0.0.0
    logical name: /dev/sdb
    version: 1D05
    serial: WD-WCATR1933480
    size: 931GiB (1TB)
    capabilities: partitioned partitioned:dos
    configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=000cd438

So it looks like I need to replace a 1TB Western Digital. Fortunately this disk is in a two disk raid array.

Remove the HD from the Raid Array

This is what I did to remove the HD from the Raid Array. Before proceeding back up everything. I do a daily offsite backup so am covered in theory.

  1. Redo the lsblk command from above to confirm which disk is which:
    # lsblk
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sdb 8:16 0 931.5G 0 disk
    └─sdb1 8:17 0 931.5G 0 part
    └─md0 9:0 0 931.4G 0 raid1
    └─vg_raid-lv_raid 253:4 0 931.4G 0 lvm /mnt/Raid
    sdc 8:32 0 931.5G 0 disk
    └─sdc1 8:33 0 931.5G 0 part
    └─md0 9:0 0 931.4G 0 raid1
    └─vg_raid-lv_raid 253:4 0 931.4G 0 lvm /mnt/Raid
  2. Remember that the defective disk in this case is /dev/sdb and the good one is /dev/sdc
  3. Write all cache to disk:
    # sync
  4. Set the disk as failed with mdadm:
    # mdadm --manage /dev/md0 --fail /dev/sdb1
    This is the failed partition from /dev/sdb.
    You should see something like this:
    mdadm: set /dev/sdb1 faulty in /dev/md0
  5. Confirm it has been marked as failed:
    # cat /proc/mdstat
    Personalities : [raid1]
    md0 : active raid1 sdc1[1] sdb1[0](F)
    976630464 blocks super 1.2 [2/1] [_U]
    bitmap: 0/8 pages [0KB], 65536KB chunk

    The (F) next to sdb1 indicates Failed.
  6. Now remove the disk with mdadm:
    # mdadm --manage /dev/md0 --remove /dev/sdb1
  7. Now confirm with the cat command as before:
    # cat /proc/mdstat
    Personalities : [raid1]
    md0 : active raid1 sdc1[1]
    976630464 blocks super 1.2 [2/1] [_U]
    bitmap: 0/8 pages [0KB], 65536KB chunk

    Notice that sdb1 is now gone.
  8. You can also confirm this with the lsblk command:
    # lsblk
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sdb 8:16 0 931.5G 0 disk
    └─sdb1 8:17 0 931.5G 0 part
    sdc 8:32 0 931.5G 0 disk
    └─sdc1 8:33 0 931.5G 0 part
    └─md0 9:0 0 931.4G 0 raid1
    └─vg_raid-lv_raid 253:4 0 931.4G 0 lvm /mnt/Raid
  9. You can now shutdown the server and replace that hard drive.
    It is easy to find the correct hard drive with the serial number you got from the lshw command you ran earlier. The serial number is: WD-WCATR1933480
  10. Power on server.
  11. Here is where I ran into an issue that left me scratching my head for quite some time. I’m documenting it here so if it happens again I can resolve it quickly.
    It turns out that the spare drive I had on hand I thought was new but was not. It was actually a drive I had installed in another system that was retired and this drive had a boot partition on it. When I booted the server, that was the partition that booted instead of my regular boot partition. I even had to recover passwords on it because the user and root passwords were not the same. All along I was thinking something had happened to bork the users somehow. But it turns out the new drive I had put in was booting and it was not really new. Lesson learned here is to make sure the drive you put in has had any partitions removed. I did this by putting the drive in another system and using fdisk to remove the partitions. Now when I boot the server the normal boot partition boots and this new drive is designated as sdb as I expect.
  12. Now you can copy the partition information from the good disk (/dev/sdc) to the new disk (/dev/sdb). Be warned that this will destroy any partition information on the new disk. Since I already destroyed any partition information in the previous step I’m good with this. The command looks like this:
    # sfdisk -d /dev/sdc | sfdisk /dev/sdb
  13. You can check the partition info is correct with the lsblk command:
    # lsblk
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sdb 8:16 0 931.5G 0 disk
    └─sdb1 8:17 0 931.5G 0 part
    sdc 8:32 0 931.5G 0 disk
    └─sdc1 8:33 0 931.5G 0 part
    └─md0 9:0 0 931.4G 0 raid1
    └─vg_raid-lv_raid 253:2 0 931.4G 0 lvm /mnt/Raid
  14. Now you can reverse the process and create the mirror that you previously had like this:
    # mdadm --manage /dev/md0 --add /dev/sdb1
  15. Now you can verify the status of your raid like this:
    # mdadm --detail /dev/md0
# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Tue Jun 27 17:49:31 2017
        Raid Level : raid1
        Array Size : 976630464 (931.39 GiB 1000.07 GB)
     Used Dev Size : 976630464 (931.39 GiB 1000.07 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Sep 24 14:46:35 2022
             State : clean, degraded, recovering 
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : bitmap

    Rebuild Status : 1% complete

          Name : Serenity.localdomain:0  (local to host Serenity.localdomain)
          UUID : f06aeaae:e0c9707b:6d982f07:3f320578
        Events : 114297

Number   Major   Minor   RaidDevice State
   2       8       17        0      spare rebuilding   /dev/sdb1
   1       8       33        1      active sync   /dev/sdc1
  • You can see that the ‘Rebuild Status is at 1% and that this is in a rebuilding state.
  • You can get the status of the rebuild like so:
    # cat /proc/mdstat
# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[2] sdc1[1]
      976630464 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.7% (7077312/976630464) finish=129.7min speed=124486K/sec
      bitmap: 8/8 pages [32KB], 65536KB chunk

You can watch this command if it is interesting to you.

There’s something missing here. It probably relates to this:

CentOS 7 created mdadm array disappears after reboot

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.