Squoggle

Mac's tech blog

CentOS Drive Testing

Leave a comment Posted by squoggle on September 23, 2022

My Server was making noises that were uncharacteristic. This is how I tested my hard drives for failure.

Install smartmontools:
# yum install smartmontools
Get a listing of all your hard drives:
# lsblk
Run a test on one of the hard drives:
# smartctl -t short /dev/sda
You will see something similar to the following:
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-693.11.6.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Fri Sep 23 13:02:21 2022 Use smartctl -X to abort test.
It will give you a time when you can check the results. When the time has elapsed, come back and check the results like this:
# smartctl -H /dev/sda smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-693.11.6.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
If the test fails you will see something like this:
# smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-693.11.6.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. Failed Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 063 063 140 Pre-fail Always FAILING_NOW 1089
Looks like you need to replace /dev/sdb

How to Replace the Hard drive

This is what I did to replace the hard drive.

Install lshw package:
# yum install lshw
Now list hardware of type disk:
# lshw -class disk
You should get way to much info.
Filter the info with grep like so:
# lshw -class disk | grep -A 5 -B 6 /dev/sdb
You should now only get the one drive you are looking for.
Mine looks like this:
# lshw -class disk | grep -A 5 -B 6 /dev/sdb *-disk:1 description: ATA Disk product: WDC WD1002FAEX-0 vendor: Western Digital physical id: 1 bus info: scsi@5:0.0.0 logical name: /dev/sdb version: 1D05 serial: WD-WCATR1933480 size: 931GiB (1TB) capabilities: partitioned partitioned:dos configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=000cd438

So it looks like I need to replace a 1TB Western Digital. Fortunately this disk is in a two disk raid array.

Remove the HD from the Raid Array

This is what I did to remove the HD from the Raid Array. Before proceeding back up everything. I do a daily offsite backup so am covered in theory.

Redo the lsblk command from above to confirm which disk is which:
# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 931.5G 0 disk └─sdb1 8:17 0 931.5G 0 part └─md0 9:0 0 931.4G 0 raid1 └─vg_raid-lv_raid 253:4 0 931.4G 0 lvm /mnt/Raid sdc 8:32 0 931.5G 0 disk └─sdc1 8:33 0 931.5G 0 part └─md0 9:0 0 931.4G 0 raid1 └─vg_raid-lv_raid 253:4 0 931.4G 0 lvm /mnt/Raid
Remember that the defective disk in this case is /dev/sdb and the good one is /dev/sdc
Write all cache to disk:
# sync
Set the disk as failed with mdadm:
# mdadm --manage /dev/md0 --fail /dev/sdb1
This is the failed partition from /dev/sdb.
You should see something like this:
mdadm: set /dev/sdb1 faulty in /dev/md0
Confirm it has been marked as failed:
# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdc1[1] sdb1[0](F) 976630464 blocks super 1.2 [2/1] [_U] bitmap: 0/8 pages [0KB], 65536KB chunk
The (F) next to sdb1 indicates Failed.
Now remove the disk with mdadm:
# mdadm --manage /dev/md0 --remove /dev/sdb1
Now confirm with the cat command as before:
# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdc1[1] 976630464 blocks super 1.2 [2/1] [_U] bitmap: 0/8 pages [0KB], 65536KB chunk
Notice that sdb1 is now gone.
You can also confirm this with the lsblk command:
# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 931.5G 0 disk └─sdb1 8:17 0 931.5G 0 part sdc 8:32 0 931.5G 0 disk └─sdc1 8:33 0 931.5G 0 part └─md0 9:0 0 931.4G 0 raid1 └─vg_raid-lv_raid 253:4 0 931.4G 0 lvm /mnt/Raid
You can now shutdown the server and replace that hard drive.
It is easy to find the correct hard drive with the serial number you got from the lshw command you ran earlier. The serial number is: WD-WCATR1933480
Power on server.
Here is where I ran into an issue that left me scratching my head for quite some time. I’m documenting it here so if it happens again I can resolve it quickly.
It turns out that the spare drive I had on hand I thought was new but was not. It was actually a drive I had installed in another system that was retired and this drive had a boot partition on it. When I booted the server, that was the partition that booted instead of my regular boot partition. I even had to recover passwords on it because the user and root passwords were not the same. All along I was thinking something had happened to bork the users somehow. But it turns out the new drive I had put in was booting and it was not really new. Lesson learned here is to make sure the drive you put in has had any partitions removed. I did this by putting the drive in another system and using fdisk to remove the partitions. Now when I boot the server the normal boot partition boots and this new drive is designated as sdb as I expect.
Now you can copy the partition information from the good disk (/dev/sdc) to the new disk (/dev/sdb). Be warned that this will destroy any partition information on the new disk. Since I already destroyed any partition information in the previous step I’m good with this. The command looks like this:
# sfdisk -d /dev/sdc | sfdisk /dev/sdb
You can check the partition info is correct with the lsblk command:
# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 931.5G 0 disk └─sdb1 8:17 0 931.5G 0 part sdc 8:32 0 931.5G 0 disk └─sdc1 8:33 0 931.5G 0 part └─md0 9:0 0 931.4G 0 raid1 └─vg_raid-lv_raid 253:2 0 931.4G 0 lvm /mnt/Raid
Now you can reverse the process and create the mirror that you previously had like this:
# mdadm --manage /dev/md0 --add /dev/sdb1
Now you can verify the status of your raid like this:
# mdadm --detail /dev/md0

# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Tue Jun 27 17:49:31 2017
        Raid Level : raid1
        Array Size : 976630464 (931.39 GiB 1000.07 GB)
     Used Dev Size : 976630464 (931.39 GiB 1000.07 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Sep 24 14:46:35 2022
             State : clean, degraded, recovering 
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : bitmap

    Rebuild Status : 1% complete

          Name : Serenity.localdomain:0  (local to host Serenity.localdomain)
          UUID : f06aeaae:e0c9707b:6d982f07:3f320578
        Events : 114297

Number   Major   Minor   RaidDevice State
   2       8       17        0      spare rebuilding   /dev/sdb1
   1       8       33        1      active sync   /dev/sdc1

You can see that the ‘Rebuild Status is at 1% and that this is in a rebuilding state.
You can get the status of the rebuild like so:
# cat /proc/mdstat

# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb1[2] sdc1[1]
      976630464 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.7% (7077312/976630464) finish=129.7min speed=124486K/sec
      bitmap: 8/8 pages [32KB], 65536KB chunk

You can watch this command if it is interesting to you.

There’s something missing here. It probably relates to this:

CentOS 7 created mdadm array disappears after reboot

CentOS/RHEL, Sys Admin

← My Python Notes The TLS 1.2 Handshake Explained: Securing Your Online Data with a Twist →

	Install and Set Up U… on SSH Keys
	Install and Set Up U… on How To Sudo without passw…
	Linux Mint 21.x \| Sq… on Install VirtualBox 7.0 on Linu…
	Linux Mint 19.x \| Sq… on IPTables

Squoggle

CentOS Drive Testing

How to Replace the Hard drive

Remove the HD from the Raid Array

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Squoggle

CentOS Drive Testing

How to Replace the Hard drive

Remove the HD from the Raid Array

Share this:

Related

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta