Saturday, September 6, 2008

Is Your Linux Software RAID Really Recoverable?

After installing a system with software RAID1 (mirroring) some additional setup is required to ensure that you can actually recover from a disk failure. For CentOS, Fedora and other RH-like distros the Red Hat Enterprise Linux SysAdmin Guide has good instructions on setting up software raid. For other distros, follow their specific installation guides.

Post Install Setup

Make backups of disk partions from all disks, members of the RAID array and file system mount points. Here we have two SATA disks /dev/sda and /dev/sdb.

# mkdir /root/raidinfo
# sfdisk -d /dev/sda > /root/raidinfo/partitions.sda
# sfdisk -d /dev/sdb > /root/raidinfo/partitions.sdb
# cat /proc/mdstat > /root/raidinfo/mdstat.orig
# cat /etc/fstab > /root/raidinfo/fstab.orig

The GRUB boot loader is only installed on one disk, by default this is the first disk the system finds and is always labeled /dev/sda. If this disk fails the system will be unbootable - we need to install GRUB on all the other disks in the array (only /dev/sdb in this example). The following set of commands will install GRUB into the MBR of /dev/sdb

# grub
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd0)
Checking if "/boot/grub/stage1" exists... no
Checking if "/grub/stage1" exists... yes
Checking if "/grub/stage2" exists... yes
Checking if "/grub/e2fs_stage1_5" exists... yes
Running "embed /grub/e2fs_stage1_5 (hd0)"... 16 sectors are embedded. succeeded
Running "install /grub/stage1 (hd0) (hd0)1+16 p (hd0,0)/grub/stage2 /grub/grub .conf"... succeeded
Done.
grub>quit

NOTE: if the version of GRUB is ever updated, only the MBR of /dev/sda will be updated automatically. You will need to reinstall GRUB to the MBR for all other drives in the array. If the kernel is updated no additional changes are required other than updating /boot/grub/grub.conf to point to the new kernel - this happens automatically on CentOS.

Testing The Setup Before Disk Failure

Immediately after installing the OS you should test the software raid setup to verify that the machine is bootable from any disk, the automatic synch’ing is working as expected and most importantly, you know what you are doing so that in a real emergency you do not lose any valuable data. The easiest way to accomplish these goals is to

  • shutdown and disconnect one disk drive
  • restart the machine (do you get the GRUB menu and boot succesfully?)
  • do a few things…..edit some text files, download some stuff, whatever
  • reconnect the drive and verify that the automatic synchronization starts and then finishes without issue
  • repeat all the above for each disk

Lets walk through an example where we remove /dev/sda from the system. After rebooting without this drive take a look at /proc/mdstat. This can be confusing: we have removed /dev/sda from the system but your boot logs and mdstat will say you have /dev/sda installed. This is because the OS labels the first disk it finds as /dev/sda but this is physically the original /dev/sdb, confused yet ;-)

# cat /proc/mdstat
....
md3 : active raid1 sda5[1]
307347392 blocks [2/1] [_U]
.....

For each of the md devices (we’ll focus on md3) it is saying that the md3 device is active as RAID1 and sda5 is its only member (this sda5 was sdb5 before removing the original sda disk). The [2/1] indicates that two members should be contained in the md3 device, but current only one is available i.e. sda5 (the original sdb5). The [_U] is indicating that the first member is not available but the second one is (the “U”). Here the first and second members refer to the original sda5 and sdb5, respectively.

Now reconnect the original /dev/sda drive and look at mdstat

# cat /proc/mdstat
....
md3 : active raid1 sdb5[1]
307347392 blocks [2/1] [_U]
.....

The only member of md3 is now shown as sdb5, this is physically the same disk that showed up as sda5 when the system had only one disk. Mmm…..md3 has only one member, we will need to manually add the original sda5 to md3 to recreate the RAID array

# mdadm -a /dev/md3 /dev/sda5

Once this is done the OS will start synchronizing sda5 with sdb5. Once again, mdstat is the source of information and will allow you to monitor the progress of the rebuild

....
md3 : active raid1 sda5[2] sdb5[1]
307347392 blocks [2/1] [_U]
[===============>.....] recovery = 77.4% (238065664/307347392) finish=25.6min speed=44977K/sec
....

Note now that md3 now has two members (sda5 and sdb5) and that the offline member (sda5) is being recovered. When the rebuild is complete the [2/1] [_U] will become [2/2] [UU]. Repeat this for the other md devices until they are all rebuilt and successfully resynch’ed.

Now do exactly the same for the other disk drive(s): remove the disk, reboot, reinstall the disk and rebuild the RAID arrays

Recovery After Disk Failure

In the event of a disk failure there are three things you need to do

  • Install a new disk….d’oh!
  • Partition the new disk in exactly the same way as the old dead one
  • Add RAID partitions back into the /dev/mdx devices

Assume /dev/sda has died. To add the correct partitions to the new disk we will use the backups of the partition tables we made above

# sfdisk /dev/sda < /root/raidinfo/partitions.sda

If the new disk is bigger than the old disk you will be left with additional free space on this drive - this will not cause any problems.

The second step is to add these partitions back into the md devices. To do this correctly you need to know which partition resided inside which md device. The answer lies in the backup of the mdstat file

# cat /root/raidinfo/mdstat.orig
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
4096448 blocks [2/2] [UU]
......

This snippet shows that sda2 was a member of md1. Now add sda2 back into md1

# mdadm -a /dev/md1 /dev/sda2

Immediately after adding a partition to the array, it will automagically begin synching /dev/sda2 with /dev/sdb2 (the other member of md1). You can monitor the progress with

# watch -n 30 cat /proc/mdstat

In this example the information will be updated every 30 seconds - do not set the time interval for updates too small, this will slow down the synching process. Repeat this for all partitions on the new disk. You can add all the partitions simultaneously, however only one md device will be synchronized at a time. Be patient - 300GB on a SATA II (3 Gbps) connection takes about 80 minutes (an average of ~60 MB/sec, consistent with the sustained transfer rate quoted by the WD specs for these drives).

No comments: