Recovery procedures for Linux MD RAID 5 devices
These notes are not about installing and building a MD RAID 5 device , but rather how to recover the RAID set when something bad has happened ( and also how to detect something bad has happened !! ).
Theses notes are probaly out of date and error ridden / wrong ... so I take no responsibility for any lost data as a result of following them :-)
1. post RAID installation configuration
- Once your raid has been configured there are some important steps to consider.
1.1. Gather configuration information
- Once you have built your RAID set it is useful to gather some configuration information in case you need to reassemble the RAID on a freshly installed system.
- The following output should be stored somewhere where it can be retrieved in the event of a problem.
- 1) output of fdisk -l this prints out the partition tables of all disks in you system.
- 2) output of hdparm -I for each disk in your system e.g hdparm -I /dev/sda this prints out all info related to your disk from hdparm , most importantly the serial number so you can match a physical disk to a device instance ( /dev/sda is disk s/n 1ab2c3d4NNJJ etc... )
- 3) ls -l /dev/disk/by-id this is another source of information for mapping specific disks to a devicenames. You cannot have enough of this information when things have gone wrong!!
- 4) the contents of /proc/mdstat This will show you (amongst other things) your RAID devices by devicename.
- 5) the contents of /etc/mdadm.conf AND the output of mdadm --examine --scan . These should be ( pretty much ) the same (!).
- 6) and finally the ouput of df -h and/or the contents of your /etc/fstab file. This info shows you the state of your mounted filesystems and is always useful to have.
1.2. implement monitoring
- There is no point having any form of Data Redundancy if you are not alerted when a disk fails. To this end a RAID set should be monitored and when a disk has errored there should be a reliable mechanism to raise the alarm.
- The output of /proc/mdstat supplies the current status of your RAID.
- Two healthy RAID 5 devices.
Personalities : [raid6] [raid5] [raid4] md1 : active raid5 sdd1[0] sdb1[3] sde1[2] sdc1[1] 879124224 blocks level 5, 256k chunk, algorithm 2 [4/4] [UUUU] md0 : active raid5 sdf1[0] sdi1[3] sdh1[2] sdg1[1] 1465150464 blocks level 5, 1024k chunk, algorithm 2 [4/4] [UUUU]
- The two numbers in square brackets ( [4/4] ) show the number of healthy active members from the total number of disks in the RAID.
- However should that become [3/4] , then you have a failed disk member.
- To regularly check the status of my RAID set I run a checking script that emails me if there is a problem.
#!/bin/bash mail_target="me@my.mail.address" status_file="/proc/mdstat" /usr/bin/grep '[\[U]_' "${status_file}"| while read line do set ${line} status=`echo ${line}| awk ' ( x = NF - 1 ) { print $x }'|tr -d '[]'` total_no_of_members=`echo $status|cut -d"/" -f1` healthy_no_of_members=`echo $status|cut -d"/" -f2` if [ "${total_no_of_members}" != "${healthy_no_of_members}" ] then cat /proc/mdstat| mail -s "Raid problem" "${mail_target}" fi done
- This is then controlled by cron ( I use an old fashioned crontab under /var/spool/cron , I can't stand these new fangled /etc/cron.daily etc... )
# DO NOT EDIT THIS FILE - edit the master and reinstall. # (/tmp/crontab.XXXXNGnDLe installed on Sun Apr 20 13:41:38 2008) # (Cron version V5.0 -- $Id: crontab.c,v 1.12 2004/01/23 18:56:42 vixie Exp $) # field allowed values # ----- -------------- # minute 0-59 # hour 0-23 # day of month 1-31 # month 1-12 (or names, see below) # day of week 0-7 (0 or 7 is Sun, or use names) 0 * * * * /usr/local/scripts/check_raid.sh
- And hey ho ! , emails when you have a failed disk.
2. reinserting ok disk into MD raid device
- On two occasions I have had a healthy disk "drop out" of the RAID set, for whatever reason. So if this has happened the "failed" disk can simply be readded.
- Firstly , check the disk by reading the partition table using fdisk /dev/<failed disk devicename>
- If you can read the partition table , the disk might well be ok... so readd the disk to the RAID
- mdadm --manage /dev/md1 --add /dev/sda1
- Where /dev/md1 is the RAID set and /dev/sda1 is the unhealthy disk ( use those configuration reports from the RAID when it was healthy to doublecheck !! )