Wiki: Md

Recovery procedures for Linux MD RAID 5 devices

These notes are not about installing and building a MD RAID 5 device , but rather how to recover the RAID set when something bad has happened ( and also how to detect something bad has happened !! ).

Theses notes are probaly out of date and error ridden / wrong ... so I take no responsibility for any lost data as a result of following them :-)

1. post RAID installation configuration
1.1. Gather configuration information
1.2. implement monitoring
2. reinserting ok disk into MD raid device



1. post RAID installation configuration

Once your raid has been configured there are some important steps to consider.

1.1. Gather configuration information

Once you have built your RAID set it is useful to gather some configuration information in case you need to reassemble the RAID on a freshly installed system.

The following output should be stored somewhere where it can be retrieved in the event of a problem.

1) output of fdisk -l this prints out the partition tables of all disks in you system.
2) output of hdparm -I for each disk in your system e.g hdparm -I /dev/sda this prints out all info related to your disk from hdparm , most importantly the serial number so you can match a physical disk to a device instance ( /dev/sda is disk s/n 1ab2c3d4NNJJ etc... )
3) ls -l /dev/disk/by-id this is another source of information for mapping specific disks to a devicenames. You cannot have enough of this information when things have gone wrong!!
4) the contents of /proc/mdstat This will show you (amongst other things) your RAID devices by devicename.
5) the contents of /etc/mdadm.conf AND the output of mdadm --examine --scan . These should be ( pretty much ) the same (!).
6) and finally the ouput of df -h and/or the contents of your /etc/fstab file. This info shows you the state of your mounted filesystems and is always useful to have.

1.2. implement monitoring

There is no point having any form of Data Redundancy if you are not alerted when a disk fails. To this end a RAID set should be monitored and when a disk has errored there should be a reliable mechanism to raise the alarm.

The output of /proc/mdstat supplies the current status of your RAID.


Two healthy RAID 5 devices.


  Personalities : [raid6] [raid5] [raid4] 
  md1 : active raid5 sdd1[0] sdb1[3] sde1[2] sdc1[1]
       879124224 blocks level 5, 256k chunk, algorithm 2 [4/4] [UUUU]
      
  md0 : active raid5 sdf1[0] sdi1[3] sdh1[2] sdg1[1]
       1465150464 blocks level 5, 1024k chunk, algorithm 2 [4/4] [UUUU]

The two numbers in square brackets ( [4/4] ) show the number of healthy active members from the total number of disks in the RAID.
However should that become [3/4] , then you have a failed disk member.



To regularly check the status of my RAID set I run a checking script that emails me if there is a problem.


#!/bin/bash
mail_target="me@my.mail.address" 
status_file="/proc/mdstat" 

/usr/bin/grep '[\[U]_' "${status_file}"| while read line
do
  set ${line}
  status=`echo ${line}| awk ' ( x = NF - 1 ) { print $x }'|tr -d '[]'`

  total_no_of_members=`echo $status|cut -d"/" -f1`
  healthy_no_of_members=`echo $status|cut -d"/" -f2`

  if [ "${total_no_of_members}" != "${healthy_no_of_members}" ]
  then
    cat /proc/mdstat| mail -s "Raid problem" "${mail_target}"
  fi
done


This is then controlled by cron ( I use an old fashioned crontab under /var/spool/cron , I can't stand these new fangled /etc/cron.daily etc... )


# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (/tmp/crontab.XXXXNGnDLe installed on Sun Apr 20 13:41:38 2008)
# (Cron version V5.0 -- $Id: crontab.c,v 1.12 2004/01/23 18:56:42 vixie Exp $)
#              field          allowed values
#              -----          --------------
#              minute         0-59
#              hour           0-23
#              day of month   1-31
#              month          1-12 (or names, see below)
#              day of week    0-7 (0 or 7 is Sun, or use names)
0 * *  * * /usr/local/scripts/check_raid.sh



And hey ho ! , emails when you have a failed disk.

2. reinserting ok disk into MD raid device


On two occasions I have had a healthy disk "drop out" of the RAID set, for whatever reason. So if this has happened the "failed" disk can simply be readded.

Firstly , check the disk by reading the partition table using fdisk /dev/<failed disk devicename>
If you can read the partition table , the disk might well be ok... so readd the disk to the RAID

mdadm --manage /dev/md1 --add /dev/sda1

Where /dev/md1 is the RAID set and /dev/sda1 is the unhealthy disk ( use those configuration reports from the RAID when it was healthy to doublecheck !! )