Weboria

Wiki: Md

Recovery procedures for Linux MD RAID 5 devices

These notes are not about installing and building a MD RAID 5 device , but rather how to recover the RAID set when something bad has happened ( and also how to detect something bad has happened !! ).

Theses notes are probaly out of date and error ridden / wrong ... so I take no responsibility for any lost data as a result of following them :-)

1. post RAID installation configuration

: 1.1. Gather configuration information
: 1.2. implement monitoring

2. reinserting ok disk into MD raid device

1. post RAID installation configuration

: Once your raid has been configured there are some important steps to consider.

1.1. Gather configuration information

: Once you have built your RAID set it is useful to gather some configuration information in case you need to reassemble the RAID on a freshly installed system.

: The following output should be stored somewhere where it can be retrieved in the event of a problem.

: 1) output of fdisk -l this prints out the partition tables of all disks in you system.
: 2) output of hdparm -I for each disk in your system e.g hdparm -I /dev/sda this prints out all info related to your disk from hdparm , most importantly the serial number so you can match a physical disk to a device instance ( /dev/sda is disk s/n 1ab2c3d4NNJJ etc... )
: 3) ls -l /dev/disk/by-id this is another source of information for mapping specific disks to a devicenames. You cannot have enough of this information when things have gone wrong!!
: 4) the contents of /proc/mdstat This will show you (amongst other things) your RAID devices by devicename.
: 5) the contents of /etc/mdadm.conf AND the output of mdadm --examine --scan . These should be ( pretty much ) the same (!).
: 6) and finally the ouput of df -h and/or the contents of your /etc/fstab file. This info shows you the state of your mounted filesystems and is always useful to have.

1.2. implement monitoring

: There is no point having any form of Data Redundancy if you are not alerted when a disk fails. To this end a RAID set should be monitored and when a disk has errored there should be a reliable mechanism to raise the alarm.

: The output of /proc/mdstat supplies the current status of your RAID.

: Two healthy RAID 5 devices.

  Personalities : [raid6] [raid5] [raid4] 
  md1 : active raid5 sdd1[0] sdb1[3] sde1[2] sdc1[1]
       879124224 blocks level 5, 256k chunk, algorithm 2 [4/4] [UUUU]
      
  md0 : active raid5 sdf1[0] sdi1[3] sdh1[2] sdg1[1]
       1465150464 blocks level 5, 1024k chunk, algorithm 2 [4/4] [UUUU]

: The two numbers in square brackets ( [4/4] ) show the number of healthy active members from the total number of disks in the RAID.
: However should that become [3/4] , then you have a failed disk member.

: To regularly check the status of my RAID set I run a checking script that emails me if there is a problem.

#!/bin/bash
mail_target="me@my.mail.address" 
status_file="/proc/mdstat" 

/usr/bin/grep '[\[U]_' "${status_file}"| while read line
do
  set ${line}
  status=`echo ${line}| awk ' ( x = NF - 1 ) { print $x }'|tr -d '[]'`

  total_no_of_members=`echo $status|cut -d"/" -f1`
  healthy_no_of_members=`echo $status|cut -d"/" -f2`

  if [ "${total_no_of_members}" != "${healthy_no_of_members}" ]
  then
    cat /proc/mdstat| mail -s "Raid problem" "${mail_target}"
  fi
done

: This is then controlled by cron ( I use an old fashioned crontab under /var/spool/cron , I can't stand these new fangled /etc/cron.daily etc... )

# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (/tmp/crontab.XXXXNGnDLe installed on Sun Apr 20 13:41:38 2008)
# (Cron version V5.0 -- $Id: crontab.c,v 1.12 2004/01/23 18:56:42 vixie Exp $)
#              field          allowed values
#              -----          --------------
#              minute         0-59
#              hour           0-23
#              day of month   1-31
#              month          1-12 (or names, see below)
#              day of week    0-7 (0 or 7 is Sun, or use names)
0 * *  * * /usr/local/scripts/check_raid.sh

: And hey ho ! , emails when you have a failed disk.

2. reinserting ok disk into MD raid device

: On two occasions I have had a healthy disk "drop out" of the RAID set, for whatever reason. So if this has happened the "failed" disk can simply be readded.

: Firstly , check the disk by reading the partition table using fdisk /dev/<failed disk devicename>
: If you can read the partition table , the disk might well be ok... so readd the disk to the RAID

: mdadm --manage /dev/md1 --add /dev/sda1

: Where /dev/md1 is the RAID set and /dev/sda1 is the unhealthy disk ( use those configuration reports from the RAID when it was healthy to doublecheck !! )