Software RAID: What Happened?

UPDATE 2/14/17: Someone else finally ran up against this problem and persisted until he got a solution.
His advice is certainly worth trying if one ever wants (or dares) to reconstruct a hosed software RAID!

This page is a record of the misery that can be caused by an innocent attempt to change a server's hostname.
Such a change can have unexpected ill effects if the server also happens to host a software RAID device.

The basic steps used to create the RAID are described in the following links:

Linux Software RAID

"It's perfectly fine for /dev/md0 to not have a valid partition table when viewed with fdisk"

Email Notifications #1

Email Notifications #2 ("Step the Third")

Trouble started when /etc/hostname was changed from localhost.localdomain to perseus.lps.cornell.edu...
Normal boot of CentOS 7 no longer succeeded; instead, the system would drop into emergency mode.
System logs revealed the following:

Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: No such file...
Failed to open private bus connection: Failed to connect to socket /var/run/dbus/system_bus_socket: No such...
[/usr/lib/systemd/system/rtkit-daemon.service:32] Unknown lvalue 'ControlGroup' in section 'Service'

type=1400 ... avc: denied { read } for pid=1129 comm="mdadm" name="mdadm.conf" dev="sdc3...
/* multiple entries like this */

Job dev-md1.device/start timed out.
Dependency failed for /user3.
Depednency failed for File System check on /dev/md1.

Job dev-md0.device/start timed out. 
Dependency failed for /user2.
Depednency failed for File System check on /dev/md0.

??? How to Fix ???

Observation: normal boot failed at the point where the OS tried to mount the devices listed in /etc/fstab.
Devices /dev/md0 and /dev/md1 were not found because the OS didn't assign the expected device names to the RAIDs.
As it turned out, instead of /dev/md0 and /dev/md1, they became /dev/md127 and /dev/md126 for some reason...

Misidentified RAID may cause reboot to go into emergency mode

RAID starting at md127 instead of md0

Hypothesis: the problem is "homehost" in RAID metadata; it's still localhost.localdomain, which does not match the new /etc/hostname.
Evidence: boot is successful when hostname is changed back to localhost.localdomain from perseus.lps.cornell.edu.
Here's a link where the problem is diagnosed and two solutions are proposed:

Changing hostname may break RAID identification?

It seems only newer OSes like CentOS 7 have this behavior? The hostname change did not cause issues for Fedora 15 (luckily!).

First Repair Attempt

In /etc/mdadm.conf, the line "HOMEHOST localhost.localdomain" was created for CentOS 7 (using Fedora 15).

Explanation of homehost in mdadm.conf man page

[slantz@perseus etc]$ cat /centos/etc/mdadm.conf
HOMEHOST localhost.localdomain
ARRAY /dev/md0 metadata=1.2 UUID=2992cbc2:b4d7c49f:45e669c5:13d77e51
ARRAY /dev/md1 metadata=1.2 UUID=f055ebd6:1c8181b4:715d8e80:c171cdde
MAILADDR [*omitted*]
MAILFROM perseus-mdadm

The trouble is this does no good at boot time. The mdadm.conf file that is used is the one in the initramfs file.

After changing mdadm.conf, the init ramdisk must be updated

In CentOS 7, the correct command to do this is "dracut -f --mdadmconf". But that procedure didn't seem to work either....
Summary of the above link, giving instructions which should work but don't, even with a special change in the dracut.conf.c directory:

echo 'mdadmconf="yes"' > /etc/dracut.conf.d/my-md.conf
dracut -f --mdadmconf
lsinitrd "initramfs-$(uname -r)" | grep mdadm.conf

Output from the last command showed that /etc/mdadm.conf was not changed in the initramfs; in fact, it wasn't even present.

What about Fedora? Changes made to mdadm.conf are supposed to be similarly ineffective until the ramdisk is updated...
What's the starting point? Peeking into the contents of Fedora's initramfs is a bit harder - see Inspecting the Content of an Initrd File.
Following this procedure showed that an entirely different mdadm.conf is present in Fedora's default initramfs.

[slantz@perseus etc]$ cat /tmp/initrdmount/etc/mdadm.conf
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all

Maybe these specifications are more forgiving, so that the assembly of the RAIDs is not messed up by the non-matching homehost.

Second Repair Attempt

The following command confirmed that the RAID metadata were indeed imprinted with localhost.localdomain:

[slantz@perseus etc]$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Nov  5 16:00:38 2014
     Raid Level : raid0
     Array Size : 1953522688 (1863.02 GiB 2000.41 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Feb 20 14:24:27 2015
          State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 512K

           Name : localhost.localdomain:0
           UUID : 2992cbc2:b4d7c49f:45e669c5:13d77e51
         Events : 4

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       48        1      active sync   /dev/sdd

Supposedly, some properties in the metadata such as homehost can be changed with suitable mdadm commands.

Mounting Linux software raid using a consistent device name

This led to the following attempt while booted into CentOS 7:

[root@perseus etc]# umount /user2
[root@perseus etc]# umount /user3
[root@perseus etc]# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
[root@perseus etc]# mdadm --stop /dev/md1
mdadm: stopped /dev/md1
[root@perseus etc]# mdadm --verbose --assemble /dev/md0 --update=homehost --homehost=perseus.lps.cornell.edu /dev/sdb /dev/sdd
mdadm: looking for devices for /dev/md0
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 1.
mdadm: added /dev/sdd to /dev/md0 as 1
mdadm: added /dev/sdb to /dev/md0 as 0
mdadm: /dev/md0 has been started with 2 drives.
[root@perseus etc]# mdadm --verbose --assemble /dev/md1 --update=homehost --homehost=perseus.lps.cornell.edu /dev/sde /dev/sdf
mdadm: looking for devices for /dev/md1
mdadm: /dev/sde is identified as a member of /dev/md1, slot 0.
mdadm: /dev/sdf is identified as a member of /dev/md1, slot 1.
mdadm: added /dev/sdf to /dev/md1 as 1
mdadm: added /dev/sde to /dev/md1 as 0
mdadm: /dev/md1 has been started with 2 drives.
[root@perseus etc]# mdadm --detail /dev/md0

This did not work. The metadata was unchanged.

Repairs Abandoned...

To prevent further aggravation, no more effort was wasted on trying to patch up the RAIDs so they would auto-assemble at boot time.

RAIDs were stopped and deleted using the procedure described here: Mdadm Cheat Sheet

Good News for the Similarly Frustrated!

Long after the above post appeared, a fellow traveler down this unfortunate path reported that he finally found a solution. Check it out!



Originally posted on 3/25/15 by Steve Lantz (steve.lantz ~at~ cornell.edu)