To those of us in the storage admin business losing two disks in a RAID 5 disk group falls into a special category. That category would be associated what most like to call a resume generating event or RGE. I ran into this specific issue today and survived due to a couple key pieces of information provided by the vendor and by coming through logs to ensure I executed the recovery process in the correct order.
But there is no recovery you say? Ah right, so you’ve lost two disks in a RAID group which has only one parity drive. The LUN’s which fall within the RAID group are all off line, and the disks in question show up as “Removed”. At this point you’re SOL. Someone is leaning over your shoulder asking as simple question… WHEN WILL MY APPLICATION SERVER BE BACK UP?
On to the recovery steps… when a disk fails most times it is actively failed by the array itself and not by catestrophic hardware failure. High CRC error rates found on the drive lead the array to kick the drive out. With a two disk failure the array takes a different approach to disks which are “Removed” due to high CRC errors. The recovery process is quite simple. Re-insert the second disk which failed. The array will attempt to copy all of the usable data off it onto a hot spare. The reason why the second disk to fail is used is due to it having the last updated data. Where as the first disk to mark failed would not have the updates included in the second failed disk.
You can check the status of the disk rebuild through Naviseccli. Run naviseccli -h SP_IP_address -user username -scope 0 getdisk 3_4_3 -state -rb this will show you the rebuild status of each LUN found on the disk which is being rebuild.
Issue: In some instances LUNs found within pools associated with the failed drives can show up as in a “Faulted” state.
Resolution: A common way to resolve this issue is to reboot each of the CLARiiON storage processors (SP). Reboot the first SP and wait 15 to 20 minutes before rebooting the second SP.