I’ve been having “fun” recently with VMware storage issues following a temporary loss of power to one of my two EMC Clariion CX4-480 SANs. The power outage was expected, and the SAN was shut down before the electricity was lost. However, ever since I’ve been periodically getting hosts greying out in the vSphere client. They’d eventually come back to life again, and the VMs running on them didn’t seem affected – though they did grey out in vSphere client too and were thus not manageable, but it was worrying – something wasn’t happy.
One host (they’re all running ESXi 5.0 update 2, build 914586) in particular had had several of these issues, more so than the others. So this morning, when it was looking normal, I decided to try putting it into maintenance mode (which worked) and then rebooted it once all the VMs had been moved off it.
When it came back up again, I noticed that one of the VMFS datastores was greyed out. Looking in the host’s Configuration – Hardware – Storage tab, that datastore was listed in the Identification column as <Name> (inactive) (unmounted), and the Capacity, Free and Type columns were all showing N/A:
I found the following entries in the /var/log/vmkernel.log file by looking for the naa reference for the problem LUN/datastore:
~ # cat /var/log/vmkernel.log | grep naa.600601607c7028006891d164db7be011 2013-02-21T08:31:55.213Z cpu26:4122)ScsiDeviceIO: 2324: Cmd(0x4124425654c0) 0x16, CmdSN 0x99f1 from world 0 to dev "naa.600601607c7028006891d164db7be011" fled H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. 2013-02-21T08:31:55.213Z cpu4:5147)LVM: 11918: Failed to open device naa.600601607c7028006891d164db7be011:1
Trying to do things on these VMs that involved much disk activity caused the VM to lock up and stop responding, even to pings. The host running the VM would then grey out. This eventually happened to about half the VMs on the datastore.
Trying to do an ls (directory listing) on the datastore via SSH to any host also failed after a few minutes.
According to VMware KB 289902 the H:0x5 is the Host status (Initiator), and means SG_ERR_DID_ABORT, aka Told to abort for some other reason, which is less than helpful.
Then I tried searching again and found this blog post, where it is mentioned that one of the steps early in the resolution was to fail over the SAN storage controller. Trespassing a LUN on the Clariion is pretty quick and easy to do. The problem LUN was owned by SPB, its default SP, so I trespassed it, which moved it to SPA. A few seconds later the VMs stored on the datastore started springing back to life,and the hosts recovered too, so that seems to have fixed it. I didn’t need to do any of the other stuff listed in Shabir Yusuf’s blog.
The host that had been rebooted was still showing the LUN as inactive and unmounted, so I right-clicked it and chose “Mount”, and it seems fine now.