Here’s another nice vSphere issue, and how to fix it.
One of my ESXi 5.0 hosts was showing as disconnected in vSphere client.
I tried to reconnect to it by using the right-click Connect option. This took a while, then gave me an authentication error:
Reconnect host Cannot comlpete login due to an incorrect user name or password. Time: Target: <broken host.fqdn> vCenter Server: vcenter.rcmtech.co.uk
I then had to enter credentials for the host (I used root), accept the certificate and click through the add host wizard. This eventually failed:
Reconnect host A general system error occurred: Timed waiting for vpxa to start Time: Target: <broken host.fqdn> vCenter Server: vcenter.rcmtech.co.uk
Searching on that error text brought me to a VMware KB article that says that this can be caused by a VM having too many snapshots (more than 32).
I ran the following PowerShell command to list all snapshots:
Get-VM | Get-Snapshot | Select-Object name,vm,sizemb
But it didn’t show any VMs with more than a handful of snapshots each.
Now the finger starts to point to my backup system, Symantec Netbackup 7.1. It creates snapshots of VMs when it backs them up, and frequently these a) don’t get removed afterwards and b) don’t show through the vSphere GUI (and also via PowerShell).
Time to enable ESXi shell and see what’s on the VMFS disks. I ran the following command, logged in as root:
find /vmfs/volumes/ -name *delta.vmdk
Wait for the screen to rapidly scroll as you find a VM that has a lot of snapshots. In my case I found one with 235 of them! I used cd to change into the folder where the *delta.vmdk files for the VM lived and did an:
ls -latr *delta.vmdk
to list them all in reverse date order (newest at the bottom). NetBackup had been creating ten snapshots every day, nice. Don’t forget to log out (exit) and disable the ESXi shell again.
To consolidate these on ESX4.x you had to shut the VM down and cold migrate it to a different place using the vmkfstools command. Luckily in vSphere 5.0 you can do this from vSphere client (I think both the host and vCenter need to be running 5.0 or higher for this to work though). Also luckily, whilst vCenter can’t talk to a host with a VM that has more than 32 snapshots, the vSphere client can, so I connected my client direct to the problem host.
So I right-clicked the VM and clicked on Consolidate. A few minutes later and the snapshots were all gone. I was then able to reconnect the host to vCenter sucessfully, right-click it and choose Connect (still got the authentication error again though).
Interestingly, the snapshot best practices KB article does mention the 32 snapshot limitation, and also mentions that you can set up an alarm. But as the best practices article states, some snapshots don’t show in the GUI (or PowerCLI) so I wonder if this alarm will help…