Windows Failover Clustering using Clariions and Mirrorview on vSphere

This is how I’ve implemented a semi-resilient back end for Blackboard. But the techniques employed could be used for anything that doesn’t support some kind of application-level replication such as SQL Mirroring, but does run as a Windows service (and thus is clusterable).

To start with let’s explain what my situation is. I have two fairly unreliable server rooms, several power outages per year is the norm. Yes I have a UPS in each one, but no generator. The UPS runs for about 30 minutes under full load. The rooms are sealed from an air conditioning point of view, and the air conditioning runs directly from the mains electricity – if the mains fails the rooms heat up and stuff shuts down when the thermal limits (e.g. in the CPUs) are reached. On the plus side, the rooms are connected by dual 4Gbps fibre channel links for the SAN fabric, and a resilient 10Gbps ethernet link. Connectivity between the rooms is thus fairly good. As a result of this I can run services from either room and end users tend not to notice any difference. It also enables me to use sychronous replication technologies such as MirrorView/S between the EMC CX4-480 Clariions, of which I have one per room. The rooms are on different sites, but these are only a fifteen minute walk away. I have a single vCenter server, but it manages two separate HA + DRS clusters, one per room.

The trick I’ve been trying to achieve is to get services to be easily moved from one room to another, and to be able to do this both proactively and automatically, in either direction. This means that if the mains fails on the active site I can move the service to the other site before a) the UPS fails and b) the servers shut down due to thermal overload, or if my estates department informs me that they need to power an entire room off for a weekend to do electrical testing I can move the service to the other room if necessary. What I also ultimately want to achieve is that if the active site power fails when there’s nobody present (I don’t have to provide 24×7 support except for a few days a year) the service will automatically move/recover itself in the other room.

So ideally I want to use Windows Failover Clustering (Blackboard doesn’t support SQL Mirroring in High Safety mode and it also needs a filestore, which are just files and folders on disk, referenced by the SQL data). Ignoring the storage element for the time being, a cluster is ideal as it both allows the manual moving of clustered resources and also automatic failure detection (e.g. one node down due to network issues, disk communication issues, reboot after installing updates etc.). It also makes it easy to set a preferred owner for the clustered resource, and the clustered resource can be moved around as much as I like. This is important for me, neither of my two rooms is a “primary” data centre – something like VMware Site Recovery Manager (up to and including v4.1) is pretty much a one way only failover (unless you have also bought EMC RecoverPoint applicances, or like doing manual reconfiguraton). SRM also does not allow for automatic failure detection, thus its failover is manually initiated only. Note that the latest version 5.0 of SRM does allow failback.

The problem I have with any version of Site Recover Manager is that it is not a High Availability solution, it’s a Disaster Recovery solution. Failover clustering is an HA solution, but unless you do clever things, not a DR solution. But HA is more likely to be invoked than DR, even in my unreliable server rooms – once a month I release Windows Updates to all my servers, and failover clustering takes care of keeping the downtime from this as low as possible. We’re talking about downtime that’s as long as it takes the cluster to failover when the active node shuts down as part of its post-update-installation reboot. The other nice thing about clustering vs SRM is that with SRM you only have one server instance, so if it breaks (misconfiguration, dodgy update, whatever) that’s it, gone – restore from backup time.

So now to the data. As previously mentioned, my two CX4 SANs are on the same fabrics, both fabrics are present in both rooms with a 4Gbps link for each of them. Therefore I use MirrorView/S to synchronously replicate the data between them. So with one CX4 (depending on the amount of data you write and how many other hosts are talking to the Storage Processor) the SP that owns the LUN you’re writing to takes the data from the host and stores it in its write cache, which is RAM, it then sends the data to the other SP via the CMI (Clariion Messaging Interface), and once the second SP acknowledges that the data is in RAM too the original SP sends the acknowledgement back to the host. So you’re talking of a write time for the host that can be as short as RAM + CMI + RAM = pretty quick (unless you fill the SP write cache). Add a second Clariion and MirrorView/S and you have to wait for all four SPs to have the data in RAM before the host gets the acknowledgement of the write. So it can slow it down. If you’re writing lots of data this might have more of an impact. But it’s still usually OK. Especially when you consider that you’re trading the performance hit against simultaneous replication of the data to two separate SANs, and in my case two separate sites. Because it’s synchronous there is no loss of data due to replication latency (as there might be with MirrorView/A). The problem with MirrorView full stop is that it does not, in itself, integrate with Windows. If you want to promote your secondary LUNs to be primary you have to do this manually via the Unisphere interface, and both Clariions need to have the host both zoned via your fabric, and also a member of a storage group that allows the LUN(s) to be visible.

What this means in practice is that should the active SAN go offline for some reason, the cluster will realise that there’s a problem and attempt to fail over. This will fail, as the disks are offline. You then have to promote the secondary LUN(s) via Unisphere and attempt to bring the cluster resources online again.

This can be made more difficult due to the way that VMware suggest that you configure Windows Failover Clustering. See the section Cluster Virtual Machines across physical hosts in their guide. You set up Windows cluster node A virtual machine on vSphere host 1, storing it on a VMFS volume on SAN i using Raw Device Mappings to connect the cluster disks (LUNs) also on SAN i, and storing the .rdmp (RDM mapping) files with the .vmdk (virtual machine disk) and .vmx (virtual machine configuration) files for the C drive of the VM. When you set up Windows cluster node B virtual machine on vSphere host 2 you browse to the .rdmp files stored with VM A. You then have to add the (inaccessable but visible) MirrorView secondary LUNs from SAN ii to the cluster, adding them first to VM B and storing their .rdmp files with the .vmdk and .vmx for VM B on SAN ii, then adding them to VM A, browsing to the .rdmp files stored with VM B on SAN ii.

The controlled failover process in the event of having to power off the active site (i.e. host with active cluster node plus SAN with MirrorView primary LUNs, in this example Host 1 and VM A) is to take the cluster resources offline, shut down both node VMs (ideally passive first, active last), promote the secondary LUNs, start the VMs (what was the active node first, then the passive) and ensure it comes back on line, then move the cluster resources to VM B. You can then power off VM A and SAN i.

The slightly nasty thing about this is that if the SAN containing the files for VM A goes offline, and VM B gets powered off, even though you can promote the MirrorView secondary LUNs on the SAN for VM B, VM B will not power on as the .rdmp files with the mappings to the LUNs on the offline SAN will not be accessible. In this case you’d have to remove those RDM LUNs from VM B (and then add them back later once they’re available again). In reality you’d be unlucky to both lose a SAN in one site and also something that’d cause a host to fail on your other site. As long as VM B stays running after your SAN failure all you need to do is promote the LUNs at site B and bring the cluster resources online.

I’ve been running with this configuration for several months now and it seems to be working well. I have two Failover clusters, one for SQL (with five data LUNs, all in a MirrorView consistency group) and a second cluster for the filestore, with one LUN. Both clusters are using disk quorum.

The improvement to this, which I’m hoping to experiment with over the next few months, is to use MirrorView/CE (cluster enabler) to allow the Windows Failover Cluster to integrate with MirrorView and allow it to promote the LUNs as the cluster resources move from one site to another. I’ll post more once I get some results.

Other info you might find interesting:

  • I’ve set node affinity for the cluster resources to the VMs in site 1, and I apply Windows Updates to the VMs in site 2 first, the same week they’re released by Microsoft. This way the VMs in site 2 are the passive nodes and can thus install updates and reboot without taking the cluster resources offline. It also means that should the nodes in site 1 get hit by some malware that the updates protect against, the cluster will (ideally) fail over to site 2 (or at least can be forced to site 2) which will not have been affected due to being updated early. The following week the active cluster nodes get updated, reboot (transferring the cluster resources to site 2) then once they’re back online the affinity setting moves the cluster resources back onto them again, ready for the next month.
  • The VMs are running on hosts with two X5690 CPUs (6 core + hyperthreading). I’ve given the SQL VMs 6 cores and 16GB of RAM, and the filestore VMs 2 cores and 6GB of RAM. I’m using Windows Server 2008 R2, and of course it has to be Enterprise edition to enable the Failover Clustering feature.
This entry was posted in Applications, Windows and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.