Why Can’t I Have A Stretched VMware Horizon View Cluster?

For those of you who are involved in the architecture of Horizon View environments, I'm sure you this is a question you've always asked, but may have only ever got the response of "VMware doesn't support that design". Why doesn't VMware support this design? It works….most of the time?

During my time at VMware I've heard many different reasons as to why VMware does not support this design. Finally I have an accurate reason and wanted to share it with you.

What is a stretched VMware Horizon View cluster?

For those of who are asking yourselves, "What is a stretched VMware Horizon View cluster?", I'll set the scene a little. A stretched Horizon View cluster is when a single View Pod (cluster of View Connection Severs), spans more than one physical location connected by WAN / MAN / MLAN and NOT by a LAN.

An example of a stretched Horizon View Cluster:

Stretched Horizon View Cluster

After an internal discussion around this subject, a colleague and good friend of mine Mike Barnett, who used to be an Escalation Engineer in VMware's GSS (Global Support Services) set the record straight for us. 

So I've decide to adapt his explanation and share it with you. 

Why can't I have a stretched VMware Horizon View Cluster? 

The main reason we (VMware) can't support View across physical locations is due to the Java Messaging Service (JMS) component.

Within View we use AD LDS (Active Directory Lightweight Directory Services) alongside JMS. AD LDS is based on the Active Directory framework which has a robust and resilient site-based architecture which allows it to support distributed environments.  View uses AD LDS to store information such as entitlements, View desktop states etc. This information is distributed to all other Connections Servers in the View Pod (cluster of View Connection Servers) using the built-in AD LDS replication. AD LDS is designed to be a store of information which the Connection Server pulls from when starting up. JMS is a separate system within View which manages the running state of the servers including task scheduling, current desktop states, etc. It runs in memory and loads much of it's initial state information from the AD LDS database on startup.

JMS is designed to be a very fast messaging system, this is the reason we use it within View. The View Connection Servers need to communicate any changes that occur across the cluster as quickly as possible (sub-millisecond speeds). These operations include changes in state of VMs and VM allocation events.

For example, User A logs into Connection Server 1 and is allocated Desktop1 in Floating Pool 1. User B logs in at the same moment to Connection Server 2 selecting the same pool, Floating Pool 1. If this notification is not as close to instantaneous as possible, User B could be incorrectly allocated Desktop1, causing an error message for one of the users when there is a clash of login. Latency, Jitter and other WAN induced conditions can cause this transfer of data to not reach the other Connection Servers in a timely manor, resulting in many different issues.

When a VM's state changes the other Connection servers within the View Pod are sent those changes via JMS. These state changes are then committed to the AD LDS instance for consumption by the Admin Web UI as well as to be used if a Connection Server were to be restarted. The other Connection Servers use this information when making decisions about which desktop to allocate to a given user as well as allowing any given Connection Server to act as the administration access point.

There are additional issues though. When a desktop is enabled for View, whether provisioned in an automated pool or added to a manual pool, the Connection Server populates the VMX file with a 'machine.id' string. (You can see this in the VMX of any View VM.) This machine.id is read by the Agent out of the VMX to get various settings needed for operation of the desktop. One of these attributes (vdi.broker.brokers) contains the hostname of every Connection Server in the cluster. So if you have 4 Connection Servers you will have 4 hostnames. When the View Agent starts it will read this machine.id value and randomly select a Connection Server to connect to so it can report its status.

In a stretched cluster you have physically separated Connection Servers. Any View Agent can talk to any Connection Server. This means that an Agent in Site A could be reporting to a Connection Server in Site B. This causes problems similar to the issue mentioned above. The Agent uses JMS to connect to the Connection Server to send its status and when a user logs On it's possible for a conflict to occur, even with very low latency.

The above reasons address the latency issues, but there are other problems. Because our implementation of JMS is not designed to operate in a multi-site architecture we haven't developed any specific handlers for Connection Servers becoming unavailable. Even with a very fast/low latency/very resilient network connection, the uptime is still not as high as a LAN connection. It just isn't, regardless of the term used to refer to it (WAN / MAN / MLAN etc).

How Should I Architect Horizon View Across Multiple Sites?

VMware recommends having two View Pods, or one View Pod per site if you happen to have more than two sites. This removes any of the issues that we talked about previously and is a full supported architecture. Something similar to the following image.

Multi-Site Horizon View Architecture

If you want to find out more about large-scale, multi-site Horizon View environments, I'd recommend you read this VMware blog post: Demystifying VMware View Large Scale Designs

I know it was a lot to read. I hope this has cleared up any questions you had.

Once again, thanks to Mike Barnett for this great explanation.