An error occurred during configuration of the HA Agent on the host

A funny thing happened whilst configuring a new ESX3.5 cluster, part of my process was to enable HA on my hosts.

After checking the “Enable VMware HA” box and making a few changes to the default settings my Hosts begin to Configure HA.

All seemed to be going well until I was presented with this;

Now I have had this error before on systems and I remembered that this could be to do with DNS issues, but first things first I tried to disable and re-enable HA a few times in case there was a “glitch” but this made no difference.

Here are a few of the other initial checks I made:

  • I had gone through and checked all of my DNS settings to make sure everything is set in lowercase
  • I had checked all connectivity, between VC, Host1, Host2 by pinging FQDN’s, using vmkping, everything could connect to each other.
  • I rebooted VC,
  • I reboot both hosts
  • I disconnected and reconnect both hosts.
  • I removed and recreated a new Cluster and added the hosts back in.

Nothing fixed my issue. 🙁

After revisiting my settings several times, I noticed on the Summary tab of the Cluster in the VMware HA box it said this:

  • Current Failover Capacity: 0 Hosts
  • Configured Failover Capacity: 1 Host

Clearly this wasn’t correct, so I decided to look closer at the HA agent’s installed on the host themselves.  I decided to have a look at the HA agent log (aam_config_util_addnode.log) this is found at the following location on the ESX host;

cat /var/log/vmware/aam/aam_config_util_addnode.log

Whilst looking through the log I noticed that I wasn’t getting a ping response from my Default Gateway;

I knew my Gateway was working fine as my workstation was configured to use it and that was functioning fine.  Now this Gateway is an interface on a Firewall and that interface had been configured to discard ping requests.   Eureka moment,  Was this the reason why the HA agent would not configure?  Because it couldn’t receive a ping response from the Default Gateway, it thinks it doesn’t exist so it is a misconfiguration error.
So lets Test this theory.  I re-set the Gateway for HA to our secondary Default Gateway which DID allow ping requests and low and behold after changing the Service Console Gateway settings and re-enabling HA on the cluster it all sprang in to life,  both Hosts were now HA enabled!!!!

Here is what the log shows now that the HA agent can see the Default Gateway.

Update: After posting this article i recieved a Comment from Duncan Epping telling me about an advanced HA option: das.isolationaddress.

By adding an extra isolation response address in the HA Advanced Options we can tell HA to use a different Gateway rather than using the Default which we know will not give a response.

Add in the following options to the HA Advanced Options window.

  • das.isolationaddress[x] = (your secondary isolation response address)
  • das.usedefaultisolationaddress = false

The second option i’ve included tells HA to not use the Default Gateway, which means it will now use the address you’ve just added instead.

This post was originally and still is posted on PlanetVM. I would just like to thank Tom Howarth for allowing me to guest post on his great Virtualisation Blog.