VMware: What if ESXi Hosts Becoming Disconnected from vCenter
Cause:
Recently we have been experiencing problems with ESXi 4.1 hosts becoming disconnected from vCenter.
Troubleshooting Steps:
1. Right clicking on the host and selecting Connect does not normally fix the problem. Also logging in directly to the host with the vSphere Client usually does not work, or if it does it disconnects quickly after connecting.
2. All of the virtual machines running on the host continue to run; however without the host being connected to vCenter or being able to connected directly with the vSphere Client you are unable to manage the virtual machines. Also, connecting to the host with PowerCLI does not work, if it does connect it drops the connection soon afterwards.
We have used a variety of “tricks” to get the host to reconnect.
1. Restart the management agents and then reconnect. If this does not work the first time then try again. I have found that it often works on the second attempt.
Recently we have been experiencing problems with ESXi 4.1 hosts becoming disconnected from vCenter.
Troubleshooting Steps:
1. Right clicking on the host and selecting Connect does not normally fix the problem. Also logging in directly to the host with the vSphere Client usually does not work, or if it does it disconnects quickly after connecting.
2. All of the virtual machines running on the host continue to run; however without the host being connected to vCenter or being able to connected directly with the vSphere Client you are unable to manage the virtual machines. Also, connecting to the host with PowerCLI does not work, if it does connect it drops the connection soon afterwards.
We have used a variety of “tricks” to get the host to reconnect.
1. Restart the management agents and then reconnect. If this does not work the first time then try again. I have found that it often works on the second attempt.
ü From
the host console press F2 to login
ü Enter the
root password
ü Go down
to Troubleshooting Options and select it
ü Select
Restart Management Agents
ü Press
F11 to restart the management agents
ü Once
they have been restarted attempt to reconnect the host by right clicking on it
within vCenter and select Connect. You will normally get an error message and
then prompted to enter a username and password, enter root and the root
password.
2. If the
above fails twice then try removing the host from vCenter and adding it in
again. This has an impact in that you will lose performance statistics, the
virtual machines will need to be put back into the correct resource pools if
you are using resource pools and if you are using Site Recovery Manager (SRM)
the virtual machine protection will need to be reconfigured. You might want to
skip this step and try the ones below first and use this as a last resort.
ü Right
click on the host in vCenter and select Remove.
ü Once it
has been removed right click on the container the host was originally in, e.g.
a cluster and select Add host
ü Enter
the host name, root for the username and the root password
ü If the
host starts to add and then fails then repeat the steps in 1 above, again you
might have to try the steps in 1 above a couple of time.
3. If you
still cannot get the host to reconnect and you are using Fibre Channel storage
then rescan the HBAs. As you cannot manage the host with a vSphere Client you
will have to do this at the command line.
ü At the
console, if you are not already in the Troubleshooting Options then follow
steps a through to c in 1 above to get to the Troubleshooting Options.
ü If the
menu shows Disable Remote Tech Support Mode then Remote Tech Support is already
enabled, if there is an option for Enable Remote Tech Support Mode then select
it.
ü Using a
SSH client such as Putty to get a SSH connection to the host
ü Login
as root
ü Issue
the command esxcfg-rescan for each of the HBAs on the host where hba is the hba
device, e.g.
esxcfg-rescan vmhba1
esxcfg-rescan vmhba2
esxcfg-rescan vmhba1
esxcfg-rescan vmhba2
Now try
reconnecting the host by right clicking on it and selecting Connect as
described in step 1.f. Again if it does not work then follow the steps in 1
above a couple of times.
4. If you
still cannot get the host to connect check for redundant directories in
/var/run/vmware/root_0 and /var/lib/vmware/hostd/stats being full, tidy these
directories and attempt to reconnect again.
ü If you
do not already have a SSH connection to the host follow steps a through to d in
3 above to get a SSH connection.
ü cd
/var/run/vmware/root_0
ü There
should be a directory in here for each running virtual machines on the host,
issue the following command to get a list of all the directories here.
“ls”
“ls”
If
there are more directories than the number of running virtual machines then use
the following command to remove the empty ones, it will attempt to delete the
non-empty ones but will fail to delete these, so you are safe to run the
command against all directories
“rmdir *”
“rmdir *”
ü Issue
the following command to check for full filesystems.
“vdf”
“vdf”
The one
to check is hostdstats, if it is 100% full then tidy it as follows
i.
cd /var/lib/vmware/hostd/stats
ii.
rm hostAgentStats-*.stats
ü Now
restart all of the services with
services.sh restart
You can run this while there are running virtual machines on the host without affecting them.
services.sh restart
You can run this while there are running virtual machines on the host without affecting them.
ü Now
attempt to connect the host again.
I have found that if a host becomes disconnected and I have had
to use the steps above to reconnect it then it becomes disconnected again
within the next 24 hours or so unless the host is restarted. Therefore, I
suggest that once you have the host reconnected put it into maintenance mode
and restart it.
I think this issue is being caused by NetApp SnapManager for SQL
and SnapManager for Exchange because usually the SnapDrive running on one of
the virtual machines on the failing host is normally reconfiguring the virtual
machine and rescanning the HBAs to attach or detach RDMs from a NetApp snapshot
to verify a backup that has just been run by the SnapManager product running on
the virtual machine when the host becomes disconnected. I do not think that it
is a fault of the SnapManager product as it is just using the VMware APIs to
perform the tasks it needs to do. All of the hosts I am having this issue with
are running an unpatched version of ESXi 4.1 (build 260247). They are also
running from IBM supplied USB keys without the latest IBM customisation. I plan
to upgrade the hosts to at least ESXi 4.1 update 1 (build 45697) or ESXi 4.1
update 2 (build 502767) to see if this helps the situation. I will also apply
IBM customisation 1.0.4 as this fixes the issue of vMotion and Fault Tolerance
becoming disabled following a reboot of the host. I will update this post with
details of whether these updates helped or not.
Comments
Post a Comment