Networking with Nested ESXi Hosts

Overview

I recently had to add an additional host to my lab as I was running out of compute (thank you VMware Cloud Foundation (VCF) and all the things…Aria Suite (specifically the clustered VMware Identity Manager), NSX, Avi Load Balancer). In addition to getting the additional host, I was fortunate enough to also upgrade the networking to 10 Gbps. Turns out, the switch has nothing to do with the issue I was experiencing, however.

When I attempted to migrate a guest to the other host, the underlying physical switch was not updating the Address Resolution Protocol (ARP) table on the switch. Well it was, but not fast enough. I tried two different switches and the same behavior was observed on both switches. I think it comes down to the VMware Switches not actually notifying the switches or the switches just not knowing what to do when receiving the Reverse Address Resolution Protocol (RARP).

Okay, so what does this mean and is there a way to make it work? I think yes is the best answer. For completeness sake, the first switch I tried this with was my trusty old HP ProCurve 2824. Yes, it is 2024 and yes, that switch went end of sale in December 2009…The switch has been great with a single host, where the physical NICs were static. Now that I added a second host, when migrating a virtual machine, the physical NICs will be different and the switch should be notified of this and update the MAC Address Table and ARP cache.

Nested versus Not Nested

More details to this investigation. I have an Ubuntu 24.04 LTS virtual machine on a port group with a defined VLAN ID. This is equivalent to an access port or an untagged port, depending on the physical switch. When this virtual machine is migrated from host to host, the physical switch updates the ARP table immediately, this is expected. This is normal and expected behavior for virtual machines running in clustered configurations with any number of available hosts to migrate to. This would be the not nested scenario.

Now, on to the nested scenario. Remember, nested ESXi is not recommended and not supported in Production environments. Since this is a lab, we get to explore and investigate goofy scenarios like this one!

Like I mentioned, I now have two physical hosts. Each host has a virtual switch (doesn’t matter if Standard or Distributed). The Security policy on the switch is set to Accept for Promiscuous mode, MAC address changes, and Forged transmits. The nested ESXi host has two virtual NICs, both set to a port group allowing for Virtual Guest Tagging. (If using a Standard Switch, the port group VLAN ID is configured as 4095. If using a vSphere Distributed Switch (VDS), then the port group is identified as VLAN trunking and [for simplicity sake] set to allow all VLAN IDs from 0-4094.

Standard Switch Security Policy
vSphere Distributed Switch Security Policy
Standard Switch Port Group Trunk
vSphere Distributed Switch Port Group Trunk

I am using the Ubuntu virtual machine to initiate ICMP traffic (ping) to the nested ESXi guest. The Ubuntu virtual machine is on a defined VLAN port group and the nested ESXi guest is on two VLAN trunk port groups. If I start out with both guests on the same host, the ping is good.

The ESXi host is on interface Ethernet 17.

Now, I will migrate the nested ESXi guest to the other host and check the arp cache again.

The other physical host is connected to interfaces ethernet 33-34, but the Mac Address is still showing on the other physical host, connected to interfaces ethernet 17-18.

If I generate traffic to the nested ESXi guest, the Mac changes. The ping even shows that there is a duplicate IP address for one of the hops. This is due to the switch learning the new interface that the Mac address resides from.

Investigation and Likely Resolution

So at this point, I know it is most likely an ARP timing issue. In production, this is not an issue as nested virtual machines are not a thing. Also, it wasn’t an issue when I just had a single host, the physical NICs were static. Now that I have two hosts, I need to figure out how to resolve this.

The replacement switch I am using is an Arista DCS-7050SX-64 running 4.28.10.1M, just latest and last compatible version for this model at the time of posting. I found an article on Arista’s website, discussing ARP Entry Creation, Aging and Refresh.

The recommended practice is to set the ARP cache expiry time less than the MAC aging timer to avoid the MAC aging and ARP going into “not learned” states which leads to traffic flooding.

Ok, terminology. ARP cache is a table that stores the correlated addresses of the devices for which the router facilitates data transmissions. It is default at four (4) hours. While I hope my nested hosts do not ping pong between the two hosts often, I don’t want there to be an issue in the case of a Distributed Resource Scheduler (DRS) event. I know, I know, I can change to a manual setting.

The MAC aging timer is default at 300 second, or five (5) minutes.

I guess in the end, when I think about the situation, since I am trying to nest VMware Cloud Foundation (VCF), I know I will need four nested ESXi guests. I can make affinity rules to keep two nested guests “pinned” to their respective hosts. This should remove any issues with DRS and still allow DRS to work with my other supporting virtual machines (Domain Controller, NTP server, Cloud Builder). Worse case scenario, I combine the affinity rules and make static ARP entries, even though that seems like an absolutely horrible idea for the long run.

Or, I can just disable DRS altogether.


Leave a Reply

Your email address will not be published. Required fields are marked *