I’ve got some new servers with Broadcom quad port 1Gb Ethernet NICs in, model 5719.
Nine of the servers are running ESX 4.1 update 2 502767, one is running ESXi 5.0 update 1 623860.
Two servers have experienced this issue. The symptom in both cases is that port 0 on the card in PCI slot 2 has stopped communicating on the network. It’s still showing as having a link, but there’s no data moving over it. Clicking the button that normally shows the physical NIC CDP info from within the vSphere client in Configuration – Networking just gives:
Cisco Discovery Protocol is not available on this physical adaptor.
But it is on all the other ones, and used to be on this one too. Additionally, after a while of it being in this state the Observed IP ranges column in the Network Adapters page just shows None. From the physical Cisco switch point of view the port is correctly configured, the statistics show data going from the Cisco switch to the NIC but no data in the other direction.
The uptime on the ESXi 5 host is 24 days, the ESX 4.1 is 25 days.
Have logged a case with VMware and the fix so far is to disable tg3 netq, as follows:
- SSH to host
esxcfg-module -s force_netq=0,0,0,0,0,0,0,0 tg3
Where the number of zeros equals the number of 5719 (tg3) NIC ports, found from
- Reboot host
VMware and Broadcom are currently working on this, more info when I get it.
Update 2012-Oct-05: VMware have told me that Broadcom have so far been unable to reproduce the issue, and would like to run some tests on my systems. I’ve agreed to this and am waiting to hear from them. There is now a VMware KB article about this which details the above workaround.
Update 2013-Feb-06: The issue is resolved with the release of a new async (i.e. manufacturer-provided) tg3 driver. The KB article has been updated and a new driver version 3.129d.v50.1 build 1013484 has been released for ESXi 5.x. Broadcom say that a new ESX 4.1 tg3 driver is currently under certification and so should be available shortly. The cause of the problem was to do with the de-allocation of NetQueues in high network traffic conditions. The de-allocation sequence has been modified to resolve the issue. Note also that the inbox (VMware) tg3 driver version 3.123b.v50.1, Build: 914586 will not accept the force_netq parameter, see the KB article for an alternative way to disable NetQueue if you’re not able to update your driver.