VMware ESX/ESXi issues with Broadcom 5719 quad port Gb NIC

I’ve got some new servers with Broadcom quad port 1Gb Ethernet NICs in, model 5719.

Nine of the servers are running ESX 4.1 update 2 502767, one is running ESXi 5.0 update 1 623860.

Two servers have experienced this issue. The symptom in both cases is that port 0 on the card in PCI slot 2 has stopped communicating on the network. It’s still showing as having a link, but there’s no data moving over it. Clicking the button that normally shows the physical NIC CDP info from within the vSphere client in Configuration – Networking just gives:

Cisco Discovery Protocol is not available on this physical adaptor.

But it is on all the other ones, and used to be on this one too. Additionally, after a while of it being in this state the Observed IP ranges column in the Network Adapters page just shows None. From the physical Cisco switch point of view the port is correctly configured, the statistics show data going from the Cisco switch to the NIC but no data in the other direction.

The uptime on the ESXi 5 host is 24 days, the ESX 4.1 is 25 days.

Have logged a case with VMware and the fix so far is to disable tg3 netq, as follows:

  • SSH to host
  • esxcfg-module -s force_netq=0,0,0,0,0,0,0,0 tg3
    Where the number of zeros equals the number of 5719 (tg3) NIC ports, found from esxcfg-nics --list.
  • Reboot host

VMware and Broadcom are currently working on this, more info when I get it.

Update 2012-Oct-05: VMware have told me that Broadcom have so far been unable to reproduce the issue, and would like to run some tests on my systems. I’ve agreed to this and am waiting to hear from them. There is now a VMware KB article about this which details the above workaround.

Update 2013-Feb-06: The issue is resolved with the release of a new async (i.e. manufacturer-provided) tg3 driver. The KB article has been updated and a new driver version 3.129d.v50.1 build 1013484 has been released for ESXi 5.x. Broadcom say that a new ESX 4.1 tg3 driver is currently under certification and so should be available shortly. The cause of the problem was to do with the de-allocation of NetQueues in high network traffic conditions. The de-allocation sequence has been modified to resolve the issue. Note also that the inbox (VMware) tg3 driver version 3.123b.v50.1, Build: 914586 will not accept the force_netq parameter, see the KB article for an alternative way to disable NetQueue if you’re not able to update your driver.

This entry was posted in Networking, vSphere and tagged , , , , , , , , , , , . Bookmark the permalink.

40 Responses to VMware ESX/ESXi issues with Broadcom 5719 quad port Gb NIC

  1. james says:

    Any update on this Robin, we have the same issue on 2 Dell R720s with the 5719.

    • rcmtech says:

      No, the advice is still to disable NetQueue as per above. The call is still open with VMware but I’ve not had any extra info from them. I’m going to be running the command on my ten hosts this week (wasn’t able to last week for business reasons).

  2. hi rcmtech,
    I had same problems with Broadcom BCM5719. Thank you for your sharing.
    But you could anwser me a question. How long after you disabled netqueue ? because i worry about this problems will be still occur.

    • rcmtech says:

      One host has had NetQueue disable for a few weeks (since I posted about this originally on 15th August) and I’ve now done all the rest of my hosts too. Out of the ten hosts I have with the 5719 cards, only two experienced the issue, and that was after about four weeks of uptime. One was running ESXi 5.0 the other ESX 4.1. The remaining hosts never exhibited the problem, but have had NetQueue disabled anyway as per VMware tech support recommendation.
      What makes you worried that the problem might still occur despite NetQueue being disabled?

  3. For your information I thought I would be less worried and i will be apply this solution for my hosts tommorow.
    Did vmware support team explain to you this problem? Because I read some documents about netqueue for vmware, I understand it not enable default for ethernet NICs.
    esxcfg-module -s force_netq=0,0,0,0,0,0,0,0 tg3
    this solution only force disable this feature on NICs although it has never been enabled

  4. VMdude says:

    I had this same issue with HP Gen8 servers using the 5719s. According to VMware it is a Broadcom driver issue. VMware has an open ticket with Broadcom to update their driver. I disabled Netqueue per VMware’s advice but still am concerned of the issue coming back since this happened on Hosts with mission critical servers…twice in a week and a half!

    It was also recommended that we can use beacon probing on the vswitches as a way to failover to other when the pNIC stops passing traffic to the network

  5. Nick says:

    I can’t believe this. I raised 2 cases with VMware support and spent days on the phone working through this being told it was our network even after actually showing them the data was being sent across the link.

    I have absolutely no faith in VMware’s support after this. 5-6 different guys I spoke with. I intend to take this up with their team leaders.

    • rcmtech says:

      The issue I have is that NO data goes over the link. The link is “up” in as much as it’s showing as 1000/Full both in the vSwitch and the physical switch. The physical switch keeps sending data to it, but never receives any data, and the data that’s sent from the physical switch is not received by the vSwitch. The only way to detect this would be if you either monitored your physical switches very closely for strange behaviour (like zero data flow in one direction) or used beacon probing – but the latter wasn’t appropriate in my case, see previous comment(s).

      I got the impression that it was a fairly new issue – if it is the same as what you’re experiencing then maybe you were one of the first to report it, in which case the until then “good” tg3 driver and 5719 hardware would have to be proved bad, and I bet tech support see a lot more issues with customer’s own networks/configs than with hardware/drivers. Or you might have been unlucky with who you spoke to. Personally VMware tech support have always been excellent (getting through to them on the other hand… don’t get me started on the VMware automated phone system!).

  6. VMdude says:

    I talked to a VMware tech a 2 days ago and he said Broadcom updated the ticket that VMware has with them about this issue. Broadcom was able to produce the issue on Linux boxes and are now testing on Windows boxes.

    Hopefully an updated driver will be released soon. i am using Beacon probing now as a belt and suspenders approach. We tested it out by removing a VLAN that the pNIC and VM was attached to simulate a failure with Link Status still being up and it failedover sucessfully to another pNIC.

  7. I also ran in these issues between Broadcom and ESX a long long time ago (ESX 3 and 3.5). It was fixed, and it’s an horror to see this issue back with ESXi 5.0 Update 1
    This is primary reason I only get Intel network cards for my designs/architecture. A lesson learned.
    But what to do when it’s on the motherboard of your server.

    • rcmtech says:

      First issues I’ve had with Broadcom, have been running with their stuff for years both with Windows and VMware and not had any problems. Seems just to be this tg3 driver (e.g. my other cards/VMware hosts are using the bnx2 and/or bnx2x driver and I’ve not had any issues with them at all, and zero issues on Windows with any cards).

  8. Jon says:

    Very helpful ! Google found this page, our call with VMware didnt help at all ! I dont think VMware have this bug in their KM system …

    • Danny says:

      VMware have confirmed this as a bug and that VMware and Broadcom are trying to fix it, however, they also said that this fix doesn’t always work and they know of at least one customer who had to swap all the 5719’s out for Intel cards…….fantastic stuff!

      • rcmtech says:

        Hi Danny, by “this fix doesn’t always work” do you mean the “disable netqueue” one or some other fix that Broadcom/VMware are still working on?

  9. Danny says:

    I was told by VMware that the “disable netqueue” fix hasnt worked for everyone but does for the vast majority. I just hope we’re in the majority :)

  10. Hi all,
    I have just found out a new that vmware site have just update new version driver for BCM5719
    VMware ESXi 5.0 Driver CD for Broadcom NetXtreme I Gigabit Ethernet including support for 5717/5718/5719/5720
    Version 3.124c.v50.1
    2012-09-18

    But i don’t sure this release resolved our issue for BCM5719. Please any guys who could contact to vmware support , Call & check this release again

  11. rcmtech says:

    Was quite hard to find the download for this.
    https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXI50-Broadcom-tg3-3124cv501&productId=229

    The changelog mentions something which might be this problem (but it’s for a 5717 – though that’s the same family as the 5719):
    Problem: (CQ65077) Vmware VM’s network traffic stuck during stress for 5717.
    Cause : RSS was enabled by mistake.
    Change : Limit irq number for ESX if NetQ is not enabled or device is not IOV capable.
    Impact : MSI-X devices on VMWare only.

    I’ve emailed the VMware support guy who’s got my case open to ask what this driver is.

    • rcmtech says:

      Support blokey said that the bug hadn’t been updated to reflect that it was resolved with the newer driver, and to wait for further info – which he’ll provide to me once he gets it.

  12. Kevin says:

    I spoke to VMware support today who mentioned broadcom or the hardware provider (in our case Dell) would need to be contacted. They did say to still run the fix with the new driver but if anyone can get a better reply than me that would be great!

  13. elgwhoppo says:

    So I am on the phone with VMware about this very situation, specifically for HP Gen8 331FLR NICs, which use the Broadcom 5719 driver mentioned here. The driver version which is on the hosts I’m looking at is 3.123b.v50.1-1OEM.500.0.0.472560, compared to the latest version tg3-3.123b.v50.1-682322.zip. Will post back up when I get an answer from VMware.

  14. Pingback: Intermittent Network Connectivity with ESXi 5.0 and DL380p Gen8 Servers with 331FLR NICs | Elgwhoppo's vNotebook

  15. rcmtech says:

    Update: a port on one of the two 5719 cards in my test host, running the Broadcom debug driver, has just started to exhibit the problem. Am now waiting for contact from Broadcom.

  16. Werner says:

    We see the same problem with the BCM5720 running ESXi 5.0 update 1.

  17. VGJ says:

    Hi,
    do you have any news on this?

  18. rcmtech says:

    Just had a new driver through from Broadcom which “may have a potential fix for this issue”. Am installing it now.

  19. a_user says:

    two new r720 hosts on 4.1 with these cards as well – build 4.1 721871. have been having these problems almost immediately since putting them into use. mind you they average 60-80+ vm’s (2 x 6core, 384gb). Going to try the netqueue disable until we hear back on the status of this new driver. had me in a wtf state for the past two days wondering what was going on. never experienced these problems previously. funny thing is, these were ordered with intel nic’s and dell inadvertently sent the pci nics as these.

    Because they connect to our equallogic back-end i also had iscsi paths drop and make vm’s become inaccessible! Glad i foudn this thread. will be following it close.t

  20. rcmtech says:

    Had notification of a new tg3 driver through via VUM on 21st December: http://kb.vmware.com/kb/2033752
    Version is 3.123b.v50.1 which fixes PR 899456, but I can find no details of what that PR is…

    • rcmtech says:

      Just been told that this does not include the fix that’s in the test driver I’m currently running.

      • Brad Calvert says:

        Ridiculous that this is very dangerous bug is still unfixed. Thanks for keeping us updated.

  21. Just had the same issue on HP DL380 G8 hosts. Thanks for the info in the post, wasn’t having any luck anywhere. Will be interesting to see when Broadcom will finally release a fix.

  22. Pingback: Missing Broadcom 5719 tg3 NICs after updates | Robin CM's IT Blog

  23. SeanD says:

    Any new updates? Is there an updated driver for this issue? I have a new Dell R820 that has 2 nic ports out of commission because of this issue.

  24. Marc says:

    Seems that there is a updated driver for vSphere 4.1 too that solves this issue:

    “For ESX/ESXi 4.x, this issue has been resolved in async tg3 driver version 3.129d.v40.1.”

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2035701

  25. Helge Klein says:

    Very helpful, thanks. We experienced this on a HP DL 380 G8. Disabling and re-enabling the link as described in VMware’s KB article fixed the issue (for now).

  26. Laura says:

    Very helpful…Thanks!!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s