UPDATE – Ignore the Broadcom driver stuff. It seemed to be ok all afternoon, but I have rebooted the ESXi host and it’s gone completely unstable again, with pretty much continuous iSCSI disconnects. Clearly this TOE/iSCSI offload support is absolutely terrible. I’m going to have to use the software initiator. What is the point of Dell marketing this?
UPDATE 2 – Dell decided to get to the bottom of this and, following an extended troubleshooting session in which I reverted one of the hypervisors, they were able to replicate the fault in their lab. It’s now being escalated with VMware and Broadcom. More news as I get it…
I’m doing this upgrade at the moment from vSphere 4.1U1 so I wanted to make notes, particularly on the hypervisor rebuild part, so I don’t have to keep looking stuff up when I do each one. Since 4.1 I have used the hardware iSCSI offload features of the Broadcom bnx2 chips in the servers, using them as HBAs in their own right. As per the Dell MEM driver 1.1 release notes they still don’t support using jumbo frames with this configuration. However, I had big problems with getting this working at all with 5.0. According to Dell support I’m in a minority of customers that use TOE so their inclination was to suggest I fall back to software iSCSI. I purposely delayed adopting vSphere 5.0 until it had been out for a few months to hopefully avoid being among the first to hit major issues, but I still ran into this. The problem manifests itself as regular errors (every few seconds) in the array logs like this:
iSCSI login to target ‘192.168.100.12:3260, iqn.2001-05.com.equallogic:0-8a0906-c541d5105-94c0000000a4adc3-vsphere’ from initiator ‘192.168.100.25:2076, iqn.1998-01.com.vmware:server.domain.com:1454019294:34’ failed for the following reason: Initiator disconnected from target during login.
These errors are generated by all HBAs that are configured for storage. Furthermore only one path is established, and the volume will occasionally go offline altogether. The ESXi host’s /var/log/vmkernel.log shows bnx2 disconnection events like this:
2012-01-16T16:35:11.248Z cpu14:4802)bnx2i::0x410013204890: bnx2i_conn_stop::vmnic1 - sess 0x41000de04fc8 conn 0x41000de05350, icid 11, cmd stats={p=0,a=0,ts=0,tc=0}, ofld_conns 2 2012-01-16T16:35:11.248Z cpu14:4802)bnx2i::0x410013204890: bnx2i_ep_disconnect: vmnic1: disconnecting ep 0x410012a18f20 {11, 120c00}, conn 0x41000de05350, sess 0x41000de04fc8, hba-state 1, num active conns 2 2012-01-16T16:35:25.554Z cpu12:4802)bnx2i::0x410013204890: bnx2i_conn_stop::vmnic1 - sess 0x41000de04fc8 conn 0x41000de05350, icid 13, cmd stats={p=0,a=0,ts=0,tc=0}, ofld_conns 2 2012-01-16T16:35:25.554Z cpu12:4802)bnx2i::0x410013204890: bnx2i_ep_disconnect: vmnic1: disconnecting ep 0x410012a192f0 {13, 125400}, conn 0x41000de05350, sess 0x41000de04fc8, hba-state 1, num active conns 2
Dell support’s first suggestion is to edit the iSCSI login timeout value from 5 seconds to 60 seconds, and you need to use build 515841 to be able to edit this. However, this did not fix the issue using TOE. It turned out to be a Broadcom driver issue.
The vanilla install of ESXi 5.0.0 (build 469512), the Hypervisor Driver Rollup 1, and the update to build 515841 all include these same driver vib packages which seem to be broken. You can audit these by running esxcli --server=servername software vib list
net-bnx2 2.0.15g.v50.11-5vmw.500.0.0.469512 VMware VMwareCertified net-bnx2x 1.61.15.v50.1-1vmw.500.0.0.469512 VMware VMwareCertified net-cnic 1.10.2j.v50.7-2vmw.500.0.0.469512 VMware VMwareCertified scsi-bnx2i 1.9.1d.v50.1-3vmw.500.0.0.469512 VMware VMwareCertified
The Broadcom NetXtreme II Network/iSCSI/FCoE Driver Set does contain newer versions:
net-bnx2 2.1.12b.v50.3-1OEM.500.0.0.472560 Broadcom VMwareCertified net-bnx2x 1.70.34.v50.1-1OEM.500.0.0.472560 Broadcom VMwareCertified net-cnic 1.11.18.v50.1-1OEM.500.0.0.472560 Broadcom VMwareCertified scsi-bnx2fc 1.0.1v.v50.1-1OEM.500.0.0.406165 Broadcom VMwareCertified scsi-bnx2i 2.70.1k.v50.2-1OEM.500.0.0.472560 Broadcom VMwareCertified
However, there is a further complication. These drivers have to be loaded on after the VMware updates. When the Broadcom drivers are installed the VMware-supplied drivers for these devices are removed. Confusingly, the VMware updater to build 515841 will see that they are missing, will ignore the OEM Broadcom replacements, and will re-install the older versions! If the host reboots at that point it will crash to a magenta screen of death as the kernel inits, possibly because two different driver versions are trying to access the same hardware. Take note, the Broadcom installer removes the following bootbank packages from the host:
VMware_bootbank_misc-cnic-register_1.1-1vmw.500.0.0.469512 VMware_bootbank_net-bnx2_2.0.15g.v50.11-5vmw.500.0.0.469512 VMware_bootbank_net-bnx2x_1.61.15.v50.1-1vmw.500.0.0.469512 VMware_bootbank_net-cnic_1.10.2j.v50.7-2vmw.500.0.0.469512 VMware_bootbank_scsi-bnx2i_1.9.1d.v50.1-3vmw.500.0.0.469512
So my recommendation would be to cross check this list whenever you install any further roll-ups to your ESXi hosts. If these or future non-OEM versions are reinstated, remove them before you restart the host, or it may not boot at all.
vCenter Server migration
- Migrate vCenter server – for 4.1 -> 5.0 the wizard does it all automatically (big improvement!)
- After upgrade you’ll get HA failing to find a master agent, and probaby some vCenter cert warnings about the hosts
- Enable SSL certificate checking (disabled by default for some reason if migrating from 4.1): http://kb.vmware.com/kb/2006729
EqualLogic SAN update
This apparently provides better vStorage integration with vSphere 5
- Keep a physical PC with the v4 infrastructure client on
- Install the v5 infrastructure client on a physical PC
- Shutdown all guests, put both hosts into maintenance mode and shutdown
- Use WebUI to update the EqualLogic firmware to 5.1.2
- Restart the SAN
- Use iDRAC to power on ESXi hosts
- If vCenter is a VM you need to use the v4 infrastructure client to connect directly its ESXi host
- Power up a DC first, then vCenter
- Quit the v4 client
- Load the v5 infrastructure client and connect to vCenter
- Start other DCs, Exchange, and SQL servers
- Start web, app, and file servers
ESXi host update
- From your iSCSI vSwitch make a note of the current iSCSI kernel port IP addresses
- vMotion guests off ESXi host, maintenance mode, shutdown
- Remove host from vCenter
- For Dell servers use iDRAC, boot into System Services mode and try connecting to the net for updates
- vmnic0 was in the management vSwitch and it was port channelled on the network switch
- Telnet to switch, use the descriptions to find the correct port channel. If you don’t have descriptions in your switch config you could as a fallback find the MAC addresses in the server BIOS and look up the switch MAC address table, or use CDP show neighbors while VMware is running
- Disable each of the ports in turn, checking in iDRAC to see if that fixes the access to the Dell firmware repo
- Apply all firmware updates
- Use iDRAC’s Virtual Media feature to present the VMVisor ISO image to the server
- Reboot selecting the boot menu, then boot from the virtual CD
- Select new install for ESXi host and install to SD card
- This way there is no legacy partition table, and the upgrade would still require you to install the Dell MEM driver in any case
- Use iDRAC to set management IP
- Start v5 infrastructure client and connect to vCenter
- Add ESXi host back into vCenter
- Add vmnic4 back to the management vSwitch
- Remove VM Network port group
- Configure NIC teaming as Route based on IP hash (for each vmkernel and port group!)
- Enable vMotion on the Management vmkernel port
- Commit changes and re-enable the disabled switchport on your switch
- Configure NTP service and hostname
- Configure ESXi licence key
- Compare the MAC addresses with of the vmbha initiators in Storage Adapters with the NICs listed in Network Adapters. You may notice that the numbering is different from the vmbha initiators that your ESXi 4.1 host was using
- Download the Dell MEM 1.1 Early Production Access, since there are bug fixes over v 1.0.1 and it is certified for vSphere 5.0
- Download VMware ESXi 5.0 Patch Release ESXi500-201112001 (build 515841 – the advised minimum for using the Dell MEM)
- Download the Broadcom NetXtreme II Network/iSCSI/FCoE Driver Set
- Some of these archives need extracting to expose the actual vib zipfile, some don’t
- Install VMware vSphere CLI
- Use the infrastructure client’s Datastore browser to upload the MEM, the 515841 patch release, and the Broadcom vib files to a local volume on the ESXi host (mine all have a single SATA hard disk for scratch)
- Put the host in Maintenance Mode
- Use RCLI to install the patch release:
- Reboot the host
- Install the Broadcom drivers:
- Reboot the host
- Install the MEM driver:
- Reboot the host
- For each HBA, check the iqn name and amend to use the hostname instead of localhost, and check the numbering. On my servers the vmbha designations shifted during one of the reboots, leaving the iqns with misleading names which caused additional confusion while setting up the array volume access. e.g. vmbha34 showed up as iqn.1998-01.com.vmware:localhost.domain.com:2062235227:36
- Run the MEM configuration script, selecting vmnic1 and vmnic3, and using the IP addresses you noted from the old ESXi instance. Dell support also advised creating the heartbeat vmkernel port, though it’s described as optional
- Update these new iqns on the SAN’s ACLs for the vSphere storage volume(s)
- After that has finished take the CHAP passwords for each vmbha from the EqualLogic Web UI and add those to the Storage Adapter configs in the infrastructure client. Remember to use the username as you see it in the EqualLogic UI not the initiator iqn
- For each of your active HBAs use the advanced settings to edit the iSCSI login timeout from 5 to 15 seconds (to match what ESXi 4.1 had)
- Configure a scratch disk path and enable scratch – use the real drive UID in the path, rather than the volume name in case you change it later. To retrieve that, use
esxcli --server=servername software vib install --depot /vmfs/volumes/SATA-LOCAL-C/ESXi500-201112001.zip
esxcli --server=servername software vib install --depot /vmfs/volumes/SATA-LOCAL-C/BCM-NetXtremeII-1.0-offline_bundle-553511.zip
esxcli --server=servername software vib install --depot /vmfs/volumes/SATA-LOCAL-C/dell-eql-mem-1.0.9.205559.zip
setup.pl --configure --server=servername
vmkfstools.pl --server=servername -P /vmfs/volumes/yourvolumenamehere