Category Archives: VMware

Deploying vCenter Server Appliance 5.1 with AD auth

In theory using the vCenter Server Appliance (hereafter vCSA) offers a number of big advantages over using vCenter. Firstly you don’t need to commit a Windows Server OS licence, secondly to manage your VMs you can use a Flash-enabled browser on any operating system (including on Surface RT!), and thirdly it should be a lot quicker to deploy.

On that last point, a number of configuration steps in the setup of the vCSA are counter-intuitive and can waste a substantial amount of time. This is because the vCSA defaults to a hostname of localhost.yourdomain.com. The various web services that the appliance runs (SSO, Lookup Service, Inventory Service, vSphere Web Client) interact with each other using SSL sessions and, although there’s a built-in method to make vCSA regenerate its self-signed certificates at boot time, at the time of writing this does not work once these web services have been configured.

If you do not edit the hostname, you will not be able to enable AD authentication, and will likely encounter “Failed to connect to VMware Lookup Service https://yourhostname:7444/lookupservice/sdk – SSL certificate verification failed” when attempting to use the vSphere Web Client. However if after completing the initial setup wizard you configure the hostname to something meaningful and try to regenerate the SSL certificates, the vCSA will hang at boot time, displaying:

Waiting for network to come up (attempt 1 of 10)... Appliance Name - VMware_vCenter_Server_Appliance Configuration for eth0 found The network for interface eth0 is managed internally, network properties ignored DNS : x.x.x.x Hostname or IP has changed. Regenerating the self-signed certificates. Starting VMware vPostgres: ok Waiting for the embedded database to start up: .[OK]

Furthermore, most of the potentially useful VMware KB articles assume a dedicated vCenter, rather than a vCSA. There seems to be no way out of this boot hang, and you will have to start deploying the vCSA all over again. The trick is to cancel out of the Setup wizard and change the hostname at the very start, before the various web services have been initialized at all.

I have outlined the successful setup process here as a reminder for future deployments:

When downloading the vCSA, download the .ova file. Ignore the other .vmdk and .ovf downloads, the .ova one will allow you to set up the IP address details at the time of deployment rather than having to change them later.
It is recommended to use Thick Provisioning for the disk, despite 120GB being overkill for typical small deployments.
Enter the chosen static IP address, network and DNS settings. Sadly they left out a hostname field here which would have saved a lot of grief.
Do not start the VM on completion of the wizard, then reduce the vCSA virtual machine RAM from 8GB down to 4GB – the supported minimum configuration for small deployments of less than 10 hosts.
As per the instructions displayed on the vCSA VM console, connect a web browser to port 5480 (the default credentials are root and vmware), and accept the EULA.
At this point quit the Setup wizard. Failure to do this will cause SSL certificate issues later.
In the Network tab change the hostname, in the Admin tab change the root password and enable SSL certificate regeneration, and finally in the System tab set the correct time zone then reboot.
Create a DNS A record for your vCSA.
After the reboot when you connect back to the admin console, the hostname in the certificate will remain as localhost.yourdomain.com – this is apparently normal.
Disable SSL certificate regeneration.
Start the Setup wizard again from the main vCenter Server tab Summary screen.
Select “Configure with default settings”.
Enable AD authentication (an Active Directory Computer account object will be created for your vCSA).
Despite having enabled Active Directory authentication, this will not work until the AD domain SSO Identity Source has been configured – you’ll see the error “A general system error occurred: Authorize Exception”. The VMware documentation mentions using the SSO Admin account to do this (admin@System-Domain) but this password is not defined by the user when deploying a vCSA. By default the root account is also an SSO admin, so log in to vSphere Web Client on port 9443 as root. Navigate to Administration > Sign-On and Discovery > Configuration > Add Identity Source:
Enter the server URLs in the format ldap://dc1.mydomain.com
The DNs will probably both be cn=Users,dc=yourdomain,dc=com since you are only likely to need the Domain Admin user or Domain Admins group.
You need to set the Domain Alias to the NETBIOS domain name or else the vSphere Web Client plugin’s “use windows credentials” option will not work.
Set Authentication Type to Password and use a basic AD user account that you might use for photocopier scan-to-email directory lookups. Test the settings and save once working ok.
vCSA does not implicitly trust Domain Admins like a standard vCenter installation would, and the permissions are somewhat difficult to find in the vSphere Web Client. Navigate Home > vCenter > vCenter Servers > your vCSA > Manage tab > Permissions:
Add your required AD groups to the role Administrator.
I had a further issue after all this, where connecting using vSphere Client with the “use windows credentials” option would result in the error “A General System error occured: Cannot get user info”. This is due to an omission in the vCSA which has been fixed in the latest download version (currently 5.1.0.10100-1123965). You can get around this by editing a config file on the vCSA.
One final reboot of the appliance is necessary to avoid permissions errors from the Inventory Service when logging into the vSphere Web Client with Windows session authentication. Be aware that the port 9443 web service can take several minutes to start, even after the console of the appliance has apparently finished booting.
In environments where you can’t afford the 120GB of disk space for the vCSA, you could use VMware Converter to V2V the appliance, resizing to say 40GB in the process.

Synology NAS for SME workloads – my design decisions

8 Replies

Synology-RS2212RP+

Background

It all started last year with a humble single drive DS111 which I bought for home, speculating that I could probably run Serviio on it to stream my movies to my DLNA enabled Bluray player. I managed to compile FFmpeg and its dependencies, install Java and get Serviio running and it has been golden ever since. Since then I’ve learned a lot more about Linux, packaged many other pieces of software and reverse-engineered my own Synology package repository, which at the last count over 5,000 NASes had checked into. The value of the Synology products lies in their great DSM software.

A few months later at work I decided to buy an RS411 to replace an appallingly slow FireWire Drobo (sub 10MB/sec!) that one of my colleagues was using to store video production rushes. Using the same four 2TB drives that had been inside the Drobo it has behaved impeccably – and it is currently in the process of mirroring 3TB to CrashPlan courtesy of my own app package. Having passed this test, I decided that Synology was a worthy candidate for a more serious purchase. However I noticed that there wasn’t much information about their SME products online so I’m sharing my research here.

The need for 2nd tier storage

I have used a 15K SAS EqualLogic iSCSI storage array for VMware workloads since 2009 but this is quite full. It can’t accommodate the data which I need to migrate from older arrays which are at end of life. This data (most of it flat files) is very much 2nd tier – I need lots of space, but I don’t really care too much about latency or throughput. It’s also predominantly static data. I do need at least some additional VMware storage, so I can use vMotion to decant and re-arrange VMs on other storage arrays. A larger Synology rack mount NAS therefore presents itself as a very good value way of addressing this need, while keeping the risk of failure acceptably low.

Which model?

The choice is actually pretty easy. A redundant power supply is fairly mandatory in business since this is the most likely thing to fail after a drive. The unit is under warranty for three years, but can you really survive for the day or two of downtime it would most likely take to get a replacement on site? Once this requirement is considered, there are actually very few models to choose from – just three in fact: the RS812RP+ (1U 4 bay), the RS2212RP+ (2U 10 bay) and the RS3412RPxs (2U 10 bay). The 4 bay model isn’t big enough for my needs so that narrows the field. The RS3412RPxs is quite a price hike over the RS2212RP+, but you do get 4 x 1GbE ports with the option of an add-in card with 2 x 10GbE ports. Considering that the EqualLogic PS4000XV unit I’m using for 1st tier storage manages fine with 2 x 1GbE on each controller I think this is a little overkill for my needs and besides, I don’t have any 10GbE switches (more on bandwidth later). Other physical enhancements are a faster CPU, ECC server RAM, and the ability to add two InfiniBand-connected RS1211 (2U 12 bay) expansion units rather than one.

Synology now offer VMware certified VAAI support from DSM 4.1 on the xs models (currently in beta). They actually support more VAAI primitives than EqualLogic, including freeing up space on a thin provisioned LUN when blocks are deleted. Dell EqualLogic brazenly advertised this as a feature in a vSphere 4.0 marketing PDF back in 2009 and I had to chase tech support for weeks to discover that it was “coming soon”. To date this functionality is still missing. The latest ETA is that it will ship with EqualLogic firmware 6.00 whenever that arrives. Though this is a software feature, Synology are using it to differentiate the more expensive products. More CPU is required during these VAAI operations, though blogger John Nash suggests that it isn’t much of an overhead.

If you need high performance during cloning and copying operations, or are considering using a Synology NAS as your 1st tier storage then perhaps you should consider the xs range.

Which drives?

The choices are bewildering at first. Many of the cheapest drives are the ‘green’ ones which typically spin at 5,400RPM. Their performance won’t be as good as 7,200RPM models, and they also tend to have more aggressive head parking and spindown timer settings in their firmwares. Western Digital ones are notorious for this to the extent that these drives have dramatically shorter lifespans if this is not disabled. To do this is tedious and requires a PC and a DOS boot disk. Making a bootable MS-DOS USB key can try the patience of even the calmest person!

UPDATE – it seems from the Synology Forum that DSM build 3.2-1922 and later will automatically disable the idle timer on these drives. You can check the status by running this one-liner while logged into SSH as root:

for d in `/usr/syno/bin/synodiskport -sata` ; do echo "*** /dev/$d ***"; /usr/syno/bin/syno_disk_ctl --wd-idle -g /dev/$d; done

You can force the disabling of the timer with that same tool:

/syno/bin/syno_disk_ctl --wd-idle -d /dev/sda

The next choice is between Enterprise class and Desktop class drives. This is quite subjective, because for years we have been taught that only SCSI/SAS drives were meant to be sufficiently reliable for continuous use. Typically the Enterprise class drives will have a 5 year manufacturer warranty and the Desktop ones will have 3 years. Often it takes a call to the manufacturer’s customer service helpline to determine the true warranty cover for a particular drive since retailers often seem to misreport this detail on their website. The Enterprise ones are significantly more expensive (£160 Vs £90 for a 2TB drive).

There is one additional feature on Enterprise class drives – TLER (Time Limited Error Recovery). There are a few articles about how this relates to RAID arrays and it proved quite a distraction while researching this NAS purchase. The concept is this: when a drive encounters a failed block during a write operation there is a delay while that drive remaps the physical block to the spare blocks on the drive. On a desktop PC your OS would hang for a moment until the drive responds that the write was successful. A typical hardware RAID controller is intolerant of a delay here and will potentially fail the entire drive if this happens, even though only a single block is faulty. TLER allows to drive to inform the controller that the write is delayed, but not failed. The side effect of not having TLER support would be frequent drive rebuilds from parity, which can be very slow when you’re dealing with 2TB disks – not to mention impairing performance. The good news though, is that the Synology products use a Linux software RAID implementation, so TLER support becomes irrelevant.

Given what’s at stake it’s highly advisable to select drives which are in the Synology HCL. The NAS may be able to overcome some particular drive firmware quirks in software (like idle timers on some models etc.), and their presence on the list does also mean that Synology have tested them thoroughly. I decided to purchase 11 drives so I would have one spare on site ready for insertion directly after a failure. RAID parity can take a long time to rebuild so you don’t want to be waiting for a replacement. Bear in mind that returning a drive under a manufacturer warranty could take a week or two.

Apparently one of the value-added things with enterprise grade SAN storage is that individual drives will be selected from different production batches to minimize the chances of simultaneous failures. This does remain a risk for NAS storage, and all the RAID levels in the world cannot help you in that scenario.

My Order

Bare RS2212RP+ 10 bay rackmount NAS (around £1,700 – all prices are excluding VAT).
11 x Hitachi Desktar HDS723020BLA642 drives including 3 year manufacturer warranty (around £1,000).
The unit has 1GB of DDR3 RAM soldered on the mainboard, with an empty SODIMM slot which I would advise populating with a 2GB Kingston RAM module, part number KVR1066D3S8S7/2G (a mere £10), just in case you want to install additional software packages later.
Synology 2U sliding rack rail kit, part number 13-082URS010 (£80). The static 1U rail kit for the RS411 was pretty poorly designed but this one is an improvement. It is still a bit time consuming to set up compared to modern snap-to-fix rails from the likes of Dell and HP.

Setup – How to serve the data

A Synology NAS offers several ways to actually store the data:

Using the NAS as a file server in its own right, using SMB, AFP, or NFS
iSCSI at block level (dedicated partitions)
iSCSI at file level (more flexible, but a small performance hit)

For my non-critical RS411, using it as an Active Directory integrated file server has proved to be very reliable. However, for this new NAS I needed LUNs for VMware. I could have perhaps defined a Disk Group and dedicated some storage to iSCSI, and some to a normal ext4 volume. I had experimented with iSCSI, but there are several problems:

Online research does reveal that there have been some significant iSCSI reliability issues on Synology, though admittedly these issues could possibly date from when DSM first introduced iSCSI functionality.
To use iSCSI multipathing on Synology the two NAS network interfaces must be on separate subnets. This is at odds with the same-subnet approach of Dell EqualLogic storage, which the rest of my VMware infrastructure uses. This would mean that hosts using iSCSI storage would need additional iSCSI initiators, significantly increasing complexity.
It is customary to isolate iSCSI traffic onto a separate storage network infrastructure, but the Synology NAS does not possess a separate management NIC. So if it is placed on a storage LAN it will not be easily managed/monitored/updated, nor even be able to send email alerts when error conditions arise. This was a show-stopper for me. I think Synology ought to consider at least allowing management traffic to use a different VLAN even if it must use the same physical NICs. However, VLANing iSCSI traffic is something most storage vendors advise against.

All of which naturally lead us onto NFS which is very easy to configure and well supported by VMware. Multipathing isn’t possible for a single NFS share, so the best strategy is to bond the NAS network interfaces into a link aggregation group (‘Route Based on IP Hash’). This does mean however, that no hypervisor’s connection to the NFS storage IP can use more than 1Gbps of bandwidth. This gives a theoretical peak throughput of 1024/8 = 128MB/sec. Considering that each individual SATA hard disk in the array is capable of providing roughly this same sustained transfer rate, this figure is somewhat disappointing. The NAS can deliver much faster speeds than this but is restricted by its 1GbE interfaces. Some NFS storage appliances help to mitigate this limitation to a degree by allowing you to configure multiple storage IP addresses. You could then split your VMs between several NFS shares, each with a different destination IP which could be routed down a different physical link. In this way a single hypervisor could saturate both links. Not so for Synology NAS unfortunately.

If raw performance is important to you, perhaps you should reconsider the xs series’ 2 x 10GbE optional add-in card. Remember though that the stock xs config (4 x GbE) will still suffer from this NFS performance capping of a single NFS connection at 1GbE. It should be noted however that multiple hypervisors accessing this storage will each be able to achieve this transfer rate, up to the maximum performance of the RAID array (around 200MB/sec for an RS2212RP+ according to the official performance figures, rising to around 10 times that figure for the xs series – presumably with the 10GbE add-in card).

As per this blog post, VMware will preferentially route NFS traffic down the first kernel port that is on the same subnet as the target NFS share if one exists, if not it will connect using the management interface via the default gateway. So adding more kernel ports won’t help. My VMware hypervisor servers use 2 x GbE for management traffic, 2 x GbE for VM network traffic, and 2 x GbE for iSCSI. Though I had enough spare NICs, connecting another pair of interfaces solely for NFS was a little overkill, especially since I know that the IOPS requirement for this storage is low. I was also running out of ports on the network patch panel in that cabinet. I did test the performance using dedicated interfaces but unsurprisingly I found it no better. In theory it’s a bad idea to use management network pNICs for anything else since that could slow vMotion operations or in extreme scenarios even prevent remote management. However, vMotion traffic is also constrained by the same limitations of ‘Route Based on IP Hash’ link aggregation policy – i.e. no single connection can saturate more than one physical link (1GbE). In my environment I’m unlikely to be migrating multiple VMs by vMotion concurrently so I have decided to use the management connections for NFS traffic too.

Benchmarking and RAID level

I found the simplest way to benchmark the transfer rates was to perform vMotion operations while keeping the Resource Monitor app open in DSM, and then referring to Cacti graphs of my switch ports to sanity check the results. The network switch is a Cisco 3750 two unit stack, with the MTU temporarily configured to a max value of 9000 bytes.

Single NFS share transfer rates reading and writing were both around 120MB/sec at the stock MTU setting of 1500 (around 30% CPU load). That’s almost full GbE line speed.
The same transfers occurred using only 15% CPU load with jumbo frames enabled, though the actual transfer rates worsened to around 60-70MB/sec. Consequently I think jumbo frames are pointless here.
The CPU use did not significantly increase between RAID5 and RAID6.

I decided therefore to keep an MTU of 1500 and to use RAID6 since this buys a lot of additional resilience. The usable capacity of this VMware ready NAS is now 14TB. It has redundant power fed from two different UPS units on different power circuits, and it has aggregated network uplinks into separate switch stack members. All in all that’s pretty darn good for £2,800 + VAT.

Shortcut to testing VMware Auto-Deploy

Upgrading to vSphere 5.0 with Dell EqualLogic

17 Replies

UPDATE – Ignore the Broadcom driver stuff. It seemed to be ok all afternoon, but I have rebooted the ESXi host and it’s gone completely unstable again, with pretty much continuous iSCSI disconnects. Clearly this TOE/iSCSI offload support is absolutely terrible. I’m going to have to use the software initiator. What is the point of Dell marketing this?

UPDATE 2 – Dell decided to get to the bottom of this and, following an extended troubleshooting session in which I reverted one of the hypervisors, they were able to replicate the fault in their lab. It’s now being escalated with VMware and Broadcom. More news as I get it…

I’m doing this upgrade at the moment from vSphere 4.1U1 so I wanted to make notes, particularly on the hypervisor rebuild part, so I don’t have to keep looking stuff up when I do each one. Since 4.1 I have used the hardware iSCSI offload features of the Broadcom bnx2 chips in the servers, using them as HBAs in their own right. As per the Dell MEM driver 1.1 release notes they still don’t support using jumbo frames with this configuration. However, I had big problems with getting this working at all with 5.0. According to Dell support I’m in a minority of customers that use TOE so their inclination was to suggest I fall back to software iSCSI. I purposely delayed adopting vSphere 5.0 until it had been out for a few months to hopefully avoid being among the first to hit major issues, but I still ran into this. The problem manifests itself as regular errors (every few seconds) in the array logs like this:

iSCSI login to target ‘192.168.100.12:3260, iqn.2001-05.com.equallogic:0-8a0906-c541d5105-94c0000000a4adc3-vsphere’ from initiator ‘192.168.100.25:2076, iqn.1998-01.com.vmware:server.domain.com:1454019294:34’ failed for the following reason: Initiator disconnected from target during login.

These errors are generated by all HBAs that are configured for storage. Furthermore only one path is established, and the volume will occasionally go offline altogether. The ESXi host’s /var/log/vmkernel.log shows bnx2 disconnection events like this:

2012-01-16T16:35:11.248Z cpu14:4802)bnx2i::0x410013204890: bnx2i_conn_stop::vmnic1 - sess 0x41000de04fc8 conn 0x41000de05350, icid 11, cmd stats={p=0,a=0,ts=0,tc=0}, ofld_conns 2
2012-01-16T16:35:11.248Z cpu14:4802)bnx2i::0x410013204890: bnx2i_ep_disconnect: vmnic1: disconnecting ep 0x410012a18f20 {11, 120c00}, conn 0x41000de05350, sess 0x41000de04fc8, hba-state 1, num active conns 2
2012-01-16T16:35:25.554Z cpu12:4802)bnx2i::0x410013204890: bnx2i_conn_stop::vmnic1 - sess 0x41000de04fc8 conn 0x41000de05350, icid 13, cmd stats={p=0,a=0,ts=0,tc=0}, ofld_conns 2
2012-01-16T16:35:25.554Z cpu12:4802)bnx2i::0x410013204890: bnx2i_ep_disconnect: vmnic1: disconnecting ep 0x410012a192f0 {13, 125400}, conn 0x41000de05350, sess 0x41000de04fc8, hba-state 1, num active conns 2

Dell support’s first suggestion is to edit the iSCSI login timeout value from 5 seconds to 60 seconds, and you need to use build 515841 to be able to edit this. However, this did not fix the issue using TOE. It turned out to be a Broadcom driver issue.

The vanilla install of ESXi 5.0.0 (build 469512), the Hypervisor Driver Rollup 1, and the update to build 515841 all include these same driver vib packages which seem to be broken. You can audit these by running esxcli --server=servername software vib list

net-bnx2     2.0.15g.v50.11-5vmw.500.0.0.469512   VMware VMwareCertified
net-bnx2x    1.61.15.v50.1-1vmw.500.0.0.469512    VMware VMwareCertified
net-cnic     1.10.2j.v50.7-2vmw.500.0.0.469512    VMware VMwareCertified
scsi-bnx2i   1.9.1d.v50.1-3vmw.500.0.0.469512     VMware VMwareCertified

The Broadcom NetXtreme II Network/iSCSI/FCoE Driver Set does contain newer versions:

net-bnx2     2.1.12b.v50.3-1OEM.500.0.0.472560    Broadcom VMwareCertified
net-bnx2x    1.70.34.v50.1-1OEM.500.0.0.472560    Broadcom VMwareCertified
net-cnic     1.11.18.v50.1-1OEM.500.0.0.472560    Broadcom VMwareCertified
scsi-bnx2fc  1.0.1v.v50.1-1OEM.500.0.0.406165     Broadcom VMwareCertified
scsi-bnx2i   2.70.1k.v50.2-1OEM.500.0.0.472560    Broadcom VMwareCertified

However, there is a further complication. These drivers have to be loaded on after the VMware updates. When the Broadcom drivers are installed the VMware-supplied drivers for these devices are removed. Confusingly, the VMware updater to build 515841 will see that they are missing, will ignore the OEM Broadcom replacements, and will re-install the older versions! If the host reboots at that point it will crash to a magenta screen of death as the kernel inits, possibly because two different driver versions are trying to access the same hardware. Take note, the Broadcom installer removes the following bootbank packages from the host:

VMware_bootbank_misc-cnic-register_1.1-1vmw.500.0.0.469512
VMware_bootbank_net-bnx2_2.0.15g.v50.11-5vmw.500.0.0.469512
VMware_bootbank_net-bnx2x_1.61.15.v50.1-1vmw.500.0.0.469512
VMware_bootbank_net-cnic_1.10.2j.v50.7-2vmw.500.0.0.469512
VMware_bootbank_scsi-bnx2i_1.9.1d.v50.1-3vmw.500.0.0.469512

So my recommendation would be to cross check this list whenever you install any further roll-ups to your ESXi hosts. If these or future non-OEM versions are reinstated, remove them before you restart the host, or it may not boot at all.

vCenter Server migration

Migrate vCenter server – for 4.1 -> 5.0 the wizard does it all automatically (big improvement!)
After upgrade you’ll get HA failing to find a master agent, and probaby some vCenter cert warnings about the hosts
Enable SSL certificate checking (disabled by default for some reason if migrating from 4.1): http://kb.vmware.com/kb/2006729

EqualLogic SAN update

This apparently provides better vStorage integration with vSphere 5

Keep a physical PC with the v4 infrastructure client on
Install the v5 infrastructure client on a physical PC
Shutdown all guests, put both hosts into maintenance mode and shutdown
Use WebUI to update the EqualLogic firmware to 5.1.2
Restart the SAN
Use iDRAC to power on ESXi hosts
If vCenter is a VM you need to use the v4 infrastructure client to connect directly its ESXi host
Power up a DC first, then vCenter
Quit the v4 client
Load the v5 infrastructure client and connect to vCenter
Start other DCs, Exchange, and SQL servers
Start web, app, and file servers

ESXi host update

From your iSCSI vSwitch make a note of the current iSCSI kernel port IP addresses
vMotion guests off ESXi host, maintenance mode, shutdown
Remove host from vCenter
For Dell servers use iDRAC, boot into System Services mode and try connecting to the net for updates
vmnic0 was in the management vSwitch and it was port channelled on the network switch
Telnet to switch, use the descriptions to find the correct port channel. If you don’t have descriptions in your switch config you could as a fallback find the MAC addresses in the server BIOS and look up the switch MAC address table, or use CDP show neighbors while VMware is running
Disable each of the ports in turn, checking in iDRAC to see if that fixes the access to the Dell firmware repo
Apply all firmware updates
Use iDRAC’s Virtual Media feature to present the VMVisor ISO image to the server
Reboot selecting the boot menu, then boot from the virtual CD
Select new install for ESXi host and install to SD card
This way there is no legacy partition table, and the upgrade would still require you to install the Dell MEM driver in any case
Use iDRAC to set management IP
Start v5 infrastructure client and connect to vCenter
Add ESXi host back into vCenter
Add vmnic4 back to the management vSwitch
Remove VM Network port group
Configure NIC teaming as Route based on IP hash (for each vmkernel and port group!)
Enable vMotion on the Management vmkernel port
Commit changes and re-enable the disabled switchport on your switch
Configure NTP service and hostname
Configure ESXi licence key
Compare the MAC addresses with of the vmbha initiators in Storage Adapters with the NICs listed in Network Adapters. You may notice that the numbering is different from the vmbha initiators that your ESXi 4.1 host was using
Download the Dell MEM 1.1 Early Production Access, since there are bug fixes over v 1.0.1 and it is certified for vSphere 5.0
Download VMware ESXi 5.0 Patch Release ESXi500-201112001 (build 515841 – the advised minimum for using the Dell MEM)
Download the Broadcom NetXtreme II Network/iSCSI/FCoE Driver Set
Some of these archives need extracting to expose the actual vib zipfile, some don’t
Install VMware vSphere CLI
Use the infrastructure client’s Datastore browser to upload the MEM, the 515841 patch release, and the Broadcom vib files to a local volume on the ESXi host (mine all have a single SATA hard disk for scratch)
Put the host in Maintenance Mode
Use RCLI to install the patch release:

esxcli --server=servername software vib install --depot /vmfs/volumes/SATA-LOCAL-C/ESXi500-201112001.zip

Reboot the host
Install the Broadcom drivers:

esxcli --server=servername software vib install --depot /vmfs/volumes/SATA-LOCAL-C/BCM-NetXtremeII-1.0-offline_bundle-553511.zip

Reboot the host
Install the MEM driver:

esxcli --server=servername software vib install --depot /vmfs/volumes/SATA-LOCAL-C/dell-eql-mem-1.0.9.205559.zip

Reboot the host
For each HBA, check the iqn name and amend to use the hostname instead of localhost, and check the numbering. On my servers the vmbha designations shifted during one of the reboots, leaving the iqns with misleading names which caused additional confusion while setting up the array volume access. e.g. vmbha34 showed up as iqn.1998-01.com.vmware:localhost.domain.com:2062235227:36
Run the MEM configuration script, selecting vmnic1 and vmnic3, and using the IP addresses you noted from the old ESXi instance. Dell support also advised creating the heartbeat vmkernel port, though it’s described as optional

setup.pl --configure --server=servername

Update these new iqns on the SAN’s ACLs for the vSphere storage volume(s)
After that has finished take the CHAP passwords for each vmbha from the EqualLogic Web UI and add those to the Storage Adapter configs in the infrastructure client. Remember to use the username as you see it in the EqualLogic UI not the initiator iqn
For each of your active HBAs use the advanced settings to edit the iSCSI login timeout from 5 to 15 seconds (to match what ESXi 4.1 had)
Configure a scratch disk path and enable scratch – use the real drive UID in the path, rather than the volume name in case you change it later. To retrieve that, use

vmkfstools.pl --server=servername -P /vmfs/volumes/yourvolumenamehere

Optimizing virtual SQL Server performance

1 Reply

Some months ago I implemented these steps and saw a striking improvement in the performance of our applications (between 2x and 4x depending on the query):

Firstly, if you’re using iSCSI, make sure the network switches are ones which have been validated as ok by your storage vendor. I’ve run into poor performance using ones which ought to have worked and offered all the required features (Flow Control, Jumbo Frames), but which in reality were causing problems.

If you’re using iSCSI with a software initiator (be it at hypervisor or guest OS level), consider using Jumbo Frames to reduce I/O related CPU activity.

Move your VMDKs to a new SAN VMFS volume that is Thick-Provisioned. Although in my environment the EqualLogic array extends LUNs by 16MB at a time, over time this can fragment things appreciably. With a 1TB LUN this can get pretty bad.

Use Storage vMotion to make the VMDK files Thick Provisioned too. This eliminates fragmentation of the VMDK since it’s no longer growing in small increments. I think this made quite a big difference, despite a whitepaper from VMware denying a performance impact. The reasoning is that since the storage array has a big cache, having the data fragmented all over the disks shouldn’t really matter that much. I don’t really believe it, and my own results seemed to prove otherwise (what about a backup operation which will need to read your data sequentially in one long pass?). My SQL server vMotion operations were very slow compared to other servers, suggesting they were heavily fragmented in their old location.

Move all of your databases (including these system ones: msdb, model, master), their logs, and fulltext catalogs to a SAN LUN directly attached inside the Guest VM using Microsoft iSCSI initiator and your SAN vendor’s Integration Tools. If you use Vmxnet3 adapters then the TCP calculation overhead will be handled by the hypervisor which in turn can be passed to Broadcom bnx2 TOE NICs if you’re using vSphere 4.1. Having the databases on a separate LUN allows off-host backup of the databases using Backup Exec with your SAN vendors’ SQL-aware VSS Hardware Provider. Database backups can then occur at any time without any impact to the SQL server’s performance. I have written a dedicated post on this subject.

Create that SAN partition with its NTFS blocks aligned with the SAN’s own disk blocks to ensure no needless multiplication of I/O (64KB offset for EqualLogic – full explanation here).

Keep TempDB on the C: drive in its default location. That way I/O to that database is segregated and can be cached differently since it is using VMware’s iSCSI initiator and not the Microsoft initiator. Typically TempDB has high I/O, but it’s not a database that you need to back up so you don’t need to be able to snapshot it on the SAN.

Create an SQL management task to rebuild and defragment the database indexes and update their statistics every week (say, Sunday at 3:00am).

Change database file autogrow amounts from 1MB to 64MB to mitigate NTFS-level fragmentation of the database MDF files as they grow.

Upgrading vSphere ESXi 4.0 to 4.1 with Dell EqualLogic storage

12 Replies

There are several big motivators to moving over to vSphere 4.1 with respect to storage. Firstly, there’s support for vStorage APIs in new EqualLogic array firmwares (starting at v5.0.0 which sadly, together with 5.0.1 have been withdrawn pending some show-stopping bugs). VM snapshot and copy operations will be done by the SAN at no I/O cost to the hypervisor. Next there’s the support for vendor-specific Multipathing Extension Modules – EqualLogic’s one is available for download under the VMware Integration category. Finally, there’s the long overdue TCP Offload Engine (TOE) support for Broadcom bnx2 NICs. All of this means a healthy increase in storage efficiency.

If you’re upgrading to vSphere 4.1 and have everything set up as per Dell EqualLogic’s vSphere 4.0 best practice documents you’ll first need to:

Upgrade vCenter and move it to a 64bit OS (which can fail)

Upgrade the hypervisors using vihostupdate.pl as per VMware’s upgrade guide, taking care to backup their configs first with esxcfg-cfgbackup.pl

Once that’s done choose an ESXi host to update, and put it in Maintenance Mode.

Make a note of your iSCSI VMkernel port IP addresses.

Make sure your ScratchConfig (Configuration -> Advanced Settings) is set to local storage. Reboot and check the change has persisted.

If the server has any Broadcom bnx2 family adapters they will now be treated as iSCSI HBAs so they will each have a vmhba designation. So, to unassign the previous explicit bindings to the Software iSCSI Initiator you need to check for its new name in the Storage Adapters configuration page.

You can’t unbind the VMkernel ports while there is an active iSCSI session using them so edit the properties of the Software iSCSI Initiator and remove the Dynamic and Static targets, then perform a rescan. Find your bound VMkernel ports using the vSphere CLI (replacing vmhba38 with the name of your software initiator):

bin\esxcli --server svr --username user --password pass swiscsi nic list -d vmhba38

Remove each bound VMkernel port like so (assuming vmk1-4 were listed as bound in the last step):

bin\esxcli --server svr --username user --password pass swiscsi nic remove -n vmk1 -d vmhba38
bin\esxcli --server svr --username user --password pass swiscsi nic remove -n vmk2 -d vmhba38
bin\esxcli --server svr --username user --password pass swiscsi nic remove -n vmk3 -d vmhba38
bin\esxcli --server svr --username user --password pass swiscsi nic remove -n vmk4 -d vmhba38

Now you can disable the Software iSCSI Initiator using the vSphere Client and then remove all the VMkernel ports and your iSCSI vSwitches.

Take note at this point that, according to the release notes PDF for the EqualLogic MEM driver, the Broadcom bnx2 TOE-enabled driver in vSphere 4.1 does not support jumbo frames. This information is further on in the document and unfortunately I only read it after I had already configured everything with jumbo frames so I had to start again. Any improvement they offer is kind of moot here since the Broadcom TOE will take over all the strenuous TCP calculation duties from the CPU, and is probably able to cope with traffic at line speed even at 1500 bytes per packet. I guess it could affect performance at the SAN end so perhaps they will work on supporting a 9000 byte MTU in forthcoming releases.

Make sure you set the MTU back to 1500 for any software initiators running in your VMs that used jumbo frames!

Re-patch your cables so you’re using your available TOE NICs for storage. On a server like the Dell PowerEdge R710 the four Broadcom TOE NICs are in fact two dual chips. So if you want to maximize your fault tolerance, be sure to use vmnic0 & vmnic2 as your iSCSI pair, or vmnic1 & vmnic3.

Log in to your EqualLogic Group Manager and delete the CHAP user you were using for the Software iSCSI Initiator for this ESXi host. Create new entries for each hardware HBA you will be using. Copy the intiator names from the vSphere GUI, and be sure to grant them access in the VDS/VSS pane too. Add these users to the volume permissions, and remove the old one.

Using vSphere CLI install the Mutipath Extension Module:

setup.pl --install --server svr --username root --password pass --bundle dell-eql-mem-1.0.0.130413.zip

Reboot the ESXi host and run the setup script in interactive configuration mode. For multiple value answers, comma separate them:

setup.pl --server svr --username root --password pass --configure

If you have Broadcom TOE NICs say yes to hardware support. This script will set up the vSwitch and the VMkernel ports and take care of the bindings (thanks Dell!):

Configuring networking for iSCSI multipathing:
vswitch = vSwitchISCSI
mtu = 1500
nics = vmnic1 vmnic3
ips = 192.168.100.95 192.168.100.96
netmask = 255.255.255.0
vmkernel = iSCSI
EQL group IP = 192.168.100.112
Creating vSwitch vSwitchISCSI.
Setting vSwitch MTU to 1500.
Creating portgroup iSCSI0 on vSwitch vSwitchISCSI.
Assigning IP address 192.168.100.95 to iSCSI0.
Creating portgroup iSCSI1 on vSwitch vSwitchISCSI.
Assigning IP address 192.168.100.96 to iSCSI1.
Creating new bridge.
Adding uplink vmnic1 to vSwitchISCSI.
Adding uplink vmnic3 to vSwitchISCSI.
Setting new uplinks for vSwitchISCSI.
Setting uplink for iSCSI0 to vmnic1.
Setting uplink for iSCSI1 to vmnic3.
Bound vmk1 to vmhba34.
Bound vmk2 to vmhba36.
Refreshing host storage system.
Adding discovery address 192.168.100.112 to storage adapter vmhba34.
Adding discovery address 192.168.100.112 to storage adapter vmhba36.
Rescanning all HBAs.
Network configuration finished successfully.

Now go back to your active HBAs and enter the new CHAP credentials. Re-scan and you should see your SAN datastores.

Recreate a pair of iSCSI VM Port Groups for any VMs that may use their own software initiators (very convenient for off-host backup of Exchange or SQL), making sure to explicitly set only one network adapter active, and the other to unused. Reverse the order for the second VM port group. Notice that setup.pl has done this for the VMkernel ports which it created.

Reboot again for good measure since we’ve made big changes to the storage config. I noticed at this point that on my ESXi hosts the Path Selection Policy for my EqualLogic datastore reset itself to Round Robin (VMware). I had to manually set it back to DELL_PSP_EQL_ROUTED. Once I had done that it persisted after a reboot.

vSphere CLI libeay32.dll error on 64bit Windows 7

14 Replies

If you install the latest build of vSphere CLI 4.1 on Windows 7 x64 some of the commands will fail, with perl.exe throwing the following error:

The ordinal 3212 could not be located in the dynamic link library LIBEAY32.dll

There isn’t much to go on when you look up the error – just a lot of people saying you should delete all older copies of LIBEAY32.dll from your system.

Fortunately there is a neater solution, and I’m surprised VMware haven’t fixed this problem yet (4.0 also had the same issue).

Open your CLI command prompt as Administrator. Type ppm and hit enter (Perl Package Manager).
Now look for a module called Crypt-SSLeay. You’ll see that CLI’s bundled ActivePerl distribution includes version 0.53, but there is a newer version 0.57 available:
Remove this as shown, then go to File -> Run Marked Actions
Click on the grey box icon on the left of the toolbar. These are available packages which are not currently installed. Search for Crypt-SSLeay once again, install, and Run Marked Actions. Exit.

Problem solved!

Upgrading to vCenter 4.1 with bundled SQL Express Edition – database migration fails

5 Replies

My infrastructure uses a bundled SQL Express Edition database because it isn’t hugely complex, and I didn’t want too much dependency on other servers (themselves VMs). I encountered problems upgrading the vCenter database while moving from Windows Server 2003 R2 SP2 x86 to Windows 2008 R2. I was migrating from vSphere 4.0U1 to 4.1. Perhaps this precise combination of versions was the problem, or perhaps it was that my database started life as VirtualCenter 3.5.

The process seems simple enough – unzip and run the Data Migration Tool from the installation media on the source vCenter server, move this folder (now with \data added) to the destination server and launch the install.bat script. However, there seem to be two major snags. The first is that the backup process will fail:

DB logs: HResult 0x2, Level 16, State 1 Named Pipes Provider: Could not open a connection to SQL Server

VMware has a knowledgbase article about this. They claim it’s caused by a misconfiguration of the SQL 2005 Express Edition instance which is pretty rich considering it was set up by their own installer. Make the named pipe change they suggest and it will work. Now unplug this machine from the LAN or disconnect its network adapter in vSphere if it’s a VM – remember you can connect the VI Client directly to the hypervisor which is hosting it.

The destination server needs to be configured with the same hostname as the source. I had then assumed that the restore tool would need to be run after a new install of vCenter 4.1 was placed on the destination server, so I installed vCenter. Then I discovered that the install.bat script in the datamigration folder refuses to run if it detects the product is already present. So naturally I uninstalled it and tried again. Perhaps this is what messed things up, or perhaps it’s because I’m using Windows 2008 R2.

Anyway, the datamigration\install.bat script kicks off the main product installer, supposedly importing all your backed up settings.

According to the vCenter 4.1 Upgrade Guide page 40 item 10 you are supposed to:

Select Install SQL Server 2005 Express instance (for small-scale deployments) and click Next.

Item 16 on the same page states that:

When the vCenter Server installation finishes, click Finish. The data migration tool restores the backed up configuration data.

If you do this you may, like me, discover that it actually doesn’t work and you end up with a completely blank database instance. Consulting datamigration\logs\restore.log I found no reference at all to any database restore.

My workaround

Go to your original vCenter server. Open the Data Sources MMC snap-in. In System DSN you should see an entry like so:

Note down the details, then create the same entry on your destination server (this will create a 64bit DSN). Notice how on page 38 of the Upgrade Guide it specifically states that:

If you use the data migration tool to migrate a SQL Server Express database located on the vCenter Server system to a new system, you do not need to create the 64-bit DSN. The data migration tool creates the DSN as part of the installation process

Apparently sometimes you do need to create the DSN.

On the destination server, copy the file \datamigration\data\vc\vc_upgraded_db and paste it in C:\temp. Rename it to vc_upgraded_db.bak.

Still on the destination server download, install and run the 64bit SQL Management Studio Express. Even if you’ve uninstalled vCenter, the SQL Express Edition instance will be left behind. If there’s not already one from a failed install, create a local database called VIM_VCDB. Restore the backup in C:\temp\vc_upgraded_db.bak over the top, paying attention to select Options -> Overwrite Existing Database and browsing to the target file locations of both the mdf and ldf files – the old database was found in C:\Program Files\Microsoft SQL Server\MSSQL.1\Data but on the 64bit system it’s C:\Program Files (x86)\…

Right-click your database, and select Properties. In Options, make sure your database recovery model is set to Simple. If you don’t do this your transaction log will fill up in a couple of days and the vCenter services will stop. In my case it seemed to have defaulted to Bulk-Logged.

Once that’s done you’ll need to update your newly created DSN to set the default database to VIM_VCDB. Now run datamigration\install.bat once again but this time opt to use an existing database as shown below:

Strangely, you will find that the you cannot use the SYSTEM account for the vCenter services however this can easily be changed in the Services MMC snap-in later.

And that’s it. You should end up with a working install with your data intact. One final tweak is to delay launch of the vCenter services so they don’t fail to start up at boot time.

UPDATE – after wasting a number of hours today with this, I’ve done some more searching and found this VMware KB article which basically just admits that the Data Migration Tool sometimes doesn’t work, and won’t even report errors in the log! When I download something as important as this, I sort of take it for granted that I won’t have to Google “vcenter data management tool does not migrate database” the moment I try using it (can’t believe I didn’t try that).

Growing a system or boot partition on a live server

PC LOAD LETTER

and other brilliant error messages