Tag Archives: EqualLogic

EqualLogic, iSCSI and the Windows Server 2008 R2 firewall

I recently migrated a backup server from Windows Server 2003 to Windows 2008 R2 in order to install Backup Exec 2012 at the same time. Once I had configured everything I noticed in the iSCSI Control Panel that only one path would ever connect to the array, and I was getting regular iSCSI timeouts and failures in the System Event Log, which I hadn’t seen while running Windows 2003:

iSCSI_errors

The errors were event 129:

The description for Event ID 129 from source iScsiPrt cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

event 39:

Initiator sent a task management command to reset the target. The target name is given in the dump data.

and event 9:

Target did not respond in time for a SCSI request. The CDB is given in the dump data.

Crucially these were spaced (all three together) at intervals of four minutes.

Solution

I spoke to the EqualLogic support team and, after a little while spent focusing on NIC drivers, one of the senior technicians fortunately realised that this four minute time interval coincides with the approximate frequency with which the array pings the initiators on the host and may send reconnect requests for additional path setup and load balancing. He recommended that I disable the Windows Firewall and sure enough the problem vanished. So it’s quite easy to inadvertently break iSCSI storage MPIO by making firewall settings changes to your system later on, and it’s easy to forget that these two things are related.

The problem for me was that this backup server has a NIC on the DMZ for faster backups (bypassing the hardware firewall). The pre- and post-backup job scripts enable and disable this NIC as required, but it does nonetheless need to be firewalled restrictively. In Windows 2003 the Windows Firewall can be enabled on a per NIC basis, however not in Windows 2008. Instead the firewall is configured instead in Network and Sharing Center on a per security zone basis (Domain Networks, Private Networks, Public Networks). The problem here is that the iSCSI NICs automatically end up in the Public zone, which is the most likely to be restricted. In my case, I had selected the option Block all connections including programs on the list of allowed programs. Even though the EqualLogic HIT Kit had specified an exemption rule, this was being denied.

Excluding iSCSI adapters

Relaxing the firewall in my scenario was not desirable, so I spent a while searching for a way to force the iSCSI NICs into the Private Networks zone. I couldn’t find one, though I did spot a method to exclude the NICs from the Network And Sharing Center altogether. In fact this same issue had been bothering people running VMware Workstation (because the VMware virtual NICs would get firewalled as Public Network connections), and fortunately someone had found a fix:
http://www.petri.co.il/exclude-vmware-virtual-adapters-vista-2008-network-awareness-windows-firewall.htm

The solution posted there uses a PowerShell script which automatically targets VMware adapters, but we can use the same registry modification. So, on your server use Regedit to navigate to HKLM\SYSTEM\CurrentControlSet\Control\Class\{4D36E972-E325-11CE-BFC1-08002BE10318}. There is a child branch here for each NIC. Find your dedicated iSCSI NICs and for each one, create a new DWORD value called *NdisDeviceType (including the asterisk) and give it a value of 1. Now disable and re-enable each modified NIC. You will see that they disappear from Network and Sharing Center, and are now unaffected by the Windows Firewall.

By setting *NdisDeviceType to a value of 1 the NIC is designated as an endpoint device and is not considered to be connecting to an external network, which is probably quite appropriate for a dedicated iSCSI storage connection. I wonder whether this is the sort of thing that ought to be automated by the HIT kit in future in fact.

Preference Order

Another thing that’s easily overlooked on servers with iSCSI storage (because it’s so well hidden) is that if you have been changing NIC configs (changing drivers, adding hardware, P2V converting, etc.) then it’s quite likely that you may have affected the preference order in which network services use physical adapters. You don’t generally want the iSCSI ones to become the higher priority ones, and I have experienced strange issues with Exchange Server in the past owing to this, as well as licence issues with copy-protected software that relies on generating a unique hardware-dependent machine ID. To set the order, open Network and Sharing Center, then click on Change Adapter Settings on the left hand side. Now hold Alt, then Advanced -> Advanced Seettings. Now you can configure the LAN NICs with higher priority:

NIC-preference-order

Optimizing virtual SQL Server performance

Some months ago I implemented these steps and saw a striking improvement in the performance of our applications (between 2x and 4x depending on the query):

  • Firstly, if you’re using iSCSI, make sure the network switches are ones which have been validated as ok by your storage vendor. I’ve run into poor performance using ones which ought to have worked and offered all the required features (Flow Control, Jumbo Frames), but which in reality were causing problems.
  • If you’re using iSCSI with a software initiator (be it at hypervisor or guest OS level), consider using Jumbo Frames to reduce I/O related CPU activity.
  • Move your VMDKs to a new SAN VMFS volume that is Thick-Provisioned. Although in my environment the EqualLogic array extends LUNs by 16MB at a time, over time this can fragment things appreciably. With a 1TB LUN this can get pretty bad.
  • Use Storage vMotion to make the VMDK files Thick Provisioned too. This eliminates fragmentation of the VMDK since it’s no longer growing in small increments. I think this made quite a big difference, despite a whitepaper from VMware denying a performance impact. The reasoning is that since the storage array has a big cache, having the data fragmented all over the disks shouldn’t really matter that much. I don’t really believe it, and my own results seemed to prove otherwise (what about a backup operation which will need to read your data sequentially in one long pass?). My SQL server vMotion operations were very slow compared to other servers, suggesting they were heavily fragmented in their old location.
  • Move all of your databases (including these system ones: msdb, model, master), their logs, and fulltext catalogs to a SAN LUN directly attached inside the Guest VM using Microsoft iSCSI initiator and your SAN vendor’s Integration Tools. If you use Vmxnet3 adapters then the TCP calculation overhead will be handled by the hypervisor which in turn can be passed to Broadcom bnx2 TOE NICs if you’re using vSphere 4.1. Having the databases on a separate LUN allows off-host backup of the databases using Backup Exec with your SAN vendors’ SQL-aware VSS Hardware Provider. Database backups can then occur at any time without any impact to the SQL server’s performance. I have written a dedicated post on this subject.
  • Create that SAN partition with its NTFS blocks aligned with the SAN’s own disk blocks to ensure no needless multiplication of I/O (64KB offset for EqualLogic – full explanation here).
  • Keep TempDB on the C: drive in its default location. That way I/O to that database is segregated and can be cached differently since it is using VMware’s iSCSI initiator and not the Microsoft initiator. Typically TempDB has high I/O, but it’s not a database that you need to back up so you don’t need to be able to snapshot it on the SAN.
  • Create an SQL management task to rebuild and defragment the database indexes and update their statistics every week (say, Sunday at 3:00am).
  • Change database file autogrow amounts from 1MB to 64MB to mitigate NTFS-level fragmentation of the database MDF files as they grow.

Upgrading vSphere ESXi 4.0 to 4.1 with Dell EqualLogic storage

There are several big motivators to moving over to vSphere 4.1 with respect to storage. Firstly, there’s support for vStorage APIs in new EqualLogic array firmwares (starting at v5.0.0 which sadly, together with 5.0.1 have been withdrawn pending some show-stopping bugs). VM snapshot and copy operations will be done by the SAN at no I/O cost to the hypervisor. Next there’s the support for vendor-specific Multipathing Extension Modules – EqualLogic’s one is available for download under the VMware Integration category. Finally, there’s the long overdue TCP Offload Engine (TOE) support for Broadcom bnx2 NICs. All of this means a healthy increase in storage efficiency.

If you’re upgrading to vSphere 4.1 and have everything set up as per Dell EqualLogic’s vSphere 4.0 best practice documents you’ll first need to:

  • Upgrade the hypervisors using vihostupdate.pl as per VMware’s upgrade guide, taking care to backup their configs first with esxcfg-cfgbackup.pl
 

Once that’s done choose an ESXi host to update, and put it in Maintenance Mode.

Make a note of your iSCSI VMkernel port IP addresses.

Make sure your ScratchConfig (Configuration -> Advanced Settings) is set to local storage. Reboot and check the change has persisted.

If the server has any Broadcom bnx2 family adapters they will now be treated as iSCSI HBAs so they will each have a vmhba designation. So, to unassign the previous explicit bindings to the Software iSCSI Initiator you need to check for its new name in the Storage Adapters configuration page.

You can’t unbind the VMkernel ports while there is an active iSCSI session using them so edit the properties of the Software iSCSI Initiator and remove the Dynamic and Static targets, then perform a rescan. Find your bound VMkernel ports using the vSphere CLI (replacing vmhba38 with the name of your software initiator):

bin\esxcli --server svr --username user --password pass swiscsi nic list -d vmhba38

Remove each bound VMkernel port like so (assuming vmk1-4 were listed as bound in the last step):

bin\esxcli --server svr --username user --password pass swiscsi nic remove -n vmk1 -d vmhba38
bin\esxcli --server svr --username user --password pass swiscsi nic remove -n vmk2 -d vmhba38
bin\esxcli --server svr --username user --password pass swiscsi nic remove -n vmk3 -d vmhba38
bin\esxcli --server svr --username user --password pass swiscsi nic remove -n vmk4 -d vmhba38

Now you can disable the Software iSCSI Initiator using the vSphere Client and then remove all the VMkernel ports and your iSCSI vSwitches.

Take note at this point that, according to the release notes PDF for the EqualLogic MEM driver, the Broadcom bnx2 TOE-enabled driver in vSphere 4.1 does not support jumbo frames. This information is further on in the document and unfortunately I only read it after I had already configured everything with jumbo frames so I had to start again. Any improvement they offer is kind of moot here since the Broadcom TOE will take over all the strenuous TCP calculation duties from the CPU, and is probably able to cope with traffic at line speed even at 1500 bytes per packet. I guess it could affect performance at the SAN end so perhaps they will work on supporting a 9000 byte MTU in forthcoming releases.

Make sure you set the MTU back to 1500 for any software initiators running in your VMs that used jumbo frames!

Re-patch your cables so you’re using your available TOE NICs for storage. On a server like the Dell PowerEdge R710 the four Broadcom TOE NICs are in fact two dual chips. So if you want to maximize your fault tolerance, be sure to use vmnic0 & vmnic2 as your iSCSI pair, or vmnic1 & vmnic3.

Log in to your EqualLogic Group Manager and delete the CHAP user you were using for the Software iSCSI Initiator for this ESXi host. Create new entries for each hardware HBA you will be using. Copy the intiator names from the vSphere GUI, and be sure to grant them access in the VDS/VSS pane too. Add these users to the volume permissions, and remove the old one.

Using vSphere CLI install the Mutipath Extension Module:

setup.pl --install --server svr --username root --password pass --bundle dell-eql-mem-1.0.0.130413.zip

Reboot the ESXi host and run the setup script in interactive configuration mode. For multiple value answers, comma separate them:

setup.pl --server svr --username root --password pass --configure

If you have Broadcom TOE NICs say yes to hardware support. This script will set up the vSwitch and the VMkernel ports and take care of the bindings (thanks Dell!):

Configuring networking for iSCSI multipathing:
vswitch = vSwitchISCSI
mtu = 1500
nics = vmnic1 vmnic3
ips = 192.168.100.95 192.168.100.96
netmask = 255.255.255.0
vmkernel = iSCSI
EQL group IP = 192.168.100.112
Creating vSwitch vSwitchISCSI.
Setting vSwitch MTU to 1500.
Creating portgroup iSCSI0 on vSwitch vSwitchISCSI.
Assigning IP address 192.168.100.95 to iSCSI0.
Creating portgroup iSCSI1 on vSwitch vSwitchISCSI.
Assigning IP address 192.168.100.96 to iSCSI1.
Creating new bridge.
Adding uplink vmnic1 to vSwitchISCSI.
Adding uplink vmnic3 to vSwitchISCSI.
Setting new uplinks for vSwitchISCSI.
Setting uplink for iSCSI0 to vmnic1.
Setting uplink for iSCSI1 to vmnic3.
Bound vmk1 to vmhba34.
Bound vmk2 to vmhba36.
Refreshing host storage system.
Adding discovery address 192.168.100.112 to storage adapter vmhba34.
Adding discovery address 192.168.100.112 to storage adapter vmhba36.
Rescanning all HBAs.
Network configuration finished successfully.

Now go back to your active HBAs and enter the new CHAP credentials. Re-scan and you should see your SAN datastores.

Recreate a pair of iSCSI VM Port Groups for any VMs that may use their own software initiators (very convenient for off-host backup of Exchange or SQL), making sure to explicitly set only one network adapter active, and the other to unused. Reverse the order for the second VM port group. Notice that setup.pl has done this for the VMkernel ports which it created.

Reboot again for good measure since we’ve made big changes to the storage config. I noticed at this point that on my ESXi hosts the Path Selection Policy for my EqualLogic datastore reset itself to Round Robin (VMware). I had to manually set it back to DELL_PSP_EQL_ROUTED. Once I had done that it persisted after a reboot.

Moving your SQL 2005 databases ready for VSS off-host backups

Many storage vendors now offer hardware Volume ShadowCopy Service providers for their storage arrays which allow the SAN itself to carry out the snapshot, rather than the underlying OS. These providers are Exchange- and SQL-aware so they will quiesce the transaction logs just before the snapshot.

The big win here is off-host backup – the target server asks the SAN to snap the data volume then carries on as normal. The backup media server meanwhile will mount this SAN snapshot and back it up directly from the SAN. In this way you can backup SQL and Exchange environments in the middle of the day without any performance degradation (assuming of course that you have the IOPS headroom on the SAN). Symantec Backup Exec 12.5 and later supports this technology but it must be purchased as an option – Advanced Disk-based Backup Option or ADBO.

However, to off-host backup Exchange or SQL you will need to have both the databases/mail stores and the transaction logs on the same SAN LUN. This flies in the face of the old wisdom of segregating logs onto RAID1 spindles, but it’s important to realise that a modern SAN makes this perfectly viable. The EqualLogic PS4000XV in my environment for instance has a write latency of <1.0ms in RAID50. Microsoft used to recommend keeping SQL logs on disk with a write latency sub 10ms (now they say 1-5ms).

Moving all of the transaction logs on a crowded SQL server is tricky for several reasons:

  • the Transact-SQL database alter command requires you to know the database’s logical filenames. On an SQL 2005 server, these are largely predictable but it gets complicated when some of the DBs have migrated from SQL 2000 and if some of them were restored from backups of databases with different names (dev or test versions which then went into production for example).
  • the system databases are likely to be on C: and if you want to grab all DBs in one backup job these will need moving too (TempDB is usually ignored by most backup software and can stay where it is).
  • though some guides on the Web suggest detaching and re-attaching the DBs – this is a surefire way to end up with a total disaster since the re-attached DBs will have new GUIDs which will wreck SharePoint amongst others.

Backing up then restoring your databases specifying new file paths is one method but the danger is that you would need to isolate them so that no changes occurred during that time window (which could be a long time).

Moving user databases

The best solution for these is 99% careful preparation work – to build a long list of T-SQL database alter commands to change the SQL file references, and a batch script to move the actual files to the database drive. You can also use this as an opportunity to clean up any badly named files, and move ones that are in the wrong place.

It is highly recommended that if you haven’t already done so, you should set the following registry values on the SQL server which will guard against future inconsistencies. If they already exist, check they’re still valid:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer\DefaultData
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer\DefaultLog
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer\BackupDirectory
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer\FullTextDefaultPath

For each database you need to run the following Transact-SQL:

Use [DBname]
Select * from sys.database_files

This will return all the files in the filegroup including full text catalogs (if they exist) together with their logical names (name column):
Database logical filenames

In this example the transaction log is already in the desired location, but it if was in say C:\TRANSACTIONS LOGS we would need to write:

alter database [SUSDB] modify file (name = SUSDB_log, filename = 'G:\DATABASES\SUSDB_log.ldf')

You would then add this to your file move batch script:

move /y "C:\TRANSACTION LOGS\SUSDB_log.LDF" G:\DATABASES\SUSDB_log.ldf

My method was to run a full SQL backup to commit the transaction logs (less data to move), run the alter database commands all at once (which don’t take effect until the SQL Server service next starts), stop the SQL Server service, run the file move batch script, check for any errors, then start the SQL Server service again. Once it’s up, you can try to expand each database in SQL Management Studio. Any databases with damaged file paths will not expand. Refer back to your command prompt window to try and figure out what went wrong (usually a typo).

In this way you should be able to move all of the logs with a bare minimum of downtime – several minutes in my case.

Moving system databases

Moving system databases is fairly straightforward, but it will require a little more downtime. Again, I’d probably leave TempDB where it is to separate its I/O from the rest as it can be high and we don’t need to back it up. If you do want to move it, the procedure is the same as any non-system database. The rest though are special cases.

Run the following and note the current file locations which will probably be in C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA

use model
select * from sys.database_files
use msdb
select * from sys.database_files
use master
select * from sys.database_files

Now close SQL Management Studio and run the following from a Command Prompt (the parameters are case sensitive!):

net stop mssqlserver
net start mssqlserver /c /m /T3608

Open SQL Management Studio again but read this carefully. With these startup parameters, SQL Server will only allow a single connection. The default behaviour of the GUI is to open the Object Explorer window once you connect, which counts as a connection. You need to click on the Disconnect button, and close the Object Explorer child window. You should then be able to open a New Query.
If you closed the Object Explorer without disconnecting you’ll get the error “Server is in single user mode. Only one administrator can connect at this time.” and you’ll need to stop and start the service again, as above, and repeat. Next:

sp_detach_db 'model'
sp_detach_db 'msdb'

Move the files to the new location (logs and databases remember), then run the following taking care to substitute in your new file paths:

sp_attach_db 'model','G:\DATABASES\model.mdf','G:\DATABASES\modellog.ldf'
sp_attach_db 'msdb','G:\DATABASES\msdbdata.mdf','G:\DATABASES\msdblog.ldf'

Stop the SQL Server service. Start it again normally (no parameters) and check you can expand model and msdb in Management Studio.

We just have the master database left to move. Stop the SQL Server service again. Move master’s log and database files to the new location. On the SQL server machine’s console, open Start Menu > Programs > Microsoft SQL Server 2005 > Configuration Tools > SQL Server Configuration Manager.
In the category SQL 2005 Services, select SQL Server (MSSQLSERVER) and look at the Properties. Select the Advanced tab. Select Startup Parameters and pull down the dropdown next to it.
Change the value from the defaults of:

-dC:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\master.mdf;-eC:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\ERRORLOG;-lC:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\mastlog.ldf

to your new file paths (don’t change the error log path by accident):

-dG:\DATABASES\master.mdf;-eC:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\ERRORLOG;-lG:\DATABASES\mastlog.ldf

SQL Server Configuration Manager
Finally, start the SQL Server Service. Done!

The trouble with full-text catalogs

If you rely on the EqualLogic Auto-Snaphot Manager to tell you whether your databases now support SAN snapshots you could be in for a surprise when you backup using ADBO in Backup Exec:

V-79-57344-34086 – ADBO: Offhost backup initialization failure on: “myhostname.domain.com”.
Snapshot provider error (0xE0008526): Backup Exec could not locate a Microsoft Volume Shadow Copy Services (VSS) software or hardware snapshot provider for this job. Select a valid VSS snapshot provider for the target computer.
Check the Windows Event Viewer for details.

This is an awful error message because it doesn’t really describe the problem (and you won’t find anything meaningful in the Event Viewer). It almost looks like a registration failure of the Hardware VSS Provider, which is misleading, and caused me about 2 hours of out-of-hours work reinstalling it, taking the server offline, etc. to satisfy Symantec Support. However, run a job with the same selection list but using normal AOFO (Advanced Open File Backup) and you get:

AOFO: Initialization failure on: “myhostname.domain.com”. Advanced Open File Option used: Microsoft Volume Shadow Copy Service (VSS).
V-79-10000-11219 – VSS Snapshot error. The volume or snapped volume may not be available or may not exist. Check the configuration of the snapshot provider, and then run the job again.
The following volumes are dependent on resource: “C:” “D:” “G:”.

Much clearer – there’s a dependency on the D: drive being detected, the drive I migrated from. By chance I changed the backup selection list realised that some databases backed up while others didn’t. The cause turned out to be a full text catalog.

The EqualLogic ASM only checks the database and log files, not full-text catalogs. Moving these seems to be pretty difficult. Microsoft have an MSDN document describing database moves (see section on catalogs further down the page). I have tried following this process to the letter, and when that didn’t work I tried various permutations of stopping the SQL Server service, the SQL FullText Search service (which seems to autorestart), the SQL Server Agent service, copying the files, not copying the files (expecting SQL to move them) etc. No combination seemed to work for me. What I found was that, while it is easy enough to move the catalog path like so:

alter database [ExampleDB] modify file (name = [sysft_ExampleDB], filename = 'G:\DATABASES\FTData\ExampleDB')

there is some meta data that does not get updated and the ADBO backup will still fail when the VSS provider checks all the file dependencies.
sys.database_files shows the correct paths. Eventually I discovered that

Select * from sys.fulltext_catalogs

still showed the old location for the catalogs. The only way I could find to get this to update was to rebuild the full-text catalog in SQL Management Studio – expand the database > Storage > Full Text Catalogs > right-click > Rebuild.

For me this was acceptable and quick, but I imagine some infrastructures might not be so tolerant of a rebuild.