- Channel HP
- :
- Enterprise Business Blogs
- :
- Services
- :
- Technical Support Services Blog | HP Technology Services
- :
- ESX performance: RESERVATIONS, What are they and w...
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content
ESX performance: RESERVATIONS, What are they and why are they needed?
We continue to encounter many ESX environments where performance is an issue. A majority of the time, the culprit is disk IO. I would like to cover a few of these challenges in the following BLOG series ESX PERFORMANCE.
A challenge to disk performance is SCSI reservations. SCSI reservations are an essential part of ESX. Understanding their use, combined with other ESX locking mechanisms, will help ESX administrators prevent catastrophic outages or performance degradation.
Though it is possible to configure an ESX environment such that it will not experience a single SCSI reservation conflict, such a configuration defeats the benefits of ESX virtualization. The key is balancing VM instances within a farm on a single storage pool.
What is a SCSI reservation?
At a high level, there are two basic SCSI locking mechanisms utilized within the industry: SCSI-2 LUN Reserve and SCSI-3 Persistent Group Reservations (PGR). Both of these mechanisms restrict LUN access. Though ESX supports PGR for use with Widows Clustering within the virtualization environment, natively ESX does not use PGR control sequences (though it can manipulate them manually via the sg_persist command). In either case, the concept is the same, protection of data at the LUN level.
Access to a LUN is the major difference between these two technologies. PGR uses restricted access keys to facility read and write access to all nodes with visibility of the LUN. SCSI2, on the other hand, locks the LUN and prevents read() or write() calls to that LUN, only allowing the owner which established the lock, exclusive access to the LUN.
What is a reservation conflict?
A reservation conflict takes place when a reservation (either SCSI-3 or SCSI-2) exists on a LUN by a IT_NEXUS (initiator) and a reservation request is made by the none owning initiator. The ESX server will attempt to re-establish the lock after a 50msec and will retry 80 times by default (see ScsiConflictRetry under SCSI advanced properties).
Does the array track reservations?
This depends on the array. Certain Enterprise class arrays do track this sort of information and it is very helpful in identifying problems. However, this information will be extremely cryptic and will require the assistance of the storage vendor’s experts to decode.
When does ESX need a SCSI-2 reservation? (LUN LEVEL LOCK)
Though a single ESX server uses SCSI-2 lock control sequences, the benefit is not realized until a cluster (aka FARM) is formed allowing multiple ESX servers to have access to the same storage pool (LUN). A lock is required for the following basic operations:
1) File system operations (storage pool) # rarely performed but will require a SCSI-2 reservation
- Creation of storage pool
- Re-signature
- Expand/extend
2) Resource lock operations within VMFS: (Extremely abundant and require a SCSI-2 reservation)
- Acquire file lock
- Acquire Metadata lock
- Powering on VM
- Creating or deleting a file (Not a VM, but a file within the VMFS filesystem) (scripts that change permissions on files, create files, delete files, etc…)
- Deploying VMs
- Creating VMs
- Growing files
- VMotion (host or storage)
VMFS Background: (layout)
VMFS, like any filesystem, consists of inodes, blocks, sub blocks, etc. The major difference is how these structures are organized into resource clusters and cluster groups. A resource cluster has a "on-disk lock" and metadata associated to it. The filesystem then groups many clusters to form a resource group. This processes repeats over and over making up the filesystem address space.
The filesystem driver manages the locality of VMs (vmdk, and other files associated to the VM) within Cluster groups. Depending on how the VMs are deployed (creation of a VM on a single node in the cluster vs creation of VM on each node in the cluster) determines the dispersion of the VMs over the resource groups as the filesystem attempts to maintain spatial locality. This is achieved by the "on-disk locks” maintaining the host ID that owns the resource. As this layout implies, there are thousands of these locks (not to be confused with SCSI reservations) within the filesystem and have a locality near the object they are protecting.
To be clear, file system on-disk locks are not reservations; on the other hand, a reservation is needed to acquire them.
VMFS-3:
When an ESX server requires an on-disk lock, the following takes place:
Establish SCSI-2 lock (disrupts all other initiator's IO) --> read on-disk lock --> update HOST info with on-disk lock if needed -> write updated lock -> Release SCSI-2 lock
Now this cycle only takes a few milliseconds (=~ 10-20). But none the less, it is disruptive to the other ESX servers as not a single IO from another ESX host, or any host for that matter, is allowed access the LUN while SCSI-2 RESERVE exists. The short duration of the RESERVE can be extended if the array is heavily loaded or handling error conditions within the SAN.
Latest versions of VMFS: Optimistic locking
With VMFS 3.1+, optimistic locking means the driver will read all the "on-disk" locks and modify metadata only owned by the host owning the lock. This "modification" is not written to disk yet as there is no SCSI-2 RESERVE. At this point, a SCSI-2 RESERVE is placed on the LUN (whereas before, the SCSI-2 was placed first, in order to read the on-disk locks). Now the final steps are taken to read and modify the on-disk locks which were not owned by this server (if there are any). This is immediately followed by the Journal update, flushing of all metadata to disk, and the SCSI-2 RELEASE.
Comparing to the previous versions of VMFS:
Read on-disk locks --> Modify free locks (already owned) --> Establish SCSI-2 lock (disruptive to all others) --> acquire all needed disk locks --> write/update metadata and on-disk locks -> Release SCSI-2 lock
How much time is a SCSI-2 Reserve held?
As mentioned earlier, the reserve itself is only held for a few milliseconds: (10-20). The introduction of optimistic locking holds off of this SCSI-2 reserve, reducing the total time (albeit milliseconds) that the SCSI-2 lock is held, thus reducing the chance of a reservation conflict within the farm.
Take away:
- In order to easily track reservation conflicts, it is best to maintain LUN identification uniformity throughout the farm. Example LUN 0's address on node 1 is the same on all nodes in the farm.
- If the farm is using an ALUA array (proxy reads), confirm that all nodes accessing the same LUN are assessing that lun over the array's controller that owns that lun. This prevents a SCSI_START from forcing the LUN to change controller ownership within array; thus preventing a thrashing condition and stalling reservations. This is true for all ALUA arrays.
- Monitor SCSI reservations to make sure that they do not exceed a threshold of 100 per hour or less, but this is just a rule of thumb. If your environment has more and is achieving desired performance, there is no need to change anything. Strive for less, but depending on workload, IO profile, etc, mileage may vary.
Reference docs:
Best practices for deploying ESX with HP storage arrays is documented @:
http://h20195.www2.hp.com/v2/GetPDF.aspx/4AA3-0450
Real word Example:
Summary:
The ESX farm experienced stalled access to storage pools during two separate windows on October 7th. The affected storage pools were VMFS_1, VMFS_2 , and VMFS_3.
After confirming that each LUN was the same for each ESX server in the farm, the following was ran to look at the reservations: ( BUS:target:LUN
artition)
$ grep "Oct [7]" */LOGS/log/vmk* | grep 24/0 | awk '{print $9}' | sort | uniq -c
59974 vmhba0:0:10:0
23898 vmhba0:0:2:0
69711 vmhba0:1:1:0
99981 vmhba0:1:3:0
..
2 vmhba0:2:7:0
16 vmhba0:3:5:0
...
24554 vmhba1:0:21:0
363899 vmhba1:0:2:0
52979 vmhba1:0:3:0
111306 vmhba1:1:1:0
7025 vmhba1:1:21:0
250340 vmhba1:1:3:0
...
2266 vmhba1:2:4:0
823 vmhba1:2:6:0
962 vmhba1:2:7:0
1020 vmhba1:3:5:0
14442 vmhba1:3:9:0
Placing the resulting data into a graph format, a pattern emerges:
The root cause:
The root cause was a combination of events:
- Layout of VM’s on the storage pool
- Preferred path setting using non-optimal paths on an ALUA array (see best practice documentation for details)
- Heavy IO workload from development servers saturating the array resulting in elongated service times
After redistributing Virtual machines to achieve better storage locality and changing the preferred path setting to prevent ALUA array controllers from thrashing LUN ownership, the SCSI reservation conflict rate has maintained acceptable levels resulting in optimal performance.
Another site with a similar issue solved using the same methodology:
After
Before





