Technical Support Services Blog
Discover the latest trends in technology, and the technical issues customers are overcoming with the aid of HP Technology Services.

ESX performance: RESERVATIONS, What are they and why are they needed?

We continue to encounter many ESX environments where performance is an issue.  A majority of the time, the culprit is disk IO.  I would like to cover a few of these challenges in the following BLOG series ESX PERFORMANCE. 

 

A challenge to disk performance is SCSI reservations.  SCSI reservations are an essential part of ESX.  Understanding their use, combined with other ESX locking mechanisms, will help ESX administrators prevent catastrophic outages or performance degradation.

 

Though it is possible to configure an ESX environment such that it will not experience a single SCSI reservation conflict, such a configuration defeats the benefits of ESX virtualization.  The key is balancing VM instances within a farm on a single storage pool.

 

What is a SCSI reservation? 

At a high level, there are two basic SCSI locking mechanisms utilized within the industry:  SCSI-2 LUN Reserve and SCSI-3 Persistent Group Reservations (PGR). Both of these mechanisms restrict LUN access.  Though ESX supports PGR for use with Widows Clustering within the virtualization environment, natively ESX does not use PGR control sequences (though it can manipulate them manually via the sg_persist command).  In either case, the concept is the same, protection of data at the LUN level.

 

Access to a LUN is the major difference between these two technologies.  PGR uses restricted access keys to facility read and write access to all nodes with visibility of the LUN.  SCSI2, on the other hand, locks the LUN and prevents read() or write() calls to that LUN, only allowing the owner which established the lock, exclusive access to the LUN.

 

What is a reservation conflict?

A reservation conflict takes place when a reservation (either SCSI-3 or SCSI-2) exists on a LUN by a IT_NEXUS (initiator) and a reservation request is made by the none owning initiator.  The ESX server will attempt to re-establish the lock after a 50msec and will retry 80 times by default (see ScsiConflictRetry under SCSI advanced properties).

 

Does the array track reservations?

This depends on the array.  Certain Enterprise class arrays do track this sort of information and it is very helpful in identifying problems.  However, this information will be extremely cryptic and will require the assistance of the storage vendor’s experts to decode. 

 

When does ESX need a SCSI-2 reservation? (LUN LEVEL LOCK)

Though a single ESX server uses SCSI-2 lock control sequences, the benefit is not realized until a cluster (aka FARM) is formed allowing multiple ESX servers to have access to the same storage pool (LUN).  A lock is required for the following basic operations:

 

1) File system operations (storage pool)   # rarely performed but will require a SCSI-2 reservation

  • Creation of storage pool
  • Re-signature
  • Expand/extend

2) Resource lock operations within VMFS: (Extremely abundant and require a SCSI-2 reservation)

  • Acquire file lock
  • Acquire Metadata lock
  • Powering on VM
  • Creating or deleting a file (Not a VM, but a file within the VMFS filesystem) (scripts that change permissions on files, create files, delete files, etc…)
  • Deploying VMs
  • Creating VMs
  • Growing files
  • VMotion (host or storage)

VMFS Background: (layout)

VMFS, like any filesystem, consists of inodes, blocks, sub blocks, etc.  The major difference is how these structures are organized into resource clusters and cluster groups.  A resource cluster has a "on-disk lock" and metadata associated to it.  The filesystem then groups many clusters to form a resource group. This processes repeats over and over making up the filesystem address space. 

 

The filesystem driver manages the locality of VMs (vmdk, and other files associated to the VM) within Cluster groups.  Depending on how the VMs are deployed (creation of a VM on a single node in the cluster vs creation of VM on each node in the cluster) determines the dispersion of the VMs over the resource groups as the filesystem attempts to maintain spatial locality.  This is achieved by the "on-disk locks” maintaining the host ID that owns the resource.  As this layout implies, there are thousands of these locks (not to be confused with SCSI reservations) within the filesystem and have a locality near the object they are protecting.

 

To be clear, file system on-disk locks are not reservations; on the other hand, a reservation is needed to acquire them.

 

VMFS-3

When an ESX server requires an on-disk lock, the following takes place: 

Establish SCSI-2 lock (disrupts all other initiator's IO) --> read on-disk lock --> update HOST info with on-disk lock if needed -> write updated lock -> Release SCSI-2 lock

 

Now this cycle only takes a few milliseconds (=~ 10-20). But none the less, it is disruptive to the other ESX servers as not a single IO from another ESX host, or any host for that matter, is allowed access the LUN while SCSI-2 RESERVE exists.  The short duration of the RESERVE can be extended if the array is heavily loaded or handling error conditions within the SAN.

 

Latest versions of VMFS: Optimistic locking

With VMFS 3.1+, optimistic locking means the driver will read all the "on-disk" locks and modify metadata only owned by the host owning the lock.  This "modification" is not written to disk yet as there is no SCSI-2 RESERVE.  At this point, a SCSI-2 RESERVE is placed on the LUN (whereas before, the SCSI-2 was placed first, in order to read the on-disk locks).  Now the final steps are taken to read and modify the on-disk locks which were not owned by this server (if there are any).  This is immediately followed by the Journal update, flushing of all metadata to disk, and the SCSI-2 RELEASE. 

Comparing to the previous versions of VMFS:

Read on-disk locks --> Modify free locks (already owned) -->  Establish SCSI-2 lock (disruptive to all others) --> acquire all needed disk locks  --> write/update metadata and on-disk locks -> Release SCSI-2 lock

 

How much time is a SCSI-2 Reserve held?

As mentioned earlier, the reserve itself is only held for a few milliseconds:  (10-20).  The introduction of optimistic locking holds off of this SCSI-2 reserve, reducing the total time (albeit milliseconds) that the SCSI-2 lock is held, thus reducing the chance of a reservation conflict within the farm.

 

Take away:

  • In order to easily track reservation conflicts, it is best to maintain LUN identification uniformity throughout the farm.  Example LUN 0's address on node 1 is the same on all nodes in the farm.
  • If the farm is using an ALUA array (proxy reads), confirm that all nodes accessing the same LUN are assessing that lun over the array's controller that owns that lun.  This prevents a SCSI_START from forcing the LUN to change controller ownership within array; thus preventing a thrashing condition and stalling reservations.  This is true for all ALUA arrays.
  • Monitor SCSI reservations to make sure that they do not exceed a threshold of 100 per hour or less, but this is just a rule of thumb.  If your environment has more and is achieving desired performance, there is no need to change anything.  Strive for less, but depending on workload, IO profile, etc, mileage may vary.

 

Reference docs:

Best practices for deploying ESX with HP storage arrays is documented @:

http://h20195.www2.hp.com/v2/GetPDF.aspx/4AA3-0450ENW.pdf

 

Real word Example:

Summary:

The ESX farm experienced stalled access to storage pools during two separate windows on October 7th.  The affected storage pools were VMFS_1,  VMFS_2 , and VMFS_3.

 

After confirming that each LUN was the same for each ESX server in the farm, the following was ran to look at the reservations:  ( BUS:target:LUN:smileytongue:artition)

$ grep "Oct  [7]" */LOGS/log/vmk*  | grep 24/0 | awk '{print $9}' | sort | uniq -c

  59974 vmhba0:0:10:0

  23898 vmhba0:0:2:0

  69711 vmhba0:1:1:0

  99981 vmhba0:1:3:0

..

      2 vmhba0:2:7:0

     16 vmhba0:3:5:0

 

  ...

  24554 vmhba1:0:21:0

 363899 vmhba1:0:2:0

  52979 vmhba1:0:3:0

 111306 vmhba1:1:1:0

   7025 vmhba1:1:21:0

 250340 vmhba1:1:3:0

...

   2266 vmhba1:2:4:0

    823 vmhba1:2:6:0

    962 vmhba1:2:7:0

   1020 vmhba1:3:5:0

  14442 vmhba1:3:9:0

 

 Placing the resulting data into a graph format, a pattern emerges:

reservation conflict 1.png

 

The root cause:

The root cause was a combination of events: 

  • Layout of VM’s on the storage pool
  • Preferred path setting using non-optimal paths on an ALUA array (see best practice documentation for details)
  • Heavy IO workload from development servers saturating the array resulting in elongated service times

After redistributing Virtual machines to achieve better storage locality and changing the preferred path setting to prevent ALUA array controllers from thrashing LUN ownership, the SCSI reservation conflict rate has maintained acceptable levels resulting in optimal performance.

 

Another site with a similar issue solved using the same methodology:

 

After

reservations now.jpg

 

 

 

Before

reservations then.jpg

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the community guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author
  • More than 30 years in Sales and Marketing in IT services business. Currently managing global campaigns for Datacenter Care.
  • I graduated in Software Engineering. Joined HP family five years ago, I deliver Insight Remote Support technical consulting for HP customers, in North America, Canada and Latin America. Assist setting up, installing and configuring the solution in customers' IT environments.
  • I am an identical twin. My brother’s name is Greg Tinker and we have been extremely fortunate working similar careers within HP, known to our HP colleagues and many of our customers as "The Tinkers". Our job is to be the technical lead on major business operational outages with millions of Dollars/Euros hanging in the balance. We both have a complete background in architectural, Infrastructure and application environments from both the proactive and reactive side of HP Enterprise Service (HP ES), and HP Enterprise Business (HP EB).
  • I am an identical twin. My brother’s name is Chris Tinker and we have been extremely fortunate working similar careers within HP, known to our HP colleagues and many of our customers as "The Tinkers". Our job is to be the technical lead on major business operational outages with millions of Dollars/Euros hanging in the balance. We both have a complete background in architectural, Infrastructure and application environments from both the proactive and reactive side of HP Enterprise Service (HP ES), and HP Enterprise Business (HP EB). We have always attended the same schools, studied the same material (big surprise, as we are identical twins), and have always worked as a close team and strive to demonstrate our teaming ability’s to others. We each have more than 11 years experience supporting mission-critical enterprise customers on a broad range of technologies. We’ve both won the HP MVP award multiple times as well as coauthored books, programs, and whitepapers in our spare time.
  • More than 25 years in the IT industry, managing ITSM, service development and delivery projects in Technology Services. Specialized in end2end support for ISV based business solutions. Certified ITIL and project management expert.
  • Eduardo Zepeda, WW TS Social Media Program Manager & Internal Communications for WW Technology Services Blogging on behalf of HP Technology Services (TS_Guest)
  • I have been with HP for 13 years, always in Services - first as a Services Channel Sales rep, then a Channel Services Segment Manager, and now, in WW Technology Services Marketing. These may be my formal job titles, but I'm really a Cheerleader for HP Services! I feel that HP has great services, exceptional Technical Experts and Delivery teams, and so many cool things are going on at HP Services. So, stay tuned...
  • I have 27 years of system, storage, and networking experience including detailed work with Data Protector (formerly Omniback II) for the past 14 years. My expertise includes StoreOnce deduplication technology, D2D appliances, performance tuning, complex remediation, and online backup integration with applications like Oracle and infrastructure like VMware. Traveling across the United States and Canada as a Sr. Technical Consultant, I deliver specialized consulting for a broad variety of HP customers.
  • MrCollaboration (aka Jim Evans) is an HP Global Services Alliance Manager. He has worked in the IT industry for more than 30 years, 22 of which were spent with Digital Equipment Corporation, Compaq and HP. He works with many third party vendors and partners to develop processes to facilitate excellent support and service for mutual customers. Jim is also HP’s representative to the Technical Support Alliance Network (TSANet).
  • I've been working in Customer Service for over 20 years. During my career I've provided support services for Languages, Programming Libraries and Operating Systems. During the last 10 years I've provide support for Linux and more recently VMware. My current role is as a Technical Account Manager working in the HP Custom Mission Critical Services Industry Standard Operating Systems team. I provide both reactive and proactive operating system support for proLiant servers and blades. Our services in the Custom teams are built on statement of work contracts for large HP customers who need a customized mission critical support offering.
  • I've been working in HP since 2007 like IT agent, developer, Web designer and then like Web Project Manager
  • I like to listen as much as I like to talk. Why? My 25+ years in the technology industry has taught me that the key to delivering value to customers is to understand what they value in the first place! I developed this passion for customers and consultative selling during my 12 years with Accenture, and I have continued to approach customers in a consultative way during my 12+ year tenure with HP. I also have a passion for HP given my knowledge of our Product and Service Portfolio and the differentiators we possess that position us as a leader in the areas our customers are telling us they want to go. Converged Infrastructure, Converged Cloud, Big Data – and the associated Service and Support implications – all such exciting technology trends where our success will hinge upon our ability to differentiate ourselves versus others in the areas that matter most to our customers. Right up my alley, and I am proud to be part of the great HP team where I know we have the best solutions in the industry!
  • Tom Clement has over 30 years experience in the areas of adult learning, secondary education, and leadership development. During this time Tom has been a consistent champion of “non-traditional” training delivery methods, including blended learning, virtual delivery (self paced and instructor led), the use of training games and simulations, and experiential learning. Tom has spent the past 25 years of his career at Hewlett Packard, focused most recently on HP’s global Virtualization, Cloud, and Converged Infrastructure customer training programs. Tom manages the strategic direction and overall performance of these training programs, ensuring these worldwide programs help HP’s customers capitalize on the business opportunities made available by IT advancements in each of these subject areas. Tom and his global teammates utilize best in class instructors, course content and supporting equipment infrastructure to deliver these training programs to HP’s customers. The team prides itself on providing the Virtualization, Cloud, and Converged Infrastructure content customers need when and where they need it, anywhere in the world. Tom is based in the Washington, DC suburbs and can be reached at tom.clement@hp.com.
Follow Us