Trying to pinpoint the root cause of MultiPath I/O (MPIO) errors.

by Kleinch on 12-13-2011 05:37 PM

The Problem:

 

Ivan had a customer that set up an Exchange 2010 c7000 BladeSystem cluster and they seemed to have a lot of MPIO errors in the eventlogs and were trying to pinpoint the root cause of these errors. On some of these servers they did have problems with disks disappearing. Most of the systems were multi-node clusters so that most of the time the application survived.

 

The Infrastructure:

 

C7000 Blade Enclosures with Virtual Connect 8Gb 24 Port FC Modules

BL460c G7 with Emulex LP1205 FC HBAs

HP Storage 8/40 SAN Switches

HP Storage EVA6400 Arrays

 

Operating System:

 

OS: Windows 2008 R2-SP1

MPIO 4.00.00 and 4.01.00

 

So what did Ivan find?

 

They did some deeper analysis on this issue and came to the following conclusion. Interesting enough to share I believe.

 

On our 2010 Exchange DAG cluster we experienced disk problems. Looking to the Event Viewer, we immediately noticed a lot of MPIO errors about failing paths. Our investigation led us to investigating other nodes (of different clusters but same Hardware (HW) ruling out any HW problems. But we also noticed lots of MPIO errors on our Hyper-V clusters? Are these related?

 

We did some Event Viewer analysis on the Hyper-V clusters and discovered that we needed to categorize all MPIO errors

 

  • The “normal behavior” errors is due to the fact that LUNs are unpresented from the cluster. The MPIO framework interprets the removal of the LUN as an error because it’s unaware of the fact that this was a coordinated user action. We use the Hyper-V platform quite intensively meaning lots of creations/deletions of Virtual Machines. Because of the fact that we use single LUN=Single VM setup (the stretched Hyper-V with CLX setup) we have lots of manipulations of disks in the cluster, meaning lots of MPIO errors = False positives.

 

  • The “problematic behavior” on the MPIO level should be seen as everything what is not common. What is not common? The peaks we see in the number of 302 and 304 events on the cluster, i.e. HYP01, HYP03, HYP07 with HYP07 as an absolute winner. So, conclusion is that in our setup “problematic” MPIO errors are **bleep** hard to identify and it’s against our nature to look at them as normal behavior.

Next challenge is explain this to the System Center Operations Manager (SCOM) :smileyhappy:

 

EVENTID 302: An unrecoverable path failure occurred on SCSI address xxx. Disk xxx failed due to no redundant paths available

EVENTID 304: An unrecoverable path failure occurred on SCSI address xxx. Disk xxx is still accessible over redundant path

 

 

Row Labels

Count of EventID 302 + 304

HBICLU01

 

HBIHYP01.ad-cob.domain

31

HBIHYP02.ad-cob.domain

28

HBIHYP03.ad-cob.domain

42

HBIHYP04.ad-cob.domain

28

HBIHYP07.ad-cob.domain

109

HBIHYP08.ad-cob.domain

28

HBICLU02

 

HBIHYP05.ad-cob.domain

HBIHYP06.ad-cob.domain

8

HBIHYP09.ad-cob.domain

8

HBIHYP10.ad-cob.domain

8

HBIHYP11.ad-cob.domain

8

HBIHYP12.ad-cob.domain

8

Grand Total

314

 

 

Just sharing some field info and hope it helps when you are looking at MultiPath I/O issues. Any comments or suggestions on this topic?

We encourage you to share your comments on this post. Comments are moderated and will be reviewed and posted as promptly as possible during regular business hours.

To ensure your comment is published, please follow our community guidelines.

Comments
by VIjyant (anon) on 01-17-2012 07:37 AM

Good  finding, surely thats gonna add much value   moving forward .

Regards

Vijyant

Post a Comment
Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.

Find HP in Social Media

Facebook Twitter YouTube SlideShare Flickr
About the Author
  • Cynthia is part of the BCS marketing team. Interested in all things mission-critical and what's next on the horizon.
  • Ken is a cloud Architect in the CloudSystem team. Ken focuses on software, servers, Virtual Connect, networking and server virtualization to enable cloud solutions. Ken also develops white papers and best practices as part of the BladeSystem Readiness Team. You can find him on Twitter as @BladeGuy.
  • Hello! I am on the HP Enterprise Servers, Storage and Networking team, focused on Interactive Web and Social Media Marketing for (ISS) Industry Standard Servers. I will be sharing relevant ISS and HP news & info as it crosses my path.
  • Greetings! I am in the HP Converged Infrastructure team focused on Server, Storage & Networking group at HP and will be sharing news & info as it crosses my path.
  • Network industry experience for more than 20 years - Data Center, Voice over IP, security, remote access, routing, switching and wireless, with companies such as HP, Cisco, Juniper Networks and Novell.
Labels