Around the Storage Block Blog
Find out about all things data storage from Around the Storage Block at HP Communities.

Making Sense of WAFL Part 4

By Karl Dohm, HP Storage Architect

 

To recap, in this series of posts we are exploring some of the limitations of WAFL and specifically how those limitations manifest themselves to an average user of the FAS.  For my previous posts see Part1,  Part 2 , and Part 3

 

I had made the assertion in the original post that some of the competitive disadvantages on the FAS involved throughput, capacity utilization, and ease of management.  I had just started on the throughput part of the discussion when Konstantinos Roussos of NetApp posted that I was off the mark and cited the Avanade paper as one piece of evidence to prove it.  So in the interest of fairness, I decided to try and understand what the Avanade paper was trying to say.

 

One of the initial tests described in the paper was rather simple to set up, and it was described by the experts at Avanade as one to "assess the overall performance of the FAS3050".  I'm always looking for something relatively simple to set up that assesses the overall performance of an array, so this captured my interest.  Since I didn't come up with the test, there shouldn't be any confusion that I somehow biased the selection to make NetApp look artificially unfavorable.   Hopefully we can agree that this test has value, while not perfect, and has been arrived at in a fair manner. 

 

My various recent blog posts have focused on trying to recreate the test result in the Avanade paper outside of the Avanade environment.  The main purpose of this iteration, at least for me, is to arrive at some way to simply and fairly have a basis for comparison. 

 

We still have some differences in how we run the tests, but for the most part these shouldn't matter much.  Patrick's first and second post on the Avanade advisor blog describes some test details that augment the original Avanade paper.  Here the test is run with 4Gb FC SAN connected to a 32 bit Windows 2003 Server on a physical machine.   Patrick has evolved the test a bit as he is running on a 64 bit Windows 2008 Hyper-V virtual machine using the iSCSI stack.  It's not practical to switch over to the environment he used, so I have to assume the differences are not significant to the outcome.  I think there is some value though in minimizing variables and staying with a simpler stack between IOmeter and the LUN.

 

Everything else possible I can think of is being matched up, including spindle count of 20 in aggregate (root volume in different aggregate), raid group size of 20, rotational latency, raid type of LUN, IOmeter profile, IOmeter file size and so on.  See the end of the post for more detailed configuration data. 

 

Since the FAS3050 model used in the original test is no longer a current product, Patrick suggested using a FAS3070 or a FAS2050.   It turns out that's good advice because, even with the new info, I could not get the 3050 results to come close to the Avanade results.  I expected the same outcome on the FAS2050, but was in for a surprise.

 

It turns out that I get better throughput on the FAS2050 than Patrick did.  For those of you thinking I'm spinning data, please open your minds and read this carefully. 

 

For the LUN with no fragmentation, in our environment the FAS 2050 runs at about 4860 IOPs and 53 MB/s, which are average values taken over the first 20 minutes after full reallocation.  Patrick's results maxed out at about 3460 IOPs and 39MB/s - which he said was also against the LUN with no fragmentation.   My result was nearly 40% faster.  So if you took that data by itself, this would be a great endorsement of the FAS by a competitor. 

 

But there is more to the story of course.  This is a random workload, and as the LUN fragments the throughput degrades.  I ran 14x 20 minute segments of this load (about 4.5 hours) after an initial reallocate and the results for the FAS2050 are shown below.

 

20 min segment IOps MBps Avg Resp Time
1 4864 52.7 30.8
2 4494 48.6 33.4
3 4296 46.5 34.9
4 4141 44.8 36.2
5 4014 43.4 37.3
6 3896 42.2 38.5
7 3812 41.2 39.3
8 3714 40.2 40.4
9 3676 39.8 40.8
10 3616 39.1 41.5
11 3576 38.7 41.9
12 3533 38.2 42.4
13 3499 37.9 42.9
14 3449 37.3 43.5

 

The results Patrick reported were closer to where the FAS ended up after fragmentation took its toll.  So I'm guessing he may have not reallocated his volume in between each IOmeter run, meaning the LUN was in fact fragmented when he got to the higher thread counts.  Anyway, his data matches nearly perfectly with my post-fragmentation data in the 14th segment:  3460 IOPs at Avanade vs 3450 here.

 

Note that I ran 150 threads in a single IOmeter worker, which was what it took to get an approximate 30msec average response time at the start of the test.  The 30msec average figure is somewhat standard to use in the industry.

 

Let's compare these results to the EVA4400.  My understanding is that these two arrays (FAS2050 and EVA4400) are often sold against each other in competitive situations, so it makes sense to see which array has an edge.

 

IO Test

 

The EVA4400 runs this same test with a throughput of nearly 6000 IOPs and 64MB/s, which doesn't degrade over the course of the 14x 20 minute segments.  The EVA LUN doesn't fragment, so the throughput remains roughly the same in each 20 minute segment. 

 

Summarizing, the FAS LUN throughput is 81% of the EVA LUN when it is not fragmented and 58% of the throughput when it is fragmented.  Since the FAS LUN can't avoid fragmenting with this load, the 58% figure is the more relevant one for a typical situation.   Note that the FAS LUN has not yet settled to a steady state after these 14 segments, so its numbers will be somewhat lower if this test runs longer.  

 

Ok, that's a lot to digest, and given the outcome there will certainly be criticisms.  But, it's a relatively simple test that can be recreated by anyone out there; it's a test that I didn't choose; the test is said by a Windows integration expert to be a measure of overall array performance, and its pretty clear that there is an EVA advantage.   

 

By the way, in case it's thought that the EVA just got lucky with this choice of test, I'm open to trying any other IOmeter access specification that anyone wants to run, as long as the test is at least somewhat relevant to customer workloads and stays reasonably simple to set up and understand.

 

Finally, it deserves mentioning that this thread has a tie to the associated topic of cost and capacity utilization.  This is mostly a topic for another day, but the EVA can be deployed without any additional spindles over the 20 used in this test.  The FAS, if running in a cluster (which is redundancy comparable to EVA) would need at least one additional global spare spindle on the a side of the cluster, three additional spindles to hold the root volume of the b side of the cluster, and one global spare for the b side of the cluster.  That's an overhead of five additional spindles to support an array with 20 data spindles.

 

You may be able to sense where I'm going with this.  Five spindles incur roughly an additional $5000 in purchase price for the drives alone, and with those five you would need additional drive shelves, cabling, rack space, power, and cooling.  Not sure how many dollars that all that adds up to, but some would call that a significant additional cost burden for an array that provides 58% of the throughput. 

 

Additional configuration details for runs done here

 

FAS2050

 


  • 20 spindle aggregate containing 1x 1TB volume
  • 1TB volume containing 1x 1TB LUN
  • 2 front end FC ports per FAS filer
  • 150 threads in IOmeter (~30mseg response time at start of test)
  • MPIO policy round robin
  • Ontap 7.2.4L1

 

EVA4400

 


  • 20 spindle disk group containing 1x 500GB LUN
  • 170 threads in IOmeter (~30 msec response time)
  • 2 front end FC ports per controller
  • 09004000 firmware

 

Common

 


  • 15K 144GB drives
  • IOmeter writes to a 100GB file placed in the lowest LBAs of the LUN
  • IOmeter access specification as in Avanade post
  • Emulex dual ported HBA
  • Emulex Queue depth = 254, Queue target=0
  • Windows 2003 32 bit server
  • MPIO and vendor specific DSM - SQST for EVA and RR for FAS
  • 4GB FC SAN
  • Proliant DL380-G4

(Editor's note: Fixed broken links resulting from moving to a new blog platform - no content of the post was changed.  14 April 2011)

Labels: EVA| NetApp| storage| WAFL
Comments
Anonymous(anon) | ‎12-05-2008 07:01 PM

Calvin,

You now they're gonna ask. So why a 500GB LUN & not 1TB ? and what's the EVA raid level ?

Anonymous(anon) | ‎12-06-2008 01:21 AM

Hi Cleanur - I only posted this entry for Karl so don't konw the answer to your question but I will ask Karl to respond.  Thanks and stay tuned for Karl's answer.  Calvin

Anonymous(anon) | ‎12-08-2008 10:28 PM

On the HA configuration; the requirement is for 1 global spare across both systems, and a RAID-4 root volume of 2 disks. Total=3 extra drives. The root disk requirement can be avoided by having the root in a normal aggregate, so that reduces it down to 1 global spare. That's a lot less than the 2 to 4 times the smallest disk size required by VRAID.

Anonymous(anon) | ‎12-09-2008 04:56 AM

Alex,

You might want to check out your own best practices for sparing

partners.netapp.com/.../storage_resiliency.html

media.netapp.com/.../tr-3437.pdf

# of Shelves   # of Disks   Recom Spares

          2             28              2

          6             84              2

          8            112              3

         12         168              3

         24           336              4

         36           504              6

         72           1,008             12

Note that if you are using NetApp Maintenance Center, you will need a minimum of two spare drives of each type in your system. Maintenance Center performs proactive health monitoring of disk drives and, when certain event thresholds are reached, it attempts preventive maintenance on the suspect disk drive. Two spare disks are required before a suspect disk drive can enter Maintenance Center for diagnostics.

Tr3437 Page 9

A minimum of two hot spare disks for each disk type is required before a suspect disk can be placed in MC to undergo diagnostic testing.

NetApp recommends provisioning storage with enough hot spares to utilize the MC self-healing functionality.

Anonymous(anon) | ‎12-10-2008 05:42 AM

Alex,

Apologies as my last comment may have come over as a little terse. The point I intended to make and failed miserably, was that all of the vendors can come up with absolute minimum spec's to meet a specific target configuration or to hammer home a particular point. But doing so I think devalues the discussion especially for customers and the uninitiated of the storage industry. In an ideal world we'd see real fit for purpose configurations from all vendors based on reference able published best practice. I realise that's unlikely to happen any time soon in the competitive blogesphere but there should be some absolute minimums presented as workable to avoid selling customers short, especially around availability.

BTW I work for a VAR who deals directly with the vast majority of the storage vendors, so I do get to see the real world configurations, warts and all in many cases.

Cheers

Cleanur

Anonymous(anon) | ‎12-10-2008 11:00 PM

I stand corrected for best practise on spares; yes, maintenance center pops drives in and out of service if the drive appears to be failing to do out-of-service checks before either declaring the disk dead or returning it to service. Without maintenance center, the drive is just declared dead and the rebuild starts on the spare.

For small systems, the overhead is 20 disks 10% decreasing down to 100 disks at 2%. For HP's VRAID-5, the overhead is 2 (protection level 1) to 4 (level 2) disks worth, which I make 20 disks at 10% (20%) down to 2% (4%) on 100 disks. The same or worse, in other words.

Anonymous(anon) | ‎12-11-2008 12:11 AM

@cleanur; that's fine, terseness is to be preferred over verbosity.

I agree with all your points, and NetApp's sizing tools do not recommend anything other than best practice.

What frustrates me (and is definitely coming across in my replies here) is the constant misrepresentation of NetApp systems. HP and EMC seem to be working in concert recently; to wit, this pointless attack as an example. feeds.feedburner.com/.../netapp-s-shining-moment-its-capacity-guarantee-program.aspx.

I would much prefer to see Jim Haberkorn and Karl Dohm present some interesting information about the performance, capacity, pricing and suitability of their storage for specific applications and workloads. Enough of knocking NetApp. I mean, this is the 4th blog in as many weeks about WAFL, and Jim and Karl seem to have a collective bee in their bonnets about describing how our kit works. Great, visit the people that know;

blogs.netapp.com/extensible_netapp

blogs.netapp.com/shadeofblue

How about a bit of transparency about HP's storage for a change?

Anonymous(anon) | ‎12-17-2008 08:27 AM

Answers to some questions about the original post, and some comments:


EVA RAID level is Raid-5


500GB LUN vs 1TB LUN wont make a difference on the EVA.  In the test described by Avanade, writes are only being performed to lowest 100GB logical block address space of LUN.  I’m not sure why they chose to use a LUN that is 10% filled (?), but it would probably have made most sense to just use a 100GB LUN.  The main point though is that in both cases, the LUN is a lot bigger than the 100GB file seeing the activity.  I can say this doesnt matter on the EVA but I'm not sure about the FAS.


Concerning the capacity questions, root volume is by default on an aggregate that is raid-dp.  That’s how it ships – with 3 spindles used per filer.  Raid-dp is said to be the best practice for every aggregate, so why would a regular customer ever decide to mess with the root volume and put it on an aggregate that is raid-4?  


Also from everything I can gather there is no way to avoid having at least one spare on each side of the cluster – at least with the options that are obvious.  Fileview complains quite loud when one side of the cluster loses its spare and the other side still has one.  So I dont see any data yet that disputes needing 5 extra spindles on the 2050 which sit idle and see no activity.


As far as being transparent, we have posted EVA4400 results here for a test that we didnt even pick - it was taken out of a NetApp blog post.  That's being pretty open.


We are still waiting for somone to challenge the EVA4400 to a different iometer test.  Can be EMC, NetApp, IBM, or any other array in the same price band with the same spindle count.  I might suggest a sequential read test across a 500GB LUN striped across 40 spindles?

Anonymous(anon) | ‎12-18-2008 02:48 AM

I think there is no issue with putting the root volume and data volumes on the SAME aggregate, as stated in this TR media.netapp.com/.../tr-3437.pdf: "The root volume can exist either as the traditional standalone two- or three-disk volume or as a FlexVol® volume that is part of a larger hosting aggregate."

Especially on smaller systems (like the FAS2000 series) this will be effective: "In practice, having the root volume on a FlexVol volume makes a bigger difference with smaller capacity storage systems compared to very large ones, where two dedicated disks for the root volume have little impact."

This way the spindles can simply be shared and there is no need to add another extra aggregate and "waste" two more spindles on parity, and simultaneously increasing performance because more spindles are actively used. Also the root volume will be dynamically allocatable, so no need to waste a whole bunch of GBs on dedicated disks.

The fact it is a default "out of the box" setup doesn't mean you can't simply ADD disks to the existing aggregate versus creating a new aggregate. Adding disks to this small "just get me going" aggregate is pretty common practice and fully supported AFAIK.

So the way you have seem to provisioned storage on this relatively small system isn't very effective (or required), yielding lower capacity and performance.

On this basis I don't think this is a very fair or reasonable capacity/performance comparison.

Anonymous(anon) | ‎12-19-2008 12:39 AM

Hi Sjon,

Thanks for the input.

I'll agree that this isnt an optimized capacity configuration from a NetApp perspective, and you can do quite a bit to improve capacity utilization if you have more spindles to work with.  But the point I was trying to make is that the configuration which was chosen by Avanade - a 20 spindle aggregate with a 1 TB LUN - would produce poor capacity utilization compared to EVA - if that 20 spindle aggregate is all you were trying to have in your configuration.

Now concerning root volume.  I gave NetApp every possible advantage here.  On the a side of the cluster I didnt combine the root volume into the 20 spindle aggregate used for the 1TB LUN.  If the root volume has any impact on throughput, I eliminated that from these results by not including it into that aggregate.  

However, when it came to spindle count, I didnt penalize them for this configuration.  The 5 spindles extra consist of 2 spares, 1 on each side of the cluster, and 3 spindles to hold the aggr0 and root volume on the b side.   The b side isnt used in the test, its just there to have failover resiliancy equivenent to any dual controller array in the industry.  It wouldnt quite be fair to compare a single controller FAS to a dual controller EVA.  Notice the 5 spindles I counted up did not include the 3 spindles on the a side which were off holding that isolated root volume.   If by chance the root volume does impact performance (makes sense that it might since it has to see activity), then the FAS numbers published here might be artificially elevated for a total spindle count of 25.  I actually needed 28 spindles.

Netapp clustered configurations are complex.  The notion of having spindles owned exclusively by a single filer in the cluster causes a bunch of cascaded management and efficiency issues.  So hopefully its understandable that it takes more than a few words to communicate the setup.

Anonymous(anon) | ‎07-28-2009 11:55 AM

Karl,

  on the face of it I think you've been reasonably fair, or at least have tried to be, though there are some criticisms I'd make which are understandable given your relative inexperience with NetApp kit.

1. I don't think I would have used three spindles for the dedicated root aggregate on the pasive controller, two disks in RAID-4 plus a hot spare is sufficient for those kinds of configs.

3. Root volume performance impact is negligible, you really shouldnt put the root volume in a dedicated aggregate on a system with a small number of spindles if performance is your primary goal, you're better off including the root volumes spindles in the aggregate.

3. Typically the 2000 series are pitched at the same market space as the MSA's, the 3100 series is our midrange where the EVA usually plays. The bottom end of our mid-range is the 3140, the old 3050 was the middle of the mid range, now replaced by the 3160, hence the 3140 is the machine you should be comparing against. If you're seeing a lot of competition for the EVA4400 from the FAS2050, then it might be because that's what our sizers say is the appropriate platform for that customers workloads.

4. I've only ever once seen someone configure a 2000 series in a completely active/passive configuration, in most cases customers run more than one workload against the array and balance the workloads and the aggregates across the controllers. This doubles the memory, CPU, NVRAM back end bandwidth etc. A better test would have involved a balanced workload across both controllers from each vendor.

5. If the workload is genuinely random, (unlike most customer workloads which are not truly random, merely "non sequential", consider the reccomendations in NetApp's technical report that discusses benchmarking and tuning for high performance - tr-3647.

6. Pushing a single synthetic workload at one controller in a pair might be simple, and easily reproduced, but it's not illustrative of real world performance which is what we optimise OnTAP for. Try running up an SPC-1 workload or an IOMeter equivalent which runs a mixture of sequential and random workloads across a number or LUNs instead, heck, even try doing it while running snapshots, I'm sure the results would be enlightening.

On a final note, overall the hot spare issue seems to be a wash between vendors. I've seen configs where the EVA looks horrible (remember that bogus EMC post :-)), on the other hand I'll grant you, having a minimum of 2 hot spares on any configuration can penalise us at the low end. Though I've also seen customers create a single RAID-DP aggregate on each controller and trust dual parity protection. Its not best practice but from a risk perspective its perfectly acceptable for many customers.

Anonymous(anon) | ‎07-29-2009 12:40 AM

Thanks for the comments John... its great to have an answer from netapp that we can work with.  

I've been missing from the blog for a while due to being dominated by some EVA development, so I apologize for the overall silence on this thread.  There are many topics left to discuss, and I want to get to them.

For comparison, I think there is a good argument for using synthetic workloads that are relatively simple to recreate.  My thought is - if your array is good at ESRP or SPC or in "real world apps", it will show up in some angle of iometer testing.   It has to.  iometer can focus on any pattern and look at it in isolation.  If the FAS is good at random writes for example, this should show up in iometer.  No?

I'll admit I did not know about the FAS turbo switch for block I/Os.  Maybe you could say a few words about what it does.  

In any of the arrays I have tested, I have opted not to tune the array to an individual synthetic workload since whatever workload we try, its only part of the bigger picture.  

Its sort of like taking a race car and tuning it to only make left hand turns on an oval track.  You will get better results, but it doesnt relate well to what you will get with real world driving.   A better test would be to take the car tuned for road driving and put it on that same oval track, then use this as one of many tests for comparison..

I've taken the tact that the array should be tuned for a generic workload, and in many cases this means using factory set defaults.  Who better can tune an array to the average expected workload  than the engineers from the company that built it.  

Synthetic (iometer) point tests can be used to show how that array can do with each type of load in isolation.  Then I think its reasonable to draw a comparison conclusion by looking at all of the test results.in aggregate.

What I have found with the FAS2050 and more recently, the CX4-120 is that for configurations with the same spindle count its hard to find any iometer point tests that the 4400 doesnt win.  

I dont follow the leap that is typical of the iometer testing criticisms - that is that somehow you can lose all/most of the iometer point tests, yet somehow still be better when running real world apps.  

I'll agrre with you in that we probably should be posititioning the 4400 against higher end models in both the NetApp and CX4 line.  At least when throughput is the comparison.  But hey, I'm not marketing, I'm from engineering.

Karl.

Karl  

Anonymous(anon) | ‎07-30-2009 08:33 PM

Karl,

   A few comments

1.I checked the specs on the 4400, and it looks more entry level than I expected so I think the 2050 is probably the one to be testing against. I'm not sure what the best comparison is to the 3040 as there have been a few changes to the EVA lineup recently.

2. My knowlege of OnTap code doesnt go deep enough to make authoratative statements about exactly how each option works. If Steve Daniel says it should be turned off for SAN only loads, thats pretty much good enough for me. My conjecture is that it simplifies the code paths by eliminating the need to check and fairly schedule resources for contending NAS workloads.

3. Most of our customers do in fact run our controllers with both SAN and NAS workloads, but when we're compared against an array that does only SAN, it seems fair enough to tune it as a SAN only workload. To extend your car analogy, its a little like taking off the factory fitted roof racks before taking it around the track to improve the aerodynamics.

4. Synthetic workloads in isolation lead themselves to non typical results (e.g. using only one controller in a pair), also the random I/Os generated by things like IOmeter are genuinely random and hence much harder to predict than the "random" I/O generated by applicatoins which isnt really random, its just "non-sequential". For this reason I think a better test would be something like Oropn or Jestress, or better still a combination of both along with a IOZone generated fileystem benchmark all running simultaneously, maybe even throw Vmark in for good measure.  I'd be happy to trade thoughts with you or anyone else from customer or vendor land as to what might create an easily reproducable set of workloads that is both realistic and useful.

5. Did you do your testing using the exact same workload mentioned in Avenade's blog ? If so remember what he said "this configuration is in no way representative of any type of Exchange workload (5.5, 2000, 2003, 2007). It is more of a wide-distribution of requests intended mimic file server workloads". The advantage this workload has is that it's been around for a while, and nobody could accuse him of cherry picking something that made NetApp look particularly good, there are in fact some settings in that workload (primarily the random 512byte sector alignment) that you dont see in well configured systems todaqy and which tend to work against a lot of our optimisation techniques.

6, As to your question about why only use 100GB of a 1TB LUN,  you'd have to ask Pat Cimprich who did the original setup, but having done a little bit of work setting up a similar test harness, I suspect because it takes IOMeter an decent chunk of time to prepare the disks, anything much larger would take too long and be too hard to replicate. Using Raw disks might have been easier, though with a larger working set size the results may have been quite different for all vendors.

As I said before, I'd be happy to share my thoughts on what it would take to create a viable customer runnable set of benchmarks or at least discussing the pro's and cons of the various approaches if anyone is interested.

Anonymous(anon) | ‎08-06-2009 06:23 AM

Hi Karl,

I just wanted to post some clarifications here as to some of the settings I originally used in our tests as they've come up numerous times on these blogs.

Why only a 100GB Iometer file on a 1TB LUN? John Martin nailed it on the head - I'm lazy and didn't want to wait the year or so it would have taken Iometer to complete the initialization of a 1TB test file.

Discussions around the NetApp config not being optimized. Yep - 100% agreed. The reality was space efficiency was not a criteria in that original Iometer test I did. Not sure if I've mentioned this in other posts or not, but the reason I went with 20 drives in the first place was because I was comparing perf to another storage array and I only had 20 drives available in that system... I wanted two apples so went with 20.

Performance differences on the 2050 between our tests. Your peak perf on the 2050 netted you 52.7 MBs and mine topped out at 39.1MBs. I noted this in my test results, but the CPU on my 2050 was pegged at the upper end of my scale. I'm just speculating here, but you used Fibre Channel and I used iSCSI to access the 2050... We'd need a NetApp engineer to validate, but I suspect the CPU load associated with iSCSI activity could have contributed to my lower throughput. The 2050 I have doesn't have Fibre Channel so I unfortunately can't go back and confirm this by retesting.

You also used MPIO which I did not. Hypothetically you could have netted better throughput with multiple paths; could be a contributing factor.

For the record - we did run all sorts of other tests as part of the testing for that whitepaper - notably JetStress and Loadsim. We also ran tests while doing snapshots. I agree with Karl though - setting up Iometer tests is far easier (which is why I was able to readily repeat some of these tests 2 years later on nights and weekends with minimal work). I do like more robust tests for real-world scenarios, but for quick tests, Iometer is my friend.

Pat

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the community guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author
25+ years experience around HP Storage. The go-to guy for news and views on all things storage..


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation