Around the Storage Block Blog
Find out about all things data storage from Around the Storage Block at HP Communities.

NetApp buys Engenio - what unified couldn't do diversified will!

There's too much to include in a short summary of this post on the news that NetApp will buy Engenio.  To read the article, click on the article title or the "read more..." link below.  If you're reading this from an RSS reader, please read the full post on my blog.

Labels: Engenio| NetApp| storage| WAFL

The top Around the Storage blog posts

I have several lists of "best of" Around the Storage blog. I was motivated to do this because of our migration to the new blog platform and some initial challenges we had with URLs to the old posts on the other platform.

Understanding FAS ESRP Results

By Karl Dohm, Storage Architect


Welcome back to the next in a series of posts where we take a closer look at NetApp and its FAS series of storage arrays.   The discussion topic today is Microsoft's Exchange Solution Reviewed Program (ESRP) and its tie to FAS throughput.


The FAS has some controversial history with regards to performance.  From time to time the issue comes up and in response NetApp has generally denied the problems exist.   Often we find the opposite stance in posts from NetApp lauding their performance, for example in Kostadis Roussos' post where he refers to WAFL write performance as 'surreal'.   But, as I have said in previous posts, there are some justifiable reasons this controversial subject keeps surfacing. 


First of all, let's touch on why an average storage consumer should care about array throughput.  An array with better throughput, i.e. the ability to service more I/Os from a given set of spindles, can result in requiring less hardware to do the same job.  The bigger the throughput difference, the more to be saved on purchase price, warranty cost, power consumption, floor space, cooling, etc.   Array throughput statistics can be meaningful when evaluating value in a storage array.   It seems NetApp also finds this array attribute important given the amount of blog posts and papers they have on the topic of performance.


Recently in the comments section of a blog post on understanding WAFL, NetApp's John Martin and I had a small debate as to whether a synthetic load generator like IOMeter could be used to characterize how an array will perform in production scenarios.  I made the argument that this type of tool can be used to circle the wagons around the I/O characteristics of a real world application, and that through multiple point tests of the load components of the application one could get a reasonable assessment of how well the box will behave.  John's opinion was more along the line that synthetic workload tests were not suitable to provide an indication of how well an array would run with a production application ("Synthetic workloads in isolation lead to non typical results").  He referenced Jetstress as a more accurate indicator.


I took his queue and had a look at the FAS2050 ESRP results paper which describes MS Exchange like throughput of the FAS2050 array.  Even though ESRP isn't intended to be a benchmark, a scan of ESRP results tells me that many vendors seem to use the forum of ESRP as a way to post throughput results relative to how their array handles MS Exchange load.  It kind of makes sense since there seems to be no Exchange related benchmark out there, and ESRP is the closest controlled thing the industry has to work with.  


The NetApp ESRP paper provides insight into how NetApp would recommend setting up the 2050 for Exchange loads, and it shows throughput results in a heavily loaded 10,000 mailbox Jetstress test.  This paper sparked our interest because the described results seemed good and did not correlate with results from synthetic load generators that produce a similar pattern as Jetstress.  Maybe John was right.  We decided to peel back the onion a bit and take a look under the covers of this ESRP test to figure out what was going on. 


We happened to have access to a FAS2050 and decided to try and recreate the ESRP results as published.  It turns out that the IOPs value that NetApp published was in fact roughly re-creatable given the data in the paper.  On the surface this can be viewed as NetApp having made an honest submission to ESRP, and within the letter of the law one could reasonably argue that they did.  But we also learned that NetApp found a way to make their results come across as favorably as possible, meaning the results have little relevance as to how well the FAS will run MS Exchange.  


After a rather lengthy setup experience, we finally configured the aggregates, volumes, servers, LUNs, MPIO, and HBA attributes as described in the ESRP paper.   We even set the diagnostic switch "wafl_downgrade_target" to a value of 0 in accordance with the recommendations in the paper.  


One might ask, as we did, what does "wafl_downgrade_target" do?   In its TR-3647 paper, NetApp describes the switch as follows: The "downgrade_target" command changes the priority of a process within Data ONTAP that handles incoming SCSI requests. This process is used by both FC SAN and iSCSI. If your system is not also running NAS workloads, then this priority shift improves response time." 


I think his description is telling us that the NAS process consumes bandwidth when there is no NAS work to do.  Also, given the NetApp messaging around unified storage architecture, a recommendation to use this switch seems like a bit of a contradiction.  Would you consider it normal to be asked to set a switch that generates the following response?  "Warning: These diagnostic commands are for use by NetWork Appliance personnel only".  Last but not least, this switch resets itself if the array reboots.  I'll leave it to the audience to draw their own conclusions as to whether use of this switch is truly a recommended practice in customer environments.  


Once the array was freshly initialized and everything was set up, we ran the test and observed the results of roughly 2200 average disk database disk transfers/second per host.  Within noise levels, this recreated the results as posted in their ESRP paper


The main problem we have with how NetApp did this testing is that after the initial run, every time this test is run it runs slower than the previous time it ran.  The 2nd run showed results of approximately 1980 transfers/second per server, about an 11% drop.  By the fifth run throughput had dropped to approximately 1555 transfers/second per server - a 30% drop.  After a couple more runs we were down to 1450, 34% slower than the first run. 


I didn't have the patience to run enough times to figure out where this decay curve flattens out. 


At this point I decided to run a "reallocate measure" against one of the database LUNs, and the FAS reported the value to be 17.  According to the NetApp man page for Reallocate Measure: "The threshold when a LUN, file, or volume is considered unoptimized enough that a reallocation should be performed is given as a number from 3 (moderately optimized) to 10 (very unoptimized)".  Allow me to translate - the database LUNs are very fragmented.  For those who might be confused by the use of the word fragmentation in this context, this is not NTFS fragmentation - its WAFL fragmentation.  


Now things were starting to make sense.   We were seeing the same sort of decay curve as shown in the IOMeter results posted in Making Sense of WAFL - Part 4.    Every time the test is run, the random component of the Jetstress database accesses fragment the LUN further and the throughput numbers get worse.  An array like EMC CX or HP EVA wont undergo this sort of decay curve since these arrays do not have internal WAFL-fragmentation problems like the FAS does.


That's not all.  After the throughput test, Jetstress executes a checksum test of the databases to be sure the array did not corrupt any data.  After a few runs I noticed an interesting pattern.  On the FAS, the length of time needed for the checksum calculation also degraded as the database LUNs went through their WAFL-fragmentation.  When the LUNs were fresh and defragmented, the checksum calculation took about 2 hours.  By the fifth run, when the database LUNs had a WAFL-fragmentation measure of 17, the checksum calculation took over 10 hours - a 250% slowdown   To summarize we saw 34% slowdown on database throughput and a 250% slowdown on checksum calculation by just letting the ESRP test run for about 48 hours before taking measurements.


So, drawing this to a close, I think there is a reasonable argument that NetApp should have results more like 1450 (or less) disk transfers/second/host as opposed to the 2220 transfers/second/host they did post.   Most would expect that results in a test as visible as ESRP are measured after a reasonable burn in period.  After all, when someone runs MS Exchange, they usually run it for longer than 2 hours.


Tweet this! 

Labels: NetApp| storage| WAFL

Spock says: It is illogical to take offense

By Jim Haberkorn


I had hoped that my last post in regards to NetApp performance claims would end gracefully, as a courageous NetApp employee has apparently now agreed to work with us to find out what we may be doing wrong, if anything, to be getting such poor performance out of our NetApp filer. FYI: That discussion has now moved to engineer Karl Dohm's blog post, where there is now a civil discussion taking place on the subject.


But alas, a graceful ending was not to be. I've been informed that a certain NetApp employee has now moved to Twitter to assert that I have called NetApp a liar in my  blog post.


So, in the interest of setting the record straight on this important point, let me make it clear: I have never referred to NetApp or its bloggers as liars, though I have said, and still believe, that some of their claims and arguments are illogical, both in regards to claims they make about themselves and claims they make about the competition.


If you check my previous blog post, you will see that the word 'liar' was used only once and that was by a NetApp blogger, in a moment of excessive sensitivity. But now another NetApp employee has picked it up and twittered about it. Ah! A new NetApp blogging tactic: One NetApp blogger exaggerates a competitor claim, then the other one attacks the competitor for it. Hmm....I must add that one to my list. 


Now, here are just three examples of NetApp illogic that surfaced in the previous post:



  1. Using blog references to convince me that WAFL is not a file system (Kostadis, Geert, are you reading this?) when every NetApp white paper on the NetApp website, including one just published in July 2009, still refers to WAFL as a file system - http://media.netapp.com/documents/wp-7079.pdf.  Logically, why would you insist your competitors accept your point when you haven't even convinced your own company?

  2. Telling me it is 'dangerous' for a competitor to even attempt to accurately performance test another vendor's array, when NetApp has actually gone to the extent of publishing two SPC benchmarks on EMC arrays. Okay, maybe 'illogical' is not the right word here - perhaps 'contradictory' would have been more precise. But then again, wouldn't you think it illogical to state an obvious contradiction in a public blog.  I mean, the idea of a debate is to win the argument, not hand your competition a stick to beat you with. Note to NetApp bloggers: I am not threatening to beat NetApp employees with a stick. 

  3. Claiming in their 21 page Wyman/Mercer cost-of-ownership white paper that after a thorough and meticulous analysis of EVA, CLARiiON, and DMX usable capacity, it was found that all those arrays used exactly the same amount of usable capacity for a 4TB database, down to the tenth of a terabyte (and by the way, the number NetApp came up with in its painstakingly precise calculation was 30.7TB for each, as opposed to their own 15.0TB for a FAS system.) If 'illogical' is not the right word here, which word would you prefer? Would you find 'ridiculous' less offensive?


But my point is: When your claims are illogical, it's illogical to take offense. Rather, reworking your arguments and getting back into the game is the best option. Also, I think everyone realizes that being illogical and lying are two entirely different things..


As far as blogging is concerned, I consider myself one of the least thin-skinned people you'll ever blog with. Any tendency towards hyper-sensitivity was beaten right out of me during six years in the Marines. When someone now tells me that 'my claims are illogical', I don't get personally worked up about it. In fact, I find myself marveling at their gracious language and self-restraint. Heck, I didn't even get angry when a NetApp blogger published one of my HP Confidential slides and called it 'nonsense' and 'dipstickery' (see this post).


So, here is my final piece of advice to my honorable NetApp colleagues: Lighten up, guys! Nobody in the blogging world minds a well phrased repartee now and again, but all this teeth-grinding is so Cold War. Within the industry, you're the only bloggers I know that carry on the way you do. Your company's doing well. Relax. Engage in the blogs if you feel so moved, but try to have a good time while you're doing it.  


Best regards,


Jim


Tweet this! 

Labels: NetApp| storage| WAFL

Making Sense of WAFL Part 4

By Karl Dohm, HP Storage Architect

 

To recap, in this series of posts we are exploring some of the limitations of WAFL and specifically how those limitations manifest themselves to an average user of the FAS.  For my previous posts see Part1,  Part 2 , and Part 3

 

I had made the assertion in the original post that some of the competitive disadvantages on the FAS involved throughput, capacity utilization, and ease of management.  I had just started on the throughput part of the discussion when Konstantinos Roussos of NetApp posted that I was off the mark and cited the Avanade paper as one piece of evidence to prove it.  So in the interest of fairness, I decided to try and understand what the Avanade paper was trying to say.

 

One of the initial tests described in the paper was rather simple to set up, and it was described by the experts at Avanade as one to "assess the overall performance of the FAS3050".  I'm always looking for something relatively simple to set up that assesses the overall performance of an array, so this captured my interest.  Since I didn't come up with the test, there shouldn't be any confusion that I somehow biased the selection to make NetApp look artificially unfavorable.   Hopefully we can agree that this test has value, while not perfect, and has been arrived at in a fair manner. 

 

My various recent blog posts have focused on trying to recreate the test result in the Avanade paper outside of the Avanade environment.  The main purpose of this iteration, at least for me, is to arrive at some way to simply and fairly have a basis for comparison. 

 

We still have some differences in how we run the tests, but for the most part these shouldn't matter much.  Patrick's first and second post on the Avanade advisor blog describes some test details that augment the original Avanade paper.  Here the test is run with 4Gb FC SAN connected to a 32 bit Windows 2003 Server on a physical machine.   Patrick has evolved the test a bit as he is running on a 64 bit Windows 2008 Hyper-V virtual machine using the iSCSI stack.  It's not practical to switch over to the environment he used, so I have to assume the differences are not significant to the outcome.  I think there is some value though in minimizing variables and staying with a simpler stack between IOmeter and the LUN.

 

Everything else possible I can think of is being matched up, including spindle count of 20 in aggregate (root volume in different aggregate), raid group size of 20, rotational latency, raid type of LUN, IOmeter profile, IOmeter file size and so on.  See the end of the post for more detailed configuration data. 

 

Since the FAS3050 model used in the original test is no longer a current product, Patrick suggested using a FAS3070 or a FAS2050.   It turns out that's good advice because, even with the new info, I could not get the 3050 results to come close to the Avanade results.  I expected the same outcome on the FAS2050, but was in for a surprise.

 

It turns out that I get better throughput on the FAS2050 than Patrick did.  For those of you thinking I'm spinning data, please open your minds and read this carefully. 

 

For the LUN with no fragmentation, in our environment the FAS 2050 runs at about 4860 IOPs and 53 MB/s, which are average values taken over the first 20 minutes after full reallocation.  Patrick's results maxed out at about 3460 IOPs and 39MB/s - which he said was also against the LUN with no fragmentation.   My result was nearly 40% faster.  So if you took that data by itself, this would be a great endorsement of the FAS by a competitor. 

 

But there is more to the story of course.  This is a random workload, and as the LUN fragments the throughput degrades.  I ran 14x 20 minute segments of this load (about 4.5 hours) after an initial reallocate and the results for the FAS2050 are shown below.

 

20 min segment IOps MBps Avg Resp Time
1 4864 52.7 30.8
2 4494 48.6 33.4
3 4296 46.5 34.9
4 4141 44.8 36.2
5 4014 43.4 37.3
6 3896 42.2 38.5
7 3812 41.2 39.3
8 3714 40.2 40.4
9 3676 39.8 40.8
10 3616 39.1 41.5
11 3576 38.7 41.9
12 3533 38.2 42.4
13 3499 37.9 42.9
14 3449 37.3 43.5

 

The results Patrick reported were closer to where the FAS ended up after fragmentation took its toll.  So I'm guessing he may have not reallocated his volume in between each IOmeter run, meaning the LUN was in fact fragmented when he got to the higher thread counts.  Anyway, his data matches nearly perfectly with my post-fragmentation data in the 14th segment:  3460 IOPs at Avanade vs 3450 here.

 

Note that I ran 150 threads in a single IOmeter worker, which was what it took to get an approximate 30msec average response time at the start of the test.  The 30msec average figure is somewhat standard to use in the industry.

 

Let's compare these results to the EVA4400.  My understanding is that these two arrays (FAS2050 and EVA4400) are often sold against each other in competitive situations, so it makes sense to see which array has an edge.

 

IO Test

 

The EVA4400 runs this same test with a throughput of nearly 6000 IOPs and 64MB/s, which doesn't degrade over the course of the 14x 20 minute segments.  The EVA LUN doesn't fragment, so the throughput remains roughly the same in each 20 minute segment. 

 

Summarizing, the FAS LUN throughput is 81% of the EVA LUN when it is not fragmented and 58% of the throughput when it is fragmented.  Since the FAS LUN can't avoid fragmenting with this load, the 58% figure is the more relevant one for a typical situation.   Note that the FAS LUN has not yet settled to a steady state after these 14 segments, so its numbers will be somewhat lower if this test runs longer.  

 

Ok, that's a lot to digest, and given the outcome there will certainly be criticisms.  But, it's a relatively simple test that can be recreated by anyone out there; it's a test that I didn't choose; the test is said by a Windows integration expert to be a measure of overall array performance, and its pretty clear that there is an EVA advantage.   

 

By the way, in case it's thought that the EVA just got lucky with this choice of test, I'm open to trying any other IOmeter access specification that anyone wants to run, as long as the test is at least somewhat relevant to customer workloads and stays reasonably simple to set up and understand.

 

Finally, it deserves mentioning that this thread has a tie to the associated topic of cost and capacity utilization.  This is mostly a topic for another day, but the EVA can be deployed without any additional spindles over the 20 used in this test.  The FAS, if running in a cluster (which is redundancy comparable to EVA) would need at least one additional global spare spindle on the a side of the cluster, three additional spindles to hold the root volume of the b side of the cluster, and one global spare for the b side of the cluster.  That's an overhead of five additional spindles to support an array with 20 data spindles.

 

You may be able to sense where I'm going with this.  Five spindles incur roughly an additional $5000 in purchase price for the drives alone, and with those five you would need additional drive shelves, cabling, rack space, power, and cooling.  Not sure how many dollars that all that adds up to, but some would call that a significant additional cost burden for an array that provides 58% of the throughput. 

 

Additional configuration details for runs done here

 

FAS2050

 


  • 20 spindle aggregate containing 1x 1TB volume
  • 1TB volume containing 1x 1TB LUN
  • 2 front end FC ports per FAS filer
  • 150 threads in IOmeter (~30mseg response time at start of test)
  • MPIO policy round robin
  • Ontap 7.2.4L1

 

EVA4400

 


  • 20 spindle disk group containing 1x 500GB LUN
  • 170 threads in IOmeter (~30 msec response time)
  • 2 front end FC ports per controller
  • 09004000 firmware

 

Common

 


  • 15K 144GB drives
  • IOmeter writes to a 100GB file placed in the lowest LBAs of the LUN
  • IOmeter access specification as in Avanade post
  • Emulex dual ported HBA
  • Emulex Queue depth = 254, Queue target=0
  • Windows 2003 32 bit server
  • MPIO and vendor specific DSM - SQST for EVA and RR for FAS
  • 4GB FC SAN
  • Proliant DL380-G4

(Editor's note: Fixed broken links resulting from moving to a new blog platform - no content of the post was changed.  14 April 2011)

Labels: EVA| NetApp| storage| WAFL

Making sense of WAFL – Part 3

By Karl Dohm, HP Storage Architect

 

Sorry for the delay, I'm just finally getting back to this making the third installment on this thread.  For the previous posts see threads Making Sense of WAFL and Making Sense of WAFL Part 2.  In this series it we are trying to seek technical truths in the highly varying posts about NetApp performance, capacity utilization, and usability.

 

I want to thank Patrick for his post as it was very beneficial in helping to figure out how the Avanade tests were run.  Clearly these tests were run with careful attention to detail.

 

As I said previously, I do like the notion of doing IOMeter based throughput tests to compare arrays.  Relatively speaking it is simple to configure, there is little ambiguity, anyone out there who is listening can repeat the test, and it offers us a fair opportunity to compare various arrays in an apples to apples fashion.  IOMeter can be modified to push nearly any load we like, so if someone has a favorite workload we can focus on whatever flavor of load we like.  Most other approaches are a bit too loose for me, leaving too much room for interpretation and variations. 

 

Patrick mentions the true test, i.e. that there are many happy NetApp customers who are running Exchange.  There is truth to this of course, but it isn't a good basis of comparison because every major array vendor has happy Exchange customers.  However, its reasonable to say that these installations can't know what they don't know. 

 

I'm not saying you can't run Exchange successfully with NetApp.  In fact I'm sure you can.  The question looking for an answer is whether the user gets good value in choosing NetApp to run Exchange.  Are they perhaps buying more iron than they need in order to handle their workload?  So if its ok with everyone, lets stick with simple IOMeter to probe this further and keep things from getting too hazy.

 

Unfortunately I don't have access to a FAS3070, so I reran the test as described by Patrick on a FAS3050c.  My numbers for 128 threads came in at 25.5 MB/s and 2950 IOPs with an average response time 43msec - on a completely defragmented LUN (best possible state).  This is still a long way off from the Avanade reported results of 48.1 MB/s with an average response time of 29.1 msec on a fragmented LUN.

 

Rather than comparing this result to our entry level EVA again, I'll just make an assumption that I've done something wrong.   So, please help me further iterate in understanding what it takes to achieve the results as reported in the paper.

 

One interesting clue is that Patrick mentioned was that MPIO was not used in the test.  I find this to be unexpected as this either means that all the load was run down a single 2Gb FC path to the FAS, or IOMeter was somehow configured to drive load down multiple paths to the same LUN, or some other multipathing product was used.   Given the heavy, random nature of the load, I would have expected the use of multiple paths just in case the host port became a bottleneck.   Perhaps this disconnect triggers figuring out what is different in the environment.

 

Here is some more config info.  I'm using a Proliant DL380-G5 running Windows Server 2003 with Emulex 4Gb/s dual ported HBA through 4Gb Brocade switches.  The Emulex max queue depth is set wide open to 254.   I am using MPIO RR to give the FAS the benefit of having multiple paths share the load.  Ontap is 7.2.2.

 

(Editor's Note: Fixed broken links resulting from moving to new blog platform.  No content was changed.  14 April 2011)

Labels: EVA| NetApp| storage| WAFL

Making Sense of WAFL - part 2

By Karl Dohm, HP Storage Architect

 

Today I'm taking a few minutes to respond to some of the comments regarding my initial post on Making Sense of WAFL

 

Apparently in that post I unwittingly opened up a few of NetApp's old wounds which have been extensively hashed through previously in public forums.   Looking through the responses, NetApp has done a nice job of trying to deflect some of these problems through releases of nice looking apparently credible documentation. 

 

For those that are biased NetApp's way, or are enamored with the technical ways of WAFL, there may be nothing to say to convince you otherwise.  But for those with an open mind, read on.

 

The problems we are talking about here are the core of WAFL, and are clearly not easy to fix - or they would be already fixed.  NetApp is not unique is having problems of course, all array vendors have their strong and weak points.  But to assert that WAFL has no weaknesses around fragmentation, performance, and capacity utilization defies common sense.  The old wounds are there for a reason.

 

Let's take a look at the Avanade white paper.  It glows with enthusiasm about how the FAS3050c performs in MS Exchange based environments.  Further detail from the paper's author can be found in an interview here.  Peeling back the onion a bit, we see that this paper was created shortly after creation of a business partnership between NetApp and Avanade.  Evidence of this partnership can be found here

 

The IOmeter baseline performance data cited in the paper is interesting and worth exploring.  In the words of the white paper's author the IOmeter test against the FAS3050c had..  "two goals: to validate that our environment was set up correctly and to assess overall performance of the FAS3050c".   

 

The report is exceptionally loose about describing the setup.  The transfer size used for IOmeter are claimed to range from .5KB to 64KB in size, but there is no indication on the weight applied to portions of this range.  There is no mention of percent reads/writes or percent random vs sequential.  It also doesn't discuss MPIO policy or HBA queue depth setting.  There is no indication whether OnTap Exchange extents are enabled.  Worst of all, and unique to NetApp, it doesn't define the history of writes and therefore level of fragmentation on the LUN.

 

I like IOMeter because its a relatively simple test to run that is available for anyone to try since its in the public domain.  Given this open invitation to compare results with Avanade, it made sense to give the described test a try and see what happens. 

 

It turns out that no matter what combination of the unspecified test parameters I tried, I could never get into the ballpark of results claimed in this white paper. 

 

So to illustrate an example, I decided to just keep things simple as possible.  Running a typical exchange 2008 simulation load of 8KB transfer size, 80% random, 60% read, IOmeter queue depth of 128, MPIO round robin, Exchange extents enabled, HBA max queue depth of 254, 20x 15K spindle raid-dp aggregate, and letting the LUN settle through its fragmentation period, the throughput settles at 19MBs at a average latency of 52msec. 

 

The white paper claimed the FAS3050c runs 48 MB/s at 30msec latency, which is a world of difference.

 

So what gives?  One of several things has happened.  Perhaps I could not successfully piece together how to run this test from the information given.  It would be great to get clarification from NetApp on how to properly run this test and recreate the results.  The other explanation is perhaps that the results are not re-creatable without some special internal-use-only tuning parameters.  Or perhaps there is no way to recreate these results.

 

An EVA4400, run with the same workload, experiences approximately 39MB/s at 25 msec average latency.  That's about twice the thoughput on a workload that is mostly random, meaning the bottleneck is supposed to be at the spindles.  Apparently on the FAS the bottleneck is somewhere else.

 

Incidentally, this FAS 3050c LUN degraded about 10% in MB/s throughput as the fragmentation settled out.  That isn't such a big number, but recall that this test is mostly random I/O.  The sequential read portion, if looked at in isolation, degrades much worse.  It is why NetApp introduced Exchange extents. 

 

As in my previous post, if you don't believe what I am saying, give it a try.  Unlike my colleagues at NetApp, I gave you enough information here to run the test. 

 

Barring sound explanation from NetApp, It seems to me that there is reason to doubt the credibility of the white papers and test results that NetApp is producing.  

 

(Editor's Note: A broken link was fixed - no content changes were made.  14 April 2011)

Labels: NetApp| storage| WAFL

Making sense of WAFL

By Karl Dohm, HP Storage Architect


Extensible NetApp Blog (http://blogs.netapp.com/extensible_netapp) contains some posts describing WAFL.  It sums WAFL up as an internal component which...


...provides mechanisms for building file-system semantics, it manages the on-disk format, it manages the free and allocated space, and provides a logical and physical volume manager.


and further making the argument that it is not a file system, but rather an essential part of one.   Fair enough, calling it a file system might be splitting hairs and technically incorrect, but given the amount of confusion across vendors in the industry, it is likely that this common misunderstanding emanated from NetApp's own documentation.  


Whether the details of the FAS internals are technically part of WAFL, or part of OnTap, or part of something else, the main point is that none of that is particularly relevant to the Storage Administrator.    What matters most is performance, space efficiency, and ease of use. 


From this point on I'm not going to split hairs and discern between WAFL and the remainder of FAS internal software/firmware - because no one but NetApp architects really care.  For the benefit of simplicity lets call it all WAFL.


WAFL is a rather unique approach to organizing data on the spindles and controlling the flow of the data to the spindles.  No doubt WAFL can come across as impressive in sales presentations because it is very different than the approach used by EMC, IBM, and HP.  In this series of posts, we will explore the other side of WAFL, highlighting some of the problems that WAFL brings to the Storage Administrator, none of which we expect NetApp will fully acknowledge.


Today lets touch on fragmentation.  Some in the industry say, WAFL is "fragmentation by design".   I didn't make it up, but like WAFL being called a file system, its one of those things you tend to hear if the conversation is around NetApp.  This statement strikes me as accurate because WAFL tries to do full stripe writes whenever it can, meaning that it prioritizes writing non sequential blocks in the same stripe over read modify write operations associated with RAID-4 or RAID-DP parity calculation. 


Translating that to the world of the Storage Administrator, this means that gradually the throughput of the FAS degrades over time when the workload has a random component.  Most real world workloads have a random component.  Applications along the line of Microsoft Exchange present a nightmare situation for the FAS.  The throughput degradation can be significant, and throughput can be unpredictable because it varies depending on history of writes. 


For those that question this assertion as being somehow biased, try the following.  Take a FAS system and create a new volume with a new LUN.  Baseline the system by running a sequential read workload and measuring the result.  Notice that the number is already not very impressive.  Next run a few hours of random workload, say 8KB 50/50 R/W, which is similar to MS Exchange.  Now try the sequential read load again and observe the new throughput.  Chances are you will have some new questions for your NetApp sales rep.  


Next time we will discuss the benefit of reallocation, NetApp's answer to fragmentation, and explore how much this really helps the problem.

Labels: NetApp| storage| WAFL
Search
Follow Us
HP Discover 2013

About the Author
  • 25+ years experience around HP Storage. The go-to guy for news and views on all things storage..
  • This profile is for team blog articles posted. See the Byline of the article to see who specifically wrote the article.
Labels