Around the Storage Block Blog
Find out about all things data storage from Around the Storage Block at HP Communities.

NetApp’s ‘Shining’ Moment – its Capacity Guarantee Program follow up

By Jim Haberkorn 


I'm leaving this Friday for a one month holiday so I'm going to try and wrap up this NetApp usable capacity issue as much as I can over the next two days.  I'll try to answer all the comments I can before I leave but some may have to wait until mid-January when I'm back.  


So, the question of the day is: how does NetApp get away with this usable capacity issue?  It's a long story but I'll try to keep it short.  NetApp has a unique technology.  Along with that uniqueness comes both strengths and weaknesses.  Now, to make it clear, no one is saying that you can't fill up NetApp filers to 99% capacity if you want to.  What's being said is that if you do, your filer's behavior is going to change dramatically from when it was first installed.  Whether you notice the change in your environment or not depends on a number of factors, but the filer's behavior will change over time in the vast majority of environments.  The issue is hard wired into their design.


The issue can be hard to pin down because it is partially tied to performance, and this factor tends to play out differently in NAS and SAN environments. So, if you listen carefully to the discussions on NetApp user groups you might notice that the people who are saying they haven't seen a problem are mainly NAS customers and the ones who are complaining are mainly from SAN environments.  There's a reason for this.  


In your average NAS environment if your word.doc file took 3 seconds to download today and takes 5 seconds tomorrow, does anyone complain?  No. Therefore, in many NAS environments, though the filer's performance may degrade by 40% or more over time for a variety of reasons, as long as it stays above the customer's pain threshold then no one even notices.  In block/SAN/database environments it's a different story. In those environments performance and maintaining free space on a NetApp filer are crucial. NetApp talks about adding free space for ‘chaotic workloads' - a clever name that implies there is something aberrant about those environments.  In fact, they are merely referring to the random workloads found in almost all SAN environments.  Those environments need more free space and when they don't get it, bad things happen - at the very least performance degrades, and if the filer should actually ever run out of free space then the data base crashes and the file systems and source LUNs become inoperable.  To the best of my knowledge, this behavior does not happen with any other array on the market.


Skeptical?  Check out the paragraph marked ‘caution' on page 25 of http://media.netapp.com/documents/tr-3431.pdf. Or try this experiment with your NetApp filer.  It will work 100% of the time.  Start with a fresh NetApp filer with all the default settings in place and create a 40 spindle raid-dp aggregate with a 1TB volume and a 500GB LUN.  And then let IOmeter run random writes to the LUN for a few hours.  The LUN will run fine and then suddenly start to throw millions of errors then crash and be taken off line.  The only way to prevent this is to drill down into the GUI and turn off the default auto-snap schedule.  


Now, I have actually read statements by NetApp bloggers that NetApp LUNs ‘DO NOT' run out of free space.  The actual quote I am referring to stated that ‘LUNs' in NetApp filers don't run out of free space because of the NetApp LUN auto-grow and snapshot auto-delete features which NetApp added several years ago (and I assume also because of a rarely talked about ‘automatic dismount of database' feature).  But what about aggregates running out of freespace?  Could that happen?  No?  Then why does NetApp have a separate best practice for aggregate free space?  But the reality is that despite LUN auto-grow and snap-auto-delete, LUNs can run out of space on NetApp filers.  What if you run out of auto-grow space?  What if you've got no more snaps to delete?  What if a customer isn't running snaps?  What if you are running dedupe on the primary volume and people change bytes in their files and the files undedupe themselves?  What if the host application creates lots of files?  One point to ponder: NetApp places a lot of emphasis on snapshots, especially for restores.  To recommend deleting snapshots to solve a problem tells you something about the seriousness of the problem.    


So how does NetApp hide all this?  Well, one way is if they have a small SAN running on the same filer as a large NAS.  With all the free space running around no one really tracks whether the SAN is being a free space hog.  


Another way is to throw disks at the problem - not for the purpose of increasing spindle count but for the purpose of increasing free space.  Which is an okay fix in my opinion as long as the customers are told the issue upfront.  Typically though it is a surprise.  Either the customer grits his teeth and keeps buying more disks to constantly maintain the original free space levels, or if they are really angry and were savvy enough to require performance guarantees ( now there is a NetApp guarantee program with some teeth in it!! ) NetApp will give them the disks for free.   It's a solvable problem in many cases, but it involves an unforeseen cost by the customer in both disks, power and floor space requirements, and if they have to upgrade to bigger filers, then in software license fees as well.  Trivia question:  How many of you think EMC has the highest gross margins in the storage industry (among the major players)?  They' don't.  In most quarters it's NetApp.   Last time I checked NetApp software gross margins were 96% and hardware was 48%. Note: if those numbers have plunged recently for NetApp, then I am open to being corrected.  


Jim Haberkorn

Labels: NAS| NetApp| storage
Comments
Anonymous(anon) | ‎12-12-2008 07:47 PM

Groan. More tedious and at length hearsay. I'm going to blog today on this and your previous effort (and a quick reminder; this was about VMware, yes? You seem to have drifted off target.)

One point; you are absolutely right about the recommendations in media.netapp.com/.../tr-3431.pdf. It indicates a 100% fractional reserve. Our bad.

It's being withdrawn, as it should have been reviewed and replaced as it's dated OCtober 2006, and the NetApp technology has come a long way since then. (It's something we've done with a number of older Technical Reports, such as our Exchange Best Practices at media.netapp.com/.../tr-3578.pdf.)

This one we missed. IBM's redbooks take longer for us to get corrected, although they're pretty quick off the mark, and I'll make sure that the process carries the changes through to the N Series documents.

The updated document will read along the lines of

Database Volume Size = Sum of the database LUN sizes that will share the database volume + (Fault Tolerance Window + Online Backup Retention Duration) * database daily change rate

For more details, see this NetApp blog posted in September; blogs.netapp.com/.../tales-from-the.html

Extract;

"Some would have you believe configuring 99.99% LUN reservations on a NetApp SAN would result in storage admin's being immediately covered by the umbrella of a mushroom cloud explosion ..."

As it points out, acceptable values for safe NetApp FAS fractional space reservations are as low as 0%.

Later, when my blog is written. Postong replies through this letter box is painful.

Anonymous(anon) | ‎12-13-2008 01:14 AM

Hi Alex, thanks for your comments. And before I forget, have a happy holiday.  I'm running fast - have to catch a plane, so here it is quick and dirty:  



  • This whole thing didn't start with VMware, it started with a totally bogus NetApp white paper attacking the EVA that was 22 pages long and filled with 100% hearsay.  So, if my blog stings a little bit, well, take consolation in the fact that I'm not depopulating entire forests with the copies I print and hand out to customers. 

  • Your response indirectly makes an excellent point: it is extremely hard to come to terms with NetApp space reservation and usable capacity because of all the confusing information NetApp has published on the subject.  How hard could it be to put all the information on the subject in one place?  Why doesn't NetApp do that?  Put it in one place and cover all the relevant issues. I've listed those issues in a previous blog response.  But the reason why NetApp doesn't do that is because it is an extremely complex issue and if they wrote down in one place all the parameters, guidelines, best practices, cautions, workarounds, capacity warning tools, and database protection features needing to be invoked to protect their customers, it would scare the daylights out of everyone.  Just a theory. 

  • And when you update your white papers, don't forget to hand out copies to the NetApp technical guys that answer questions on the NetApp user chats - they seem to think you still need to increase free space when performance drops.  

  • Sorry to repeat this but, as I've said all along: if NetApp didn't have a usable capacity problem they wouldn't have to resort to the tricks they played in their Wyman/Mercer white paper or in their capacity guarantee.  In the end, NetApp tipped its own hand. 

  • And finally, darn! I wish you had written your counter blog earlier rather than wait to announce it on the day I mentioned I was leaving on vacation for a month.  But don't worry, I'm rather flattered.      

And again, Alex, and everyone else who has been reading this conversation - I hope you have a safe and happy holiday season!!!

Anonymous(anon) | ‎12-13-2008 08:06 PM

To your first point, your readers (and I was one of them) might get confused (and I did) if you insist on wandering off topic and using a shotgun approach to making your point.

To your second point; we are going to put all the information on the subject of space management on NetApp systems in one place. Currently, the same information is spread across several documents, and as you noticed, older copies haven't been updated. In scientific paralnce, we will disporve your theory.

Point 3; I'm unsure what you're driving at here. Do you mean spindle count for IOPS? Can you give an example?

Last point; you and Chuck have commercial problems with the guarantee. We and our customers have no usable capacity problem at all with this; quite the reverse. You might want to take a look at a blog I did some time ago about ELF and realize that comparing an EVA's usable capacity to a NetApp system just doesn't mean the same thing. It's called storage virtualization. blogs.netapp.com/.../elf-wealth-and.html

Enjoy your break.

0007725852 | ‎02-27-2009 07:30 PM

Hey, Alex


All i see is: your answer can be described in one sentence: "Our lawyers so hard-balled to have "problems with the guarantee""

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the community guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author
  • 25+ years experience around HP Storage. The go-to guy for news and views on all things storage..
  • This profile is for team blog articles posted. See the Byline of the article to see who specifically wrote the article.
Follow Us