By Jim Haberkorn
I'm leaving this Friday for a one month holiday so I'm going to try and wrap up this NetApp usable capacity issue as much as I can over the next two days. I'll try to answer all the comments I can before I leave but some may have to wait until mid-January when I'm back.
So, the question of the day is: how does NetApp get away with this usable capacity issue? It's a long story but I'll try to keep it short. NetApp has a unique technology. Along with that uniqueness comes both strengths and weaknesses. Now, to make it clear, no one is saying that you can't fill up NetApp filers to 99% capacity if you want to. What's being said is that if you do, your filer's behavior is going to change dramatically from when it was first installed. Whether you notice the change in your environment or not depends on a number of factors, but the filer's behavior will change over time in the vast majority of environments. The issue is hard wired into their design.
The issue can be hard to pin down because it is partially tied to performance, and this factor tends to play out differently in NAS and SAN environments. So, if you listen carefully to the discussions on NetApp user groups you might notice that the people who are saying they haven't seen a problem are mainly NAS customers and the ones who are complaining are mainly from SAN environments. There's a reason for this.
In your average NAS environment if your word.doc file took 3 seconds to download today and takes 5 seconds tomorrow, does anyone complain? No. Therefore, in many NAS environments, though the filer's performance may degrade by 40% or more over time for a variety of reasons, as long as it stays above the customer's pain threshold then no one even notices. In block/SAN/database environments it's a different story. In those environments performance and maintaining free space on a NetApp filer are crucial. NetApp talks about adding free space for ‘chaotic workloads' - a clever name that implies there is something aberrant about those environments. In fact, they are merely referring to the random workloads found in almost all SAN environments. Those environments need more free space and when they don't get it, bad things happen - at the very least performance degrades, and if the filer should actually ever run out of free space then the data base crashes and the file systems and source LUNs become inoperable. To the best of my knowledge, this behavior does not happen with any other array on the market.
Skeptical? Check out the paragraph marked ‘caution' on page 25 of http://media.netapp.com/documents/tr-3431.pdf. Or try this experiment with your NetApp filer. It will work 100% of the time. Start with a fresh NetApp filer with all the default settings in place and create a 40 spindle raid-dp aggregate with a 1TB volume and a 500GB LUN. And then let IOmeter run random writes to the LUN for a few hours. The LUN will run fine and then suddenly start to throw millions of errors then crash and be taken off line. The only way to prevent this is to drill down into the GUI and turn off the default auto-snap schedule.
Now, I have actually read statements by NetApp bloggers that NetApp LUNs ‘DO NOT' run out of free space. The actual quote I am referring to stated that ‘LUNs' in NetApp filers don't run out of free space because of the NetApp LUN auto-grow and snapshot auto-delete features which NetApp added several years ago (and I assume also because of a rarely talked about ‘automatic dismount of database' feature). But what about aggregates running out of freespace? Could that happen? No? Then why does NetApp have a separate best practice for aggregate free space? But the reality is that despite LUN auto-grow and snap-auto-delete, LUNs can run out of space on NetApp filers. What if you run out of auto-grow space? What if you've got no more snaps to delete? What if a customer isn't running snaps? What if you are running dedupe on the primary volume and people change bytes in their files and the files undedupe themselves? What if the host application creates lots of files? One point to ponder: NetApp places a lot of emphasis on snapshots, especially for restores. To recommend deleting snapshots to solve a problem tells you something about the seriousness of the problem.
So how does NetApp hide all this? Well, one way is if they have a small SAN running on the same filer as a large NAS. With all the free space running around no one really tracks whether the SAN is being a free space hog.
Another way is to throw disks at the problem - not for the purpose of increasing spindle count but for the purpose of increasing free space. Which is an okay fix in my opinion as long as the customers are told the issue upfront. Typically though it is a surprise. Either the customer grits his teeth and keeps buying more disks to constantly maintain the original free space levels, or if they are really angry and were savvy enough to require performance guarantees ( now there is a NetApp guarantee program with some teeth in it!! ) NetApp will give them the disks for free. It's a solvable problem in many cases, but it involves an unforeseen cost by the customer in both disks, power and floor space requirements, and if they have to upgrade to bigger filers, then in software license fees as well. Trivia question: How many of you think EMC has the highest gross margins in the storage industry (among the major players)? They' don't. In most quarters it's NetApp. Last time I checked NetApp software gross margins were 96% and hardware was 48%. Note: if those numbers have plunged recently for NetApp, then I am open to being corrected.