By Karl Dohm, Storage Architect
Welcome back to the next in a series of posts where we take a closer look at NetApp and its FAS series of storage arrays. The discussion topic today is Microsoft's Exchange Solution Reviewed Program (ESRP) and its tie to FAS throughput.
The FAS has some controversial history with regards to performance. From time to time the issue comes up and in response NetApp has generally denied the problems exist. Often we find the opposite stance in posts from NetApp lauding their performance, for example in Kostadis Roussos' post where he refers to WAFL write performance as 'surreal'. But, as I have said in previous posts, there are some justifiable reasons this controversial subject keeps surfacing.
First of all, let's touch on why an average storage consumer should care about array throughput. An array with better throughput, i.e. the ability to service more I/Os from a given set of spindles, can result in requiring less hardware to do the same job. The bigger the throughput difference, the more to be saved on purchase price, warranty cost, power consumption, floor space, cooling, etc. Array throughput statistics can be meaningful when evaluating value in a storage array. It seems NetApp also finds this array attribute important given the amount of blog posts and papers they have on the topic of performance.
Recently in the comments section of a blog post on understanding WAFL, NetApp's John Martin and I had a small debate as to whether a synthetic load generator like IOMeter could be used to characterize how an array will perform in production scenarios. I made the argument that this type of tool can be used to circle the wagons around the I/O characteristics of a real world application, and that through multiple point tests of the load components of the application one could get a reasonable assessment of how well the box will behave. John's opinion was more along the line that synthetic workload tests were not suitable to provide an indication of how well an array would run with a production application ("Synthetic workloads in isolation lead to non typical results"). He referenced Jetstress as a more accurate indicator.
I took his queue and had a look at the FAS2050 ESRP results paper which describes MS Exchange like throughput of the FAS2050 array. Even though ESRP isn't intended to be a benchmark, a scan of ESRP results tells me that many vendors seem to use the forum of ESRP as a way to post throughput results relative to how their array handles MS Exchange load. It kind of makes sense since there seems to be no Exchange related benchmark out there, and ESRP is the closest controlled thing the industry has to work with.
The NetApp ESRP paper provides insight into how NetApp would recommend setting up the 2050 for Exchange loads, and it shows throughput results in a heavily loaded 10,000 mailbox Jetstress test. This paper sparked our interest because the described results seemed good and did not correlate with results from synthetic load generators that produce a similar pattern as Jetstress. Maybe John was right. We decided to peel back the onion a bit and take a look under the covers of this ESRP test to figure out what was going on.
We happened to have access to a FAS2050 and decided to try and recreate the ESRP results as published. It turns out that the IOPs value that NetApp published was in fact roughly re-creatable given the data in the paper. On the surface this can be viewed as NetApp having made an honest submission to ESRP, and within the letter of the law one could reasonably argue that they did. But we also learned that NetApp found a way to make their results come across as favorably as possible, meaning the results have little relevance as to how well the FAS will run MS Exchange.
After a rather lengthy setup experience, we finally configured the aggregates, volumes, servers, LUNs, MPIO, and HBA attributes as described in the ESRP paper. We even set the diagnostic switch "wafl_downgrade_target" to a value of 0 in accordance with the recommendations in the paper.
One might ask, as we did, what does "wafl_downgrade_target" do? In its TR-3647 paper, NetApp describes the switch as follows: The "downgrade_target" command changes the priority of a process within Data ONTAP that handles incoming SCSI requests. This process is used by both FC SAN and iSCSI. If your system is not also running NAS workloads, then this priority shift improves response time."
I think his description is telling us that the NAS process consumes bandwidth when there is no NAS work to do. Also, given the NetApp messaging around unified storage architecture, a recommendation to use this switch seems like a bit of a contradiction. Would you consider it normal to be asked to set a switch that generates the following response? "Warning: These diagnostic commands are for use by NetWork Appliance personnel only". Last but not least, this switch resets itself if the array reboots. I'll leave it to the audience to draw their own conclusions as to whether use of this switch is truly a recommended practice in customer environments.
Once the array was freshly initialized and everything was set up, we ran the test and observed the results of roughly 2200 average disk database disk transfers/second per host. Within noise levels, this recreated the results as posted in their ESRP paper.
The main problem we have with how NetApp did this testing is that after the initial run, every time this test is run it runs slower than the previous time it ran. The 2nd run showed results of approximately 1980 transfers/second per server, about an 11% drop. By the fifth run throughput had dropped to approximately 1555 transfers/second per server - a 30% drop. After a couple more runs we were down to 1450, 34% slower than the first run.
I didn't have the patience to run enough times to figure out where this decay curve flattens out.
At this point I decided to run a "reallocate measure" against one of the database LUNs, and the FAS reported the value to be 17. According to the NetApp man page for Reallocate Measure: "The threshold when a LUN, file, or volume is considered unoptimized enough that a reallocation should be performed is given as a number from 3 (moderately optimized) to 10 (very unoptimized)". Allow me to translate - the database LUNs are very fragmented. For those who might be confused by the use of the word fragmentation in this context, this is not NTFS fragmentation - its WAFL fragmentation.
Now things were starting to make sense. We were seeing the same sort of decay curve as shown in the IOMeter results posted in Making Sense of WAFL - Part 4. Every time the test is run, the random component of the Jetstress database accesses fragment the LUN further and the throughput numbers get worse. An array like EMC CX or HP EVA wont undergo this sort of decay curve since these arrays do not have internal WAFL-fragmentation problems like the FAS does.
That's not all. After the throughput test, Jetstress executes a checksum test of the databases to be sure the array did not corrupt any data. After a few runs I noticed an interesting pattern. On the FAS, the length of time needed for the checksum calculation also degraded as the database LUNs went through their WAFL-fragmentation. When the LUNs were fresh and defragmented, the checksum calculation took about 2 hours. By the fifth run, when the database LUNs had a WAFL-fragmentation measure of 17, the checksum calculation took over 10 hours - a 250% slowdown To summarize we saw 34% slowdown on database throughput and a 250% slowdown on checksum calculation by just letting the ESRP test run for about 48 hours before taking measurements.
So, drawing this to a close, I think there is a reasonable argument that NetApp should have results more like 1450 (or less) disk transfers/second/host as opposed to the 2220 transfers/second/host they did post. Most would expect that results in a test as visible as ESRP are measured after a reasonable burn in period. After all, when someone runs MS Exchange, they usually run it for longer than 2 hours.