In the past few years, IT shops have grown accustomed to letting their file systems grow to staggering capacities. File systems are advancing at similar rates to extend their capacity limits and design thresholds, especially around metadata management. However, this increased capacity lays the groundwork for extremely volatile situations.
Is your data safe? Will you be able to access it when you need it? How is your Storage solution architected? When was the last backup? If you even have a backup, how long will it take you to restore it? Questions all business leaders and CIOs should be asking.
Is large, single name space file systems right for you?
Recently, I was engaged on a case that involved data migration at the block level of a customer’s business data. After the migration was completed –successfully I might add –all systems were released to full production load. Sometime later (weeks/months) the System administrator ran a simple script to change file and directory permissions –for which he felt would be a harmless action---affected the entire business unit by taking the file system offline and impacting thousands of users. The “hot seat” takes all new meaning when the IT guy had to explain to executives that a benign task caused a single, extraordinarily large file systems to go offline.
This particular case revolved around a NTFS file system of 10 TB capacity containing 10’s of millions of files of varying in size from a few KB to multiple gigabytes. At the particular time of the outage the customer had no idea that his file system was horrifically fragmented at both the NTFS file system layer and the MFT. In this particular case the outage was caused by a system administrator applying ACL changes to millions of files via a batch job. These particular actions caused extremely long wait times at the system layer in the file system driver stack. As a result, the system became very lethargic leading network pauses, clients being disconnected from the share, and applications using those shares crashing.
It was not that the customer just had 10 TB NTFS file system, it was also the millions upon millions of files varying in size from a few KB to multiple gigabytes which lead to the climatic event. In this particular case, the MFT had grown to be greater than 12 gigabytes, fragmentation was rife throughout the file system and MFT. In this case, defrag, and contig of just MFT (metadata) ran for days trying to realign data on contiguous blocks but was never able to complete. If in fact contig had completed, one pass would not have been enough. It would have required several more executions to realign additional blocks not aligned on the initial pass due to locked address space as the file system was still online –not to mention the many other factors that existed. In any case, a new design was required to fix this clients problem, they decided to go with multiple smaller NTFS file systems. There’s far more to this story; for which, I am not including here, as it will just muddy the waters. Suffice it to say, the customer had no choice but to keep his basic disk windows approach due to their style of clustering, so we had to reduce the size of the file system to manage the number of files and fragmentation.
Was this strictly a problem of NTFS? Personally, as the debugger, coder, author, and Unix guru, you might expect me to say – yes. However, I truly feel that sometimes good products get a bad rap for being utilized in areas they were never intended or in areas that were never conceived. I’m not saying you cannot have a 10 TB or even a 200 TB NTFS file system. Both of these are fully supported, and we do have customer’s that use them to these extremes. At the end of the day it is extraordinary important to think about what is supported, but also think about the use model—would you take “any” automobile with four wheels and a high powered engine to the racetrack?
In this particular case, there are better solutions to accommodate extraordinary large file systems which contain extraordinarily high numbers of files of varying size and access patterns along with thousands of user accesses (IBRIX, Luster, etc…) If you want to know more, please comment.
Until next time.
This issue is not going away – In fact, similar issues are cropping up more and more as customers/companies continue to grow their legacy file systems bigger, and bigger, and –well, you get the picture. Before you put your business into a difficult corner, consider the ramifications of going to a single filesystem petabytes in size – Can you back it up, can you restore it before the company is out of business? You will have to have a DR Strategy with a single file system of that size – but remember synchronous or asynchronous hardware replication only protects hardware failure, not filesystem issues, viruses, or human mistakes such as deleting or formatting the wrong disk…
Learn about game-changing innovations in software, hardware, services, and networking. HP Discover2012 http://bit.ly/LOt1qp