While a lot of systems do in fact run some sort of automated monitoring tools such as sar, ganglia, cacti and others, I'm always surprised to hear that the monitoring intervals being used are on the order of 5-10 minutes long. The positive note is that in a relatively small number of data samples you can see what your system has been doing over the course of days, weeks or even months. In fact, if you're a sar user whose default interval is 10 minutes, a day's activity is shown in 144 samples. However, what are you actually seeing? If your network activity shows occasional blips at 50% load, how would you know it wasn't running at 100% for 5 minutes and 0% for the other 5? When showing 10 minutes averages you're certainly seeing some form of activity, and if you tend to run long workloads perhaps the numbers are representative of what is typically happening, but what value is that data when something goes wrong?
I'd claim the only solution, assuming of course that you do want to be able to diagnose performance problems, is to collect your data at a much finer level of granularity, perhaps in the 1-10 second range. Admittedly you have the same problem at any interval in that any events that are shorter than the interval could get averaged away, but your chances are significantly better than in the 10 minute interval case.
However, this immediately raises 2 questions, the first of which is "how much is this going to cost in terms of overhead" and my answer is the classic response of "it depends". But seriously, in most cases you won't even notice it. Many monitoring tools will use less than 1% of the CPU and some even less than 0.1% but to be sure this isn't a problem you can always run some applications with and without monitoring enabled and see what the effect is. If you're having intermittent problems, perhaps you enable even finer-grained monitoring until you can track them down.
The second question brings us back to a previous post I wrote about cluster vs local monitoring. When you do monitor at lower levels of granularity on a large cluster, you're going to overwhelm any centralized data collector and in this case the only solution I can think of is to collect less central data. In other words, can you configure your environment such that not all the data collected on an individual system is reported to the centralized monitor? If so, now you can continue to centrally monitor your cluster, but when you feel you need more detailed data you can simply go to the individual systems on which it is being collected.
To my knowledge most monitoring environments can't do this! Either they collect the data locally or they collect it centrally but not both. However, there is a solution at the core of which is an open source tool I wrote a number of years ago called collectl. It's primary focus is fine-grained local data collection in the 0.1% overhead range but it can also send a subset of that data over the network to a centralized monitoring system and in fact that is exactly what is done with HP's Cluster Management Tool, so with CMU you can centrally monitor multi-thousand node clusters and still get at the details when you need it. More on CMU in a future post.
As Alanna said in a previous post of hers, I'll be at the HP Technical Forum next week doing a couple of presentations on collectl and will also be doing some demos of CMU in one of our booths. If you'll be there and this topic interests you, be sure to attend one of my presentations or just track me down at the booth and we can discuss this in as much detail as you like.