Hyperscale Computing Blog
Learn more about relevant scale-out computing topics, including high performance computing solutions from the data center to cloud.

What is your monitoring interval?

While a lot of systems do in fact run some sort of automated monitoring tools such as sar, ganglia, cacti and others, I'm always surprised to hear that the monitoring intervals being used are on the order of 5-10 minutes long.  The positive note is that in a relatively small number of data samples you can see what your system has been doing over the course of days, weeks or even months.  In fact, if you're a sar user whose default interval is 10 minutes, a day's activity is shown in 144 samples.  However, what are you actually seeing?  If your network activity shows occasional blips at 50% load, how would you know it wasn't running at 100% for 5 minutes and 0% for the other 5?  When showing 10 minutes averages you're certainly seeing some form of activity, and if you tend to run long workloads perhaps the numbers are representative of what is typically happening, but what value is that data when something goes wrong?

I'd claim the only solution, assuming of course that you do want to be able to diagnose performance problems, is to collect your data at a much finer level of granularity, perhaps in the 1-10 second range.  Admittedly you have the same problem at any interval in that any events that are shorter than the interval could get averaged away,  but your chances are significantly better than in the 10 minute interval case.

However, this immediately raises 2 questions, the first of which is "how much is this going to cost in terms of overhead" and my answer is the classic response of "it depends".  But seriously, in most cases you won't even notice it.  Many monitoring tools will use less than 1% of the CPU and some even less than 0.1% but to be sure this isn't a problem you can always run some applications with and without monitoring enabled and see what the effect is.  If you're having intermittent problems, perhaps you enable even finer-grained monitoring until you can track them down.

The second question brings us back to a previous post I wrote about cluster vs local monitoring.  When you do monitor at lower levels of granularity on a large cluster, you're going to overwhelm any centralized data collector and in this case the only solution I can think of is to collect less central data.  In other words, can you configure your environment such that not all the data collected on an individual system is reported to the centralized monitor?  If so, now you can continue to centrally monitor your cluster, but when you feel you need more detailed data you can simply go to the individual systems on which it is being collected.

To my knowledge most monitoring environments can't do this!  Either they collect the data locally or they collect it centrally but not both.  However, there is a solution at the core of which is an open source tool I wrote a number of years ago called collectl.  It's primary focus is fine-grained local data collection in the 0.1% overhead range but it can also send a subset of that data over the network to a centralized monitoring system and in fact that is exactly what is done with HP's Cluster Management Tool, so with CMU you can centrally monitor multi-thousand node clusters and still get at the details when you need it.  More on CMU in a future post.

As Alanna said in a previous post of hers, I'll be at the HP Technical Forum next week doing a couple of presentations on collectl and will also be doing some demos of CMU in one of our booths.  If you'll be there and this topic interests you, be sure to attend one of my presentations or just track me down at the booth and we can discuss this in as much detail as you like.


Showing results for 
Search instead for 
Do you mean 
Follow Us

About the Author(s)
  • I am a member of the Enterprise Group Global Marketing team blogging on topics of interest for HP Servers. Check out blog posts on all four Server blog sites-Reality Check, The Eye on Blades, Mission Critical Computing and Hyperscale Computing- for exciting news on the future of compute.
  • HP Servers, Converged Infrastructure, Converged Systems and ExpertOne
  • WW responsibility for development of ROI and TCO tools for the entire ISS portfolio. Technical expertise with a financial spin to help IT show the business value of their projects.
  • Luke Oda is a member of the HP's BCS Marketing team. With a primary focus on marketing programs that support HP's BCS portfolio. His interests include all things mission-critical and the continuing innovation that HP demonstrates across the globe.
  • HP Newbie...Tech Enthusiast...Self-proclaimed Comms Queen
  • HP Servers, Global Product Marketing
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.