Rethink BI : Business Insights over Business Intelligence
The purpose of this business insights thought leadership blog is to share the HP point of view on industry trends such as Big Data and Real Time Analytics, and provide updates on key innovations and solutions.

Big Data architecture in the New Style of IT—Part 3

In this third part of our series, we ask Greg Battas, CTO of Business Intelligence Solutions for HP Converged Systems to help lay the groundwork in understanding what exactly is involved with Big Data processing. Last time, we discovered some common misconceptions about Hadoop clusters and conventional wisdom. In this post, we learn how moving the processing closer to the data is a critical step in a successful converged system solution.

 

iStock_000023930618Small.pngGreg’s response sheds new light on some of the common misconceptions around processing. In the second part of this series, Greg reminded us of how current application-centric data centers are accumulating around processing of single jobs … leading to a succession of computing nodes assigned to a single data cluster or application. His sense is that the demands of Big Data require a more converged system approach.   

 

Where is the processing happening?

GB: Another good question that leads back to debunking some common wisdom: Big Data is all about bringing the processing to the data. There’s a couple of misconceptions here we need to understand. The first is the common idea that Big Data is all about I/O. In other words, Big Data is all about getting the data off the disc. But in reality once the data is off the disk there’s still a lot of work to do to it. Second, the common idea is to take a traditional SAN approach, but shipping every block over the disk just doesn’t scale cost effectively. And finally, there’s this notion that Big Data scales because the processing happens close to the data by using internal DAS, and shipping work to each node. But in reality every time the data is removed from the disk and shuffled it’s not being processed locally.

 

People in the Very Large Database (VLDB) communities have struggled with this for years. And the answer lies in optimizing the data. For example, say you are aggregating on “customer ID” and everytime users don’t know an ID they enter a zero in the field. If you aggregate on that, one node gets all of that work. So it turns out that sometimes it’s better to distribute the data and not process the data where it came from. VLDBs, such as those found in our Vertica product for example, are extremely good examples of how to distribute data, because they are adept at deciding what should be pushed down toward storage and what shouldn’t.

 

What we learned was that once the data is off the disk, there’s actually quite a bit of CPU power available to do analytics and other processes. So we’re working with our partners, Hadoop and others, and our products to see just exactly where this work could best be done. We’re essentially asking: Does everything have to happen at the data? Or is there opportunities to open up other resources to do other kinds of jobs at the same time?

 

In our next post, we’ll shed some light on the current trend toward Software Defined Storage, and how more and more companies are choosing to move to industry standard servers running parallel file systems — rather than traditional storage arrays or databases.

 

Check out how HP Information Optimization solutions can help you Harness the Power of Big Data.

 

Greg_Battas_badge_176x304_tcm245_1428057_tcm245_1422290_32_tcm245-1428057.pngAbout Greg Battas

Greg’s background in solving business problems for customers — in particular those in the retail, telecommunications and financial services sectors — and in product development for relational database management systems, has played a critical role in helping bridge the gap between the viewpoints of IT and business decision-makers to explain how to use technology to solve challenging organizational issues.

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the community guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation