- Channel HP
- :
- Enterprise Business Blogs
- :
- Cloud Computing
- :
- Cloud Source Blog
- :
- Should big data reside in the cloud?
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content
Should big data reside in the cloud?
Now that cloud is slowly evolving beyond the “peak of inflated expectations” on the Gartner hype cycle, it looks like a new hype is coming quickly. It’s called “big data.” Let’s first define the term, so we’re sure what we are talking about. According to Wikipedia, big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. Examples include web logs; RFID; sensor networks; social networks; social data (due to the Social data revolution), Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; medical records; photography archives; video archives; and large-scale e-commerce.
Managing the data
Now, that’s clear. Actually, the world’s ‘digital universe’ is in the process of adding 1.8 Zettabytes in 2011 with continuing exponential growth – projecting 8 zettabytes in 2015 and 35 Zettabytes in 2020. 70 percent of that data is generated by individuals and 85 percent consists in unstructured data. We call that human information. Every second 97,000 tweets are added, every minute 12 million texts and every day 294 million emails.
Did you know that today, a single commercial flight across the U.S. generates 240 terabytes (TB) of wireless sensor data? The key issue is no longer capturing the data, but actually storing it. In a currently ongoing project, we gather 1.35 TB in a 15 minutes experiment using 1 million wireless sensors. With a stable 56MB wireless link, we need around 42 hours to gather and store the data, so new data transfer mechanisms and approaches need to be invented.
Whether it is for disaster recovery, for back-up or to perform compute intensive analytics or calculations, the time and cost of storing the data in the cloud is often forgotten. So, the first question to be raised is whether the data needs to be in the cloud in the first place, or whether a hybrid approach, integrating public cloud with enterprise IT resources (being it private cloud or legacy), should not be taken. Let’s look at an example where we combine social networking data, already in the cloud, with enterprise information.
Understanding your customer behavior
If you want to know what the world thinks about your product, your brand, your services, you better take a look at tweets, blog entries and forums. In the past, people moaned about how bad a service was at the local bar, today they do it on Twitter. You can no longer ignore that fact if you want to stay competitive.
A senior business and technology executive survey we commissioned showed us that enterprises typically only leverage 5 percent of the available information, that 48 percent do not have an effective information strategy in place and that only 2 percent can deliver the right information at the right time to support enterprise outcomes 100 percent of the time.
The social media data I talked about is located in the cloud, by definition. But ideally, companies want to cross-correlate this data with their own customer information. Actually HPLabs did just that with their “project fusion.” They learned to predict customer behavior by merging social media and company data. Obviously many larger companies are not interested in migrating all their customer data to the cloud so it’s key to be able to integrate data from multiple sources into such common analysis.
HP’s Approach
HP is conscious of the importance of providing the ability to search not just through structured data, but also to take advantage of being able to scan through the huge amount of non-structured data. By combining Vertica’s Analytic Platform focused on the analysis of structured data, with Autonomy’s Meaning Based Computing approach, HP is now offering you an environment through which you can really understand what’s happening. The combination of multiple information sources allows you to keep the data where it is while taking full advantage of the information embedded in it. This is what we call the Human Information Era.
So, big data may still be hype, but the data is there and enterprises need to take that into account. Tools exist today, as we demonstrated with project fusion and there is more to come. You really want to look at this because it may give you an unfair advantage in doing business. And if you don’t do it, your competitor might. That would be a real pity, wouldn’t it?
- Mark as Read
- Mark as New
- Bookmark
- Highlight
- Email to a Friend
- Report Inappropriate Content
Althought large sets of data exist and they are getting larger for most organizations by the minute, transporting them between computing capabilities is still quite bottleneck, even though bandwidth capabilities continue to expand.
Deciding where and how to store all the information can be a constraint to tapping into the nearly unlimted computing within our grasp. Once the data sits in a cloud site and begins to expand it can be difficult and costly to move it to a new site or service provider.
How do these supply and demand issues affect the enterprise architecture and the cloud consumption goverance processes?
- Mark as Read
- Mark as New
- Bookmark
- Highlight
- Email to a Friend
- Report Inappropriate Content
Charlie, first of all, you are absolutely right, and the public cloud service providers often forget to highlight the cost and issues related with the transfer of large amount of data. This is why I find it important to first look at the datasources and then identify where the processing is done. Networking costs often are greater than the cloud service consumption ones, which is ironic.
As far as architecures are concerned, a service model with loosely coupled services around data structures is probably the most flexible one. We have been experimenting with environments where the processing happens in the cloud, but agents only transfer the data needed for the specific processing to the cloud. So, part of the query is done on the systems that host the data. These queries are triggered from the cloud. Now, such approach is not applicable to every solution, but it demonstrates we need to become creative to address the multiple needs of having to maniplulate extremely large datasets, where-ever they are. Networking is ineed quickly becoming a bottleneck (again?).
- Mark as Read
- Mark as New
- Bookmark
- Highlight
- Email to a Friend
- Report Inappropriate Content
Christian, Your post highlights yet another reason why the CIO today is stretched to establish the extent to which Cloud Computing should be adopted within the Enterprise. It calls out some of the forces and counter-forces that the CIO needs to consider when determining the extent of cloud adoption from the perspective of big data.





