Rethink BI : Business Insights over Business Intelligence
The purpose of this business insights thought leadership blog is to share the HP point of view on industry trends such as Big Data and Real Time Analytics, and provide updates on key innovations and solutions.

“Big Data” definitions may vary based on your perspective

In my previous BLOG entry titled (What is big data and how big is BIG !), I touched on how the term “big data” can vary based upon your point of view.  This BLOG is an extension of that discussion.

 

IDC defines “big data” as being greater than 100 TB, growing at 60%+, at least two data sources, typically running on scale-out architecture.

 

Structured data

 

Companies try to mine data from internal (i.e. OLTP systems), as well as external sources to make strategic and tactical decisions for their business. In general, approximately 10% of the overall “big data pie” may be referred to as “structured big data”. The other 90% is unstructured or semi-structured data.

 

Over the past 20+ years, the industry has primarily focused helping businesses store massive amounts of structured data in data warehouses and data marts. Then this structured data in converted information when users leverage various tools to analyze the data.

 

Even if we assume that only 10% of a company’s data is structured, it is likely to be considered “big data” if you are a phone company, bank or stock exchange or broker. Each industry has its own type of structured big data which may be in the 10s or hundreds of terabytes range.

 

It is also common for this OLTP structured data to provide the truest representation of a company’s current state, past and future trends.

 

Structured data.png

 

So this begs the question, “Is structured data considered big data ?” My professional opinion is “YES !” And not only can structured data be big, it may contain the crown jewels of company specific information waiting to be analyzed and mined.

 

Semi-Structured data

 

Semi-structured data is interesting because it is difficult to define. I personally feel that semi-structured data is data which could have been represented as structured data, but was not for various reasons. Examples include XML files, Spreadsheets, flat files in record format, etc.

 

For the most part, semi-structured data is easier to transform than unstructured data into a structured form, because of key words, tags, known field positions, etc.

 

Semi Structured 1.png

 

Unstructured data

 

The final type of “big data” which I would like to talk about today is unstructured data. Much of this data may not be directly tied to a business’s operations and transactional history. But, it has the ability to provide a company with a great deal of insight regarding a population’s sentiment, current, past and future behaviors.

 

As you may guess, unstructured data typically consists of 80+ % of all the data which companies and governments hope to derive information from. Until recently, it was expensive to not only store but, have the correct tools to digest, maintain and allow access to these massive amounts of information.

 

Unstructured 1.png

 

Today, semi-structured and unstructured data is frequently processed by using Hadoop. One common method is to use Mapreduce to transform semi and unstructured data into structured data for easy access. Once this transformation takes place, end users, management, business analysts or “data scientists” can use a variety of tools to transform “data” into “information” to make strategic and tactical business decisions.

 

Hadoop may run in a Windows environment or Unix/Linux and may feed structured databases using connectors. Some of the best structured databases include Microsoft SQL Server (scale-up (Fast Track) or scale-out (HP EDW Appliance optimized for SQL Server Parallel Data Warehouse software)), Vertica, etc.

 

Windows Hadoop “in house servers” or “Hadoop in the cloud (Azure)”

 

Several sources of great information include:

 

HaDoop & EDW.png

 

(source Microsoft.com)

 

For those interested in HP Solutions for Apache Hadoop using Unix/Linux read the Converged Infrastructure News article titled “Unleash the power of Big Data”.

 

Finally, some organizations may be interested in even more complex analysis of video, audio, image and social networking data. In this case, it may be worthwhile to look into Autonomy. And their infographic titled “What is big Data”.

 

Summary

 

In conclusion of today’s BLOG, I return to the question “What is big data ?” and my conclusion is still “it depends” and the definition can vary from customer to customer and company to company or from government to government agency. However, “big data” typically falls into one of the base basic areas “structured”, “semi-structured” and “unstructured”.

 

3 types of big data 1.png

 

As of October 2012, the readers of this BLOG may way to be aware of the fact that the HP EDW appliance scales anywhere from 15TB – 610TB and is optimized to support your structured database needs.  Recently HP EDW version 1.4 was announced to provide customers with better performance, easier scalability and updated EDW management tools:

 

You asked ! We heard ! - HP Enterprise Data Warehouse (V 1.4) optimized for Microsoft PDW (AU3.5)

 

AFTERTHOUGHT:Big data's  3V's

 

Finally, before I sign off for today, I thought it would be appropriate to talk about the “3Vs”….a relatively new industry buzz word.

 

The “3Vs” typically refer to Volume, Variety and Velocity. However, here are some interesting thoughts regarding the 3Vs.

 

·         Big Data -- Why the 3Vs Just Don't Make Sense 

 

Regardless of how you define the 3Vs, for the sake of my discussion, I always assume that a customer’s architecture should be able to support ever exploding volumes of data which have wide varieties of data types at varying input and query velocities.

 

It should also be noted that some people in the industry are starting to recognize that there are 2 more V’s which should be added to Volume, Variety and Velocity:

 

  • “Value” – Big data tries to filter and keep only data which may be of value. In most cases, it is reasonable to ignore spam eMails, fantasy football information, and duplicate data. In addition, trying to screen eMails, BLOGS, reviews, etc. for credible data from credible sources contributes to a BI environment which contains valuable information.

 

  • “Veracity” – pertains to companies wanting a single version of the truth.

 

In future BLOGs I will be addressing how a good BI infrastructure and database design using conformed dimensions helps maintain a single version of the truth you your BI environment.

 

Hub and spoke infrastructures supported with a MPP, scale-out solutions are a great foundation to support users trying to transform these massive quantities of data into information using tools which generate simple reports to complex OLAP queries and to advanced analytical data mining algorithms.

 

Upcoming post

 

My next post will discuss BI tools and end user workload profiles.

 

Through the course of my blog series I will discuss how the various HP Reference Architectures and Appliances will help you support your company’s BI requirements. The “What’s new” links, on the right side of the page, contain useful background information.

 

As usual, if any readers have thoughts or comments, please feel free to let me know.

 

PS: Easy access links to my BLOG entries to create a comprehensive series of BI Best Practice and architecture ideas, here are the links:

 

1)            Why spend time reading another blog on data business intelligence, data warehousing & analytics?

2)            What is big data and how big is BIG !

3)            You asked ! We Heard ! – HP Enterprise Data Warehouse (V1.4) optimized for SQL Server Parallel Data ...

 

 

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the community guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
Jeff Spiller has over 30 years experience in architecting highly available and scalable multi-tier platforms for a variety of Fortune 500 co...


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.