In my previous BLOG entry titled (What is big data and how big is BIG !), I touched on how the term “big data” can vary based upon your point of view. This BLOG is an extension of that discussion.
IDC defines “big data” as being greater than 100 TB, growing at 60%+, at least two data sources, typically running on scale-out architecture.
Companies try to mine data from internal (i.e. OLTP systems), as well as external sources to make strategic and tactical decisions for their business. In general, approximately 10% of the overall “big data pie” may be referred to as “structured big data”. The other 90% is unstructured or semi-structured data.
Over the past 20+ years, the industry has primarily focused helping businesses store massive amounts of structured data in data warehouses and data marts. Then this structured data in converted information when users leverage various tools to analyze the data.
Even if we assume that only 10% of a company’s data is structured, it is likely to be considered “big data” if you are a phone company, bank or stock exchange or broker. Each industry has its own type of structured big data which may be in the 10s or hundreds of terabytes range.
It is also common for this OLTP structured data to provide the truest representation of a company’s current state, past and future trends.
So this begs the question, “Is structured data considered big data ?” My professional opinion is “YES !” And not only can structured data be big, it may contain the crown jewels of company specific information waiting to be analyzed and mined.
Semi-structured data is interesting because it is difficult to define. I personally feel that semi-structured data is data which could have been represented as structured data, but was not for various reasons. Examples include XML files, Spreadsheets, flat files in record format, etc.
For the most part, semi-structured data is easier to transform than unstructured data into a structured form, because of key words, tags, known field positions, etc.
The final type of “big data” which I would like to talk about today is unstructured data. Much of this data may not be directly tied to a business’s operations and transactional history. But, it has the ability to provide a company with a great deal of insight regarding a population’s sentiment, current, past and future behaviors.
As you may guess, unstructured data typically consists of 80+ % of all the data which companies and governments hope to derive information from. Until recently, it was expensive to not only store but, have the correct tools to digest, maintain and allow access to these massive amounts of information.
Today, semi-structured and unstructured data is frequently processed by using Hadoop. One common method is to use Mapreduce to transform semi and unstructured data into structured data for easy access. Once this transformation takes place, end users, management, business analysts or “data scientists” can use a variety of tools to transform “data” into “information” to make strategic and tactical business decisions.
Hadoop may run in a Windows environment or Unix/Linux and may feed structured databases using connectors. Some of the best structured databases include Microsoft SQL Server (scale-up (Fast Track) or scale-out (HP EDW Appliance optimized for SQL Server Parallel Data Warehouse software)), Vertica, etc.
Windows Hadoop “in house servers” or “Hadoop in the cloud (Azure)”
Several sources of great information include:
- Microsoft's Windows Server implementation of Hadoop is in private preview
- Microsoft’s plan for Hadoop and big data
For those interested in HP Solutions for Apache Hadoop using Unix/Linux read the Converged Infrastructure News article titled “Unleash the power of Big Data”.
Finally, some organizations may be interested in even more complex analysis of video, audio, image and social networking data. In this case, it may be worthwhile to look into Autonomy. And their infographic titled “What is big Data”.
In conclusion of today’s BLOG, I return to the question “What is big data ?” and my conclusion is still “it depends” and the definition can vary from customer to customer and company to company or from government to government agency. However, “big data” typically falls into one of the base basic areas “structured”, “semi-structured” and “unstructured”.
As of October 2012, the readers of this BLOG may way to be aware of the fact that the HP EDW appliance scales anywhere from 15TB – 610TB and is optimized to support your structured database needs. Recently HP EDW version 1.4 was announced to provide customers with better performance, easier scalability and updated EDW management tools:
AFTERTHOUGHT:Big data's 3V's
Finally, before I sign off for today, I thought it would be appropriate to talk about the “3Vs”….a relatively new industry buzz word.
The “3Vs” typically refer to Volume, Variety and Velocity. However, here are some interesting thoughts regarding the 3Vs.
Regardless of how you define the 3Vs, for the sake of my discussion, I always assume that a customer’s architecture should be able to support ever exploding volumes of data which have wide varieties of data types at varying input and query velocities.
It should also be noted that some people in the industry are starting to recognize that there are 2 more V’s which should be added to Volume, Variety and Velocity:
- “Value” – Big data tries to filter and keep only data which may be of value. In most cases, it is reasonable to ignore spam eMails, fantasy football information, and duplicate data. In addition, trying to screen eMails, BLOGS, reviews, etc. for credible data from credible sources contributes to a BI environment which contains valuable information.
- “Veracity” – pertains to companies wanting a single version of the truth.
In future BLOGs I will be addressing how a good BI infrastructure and database design using conformed dimensions helps maintain a single version of the truth you your BI environment.
Hub and spoke infrastructures supported with a MPP, scale-out solutions are a great foundation to support users trying to transform these massive quantities of data into information using tools which generate simple reports to complex OLAP queries and to advanced analytical data mining algorithms.
My next post will discuss BI tools and end user workload profiles.
Through the course of my blog series I will discuss how the various HP Reference Architectures and Appliances will help you support your company’s BI requirements. The “What’s new” links, on the right side of the page, contain useful background information.
As usual, if any readers have thoughts or comments, please feel free to let me know.
PS: Easy access links to my BLOG entries to create a comprehensive series of BI Best Practice and architecture ideas, here are the links: