By Joy Joseph Padamadan, Big Data Solutions Engineer, HP
Being a first-timer to the Strata conference in Santa Clara last week, I was curious to know how it was going to be of most value to me in gaining additional insight into Big Data, especially around Hadoop and its ecosystem software stack (will be referred as Hadoop from here onwards). With numerous companies showcasing their innovative technologies in software and hardware, the conference provides a feel for the Big Data community, its future direction and an opportunity to be in touch with the real people behind the technologies.
I could clearly observed three distinct patterns on Hadoop from the conference and the attendees, as follows:
- Hadoop adoption has been accelerated.
- Every big data software vendor now realizes the disruptive power of Hadoop technologies
- Interesting glimpses into future of Big Data
In this post I’ll look at the first two points and then address the future for Big Data in my next post on the subject.
1. Hadoop adoption has been accelerated. Companies recognize Hadoop’s power and they are getting more comfortable in adopting Hadoop and its surrounding technologies. Hadoop has already proved that it “is” the solution for Big Data problems of the present and future; capable of dealing with zeta bytes of data in a cost effective manner. It has become a formidable force in the market place that no one can ignore those technologies any longer. It presents two clear options to existing vendors in Big Data space: embrace or work with it, or else face gradual extinction. The fact that SAP HANA now wants to project its association with Hadoop as a software-stack that gives a “complete picture of the enterprise” is the best example to prove this point. Intel’s Hadoop distribution – a chip company doing their best to bring in their processor level optimization techniques to speed up Hadoop - is another example. [Intel’s success will depend on how much software value-add it is going bring (into Hadoop source), especially in Real-Time analytics, since much of the hardware-level improvements are not really relevant in a batch processing environment]
2. Every big data software vendor now realizes the disruptive power of Hadoop technologies. There are two major streams where innovations are happening – both trying to achieve the same goal of Real-time analytics (3), but both of those streams originated from two different worlds: un-structured and structured.
- Hadoop-centric innovation – This is mainly based on Open Source Hadoop distributions. It creates big data solutions, especially when the existing major vendors do not fit to company needs, both in terms of budget and scalability. Cloudera, Hortonworks, and Intel are some of the good examples. One of the main challenges for these technologies is to create an underlying layer of SQL runtime to compete with existing SQL players in the market. SQL has a lot of clout in Enterprise computing environments and its existing ecosystems. The new SQL software-layer on top of Hadoop envisions making the best out of existing investments in the Enterprise. Example: Clodera’s impala. When these distributions achieve their targeted performance (on a scale out environment) for Real Time Analytics, they will be very disruptive in the market place – not just the new market, but even in the existing strongholds.
- Innovations on tight (HDFS) integration with Hadoop by existing database vendors and major MPP database companies. Prominent SQL analytical MPP databases are searching for tight integration of HDFS for their data store to stay relevant in the market. Green Plum’s Hawq and Teradata’s Aster Data are two prominent examples in this direction.
Both of the above activities are still facing major challenges in using HDFS as a data store in meeting transactional (ACID) properties. Lots of new technologies are yet to be seen to keep up with data volumes and larger and larger clusters. And, it will be interesting to see what kind of compromises they have to make in terms of the famous CAP theorem to achieve large scale out configurations. May be they will prove CAP theorem is wrong, who knows?!
There are a bunch of other NoSQL Databases in the Open Source world that concentrate on one of the two aspects of CAP theorem. I had an opportunity to talk to the latest benchmark numbers on CouchBase (Availability) and HBase (Consistency). They are getting better and better.
In the next post I’ll talk about the glimpses we were given into the future of Big Data, so please stay tuned.
For more information and to check out HP’s point of view on this subject please go to hp.com/go/information.