Innovation @ HP Labs

Insights on research, innovation, and emerging technology from HP Labs researchers around the world.

Learn more at www.hpl.hp.com

HP Labs and HP Vertica enhance R to simplify Big Data processing

Contributed by Indrajit Roy, HP Labs principal researcher and technical lead for the Distributed R project

 

Editor’s Note: Distributed R began at HP Labs as a summer internship project in 2011. During the last three years a dedicated team of HP Labs researchers and HP Vertica developers has continued to work on the project and developed the technology to the point where it has now been transferred to HP Vertica’s marketplace for commercial use.

 

Distributed-R_Feb-26-2014_w.jpg

From left to right: HP Labs researchers Vanish Talwar, Alvin AuYoung, Rob Schreiber, and Indrajit Roy. Not in the photo: interns Shivaram Venkataraman, Erik Bodzsar, and Kyungyong Lee

 

 

icon_1068_1063.pngData scientists are key to unlocking actionable insights from data – a task that’s becoming increasingly complex as we tackle ever larger sets of both structured and unstructured information. At HP, we realize the need to empower data scientists in the ‘Big Data’ era. To that end, HP Vertica announced last month the debut of Distributed R, a platform developed in HP Labs to run complex machine learning, statistical analysis, and graph processing on a Big Data scale.

 

 Every data scientist has his or her favorite analysis tool. For the last decade, the statistical programming language R has been a popular choice – it’s open source and used by millions. However, R has multiple limitations when applied to Big Data. The main issue: R does not scale and it features almost no parallel algorithms. 

 

With Distributed R, we have overcome many of R’s limitations. Using the new platform, data scientists can continue to use the familiar R environment while benefiting from parallel algorithms and a scalable, high-performance environment. For data scientists unfamiliar with distributed programming, Distributed R simplifies how a cluster of servers can be used to complete analyses in a matter of minutes.

 

Distributed R started as an HP Labs summer internship project in 2011. Its aim was to run machine learning and graph algorithms on really large datasets, billions of records and terabyte-scale data. We succeeded in doing that and more, with the technology now being transferred to HP Vertica for commercial use.

 

HP customers can already use databases like HP Vertica to store and efficiently analyze data using SQL. With the addition of Distributed R, they can perform complex analyses on top of HP Vertica. For example, healthcare customers can use fast, ad-hoc queries in HP Vertica to perform patient analytics, discover business trends, and comply with regulations. To model patient health and predict complications, analysts may need to run clustering and classification algorithms that are not easily expressed in SQL. These algorithms can now be run using Distributed R.

 

While Distributed R can be used as a standalone platform with any backend store, the combination of HP Vertica and Distributed R has multiple benefits. Users can perform SQL analysis and pre-processing in HP Vertica, do their complex modeling in Distributed R, and run predictions in-database. This integrated approach offers a convenient way to deploy and manage the full life-cycle of data analysis.

 

Distributed R is currently in beta and available for free on the HP Vertica marketplace. HP Vertica and HP Labs are working closely to improve the software and add more features.

 

Our vision is to continue to develop the system as an open platform for data mining. We look forward to community engagement and welcome your contributions as we develop Distributed R further.

 

Sign up for the webinar about HP Vertica Distributed R on March 11. 

 

Photography by Serge Vejvoda

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the community guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation