Contributed by Vanish Talwar, Principal Research Scientist, and Indrajit Roy, Senior Researcher in HP Labs’ Intelligent Infrastructure Lab
Editor’s note: This is another in a series of posts featuring research projects discussed by visiting speakers to HP Labs Palo Alto.
BlinkDB is an ongoing project at AMPLab in UC Berkeley that supports bounded errors and bounded response times on very large data. Recently, the lead graduate student working on this project, Sameer Agarwal, gave a talk on BlinkDB at HP Labs. Sameer is a 4th year PhD student advised by Prof. Ion Stoica. He and his team have implemented BlinkDB as a sampling-based approximate query engine. It maintains a set of multi-dimensional, multi-resolution samples from original data which is updated over time. When a query comes in, BlinkDB dynamically selects an appropriately sized sample on which the query is executed. This is done by generating an error-latency profile for the query on different sample sizes. Sameer presented various results showing the effectiveness of BlinkDB. In particular, on a 17TB trace BlinkDB was 100x faster than Hive within an error of 2-10%.
This approach helps support interactive queries and is attractive in scenarios where perfect answers are not always needed. In such cases, approximation can be used to get an answer back in a user provided time bound and/or with user provided error bounds. A tradeoff can be made between query accuracy and response time.
BlinkDB is available as an open source version at: http://blinkdb.org. This is an ongoing research project and more details are available at the project’s website.