Editor’s note: HP Labs researchers Kim Keeton and Brad Morrey give us an in-depth view of how an HP Labs technology is developed, and then is transferred to HP’s products.
By Principal Researcher Kim Keeton and Senior Research Scientist Brad Morrey, Intelligent Infrastructure Lab
On December 4 at HP Discover in Frankfurt, HP announced HP StoreAll with Express Query, a new storage product for archive data and object storage. StoreAll provides great scale (up to 1024 nodes and 16 petabytes of data) and low cost of storage (as low as $0.91/GB). Express Query adds a scalable metadata database that was developed by HP Labs. Express Query is integrated with StoreAll, to automatically capture and store file system attributes, like file size and last modified time. Additionally, Express Query allows users to tag files with custom attributes – for example, an animation studio could add attributes according to the movie title or content of the file (“penguin”). Users can later find files that match custom or system attributes efficiently, using a RESTful API. This “organization by search” capability allows users the flexibility to manage space, backup or tiering policies based on metadata stored in Express Query.
Express Query is based on HP Labs’ research. Our early work in this space was aimed at managing the unstructured information in the enterprise. By extracting structured metadata from unstructured information like text documents and storing it in a centralized, searchable database, our goal was to enable new information management applications, particularly for knowledge workers and legal compliance. When our initial experiments found existing transactional and NoSQL databases lacking, we decided to develop our own database. The key requirements were the ability to ingest large amounts of data quickly with a scalable architecture, while simultaneously being able to satisfy a significant query workload, with the observation that queries to the ingested data could operate on slightly stale versions of that data.
We achieved these goals using a scalable, pipelined, distributed database that is optimized for high write throughput while requiring modest computation and I/O requirements. Its design decouples the update ingestion pipeline from the read query engine: updates flow through the pipeline, being sorted and merged in with older data, and queries operate by reading from the end of the pipeline, optionally reading data from earlier stages in the pipeline to obtain more up-to-date results. Individual stages of the pipeline can be parallelized independently, providing scalability. We used existing HP Labs technology developed in our group called DataSeries as the storage engine, because it uses in-memory indexes, data compression, and parallel reading and writing to reduce I/O requirements. In our 2012 EuroSys paper entitled “Trading Freshness for Performance in a Scalable Database”, we demonstrated that it was possible to trade off between the freshness of the query results and ingest performance in a controlled fashion, depending on the requirements of the use case.
As we finished our prototype, we realized that the technology was useful for applications beyond our initial use cases. Working with HP Storage, we identified the need for a metadata store for their scalable archive file and object offering, and saw a natural match. We hardened our research prototype, developed a query generator to translate RESTful API requests into efficient queries, and worked closely with worldwide HP Storage teams to build the ecosystem needed to release StoreAll.
Express Query works well in the context of StoreAll because it can keep up with the very large ingestion rates required from tracking all metadata operations in a scalable file system, and because it provides the engine to gain insight into what is in the system in ways that are not possible today. As a simple example, we compared running a Linux “find” command on a file system with 500 million files to find the small set (~2000) of files that were modified in the last 4 hours with the equivalent Express Query request. The find takes 42 hours, while with Express Query it only takes 1.4 seconds – roughly 100,000X faster! Express Query makes what was otherwise impractical, practical: users can interactively issue queries to understand what is occurring in their system.
Industry analysts have questioned HP’s ability to innovate from within. Our research experiences and partnership with HP Storage to build StoreAll with Express Query are proving them wrong. We look forward to finding future opportunities to leverage our research to provide further wins for HP.