When it comes to Data warehousing, Facebook is seen as a company breaking the sound barrier. They have the largest Data warehouse on the planet with 300+ petabytes of data and growing. It comes with no surprise that they keep innovating new solutions to speed up and solve analytic and storage problems at scale.  But they clearly don’t want to walk that road alone. On Wed of this week, Facebook announced that they have open sourced Presto a distributed query engine optimized for low latency interactive analytics that performs up to 10x faster than Apache Hive. Like Hadoop Hive, at its core, it is a SQL engine that leverages the Hadoop file system (HDFS); however, unlike Hive, it does not build on the Hadoop MapReduce engine.

The original announcement of Presto was made back in June 2013 at the WebScale Conference by Martin Traverso, a member of the Data Infrastructure team at Facebook. This was one of the many signals highlighting the critical need for a Big Data warehouse and processing engine that was not limited by batch Map Reduce. Numerous solutions have been released like Presto that provide more ‘real-time’ and ad hoc query support. The goal of Facebook’s original announcement was to bring on a few key companies early to help refine the Presto code base before making it open source. Based on the contributors list by ohloh.net, companies like AirBnB, DropBox, Google and BlinkDB have already contributed to the Presto project. AirBnB has also stated publically that they are also in production with Presto.

Martin describes Presto as an extensible platform designed to integrate with other systems like existing data warehouse through a plug-in architecture including a Hive Plugin. There are three sets of APIs that developers can implement to extend presto: metadata api, the data allocation api which specifies the location of physical data, and the data streaming api for fetching data from a data source. Their next area of focus will be on approximate and arbitrary queries which aligns to the work at UC Berkeley Amp Lab and BlinkDB. The goal is to accelerate queries by 10-100x by carefully selecting samples based on other queries on the system.

It remains to be seen how this tool will impact the overall Big Data ecosystem and more specifically, the Hadoop ecosystem.   What is clear is that the Big Data ecosystem is growing and tools consolidation will occur over the course of the next few years. At this juncture, there are enough tools in the ecosystem to make one’s head spin. In the Hadoop ecosystem alone, you have numerous SQL-on-Hadoop options designed to provide ‘real-time’ SQL support for Hadoop.  In 2013 alone, numerous SQL-on-Hadoop products were released including: Cloudera Impala, Pivotal HAWQ and Hortonwork’s Stinger. And more tools are expected to come from companies like IBM as well as additional Apache projects like Drill sponsored by MapR.

Meanwhile, Amazon has it’s lighting fast Petabyte scale Data warehouse in the sky. While Redshift is not based on Hadoop, some folks claim it is much faster than Hadoop Hive; however, it comes with a heavier price tag.  And don’t forget Shark which is now being adopted by companies like Yahoo. Shark is an in-memory SQL DB that claims to be 30x faster than Hadoop Hive. So why all the fuss?

Ultimately, Presto like many other SQL on Hadoop tools shows that SQL can be leveraged on top of HDFS and that HDFS is pushing closer to becoming a truly real-time engine. Each of these technologies will either find their niche, improve the current ecosystem or fade away.  Another key takeaway from this project is the speed at which innovation is occurring. Martin responds to a thread on ycombinator with the following, “We started the project with 4 full-time engineers and added a fifth person recently. We began working on Presto in August last year and deployed our first working version in January, and we’ve been iterating on it ever since.” In June, Martin showed some stats of just how much data they were processing with Presto:

  • 250 PB of Data in their DW
  • 1000s of Machines across geo distributed
  • 850 internal users/day
  • 27K queries/day
  • 320 TB scanned/day
  • 3 regional clusters sclaed to 1000 nodes
  • 4-7 x more cpu efficient than Hive, 8-10 x faster than hive

Net Net all these new projects/products are competing to own what promises to be a lucrative and booming new market designed to process and analyze large scale data in real-time.  Companies like Amazon and Cloudera are gaining the most ground because they make these complex technologies easily accessible to the Enterprise. The success of Presto will be driven by the software companies that package it up and make it easily available for use in the Enterprise.

 

Reference:

The prestodb.io project

GitHub Presto Contributors on Ohloh

YCombinator Discussion

Google Dremel – A Distributed SQL Database that Scales

A discussion on Quora on the differences of Presto and Shark