Presto: Facebook's Distributed SQL Query Engine

by anonymous on 2013-11-17 18:37:11

English Title: Martin Traverso (Facebook)

Background

Facebook is a data-driven company. Data processing and analysis are at the core of Facebook's development and delivery of products for over a billion active users. We have one of the largest data warehouses in the world, storing approximately 300PB or more of data. This data is used by a variety of programs, including traditional batch data processing, graph-based data analysis [1], machine learning, and real-time data analysis.

For analysts, data scientists, and engineers who need to process, analyze, and continuously improve our products, enhancing the query performance of the data warehouse is crucial. Running more queries within a given time frame and obtaining faster results can improve their productivity.

The data in Facebook’s data warehouse is stored on several large Hadoop HDFS clusters. Hadoop MapReduce [2] and Hive were designed for large-scale, highly reliable computations and optimized to increase the overall system throughput. However, as our data warehouse grew to the petabyte level and our demands increased, we needed an interactive query system with low latency that could work on our data warehouse.

In the fall of 2012, a team from Facebook’s Data Infrastructure department began addressing this issue for our data warehouse users. After evaluating some external projects, we found that they were either too immature or unable to meet our requirements for flexibility and scale. Therefore, we decided to build Presto, a new system capable of performing interactive queries on petabyte-scale data.

In this article, we will briefly introduce Presto’s architecture, current status, and future prospects.

Architecture

Presto is a distributed SQL query engine specifically designed for high-speed, real-time data analysis. It supports standard ANSI SQL, including complex queries, aggregation, joins, and window functions.

The simplified architecture of the Presto system is shown in the diagram below. The client sends SQL queries to Presto's coordinator. The coordinator performs syntax checking, analysis, and query planning. The scheduler combines execution pipelines and assigns tasks to nodes closest to the data while monitoring the execution process. The client retrieves data from the output segments, which are sequentially fetched from the underlying processing segments.

Presto’s operational model fundamentally differs from Hive or MapReduce. Hive translates queries into multi-stage MapReduce tasks, running them sequentially. Each task reads input data from disk and writes intermediate results back to disk. However, the Presto engine does not use MapReduce. Instead, it uses a custom query and execution engine with operators tailored to support SQL syntax. In addition to improved scheduling algorithms, all data processing occurs in memory. Different processing nodes form a pipeline via the network, avoiding unnecessary disk I/O and additional latency. This pipelined execution model runs multiple data processing stages simultaneously, transferring data from one stage to the next as soon as it becomes available, significantly reducing end-to-end response times for various queries.

Presto is implemented in Java primarily due to its high development efficiency, excellent ecosystem, and ease of integration with other Java applications in Facebook’s data infrastructure. Presto dynamically compiles parts of the query plan into JVM bytecode and allows the JVM to optimize and generate native machine code. By carefully managing memory and data structures, Presto avoids common issues related to memory allocation and garbage collection in typical Java programs. (In a subsequent article, we will share tips and tricks for developing high-performance Java systems, along with lessons learned during the construction of Presto.)

Scalability was another key consideration in designing Presto. Early in the project, we realized that beyond HDFS, large amounts of data would be stored in many other types of systems. Some are well-known systems like HBase, while others are custom backends like Facebook News Feed. Presto has been designed with a simple abstraction layer for data storage to enable SQL querying across different data storage systems. Storage plugins (connectors) only need to provide interfaces for metadata extraction, data location retrieval, and data acquisition operations. Besides our primary Hive/HDFS backend system, we have also developed connectors for other systems, including HBase, Scribe, and custom-developed systems.

(Translator's Note: Scribe is an open-source project from Facebook that can aggregate log files generated by large numbers of servers in real-time into the file system. For more information, see: https://github.com/facebook/scribe)

(Translator's Note: Based on current information, Presto's architecture for distributed data processing differs significantly from Hortonworks' Stinger based on MapReduce 2.0 and may be closer to Google's Dremel or Cloudera's Impala.)

Current Status

As introduced above, the development of Presto began in the fall of 2012. Our first production system went live in early 2013. By spring 2013, the system was rolled out across the entire company. Since then, Presto has become the primary system for interactive analytics on the data warehouse within the company. It has been deployed in multiple regions, and we have successfully scaled a cluster to 1,000 nodes. More than 1,000 employees use the system daily, running over 30,000 queries on a petabyte of data each day.

Presto outperforms Hive/MapReduce by more than 10 times in CPU performance and major query performance. It currently supports most ANSI SQL operations, including joins, left/right outer joins, subqueries, general aggregations, and scalar functions, as well as approximate deduplication (using HyperLogLog) and approximate percentiles (based on quantile digest algorithms). The main limitation at this stage is the size limit for table joins and the cardinality of unique keys/groups. The system cannot currently write query results back to specific tables (query results are directly returned to clients via stream output).

(Translator's Note: When performing specific operations on big data, approximate algorithms using statistical methods are often employed. The HyperLogLog algorithm estimates the frequency of particular values in large datasets; for more details, see this blog post. The Quantile Digest algorithm and its specific applications can be found in this blog post.)

Future Prospects

We are actively working to expand Presto's functionality and improve performance. In the coming months, we will remove the size limits on joins and aggregations in queries and provide the ability to write query results to output tables. We are also developing a query accelerator, designing a new data format optimized for query processing to avoid unnecessary data transformations. These new features will cache frequently used datasets from the backend data warehouse, enabling the system to effectively use cached data to speed up queries without requiring user awareness of the caching mechanism. Additionally, we are developing a high-performance HBase connector.

Open Source

At the Analytics @ WebScale conference in June 2013, we introduced Presto for the first time. Since then, it has garnered significant external attention. In recent months, we have released Presto's source code and executable packages to some external companies. They have successfully deployed and tested it in their environments, providing us with valuable feedback.

Today, we are excited to announce that we are making Presto an open-source project. You can find the source code and documentation on the following websites. We would love to hear about your use cases and how Presto can assist with your interactive analytics needs.

Presto Official Website: http://prestodb.io/

Presto GitHub Page: https://github.com/facebook/presto

The Presto team within Facebook's Data Infrastructure consists of the following members: Martin Traverso, Dain Sundstrom, David Phillips, Eric Hwang, Nileema Shingte, and Ravi Murthy.

Links

[1] Scaling Apache Giraph to a trillion edges. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920

[2] Under the hood: Scheduling MapReduce jobs more efficiently with Corona https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920

[3] Video of Presto talk at Analytics@Webscale conference, June 2013 https://www.facebook.com/photo.php?v=10202463462128185

tags users