Confidential Use Only – Do Not Share
David Phillips
Software Engineer
Facebook
Presto: Fast SQL on Everything
What is Presto?
• Open source distributed SQL query engine
• ANSI SQL compliant
• Originally developed by Facebook
• Used in production at many well known companies
Commercial Offerings
Notable Characteristics
• Adaptive multi-tenant system
• Run hundreds of concurrent queries on thousands of nodes
• Extensible, federated design
• Plugins provide connectors, functions, types, security
• Flexible design supports many different use cases
• High performance
• Many optimizations, code generation, long-lived JVM
Use Cases at Facebook
Interactive Analytics
• Facebook has a massive multi-tenant data warehouse
• Employees need to quickly analyze small data (~50GB-3TB)
• Visualizations, dashboards, notebooks, BI tools
• Clusters run 50-100 concurrent queries w/ diverse shapes
• Queries usually execute in seconds or minutes
• Users are latency sensitive
• Fast improves productivity, slow blocks their work
Batch ETL
• Populate and process data in the warehouse
• Jobs are scheduled using a workflow management system
• Similar to Azkaban or Airflow
• Manages dependencies between jobs
• Queries are typically written by data engineers
• More expensive in CPU and data volume than Interactive
• Throughput and efficiency more important than latency
A/B Testing
• Evaluate product changes via statistical hypothesis testing
• Results need to be available in hours (not days)
• Data must be complete and accurate
• Arbitrary slice and dice at interactive latency (~5 -30s)
• Cannot pre-aggregate data, must compute results on the fly
• Producing results requires joining multiple large data sets
• Web interface generates restricted query shapes
App Analytics
• External-user facing custom reporting tools
• Facebook Analytics offers analytics to application developers
• Web interface generates small set of query shapes
• Highly selective queries over large aggregate data volumes
• Application developers can only access their own data
• Very strict latency requirements (~100ms-5s)
• Highly available, hundreds of concurrent queries
System Design
Worker
Data Source APIProcessor
Worker
Coordinator
Planner/Optimizer Scheduler
Metadata API Data Location API
Queue
Processor
Query
Results Data Source APIProcessor
Worker
External
Storage
System
Presto
Architecture
Predicate Pushdown
• Engine provides connectors with a two part constraint:
1. Domain of values: ranges and nullability
2. “Black box” predicate for filtering
• Connectors report the domain they can guarantee
• Engine can elide redundant filtering
• Optimizer can make further use of this information
Data Layouts
• Optimizer takes advantage of physical layout of data
• Properties: partitioning, sorting, grouping, indexes
• Tables can have multiple layouts with different properties
• Layouts can have a subset of columns or data
• Optimizer chooses best layout for query
• Tune queries by adding new physical layouts
LeftJoin
LocalShuffle
Stage 2
Stage 4
partitioned-shuffle
Hash
Filter
Scan
Hash
Scan
AggregateFinal
Hash
Stage 0
Output
Stage 1
Stage 3
collecting-shuffle
partitioned-shuffle partitioned-shuffle
AggregatePartial
Stage 0
LeftJoin
LocalShuffle
Stage 1collecting-shuffle
Hash
Scan
Aggregate
Output
Hash
Filter
Scan
Optimized plan using
data layout properties
Original plan
without any
data layout
properties
Pre-computing Hashes
• Computing hashes can be expensive
• Especially for strings or complex types
• Push computation to the lowest level of the plan tree
• Re-use for aggregations, joins, local or remote shuffles
Intra-node Parallelism
• Use multiple threads on a single node
• More efficient than parallelism across nodes
• Little latency overhead
• Efficiently share state (e.g., hash tables) between threads
• Needed due to skew or table transforms
LookupJoin
HashBuild
LocalShuffle
ScanHashScanFilterHash
HashBuild
Pipeline 0
Pipeline 1
Pipeline 2
Stage 0
Task 0
Stage 1
Task 0 Task 1
Task 3..n
Task 2
HashAggregate
ScanHash
Physical Execution Plan
Pipeline 1 is parallelized
across multiple threads
Stage Scheduling
• Two scheduling policies:
1. All-at-once: minimize latency
2. Phased: minimize resource usage
Split Scheduling
• Splits are enumerated as the query executes, not up front
• For Hive, both partition metadata and discovering files
• Start executing immediately
• Queries often finish early (LIMIT or interactive)
• Reduces metadata memory usage on coordinator
• Splits are assigned to worker with shortest queue
Operating on Compressed Data
• Process dictionaries directly instead of values
• Shared dictionaries can be larger than rows
• Use heuristics to determine if speculation is working
• Hash table creation takes advantage of dictionaries
• Joins can produce dictionary encoded data
Page Layout in Memory
Page 0
partkey returnflag shipinstruct
52470
50600
18866
72387
7429
44077
148102
101228
"F" x 8
0: "IN PERSON"
1: "COD"
2: "RETURN"
3: "NONE"
LongBlock RLEBlock DictionaryBlock
Indices
1
0
1
2
0
2
2
1
Dictionary
Page 1
partkey returnflag
164648
35173
139350
40227
87261
184817
153099
"O" x 7
LongBlock RLEBlock DictionaryBlock
Indices2
2
2
0
1
3
2
Dictionary
shipinstruct
Writer Scaling
• Write performance dominated by concurrency
• Too few writers causes the query to be slow
• Too many writers creates small files
• Expensive to read later (metadata, IO, latency)
• Inefficient for storage system
• Add writers as needed when producer buffers are full, as
long as data written exceeds a configured threshold
Code Generation
• SQL → JVM bytecode → machine code
• Filter, project, sort comparators, aggregations
• Auto-vectorization, branch prediction, register use
• Eliminate virtual calls and allow inlining
• Profile each task independently based on data processed
• Avoid profile pollution across tasks and queries
• Profile can change during execution as data changes
CPU Time Improvements for Bytecode Generation
0
1000
2000
3000
4000
5000
6000
7000
Baseline 1 Transform 2 Transforms 3 Transforms
AvgCPUTime(seconds)
Generated Naïve
Fault Tolerance
• Node crash causes query failure
• In practice, failures are rare, even on large clusters
• Checkpointing or other recovery mechanisms have a cost
• Re-run failures rather than making everything expensive
• Limit runtime to a few hours to reduce waste and latency
• Clients retry on failure
Presto: Fast SQL on Everything

Presto: Fast SQL on Everything

  • 1.
    Confidential Use Only– Do Not Share David Phillips Software Engineer Facebook Presto: Fast SQL on Everything
  • 2.
    What is Presto? •Open source distributed SQL query engine • ANSI SQL compliant • Originally developed by Facebook • Used in production at many well known companies
  • 4.
  • 5.
    Notable Characteristics • Adaptivemulti-tenant system • Run hundreds of concurrent queries on thousands of nodes • Extensible, federated design • Plugins provide connectors, functions, types, security • Flexible design supports many different use cases • High performance • Many optimizations, code generation, long-lived JVM
  • 6.
    Use Cases atFacebook
  • 7.
    Interactive Analytics • Facebookhas a massive multi-tenant data warehouse • Employees need to quickly analyze small data (~50GB-3TB) • Visualizations, dashboards, notebooks, BI tools • Clusters run 50-100 concurrent queries w/ diverse shapes • Queries usually execute in seconds or minutes • Users are latency sensitive • Fast improves productivity, slow blocks their work
  • 8.
    Batch ETL • Populateand process data in the warehouse • Jobs are scheduled using a workflow management system • Similar to Azkaban or Airflow • Manages dependencies between jobs • Queries are typically written by data engineers • More expensive in CPU and data volume than Interactive • Throughput and efficiency more important than latency
  • 9.
    A/B Testing • Evaluateproduct changes via statistical hypothesis testing • Results need to be available in hours (not days) • Data must be complete and accurate • Arbitrary slice and dice at interactive latency (~5 -30s) • Cannot pre-aggregate data, must compute results on the fly • Producing results requires joining multiple large data sets • Web interface generates restricted query shapes
  • 10.
    App Analytics • External-userfacing custom reporting tools • Facebook Analytics offers analytics to application developers • Web interface generates small set of query shapes • Highly selective queries over large aggregate data volumes • Application developers can only access their own data • Very strict latency requirements (~100ms-5s) • Highly available, hundreds of concurrent queries
  • 11.
  • 12.
    Worker Data Source APIProcessor Worker Coordinator Planner/OptimizerScheduler Metadata API Data Location API Queue Processor Query Results Data Source APIProcessor Worker External Storage System Presto Architecture
  • 13.
    Predicate Pushdown • Engineprovides connectors with a two part constraint: 1. Domain of values: ranges and nullability 2. “Black box” predicate for filtering • Connectors report the domain they can guarantee • Engine can elide redundant filtering • Optimizer can make further use of this information
  • 14.
    Data Layouts • Optimizertakes advantage of physical layout of data • Properties: partitioning, sorting, grouping, indexes • Tables can have multiple layouts with different properties • Layouts can have a subset of columns or data • Optimizer chooses best layout for query • Tune queries by adding new physical layouts
  • 15.
    LeftJoin LocalShuffle Stage 2 Stage 4 partitioned-shuffle Hash Filter Scan Hash Scan AggregateFinal Hash Stage0 Output Stage 1 Stage 3 collecting-shuffle partitioned-shuffle partitioned-shuffle AggregatePartial Stage 0 LeftJoin LocalShuffle Stage 1collecting-shuffle Hash Scan Aggregate Output Hash Filter Scan Optimized plan using data layout properties Original plan without any data layout properties
  • 16.
    Pre-computing Hashes • Computinghashes can be expensive • Especially for strings or complex types • Push computation to the lowest level of the plan tree • Re-use for aggregations, joins, local or remote shuffles
  • 17.
    Intra-node Parallelism • Usemultiple threads on a single node • More efficient than parallelism across nodes • Little latency overhead • Efficiently share state (e.g., hash tables) between threads • Needed due to skew or table transforms
  • 18.
    LookupJoin HashBuild LocalShuffle ScanHashScanFilterHash HashBuild Pipeline 0 Pipeline 1 Pipeline2 Stage 0 Task 0 Stage 1 Task 0 Task 1 Task 3..n Task 2 HashAggregate ScanHash Physical Execution Plan Pipeline 1 is parallelized across multiple threads
  • 19.
    Stage Scheduling • Twoscheduling policies: 1. All-at-once: minimize latency 2. Phased: minimize resource usage
  • 20.
    Split Scheduling • Splitsare enumerated as the query executes, not up front • For Hive, both partition metadata and discovering files • Start executing immediately • Queries often finish early (LIMIT or interactive) • Reduces metadata memory usage on coordinator • Splits are assigned to worker with shortest queue
  • 21.
    Operating on CompressedData • Process dictionaries directly instead of values • Shared dictionaries can be larger than rows • Use heuristics to determine if speculation is working • Hash table creation takes advantage of dictionaries • Joins can produce dictionary encoded data
  • 22.
    Page Layout inMemory Page 0 partkey returnflag shipinstruct 52470 50600 18866 72387 7429 44077 148102 101228 "F" x 8 0: "IN PERSON" 1: "COD" 2: "RETURN" 3: "NONE" LongBlock RLEBlock DictionaryBlock Indices 1 0 1 2 0 2 2 1 Dictionary Page 1 partkey returnflag 164648 35173 139350 40227 87261 184817 153099 "O" x 7 LongBlock RLEBlock DictionaryBlock Indices2 2 2 0 1 3 2 Dictionary shipinstruct
  • 23.
    Writer Scaling • Writeperformance dominated by concurrency • Too few writers causes the query to be slow • Too many writers creates small files • Expensive to read later (metadata, IO, latency) • Inefficient for storage system • Add writers as needed when producer buffers are full, as long as data written exceeds a configured threshold
  • 24.
    Code Generation • SQL→ JVM bytecode → machine code • Filter, project, sort comparators, aggregations • Auto-vectorization, branch prediction, register use • Eliminate virtual calls and allow inlining • Profile each task independently based on data processed • Avoid profile pollution across tasks and queries • Profile can change during execution as data changes
  • 25.
    CPU Time Improvementsfor Bytecode Generation 0 1000 2000 3000 4000 5000 6000 7000 Baseline 1 Transform 2 Transforms 3 Transforms AvgCPUTime(seconds) Generated Naïve
  • 26.
    Fault Tolerance • Nodecrash causes query failure • In practice, failures are rare, even on large clusters • Checkpointing or other recovery mechanisms have a cost • Re-run failures rather than making everything expensive • Limit runtime to a few hours to reduce waste and latency • Clients retry on failure