Internals of Presto Service
Taro L. Saito, Treasure Data
leo@treasure-data.com
March 11-12th, 2015
Treasure Data Tech Talk #1 at Tokyo
Taro L. Saito @taroleo
•  2007 University of Tokyo. Ph.D.
–  XML DBMS, Transaction Processing
•  Relational-Style XML Query [SIGMOD 2008]
•  ~ 2014 Assistant Professor at University of Tokyo
–  Genome Science Research
•  Distributed Computing, Personal Genome Analysis
•  March 2014 ~ Treasure Data
–  Software Engineer, MPP Team Leader
•  Open source projects at GitHub
–  snappy-java, msgpack-java, sqlite-jdbc
–  sbt-pack, sbt-sonatype, larray
–  silk
•  Distributed workflow engine
2
Hive
TD API /
Web Console
batch query
Presto
Treasure Data
PlazmaDB:
MessagePack Columnar Storage
td-presto connector
Interactive query
What is Presto?
•  A distributed SQL Engine developed by Facebook
–  For interactive analysis on peta-scale dataset
•  As a replacement of Hive
–  Nov. 2013: Open sourced at GitHub
•  Presto
–  Written in Java
–  In-memory query layer
–  CPU efficient for ad-hoc analysis
–  Based on ANSI SQL
–  Isolation of query layer and storage access layer
•  A connector provides data access (reading schema and records)
4
Presto: Distributed SQL Engine
5
TD Presto has its own
query retry mechanism
Tailored to throughput CPU-intensive. Faster response time
Fault
Tolerant
Treasure Data: Presto as a Service
6
Presto Public
Release
Topics
•  Challenges in providing Database as a Service
•  TD Presto Connector
–  Optimizing Scan Performance
–  Multi-tenancy Cluster Management
•  Resource allocation
•  Monitoring
•  Query Tuning
7
buffer
Optimizing Scan Performance
•  Fully utilize the network bandwidth from S3
•  TD Presto becomes CPU bottleneck
TableScanOperator	
•  s3 file list
•  table schema
header
request
S3 / RiakCS	
•  release(Buffer)
Buffer size limit
Reuse allocated buffers
Request Queue	
•  priority queue
•  max connections limit
Header	
Column Block 0
(column names)	
Column Block 1	
Column Block i	
Column Block m	
MPC1 file
HeaderReader	
•  callback to HeaderParser
ColumnBlockReader	
header
HeaderParser	
•  parse MPC file header
• column block offsets
• column names
column block request
Column block requests
column block
prepare
MessageUnpacker	
buffer
MessageUnpacker	
MessageUnpacker	
S3 read	
S3 read	
pull records
Retry GET request on
- 500 (internal error)
- 503 (slow down)
- 404 (not found)
- eventual consistency
S3 read	
•  decompression
•  msgpack-java v07
S3 read	
S3 read	
S3 read
MessageBuffer
•  msgpack-java v06 was the bottleneck
–  Inefficient buffer access
•  v07
•  Fast memory access
•  sun.misc.Unsafe
•  Direct access to heap memory
•  extract primitive type value from byte[]
•  cast
•  No boxing
9
Unsafe memory access performance is comparable to C
•  http://frsyuki.hatenablog.com/entry/2014/03/12/155231
10
Why ByteBuffer is slow?
•  Following a good programming manner
–  Define interface, then implement classes
•  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer
implementations
•  In reality: TypeProfile slows down method access
–  JVM generates look-up table of method implementations
–  Simply importing one or more classes generates TypeProfile
•  v07 avoid TypeProfile generation
–  Load an implementation class through Reflection
11
Format Type Detection
•  MessageUnpacker
–  read prefix: 1 byte
–  detect format type
•  switch-case
–  ANTLR generates this
type of codes
12
Format Type Detection
•  Using cache-efficient lookup table: 20000x faster
13
2x performance improvement in v07
14
Database As A Service
15
Claremont Report on Database Research
•  Discussion on future of DBMS
–  Top researchers, vendors and
practitioners.
–  CACM, Vol. 52 No. 6, 2009
•  Predicts emergence of Cloud Data
Service
–  SQL has an important role
•  limited functionality
•  suited for service provider
–  A difficult example: Spark 
•  Need a secure application container
to run arbitrary Scala code.
16
Beckman Report on Database Research
•  2013
–  http://beckman.cs.wisc.edu/beckman-report2013.pdf
–  Topics of Big-Data
•  End-to-end service
–  From data collection to knowledge
•  Cloud Service has become popular
–  IaaS, PaaS, SaaS
–  Challenge is to migrate all of the functionalities of DBMS into Cloud
17
Results Push
Results Push
SQL
Big Data Simplified: The Treasure Data Approach
AppServers
Multi-structured Events!
•  register!
•  login!
•  start_event!
•  purchase!
•  etc!
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
ü  App log data!
ü  Mobile event data!
ü  Sensor data!
ü  Telemetry!
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent
Embedded SDKs
Server-side Agents
18
Challenges in Database as a Service
•  Tradeoffs
–  Cost and service level objectives (SLOs)
•  Reference
–  Workload Management for Big Data Analytics. A. Aboulnaga
[SIGMOD2013 Tutorial]
19
Run each query set
on an independent
cluster
Run all queries
together on the
smallest possible
cluster
Fast
$$$
Limited performance guarantee
Reasonable price
Shift of Presto Query Usage
•  Initial phase
–  Try and error of queries
•  Many syntax errors, semantic errors
•  Next phase
–  Scheduled query execution
•  Increased Presto query usage
–  Some customers submit more than 1,000 Presto queries / day
–  Establishing typical query patterns
•  hourly, daily reports
•  query templates
•  Advanced phase: More elaborate data analysis
–  Complex queries
•  via data scientists and data analysts
–  High resource usage
20
Usage Shift: Simple to Complex queries
21
Monitoring Presto Usage with Fluentd
22
Hive
Presto
DataDog
•  Monitoring CPU, memory and network usage
•  Query stats
23
Query Collection in TD
•  SQL query logs
–  query, detailed query plan, elapsed time, processed rows, etc.
•  Presto is used for analyzing the query history
24
Daily/Hourly Query Usage
25
Query Running Time
•  More than 90% of queries finishes within 2 min.
expected response time for interactive queries
26
Processed Rows of Queries
27
Performance
•  Processed rows / sec. of a query
28
Collecting Recoverable Error Patterns
•  Presto has no fault tolerance
•  Error types
–  User error
•  Syntax errors
–  SQL syntax, missing function
•  Semantic errors
–  missing tables/columns
–  Insufficient resource
•  Exceeded task memory size
–  Internal failure
•  I/O error
–  S3/Riak CS
•  worker failure
•  etc.
29
TD Presto retries
these queries
Query Retry on Internal Errors
•  More than 99.8% of queries finishes without errors
30
Query Retry on Internal Errors (log scale)
•  Queries succeed eventually
31
Multi-tenancy: Resource Allocation
•  Price-plan based resource allocation
•  Parameters
–  The number of worker nodes to use (min-candidates)
–  The number of hash partitions (initial-hash-partitions)
–  The maximum number of running tasks per account
•  If running queries exceeds allowed number of tasks, the next queries need
to wait (queued)
•  Presto: SqlQueryExecution class
–  Controls query execution state: planning -> running -> finished
•  No resource allocation policy
–  Extended TDSqlQueryExection class monitors running tasks and limits
resource usage
•  Rewriting SqlQueryExecutionFactory at run-time by using ASM library
32
Query Queue
•  Presto 0.97
–  Introduces user-wise query queues
•  Can limit the number of concurrent queries per user
•  Problem
–  Running too many queries delays overall query
performance
33
Customer Feedback
•  A feedback:
–  We don’t care if large queries take long time
–  But interactive queries should run immediately
•  Challenges
–  How do we allocate resources even if preceding queries
occupies customer share of resources?
–  How do we know a submitted query is interactive one?
34
Admission control is necessary
•  Adjust resource utilization
–  Running Drivers (Splits)
–  MPL (Multi-Programming Level)
35
Challenge: Auto Scaling
•  Setting the cluster size based on the peak usage is expensive
•  But predicting customer usage is difficult
36
Typical Query Patterns [Li Juang]
•  Q: What are typical queries of a customer?
–  Customer feels some queries are slow
–  But we don’t know what to compare with, except scheduled queries
•  Approach: Clustering Customer SQLs
•  TF/IDF measure: TF x IDF vector
–  Split SQL statements into tokens
–  Term frequency (TF) = the number of each term in a query
–  Inverse document frequency (IDF) = log (# of queries / # of queries that
have a token)
•  k-means clustering
–  TF/IDF vector
–  Generates clusters of similar queries
•  x-means clustering for deciding number of clusters automatically
–  D. Pelleg [ICML2000]
37
Problematic Queries
•  90% of queries finishes within 2 min.
–  But remaining 10% is still large
•  10% of 10,000 queries is 1,000.
•  Long-running queries
•  Hog queries
38
Long Running Queries
•  Typical bottlenecks
–  Cross joins
–  IN (a, b, c, …)
•  semi-join filtering process is slow
–  Complex scan condition
•  pushing down selection
•  but delays column scan
–  Tuple materialization
•  coordinator generates json data
–  Many aggregation columns
•  group by 1, 2, 3, 4, 5, 6, …
–  Full scan
•  Scanning 100 billion rows…
•  Adding more resources does not always make query faster
•  Storing intermediate data to disks is necessary
39
Result are
buffered
(waiting fetch)
slow process
fast
fast
Hog Query
•  Queries consuming a lot of CPU/memory resources
–  Coined in S. Krompass et al. [EDBT2009]
•  Example:
–  select 1 as day, count(…) from … where time <= current_date - interval 1 day
union all
select 2 as day, count(…) from … where time <= current_date - interval 2 day
union all
–  …
–  (up to 190 days)
•  More than 1000 query stages.
•  Presto tries to run all of the stages at once.
–  High CPU usage at coordinator
40
•  Query rewriting (better)
–  With group by and window functions
–  Not a perfect solution
•  Need to understand the meaning of the query
•  Semantic change is not allowed
–  e.g., We cannot rewrite UNION to UNION ALL
–  UNION includes duplicate elimination
•  Workaround Idea
–  Bushy plan -> Deep plan
–  Introduce stage-wise resource assignment
Query Rewriting? Plan Optimization?
41
Future Work
•  Reducing Queuing/Response Time
–  Introducing shared queue between customers
•  For utilizing remaining cluster resources
–  Fair-Scheduling: C. Gupata [EDBT2009]
–  Self-tuning DBMS. S. Chaudhuri [VLDB2007]
•  Adjusting Running Query Size (hard)
–  Limiting driver resources as small as possible for hog queries
–  Query plan based cost estimation
•  Predicting Query Running Time
–  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011]
42
Summary: Treasures in Treasure Data
•  Treasures for our customers
–  Data collected by fluentd (td-agent)
–  Query analysis platform
–  Query results - values
•  For Treasure Data
–  SQL query logs
•  Stored in treasure data
–  We know how customers use SQL
•  Typical queries and failures
–  We know which part of query can be improved
43

Internals of Presto Service

  • 1.
    Internals of PrestoService Taro L. Saito, Treasure Data leo@treasure-data.com March 11-12th, 2015 Treasure Data Tech Talk #1 at Tokyo
  • 2.
    Taro L. Saito@taroleo •  2007 University of Tokyo. Ph.D. –  XML DBMS, Transaction Processing •  Relational-Style XML Query [SIGMOD 2008] •  ~ 2014 Assistant Professor at University of Tokyo –  Genome Science Research •  Distributed Computing, Personal Genome Analysis •  March 2014 ~ Treasure Data –  Software Engineer, MPP Team Leader •  Open source projects at GitHub –  snappy-java, msgpack-java, sqlite-jdbc –  sbt-pack, sbt-sonatype, larray –  silk •  Distributed workflow engine 2
  • 3.
    Hive TD API / WebConsole batch query Presto Treasure Data PlazmaDB: MessagePack Columnar Storage td-presto connector Interactive query
  • 4.
    What is Presto? • A distributed SQL Engine developed by Facebook –  For interactive analysis on peta-scale dataset •  As a replacement of Hive –  Nov. 2013: Open sourced at GitHub •  Presto –  Written in Java –  In-memory query layer –  CPU efficient for ad-hoc analysis –  Based on ANSI SQL –  Isolation of query layer and storage access layer •  A connector provides data access (reading schema and records) 4
  • 5.
    Presto: Distributed SQLEngine 5 TD Presto has its own query retry mechanism Tailored to throughput CPU-intensive. Faster response time Fault Tolerant
  • 6.
    Treasure Data: Prestoas a Service 6 Presto Public Release
  • 7.
    Topics •  Challenges inproviding Database as a Service •  TD Presto Connector –  Optimizing Scan Performance –  Multi-tenancy Cluster Management •  Resource allocation •  Monitoring •  Query Tuning 7
  • 8.
    buffer Optimizing Scan Performance • Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck TableScanOperator •  s3 file list •  table schema header request S3 / RiakCS •  release(Buffer) Buffer size limit Reuse allocated buffers Request Queue •  priority queue •  max connections limit Header Column Block 0 (column names) Column Block 1 Column Block i Column Block m MPC1 file HeaderReader •  callback to HeaderParser ColumnBlockReader header HeaderParser •  parse MPC file header • column block offsets • column names column block request Column block requests column block prepare MessageUnpacker buffer MessageUnpacker MessageUnpacker S3 read S3 read pull records Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency S3 read •  decompression •  msgpack-java v07 S3 read S3 read S3 read
  • 9.
    MessageBuffer •  msgpack-java v06was the bottleneck –  Inefficient buffer access •  v07 •  Fast memory access •  sun.misc.Unsafe •  Direct access to heap memory •  extract primitive type value from byte[] •  cast •  No boxing 9
  • 10.
    Unsafe memory accessperformance is comparable to C •  http://frsyuki.hatenablog.com/entry/2014/03/12/155231 10
  • 11.
    Why ByteBuffer isslow? •  Following a good programming manner –  Define interface, then implement classes •  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer implementations •  In reality: TypeProfile slows down method access –  JVM generates look-up table of method implementations –  Simply importing one or more classes generates TypeProfile •  v07 avoid TypeProfile generation –  Load an implementation class through Reflection 11
  • 12.
    Format Type Detection • MessageUnpacker –  read prefix: 1 byte –  detect format type •  switch-case –  ANTLR generates this type of codes 12
  • 13.
    Format Type Detection • Using cache-efficient lookup table: 20000x faster 13
  • 14.
  • 15.
    Database As AService 15
  • 16.
    Claremont Report onDatabase Research •  Discussion on future of DBMS –  Top researchers, vendors and practitioners. –  CACM, Vol. 52 No. 6, 2009 •  Predicts emergence of Cloud Data Service –  SQL has an important role •  limited functionality •  suited for service provider –  A difficult example: Spark  •  Need a secure application container to run arbitrary Scala code. 16
  • 17.
    Beckman Report onDatabase Research •  2013 –  http://beckman.cs.wisc.edu/beckman-report2013.pdf –  Topics of Big-Data •  End-to-end service –  From data collection to knowledge •  Cloud Service has become popular –  IaaS, PaaS, SaaS –  Challenge is to migrate all of the functionalities of DBMS into Cloud 17
  • 18.
    Results Push Results Push SQL BigData Simplified: The Treasure Data Approach AppServers Multi-structured Events! •  register! •  login! •  start_event! •  purchase! •  etc! SQL-based Ad-hoc Queries SQL-based Dashboards DBs & Data Marts Other Apps Familiar & Table-oriented Infinite & Economical Cloud Data Store ü  App log data! ü  Mobile event data! ü  Sensor data! ü  Telemetry! Mobile SDKs Web SDK Multi-structured Events Multi-structured Events Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Embedded SDKs Server-side Agents 18
  • 19.
    Challenges in Databaseas a Service •  Tradeoffs –  Cost and service level objectives (SLOs) •  Reference –  Workload Management for Big Data Analytics. A. Aboulnaga [SIGMOD2013 Tutorial] 19 Run each query set on an independent cluster Run all queries together on the smallest possible cluster Fast $$$ Limited performance guarantee Reasonable price
  • 20.
    Shift of PrestoQuery Usage •  Initial phase –  Try and error of queries •  Many syntax errors, semantic errors •  Next phase –  Scheduled query execution •  Increased Presto query usage –  Some customers submit more than 1,000 Presto queries / day –  Establishing typical query patterns •  hourly, daily reports •  query templates •  Advanced phase: More elaborate data analysis –  Complex queries •  via data scientists and data analysts –  High resource usage 20
  • 21.
    Usage Shift: Simpleto Complex queries 21
  • 22.
    Monitoring Presto Usagewith Fluentd 22 Hive Presto
  • 23.
    DataDog •  Monitoring CPU,memory and network usage •  Query stats 23
  • 24.
    Query Collection inTD •  SQL query logs –  query, detailed query plan, elapsed time, processed rows, etc. •  Presto is used for analyzing the query history 24
  • 25.
  • 26.
    Query Running Time • More than 90% of queries finishes within 2 min. expected response time for interactive queries 26
  • 27.
  • 28.
  • 29.
    Collecting Recoverable ErrorPatterns •  Presto has no fault tolerance •  Error types –  User error •  Syntax errors –  SQL syntax, missing function •  Semantic errors –  missing tables/columns –  Insufficient resource •  Exceeded task memory size –  Internal failure •  I/O error –  S3/Riak CS •  worker failure •  etc. 29 TD Presto retries these queries
  • 30.
    Query Retry onInternal Errors •  More than 99.8% of queries finishes without errors 30
  • 31.
    Query Retry onInternal Errors (log scale) •  Queries succeed eventually 31
  • 32.
    Multi-tenancy: Resource Allocation • Price-plan based resource allocation •  Parameters –  The number of worker nodes to use (min-candidates) –  The number of hash partitions (initial-hash-partitions) –  The maximum number of running tasks per account •  If running queries exceeds allowed number of tasks, the next queries need to wait (queued) •  Presto: SqlQueryExecution class –  Controls query execution state: planning -> running -> finished •  No resource allocation policy –  Extended TDSqlQueryExection class monitors running tasks and limits resource usage •  Rewriting SqlQueryExecutionFactory at run-time by using ASM library 32
  • 33.
    Query Queue •  Presto0.97 –  Introduces user-wise query queues •  Can limit the number of concurrent queries per user •  Problem –  Running too many queries delays overall query performance 33
  • 34.
    Customer Feedback •  Afeedback: –  We don’t care if large queries take long time –  But interactive queries should run immediately •  Challenges –  How do we allocate resources even if preceding queries occupies customer share of resources? –  How do we know a submitted query is interactive one? 34
  • 35.
    Admission control isnecessary •  Adjust resource utilization –  Running Drivers (Splits) –  MPL (Multi-Programming Level) 35
  • 36.
    Challenge: Auto Scaling • Setting the cluster size based on the peak usage is expensive •  But predicting customer usage is difficult 36
  • 37.
    Typical Query Patterns[Li Juang] •  Q: What are typical queries of a customer? –  Customer feels some queries are slow –  But we don’t know what to compare with, except scheduled queries •  Approach: Clustering Customer SQLs •  TF/IDF measure: TF x IDF vector –  Split SQL statements into tokens –  Term frequency (TF) = the number of each term in a query –  Inverse document frequency (IDF) = log (# of queries / # of queries that have a token) •  k-means clustering –  TF/IDF vector –  Generates clusters of similar queries •  x-means clustering for deciding number of clusters automatically –  D. Pelleg [ICML2000] 37
  • 38.
    Problematic Queries •  90%of queries finishes within 2 min. –  But remaining 10% is still large •  10% of 10,000 queries is 1,000. •  Long-running queries •  Hog queries 38
  • 39.
    Long Running Queries • Typical bottlenecks –  Cross joins –  IN (a, b, c, …) •  semi-join filtering process is slow –  Complex scan condition •  pushing down selection •  but delays column scan –  Tuple materialization •  coordinator generates json data –  Many aggregation columns •  group by 1, 2, 3, 4, 5, 6, … –  Full scan •  Scanning 100 billion rows… •  Adding more resources does not always make query faster •  Storing intermediate data to disks is necessary 39 Result are buffered (waiting fetch) slow process fast fast
  • 40.
    Hog Query •  Queriesconsuming a lot of CPU/memory resources –  Coined in S. Krompass et al. [EDBT2009] •  Example: –  select 1 as day, count(…) from … where time <= current_date - interval 1 day union all select 2 as day, count(…) from … where time <= current_date - interval 2 day union all –  … –  (up to 190 days) •  More than 1000 query stages. •  Presto tries to run all of the stages at once. –  High CPU usage at coordinator 40
  • 41.
    •  Query rewriting(better) –  With group by and window functions –  Not a perfect solution •  Need to understand the meaning of the query •  Semantic change is not allowed –  e.g., We cannot rewrite UNION to UNION ALL –  UNION includes duplicate elimination •  Workaround Idea –  Bushy plan -> Deep plan –  Introduce stage-wise resource assignment Query Rewriting? Plan Optimization? 41
  • 42.
    Future Work •  ReducingQueuing/Response Time –  Introducing shared queue between customers •  For utilizing remaining cluster resources –  Fair-Scheduling: C. Gupata [EDBT2009] –  Self-tuning DBMS. S. Chaudhuri [VLDB2007] •  Adjusting Running Query Size (hard) –  Limiting driver resources as small as possible for hog queries –  Query plan based cost estimation •  Predicting Query Running Time –  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011] 42
  • 43.
    Summary: Treasures inTreasure Data •  Treasures for our customers –  Data collected by fluentd (td-agent) –  Query analysis platform –  Query results - values •  For Treasure Data –  SQL query logs •  Stored in treasure data –  We know how customers use SQL •  Typical queries and failures –  We know which part of query can be improved 43