Internals of Presto Service

Internals of Presto Service
Taro L. Saito, Treasure Data
leo@treasure-data.com
March 11-12th, 2015
Treasure Data Tech Talk #1 at Tokyo

Taro L. Saito @taroleo
•  2007 University of Tokyo. Ph.D.
–  XML DBMS, Transaction Processing
•  Relational-Style XML Query [SIGMOD 2008]
•  ~ 2014 Assistant Professor at University of Tokyo
–  Genome Science Research
•  Distributed Computing, Personal Genome Analysis
•  March 2014 ~ Treasure Data
–  Software Engineer, MPP Team Leader
•  Open source projects at GitHub
–  snappy-java, msgpack-java, sqlite-jdbc
–  sbt-pack, sbt-sonatype, larray
–  silk
•  Distributed workflow engine
2

Hive
TD API /
Web Console
batch query
Presto
Treasure Data
PlazmaDB:
MessagePack Columnar Storage
td-presto connector
Interactive query

What is Presto?
•  A distributed SQL Engine developed by Facebook
–  For interactive analysis on peta-scale dataset
•  As a replacement of Hive
–  Nov. 2013: Open sourced at GitHub
•  Presto
–  Written in Java
–  In-memory query layer
–  CPU efficient for ad-hoc analysis
–  Based on ANSI SQL
–  Isolation of query layer and storage access layer
•  A connector provides data access (reading schema and records)
4

Presto: Distributed SQL Engine
5
TD Presto has its own
query retry mechanism
Tailored to throughput CPU-intensive. Faster response time
Fault
Tolerant

Treasure Data: Presto as a Service
6
Presto Public
Release

Topics
•  Challenges in providing Database as a Service
•  TD Presto Connector
–  Optimizing Scan Performance
–  Multi-tenancy Cluster Management
•  Resource allocation
•  Monitoring
•  Query Tuning
7

buffer
Optimizing Scan Performance
•  Fully utilize the network bandwidth from S3
•  TD Presto becomes CPU bottleneck
TableScanOperator
•  s3 file list
•  table schema
header
request
S3 / RiakCS
•  release(Buffer)
Buffer size limit
Reuse allocated buffers
Request Queue
•  priority queue
•  max connections limit
Header
Column Block 0
(column names)
Column Block 1
Column Block i
Column Block m
MPC1 file
HeaderReader
•  callback to HeaderParser
ColumnBlockReader
header
HeaderParser
•  parse MPC file header
• column block offsets
• column names
column block request
Column block requests
column block
prepare
MessageUnpacker
buffer
MessageUnpacker
MessageUnpacker
S3 read
S3 read
pull records
Retry GET request on
- 500 (internal error)
- 503 (slow down)
- 404 (not found)
- eventual consistency
S3 read
•  decompression
•  msgpack-java v07
S3 read
S3 read
S3 read

MessageBuffer
•  msgpack-java v06 was the bottleneck
–  Inefficient buffer access
•  v07
•  Fast memory access
•  sun.misc.Unsafe
•  Direct access to heap memory
•  extract primitive type value from byte[]
•  cast
•  No boxing
9

Unsafe memory access performance is comparable to C
•  http://frsyuki.hatenablog.com/entry/2014/03/12/155231
10

Why ByteBuffer is slow?
•  Following a good programming manner
–  Define interface, then implement classes
•  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer
implementations
•  In reality: TypeProfile slows down method access
–  JVM generates look-up table of method implementations
–  Simply importing one or more classes generates TypeProfile
•  v07 avoid TypeProfile generation
–  Load an implementation class through Reflection
11

Format Type Detection
•  MessageUnpacker
–  read prefix: 1 byte
–  detect format type
•  switch-case
–  ANTLR generates this
type of codes
12

Format Type Detection
•  Using cache-efficient lookup table: 20000x faster
13

2x performance improvement in v07
14

Claremont Report on Database Research
•  Discussion on future of DBMS
–  Top researchers, vendors and
practitioners.
–  CACM, Vol. 52 No. 6, 2009
•  Predicts emergence of Cloud Data
Service
–  SQL has an important role
•  limited functionality
•  suited for service provider
–  A difficult example: Spark
•  Need a secure application container
to run arbitrary Scala code.
16

Beckman Report on Database Research
•  2013
–  http://beckman.cs.wisc.edu/beckman-report2013.pdf
–  Topics of Big-Data
•  End-to-end service
–  From data collection to knowledge
•  Cloud Service has become popular
–  IaaS, PaaS, SaaS
–  Challenge is to migrate all of the functionalities of DBMS into Cloud
17

Results Push
Results Push
SQL
Big Data Simplified: The Treasure Data Approach
AppServers
Multi-structured Events!
•  register!
•  login!
•  start_event!
•  purchase!
•  etc!
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
ü  App log data!
ü  Mobile event data!
ü  Sensor data!
ü  Telemetry!
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent
Embedded SDKs
Server-side Agents
18

Challenges in Database as a Service
•  Tradeoffs
–  Cost and service level objectives (SLOs)
•  Reference
–  Workload Management for Big Data Analytics. A. Aboulnaga
[SIGMOD2013 Tutorial]
19
Run each query set
on an independent
cluster
Run all queries
together on the
smallest possible
cluster
Fast
$$$
Limited performance guarantee
Reasonable price

Shift of Presto Query Usage
•  Initial phase
–  Try and error of queries
•  Many syntax errors, semantic errors
•  Next phase
–  Scheduled query execution
•  Increased Presto query usage
–  Some customers submit more than 1,000 Presto queries / day
–  Establishing typical query patterns
•  hourly, daily reports
•  query templates
•  Advanced phase: More elaborate data analysis
–  Complex queries
•  via data scientists and data analysts
–  High resource usage
20

Usage Shift: Simple to Complex queries
21

Monitoring Presto Usage with Fluentd
22
Hive
Presto

DataDog
•  Monitoring CPU, memory and network usage
•  Query stats
23

Query Collection in TD
•  SQL query logs
–  query, detailed query plan, elapsed time, processed rows, etc.
•  Presto is used for analyzing the query history
24

Query Running Time
•  More than 90% of queries finishes within 2 min.
expected response time for interactive queries
26

Performance
•  Processed rows / sec. of a query
28

Collecting Recoverable Error Patterns
•  Presto has no fault tolerance
•  Error types
–  User error
•  Syntax errors
–  SQL syntax, missing function
•  Semantic errors
–  missing tables/columns
–  Insufficient resource
•  Exceeded task memory size
–  Internal failure
•  I/O error
–  S3/Riak CS
•  worker failure
•  etc.
29
TD Presto retries
these queries

Query Retry on Internal Errors
•  More than 99.8% of queries finishes without errors
30

Query Retry on Internal Errors (log scale)
•  Queries succeed eventually
31

Multi-tenancy: Resource Allocation
•  Price-plan based resource allocation
•  Parameters
–  The number of worker nodes to use (min-candidates)
–  The number of hash partitions (initial-hash-partitions)
–  The maximum number of running tasks per account
•  If running queries exceeds allowed number of tasks, the next queries need
to wait (queued)
•  Presto: SqlQueryExecution class
–  Controls query execution state: planning -> running -> finished
•  No resource allocation policy
–  Extended TDSqlQueryExection class monitors running tasks and limits
resource usage
•  Rewriting SqlQueryExecutionFactory at run-time by using ASM library
32

Query Queue
•  Presto 0.97
–  Introduces user-wise query queues
•  Can limit the number of concurrent queries per user
•  Problem
–  Running too many queries delays overall query
performance
33

Customer Feedback
•  A feedback:
–  We don’t care if large queries take long time
–  But interactive queries should run immediately
•  Challenges
–  How do we allocate resources even if preceding queries
occupies customer share of resources?
–  How do we know a submitted query is interactive one?
34

Admission control is necessary
•  Adjust resource utilization
–  Running Drivers (Splits)
–  MPL (Multi-Programming Level)
35

Challenge: Auto Scaling
•  Setting the cluster size based on the peak usage is expensive
•  But predicting customer usage is difficult
36

Typical Query Patterns [Li Juang]
•  Q: What are typical queries of a customer?
–  Customer feels some queries are slow
–  But we don’t know what to compare with, except scheduled queries
•  Approach: Clustering Customer SQLs
•  TF/IDF measure: TF x IDF vector
–  Split SQL statements into tokens
–  Term frequency (TF) = the number of each term in a query
–  Inverse document frequency (IDF) = log (# of queries / # of queries that
have a token)
•  k-means clustering
–  TF/IDF vector
–  Generates clusters of similar queries
•  x-means clustering for deciding number of clusters automatically
–  D. Pelleg [ICML2000]
37

Problematic Queries
•  90% of queries finishes within 2 min.
–  But remaining 10% is still large
•  10% of 10,000 queries is 1,000.
•  Long-running queries
•  Hog queries
38

Long Running Queries
•  Typical bottlenecks
–  Cross joins
–  IN (a, b, c, …)
•  semi-join filtering process is slow
–  Complex scan condition
•  pushing down selection
•  but delays column scan
–  Tuple materialization
•  coordinator generates json data
–  Many aggregation columns
•  group by 1, 2, 3, 4, 5, 6, …
–  Full scan
•  Scanning 100 billion rows…
•  Adding more resources does not always make query faster
•  Storing intermediate data to disks is necessary
39
Result are
buffered
(waiting fetch)
slow process
fast
fast

Hog Query
•  Queries consuming a lot of CPU/memory resources
–  Coined in S. Krompass et al. [EDBT2009]
•  Example:
–  select 1 as day, count(…) from … where time <= current_date - interval 1 day
union all
select 2 as day, count(…) from … where time <= current_date - interval 2 day
union all
–  …
–  (up to 190 days)
•  More than 1000 query stages.
•  Presto tries to run all of the stages at once.
–  High CPU usage at coordinator
40

•  Query rewriting (better)
–  With group by and window functions
–  Not a perfect solution
•  Need to understand the meaning of the query
•  Semantic change is not allowed
–  e.g., We cannot rewrite UNION to UNION ALL
–  UNION includes duplicate elimination
•  Workaround Idea
–  Bushy plan -> Deep plan
–  Introduce stage-wise resource assignment
Query Rewriting? Plan Optimization?
41

Future Work
•  Reducing Queuing/Response Time
–  Introducing shared queue between customers
•  For utilizing remaining cluster resources
–  Fair-Scheduling: C. Gupata [EDBT2009]
–  Self-tuning DBMS. S. Chaudhuri [VLDB2007]
•  Adjusting Running Query Size (hard)
–  Limiting driver resources as small as possible for hog queries
–  Query plan based cost estimation
•  Predicting Query Running Time
–  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011]
42

Summary: Treasures in Treasure Data
•  Treasures for our customers
–  Data collected by fluentd (td-agent)
–  Query analysis platform
–  Query results - values
•  For Treasure Data
–  SQL query logs
•  Stored in treasure data
–  We know how customers use SQL
•  Typical queries and failures
–  We know which part of query can be improved
43

Internals of Presto Service

More Related Content

What's hot

Viewers also liked

Similar to Internals of Presto Service

More from Treasure Data, Inc.

Recently uploaded

Internals of Presto Service