Presto At Treasure Data

T R E A S U R E D A T A
Presto At Treasure Data
Presto Meetup @ Tokyo - June 15, 2017
Taro L. Saito - GitHub:@xerial
Ph.D., Software Engineer at Treasure Data, Inc.
1

Presto Usage at Treasure Data (2017)
Processing 15 Trillion Rows / Day  
(= 173 Million Rows / sec.)
150,000~ Queries / Day
1,500~ Users
Hosting Presto as a service for 3 years
2

Configurations
• Hosted on AWS (us-east), AWS Tokyo, IDCF (Japan)
• Multi-Tenancy Clusters
• PlazmaDB
• Storage: Amazon S3 or RiakCS
• S3 file indexes: PostgreSQL
• Storage format: Columnar Message Pack (MPC)
• MessagePack: Self-type describing format.
• Compact: 10x compression ratio from the original input data (JSON)
• 200GB JVM memory per node
• To support varieties of query usage
• Estimating required memory in advance is difficult
• For avoiding WAITING_FOR_MEMORY state that blocks the entire query processing
• In small-memory configuration, major GCs was quite frequent
3

Challenges
• Major Complaint
• Presto is slower than usual
• Only 20% of 150,000 queries are using our scheduling feature
• However, 85% of queries are actually scheduled by user scripts or third-party tools  
• How can we know the expected performance?
• (Implicit) Service Level Objectives (SLOs)
4

Understanding Implicit SLOs
• We usually looked into slow queries to ﬁgure out the performance bottlenecks.
• However analyzing SQL takes a long time
• Because we need to understand the meaning of the data.
• Understanding a hundred lines of SQL is painful
• Created Presto Query Tuning Guides:
• Presto Query FAQs: https://docs.treasuredata.com/articles/presto-query-faq
• Expectations to Performance
• Scheduled queries: We can estimate SLOs from historical stats
• Scheduled, but submitted from third-party tools or user scripts
• How do we know the expected performance?
• We need to internalize customer’s knowledge on query performance
5

• Bad:
• Collecting stdout/stderr logs of Presto
• Good:
• Collecting logs in a queryable format with Presto
• Collecting Query Event Logs to Treasure Data
• Presto Event Listener -> ﬂuentd -> Treasure Data
• Treasure Data
• schema-less: Schema can be automatically generated from the data
• As we add new ﬁelds to the event, the schema evolves automatically
• We are collecting every single query log since the beginning of the Presto service
Our Approach: Data-Driven Improvement
Query Logs
Store
Analyze
SQL
Improve & Optimize
6

Query Event Logs
• Query Completion
• queryId, user id, session parameters, etc.
• Query stats: running time, total rows, bytes, splits, CPU time, etc.
• SQL statement
• Split Completion
• Running time, Processed rows, bytes, etc.
• S3 GET access count, read bytes
• Table Scan
• Accessed tables names, column sets
• Accessed time ranges (e.g., queries looking at data of past 1 hour, 7 days, etc.)
• Filtering conditions (predicate)
7

Clustering Queries with Query Signature
• Finding Implicit SLOs
• Need to classify 85% of scheduled queries
• Extracting Query Signatures
• Simplify complex SQL expressions into a
tiny SQL representation
• Reusing ANTLR parser of Presto
• Query Signature Example:
• S[Cnt](J(T1,G(S[Cnt](T2))))
• SELET count(a),... FROM T1  
JOIN (SELECT count(b),... FROM T2 GROUP BY x)
8

Implicit SLOs
• Collect the historical query running times
• Queries that have the same query signature
• Median-absolute deviation (MAD): the deviation of (running time - median)^2
• CoV: Coeﬃcient of variation = MAD / median
• If CoV > 1, the query running time tends to vary
• If CoV < 1, median of historical running time is useful for query running time
estimation.
• SLO violation:
• If query is running longer than median + MAD
• Customer feels query is slower than usual
• However, query might be processing much more data than usual
• Normalization based on the processing data size is also necessary
9

Typical Performance Bottlenecks
• Huge Queries
• Frequent S3 access, wide table scans
• Single-node operators
• order by, window function, count(distinct x), processing skewed data, etc.
• Ill-performing worker nodes
• Heavy load on a single worker node
• Insuﬃcient pool memory
• Major/full GCs
• We are using min.error-duration = 2m, but GC pause can be longer
• Too much resource usage
• A single query occupies the entire cluster
• e.g., A query with hundreds of query stages!
10

Split Resource Manager
• Problem: A singe query can occupy the entire cluster resource
• But Presto has a limited performance control
• Only for cpu time, memory usage, and concurrent queries (CQ) limits
• No throttling nor boosting
• Created Split Resource Manger
• Limiting the max runnable splits for each customer
• Using a custom RemoteTask class, which adds an wait if no splits are available
• => Eﬃcient Use of Multi-Tenancy Cluster
11

Presto Ops Robot
• Problem: Insufficient memory of a worker
• Queries using that worker node enter WAITING_FOR_MEMORY state
• Report JMX metrics -> fluentd -> DataDog -> Trigger Alert -> Presto Ops Robot
• Presto Ops Robot
• Sending graceful shutdown command (POST SHUTTING_DOWN message to /v1/status)
• or kill memory consuming queries in the worker node
• Restarting worker JVM process
• At least every 1 week, to avoid any issues when running JVM for a long time
• Resetting any effect caused by unknown bugs
• Useful for cleaning up untracked memory (e.g., ANTLR objects, etc.)
12

S3 Access Performance
• Problem: Slow Table Scan
• S3 GET request has constant latency
• 30ms ~ 50ms latency regardless of the read size (up to 8KB read)
• Request retry on 500 (unavailable) or 503 (Slowdown) is also necessary
• Reading small header part of S3 objects can be the majority of query processing time
• Columnar format: header + column blocks
• IO Manager:
• Need to send as many S3 GET requests as possible
• 1 split = multiple S3 objects
• Pipelining S3 GET requests and column reads
13

Presto Stella: Plazma Storage Optimizer
• Problem:
• Some query reads 1 million partitions <- S3 latency overhead is quite high
• Data from mobile applications often have wide-range of time values.
• Presto Stella Connector
• Using Presto for optimizing physical storage partitions
• Input records: File list on S3
• Table writer stage: Merges fragmented partitions, and upload them to S3
• Commit: Update S3 file indexes on PostgreSQL (in an atomic transaction)
• Performance Improvement
• e.g. 10,000 partitions (30 sec.) -> 20 partitions (1.5 sec.)
• 20x performance improvement
• Use Cases
• Maintain fragmented user-defined partitions
• 1-hour partitioning -> more flexible time range partitioning
14

Transitions of Database Usages
15

New Directions Explored By Presto
• Traditional Database Usage
• Required Database Administrator (DBA)
• DBA designs the schema and queries
• DBA tunes query performance
• After Presto
• Schema is designed by data providers
• 1st data (user’s customer data)
• 3rd party data sources
• Analysts or Marketers explore the data with Presto
• Don’t know the schema in advance
• Convenient and low-latency access are necessary
• SQL can be ineﬃcient at ﬁrst
• While exploring data, SQL can be sophisticated, but not always
16

Prestobase Proxy: Low-Latency Access to Presto
• Needed more interactive experiences of Presto
• Prestobase Proxy: Gateway to Presto Coordinator
• Talks Presto Protocol (/v1/statement/…)
• Written in Scala.
• Runs on Docker
• Based on Finagle (HTTP server written by Twitter)
• Features
• Can work with standard presto clients (e.g., presto-cli, presto-jdbc, presto-odbc, etc.)
• Increased connectivity to BI tools: Tableau, Datorama, ChartIO, Looker, etc.
• Authentication (API key)
• Rewriting nextUri (internal IP address -> external host name)
• BI-tool speciﬁc query ﬁlters
• etc.
17

Customizing Prestobase Filters
• Prestobase Proxy: Gateway to access Presto
• Adding TD specific binding
• Finagle filters -> Injecting TD Specific filters
• Using Airframe, dependent injection library for Scala
18

Airframe
• http://wvlet.org/airframe
• Three step DI in Scala
• Bind
• Design
• Build
• Built-in life cycle manager
• Session start/shutdown
• examples:
• Open/close Presto connection
• Shutting down Presto server
• etc.
• Session
• Manage singletons and binding rules
19

VCR Record/Replay for Testing Presto
• Launching Presto requires a lot of memory (e.g., 2GB or more)
• Often crashes CI service containers (TravisCI, CircleCI, etc.)
• Recording Presto responses (prestobase-vcr)
• with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc
• DB ﬁle for each test suite
• Enabled small-memory footprint testing
• Can run many Presto tests in CI
20

Optimizing QueryResults Transfer in Prestobase
• Accept: application/x-msgpack
• HTTP header
• Returning Presto query result rows in MessagePack format
• QueryResults object
• Contains Array<Array<Object>> => MessagePack (compact binary)
• Encoding QueryResults objects using MessagePack/Jackson
• https://github.com/msgpack/msgpack-java
• Presto client doesn’t need to parse the row part
• 1.5x ~ 2.0x performance improvement for streaming query results
21

Prestobase Modules
• prestobase-proxy
• Proxy server to access Presto with authentication
• prestobase-agent
• Agent for running Presto queries and storing their results
• prestobase-vcr
• For recording/replaying Presto responses
• prestobase-codec
• MessagePack codec of Presto query responses
• prestobase-hq (headquarter)
• Presto usage analysis pipelines, SLO monitoring, etc.
• prestobase-conductor
• Multi Presto cluster management tool
• td-prestobase
• Treasure Data specific bindings of prestobase
• TD Authentication, job logging/monitoring
• BI tool specific filters (Tableau, Looker, etc.)
22

Bridging Gaps Between SQL and Programming Language
• Traditional Approach
• OR-Mapper: app developer design objects and schema, then generate SQLs
• New Approach: SQL First
• Need to manage various SQL results inside Programming Language
• prestobase-hq
• Need to manage hundreds of SQLs and their results
• SLO analysis, query performance analysis, etc.
• But How?
23

sbt-sql: https://github.com/xerial/sbt-sql
• Scala SBT plugin for generating model classes from SQL ﬁles
• src/main/sql/presto/*.sql (Presto Queries)
• Using SQL as a function
• Read Presto SQL Results as Objects
• Enabled managing SQL queries in GitHub
• Type-safe data analysis in prestobase-hq
24

Big Challenge: Splitting Huge Queries
• Table Scan Log Analysis
• Revealed most of customers are scanning the same data over and over
• Optimizing SQL is not the major concern.
• Analyzing data has higher priority
• Splitting a huge query into scheduled hourly/daily jobs
• digdag: Open-source workﬂow engine
• http://digdag.io
• YAML-based task deﬁnition
• Scheduling, run Presto queries
• Easy to use
25

Time Range Primitives
• TD_TIME_RANGE(time, ‘2017-06-15’, ’2017-06-16’, ‘PDT’)
• Most frequently used UDF, but inconvenient
• Use short description of relative time ranges
• 1d (1 day)
• 7d (7 days)
• 1h (1 hour)
• 1w (1 week)
• 1M (1 month)
• today, yeasterday, lastWeek, thisWeek, etc.
• Recent data access
• 1dU (1 day until now) => TD_TIME_RANGE(time, ‘2017-06-15’, null, ‘JST’) open range
• Splitting ranges
• 1w.splitIntoDays
26

MessageFrame (In Design)
• Next-generation Tabular Data Format
• Hybrid layout:
• row-oriented: for streaming. Quick write
• column-oriented: better compression & fast read
• Speciﬁcation Layers
• Layer-0 (basic specs: Keep it simple stupid)
• Data type: MessagePack
• Compression codec: raw, delta, gzip, (snappy, zstd? etc.)
• Column metadata: min/max/sum values of columns
• Layer-1 (advanced compression)
• Layer-N should be convertible to Layer-0
27

Summary
• Managing Implicit SLOs
• Data-oriented approach: Presto -> Fluentd -> Treasure Data -> Presto
• SQL clustering -> Find a bottleneck -> Optimize it!
• Optimization approaches
• Split usage control, Presto Ops Robot, Stella partition optimizer
• Low-latency access by Prestobase
• Workﬂow
• On-going Work
• Physical storage optimization (Stella)
• Huge query optimization
• Incremental Processing Support
• DigDag workﬂow
• MessageFrame
28
https://www.treasuredata.com/company/careers/

Presto At Treasure Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Presto At Treasure Data

Similar to Presto At Treasure Data (20)

More from Taro L. Saito

More from Taro L. Saito (20)

Recently uploaded

Recently uploaded (20)

Presto At Treasure Data