Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto At Treasure Data

3,964 views

Published on

Presto At Treasure Data - Presto Meetup Tokyo 2017
https://techplay.jp/event/621143

Published in: Technology
  • Follow the link, new dating source: ♥♥♥ http://bit.ly/2ZDZFYj ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❤❤❤ http://bit.ly/2ZDZFYj ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Presto At Treasure Data

  1. 1. T R E A S U R E D A T A Presto At Treasure Data Presto Meetup @ Tokyo - June 15, 2017 Taro L. Saito - GitHub:@xerial Ph.D., Software Engineer at Treasure Data, Inc. 1
  2. 2. Presto Usage at Treasure Data (2017) Processing 15 Trillion Rows / Day 
 (= 173 Million Rows / sec.) 150,000~ Queries / Day 1,500~ Users Hosting Presto as a service for 3 years 2
  3. 3. Configurations • Hosted on AWS (us-east), AWS Tokyo, IDCF (Japan) • Multi-Tenancy Clusters • PlazmaDB • Storage: Amazon S3 or RiakCS • S3 file indexes: PostgreSQL • Storage format: Columnar Message Pack (MPC) • MessagePack: Self-type describing format. • Compact: 10x compression ratio from the original input data (JSON) • 200GB JVM memory per node • To support varieties of query usage • Estimating required memory in advance is difficult • For avoiding WAITING_FOR_MEMORY state that blocks the entire query processing • In small-memory configuration, major GCs was quite frequent 3
  4. 4. Challenges • Major Complaint • Presto is slower than usual • Only 20% of 150,000 queries are using our scheduling feature • However, 85% of queries are actually scheduled by user scripts or third-party tools 
 • How can we know the expected performance? • (Implicit) Service Level Objectives (SLOs) 4
  5. 5. Understanding Implicit SLOs • We usually looked into slow queries to figure out the performance bottlenecks. • However analyzing SQL takes a long time • Because we need to understand the meaning of the data. • Understanding a hundred lines of SQL is painful • Created Presto Query Tuning Guides: • Presto Query FAQs: https://docs.treasuredata.com/articles/presto-query-faq • Expectations to Performance • Scheduled queries: We can estimate SLOs from historical stats • Scheduled, but submitted from third-party tools or user scripts • How do we know the expected performance? • We need to internalize customer’s knowledge on query performance 5
  6. 6. • Bad: • Collecting stdout/stderr logs of Presto • Good: • Collecting logs in a queryable format with Presto • Collecting Query Event Logs to Treasure Data • Presto Event Listener -> fluentd -> Treasure Data • Treasure Data • schema-less: Schema can be automatically generated from the data • As we add new fields to the event, the schema evolves automatically • We are collecting every single query log since the beginning of the Presto service Our Approach: Data-Driven Improvement Query Logs Store Analyze SQL Improve & Optimize 6
  7. 7. Query Event Logs • Query Completion • queryId, user id, session parameters, etc. • Query stats: running time, total rows, bytes, splits, CPU time, etc. • SQL statement • Split Completion • Running time, Processed rows, bytes, etc. • S3 GET access count, read bytes • Table Scan • Accessed tables names, column sets • Accessed time ranges (e.g., queries looking at data of past 1 hour, 7 days, etc.) • Filtering conditions (predicate) 7
  8. 8. Clustering Queries with Query Signature • Finding Implicit SLOs • Need to classify 85% of scheduled queries • Extracting Query Signatures • Simplify complex SQL expressions into a tiny SQL representation • Reusing ANTLR parser of Presto • Query Signature Example: • S[Cnt](J(T1,G(S[Cnt](T2)))) • SELET count(a),... FROM T1 
 JOIN (SELECT count(b),... FROM T2 GROUP BY x) 8
  9. 9. Implicit SLOs • Collect the historical query running times • Queries that have the same query signature • Median-absolute deviation (MAD): the deviation of (running time - median)^2 • CoV: Coefficient of variation = MAD / median • If CoV > 1, the query running time tends to vary • If CoV < 1, median of historical running time is useful for query running time estimation. • SLO violation: • If query is running longer than median + MAD • Customer feels query is slower than usual • However, query might be processing much more data than usual • Normalization based on the processing data size is also necessary 9
  10. 10. Typical Performance Bottlenecks • Huge Queries • Frequent S3 access, wide table scans • Single-node operators • order by, window function, count(distinct x), processing skewed data, etc. • Ill-performing worker nodes • Heavy load on a single worker node • Insufficient pool memory • Major/full GCs • We are using min.error-duration = 2m, but GC pause can be longer • Too much resource usage • A single query occupies the entire cluster • e.g., A query with hundreds of query stages! 10
  11. 11. Split Resource Manager • Problem: A singe query can occupy the entire cluster resource • But Presto has a limited performance control • Only for cpu time, memory usage, and concurrent queries (CQ) limits • No throttling nor boosting • Created Split Resource Manger • Limiting the max runnable splits for each customer • Using a custom RemoteTask class, which adds an wait if no splits are available • => Efficient Use of Multi-Tenancy Cluster 11
  12. 12. Presto Ops Robot • Problem: Insufficient memory of a worker • Queries using that worker node enter WAITING_FOR_MEMORY state • Report JMX metrics -> fluentd -> DataDog -> Trigger Alert -> Presto Ops Robot • Presto Ops Robot • Sending graceful shutdown command (POST SHUTTING_DOWN message to /v1/status) • or kill memory consuming queries in the worker node • Restarting worker JVM process • At least every 1 week, to avoid any issues when running JVM for a long time • Resetting any effect caused by unknown bugs • Useful for cleaning up untracked memory (e.g., ANTLR objects, etc.) 12
  13. 13. S3 Access Performance • Problem: Slow Table Scan • S3 GET request has constant latency • 30ms ~ 50ms latency regardless of the read size (up to 8KB read) • Request retry on 500 (unavailable) or 503 (Slowdown) is also necessary • Reading small header part of S3 objects can be the majority of query processing time • Columnar format: header + column blocks • IO Manager: • Need to send as many S3 GET requests as possible • 1 split = multiple S3 objects • Pipelining S3 GET requests and column reads 13
  14. 14. Presto Stella: Plazma Storage Optimizer • Problem: • Some query reads 1 million partitions <- S3 latency overhead is quite high • Data from mobile applications often have wide-range of time values. • Presto Stella Connector • Using Presto for optimizing physical storage partitions • Input records: File list on S3 • Table writer stage: Merges fragmented partitions, and upload them to S3 • Commit: Update S3 file indexes on PostgreSQL (in an atomic transaction) • Performance Improvement • e.g. 10,000 partitions (30 sec.) -> 20 partitions (1.5 sec.) • 20x performance improvement • Use Cases • Maintain fragmented user-defined partitions • 1-hour partitioning -> more flexible time range partitioning 14
  15. 15. Transitions of Database Usages 15
  16. 16. New Directions Explored By Presto • Traditional Database Usage • Required Database Administrator (DBA) • DBA designs the schema and queries • DBA tunes query performance • After Presto • Schema is designed by data providers • 1st data (user’s customer data) • 3rd party data sources • Analysts or Marketers explore the data with Presto • Don’t know the schema in advance • Convenient and low-latency access are necessary • SQL can be inefficient at first • While exploring data, SQL can be sophisticated, but not always 16
  17. 17. Prestobase Proxy: Low-Latency Access to Presto • Needed more interactive experiences of Presto • Prestobase Proxy: Gateway to Presto Coordinator • Talks Presto Protocol (/v1/statement/…) • Written in Scala. • Runs on Docker • Based on Finagle (HTTP server written by Twitter) • Features • Can work with standard presto clients (e.g., presto-cli, presto-jdbc, presto-odbc, etc.) • Increased connectivity to BI tools: Tableau, Datorama, ChartIO, Looker, etc. • Authentication (API key) • Rewriting nextUri (internal IP address -> external host name) • BI-tool specific query filters • etc. 17
  18. 18. Customizing Prestobase Filters • Prestobase Proxy: Gateway to access Presto • Adding TD specific binding • Finagle filters -> Injecting TD Specific filters • Using Airframe, dependent injection library for Scala 18
  19. 19. Airframe • http://wvlet.org/airframe • Three step DI in Scala • Bind • Design • Build • Built-in life cycle manager • Session start/shutdown • examples: • Open/close Presto connection • Shutting down Presto server • etc. • Session • Manage singletons and binding rules 19
  20. 20. VCR Record/Replay for Testing Presto • Launching Presto requires a lot of memory (e.g., 2GB or more) • Often crashes CI service containers (TravisCI, CircleCI, etc.) • Recording Presto responses (prestobase-vcr) • with sqlite-jdbc: https://github.com/xerial/sqlite-jdbc • DB file for each test suite • Enabled small-memory footprint testing • Can run many Presto tests in CI 20
  21. 21. Optimizing QueryResults Transfer in Prestobase • Accept: application/x-msgpack • HTTP header • Returning Presto query result rows in MessagePack format • QueryResults object • Contains Array<Array<Object>> => MessagePack (compact binary) • Encoding QueryResults objects using MessagePack/Jackson • https://github.com/msgpack/msgpack-java • Presto client doesn’t need to parse the row part • 1.5x ~ 2.0x performance improvement for streaming query results 21
  22. 22. Prestobase Modules • prestobase-proxy • Proxy server to access Presto with authentication • prestobase-agent • Agent for running Presto queries and storing their results • prestobase-vcr • For recording/replaying Presto responses • prestobase-codec • MessagePack codec of Presto query responses • prestobase-hq (headquarter) • Presto usage analysis pipelines, SLO monitoring, etc. • prestobase-conductor • Multi Presto cluster management tool • td-prestobase • Treasure Data specific bindings of prestobase • TD Authentication, job logging/monitoring • BI tool specific filters (Tableau, Looker, etc.) 22
  23. 23. Bridging Gaps Between SQL and Programming Language • Traditional Approach • OR-Mapper: app developer design objects and schema, then generate SQLs • New Approach: SQL First • Need to manage various SQL results inside Programming Language • prestobase-hq • Need to manage hundreds of SQLs and their results • SLO analysis, query performance analysis, etc. • But How? 23
  24. 24. sbt-sql: https://github.com/xerial/sbt-sql • Scala SBT plugin for generating model classes from SQL files • src/main/sql/presto/*.sql (Presto Queries) • Using SQL as a function • Read Presto SQL Results as Objects • Enabled managing SQL queries in GitHub • Type-safe data analysis in prestobase-hq 24
  25. 25. Big Challenge: Splitting Huge Queries • Table Scan Log Analysis • Revealed most of customers are scanning the same data over and over • Optimizing SQL is not the major concern. • Analyzing data has higher priority • Splitting a huge query into scheduled hourly/daily jobs • digdag: Open-source workflow engine • http://digdag.io • YAML-based task definition • Scheduling, run Presto queries • Easy to use 25
  26. 26. Time Range Primitives • TD_TIME_RANGE(time, ‘2017-06-15’, ’2017-06-16’, ‘PDT’) • Most frequently used UDF, but inconvenient • Use short description of relative time ranges • 1d (1 day) • 7d (7 days) • 1h (1 hour) • 1w (1 week) • 1M (1 month) • today, yeasterday, lastWeek, thisWeek, etc. • Recent data access • 1dU (1 day until now) => TD_TIME_RANGE(time, ‘2017-06-15’, null, ‘JST’) open range • Splitting ranges • 1w.splitIntoDays 26
  27. 27. MessageFrame (In Design) • Next-generation Tabular Data Format • Hybrid layout: • row-oriented: for streaming. Quick write • column-oriented: better compression & fast read • Specification Layers • Layer-0 (basic specs: Keep it simple stupid) • Data type: MessagePack • Compression codec: raw, delta, gzip, (snappy, zstd? etc.) • Column metadata: min/max/sum values of columns • Layer-1 (advanced compression) • Layer-N should be convertible to Layer-0 27
  28. 28. Summary • Managing Implicit SLOs • Data-oriented approach: Presto -> Fluentd -> Treasure Data -> Presto • SQL clustering -> Find a bottleneck -> Optimize it! • Optimization approaches • Split usage control, Presto Ops Robot, Stella partition optimizer • Low-latency access by Prestobase • Workflow • On-going Work • Physical storage optimization (Stella) • Huge query optimization • Incremental Processing Support • DigDag workflow • MessageFrame 28 https://www.treasuredata.com/company/careers/
  29. 29. T R E A S U R E D A T A 29

×