Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Optimizing Presto Connector on Cloud Storage

1,698 views

Published on

DB Tech Showcase Tokyo 2017

Published in: Engineering
  • Be the first to comment

Optimizing Presto Connector on Cloud Storage

  1. 1. T R E A S U R E D A T A OPTIMIZING PRESTO CONNECTOR ON CLOUD STORAGE DB Tech Showcase Tokyo 2017 Kai Sasaki Software Engineer at Treasure Data Inc.
  2. 2. ABOUT ME • Kai Sasaki (佐々木 海) • Software Engineer at TreasureData • Hadoop/Spark contributor • Hivemall committer • Java/Scala/Python
  3. 3. TREASURE DATA Data Analytics Platform Unify all your raw data in scalable and secure platform. Supporting 100+ integrations to enable you to easily connect all your data sources in real-time. Live with OSS • Fluentd • Embulk • Digdag • Hivemall and more https://www.treasuredata.com/opensource/
  4. 4. AGENDA • What is Presto? • Presto Connector Detail • Cloud Storage and PlazmaDB • Transaction and Partitioning • Time Index Partitioning • User Defined Partitioning
  5. 5. WHAT IS PRESTO?
  6. 6. WHAT IS PRESTO? • Presto is an open source scalable distributed SQL 
 engine for huge OLAP workloads • Mainly developed by Facebook, Teradata • Used by FB, Uber, Netflix etc • In-Memory processing • Pluggable architecture
 Hive, Cassandra, Kafka etc
  7. 7. PRESTO IN TREASURE DATA
  8. 8. PRESTO IN TREASURE DATA • Multiple clusters with 40~50 workers • Presto 0.178 + Original Presto Plugin (Connector) • 4.3+ million queries per month • 400 trillion records per month • 6+ PB per month
  9. 9. PRESTO CONNECTOR
  10. 10. PRESTO CONNECTOR • Presto connector is the plugin for providing the access way to various kind of existing data storage from Presto. • Connector is responsible for managing metadata/ transaction/data accessor. http://prestodb.io/
  11. 11. PRESTO CONNECTOR • Hive Connector
 Use metastore as metadata and S3/HDFS as storage. • Kafka Connector
 Querying Kafka topic as table. Each message as interpreted as row in a table. • Redis Connector
 Key/value pair is interpreted as a row in Presto. • Cassandra Connector
 Support Cassandra 2.1.5 or later.
  12. 12. PRESTO CONNECTOR • Black Hole Connector
 Works like /dev/null or /dev/zero in Unix like system. Used for catastrophic test or integration test. • Memory Connector
 Metadata and data are stored in RAM on worker nodes. 
 Still experimental connector mainly used for test. • System Connector
 Provides information about the cluster state and running query metrics. It is useful for runtime monitoring.
  13. 13. CONNECTOR DETAIL
  14. 14. PRESTO CONNECTOR • Plugin defines an interface 
 to bootstrap your connector 
 creation. • Also provides the list of 
 UDFs available your 
 Presto cluster. • ConnectorFactory is able to
 provide multiple connector implementations. Plugin ConnectorFactory Connector getConnectorFactories() create(connectorId,…)
  15. 15. PRESTO CONNECTOR • Connector provides classes to manage metadata, storage accessor and table access control. • ConnectorSplitManager create 
 data source metadata to be 
 distributed multiple worker 
 node. • ConnectorPage
 [Source|Sink]Provider
 is provided to split 
 operator. Connector Connector Metadata Connector SplitManager Connector PageSource Provider Connector PageSink Provider Connector Access Control
  16. 16. PRESTO CONNECTOR • Call beginInsert from 
 ConnectorMetadata • ConnectorSplitManager creates
 splits that includes metadata of 
 actual data source (e.g. file path) • ConnectorPageSource
 Provider downloads the
 file from data source in 
 parallel • finishInsert in ConnectorMetadata
 commit the transaction Connector Metadata beginInsert getSplits Connector PageSource Provider Connector PageSource Provider Connector PageSource Provider Connector Metadata finishInsert Operators… Connector SplitManager
  17. 17. PRESTO ON CLOUD STORAGE • Distributed execution engine like Presto cannot make use of data locality any more on cloud storage. • Read/Write of data can be a dominant factor of query performance, stability and money. → Connector should be implemented to take care of 
 network IO cost.
  18. 18. CLOUD STORAGE IN TD • Our Treasure Data storage service is built on cloud storage like S3. • Presto just provides a distributed query execution layer. It requires us to make our storage system also scalable. • On the other hand, we should make use of maintainability and availability provided cloud service provider (IaaS).
  19. 19. EASE-UP APPROACH
  20. 20. PLAZMADB • We built a thin storage layer on existing cloud storage and relational database, called PlazmaDB. • PlazmaDB is a central component that stores all customer data for analysis in Treasure Data. • PlazmaDB consists of two components • Metadata (PostgreSQL) • Storage (S3 or RiakCS)
  21. 21. PLAZMADB • PlazmaDB stores metadata of data files in PostgreSQL hosted by Amazon RDS.
  22. 22. PLAZMADB • PlazmaDB stores metadata of data files in PostgreSQL hosted by Amazon RDS. • This PostgreSQL manages the index, file path on S3, transaction and deleted files. LOG LOG
  23. 23. TRANSACTION AND PARTITIONING
  24. 24. TRANSACTION AND PARTITIONING • Consistency is the most important factor for enterprise analytics workload. Therefore MPP engine like Presto and backend storage MUST always guarantee the consistency. → UPDATE is done atomically by PlazmaDB • At the same time, we want to achieve high throughput by distributing workload to multiple worker nodes. → Data files are partitioned in PlazmaDB
  25. 25. PLAZMADB TRANSACTION • PlazmaDB supports transaction for the query that has side- effect (e.g. INSERT INTO/CREATE TABLE). • Transaction of PlazmaDB means the atomic operation on the appearance of the data on S3, not actual file. • Transaction is composed of two phases • Uploading uncommitted partitions • Commit transaction by moving uncommitted partitions
  26. 26. PLAZMADB TRANSACTION • Multiple worker try to upload files to S3
 asynchronously. Uncommitted Committed PostgreSQL
  27. 27. PLAZMADB TRANSACTION • After uploading is done, insert a record in uncommitted 
 table in PostgreSQL respectively. Uncommitted Committed PostgreSQL
  28. 28. PLAZMADB TRANSACTION • After uploading is done, insert a record in uncommitted 
 table in PostgreSQL respectively. Uncommitted Committed PostgreSQL p1 p2
  29. 29. PLAZMADB TRANSACTION • After all upload tasks are completed, coordinator tries 
 to commit the transaction by moving 
 all records in uncommitted to committed. Uncommitted Committed p1 p2 p3 PostgreSQL
  30. 30. PLAZMADB TRANSACTION • After all upload tasks are completed, coordinator tries 
 to commit the transaction by moving 
 all records in uncommitted to committed. Uncommitted Committed p1 p2 p3 PostgreSQL
  31. 31. PLAZMADB DELETE • Delete query is handled in similar way. First newly created
 partitions are uploaded excluding deleted 
 records. Uncommitted Committed p1 p2 p3 p1’ p2’ p3’ PostgreSQL
  32. 32. PLAZMADB DELETE • When transaction is committed, the records in committed table is replaced by uncommitted records 
 with different file path. Uncommitted Committed p1’ p2’ p3’ PostgreSQL
  33. 33. PARTITIONING • To make the best of high throughput by Presto parallel processing, it is necessary to distribute data source too. • Distributing data source evenly can contribute the high throughput and performance stability. • Two basic partitioning method • Key range partitioning -> Time-Index partitioning • Hash partitioning -> User Defined Partitioning
  34. 34. PARTITIONING • A partition record in PlazmaDB represents a file stored in S3 with some additional information • Data Set ID • Range Index Key • Record Count • File Size • Checksum • File Path
  35. 35. PARTITIONING • All partitions in PlazmaDB are indexed by time when it is generated. Time index is recorded as UNIX epoch. • A partition keeps first_index_key and last_index_key to specifies the range where the partition includes. • PlazmaDB index is constructed as multicolumn index by using GiST index of PostgreSQL. 
 (https://www.postgresql.org/docs/current/static/gist.html) • (data_set_id, index_range(first_index_key, last_index_key))
  36. 36. LIFECYCLE OF PARTITION • PlazmaDB has two storage management layer.
 At the beginning, records are put on realtime storage layer in raw format. Realtime Storage Archive Storage time: 100 time: 4000 time: 3800 time: 300 time: 500
  37. 37. LIFECYCLE OF PARTITION • Every one hour, a specific map reduce job called Log Merge Job runs to merge same time range records into one partition in archive storage. Realtime Storage Archive Storage time: 100 time: 4000 time: 3800 time: 300 time: 500 time: 0~3599 time: 3600~7200 MR
  38. 38. LIFECYCLE OF PARTITION • Query execution engine like Presto needs to fetch the data from both realtime storage and archive storage. But basically it should be efficient to read the data from archive storage. Realtime Storage Archive Storage time: 100 time: 4000 time: 3800 time: 300 time: 500 time: 0~3599 time: 3600~7200 MR
  39. 39. TWO PARTITIONING TYPES
  40. 40. TIME INDEX PARTITIONING • By using multicolumn index on time range in PlazmaDB, Presto can filter out unnecessary partitions through predicate push down. • TD_TIME_RANGE UDF tells Presto the hint which partitions should be fetched from PlazmaDB. • e.g. TD_TIME_RANGE(time, ‘2017-08-31 12:30:00’, NULL, ‘JST’) • ConnectorSplitManager select the necessary partitions and calculates the split distribution plan.
  41. 41. TIME INDEX PARTITIONING • Select metadata records from realtime storage and archive storage according to given time range.
 
 SELECT * FROM rt/ar WHERE start < time AND time < end; Connector
 SplitManger time: 0~3599 time: 3600~7200 time: 8000 time: 8200 time: 9000 time: 8800 Realtime Storage Archive Storage
  42. 42. TIME INDEX PARTITIONING • A split is responsible to download multiple files on S3 in order to reduce overhead. • ConnectorSplitManager calculates file assignment to each split based on given statistics information (e.g. file size, the number of columns, record count) f1 f2 f3 Connector
 SplitManger Split1 Split2
  43. 43. TIME INDEX PARTITIONING SELECT 10 cols in a range 0 sec 23 sec 45 sec 68 sec 90 sec 113 sec 135 sec 158 sec 180 sec 60days 50days 40days 30days 20days 10days TD_TIME_RANGE
  44. 44. TIME INDEX PARTITIONING SELECT 10 cols in a range 0 splits 8 splits 15 splits 23 splits 30 splits 38 splits 45 splits 53 splits 60 splits 6years~ 5years 4years 3years 2years 1 year 6month split
  45. 45. CHALLENGE • Time-Index partitioning worked very well because • Most logs from web page, IoT devices have originally the time when it is created. • OLAP workload from analysts often limited by specific time range (e.g. in the last week, during a campaign). • But it lacks the flexibility to make an index on the column other than time. This is required especially in digital marketing, DMP use cases.
  46. 46. USER DEFINED PARTITIONING
  47. 47. USER DEFINED PARTITIONING • Now evaluating user defined partitioning with Presto. • User defined partitioning allows customer to set index on arbitrary data attribute flexibly. • User defined partitioning can co-exist with time-index partitioning as secondary index.
  48. 48. SELECT COUNT(1) FROM audience 
 WHERE 
 TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’)
 AND
 audience.room = ‘E’
  49. 49. BUCKETING • Similar mechanism with Hive bucketing • Bucket is a logical group of partition files by specified bucketing column. table bucket bucket bucket bucket partition partition partition partition partition partition partition partition partition partition partition partition time range 1 time range 2 time range 3 time range 4
  50. 50. BUCKETING • PlazmaDB defines the hash function type on partitioning key and total bucket count which is fixed in advance. Connector SplitManager SELECT COUNT(1) FROM audience 
 WHERE 
 TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’)
 AND
 audience.room = ‘E’ table bucket1 bucket2 bucket3 partition partition partition partition partition partition partition partition partition
  51. 51. BUCKETING • ConnectorSplitManager select the proper partition from PostgreSQL with given time range and bucket key. Connector SplitManager SELECT COUNT(1) FROM audience 
 WHERE 
 TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’)
 AND
 audience.room = ‘E’ table bucket1 bucket2 bucket3 partition partition partition partition partition partition partition partition partition hash(‘E’) -> bucket2 1504483200 < time && time < 1504742400
  52. 52. USER DEFINED PARTITIONING • We can skip to read several unnecessary partitions. This architecture very fit to digital marketing use cases. • Creating user segment • Aggregation by channel • Still make use of time index partitioning. • It’s now tested internally.
  53. 53. RECAP • Presto provides a plugin mechanism called connector. • Though Presto itself is highly scalable distributed engine, connector is also responsible for efficient query execution. • PlazmaDB has some desirable features to be integrated with such kind of connector because of • Transaction support • Time-Index Partitioning • User Defined Partitioning
  54. 54. T R E A S U R E D A T A

×