Optimizing Presto Connector on Cloud Storage

Kai Sasaki
Kai SasakiSoftware engineer - Treasure Data
T R E A S U R E D A T A
OPTIMIZING PRESTO CONNECTOR
ON CLOUD STORAGE
DB Tech Showcase Tokyo 2017
Kai Sasaki
Software Engineer at Treasure Data Inc.
ABOUT ME
• Kai Sasaki (佐々木 海)

• Software Engineer at TreasureData

• Hadoop/Spark contributor

• Hivemall committer

• Java/Scala/Python
TREASURE DATA
Data Analytics Platform
Unify all your raw data in scalable and
secure platform. Supporting 100+
integrations to enable you to easily connect
all your data sources in real-time.
Live with OSS
• Fluentd
• Embulk
• Digdag
• Hivemall
and more
https://www.treasuredata.com/opensource/
AGENDA
• What is Presto?

• Presto Connector Detail

• Cloud Storage and PlazmaDB

• Transaction and Partitioning

• Time Index Partitioning

• User Defined Partitioning
WHAT IS PRESTO?
WHAT IS PRESTO?
• Presto is an open source scalable distributed SQL 

engine for huge OLAP workloads

• Mainly developed by Facebook, Teradata

• Used by FB, Uber, Netflix etc

• In-Memory processing

• Pluggable architecture

Hive, Cassandra, Kafka etc
PRESTO IN TREASURE DATA
PRESTO IN TREASURE DATA
• Multiple clusters with 40~50 workers
• Presto 0.178 + Original Presto Plugin (Connector)
• 4.3+ million queries per month
• 400 trillion records per month

• 6+ PB per month
PRESTO CONNECTOR
PRESTO CONNECTOR
• Presto connector is the plugin for providing the access way
to various kind of existing data storage from Presto.

• Connector is responsible for managing metadata/
transaction/data accessor.
http://prestodb.io/
PRESTO CONNECTOR
• Hive Connector

Use metastore as metadata and S3/HDFS as storage.

• Kafka Connector

Querying Kafka topic as table. Each message as interpreted as row
in a table.

• Redis Connector

Key/value pair is interpreted as a row in Presto.

• Cassandra Connector

Support Cassandra 2.1.5 or later.
PRESTO CONNECTOR
• Black Hole Connector

Works like /dev/null or /dev/zero in Unix like system. Used for
catastrophic test or integration test.

• Memory Connector

Metadata and data are stored in RAM on worker nodes. 

Still experimental connector mainly used for test.

• System Connector

Provides information about the cluster state and running
query metrics. It is useful for runtime monitoring.
CONNECTOR DETAIL
PRESTO CONNECTOR
• Plugin defines an interface 

to bootstrap your connector 

creation.

• Also provides the list of 

UDFs available your 

Presto cluster.

• ConnectorFactory is able to

provide multiple connector implementations.
Plugin
ConnectorFactory
Connector
getConnectorFactories()
create(connectorId,…)
PRESTO CONNECTOR
• Connector provides classes to manage metadata, storage
accessor and table access control.

• ConnectorSplitManager create 

data source metadata to be 

distributed multiple worker 

node. 

• ConnectorPage

[Source|Sink]Provider

is provided to split 

operator. Connector
Connector
Metadata
Connector
SplitManager
Connector
PageSource
Provider
Connector
PageSink
Provider
Connector
Access
Control
PRESTO CONNECTOR
• Call beginInsert from 

ConnectorMetadata

• ConnectorSplitManager creates

splits that includes metadata of 

actual data source (e.g. file path)

• ConnectorPageSource

Provider downloads the

file from data source in 

parallel

• finishInsert in ConnectorMetadata

commit the transaction
Connector
Metadata
beginInsert
getSplits
Connector
PageSource
Provider
Connector
PageSource
Provider
Connector
PageSource
Provider
Connector
Metadata
finishInsert
Operators…
Connector
SplitManager
PRESTO ON CLOUD STORAGE
• Distributed execution engine like Presto cannot make use of
data locality any more on cloud storage. 

• Read/Write of data can be a dominant factor of query
performance, stability and money.

→ Connector should be implemented to take care of 

network IO cost.
CLOUD STORAGE IN TD
• Our Treasure Data storage service is built on cloud storage
like S3. 

• Presto just provides a distributed query execution layer. It
requires us to make our storage system also scalable.
• On the other hand, we should make use of maintainability
and availability provided cloud service provider (IaaS).
EASE-UP APPROACH
PLAZMADB
• We built a thin storage layer on existing cloud storage and
relational database, called PlazmaDB.

• PlazmaDB is a central component that stores all customer data
for analysis in Treasure Data.

• PlazmaDB consists of two components

• Metadata (PostgreSQL)

• Storage (S3 or RiakCS)
PLAZMADB
• PlazmaDB stores metadata of data files in PostgreSQL
hosted by Amazon RDS.
PLAZMADB
• PlazmaDB stores metadata of data files in PostgreSQL
hosted by Amazon RDS. 

• This PostgreSQL manages the index, file path on S3,
transaction and deleted files.
LOG
LOG
TRANSACTION AND PARTITIONING
TRANSACTION AND PARTITIONING
• Consistency is the most important factor for enterprise
analytics workload. Therefore MPP engine like Presto and
backend storage MUST always guarantee the consistency.

→ UPDATE is done atomically by PlazmaDB

• At the same time, we want to achieve high throughput by
distributing workload to multiple worker nodes.

→ Data files are partitioned in PlazmaDB
PLAZMADB TRANSACTION
• PlazmaDB supports transaction for the query that has side-
effect (e.g. INSERT INTO/CREATE TABLE).

• Transaction of PlazmaDB means the atomic operation on the
appearance of the data on S3, not actual file.
• Transaction is composed of two phases

• Uploading uncommitted partitions

• Commit transaction by moving uncommitted partitions
PLAZMADB TRANSACTION
• Multiple worker try to upload files to S3

asynchronously.
Uncommitted Committed
PostgreSQL
PLAZMADB TRANSACTION
• After uploading is done, insert a record in uncommitted 

table in PostgreSQL respectively.
Uncommitted Committed
PostgreSQL
PLAZMADB TRANSACTION
• After uploading is done, insert a record in uncommitted 

table in PostgreSQL respectively.
Uncommitted Committed
PostgreSQL
p1
p2
PLAZMADB TRANSACTION
• After all upload tasks are completed, coordinator tries 

to commit the transaction by moving 

all records in uncommitted to committed.
Uncommitted Committed
p1
p2
p3
PostgreSQL
PLAZMADB TRANSACTION
• After all upload tasks are completed, coordinator tries 

to commit the transaction by moving 

all records in uncommitted to committed.
Uncommitted Committed
p1
p2
p3
PostgreSQL
PLAZMADB DELETE
• Delete query is handled in similar way. First newly created

partitions are uploaded excluding deleted 

records.
Uncommitted Committed
p1
p2
p3
p1’
p2’
p3’
PostgreSQL
PLAZMADB DELETE
• When transaction is committed, the records in committed
table is replaced by uncommitted records 

with different file path.
Uncommitted Committed
p1’
p2’
p3’
PostgreSQL
PARTITIONING
• To make the best of high throughput by Presto parallel
processing, it is necessary to distribute data source too.

• Distributing data source evenly can contribute the high
throughput and performance stability. 

• Two basic partitioning method

• Key range partitioning -> Time-Index partitioning
• Hash partitioning -> User Defined Partitioning
PARTITIONING
• A partition record in PlazmaDB represents a file stored in S3 with some
additional information

• Data Set ID

• Range Index Key

• Record Count

• File Size

• Checksum

• File Path
PARTITIONING
• All partitions in PlazmaDB are indexed by time when it is
generated. Time index is recorded as UNIX epoch. 

• A partition keeps first_index_key and last_index_key
to specifies the range where the partition includes. 

• PlazmaDB index is constructed as multicolumn index by
using GiST index of PostgreSQL. 

(https://www.postgresql.org/docs/current/static/gist.html)

• (data_set_id, index_range(first_index_key, last_index_key))
LIFECYCLE OF PARTITION
• PlazmaDB has two storage management layer.

At the beginning, records are put on realtime storage layer
in raw format.
Realtime Storage Archive Storage
time: 100
time: 4000
time: 3800
time: 300
time: 500
LIFECYCLE OF PARTITION
• Every one hour, a specific map reduce job called Log Merge
Job runs to merge same time range records into one
partition in archive storage.
Realtime Storage Archive Storage
time: 100
time: 4000
time: 3800
time: 300
time: 500
time: 0~3599
time:
3600~7200
MR
LIFECYCLE OF PARTITION
• Query execution engine like Presto needs to fetch the data
from both realtime storage and archive storage. But basically
it should be efficient to read the data from archive storage.
Realtime Storage Archive Storage
time: 100
time: 4000
time: 3800
time: 300
time: 500
time: 0~3599
time:
3600~7200
MR
TWO PARTITIONING TYPES
TIME INDEX PARTITIONING
• By using multicolumn index on time range in PlazmaDB,
Presto can filter out unnecessary partitions through predicate
push down.

• TD_TIME_RANGE UDF tells Presto the hint which partitions
should be fetched from PlazmaDB.

• e.g. TD_TIME_RANGE(time, ‘2017-08-31 12:30:00’, NULL, ‘JST’)
• ConnectorSplitManager select the necessary partitions
and calculates the split distribution plan.
TIME INDEX PARTITIONING
• Select metadata records from realtime storage and archive
storage according to given time range.



SELECT * FROM rt/ar WHERE start < time AND time < end;
Connector

SplitManger
time: 0~3599
time:
3600~7200
time: 8000
time: 8200
time: 9000
time: 8800
Realtime Storage Archive Storage
TIME INDEX PARTITIONING
• A split is responsible to download multiple files on S3 in
order to reduce overhead. 

• ConnectorSplitManager calculates file assignment to
each split based on given statistics information (e.g. file size,
the number of columns, record count)
f1
f2
f3
Connector

SplitManger
Split1
Split2
TIME INDEX PARTITIONING
SELECT 10 cols in a range
0 sec
23 sec
45 sec
68 sec
90 sec
113 sec
135 sec
158 sec
180 sec
60days 50days 40days 30days 20days 10days
TD_TIME_RANGE
TIME INDEX PARTITIONING
SELECT 10 cols in a range
0 splits
8 splits
15 splits
23 splits
30 splits
38 splits
45 splits
53 splits
60 splits
6years~ 5years 4years 3years 2years 1 year 6month
split
CHALLENGE
• Time-Index partitioning worked very well because

• Most logs from web page, IoT devices have originally the
time when it is created.

• OLAP workload from analysts often limited by specific
time range (e.g. in the last week, during a campaign).

• But it lacks the flexibility to make an index on the column
other than time. This is required especially in digital
marketing, DMP use cases.
USER DEFINED PARTITIONING
USER DEFINED PARTITIONING
• Now evaluating user defined partitioning with Presto.

• User defined partitioning allows customer to set index on
arbitrary data attribute flexibly. 

• User defined partitioning can co-exist with time-index
partitioning as secondary index.
SELECT
COUNT(1)
FROM audience 

WHERE 

TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’)

AND

audience.room = ‘E’
BUCKETING
• Similar mechanism with Hive bucketing

• Bucket is a logical group of partition files by specified
bucketing column.
table
bucket bucket bucket bucket
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
time range 1
time range 2
time range 3
time range 4
BUCKETING
• PlazmaDB defines the hash function type on partitioning key
and total bucket count which is fixed in advance.
Connector
SplitManager
SELECT COUNT(1) FROM audience 

WHERE 

TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’)

AND

audience.room = ‘E’
table
bucket1 bucket2 bucket3
partition
partition
partition
partition
partition
partition
partition
partition
partition
BUCKETING
• ConnectorSplitManager select the proper partition from
PostgreSQL with given time range and bucket key.
Connector
SplitManager
SELECT COUNT(1) FROM audience 

WHERE 

TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’)

AND

audience.room = ‘E’
table
bucket1 bucket2 bucket3
partition
partition
partition
partition
partition
partition
partition
partition
partition
hash(‘E’) -> bucket2
1504483200 < time
&& time < 1504742400
USER DEFINED PARTITIONING
• We can skip to read several unnecessary partitions. This
architecture very fit to digital marketing use cases.

• Creating user segment

• Aggregation by channel

• Still make use of time index partitioning.

• It’s now tested internally.
RECAP
• Presto provides a plugin mechanism called connector.
• Though Presto itself is highly scalable distributed engine, connector is
also responsible for efficient query execution.
• PlazmaDB has some desirable features to be integrated with such kind
of connector because of
• Transaction support
• Time-Index Partitioning
• User Defined Partitioning
T R E A S U R E D A T A
1 of 54

Recommended

[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora... by
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...Insight Technology, Inc.
581 views28 slides
Microsoft R - Data Science at Scale by
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at ScaleSascha Dittmann
346 views36 slides
A Day in the Life of a Druid Implementor and Druid's Roadmap by
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
297 views46 slides
Engineering practices in big data storage and processing by
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processingSchubert Zhang
1.6K views54 slides
Apache Spark and MongoDB - Turning Analytics into Real-Time Action by
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionJoão Gabriel Lima
3.5K views13 slides
Building Data Applications with Apache Druid by
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
267 views19 slides

More Related Content

What's hot

Membase Meetup 2010 by
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010Membase
685 views43 slides
Benchmarking Apache Druid by
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
463 views35 slides
Horizon for Big Data by
Horizon for Big DataHorizon for Big Data
Horizon for Big DataSchubert Zhang
1.1K views21 slides
Jump Start on Apache Spark 2.2 with Databricks by
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
976 views133 slides
Presto - SQL on anything by
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anythingGrzegorz Kokosiński
3.3K views21 slides
AWS Big Data Demystified #1: Big data architecture lessons learned by
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
1.7K views56 slides

What's hot(20)

Membase Meetup 2010 by Membase
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
Membase685 views
Benchmarking Apache Druid by Matt Sarrel
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
Matt Sarrel463 views
Jump Start on Apache Spark 2.2 with Databricks by Anyscale
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale976 views
AWS Big Data Demystified #1: Big data architecture lessons learned by Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty1.7K views
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB by Athiq Ahamed
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed7.6K views
Druid: Under the Covers (Virtual Meetup) by Imply
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
Imply 166 views
What database by Regunath B
What databaseWhat database
What database
Regunath B3.2K views
In Search of Database Nirvana: Challenges of Delivering HTAP by HBaseCon
In Search of Database Nirvana: Challenges of Delivering HTAPIn Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAP
HBaseCon1.6K views
Cassandra vs. ScyllaDB: Evolutionary Differences by ScyllaDB
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB990 views
Overcoming Barriers of Scaling Your Database by ScyllaDB
Overcoming Barriers of Scaling Your DatabaseOvercoming Barriers of Scaling Your Database
Overcoming Barriers of Scaling Your Database
ScyllaDB587 views
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive by Omid Vahdaty
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty2.2K views
Apache Druid®: A Dance of Distributed Processes by Imply
 Apache Druid®: A Dance of Distributed Processes Apache Druid®: A Dance of Distributed Processes
Apache Druid®: A Dance of Distributed Processes
Imply 259 views
Archmage, Pinterest’s Real-time Analytics Platform on Druid by Imply
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Imply 249 views
Under the Hood of a Shard-per-Core Database Architecture by ScyllaDB
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
ScyllaDB1.3K views
August meetup - All about Apache Druid by Imply
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
Imply 218 views
Stsg17 speaker yousunjeong by Yousun Jeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong2.4K views
Presto @ Facebook: Past, Present and Future by DataWorks Summit
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
DataWorks Summit5.6K views
Real-Time Analytics with Apache Cassandra and Apache Spark by Guido Schmutz
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz9.1K views

Viewers also liked

Kerasを用いた3次元検索エンジン@TFUG by
Kerasを用いた3次元検索エンジン@TFUGKerasを用いた3次元検索エンジン@TFUG
Kerasを用いた3次元検索エンジン@TFUGOgushi Masaya
5.7K views84 slides
Deep dive into deeplearn.js by
Deep dive into deeplearn.jsDeep dive into deeplearn.js
Deep dive into deeplearn.jsKai Sasaki
2.9K views39 slides
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA by
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CAkbajda
1.6K views19 slides
Facebook Presto presentation by
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
18.4K views23 slides
Presto: Distributed sql query engine by
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
9.6K views12 slides
Presto: SQL-on-anything by
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anythingDataWorks Summit
5.3K views31 slides

Viewers also liked(11)

Kerasを用いた3次元検索エンジン@TFUG by Ogushi Masaya
Kerasを用いた3次元検索エンジン@TFUGKerasを用いた3次元検索エンジン@TFUG
Kerasを用いた3次元検索エンジン@TFUG
Ogushi Masaya5.7K views
Deep dive into deeplearn.js by Kai Sasaki
Deep dive into deeplearn.jsDeep dive into deeplearn.js
Deep dive into deeplearn.js
Kai Sasaki2.9K views
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA by kbajda
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
kbajda1.6K views
Facebook Presto presentation by Cyanny LIANG
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG18.4K views
Presto: Distributed sql query engine by kiran palaka
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
kiran palaka9.6K views
Presto at Hadoop Summit 2016 by kbajda
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
kbajda3.4K views
How to ensure Presto scalability 
in multi use case by Kai Sasaki
How to ensure Presto scalability 
in multi use case How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case
Kai Sasaki4.2K views
Hive, Presto, and Spark on TPC-DS benchmark by Dongwon Kim
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim9.6K views

Similar to Optimizing Presto Connector on Cloud Storage

Real World Storage in Treasure Data by
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure DataKai Sasaki
543 views67 slides
Overview of data analytics service: Treasure Data Service by
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
2.6K views60 slides
User Defined Partitioning on PlazmaDB by
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBKai Sasaki
1.4K views28 slides
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300) by
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Web Services
482 views28 slides
(BDT303) Running Spark and Presto on the Netflix Big Data Platform by
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
38.4K views68 slides
Running Presto and Spark on the Netflix Big Data Platform by
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
1.8K views68 slides

Similar to Optimizing Presto Connector on Cloud Storage(20)

Real World Storage in Treasure Data by Kai Sasaki
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure Data
Kai Sasaki543 views
Overview of data analytics service: Treasure Data Service by SATOSHI TAGOMORI
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
SATOSHI TAGOMORI2.6K views
User Defined Partitioning on PlazmaDB by Kai Sasaki
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDB
Kai Sasaki1.4K views
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300) by Amazon Web Services
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
(BDT303) Running Spark and Presto on the Netflix Big Data Platform by Amazon Web Services
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services38.4K views
Running Presto and Spark on the Netflix Big Data Platform by Eva Tse
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse1.8K views
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr... by Amazon Web Services
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
Amazon Web Services1.9K views
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ... by Amazon Web Services
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Amazon Web Services1.9K views
Data Analytics Service Company and Its Ruby Usage by SATOSHI TAGOMORI
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI8.9K views
RaptorX: Building a 10X Faster Presto with hierarchical cache by Alluxio, Inc.
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.353 views
Large Scale Lakehouse Implementation Using Structured Streaming by Databricks
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks490 views
Cloud Lambda Architecture Patterns by Asis Mohanty
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
Asis Mohanty121 views
Healthcare Claim Reimbursement using Apache Spark by Databricks
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks493 views
Presto At Treasure Data by Taro L. Saito
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
Taro L. Saito5.4K views
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA... by Maaz Anjum
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum638 views
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift by Amazon Web Services
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters... by Databricks
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks2.8K views
20180522 infra autoscaling_system by Kai Sasaki
20180522 infra autoscaling_system20180522 infra autoscaling_system
20180522 infra autoscaling_system
Kai Sasaki1.2K views

More from Kai Sasaki

Graviton 2で実現する
コスト効率のよいCDP基盤 by
Graviton 2で実現する
コスト効率のよいCDP基盤Graviton 2で実現する
コスト効率のよいCDP基盤
Graviton 2で実現する
コスト効率のよいCDP基盤Kai Sasaki
2.1K views27 slides
Infrastructure for auto scaling distributed system by
Infrastructure for auto scaling distributed systemInfrastructure for auto scaling distributed system
Infrastructure for auto scaling distributed systemKai Sasaki
1.5K views33 slides
Continuous Optimization for Distributed BigData Analysis by
Continuous Optimization for Distributed BigData AnalysisContinuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData AnalysisKai Sasaki
1.2K views38 slides
Recent Changes and Challenges for Future Presto by
Recent Changes and Challenges for Future PrestoRecent Changes and Challenges for Future Presto
Recent Changes and Challenges for Future PrestoKai Sasaki
1.3K views32 slides
Presto updates to 0.178 by
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178Kai Sasaki
1.3K views20 slides
Managing multi tenant resource toward Hive 2.0 by
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Kai Sasaki
2.2K views56 slides

More from Kai Sasaki(20)

Graviton 2で実現する
コスト効率のよいCDP基盤 by Kai Sasaki
Graviton 2で実現する
コスト効率のよいCDP基盤Graviton 2で実現する
コスト効率のよいCDP基盤
Graviton 2で実現する
コスト効率のよいCDP基盤
Kai Sasaki2.1K views
Infrastructure for auto scaling distributed system by Kai Sasaki
Infrastructure for auto scaling distributed systemInfrastructure for auto scaling distributed system
Infrastructure for auto scaling distributed system
Kai Sasaki1.5K views
Continuous Optimization for Distributed BigData Analysis by Kai Sasaki
Continuous Optimization for Distributed BigData AnalysisContinuous Optimization for Distributed BigData Analysis
Continuous Optimization for Distributed BigData Analysis
Kai Sasaki1.2K views
Recent Changes and Challenges for Future Presto by Kai Sasaki
Recent Changes and Challenges for Future PrestoRecent Changes and Challenges for Future Presto
Recent Changes and Challenges for Future Presto
Kai Sasaki1.3K views
Presto updates to 0.178 by Kai Sasaki
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178
Kai Sasaki1.3K views
Managing multi tenant resource toward Hive 2.0 by Kai Sasaki
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
Kai Sasaki2.2K views
Embulk makes Japan visible by Kai Sasaki
Embulk makes Japan visibleEmbulk makes Japan visible
Embulk makes Japan visible
Kai Sasaki4.3K views
Maintainable cloud architecture_of_hadoop by Kai Sasaki
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
Kai Sasaki4.3K views
図でわかるHDFS Erasure Coding by Kai Sasaki
図でわかるHDFS Erasure Coding図でわかるHDFS Erasure Coding
図でわかるHDFS Erasure Coding
Kai Sasaki4.8K views
Spark MLlib code reading ~optimization~ by Kai Sasaki
Spark MLlib code reading ~optimization~Spark MLlib code reading ~optimization~
Spark MLlib code reading ~optimization~
Kai Sasaki835 views
How I tried MADE by Kai Sasaki
How I tried MADEHow I tried MADE
How I tried MADE
Kai Sasaki1.2K views
Reading kernel org by Kai Sasaki
Reading kernel orgReading kernel org
Reading kernel org
Kai Sasaki817 views
Reading drill by Kai Sasaki
Reading drillReading drill
Reading drill
Kai Sasaki1.1K views
Kernel ext4 by Kai Sasaki
Kernel ext4Kernel ext4
Kernel ext4
Kai Sasaki1.6K views
Kernel bootstrap by Kai Sasaki
Kernel bootstrapKernel bootstrap
Kernel bootstrap
Kai Sasaki1.3K views
HyperLogLogを用いた、異なり数に基づく
 省リソースなk-meansの
k決定アルゴリズムの提案 by Kai Sasaki
HyperLogLogを用いた、異なり数に基づく
 省リソースなk-meansの
k決定アルゴリズムの提案HyperLogLogを用いた、異なり数に基づく
 省リソースなk-meansの
k決定アルゴリズムの提案
HyperLogLogを用いた、異なり数に基づく
 省リソースなk-meansの
k決定アルゴリズムの提案
Kai Sasaki2K views
Kernel resource by Kai Sasaki
Kernel resourceKernel resource
Kernel resource
Kai Sasaki1.8K views
Kernel overview by Kai Sasaki
Kernel overviewKernel overview
Kernel overview
Kai Sasaki1.7K views
AutoEncoderで特徴抽出 by Kai Sasaki
AutoEncoderで特徴抽出AutoEncoderで特徴抽出
AutoEncoderで特徴抽出
Kai Sasaki37.7K views
Pattern match with case class by Kai Sasaki
Pattern match with case classPattern match with case class
Pattern match with case class
Kai Sasaki3.6K views

Recently uploaded

REACTJS.pdf by
REACTJS.pdfREACTJS.pdf
REACTJS.pdfArthyR3
37 views16 slides
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth by
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for GrowthInnomantra
20 views4 slides
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptxlwang78
188 views19 slides
unit 1.pptx by
unit 1.pptxunit 1.pptx
unit 1.pptxrrbornarecm
5 views53 slides
dummy.pptx by
dummy.pptxdummy.pptx
dummy.pptxJamesLamp
7 views2 slides
sam_software_eng_cv.pdf by
sam_software_eng_cv.pdfsam_software_eng_cv.pdf
sam_software_eng_cv.pdfsammyigbinovia
11 views5 slides

Recently uploaded(20)

REACTJS.pdf by ArthyR3
REACTJS.pdfREACTJS.pdf
REACTJS.pdf
ArthyR337 views
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth by Innomantra
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth
Innomantra 20 views
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang78188 views
Créativité dans le design mécanique à l’aide de l’optimisation topologique by LIEGE CREATIVE
Créativité dans le design mécanique à l’aide de l’optimisation topologiqueCréativité dans le design mécanique à l’aide de l’optimisation topologique
Créativité dans le design mécanique à l’aide de l’optimisation topologique
LIEGE CREATIVE8 views
Design_Discover_Develop_Campaign.pptx by ShivanshSeth6
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptx
ShivanshSeth655 views
Ansari: Practical experiences with an LLM-based Islamic Assistant by M Waleed Kadous
Ansari: Practical experiences with an LLM-based Islamic AssistantAnsari: Practical experiences with an LLM-based Islamic Assistant
Ansari: Practical experiences with an LLM-based Islamic Assistant
M Waleed Kadous11 views
Web Dev Session 1.pptx by VedVekhande
Web Dev Session 1.pptxWeb Dev Session 1.pptx
Web Dev Session 1.pptx
VedVekhande20 views
MongoDB.pdf by ArthyR3
MongoDB.pdfMongoDB.pdf
MongoDB.pdf
ArthyR351 views
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf by AlhamduKure
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
AlhamduKure10 views
Design of machine elements-UNIT 3.pptx by gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy38 views

Optimizing Presto Connector on Cloud Storage

  • 1. T R E A S U R E D A T A OPTIMIZING PRESTO CONNECTOR ON CLOUD STORAGE DB Tech Showcase Tokyo 2017 Kai Sasaki Software Engineer at Treasure Data Inc.
  • 2. ABOUT ME • Kai Sasaki (佐々木 海) • Software Engineer at TreasureData • Hadoop/Spark contributor • Hivemall committer • Java/Scala/Python
  • 3. TREASURE DATA Data Analytics Platform Unify all your raw data in scalable and secure platform. Supporting 100+ integrations to enable you to easily connect all your data sources in real-time. Live with OSS • Fluentd • Embulk • Digdag • Hivemall and more https://www.treasuredata.com/opensource/
  • 4. AGENDA • What is Presto? • Presto Connector Detail • Cloud Storage and PlazmaDB • Transaction and Partitioning • Time Index Partitioning • User Defined Partitioning
  • 6. WHAT IS PRESTO? • Presto is an open source scalable distributed SQL 
 engine for huge OLAP workloads • Mainly developed by Facebook, Teradata • Used by FB, Uber, Netflix etc • In-Memory processing • Pluggable architecture
 Hive, Cassandra, Kafka etc
  • 8. PRESTO IN TREASURE DATA • Multiple clusters with 40~50 workers • Presto 0.178 + Original Presto Plugin (Connector) • 4.3+ million queries per month • 400 trillion records per month • 6+ PB per month
  • 10. PRESTO CONNECTOR • Presto connector is the plugin for providing the access way to various kind of existing data storage from Presto. • Connector is responsible for managing metadata/ transaction/data accessor. http://prestodb.io/
  • 11. PRESTO CONNECTOR • Hive Connector
 Use metastore as metadata and S3/HDFS as storage. • Kafka Connector
 Querying Kafka topic as table. Each message as interpreted as row in a table. • Redis Connector
 Key/value pair is interpreted as a row in Presto. • Cassandra Connector
 Support Cassandra 2.1.5 or later.
  • 12. PRESTO CONNECTOR • Black Hole Connector
 Works like /dev/null or /dev/zero in Unix like system. Used for catastrophic test or integration test. • Memory Connector
 Metadata and data are stored in RAM on worker nodes. 
 Still experimental connector mainly used for test. • System Connector
 Provides information about the cluster state and running query metrics. It is useful for runtime monitoring.
  • 14. PRESTO CONNECTOR • Plugin defines an interface 
 to bootstrap your connector 
 creation. • Also provides the list of 
 UDFs available your 
 Presto cluster. • ConnectorFactory is able to
 provide multiple connector implementations. Plugin ConnectorFactory Connector getConnectorFactories() create(connectorId,…)
  • 15. PRESTO CONNECTOR • Connector provides classes to manage metadata, storage accessor and table access control. • ConnectorSplitManager create 
 data source metadata to be 
 distributed multiple worker 
 node. • ConnectorPage
 [Source|Sink]Provider
 is provided to split 
 operator. Connector Connector Metadata Connector SplitManager Connector PageSource Provider Connector PageSink Provider Connector Access Control
  • 16. PRESTO CONNECTOR • Call beginInsert from 
 ConnectorMetadata • ConnectorSplitManager creates
 splits that includes metadata of 
 actual data source (e.g. file path) • ConnectorPageSource
 Provider downloads the
 file from data source in 
 parallel • finishInsert in ConnectorMetadata
 commit the transaction Connector Metadata beginInsert getSplits Connector PageSource Provider Connector PageSource Provider Connector PageSource Provider Connector Metadata finishInsert Operators… Connector SplitManager
  • 17. PRESTO ON CLOUD STORAGE • Distributed execution engine like Presto cannot make use of data locality any more on cloud storage. • Read/Write of data can be a dominant factor of query performance, stability and money. → Connector should be implemented to take care of 
 network IO cost.
  • 18. CLOUD STORAGE IN TD • Our Treasure Data storage service is built on cloud storage like S3. • Presto just provides a distributed query execution layer. It requires us to make our storage system also scalable. • On the other hand, we should make use of maintainability and availability provided cloud service provider (IaaS).
  • 20. PLAZMADB • We built a thin storage layer on existing cloud storage and relational database, called PlazmaDB. • PlazmaDB is a central component that stores all customer data for analysis in Treasure Data. • PlazmaDB consists of two components • Metadata (PostgreSQL) • Storage (S3 or RiakCS)
  • 21. PLAZMADB • PlazmaDB stores metadata of data files in PostgreSQL hosted by Amazon RDS.
  • 22. PLAZMADB • PlazmaDB stores metadata of data files in PostgreSQL hosted by Amazon RDS. • This PostgreSQL manages the index, file path on S3, transaction and deleted files. LOG LOG
  • 24. TRANSACTION AND PARTITIONING • Consistency is the most important factor for enterprise analytics workload. Therefore MPP engine like Presto and backend storage MUST always guarantee the consistency. → UPDATE is done atomically by PlazmaDB • At the same time, we want to achieve high throughput by distributing workload to multiple worker nodes. → Data files are partitioned in PlazmaDB
  • 25. PLAZMADB TRANSACTION • PlazmaDB supports transaction for the query that has side- effect (e.g. INSERT INTO/CREATE TABLE). • Transaction of PlazmaDB means the atomic operation on the appearance of the data on S3, not actual file. • Transaction is composed of two phases • Uploading uncommitted partitions • Commit transaction by moving uncommitted partitions
  • 26. PLAZMADB TRANSACTION • Multiple worker try to upload files to S3
 asynchronously. Uncommitted Committed PostgreSQL
  • 27. PLAZMADB TRANSACTION • After uploading is done, insert a record in uncommitted 
 table in PostgreSQL respectively. Uncommitted Committed PostgreSQL
  • 28. PLAZMADB TRANSACTION • After uploading is done, insert a record in uncommitted 
 table in PostgreSQL respectively. Uncommitted Committed PostgreSQL p1 p2
  • 29. PLAZMADB TRANSACTION • After all upload tasks are completed, coordinator tries 
 to commit the transaction by moving 
 all records in uncommitted to committed. Uncommitted Committed p1 p2 p3 PostgreSQL
  • 30. PLAZMADB TRANSACTION • After all upload tasks are completed, coordinator tries 
 to commit the transaction by moving 
 all records in uncommitted to committed. Uncommitted Committed p1 p2 p3 PostgreSQL
  • 31. PLAZMADB DELETE • Delete query is handled in similar way. First newly created
 partitions are uploaded excluding deleted 
 records. Uncommitted Committed p1 p2 p3 p1’ p2’ p3’ PostgreSQL
  • 32. PLAZMADB DELETE • When transaction is committed, the records in committed table is replaced by uncommitted records 
 with different file path. Uncommitted Committed p1’ p2’ p3’ PostgreSQL
  • 33. PARTITIONING • To make the best of high throughput by Presto parallel processing, it is necessary to distribute data source too. • Distributing data source evenly can contribute the high throughput and performance stability. • Two basic partitioning method • Key range partitioning -> Time-Index partitioning • Hash partitioning -> User Defined Partitioning
  • 34. PARTITIONING • A partition record in PlazmaDB represents a file stored in S3 with some additional information • Data Set ID • Range Index Key • Record Count • File Size • Checksum • File Path
  • 35. PARTITIONING • All partitions in PlazmaDB are indexed by time when it is generated. Time index is recorded as UNIX epoch. • A partition keeps first_index_key and last_index_key to specifies the range where the partition includes. • PlazmaDB index is constructed as multicolumn index by using GiST index of PostgreSQL. 
 (https://www.postgresql.org/docs/current/static/gist.html) • (data_set_id, index_range(first_index_key, last_index_key))
  • 36. LIFECYCLE OF PARTITION • PlazmaDB has two storage management layer.
 At the beginning, records are put on realtime storage layer in raw format. Realtime Storage Archive Storage time: 100 time: 4000 time: 3800 time: 300 time: 500
  • 37. LIFECYCLE OF PARTITION • Every one hour, a specific map reduce job called Log Merge Job runs to merge same time range records into one partition in archive storage. Realtime Storage Archive Storage time: 100 time: 4000 time: 3800 time: 300 time: 500 time: 0~3599 time: 3600~7200 MR
  • 38. LIFECYCLE OF PARTITION • Query execution engine like Presto needs to fetch the data from both realtime storage and archive storage. But basically it should be efficient to read the data from archive storage. Realtime Storage Archive Storage time: 100 time: 4000 time: 3800 time: 300 time: 500 time: 0~3599 time: 3600~7200 MR
  • 40. TIME INDEX PARTITIONING • By using multicolumn index on time range in PlazmaDB, Presto can filter out unnecessary partitions through predicate push down. • TD_TIME_RANGE UDF tells Presto the hint which partitions should be fetched from PlazmaDB. • e.g. TD_TIME_RANGE(time, ‘2017-08-31 12:30:00’, NULL, ‘JST’) • ConnectorSplitManager select the necessary partitions and calculates the split distribution plan.
  • 41. TIME INDEX PARTITIONING • Select metadata records from realtime storage and archive storage according to given time range.
 
 SELECT * FROM rt/ar WHERE start < time AND time < end; Connector
 SplitManger time: 0~3599 time: 3600~7200 time: 8000 time: 8200 time: 9000 time: 8800 Realtime Storage Archive Storage
  • 42. TIME INDEX PARTITIONING • A split is responsible to download multiple files on S3 in order to reduce overhead. • ConnectorSplitManager calculates file assignment to each split based on given statistics information (e.g. file size, the number of columns, record count) f1 f2 f3 Connector
 SplitManger Split1 Split2
  • 43. TIME INDEX PARTITIONING SELECT 10 cols in a range 0 sec 23 sec 45 sec 68 sec 90 sec 113 sec 135 sec 158 sec 180 sec 60days 50days 40days 30days 20days 10days TD_TIME_RANGE
  • 44. TIME INDEX PARTITIONING SELECT 10 cols in a range 0 splits 8 splits 15 splits 23 splits 30 splits 38 splits 45 splits 53 splits 60 splits 6years~ 5years 4years 3years 2years 1 year 6month split
  • 45. CHALLENGE • Time-Index partitioning worked very well because • Most logs from web page, IoT devices have originally the time when it is created. • OLAP workload from analysts often limited by specific time range (e.g. in the last week, during a campaign). • But it lacks the flexibility to make an index on the column other than time. This is required especially in digital marketing, DMP use cases.
  • 47. USER DEFINED PARTITIONING • Now evaluating user defined partitioning with Presto. • User defined partitioning allows customer to set index on arbitrary data attribute flexibly. • User defined partitioning can co-exist with time-index partitioning as secondary index.
  • 48. SELECT COUNT(1) FROM audience 
 WHERE 
 TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’)
 AND
 audience.room = ‘E’
  • 49. BUCKETING • Similar mechanism with Hive bucketing • Bucket is a logical group of partition files by specified bucketing column. table bucket bucket bucket bucket partition partition partition partition partition partition partition partition partition partition partition partition time range 1 time range 2 time range 3 time range 4
  • 50. BUCKETING • PlazmaDB defines the hash function type on partitioning key and total bucket count which is fixed in advance. Connector SplitManager SELECT COUNT(1) FROM audience 
 WHERE 
 TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’)
 AND
 audience.room = ‘E’ table bucket1 bucket2 bucket3 partition partition partition partition partition partition partition partition partition
  • 51. BUCKETING • ConnectorSplitManager select the proper partition from PostgreSQL with given time range and bucket key. Connector SplitManager SELECT COUNT(1) FROM audience 
 WHERE 
 TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’)
 AND
 audience.room = ‘E’ table bucket1 bucket2 bucket3 partition partition partition partition partition partition partition partition partition hash(‘E’) -> bucket2 1504483200 < time && time < 1504742400
  • 52. USER DEFINED PARTITIONING • We can skip to read several unnecessary partitions. This architecture very fit to digital marketing use cases. • Creating user segment • Aggregation by channel • Still make use of time index partitioning. • It’s now tested internally.
  • 53. RECAP • Presto provides a plugin mechanism called connector. • Though Presto itself is highly scalable distributed engine, connector is also responsible for efficient query execution. • PlazmaDB has some desirable features to be integrated with such kind of connector because of • Transaction support • Time-Index Partitioning • User Defined Partitioning
  • 54. T R E A S U R E D A T A