Building a High-performance Data Lake
Analytics Engine at Alibaba Cloud


with Presto+Alluxio
Zhenlin Ma
⽬录
Introduction to DLA
01
02
03
DLA Presto Architecture
Optimizations on OSS Data Source
Introduction to DLA
Data Lake Analytics (DLA) is a large scale serverless data federation service on Alibaba Cloud.
Serverless Data Federation Database-like User Experience High performance
列存表
Data Lake Storag
e

(OSS)
One Click


Data Lake
DB - Data Streaming - Data
Spark


Streaming
LogService Application Logs
Serverless


Spark


ETL&ML
Serverless


Presto
Metadata


Management
Auto


Discovery
Archived


Transactional Data
DW
DMS APP QuickBI
Data Lake Engin
e

(DLA)
Introduction to DLA
DLA Presto
Multi-Coordinator
s

Lake Formation:One Click Data Warehouse, Metadata Discover
y

Enterprise level Access Contro
l

Cost:Billing methods based on the volume of scanned data, or the number of compute units used.


MySQL protocol support


Caching


Data sources:More than15 types of data source is supported,including Alibaba Cloud OSS, ADB,
Table Store , etc.


Billing methods
Contents
Introduction to DLA
01
02
03 Optimizations on OSS Data Source
DLA Presto Architecture
Oracle
DLA Presto Architecture
FrontNode
Uni
fi
ed


Meta


Service
OSS MySQL SQLServer …
TableStore
MaxCompute ElasticSearch Druid
Worker Worker Worker
Coordinator
Default Cluster
Worker Worker Worker
Coordinator
CU Cluster
Presto Clusters
PostgreSQL
SQL Dialect Transformation/Submit Query/Fetch Result
TableScan/Pushdown
Met
a

Operation
MySQL Protocol
Multiple Charging Model Unified Meta & Access Control
About Presto
Presto is an open source distributed SQL query engine for running interactive analytic queries
against data sources of all sizes ranging from gigabytes to petabytes.
Full Memory Processing Pluggable Connectors Great Community
Full SQL Semantics
Blazing fast, suitable for adhoc 

queries, data exploration, and 

lightweight ETL.
Compliant with ANSI
SQL, don’t need to worry
that any SQL syntax not
supported.
Challenges to DLA Presto
Oracle
FrontNode
Uni
fi
ed


Meta


Service
OSS MySQL SQLServer …
TableStore
MaxCompute ElasticSearch Druid
Worker Worker Worker
Coordinator
Default Cluster
Worker Worker Worker
Coordinator
CU Cluster
Presto Clusters
PostgreSQL
SQL Dialect Transformation/Submit Query/Fetch Result
TableScan/Pushdown
Met
a

Operation
Challenges to DLA Presto
Oracle
FrontNode
Uni
fi
ed


Meta


Service
OSS MySQL SQLServer …
TableStore
MaxCompute ElasticSearch Druid
Worker Worker Worker
Coordinator
Default Cluster
Worker Worker Worker
Coordinator
CU Cluster
Presto Clusters
PostgreSQL
SQL Dialect Transformation/Submit Query/Fetch Result
TableScan/Pushdown
Met
a

Operation
Request costs
Bandwidth limit
Performance
pulling large
data Latency to get
metadata/partitions
Performance
pulling large data
Pressure on
data source
small data big data
update frequently
update infrequently
OTS
OSS
ODPS
?
Mysql
Redis
Mongodb
PostgresSQL
…
Big Data/NoSQL: Performance of pulling large data
Online System:Pressure on data source
Big Data/O
ffl
ine:Performance of pulling large data
Challenges to DLA Prest
o

-Analysis
small data big data
update frequently
update not frequently
OTS
OSS
ODPS
?
Mysql
Redis
Mongodb
PostgresSQL
…
Big Data/NoSQL: Performance of pulling large data
Online System:Pressure on data source
Big Data/O
ffl
ine:Performance of pulling large data
Concurrency limitation
Avoid reading master
Challenges to DLA Prest
o

-Analysis
small data big data
update frequently
update not frequently
OTS
OSS
ODPS
?
Mysql
Redis
Mongodb
PostgresSQL
…
Big Data/NoSQL: Performance of pulling large data
Online System:Pressure on data source
Big Data/O
ffl
ine:Performance of pulling large data
Concurrency limitation
Avoid reading master
Caching
Challenges to DLA Prest
o

-Analysis
small data big data
update frequently
update not frequently
OTS
OSS
ODPS
?
Mysql
Redis
Mongodb
PostgresSQL
…
Big Data/NoSQL: Performance of pulling large data
Online System:Pressure on data source
Big Data/O
ffl
ine:Performance of pulling large data
Concurrency limitation
Avoid reading master
Caching
Pushdown
Challenges to DLA Prest
o

-Analysis
Oracle
Solutions
FrontNode
统⼀元
数据管
理
OSS MySQL SQLServer …
TableStore
MaxCompute ElasticSearch Druid
Worker Worker Worker
Coordinator
Default Cluster
Worker Worker Worker
Coordinator
CU Cluster
Presto Clusters
PostgreSQL
SQL改写 / 提交查询 / 取查询结果
TableScan/Pushdown
元数据操作
Decrease
Request count
Alluxio Data
Cache
Data Cache


Partition meta cache


splits cache


对源库影响
对源库影响
对源库影响
Limit Concurrency


Read from slavery


One Click Data Lake
Pushdown
Contents
Introduction to DLA
01
02
03 Optimizations on OSS Data Source
DLA Presto Architecture
DLA Presto Optimizations on OSS Data Source
Decreasing OSS API request count
Alluxio Data Cache
Decreasing OSS API request count
Background


Users report that the OSS Calling fees are high, even higher than DLA
fees


OSS Calling fees = Actual calls × Unit price per 10,000 calls/10000
Hadoop FileSystem
API Invocation
Alibaba Cloud
OSS API Invocation
read
read
…
seek(100)
read
seek(128MB)
read
#1 read as much data as possible
with 1 request
small seek, continue reading
big seek, start a new request
continue reading
#2 read continue reading
…
1.Reduced API call count down to 1/10 for data stored in Text format.


2.Reduced API call count down to 1/3 for data stored in ORC/Parquet format.


3.Saves cost for about 60% to 90% on average.
Decreasing OSS API request count
Fully tested in
Facebook/
Netease/JD
production
environment
Alluxio Data Cach
e

-Why Alluxio
Proven Solution Efficiency Monitoring
High
concurrency


Asynchronous
write cache
Easy to
monitor and
diagnosis
Alluxio Data Cach
e

-Local Cache v.s. Cluster
OSS
Worker Worker Worker
Coordinator
Presto Cluster
Worker Worker Worker
Master
Alluxio Cluster
read alluxio
on cache miss cache to alluxio
return data
Presto Cluster
Alluxio Data Cach
e

-Local Cache
Alluxio data cache is a library
residing in the Presto worker.


Cache data is stored in local
Disk.
SOFT_AFFINITY


Makes the best attempt to assign the same split to the same worker when doing the
scheduling


Preferred(0) -> Preferred(1) -> LeastBusy
Alluxio Data Cach
e

-Local Cache
Preferred(1)
Preferred(0) Preferred(0)
Preferred(0)
LeastBusy
Alluxio Data Cach
e

-Cluster
OSS
Worker Worker Worker
Coordinator
Presto Cluster
Worker Worker Worker
Master
Alluxio Cluster
read alluxio
on cache miss cache to alluxio
return data
Alluxio is a distributed
caching service to
Presto


Short-circuit read
supported
Alluxio Data Cach
e

-Local Cache v.s. Cluster
OSS
Worker Worker Worker
Coordinator
Presto Cluster
Worker Worker Worker
Master
Alluxio Cluster
read alluxio
on cache miss cache to alluxio
return data
Presto Cluster
Local Cache v.s.Cluster


Data closer to compute node


No extra nodes needed
Local Cache v.s. Collocated Cluster


Easy to maintanance


No resource waste if user didn’t has OSS data source
Local Cache v.s. Cluster


Data closer to compute node


No extra resource needed


Local Cache v.s. Collocated Cluster


Easy to maintenance


No resource waste if user didn’t has OSS
data source
Alluxio Data Cach
e

-Local Cache v.s. Cluster
Alluxio Data Cach
e

-Improvements in DLA
Sceneries of Community Solution v.s. Sceneries of DLA


Queries mainly on hive data sources v.s. Can’t assume that for a specific user


SSD v.s. Ultra cloud disk
Challenges


Performance improvement in the statistical sense may not be perceivable by
users, necessary to increase cache hit ratio for every single query


Low disk throughput affects the acceleration effect
Increase cache hit ratio for every single query Increase disk throughput
Alluxio Data Cach
e

-Improvements in DLA
Increase cache hit ratio


Analysis


SOFT_AFFINITY:Preferred(0) -> Preferred(1) ->
LeastBusy


Key is to submit more splits to Preferred Nodes


node-scheduler.max-splits-per-node


Increase node-scheduler.max-splits-per-node


Effect:Cache hit ratio increased


Side effect:load for workers become
Unbalanced
4 splits 1 split 1 split
split1
split2
split3 split5 split6
split4
Alluxio Data Cach
e

-Improvements in DLA
Increase cache hit ratio


Unbalanced load


HiveSplit Preferred Nodes:
path.hashCode() % numWorkers


Big file generate more splits, Cause the
corresponding worker getting more load


Need to submit splits of a big file to
different nodes


(path.hashCode() + (start / (fileSize /
numWorkers)))) % numWorkers
2 splits 2 splits 2splits
split4 split5
split1
split2
split3
split4
Alluxio Data Cach
e

-Improvements in DLA
Improve disk throughput


20GB Ultra disk throughput:


Write109MB/s Read 108MB/s


Multiple disks


6 ultra disks performance: 600MB/s read/write


Implement


page.path = $root/$page_path


=>


page.path = $roots[page.hash % roots.size]/$page_path
 Environment:


Cluster:16cpu64GB * 16 nodes


Disk:20GB ultra disk * 6


Data:TPCH-1TB / ORC / Stored at OSS


Queries Chosen from TPCH:


Include scan to table lineitem(the biggest table)


without join between three or more tables
Alluxio Data Cach
e

-Performance
 Test Result
Alluxio Data Cach
e

-Performance
Future Plan
Alluxio Cluster


Shared by multi users


Suitable when Presto auto scaling


Improvements for OSS Data Source


Fragment Result Cache


Query Result Cache


Improve performance of querying small files
More Information about DLA
• DLA Homepage:https://www.aliyun.com/product/datalakeanalytics
• DLA SQL Introduction:https://developer.aliyun.com/article/770819
We are hiring :)

Building a high-performance data lake analytics engine at Alibaba Cloud with Presto+Alluxio

  • 1.
    Building a High-performanceData Lake Analytics Engine at Alibaba Cloud with Presto+Alluxio Zhenlin Ma
  • 2.
    ⽬录 Introduction to DLA 01 02 03 DLAPresto Architecture Optimizations on OSS Data Source
  • 3.
    Introduction to DLA DataLake Analytics (DLA) is a large scale serverless data federation service on Alibaba Cloud. Serverless Data Federation Database-like User Experience High performance
  • 4.
    列存表 Data Lake Storag e (OSS) OneClick 
 Data Lake DB - Data Streaming - Data Spark Streaming LogService Application Logs Serverless Spark ETL&ML Serverless Presto Metadata 
 Management Auto Discovery Archived Transactional Data DW DMS APP QuickBI Data Lake Engin e (DLA) Introduction to DLA
  • 5.
    DLA Presto Multi-Coordinator s Lake Formation:OneClick Data Warehouse, Metadata Discover y Enterprise level Access Contro l Cost:Billing methods based on the volume of scanned data, or the number of compute units used. MySQL protocol support Caching Data sources:More than15 types of data source is supported,including Alibaba Cloud OSS, ADB, Table Store , etc. 

  • 6.
  • 7.
    Contents Introduction to DLA 01 02 03Optimizations on OSS Data Source DLA Presto Architecture
  • 8.
    Oracle DLA Presto Architecture FrontNode Uni fi ed Meta Service OSSMySQL SQLServer … TableStore MaxCompute ElasticSearch Druid Worker Worker Worker Coordinator Default Cluster Worker Worker Worker Coordinator CU Cluster Presto Clusters PostgreSQL SQL Dialect Transformation/Submit Query/Fetch Result TableScan/Pushdown Met a Operation MySQL Protocol Multiple Charging Model Unified Meta & Access Control
  • 9.
    About Presto Presto isan open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Full Memory Processing Pluggable Connectors Great Community Full SQL Semantics Blazing fast, suitable for adhoc queries, data exploration, and lightweight ETL. Compliant with ANSI SQL, don’t need to worry that any SQL syntax not supported.
  • 10.
    Challenges to DLAPresto Oracle FrontNode Uni fi ed Meta Service OSS MySQL SQLServer … TableStore MaxCompute ElasticSearch Druid Worker Worker Worker Coordinator Default Cluster Worker Worker Worker Coordinator CU Cluster Presto Clusters PostgreSQL SQL Dialect Transformation/Submit Query/Fetch Result TableScan/Pushdown Met a Operation
  • 11.
    Challenges to DLAPresto Oracle FrontNode Uni fi ed Meta Service OSS MySQL SQLServer … TableStore MaxCompute ElasticSearch Druid Worker Worker Worker Coordinator Default Cluster Worker Worker Worker Coordinator CU Cluster Presto Clusters PostgreSQL SQL Dialect Transformation/Submit Query/Fetch Result TableScan/Pushdown Met a Operation Request costs Bandwidth limit Performance pulling large data Latency to get metadata/partitions Performance pulling large data Pressure on data source
  • 12.
    small data bigdata update frequently update infrequently OTS OSS ODPS ? Mysql Redis Mongodb PostgresSQL … Big Data/NoSQL: Performance of pulling large data Online System:Pressure on data source Big Data/O ffl ine:Performance of pulling large data Challenges to DLA Prest o -Analysis
  • 13.
    small data bigdata update frequently update not frequently OTS OSS ODPS ? Mysql Redis Mongodb PostgresSQL … Big Data/NoSQL: Performance of pulling large data Online System:Pressure on data source Big Data/O ffl ine:Performance of pulling large data Concurrency limitation Avoid reading master Challenges to DLA Prest o -Analysis
  • 14.
    small data bigdata update frequently update not frequently OTS OSS ODPS ? Mysql Redis Mongodb PostgresSQL … Big Data/NoSQL: Performance of pulling large data Online System:Pressure on data source Big Data/O ffl ine:Performance of pulling large data Concurrency limitation Avoid reading master Caching Challenges to DLA Prest o -Analysis
  • 15.
    small data bigdata update frequently update not frequently OTS OSS ODPS ? Mysql Redis Mongodb PostgresSQL … Big Data/NoSQL: Performance of pulling large data Online System:Pressure on data source Big Data/O ffl ine:Performance of pulling large data Concurrency limitation Avoid reading master Caching Pushdown Challenges to DLA Prest o -Analysis
  • 16.
    Oracle Solutions FrontNode 统⼀元 数据管 理 OSS MySQL SQLServer… TableStore MaxCompute ElasticSearch Druid Worker Worker Worker Coordinator Default Cluster Worker Worker Worker Coordinator CU Cluster Presto Clusters PostgreSQL SQL改写 / 提交查询 / 取查询结果 TableScan/Pushdown 元数据操作 Decrease Request count Alluxio Data Cache Data Cache Partition meta cache splits cache 对源库影响 对源库影响 对源库影响 Limit Concurrency Read from slavery One Click Data Lake Pushdown
  • 17.
    Contents Introduction to DLA 01 02 03Optimizations on OSS Data Source DLA Presto Architecture
  • 18.
    DLA Presto Optimizationson OSS Data Source Decreasing OSS API request count Alluxio Data Cache
  • 19.
    Decreasing OSS APIrequest count Background Users report that the OSS Calling fees are high, even higher than DLA fees OSS Calling fees = Actual calls × Unit price per 10,000 calls/10000
  • 20.
    Hadoop FileSystem API Invocation AlibabaCloud OSS API Invocation read read … seek(100) read seek(128MB) read #1 read as much data as possible with 1 request small seek, continue reading big seek, start a new request continue reading #2 read continue reading … 1.Reduced API call count down to 1/10 for data stored in Text format. 2.Reduced API call count down to 1/3 for data stored in ORC/Parquet format. 3.Saves cost for about 60% to 90% on average. Decreasing OSS API request count
  • 21.
    Fully tested in Facebook/ Netease/JD production environment AlluxioData Cach e -Why Alluxio Proven Solution Efficiency Monitoring High concurrency Asynchronous write cache Easy to monitor and diagnosis
  • 22.
    Alluxio Data Cach e -LocalCache v.s. Cluster OSS Worker Worker Worker Coordinator Presto Cluster Worker Worker Worker Master Alluxio Cluster read alluxio on cache miss cache to alluxio return data Presto Cluster
  • 23.
    Alluxio Data Cach e -LocalCache Alluxio data cache is a library residing in the Presto worker. Cache data is stored in local Disk.
  • 24.
    SOFT_AFFINITY Makes the bestattempt to assign the same split to the same worker when doing the scheduling Preferred(0) -> Preferred(1) -> LeastBusy Alluxio Data Cach e -Local Cache Preferred(1) Preferred(0) Preferred(0) Preferred(0) LeastBusy
  • 25.
    Alluxio Data Cach e -Cluster OSS WorkerWorker Worker Coordinator Presto Cluster Worker Worker Worker Master Alluxio Cluster read alluxio on cache miss cache to alluxio return data Alluxio is a distributed caching service to Presto Short-circuit read supported
  • 26.
    Alluxio Data Cach e -LocalCache v.s. Cluster OSS Worker Worker Worker Coordinator Presto Cluster Worker Worker Worker Master Alluxio Cluster read alluxio on cache miss cache to alluxio return data Presto Cluster Local Cache v.s.Cluster Data closer to compute node No extra nodes needed Local Cache v.s. Collocated Cluster Easy to maintanance No resource waste if user didn’t has OSS data source
  • 27.
    Local Cache v.s.Cluster Data closer to compute node No extra resource needed Local Cache v.s. Collocated Cluster Easy to maintenance No resource waste if user didn’t has OSS data source Alluxio Data Cach e -Local Cache v.s. Cluster
  • 28.
    Alluxio Data Cach e -Improvementsin DLA Sceneries of Community Solution v.s. Sceneries of DLA Queries mainly on hive data sources v.s. Can’t assume that for a specific user SSD v.s. Ultra cloud disk Challenges Performance improvement in the statistical sense may not be perceivable by users, necessary to increase cache hit ratio for every single query Low disk throughput affects the acceleration effect Increase cache hit ratio for every single query Increase disk throughput
  • 29.
    Alluxio Data Cach e -Improvementsin DLA Increase cache hit ratio Analysis SOFT_AFFINITY:Preferred(0) -> Preferred(1) -> LeastBusy Key is to submit more splits to Preferred Nodes node-scheduler.max-splits-per-node Increase node-scheduler.max-splits-per-node Effect:Cache hit ratio increased Side effect:load for workers become Unbalanced 4 splits 1 split 1 split split1 split2 split3 split5 split6 split4
  • 30.
    Alluxio Data Cach e -Improvementsin DLA Increase cache hit ratio Unbalanced load HiveSplit Preferred Nodes: path.hashCode() % numWorkers Big file generate more splits, Cause the corresponding worker getting more load Need to submit splits of a big file to different nodes (path.hashCode() + (start / (fileSize / numWorkers)))) % numWorkers 2 splits 2 splits 2splits split4 split5 split1 split2 split3 split4
  • 31.
    Alluxio Data Cach e -Improvementsin DLA Improve disk throughput 20GB Ultra disk throughput: Write109MB/s Read 108MB/s Multiple disks 6 ultra disks performance: 600MB/s read/write Implement page.path = $root/$page_path => page.path = $roots[page.hash % roots.size]/$page_path
  • 32.
     Environment: Cluster:16cpu64GB * 16nodes Disk:20GB ultra disk * 6 Data:TPCH-1TB / ORC / Stored at OSS Queries Chosen from TPCH: Include scan to table lineitem(the biggest table) without join between three or more tables Alluxio Data Cach e -Performance
  • 33.
     Test Result Alluxio DataCach e -Performance
  • 34.
    Future Plan Alluxio Cluster Sharedby multi users Suitable when Presto auto scaling Improvements for OSS Data Source Fragment Result Cache Query Result Cache Improve performance of querying small files
  • 35.
    More Information aboutDLA • DLA Homepage:https://www.aliyun.com/product/datalakeanalytics • DLA SQL Introduction:https://developer.aliyun.com/article/770819 We are hiring :)