SlideShare a Scribd company logo
Pegasus设计实现与开源之路
何昱晨
2021.9.25
• 分布式系统工程师
• 本科硕士均毕业于中国人民大学
• 就职于小米,负责分布式KV存储系统
Pegasus及其生态工具研发工作
• Apache Pegasus PPMC
何昱晨
Outline
• Basic Introduction
– Architecture, Data Model, Dual WAL, Performance
• New Features
– Duplication, Bulk load, Access control, Partition split, User
Defined Compaction
• Surrounding Ecosystems
– Pegasus-Spark, Meta proxy, Disk Migration tools
• Community
Basic Introduction
4
Introduction
• Redis or HBase
– Non-Volatile vs Consistent
– Remote Access
• Pegasus
– C++
– Local persistent storage
– Strongly consistent
– High performance
– Horizontally scalable
Architecture
Meta server
• Cluster controller
• Configuration manager
Replica server
• Data node
• Hash partitioning
• PacificA (strongly consistent)
• RocksDB instance for each replica
Zookeeper
• Meta server election
• Metadata storage
ClientLib
• Cache data routing table
• Straightly access to replica server
Data Model
Dual WAL
Traditional solution
Disk
Data
Log
Replica1
Data
Log
Replica2
Data
Log
Replica3
client
• Data background compaction may strongly affect WAL sync performance
Dual WAL
Data Disk
Data
Private Log
Replica1
Data
Private Log
Replica2
Data
Private Log
Replica3
client
Shared Log
Log Disk
• Separate WAL and data, sync-write shared log, async-write private log
Performance
Read:Write Client*Thread --- QPS AvgLatency P99Latency(us)
0:1 3*15
read --- --- ---
write 46128 972 5591
1:0 3*50
read 282648 542 1674
write --- --- ---
1:1 3*30
read 36014 1068 15345
write 36016 1421 8197
1:3 3*15
read 11622 779 10417
write 34989 1021 5467
2.2.0 (Newest release) benchmark
New Features
11
Duplication
Basic introduction
Region2
Table
Region1
Table
async-duplication
• Design for cross-region online backup
• Transfer log, write asynchronously
• Supporting single-master and multi-master
Duplication
Case1: Online Migration
Target Cluster
Table
Source Cluster
Table
client
1. Reserve logs
Remote storage
2. cold backup
3. restore
4. duplication
5. switch
Duplication
Case2: Master-Slave cluster
client client
Slave region
Table
Master region
Table
duplication
Eventually-consistent
read
client client
Table
Region1 Region2
Duplication
Enhancement in future
• Master-master in practice
• More than two region duplication in practice
• Facility for supporting remote disaster-tolerant system
• auto-switch master slave
• better user experience
• Extension:
• supporting CDC on demand
• eg: ES, MQ…
Bulk Load
Fast import lots of data offline
sst file
sst file
Table
Replica server
original data
File provider
sst file sst file
1. Generate Files
2. Download Files
3. Ingest Files
client
R/W Reject write(ingestion)
Access Control
Authentication: Kerberos
Authorization: Whitelist based coarse-grained table-level access control
Cluster
KeytabA
X
TableA
KeytabB
TableB
KeytabA
client
Partition Split
Basic introduction
• Replica divide into two replicas
• Replica[i] -> Replica[i], Replica[i+original_partition_count]
Replica group0
Replica0 Replica4
Replica0
Replica group1
Replica1 Replica5
Replica1
Replica group2
Replica2 Replica6
Replica2
Replica group3
Replica3 Replica7
Replica3
Partition Split
Stage1: async-learn
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
copy data
copy data copy data
• parent(old replica), child(new replica)
• child replica copy data
• client only know parent replica
Partition Split
Stage2: register
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
meta server
register child X
• when child copy all parent data
• Reject R/W while registering
Partition Split
Partition split succeed
Replica server
secondary
secondary
Replica server
primary
primary
Replica server
secondary
secondary
client
• Will be released in 2.3.0
• GC dup-data by compaction
User defined compaction
Current Compaction operation - Deletion
No=3, key=“key3”, value=“value”
No=2, key=“key2”, value=“value”, expired
No=1, key=“key”, value=“old_value”
No=4, key=“key4”, value=“value”, parent
No=5, key=“key”, value=“new_value”
RocksDB instance
compaction
No=3, key=“key3”, value=“value”
No=5, key=“key”, value=“new_value”
RocksDB instance
GC duplicated data
GC expired data
User defined compaction
Current Compaction operation – Update table-level TTL
No=3, key=“key3”, value=“value”
No=2, key=“key2”, value=“value”
No=1, key=“key1”, value=“value”
RocksDB instance
No=3, key=“key3”, value=“value”,ttl=30 days
No=2, key=“key2”, value=“value”,ttl=30 days
No=1, key=“key1”, value=“value”,ttl=30 days
RocksDB instance
compaction
Table TTL 30 days
User defined compaction
Update TTL(Based on current time)
Compaction Operations
Update TTL(Based on old TTL)
Update TTL(timestamp)
Deletion
No TTL
TTL range
HashKey prefix
Compaction Rules
HashKey postfix
HashKey anywhere
SortKey prefix
SortKey postfix
SortKey anywhere
User defined compaction
User Cases examples
• Compaction Rule = TTL Range
• Compaction Operation = Update TTL
• Compaction Rule = Hashkey Prefix + TTL Range
• Compaction Operation =Deletion
Update Data TTL more than 6 month into 2 months
Delete HashKey prefix "test" and TTL more than 30 days
• Will be released in 2.3.0
Surrounding Ecosystem
26
Pegasus-Spark
Best practices
• Large offline data analysis (SQL)
• Large offline data load (Bulk Load)
Pegasus-Spark
Offline Analysis
• Convert into Hive(parquet)
• Use SparkSQL to analysis
HDFS
Replica server Replica server
Hive
Schema RDD
Pegasus-Spark
Convert to SST file for Bulk load
node
node
node
node
node
node
Transform(Pegasus-Spark)
HDFS
(sst file)
Distinct
Repartition
Sort
original
data
original
data
Meta Proxy
Basic introduction
• access unification
• primary and standby cluster manager
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
MetaProxy
Meta Proxy
Switch primary and standby cluster
client client client
Cluster primary
meta meta
Cluster secondary
meta meta
MetaProxy
duplication
client client client
Cluster secondary
meta meta
Cluster primary
meta meta
MetaProxy
duplication
switch
Disk migration tool
balance disk usage on replica server
Disk4
40%
Disk2
75%
Disk1
70%
Disk3
85%
Disk
migrator
Select Disk
Select
Replica
Migrate
Replica
balanced
Disk4
65%
Disk2
65%
Disk1
70%
Disk3
70%
Replica server Replica server
Loop
until balance
Community
33
Process
2016
Release 1.0.0
Join Apache
Release 2.0.0
Meet UP
2015
Start
Open GitHub
2017.9
2020.6
2020.9
2021.9
Tools
Start contribution from API and tools
C++/Java/Go/Python/NodeJs/Scala
Pegasus
core
user-cli
client
HTTP API
RPC API
monitoring
admin-cli
deploy tools
other tools …
Pegic(Go)/C++ shell client
Falcon/Prometheus
Minos
Admin-cli(Go)/
C++ shell client
Meta Proxy(Go)
In the future
Enhancement & Features
• Periodically Bulk load
• Duplication
• Hotpot partition detection
• Read throughput throttling
• Tracing
• Admin Service
• Others…
Pegasus 2.3.0 is releasing(150+ commits)
• Partition Split
• User defined compaction
• Cluster Load Balance
• Onetime Backup
Community Development
How to contribute
• Lookup/Raise issue, assign it to yourself
• Follow the Pegasus official WeChat account
• Join Pegasus developer WeChat group
What we plan to do
• Benchmark
• More documents and technical articles
• Online workshop
• Offline meetup
Thank You
https://pegasus.apache.org/
Apache Pegasus
https://github.com/apache/incubator-pegasus

More Related Content

Similar to The Design, Implementation and Open Source Way of Apache Pegasus

How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
acelyc1112009
 
Managing Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive ComputingManaging Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
pbelko82
 
Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale
NetApp
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
SUSE Italy
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
Christian Johannsen
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
Rakuten Group, Inc.
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
Clarence J M Tauro
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
Bob Ward
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
nnakasone
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Open stack ha design & deployment kilo
Open stack ha design & deployment   kiloOpen stack ha design & deployment   kilo
Open stack ha design & deployment kilo
Steven Li
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
Asis Mohanty
 

Similar to The Design, Implementation and Open Source Way of Apache Pegasus (20)

How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
 
Managing Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive ComputingManaging Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive Computing
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Open stack ha design & deployment kilo
Open stack ha design & deployment   kiloOpen stack ha design & deployment   kilo
Open stack ha design & deployment kilo
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 

More from acelyc1112009

How does Apache Pegasus used in SensorsData
How does Apache Pegasusused in SensorsDataHow does Apache Pegasusused in SensorsData
How does Apache Pegasus used in SensorsData
acelyc1112009
 
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataHow does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
acelyc1112009
 
How to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosHow to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenarios
acelyc1112009
 
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
acelyc1112009
 
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
acelyc1112009
 
The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0
acelyc1112009
 
Apache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of XiaomiApache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of Xiaomi
acelyc1112009
 
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
acelyc1112009
 
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partHow do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine part
acelyc1112009
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend part
acelyc1112009
 

More from acelyc1112009 (10)

How does Apache Pegasus used in SensorsData
How does Apache Pegasusused in SensorsDataHow does Apache Pegasusused in SensorsData
How does Apache Pegasus used in SensorsData
 
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataHow does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
 
How to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosHow to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenarios
 
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
 
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
 
The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0
 
Apache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of XiaomiApache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of Xiaomi
 
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
 
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partHow do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine part
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend part
 

Recently uploaded

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 

Recently uploaded (20)

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 

The Design, Implementation and Open Source Way of Apache Pegasus

  • 2. • 分布式系统工程师 • 本科硕士均毕业于中国人民大学 • 就职于小米,负责分布式KV存储系统 Pegasus及其生态工具研发工作 • Apache Pegasus PPMC 何昱晨
  • 3. Outline • Basic Introduction – Architecture, Data Model, Dual WAL, Performance • New Features – Duplication, Bulk load, Access control, Partition split, User Defined Compaction • Surrounding Ecosystems – Pegasus-Spark, Meta proxy, Disk Migration tools • Community
  • 5. Introduction • Redis or HBase – Non-Volatile vs Consistent – Remote Access • Pegasus – C++ – Local persistent storage – Strongly consistent – High performance – Horizontally scalable
  • 6. Architecture Meta server • Cluster controller • Configuration manager Replica server • Data node • Hash partitioning • PacificA (strongly consistent) • RocksDB instance for each replica Zookeeper • Meta server election • Metadata storage ClientLib • Cache data routing table • Straightly access to replica server
  • 8. Dual WAL Traditional solution Disk Data Log Replica1 Data Log Replica2 Data Log Replica3 client • Data background compaction may strongly affect WAL sync performance
  • 9. Dual WAL Data Disk Data Private Log Replica1 Data Private Log Replica2 Data Private Log Replica3 client Shared Log Log Disk • Separate WAL and data, sync-write shared log, async-write private log
  • 10. Performance Read:Write Client*Thread --- QPS AvgLatency P99Latency(us) 0:1 3*15 read --- --- --- write 46128 972 5591 1:0 3*50 read 282648 542 1674 write --- --- --- 1:1 3*30 read 36014 1068 15345 write 36016 1421 8197 1:3 3*15 read 11622 779 10417 write 34989 1021 5467 2.2.0 (Newest release) benchmark
  • 12. Duplication Basic introduction Region2 Table Region1 Table async-duplication • Design for cross-region online backup • Transfer log, write asynchronously • Supporting single-master and multi-master
  • 13. Duplication Case1: Online Migration Target Cluster Table Source Cluster Table client 1. Reserve logs Remote storage 2. cold backup 3. restore 4. duplication 5. switch
  • 14. Duplication Case2: Master-Slave cluster client client Slave region Table Master region Table duplication Eventually-consistent read client client Table Region1 Region2
  • 15. Duplication Enhancement in future • Master-master in practice • More than two region duplication in practice • Facility for supporting remote disaster-tolerant system • auto-switch master slave • better user experience • Extension: • supporting CDC on demand • eg: ES, MQ…
  • 16. Bulk Load Fast import lots of data offline sst file sst file Table Replica server original data File provider sst file sst file 1. Generate Files 2. Download Files 3. Ingest Files client R/W Reject write(ingestion)
  • 17. Access Control Authentication: Kerberos Authorization: Whitelist based coarse-grained table-level access control Cluster KeytabA X TableA KeytabB TableB KeytabA client
  • 18. Partition Split Basic introduction • Replica divide into two replicas • Replica[i] -> Replica[i], Replica[i+original_partition_count] Replica group0 Replica0 Replica4 Replica0 Replica group1 Replica1 Replica5 Replica1 Replica group2 Replica2 Replica6 Replica2 Replica group3 Replica3 Replica7 Replica3
  • 19. Partition Split Stage1: async-learn client Replica server child secondary Replica server child primary Replica server child secondary copy data copy data copy data • parent(old replica), child(new replica) • child replica copy data • client only know parent replica
  • 20. Partition Split Stage2: register client Replica server child secondary Replica server child primary Replica server child secondary meta server register child X • when child copy all parent data • Reject R/W while registering
  • 21. Partition Split Partition split succeed Replica server secondary secondary Replica server primary primary Replica server secondary secondary client • Will be released in 2.3.0 • GC dup-data by compaction
  • 22. User defined compaction Current Compaction operation - Deletion No=3, key=“key3”, value=“value” No=2, key=“key2”, value=“value”, expired No=1, key=“key”, value=“old_value” No=4, key=“key4”, value=“value”, parent No=5, key=“key”, value=“new_value” RocksDB instance compaction No=3, key=“key3”, value=“value” No=5, key=“key”, value=“new_value” RocksDB instance GC duplicated data GC expired data
  • 23. User defined compaction Current Compaction operation – Update table-level TTL No=3, key=“key3”, value=“value” No=2, key=“key2”, value=“value” No=1, key=“key1”, value=“value” RocksDB instance No=3, key=“key3”, value=“value”,ttl=30 days No=2, key=“key2”, value=“value”,ttl=30 days No=1, key=“key1”, value=“value”,ttl=30 days RocksDB instance compaction Table TTL 30 days
  • 24. User defined compaction Update TTL(Based on current time) Compaction Operations Update TTL(Based on old TTL) Update TTL(timestamp) Deletion No TTL TTL range HashKey prefix Compaction Rules HashKey postfix HashKey anywhere SortKey prefix SortKey postfix SortKey anywhere
  • 25. User defined compaction User Cases examples • Compaction Rule = TTL Range • Compaction Operation = Update TTL • Compaction Rule = Hashkey Prefix + TTL Range • Compaction Operation =Deletion Update Data TTL more than 6 month into 2 months Delete HashKey prefix "test" and TTL more than 30 days • Will be released in 2.3.0
  • 27. Pegasus-Spark Best practices • Large offline data analysis (SQL) • Large offline data load (Bulk Load)
  • 28. Pegasus-Spark Offline Analysis • Convert into Hive(parquet) • Use SparkSQL to analysis HDFS Replica server Replica server Hive Schema RDD
  • 29. Pegasus-Spark Convert to SST file for Bulk load node node node node node node Transform(Pegasus-Spark) HDFS (sst file) Distinct Repartition Sort original data original data
  • 30. Meta Proxy Basic introduction • access unification • primary and standby cluster manager client client client Cluster A meta meta Cluster B meta meta Cluster C meta meta client client client Cluster A meta meta Cluster B meta meta Cluster C meta meta MetaProxy
  • 31. Meta Proxy Switch primary and standby cluster client client client Cluster primary meta meta Cluster secondary meta meta MetaProxy duplication client client client Cluster secondary meta meta Cluster primary meta meta MetaProxy duplication switch
  • 32. Disk migration tool balance disk usage on replica server Disk4 40% Disk2 75% Disk1 70% Disk3 85% Disk migrator Select Disk Select Replica Migrate Replica balanced Disk4 65% Disk2 65% Disk1 70% Disk3 70% Replica server Replica server Loop until balance
  • 34. Process 2016 Release 1.0.0 Join Apache Release 2.0.0 Meet UP 2015 Start Open GitHub 2017.9 2020.6 2020.9 2021.9
  • 35. Tools Start contribution from API and tools C++/Java/Go/Python/NodeJs/Scala Pegasus core user-cli client HTTP API RPC API monitoring admin-cli deploy tools other tools … Pegic(Go)/C++ shell client Falcon/Prometheus Minos Admin-cli(Go)/ C++ shell client Meta Proxy(Go)
  • 36. In the future Enhancement & Features • Periodically Bulk load • Duplication • Hotpot partition detection • Read throughput throttling • Tracing • Admin Service • Others… Pegasus 2.3.0 is releasing(150+ commits) • Partition Split • User defined compaction • Cluster Load Balance • Onetime Backup
  • 37. Community Development How to contribute • Lookup/Raise issue, assign it to yourself • Follow the Pegasus official WeChat account • Join Pegasus developer WeChat group What we plan to do • Benchmark • More documents and technical articles • Online workshop • Offline meetup