SlideShare a Scribd company logo
1 of 38
Pegasus设计实现与开源之路
何昱晨
2021.9.25
• 分布式系统工程师
• 本科硕士均毕业于中国人民大学
• 就职于小米,负责分布式KV存储系统
Pegasus及其生态工具研发工作
• Apache Pegasus PPMC
何昱晨
Outline
• Basic Introduction
– Architecture, Data Model, Dual WAL, Performance
• New Features
– Duplication, Bulk load, Access control, Partition split, User
Defined Compaction
• Surrounding Ecosystems
– Pegasus-Spark, Meta proxy, Disk Migration tools
• Community
Basic Introduction
4
Introduction
• Redis or HBase
– Non-Volatile vs Consistent
– Remote Access
• Pegasus
– C++
– Local persistent storage
– Strongly consistent
– High performance
– Horizontally scalable
Architecture
Meta server
• Cluster controller
• Configuration manager
Replica server
• Data node
• Hash partitioning
• PacificA (strongly consistent)
• RocksDB instance for each replica
Zookeeper
• Meta server election
• Metadata storage
ClientLib
• Cache data routing table
• Straightly access to replica server
Data Model
Dual WAL
Traditional solution
Disk
Data
Log
Replica1
Data
Log
Replica2
Data
Log
Replica3
client
• Data background compaction may strongly affect WAL sync performance
Dual WAL
Data Disk
Data
Private Log
Replica1
Data
Private Log
Replica2
Data
Private Log
Replica3
client
Shared Log
Log Disk
• Separate WAL and data, sync-write shared log, async-write private log
Performance
Read:Write Client*Thread --- QPS AvgLatency P99Latency(us)
0:1 3*15
read --- --- ---
write 46128 972 5591
1:0 3*50
read 282648 542 1674
write --- --- ---
1:1 3*30
read 36014 1068 15345
write 36016 1421 8197
1:3 3*15
read 11622 779 10417
write 34989 1021 5467
2.2.0 (Newest release) benchmark
New Features
11
Duplication
Basic introduction
Region2
Table
Region1
Table
async-duplication
• Design for cross-region online backup
• Transfer log, write asynchronously
• Supporting single-master and multi-master
Duplication
Case1: Online Migration
Target Cluster
Table
Source Cluster
Table
client
1. Reserve logs
Remote storage
2. cold backup
3. restore
4. duplication
5. switch
Duplication
Case2: Master-Slave cluster
client client
Slave region
Table
Master region
Table
duplication
Eventually-consistent
read
client client
Table
Region1 Region2
Duplication
Enhancement in future
• Master-master in practice
• More than two region duplication in practice
• Facility for supporting remote disaster-tolerant system
• auto-switch master slave
• better user experience
• Extension:
• supporting CDC on demand
• eg: ES, MQ…
Bulk Load
Fast import lots of data offline
sst file
sst file
Table
Replica server
original data
File provider
sst file sst file
1. Generate Files
2. Download Files
3. Ingest Files
client
R/W Reject write(ingestion)
Access Control
Authentication: Kerberos
Authorization: Whitelist based coarse-grained table-level access control
Cluster
KeytabA
X
TableA
KeytabB
TableB
KeytabA
client
Partition Split
Basic introduction
• Replica divide into two replicas
• Replica[i] -> Replica[i], Replica[i+original_partition_count]
Replica group0
Replica0 Replica4
Replica0
Replica group1
Replica1 Replica5
Replica1
Replica group2
Replica2 Replica6
Replica2
Replica group3
Replica3 Replica7
Replica3
Partition Split
Stage1: async-learn
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
copy data
copy data copy data
• parent(old replica), child(new replica)
• child replica copy data
• client only know parent replica
Partition Split
Stage2: register
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
meta server
register child X
• when child copy all parent data
• Reject R/W while registering
Partition Split
Partition split succeed
Replica server
secondary
secondary
Replica server
primary
primary
Replica server
secondary
secondary
client
• Will be released in 2.3.0
• GC dup-data by compaction
User defined compaction
Current Compaction operation - Deletion
No=3, key=“key3”, value=“value”
No=2, key=“key2”, value=“value”, expired
No=1, key=“key”, value=“old_value”
No=4, key=“key4”, value=“value”, parent
No=5, key=“key”, value=“new_value”
RocksDB instance
compaction
No=3, key=“key3”, value=“value”
No=5, key=“key”, value=“new_value”
RocksDB instance
GC duplicated data
GC expired data
User defined compaction
Current Compaction operation – Update table-level TTL
No=3, key=“key3”, value=“value”
No=2, key=“key2”, value=“value”
No=1, key=“key1”, value=“value”
RocksDB instance
No=3, key=“key3”, value=“value”,ttl=30 days
No=2, key=“key2”, value=“value”,ttl=30 days
No=1, key=“key1”, value=“value”,ttl=30 days
RocksDB instance
compaction
Table TTL 30 days
User defined compaction
Update TTL(Based on current time)
Compaction Operations
Update TTL(Based on old TTL)
Update TTL(timestamp)
Deletion
No TTL
TTL range
HashKey prefix
Compaction Rules
HashKey postfix
HashKey anywhere
SortKey prefix
SortKey postfix
SortKey anywhere
User defined compaction
User Cases examples
• Compaction Rule = TTL Range
• Compaction Operation = Update TTL
• Compaction Rule = Hashkey Prefix + TTL Range
• Compaction Operation =Deletion
Update Data TTL more than 6 month into 2 months
Delete HashKey prefix "test" and TTL more than 30 days
• Will be released in 2.3.0
Surrounding Ecosystem
26
Pegasus-Spark
Best practices
• Large offline data analysis (SQL)
• Large offline data load (Bulk Load)
Pegasus-Spark
Offline Analysis
• Convert into Hive(parquet)
• Use SparkSQL to analysis
HDFS
Replica server Replica server
Hive
Schema RDD
Pegasus-Spark
Convert to SST file for Bulk load
node
node
node
node
node
node
Transform(Pegasus-Spark)
HDFS
(sst file)
Distinct
Repartition
Sort
original
data
original
data
Meta Proxy
Basic introduction
• access unification
• primary and standby cluster manager
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
MetaProxy
Meta Proxy
Switch primary and standby cluster
client client client
Cluster primary
meta meta
Cluster secondary
meta meta
MetaProxy
duplication
client client client
Cluster secondary
meta meta
Cluster primary
meta meta
MetaProxy
duplication
switch
Disk migration tool
balance disk usage on replica server
Disk4
40%
Disk2
75%
Disk1
70%
Disk3
85%
Disk
migrator
Select Disk
Select
Replica
Migrate
Replica
balanced
Disk4
65%
Disk2
65%
Disk1
70%
Disk3
70%
Replica server Replica server
Loop
until balance
Community
33
Process
2016
Release 1.0.0
Join Apache
Release 2.0.0
Meet UP
2015
Start
Open GitHub
2017.9
2020.6
2020.9
2021.9
Tools
Start contribution from API and tools
C++/Java/Go/Python/NodeJs/Scala
Pegasus
core
user-cli
client
HTTP API
RPC API
monitoring
admin-cli
deploy tools
other tools …
Pegic(Go)/C++ shell client
Falcon/Prometheus
Minos
Admin-cli(Go)/
C++ shell client
Meta Proxy(Go)
In the future
Enhancement & Features
• Periodically Bulk load
• Duplication
• Hotpot partition detection
• Read throughput throttling
• Tracing
• Admin Service
• Others…
Pegasus 2.3.0 is releasing(150+ commits)
• Partition Split
• User defined compaction
• Cluster Load Balance
• Onetime Backup
Community Development
How to contribute
• Lookup/Raise issue, assign it to yourself
• Follow the Pegasus official WeChat account
• Join Pegasus developer WeChat group
What we plan to do
• Benchmark
• More documents and technical articles
• Online workshop
• Offline meetup
Thank You
https://pegasus.apache.org/
Apache Pegasus
https://github.com/apache/incubator-pegasus

More Related Content

Similar to The Design, Implementation and Open Source Way of Apache Pegasus

How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataacelyc1112009
 
Managing Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive ComputingManaging Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale NetApp
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and dockerBob Ward
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...nnakasone
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Open stack ha design & deployment kilo
Open stack ha design & deployment   kiloOpen stack ha design & deployment   kilo
Open stack ha design & deployment kiloSteven Li
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 

Similar to The Design, Implementation and Open Source Way of Apache Pegasus (20)

How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
 
Managing Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive ComputingManaging Big Data: An Introduction to Data Intensive Computing
Managing Big Data: An Introduction to Data Intensive Computing
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Open stack ha design & deployment kilo
Open stack ha design & deployment   kiloOpen stack ha design & deployment   kilo
Open stack ha design & deployment kilo
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 

More from acelyc1112009

How does Apache Pegasus used in SensorsData
How does Apache Pegasusused in SensorsDataHow does Apache Pegasusused in SensorsData
How does Apache Pegasus used in SensorsDataacelyc1112009
 
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataHow does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataacelyc1112009
 
How to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosHow to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosacelyc1112009
 
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...acelyc1112009
 
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...acelyc1112009
 
The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0acelyc1112009
 
Apache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of XiaomiApache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of Xiaomiacelyc1112009
 
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...acelyc1112009
 
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partHow do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partacelyc1112009
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partacelyc1112009
 

More from acelyc1112009 (10)

How does Apache Pegasus used in SensorsData
How does Apache Pegasusused in SensorsDataHow does Apache Pegasusused in SensorsData
How does Apache Pegasus used in SensorsData
 
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataHow does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
 
How to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosHow to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenarios
 
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
 
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
 
The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0
 
Apache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of XiaomiApache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of Xiaomi
 
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
 
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partHow do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine part
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend part
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

The Design, Implementation and Open Source Way of Apache Pegasus

  • 2. • 分布式系统工程师 • 本科硕士均毕业于中国人民大学 • 就职于小米,负责分布式KV存储系统 Pegasus及其生态工具研发工作 • Apache Pegasus PPMC 何昱晨
  • 3. Outline • Basic Introduction – Architecture, Data Model, Dual WAL, Performance • New Features – Duplication, Bulk load, Access control, Partition split, User Defined Compaction • Surrounding Ecosystems – Pegasus-Spark, Meta proxy, Disk Migration tools • Community
  • 5. Introduction • Redis or HBase – Non-Volatile vs Consistent – Remote Access • Pegasus – C++ – Local persistent storage – Strongly consistent – High performance – Horizontally scalable
  • 6. Architecture Meta server • Cluster controller • Configuration manager Replica server • Data node • Hash partitioning • PacificA (strongly consistent) • RocksDB instance for each replica Zookeeper • Meta server election • Metadata storage ClientLib • Cache data routing table • Straightly access to replica server
  • 8. Dual WAL Traditional solution Disk Data Log Replica1 Data Log Replica2 Data Log Replica3 client • Data background compaction may strongly affect WAL sync performance
  • 9. Dual WAL Data Disk Data Private Log Replica1 Data Private Log Replica2 Data Private Log Replica3 client Shared Log Log Disk • Separate WAL and data, sync-write shared log, async-write private log
  • 10. Performance Read:Write Client*Thread --- QPS AvgLatency P99Latency(us) 0:1 3*15 read --- --- --- write 46128 972 5591 1:0 3*50 read 282648 542 1674 write --- --- --- 1:1 3*30 read 36014 1068 15345 write 36016 1421 8197 1:3 3*15 read 11622 779 10417 write 34989 1021 5467 2.2.0 (Newest release) benchmark
  • 12. Duplication Basic introduction Region2 Table Region1 Table async-duplication • Design for cross-region online backup • Transfer log, write asynchronously • Supporting single-master and multi-master
  • 13. Duplication Case1: Online Migration Target Cluster Table Source Cluster Table client 1. Reserve logs Remote storage 2. cold backup 3. restore 4. duplication 5. switch
  • 14. Duplication Case2: Master-Slave cluster client client Slave region Table Master region Table duplication Eventually-consistent read client client Table Region1 Region2
  • 15. Duplication Enhancement in future • Master-master in practice • More than two region duplication in practice • Facility for supporting remote disaster-tolerant system • auto-switch master slave • better user experience • Extension: • supporting CDC on demand • eg: ES, MQ…
  • 16. Bulk Load Fast import lots of data offline sst file sst file Table Replica server original data File provider sst file sst file 1. Generate Files 2. Download Files 3. Ingest Files client R/W Reject write(ingestion)
  • 17. Access Control Authentication: Kerberos Authorization: Whitelist based coarse-grained table-level access control Cluster KeytabA X TableA KeytabB TableB KeytabA client
  • 18. Partition Split Basic introduction • Replica divide into two replicas • Replica[i] -> Replica[i], Replica[i+original_partition_count] Replica group0 Replica0 Replica4 Replica0 Replica group1 Replica1 Replica5 Replica1 Replica group2 Replica2 Replica6 Replica2 Replica group3 Replica3 Replica7 Replica3
  • 19. Partition Split Stage1: async-learn client Replica server child secondary Replica server child primary Replica server child secondary copy data copy data copy data • parent(old replica), child(new replica) • child replica copy data • client only know parent replica
  • 20. Partition Split Stage2: register client Replica server child secondary Replica server child primary Replica server child secondary meta server register child X • when child copy all parent data • Reject R/W while registering
  • 21. Partition Split Partition split succeed Replica server secondary secondary Replica server primary primary Replica server secondary secondary client • Will be released in 2.3.0 • GC dup-data by compaction
  • 22. User defined compaction Current Compaction operation - Deletion No=3, key=“key3”, value=“value” No=2, key=“key2”, value=“value”, expired No=1, key=“key”, value=“old_value” No=4, key=“key4”, value=“value”, parent No=5, key=“key”, value=“new_value” RocksDB instance compaction No=3, key=“key3”, value=“value” No=5, key=“key”, value=“new_value” RocksDB instance GC duplicated data GC expired data
  • 23. User defined compaction Current Compaction operation – Update table-level TTL No=3, key=“key3”, value=“value” No=2, key=“key2”, value=“value” No=1, key=“key1”, value=“value” RocksDB instance No=3, key=“key3”, value=“value”,ttl=30 days No=2, key=“key2”, value=“value”,ttl=30 days No=1, key=“key1”, value=“value”,ttl=30 days RocksDB instance compaction Table TTL 30 days
  • 24. User defined compaction Update TTL(Based on current time) Compaction Operations Update TTL(Based on old TTL) Update TTL(timestamp) Deletion No TTL TTL range HashKey prefix Compaction Rules HashKey postfix HashKey anywhere SortKey prefix SortKey postfix SortKey anywhere
  • 25. User defined compaction User Cases examples • Compaction Rule = TTL Range • Compaction Operation = Update TTL • Compaction Rule = Hashkey Prefix + TTL Range • Compaction Operation =Deletion Update Data TTL more than 6 month into 2 months Delete HashKey prefix "test" and TTL more than 30 days • Will be released in 2.3.0
  • 27. Pegasus-Spark Best practices • Large offline data analysis (SQL) • Large offline data load (Bulk Load)
  • 28. Pegasus-Spark Offline Analysis • Convert into Hive(parquet) • Use SparkSQL to analysis HDFS Replica server Replica server Hive Schema RDD
  • 29. Pegasus-Spark Convert to SST file for Bulk load node node node node node node Transform(Pegasus-Spark) HDFS (sst file) Distinct Repartition Sort original data original data
  • 30. Meta Proxy Basic introduction • access unification • primary and standby cluster manager client client client Cluster A meta meta Cluster B meta meta Cluster C meta meta client client client Cluster A meta meta Cluster B meta meta Cluster C meta meta MetaProxy
  • 31. Meta Proxy Switch primary and standby cluster client client client Cluster primary meta meta Cluster secondary meta meta MetaProxy duplication client client client Cluster secondary meta meta Cluster primary meta meta MetaProxy duplication switch
  • 32. Disk migration tool balance disk usage on replica server Disk4 40% Disk2 75% Disk1 70% Disk3 85% Disk migrator Select Disk Select Replica Migrate Replica balanced Disk4 65% Disk2 65% Disk1 70% Disk3 70% Replica server Replica server Loop until balance
  • 34. Process 2016 Release 1.0.0 Join Apache Release 2.0.0 Meet UP 2015 Start Open GitHub 2017.9 2020.6 2020.9 2021.9
  • 35. Tools Start contribution from API and tools C++/Java/Go/Python/NodeJs/Scala Pegasus core user-cli client HTTP API RPC API monitoring admin-cli deploy tools other tools … Pegic(Go)/C++ shell client Falcon/Prometheus Minos Admin-cli(Go)/ C++ shell client Meta Proxy(Go)
  • 36. In the future Enhancement & Features • Periodically Bulk load • Duplication • Hotpot partition detection • Read throughput throttling • Tracing • Admin Service • Others… Pegasus 2.3.0 is releasing(150+ commits) • Partition Split • User defined compaction • Cluster Load Balance • Onetime Backup
  • 37. Community Development How to contribute • Lookup/Raise issue, assign it to yourself • Follow the Pegasus official WeChat account • Join Pegasus developer WeChat group What we plan to do • Benchmark • More documents and technical articles • Online workshop • Offline meetup