The Design, Implementation and Open Source Way of Apache Pegasus

Pegasus设计实现与开源之路
何昱晨
2021.9.25

• 分布式系统工程师
• 本科硕士均毕业于中国人民大学
• 就职于小米，负责分布式KV存储系统
Pegasus及其生态工具研发工作
• Apache Pegasus PPMC
何昱晨

Outline
• Basic Introduction
– Architecture, Data Model, Dual WAL, Performance
• New Features
– Duplication, Bulk load, Access control, Partition split, User
Defined Compaction
• Surrounding Ecosystems
– Pegasus-Spark, Meta proxy, Disk Migration tools
• Community

Introduction
• Redis or HBase
– Non-Volatile vs Consistent
– Remote Access
• Pegasus
– C++
– Local persistent storage
– Strongly consistent
– High performance
– Horizontally scalable

Architecture
Meta server
• Cluster controller
• Configuration manager
Replica server
• Data node
• Hash partitioning
• PacificA (strongly consistent)
• RocksDB instance for each replica
Zookeeper
• Meta server election
• Metadata storage
ClientLib
• Cache data routing table
• Straightly access to replica server

Dual WAL
Traditional solution
Disk
Data
Log
Replica1
Data
Log
Replica2
Data
Log
Replica3
client
• Data background compaction may strongly affect WAL sync performance

Dual WAL
Data Disk
Data
Private Log
Replica1
Data
Private Log
Replica2
Data
Private Log
Replica3
client
Shared Log
Log Disk
• Separate WAL and data, sync-write shared log, async-write private log

Performance
Read:Write Client*Thread --- QPS AvgLatency P99Latency(us)
0:1 3*15
read --- --- ---
write 46128 972 5591
1:0 3*50
read 282648 542 1674
write --- --- ---
1:1 3*30
read 36014 1068 15345
write 36016 1421 8197
1:3 3*15
read 11622 779 10417
write 34989 1021 5467
2.2.0 (Newest release) benchmark

Duplication
Basic introduction
Region2
Table
Region1
Table
async-duplication
• Design for cross-region online backup
• Transfer log, write asynchronously
• Supporting single-master and multi-master

Duplication
Case1: Online Migration
Target Cluster
Table
Source Cluster
Table
client
1. Reserve logs
Remote storage
2. cold backup
3. restore
4. duplication
5. switch

Duplication
Case2: Master-Slave cluster
client client
Slave region
Table
Master region
Table
duplication
Eventually-consistent
read
client client
Table
Region1 Region2

Duplication
Enhancement in future
• Master-master in practice
• More than two region duplication in practice
• Facility for supporting remote disaster-tolerant system
• auto-switch master slave
• better user experience
• Extension:
• supporting CDC on demand
• eg: ES, MQ…

Bulk Load
Fast import lots of data offline
sst file
sst file
Table
Replica server
original data
File provider
sst file sst file
1. Generate Files
2. Download Files
3. Ingest Files
client
R/W Reject write(ingestion)

Access Control
Authentication: Kerberos
Authorization: Whitelist based coarse-grained table-level access control
Cluster
KeytabA
X
TableA
KeytabB
TableB
KeytabA
client

Partition Split
Basic introduction
• Replica divide into two replicas
• Replica[i] -> Replica[i], Replica[i+original_partition_count]
Replica group0
Replica0 Replica4
Replica0
Replica group1
Replica1 Replica5
Replica1
Replica group2
Replica2 Replica6
Replica2
Replica group3
Replica3 Replica7
Replica3

Partition Split
Stage1: async-learn
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
copy data
copy data copy data
• parent(old replica), child(new replica)
• child replica copy data
• client only know parent replica

Partition Split
Stage2: register
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
meta server
register child X
• when child copy all parent data
• Reject R/W while registering

Partition Split
Partition split succeed
Replica server
secondary
secondary
Replica server
primary
primary
Replica server
secondary
secondary
client
• Will be released in 2.3.0
• GC dup-data by compaction

User defined compaction
Current Compaction operation - Deletion
No=3, key=“key3”, value=“value”
No=2, key=“key2”, value=“value”, expired
No=1, key=“key”, value=“old_value”
No=4, key=“key4”, value=“value”, parent
No=5, key=“key”, value=“new_value”
RocksDB instance
compaction
No=5, key=“key”, value=“new_value”
RocksDB instance
GC duplicated data
GC expired data

Current Compaction operation – Update table-level TTL
RocksDB instance
No=3, key=“key3”, value=“value”,ttl=30 days
RocksDB instance
compaction
Table TTL 30 days

Update TTL(Based on current time)
Compaction Operations
Update TTL(Based on old TTL)
Update TTL(timestamp)
Deletion
No TTL
TTL range
HashKey prefix
Compaction Rules
HashKey postfix
HashKey anywhere
SortKey prefix
SortKey postfix
SortKey anywhere

User Cases examples
• Compaction Rule = TTL Range
• Compaction Operation = Update TTL
• Compaction Rule = Hashkey Prefix + TTL Range
• Compaction Operation =Deletion
Update Data TTL more than 6 month into 2 months
Delete HashKey prefix "test" and TTL more than 30 days
• Will be released in 2.3.0

Pegasus-Spark
Best practices
• Large offline data analysis (SQL)
• Large offline data load (Bulk Load)

Pegasus-Spark
Offline Analysis
• Convert into Hive(parquet)
• Use SparkSQL to analysis
HDFS
Replica server Replica server
Hive
Schema RDD

Pegasus-Spark
Convert to SST file for Bulk load
node
node
node
node
node
node
Transform(Pegasus-Spark)
HDFS
(sst file)
Distinct
Repartition
Sort
original
data
original
data

Meta Proxy
Basic introduction
• access unification
• primary and standby cluster manager
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
MetaProxy

Meta Proxy
Switch primary and standby cluster
Cluster primary
meta meta
Cluster secondary
meta meta
MetaProxy
duplication
Cluster secondary
meta meta
Cluster primary
meta meta
MetaProxy
duplication
switch

Disk migration tool
balance disk usage on replica server
Disk4
40%
Disk2
75%
Disk1
70%
Disk3
85%
Disk
migrator
Select Disk
Select
Replica
Migrate
Replica
balanced
Disk4
65%
Disk2
65%
Disk1
70%
Disk3
70%
Replica server Replica server
Loop
until balance

Process
2016
Release 1.0.0
Join Apache
Release 2.0.0
Meet UP
2015
Start
Open GitHub
2017.9
2020.6
2020.9
2021.9

Tools
Start contribution from API and tools
C++/Java/Go/Python/NodeJs/Scala
Pegasus
core
user-cli
client
HTTP API
RPC API
monitoring
admin-cli
deploy tools
other tools …
Pegic(Go)/C++ shell client
Falcon/Prometheus
Minos
Admin-cli(Go)/
C++ shell client
Meta Proxy(Go)

In the future
Enhancement & Features
• Periodically Bulk load
• Duplication
• Hotpot partition detection
• Read throughput throttling
• Tracing
• Admin Service
• Others…
Pegasus 2.3.0 is releasing（150+ commits）
• Partition Split
• User defined compaction
• Cluster Load Balance
• Onetime Backup

Community Development
How to contribute
• Lookup/Raise issue, assign it to yourself
• Follow the Pegasus official WeChat account
• Join Pegasus developer WeChat group
What we plan to do
• Benchmark
• More documents and technical articles
• Online workshop
• Offline meetup

Thank You
https://pegasus.apache.org/
Apache Pegasus
https://github.com/apache/incubator-pegasus

The Design, Implementation and Open Source Way of Apache Pegasus

Recommended

Recommended

More Related Content

Similar to The Design, Implementation and Open Source Way of Apache Pegasus

Similar to The Design, Implementation and Open Source Way of Apache Pegasus (20)

More from acelyc1112009

More from acelyc1112009 (10)

Recently uploaded

Recently uploaded (20)

The Design, Implementation and Open Source Way of Apache Pegasus