A presentation in Apache Pegasus meetup in 2021 from Yuchen He.
Apache Pegasus is a horizontally scalable, strongly consistent and high-performance key-value store.
Know more about Pegasus https://pegasus.apache.org, https://github.com/apache/incubator-pegasus
3. Outline
• Basic Introduction
– Architecture, Data Model, Dual WAL, Performance
• New Features
– Duplication, Bulk load, Access control, Partition split, User
Defined Compaction
• Surrounding Ecosystems
– Pegasus-Spark, Meta proxy, Disk Migration tools
• Community
5. Introduction
• Redis or HBase
– Non-Volatile vs Consistent
– Remote Access
• Pegasus
– C++
– Local persistent storage
– Strongly consistent
– High performance
– Horizontally scalable
6. Architecture
Meta server
• Cluster controller
• Configuration manager
Replica server
• Data node
• Hash partitioning
• PacificA (strongly consistent)
• RocksDB instance for each replica
Zookeeper
• Meta server election
• Metadata storage
ClientLib
• Cache data routing table
• Straightly access to replica server
9. Dual WAL
Data Disk
Data
Private Log
Replica1
Data
Private Log
Replica2
Data
Private Log
Replica3
client
Shared Log
Log Disk
• Separate WAL and data, sync-write shared log, async-write private log
15. Duplication
Enhancement in future
• Master-master in practice
• More than two region duplication in practice
• Facility for supporting remote disaster-tolerant system
• auto-switch master slave
• better user experience
• Extension:
• supporting CDC on demand
• eg: ES, MQ…
16. Bulk Load
Fast import lots of data offline
sst file
sst file
Table
Replica server
original data
File provider
sst file sst file
1. Generate Files
2. Download Files
3. Ingest Files
client
R/W Reject write(ingestion)
19. Partition Split
Stage1: async-learn
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
copy data
copy data copy data
• parent(old replica), child(new replica)
• child replica copy data
• client only know parent replica
20. Partition Split
Stage2: register
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
meta server
register child X
• when child copy all parent data
• Reject R/W while registering
21. Partition Split
Partition split succeed
Replica server
secondary
secondary
Replica server
primary
primary
Replica server
secondary
secondary
client
• Will be released in 2.3.0
• GC dup-data by compaction
22. User defined compaction
Current Compaction operation - Deletion
No=3, key=“key3”, value=“value”
No=2, key=“key2”, value=“value”, expired
No=1, key=“key”, value=“old_value”
No=4, key=“key4”, value=“value”, parent
No=5, key=“key”, value=“new_value”
RocksDB instance
compaction
No=3, key=“key3”, value=“value”
No=5, key=“key”, value=“new_value”
RocksDB instance
GC duplicated data
GC expired data
23. User defined compaction
Current Compaction operation – Update table-level TTL
No=3, key=“key3”, value=“value”
No=2, key=“key2”, value=“value”
No=1, key=“key1”, value=“value”
RocksDB instance
No=3, key=“key3”, value=“value”,ttl=30 days
No=2, key=“key2”, value=“value”,ttl=30 days
No=1, key=“key1”, value=“value”,ttl=30 days
RocksDB instance
compaction
Table TTL 30 days
24. User defined compaction
Update TTL(Based on current time)
Compaction Operations
Update TTL(Based on old TTL)
Update TTL(timestamp)
Deletion
No TTL
TTL range
HashKey prefix
Compaction Rules
HashKey postfix
HashKey anywhere
SortKey prefix
SortKey postfix
SortKey anywhere
25. User defined compaction
User Cases examples
• Compaction Rule = TTL Range
• Compaction Operation = Update TTL
• Compaction Rule = Hashkey Prefix + TTL Range
• Compaction Operation =Deletion
Update Data TTL more than 6 month into 2 months
Delete HashKey prefix "test" and TTL more than 30 days
• Will be released in 2.3.0
29. Pegasus-Spark
Convert to SST file for Bulk load
node
node
node
node
node
node
Transform(Pegasus-Spark)
HDFS
(sst file)
Distinct
Repartition
Sort
original
data
original
data
30. Meta Proxy
Basic introduction
• access unification
• primary and standby cluster manager
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
MetaProxy
31. Meta Proxy
Switch primary and standby cluster
client client client
Cluster primary
meta meta
Cluster secondary
meta meta
MetaProxy
duplication
client client client
Cluster secondary
meta meta
Cluster primary
meta meta
MetaProxy
duplication
switch
32. Disk migration tool
balance disk usage on replica server
Disk4
40%
Disk2
75%
Disk1
70%
Disk3
85%
Disk
migrator
Select Disk
Select
Replica
Migrate
Replica
balanced
Disk4
65%
Disk2
65%
Disk1
70%
Disk3
70%
Replica server Replica server
Loop
until balance
35. Tools
Start contribution from API and tools
C++/Java/Go/Python/NodeJs/Scala
Pegasus
core
user-cli
client
HTTP API
RPC API
monitoring
admin-cli
deploy tools
other tools …
Pegic(Go)/C++ shell client
Falcon/Prometheus
Minos
Admin-cli(Go)/
C++ shell client
Meta Proxy(Go)
36. In the future
Enhancement & Features
• Periodically Bulk load
• Duplication
• Hotpot partition detection
• Read throughput throttling
• Tracing
• Admin Service
• Others…
Pegasus 2.3.0 is releasing(150+ commits)
• Partition Split
• User defined compaction
• Cluster Load Balance
• Onetime Backup
37. Community Development
How to contribute
• Lookup/Raise issue, assign it to yourself
• Follow the Pegasus official WeChat account
• Join Pegasus developer WeChat group
What we plan to do
• Benchmark
• More documents and technical articles
• Online workshop
• Offline meetup