A presentation in ApacheCon Asia 2022 from Dan Wang and Yingchun Lai.
Apache Pegasus is a horizontally scalable, strongly consistent and high-performance key-value store.
Know more about Pegasus https://pegasus.apache.org, https://github.com/apache/incubator-pegasus
Invezz.com - Grow your wealth with trading signals
How does Apache Pegasus (incubating) community develop at SensorsData
1. How does Apache Pegasus (incubating)
community develop at SensorsData
Dan Wang & Yingchun Lai
2022.07.29
2. Outline
• Overview of Apache Pegasus
• Architecture, Data Model, User Interface, Performance, Important Features
• Why SensorsData chose Apache Pegasus?
• Evolution and Current Situation
• Contributions to Apache Pegasus by SensorsData
• Features, Improvements, Bugfixes
• What's going on in the Pegasus community?
• Development, New Release and Activities
3. Overview of Apache Pegasus
Architecture, Data Model, User Interface, Performance, Important Features
4. What is Pegasus?
Apache Pegasus is a horizontally scalable, strongly consistent and
high-performance key-value store
• C++ implemented
• Local persistent storage engine by RocksDB
• Strongly consistent by PacificA
• High performance
• Horizontally scalable
• Flexible data model
• Easy to use ecosystem tools
5. Architecture
MetaServer
• Cluster controller
• Configuration manager
• Doesn't store data on itself
ReplicaServer
• Data node
• Hash partition
• PacificA (strongly consistent)
• One RocksDB instance for each replica
ZooKeeper
• Meta server election
• Metadata storage
ClientLib
• Request routing table from MetaServer once
• Cache routing table
• Straightly interact with ReplicaServer for R/W requests
6. Data Model
SortKey
• Extend user's usage scenario
• Sorted in a specified HashKey
HashKey
• Decide which partition it belongs to
• hash(HashKey) % kPartitionCount → partition_id
Value
• User's data
7. User Interface
note: * means uncertain count
• Supported language: Java, C++, Go, Python, Node-js, Scala
• Multiple SortKeys under one HashKey can be atomically accessed
8. How to adapt to RocksDB
For one table in Pegasus
• The whole key space is hash split into N partitions
• Each partition has 3 replicas typically
• Distribute all these (3*N) replicas to M Replica Servers
• Load balance between Replica Servers in cluster
• Both for replicas and primary replicas
• Both consider replica count and disk space
• Load balance between data directories on Replica Server
• Same
• Each replica corresponding to a RocksDB instance
• How does Pegasus key-value map to RocksDB key-value?
9. How to adapt to RocksDB
RocksDB Key
• Length of HashKey: 2 bytes, for encoding and decoding key
• HashKey: variable length, defined by user
• SortKey: variable length, defined by user
RocksDB Value
• Value Header: 13 bytes
• Flag bit: 1 bit, always to be 1
• Data version: 7 bits
• Expire timestamp: 4 bytes, in seconds, since epoch
• Time tag: 8 bytes, designed for duplication
• Timestamp: 56 bits, in micro-seconds
• Cluster ID: 7 bits
• Deleted tag: 1 bit
• Value: variable length, defined by user
10. Performance
YCSB on Pegasus 2.3.0 (the latest release)
• CPU: 2.4 GHz, 24 cores
• Memory: 128 GB
• Disk: 480G SSD * 8
• Network card: 10 Gb/s
• 5 Replica Servers
• 64 partitions on test table
11. Important Features
Cold Backup
• Create checkpoint for a table
• Store data remotely on HDFS
• Restore table to the original or another cluster
Duplication
• Asynchronous duplicate
• To achieve high write throughput
• To tolerant high latency
• The two clusters can be deployed in
different regions
• Support pipeline duplication, multi-master
duplication, and master-master duplication
12. Important Features
Bulk Load
• Generate SST files from user's original data
• via Pegasus-Spark, in Pegasus rule
• Store generated SST files to HDFS
• Download SST files to Pegasus ReplicaServer
• Ingest SST files to RocksDB
• Reject client write while ingesting
• Provide read & write
Partition Split
• Divide one replica into two replicas
• Copy checkpoint and then duplicate WAL
• Register on Meta Server when new replicas are ready
• Reject client R/W request while registering
• Provide client R/W request
• GC redundant data that doesn't belong to the new partition
in the background
13. Important Features
Backup Request
• Only for read requests
• Usage scenarios:
• Load inbalanced
• Network problem
• Single point of failure
Hotkey Detection
• Detect bad designed user key
• Resolve single point of failure
caused by hotkey
14. Important Features
• Access control
• Authentication: Kerberos
• Authorization: table-level ACL
• Usage scenario option templates
• Set RocksDB options in table level
• Manual compaction
• Fast GC, fast sort
• Integration with BigData ecosystem
• HDFS: cold backup, bulk load
• Spark: bulk load, analysis on Hive
• MetaProxy
• Access unification
17. KV Evolution in SensorsData
• Distributed Redis (2016)
• Redis sentinel → Redis cluster
• Pros
• Scale out
• Cons
• Frequent OOM (several hundred million keys)
18. KV Evolution in SensorsData
• SSDB (2017)
• master-slave
• Pros
• Reduce memory consumption
• Compatible with redis, thus easy for migration
• Persistence
• Cons
• Cannot scale out
• Cannot afford to more data and more businesses (I/O
utilization is nearly 100%)
19. Introduce Apache Pegasus (2020)
• Scale out
• High Availability
• Strong consistency
• Persistence
• High performance
• Stability
• Tools for monitor and operations
• Support mget & mset
• Documents & community
• Cost for migration
20. Apache Pegasus in SensorsData
• Pegasus has been deployed on over 1300 clusters up to
now
• About 20 products have chosen Pegasus to store their
business data
22. Characteristics of Product Environment
• It’s difficult for operations on private clusters
• A large number of clusters
• Some clusters have to be operated on site
• Some clusters are very small
• Even single node
• The hardware configuration is not good
• Small memory
• HDD
• Multiple services are deployed on one node
• Have to limit resource usage, such as memory
23. New Functions
• Support single replica
• Connect Zookeeper secured with Kerberos
• Change the replication factor of each table
• Implement new system of metrics
Improve Memory Usage
• Limit RocksDB memory usage
• Support jemalloc
Refactor
• Merge sub-projects
Contributions
Optimize Performance
• Support batchGetByPartitions to improve batch get
• Use multi_set to speed copy_data up
Compatibility
• Support to build on MacOS
• Support to build and run on AArch64 architecture
Bugfixes
• Fix replica metadata lost risk on XFS after power
outage
• Fix message body size unset after parsed which
leads to large I/O throughputs
24. Change the Replication Factor
• Motivation
• Scale out, e.g., 1 → 3 or 2 → 3
• Migration
• Increase partitions offline
• Process
• Check new replication factor
• Update meta data asynchronously
• Missing/redundant replicas will be
added/dropped typically during several
seconds
• Clearing redundant data can be
launched by lively meta level
25. New System of Metrics
Perf-counter
• Verbose naming
• Overlapped metric types
• Unreasonable abstract interfaces
• Memory leak by outdated metrics
• Potential performance problems
New metrics
• Use labels to simplify naming
• Redefine metric types
• Clear outdated metrics after a
period of configurable time
• Improve performance
26. Framework
• Gauge: set/get, increment/decrement
• Counter: increment monotonically
• Percentile: P90/P95/P99/..., for a fixed window size
27. Performance
0
50
100
150
200
250
300
350
2 threads 4 threads 8 threads 16 threads
Latency of counters
(seconds, with 1 billion oeprations for each thread)
old counter new counter
0
10
20
30
40
50
60
70
80
10,000 operations 50,000 opeartions 100,000 operations
Latency of percentiles
(seconds, with window size 5000)
old percentile new percentile
New counter is based on long adder.
New percentile is based on nth_element
instead of median-of-medians selection.
32. jemalloc vs. tcmalloc
• Both memtables and index & filter blocks are capped by block cache
• rocksdb_block_cache_capacity=12GB
• rocksdb_total_size_across_write_buffer=8GB
37. What's going on in the Pegasus community?
Development, New Release and Activities
38. Development
• New metrics framework
• Higher performance and easy to use
• Enhance backup & restore
• Enhance duplication
• Enhance authorization
• Easy to use admin tools
• Use Go tools to replace C++ tools
• Refactor
• Support more CPU architectures
• x86, ARMs, Apple Silicon
• Support more operation systems
• Linux: RHEL/CentOS(6, 7, 8, 9), Ubuntu(16.04, 18.04, 20.04, 22.04)
• MacOS: 12.4
• Website & Documents
39. New Release
• Pegasus 2.4.0
• Performance improvement
• Refactor dual-WAL to single WAL
• New features
• Change table's replication factor
• Read request limiter
• Enhancement
• Bulk load
• Duplication
• Manual compaction
• API
• Add batchGetByPartitions()
• Tools
• admin-cli support more operations
40. Activities
• The 1st meetup held in Sep, 2021
• Planning to hold the 2nd meetup this autumn
• Online small meetings held unscheduled