How does Apache Pegasus (incubating) community develop at SensorsData

How does Apache Pegasus (incubating)
community develop at SensorsData
Dan Wang & Yingchun Lai
2022.07.29

Outline
• Overview of Apache Pegasus
• Architecture, Data Model, User Interface, Performance, Important Features
• Why SensorsData chose Apache Pegasus?
• Evolution and Current Situation
• Contributions to Apache Pegasus by SensorsData
• Features, Improvements, Bugfixes
• What's going on in the Pegasus community?
• Development, New Release and Activities

Overview of Apache Pegasus
Architecture, Data Model, User Interface, Performance, Important Features

What is Pegasus?
Apache Pegasus is a horizontally scalable, strongly consistent and
high-performance key-value store
• C++ implemented
• Local persistent storage engine by RocksDB
• Strongly consistent by PacificA
• High performance
• Horizontally scalable
• Flexible data model
• Easy to use ecosystem tools

Architecture
MetaServer
• Cluster controller
• Configuration manager
• Doesn't store data on itself
ReplicaServer
• Data node
• Hash partition
• PacificA (strongly consistent)
• One RocksDB instance for each replica
ZooKeeper
• Meta server election
• Metadata storage
ClientLib
• Request routing table from MetaServer once
• Cache routing table
• Straightly interact with ReplicaServer for R/W requests

Data Model
SortKey
• Extend user's usage scenario
• Sorted in a specified HashKey
HashKey
• Decide which partition it belongs to
• hash(HashKey) % kPartitionCount → partition_id
Value
• User's data

User Interface
note: * means uncertain count
• Supported language: Java, C++, Go, Python, Node-js, Scala
• Multiple SortKeys under one HashKey can be atomically accessed

How to adapt to RocksDB
For one table in Pegasus
• The whole key space is hash split into N partitions
• Each partition has 3 replicas typically
• Distribute all these (3*N) replicas to M Replica Servers
• Load balance between Replica Servers in cluster
• Both for replicas and primary replicas
• Both consider replica count and disk space
• Load balance between data directories on Replica Server
• Same
• Each replica corresponding to a RocksDB instance
• How does Pegasus key-value map to RocksDB key-value?

How to adapt to RocksDB
RocksDB Key
• Length of HashKey: 2 bytes, for encoding and decoding key
• HashKey: variable length, defined by user
• SortKey: variable length, defined by user
RocksDB Value
• Value Header: 13 bytes
• Flag bit: 1 bit, always to be 1
• Data version: 7 bits
• Expire timestamp: 4 bytes, in seconds, since epoch
• Time tag: 8 bytes, designed for duplication
• Timestamp: 56 bits, in micro-seconds
• Cluster ID: 7 bits
• Deleted tag: 1 bit
• Value: variable length, defined by user

Performance
YCSB on Pegasus 2.3.0 (the latest release)
• CPU: 2.4 GHz, 24 cores
• Memory: 128 GB
• Disk: 480G SSD * 8
• Network card: 10 Gb/s
• 5 Replica Servers
• 64 partitions on test table

Important Features
Cold Backup
• Create checkpoint for a table
• Store data remotely on HDFS
• Restore table to the original or another cluster
Duplication
• Asynchronous duplicate
• To achieve high write throughput
• To tolerant high latency
• The two clusters can be deployed in
different regions
• Support pipeline duplication, multi-master
duplication, and master-master duplication

Important Features
Bulk Load
• Generate SST files from user's original data
• via Pegasus-Spark, in Pegasus rule
• Store generated SST files to HDFS
• Download SST files to Pegasus ReplicaServer
• Ingest SST files to RocksDB
• Reject client write while ingesting
• Provide read & write
Partition Split
• Divide one replica into two replicas
• Copy checkpoint and then duplicate WAL
• Register on Meta Server when new replicas are ready
• Reject client R/W request while registering
• Provide client R/W request
• GC redundant data that doesn't belong to the new partition
in the background

Important Features
Backup Request
• Only for read requests
• Usage scenarios:
• Load inbalanced
• Network problem
• Single point of failure
Hotkey Detection
• Detect bad designed user key
• Resolve single point of failure
caused by hotkey

Important Features
• Access control
• Authentication: Kerberos
• Authorization: table-level ACL
• Usage scenario option templates
• Set RocksDB options in table level
• Manual compaction
• Fast GC, fast sort
• Integration with BigData ecosystem
• HDFS: cold backup, bulk load
• Spark: bulk load, analysis on Hive
• MetaProxy
• Access unification

Why SensorsData chose Apache Pegasus?
Evolution and Current Situation

KV Evolution in SensorsData
• Standalone Redis (initially)
• Pros
• Mature
• Cons
• Single-point deployment
• Considerable memory consumption
• Volatile

• Distributed Redis (2016)
• Redis sentinel → Redis cluster
• Pros
• Scale out
• Cons
• Frequent OOM (several hundred million keys)

• SSDB (2017)
• master-slave
• Pros
• Reduce memory consumption
• Compatible with redis, thus easy for migration
• Persistence
• Cons
• Cannot scale out
• Cannot afford to more data and more businesses (I/O
utilization is nearly 100%)

Introduce Apache Pegasus (2020)
• Scale out
• High Availability
• Strong consistency
• Persistence
• Stability
• Tools for monitor and operations
• Support mget & mset
• Documents & community
• Cost for migration

Apache Pegasus in SensorsData
• Pegasus has been deployed on over 1300 clusters up to
now
• About 20 products have chosen Pegasus to store their
business data

Contributions to Apache Pegasus by SensorsData
Features, Improvements, Bugfixes

Characteristics of Product Environment
• It’s difficult for operations on private clusters
• A large number of clusters
• Some clusters have to be operated on site
• Some clusters are very small
• Even single node
• The hardware configuration is not good
• Small memory
• HDD
• Multiple services are deployed on one node
• Have to limit resource usage, such as memory

New Functions
• Support single replica
• Connect Zookeeper secured with Kerberos
• Change the replication factor of each table
• Implement new system of metrics
Improve Memory Usage
• Limit RocksDB memory usage
• Support jemalloc
Refactor
• Merge sub-projects
Contributions
Optimize Performance
• Support batchGetByPartitions to improve batch get
• Use multi_set to speed copy_data up
Compatibility
• Support to build on MacOS
• Support to build and run on AArch64 architecture
Bugfixes
• Fix replica metadata lost risk on XFS after power
outage
• Fix message body size unset after parsed which
leads to large I/O throughputs

Change the Replication Factor
• Motivation
• Scale out, e.g., 1 → 3 or 2 → 3
• Migration
• Increase partitions offline
• Process
• Check new replication factor
• Update meta data asynchronously
• Missing/redundant replicas will be
added/dropped typically during several
seconds
• Clearing redundant data can be
launched by lively meta level

New System of Metrics
Perf-counter
• Verbose naming
• Overlapped metric types
• Unreasonable abstract interfaces
• Memory leak by outdated metrics
• Potential performance problems
New metrics
• Use labels to simplify naming
• Redefine metric types
• Clear outdated metrics after a
period of configurable time
• Improve performance

Framework
• Gauge: set/get, increment/decrement
• Counter: increment monotonically
• Percentile: P90/P95/P99/..., for a fixed window size

Performance
0
50
100
150
200
250
300
350
2 threads 4 threads 8 threads 16 threads
Latency of counters
(seconds, with 1 billion oeprations for each thread)
old counter new counter
0
10
20
30
40
50
60
70
80
10,000 operations 50,000 opeartions 100,000 operations
Latency of percentiles
(seconds, with window size 5000)
old percentile new percentile
New counter is based on long adder.
New percentile is based on nth_element
instead of median-of-medians selection.

Result of Improvement
• There are 1000 <hashKey, sortKey> pairs for each batch request

Limit RocksDB Memory Usage
• Cost memtables to block cache
• Write Buffer Manager
• rocksdb_block_cache_capacity
• rocksdb_total_size_across_write_buffer
• Cost index & filter blocks to block cache
• Memory usage ∝ num_partitions * max_open_files
• rocksdb_cache_index_and_filter_blocks（Version ≥ 5.15）

jemalloc vs. tcmalloc
• Both memtables and index & filter blocks are capped by block cache
• rocksdb_block_cache_capacity=12GB
• rocksdb_total_size_across_write_buffer=8GB

What's going on in the Pegasus community?
Development, New Release and Activities

Development
• New metrics framework
• Higher performance and easy to use
• Enhance backup & restore
• Enhance duplication
• Enhance authorization
• Easy to use admin tools
• Use Go tools to replace C++ tools
• Refactor
• Support more CPU architectures
• x86, ARMs, Apple Silicon
• Support more operation systems
• Linux: RHEL/CentOS(6, 7, 8, 9), Ubuntu(16.04, 18.04, 20.04, 22.04)
• MacOS: 12.4
• Website & Documents

New Release
• Pegasus 2.4.0
• Performance improvement
• Refactor dual-WAL to single WAL
• New features
• Change table's replication factor
• Read request limiter
• Enhancement
• Bulk load
• Duplication
• Manual compaction
• API
• Add batchGetByPartitions()
• Tools
• admin-cli support more operations

Activities
• The 1st meetup held in Sep, 2021
• Planning to hold the 2nd meetup this autumn
• Online small meetings held unscheduled

Thanks
WeChat official account
https://pegasus.apache.org
https://github.com/apache/incubator-pegasus

How does Apache Pegasus (incubating) community develop at SensorsData

Recommended

Recommended

More Related Content

Similar to How does Apache Pegasus (incubating) community develop at SensorsData

Similar to How does Apache Pegasus (incubating) community develop at SensorsData (20)

More from acelyc1112009

More from acelyc1112009 (12)

Recently uploaded

Recently uploaded (20)

How does Apache Pegasus (incubating) community develop at SensorsData

Editor's Notes