SlideShare a Scribd company logo
How does Apache Pegasus (incubating)
community develop at SensorsData
Dan Wang & Yingchun Lai
2022.07.29
Outline
• Overview of Apache Pegasus
• Architecture, Data Model, User Interface, Performance, Important Features
• Why SensorsData chose Apache Pegasus?
• Evolution and Current Situation
• Contributions to Apache Pegasus by SensorsData
• Features, Improvements, Bugfixes
• What's going on in the Pegasus community?
• Development, New Release and Activities
Overview of Apache Pegasus
Architecture, Data Model, User Interface, Performance, Important Features
What is Pegasus?
Apache Pegasus is a horizontally scalable, strongly consistent and
high-performance key-value store
• C++ implemented
• Local persistent storage engine by RocksDB
• Strongly consistent by PacificA
• High performance
• Horizontally scalable
• Flexible data model
• Easy to use ecosystem tools
Architecture
MetaServer
• Cluster controller
• Configuration manager
• Doesn't store data on itself
ReplicaServer
• Data node
• Hash partition
• PacificA (strongly consistent)
• One RocksDB instance for each replica
ZooKeeper
• Meta server election
• Metadata storage
ClientLib
• Request routing table from MetaServer once
• Cache routing table
• Straightly interact with ReplicaServer for R/W requests
Data Model
SortKey
• Extend user's usage scenario
• Sorted in a specified HashKey
HashKey
• Decide which partition it belongs to
• hash(HashKey) % kPartitionCount → partition_id
Value
• User's data
User Interface
note: * means uncertain count
• Supported language: Java, C++, Go, Python, Node-js, Scala
• Multiple SortKeys under one HashKey can be atomically accessed
How to adapt to RocksDB
For one table in Pegasus
• The whole key space is hash split into N partitions
• Each partition has 3 replicas typically
• Distribute all these (3*N) replicas to M Replica Servers
• Load balance between Replica Servers in cluster
• Both for replicas and primary replicas
• Both consider replica count and disk space
• Load balance between data directories on Replica Server
• Same
• Each replica corresponding to a RocksDB instance
• How does Pegasus key-value map to RocksDB key-value?
How to adapt to RocksDB
RocksDB Key
• Length of HashKey: 2 bytes, for encoding and decoding key
• HashKey: variable length, defined by user
• SortKey: variable length, defined by user
RocksDB Value
• Value Header: 13 bytes
• Flag bit: 1 bit, always to be 1
• Data version: 7 bits
• Expire timestamp: 4 bytes, in seconds, since epoch
• Time tag: 8 bytes, designed for duplication
• Timestamp: 56 bits, in micro-seconds
• Cluster ID: 7 bits
• Deleted tag: 1 bit
• Value: variable length, defined by user
Performance
YCSB on Pegasus 2.3.0 (the latest release)
• CPU: 2.4 GHz, 24 cores
• Memory​: 128 GB
• Disk: 480G SSD * 8
• Network card: 10 Gb/s
• 5 Replica Servers
• 64 partitions on test table
Important Features
Cold Backup
• Create checkpoint for a table
• Store data remotely on HDFS
• Restore table to the original or another cluster
Duplication
• Asynchronous duplicate
• To achieve high write throughput
• To tolerant high latency
• The two clusters can be deployed in
different regions
• Support pipeline duplication, multi-master
duplication, and master-master duplication
Important Features
Bulk Load
• Generate SST files from user's original data
• via Pegasus-Spark, in Pegasus rule
• Store generated SST files to HDFS
• Download SST files to Pegasus ReplicaServer
• Ingest SST files to RocksDB
• Reject client write while ingesting
• Provide read & write
Partition Split
• Divide one replica into two replicas
• Copy checkpoint and then duplicate WAL
• Register on Meta Server when new replicas are ready
• Reject client R/W request while registering
• Provide client R/W request
• GC redundant data that doesn't belong to the new partition
in the background
Important Features
Backup Request
• Only for read requests
• Usage scenarios:
• Load inbalanced
• Network problem
• Single point of failure
Hotkey Detection
• Detect bad designed user key
• Resolve single point of failure
caused by hotkey
Important Features
• Access control
• Authentication: Kerberos
• Authorization: table-level ACL
• Usage scenario option templates
• Set RocksDB options in table level
• Manual compaction
• Fast GC, fast sort
• Integration with BigData ecosystem
• HDFS: cold backup, bulk load
• Spark: bulk load, analysis on Hive
• MetaProxy
• Access unification
Why SensorsData chose Apache Pegasus?
Evolution and Current Situation
KV Evolution in SensorsData
• Standalone Redis (initially)
• Pros
• Mature
• High performance
• Cons
• Single-point deployment
• Considerable memory consumption
• Volatile
KV Evolution in SensorsData
• Distributed Redis (2016)
• Redis sentinel → Redis cluster
• Pros
• Scale out
• Cons
• Frequent OOM (several hundred million keys)
KV Evolution in SensorsData
• SSDB (2017)
• master-slave
• Pros
• Reduce memory consumption
• Compatible with redis, thus easy for migration
• Persistence
• Cons
• Cannot scale out
• Cannot afford to more data and more businesses (I/O
utilization is nearly 100%)
Introduce Apache Pegasus (2020)
• Scale out
• High Availability
• Strong consistency
• Persistence
• High performance
• Stability
• Tools for monitor and operations
• Support mget & mset
• Documents & community
• Cost for migration
Apache Pegasus in SensorsData
• Pegasus has been deployed on over 1300 clusters up to
now
• About 20 products have chosen Pegasus to store their
business data
Contributions to Apache Pegasus by SensorsData
Features, Improvements, Bugfixes
Characteristics of Product Environment
• It’s difficult for operations on private clusters
• A large number of clusters
• Some clusters have to be operated on site
• Some clusters are very small
• Even single node
• The hardware configuration is not good
• Small memory
• HDD
• Multiple services are deployed on one node
• Have to limit resource usage, such as memory
New Functions
• Support single replica
• Connect Zookeeper secured with Kerberos
• Change the replication factor of each table
• Implement new system of metrics
Improve Memory Usage
• Limit RocksDB memory usage
• Support jemalloc
Refactor
• Merge sub-projects
Contributions
Optimize Performance
• Support batchGetByPartitions to improve batch get
• Use multi_set to speed copy_data up
Compatibility
• Support to build on MacOS
• Support to build and run on AArch64 architecture
Bugfixes
• Fix replica metadata lost risk on XFS after power
outage
• Fix message body size unset after parsed which
leads to large I/O throughputs
Change the Replication Factor
• Motivation
• Scale out, e.g., 1 → 3 or 2 → 3
• Migration
• Increase partitions offline
• Process
• Check new replication factor
• Update meta data asynchronously
• Missing/redundant replicas will be
added/dropped typically during several
seconds
• Clearing redundant data can be
launched by lively meta level
New System of Metrics
Perf-counter
• Verbose naming
• Overlapped metric types
• Unreasonable abstract interfaces
• Memory leak by outdated metrics
• Potential performance problems
New metrics
• Use labels to simplify naming
• Redefine metric types
• Clear outdated metrics after a
period of configurable time
• Improve performance
Framework
• Gauge: set/get, increment/decrement
• Counter: increment monotonically
• Percentile: P90/P95/P99/..., for a fixed window size
Performance
0
50
100
150
200
250
300
350
2 threads 4 threads 8 threads 16 threads
Latency of counters
(seconds, with 1 billion oeprations for each thread)
old counter new counter
0
10
20
30
40
50
60
70
80
10,000 operations 50,000 opeartions 100,000 operations
Latency of percentiles
(seconds, with window size 5000)
old percentile new percentile
New counter is based on long adder.
New percentile is based on nth_element
instead of median-of-medians selection.
Original Batch Get
Improved Batch Get
Result of Improvement
• There are 1000 <hashKey, sortKey> pairs for each batch request
Limit RocksDB Memory Usage
• Cost memtables to block cache
• Write Buffer Manager
• rocksdb_block_cache_capacity
• rocksdb_total_size_across_write_buffer
• Cost index & filter blocks to block cache
• Memory usage ∝ num_partitions * max_open_files
• rocksdb_cache_index_and_filter_blocks(Version ≥ 5.15)
jemalloc vs. tcmalloc
• Both memtables and index & filter blocks are capped by block cache
• rocksdb_block_cache_capacity=12GB
• rocksdb_total_size_across_write_buffer=8GB
Performance
QPS of Single Put
Single Put + Single Get
Scan + Single Put
What's going on in the Pegasus community?
Development, New Release and Activities
Development
• New metrics framework
• Higher performance and easy to use
• Enhance backup & restore
• Enhance duplication
• Enhance authorization
• Easy to use admin tools
• Use Go tools to replace C++ tools
• Refactor
• Support more CPU architectures
• x86, ARMs, Apple Silicon
• Support more operation systems
• Linux: RHEL/CentOS(6, 7, 8, 9), Ubuntu(16.04, 18.04, 20.04, 22.04)
• MacOS: 12.4
• Website & Documents
New Release
• Pegasus 2.4.0
• Performance improvement
• Refactor dual-WAL to single WAL
• New features
• Change table's replication factor
• Read request limiter
• Enhancement
• Bulk load
• Duplication
• Manual compaction
• API
• Add batchGetByPartitions()
• Tools
• admin-cli support more operations
Activities
• The 1st meetup held in Sep, 2021
• Planning to hold the 2nd meetup this autumn
• Online small meetings held unscheduled
Thanks
WeChat official account
https://pegasus.apache.org
https://github.com/apache/incubator-pegasus

More Related Content

Similar to How does Apache Pegasus (incubating) community develop at SensorsData

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
hypertable
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
Piyuesh Kumar
 
High Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance TuningHigh Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance Tuning
Albert Chen
 
Getting started with Riak in the Cloud
Getting started with Riak in the CloudGetting started with Riak in the Cloud
Getting started with Riak in the Cloud
Ines Sombra
 
More Cache for Less Cash
More Cache for Less CashMore Cache for Less Cash
More Cache for Less Cash
Michael Collier
 
High-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and JavaHigh-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and Java
sunnygleason
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
Nicolas Poggi
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
Regunath B
 
Collier exadata technical overview presentation 4 14-10
Collier exadata technical overview presentation 4 14-10Collier exadata technical overview presentation 4 14-10
Collier exadata technical overview presentation 4 14-10
xKinAnx
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
Marco Tusa
 
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb dataUsing cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb data
Ramesh Veeramani
 
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
monsonc
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
Chin Huang
 
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
DataStax Academy
 
ActiveMQ 5.9.x new features
ActiveMQ 5.9.x new featuresActiveMQ 5.9.x new features
ActiveMQ 5.9.x new features
Christian Posta
 
V mware2012 20121221_final
V mware2012 20121221_finalV mware2012 20121221_final
V mware2012 20121221_final
Web2Present
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
MongoDB
 
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
DoKC
 

Similar to How does Apache Pegasus (incubating) community develop at SensorsData (20)

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
 
High Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance TuningHigh Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance Tuning
 
Getting started with Riak in the Cloud
Getting started with Riak in the CloudGetting started with Riak in the Cloud
Getting started with Riak in the Cloud
 
More Cache for Less Cash
More Cache for Less CashMore Cache for Less Cash
More Cache for Less Cash
 
High-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and JavaHigh-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and Java
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
 
Collier exadata technical overview presentation 4 14-10
Collier exadata technical overview presentation 4 14-10Collier exadata technical overview presentation 4 14-10
Collier exadata technical overview presentation 4 14-10
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb dataUsing cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb data
 
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
 
ActiveMQ 5.9.x new features
ActiveMQ 5.9.x new featuresActiveMQ 5.9.x new features
ActiveMQ 5.9.x new features
 
V mware2012 20121221_final
V mware2012 20121221_finalV mware2012 20121221_final
V mware2012 20121221_final
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
Disaggregated Container Attached Storage - Yet Another Topology with What Pur...
 

More from acelyc1112009

Apache Pegasus (incubating): A distributed key-value storage system
Apache Pegasus (incubating): A distributed key-value storage systemApache Pegasus (incubating): A distributed key-value storage system
Apache Pegasus (incubating): A distributed key-value storage system
acelyc1112009
 
How does Apache Pegasus used in SensorsData
How does Apache Pegasusused in SensorsDataHow does Apache Pegasusused in SensorsData
How does Apache Pegasus used in SensorsData
acelyc1112009
 
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataHow does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
acelyc1112009
 
How to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosHow to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenarios
acelyc1112009
 
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
acelyc1112009
 
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
acelyc1112009
 
The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0
acelyc1112009
 
The Design, Implementation and Open Source Way of Apache Pegasus
The Design, Implementation and Open Source Way of Apache PegasusThe Design, Implementation and Open Source Way of Apache Pegasus
The Design, Implementation and Open Source Way of Apache Pegasus
acelyc1112009
 
Apache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of XiaomiApache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of Xiaomi
acelyc1112009
 
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
acelyc1112009
 
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partHow do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine part
acelyc1112009
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend part
acelyc1112009
 

More from acelyc1112009 (12)

Apache Pegasus (incubating): A distributed key-value storage system
Apache Pegasus (incubating): A distributed key-value storage systemApache Pegasus (incubating): A distributed key-value storage system
Apache Pegasus (incubating): A distributed key-value storage system
 
How does Apache Pegasus used in SensorsData
How does Apache Pegasusused in SensorsDataHow does Apache Pegasusused in SensorsData
How does Apache Pegasus used in SensorsData
 
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsDataHow does the Apache Pegasus used in Advertising Data Stream in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
 
How to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenariosHow to continuously improve Apache Pegasus in complex toB scenarios
How to continuously improve Apache Pegasus in complex toB scenarios
 
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
 
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...How does Apache Pegasus used  in Xiaomi's Universal Recommendation Algorithm ...
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
 
The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0The Introduction of Apache Pegasus 2.4.0
The Introduction of Apache Pegasus 2.4.0
 
The Design, Implementation and Open Source Way of Apache Pegasus
The Design, Implementation and Open Source Way of Apache PegasusThe Design, Implementation and Open Source Way of Apache Pegasus
The Design, Implementation and Open Source Way of Apache Pegasus
 
Apache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of XiaomiApache Pegasus's Practice in Data Access Business of Xiaomi
Apache Pegasus's Practice in Data Access Business of Xiaomi
 
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
 
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine partHow do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine part
 
How do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend partHow do we manage more than one thousand of Pegasus clusters - backend part
How do we manage more than one thousand of Pegasus clusters - backend part
 

Recently uploaded

Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
Vineet
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Rebecca Bilbro
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
ArshadAyub49
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
hiju9823
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
frp60658
 

Recently uploaded (20)

Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
 

How does Apache Pegasus (incubating) community develop at SensorsData

  • 1. How does Apache Pegasus (incubating) community develop at SensorsData Dan Wang & Yingchun Lai 2022.07.29
  • 2. Outline • Overview of Apache Pegasus • Architecture, Data Model, User Interface, Performance, Important Features • Why SensorsData chose Apache Pegasus? • Evolution and Current Situation • Contributions to Apache Pegasus by SensorsData • Features, Improvements, Bugfixes • What's going on in the Pegasus community? • Development, New Release and Activities
  • 3. Overview of Apache Pegasus Architecture, Data Model, User Interface, Performance, Important Features
  • 4. What is Pegasus? Apache Pegasus is a horizontally scalable, strongly consistent and high-performance key-value store • C++ implemented • Local persistent storage engine by RocksDB • Strongly consistent by PacificA • High performance • Horizontally scalable • Flexible data model • Easy to use ecosystem tools
  • 5. Architecture MetaServer • Cluster controller • Configuration manager • Doesn't store data on itself ReplicaServer • Data node • Hash partition • PacificA (strongly consistent) • One RocksDB instance for each replica ZooKeeper • Meta server election • Metadata storage ClientLib • Request routing table from MetaServer once • Cache routing table • Straightly interact with ReplicaServer for R/W requests
  • 6. Data Model SortKey • Extend user's usage scenario • Sorted in a specified HashKey HashKey • Decide which partition it belongs to • hash(HashKey) % kPartitionCount → partition_id Value • User's data
  • 7. User Interface note: * means uncertain count • Supported language: Java, C++, Go, Python, Node-js, Scala • Multiple SortKeys under one HashKey can be atomically accessed
  • 8. How to adapt to RocksDB For one table in Pegasus • The whole key space is hash split into N partitions • Each partition has 3 replicas typically • Distribute all these (3*N) replicas to M Replica Servers • Load balance between Replica Servers in cluster • Both for replicas and primary replicas • Both consider replica count and disk space • Load balance between data directories on Replica Server • Same • Each replica corresponding to a RocksDB instance • How does Pegasus key-value map to RocksDB key-value?
  • 9. How to adapt to RocksDB RocksDB Key • Length of HashKey: 2 bytes, for encoding and decoding key • HashKey: variable length, defined by user • SortKey: variable length, defined by user RocksDB Value • Value Header: 13 bytes • Flag bit: 1 bit, always to be 1 • Data version: 7 bits • Expire timestamp: 4 bytes, in seconds, since epoch • Time tag: 8 bytes, designed for duplication • Timestamp: 56 bits, in micro-seconds • Cluster ID: 7 bits • Deleted tag: 1 bit • Value: variable length, defined by user
  • 10. Performance YCSB on Pegasus 2.3.0 (the latest release) • CPU: 2.4 GHz, 24 cores • Memory​: 128 GB • Disk: 480G SSD * 8 • Network card: 10 Gb/s • 5 Replica Servers • 64 partitions on test table
  • 11. Important Features Cold Backup • Create checkpoint for a table • Store data remotely on HDFS • Restore table to the original or another cluster Duplication • Asynchronous duplicate • To achieve high write throughput • To tolerant high latency • The two clusters can be deployed in different regions • Support pipeline duplication, multi-master duplication, and master-master duplication
  • 12. Important Features Bulk Load • Generate SST files from user's original data • via Pegasus-Spark, in Pegasus rule • Store generated SST files to HDFS • Download SST files to Pegasus ReplicaServer • Ingest SST files to RocksDB • Reject client write while ingesting • Provide read & write Partition Split • Divide one replica into two replicas • Copy checkpoint and then duplicate WAL • Register on Meta Server when new replicas are ready • Reject client R/W request while registering • Provide client R/W request • GC redundant data that doesn't belong to the new partition in the background
  • 13. Important Features Backup Request • Only for read requests • Usage scenarios: • Load inbalanced • Network problem • Single point of failure Hotkey Detection • Detect bad designed user key • Resolve single point of failure caused by hotkey
  • 14. Important Features • Access control • Authentication: Kerberos • Authorization: table-level ACL • Usage scenario option templates • Set RocksDB options in table level • Manual compaction • Fast GC, fast sort • Integration with BigData ecosystem • HDFS: cold backup, bulk load • Spark: bulk load, analysis on Hive • MetaProxy • Access unification
  • 15. Why SensorsData chose Apache Pegasus? Evolution and Current Situation
  • 16. KV Evolution in SensorsData • Standalone Redis (initially) • Pros • Mature • High performance • Cons • Single-point deployment • Considerable memory consumption • Volatile
  • 17. KV Evolution in SensorsData • Distributed Redis (2016) • Redis sentinel → Redis cluster • Pros • Scale out • Cons • Frequent OOM (several hundred million keys)
  • 18. KV Evolution in SensorsData • SSDB (2017) • master-slave • Pros • Reduce memory consumption • Compatible with redis, thus easy for migration • Persistence • Cons • Cannot scale out • Cannot afford to more data and more businesses (I/O utilization is nearly 100%)
  • 19. Introduce Apache Pegasus (2020) • Scale out • High Availability • Strong consistency • Persistence • High performance • Stability • Tools for monitor and operations • Support mget & mset • Documents & community • Cost for migration
  • 20. Apache Pegasus in SensorsData • Pegasus has been deployed on over 1300 clusters up to now • About 20 products have chosen Pegasus to store their business data
  • 21. Contributions to Apache Pegasus by SensorsData Features, Improvements, Bugfixes
  • 22. Characteristics of Product Environment • It’s difficult for operations on private clusters • A large number of clusters • Some clusters have to be operated on site • Some clusters are very small • Even single node • The hardware configuration is not good • Small memory • HDD • Multiple services are deployed on one node • Have to limit resource usage, such as memory
  • 23. New Functions • Support single replica • Connect Zookeeper secured with Kerberos • Change the replication factor of each table • Implement new system of metrics Improve Memory Usage • Limit RocksDB memory usage • Support jemalloc Refactor • Merge sub-projects Contributions Optimize Performance • Support batchGetByPartitions to improve batch get • Use multi_set to speed copy_data up Compatibility • Support to build on MacOS • Support to build and run on AArch64 architecture Bugfixes • Fix replica metadata lost risk on XFS after power outage • Fix message body size unset after parsed which leads to large I/O throughputs
  • 24. Change the Replication Factor • Motivation • Scale out, e.g., 1 → 3 or 2 → 3 • Migration • Increase partitions offline • Process • Check new replication factor • Update meta data asynchronously • Missing/redundant replicas will be added/dropped typically during several seconds • Clearing redundant data can be launched by lively meta level
  • 25. New System of Metrics Perf-counter • Verbose naming • Overlapped metric types • Unreasonable abstract interfaces • Memory leak by outdated metrics • Potential performance problems New metrics • Use labels to simplify naming • Redefine metric types • Clear outdated metrics after a period of configurable time • Improve performance
  • 26. Framework • Gauge: set/get, increment/decrement • Counter: increment monotonically • Percentile: P90/P95/P99/..., for a fixed window size
  • 27. Performance 0 50 100 150 200 250 300 350 2 threads 4 threads 8 threads 16 threads Latency of counters (seconds, with 1 billion oeprations for each thread) old counter new counter 0 10 20 30 40 50 60 70 80 10,000 operations 50,000 opeartions 100,000 operations Latency of percentiles (seconds, with window size 5000) old percentile new percentile New counter is based on long adder. New percentile is based on nth_element instead of median-of-medians selection.
  • 30. Result of Improvement • There are 1000 <hashKey, sortKey> pairs for each batch request
  • 31. Limit RocksDB Memory Usage • Cost memtables to block cache • Write Buffer Manager • rocksdb_block_cache_capacity • rocksdb_total_size_across_write_buffer • Cost index & filter blocks to block cache • Memory usage ∝ num_partitions * max_open_files • rocksdb_cache_index_and_filter_blocks(Version ≥ 5.15)
  • 32. jemalloc vs. tcmalloc • Both memtables and index & filter blocks are capped by block cache • rocksdb_block_cache_capacity=12GB • rocksdb_total_size_across_write_buffer=8GB
  • 35. Single Put + Single Get
  • 37. What's going on in the Pegasus community? Development, New Release and Activities
  • 38. Development • New metrics framework • Higher performance and easy to use • Enhance backup & restore • Enhance duplication • Enhance authorization • Easy to use admin tools • Use Go tools to replace C++ tools • Refactor • Support more CPU architectures • x86, ARMs, Apple Silicon • Support more operation systems • Linux: RHEL/CentOS(6, 7, 8, 9), Ubuntu(16.04, 18.04, 20.04, 22.04) • MacOS: 12.4 • Website & Documents
  • 39. New Release • Pegasus 2.4.0 • Performance improvement • Refactor dual-WAL to single WAL • New features • Change table's replication factor • Read request limiter • Enhancement • Bulk load • Duplication • Manual compaction • API • Add batchGetByPartitions() • Tools • admin-cli support more operations
  • 40. Activities • The 1st meetup held in Sep, 2021 • Planning to hold the 2nd meetup this autumn • Online small meetings held unscheduled

Editor's Notes

  1. 大家好,欢迎大家参加Apache Pegasus主题的分享,今天我们给大家带来的分享主题是“---”。 先做一个自我介绍, 我叫王聃,是Pegasus项目的Committer。目前是在神策数据公司做存储技术方向的工程师。 我叫赖迎春,是Pegasus项目的PPMC member,同时也是Apache Kudu项目的PMC member,也是Apache Doris的committer。目前是在神策数据公司做存储技术方向的工程师。
  2. 我们本次的分享主要从以下方面展开: 首先,会对Pegasus做一个总体的介绍 然后,讲一下在我们神策数据是为什么选择了Pegasus作为分布式KV存储系统 随后,我们会展开讲一下神策数据对Pegasus社区做了哪些contribution 最后,我们看看Pegasus社区目前的一些动向
  3. 首先,第一部分我们先对Pegasus做一个总体的介绍。看看它是做什么的,能解决我们的什么问题。 我们会从它的架构、数据模型、用户接口、性能数据,以及一些重要的特性。
  4. 什么是Pegasus? 一句话概括起来:Apache Pegasus是一个可水平扩展、强一致性、高性能的key-value数据库 他的一些重要特性包括: 它是使用C++实现的 本地存储:使用的是rocksdb来作为底层的存储引擎 数据的一致性级别是强一致性:使用了PacificA这个算法来做的实现 高性能:底层的实现精妙且高效 可水平扩展:不论是中心控制节点还是数据节点,都可以做动态的水平扩展 弹性的数据模型:Pegasus虽然是一个KV数据库,但是用户可以基于他做弹性的扩展 此外呢,他也有丰富且易用的生态工具
  5. 看看Pegasus的架构 Pegasus系统里面有4个角色,分别是: MetaServer:也就是中心管理节点(右图的上面部分),包括表、分片、分片leader的管理,路由表信息的维护等。需要指出的是,他上面并不保存这些meta数据。meta的保存是放在zookeeper上。 其次是ReplicaServer:也就是数据节点(右图的中间部分),它是用户数据真正存储的位置。在它上面启动若干rocksdb实例来做存储引擎,通过PacificA协议来做一致性的协调。(从图中也可以看到一个replica server上面可以保存多个分片的数据。比如一张表可以划分成4个partition(也就是4个分片),每个partition通常设置成有3个replica(也就是3个副本)。replica和leader都会通过负载均衡算法均匀的分布在各个replica server上) 然后是Zookeeper:它是MetaServer存储meta数据的位置,meta server的选主也是通过他来实现。 最后是Client lib:Pegasus的用户,通过他来与Pegasus集群做交互。通过open table和读写接口,client lib会自动的从meta server一次性的获取路由表,缓存起来,再直接与replica server进行数据读写的交互。
  6. 接下来我们看看Pegasus的数据模型: 作为一个KV数据库,Pegasus的数据模型也很简单,采用了两级key: 其中,hashkey用于确定数据的分片,sortkey用于确定单行数据在所属hashkey中的位置。 左图展示了一行数据的路由过程: 首先通过用户自定义的hashkey计算出他所属的partition的id,拿到partition id之后通过路由表,确定读写这行数据应该与哪台replica server交互。 右图呢,展示了通过两级key可以实现的一些用户意图,以及部分用户接口。(停留几秒即可)
  7. 那么接下来,我们看看Pegasus都有哪些用户接口。 这张表展示了Pegasus常用的数据操作类的用户接口,按操作的hashkey和sortkey的数量,以及数据的操作类型做了一个分类。 比如,第三列的Get类请求,当指定1个hashkey和1个sortkey时,接口是get;当指定1个hashkey和多个sortkey时,接口是multiGet;当指定1个hashkey但不明确指定sortkey时,接口是另一个版本的multiGet;等等。 需要指出的是,同一个hashkey下的多个hashkey的读写操作,也都是可以保证原子性的,这可以丰富用户的应用场景。 此外呢,我们还有scan类接口、CAS类的接口、GEO类的接口等等。 当然了,Pegasus的client也有多种语言的客户端,包括Java、C++、Python、Node-js,Scala等。
  8. 我们接下来看看Pegasus是如何适配rocksdb的 我们先把讨论的范围聚焦在一张表上: 在这张表内,会把整个key空间通过hash的方式划分成N个partition,通常一个partition会有3个replica。在这里的每一个replica,就对应一个rocksdb实例。 然后会由metaserver调度,将这些replica负载均衡到集群中的所有replica server上(当然,会同时保证leader replica也是均衡的)。 在单台replica server上呢,也会保证所有的replica在多块数据盘上,也是均衡的。 那么问题来了,Pegasus的两级key是怎样映射成rocksdb的key的呢?
  9. rocksdb的key,从右图可见,划分成了3部分,其中: 第一部分2字节,指出了hashkey的长度。通过他来做rocksdb的key和pegasus的key的相互转换。 然后,第二、第三部分分别是用户接口所指定的hashkey和sortkey。 rocksdb的value部分,从右图可以看出,他也是由两部分组成: 第一部分是header,包含的信息如:数据的版本号、数据的过期时间、用于duplication功能的time tag。由于header的存在,pegasude的value也具有较高的可扩展性。 第二部分就是用户指定的value了。
  10. 接下来,我们看看Pegasus的性能表现。 我们使用了如右下角所示的测试环境,使用了业界流行的YCSB测试框架来做测试。针对于Pegasus年初发布的2.3版本,性能测试结果如图: 我们测试了只读、只写、读写混合等场景。 (停1s)注意其中延迟的单位是微秒(停5s)
  11. 接下来,我们来看看Pegasus都有哪些重要的特性。 首先是Cold backup,即数据的冷备份。它的基本流程是,首先在源集群对一张表创建checkpoint,然后会将这个checkpoint的文件上传到源端,比如HDFS中,最后,在要做数据恢复时,将备份的checkpoint恢复到目标集群。 它的使用场景比如:为了数据安全的定期备份,跨region的数据快速迁移、集群中单表的数据迁出等等。 然后,我们再看看Duplication,即数据的热备份。他能够做到的是,在源端集群的实时数据写入,会自动的同步到目标集群,以提供最终一致性的数据同步。 (这里同步的粒度是表啊) duplication使用的是异步复制,保证数据的最终一致性。这样呢,能够容忍源端和目的端集群存在较大的网络延迟。我们也支持多种模式的duplication,包括pipeline的、多master的、双集群互相的。 它的使用场景比如:实时的数据备份、跨region场景下提供低延迟的本地读、数据迁移等等。
  12. 接下来我们再看看BulkLoad。bulkload的作用是提供大批量数据的快速导入,相比于使用API的方式导入数据效率能提升一个数量级。 它的大致流程是,首先通过Pegasus的生态工具将用户数据直接生成为rocksdb的sst文件,将生成的文件保存在比如HDFS中。Pegasus集群再从远端将这些数据下载到本地,再ingest到rocksdb中,完成后,就能够提供正常的读写服务了。 然后,我们再看看Partition split功能。 前面我们有提到一张表会有多个分片,表的所有数据会划分到这些分片中。这里的分片是建表时指定的,那么如果一开始指定的分片数过小,就会导致随着数据的持续写入,单个分片的数据量越来越大,从而影响了读写效率。 此时,partition split功能就可以发挥作用了。他可以将分片一分为二,降低单分片的数据量,提升读写效率。 大致的流程是,首先将原分片复制一份到本地,包括sst文件和wal,形成一个影子副本,完成之后在metaserver上进行注册,然后,分片数扩大一倍,原副本和影子副本各自服务一部分数据。过程中,仅在注册的时候短暂的停止读写。最终,新的副本会在后台GC掉不再数据自己的那一部分数据,清理掉多使用的磁盘空间。
  13. 我们再看看Backup request backup request是针对集群的读请求长尾产生的特性。比如,当集群中的replica server存在负载不均衡的情况、部分节点存在网络问题、单点故障等,那么当读请求发往所在的primary节点是故障节点的时候,势必会出现延迟变大或超时,这个时候client就会智能的发往secondary节点,从而能够及时的得到响应。 再看看热点key检测 热点key是指表中某一个key的读写请求量远远高于平均水平,而这个key所属分片所在的replica server将承受远高于其他节点的请求量,导致单点压力问题。 热点key的检测可以自动的将这类请求识别出来,如下面这张图片所示,检测过程将分为3个过程,首先检测出表中的热点分片,再检测出热点分片中的热点分桶,最后检测出热点分桶中的热点key。注意这里的分桶bucket是一个虚拟的概念。分为3个阶段可以将检测过程中对正常的读写性能尽量降到最低。
  14. 最后,我们还列出了一些其他的重要特性,就不做具体的展开了。 比如,Pegasus支持了访问控制,这样能让数据的访问更安全;支持了基于使用场景的配置模板,以适应更多的用户场景;支持了manual compaction,可以将底层数据做快速排序和垃圾回收 此外呢,也跟大数据生态组件很好的做了集成,让用户可以更好的使用Pegasus
  15. 接下来为大家介绍下,神策数据为何选择Apache Pegasus作为KV存储。 这一部分的内容主要包括KV系统在神策数据的演进,以及Pegasus在神策数据的现状。
  16. 公司使用的最初KV系统是单机redis,用来存储id mapping数据。 使用单机redis的主要考虑是发展比较成熟,且因为数据都放在内存中所以性能比较高。 当然问题也是有的,比如: 1. 单实例部署,无法水平扩展; 2. 数据都放在内存中所以比较耗内存,会影响同一机器上的其他组件; 3. 关联关系数据随时间逐渐膨胀,但很多都是冷数据,redis原生不支持将冷数据写到磁盘上; 4. 即使开了持久化也可能丢数据 但这比较适应早期客户集群的特点,比如以单机版为主,数据量不大,因此单机redis是够用的。
  17. 但随着集群版客户的部署增加,2016年开始使用redis分布式方案。 开始时先是使用redis sentinel,后来redis cluster稳定后新部署的就是redis cluster了。 redis cluster的优势主要是: 1. 有成熟的水平扩展方案 2. 运维工具可用性高 但问题也比较多: 1. 有的客户的id mapping数据量超过1亿,占用内存过高,频繁OOM; 2. 冷数据的问题还是没有解决; 3. 无法跨实例执行mget这样的多键操作
  18. 为了解决这些问题,2017年引入了主从模式的SSDB: 1. 减少了内存的消耗,这也是最大的痛点; 2. 兼容redis协议使得代码的迁移是比较容易的; 3. 持久化,数据也不会丢失了 但问题也是比较明显的: 1. SSDB无法水平扩展,成为了单点; 2. 客户数据量越来越大(数百GB),leveldb的读放大会导致性能衰退,容易把单个节点的I/O打满; 因此,无法满足大数据量的业务,也难以支撑更多的产品线接入。
  19. 基于以上问题,我们需要一个分布式的KV存储的方案来存储id mapping数据,以及越来越多接入的其他业务数据。 我们选择了Apache Pegasus作为分布式KV存储方案,主要因为如下原因: 1. 最看重的当然是分布式存储的这个特性,Pegasus可水平扩展,高可用,强一致,持久化; 2. 读写性能较高,满足需求; 3. 稳定性较好,自15年开发,16年开始线上部署,已在小米经历了长时间检验; 4. 运维和监控方面的工具比较丰富; 5. 支持多键操作mget和mset; 6. 有设计和使用文档可查,另外开源社区比较活跃; 7. 数据和代码的迁移成本都比较低
  20. 从2020年开始部署上线第一个集群开始,目前线上已经部署了超过1300个集群,其中每个神策部署环境中包含1 ~ 2个Pegasus集群。 目前有将近20个产品在使用Pegasus存储业务数据,包括刚才提到的id mapping,渠道追踪,用户画像,各种在线服务,等等。
  21. 接下来的一节,介绍神策数据基于自身特点,对Apache Pegasus做了哪些改进,同时这些改进也都回馈给了社区。
  22. 既然是基于神策自身的特点,首先来看下生产环境的特点: 1. 私有部署集群对于运维来说是个挑战: (1)私有集群的数量非常多; (2)某些集群是无法连接外网的,只能现场操作 2. 部分集群的规模非常小,甚至是单节点部署的 3. 部分集群的硬件配置不高,比如内存较小、磁盘性能比较低(使用HDD盘) 4. 每台机器上部署的组件比较多,因此需要各个组件限制对CPU和内存这样的资源的使用
  23. 基于以上特点,我们主要做了如下改进,包括: 功能改进方面, 1. 因为有单机部署的环境,所以支持了单副本部署; 2. Pegasus依赖ZooKeeper存储元数据,而我们有些集群部署的ZooKeeper是需要Kerberos认证的,我们对这类ZooKeeper的连接做了支持; 3. 因为有单节点、双节点部署的环境,所以需要支持对表的副本数量的动态修改,以便后面的扩容和缩容; 4. 解决现有指标的问题,实现新的指标框架。 性能优化方面, 1. 优化批量读的性能 2. 有数据迁移的需求,通过multi_set接口加速了copy_data 内存优化方面, 1. 刚提到了需要限制各个组件的内存,这里我们主要限制了RocksDB的内存使用 2. 内存分配方式已支持tcmalloc,这里又支持了jemalloc然后对比两种分配方式对性能和内存使用上的影响 兼容性 1. 为方便开发,支持了基于MacOS构建Pegasus 2. 因为有跑在ARM机器上的需求,因此支持了AArch64的体系结构 重构方面, 之前Pegasus有大量的子项目,比如rDSN分布式框架,运维工具和各个语言的接口,比较分散不便于管理,目前都已合并到Pegasus项目项目中。 bugfix方面, 1. 修复了断电后XFS文件系统上副本元数据丢失的问题; 2. 以及thrift消息反序列化后没有设置body size而导致的I/O暴增的问题
  24. 对于刚才所列举的工作中,我们选出其中的一些重点feature做个介绍: 先来介绍修改表的副本个数。刚刚简单介绍过这个功能,这是个重点feature,可以实现我们对表的副本数量的动态修改。 这个功能可以应用的场景包括: 1. 最通常的一个场景就是,集群扩容时,可能存在单节点或双节点扩为3节点或以上的集群,当然也可能会把集群规模减小,这些都可能会用到修改副本个数; 2. 数据迁移或离线扩分片时,可以先在目标集群建个单副本的表复制数据,待跨集群数据复制完成后再增加副本补全目标集群内的各个副本的数据,因此这也是修改副本个数的一个场景。 那么如何实现这个功能呢? 首先肯定需要确定的是,目标副本数是不是合法的?比如超过了集群的节点数量,或者是一个比较大的数,以及当前集群的状态或者表的状态是否允许修改副本个数,等等。 这个约束条件我们首先通过一个pull request确定了下来。 而后就是如何实际修改副本个数了,这个过程可以看右图。 改副本个数的命令目前是通过Pegasus Shell,由管理员操作发出的,修改请求会发给Meta Server。 Meta Server会异步地更新各个分片的元数据信息,都更新完成后再修改表的元数据信息。 元数据更新完成后,各个Replica Server再将新的副本个数同步到本地的元数据。 这一过程中,会设置标志位防止对同一张表的副本个数的重复修改。 另外对于减副本,会涉及删除多余的副本数据的操作,这块要手动修改meta level的状态才能进行,比较安全。
  25. 实现新的指标系统,主要是为解决目前指标系统存在的问题的。 左边就是目前指标系统的问题,包括: 1. 指标的命名冗长,包含了大量无关或多余的信息,不易读; 2. 指标类型的概念并不是严格界定的,比如counter类型一般是单调递增的一个计数器,但还包含set方法;这主要因为各个指标类型继承了同一套抽象接口,而这本身也是不太合理的; 3. 不再被使用的指标(比如指标对应的副本和表被删除了)的空间并没有释放,因此存在内存泄露问题; 4. 目前指标系统的底层实现也有潜在的性能问题 右边是新指标系统对这些问题的相应改进: 1. 首先简化指标命名,将其中无关和多余的信息去掉,同时把一部分有用的信息放到标签中(对应的概念比如Prometheus的labels,指标数据采集时可以直接作为labels传给Prometheus); 2. 梳理目前系统使用的所有指标,厘清这些指标所需要的指标类型,从概念、实现、性能等方面重新定义所需要的指标类型; 3. 系统需要支持配置不再使用的指标在保留一段时间后可以被定期清理; 4. 对潜在的性能问题进行改进
  26. 基于以上考虑,设计一个新的指标框架。 首先,左上角给出了新定义的3种指标类型: 1. Gauge类型:支持对一个数值进行set以及同时需要增减的情况; 2. Counter类型:是一个单调递增的计数器,之前用于计算qps的Rate/Meter这样的指标类型不再提供,改为Counter类型搭配像Prometheus这样监控系统的rate函数来实现; 3. Percentile类型:支持计算一个固定大小的采样窗口内的百分位的计算 新框架的架构如图所示,这里也借鉴了Kudu指标体系的实现: 1. 每个指标创建后,会注册到metric entity中; 2. metric entity就是管理指标的一个特定实体,比如server-level,table-level,replica-level的; 3. 每个metric entity又注册到一个单例metric registry,这个registry汇总了单个角色实例(Meta Server或者Replica Server)的全部指标; 4. registry中指标的数据会定期生成快照,然后通过Sink采集到不同的监控系统,比如Prometheus和Open Falcon。
  27. 这是对Counter和Percentile两种指标类型进行性能优化,以及优化后的一个对比。 左边的柱状图是对于Counter计算时间的一个对比: 其中,横轴是线程个数,每个线程执行10亿次Counter计算; 纵轴是总的计算时间,单位是秒; 蓝色表示的是原Counter类型,橙色表示的是新的Counter类型。 可以看到在线程数比较多的时候,新的Counter类型的性能超过原先的一倍多; 这是因为原Counter类型是基于数组实现的,可能存在伪共享问题; 而新的Counter类型基于Long Adder算法实现,在解决伪共享问题的同时,某些情况下还能减少内存使用。 右边的柱状图是对于Percentile计算时间的一个对比: 其中,横轴是操作的次数,这里每一次操作就是对一个固定大小的采样窗口计算所有百分位(P90,P95,P99等等); 纵轴是总的计算时间,单位是秒。 蓝色表示的是原Percentile类型,橙色表示的是新的Percentile类型。 可以看到新的Percentile的性能也是比原先的快一倍多; 主要原因是原Percentile类型的实现是基于median-of-medians这种选择算法的,内存拷贝和数组的初始化比较多; 而新的Percentile类型是基于C++ STL的nth_element()函数的,节省了大量拷贝的时间。
  28. 接下来介绍的feature是对batch get接口的优化。 所谓batch get就是将对多个hash key的读请求打包起来,然后通过一个客户端接口发出; 当然因为是多个hash key,那么请求可能是会发给多个Replica Server的。 本页的图是原有batch get接口的实现; 接口的底层实现,实际上是在客户端将已经打包起来的请求拆开,每个<hash key, sort key>都发出一个RPC请求; 比如图中实际上是有9个<hash key, sort key>,因此实际上产生了9个RPC请求; 如果一个batch中的<hash key, sort key>数量比较多,那么会产生大量的RPC请求,出现性能问题。 通过观察这个图,发现这个过程其实是可以优化的; 图中有4个分片,主副本分别位于Replica Server 0 ~ 2上,读请求是发给分片的主副本的; 我们知道同一个hashkey的数据会写到同一个分片; 而图中有很多的hash key,比如hashkey 1和2都映射到了partition 1,hashkey 4和6都映射到了partition 2; 因此,我们可以将映射到同一分片的所有hash key打包,作为一个RPC请求发给Replica Server,这样就可以减少RPC请求的个数了。
  29. 优化后的batch get接口如图所示。 hashkey 1和2打包起来作为一个RPC请求发给了partition 1的主副本; 而hashkey 4和6也打包起来作为一个RPC请求发给了partition 2的主副本; 这样一来,总的RPC请求次数从9次减少为4次。
  30. 改进前后的性能测试如这个表所示。 单个请求包含1000个<hash key, sort key>对,在3节点集群上测试,从QPS上看性能是有提升的。
  31. 前面讲到生产环境的特点时,提到了需要限制资源使用。 而RocksDB是内存使用的大头,所以我们首先尝试对RocksDB内存进行限制。 首先RocksDB支持Write Buffer Manager特性,可以将memtables的内存都放到block cache里管理; 有两个参数来控制,其中rocksdb_block_cache_capacity是已有的,是block cache的大小,由一个Replica Server实例上所有的分片共享; 另一个参数rocksdb_total_size_across_write_buffer是新增的,设置的是block cache中有多少内存能分配给memtables。 RocksDB还有一类内存就是index & filter blocks内存,这部分内存的使用可以简单认为和分片数量、设置的max_open_files成正比; 在RocksDB 5.15及以后,index & filter blocks这部分内存也可以由block cache来管理了,因此引入了rocksdb_cache_index_and_filter_blocks这个参数,设为true就表示启用这个特性。
  32. 前面提到,我们还引入了jemalloc和已有的tcmalloc内存分配方式作对比。 我们设置了4组配置,jemalloc和tcmalloc各两组: 1. jemalloc配置1和2大体相当,dirty和muzzy的这两个参数是两阶段gc时,每个阶段gc完成的时间(0表示马上gc),background_thread为true会主动回收内存,因此配置2比1是更积极的回收内存策略; 2. tcmalloc配置1是积极的内存回收策略,每10秒就检查一次,并且不保留任何内存,而配置2则不做主动回收内存的设置。 另外,RocksDB的配置是将刚提到的memtables和index & filter blocks内存都放到block cache中限制。
  33. 性能对比可以发现,jemalloc的single put的qps明显高于tcmalloc,其他方面则大体相当。
  34. 这个是single put的qps的监控图,顺序和前面的性能对比表格是一致的(忽略左数第二个曲线),可以看到整个执行过程中jemalloc的qps都是要高于tcmalloc的。
  35. 接下来我们看下内存的限制情况。 这个图是先进行single put,再进行single get操作,可以看到除了tcmalloc配置2外,其他几组都在积极地回收内存,并且总内存控制得不错。
  36. 这个图是先进行scan,再进行single put操作,可以看前面几组都在积极地回收内存,并且总内存都控制在了12GB以下; 而tcmalloc配置2的内存则没有控制住,达到15 ~ 16GB左右。
  37. 最后一部分,我们来整体看看Pegasus社区近期的发展
  38. 首先,近期的开发工作。 我们正在重新实现Pegasus的metrics框架,前面也有比较详细的描述了。 我们也在对已有的重点功能进行加强,提升这些功能的稳定性和易用性,比如数据的备份和恢复、duplication 访问控制方面,我们也将计划与Apache Ranger进行集成,使得访问控制更加易用便捷 在运维管理工具方面,我们也将做一个切换。新工具对于开发者和使用者都将更友好 此外,我们也将支持更多的CPU架构,比如一些国产的CPU都在我们的计划之列 也将支持更多的操作系统,包括一些国产操作系统也在我们的计划之列 同时,也会支持MacOS和苹果芯片,这将使得我们的开发者更方便的进行本地开发和试验
  39. 我们再看看Pegasus的新版本 社区的2.4.0版本正在准备中,版本包含了诸多的性能优化、新的特性、以及功能增强,这里只列出了其中的一部分。 比如,我们将原有的多wal优化为了一个,新增的特性比如支持了表副本数的动态修改、读限流等,新增的API比如batchGetByPartitions,前面也有提到,性能带来了显著提升。
  40. 最后,我们再看看社区的活动。 去年,Pegasus社区举办了第一次线下的meetup,今年秋季,我们也在筹划举办第二次线下的meetup。 同时呢,我们也不定期的举行线上的meetup。 不论是线上还是线下,我们都鼓励Pegasus的开发者、使用者,或者任何有兴趣的朋友们加入进来,和我们一起交流。
  41. 最后,感谢大家的观看 欢迎大家加入到Pegasus社区,和我们一起交流技术