This document discusses extending shared storage to MyRocks in PolarDB by implementing RocksDB write-ahead logging (WAL) replication. It describes converting system tables to RocksDB, replicating the WAL, DDL operations, caches, and index statistics. Challenges with DDL replication during primary/replica crashes are addressed. Multi-version concurrency control (MVCC) is implemented based on RocksDB snapshots to maintain consistency between primary and replicas.
2. MORE THAN JUST CLOUD
Agenda
• Background
• Basic Architecture
• Implement details
• Convert system tables to RocksDB
• RocksDB WAL/Manifest Replication
• DDL Replication
• Cache Replication
• Index Statistic Replication
• New Log Format
• MVCC
3. MORE THAN JUST CLOUD
Background
Why POLARDB for MyRocks
Benifits from MyRocks
• Greate space efficiency, better compression
• Greate write efficiency, lower write amplification
• Fast data loading
Benifits from share-storage
• Promising data consistency
• Ability to scale read node immediately without full copy of data
4. MORE THAN JUST CLOUD
Basic Architecture
Primary
• Accept Read/Write workload
Replica
• Only Accept Read workload
• Share sst/wal with primary
Replace binlog replication with WAL replication
5. MORE THAN JUST CLOUD
Let’s Begin
prepare for rocksdb wal replication
• Base on AIiSQL5.7
• Port MyRocks from Facebook
• Remove innodb, only support RocksDB and MyISAM engine
• Convert system tables to RocksDB
6. MORE THAN JUST CLOUD
Covert system tables to RocksDB
Prepare for RocksDB WAL replication
• Covert system tables to RocksDB
• Except mysql.slow_log, mysql.general_log, they store in local disk,
primary and replica have their owen mysql.slow_log,
mysql.general_log tables.
7. MORE THAN JUST CLOUD
Rocksdb WAL/Manifest replication
Architecture
8. MORE THAN JUST CLOUD
Rocksdb WAL/Manifest replication
Asynchronous replication
WAL Replication
• Replay PUT/DELETE/MERGE
Manifest Replicaion
• Replay flush & compaction
WAL and Manifest Coordination
• Only apply VEdit while Applied lsn > VEdit lsn
9. MORE THAN JUST CLOUD
Rocksdb WAL/Manifest replication
Control Primary WAL and SST files deletion
WAL deletion - original wal deletion will lead Replica lost wal
• Lm: min_log_number on Primary
• Ln: min_log_number on all Replicas
• new_min_log_number= min(Lm,Ln)
• When WAL’s number < new_min_log_number, then this WAL can be deleted
SST deletion- original SST deleteion will lead Replica cannot find SST and crash
• min_version_number: the min version number replica is using
• SST can be deleted only when It will’t be used by Primary and all Replicas
11. MORE THAN JUST CLOUD
DDL replication
Remove frm,par files
Remove frm,par files
• Store these contents in RocksDB
• Replica can read multi version of table schema
• DDL replication is asynchronous
12. MORE THAN JUST CLOUD
DDL replication
Primary
• Log MDL lock start and end.
Replica
• Replay MDL lock start
A. lock MDL
• Replay MDL lock end
A. update table cache in myrocks
B. unlock MDL
We have MDL lock to protect DDL operation in Primary. This lock also
need in Replica’s DDL.
13. MORE THAN JUST CLOUD
Cache replication
ACL, Procedure, Query cache Replicaition
Primary
• Log cache change in RocksDB WAL
ACL, Procedure, query cache
Replica
• Replay this change from WAL and invaild this cache
14. MORE THAN JUST CLOUD
Index Statistics Replication
Persistent
• Part index statistics information persist in each SST
• Total index statistics store in INDEX_STATISTICS
Memory
• Rdb_dey_def::m_stats
Update
• Analyze table
• Flush memtable
• Compact
Log these update operations and replay in Replica
15. MORE THAN JUST CLOUD
New Log Format
log change for replication
Log Types
• DDL(START, END)
• Cache change, ACL/Proc
Log format
• PUT/DELETE
Log store location
• __system__ column family
16. MORE THAN JUST CLOUD
New Log Format
New type in data dictionary
// Data dictionary types
enum DATA_DICT_TYPE {
DDL_ENTRY_INDEX_START_NUMBER = 1,
INDEX_INFO = 2,
CF_DEFINITION = 3,
BINLOG_INFO_INDEX_NUMBER = 4,
DDL_DROP_INDEX_ONGOING = 5,
INDEX_STATISTICS = 6,
MAX_INDEX_ID = 7,
DDL_CREATE_INDEX_ONGOING = 8,
POLAR_LOG = 100, // for polar replication
END_DICT_INDEX_ID = 255
};
enum POLAR_LOG_TYPE {
TABLE_DDL = 1,
CACHE_CHANGE = 2,
……
END_POLAR_ROCK_TYPE = 255
};
17. MORE THAN JUST CLOUD
New Log Format
New type in data dictionary
DDL_START
• type: PUT
• key: POLAR_LOG+TABLE_DDL+dbname.tablename
• value: NULL
DDL_END
• type: DELETE
• key: POLAR_LOG+TABLE_DDL+dbname.tablename
• value: NULL
CACHE_CHANGE
• type: PUT
• key: POLAR_LOG+CACHE_CHANGE+ACL/Proc
• value: NULL
18. MORE THAN JUST CLOUD
New Log Format
Problems
DDL_START
• type: PUT
• key: POLAR_LOG+TABLE_DDL+dbname.tablename
• value: NULL
DDL_END
• type: DELETE
• key: POLAR_LOG+TABLE_DDL+dbname.tablename
• value: NULL
DDL_START and DDL_END must be a pair.
Problem 1: Primary Crash
• Primary crash after DDL_START, Primary will resent
DDL_START when restart, and the previous DDL_END will
lost.
• Replica replay DDL_START and hold MDL lock, It will not
unlock with DDL_END
Problem 2: Replica Crash
• Replica carsh after DDL_START, Replica will continue to
replay DDL_END when restart
• But the lock with DDL_START will not exist after restart,
Replica replay DDL_END to unlock a MDL lock which is
not exist
19. MORE THAN JUST CLOUD
New Log Format
Solutions
DDL_START and DDL_END must be a pair.
Primary Crash
• Primary crash after DDL_START, Primary will resent
another DDL_START when restart, and the privious
DDL_END will lost.
• Replica replay DDL_START and hold MDL lock, It will not
unlock with DDL_END
Replica Crash
• Replica carsh after DDL_START, Replica will continue to
replay DDL_END when restart
• But the lock with DDL_START will not exist after restart,
Replica replay DDL_END to unlock a MDL lock which is
not exist
Primary Crash
• Primary Scan RocksDB to find record TABLE_DDL
when restart, if found, Primary should resent
DDL_END, and Replica will unlock the old lock
Replica Crash
• Replica Scan RocksDB to find record TABLE_DDL
when restart, if found, Replica should replay
DDL_START to lock
SolutionsProblems
20. MORE THAN JUST CLOUD
MVCC
MVCC based on RocksDB snapshot
Control compact in Primary
• Compact in Primary should consider about Replica’s snapshot
• Only delete record when sequnce >=Sn, Sn is the min snapshot seqence in Replica
Control flush in Replica
• After flush memtable, The Replica snapshot data may lost in SST by Primary compact
• Only flush when memtable’s min sequnce >=Sn, Sn is the min snapshot seqence in Replica
Keep a consistent snapshot in Replica
21. MORE THAN JUST CLOUD
Future
Feature
• Online DDL
• HA
Performance
• Multi-write WAL
• Asynchronous commit