1. Database in the Cloud Era
PolarDB
A database architecture for the Cloud
Date: April 9th, 2019
Time: 15:50PM-17:50PM
Venue: 7003@ Parisian Grand Ballroom, the Parisian Macao
2. Database in the Cloud EraDatabase in the Cloud Era
Manyi Lu
Senior Director @ Alibaba Cloud
Bio:
Manyi Lu has 20 years experience in the database field.
She currently works at Alicloud, leading MySQL RDS and
PolarDB development. Previously, she worked as
engineering director at Oracle, heading the MySQL
optimizer team. She has also held various positions both
as product manager and engineering manager at Sun Micr
osystems.
3. Database in the Cloud Era
POLARDB:a Cloud Native Database
Emerging
Hardware
• NVM
• RDMA
• FPGA
Auto scaling
• Scaling up/down
• Paid by Usage
• Zero Downtime
Security
• Encryption
• Audit
• Access Control
Intelligence
• Self-configuration
• Self-optimization
• Self-diagnosis
• Self-healing
CLOUD NATIVE
User Oriented
4. Database in the Cloud Era
Cloud Native Architecture
• Scale compute and storage independently
• Shared storage
• Across AZ fail-over
• Optimize division of functionality between
storage and compute
• Tight integration with other cloud components
like metering, monitoring, control plan
• Optimize for hardware in the data centers
• Compatible with MySQL/PG etc
• Security
PolarProxy
PolarStore
PolarDB
Intelligent proxy
100% Compatible
Storage Optimized
For Database
PolarFS
5. Database in the Cloud Era
PolarStore: Architecture overview
- Designed for Emerging
Hardware
- Low Latency Oriented
- Active R/W – Active RO
- High Availability
libpfs
Host1
POLARDB
libpfs
POLARDB
Host2
volume 1 Volume 2
chunk1 chunk2 chunk1 chunk2
PolarSwitch
libpfs
POLARDB
volume 1
PolarSwitch
chunk1 chunk2
ChunkServer ChunkServer ChunkServer ChunkServer
chunk chunk chunk chunk
ParallelRaft
PolarCtrl
metadata
Key Components: 1. libpfs 2. PolarSwitch 3. ChunksServer 4. PolarCtrl
data route
control route
6. Database in the Cloud Era
PolarStore: Design for Emerging Hardware
- No Context Switch
- OS-bypass & zero-copy
RDMA-NIC
Network Over RDMA
libpfs
POLARDB
Memory
- Parallel Random I/O absorbed by Optane
- Excellent performance with less long tail latency issue
- No need of Over Provisioning
WAL Log in 3Dxpoint optane
RDMA Network
RDMA
RDMA-NIC
Optane
NVMe SSDs
Memory
Chunkserver 1
RDMA-NIC
Optane
NVMe SSDs
Memory
Chunkserver 3
RDMA-NIC
Optane
NVMe SSDs
Memory
Chunkserver 2
PolarDB write to shm
7. Database in the Cloud Era
Dynamic Scaling
Local
Storage
Fast Scaling
MySQL
POLARDB
Master
Local
Storage
Replica
Local
Storage
Replica
Master Replica Replica
Shared Storage
Upgrade 2vCPU to 32vCPU, only in 5 minutes
Add more Replicas, only in 5 minutes.
6,940
10,230
13,521
16,811
20,102
39,844
4,949
6,549
8,149
9,749
11,349
20,949
1 Replica 2 Replica 3 Replica 4 Replica 5 Replica 10
Replica
RDS MySQL POLARDB
Lower Cost: 30%~50% OFF
Total costs of 4vCPU 32G Memory 500G Storage
with different replica numbers
0
10000
20000
30000
40000
8. Database in the Cloud Era
Shared Nothing Logical Replication vs Shared Storage Physical Replication
Local Storage Local Storage
Master
POLARSTORE
Slave Master Slave
Data
Binlog
Redo
log Data
Master
Binlog
Slave
Binlog
Redo
log
Data
Redo
log
Data
Redo
log
Binlog
Physical Replication is much more reliable than Logical Replication
9. Database in the Cloud Era
Shared Nothing Logical replication vs Shared Storage Physical Replication
10. Database in the Cloud Era
HTAP - Parallel Query
Reduce Latency of Complex Queries
1024
512
256
128
64
32
16
8
4
2
DBT3 Query 6 Linear Scalability
1 2 4 8 16 32
tpch40
ideal_tpch40
tpch20
ideal_tpch20
tpch10
ideal_tpch10
tpch5
ideal_tpch5
One Query
Multiple Workers on Server
Parallel Scan on Storage Engine.
Workers Storage Engine
11. Database in the Cloud Era
Single-master
Single Endpoint Transparent Failover
Attacks Protection Causal Consistent Read
Proxy Cluster
Master Replica Replica
Shared Storage
Application
Replica
Read/Write Split
High Availability
Load Balance
Security
12. Database in the Cloud Era
Read and Write Split - Session Consistency
I am Manyi Lu。 I previously worked in the MySQL optimizer team at Oracle.
We at Alicloud have been offering relational database as a service for many years. Based on our extensive experience with RDS we have developed a cloud native database, which we believe can better serve our cloud customers.
What makes PolarDB unique are:
It is built on emerging hardware. A cloud native database allows us to tightly integrate software and hardware in a way that it hardly achievable on premise. PolarDBs storage layer leverages a range of the modern hardware NVME, RDMA. and FPGA.
One of the top benefits of moving to cloud compared to on-premise is we allow users to seamlessly scale up and down based on their business need.
Security is always a concern when users move to cloud. But rest assured, we offer transparent data encryption, audit and access control.
Intelligence: since we offer a managed service, we manage hundreds of thousands database instances, we need automation to reduce our maintenance cost as well as giving our users better user experience. We have extensive monitoring system in place and with that as a basis, we have automatic failure detection, self diagnosis and self healing, Cloud is all about economy of scale, we must reduce human involvement to achieve that.
PolarStore is a distributed share storage, it consists of a cluster of ChunkServers. Data is divided into trunks, and each trunk has 3 identical copies and we use parallel-raft protocol to guarantee the consistency between chunks. PFS, is a filesystem developed particularly for PolarDB, it allows the database engine to access PolarStore as local storage.
PolarDB supports ROCE Ethernet networks, where the application passes through an RDMA network to write the contents of its memory directly into the memory address of the target machine, or read data directly from the memory of the target machine and into its own memory. In the middle, the communication protocol codec and relay mechanism are both handled by the RDMA card, without the need for any participation from the CPU.
PolarFS uses a full user-space I/O stack, including RDMA and SPDK, to avoid the overhead of the kernel network stack and storage stack.
PolarDB uses the leading hardware technology。It uses Optane storage card as a cache to NMVe SSD. In this way, it ensures stable, low write latency, high throughput, and keeps cost performance ratio low for the entire system.
PolarDB uses RDMA network between between chunk servers, and DB node
and storage layer. RDMA more or less has removed networking as the performance bottleneck. The application through an RDMA network can write the contents of its memory directly into the memory address of the target machine, or read data directly from the memory of the target machine and into its own memory, without the need for any participation from the CPU.
PolarFS uses a full user-space I/O stack, including RDMA and SPDK, to avoid the overhead of the kernel network stack and storage stack.
When your database load increases, you will want to add more replicas to support the increasing load. In POLARDB, adding more replicas does not require extra disk space. All server instances are using the same shared data files, while with traditional MySQL replication, each server will have its own copy of the data. So with the increasing number of replicas, the cost savings of POLARDB will increase. Scaling up will also be much faster since there will be no need to copy data when adding more replicas.
POLARDB uses physical replication instead of the logical replication used in traditional MySQL replication. This means that the redo log, which InnoDB writes to be able to recover from failures, is also used to replicate changes to other servers. So while the logical replication will write both logical and physical log to local storage on all servers, all POLARDB servers will share the same log files.
Left: binlog replication statement based, DDL starts on replica AFTER it is completed on master. On master, master can execute other transactions and generate log against the new schema right after the DDL operation complets. On slave, the DDL operation starts late, and must accumulate all log from the master until DDL is completed on the master.
PolarDB physical replication
No data change on replica due to shared disk, but MDL locking is needed, and invalidation of dictionary cache on replica.
Currently, we can only support single master and multiple replica.
Single endpoint: one end point, proxy will handle load balancing and read/write splitt
Transparant failover: When master fails, fail-over to RO. If AC failure, fail over to another AZ
Causal consistent read: read your own write. MySQL asynch/semi synch replication doesn’t provide that, due to replication lag.
Update USER SET COMMIT, in the same package as ack, include the LSN , send to proxy.
As an example, take one application that does an update to one row, commits the update, and then try to read the same row. We then need to make sure that the recently committed version of the row is reflected at the read replica where it is to be read. To achieve this, the master will return the log sequence number of the update to the proxy, and the proxy will tell the read replica to make sure the corresponding redo log record has been applied before the read is performed.