The Architecture Overview of OceanBase DataBase

Charlie Yang
rizhao.ych@oceanbase.com
The Architecture Overview of
OceanBase Database
1

About OceanBase
• Distributed SQL database, starting from 2010
• Serves all payment requests of Alipay since 2017, 61 million peak QPS in 2019.11.11
• Adopted by 400+ customers in mission critical scenario
• TPC-C: 707 million tpmC (No. 1), TPC-H: 15 million qphH @30000GB (No. 2)
• Scalable OLTP：linear scalability with strong consistency and high availability
• HTAP：real-time operational analytics in one unify system
• Compatible to MySQL with high performance and much lower cost
2

Design Goals
Monolithic Database
• Full SQL functionality
• High Performance of single node
Distributed Storage System
• High scalability, high availability
• Key-value store or limited SQL
functionality
OceanBase = A Distributed SQL Database with full SQL support and high
performance of single node
3

Scalable OLTP
Unlimited Storage In One Cluster
• Max 1000+ servers
• 6PB+ data of storage
• 320 billion+ records （one single table）
Linear Scalability
• Fault Tolerance using Paxos
• Distributed Transaction using 2PC
• Data shuﬀle at partition granularity
Leader Follower
P1 P2
P4
OBServer
ZONE_1
P5 P6
P8
OBServer
P7
P1
P3 P4
OBServer
ZONE_2
P5 P6
P8
OBServer
P7
P1
P3
P2
OBServer
ZONE_3
P5 P6
P8
OBServer
P7
Paxos Group
P3
P2
P4
4

Real-time Operational Analytics
HTAP Integration
Provide services for real-time operational
analytics scenarios
• Heavy OLAP workload: individual replica
to do OLAP
• Light OLAP workload: do OLTP and OLAP
in the same replica (mixed row-columnar
storage)
P1 P2
P4
Server1
IDC1
P5 P6
P8
Server2
P7
P1
P3 P4
Server3
P5 P6
P8
Server4
P7
P1
P3
P2
Server5
P5 P6
P8
Server6
P7
Paxos Group
Proxy Proxy Proxy
OLTP business
P3
P2
P4
Leader
Follower
IDC2 IDC3
OLTP business
OLAP real time
data analytics
5

TPC Benchmark
• 2019 TPC-C 60.88 million tpmC
• 2020 TPC-C 707 million tpmC
6
• 2021 TPC-H 15.26 million qphH (30,000GB dataset)

Architecture
• Each cluster consists of several zones in one
or multiple regions.
• OBProxy is used to route requests to
OBServer.
• Each OBServer is similar to a classical
RDBMS; Compiles SQL statement(s) to
produce a SQL execution plan.
• One OBServer is elected to host root service.
• Redo logs are replicated among the zones
using Paxos.
• Transactions for only one partition are
executed locally.
• Transactions for multiple partitions are
executed using 2PC.
7

Basic Concept
Cluster
Zone OBServer
Admin
APP
Tenant
Database
Table
Partition
Replica
Resource
Pool
Zone
Zone
OBServer
OBServer
Each cluster has
multiple Zones
Each Zone has
multiple OBServers
Replica
Replica
• Zone: Availability Zone, an IDC in most case
• Multi-tenant architecture: divides each cluster into multiple resource pools owned by tenants,
resource isolation is done internally by the database
Each resource pool has
multiple resource units
8

Transaction Engine
Leader Follower
P1 P2
P4
OBServer
ZONE_1
P5 P6
P8
OBServer
P7
P1
P3 P4
OBServer
ZONE_2
P5 P6
P8
OBServer
P7
P1
P3
P2
OBServer
ZONE_3
P5 P6
P8
OBServer
P7
Paxos Group
P3
P2
P4
• Paxos: Quorum-based consensus (2 out of
3, or 3 out of 5 replicas)
• High availability: RPO = 0, RTO < 30 s
• RPO: recovery point objective
• RTO: recovery time objective
• Distributed Transaction
• Two Phase Commit (2PC)
• MVCC & Snapshot Isolation
• Linearizability: uses GTS (Global
Transaction Service) to retrieve the
global unique id for each transaction
9

Storage Engine
Logs
Update
Replicas
MemTable(WOS) ROW Cache
Minor SSTable Major SSTable(ROS)
Disk
Row-Level
In-Memory
Redo/MVCC
Memory
In-Memory Hash In-Memory
B +-Trees
Scan
Big-Query
Get
Small-Query
Block Cache
• LSM Tree: MemTable and SSTable
• Compaction: Merges several sstables and
memtables into one single sstable
• MemTable: Btree and hash index
• SSTable: divided into data blocks, order by
primary key
• Macro Block: mostly 2MB, write unit
Micro Block: mostly 8KB ~ 512KB, read
unit, encoding and compression unit
• Cache: Row Cache (for single row get) and
Block Cache (for scan)
10

SQL Engine
• Fast parser attempts to match an existing
plan in the plan cache.
• Resolver translates SQL request and
generates a statement tree.
• Transformer analyzes and rewrites the user
SQL.
• Optimizer(System-R like cost based
optimizer) performs query transformation and
optimization.
• Code Generator does code generation.
• Vector execution and Parallel execution
are used for OLAP big query
11

Comparation with other distributed SQL database
OceanBase Cloud Spanner Cockroachdb TiDB
Multi-tenancy YES YES YES NO
SQL Compatibility MySQL PostgreSQL PostgreSQL MySQL
SQL Join YES YES YES YES
Foreign Key YES YES YES NO
XA, Stored
Procedure
YES NO NO NO
Multi-Model Json, GIS, KV API JSON, KV API JSON, GIS JSON, TiKV
HTAP YES NO NO YES
Global Database NO YES YES NO
Replication Paxos Paxos Raft Raft
Global Time GTS Truetime HLC GTS
Linearizability YES YES NO YES
Commit Wait NO YES (7~10ms) NO NO
Sysbench(1 node) ~~ MySQL < 1/3 MySQL < 1/3 MySQL < 1/3 MySQL
Language C++ C++ Go Go, Rust 12

* Statistics above are from real production in Alipay
OceanBase @ Alipay
61
millions
Queries per
second
Peak performance
> 200
Nodes
in one cluster
Cluster Size
> 6
PB data
Data size in one instance
> 320
billions
Billions of rows
single table size
RPO = 0
RTO < 30
Disaster Tolerance
Seconds
Trade
Payment
Accounting
CIF
Promotion
Real time data
13

Outside Validation on Mission Critical System Across the
Industry
Used by 400+ customers on Mission Critical Systems
Mission critical BOSS and CRM systems
Debit and credit card transaction, billing, accounting systems
High concurrency scenarios: payment, accounting, customer info systems
FIs
High tech
Telco
14

Charlie Yang
rizhao.ych@oceanbase.com
Thank you!
https://github.com/oceanbase/oceanbase
https://www.oceanbase.com/en/
15

The Architecture Overview of OceanBase DataBase

Recommended

Recommended

More Related Content

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

The Architecture Overview of OceanBase DataBase