数据分片方法的分析和比较

数据分片方法分析和比较
潘凌涛

提纲
●
Bigtable 中的数据分片的方法
●
Dynamo 中的 DHT 方法
●
Ceph
●
其他的两个简单但是常用的方法
●
总结

Bigtable
●
Data Model
– (row:string, column:string, time:int64) string→

Bigtable
●
The row keys are arbitrary strings (10-
100 bytes is typical)
●
Bigtable maintains data in lexicographic
order by row key.
●
The row range for a table is dynamically
partitioned. Each row range is called a
tablet.
●
Tablet is the unit of distribution and load
balancing.

Bigtable
●
Column keys are grouped into sets called
column families
●
Column families form the basic unit of
access control.
●
All data stored in a column family is
usually of the same type and compressed
together.
●
A column key is family:qualifier

Bigtable
Bigtable 的数据模型和物理数据存储方式是
一致的。
Cassandra 的数据模型模仿了 Bigtable, 但
是物理数据存储方式是有很大不同的。

Bigtable Data Partition
●
一个 Bigtable 集群存储多张表，每张表被
划分成多个 tablet, 每个 tablet 存储一段
的 row range 。
●
一开始，一张表只包含一个 tablet, 随着
数据增长会进行 split 。每个 tablet 的大
小范围为 100-200 MB.

●
Bigtable 系统架构： Master + Tablet
Server, 数据存在 GFS 上，客户端通过一
个 Client Library 来访问数据
●
Partition 算法要解决的是下面的映射关系
<table-name, row-key> → tablet position
●

●
由一个单独的 Metadata Table 来存储这样的映
射关系。 Metadata Table 本身也保存成多个
tablet, 组织成一个二层的树。它的根 tablet 的
位置保存在 chobby server 里边。
●
METADATA table 中每一项的 row key 是
tablet 的表名 +end row

1. 客户端首先的得到 root tablet 的位置
2. 客户端查询 root tablet 所在的 server 得到下一个
metadata tablet 的位置。
3. 客户端查询这个 metadata tablet 所在的 server, 得到
数据所在的 tablet 的位置
4. 再进行一次查询就得到了数据

●
最多要进行 4 次查询。由于客户端有缓
存，所以一般情况下只需进行第 4 步。所
以在访问的局部性好的情况下，这个方法
在性能上的可扩展性还是比较好的。

Replication 等问题的处理
●
Replication ：通过 GFS 保证
●
节点加入退出的处理：通过 Master 来重
新分配对应的 tablet 的 server, 原来节点
服务的数据会短暂的不可用
●
节点故障的处理：通过 Master 重新分配
数据
●
数据同步和冲突解决：没有

Dynamo
●
Data Model: key value→
●
Dynamo 使用 Consistent Hashing 来做
数据分片。同时采用了虚拟节点技术，即
每个节点对应多个虚拟节点。

Dynamo
●
使用虚拟节点的好处
– 如果一个节点失效，那么它的负载可以均
匀的分布到其余节点上
– 节点加入的时候，它能从其他多个节点上
获取数据
– 一个节点能根据它的性能来设定虚拟节点
的数量
●

Dynamo
●
Partition 算法 :
(key, node-map) node list→
●
Node-map 维护方法：
– 显示的节点加入和退出
– 每个节点上都维护这个 node-map 表 , 表
项为
●
(node, token)
– 通过 Gossip 协议来保持这个表的全局一
致
●

Dynamo
●
Dynamo 实际上借鉴了之前提出的用于
P2P 系统的 Distributed Hash
Table( DHT) ，不同的地方：
– Dynamo 维护全局一致的 node-map, 适合
于集群规模 (1000 台左右或以下 )
– O(1) 的位置查找

Replication 等问题的处理
●
Replication ： key n nodes(preference→
list)
●
节点加入退出的处理：显示的敲一个命令
来加入和退出一个节点
●
节点故障的处理：每个节点自己检测邻居
节点是否有故障，没有一个全局的统一视
图。 Hinted handoff 机制 ,sloppy
quorum 机制
●
数据同步和冲突解决：用 Vector Clocks
来解决数据冲突， Read Repair 和定期的
Merkle Tree 算法来进行数据同步

Ceph
●
Ceph: A Scalable, High-Performance
Distributed File System
●

Ceph
●
Ceph 需要处理 metadata ( 目录 ) 和
data ( 文件 ) 的 partition 问题。
●
同 bigtable 一样， metadata 也是一种
树状结构， data 则是一种平坦结构。同
bigtable 不一样的是 , metadata 上的操作
是很频繁的。
●
Metadata 的分片采用一种叫做 Dynamic
Subtree Partitioning 的算法

Ceph
●
Data 的 partition 采用叫做 CRUSH 的方
法
●
CRUSH 在本质上和 DHT 完成的功能是一
样的
– CRUSH: (key, node-map) --> a list of n
distinct storage targets.
●
但是 CRUSH 算法更加的复杂， node-
map 维护也更加的复杂

Proxy 方法 1
●
Partition 算法： key->Bucket->Storage
Server
– 1. Replication ：一个桶分配多个 storage server
– 2. 节点加入退出的处理：修改 Partition Map
– 3. 节点故障的处理：及时重新上线
– 4. 数据同步和冲突解决：定期 merkle tree

Proxy 方法 2
●
Replication ：一个桶分配一个 storage server ，一个
storage server 对应一个或多个 Slave Server 做数据备
份

总结
●
有两种基本的选择：将数据组织成层次结
构还是平坦的结构
– 组织成层次结构的好处是提供了
discoverability
●
层次结构要求访问模式有很好的局部性，
对存储系统一致性的要求也较高
●
平坦的结构本质上说都是
– (Key, node-map)-> node

数据分片方法的分析和比较

Recommended

Recommended

More Related Content

Similar to 数据分片方法的分析和比较

Similar to 数据分片方法的分析和比较 (20)

数据分片方法的分析和比较