Storage cassandra

[object Object],Roc.Yang 2011.04

Contents Overview 1 2 Data Model Storage Model 3 4 System Architecture Read & Write 5 6 Other

Cassandra – From Dynamo and Bigtable ,[object Object],[object Object],[object Object],[object Object]

Cassandra - Overview ,[object Object],[object Object]

Cassandra - Highlights ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra – Trade Offs ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra From Dynamo and BigTable ,[object Object],[object Object]

Dynamo-like Features ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

BigTable-like Features ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Brewer's CAP Theorem ,[object Object],[object Object],[object Object],http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

ACID & BASE ,[object Object],[object Object],ACID: http://en.wikipedia.org/wiki/ACID ACID and BASE: MySQL and NoSQL : http:// www.schoonerinfotech.com/solutions/general/what_is_nosql ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

NoSQL ,[object Object],[object Object],[object Object],[object Object],NoSQL: http://en.wikipedia.org/wiki/NoSQL http://nosql-database.org /

Dynamo & Bigtable ,[object Object],[object Object],[object Object],[object Object]

Dynamo & Bigtable ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Dynamo Architecture & Lookup ,[object Object],[object Object],[object Object]

Dynamo ,[object Object],[object Object],[object Object]

Dynamo Techniques ,[object Object],问题采取的相关技术数据均衡分布改进的一致性哈希算法，数据备份数据冲突处理向量时钟（ vector clock ）临时故障处理 Hinted handoff （数据回传机制），参数（ W,R,N ）可调的弱 quorum 机制永久故障后的恢复 Merkle 哈希树成员资格以及错误检测基于 gossip 的成员资格协议和错误检测

Dynamo Techniques Advantages ,[object Object]

Dynamo 数据均衡分布的问题 ,[object Object],[object Object],[object Object],[object Object]

Dynamo 数据冲突处理 ,[object Object],[object Object],[object Object]

Dynamo 临时故障处理机制 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Dynamo 永久性故障恢复 ,[object Object],[object Object]

Dynamo 成员资格及错误检测 ,[object Object]

Consistent Hashing - Dynamo ,[object Object],[object Object]

Bigtable ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Bigtable: Data Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],Bigtable: Tablet location hierarchy

Bigtable: METADATA ,[object Object],[object Object],[object Object],[object Object],[object Object]

Bigtable: Tablet Representation

Cassandra – Data Model ,[object Object],[object Object],[object Object],[object Object]

Cassandra – Data Model Columns are added and modified dynamically KEY ColumnFamily1 Name : MailList Type : Simple Sort : Name Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Name : aloha ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 <Column List> Name : hint2 <Column List> Name : hint3 <Column List> Name : hint4 <Column List> C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Name : dude C2 V2 T2 C6 V6 T6 Column Families are declared upfront SuperColumns are added and modified dynamically Columns are added and modified dynamically

Cassandra – Data Model ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra – Data Model(a example)

Cassandra – Data Model http://www.divconq.com/2010/cassandra-columns-and-supercolumns-and-rows/

Cassandra – Data Model - Cluster Cluster

Cassandra – Data Model - Cluster Cluster > Keyspace Partitioners: OrderPreservingPartitioner RandomPartitioner Like an RDBMS schema: Keyspace per application

Cassandra – Data Model Cluster > Keyspace > Column Family Like an RDBMS table: Separates types in an app

Cassandra – Data Model SortedMap<Name,Value> ... Cluster > Keyspace > Column Family > Row

Cassandra – Data Model Cluster > Keyspace > Column Family > Row > “Column” … Name -> Value byte[] -> byte[] +version timestamp Not like an RDBMS column: Attribute of the row: each row can contain millions of different columns

Storage Model Key (CF1 , CF2 , CF3) Commit Log Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF1) Memtable ( CF2) Memtable ( CF2) FLUSH ,[object Object],[object Object],[object Object],Dedicated Disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> --- --- --- --- <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> BLOCK Index <Key Name> Offset, <Key Name> Offset K 128 Offset K 256 Offset K 384 Offset Bloom Filter (Index in memory) Data file on disk

Storage Model-Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE SORT K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Loaded in memory Index File Data File D E L E T E D

Storage Model - Write ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Storage Model - Write ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Storage Model - Read ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra – Storage Cassandra 的存储机制，借鉴了 Bigtable 的设计，采用 Memtable 和 SSTable 的方式。和关系数据库一样， Cassandra 在写数据之前，也需要先记录日志，称之为 commitlog ，然后数据才会写入到 Column Family 对应的 Memtable 中，并且 Memtable 中的内容是按照 key 排序好的。 Memtable 是一种内存结构，满足一定条件后批量刷新到磁盘上，存储为 SSTable 。这种机制，相当于缓存写回机制 (Write-back Cache) ，优势在于将随机 IO 写变成顺序 IO 写，降低大量的写操作对于存储系统的压力。 SSTable 一旦完成写入，就不可变更，只能读取。下一次 Memtable 需要刷新到一个新的 SSTable 文件中。所以对于 Cassandra 来说，可以认为只有顺序写，没有随机写操作。 SSTable: http://wiki.apache.org/cassandra/ArchitectureSSTable

Cassandra – Storage 因为 SSTable 数据不可更新，可能导致同一个 Column Family 的数据存储在多个 SSTable 中，这时查询数据时，需要去合并读取 Column Family 所有的 SSTable 和 Memtable ，这样到一个 Column Family 的数量很大的时候，可能导致查询效率严重下降。因此需要有一种机制能快速定位查询的 Key 落在哪些 SSTable 中，而不需要去读取合并所有的 SSTable 。 Cassandra 采用的是 Bloom Filter 算法，通过多个 hash 函数将 key 映射到一个位图中，来快速判断这个 key 属于哪个 SSTable 。

Cassandra – Storage ,[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture Content ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools Commit log Memtable SSTable Indexes Compaction Messaging service Gossip Failure detection Cluster state Partitioner Replication Top Layer Middle Layer Core Layer

System Architecture ,[object Object],[object Object],[object Object],[object Object]

System Architecture ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture Messaging Layer Cluster Membership Failure Detector Storage Layer Partitioner Replicator Cassandra API Tools

System Architecture ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture - Partitioning ,[object Object],[object Object]

System Architecture – Partitioning (Ring Topology) a j g d RF=3 Conceptual Ring One token per node Multiple ranges per node

Conceptual Ring One token per node Multiple ranges per node System Architecture – Partitioning (Ring Topology) a j g d RF=2

Token assignment Range adjustment Bootstrap Arrival only affects immediate neighbors System Architecture – Partitioning (New Node) a j g d RF=3 m

Node dies Available? Hinting Handoff Achtung! Plan for this System Architecture – Partitioning (Ring Partition) a j g d RF=3

System Architecture – Partitioning 在 Cassandra 实际的环境，一个必须要考虑的关键问题是 Token 的选择。 Token 决定了每个节点存储的数据的分布范围，每个节点保存的数据的 key 在 ( 前一个节点 Token ，本节点 Token] 的半开半闭区间内，所有的节点形成一个首尾相接的环，所以第一个节点保存的是大于最大 Token 小于等于最小 Token 之间的数据 ; 根据采用的分区策略的不同， Token 的类型和设置原则也有所不同。 Cassandra (0.6 版本 ) 本身支持三种分区策略： RandomPartitioner OrderPreservingPartitioner CollatingOrderPreservingPartitioner

System Architecture – Partitioning ,[object Object],[object Object],[object Object]

System Architecture – Partitioning ,[object Object],[object Object]

System Architecture – Partitioning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture – Partitioning - Token ,[object Object],[object Object],[object Object]

System Architecture – Partitioning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture – Snitching ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture - Replication ,[object Object],[object Object],[object Object],[object Object]

System Architecture – Placement ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture - Replication ,[object Object],[object Object],[object Object]

System Architecture - Replication ,[object Object],[object Object]

System Architecture – Partitioning

System Architecture - Replication 1) Every node is aware of every other node in the system and hence the range they are responsible for. This is through Gossiping (not the leader). 2) A key is assigned to a node, that node is the key’s coordinator,who is responsible for replicating the item associated with the key on N-1 replicas in addition to itself. 3) Cassandra offers several replication policies and leaves it up to the application to choose one. These polices differ in the location of the selected Replicas. Rack Aware, Rack Unaware, Datacenter Aware are some of these polices. 4) Whenever a new node joins the system it contacts the Leader of the Cassandra, who tells the node what is the range for which it is responsible for replicating the associated keys. 5) Cassandra uses Zookeeper for maintaining the Leader. 6) The nodes that are responsible for the same range are called “Preference List” for that range. This terminology is borrowed from Dynamo.

System Architecture – Replication

System Architecture - Replication ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

System Architecture – Replication(Leader) ,[object Object],[object Object],[object Object]

System Architecture - Membership ,[object Object]

System Architecture - Failure handling ,[object Object],[object Object]

System Architecture - Bootstrapping ,[object Object],[object Object]

System Architecture - Scaling ,[object Object]

System Architecture - Scaling

System Architecture - Local Persistence ,[object Object]

System Architecture - Communication ,[object Object],[object Object]

Cassandra – Read/Write ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra – Read/Write ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra - Read Repair ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra - Reads ,[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra - Read ,[object Object],[object Object],[object Object]

Read Query Closest replica Cassandra Cluster Replica A Result Replica B Replica C Result Client Read repair if digests differ Digest Response Digest Query Digest Response

Cassandra - Write ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra – Write(Properties) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra - Writes ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra - Write ,[object Object],[object Object],[object Object],[object Object]

Cassandra – Write(Fast) ,[object Object],[object Object],[object Object],[object Object],[object Object]

Cassandra – Write(Compactions)

Cassandra – Gossip Cassandra 是一个有单个节点组成的集群 – 其中没有“主”节点或单点故障 - 因此，每个节点都必须积极地确认集群中其他节点的状态。它们使用一个称为闲话（ Gossip ）的机制来做此事 . 每个节点每秒中都会将集群中每个节点的状态“以闲话的方式传播”到 1-3 个其他节点 . 系统为闲话数据添加了版本 , 因此一个节点的任何变更都会快速地传播遍整个集群 . 通过这种方式 , 每个节点都能知道任一其他节点的当前状态 : 是在正在自举呢 , 还是正常运行呢 , 等。

Cassandra – Hinted Handoff ,[object Object],[object Object]

Cassandra – Anti-entropy ,[object Object]

Other - DHT ,[object Object],[object Object],[object Object]

Other - Cassandra - Domain Models

Other - Bloom filter An example of a Bloom filter, representing the set { x , y , z }. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3. http://en.wikipedia.org/wiki/Bloom_filter

Other - Bloom filter Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter. Use of a Bloom filter for this purpose, however, does increase memory usage. 。

Other - Timestamps and Vector Clocks ,[object Object],[object Object],[object Object],[object Object]

Other - Vector Clocks ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Other - CommitLog 和关系型数据库系统一样， Cassandra 也是采用的先写日志再写数据的方式，其日志称之为 Commitlog 。和 Memtable/SSTable 不一样的是， Commitlog 是 server 级别的，不是 Column Family 级别的。每个 Commitlog 文件的大小是固定的，称之为一个 Commitlog Segment ，目前版本 (0.5.1) 中，这个大小是 128MB ，这是硬编码在代码 (srcavargpacheassandra bommitlog.java) 中的。当一个 Commitlog 文件写满以后，会新建一个的文件。当旧的 Commitlog 文件不再需要时，会自动清除 .

Other - CommitLog 每个 Commitlog 文件 (Segment) 都有一个固定大小（大小根据 Column Family 的数目而定）的 CommitlogHeader 结构，其中有两个重要的数组，每一个 Column Family 在这两个数组中都存在一个对应的元素。其中一个是位图数组 ( BitSet dirty ) ，如果 Column Family 对应的 Memtable 中有脏数据，则置为 1 ，否则为 0 ，这在恢复的时候可以指出哪些 Column Family 是需要利用 Commitlog 进行恢复的。另外一个是整数数组 ( int[] lastFlushedAt ) ，保存的是 Column Family 在上一次 Flush 时日志的偏移位置，恢复时则可以从这个位置读取 Commitlog 记录。通过这两个数组结构， Cassandra 可以在异常重启服务的时候根据持久化的 SSTable 和 Commitlog 重构内存中 Memtable 的内容，也就是类似 Oracle 等关系型数据库的实例恢复 .

Other - CommitLog 当 Memtable flush 到磁盘的 SStable 时，会将所有 Commitlog 文件的 dirty 数组对应的位清零，而在 Commitlog 达到大小限制创建新的文件时， dirty 数组会从上一个文件中继承过来。如果一个 Commitlog 文件的 dirty 数组全部被清零，则表示这个 Commitlog 在恢复的时候不再需要，可以被清除。因此，在恢复的时候，所有的磁盘上存在的 Commitlog 文件都是需要的 . http://wiki.apache.org/cassandra/ArchitectureCommitLog http://www.ningoo.net/html/2010/cassandra_commitlog.html

Storage cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Storage cassandra

Similar to Storage cassandra (20)

Recently uploaded

Recently uploaded (20)

Storage cassandra

Editor's Notes