Shunsuke Nakamura / @sunsuk7tp Tokyo Institute of Technology Master Course Tokyo, Japan
Update latency Read latency in write-heavy workload in read-heavy workload write- optimizedBetter read- read- optimized optimized write- optimized
performance storage engine distribution Apache HBase write optimized Bigtable like centralized Apache Cassandra write optimized Bigtable like decentralized Sharded MySQL read optimized MySQL centralized Yahoo! Sherpa read optimized MySQL centralizedThe storage engine determines which workload a data store treats efficiently.The distribution architecture of a data store is independent of the performance characteristics of read and write.For example, if the storage part is excanged with MySQL, what does the characteristics of read and write change?
= Dynamo + Bigtable distribution (P2P/decentralized) storage engine
= Dynamo + distribution (P2P/decentralized) storage engine
MySQL= Dynamo + Bigtable Redis : storage engine
MyCassandra is a modular distributed data store. You can select a storage engine by a keyspace. Index algorithm Read-optimized vs. write-optimized Sequential or Random Volatile or persistence Your experience for the storage engine
You can adapt any data store to MyCassandra, a scalable data store. • RDB (MySQL/PostgreSQL) You can apply to the apps which change I/O characteristics by a phase. • MapReduce: Map – Shuffle - Reduce • Full text search: crowl – indexing – search You can apply to any IaaS environments. • EC2 + RDS (MyCassandra with MySQL)
Max. QPS for 40 Clients Bigtable MySQL40000 Redis3500030000250002000015000100005000 Better 0 (qps) Write Only Write Heavy Read Heavy Read Only
Now supporting • put (key, cf) Insert/Update/Delete At least, you implement this two method. • get (key) • getRangeSlice (startWith, engWith, maxResults) • truncate/dropTable/dropDB Next supporting • secondaryIndex • expire • counter (Cassandra-0.8 ~)
The Data model is the same as Cassandra. • But super column is not supported now. Store with the same Key/Value format as SSTable • Supporting for a NoSQL of Any data model NoSQL with a data model of smaller dimension than Cassandra • Add a prefix to a primary key • The prefix means a Keyspace/ColumnFamily name.
Cassandra MySQL Rediskeyspace database dbcolumn family table recordcolumn field
database db table A table B key valueskey values key values A:sato …sato gender;male;age;17 sato visits;18;plan;Gold B:ito …suzuki gender;female;age; suzuki visits; A:suzuki … 21;region;Tokyo 214;plan;Bronze B:tanaka … RDB (MySQL) KVS (Redis) keyspace columnfamily A columnfamily B key col gender age region key col visits plan sato male 17 [null] sato 18 Gold suzuki female 21 Tokyo suzuki 214 Bronze Bigtable (Cassandra)
A Key and a Value serialized a Object (now) # change easily A column is mapped to a MySQL’s field • It gets smaller overhead but a schema is needed. Add specialized column • For secondary search • For range query rowKey CF counter secondary token index Primary Serialized Specialized For secondary For range key object column search search Key Value
A heterogeneous cluster • It combines multiple types of nodes where different storage engines are located. • Replicas of data are located each different storage engines. • A proxy routes to nodes that efficiently process a query. write query read query sync async async sync W R W R Bigtable MySQL Bigtable MySQL
• W: write-optimized (e.g. Bigtable) • R: read-optimized (e.g. MySQL) • RW: memory-based (e.g. Redis) MyCassandra Cluster keeps the same consistency strength with Cassandra.Quorum Protocol: (write agrements) + (read afreements) > (replicas) • This protocol guarantees to get one of the most recent value. Our system needs one node which synchronously process both read and write queries. Memory-based node (Redis) write query sync async write read W R W RW R Bigtable MySQL
• W: write-optimized (e.g. Bigtable) • R: read-optimized =3, =2 (e.g. MySQL)W:RW:R = 1:1:1 • RW: memory-based Client Proxy (e.g. Redis) 1) A proxy broadcasts the query to nodes. Wait for two acks for 2) The proxy waits write and return 3a) write success: The proxy Async write returns a success msg. to client. 3b) write failure: The proxy waits W for acks from total RW R 4) the proxy asynchronously waits for acks Nodes responsible for a record from the remaining Write Latency: max (W, RW)
• W: write-optimized (e.g. Bigtable) • R: read-optimized =3, =2 (e.g. MySQL)W:RW:R = 1:1:1 Client • RW: memory-based Proxy (e.g. Redis) 1) A proxy sends a request to a R or Async check RW node, a digest request to other consistency replicas. Check consistency 2) The proxy waits for replies and return result including the specified record. 3a) success: if the record and digests are consistent, returns the W RW R record to the client. 3b) failure or inconsistency: The proxy tries to read and collect digests untilNodes responsible for a record they satisfy the quorum 4) The proxy waits from the remaining Read Latency: max (R, RW) nodes after replying to the client. If there is inconsistent, resolve it using Read Repair.
20000 Cassandra ×0.90 max. qps for 40 clients MyCassandra Cluster 18000 16000 × 6.53 14000 12000 × 1.54 × 0.93 10000Better 8000 6000 4000 2000 0 [100:0] [50:50] [5:95] [0:100] [write:read] (query/sec) Write-Only Write-Heavy Read-Heavy Read-Only Write Heavy Read Heavy • YCSB / Zipfian • Throughput was up to 6.53 times as high as those of Cassandra. • In Write-Heavy, there happens multiple read repairs.
MyCassandra-0.2.2 • secondaryIndex Apply to MySQL and MongoDB MyCassandra-0.3.0 • Based on Cassandra-0.8 • Atomic counter • Brisk (Hadoop + Cassandra)…
1. Asynchronous deletion2. Engine failure detection3. Support for ad hoc query
Cassandra’s delete/expire operation • Logical deletion using tombstone • Actual deletion with SSTable compaction This approach depends on Bigtable’s engine. MyCassandra (MySQL, Redis, …) • Synchronous Deletion (now) • Expire function works well, but data continues to exit. • Asynchronous deletion is a heavy operation I/O to a big table different from SSTable (It is a data subset.)
Only with storage engine failure,failure detection and the behavior of instance With several storage engines and a partial failure, the behavior of instance instance instance What should I instance Periodic do? polling detectengine engine engine instance overall failure? Take over the other node? node down
Ad hoc query and data model • If it does not depend on distributed archetecture, it can be added easily. Data model of Redis (List, Set, ..) Document data model and ad hoc queries of MongoDB • But if it depends, it can not be supported. Atomic query across multiple keys. Join It is important to determine whether the query is dependent on the distributed mechanism.
github • https://github.com/sunsuk7tp/MyCassandra/ Twitter • @MyCassandraJP • @_MyCassandra # @MyCassandra had already been taken!! • @sunsuk7tp # my private account Google Groups • https://groups.google.com/group/my-cassandra