MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

3,132 views
3,016 views

Published on

This slides are the presentation for SYSTOR2012 at Haifa, Israel.

http://www.research.ibm.com/haifa/conferences/systor2012/index.shtml

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,132
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
22
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

  1. 1. +MyCassandra: A Cloud Storage Supportingboth Read Heavy and Write Heavy Workloads Shunsuke Nakamura (Tokyo Institute of Technology, NHN Japan) Kazuyuki Shudo (Tokyo Inistitute of Technology) Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  2. 2. + Cloud Storage Distributed data store processing large amount of data   NoSQL, Key-Value Storage (KVS), Document-Oriented DB, GraphDB   Example: memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache Cassandra, Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Tyrant, Flare, ROMA, kumofs, Kai, Redis, LevelDB, Hadoop HBase, Hypertable,Yahoo! PNUTS, Scalaris, Dynomite, ThruDB, Neo4j, IBM ObjectGrid, Giraph, Oracle Coherence, and the others. (> 100 products)   Characteristics: “limited functions, massive volume, high performance”   Data access only by primary key   No luxury features such as join, global transaction   Scalable to much larger data and number of nodes
  3. 3. + Design policies of cloud storages There are many trade-offs.   data model   key/value, multi-dimensional map, document or graph   performance - write vs. read   latency vs. persistence   latency – memory and disk utilization   persistence – synchronous vs. asynchronous (snapshot)   replication – synchronous vs. asynchronous   consistency between replicas – strong vs. weak   data partitioning – row vs. column   distribution – master/slave vs. decentralizedSession 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  4. 4. + MyCassandra focuses on performance trade-off   data model   key/value vs. multi-dimensional map vs. document vs. graph   performance - write vs. read   latency vs. persistence   latency – memory and disk utilization   persistence – synchronous vs. asynchronous (snapshot)   replication – synchronous vs. asynchronous   consistency between replicas – strong vs. weak   data partitioning – row vs. column   distribution – master/slave vs. decentralizedSession 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  5. 5. + Performance trade-off Write-optimized vs. read-optimized   A cloud storage with persistence is designed to optimize either write or read workload.   Storage engine determines which workload a cloud storage treats efficiently. Bigtable, Cassandra, MySQL, Yahoo! HBase SherpaIndexing Log-Structured B-Trees [R.Bayer ’70] Merge Tree [P. O’Neil ‘96] Write to disk append random reads, writes Read to disk random reads + merge random read Performance write-optimized read-optimized Storage engine Bigtable clone MySQL Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  6. 6. + Performance trade-off - write-optimized vs. read-optimized -  Write latency for write-heavy workload Better read-optimized write-optimized 6 Yahoo! Cloud Serving Benchmark, SOCC ’10 - mycassandra -Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  7. 7. + Performance trade-off - write-optimized vs. read-optimized - Read latency for read-heavy workload write-optimized Better read-optimized Yahoo! Cloud Serving Benchmark, SOCC ’10 - mycassandra -Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  8. 8. + Research overview   Contribution: A technique to build a cloud storage performing well with both read and write workloads  Steps: 1.  MyCassandra: Storage engine enabled Apache Cassandra 2.  MyCassandra Cluster: Heterogeneous cluster with different storage engines  1. MyCassandra 2. MyCassandra Cluster read-optimized read and write-optimized select write-optimized Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  9. 9. + Apache Cassandra Open-sourced by in 2008 A top-level project in   Features:   Scalability up to hundreds of servers across multiple racks/datacenters   High availability without SPOF by adopting a decentralized architecture   Write-optimized dc1 dc2 Clustering across multiple racks/DCs Replication strategy based on region dc3 Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  10. 10. + Apache Cassandra A decentralized cloud storage without SPOF   Consistent Hashing (a decentralized algorithm): Assign identifiers to both nodes and data on its circular ID space.A-Z: hash value Num of replica := 3 ID space A F Z Roles of each node secondary 1 •  Proxy, serving clients Q •  Primary/secondary data nodes V N primary secondary 2 hash(key) = Q key values Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  11. 11. + Apache Cassandra Write-optimized storage engine, a Bigtable clone   O(1) fast write operation   Write an update to disk sequentially - Fast because of no random I/O to disk - Always writable because of no write-lock memory sync <k1, obj (v1+v2)> async flush write path Memtable 1.  Append an update to CommitLog for persistence Only sequential write disk 2.  Update Memtable, a map in <k1, v1>, <k1, v2> memory, for quick reading3.  Acknowledge a client CommitLog 4.  Asynchronously flush Memtable <k1,obj1> to SSTable SSTable 1 5.  Delete flushed data from <k1,obj2> CommitLog and Memtable SSTable 2 <k1,obj3> Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6) SSTable 3
  12. 12. + Apache Cassandra Write-optimized storage engine, a Bigtable clone   Slow read operation   Read data from Memtable and multiple SSTables, and merge them - Slow because of multiple random I/Os on disk memory <k1,obj> Memtable disk CommitLog merge <k1,obj1> SSTable 1 multiple random I/Os <k1,obj2> SSTable 2 <k1,obj3> SSTable 3 Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  13. 13. + Performance of original Cassandra Write performance is much higher.   YCSB results show:   Average: write is 9 x as fast as read.   99.9%ile: write is 43.5 x as fast as read. Better readNumber of operations write avg. 6.16 ms read Latency (ms) 99.9 %ile write write: 2.0 ms avg. 0.69 ms read: 86.9 ms Latency (ms)
  14. 14. 1. Storage Engine Support +  1.MyCassandra read-optimized select write-optimized Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  15. 15. + MyCassandra: A modular cloud storage Storage engines are supported   Storage engine feature inspired by MySQL   An independent and pluggable component   Perform disk I/O   A cloud storage can be either write-optimized or read-optimized by selecting storage engine   Keep Cassandra’s original distribution architecture and data model Decentralized Consistent Hashing Bigtable Gossip Protocol Bigtable MySQL Redis … selectable InnoDB MyISAM Memory … selectable Decentralized + Storage engine Storage engine
  16. 16. MyCassandra implementation Cassandra’s original distribution arch. Storage Engine Interface introduced Implement Storage Engine Storage Engine Interface Interface Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  17. 17. Performance of each storage engine   storage engines   Bigtable: write-optimized (original Casssandra 0.7.5)   MySQL: read-optimized (MySQL 6.0 with InnoDB, JDBC API, stored procedure)   Redis: in-memory KVS (Redis 2.2.8) 6 nodes -  Crucial’s SSD -  allocate 6GB mem in 8GB x 11.79 1KB x 36 million data set x 9.87 workload Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  18. 18. 2. Heterogeneous cluster of different storage engines + 2.MyCassandra Cluster read and write-optimized Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  19. 19. •  W: write-optimized Basic idea •  R: read-optimized •  RW: in-memory   Replicate data on different storage engine nodes write query   Route a query to nodes processing it efficiently   Synchronously route to nodes processing quickly sync async   Asynchronously route to nodes processing slowly → Exploit each node’s advantage W R   Furthermore, maintain consistency between replicas as much as the original Cassandra Quorum Protocol: (write agreements) + (read agreements) > (num of replicas) = Guarantee retrieval of the latest data write read Consequence: At least one node processes both read and write queries synchronously and quickly → In-memory nodes play this role. W RW R Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  20. 20. •  W: write-optimizedCluster design •  R: read-optimized •  RW: in-memory  Combine nodes with different storage engines   write-optimized (W), read-optimized (R), in-memory (RW)  Disseminate storage engine types of each nodes   The type is attached to gossip messages  Place replicas on nodes with different storage engines   Proxy (any node requested) selects the storing nodes 1.  The primary node determined based on the queried key 2.  N -1 secondary nodes with different storage engines  Multiple nodes share a single server for load balance Proxy (any node) Cluster configuration (N=3) gossip RW W W RW RW R RW R W secondary2 primary secondary1 responsible nodes
  21. 21. •  W: write-optimized Process for a write access •  R: read-optimized •  RW: in-memory •  Quorum parameters Client = 3, = = 2 1)  A proxy receives a write query •  Num. of replicas Write for a single record from a client. The proxy routes to nodes storing the record. W:RW:R = 1:1:1 Proxy … … 2)  The proxy waits ACKs. W, RW … nodes usually reply quickly. Wait for two ACKs … for write and return 3-a) If writing succeeds and theRW proxy receives ACKs, it Async write returns a success message. R W 3-b) If a data node fails to write, W RW R the proxy waits for ACKs including R nodes and returns a Nodes storing the record success message. 4) After returning, the proxy Write latency: max (W, RW) asynchronously waits ACKs from the remaining nodes. Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  22. 22. •  W: write-optimized Process for a read access •  R: read-optimized •  RW: in-memory •  Quorum parameters = 3, = = 2 Client 1)  A proxy receives a read query and routes to storing nodes. •  Num. of replicas Read for a single record W:RW:R = 1:1:1 2)  Theproxy waits for ACKs. R Proxy and RW nodes reply quickly. … … … 3-a) If returned values are … consistent, the proxy returns it. Async check Check consistentcyRW consistency 3-b) If the values are mismatched, and return result the proxy waits for consistent R W values including W nodes. W RW R 4) After returning, the proxy waits Nodes storing the record from the remaining nodes. If the proxy notices inconsistent Read latency: max (R, RW) values, it asynchronously updates them to the consistent one. Cassandra’s feature Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6) ReadRepair does it.
  23. 23. + Performance Evaluation Demonstrate that a heterogeneous cluster performs well with both read- and write-heavy workloads   Targets   MyCassandra Cluster: 3 different nodes/server x 6 servers   Cassandra: 1 node/server x 6 servers  Quorum parameters = 3, = =2  Storage Engine   Bigtable (W), MySQL / InnoDB (R), Redis (RW)  Yahoo! Cloud Serving Benchmark (YCSB) [SOCC ’10] 1.  Load data (1KB record, 10 x 100bytes columns) from a YCSB client 2.  Warm up 3.  Run benchmark and measure response times from a clientSession 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  24. 24. + YCSB workloads Workload Application Operation Record example ratio selection Write-Only Log Read: 0% ZipfianWrite Write: 100% heavy Write-Heavy Session store Read: 50% Write: 50% Read-Heavy Photo Read: 95%Read Write: 5% heavy tagging Read-Only Cache Read: 100% Write: 0% Zipfian distribution: the access frequency of each datum is determined by its popularity, not by freshness. Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  25. 25. Write/Read latency (Response time) 1.5 avg. write-latency Cassandra + 0.57ms (max) MyCassandra Cluster 1Better + 42.5% + 59.5% + 69.5% Performs well 0.5 with write:5% MySQL + Redis write:100% write:50% write:0% 0 (ms) max 90.4% lower in read-only workload 35 30 avg. read-latency - 26.5ms (max) 25 20Better - 88.8% - 90.4% 15 - 83.3% 10 5 read:0% read:50% read:95% read:100% 0 (ms) Write-Only Write-Heavy Read-Heavy Read-Only
  26. 26. Throughput 25000 Cassandra x 0.87 QPS for 40 clients MyCassandra Cluster 20000 15000Better 10000 x 2.16 x 11.00 x 4.07 5000 0 [100:0] [50:50] [5:95] [0:100] [write:read] (query/sec) Write-Only Write-Heavy Read-Heavy Read-Only Write heavy Read heavy •  11.0 times as high as Cassandra in Read-Only workload •  Write performance is comparable with Cassandra. Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  27. 27. + Conclusion   A cloud storage supporting both write-heavy and read- heavy workloads by combining different storage engine nodes.   MyCassandra Cluster achieved better throughput than the original Cassandra on read heavy workload.   With a read-heavy workload   Read latency: 90.4 % lower at most   Throughput: 11.0 times at mostSession 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  28. 28. + Related Work   Indexing algorithm whose goals include achieving both write and read performance   FD-Tree: Tree Indexing on Flash Disks, VLDB ’10   bLSM: A General Purpose Log Structured Merge Tree, SIGMOD ‘12   Fractal-Tree: It’s implemented in TokuDB (MySQL storage engine)   Modular data stores:   MySQL   Anvil, SOSP ’09   Cloudy, VLDB ’10   Dynamo, SOSP ’07   Fractured Mirrors:   MyCassandra, SYSTOR ‘12: read vs. write Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  29. 29. Discussion 1. the slight higher + write latency The cause is load balancing.   Cassandra   Write to any nodes in N nodes   MyCassandra Cluster   Write to the specified and nodes However this cost well worths improving for read performance. MyCassandra Cassandra Cluster Sync operation is write read write read Sync operation isequally distributed. fixed. W RW R Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  30. 30. + Discussion 2. in-memory node Q. Memory overflow A. In-memory node plays as LRU-like cache. The swapped data is recovered from the other persistent nodes by read repair. Q. Fault tolerance A. 1) Write to an alternative node, and if the node is recovered, it resolves inconsistency using values from the node. 2) Asynchronous snapshot (Redis’s feature) Q. Whole in-memory nodes A. This case limits capacities in cluster with the memory’s capacity.Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  31. 31. + オープンソース化 Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
  32. 32. +Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

×