読み出し性能と書き込み性能を両立させるクラウドストレージ (OS-117-24)

3,615 views
3,507 views

Published on

2011/04/14 4月OS/ARC研究会の発表スライドです。

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,615
On SlideShare
0
From Embeds
0
Number of Embeds
113
Actions
Shares
0
Downloads
32
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

読み出し性能と書き込み性能を両立させるクラウドストレージ (OS-117-24)

  1. 1. 11.4.14 - mycassandra - 1
  2. 2. NoSQL, Key-Value Store (KVS), Document-oriented DB, GraphDB : memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache Cassandra, Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Cabinet/Tokyo Tyrant, Flare, ROMA, kumofs, Kai, Redis, Hadoop Hbase, Hypertable, Yahoo! PNUTS, Scalaris, Dynomite, ThruDB, Neo4j, IBM ObjectGrid, Oracle Coherence, Velocity, … :“ ↔ ”• •  (join, transaction)•  / - mycassandra - 2
  3. 3. •  •  key/value vs. multi-dimensional map vs. document vs. graph •  •  vs. •  vs. •  •  strong vs. weak •  •  vs. •  •  row vs. column •  •  master/slave vs. decentralized11.4.14 - mycassandra - 3
  4. 4. •  •  key/value vs. multi-dimensional map vs. document vs. graph •  •  vs. •  vs. •  •  strong vs. weak •  •  vs. •  •  row vs. column •  •  master/slave vs. decentralized11.4.14 - mycassandra - 4
  5. 5. vs. write/read Bigtable, Cassandra, MySQL, Sherpa HBase Log-Structured B+-Tree [R.Bayer ‘72] Merge Tree [P. O’Neil ‘96] Bigtable MySQL 11.4.14 - mycassandra - 5
  6. 6. Write-Heavy Read-Heavy write-optimized Better Better read-optimized write-optimized read-optimized Yahoo! Cloud Serving Benchmark, SOCC ’1011.4.14 - mycassandra - 6
  7. 7. / 1.  2.  1.MyCassandra 2.MyCassandra Cluster read-optimized read/write-optimized write-optimized11.4.14 - mycassandra - 7
  8. 8. Apache Cassandra •  •  •  N = 3 ID Consistent Hashing( ) A F Z secondary 1 Q V N •  request proxy primary secondary 2 •  primary node •  secondary node hash(key) = Q key values11.4.14 - mycassandra - 8
  9. 9. Google Bigtable - : - •  Bigtable: sequential write I/O •  always writable write-lock <k1, cf1+cf2> Cassandra map: <key,ColumnFamily> async Memtable Memory Disk <k1, cf1> <k1, cf2> write Commit Log SSTable 11.4.14 - mycassandra - 9
  10. 10. Google Bigtable - : - key •  Memtable value •  SSTable value I/O Map Cassandra <key,ColumnFamily> read Memtable Memory <k1, CF4> Disk <key, CF1> Commit Log I/O <key, CF2> SSTable <key, CF3> 11.4.14 - mycassandra - 10
  11. 11. 1. MyCassandra read-optimized write-optimized 11.4.14 - mycassandra - 11
  12. 12. Cassandra •  Cassandra / •  Consistent Hashing InnoDB MyISAM Memory … Gossip Protocol Bigtable MySQL Redis …11.4.14 MyCassandra: 12
  13. 13. MyCassandra : Cassandra : . JDBC API / stored procedure : key-value store MyCassandra node × 611.4.14 13
  14. 14. 2. MyCassandra Cluster read/write-optimized11.4.14 - mycassandra - 14
  15. 15. •  •  sync async => •  Quorum Protocol: ( )+ ( )> ( ) => mem11.4.14 - mycassandra - 15
  16. 16. •  W: •  R: •  RW: MyCassandra •  (W) / (R) / (RW) •  gossip protocol •  1.  (key ) 2.  × N-1 N=3 Consistent Hashing ID R RW RW W W R gossip R RW W RW R W11.4.14 16
  17. 17. host node(1) 1 /1 → ☓ storage ☓(2) 1 /k → ID [Amazon Dynamo, SOSP ’07] ☓(3) 1 → FT spaceFaultTorelance (FT) space FT space (3)1storage / 1node / 1 host (2) (1) virtual node 1 node / host k nodes / host11.4.14 17 k storages / node 1 storage / node
  18. 18. •  : •  R: •  RW: =3, =2 W:RW:R = 1:1:1 Client 1)  Proxy 2)  W, RW ACK ACK 3a) W 3b) R RW R ACK : max (W, RW)11.4.14 - mycassandra - 18
  19. 19. •  : •  R: =3, =2 •  RW: W:RW:R = 1:1:1 Client Proxy 1)  2)  R, RW 3a) 3b) or W W RW R 4)  Proxy (Cassandra read repair ) : max (R, RW)11.4.14 - mycassandra - 19
  20. 20. /   •  MyCassandra Cluster: 6×3 = 18 /6 (W:R:RW = 6 : 6 : 6) •  Cassandra: 6 /6   •  : = 3, : = =2   : Bigtable (W), MySQL / InnoDB (R), Redis (RW) : YCSB (Yahoo! Cloud Serving Benchmark) [SOCC ’10]   1.  MyCassandra/Cassandra×6 YCSB Client×1 2.  1KB values(100[Bytes]×10[columns])+key 1,000 3.  4.  YCSB 5.  YCSB Stat11.4.14 - mycassandra - 20
  21. 21. YCSB •  4 Workload Application Operation Ratio Record Example Selection Log Read: 0% Zipfian( ) Write Write-Only Write: 100% Heavy Read: 50% Write-Heavy Session Store Write: 50% Read: 95% Read Read-Heavy Photo tagging Write: 5% Heavy Read: 100% Read-Only Cache Write: 0% ( ) Zipfian : , / 11.4.14 - mycassandra - 21
  22. 22. / 1 11.5~23.5% avg. write-latency Cassandra 0.8 MyCassandra 0.6 Cluster 0.4 MySQL + RedisBetter 0.2 write:100% write:50% write:5% write:0% 0 (ms) 88.5% 10 avg. read-latency 8Better 6 85.2% 88.5% 4 49.7% 2 read:0% read:50% read:95% read:100% 0 (ms) Write-Only Write-Heavy Read-Heavy Read-Only11.4.14 - mycassandra - 22
  23. 23. 30000 0.99 Cassandra max. qps for 40 clients MyCassandra 25000 Cluster 20000 6.53 15000Better 10000 0.62 1.49 5000 0 [100:0] [50:50] [5:95] [0:100] [write:read] (query/sec) Write-Only Write-Heavy Read-Heavy Read-Only Write Heavy Read Heavy •  6.53 •  11.4.14 - mycassandra - 23
  24. 24. (1) : HDD vs. SSD 30000 Cassandra HDD 30000 MyCassandra SSD HDD 25000 SSD 25000 20000 20000 Cluster 15000 15000 (3) ( ) ( ) 10000Better 10000 5000 5000 (3) 0 0 (qps) (qps) (1) HDD/SSD IOZone HDD: Western digital SSD: Crucial (2) benchmark sequential write 86,277 qps 96,401 qps (3) sequential read 108,914 qps 216,099 qps random write 2,485 qps 29,045 qps11.4.14 - mycassandra - random read 926 qps 21,751 qps 24
  25. 25.  Read-Heavy •  88.5% •  6.53 => /   Write-Heavy •  Cassandra11.4.14 - mycassandra - 25
  26. 26. (1/2)  Write-Heavy •  MySQL •  : •  : •  •  ) write-optimized write-heavy 4 15000 Cassandra MyCassandra cluster 3 10000 2 1 5000 0 011.4.14 26 write latency read latency throughput
  27. 27. (2/2)  Amazon EC2 •  1 /N   / •  / •  • 11.4.14 - mycassandra - 27
  28. 28.   FD-Tree: Tree Indexing on Flash Disks, VLDB ’10 •  •  B+tree + LSM-tree •  SSD   •  MySQL: RDBMS •  Anvil, SOSP ’09: 1 •  Cloudy, VLDB ’10: •  Dynamo, SOSP ‘07: vs. •  MyCassandra ( ): vs.11.4.14 - mycassandra - 28
  29. 29. : MyCassandra/MyCassandra Cluster Cassandra 1. MyCassandra 2. MyCassandra Cluster data model multi-dimensional map (Column Family) throughput write write or read write and read latency low lower in case lower persistence yes yes or no (memory) yes consistency weak (eventual, quorum) replication sync / async data partition row node decentralized organization throughput, latency 11.4.14 - mycassandra - 29
  30. 30. : 1) 2) MySQL + memcached : MyCassandra Cluster - - Table movie-id name thumb-name tag count 704122313 movieA EY37lHk5bgU sport, succer, FIFA, … 169,374 704122314 movieB Zk3BSYMWjzQ music, jazz, … 472,80311.4.14 Read-Heavy - mycassandra - Write-Heavy 30

×