(24)•  @sunsuk7tp•          /P.A. WORKS              /•                CS M2•          :       : HPC          TSUBAME   ...
+    NoSQL, Key-Value Store (KVS), Document-Oriented DB, GraphDB       : memcached, Google Bigtable, Amazon Dynamo, Amazo...
/DC                                •  decentralized                                •      •  master/slave     •  data/meta...
  SPOF     DC              dc1          dc2               rack/dc           region                     dc3
        •           •  (   )                          <<                       &       •                    , correlated...
      ⇒               !!         SPOF                ~ SPOF
         decentralized     •    proxy/master/slave 
Consistent Hashing (                                       )    (A~Z                )           N := 3                   ...
MyCassandra
SQL                                     map                    Megastore                     library   relational  data mo...
PNUTS (VLDB ‘08): MySQL NoSQL   YCSB (SOCC ’10):
Write-Heavy       Read-Heavy                                  write-                                optimizedBetter       ...
Apache HBase       write optimized   Bigtable like   centralized     Apache Cassandra   write optimized   Bigtable like   ...
MyCassandra
= Dynamo + Bigtable
= Dynamo + Bigtable      (P2P/decentralized)
= Dynamo +     (P2P/decentralized)
                         RDBMS            Table               •      /               •                •               No...
MyCassandra
= Dynamo +     (P2P/decentralized)
MySQL= Dynamo +   Bigtable              Redis                :
1    (master/worker, sharding,       consistent hashing)                     •  cache / persistence                     • ...
+    MyCassandra
  InnoDB (MySQL 5.1~     )  MyISAM  Memory  Merge  Archive  Federated  NDB  CSV  Blackhole ( )  FALCON  MariaDB...
  MySQL:  Bigtable:   Cassandra  Redis:       /      snapshot  MongoDB:                       DB          
                                 decentralized     •             RDB (MySQL / PostgreSQL)     •  master/slave     decent...
            / I/O     •  Bigtable (LSM-tree)     •  MySQL (B-trees/ )     •  Redis (Hash)     •  MongoDB (B-tree)     • ...
hash           B-Trees          LSM-Tree write                  1   random I/O   append read                   1   random ...
+                                : O(1)                                     sequential write                             ...
+    Key      •  Memtable           value      •  SSTable                value                                  I/O     d...
+                                                        (     / 99.9%)                                                  1...
Max. QPS for 40 Clients           Bigtable                                                MySQL40000                      ...
           /            /        /99%/Max/                       ( KB~       MB)    HDD/SSD                  (zipfi...
select
proxy  client                                client    •  o.a.c.cli    •  o.a.c.avro/thrift                              ...
      •  put (key, cf)                                               OK     •  get (key)     •  getRangeSlice (startWith...
    Cassandra     •         : keyspace – columnfamily – column     •              key/value(             )     •         ...
          Cassandra     •  Super Column SSTable              key-value     •                                   KVS key ...
Cassandra       MySQL      Rediskeyspace        database   dbcolumn family   table      recordcolumn          field
database                                                                   db             table A                         ...
       •  MySQL database = keyspace :=>           MyCassandra (MySQL)      •  MySQL table = keyspace :=>           Cass...
                                          1  secondary index     rowKey CF          counter   secondary   token         ...
      •      •      •                 write query            read query            sync         async    async         sy...
•  W:                                •  R:                                •  RW:                                      wr...
•  :                                                               •  R:                                                  ...
•  :                                                             •  R:                                                    ...
20000                                              Cassandra                    0.90      max. qps for 40 clients    MyCas...
  https://github.com/sunsuk7tp/MyCassandra  MyCassandra-0.2.0 (      )     •  based on Cassandra-0.7.5     •  Baseic CRU...
1.         cassandra.yaml      •       engine host, port, …      •     default engine2.                                   ...
    Embedded InnoDB     •  HailDB:                      …     •  Handler Socket:                            …     •  Extr...
•            /•  hash/B+tree• class            persistence   algorithm        lock unitProtoHashDB      volatile      hash...
 MyCassandra-0.2.2 •  secondaryIndex      MySQL MongoDB MyCassandra-0.3.0 •  Based on Cassandra-0.8 •  Atomic counter •...
1. 2. 3. 
    Cassandra             /expire     •  tombstone     •                SSTable     •  Bigtable like MyCassandra        ...
   instance        instance   instance         ping                        detectengine          engine      engine     ...
      •             Redis            MongoDB                •                  key            Join 
      •       •  Cassandra-0.6         :               GC                •  Cassandra-0.7, 0.8:                     ...
 Issue  •  https://github.com/sunsuk7tp/MyCassandra/issues Twitter  •  @MyCassandraJP  •  @_MyCassandra # @MyCassandra  ...
               / @railute     •                       Cassandra    Gemini Mobile Technologies / @geminimobile     •     ...
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra
Upcoming SlideShare
Loading in …5
×

第17回Cassandra勉強会: MyCassandra

8,540 views

Published on

第17回Cassandra勉強会 (http://bit.ly/lIHd5v)で発表させていただいたMyCassandraのスライドです。ソース(https://github.com/sunsuk7tp/MyCassandra/)。 MyCassandra使ってみたい、設計こうした方がいい、その他疑問ありましたら@sunsuk7tp, @MyCassandraJP, @_MyCassandraまでご連絡下さい。

Published in: Technology

第17回Cassandra勉強会: MyCassandra

  1. 1. (24)•  @sunsuk7tp•  /P.A. WORKS /•  CS M2•  :   : HPC   TSUBAME   MPI, Cell B.E., GPU CUDA, Hadoop on   :     , P2P   NoSQL Afternoon in Japan (10.11.1, )   SACSIS 2011•  Web 6   PHP, Perl, JavaScript     Apache Solr, MySQL   NoSQL   NoSQL•  Jazz, trumpet•  Cassandra 0.6.0   @railute @yutuki_r @techmemo Itmedia 3 http://lab.jibun.atmarkit.co.jp/entries/1058
  2. 2. +  NoSQL, Key-Value Store (KVS), Document-Oriented DB, GraphDB : memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache Cassandra, Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Tyrant, Flare, ROMA, kumofs, Kai, Redis, LevelDB, Hadoop HBase, Hypertable,Yahoo! PNUTS, Scalaris, Dynomite, ThruDB, Neo4j, IBM ObjectGrid, Oracle Coherence, 100  : ↔ •  •  join, transaction •  /
  3. 3. /DC •  decentralized •  •  master/slave •  data/meta/proxy • •  •  Map Reduce• 
  4. 4.   SPOF  DC dc1 dc2 rack/dc region dc3
  5. 5.   •    •  ( ) << & •  , correlated failure   SPOF = “ ”   •  : 1 •  : /( ) Daniel Ford et. al. (Google), “Availability in Globally Distributed Storage Systems”, OSDI 2010
  6. 6.   ⇒  !!  SPOF  ~ SPOF
  7. 7.   decentralized •  proxy/master/slave 
  8. 8. Consistent Hashing ( )  (A~Z ) N := 3 ID A F Z •  request proxy secondary 1 •  primary node Q •  secondary node V N primary secondary 2 hash(key) = Q key values
  9. 9. MyCassandra
  10. 10. SQL map Megastore library relational data model table (multi-dimentional sorted map)(sorted) records (sorted) map (sorted) map + indices + indices RDB Bigtable KVS NoSQL
  11. 11. PNUTS (VLDB ‘08): MySQL NoSQL YCSB (SOCC ’10):
  12. 12. Write-Heavy Read-Heavy write- optimizedBetter read- read- optimized optimized write- optimized
  13. 13. Apache HBase write optimized Bigtable like centralized Apache Cassandra write optimized Bigtable like decentralized Sharded MySQL read optimized MySQL centralized Yahoo! Sherpa read optimized MySQL centralized :⇒  Cassandra MySQL
  14. 14. MyCassandra
  15. 15. = Dynamo + Bigtable
  16. 16. = Dynamo + Bigtable (P2P/decentralized)
  17. 17. = Dynamo + (P2P/decentralized)
  18. 18.   RDBMS  Table •  / •  •  NoSQL !! query
  19. 19. MyCassandra
  20. 20. = Dynamo + (P2P/decentralized)
  21. 21. MySQL= Dynamo + Bigtable Redis :
  22. 22. 1 (master/worker, sharding, consistent hashing) •  cache / persistence •  index •  write/read-optimized • 
  23. 23. + MyCassandra
  24. 24.   InnoDB (MySQL 5.1~ )  MyISAM  Memory  Merge  Archive  Federated  NDB  CSV  Blackhole ( )  FALCON  MariaDB  Drizzle InnoDB/MyISAM  solidDB MySQL Cluster  :
  25. 25.   MySQL:  Bigtable: Cassandra  Redis: / snapshot  MongoDB: DB    
  26. 26.   decentralized •    RDB (MySQL / PostgreSQL) •  master/slave decentralized   MongoDB / Redis  •  MapReduce   MySQL Bigtable   MySQL (InnoDB) INSERT   Bigtable INSERT/GET •    / /  EC2+RDS MyCassandra
  27. 27.   / I/O •  Bigtable (LSM-tree) •  MySQL (B-trees/ ) •  Redis (Hash) •  MongoDB (B-tree) •  KyotoCabinet (B+ tree/hash)
  28. 28. hash B-Trees LSM-Tree write 1 random I/O append read 1 random I/O N random I/O + merge cache Memcached, MySQL, Cassandra, Redis, MongoDB, HBase, KyotoCabinet KyotoCabinet LevelDB  
  29. 29. + : O(1)   sequential write I/O   Always writable write-lock memory sync <k1, obj (v1+v2)> async flush write path Memtable LSM-Tree [P. O’Neil ‘96] disk <k1, v1>, <k1, v2> Commit Log sequential disk mem <k1,obj1> write SSTable 1 <k1,obj2> SSTable 2 <k1,obj3> SSTable 3 SSTable
  30. 30. +  Key •  Memtable value •  SSTable value I/O disk memory <k1,obj> Memtable disk mem disk <k1,obj+obj1~3> Commit Log client merge <k1,obj1> SSTable 1 I/O <k1,obj2> SSTable 2 <k1,obj3> SSTable 3
  31. 31. + ( / 99.9%) 1/9 Better read write avg. 6.16 msNumber of queries read Latency (ms) write write: 2.0 ms avg. 0.69 ms read: 86.9 ms 99.9 percentile Latency (ms)
  32. 32. Max. QPS for 40 Clients Bigtable MySQL40000 Redis3500030000250002000015000100005000 Better 0 (qps) Write Only Write Heavy Read Heavy Read Only
  33. 33.   / /  /99%/Max/    ( KB~ MB)  HDD/SSD  (zipfian, uniform, latest)  •  Embedded InnoDB, KyotoCabinet# ( )
  34. 34. select
  35. 35. proxy  client client •  o.a.c.cli •  o.a.c.avro/thrift server  proxy •  o.a.c.service.StorageProxy  server engine •  o.a.c.service.StorageService •  o.a.c.db.ReadVerbHandler/RowMutationVerbHandler  engine •  o.a.c.db.Table (keyspace )   o.a.c.db.commitlog   o.a.c.db.ColumnFamilyStore (columnfamily )   o.a.c.db.engine.StorageEngineInterface   o.a.c.db.engine.MySQLInstance, RedisInstance, MongoDBInstance, …
  36. 36.   •  put (key, cf)   OK •  get (key) •  getRangeSlice (startWith, engWith, maxResults) •  truncate/dropTable/dropDB  •  secondaryIndex •  expire •  counter (Cassandra-0.8 )
  37. 37.   Cassandra •  : keyspace – columnfamily – column •  key/value( ) •    ColumnFamily SSTable <key, value>   value: columnFamily Keyspace ColumnFamily A ColumnFamily B key col gender age region key col visits plan sato male 17 [null] sato 18 Gold suzuki female 21 Tokyo suzuki 214 Bronze Bigtable (Cassandra)
  38. 38.   Cassandra •  Super Column SSTable key-value •   KVS key prefix • 
  39. 39. Cassandra MySQL Rediskeyspace database dbcolumn family table recordcolumn field
  40. 40. database db table A table B key valueskey values key values A:sato …sato gender;male;age;17 sato visits;18;plan;Gold B:ito …suzuki gender;female;age; suzuki visits; A:suzuki … 21;region;Tokyo 214;plan;Bronze B:tanaka … RDB (MySQL) KVS (Redis) keyspace columnfamily A columnfamily B key col gender age region key col visits plan sato male 17 [null] sato 18 Gold suzuki female 21 Tokyo suzuki 214 Bronze Bigtable (Cassandra)
  41. 41.   •  MySQL database = keyspace :=>   MyCassandra (MySQL) •  MySQL table = keyspace :=>   Cassandra Bigtable (Cassandra)keyspace columnfamily A columnfamily B key col gender age region key col visits plan sato male 17 [null] sato 18 Gold suzuki female 21 Tokyo suzuki 214 Bronze MySQL gender age region visits plan sato male 17 [null] 18 Gold Table suzuki female 21 Tokyo 214 Bronze
  42. 42.   1  secondary index rowKey CF counter secondary token index Serialized Object Key Value Key-Value KVS …
  43. 43.   •  •  •  write query read query sync async async sync W R W R Bigtable MySQL Bigtable MySQL
  44. 44. •  W: •  R: •  RW:   write query sync async W R Quorum Protocol: ( )+ ( )> ( ) •  write read W RW R
  45. 45. •  : •  R: •  RW: =3, =2 ClientW:RW:R = 1:1:1 Proxy 1)  2)  W, RW ACK ACK 3a) W RW R 3b) R ACK : max (W, RW)
  46. 46. •  : •  R: •  RW: =3, =2W:RW:R = 1:1:1 Client Proxy 1)  2)  R, RW 3a) 3b) or W W RW R 4)  . (Cassandra read repair ) : max (R, RW)
  47. 47. 20000 Cassandra 0.90 max. qps for 40 clients MyCassandra Cluster 18000 16000 6.49 14000 12000 1.54 0.93 10000Better 8000 6000 4000 2000 0 [100:0] [50:50] [5:95] [0:100] [write:read] (query/sec) Write-Only Write-Heavy Read-Heavy Read-Only Write Heavy Read Heavy • YCSB / Zipfian •  6.49 • 
  48. 48.   https://github.com/sunsuk7tp/MyCassandra  MyCassandra-0.2.0 ( ) •  based on Cassandra-0.7.5 •  Baseic CRUD on a simple record •  RangeSlice •  keyspace
  49. 49. 1.  cassandra.yaml •  engine host, port, … •  default engine2.  ( )3.  MyCassandra (Cassandra )4.  or keyspace, columnfamily •  engine (keyspace ) •  (column family )
  50. 50.   Embedded InnoDB •  HailDB: … •  Handler Socket: … •  ExtraDB •  API  DBM (KyotoCabinet) •  KyotoCassandra/Kyossandra/ ssandra ( ) •  •  NoSQL •  QDBM, TC Hash or B+Tree db
  51. 51. •  /•  hash/B+tree• class persistence algorithm lock unitProtoHashDB volatile hash whole (rwlock)ProtoTreeDB red black tree whole (rwlock)StashDB hash record (rwlock)CacheDB hash record (mutex)GrassDB B+ tree page (rwlock)HashDB persistent hash record (rwlock)TreeDB B+ tree page (rwlock)DirDB undefined record (rwlock)ForestDB B+ tree page (rwlock)
  52. 52.  MyCassandra-0.2.2 •  secondaryIndex   MySQL MongoDB MyCassandra-0.3.0 •  Based on Cassandra-0.8 •  Atomic counter •  Brisk (Hadoop + Cassandra)…
  53. 53. 1. 2. 3. 
  54. 54.   Cassandra /expire •  tombstone •  SSTable •  Bigtable like MyCassandra Bigtable •  •  expire •    1 Table
  55. 55.    instance instance instance ping detectengine engine engine instance ? ? node down ?
  56. 56.   •    Redis   MongoDB   •    key   Join 
  57. 57.   •   •  Cassandra-0.6 :   GC   •  Cassandra-0.7, 0.8:          …
  58. 58.  Issue •  https://github.com/sunsuk7tp/MyCassandra/issues Twitter •  @MyCassandraJP •  @_MyCassandra # @MyCassandra orz •  @sunsuk7tp # Google Groups •  https://groups.google.com/group/my-cassandra
  59. 59.   / @railute •  Cassandra  Gemini Mobile Technologies / @geminimobile •  Hibari  / @yutuki_r •  Cassandra twitter  dann / @techmemo •  Cassandra  / @tatsuya6502 •  YCSB , Hibari  / @mikio1978 / @fallabs •  KyotoCabinet  / @muga_nishizawa  / @Nakata_itpro  / @shudo  Cassandra   UST ( )

×