第17回Cassandra勉強会: MyCassandra

8,533 views

Published on

第17回Cassandra勉強会 (http://bit.ly/lIHd5v)で発表させていただいたMyCassandraのスライドです。ソース(https://github.com/sunsuk7tp/MyCassandra/)。 MyCassandra使ってみたい、設計こうした方がいい、その他疑問ありましたら@sunsuk7tp, @MyCassandraJP, @_MyCassandraまでご連絡下さい。

Published in: Technology

第17回Cassandra勉強会: MyCassandra

  1. 1. (24)•  @sunsuk7tp•  /P.A. WORKS /•  CS M2•  :   : HPC   TSUBAME   MPI, Cell B.E., GPU CUDA, Hadoop on   :     , P2P   NoSQL Afternoon in Japan (10.11.1, )   SACSIS 2011•  Web 6   PHP, Perl, JavaScript     Apache Solr, MySQL   NoSQL   NoSQL•  Jazz, trumpet•  Cassandra 0.6.0   @railute @yutuki_r @techmemo Itmedia 3 http://lab.jibun.atmarkit.co.jp/entries/1058
  2. 2. +  NoSQL, Key-Value Store (KVS), Document-Oriented DB, GraphDB : memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache Cassandra, Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Tyrant, Flare, ROMA, kumofs, Kai, Redis, LevelDB, Hadoop HBase, Hypertable,Yahoo! PNUTS, Scalaris, Dynomite, ThruDB, Neo4j, IBM ObjectGrid, Oracle Coherence, 100  : ↔ •  •  join, transaction •  /
  3. 3. /DC •  decentralized •  •  master/slave •  data/meta/proxy • •  •  Map Reduce• 
  4. 4.   SPOF  DC dc1 dc2 rack/dc region dc3
  5. 5.   •    •  ( ) << & •  , correlated failure   SPOF = “ ”   •  : 1 •  : /( ) Daniel Ford et. al. (Google), “Availability in Globally Distributed Storage Systems”, OSDI 2010
  6. 6.   ⇒  !!  SPOF  ~ SPOF
  7. 7.   decentralized •  proxy/master/slave 
  8. 8. Consistent Hashing ( )  (A~Z ) N := 3 ID A F Z •  request proxy secondary 1 •  primary node Q •  secondary node V N primary secondary 2 hash(key) = Q key values
  9. 9. MyCassandra
  10. 10. SQL map Megastore library relational data model table (multi-dimentional sorted map)(sorted) records (sorted) map (sorted) map + indices + indices RDB Bigtable KVS NoSQL
  11. 11. PNUTS (VLDB ‘08): MySQL NoSQL YCSB (SOCC ’10):
  12. 12. Write-Heavy Read-Heavy write- optimizedBetter read- read- optimized optimized write- optimized
  13. 13. Apache HBase write optimized Bigtable like centralized Apache Cassandra write optimized Bigtable like decentralized Sharded MySQL read optimized MySQL centralized Yahoo! Sherpa read optimized MySQL centralized :⇒  Cassandra MySQL
  14. 14. MyCassandra
  15. 15. = Dynamo + Bigtable
  16. 16. = Dynamo + Bigtable (P2P/decentralized)
  17. 17. = Dynamo + (P2P/decentralized)
  18. 18.   RDBMS  Table •  / •  •  NoSQL !! query
  19. 19. MyCassandra
  20. 20. = Dynamo + (P2P/decentralized)
  21. 21. MySQL= Dynamo + Bigtable Redis :
  22. 22. 1 (master/worker, sharding, consistent hashing) •  cache / persistence •  index •  write/read-optimized • 
  23. 23. + MyCassandra
  24. 24.   InnoDB (MySQL 5.1~ )  MyISAM  Memory  Merge  Archive  Federated  NDB  CSV  Blackhole ( )  FALCON  MariaDB  Drizzle InnoDB/MyISAM  solidDB MySQL Cluster  :
  25. 25.   MySQL:  Bigtable: Cassandra  Redis: / snapshot  MongoDB: DB    
  26. 26.   decentralized •    RDB (MySQL / PostgreSQL) •  master/slave decentralized   MongoDB / Redis  •  MapReduce   MySQL Bigtable   MySQL (InnoDB) INSERT   Bigtable INSERT/GET •    / /  EC2+RDS MyCassandra
  27. 27.   / I/O •  Bigtable (LSM-tree) •  MySQL (B-trees/ ) •  Redis (Hash) •  MongoDB (B-tree) •  KyotoCabinet (B+ tree/hash)
  28. 28. hash B-Trees LSM-Tree write 1 random I/O append read 1 random I/O N random I/O + merge cache Memcached, MySQL, Cassandra, Redis, MongoDB, HBase, KyotoCabinet KyotoCabinet LevelDB  
  29. 29. + : O(1)   sequential write I/O   Always writable write-lock memory sync <k1, obj (v1+v2)> async flush write path Memtable LSM-Tree [P. O’Neil ‘96] disk <k1, v1>, <k1, v2> Commit Log sequential disk mem <k1,obj1> write SSTable 1 <k1,obj2> SSTable 2 <k1,obj3> SSTable 3 SSTable
  30. 30. +  Key •  Memtable value •  SSTable value I/O disk memory <k1,obj> Memtable disk mem disk <k1,obj+obj1~3> Commit Log client merge <k1,obj1> SSTable 1 I/O <k1,obj2> SSTable 2 <k1,obj3> SSTable 3
  31. 31. + ( / 99.9%) 1/9 Better read write avg. 6.16 msNumber of queries read Latency (ms) write write: 2.0 ms avg. 0.69 ms read: 86.9 ms 99.9 percentile Latency (ms)
  32. 32. Max. QPS for 40 Clients Bigtable MySQL40000 Redis3500030000250002000015000100005000 Better 0 (qps) Write Only Write Heavy Read Heavy Read Only
  33. 33.   / /  /99%/Max/    ( KB~ MB)  HDD/SSD  (zipfian, uniform, latest)  •  Embedded InnoDB, KyotoCabinet# ( )
  34. 34. select
  35. 35. proxy  client client •  o.a.c.cli •  o.a.c.avro/thrift server  proxy •  o.a.c.service.StorageProxy  server engine •  o.a.c.service.StorageService •  o.a.c.db.ReadVerbHandler/RowMutationVerbHandler  engine •  o.a.c.db.Table (keyspace )   o.a.c.db.commitlog   o.a.c.db.ColumnFamilyStore (columnfamily )   o.a.c.db.engine.StorageEngineInterface   o.a.c.db.engine.MySQLInstance, RedisInstance, MongoDBInstance, …
  36. 36.   •  put (key, cf)   OK •  get (key) •  getRangeSlice (startWith, engWith, maxResults) •  truncate/dropTable/dropDB  •  secondaryIndex •  expire •  counter (Cassandra-0.8 )
  37. 37.   Cassandra •  : keyspace – columnfamily – column •  key/value( ) •    ColumnFamily SSTable <key, value>   value: columnFamily Keyspace ColumnFamily A ColumnFamily B key col gender age region key col visits plan sato male 17 [null] sato 18 Gold suzuki female 21 Tokyo suzuki 214 Bronze Bigtable (Cassandra)
  38. 38.   Cassandra •  Super Column SSTable key-value •   KVS key prefix • 
  39. 39. Cassandra MySQL Rediskeyspace database dbcolumn family table recordcolumn field
  40. 40. database db table A table B key valueskey values key values A:sato …sato gender;male;age;17 sato visits;18;plan;Gold B:ito …suzuki gender;female;age; suzuki visits; A:suzuki … 21;region;Tokyo 214;plan;Bronze B:tanaka … RDB (MySQL) KVS (Redis) keyspace columnfamily A columnfamily B key col gender age region key col visits plan sato male 17 [null] sato 18 Gold suzuki female 21 Tokyo suzuki 214 Bronze Bigtable (Cassandra)
  41. 41.   •  MySQL database = keyspace :=>   MyCassandra (MySQL) •  MySQL table = keyspace :=>   Cassandra Bigtable (Cassandra)keyspace columnfamily A columnfamily B key col gender age region key col visits plan sato male 17 [null] sato 18 Gold suzuki female 21 Tokyo suzuki 214 Bronze MySQL gender age region visits plan sato male 17 [null] 18 Gold Table suzuki female 21 Tokyo 214 Bronze
  42. 42.   1  secondary index rowKey CF counter secondary token index Serialized Object Key Value Key-Value KVS …
  43. 43.   •  •  •  write query read query sync async async sync W R W R Bigtable MySQL Bigtable MySQL
  44. 44. •  W: •  R: •  RW:   write query sync async W R Quorum Protocol: ( )+ ( )> ( ) •  write read W RW R
  45. 45. •  : •  R: •  RW: =3, =2 ClientW:RW:R = 1:1:1 Proxy 1)  2)  W, RW ACK ACK 3a) W RW R 3b) R ACK : max (W, RW)
  46. 46. •  : •  R: •  RW: =3, =2W:RW:R = 1:1:1 Client Proxy 1)  2)  R, RW 3a) 3b) or W W RW R 4)  . (Cassandra read repair ) : max (R, RW)
  47. 47. 20000 Cassandra 0.90 max. qps for 40 clients MyCassandra Cluster 18000 16000 6.49 14000 12000 1.54 0.93 10000Better 8000 6000 4000 2000 0 [100:0] [50:50] [5:95] [0:100] [write:read] (query/sec) Write-Only Write-Heavy Read-Heavy Read-Only Write Heavy Read Heavy • YCSB / Zipfian •  6.49 • 
  48. 48.   https://github.com/sunsuk7tp/MyCassandra  MyCassandra-0.2.0 ( ) •  based on Cassandra-0.7.5 •  Baseic CRUD on a simple record •  RangeSlice •  keyspace
  49. 49. 1.  cassandra.yaml •  engine host, port, … •  default engine2.  ( )3.  MyCassandra (Cassandra )4.  or keyspace, columnfamily •  engine (keyspace ) •  (column family )
  50. 50.   Embedded InnoDB •  HailDB: … •  Handler Socket: … •  ExtraDB •  API  DBM (KyotoCabinet) •  KyotoCassandra/Kyossandra/ ssandra ( ) •  •  NoSQL •  QDBM, TC Hash or B+Tree db
  51. 51. •  /•  hash/B+tree• class persistence algorithm lock unitProtoHashDB volatile hash whole (rwlock)ProtoTreeDB red black tree whole (rwlock)StashDB hash record (rwlock)CacheDB hash record (mutex)GrassDB B+ tree page (rwlock)HashDB persistent hash record (rwlock)TreeDB B+ tree page (rwlock)DirDB undefined record (rwlock)ForestDB B+ tree page (rwlock)
  52. 52.  MyCassandra-0.2.2 •  secondaryIndex   MySQL MongoDB MyCassandra-0.3.0 •  Based on Cassandra-0.8 •  Atomic counter •  Brisk (Hadoop + Cassandra)…
  53. 53. 1. 2. 3. 
  54. 54.   Cassandra /expire •  tombstone •  SSTable •  Bigtable like MyCassandra Bigtable •  •  expire •    1 Table
  55. 55.    instance instance instance ping detectengine engine engine instance ? ? node down ?
  56. 56.   •    Redis   MongoDB   •    key   Join 
  57. 57.   •   •  Cassandra-0.6 :   GC   •  Cassandra-0.7, 0.8:          …
  58. 58.  Issue •  https://github.com/sunsuk7tp/MyCassandra/issues Twitter •  @MyCassandraJP •  @_MyCassandra # @MyCassandra orz •  @sunsuk7tp # Google Groups •  https://groups.google.com/group/my-cassandra
  59. 59.   / @railute •  Cassandra  Gemini Mobile Technologies / @geminimobile •  Hibari  / @yutuki_r •  Cassandra twitter  dann / @techmemo •  Cassandra  / @tatsuya6502 •  YCSB , Hibari  / @mikio1978 / @fallabs •  KyotoCabinet  / @muga_nishizawa  / @Nakata_itpro  / @shudo  Cassandra   UST ( )

×