NoSQL: Cassadra vs. HBase


Published on

Published in: Technology

NoSQL: Cassadra vs. HBase

  1. 1. YCSBYahoo! Cloud Serving BenchmarkScalable Distributed SystemsAntonio L. Severienantonio.severien@gmail.comJoão
  2. 2. Overview• Distributed Databases• Cassandra• HBase• YCSB General View• YCSB Details• Amazon EC2• YCSB Results• YCSB Future• Conclusions• References
  3. 3. Distributed DatabasesTraditional RDBMS• ACID transactions• Query language (SQL)• Data tied to the modeling (hard to analyze)• Scalable to a limitDistributed Databases• Not ACID• Not Relational• Column oriented (key-value)• CAP (Consistency, Availability, Partitioning)• Big Data (Massively scalable)
  4. 4. Distributed Databases• Sherpa/PNUTS• BigTable• HBase, Hypertable, HTable• Megastore• Azure• Cassandra• Amazon Web Services• S3, SimpleDB, EBS • CouchDB• Voldemort• Dynomite• Tokyo• Redis• MongoDB
  5. 5. Distributed Databases• NoSQL Databases have different designs and architectureCassandraThriftGossipToken ring…HbaseHDFSZookeeperHadoop (MapReduce)BigTableGFSChubby (Lock Service)MapReduce
  6. 6. Cassandra• Highlights• High availability• Incremental scalability• Eventually consistent• Tradeoffs between consistency and latency• Minimal administration• No SPF (Single Point of Failure)
  7. 7. Cassandra• CAP-aware• Cassandra values Availability and Partitioning tolerance (AP) eventually consistent• Providing strong Consistency in Cassandra increases latency• Partitioning• Token oriented• Explicit Replication• Replication factor ≤ Total nodes• High level clients• Python, Java, C#, .NET, Scala, Ruby, PHP, Erlang, Haskell…etc• Thrift  driver-level interface
  8. 8. Cassandra• Data Model• Cluster:• Machines (nodes) in a logicalCassandra instance• can contain multiple keyspaces• Keyspace:• name for ColumnFamilies• ColumnFamilies:• contain multiple columns each with name, value and timestampreferenced by row keys.• Analogous to table on RDBMS• SuperColumns:• columns with subcolumns• Rows• ColumnskeyA Column1 Column2 Column3keyB Column5 Column6 column10ColumnByte[] NameByte[] ValueI64 Timestamp
  9. 9. CassandraPartitioning Replication
  10. 10. HBase“HBase is more a datastore than a database”• It lacks many of the features of RDBMS• Distributed and scalable big data store.• Regions model• Strong consistency
  11. 11. HBaseBuilt on top of Hadoop Distributed Filesystem (HDFS)
  12. 12. HBase• The NameNode isresponsible for maintainingthe filesystem metadata.• The DataNodes areresponsible for storing HDFSblocks.
  13. 13. HBase• The NameNode isresponsible for maintainingthe filesystem metadata.• The DataNodes areresponsible for storing HDFSblocks.Note: In our study case, we onlyhad interest on HDFS layer.
  14. 14. HBase
  15. 15. HBaseDatanodesNamenode
  16. 16. HBase• Data is stored into HBase tables.• Tables are made of rows and columns.• All columns belong to a particular column family.Important note: All column family members are stored together.• A query on acolumn familymodel has a betterperformance
  17. 17. YCSB General View• Which is the best NoSQL DB?• How to compare?• Yahoo! Cloud Serving Benchmark (YCSB)• Benchmarking tool• Evaluate key-value and cloud DBs performance on a common setof workloads• Client – an extensible workload generator• Yahoo! Research• Brian F. Cooper -• Joint work with Adam Silberstein, Erwin Tam, Raghu Ramakrishnanand Russell Sear
  18. 18. YCSB Details• How it works?YCSB ClientDBInterfaceLayerClientThreadsStatisticsWorkloadExecutorCloudServingStoreWorkload file• Read/write mix• Record size• Popularity distribution• …Command line• DB to use• Workload to use• Target throughput• Number of threads• …
  19. 19. YCSB DetailsBenchmark Tiers• Performance• Measure latency/throughput curve• Increase throughput until saturation• Scalability• Scale up: increase hardware, data size and throughputproportionally• Elastic speedup: add servers while running a workload
  20. 20. YCSB DetailsLoad phase- Load the database$ ycsb load cassandra-10–p hosts= –P workloadXTransactions phase- Executes the workload$ ycsb run cassandra-10–p hosts= –P workloadXRandom Load Distribution
  21. 21. YCSB Details• # Yahoo! Cloud System Benchmark• # Workload A: Update heavy workload• # Application example: Session store recording recent actions• #• # Read/update ratio: 50/50• # Default data size: 1 KB records (10 fields, 100 bytes each, plus key)• # Request distribution: zipfian• recordcount=1000• operationcount=1000•• readallfields=true• readproportion=0.5• updateproportion=0.5• scanproportion=0• insertproportion=0• requestdistribution=zipfian
  22. 22. YCSB Details• Execution parameters• $ ./bin/ycsb run cassandra-10 –P workloads/workloada –s –threads 10 –target 100> transactions.dat[OVERALL],RunTime(ms), 10110[OVERALL],Throughput(ops/sec), 98.91196834817013[UPDATE], Operations, 491[UPDATE], AverageLatency(ms), 0.054989816700611[UPDATE], MinLatency(ms), 0[UPDATE], MaxLatency(ms), 1[UPDATE], 95thPercentileLatency(ms), 1[UPDATE], 99thPercentileLatency(ms), 1[UPDATE], Return=0, 491[UPDATE], 0, 464[UPDATE], 1, 27[UPDATE], 2, 0[UPDATE], 3, 0[UPDATE], 4, 0...
  23. 23. YCSB Details• $ ./bin/ycsb run basic -P workloads/workloada -P large.dat -s -threads 10 -target 100 –p measurementtype=timeseries -p timeseries.granularity=2000 >transactions.dat[OVERALL],RunTime(ms), 10077[OVERALL],Throughput(ops/sec), 9923.58836955443[UPDATE], Operations, 50396[UPDATE], AverageLatency(ms), 0.04339630129375347[UPDATE], MinLatency(ms), 0[UPDATE], MaxLatency(ms), 338[UPDATE], Return=0, 50396[UPDATE], 0, 0.10264765784114054[UPDATE], 2000, 0.026989343690867442[UPDATE], 4000, 0.0352882703777336[UPDATE], 6000, 0.004238958990536277[UPDATE], 8000, 0.052813085033008175[UPDATE], 10000, 0.0[READ], Operations, 49604[READ], AverageLatency(ms), 0.038242883638416256[READ], MinLatency(ms), 0[READ], MaxLatency(ms), 230[READ], Return=0, 49604[READ], 0, 0.08997245741099663[READ], 2000, 0.02207505518763797[READ], 4000, 0.03188493260913297[READ], 6000, 0.004869141813755326[READ], 8000, 0.04355329949238579[READ], 10000, 0.005405405405405406
  24. 24. YCSB DetailsStatus Output
  25. 25. Amazon EC2 ConfigurationLarge Instance7.5 GB memory4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)850 GB instance storage64-bit platformI/O Performance: HighAPI name: m1.largeExperiment Set-upCassandra Cluster3 nodes + 1 node (Elasticity)Hbase Cluster3 nodes
  26. 26. Amazon EC2 UsageCassandraLoad phase: 60,000,000 records of 1Kb
  27. 27. Amazon EC2 UsageHBaseLoad phase: 60,000,000 records of 1Kb
  28. 28. Amazon EC2 UsageLoad phase: 60,000,000 records of 1KbCassandraHBase
  29. 29. Amazon EC2 UsageLoad phase: 60,000,000 records of 1KbCassandra HBase
  30. 30. Amazon EC2 UsageTransaction phase:- 10,000 records- 1,000,000 operations- 250 threadsCassandra
  31. 31. YCSB Cassandra ResultsUpdate Heavy Workload(50/50)01020304050600 1,000 2,000 3,000 4,000 5,000 6,000AverageLatency(ms)Throughput (ops/sec)Update01020304050600 1,000 2,000 3,000 4,000 5,000 6,000AverageLatency(ms)Throughput (ops/sec)Read
  32. 32. YCSB HBase Results0. 485 492.38 507.17 562.33 620.04 634.82 734.32 845.15AverageLatency(ms)Throughput (ops/sec)Update Hbase 485 492.38 507.17 562.33 620.04 634.82 734.32 845.15AverageLatency(ms)Throughput (ops/sec)Read HBase 0.90.5
  33. 33. YCSB Cassandra Results010,00020,00030,00040,00050,00060,00070,00080,0000 50000 100000 150000 200000 250000 300000 350000 400000Latency(ms)Time milisecondsElasticity Cassandra 1.0
  34. 34. YCSB Cassandra Results010,00020,00030,00040,00050,00060,00070,00080,0000 50000 100000 150000 200000 250000 300000 350000 400000Latency(ms)Time milisecondsElasticity Cassandra 1.0
  35. 35. YCSB FutureProvide statistics for:- Availability- ReplicationAdditional Distributed DatabasesCurrently supported:Cassandra MapkeeperMongoDB RedisVoldemort Vmware vFabric GemfireHbase
  36. 36. Conclusions• YCSB provides a common ground for benchmarking cloud DBservices• Good for leaning and experimenting with different distributeddatabases• Open source, extensible for new databases• Laboratory with Amazon EC2 provided good insight into settingup cloud services• Challenges• Installation problems• Hard to follow documentation• Working on distributed environment require lots of configuration
  37. 37. References• YCSB (Yahoo! Cloud Serving Benchmark)•• Yahoo! Research•• BigTable•• Cassandra•• HBase•
  38. 38. Questions