Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

5,727 views

Published on

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

Published in: Data & Analytics
  • Be the first to comment

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

  1. 1. Referent Einrichtung Titel des Vortrages 1 WP-Benchmarking Top NoSQL Databases Apache Cassandra, Apache HBase and MongoDB Presented By Athiq Ahamed Supriya
  2. 2. Referent Einrichtung Titel des Vortrages 2 Introduction  Enormous amount of data-BigData  Scalabilty issue in RDBMS  Rise of NoSQL databases  Amazon Dynamo  Big table  CAP Theorem  BASE system
  3. 3. Referent Einrichtung Titel des Vortrages 3 CAP Theorem  Consistency  Availability  Partition tolerance CAP theorem states that only two of the properties can be achieved at a time.
  4. 4. Referent Einrichtung Titel des Vortrages 4 RDBMS NoSQL Supports powerful query language Supports very simple query language It has a fixed schema No fixed schema Follows ACID (Atomicity, Consistency, Isolation and Durability) It is only eventually consistent Supports transactions Does not support transactions RDBMS vs NoSQL Content:tutorialspoint.com
  5. 5. Referent Einrichtung Titel des Vortrages 5  Basically available: System guarantees availability, in terms of the CAP theorem  Soft state: State of the system may change over time, because of eventual consistency model  Eventual consistency: System will become consistent over time BASE Content:www.edureka.in
  6. 6. Referent Einrichtung Titel des Vortrages 6  Fast Performance is the key.  POC processes include right benchmarks:  Configurations  Parameters  Workloads Making the right choice! Selection of NoSQL
  7. 7. Referent Einrichtung Titel des Vortrages 7  Yahoo Cloud Serving Benchmark (YCSB)  Top 3 NoSQL databases-Apache Cassandra, Apache Hbase and MongoDB.  Amazon Web Services EC2 instances for hosting the tests  Test performed 3 times on 3 different days Benchmark configuration
  8. 8. Referent Einrichtung Titel des Vortrages 8  The tests ran on large size instances (15GB RAM and 4 CPU cores)  Instances used customized Ubuntu with Oracle Java 1.6 installed as a base.  A customized script written to drive the benchmark processes Benchmark configuration
  9. 9. Referent Einrichtung Titel des Vortrages 9  Each NoSQL system performs differently, not alike.  Components and Internal working.  Apache Cassandra: Columnar database model  Apache HBase: Columnar database model  MongoDB: Document storage database model Understanding NoSQL Databases
  10. 10. Referent Einrichtung Titel des Vortrages 10 Apache Cassandra  Cassandra is scalable, fault-tolerant, and consistent. All nodes are equal.  Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.  Key components: Node, Cluster, Commit log, Mem-table, SSTable and Bloom filter Content:http://www.tutorialspoint.com/cassandra/cassandra_architecture.htm
  11. 11. Referent Einrichtung Titel des Vortrages 11  Ring structure, peer to peer architecture  All nodes are equal  This improves general database availablity  Scaling up and scaling down is easier  Cassandra has key-value, column oriented database Apache Cassandra
  12. 12. Referent Einrichtung Titel des Vortrages 12 Apache Cassandra Content:http://demoiselle.sourceforge.net/component/demoiselle- cassandra/1.0.0/images/datamodel1.png
  13. 13. Referent Einrichtung Titel des Vortrages 13  Cassandra has an internal keyspace called system, stores metadata about the cluster.  Metadata:  The node‘s token  The cluster name  Keyspace n schema definitions (dynamic loading)  Whether or not the node is bootstrapped Apache Cassandra Content:https://www.edureka.co/blog/category/apache-cassandra/
  14. 14. Referent Einrichtung Titel des Vortrages 14  Commit log: Crash recovery mechanism. Every write operation is written to commit log  Mem-Table: A memory resident data structure.  SSTable: It is a disk file to which the data is flushed from the mem-table Apache Cassandra
  15. 15. Referent Einrichtung Titel des Vortrages 15  Bloom filters are used as a performance booster  Bloom filter are very fast, quick algorithms for testing a member in the set.  Bloom filters serves as a special kind of cache – quick lookups/search as they reside in memory Apache Cassandra
  16. 16. Referent Einrichtung Titel des Vortrages 16  Gossip protocol: Communiction between nodes, co- ordination and failure check  Anti-Entropy protocol: Replica sync mechanism enusing data on different nodes are updated (Merkle trees)  Snitches ensures host proximity Apache Cassandra
  17. 17. Referent Einrichtung Titel des Vortrages 17 Apache Cassandra- Read/Write operation
  18. 18. Referent Einrichtung Titel des Vortrages 18  Sparse, distributed, sorted map and multidimensional and consistent.  Hbase is a Key/value store  Consists Row key, Column family, columns and timestamp. Apache HBase
  19. 19. Referent Einrichtung Titel des Vortrages 19 Apache HBase Content:http://zhangjunhd.github.io/assets/2013-02-25-apache-hbase/rowkey-
  20. 20. Referent Einrichtung Titel des Vortrages 20  Region: Contiguous rows form a region  Region server(RS): Serves one or more regions.  Master server: Daemon responsible for managing Hbase cluster  HDFS: Distributed, open source file system containing HBase‘s data  Zookeeper: Distributed, open source co-ordinated service for co-ordination of master and region servers. Apache HBase Components Content: https://www.mapr.com/blog/in-depth-look-hbase-architecture
  21. 21. Referent Einrichtung Titel des Vortrages 21 Apache Hbase Architecture
  22. 22. Referent Einrichtung Titel des Vortrages 22  Client obtains meta table RS from Zookeeper  Client gets RS which holds the corresponding rowkey  Client receives the row from the respective Region server  Client caches this information along with the location of meta table server. First Read/Write to HBase
  23. 23. Referent Einrichtung Titel des Vortrages 23  WAL: Write Ahead Log is a file on the distributed file system. It is used to store new data  Block Cache: It is the read cache. It stores frequently read data in memory  Mem Store: Write cache that stores new data which is not written to disk yet.  Hfiles stores the rows as sorted key values on disk HBase RS Components
  24. 24. Referent Einrichtung Titel des Vortrages 24  Client writes the data to the WAL file stored on disk  WAL is used to recover not yet persisted data in case a server crashes.  Once data is written to WAL, it is placed in Mem Store Hbase Write steps (1)
  25. 25. Referent Einrichtung Titel des Vortrages 25  All write/read are to/from the primary node.  HDFS replicates WAL and Hfile blocks. Replication happens automatically.  When data is written in HDFS, one copy is written locally and then it is replicated to a secondary node and later to tertiary node. HDFS Write steps (2)
  26. 26. Referent Einrichtung Titel des Vortrages 26  Cassandra usecase: Availability and Partition tolerant requirements. Consistency is tunable by setting it high in the option  Hbase usecase: Consistency and Scalability. However, at less number of nodes/threads, availability is achieved high Cassandra and Hbase
  27. 27. Referent Einrichtung Titel des Vortrages 27  Document-oriented database  High performance and automatic scaling  High consistency and partition tolerant  Replication and failover for high availability  Low latency  Flexible indexing MongoDB
  28. 28. Referent Einrichtung Titel des Vortrages 28  Document is the basic unit for MongoDB(row)  Collection is similar to a table  A single instance has multiple independent databases  Every document has a special key, “_id”  Powerful JavaScript shell for administration  Configdb contains metadata of clusters MongoDB Concepts
  29. 29. Referent Einrichtung Titel des Vortrages 29 MongoDB Simple Architecture
  30. 30. Referent Einrichtung Titel des Vortrages 30  A mongo receives queries from applications  Uses metadata from config server for the data  Mangos directs write operations to a particular shard  Mongos uses the cluster metadata from the config database Read/Write MongoDB
  31. 31. Referent Einrichtung Titel des Vortrages 31  Scalability  Availability  Partition Tolerant  Consistency MOST IMPORTANT PERFORMANCE Yahoo Cloud Serving Benchmark (YCSB) Recap Importance of Benchmark and Factors
  32. 32. Referent Einrichtung Titel des Vortrages 32 Results: Load Process
  33. 33. Referent Einrichtung Titel des Vortrages 33 Results: Read/Write Mix Workload
  34. 34. Referent Einrichtung Titel des Vortrages 34 Results: Read/Scan Mix Workload
  35. 35. Referent Einrichtung Titel des Vortrages 35 Results: Read Latency across all workloads
  36. 36. Referent Einrichtung Titel des Vortrages 36 Results: Insert Latency across all workloads
  37. 37. Referent Einrichtung Titel des Vortrages 37 Lets MIGRATE from traditional data base !!!! Live Demo
  38. 38. Referent Einrichtung Titel des Vortrages 38  Identify data model for the application  Corresponding data sets have to be known  Whether the application requires replication  Identify the performance requirements  Prototype the application  Test the performance of the prototype Discussion
  39. 39. Referent Einrichtung Titel des Vortrages 39 Conclusion  NoSQL replaced tradition relational databases  Performance is the key feature  Importance of benchmarks  Top three NoSQL data base’s performance tested  Cassandra outperforms all the other NoSQL data bases  Decide based on application
  40. 40. Referent Einrichtung Titel des Vortrages 40

×