Comparison of different NoSQL databases,
namely, HBase and MongoDB at different workloads using Yahoo Cloud Serving Benchmarking (YCSB)
Tools used
> HBase, MongoDB, Shell Scripting, YCSB, Hadoop Environment
> Tableau for Visualization
> LATEX for documentation
This document analyzes the performance of MongoDB and HBase databases. It describes the architectures and key characteristics of each database, including MongoDB's document model, auto-sharding, and replication features. It also covers HBase's use of HDFS for storage and Zookeeper for coordination. The document examines the security features of each database, such as authentication, authorization, and encryption. Finally, it discusses findings from literature that NoSQL databases sacrifice ACID properties for scalability and performance.
Comparison between mongo db and cassandra using ycsbsonalighai
Performed YCSB benchmarking test to check the performances of MongoDB and Cassandra for different workloads and a million opcounts and generated a report discussing clear insights.
The document provides an agenda for a two-day training on NoSQL and MongoDB. Day 1 covers an introduction to NoSQL concepts like distributed and decentralized databases, CAP theorem, and different types of NoSQL databases including key-value, column-oriented, and document-oriented databases. It also covers functions and indexing in MongoDB. Day 2 focuses on specific MongoDB topics like aggregation framework, sharding, queries, schema-less design, and indexing.
DynamoDB is a key-value database that achieves high availability and scalability through several techniques:
1. It uses consistent hashing to partition and replicate data across multiple storage nodes, allowing incremental scalability.
2. It employs vector clocks to maintain consistency among replicas during writes, decoupling version size from update rates.
3. For handling temporary failures, it uses sloppy quorum and hinted handoff to provide high availability and durability guarantees when some replicas are unavailable.
In these slides we introduce Column-Oriented Stores. We deeply analyze Google BigTable. We discuss about features, data model, architecture, components and its implementation. In the second part we discuss all the major open source implementation for column-oriented databases.
DBs belonging to NoSQL family store large volume of data in key-value format. The retrieval and storage NoSQL data stores are widely used to store and retrieve possibly large amounts of data, typically in a key-value format. There are numerous NoSQL types with different performances, and thus it is important to evaluate them in terms of performance and verify in ways the performance is related to the database kind. In this paper, we assess five most popular NoSQL DBs: Cassandra, HBase, Redis, OrientDB and MongoDB. We shall measure these DBs in terms of query performance, based on reads and updates, taking into consideration the typical workloads, as represented by the Yahoo! Cloud Serving Benchmark. This shall facilitate users to choose the most appropriate DB according to the specific mechanisms and their application requirements.
HBase is a column-oriented NoSQL database that provides random real-time read/write access to big data stored in Hadoop's HDFS. It is modeled after Google's Bigtable and sits on top of HDFS to allow fast access to large datasets. HBase architecture includes HMaster, HRegionServers, ZooKeeper, and HDFS. HMaster manages metadata and load balancing while HRegionServers serve read/write requests directly from clients. ZooKeeper coordinates the cluster and HDFS provides storage. Data is stored in tables divided into regions hosted by HRegionServers.
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...Vivek Adithya Mohankumar
The research paper covers the consolidated interpretation of NoSQL systems, on the basis of performance, scalability and data aggregation, and compares the types of NoSQL databases based on their implementation and maintenance.
This document analyzes the performance of MongoDB and HBase databases. It describes the architectures and key characteristics of each database, including MongoDB's document model, auto-sharding, and replication features. It also covers HBase's use of HDFS for storage and Zookeeper for coordination. The document examines the security features of each database, such as authentication, authorization, and encryption. Finally, it discusses findings from literature that NoSQL databases sacrifice ACID properties for scalability and performance.
Comparison between mongo db and cassandra using ycsbsonalighai
Performed YCSB benchmarking test to check the performances of MongoDB and Cassandra for different workloads and a million opcounts and generated a report discussing clear insights.
The document provides an agenda for a two-day training on NoSQL and MongoDB. Day 1 covers an introduction to NoSQL concepts like distributed and decentralized databases, CAP theorem, and different types of NoSQL databases including key-value, column-oriented, and document-oriented databases. It also covers functions and indexing in MongoDB. Day 2 focuses on specific MongoDB topics like aggregation framework, sharding, queries, schema-less design, and indexing.
DynamoDB is a key-value database that achieves high availability and scalability through several techniques:
1. It uses consistent hashing to partition and replicate data across multiple storage nodes, allowing incremental scalability.
2. It employs vector clocks to maintain consistency among replicas during writes, decoupling version size from update rates.
3. For handling temporary failures, it uses sloppy quorum and hinted handoff to provide high availability and durability guarantees when some replicas are unavailable.
In these slides we introduce Column-Oriented Stores. We deeply analyze Google BigTable. We discuss about features, data model, architecture, components and its implementation. In the second part we discuss all the major open source implementation for column-oriented databases.
DBs belonging to NoSQL family store large volume of data in key-value format. The retrieval and storage NoSQL data stores are widely used to store and retrieve possibly large amounts of data, typically in a key-value format. There are numerous NoSQL types with different performances, and thus it is important to evaluate them in terms of performance and verify in ways the performance is related to the database kind. In this paper, we assess five most popular NoSQL DBs: Cassandra, HBase, Redis, OrientDB and MongoDB. We shall measure these DBs in terms of query performance, based on reads and updates, taking into consideration the typical workloads, as represented by the Yahoo! Cloud Serving Benchmark. This shall facilitate users to choose the most appropriate DB according to the specific mechanisms and their application requirements.
HBase is a column-oriented NoSQL database that provides random real-time read/write access to big data stored in Hadoop's HDFS. It is modeled after Google's Bigtable and sits on top of HDFS to allow fast access to large datasets. HBase architecture includes HMaster, HRegionServers, ZooKeeper, and HDFS. HMaster manages metadata and load balancing while HRegionServers serve read/write requests directly from clients. ZooKeeper coordinates the cluster and HDFS provides storage. Data is stored in tables divided into regions hosted by HRegionServers.
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...Vivek Adithya Mohankumar
The research paper covers the consolidated interpretation of NoSQL systems, on the basis of performance, scalability and data aggregation, and compares the types of NoSQL databases based on their implementation and maintenance.
This document provides an outline for a student talk on NoSQL databases. It introduces NoSQL databases and discusses their characteristics and uses. It then covers different types of NoSQL databases including key-value, column, document, and graph databases. Examples of specific NoSQL databases like MongoDB, Cassandra, HBase, Riak, and Neo4j are provided. The document also discusses concepts like CAP theorem, replication, sharding, and provides comparisons of different database types.
This document provides information about MongoDB, including:
- MongoDB is a cross-platform document-oriented database that provides high performance, high availability, and easy scalability.
- Data is stored in MongoDB in the form of JSON-like documents with dynamic schemas, instead of using fixed table schemas as in SQL-based databases.
- Relationships between documents can be modeled either by embedding one document inside another or by storing references between separate documents.
This document summarizes the new features and improvements in Apache HBase 0.98, including:
1) Performance enhancements such as reverse scans, improved write throughput, and stripe compactions.
2) New security features like endpoint access control, transparent encryption, per-cell ACLs, and visibility labels.
3) Additional features like MapReduce over snapshots and REST streaming scans.
4) The release manager ensures compatibility, performance, and other criteria are met before the 0.98 release.
LeoFS is an open source, highly available, distributed storage system that allows organizations to efficiently store large amounts of data safely and inexpensively. It provides high availability, scalability, and cost performance. Key features include multi-datacenter replication for availability across regions, and a new administrative console called LeoCenter for inspecting and operating LeoFS clusters. LeoFS uses consistent hashing to replicate and distribute data across storage nodes with no single point of failure.
Altoros using no sql databases for interactive_applicationsJeff Harris
This document compares the performance of Cassandra, MongoDB, and Couchbase for interactive applications. Benchmarking showed Couchbase had the lowest latencies and highest throughput. Cassandra demonstrated better performance than MongoDB. While MongoDB had the lowest throughput, Cassandra and Couchbase provided better scalability and flexibility in resizing clusters. The analysis concludes Couchbase is well-suited for interactive applications due to its in-memory caching and fine-grained locking, which enable high performance for reads and writes.
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
Benchmarking Couchbase Server for Interactive ApplicationsAltoros
This document provides a benchmark comparison of Couchbase, Cassandra, and MongoDB for interactive applications with high in-memory data loads. It describes the key criteria for databases used in these applications, including scalability and low latency performance. The infrastructure and settings for benchmarking the databases are outlined. Results are then shown for read, insert, and update latencies as well as 95th percentile times. Finally, the document analyzes the results and concludes that Couchbase is well-suited for interactive applications requiring low latency access to large, in-memory datasets.
MongoDB is a document-oriented NoSQL database that stores data as JSON-like documents. It is schema-less, scales easily, supports dynamic queries on documents, and stores data in BSON format. MongoDB is good for high write loads, high availability, large and changing datasets. Installation is simple, and it supports replication and sharding for availability and scaling. Data can be embedded or referenced between documents. Indexes and text search are supported. Programming involves JavaScript and MongoDB methods.
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download
This document provides an introduction to NoSQL databases. It begins by explaining what a database management system (DBMS) and relational database management system (RDBMS) are. It then discusses some limitations of relational databases and how NoSQL databases address those limitations by being non-relational, schema-free, and offering simple APIs. The document provides a brief history of NoSQL databases and defines what NoSQL is and why it was developed to handle large, growing amounts of unstructured data from sources like social networks. It outlines some key features of NoSQL databases.
This document provides an overview and introduction to NoSQL databases. It discusses key-value stores like Dynamo and BigTable, which are distributed, scalable databases that sacrifice complex queries for availability and performance. It also explains column-oriented databases like Cassandra that scale to massive workloads. The document compares the CAP theorem and consistency models of these databases and provides examples of their architectures, data models, and operations.
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...Ashnikbiz
Ashnik Database Solution Architect, Sameer Kumar, an Open Source evangelist presented at FOSSASIA 2015 about the features of open source database like PostgreSQL which are missed by developers stuck on proprietary databases.
10 Features you would love as an Open Source developer!
- New JSON Datatype
- Vast set of datatypes supported
- Rich support for foreign Data Wrap
- User Defined Operators
- User Defined Extensions
- Filter Based Indexes or Partial Indexes
- Granular control of parameters at User, Database, Connection or Transaction Level
- Use of indexes to get statistics
- JDBC API for COPY -Command
- Full Text Search
This document discusses cloning Twitter using Redis by storing user, follower, and post data in Redis keys and data structures. It provides examples of how to store:
1) User profiles as Hashes with fields like username and ID.
2) Follower and following relationships as Sorted Sets with user IDs and timestamps.
3) User posts and timelines as Lists by pushing new post IDs.
It explains that while Redis lacks tables, its keys and data structures like Hashes, Sets and Lists allow building the same data model without secondary indexes. The document also notes that the system can scale horizontally by sharding the data across multiple Redis servers.
This document discusses different types of distributed databases. It covers data models like relational, aggregate-oriented, key-value, and document models. It also discusses different distribution models like sharding and replication. Consistency models for distributed databases are explained including eventual consistency and the CAP theorem. Key-value stores are described in more detail as a simple but widely used data model with features like consistency, scaling, and suitable use cases. Specific key-value databases like Redis, Riak, and DynamoDB are mentioned.
- Mongo DB is an open-source document database that provides high performance, a rich query language, high availability through clustering, and horizontal scalability through sharding. It stores data in BSON format and supports indexes, backups, and replication.
- Mongo DB is best for operational applications using unstructured or semi-structured data that require large scalability and multi-datacenter support. It is not recommended for applications with complex calculations, finance data, or those that scan large data subsets.
- The next session will provide a security and replication overview and include demonstrations of installation, document creation, queries, indexes, backups, and replication and sharding if possible.
Schema Agnostic Indexing with Azure DocumentDBDharma Shukla
- DocumentDB is a fully managed NoSQL database service that provides automatic indexing of JSON documents without requiring schemas (schema agnostic).
- It uses a logical index that maps JSON paths to postings lists containing document identifiers. This index is implemented using a physical write-optimized architecture with blind updates and value merging to support high write volumes.
- The physical index uses a log-structured storage approach with delta records, mapping tables, and page stubs to allow for highly concurrent updates while minimizing I/O overhead during index maintenance.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
Data Storage and Management project ReportTushar Dalvi
This paper aims at evaluating the performance of random reads and random writes the information of HBase and Cassandra and compare the results that we got through various ubuntu operation
This document provides an outline for a student talk on NoSQL databases. It introduces NoSQL databases and discusses their characteristics and uses. It then covers different types of NoSQL databases including key-value, column, document, and graph databases. Examples of specific NoSQL databases like MongoDB, Cassandra, HBase, Riak, and Neo4j are provided. The document also discusses concepts like CAP theorem, replication, sharding, and provides comparisons of different database types.
This document provides information about MongoDB, including:
- MongoDB is a cross-platform document-oriented database that provides high performance, high availability, and easy scalability.
- Data is stored in MongoDB in the form of JSON-like documents with dynamic schemas, instead of using fixed table schemas as in SQL-based databases.
- Relationships between documents can be modeled either by embedding one document inside another or by storing references between separate documents.
This document summarizes the new features and improvements in Apache HBase 0.98, including:
1) Performance enhancements such as reverse scans, improved write throughput, and stripe compactions.
2) New security features like endpoint access control, transparent encryption, per-cell ACLs, and visibility labels.
3) Additional features like MapReduce over snapshots and REST streaming scans.
4) The release manager ensures compatibility, performance, and other criteria are met before the 0.98 release.
LeoFS is an open source, highly available, distributed storage system that allows organizations to efficiently store large amounts of data safely and inexpensively. It provides high availability, scalability, and cost performance. Key features include multi-datacenter replication for availability across regions, and a new administrative console called LeoCenter for inspecting and operating LeoFS clusters. LeoFS uses consistent hashing to replicate and distribute data across storage nodes with no single point of failure.
Altoros using no sql databases for interactive_applicationsJeff Harris
This document compares the performance of Cassandra, MongoDB, and Couchbase for interactive applications. Benchmarking showed Couchbase had the lowest latencies and highest throughput. Cassandra demonstrated better performance than MongoDB. While MongoDB had the lowest throughput, Cassandra and Couchbase provided better scalability and flexibility in resizing clusters. The analysis concludes Couchbase is well-suited for interactive applications due to its in-memory caching and fine-grained locking, which enable high performance for reads and writes.
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
Benchmarking Couchbase Server for Interactive ApplicationsAltoros
This document provides a benchmark comparison of Couchbase, Cassandra, and MongoDB for interactive applications with high in-memory data loads. It describes the key criteria for databases used in these applications, including scalability and low latency performance. The infrastructure and settings for benchmarking the databases are outlined. Results are then shown for read, insert, and update latencies as well as 95th percentile times. Finally, the document analyzes the results and concludes that Couchbase is well-suited for interactive applications requiring low latency access to large, in-memory datasets.
MongoDB is a document-oriented NoSQL database that stores data as JSON-like documents. It is schema-less, scales easily, supports dynamic queries on documents, and stores data in BSON format. MongoDB is good for high write loads, high availability, large and changing datasets. Installation is simple, and it supports replication and sharding for availability and scaling. Data can be embedded or referenced between documents. Indexes and text search are supported. Programming involves JavaScript and MongoDB methods.
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download
This document provides an introduction to NoSQL databases. It begins by explaining what a database management system (DBMS) and relational database management system (RDBMS) are. It then discusses some limitations of relational databases and how NoSQL databases address those limitations by being non-relational, schema-free, and offering simple APIs. The document provides a brief history of NoSQL databases and defines what NoSQL is and why it was developed to handle large, growing amounts of unstructured data from sources like social networks. It outlines some key features of NoSQL databases.
This document provides an overview and introduction to NoSQL databases. It discusses key-value stores like Dynamo and BigTable, which are distributed, scalable databases that sacrifice complex queries for availability and performance. It also explains column-oriented databases like Cassandra that scale to massive workloads. The document compares the CAP theorem and consistency models of these databases and provides examples of their architectures, data models, and operations.
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...Ashnikbiz
Ashnik Database Solution Architect, Sameer Kumar, an Open Source evangelist presented at FOSSASIA 2015 about the features of open source database like PostgreSQL which are missed by developers stuck on proprietary databases.
10 Features you would love as an Open Source developer!
- New JSON Datatype
- Vast set of datatypes supported
- Rich support for foreign Data Wrap
- User Defined Operators
- User Defined Extensions
- Filter Based Indexes or Partial Indexes
- Granular control of parameters at User, Database, Connection or Transaction Level
- Use of indexes to get statistics
- JDBC API for COPY -Command
- Full Text Search
This document discusses cloning Twitter using Redis by storing user, follower, and post data in Redis keys and data structures. It provides examples of how to store:
1) User profiles as Hashes with fields like username and ID.
2) Follower and following relationships as Sorted Sets with user IDs and timestamps.
3) User posts and timelines as Lists by pushing new post IDs.
It explains that while Redis lacks tables, its keys and data structures like Hashes, Sets and Lists allow building the same data model without secondary indexes. The document also notes that the system can scale horizontally by sharding the data across multiple Redis servers.
This document discusses different types of distributed databases. It covers data models like relational, aggregate-oriented, key-value, and document models. It also discusses different distribution models like sharding and replication. Consistency models for distributed databases are explained including eventual consistency and the CAP theorem. Key-value stores are described in more detail as a simple but widely used data model with features like consistency, scaling, and suitable use cases. Specific key-value databases like Redis, Riak, and DynamoDB are mentioned.
- Mongo DB is an open-source document database that provides high performance, a rich query language, high availability through clustering, and horizontal scalability through sharding. It stores data in BSON format and supports indexes, backups, and replication.
- Mongo DB is best for operational applications using unstructured or semi-structured data that require large scalability and multi-datacenter support. It is not recommended for applications with complex calculations, finance data, or those that scan large data subsets.
- The next session will provide a security and replication overview and include demonstrations of installation, document creation, queries, indexes, backups, and replication and sharding if possible.
Schema Agnostic Indexing with Azure DocumentDBDharma Shukla
- DocumentDB is a fully managed NoSQL database service that provides automatic indexing of JSON documents without requiring schemas (schema agnostic).
- It uses a logical index that maps JSON paths to postings lists containing document identifiers. This index is implemented using a physical write-optimized architecture with blind updates and value merging to support high write volumes.
- The physical index uses a log-structured storage approach with delta records, mapping tables, and page stubs to allow for highly concurrent updates while minimizing I/O overhead during index maintenance.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
Data Storage and Management project ReportTushar Dalvi
This paper aims at evaluating the performance of random reads and random writes the information of HBase and Cassandra and compare the results that we got through various ubuntu operation
Hive is a data warehouse infrastructure tool used to process large datasets in Hadoop. It allows users to query data using SQL-like queries. Hive resides on HDFS and uses MapReduce to process queries in parallel. It includes a metastore to store metadata about tables and partitions. When a query is executed, Hive's execution engine compiles it into a MapReduce job which is run on a Hadoop cluster. Hive is better suited for large datasets and queries compared to traditional RDBMS which are optimized for transactions.
I have examined the performance of two databases - HBase and Cassandra in terms of their scalability, security, performance and compared the results thus obtained through different operations on the Ubuntu interface.
Optimization on Key-value Stores in Cloud EnvironmentFei Dong
This document discusses optimizing key-value stores like HBase in cloud environments. It introduces HBase, a distributed, column-oriented database built on HDFS that provides scalable storage and retrieval of large datasets. The document compares rule-based and cost-based optimization strategies, and explores using rule-based optimization to analyze HBase's performance when deployed on Amazon EC2 instances. It describes developing an HBase profiler to measure the costs of using HBase for storage.
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
This presentation explores the design and challenges HappiestMinds faced while implementing a storage and search infrastructure for a large publisher where books/documents/artifacts related records are stored in Apache HBase. Upon bulk insert of book records into HBase, the Elasticsearch index is built offline using MapReduce code but there are certain use cases where the records need to be re-indexed in Elasticsearch using Region Observer Coprocessors.
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
This document discusses implementing an HBase coprocessor to index columns from HBase into an Elasticsearch cluster. It describes storing book records from publishers and libraries in HBase and indexing them into Elasticsearch using MapReduce jobs. To handle updates, a coprocessor uses HBase's checkAndPut method to verify the record version before updating and indexing the new version into Elasticsearch, ensuring consistency between the two systems.
The document provides information on various components of the Hadoop ecosystem including Pig, Zookeeper, HBase, Spark, and Hive. It discusses how HBase offers random access to data stored in HDFS, allowing for faster lookups than HDFS alone. It describes the architecture of HBase including its use of Zookeeper, storage of data in regions on region servers, and secondary indexing capabilities. Finally, it summarizes Hive and how it allows SQL-like queries on large datasets stored in HDFS or other distributed storage systems using MapReduce or Spark jobs.
This document provides an overview of MongoDB, including what NoSQL databases are, MongoDB features like querying, indexing, replication, load balancing and aggregation. It discusses how MongoDB stores data in documents and collections, can be used for file storage, and is used by many large companies. The document also covers installing and running MongoDB on a local system.
This document provides an introduction and overview of Hive, including its components, architecture, features, and working. Hive is a data warehouse infrastructure that uses SQL-like queries to analyze large datasets stored in Hadoop. It works by taking queries, compiling them into MapReduce jobs, and executing those jobs on a Hadoop cluster to process and retrieve data in a distributed manner. The document outlines Hive's client, servers, storage mechanisms, architecture involving interfaces, metastore, and query processing engine, as well as its key features like SQL support, scalability, and data formats. It concludes with a high-level description of Hive's query execution workflow.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Devise and implement a test strategy in order to perform a comparative analysis of the capabilities of two database management systems (Cassandra and HBase) in terms of performance.
Approach: Installation and implementation of instances of the two data storage and management systems. The Yahoo Cloud Serving Benchmark is used to compare the performances of HBase and Cassandra. Average latency and throughput were considered for analyzing the comparison of the two databases. The results obtained from YCSB are then analyzed and visualized with the help of Tableau.
Findings: HBase performs insertion, reading, and updating of records faster than Cassandra but only when the operations count is less. At heavier loads, Cassandra performs better than Hbase.
Tools: Hbase, Cassandra, Hadoop, Tableau, YCSB
The document discusses factors to consider when selecting a NoSQL database management system (DBMS). It provides an overview of different NoSQL database types, including document databases, key-value databases, column databases, and graph databases. For each type, popular open-source options are described, such as MongoDB for document databases, Redis for key-value, Cassandra for columnar, and Neo4j for graph databases. The document emphasizes choosing a NoSQL solution based on application needs and recommends commercial support for production systems.
This document contains information about HBase concepts and configurations. It discusses different modes of HBase operation including standalone, pseudo-distributed, and distributed modes. It also covers basic prerequisites for running HBase like Java, SSH, DNS, NTP, ulimit settings, and Hadoop for distributed mode. The document explains important HBase configuration files like hbase-site.xml, hbase-default.xml, hbase-env.sh, log4j.properties, and regionservers. It provides details on column-oriented versus row-oriented databases and discusses optimizations that can be made through configuration settings.
This document provides an overview and configuration instructions for Hadoop, Flume, Hive, and HBase. It begins with an introduction to each tool, including what problems they aim to solve and high-level descriptions of how they work. It then provides step-by-step instructions for downloading, configuring, and running each tool on a single node or small cluster. Specific configuration files and properties are outlined for core Hadoop components as well as integrating Flume, Hive, and HBase.
Big data analytics with Hadoop allows organizations to analyze very large datasets and gain valuable business insights. Hadoop is an open-source framework that uses distributed storage and processing across clusters of commodity hardware. It has four main components - MapReduce for distributed processing, HDFS for distributed storage, YARN for resource management, and common utilities. Hadoop provides advantages like ability to handle diverse data sources, cost effectiveness by using commodity hardware, high performance through parallel processing, fault tolerance and high availability.
Apache Hive is a data warehouse infrastructure tool built on Hadoop that allows users to query and analyze large datasets stored in Hadoop using SQL. It works by translating SQL queries into MapReduce jobs that process the data. Hive provides a metastore to store metadata about the schema and HDFS location of tables, and uses a query language called HiveQL that is similar to SQL. It allows users to run analytics on large datasets without needing to write MapReduce code directly.
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBAthiq Ahamed
This document provides a summary of a presentation that benchmarked the performance of three popular NoSQL databases: Apache Cassandra, Apache HBase, and MongoDB. It describes the architectures and data models of each database. Benchmark tests were run using the Yahoo Cloud Serving Benchmark and found that Apache Cassandra consistently outperformed the other databases across different workloads in terms of load time, read and write performance, and latency. The presentation emphasizes the importance of benchmarks for evaluating NoSQL database performance and choosing the right database based on application requirements.
Similar to Performance Analysis of HBASE and MONGODB (20)
Analysis of Crime Big Data using MapReduceKaushik Rajan
Analyzed Crime Big data of Washington DC to solve the following business queries:
> Which hour has the highest crime count?
> Which shift has the highest crime count?
> Year wise crime count
> Hour wise crime count
> Crime count by an offense
> Average of Shift wise crime count
The data was initially stored in MySql which was then moved to HDFS using SQOOP, from where 4 MapReduce operations are doing using JAVA in Eclipse IDE. The outputs of the queries are then moved to HBase using SQOOP. Two more MapReduce operations are done using PIG, the output of which is also moved to HBase using SQOOP. All the outputs were then moved to the local system and are visualized using RStudio and Tableau.
Tools used:
> MySQL, HDFS and HBase to store the data
> SCOOP to move the data from one database to another
> JAVA (Eclipse IDE) and PIG to run the MapReduce queries
> RStudio for data pre-processing and visualization
> Tableau for visualization
> LATEX for Documentation
Data Mining on Customer Churn ClassificationKaushik Rajan
Implemented multiple classifiers to classify if a customer will leave or stay with the company based on multiple independent variables.
Tools used:
> RStudio for Exploratory data analysis, Data Pre-processing and building the models
> Tableau and RStudio for Visualization
> LATEX for documentation
Machine learning models used:
> Random Forest
> C5.0
> Decision tree
> Neural Network
> K-Nearest Neighbour
> Naive Bayes
> Support Vector Machine
Methodology: CRISP-DM
Infographic follows Andy Kirk's Three Principles of Good Visualization Design.
The infographic contains the information on both the negatives and positives of the smartphone boom in the USA.
8 graphs have been created in total,
1) Number of smartphone users in US – year wise
2) Evolution of smartphone – 1993 to 2012
3) Most preferred brand of smartphone
4) Number of internet users – State wise
5) Correlation plot between smartphone users and internet users in New Hampshire
6) Correlation plot between smartphone users, internet users and GDP of US – Year wise
7) Timeframe in which smartphones are being used – Before sleeping and after waking up
8) Smartphone which emits the highest radiation
Business magazine-style report on World Suicide Rate Analysis.
Andy Kirk's The Three Principles of Good Visualization Design was followed to create the report.
Tools used: RStudio, Tableau and Canva
Graphs plotted:
1) Map Chart is used to show the amount of suicide in each country
2) A horizontal bar graph is used to further compare the differences in suicide
counts in each country
3) A single line graph is used to show the amount of suicides committed each year
during the period 1985-2016
4) Stacked 100% Area graph is used to compare the suicide counts among
different age groups, namely 5-14 years,15-24 years, 25-34 years, 35-54 years,
55-74 years
5) Side by side bar chart is used to compare the number of suicide counts
among different age groups, sex-wise
6) A pie chart is used to see the composition of causes of deaths in the US in the year
2017
7) Bubble Chart is used to check out the methods by which people commit
suicide and to check which one causes the maximum death
8) A line graph is used to compare the Suicide count and happiness index for the years 2006 to 2015
9) The correlation matrix is used to find the correlation between different elements.
10) Side by side line graph is used to compare gender wise suicide
percentage in the US
11) Treemaps are used to see the composition of the number of suicides
among the states of US
12) Side by side area graph is used to see the availability of different drugs in the US
over time
Data Warehousing and Business Intelligence Project on Smart Agriculture and M...Kaushik Rajan
Implemented a Data Warehouse on smart agriculture to solve various Business Intelligence queries. Integrated multiple datasets from 3 different data sources including both structured and unstructured data.
Tools used:
> SQL Server Integration Services for ETL
> SQL Server Management Services for Database
> SQL Server Analysis Services for building the Schema
> Tableau and PowerBI for Visualization
> R for data preprocessing
> LATEX for documentation
Video Presentation: https://www.youtube.com/watch?v=0oIlLQcyPdM
Analysis of SOFTWARE DEFINED STORAGE (SDS)Kaushik Rajan
This document analyzes software defined storage (SDS) and compares it to traditional storage systems. SDS abstracts and simplifies data storage management, separating the storage software from hardware. It provides benefits like flexibility, reliability, lower costs, and higher performance. SDS also allows for easier scaling of storage capacity and automation of management. While traditional systems are suitable for some specific workloads, the comparison shows SDS has advantages and is revolutionizing storage in the IT industry.
Multiple Regression and Logistic RegressionKaushik Rajan
1) Multiple Regression to predict Life Expectancy using independent variables Lifeexpectancymale, Lifeexpectancyfemale, Adultswhosmoke, Bingedrinkingadults, Healthyeatingadults and Physicallyactiveadults.
2) Binomial Logistic Regression to predict the Gender (0 - Male, 1 - Female) with the help of independent variables such as LifeExpectancy, Smokingadults, DrinkingAdults, Physicallyactiveadults and Healthyeatingadults.
Tools used:
> RStudio for Data pre-processing and exploratory data analysis
> SPSS for building the models
> LATEX for documentation
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
1. Data Storage and Management Project
on
Performance Analysis of HBASE and MONGODB
Kaushik Rajan
17165849
MSc Data Analytics – 2018/9
Submitted to: Vikas Tomer
2. Abstract
The objective of this project is to compare the different NoSQL databases,
namely, HBase and MongoDB. The performance comparison is made between the
two for the opcounts —12500, 25000, 50000, 75000 and 100000, and for 3 different
workloads —Workload A, Workload B and Workload D, using Yahoo Cloud Serving
Benchmarking (YCSB) test. The tests were run thrice and the average of the
tests were considered to evaluate the performance of both the NoSQL databases
at different workloads. The Databases were compared based on Average Latency
of Read operation vs the total Throughput, Average Latency of Update operation
vs the total Throughput and the throughput time when run with all the three
workloads. Output logs taken were used to plot the graph using Tableau.
1 Introduction
Over 2.5 quintillion bytes of data are generated every single day from different platforms.
In the last 2 years alone more than 90 % of the worlds data have been generated and
it is said to increase multi-fold within the next few years. To store the data that are
generated, RDBMS were used. RDBMS has been the data processing dictionary over
a period of time and is the basis of SQL. The Traditional databases cannot handle the
vast amount and the variety of data that are being generated in the current age. The
major factors that are to be considered while picking a database are the 3Vs namely
Volume, Velocity and Variety. The traditional transactional database cannot handle the
unstructured data and also cannot be used for analytics purpose. Therefore, nowadays,
NoSQL databases such as the Hadoop, HBase, MongoDB, Apache Spark, Cassandra
are preferred over the traditional databases. Advantages of the NoSQL databases are
they are faster, more efficient, better performance and can handle unstructured data.
All the NoSQL databases have certain advantages and disadvantages over the other.
Choosing the perfect database for the particular work is essential. For this project, the
NoSQL databases HBase and MongoDB are compared. Different aspects such as the
time taken for the READ operation, time taken for the UPDATE operation, time taken
for the INSERT operation and throughputs are compared by subjecting the databases to
different workloads and at different Opcounts.
2 Key Characteristics of Chosen Data Storage Man-
agement Systems
2.1 HBase
HBase is a open-source, distributed column-oriented database which is created on the
top of HDFS, which is written in java, and which is designed after Googles Big Table.
Hadoop can only perform batch processing whereas, HBase can access the data randomly
and provides fast lookups for larger tables. HBase also has low latency while accessing a
single row from billions of records. Since HBase is a column-oriented database, the records
are sorted by rows and hence is suitable for Online Analytical Processing (OLAP). HBase
also has recovery option as it provides duplication.
1
3. Other key features are:
• HBase is linearly scalable.
• Failure support is automatic.
• Provides consistent read and write operations.
• Integrates with Hadoop, both as a source and destination.
• Provides data replication within clusters.
2.2 MongoDB
MongoDB is a cross-platform document-oriented NoSQL database program and is writ-
ten in C++. It is free and opensource. MongoDB uses JSON like documents. With
MongoDB, the user can precisely control where they want to place the data. MongoDB
has complete flexibility in deployment and can be easily migrated.
Other key features are:
• Supports Ad hoc queries.
• Can be indexed with primary and secondary indices.
• Easy recovery of data as MongoDB stores multiple replica of the data
• Scales horizontally using sharding.
• The system converts the file into multiple parts and stores each parts into different
document.
• Supports map-reduce and aggregation tools.
• High Performance and efficiency.
2
4. 3 Database Architectures
3.1 HBase
Figure 1: Architecture of HBase
The HBase Architecture mainly consists of 4 components
• HMaster
• HRegionserver
• HRegions
• ZooKeeper
3.1.1 HMaster
HBase Architecture consists of a single master node which is known as the HMaster and
several slaves called the region server. A Region server can serve multiple regions but a
region can only be served by a single region server. When a write request is sent by the
user, the request is sent to the HMaster which in turn forwards it to the corresponding
region server. HMaster is the master server in the HBase Architecture which acts as
a main monitoring agent which watches over all the Region server instances. HMaster
maintains the nodes present in the cluster and is the vital part due to which HBase
3
5. performs better. It takes care of all the region server and assigns and reassigns them
with regions. It controls the load balancing. It provides the interface where the use
can create, delete or update a table. [https://www.edureka.co/blog/hbase architecture/
(2018)]
3.1.2 HRegionserver
These are the nodes which handle the requests that are received from the clients. The
HRegionserver assigns the read write requests that it receives from the clients to the
respective regions where the actual column exists. The Regionserver consists for multiple
components, namely —
Block Cache —Frequently read data is stored in this block for faster access
MemStore —This is the write cache and it stores the data temporarily that are yet
to be moved to the disk. Memstory is present in all the column family.
WAL —Stores the data that are not up for permanent storage.
HFile —Stores the rows that are sorted.
3.1.3 HRegions
These are the basic elements of HBase Cluster and consists of 2 components namely
—Memstore and HFile
4
6. 3.1.4 Zookeeper
Zookeeper is also called as the coordinator as it coordinates all the processes happening
inside the Hbase distributed system. It maintains the server state inside the cluster.
There is a continuous communication between the region servers and the HMaster with
the Zookeeper. The region servers and the HMaster continuously sends signal to the
zookeeper so that the zookeeper can check if a server is alive or not. When a HMaster fails
to send a notification to the zookeeper, the session is deleted. It also takes care of the data
recovery process and provides server failure notification. The zookeeper provides backup
server if a particular server fails. [https://www.tutorialspoint.com/hbase/hbasearchitecture.htm
(2018)]
Meta Table
The Meta table is a catalog table for HBase which maintains the records of all the
available Region servers in the system. As observed from the figure, the table is main-
tained in the form of keys and values.
5
7. HBase Write Mechanism
Whenever there is a write request, it is written on the WAL. The request is then
copied to the Memstore and the acknowledgement is sent to the client. When MemStore
attains threshold, the data is moved into a Hfile.
6
8. 3.2 MongoDB
Figure 2: Architecture of MongoDB
The Nexus architecture: The main idea behind the creation of MongoDB is to com-
bine the best of Relational databases and NoSQL.
Expressive query language: Clients are allowed to access and manipulate data as
per their wish and MongoDB supports analytical application also.
Secondary indexes: MongoDB provides good indexes for read-write operation which
improves its efficiency.
7
9. 3.3 MongoDB Flexible Storage Architecture
The Flexible Storage Architecture automatically controls the flow of data which decreases
the complexity. MongoDB ships with 4 storage engines namely —
WiredTiger Storage engine: This provides the best all round performance and stor-
age improvement.
Encrypted Storage engine: This ensures that all the highly confidential data are
well protected by encrypting them.
In-Memory storage engine: This is helpful for Realtime analytics.
MMAPv1 engine: This is the improved version of previous version of the storage
engine.
4 SECURITY
4.1 HBase
HBase provides secure client access which runs on the top of HDFS. To protect itself from
unauthenticated/unauthorized users and network based breach, it has network authen-
tication protocol called the Kerberos. Kerberos are basically a secret key and the client
has to prove their identity. Authentication of each service is done using a keytab file
and the procedure to create a keytab file is similar to the one in Hadoop.[https://data-
flair.training/blogs/hbase security/ (2018)]
4.1.1 Kerberos Authentication
The HBase client authenticates itself to the Kerberos server after which it receives a
TGT. The job of the Kerberos authentication is to check if the received request is valid
request or not and to check if it comes from the respective HBase Region Server.
8
10. 4.1.2 Kerberos Authorisation
A limit can be set in HBase as to how many clients can access it at any given time. Also,
the access level can be set for each client. The different levels of authorisation are:
Superuser: A client who has unlimited access
Read(R): Read access only.
Write(W): Write access only.
Execute(X): Execute access only.
Create(C): Create access only.
Admin(A): Right to perform cluster admin operations.
A Superuser is free to perform any of the available operations in HBase which is config-
ured in HMaster level in hbase-site.xml. A Global user is granted a global access and
has administration rights to access all the tables. A Namespace authorised user has the
rights to access all table inside the namespace allocated to them. A Table access user can
access a particular table for which the user has the right. Similarly, Authorisation can
be combined and different users can have multiple access. For example, A global admin
can access all the tables and can also perform clusterwise operations. A table-admin can
create or drop a table.
4.2 MongoDB
MongoDB has 5 core security areas:
Authentication: LDAP centralizes the available data into the directory
Authorization: Defines role-based access control for each of the users.
Encryption: More important for the protection of all the data stored in MongoDB
Auditing: Ability to find which user requires which kind of access to the database.
Governance: Refers to the validation of the documents and preventing the sensitive
data from entering into the main system.
LDAP Authentication is the place where users information without their roles are stored.
MongoDB has built-in user roles which is turned off by default. However, it misses items
like password complexity, age based rotation and identification. LDAP fills in these gaps.
5 major built in roles :
Read: Users are given read only access
ReadWrite: Allows users to edit data and read them
9
11. dbOwner: Has the right to grant access to the users and add or remove users.
dbAdminAnyDatabase: Has rights to create dbAdmin.
Root: Is the superuser which is similar to the one on HBase.
5 Literature Survey
In the paper presented by the author Ganesh Chandra (2015), the author has tested the
performance of the NoSQL tools such as MongoDB, HBase and Cassandra. The Author
has stated that the NoSQL data bases are slowly replacing the traditions data bases and
the author has highlighted the performance of the 3 databases used for the test and have
concluded that Cassandra is better than the other 2 databases. The Databases were
tested using different workloads and were compared based on number of read operations,
write operations, update operations, insert operations and throughput.
The author Pallas et al. (2016) has presented the paper with the information about
the security aspect of HBase and how sensitive data will be analyzed and how analyzing
the sensitive data will decrease the performance of the HBase system. The author says
that security of HBase comes for a price, That is, either the performance of the Database
or addition of more resources.
The author Waage & Wiese (2015) has made a paper explaining that the NoSQL databases
are not so secured and has compared the performance of the database using YCSB using
the encrypted data which is highly secured for benchmarking the database.
Paper presented by the author Yoon & Lee (2018) containts the information on recover-
ing the deleted data in MongoDB database. MongoDB database usually creates multiple
replica of the data as backup and incase if the original data gets deleted or crupted, it
can recover the data from the created replicas. In the paper the internal structures of the
wiredTiger and MMAPV1, which are the storage engines of the database are analyzed in
depth.
The author Colombo & Ferrari (2017) talks about how privacy has become the ma-
jor requirement for the DBMS and has proposed an effective solution to integrate access
control into MongoDB and providing granular access to each user as per their requirement.
The author FOTACHE & COGEAN (2013) talks about the difference between MySQL
and NoSQL databases and which is better for a mobile application. The author covers in
depth about the performance of both MySQL and an NoSQL database by using them for
a mobile application, The author also talks about the migration of database from MySQL
to a NoSQL database. The author concludes that a NoSQL database is better than a
MySQL database for a mobile application as it performs better and is more efficient.
10
12. 6 Performance Test Plan
Yahoo! Cloud Servicing Benchmark (YCSB) is used for Benchmarking the selected
databases HBase and MongoDB. YCSB is the most common open source tool which
is used to carry out the benchmarking tests. Initially, after configuring, both the NoSQL
databases are run using the commands start-dfs.sh, start-yarn.sh, start-hbase.sh and then
the testharness script is used to process the operations in different test scenarios. The
databases were subjected to different workload and at different opcounts. Their per-
formance were then compared based on the average latency, throughputs, average time
taken for read operation, average time taken for write operation and average time taken
for insert operation.
Different operations done by the workloads are:
• Read
• Update
• Insert
• Scan
The performance of the databases also depends on the configuration of the machine used.
Device Specification:
Machine name : Dell Predator
Processor used: Intel core-I7
Machine RAM : 12gb
Virtual machine(Openstack)
O.S: Ubuntu 16.04 LTS (64-bit)
RAM: 4.0 Gb
HDD Space: 64 Gb
Details of the different types of available workloads
Workload A: Read/Update —50/50
Workload B: Read/Update —95/5
Workload C: Read only
Workload D: Read latest workload
Workload E: Short ranges
Workload F: Read-modify-write
Workloads used for this project:
11
13. Figure 3: workloadlist.txt
Operational counts used:
Figure 4: opcounts.txt
The opcounts used are 12500,25000,50000,75000,100000.
The opcounts and the workloads are set inside the testharness file and the test is run
using the command - ./runtest.sh and the output is saved inside the YCSB file as output.
12
15. 7 Evaluation and Results
7.1 Workload A
In Workload A, the databases HBase and MongoDB are compared on the basis of average
latency of of read and update operations and total throughput.
7.1.1 Average Latency for Read Operation Vs Total Throughput
On comparing the Average latency for read operation against the Total throughput, it
can be inferred that MongoDB performs better as it has on an average more throughput
at a lower latency than HBase
Figure 7: Average Latency for Read Operation Vs Total Throughput
7.1.2 Average Latency for Update Operation Vs Total Throughput
HBase has the highest average latency and lowest total throughput at the lowest Opcounts
while MongoDB has the highest total throughput with lowest average latency at the
highest opcount and therefore MongoDB is better than HBase.
14
16. Figure 8: Average Latency for Update Operation Vs Total Throughput
7.1.3 Read Operation and Average Latency
This is a comparison between the number of Read operations and the average latency for
both the databases while using workload A with Opcounts 12500,25000,50000,75000,100000.
Figure 9: Output Table
15
17. Figure 10: Read Operation and Average Latency for Workload A
From the graph it can be inferred that the number of read operations increase as the
operational count increases for both the databases. The number of Read operation count
is similar in both the databases over different Opcounts. The Average Latency in general
is lesser in MongoDB than HBase and hence it can be inferred that MongoDB is more
suited for Workload A
7.1.4 Update Operation and Average Latency
The number of Update operation and the average latency between the databases are
compared while using workload A with Opcounts 12500,25000,50000,75000,100000.
Figure 11: Output Table
16
18. Figure 12: Update Operation and Average Latency for Workload A
Similar to Read Operation, MongoDB has very low latency while performing Up-
date operation also and hence MongoDB would be the preferred choice over HBase for
Workload A
7.1.5 Throughput
Overall Throughput is the number of total operations performed each second and it is one
of the major factors in determining the better database. More the throughtput, better is
the database.
Figure 13: Output Table
17
19. Figure 14: Throughput for Workload A
From the graph, it can be inferred that MongoDB is the clear winner when it comes to
choose a database for workload A. Mongo DB has higher throughput for all the opcounts
when compared to HBase.
7.2 Workload B
The databases HBase and MongoDB are compared on the basis of average latency for
read and update operations and Total Throughput using Workload B
7.2.1 Average Latency for Read Operation Vs Total Throughput
The MongoDB outperforms HBase in Workload B also as it has higher throughput at a
lower average latency than HBase
18
20. Figure 15: Average Latency for Read Operation Vs Total Throughput
7.2.2 Average Latency for Update Operation Vs Total Throughput
Similar pattern is observed when comparing Average Latency for update opertion against
the total throughput also. MongoDB is the better one amoung the two.
Figure 16: Average Latency for Update Operation Vs Total Throughput
7.2.3 Read Operation and Average Latency
This is a comparison between the number of Read operation and the average latency for
both the databases while using workload B with Opcounts 12500,25000,50000,75000,100000.
19
21. Figure 17: Output Table
Figure 18: Read Operation and Average Latency for Workload B
Once again MongoDB does better performance than HBase. The Graph shows that
the latency is low for MongoDB than HBase over different opcounts.
7.2.4 Update Operation and Average Latency
The Update operation and the average latency are compared accross the different op-
counts for both the data bases while testing with workload B.
20
22. Figure 19: Output Table
Figure 20: Update Operation and Average Latency for Workload B
HBase produces lesser update counts at higher average latency than MongoDB, which
produces more update counts at lower average latency and hence MongoDB is preferred
over HBase.
21
23. 7.2.5 Throughput
Over all throughput is the most important factor in determining which database to use
at different conditions. More the throughput, better the database is.
Figure 21: Output Table
Figure 22: Throughput for Workload B
22
24. 7.3 Workload C
The databases HBase and MongoDB are compared on the basis of average latency of
read and Insert operations and Total Throughput using Workload C
7.3.1 Average Latency for Insert Operation Vs Total Throughput
Average Latency for Insert operation is compared with the Total throughput for workload
C for both the databases and MongoDB outperforms HBase and hence it is preferred over
HBase.
Figure 23: Average Latency for Insert Operation Vs Total Throughput
7.3.2 Average Latency for Read Operation Vs Total Throughput
Figure 24: Average Latency for Read Operation Vs Total Throughput
23
25. 7.3.3 Read Operation and Average Latency
This is a comparison between the number of Read operation an the average latency for
both the databases while using workload B with Opcounts 12500,25000,50000,75000,100000.
Figure 25: Output Table
Figure 26: Read Operation and Average Latency for Workload B
As observed from the Figure 20, MongoDB does more read operations at low latency
when compared to HBase and hence it is preferred over HBase.
24
26. 7.3.4 Insert Operation and Average Latency
In this section, Number of Insert counts are compared with the Average latency for
Workload C for the databases HBase and MongoDB over multiple opcounts.
Figure 27: Output Table
Figure 28: Read Operation and Average Latency for Workload B
The MongoDB performs better again and hence is preferred over HBase.
25
28. Figure 30: Throughput for Workload B
MongoDB has better throughput rate than HBase and hence it is the better choice than
HBase for Workload C.
7.4 Throughput vs Average Latency
The 2 databases are compared on the basis of throughput vs Average Latency and it can
be inferred from the graph that MongoDB performs better at all the three workloads used
for the test. MongoDB has more overall throughput than Hbase and that too at a lower
latency than HBase, which therefore confirms that MongoDB should be the preferred
database over HBase.
Workload A
Figure 31: Throughput vs Average Latency for Workload A
27
29. Workload B
Figure 32: Throughput vs Average Latency for Workload B
Workload c
Figure 33: Throughput vs Average Latency for Workload C
Result: After running the test and analyizing the outputs for the 2 databases for
different workloads at different opcounts, it can be concluded that MongoDB is the better
one among the two for the tests undergone. MongoDB outperforms HBase in every aspect
and for all the different workloads.
28
30. 8 Conclusion and discussion
In real life, a database is picked based on the requirement and the type of data that is
used. Before choosing a database to handle all the data, a company runs various tests
and checks various parameters such as —performance, efficiency, speed, ability to han-
dle unstructured data and security. The performance of a database also depends on the
hardware used to run the particular database. Different NoSQL databases have differ-
ent advantanges and disadvantanges and hence benchmarking tests are run to determine
which database is best suited for a particular company or a project. The motive of the
project was to install and compare the 2 NoSQL databases —HBase and MongoDB and
to test the performance of both the databases by subjecting them to multiple workloads.
The outputs of the database are then compared based on their Read-Update operation,
Average Latency and Throughput. For all the workloads, the databases were compared
using the average latency for read operation against the total throughout, average latency
for update operation against the total throughput, average latency for insert operation
against the total throughput, and finally comparing the total throughput against the
average overall latency. MongoDB turned out to be the better performer for all the tests
which were performed. MongoDB was well ahead of HBase in terms of performance and
is the clear choice of database that will be picked when the test is performed for all the
3 workloads used in this project.
References
Colombo, P. & Ferrari, E. (2017), ‘Enhancing mongodb with purpose-based access con-
trol’, IEEE Transactions on Dependable and Secure Computing 14(6), 591–604.
FOTACHE, M. & COGEAN, D. (2013), ‘Nosql and sql databases for mobile applications.
case study: Mongodb versus postgresql.’, Informatica Economica 17(2), 41 – 58.
URL: http://search.ebscohost.com/login.aspx?direct=trueAuthType=ip,cookie,shibdb=bthAN=888035
livescope=sitecustid=ncirlib
Ganesh Chandra, D. (2015), ‘Base analysis of nosql database.’, Future Generation
Computer Systems 52(Special Section: Cloud Computing: Security, Privacy and
Practice), 13 – 21.
URL: http://search.ebscohost.com/login.aspx?direct=trueAuthType=ip,cookie,shibdb=edselpAN=S01
livescope=sitecustid=ncirlib
https://data-flair.training/blogs/hbase security/ (2018), ‘Security’.
URL: https://data-flair.training/blogs/hbase-security/
https://www.edureka.co/blog/hbase architecture/ (2018), ‘Hbase architecture: Hbase
data model hbase read/write mechanism’.
URL: https://www.edureka.co/blog/hbase-architecture/
https://www.tutorialspoint.com/hbase/hbasearchitecture.htm(2018), ‘Hbase .
URL:https://www.tutorialspoint.com/hbase/hbasearchitecture.htmPallas, F., Gnther, J.& Bermbac
29
31. Waage, T. & Wiese, L. (2015), Benchmarking encrypted data storage in hbase and cassandra
with ycsb, in F. Cuppens, J. Garcia-Alfaro, N. Zincir Heywood & P. W. L. Fong, eds,
‘Foundations and Practice of Security’, Springer International Publishing, Cham, pp. 311–
325.
Yoon, J. & Lee, S. (2018), ‘A method and tool to recover data deleted from a mongodb’,
Digital Investigation 24, 106 – 120.
URL: http://www.sciencedirect.com/science/article/pii/S1742287617302347
30