Brig Lamoreaux\'s of Apollo Group worked with his colleagues to put together this WP detailing their evaluation of MongoDB. He also presented at Oracle Openworld 2012 on their use case with MongoDB.
This document compares the total cost of ownership (TCO) of MongoDB and Oracle for two example projects - a smaller enterprise project and a larger enterprise project. It analyzes the upfront and ongoing costs associated with initial developer effort, initial administrative effort, software licenses, server hardware, storage hardware, ongoing developer effort, ongoing administrative effort, and software maintenance and support. For both example projects, the analysis finds that the costs of initial developer effort, initial administrative effort, server hardware, and ongoing administrative and developer effort are lower for MongoDB compared to Oracle, while software license costs are significantly higher for Oracle.
The document provides an overview of running MongoDB on Amazon EC2, including basic tips, architecture considerations, operations procedures, and security practices. It discusses building MongoDB clusters on EC2 using basic building blocks and production designs to achieve scalability, high availability, and fault tolerance. Key operations like backup, restore, and monitoring are also covered.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
This paper discusses implementing NoSQL databases for robotics applications. NoSQL databases are well-suited for robotics because they can store massive amounts of data, retrieve information quickly, and easily scale. The paper proposes using a NoSQL graph database to store robot instructions and relate them according to tasks. MapReduce processing is also suggested to break large robot data problems into parallel pieces. Implementing a NoSQL system would allow building more intelligent humanoid robots that can process billions of objects and learn quickly from massive sensory inputs.
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
This document discusses cache and consistency in NoSQL databases. It introduces distributed caching using Memcached to improve performance and reduce load on database servers. It discusses using consistent hashing to partition and replicate data across servers while maintaining consistency. Paxos is presented as an efficient algorithm for maintaining consistency during updates in a distributed system in a more flexible way than traditional 2PC and 3PC approaches.
Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima
The document discusses integrating database management systems (DBMSs) as a read-only execution layer into Hadoop. It proposes a new system architecture that incorporates modified DBMS engines augmented with a customized storage engine capable of directly accessing data from the Hadoop Distributed File System (HDFS) and using global indexes. This allows DBMS to efficiently provide read-only operators while not being responsible for data management. The system is designed to address limitations of HadoopDB by improving performance, fault tolerance, and data loading speed.
Brig Lamoreaux\'s of Apollo Group worked with his colleagues to put together this WP detailing their evaluation of MongoDB. He also presented at Oracle Openworld 2012 on their use case with MongoDB.
This document compares the total cost of ownership (TCO) of MongoDB and Oracle for two example projects - a smaller enterprise project and a larger enterprise project. It analyzes the upfront and ongoing costs associated with initial developer effort, initial administrative effort, software licenses, server hardware, storage hardware, ongoing developer effort, ongoing administrative effort, and software maintenance and support. For both example projects, the analysis finds that the costs of initial developer effort, initial administrative effort, server hardware, and ongoing administrative and developer effort are lower for MongoDB compared to Oracle, while software license costs are significantly higher for Oracle.
The document provides an overview of running MongoDB on Amazon EC2, including basic tips, architecture considerations, operations procedures, and security practices. It discusses building MongoDB clusters on EC2 using basic building blocks and production designs to achieve scalability, high availability, and fault tolerance. Key operations like backup, restore, and monitoring are also covered.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
This paper discusses implementing NoSQL databases for robotics applications. NoSQL databases are well-suited for robotics because they can store massive amounts of data, retrieve information quickly, and easily scale. The paper proposes using a NoSQL graph database to store robot instructions and relate them according to tasks. MapReduce processing is also suggested to break large robot data problems into parallel pieces. Implementing a NoSQL system would allow building more intelligent humanoid robots that can process billions of objects and learn quickly from massive sensory inputs.
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
This document discusses cache and consistency in NoSQL databases. It introduces distributed caching using Memcached to improve performance and reduce load on database servers. It discusses using consistent hashing to partition and replicate data across servers while maintaining consistency. Paxos is presented as an efficient algorithm for maintaining consistency during updates in a distributed system in a more flexible way than traditional 2PC and 3PC approaches.
Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima
The document discusses integrating database management systems (DBMSs) as a read-only execution layer into Hadoop. It proposes a new system architecture that incorporates modified DBMS engines augmented with a customized storage engine capable of directly accessing data from the Hadoop Distributed File System (HDFS) and using global indexes. This allows DBMS to efficiently provide read-only operators while not being responsible for data management. The system is designed to address limitations of HadoopDB by improving performance, fault tolerance, and data loading speed.
Altoros using no sql databases for interactive_applicationsJeff Harris
This document compares the performance of Cassandra, MongoDB, and Couchbase for interactive applications. Benchmarking showed Couchbase had the lowest latencies and highest throughput. Cassandra demonstrated better performance than MongoDB. While MongoDB had the lowest throughput, Cassandra and Couchbase provided better scalability and flexibility in resizing clusters. The analysis concludes Couchbase is well-suited for interactive applications due to its in-memory caching and fine-grained locking, which enable high performance for reads and writes.
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA csandit
Recently, real-time image data generated is increasing not only in resolution but also in amount. This large-scale image originates from a large number of camera channels. There is a
way to use GPU for high-speed processing of images, but it cannot be done efficiently by using single GPU for large-scale image processing. In this paper, we provide a new method for
constructing a distributed environment using open source called Apache Kafka for real-time processing of large-scale images. This method provides an opportunity to gather related data into single node for high-speed processing using GPGPU or Xeon-Phi processing.
The document describes the SeqWare Query Engine, which uses HBase and Hadoop to store and query sequencing data at scale in the cloud. It allows users to ask questions about variants, genes, and annotations. The backend uses HBase to store billions of rows and columns and support annotation, querying, and comparison of samples. It also uses MapReduce for parallel processing. This enables scalable analysis and mining of petabyte-scale genomic datasets as sequencing output rapidly increases.
This document provides an overview of Oracle Database. It discusses database concepts like the entity-relationship model, normalization, and SQL. It describes Oracle architecture including memory structures like the shared pool and database buffer cache. It outlines Oracle processes such as the database writer and log writer. It also covers storage structures, both physical (data files, segments, extents, blocks) and logical (tablespaces, tables, indexes).
This document provides an overview of the new features and capabilities of IBM's Big SQL 3.0, an SQL-on-Hadoop solution. Big SQL 3.0 replaces the previous MapReduce-based architecture with a massively parallel processing SQL engine that pushes processing down to HDFS data nodes for low-latency queries. It features a shared-nothing parallel database architecture, rich SQL support including stored procedures and functions, automatic memory management, workload management tools, and fault tolerance. The document discusses the new architecture, performance improvements, and how Big SQL 3.0 represents an important advancement for SQL-on-Hadoop solutions.
This document discusses geo-distributed parallelization of MapReduce jobs across multiple datacenters. It introduces GEO-PACT, a Hadoop-based framework that can efficiently process sequences of parallelization contracts jobs on geo-distributed input data. GEO-PACT uses a group manager to determine optimal execution paths and job managers at each datacenter to execute tasks locally using Hadoop. It employs copy managers to transfer data between datacenters and aggregation managers to combine results. The goal is to optimize execution time by leveraging data locality across geographically distributed data sources.
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
This document summarizes a research paper that proposes a methodology for optimizing storage on the cloud using authorized de-duplication. It discusses how de-duplication works to eliminate duplicate data and optimize storage. The key steps are chunking files into blocks, applying secure hash algorithms like SHA-512 to generate unique hashes for each block, and comparing hashes to reference duplicate blocks instead of storing multiple copies. It also discusses using cryptographic techniques like ciphertext-policy attribute-based encryption for authentication and security on public clouds. The proposed approach aims to optimize storage while providing authorized de-duplication functionality.
This document provides an overview of the Hadoop/MapReduce/HBase framework and its applications in bioinformatics. It discusses Hadoop and its components, how MapReduce programs work, HBase which enables random access to Hadoop data, related projects like Pig and Hive, and examples of applications in bioinformatics and benchmarking of these systems.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Non-shared disk clusters provide a fault tolerant and cost-effective approach to data-intensive computing. This document describes a prototype non-shared disk cluster and plans for a full implementation. The prototype uses local disks on nodes that are not shared over the network, requiring processing to occur on nodes containing the needed data. A file catalog tracks data placement. The full implementation will include modifications to analysis software to dispatch jobs to nodes based on file location. Fault tolerance is provided by restarting jobs if nodes fail and restoring failed nodes.
This document discusses implementing a subset of the SQLite command processor directly on a GPU to accelerate SQL database operations. The author focuses on accelerating SELECT queries by having each CUDA thread execute SQLite opcodes on a single database row. Results show 20-70x speedups compared to CPU execution. This allows SQL queries to leverage the GPU's parallelism with minimal code changes, providing a simpler interface than existing GPU data processing approaches.
Facebook's TAO & Unicorn data storage and search platformsNitish Upreti
Unicorn is Facebook's in-memory, distributed graph search system that allows users to perform complex queries over the social graph. It supports operators like Apply and Extract that enable multi-step graph traversals to find socially relevant results. Unicorn stores adjacency lists in a sharded architecture and uses techniques like weak AND to balance social proximity and result diversity. It also attaches lineage metadata to results to allow privacy-aware rendering of results by Facebook's frontend services.
White Paper: Hadoop in Life Sciences — An Introduction EMC
This White Paper reviews the Apache Hadoop technology, its components — MapReduce and Hadoop Distributed File System — and its adoption in the life sciences with an example in Genomics data analysis.
1) Watson is an IBM supercomputer system designed to answer natural language questions. It competed on the game show Jeopardy! in 2011 against human champions.
2) Watson uses IBM's DeepQA architecture which analyzes natural language using over 100 techniques to find and evaluate evidence from massive amounts of data to answer questions accurately.
3) Watson harnesses the massive parallel processing power of 2,880 POWER7 cores across 90 servers to rapidly analyze questions and return answers, often in just 1-6 seconds.
This document discusses Facebook's deployment of Hadoop and HBase to support real-time applications at massive scale. It describes how Facebook Messages, Insights, and other applications require high throughput writes, large datasets, and low-latency reads. The document outlines why Hadoop and HBase were chosen over other systems to meet these needs, including elasticity, consistency, availability, and fault tolerance. It also describes enhancements made to HDFS and HBase to optimize for Facebook's workloads.
The document describes the architecture and design of the Hadoop Distributed File System (HDFS). It discusses key aspects of HDFS including its master/slave architecture with a single NameNode and multiple DataNodes. The NameNode manages the file system namespace and regulates client access, while DataNodes store and retrieve blocks of data. HDFS is designed to reliably store very large files across machines by replicating blocks of data and detecting/recovering from failures.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
MongoDB on Windows Azure provides two options for deploying the MongoDB database on Microsoft's cloud platform:
1) Windows Azure Virtual Machines allow more control over infrastructure but require more operational effort. Users can choose Windows or Linux and install software themselves.
2) Windows Azure Cloud Services decrease operational effort through automated management but provide less infrastructure control. Only Windows is supported and configurations are pre-defined.
Both options provide scalability and high availability through features like replication and sharding. Developers should evaluate the level of control and effort needed to determine the best deployment model for their application on the Windows Azure cloud.
Comparison between mongo db and cassandra using ycsbsonalighai
Performed YCSB benchmarking test to check the performances of MongoDB and Cassandra for different workloads and a million opcounts and generated a report discussing clear insights.
The document discusses file management systems and database management systems (DBMS). It describes the different types of file organization including sequential, indexed sequential, and direct access. It also discusses fundamental characteristics of file management systems like creation, updating, retrieval, and maintenance of files. Additionally, it covers topics like data models, DBMS languages, database users, advantages and disadvantages of DBMS, and challenges of data redundancy.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Altoros using no sql databases for interactive_applicationsJeff Harris
This document compares the performance of Cassandra, MongoDB, and Couchbase for interactive applications. Benchmarking showed Couchbase had the lowest latencies and highest throughput. Cassandra demonstrated better performance than MongoDB. While MongoDB had the lowest throughput, Cassandra and Couchbase provided better scalability and flexibility in resizing clusters. The analysis concludes Couchbase is well-suited for interactive applications due to its in-memory caching and fine-grained locking, which enable high performance for reads and writes.
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA csandit
Recently, real-time image data generated is increasing not only in resolution but also in amount. This large-scale image originates from a large number of camera channels. There is a
way to use GPU for high-speed processing of images, but it cannot be done efficiently by using single GPU for large-scale image processing. In this paper, we provide a new method for
constructing a distributed environment using open source called Apache Kafka for real-time processing of large-scale images. This method provides an opportunity to gather related data into single node for high-speed processing using GPGPU or Xeon-Phi processing.
The document describes the SeqWare Query Engine, which uses HBase and Hadoop to store and query sequencing data at scale in the cloud. It allows users to ask questions about variants, genes, and annotations. The backend uses HBase to store billions of rows and columns and support annotation, querying, and comparison of samples. It also uses MapReduce for parallel processing. This enables scalable analysis and mining of petabyte-scale genomic datasets as sequencing output rapidly increases.
This document provides an overview of Oracle Database. It discusses database concepts like the entity-relationship model, normalization, and SQL. It describes Oracle architecture including memory structures like the shared pool and database buffer cache. It outlines Oracle processes such as the database writer and log writer. It also covers storage structures, both physical (data files, segments, extents, blocks) and logical (tablespaces, tables, indexes).
This document provides an overview of the new features and capabilities of IBM's Big SQL 3.0, an SQL-on-Hadoop solution. Big SQL 3.0 replaces the previous MapReduce-based architecture with a massively parallel processing SQL engine that pushes processing down to HDFS data nodes for low-latency queries. It features a shared-nothing parallel database architecture, rich SQL support including stored procedures and functions, automatic memory management, workload management tools, and fault tolerance. The document discusses the new architecture, performance improvements, and how Big SQL 3.0 represents an important advancement for SQL-on-Hadoop solutions.
This document discusses geo-distributed parallelization of MapReduce jobs across multiple datacenters. It introduces GEO-PACT, a Hadoop-based framework that can efficiently process sequences of parallelization contracts jobs on geo-distributed input data. GEO-PACT uses a group manager to determine optimal execution paths and job managers at each datacenter to execute tasks locally using Hadoop. It employs copy managers to transfer data between datacenters and aggregation managers to combine results. The goal is to optimize execution time by leveraging data locality across geographically distributed data sources.
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
This document summarizes a research paper that proposes a methodology for optimizing storage on the cloud using authorized de-duplication. It discusses how de-duplication works to eliminate duplicate data and optimize storage. The key steps are chunking files into blocks, applying secure hash algorithms like SHA-512 to generate unique hashes for each block, and comparing hashes to reference duplicate blocks instead of storing multiple copies. It also discusses using cryptographic techniques like ciphertext-policy attribute-based encryption for authentication and security on public clouds. The proposed approach aims to optimize storage while providing authorized de-duplication functionality.
This document provides an overview of the Hadoop/MapReduce/HBase framework and its applications in bioinformatics. It discusses Hadoop and its components, how MapReduce programs work, HBase which enables random access to Hadoop data, related projects like Pig and Hive, and examples of applications in bioinformatics and benchmarking of these systems.
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing.
A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets
across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File
System termed as HDFS. The HDFS similar to most distributed file systems share a familiar problem on data
sharing and availability among compute nodes, often which leads to decrease in performance. This paper is an
experimental evaluation of Hadoop's computing performance which is made by designing a rack aware cluster
that utilizes the Hadoop’s default block placement policy to improve data availability. Additionally, an adaptive
data replication scheme that relies on access count prediction using Langrange’s interpolation is adapted to fit
the scenario. To prove, experiments were conducted on a rack aware cluster setup which significantly reduced
the task completion time, but once the volume of the data being processed increases there is a considerable
cutback in computational speeds due to update cost. Further the threshold level for balance between the update
cost and replication factor is identified and presented graphically.
Non-shared disk clusters provide a fault tolerant and cost-effective approach to data-intensive computing. This document describes a prototype non-shared disk cluster and plans for a full implementation. The prototype uses local disks on nodes that are not shared over the network, requiring processing to occur on nodes containing the needed data. A file catalog tracks data placement. The full implementation will include modifications to analysis software to dispatch jobs to nodes based on file location. Fault tolerance is provided by restarting jobs if nodes fail and restoring failed nodes.
This document discusses implementing a subset of the SQLite command processor directly on a GPU to accelerate SQL database operations. The author focuses on accelerating SELECT queries by having each CUDA thread execute SQLite opcodes on a single database row. Results show 20-70x speedups compared to CPU execution. This allows SQL queries to leverage the GPU's parallelism with minimal code changes, providing a simpler interface than existing GPU data processing approaches.
Facebook's TAO & Unicorn data storage and search platformsNitish Upreti
Unicorn is Facebook's in-memory, distributed graph search system that allows users to perform complex queries over the social graph. It supports operators like Apply and Extract that enable multi-step graph traversals to find socially relevant results. Unicorn stores adjacency lists in a sharded architecture and uses techniques like weak AND to balance social proximity and result diversity. It also attaches lineage metadata to results to allow privacy-aware rendering of results by Facebook's frontend services.
White Paper: Hadoop in Life Sciences — An Introduction EMC
This White Paper reviews the Apache Hadoop technology, its components — MapReduce and Hadoop Distributed File System — and its adoption in the life sciences with an example in Genomics data analysis.
1) Watson is an IBM supercomputer system designed to answer natural language questions. It competed on the game show Jeopardy! in 2011 against human champions.
2) Watson uses IBM's DeepQA architecture which analyzes natural language using over 100 techniques to find and evaluate evidence from massive amounts of data to answer questions accurately.
3) Watson harnesses the massive parallel processing power of 2,880 POWER7 cores across 90 servers to rapidly analyze questions and return answers, often in just 1-6 seconds.
This document discusses Facebook's deployment of Hadoop and HBase to support real-time applications at massive scale. It describes how Facebook Messages, Insights, and other applications require high throughput writes, large datasets, and low-latency reads. The document outlines why Hadoop and HBase were chosen over other systems to meet these needs, including elasticity, consistency, availability, and fault tolerance. It also describes enhancements made to HDFS and HBase to optimize for Facebook's workloads.
The document describes the architecture and design of the Hadoop Distributed File System (HDFS). It discusses key aspects of HDFS including its master/slave architecture with a single NameNode and multiple DataNodes. The NameNode manages the file system namespace and regulates client access, while DataNodes store and retrieve blocks of data. HDFS is designed to reliably store very large files across machines by replicating blocks of data and detecting/recovering from failures.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
MongoDB on Windows Azure provides two options for deploying the MongoDB database on Microsoft's cloud platform:
1) Windows Azure Virtual Machines allow more control over infrastructure but require more operational effort. Users can choose Windows or Linux and install software themselves.
2) Windows Azure Cloud Services decrease operational effort through automated management but provide less infrastructure control. Only Windows is supported and configurations are pre-defined.
Both options provide scalability and high availability through features like replication and sharding. Developers should evaluate the level of control and effort needed to determine the best deployment model for their application on the Windows Azure cloud.
Comparison between mongo db and cassandra using ycsbsonalighai
Performed YCSB benchmarking test to check the performances of MongoDB and Cassandra for different workloads and a million opcounts and generated a report discussing clear insights.
The document discusses file management systems and database management systems (DBMS). It describes the different types of file organization including sequential, indexed sequential, and direct access. It also discusses fundamental characteristics of file management systems like creation, updating, retrieval, and maintenance of files. Additionally, it covers topics like data models, DBMS languages, database users, advantages and disadvantages of DBMS, and challenges of data redundancy.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Benchmarking Couchbase Server for Interactive ApplicationsAltoros
This document provides a benchmark comparison of Couchbase, Cassandra, and MongoDB for interactive applications with high in-memory data loads. It describes the key criteria for databases used in these applications, including scalability and low latency performance. The infrastructure and settings for benchmarking the databases are outlined. Results are then shown for read, insert, and update latencies as well as 95th percentile times. Finally, the document analyzes the results and concludes that Couchbase is well-suited for interactive applications requiring low latency access to large, in-memory datasets.
This document provides an overview of MongoDB, including what NoSQL databases are, MongoDB features like querying, indexing, replication, load balancing and aggregation. It discusses how MongoDB stores data in documents and collections, can be used for file storage, and is used by many large companies. The document also covers installing and running MongoDB on a local system.
Using In-Memory Encrypted Databases on the CloudFrancesco Pagano
The document proposes using in-memory encrypted databases on the cloud to address privacy issues. It presents an agent-based approach where each user has a local database encrypted with a key only available to a trusted synchronizer. This allows data to be shared securely while minimizing encryption/decryption overhead. A prototype was implemented using HyperSQL. Benchmark results showed the overhead of the proposed solution is low, especially for read operations, making it suitable for privacy-focused applications like professional networks that require frequent data sharing.
The document proposes using in-memory encrypted databases on the cloud to address privacy issues. It presents an agent-based approach where each user has a local database encrypted with a key only available to a trusted synchronizer. This allows data to be shared securely while minimizing encryption/decryption overhead. A prototype was implemented using HyperSQL. Benchmark results showed the overhead of the proposed solution is low, especially for read operations, making it suitable for privacy-focused applications like professional networks that require frequent data sharing.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage, which partitions data into blocks and replicates them across nodes for fault tolerance. The master node tracks where data blocks are stored and worker nodes execute tasks like mapping and reducing data. Hadoop provides scalability and fault tolerance but is slower for iterative jobs compared to Spark, which keeps data in memory. The Lambda architecture also informs Hadoop's ability to handle batch and speed layers separately for scalability.
This document compares approaches to large-scale data analysis using MapReduce and parallel database management systems (DBMSs). It presents results from running a benchmark of tasks on an open-source MapReduce system (Hadoop) and two parallel DBMSs using a cluster of 100 nodes. The parallel DBMSs showed significantly better performance than MapReduce for the tasks, but took much longer to load data and tune executions. The document discusses architectural differences between the approaches and their performance implications.
Apache Spark is written in Scala programming language that compiles the program code into byte code for the JVM for spark big data processing.
The open source community has developed a wonderful utility for spark python big data processing known as PySpark.
This document discusses predictive maintenance using sensor data in utility industries. It describes how sensors can monitor infrastructure and predict failures by analyzing patterns in sensor data using machine learning models. An architecture is proposed that uses big data frameworks like Spark, Kafka and HBase to collect, analyze and store large volumes of real-time sensor data at scale. Predictive analytics on this data with techniques like clustering and regression can detect anomalies and predict failures to enable condition-based maintenance in utilities. Modeling uncertain sensor readings with probabilistic and autoregressive approaches is also discussed.
This article discusses opportunities and challenges for efficient parallel data processing in cloud computing environments. It introduces Nephele, a new data processing framework designed specifically for clouds. Nephele is the first framework to leverage dynamic resource allocation in clouds for task scheduling and execution. The article analyzes how existing frameworks assume static resource environments unlike clouds, and how Nephele addresses this by dynamically allocating different compute resources during job execution. It then provides initial performance results for Nephele and compares it to Hadoop for MapReduce-style jobs on cloud infrastructure.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh
The document describes MapReduce, a programming model and implementation for processing large datasets across clusters of computers. It allows users to write map and reduce functions to parallelize tasks. The MapReduce library automatically parallelizes jobs, distributes data and tasks, handles failures and coordinates communication between machines. It is scalable, processing terabytes of data on thousands of machines, and easy for programmers without parallel experience to use.
Spark is a new framework that supports applications that reuse a working set of data across multiple parallel operations. This includes iterative machine learning algorithms and interactive data analysis tools. Spark supports these applications while retaining scalability and fault tolerance through resilient distributed datasets (RDDs) which allow data to be cached in memory across operations. Spark provides RDDs and restricted shared variables like broadcast variables and accumulators to program clusters simply. Experiments show Spark can run iterative jobs faster and interactively query large datasets with low latency. Future work aims to enhance RDD properties and define new transforming operations.
This document compares the total cost of ownership of MongoDB and Oracle databases. It outlines the various cost categories to consider, including upfront costs like software, hardware, development efforts, and ongoing costs like maintenance and support. The document then provides two example scenarios - a smaller and larger enterprise project - comparing the expected costs of building each using MongoDB versus Oracle. It finds that for these examples, using MongoDB is over 70% less expensive than using Oracle. Finally, it discusses how MongoDB's advantages in flexibility, ease of use and support for modern development can help reduce costs and speed development.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
This document discusses different approaches to modeling messaging inboxes in MongoDB. It describes three main approaches: fan out on read, fan out on write, and bucketed fan out on write. Fan out on read involves storing a single document per message with all recipients, requiring a scatter-gather query to read an inbox. Fan out on write stores one document per recipient but still requires random reads. Bucketed fan out on write stores messages in bucketed inbox documents for each recipient, allowing an entire inbox to be read with one or two documents. In general, bucketed fan out on write is considered the best approach as it provides good performance for both sending messages and reading inboxes. Factors like data size, write vs read loads
This talk introduces the features of MongoDB by demonstrating how one can build a simple library application. The talk will cover the basics of MongoDB\'s document model, query language, and API.
Strategies For Backing Up Mongo Db 10.2012 CopyJeremy Taylor
This document discusses various strategies for backing up MongoDB deployments, including single node, replica sets, and sharded clusters. It covers using file system snapshots, mongodump, journaling, and stopping the balancer to get consistent backups. The key strategies are:
1. Using fsyncLock and mongodump to backup a single node deployment.
2. Backing up a hidden secondary in a replica set after killing mongod and using fsyncLock.
3. Using mongodump with the --oplog option and mongorestore with --oplogreplay to get a point-in-time backup of a replica set.
4. Stopping the balancer, backing up each shard and
Commands are special MongoDB operations that can be run from client libraries or the shell. This document provides a list and brief description of some of the most commonly used commands, including commands to get database and server information and statistics, manage collections and indexes, and configure replication and profiling.
The document provides an overview of running MongoDB on Amazon EC2, including basic tips, a basic installation process, architecture considerations, operations like backup and monitoring, and security recommendations. It discusses building MongoDB on EC2 instances for performance, durability, elasticity, and scalability. The installation instructions outline downloading and installing MongoDB, creating an EBS volume, and mounting it to store data on.
- Modern data workloads like big data, agile development, and cloud computing are driving new requirements for database management systems that relational databases can't meet.
- NoSQL databases like MongoDB were created to address these new requirements by providing horizontal scalability, flexible schemas, and compatibility with cloud environments.
- MongoDB scales across multiple servers, allows dynamic schema changes, and runs well on commodity hardware and virtual infrastructures, making it well-suited for modern applications.
2. MongoDB on Windows Azure
MongoDB on Windows Azure brings the provide customers the tools to build limit-
power of the leading NoSQL database lessly scalable applications in the cloud.
to Microsoft’s flexible, open, and
scalable cloud. This paper begins with an overview of
MongoDB. Next, we describe the two
MongoDB is an open source, document- primary deployment options available on
oriented database designed with scalability Microsoft’s cloud platform, Windows Azure
and developer agility in mind. Windows Virtual Machines and Windows Azure Cloud
Azure is the cloud services operating sys- Services. Finally, to help those evaluating
tem that provides the development, service deploying MongoDB on Windows Azure,
hosting, and service management envi- we outline the pros and cons of the two
ronment for the Azure Services Platform. deployment options available.
Together, MongoDB and Windows Azure
2
3. Figure 1: Sample JSON Document
{
"_id": ObjectId("504e4dd43796b3da50183991"),
"text": "Study Implicates Immune System in Parkinson’s Disease Pathogenesis http://bit.ly/duhe4P",
"source": "<a href="http://twitterfeed.com" rel="nofollow">twitterfeed</a>,
"coordinates": null,
"truncated": false,
"entities": {
"urls": [{
"indices": [
67,
87],
"url": "http://bit.ly/duhe4P",
"expanded_url": null
}],
"hashtags": []
},
"retweeted": false,
"place": null,
"user": {
"friends_count": 780,
"created_at": "Fri Jan 08 17:40:11 +0000 2010",
"description": "Latest medical news, articles, and features from Medscape Pathology.",
"time_zone": "Eastern Time (US & Canada)",
"url": "http://www.medscape.com/pathology",
"screen_name": "MedscapePath",
"utc_offset": -18000
},
"favorited": false,
"in_reply_to_user_id": null,
"id": NumberLong("22819397000")
}
About MongoDB
MongoDB is an open source, document- across documents and to adapt schemas as
oriented database. MongoDB bridges the their applications evolve.
gap between key-value stores – which are
fast and scalable – and relational databas- Unlike relational databases, MongoDB does
es – which have rich functionality. Instead not use SQL syntax. Rather, MongoDB has
of storing data in tables and rows as one a query language based on JSON. It also
would with a relational database, MongoDB has drivers for most modern languages,
stores a binary form of JSON (BSON or such as C#, Java, Ruby, Python, and many
‘binary JSON’ documents). An example of a others. MongoDB’s flexible data model and
document is shown in Figure 1. support for modern programming languages
simplify development and administration
The document serves as the fundamental significantly.
unit within MongoDB (like a row in an
RDBMS); one can add fields (like a column
in an RDBMS), as well as nested fields and
embedded documents. Rather than impos-
ing a flat, rigid schema across an entire
table, the schema is implicit in what fields
are used in the documents. Thus, MongoDB
allows developers to have variable schemas
3
4. Figure 2: Replica Sets with MongoDB
MongoDB Architecture
MongoDB’s core capabilities deliver reli-
ability, high availability, high performance,
Application and scalability.
Replication through replica sets provides for
Read Write high availability and data safety. A replica
set is comprised of one primary node and
some number of secondary nodes (de-
Primary termined by the user). Figure 2 shows an
example replica set, with one primary and
two secondaries (a common deployment
Asynchronous
Secondary model). By default, the primary node takes
Replication
all reads and writes from the application;
the secondaries replicate asynchronously in
the background. If the primary node goes
Secondary
down for any reason, one of the secondaries
is automatically promoted to primary status
Automatic and begins to take all reads and writes.
Leader Election Replica sets help protect applications from
hardware and data center-related down-
time. Moreover, they make it easy for DBAs
to conduct operational tasks, including
software upgrades and hardware changes.
Figure 3: Sharding with MongoDB
Sharding enables users to scale horizon-
tally as their data volumes grow and/or as
Shard A Shard B Shard C Shard N demands on their data stores grow. A shard
0...30 31...60 61...90 n...n+30 is a subset of the database, kind of like a
partition of the data. In Figure 3, Shard A
... contains documents 1-30; Shard B contains
documents 31-60; and so on. One can
choose any key on which to shard the col-
Horizontally Scalable lection (e.g., user name), and MongoDB will
automatically shard the data store based on
this key. One can scale a database infi-
nitely using sharding by adding new nodes
to a cluster. When a new node is added,
MongoDB recognizes it and redistributes
the data across the cluster. Because shard-
ing distributes both the actual data and
therefore the load (i.e., traffic), it enables
horizontal scalability as well as high per-
formance.
4
5. Figure 4: MongoDB Architecture
Application
mongos
Replica Set A Replica Set B Replica Set C Replica Set N
0...30 31...60 61...90 n...n+30
Primary Primary Primary Primary
Secondary Secondary Secondary Secondary
Secondary Secondary Secondary
... Secondary
An overview of the MongoDB architecture is shown in Figure EC2, Azure VMs give users access to elastic, on-demand
4. In a multi-shard environment, the application communi- virtual servers. Users can install Windows or Linux on a VM
cates with mongos, an intermediary router that directs reads and configure it based on their own preferences or their
and writes to the appropriate shard. Each shard is a replica apps’ specific needs. Users manage the VMs themselves,
set, providing scalability, availability, and performance to including scaling, installing security patches, and ongoing
performance monitoring and management. Azure VMs give
developers.
users a relatively significant degree of control over their
MongoDB was built for the cloud. Cloud services like Windows environments, but by the same token require users to take
on the VM management. Note: This service is currently in
Azure are therefore a natural fit for MongoDB. By coupling
preview (beta).
MongoDB’s easy-to-scale architecture and Azure’s elastic
cloud capacity, users can quickly and easily build, scale, and »» Windows Azure Cloud Services. Windows Azure Cloud
manage their applications. Services (Worker Roles and Web Roles) is Microsoft’s
Platform–as–a–Service (PaaS) offering. Similar to Heroku,
Worker Roles provide users with prebuilt, preconfigured
About Windows Azure Services instances of compute power. In contrast with Azure VMs,
users do not have to configure or manage Azure Worker
for MongoDB Roles. Windows Azure handles the deployment details –
Windows Azure is Microsoft’s suite of cloud services, providing from provisioning and load balancing to health monitoring
developers on-demand compute and storage to create, host for continuous availability. This can be helpful to some
and manage scalable and available web applications through users who prefer not to manage their applications at the
Microsoft data centers. When deploying MongoDB to Windows infrastructure level, though it restricts the level of control
Azure, users can choose from two deployment options: users have over their environments.
»» Windows Azure Virtual Machines. Windows Azure
Virtual Machines (VMs) is Microsoft’s Infrastructure-as-a-
Service (IaaS) offering. Similar to Amazon Web Services
5
6. Understanding the Deployment Options
Given that MongoDB can be deployed on either Windows Azure Virtual Machines may not always be the right fit for the
Azure Virtual Machines (IaaS) or Windows Azure Cloud Services following reasons:
(PaaS), it is important for users to consider the different capa-
bilities and implementation details of each service to deter- Increased Operational Effort. The increased control that Azure
mine which deployment model makes the most sense for their Virtual Machines provide comes with increase effort, as well.
applications. Users must define and implement their own security measures,
apply patches, and locate instances for fault tolerance. This
consideration may be important for developers that lack expe-
Azure Virtual Machines rience managing their own infrastructure or for companies that
BASIC SETUP don’t have the operational bandwidth to devote to managing
After being granted access to the preview functionality for this component of the stack.
Azure Virtual Machines, users can launch an instance and
install and configure MongoDB on it manually. Alternatively, Beta. The Azure Virtual Machines service is still in
users can use the recently released Windows Azure installer for Preview (beta).
MongoDB to set up a MongoDB replica set quickly and easily on
Windows Azure VMs. Azure Cloud Services
The installer is built on top of Windows PowerShell and the BASIC SETUP
Windows Azure command line tool. It contains a number of Users can also deploy MongoDB on Azure Cloud Services. To
deployment scripts. The tool is designed to help users get do so, download the MongoDB Azure Worker Role package,
single or multi-node MongoDB configurations up and running which is a preconfigured Worker Role with MongoDB. When
quickly. There are only two steps to installing and configur- deployed, each replica set member runs as a separate Worker
ing a MongoDB replica set on Azure VMs. Note: the installer Role instance; MongoDB data files are stored in Azure Cloud
is designed to run on a user’s local machine (i.e., not directly Drives. For detailed instructions, visit the MongoDB wiki (wiki.
on an Azure VM), and then to deploy output to Windows Azure mongodb.org).
VMs. To start, download the publish settings file. Next, run the
installer from the command prompt. PROS AND CONS OF AZURE CLOUD SERVICES
The pros and cons of running MongoDB on Azure Cloud
With Azure Virtual Machines, users can create their own VMs or Services are generally consistent with those of using PaaS in
they can create a VM instance from one of several pre-installed general, though there are some Azure-specific considerations.
operating system configurations. Both Windows and Linux are Overall, Windows Azure Cloud Services decreases the opera-
supported on Azure Virtual Machines. To deploy MongoDB on tional burden on users but affords them less control from an
Linux, visit the MongoDB wiki (wiki.mongodb.org) for step-by- infrastructure configuration standpoint. The advantages of
step instructions. using Azure Cloud Services are as follows:
PROS AND CONS OF AZURE VIRTUAL MACHINES »» Lower Operational Effort. Microsoft manages OS updates
and security, decreasing the operational burden on the
The pros and cons of deploying MongoDB on Azure Virtual Ma-
users.
chines are generally consistent with the considerations around
using IaaS more broadly. Overall, Azure Virtual Machines allow »» Built-in Fault Tolerance.
When deploying multiple
users to fine-tune their deployments but by the same token MongoDB worker role instances, Windows Azure
require increased operational effort. automatically deploys the instances across multiple fault
and update domains to guarantee better uptime.
The advantages of using MongoDB on Azure Virtual Machines
are as follows: »» Secure by Default. Microsoft takes measures to ensure that
worker and web roles are secure. Endpoints on instances
»» Increased Control. Users have more control over their can be enabled for instance-to-instance communication
infrastructural configuration relative to Azure Cloud without making them public. Thus, one can configure
Services. For instance, they can install and configure MongoDB to be secure by enabling it only for other roles in
services on the OS, define policies, etc. This consideration the same deployment.
may be important for enterprises that have regimented
policies and processes for IT security and compliance.
»» OS Choice. Users can use Windows or Linux.
6
7. Table 1: Pros and Cons Summary - Windows Azure Virtual Machines and
Windows Azure Cloud Services
By the same token, there are some aspects
of Azure Cloud Services that may be consid-
ered drawbacks:
PROS CONS
»» Windows Only. Worker Roles can only be
deployed with Windows; Linux is not an
IaaS – - Increased control - Increased operational
effort option.
Windows - OS choice
Azure Virtual
Machines
»» Fixed OS Configuration. Users cannot
configure the OS, and must therefore
develop applications that run on the
pre-defined machine configurations
Initial - Lower operational effort - Windows only
available.
Administrative - Built-in fault tolerance - Fixed OS
Effort configuration
- Secure by default Table 1 summarizes the pros and cons of
using Windows Azure Worker Roles and
Windows Azure Virtual Machines.
Summary
MongoDB was built for ease of use, scal-
ability, availability, and performance, and
it’s quickly becoming an attractive alterna-
tive to relational databases. Windows Azure
provides a flexible cloud platform for host-
ing MongoDB, with two deployment models
to choose from. Developers and enterprises
looking at deploying MongoDB on Windows
Azure should consider the pros and cons
discussed here when evaluating which
option is most appropriate for them. We
hope that this paper helps customers better
understand these solutions, how they work,
and how to assess them.
To learn more about MongoDB and how
to deploy it in the cloud, or to speak
to a sales representative, please email
info@10gen.com.
7
8. New York 578 Broadway, New York, NY 10012 • London 5-25 Scrutton St., London EC2A 4HJ
info@10gen.com • US (866) 237-8815 • INTL +1 (650) 440-4474