NoSQL, as many of you may already know, is basically a database used to manage huge sets of unstructured data, where in the data is not stored in tabular relations like relational databases. Most of the currently existing Relational Databases have failed in solving some of the complex modern problems like:
• Continuously changing nature of data - structured, semi-structured, unstructured and polymorphic data.
• Applications now serve millions of users in different geo-locations, in different timezones and have to be up and running all the time, with data integrity maintained
• Applications are becoming more distributed with many moving towards cloud computing.
NoSQL plays a vital role in an enterprise application which needs to access and analyze a massive set of data that is being made available on multiple virtual servers (remote based) in the cloud infrastructure and mainly when the data set is not structured. Hence, the NoSQL database is designed to overcome the Performance, Scalability, Data Modelling and Distribution limitations that are seen in the Relational Databases.
Hive is a data warehouse infrastructure tool that allows users to query and analyze large datasets stored in Hadoop. It uses a SQL-like language called HiveQL to process structured data stored in HDFS. Hive stores metadata about the schema in a database and processes data into HDFS. It provides a familiar interface for querying large datasets using SQL-like queries and scales easily to large datasets.
This document provides an introduction to NoSQL databases. It discusses the history and limitations of relational databases that led to the development of NoSQL databases. The key motivations for NoSQL databases are that they can handle big data, provide better scalability and flexibility than relational databases. The document describes some core NoSQL concepts like the CAP theorem and different types of NoSQL databases like key-value, columnar, document and graph databases. It also outlines some remaining research challenges in the area of NoSQL databases.
The presentation provides an overview of NoSQL databases, including a brief history of databases, the characteristics of NoSQL databases, different data models like key-value, document, column family and graph databases. It discusses why NoSQL databases were developed as relational databases do not scale well for distributed applications. The CAP theorem is also explained, which states that only two out of consistency, availability and partition tolerance can be achieved in a distributed system.
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.
Very basic Introduction to Big Data. Touches on what it is, characteristics, some examples of Big Data frameworks. Hadoop 2.0 example - Yarn, HDFS and Map-Reduce with Zookeeper.
This document provides an overview of NoSQL databases and compares them to relational databases. It discusses the different types of NoSQL databases including key-value stores, document databases, wide column stores, and graph databases. It also covers some common concepts like eventual consistency, CAP theorem, and MapReduce. While NoSQL databases provide better scalability for massive datasets, relational databases offer more mature tools and strong consistency models.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Hive is a data warehouse infrastructure tool that allows users to query and analyze large datasets stored in Hadoop. It uses a SQL-like language called HiveQL to process structured data stored in HDFS. Hive stores metadata about the schema in a database and processes data into HDFS. It provides a familiar interface for querying large datasets using SQL-like queries and scales easily to large datasets.
This document provides an introduction to NoSQL databases. It discusses the history and limitations of relational databases that led to the development of NoSQL databases. The key motivations for NoSQL databases are that they can handle big data, provide better scalability and flexibility than relational databases. The document describes some core NoSQL concepts like the CAP theorem and different types of NoSQL databases like key-value, columnar, document and graph databases. It also outlines some remaining research challenges in the area of NoSQL databases.
The presentation provides an overview of NoSQL databases, including a brief history of databases, the characteristics of NoSQL databases, different data models like key-value, document, column family and graph databases. It discusses why NoSQL databases were developed as relational databases do not scale well for distributed applications. The CAP theorem is also explained, which states that only two out of consistency, availability and partition tolerance can be achieved in a distributed system.
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.
Very basic Introduction to Big Data. Touches on what it is, characteristics, some examples of Big Data frameworks. Hadoop 2.0 example - Yarn, HDFS and Map-Reduce with Zookeeper.
This document provides an overview of NoSQL databases and compares them to relational databases. It discusses the different types of NoSQL databases including key-value stores, document databases, wide column stores, and graph databases. It also covers some common concepts like eventual consistency, CAP theorem, and MapReduce. While NoSQL databases provide better scalability for massive datasets, relational databases offer more mature tools and strong consistency models.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
This presentation explains what data engineering is and describes the data lifecycles phases briefly. I used this presentation during my work as an on-demand instructor at Nooreed.com
- Polyglot persistence involves using multiple data storage technologies to handle different data storage needs within a single application. This allows using the right technology for the job rather than trying to solve all problems with a single database.
- For example, a key-value store may be better for transient session or shopping cart data before an order is placed, while relational databases are better for structured transactional data after an order is placed.
- Using services that abstract the direct usage of different data stores allows sharing of data between applications in an enterprise. This improves reuse of data across systems.
Data Engineering is the process of collecting, transforming, and loading data into a database or data warehouse for analysis and reporting. It involves designing, building, and maintaining the infrastructure necessary to store, process, and analyze large and complex datasets. This can involve tasks such as data extraction, data cleansing, data transformation, data loading, data management, and data security. The goal of data engineering is to create a reliable and efficient data pipeline that can be used by data scientists, business intelligence teams, and other stakeholders to make informed decisions.
Visit by :- https://www.datacademy.ai/what-is-data-engineering-data-engineering-data-e/
The document provides an overview of big data analytics using Hadoop. It discusses how Hadoop allows for distributed processing of large datasets across computer clusters. The key components of Hadoop discussed are HDFS for storage, and MapReduce for parallel processing. HDFS provides a distributed, fault-tolerant file system where data is replicated across multiple nodes. MapReduce allows users to write parallel jobs that process large amounts of data in parallel on a Hadoop cluster. Examples of how companies use Hadoop for applications like customer analytics and log file analysis are also provided.
There are three common data warehouse architectures: basic, with a staging area, and with a staging area and data marts. The basic architecture extracts data directly from source systems into the data warehouse for users. The staging area architecture uses a staging area to clean and process data before loading it into the warehouse. The third architecture adds data marts, which are subsets of the warehouse organized for specific business units like sales or purchasing.
1. The document provides an overview of key concepts in data science and machine learning including the data science process, types of data, machine learning techniques, and Python tools used for machine learning.
2. It describes the typical 6 step data science process: setting goals, data retrieval, data preparation, exploration, modeling, and presentation.
3. Different types of data are discussed including structured, unstructured, machine-generated, graph-based, and audio/video data.
4. Machine learning techniques can be supervised, unsupervised, or semi-supervised depending on whether labeled data is used.
Hive is a data warehouse infrastructure tool used to process large datasets in Hadoop. It allows users to query data using SQL-like queries. Hive resides on HDFS and uses MapReduce to process queries in parallel. It includes a metastore to store metadata about tables and partitions. When a query is executed, Hive's execution engine compiles it into a MapReduce job which is run on a Hadoop cluster. Hive is better suited for large datasets and queries compared to traditional RDBMS which are optimized for transactions.
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Big data is high-volume, high-velocity, and high-variety data that is difficult to process using traditional data management tools. It is characterized by 3Vs: volume of data is growing exponentially, velocity as data streams in real-time, and variety as data comes from many different sources and formats. The document discusses big data analytics techniques to gain insights from large and complex datasets and provides examples of big data sources and applications.
Visual data mining combines traditional data mining methods with information visualization techniques to explore large datasets. There are three levels of integration between visualization and automated mining methods - no/limited integration, loose integration where methods are applied sequentially, and full integration where methods are applied in parallel. Different visualization methods exist for univariate, bivariate and multivariate data based on the type and dimensions of the data. The document describes frameworks and algorithms for visual data mining, including developing new algorithms interactively through a visual interface. It also summarizes a document on using data mining and visualization techniques for selective visualization of large spatial datasets.
Relational RDBMS : MySQL, PostgreSQL and SQL SERVERDalila Chouaya
This document compares and contrasts three popular open-source relational database management systems (RDBMS): MySQL, PostgreSQL, and Microsoft SQL Server. It discusses each RDBMS's supported data types, advantages, disadvantages, and best use cases. MySQL is noted as the most popular with strengths in speed and ease of use, while PostgreSQL focuses on compliance and extensibility. SQL Server is suited for large enterprise systems but has higher costs. The document provides an overview of key factors to consider when selecting an RDBMS.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
This document provides an overview of SQL and NoSQL databases. It defines SQL as a language used to communicate with relational databases, allowing users to query, manipulate, and retrieve data. NoSQL databases are defined as non-relational and allow for flexible schemas. The document compares key aspects of SQL and NoSQL such as data structure, querying, scalability and provides examples of popular SQL and NoSQL database systems. It concludes that both SQL and NoSQL databases will continue to be important with polyglot persistence, using the best database for each storage need.
The document discusses the key aspects of the ETL (extract, transform, load) process. It describes what ETL is, the main steps including extraction of data from source systems, transforming the data to conform to the data warehouse schema, and loading the data into the data warehouse. It covers techniques for data extraction such as log-based extraction, triggers, and application-assisted extraction. It also discusses various types of data transformations including formatting, decoding, calculations, and integration of data from multiple sources.
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaEdureka!
This tutorial on data warehouse concepts will tell you everything you need to know in performing data warehousing and business intelligence. The various data warehouse concepts explained in this video are:
1. What Is Data Warehousing?
2. Data Warehousing Concepts:
i. OLAP (On-Line Analytical Processing)
ii. Types Of OLAP Cubes
iii. Dimensions, Facts & Measures
iv. Data Warehouse Schema
This document provides an overview of NoSQL databases. It discusses that NoSQL databases are non-relational and do not follow the RDBMS principles. It describes some of the main types of NoSQL databases including document stores, key-value stores, column-oriented stores, and graph databases. It also discusses how NoSQL databases are designed for massive scalability and do not guarantee ACID properties, instead following a BASE model ofBasically Available, Soft state, and Eventually Consistent.
The document provides an introduction to NoSQL databases, including key definitions and characteristics. It discusses that NoSQL databases are non-relational and do not follow RDBMS principles. It also summarizes different types of NoSQL databases like document stores, key-value stores, and column-oriented stores. Examples of popular databases for each type are also provided.
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
This presentation explains what data engineering is and describes the data lifecycles phases briefly. I used this presentation during my work as an on-demand instructor at Nooreed.com
- Polyglot persistence involves using multiple data storage technologies to handle different data storage needs within a single application. This allows using the right technology for the job rather than trying to solve all problems with a single database.
- For example, a key-value store may be better for transient session or shopping cart data before an order is placed, while relational databases are better for structured transactional data after an order is placed.
- Using services that abstract the direct usage of different data stores allows sharing of data between applications in an enterprise. This improves reuse of data across systems.
Data Engineering is the process of collecting, transforming, and loading data into a database or data warehouse for analysis and reporting. It involves designing, building, and maintaining the infrastructure necessary to store, process, and analyze large and complex datasets. This can involve tasks such as data extraction, data cleansing, data transformation, data loading, data management, and data security. The goal of data engineering is to create a reliable and efficient data pipeline that can be used by data scientists, business intelligence teams, and other stakeholders to make informed decisions.
Visit by :- https://www.datacademy.ai/what-is-data-engineering-data-engineering-data-e/
The document provides an overview of big data analytics using Hadoop. It discusses how Hadoop allows for distributed processing of large datasets across computer clusters. The key components of Hadoop discussed are HDFS for storage, and MapReduce for parallel processing. HDFS provides a distributed, fault-tolerant file system where data is replicated across multiple nodes. MapReduce allows users to write parallel jobs that process large amounts of data in parallel on a Hadoop cluster. Examples of how companies use Hadoop for applications like customer analytics and log file analysis are also provided.
There are three common data warehouse architectures: basic, with a staging area, and with a staging area and data marts. The basic architecture extracts data directly from source systems into the data warehouse for users. The staging area architecture uses a staging area to clean and process data before loading it into the warehouse. The third architecture adds data marts, which are subsets of the warehouse organized for specific business units like sales or purchasing.
1. The document provides an overview of key concepts in data science and machine learning including the data science process, types of data, machine learning techniques, and Python tools used for machine learning.
2. It describes the typical 6 step data science process: setting goals, data retrieval, data preparation, exploration, modeling, and presentation.
3. Different types of data are discussed including structured, unstructured, machine-generated, graph-based, and audio/video data.
4. Machine learning techniques can be supervised, unsupervised, or semi-supervised depending on whether labeled data is used.
Hive is a data warehouse infrastructure tool used to process large datasets in Hadoop. It allows users to query data using SQL-like queries. Hive resides on HDFS and uses MapReduce to process queries in parallel. It includes a metastore to store metadata about tables and partitions. When a query is executed, Hive's execution engine compiles it into a MapReduce job which is run on a Hadoop cluster. Hive is better suited for large datasets and queries compared to traditional RDBMS which are optimized for transactions.
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Big data is high-volume, high-velocity, and high-variety data that is difficult to process using traditional data management tools. It is characterized by 3Vs: volume of data is growing exponentially, velocity as data streams in real-time, and variety as data comes from many different sources and formats. The document discusses big data analytics techniques to gain insights from large and complex datasets and provides examples of big data sources and applications.
Visual data mining combines traditional data mining methods with information visualization techniques to explore large datasets. There are three levels of integration between visualization and automated mining methods - no/limited integration, loose integration where methods are applied sequentially, and full integration where methods are applied in parallel. Different visualization methods exist for univariate, bivariate and multivariate data based on the type and dimensions of the data. The document describes frameworks and algorithms for visual data mining, including developing new algorithms interactively through a visual interface. It also summarizes a document on using data mining and visualization techniques for selective visualization of large spatial datasets.
Relational RDBMS : MySQL, PostgreSQL and SQL SERVERDalila Chouaya
This document compares and contrasts three popular open-source relational database management systems (RDBMS): MySQL, PostgreSQL, and Microsoft SQL Server. It discusses each RDBMS's supported data types, advantages, disadvantages, and best use cases. MySQL is noted as the most popular with strengths in speed and ease of use, while PostgreSQL focuses on compliance and extensibility. SQL Server is suited for large enterprise systems but has higher costs. The document provides an overview of key factors to consider when selecting an RDBMS.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
This document provides an overview of SQL and NoSQL databases. It defines SQL as a language used to communicate with relational databases, allowing users to query, manipulate, and retrieve data. NoSQL databases are defined as non-relational and allow for flexible schemas. The document compares key aspects of SQL and NoSQL such as data structure, querying, scalability and provides examples of popular SQL and NoSQL database systems. It concludes that both SQL and NoSQL databases will continue to be important with polyglot persistence, using the best database for each storage need.
The document discusses the key aspects of the ETL (extract, transform, load) process. It describes what ETL is, the main steps including extraction of data from source systems, transforming the data to conform to the data warehouse schema, and loading the data into the data warehouse. It covers techniques for data extraction such as log-based extraction, triggers, and application-assisted extraction. It also discusses various types of data transformations including formatting, decoding, calculations, and integration of data from multiple sources.
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaEdureka!
This tutorial on data warehouse concepts will tell you everything you need to know in performing data warehousing and business intelligence. The various data warehouse concepts explained in this video are:
1. What Is Data Warehousing?
2. Data Warehousing Concepts:
i. OLAP (On-Line Analytical Processing)
ii. Types Of OLAP Cubes
iii. Dimensions, Facts & Measures
iv. Data Warehouse Schema
This document provides an overview of NoSQL databases. It discusses that NoSQL databases are non-relational and do not follow the RDBMS principles. It describes some of the main types of NoSQL databases including document stores, key-value stores, column-oriented stores, and graph databases. It also discusses how NoSQL databases are designed for massive scalability and do not guarantee ACID properties, instead following a BASE model ofBasically Available, Soft state, and Eventually Consistent.
The document provides an introduction to NoSQL databases, including key definitions and characteristics. It discusses that NoSQL databases are non-relational and do not follow RDBMS principles. It also summarizes different types of NoSQL databases like document stores, key-value stores, and column-oriented stores. Examples of popular databases for each type are also provided.
The document provides an agenda for a two-day training on NoSQL and MongoDB. Day 1 covers an introduction to NoSQL concepts like distributed and decentralized databases, CAP theorem, and different types of NoSQL databases including key-value, column-oriented, and document-oriented databases. It also covers functions and indexing in MongoDB. Day 2 focuses on specific MongoDB topics like aggregation framework, sharding, queries, schema-less design, and indexing.
This document provides an overview of a NoSQL Night event presented by Clarence J M Tauro from Couchbase. The presentation introduces NoSQL databases and discusses some of their advantages over relational databases, including scalability, availability, and partition tolerance. It covers key concepts like the CAP theorem and BASE properties. The document also provides details about Couchbase, a popular document-oriented NoSQL database, including its architecture, data model using JSON documents, and basic operations. Finally, it advertises Couchbase training courses for getting started and administration.
The document discusses NoSQL technologies including Cassandra, MongoDB, and ElasticSearch. It provides an overview of each technology, describing their data models, key features, and comparing them. Example documents and queries are shown for MongoDB and ElasticSearch. Popular use cases for each are also listed.
This document introduces NoSQL and document databases, and demonstrates using RavenDB with ASP.NET MVC. It defines NoSQL as non-relational databases that avoid joins and scale horizontally. Popular NoSQL databases like MongoDB and CouchDB are discussed. The document then focuses on document databases and RavenDB specifically, highlighting its .NET support, scalability, transactions, and RESTful API. Finally, a demo app using RavenDB with ASP.NET MVC is proposed along with links to source code.
MongoDB is a horizontally scalable, schema-free, document-oriented NoSQL database. It stores data in flexible, JSON-like documents, allowing for easy storage and retrieval of data without rigid schemas. MongoDB provides high performance, high availability, and easy scalability. Some key features include embedded documents and arrays to reduce joins, dynamic schemas, replication and failover for availability, and auto-sharding for horizontal scalability.
This document discusses several NoSQL databases including key-value, column-family, graph, and document databases. It provides information on Cassandra, DynamoDB, Riak, Redis, CouchDB, Azure Table Storage, BerkeleyDB, HBase, BigTable, HyperTable, Neo4j, and MongoDB, summarizing their architectures, features, uses cases, and advantages.
The document summarizes a meetup about NoSQL databases hosted by AWS in Sydney in 2012. It includes an agenda with presentations on Introduction to NoSQL and using EMR and DynamoDB. NoSQL is introduced as a class of databases that don't use SQL as the primary query language and are focused on scalability, availability and handling large volumes of data in real-time. Common NoSQL databases mentioned include DynamoDB, BigTable and document databases.
An overview of various database technologies and their underlying mechanisms over time.
Presentation delivered at Alliander internally to inspire the use of and forster the interest in new (NOSQL) technologies. 18 September 2012
This document provides an overview and summary of key concepts related to advanced databases. It discusses relational databases including MySQL, SQL, transactions, and ODBC. It also covers database topics like triggers, indexes, and NoSQL databases. Alternative database systems like graph databases, triplestores, and linked data are introduced. Web services, XML, and data journalism are also briefly summarized. The document provides definitions and examples of these technical database terms and concepts.
NoSQL databases provide an alternative to traditional relational databases that is well-suited for large datasets, high scalability needs, and flexible, changing schemas. NoSQL databases sacrifice strict consistency for greater scalability and availability. The document model is well-suited for semi-structured data and allows for embedding related data within documents. Key-value stores provide simple lookup of data by key but do not support complex queries. Graph databases effectively represent network-like connections between data elements.
Solr cloud the 'search first' nosql database extended deep divelucenerevolution
Presented by Mark Miller, Software Engineer, Cloudera
As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.
This document provides an overview of NoSQL data architecture patterns, including key-value stores, graph stores, and column family stores. It describes key aspects of each pattern such as how keys and values are structured. Key-value stores use a simple key-value approach with no query language, while graph stores are optimized for relationships between objects. Column family stores use row and column identifiers as keys and scale well for large volumes of data.
NoSQL databases were developed to address the limitations of relational databases in handling massive, unstructured datasets. NoSQL databases sacrifice ACID properties like consistency in favor of scalability and availability. The CAP theorem states that only two of consistency, availability, and partition tolerance can be achieved at once. Common NoSQL database types include document stores, key-value stores, column-oriented stores, and graph databases. NoSQL is best suited for large datasets that don't require strict consistency or relational structures.
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityIvan Zoratti
Colin Charles gave a presentation comparing SQL and NoSQL databases. He discussed why organizations adopt NoSQL databases like MongoDB for large, unstructured datasets and rapid development. However, he argued that MySQL can also handle these workloads through features like dynamic columns, memcached integration, and JSON support. MySQL addresses limitations around high availability, scalability, and schema flexibility through tools and plugins that provide sharding, replication, load balancing, and online schema changes. In the end, MySQL with the right tools is capable of fulfilling both transactional and NoSQL-style workloads.
Assessment and Planning in Educational technology.pptxKavitha Krishnan
In an education system, it is understood that assessment is only for the students, but on the other hand, the Assessment of teachers is also an important aspect of the education system that ensures teachers are providing high-quality instruction to students. The assessment process can be used to provide feedback and support for professional development, to inform decisions about teacher retention or promotion, or to evaluate teacher effectiveness for accountability purposes.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Physiology and chemistry of skin and pigmentation, hairs, scalp, lips and nail, Cleansing cream, Lotions, Face powders, Face packs, Lipsticks, Bath products, soaps and baby product,
Preparation and standardization of the following : Tonic, Bleaches, Dentifrices and Mouth washes & Tooth Pastes, Cosmetics for Nails.
1. VISVESVARAYA TECHNOLOGICAL UNIVERSITY
Belgaum, Karnataka
IV SEMESTER SEMINAR ON
“NoSQL Data Management: Concepts and Systems”
Md. Mushahid Faizan
(1RX16MCA34)
RNS INSTITUTE OF TECHNOLOGY
Estd: 2001
Department of MCA
Channasandra, Dr. Vishnuvardhan, Bengaluru-560098.
3. NoSQL Data Management
History of NoSQL
● MultiValue databases at TRW in 1965.
● DBM is released by AT&T in 1979.
● Lotus Domino released in 1989.
● Carlo Strozzi used the term NoSQL in 1998 to name his lightweight,
open-source relational database that did not expose the standard SQL
interface.
● Graph database Neo4j is started in 2000.
● Google BigTable is started in 2004. Paper published in 2006.
● CouchDB is started in 2005.
● The research paper on Amazon Dynamo is released in 2007.
● The document database MongoDB is started in 2007 as a part of a open
source cloud computing stack and first standalone release in 2009.
● Facebooks open sources the Cassandra project in 2008.
● Project Voldemort started in 2008.
● The term NoSQL was reintroduced in early 2009.
● Some NoSQL conferences
NoSQL Matters, NoSQL Now!, INOSA
3
4. NoSQL Data Management
History of NoSQL
• SQL Databases were dominant for decades
Persistent storage
Standards based
Concurrency Control
Application Integration
ACID
• Designed to run on a single „big“ machine
• Cloud computing changes that dramatically
Cluster of machines
Large amount of unreliable machines
Distributed System
Schema-free unstructured Big Data
4
5. NoSQL Data Management
Methods to Run a Database
• Virtual Machine Image
Users purchase virtual machine instances to run a database on these
Upload and setup own image with database, or use ready-made images with optimized
database installations
E.g. Oracle Database 11g Enterprise Edition image for Amazon EC2 and for Microsoft
Azure.
• Database as a service (DBaaS)
Using a database without physically launching a virtual machine instance
No configuration or management needed by application owners
E.g. Amazon Web Services provide SimpleDB, Amazon Relational Database Service (RDS),
DynamoDB,
• Managed database hosting
Not offered as a service, but hosted and managed by the cloud database vendor
E.g. Rackspace offers managed hosting for MySQL
5
6. NoSQL Data Management
Which Data Model?
• Relational Databases
Standard SQL database available for Cloud Environments as Virtual Machine
Image or as a service depending on the vendor
Not cloud-ready: Difficult to scale
• NoSQL databases
Database which is designed for the cloud
Built to serve heavy read/write loads
Good ability to scale up and down
Applications built based on SQL data model require a complete rewrite
E.g. Apache Cassandra, CouchDB and MongoDB
6
7. NoSQL Data Management
How to scale the data management?
• Vertical scaling – Scale up
• Horizontal scaling – Scale out
7
8. NoSQL Data Management
Definition and Goals of NoSQL databases
• No formal NoSQL definition
available!
• Store very large scale data
called “Big data”
• Typically scale horizontally
• Simple query mechanisms
• Often designed and set up for
a concrete application
• Typical characteristics of
NoSQL databases are:
Non-relational
Schema-free
Open Source
Simple API
Distributed
Eventual consistency
Source: https://clt.vtc.edu.hk/what-happens-online-in-60-seconds/
8
9. NoSQL Data Management
Non-relational
• NoSQL databases generally do not follow the relational model
• Do not provide tables with flat fixed-column records
• Work with self-contained (hierarchical) aggregates or BLOBs
• No need for object-relational mapping and data normalization
• No complex and costly features like query languages, query planners,
referential integrity, joins, ACID
Schema-free
9
• Most NoSQL databases are schema-free or have relaxed schemas
• No need for definition of any sort of schema of the data
• Allows heterogeneous structures of data in the same domain
10. NoSQL Data Management
Simple API
• Often simple interfaces for storage and querying data provided
• APIs often allow low-level data manipulation and selection methods
• Often no standard based query language is used
• Text-based protocols often using HTTP REST with JSON
• Web-enabled databases running as internet-facing services
Distributed
• Several NoSQL databases can be executed in a distributed fashion
• Providing auto-scaling and fail-over capabilities
• Often ACID is sacrificed for scalability and throughput
• Often no synchronous replication between distributed nodes is possible, e.g.
asynchronous Multi-Master Replication, peer-to-peer, HDFS Replication
• Only providing eventual consistency
10
11. NoSQL Data Management
Core Categories of NoSQL Systems
• Key-Value Stores
Manage associative arrays
Big hash table
• Wide Column Stores
Each storage block contains only data from one column
Read and write is done using columns (rather than rows – like
in SQL)
• Document Stores
Store documents consisting of tagged values
Data is a collection of key value pairs
Provides structure and encoding of the managed data
Encoded using XML, JSON, BSON
Schema-free
• Graph DB
Network database using graphs with node and edges for
storage
Nodes represent entities, edges represent their relationships
• Other NoSQL systems
ObjectDB
XML DB
Special Grid DB
Triple Store
Value
{
“id”: 123,
“name”:”Matthias”,
“location”:{
“city”:”Hersonissos”,
“region”:”Crete”
}
}
A B
C
relation
11
12. NoSQL Data Management
Usage of NoSQL in Practice
• Google
Big Table
Google Apps, Google Search
• Facebook
Social network
• Twitter
• Amazon
DynamoDB and SimpleDB
• CERN
• GitHub
12
13. NoSQL Data Management
Overview
• Introduction to NoSQL
• Basic Concepts for NoSQL
CAP-Theorem
Eventual Consistency
Consistent Hashing
MVCC-Protocol
Query Mechanisms for NoSQL
• Overview of NoSQL Systems
13
14. NoSQL Data Management
CAP Theorem – Brewer's theorem
NoSQL (AP)
Domain Name
System DNS (AP)
Cloud Computing
(AP)
Consistency: all nodes see the
same data at the same time
Availability: every request
receives a response about
whether it succeeded or failed
Partition tolerance: the
system continues to operate
despite physical network
partitions
Choose two!
• States that it is impossible for a
distributed system to provide all
three of the following guarantees
14
15. NoSQL Data Management
Eventual consistency and BASE
• The term “eventual consistency”
Copies of data on multiple machines to achieve high availability and scalability
A change to a data item on one machine has to be propagated to other replicas
Propagation is not instantaneous so the copies will be mutually inconsistent
The change will eventually be propagated to all the copies
Fast access requirement dominates
Different replicas can return different values for the queried attribute of the object
A System that achieved eventual consistency converged, or achieved replica
convergence
• Eventual consistency guarantees:
If no new updates are made to a given data item eventually all accesses to that item will
return the last updated value
• Eventually consistent services provide weak consistency using BASE
Basically Available, Soft state, Eventual consistency
Basically available indicates that the system guaranteed availability (CAP theorem)
Soft state indicates that the state of the system may change over time, even without input
Eventual consistency indicates that the system will become consistent over time
15
16. NoSQL Data Management
Consistent Hashing
• Technique how to efficiently distribute replicas to nodes
• Consistent hashing is a special kind of hashing
When hash table is resized only K/n keys need to be remapped on average
K is the number of keys, and n is the number of slots
In traditional hash tables nearly all keys have to be remapped
• Insert Servers on ring
Hash based e.g. on IP, Name, …
Take over objects between own and processor hash
• Insert Objects on ring
Hash based on key
Walks around the circle until
falling into the first bucket
• Delete Servers
Copy objects to next server
• Virtual Servers
More than one hash per server
• Replication
Place objects multiple times
Improves reliability
Server 1
0
12
2
Server 3
Server 2/2
16
SeSrevrevre2r/21
17. NoSQL Data Management
Multiversion Concurrency Control (MVCC)
• Concurrency control method to
provide concurrent access to the
database
• LOCKING
All readers wait until the writer is done
This can be very slow!
• MVCC
An write adds a new version
Read is always possible
Any changes made by a writer will not
be seen by other users of the database
until they have been committed
Conflicts (e.g. V1a ,V1b) can occur and
have to be handled
User 1 User 2
V
transaction T
User 1
write
read V (Delayed)
Lock!
V1
write
V0
MVCC
read V0-Old Version
V1 read V1
transaction T
User 1
User 1 User 2
17
18. NoSQL Data Management
Query Mechanisms for NoSQL
• REST based retrieval of a value based on its key/ID with
GET resource
• Document stores allow more complex queries
E.g. CouchDB allows to define views with MapReduce
• MapReduce
Available in many NoSQL databases
Can run fully distributed
It is Functional Programming, not writing queries!
Map phase - perform filtering and sorting
Reduce phase - performs a summary operation
~ SELECT and GROUP BY of a relational database
More details later!
• Apache Spark is an open source big data processing
framework providing more operations than MapReduce
• Example use cases for MapReduce
Distributed Search
Counting – URL, Words
Building linkage graphs for web sites
Sorting distributed data
17
Source: @tgrall
19. NoSQL Data Management
Overview
• Introduction to NoSQL
• Basic Concepts for NoSQL
• Overview of NoSQL Systems
Key-Value Stores
Document Stores
Wide-column stores
Graph Stores
Hadoop Map/Reduce and more …
19
20. NoSQL Data Management
Key Value Stores
Memcached
Project
Voldemort
• Developer: Basho Technologies
(http://basho.com/)
• Current version: 2.1.1
• Available since: 2009
• Licence: Apache licence 2.0
• Supported operating systems:
Linux, BSD, Mac OS, Solaris
• Client libraries for:
Java, Ruby, Python, C#, Erlang
(the official ones)
C, Node.js, PHP (the unofficial
ones)
even more form the Riak
community
• Developer: Basho Technologies
(http://basho.com/)
• Current version: 2.1.1
• Available since: 2009
• Licence: Apache licence 2.0
• Supported operating systems:
Linux, BSD, Mac OS, Solaris
• Client libraries for:
Java, Ruby, Python, C#, Erlang
(the official ones)
C, Node.js, PHP (the unofficial
ones)
even more form the Riak
community
20
21. NoSQL Data Management
Typical Use Cases
• Session data
• User profiles
• Sensor data (IOT)
timestamp x y z temperature
01.01.2014 350 120 78 -10°
01.01.2014 350 120 95 -9
01.01.2014 350 100 78 -10°
02.01.2014 350 120 78 -5°
02.01.2014 350 120 95 -8°
…
key value
sessionid=A08154711
userlogin=“xyz”
date_of_expiry=2015/12/31
id
21
22. NoSQL Data Management
Key Functionality
buckets
key value
key value
key value
key value
store <k,v>
get <k>
delete <k>
get bucket properties
set bucket properties
bucket types
create bucket type
activate bucket type
update bucket type
get status of bucket type
22
23. NoSQL Data Management
Instances and Vnodes
• Riak runs on potentially large
clusters
• Each host in the cluster runs a
single instance of Riak (Riak
node)
• Each Riak node manages a set of
virtual node (vnodes)
• Mapping of <bucket,key> pairs
compute 160-bit hash
map result to a ring position
ring is divided into partitions
each vnode is responsible for one
partition
docs.basho.com 23
24. NoSQL Data Management
Configure Replication
• Some bucket parameters
N: replication factor
R: number of servers that must
respond to a read request
W: number of servers that must
respond to a write request
DW: number of servers that
must report that a write has
successfully been written to disk
…
Parameters allow to trade
consistency vs. availability vs.
performance
docs.basho.com 24
25. NoSQL Data Management
Transactions and Consistency
• No multi-operation transactions are supported
• Eventual consistency is default (and was the only option before Riak 2.0)
Vector clocks and Dotted Version Vectors (DVV) used to resolve object
conflicts
• Strong consistency as an alternative option
A quorum of nodes is needed for any successful operation
25
26. NoSQL Data Management
Riak Search
• For Raik KV, a value is just a value possibly associated with a type
• Riak Search 2.0
Based on Solr, the search platform built on Apache Lucene
Define extractors, i.e., modules responsible for pulling out a list of fields and
values from a Riak object
Define Solr schema to instruct Riak/Solr how to index a value
Queries:
exact match, globs, inclusive/exclusive range queries, AND/OR/NOT, prefix
matching, proximity searches, term boosting, sorting, …
26
27. NoSQL Data Management
Document Stores
• Developer: MongoDB, Inc.
(https://www.mongodb.com/)
• Available since: 2009
• Licence: GNU AGPL v3.0
• Supported operating systems: all
major platforms
• Drivers for:
C, C++, C#, Java, Node.js,
Perl, PHP, Python, Motor,
Ruby, Scala, …
• Developer: MongoDB, Inc.
(https://www.mongodb.com/)
• Available since: 2009
• Licence: GNU AGPL v3.0
• Supported operating systems: all
major platforms
• Drivers for:
C, C++, C#, Java, Node.js,
Perl, PHP, Python, Motor,
Ruby, Scala, …
27
29. NoSQL Data Management
Typical Use Cases
• Simple content management
e.g. blogging platforms
• Logging events
cope with event type heterogeneity
data associated with events changes over time
• E-Commerce applications
flexible schema for product and order data
29
30. NoSQL Data Management
Data Organization
• Collections of documents that
share indexes (but not a schema)
• Collections are organized into
databases
• Documents stored in BSON
databases
collections
30
31. NoSQL Data Management
Key Functionality
collections
find <field critera>
insert document
update document(s)
create index
drop index
databases
create collection
drop collection
delete document(s)
query data inside
the documents
31
33. NoSQL Data Management
Availability
• Each Replica set is group of
MongoDB instances that hold the
same dataset
• one primary instance that takes
all write operations
• multiple secondary instances
• changes are replicated to the
secondaries
• if the primary is unavailable, the
replica set will elect a secondary
to be primary
• Specific secondaries
priority 0 member
hidden member
delayed member
arbiter
sec.
1
sec.
2
prim.
read/write
apply opplog
replica set
33
34. NoSQL Data Management
Scalability
• Replica sets for read scalability
reading from secondaries may
provide stale data
• Sharding for write scalability
at collection level
using indexed field that is
available in all documents of the
collection
range-based or hash-based
Each shard is a MongoDB
instance
background processes to
maintain even distribution:
splitting + balancer
Shards may also hold replica sets
shard shard shard
1 2 3
router
gconfi
serveg rconfi
serveconfig r
server
metadata
read/write
34
35. NoSQL Data Management
Transactions and Consistency
• Write concern, w option
default:
num.:
majority:
confirms write operations only on the primary
Guarantees that write operations have propagated successfully to
the specified number of replica set members including the primary
Confirms that write operations have propagated to the majority of
voting nodes
• Read preference
describes how MongoDB clients route read operations to the members of a
replica set, i.e., from one of the secondaries or the primary
eventual consistency!
• No multi-operation transactions supported. reading stale data
from the primary
is possible!
35
36. NoSQL Data Management
Wide-Column Stores / Column Family Stores
• 2006: originally project of
company Powerset
• 2008: HBase becomes Hadoop
sub-project.
• 2010: HBase becomes Apache
top-level project.
• runs on top of HDFS (Hadoop
Distributed File System)
• providing BigTable-like
capabilities for Hadoop
• APIs: Java, REST, Thrift, C/C++
• 2006: originally project of
company Powerset
• 2008: HBase becomes Hadoop
sub-project.
• 2010: HBase becomes Apache
top-level project.
• runs on top of HDFS (Hadoop
Distributed File System)
• providing BigTable-like
capabilities for Hadoop
• APIs: Java, REST, Thrift, C/C++
36
37. NoSQL Data Management
Logical Data Model
• Table rows contain:
row key
versions, typically a timestamp
multiple column families per key
- define column families at design time
- add columns to a column family at runtime
• Metadata
there is no catalog that provides the set of all columns for a certain table
left to the user/application
sparse table
37
38. NoSQL Data Management
Physical Data Model
• Store each column familiy separately
• Sorted by timestamp (descending)
• Empty cells from the logical view are not stored
• Key/Value class
keylength valuelength key value
rowlength row
key
columnfamily
length
column
family
columnqualifier timestamp keytype
com.cnn.
www
2 anchor cnnsi.com t9 put
e.g.
38
39. NoSQL Data Management
Key Functionality
namespaces
get <k, t>
put <k, …>
scan
regions
split
merge
delete <k>
create
alter
drop
c, …, c
key family c, …, c
key family c, …, c
key family c, …, c
table
key family
39
40. NoSQL Data Management
MasterServer and RegionServer
• Failover
HBase clients talk directly to the
RegionServers, hence they may
continue without MasterServer (at least
for some time)
catalog table META exists as HBase
tables, i.e., not resident in the
MasterServer
HMaster
• monitors RegionServers
• operations related to metadata
(tables, column families, regions)
Backup HMaster
…
HRegionServer
• manages regions
• operations related to data (put,
get, …)
• operations related to regions
(split, merge, …)
META: list of regions for each table
• Failover
Region immediately becomes
unavailable when the RegionServer is
down
The Master will detect that the
RegionServer has failed
region assignments will be considered
invalid
assign region to a new RegionServer
40
41. NoSQL Data Management
Storage Structure
• Table T with column families a and b
HRegionServer
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
StoreFile
Store a Store b
MemStore MemStore
StoreFile
StoreFile
StoreFile
StoreFile
HRegionServer
Store a Store b
MemStore MemStore
StoreFile StoreFile
StoreFile StoreFile
Block
<k,v> <k,v> …
Block
Block
Block
mainmemoryfilesystem
Log Log
regions
41
42. NoSQL Data Management
Write Data
1. Write change to log (WAL)
2. Write change to MemStore
3. Regularily flush MemStore to disk (into StoreFiles) and empty MemStore
HRegionServerHRegionServer
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
StoreFile
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
Store a
MemStore MemStore
StoreFile
StoreFile
Store b
StoreFile
StoreFile
mainmemoryfilesystem
Log Log
write table_T.family_a.field_f
42
43. NoSQL Data Management
Read Data
1. Read from Block Cache
2. Read from MemStore
3. Read from all relevant StoreFiles
4. Merge results
HRegionServerHRegionServer
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
StoreFile
Store a
MemStore MemStore
StoreFile
StoreFile
StoreFile
Store b
StoreFile
Store a
MemStore MemStore
StoreFile
StoreFile
Store b
StoreFile
StoreFile
mainmemoryfilesystem
Log Log
read table_T.family_a.field_f
43
44. NoSQL Data Management
StoreFile Reorganisation
• minor compaction
merge various StoreFiles, without
considering tombstones etc.
• major compaction
reorg of store files, e.g. by
removing deleted rows
significantly reduces file size
at a configurable time interval
or manually
• Merge Regions
consolidate several regions of one table into a single region
offline reorg initiated manually!
• Split Regions
distribute data from one region to several regions
configurable by parameter max.filesize or manually
Store a
MemStore
StoreFile
StoreFile
StoreFile
Store a
MemStore
StoreFile
44
45. NoSQL Data Management
Transactions and Consistency
• No explicit transaction boundaries
• Atomicity
atomic row-level operations (either "success" or "failure")
operations spanning several rows (batch put) are not atomic
• Consistency
Default: Strong consistency by routing all through a single region server
Optional: Region replication for high availability
- Writes only through the primary
- Reads may also be processed by the secondaries
45
46. NoSQL Data Management
SELECT WRITETIME (name)
FROM excelsior.clicks
WHERE url = 'http://apache.org';
SELECT WRITETIME (name)
FROM excelsior.clicks
WHERE url = 'http://apache.org';
CQL in Cassandra
45
CREATE KEYSPACE demodb WITH REPLICATION =
{'class' : 'SimpleStrategy', 'replication_factor': 3};
CREATE KEYSPACE demodb WITH REPLICATION =
{'class' : 'SimpleStrategy', 'replication_factor': 3};
CREATE TABLE users (
user_name varchar,
password varchar,
gender varchar,
session_token varchar,
state varchar,
birth_year bigint,
PRIMARY KEY (user_name));
CREATE TABLE users (
user_name varchar,
password varchar,
gender varchar,
session_token varchar,
state varchar,
birth_year bigint,
PRIMARY KEY (user_name));
SELECT *
FROM emp
WHERE empID IN (130,104)
ORDER BY deptID DESC;
SELECT *
FROM emp
WHERE empID IN (130,104)
ORDER BY deptID DESC;
INSERT INTO excelsior.clicks (userid, url, date, name)
VALUES ( 3715e600-2eb0-11e2-81c1-0800200c9a66,
'http://apache.org', '2013-10-09', 'Mary')
USING TTL 86400;
INSERT INTO excelsior.clicks (userid, url, date, name)
VALUES ( 3715e600-2eb0-11e2-81c1-0800200c9a66,
'http://apache.org', '2013-10-09', 'Mary')
USING TTL 86400;
cassandra.apache.org
47. NoSQL Data Management
Graph Databases
Flock DB
• Developer: Neo Technology
(http://www.neotechnology.com)
• Available since: 2007
• Licence: GPLv3 and AGPLv3,
commercial
• Supported operating systems: all
major platforms
• Drivers for:
Java, .NET, JavaScript, Python,
Ruby, PHP
and R, Go, Clojure, Perl, Haskell
• Developer: Neo Technology
(http://www.neotechnology.com)
• Available since: 2007
• Licence: GPLv3 and AGPLv3,
commercial
• Supported operating systems: all
major platforms
• Drivers for:
Java, .NET, JavaScript, Python,
Ruby, PHP
and R, Go, Clojure, Perl, Haskell
47
48. NoSQL Data Management
Typical Use Cases
• Highly connected data,
e.g., social networks,
employees and their knowledge
• Location-based services,
e.g., planning delivery
• Recommendation systems,
e.g., bought products, often-visited attractions
people who visited … also visited …
5
15
7
9
1
20 12
50
graph with
distances
smartblogs.com
48
49. NoSQL Data Management
Graph Data Model
48
node
relationship
properties
• No need to define a schema
neo4j.com
50. NoSQL Data Management
Basic Functionality
match node
patterns
graph traversal
graph
create node/relationship
delete node/relationship
set property
remove property
create index
query index
50
51. NoSQL Data Management
Cypher Example
• Many query languages: Cypher, Gremlin, G, GraphLog, GRAM, GraphDB,
…
• No standard
CREATE (me:PERSON {name:”Holger”})
CREATE (mat:PERSON {name:”Matthias”})
CREATE (fra:PERSON {name:”Frank”})
CREATE (me) -[knows:KNOWS]-> (mat)
CREATE (me) -[knows:KNOWS]-> (fra)
CREATE (mat) -[knows:KNOWS]-> (me)
MATCH (n {name:”Holger”})-[:KNOWS]->(m)
51
52. NoSQL Data Management
S2
S3 S4
S1
Scalability
• Strategies for read scaling
a) large enough memory
for the working set of nodes
b) adding read-only slaves
graph
data
node memory
graph
data
graph
data
graph
data
graph
data
slave memory
c) application-level sharding
graph
data
north
graph
data
south
appl query
north/south?
automatic
sharding?
52
53. NoSQL Data Management
High Availability
M S1 S2 S3
propagate
write commit
• HA availability feature in neo4j
cluster of 1 master and n slave nodes
continues to operate from any number of nodes down to a single machine
nodes may leave and re-join the cluster
in case of master failure, a new master will be elected
read operations are possible on any node
write operations are possible on any node and propagated to the others
write on master
53
54. NoSQL Data Management
High Availability
M S1 S2 S3
write commit
propagate
propagate
commit
• HA availability feature in neo4j
cluster of 1 master and n slave nodes
continues to operate from any number of nodes down to a single machine
nodes may leave and re-join the cluster
in case of master failure, a new master will be elected
read operations are possible on any node
write operations are possible on any node and propagated to the others
write on slave
M S1 S2 S3
pull asynchronously
54
55. NoSQL Data Management
Transactions and Consistency
• Set transaction boundaries explicitly
• Transactions are possible on any node in the cluster
• Transactions are atomic and durable
• Writes are eventually consistent
optimistically pushed to slaves
slaves can also be configured to pull updates asynchronously
55
57. NoSQL Data Management
Principles of Map Reduce
• User provides data in files
• Data model: key/value pairs (k, v)
• Based on higher-order functions MAP and REDUCE
• Tasks of the programmer
User-defined functions m and r serve as input to MAP and REDUCE
m and r define what the job actually does
• MAP m: Km,Vm ↦ Kr, Vr
∗
• REDUCE r: (Kr, Vr
∗
)↦(Kr, Vr)
• Example: Aggregate salary per department:
(employee, <name, department, salary, …>)
(department, salary)
MAP
(department, salary)
(department, <salary, salary, …>)
RED
57
58. NoSQL Data Management
…
Execution of Map/Reduce Jobs
57
file
task tracker m2
map()
combine()
partition()
task tracker m3
task tracker m1
split 1
split 2
split 3
split 4
split 5
k/v 1
k/v 2
k/v 1
k/v 2
k/v 1
k/v 2
task tracker r2
task tracker r1
shuffle()
sort()
reduce()
output()
file
phases defined by user
job trackerprogram
client start &
control
59. NoSQL Data Management
Fault Tolerance
58
• map node fails
job tracker receives no report for a certain time -> mark node as failed
restart map job on a different node
new job reads another copy of the necessary input split
• reduce node fails
job tracker receives no report for a certain time -> mark node as failed
restart reduce job on a different node
read necessary intermediate input data from map nodes
• To make this work, all relevant data has to be stored in a distributed file
system, in particular
the input splits
intermediate data produced by map jobs
Hadoop Files System (HDFS)
61. NoSQL Data Management
Is that all?
• Other systems build on or extend this
basic functionality
• Build an SQL layer on to of Hadoop MapReduce
Hive
Pig
• Others focus on datastream and in-memory processing
Spark
Flink
61
62. NoSQL Data Management
What we also Skipped Today
• Further classes of NoSQL systems
Triple stores, …
• NewSQL
• Cloud offerings for the various types of NoSQL data stores
e.g., Riak CS (Cloud Storage)
• More cloud platforms
IBM Bluemix
Google app engine
62
63. NoSQL Data Management
Conclusion
• Relational Databases provide
Data spread over many tables
Schema needs to be defined
Structured query language (SQL)
Transactions
Strong Consistency
General purpose applicability
• NoSQL
Aggregated data in one object (identified by a key)
No predefined schema
No declarative query language
Limited transactional capability
Eventual consistency rather ACID property
Focus on scalability and availability
Often selected and customized for a concrete application scenario
To make a proper decision,
carefully examine your application
• the data model that is most
appropriate
• the query complexity
• the consistency needs
• the transactional requirements
63
To make a proper decision,
carefully examine your application
• the data model that is most
appropriate
• the query complexity
• the consistency needs
• the transactional requirements