NoSQL databases take different approaches to storing and querying data compared to relational databases. Key-value databases store data as unstructured blobs associated with keys, documents databases store hierarchical data as documents, columnar databases store data by column rather than by row for improved analytics performance, and graph databases natively represent relationships between nodes. Aggregate-oriented NoSQL databases group and store related data together for faster access compared to retrieving scattered relational data.
Cassandra from the trenches: migrating Netflix (update)Jason Brown
Update talk on Cassandra at Netflix, presented at the Silicon Valley NoSQL meetup on 9 Feb 2012. Includes an introduction to Astyanax, an open source cassandra client written in java.
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download the slides
MongoDB is a horizontally scalable, schema-free, document-oriented NoSQL database. It stores data in flexible, JSON-like documents, allowing for easy storage and retrieval of data without rigid schemas. MongoDB provides high performance, high availability, and easy scalability. Some key features include embedded documents and arrays to reduce joins, dynamic schemas, replication and failover for availability, and auto-sharding for horizontal scalability.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
This document provides an overview of SQL Server database development concepts including SQL Server objects, tables, data types, relationships, constraints, indexes, views, queries, joins, stored procedures and more. It begins with introductory content on SQL Server and databases and then covers these topics through detailed explanations and examples in a structured outline.
The document discusses NoSQL databases and MapReduce. It provides historical context on how databases were not adequate for the large amounts of data being accumulated from the web. It describes Brewer's Conjecture and CAP Theorem, which contributed to the rise of NoSQL databases. It then defines what NoSQL databases are, provides examples of different types, and discusses some large-scale implementations like Amazon SimpleDB, Google Datastore, and Hadoop MapReduce.
Graph databases store data in graph structures with nodes, edges, and properties. Neo4j is a popular open-source graph database that uses a property graph model. It has a core API for programmatic access, indexes for fast lookups, and Cypher for graph querying. Neo4j provides high availability through master-slave replication and scales horizontally by sharding graphs across instances through techniques like cache sharding and domain-specific sharding.
Cassandra from the trenches: migrating Netflix (update)Jason Brown
Update talk on Cassandra at Netflix, presented at the Silicon Valley NoSQL meetup on 9 Feb 2012. Includes an introduction to Astyanax, an open source cassandra client written in java.
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download the slides
MongoDB is a horizontally scalable, schema-free, document-oriented NoSQL database. It stores data in flexible, JSON-like documents, allowing for easy storage and retrieval of data without rigid schemas. MongoDB provides high performance, high availability, and easy scalability. Some key features include embedded documents and arrays to reduce joins, dynamic schemas, replication and failover for availability, and auto-sharding for horizontal scalability.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
This document provides an overview of SQL Server database development concepts including SQL Server objects, tables, data types, relationships, constraints, indexes, views, queries, joins, stored procedures and more. It begins with introductory content on SQL Server and databases and then covers these topics through detailed explanations and examples in a structured outline.
The document discusses NoSQL databases and MapReduce. It provides historical context on how databases were not adequate for the large amounts of data being accumulated from the web. It describes Brewer's Conjecture and CAP Theorem, which contributed to the rise of NoSQL databases. It then defines what NoSQL databases are, provides examples of different types, and discusses some large-scale implementations like Amazon SimpleDB, Google Datastore, and Hadoop MapReduce.
Graph databases store data in graph structures with nodes, edges, and properties. Neo4j is a popular open-source graph database that uses a property graph model. It has a core API for programmatic access, indexes for fast lookups, and Cypher for graph querying. Neo4j provides high availability through master-slave replication and scales horizontally by sharding graphs across instances through techniques like cache sharding and domain-specific sharding.
This document discusses different types of distributed databases. It covers data models like relational, aggregate-oriented, key-value, and document models. It also discusses different distribution models like sharding and replication. Consistency models for distributed databases are explained including eventual consistency and the CAP theorem. Key-value stores are described in more detail as a simple but widely used data model with features like consistency, scaling, and suitable use cases. Specific key-value databases like Redis, Riak, and DynamoDB are mentioned.
Object relational database management systemSaibee Alam
this presentation provide a full explanation of object relational database management system. its a part of advanced database management system. important topic of computer science if you are UG/PG student or preparing for some competitive exam.
This document provides an overview and introduction to NoSQL databases. It discusses key-value stores like Dynamo and BigTable, which are distributed, scalable databases that sacrifice complex queries for availability and performance. It also explains column-oriented databases like Cassandra that scale to massive workloads. The document compares the CAP theorem and consistency models of these databases and provides examples of their architectures, data models, and operations.
PostgreSQL is an open-source object-relational database management system descended from POSTGRES. It supports many SQL standards and features extensions like user-defined data types, functions, operators and index methods. Transactions in PostgreSQL provide ACID properties including atomicity, consistency, isolation and durability to maintain data integrity during concurrent operations.
This presentation about HBase will help you understand what is HBase, what are the applications of HBase, how is HBase is different from RDBMS, what is HBase Storage, what are the architectural components of HBase and at the end, we will also look at some of the HBase commands using a demo. HBase is an essential part of the Hadoop ecosystem. It is a column-oriented database management system derived from Google’s NoSQL database Bigtable that runs on top of HDFS. After watching this video, you will know how to store and process large datasets using HBase. Now, let us get started and understand HBase and what it is used for.
Below topics are explained in this HBase presentation:
1. What is HBase?
2. HBase Use Case
3. Applications of HBase
4. HBase vs RDBMS
5. HBase Storage
6. HBase Architectural Components
What is this Big Data Hadoop training course about?
Simplilearn’s Big Data Hadoop training course lets you master the concepts of the Hadoop framework and prepares you for Cloudera’s CCA175 Big data certification. The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This presentation is related to nosql database and nosql database types information. this presentationa also contains discussion about, how mongodb works and mongodb security and mongodb sharding information.
This document provides an overview of CouchDB, a NoSQL document database. It discusses key concepts like the CAP theorem and different categories of NoSQL databases. It then describes CouchDB in more detail, covering how to interact with data via REST APIs and CURL, use design documents to define views and validation, and handle data replication and conflicts. Map/reduce functions are used to query the data and build indexes.
DynamoDB is a key-value database that achieves high availability and scalability through several techniques:
1. It uses consistent hashing to partition and replicate data across multiple storage nodes, allowing incremental scalability.
2. It employs vector clocks to maintain consistency among replicas during writes, decoupling version size from update rates.
3. For handling temporary failures, it uses sloppy quorum and hinted handoff to provide high availability and durability guarantees when some replicas are unavailable.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...Beat Signer
The document discusses Structured Query Language (SQL) and its history and components. It notes that SQL is a declarative query language used to define database schemas, manipulate data through queries, and control transactions. The document outlines SQL's data definition language for defining schemas and data manipulation language for querying and modifying data. It also provides examples of SQL statements for creating tables and defining constraints.
The document provides an overview of column databases. It begins with a quick recap of different database types and then defines and discusses column databases and column-oriented databases. It explains that column databases store data by column rather than by row, allowing for faster access to specific columns of data. Examples of column databases discussed include Cassandra, HBase, and Vertica. The document then focuses on Cassandra, describing its data model using concepts like keyspaces and column families. It also explains Cassandra's database engine architecture featuring memtables, SSTables, and compaction. The document concludes by mentioning some large companies that use Cassandra in production systems.
Concepts of Apache Hive in Big Data.
contains:
what is hive?
why hive?
how hive works
hive Architecture
data models in hive
pros and cons of hive
hiveql
pig vs hive
Cassandra is an open source, distributed database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability and performance, as well as flexibility in schemas. Cassandra finds use in large companies like Facebook, Netflix and eBay due to its abilities to scale and perform well under heavy loads. However, it may not be suited for applications requiring many joins, transactions or strong consistency guarantees.
The presentation provides an overview of NoSQL databases, including a brief history of databases, the characteristics of NoSQL databases, different data models like key-value, document, column family and graph databases. It discusses why NoSQL databases were developed as relational databases do not scale well for distributed applications. The CAP theorem is also explained, which states that only two out of consistency, availability and partition tolerance can be achieved in a distributed system.
This document discusses revisiting SQL basics and advanced topics. It covers objectives, assumptions, and topics to be covered including staying clean with conventions, data types, revisiting basics, joining, subqueries, joins versus subqueries, group by, set operations, and case statements. The topics sections provides details on each topic with examples to enhance SQL knowledge and write better queries.
In this introduction to Apache Hive the following topics are covered:
1. Hive Introduction
2. Hive origin
3. Where does Hive fall in Big Data stack
4. Hive architecture
5. Tts job execution mechanisms
6. HiveQL and Hive Shell
7 Types of tables
8. Querying data
9. Partitioning
10. Bucketing
11. Pros
12. Limitations of Hive
This document provides an overview of different database types including relational, NoSQL, document, key-value, graph, and column family databases. It discusses the history and drivers behind the development of NoSQL databases, as well as concepts like horizontal scaling, the CAP theorem, and eventual consistency. Specific databases are also summarized, including MongoDB, Redis, Neo4j, and HBase.
The document discusses various techniques for optimizing SQL Server performance, including handling index fragmentation, optimizing files and partitioning tables, effective use of SQL Profiler and Performance Monitor, a methodology for performance troubleshooting, and a 10 step process for performance optimization. Some key points covered are determining and resolving index fragmentation, partitioning tables across multiple file groups, capturing traces with SQL Profiler and Performance Monitor counters to diagnose issues, and ensuring proper indexing through query execution plans and the SQL Server tuning advisor.
The document discusses NoSQL databases and big data frameworks. It defines NoSQL databases as next generation databases that are non-relational, distributed, open-source and horizontally scalable. It describes four main categories of NoSQL databases - document databases, key-value stores, column-oriented databases and graph databases. It also discusses properties of NoSQL databases and provides examples of popular NoSQL databases. The document then discusses big data frameworks like Hadoop and its ecosystem including HDFS, MapReduce, YARN and Hadoop Common. It provides details on how these components work together to process large datasets in a distributed manner.
The document provides an overview of high performance scalable data stores, also known as NoSQL systems, that have been introduced to provide faster indexed data storage than relational databases. It discusses key-value stores, document stores, extensible record stores, and relational databases that provide horizontal scaling. The document contrasts several popular NoSQL systems, including Redis, Scalaris, Tokyo Tyrant, Voldemort, Riak, and SimpleDB, focusing on their data models, features, performance, and tradeoffs between consistency and scalability.
This document discusses different types of distributed databases. It covers data models like relational, aggregate-oriented, key-value, and document models. It also discusses different distribution models like sharding and replication. Consistency models for distributed databases are explained including eventual consistency and the CAP theorem. Key-value stores are described in more detail as a simple but widely used data model with features like consistency, scaling, and suitable use cases. Specific key-value databases like Redis, Riak, and DynamoDB are mentioned.
Object relational database management systemSaibee Alam
this presentation provide a full explanation of object relational database management system. its a part of advanced database management system. important topic of computer science if you are UG/PG student or preparing for some competitive exam.
This document provides an overview and introduction to NoSQL databases. It discusses key-value stores like Dynamo and BigTable, which are distributed, scalable databases that sacrifice complex queries for availability and performance. It also explains column-oriented databases like Cassandra that scale to massive workloads. The document compares the CAP theorem and consistency models of these databases and provides examples of their architectures, data models, and operations.
PostgreSQL is an open-source object-relational database management system descended from POSTGRES. It supports many SQL standards and features extensions like user-defined data types, functions, operators and index methods. Transactions in PostgreSQL provide ACID properties including atomicity, consistency, isolation and durability to maintain data integrity during concurrent operations.
This presentation about HBase will help you understand what is HBase, what are the applications of HBase, how is HBase is different from RDBMS, what is HBase Storage, what are the architectural components of HBase and at the end, we will also look at some of the HBase commands using a demo. HBase is an essential part of the Hadoop ecosystem. It is a column-oriented database management system derived from Google’s NoSQL database Bigtable that runs on top of HDFS. After watching this video, you will know how to store and process large datasets using HBase. Now, let us get started and understand HBase and what it is used for.
Below topics are explained in this HBase presentation:
1. What is HBase?
2. HBase Use Case
3. Applications of HBase
4. HBase vs RDBMS
5. HBase Storage
6. HBase Architectural Components
What is this Big Data Hadoop training course about?
Simplilearn’s Big Data Hadoop training course lets you master the concepts of the Hadoop framework and prepares you for Cloudera’s CCA175 Big data certification. The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This presentation is related to nosql database and nosql database types information. this presentationa also contains discussion about, how mongodb works and mongodb security and mongodb sharding information.
This document provides an overview of CouchDB, a NoSQL document database. It discusses key concepts like the CAP theorem and different categories of NoSQL databases. It then describes CouchDB in more detail, covering how to interact with data via REST APIs and CURL, use design documents to define views and validation, and handle data replication and conflicts. Map/reduce functions are used to query the data and build indexes.
DynamoDB is a key-value database that achieves high availability and scalability through several techniques:
1. It uses consistent hashing to partition and replicate data across multiple storage nodes, allowing incremental scalability.
2. It employs vector clocks to maintain consistency among replicas during writes, decoupling version size from update rates.
3. For handling temporary failures, it uses sloppy quorum and hinted handoff to provide high availability and durability guarantees when some replicas are unavailable.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...Beat Signer
The document discusses Structured Query Language (SQL) and its history and components. It notes that SQL is a declarative query language used to define database schemas, manipulate data through queries, and control transactions. The document outlines SQL's data definition language for defining schemas and data manipulation language for querying and modifying data. It also provides examples of SQL statements for creating tables and defining constraints.
The document provides an overview of column databases. It begins with a quick recap of different database types and then defines and discusses column databases and column-oriented databases. It explains that column databases store data by column rather than by row, allowing for faster access to specific columns of data. Examples of column databases discussed include Cassandra, HBase, and Vertica. The document then focuses on Cassandra, describing its data model using concepts like keyspaces and column families. It also explains Cassandra's database engine architecture featuring memtables, SSTables, and compaction. The document concludes by mentioning some large companies that use Cassandra in production systems.
Concepts of Apache Hive in Big Data.
contains:
what is hive?
why hive?
how hive works
hive Architecture
data models in hive
pros and cons of hive
hiveql
pig vs hive
Cassandra is an open source, distributed database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability and performance, as well as flexibility in schemas. Cassandra finds use in large companies like Facebook, Netflix and eBay due to its abilities to scale and perform well under heavy loads. However, it may not be suited for applications requiring many joins, transactions or strong consistency guarantees.
The presentation provides an overview of NoSQL databases, including a brief history of databases, the characteristics of NoSQL databases, different data models like key-value, document, column family and graph databases. It discusses why NoSQL databases were developed as relational databases do not scale well for distributed applications. The CAP theorem is also explained, which states that only two out of consistency, availability and partition tolerance can be achieved in a distributed system.
This document discusses revisiting SQL basics and advanced topics. It covers objectives, assumptions, and topics to be covered including staying clean with conventions, data types, revisiting basics, joining, subqueries, joins versus subqueries, group by, set operations, and case statements. The topics sections provides details on each topic with examples to enhance SQL knowledge and write better queries.
In this introduction to Apache Hive the following topics are covered:
1. Hive Introduction
2. Hive origin
3. Where does Hive fall in Big Data stack
4. Hive architecture
5. Tts job execution mechanisms
6. HiveQL and Hive Shell
7 Types of tables
8. Querying data
9. Partitioning
10. Bucketing
11. Pros
12. Limitations of Hive
This document provides an overview of different database types including relational, NoSQL, document, key-value, graph, and column family databases. It discusses the history and drivers behind the development of NoSQL databases, as well as concepts like horizontal scaling, the CAP theorem, and eventual consistency. Specific databases are also summarized, including MongoDB, Redis, Neo4j, and HBase.
The document discusses various techniques for optimizing SQL Server performance, including handling index fragmentation, optimizing files and partitioning tables, effective use of SQL Profiler and Performance Monitor, a methodology for performance troubleshooting, and a 10 step process for performance optimization. Some key points covered are determining and resolving index fragmentation, partitioning tables across multiple file groups, capturing traces with SQL Profiler and Performance Monitor counters to diagnose issues, and ensuring proper indexing through query execution plans and the SQL Server tuning advisor.
The document discusses NoSQL databases and big data frameworks. It defines NoSQL databases as next generation databases that are non-relational, distributed, open-source and horizontally scalable. It describes four main categories of NoSQL databases - document databases, key-value stores, column-oriented databases and graph databases. It also discusses properties of NoSQL databases and provides examples of popular NoSQL databases. The document then discusses big data frameworks like Hadoop and its ecosystem including HDFS, MapReduce, YARN and Hadoop Common. It provides details on how these components work together to process large datasets in a distributed manner.
The document provides an overview of high performance scalable data stores, also known as NoSQL systems, that have been introduced to provide faster indexed data storage than relational databases. It discusses key-value stores, document stores, extensible record stores, and relational databases that provide horizontal scaling. The document contrasts several popular NoSQL systems, including Redis, Scalaris, Tokyo Tyrant, Voldemort, Riak, and SimpleDB, focusing on their data models, features, performance, and tradeoffs between consistency and scalability.
This document provides an overview of NoSQL databases and MongoDB. It states that NoSQL databases are more scalable and flexible than relational databases. MongoDB is described as a cross-platform, document-oriented database that provides high performance, high availability, and easy scalability. MongoDB uses collections and documents to store data in a flexible, JSON-like format.
1) Organizations now deal with huge amounts of data both internally and externally generated to better understand their business and customers.
2) Relational databases cannot effectively handle this big data due to challenges in data structure, scaling, and speed.
3) NoSQL databases provide alternatives to store structured, semi-structured, and unstructured data across different data models like columnar, key-value, document, and graph. Each type has different properties suited for various use cases.
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSRUHULAMINHAZARIKA
Apache Hive is a data warehousing tool built on top of Hadoop that allows users to query and manage large datasets using SQL. It is targeted towards users familiar with SQL and allows them to write queries in a language called HiveQL, which is similar to SQL. Hive allows SQL queries to be parallelized into map/reduce jobs that run on Hadoop clusters. Hive also supports partitioning of tables to improve query performance on large datasets.
The document discusses different NoSQL database models and when each may be appropriate to use. It notes that relational databases can scale with enough effort, but using the proper NoSQL model for one's data avoids unnecessary layers of abstraction. Key-value stores are best for simple dictionaries or session data, while document stores allow for querying inner document values and are well-suited for documents like blogs. Column-family databases are optimized for high write volumes with small chance of collisions. Graph databases are best for data inherently involving nodes and relationships like social networks. The best approach is "polyglot persistence", using the database model that best represents each slice of data rather than forcing all data into a single model.
The document provides an introduction to NOSQL databases. It begins with basic concepts of databases and DBMS. It then discusses SQL and relational databases. The main part of the document defines NOSQL and explains why NOSQL databases were developed as an alternative to relational databases for handling large datasets. It provides examples of popular NOSQL databases like MongoDB, Cassandra, HBase, and CouchDB and describes their key features and use cases.
This document provides an overview of NoSQL databases and the HBase framework. It discusses key aspects of NoSQL including advantages like high scalability and schema flexibility. It then describes the different categories of NoSQL databases including key-value, column-oriented, graph and document oriented. The document proceeds to explain aggregate data models and how key-value and document databases are aggregate-oriented. It provides details on HBase, describing it as a column-oriented database, and its architecture, data model involving tables, rows, column families and cells.
Big Data Frameworks: Introduction to NoSQL – Aggregate Data Models – Hbase: Data Model and Implementations – Hbase Clients – Examples – .Cassandra: Data Model – Examples – Cassandra Clients – Hadoop Integration. Pig – Grunt – Pig Data Model – Pig Latin – developing and testing Pig Latin scripts. Hive – Data Types and File Formats – HiveQL Data Definition – HiveQL Data Manipulation – HiveQL Queries
This document provides an introduction to NoSQL and MongoDB. It discusses that NoSQL is a non-relational database management system that avoids joins and is easy to scale. It then summarizes the different flavors of NoSQL including key-value stores, graphs, BigTable, and document stores. The remainder of the document focuses on MongoDB, describing its structure, how to perform inserts and searches, features like map-reduce and replication. It concludes by encouraging the reader to try MongoDB themselves.
This document provides an introduction to MongoDB, a non-relational NoSQL database. It discusses what NoSQL databases are and their benefits compared to SQL databases, such as being more scalable and able to handle large, changing datasets. It then describes key features of MongoDB like high performance, rich querying, and horizontal scalability. The document outlines concepts like document structure, collections, and CRUD operations in MongoDB. It also covers topics such as replication, sharding, and installing MongoDB.
NoSQL is a non-relational database designed for large-scale data storage needs. It has several key features: it is non-relational, schema-free, uses simple APIs, and is distributed. The four main types of NoSQL databases are key-value, column-oriented, document-oriented, and graph-based. Key advantages of NoSQL include scalability, flexibility in data structures, and ease of development. However, NoSQL sacrifices some consistency and lacks standardization compared to SQL databases.
This document provides an introduction and overview of MongoDB. It begins with definitions of NoSQL databases and describes the main types: key-value stores, wide column stores, document stores, and graph stores. It then discusses MongoDB specifically, describing it as a free, open-source, document-oriented database that uses JSON-like documents with dynamic schemas. The document outlines how to quickly install MongoDB using Docker, and how to perform basic CRUD operations like creating databases and collections, inserting, reading, updating, and deleting documents. It also discusses some key MongoDB concepts like its support for the CAP theorem prioritizing availability and partition tolerance over strong consistency.
The document discusses database hardware requirements like RAM, disk space, processors and networks and how they impact database performance. It also covers topics like transaction logging, how databases and their related files are structured, and the different SQL data types and statements used to work with databases. Various SQL objects like tables, views, indexes and their creation are explained along with examples.
This document provides an overview of NoSQL databases. It discusses that NoSQL databases offer more flexibility, higher performance, scalability, and choices compared to relational databases. The four main types of NoSQL databases are column family stores, key-value stores, document stores, and graph stores. Each has their own advantages and disadvantages for storing and querying data.
Domain Driven Design is a software development process that focuses on finding a common language for the involved parties. This language and the resulting models are taken from the domain rather than the technical details of the implementation. The goal is to improve the communication between customers, developers and all other involved groups. Even if Eric Evan's book about this topic was written almost ten years ago, this topic remains important because a lot of projects fail for communication reasons.
Relational databases have their own language and influence the design of software into a direction further away from the Domain: Entities have to be created for the sole purpose of adhering to best practices of relational database. Two kinds of NoSQL databases are changing that: Document stores and graph databases. In a document store you can model a "contains" relation in a more natural way and thereby express if this entity can exist outside of its surrounding entity. A graph database allows you to model relationships between entities in a straight forward way that can be expressed in the language of the domain.
In this talk I want to look at the way a multi model database that combines a document store and a graph database can help you to model your problems in a way that is understandable for all parties involved, and explain the benefits of this approach for the software development process.
MongoDB is a document-oriented NoSQL database that uses JSON-like documents with optional schemas. It provides high performance, high availability, and easy scalability. MongoDB is also called "humongous" because it is designed to store and handle large volumes of data. Some key advantages of MongoDB include its ability to handle large, unstructured data sets and provide agile development with quick code iterations.
This document provides an introduction and overview of NOSQL databases. It defines NOSQL as "not only SQL" databases that are an alternative to traditional relational databases. The key advantages of NOSQL databases are that they can handle huge datasets, scale easily, and provide fast and flexible querying. The main types of NOSQL databases are described as key-value stores, document databases, graph databases, and column-oriented databases. Examples of popular NOSQL databases and real-world uses by companies are also provided.
Columnar databases store data by columns rather than rows. This column-oriented approach keeps all attribute information together, improving query performance for analytics workloads that retrieve subsets of columns. However, it increases overhead for write operations like inserts due to needing to modify all columns for each row. Columnar databases are well-suited for analytical workloads with many reads and few writes, like data warehousing.
OAuth and OpenID Connect are authorization frameworks that enable third party applications (API clients) to obtain limited access to RESTful APIs on behalf of resource owners. OAuth allows API clients to obtain authorization grants, which can be exchanged for access tokens to make requests to the API. OpenID Connect is used by API clients to obtain information about the authentication of the resource owner performed by the authorization server in an ID token.
The document provides an overview of SAML (Security Assertion Markup Language), including its main components and use cases. It discusses SAML assertions, which contain statements to describe authentication, attributes, and authorization information. SAML defines request/response protocols, bindings to transport messages over protocols like HTTP, and profiles that combine assertions, protocols and bindings to provide interoperability for specific use cases. A key use case is web single sign-on, where the SAML web browser SSO profile defines how assertions, messages and bindings are used to enable SSO between an identity provider and service provider.
Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It allows applications to publish and subscribe to streams of records, and processes large amounts of continuous data easily and reliably. Producers write data to topics which are divided into partitions. Consumers can join a consumer group to read from topics and process the data in parallel. Records are stored on disk for a configurable period to allow consumption from past records.
The document provides an overview of containers and Kubernetes. It discusses the need for containers due to microservices and infrastructure as code. It then covers technical details of containers like Dockerfiles, images, and registries. It also discusses Kubernetes and its components like kube-apiserver, etcd, and kubelet. Finally, it covers Kubernetes concepts like pods, services, deployments, and how they are configured.
ZooKeeper is a distributed coordination service that allows distributed applications to synchronize data and configuration. It provides a simple API for applications to read, write, and watch a shared hierarchical data structure called a znode tree that is replicated across servers. ZooKeeper addresses the need for distributed applications like Hadoop and Kafka to coordinate tasks and share configuration through a common data store that remains available even if individual servers fail.
- Leo's notes summarize Oracle Database components including metadata, control files, user data, database, Oracle instance, background processes, online redo logs, archive logs, and data files.
- The notes also cover Oracle Database configuration including Oracle homes, Oracle base, data file locations, redo log groups, and archive log destinations.
- Key processes like the log writer process and database writer process are described as well as their roles in writing redo logs and data to disk.
Application Continuity with Oracle DB 12c Léopold Gault
Application Continuity is a feature of Oracle database 12c, when used through the JDBC replay driver (by java applications). You can benefit from this features when using a RAC or Data Guard.Those are my personal notes on the subject. Views expressed here are my own, and do not necessarily reflect the views of Oracle.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
How to Get CNIC Information System with Paksim Ga.pptx
NoSQL - Leo's notes
1. NoSQL
Leo’s notes
Those slides are Leopold Gault's notes, when reading :
• https://www.thoughtworks.com/insights/blog/nosql-databases-overview
• https://www.slideshare.net/arangodb/query-mechanisms-for-nosql-databases
• https://www.slideshare.net/arangodb/introduction-to-column-oriented-databases
• https://neo4j.com/developer/guide-data-modeling/
I am not a NoSQL expert; those notes are just my understanding of the aforementioned sources
4. Relational data models (OLTP and OLAP) vs NoSQL data models
NoSQL data modelsRelational data models
Transactional (OLTP)
Note that they represent a
document as a hierarchical
tree of data (it makes sense)
I think they meant to
represent a star schema
5. Who do I think is meant to be normalized ?
Transactional (OLTP)
normalized
normalizedDeliberately
de-normalized
Normalized ?
Not normalized
Not normalized
NoSQL data modelsRelational data models
6. Who do I think natively supports ACID transactions
Transactional (OLTP)
Always
Most of the times
(e.g. Node4J)
Maybe
sometimes
NoSQL data modelsRelational data models
Always
Maybe
sometimes
Maybe
sometimes
8. Why aggregates
In a RDBMS, such set of data would
have to be fetched from many
tables (requiring plenty of JOINs)
Let’s say that my application always uses
a set of data like this one
9. Why aggregates
In a RDBMS, such set of data would
have to be fetched from many
tables (requiring plenty of JOINs)
Let’s say that my application always uses
a set of data like this one
We can see that there is a big mismatch between the way the data is
aggregated by this application (i.e. the data is aggregated), and how the
data was scattered in tables of the RDBMS.
10. Aggregate-oriented DBMS
NoSQL DBMS (bar Graph DBMS) are aggregate-oriented.
An aggregate is a set of data, that will form the boundaries for ACID
operations.
Hence, the “acidity scope” is not at the transaction level, but at the aggregate level. Note
however that some aggregate-oriented DBMS also support ACID transactions.
An aggregate’s data have been grouped together only because it makes sense to do so,
from the application’s point of view.
This grouping is masterminded by a human. By:
• the developer: when coding an app, the developer will try to identify which sets of
data will be accessed together by the app. He will hence decide to write/read each set
of data as an aggregate.
• or the creator of materialized views, i.e. new aggregates emitted from disparate data.
11. Why aggregate-oriented DBMS
Working with aggregates is more performant. Indeed, an aggregate is stored together,
instead of being scattered among many tables. The same applies when reading: it is
quicker to retrieve a set of data that has been stored together, than if it had been
scattered throughout many tables.
In a cluster of an aggregate-oriented DBMS, an aggregate can live on the same node (or
be replicated on the same few nodes). Thus our cluster can scale out without reducing
the response time, as sets of data frequently accessed together (i.e. aggregates) are not
cut into pieces that are scattered through many nodes. The same logic applies for
sharding (an aggregate would belong to a single shard, instead of many) and replication.
12. About aggregates
Here are 2 formal definitions :
• An aggregate is a collection of data that we interact with as a unit. These units
of data (aggregates) form the boundaries for ACID operations (at the
aggregate level) with the database. [source1]
• Aggregate defines a collection of related objects that we treat as a unit. This
unit is taken as a whole for the context of {data manipulation and
management of consistency}. We update aggregates via atomic operations
and communicate our data storage in terms of aggregates. NoSQL databases,
apart from graph databases , have aggregate data models.
However, relational databases have no concept of aggregates within their data
model. These are considered aggregate-ignorant.
An aggregate-ignorant model allows you to look at data in different ways, so
it’s good when you don’t have a primary structure for manipulating data.
Aggregate ignorant databases, like relational and graph databases, in general
support ACID transactions.
[source2]
13. Who do I think is aggregate oriented?
Transactional (OLTP)
Yes (1 aggregate = 1 column /
segment of column)
Yes (1 aggregate can be a whole document
(identified by its key),
or a materialized view generated using map-
reduce)
Yes
(1 aggregate = 1 value,
i.e. a BLOB that bundles together
a bunch of data, this bunch is
meaningful only for the app)
No
No
No
Aggregate ignorant
Aggregate oriented data models
Maybe also a column family ?
But I don’t think so
I think the reason why Graph DBs are not “aggregate oriented” is because, despite storing
data as interconnected nodes, a node is probably not considered as an aggregate; probably
because the boundaries of an ACID operation extend beyond one node.
NoSQL data modelsRelational data models
15. Key value DBMS
BLOB.
The K-V DBMS doesn’t care
what’s inside this BLOB value;
it’s up to the app to figure that
out.
key value
key value
key value
key value
key value
key value
Values are just BLOBs ; they have no meaning for the DBMS
Key-value DBMS
16. Key value DBMS
key value
key value
key value
key value
key value
key value
API
• get the value for a key,
• put a value for a key,
• delete a key-value pair
How to query: with a very simple API
Key-value DBMS
18. <Value=Document>
<Key=DocumentID>
Documents DBMS
key document
key document
key document
key document
key document
“key-value stores where the value is examinable”; indeed this value is a document
key document
Depending on the DBMS, the document
may be in JSON, XML, BSON, etc.
Documents DBMS
19. Documents DBMS
key document
key document
key document
key document
key document
Example with a JSON document
key document
Documents DBMS
20. Documents DBMS
key document
key document
key document
key document
key document
How to query: with the document key, or (for some DBMS, like MongoDB) with attributes within documents
key
API
MongoDB
Actually, with MongoDB, it wouldn’t be a JSON doc, but a
BSON one. So it’d look like this:
x31x00x00x00
x04BSONx00
x26x00x00x00
x02x30x00x08x00x00x00awesomex00
x01x31x00x33x33x33x33x33x33x14x40
x10x32x00xc2x07x00x00
x00
x00
21. Documents DBMS
key document
key document
key document
key document
key document
How to query: for some other DBMS (e.g. CouchDB), querying docs by anything else than their ID requires
creating a materialized view, populated with JavaScript map-reduce code (for instance).
key
API
CouchDB
Document ID
This functions will parse all the
documents in the store, and emit the
docID of docs where there is a match
(where one of the topics is “music”).
The load of running a map function can be
distributed between nodes.
I think that this map function should be followed by a reduce
function that simply returns what it has been fed as parameters: e.g.
nonReduce = function (keys, values, reduce) {
if (reduce) {
// never run
}else{
// returns the emitted data
return values;
}
};
false
22. Documents DBMS
key document
key document
key document
key document
key document
Example with map and reduce
key
API
CouchDB
I think it's an array (with keys) of an array (with '1's) :
values= [ 'skating':[1,1]
'music': [1],
'sleeping': [1,1,1,1]
];
length() of each nested array ?
Boolean to say whether or not a
re-reduce is needed
That’s a key
That’s a value
24. Columnar DBMS vs RDBMS
How you use them
Columnar DBMS
• Data is stored in columns
• You specify column families (kind of entities), that are composed of
rows featuring some of the columns (among all the columns
mentioned in the column-family).
RDBMS
• Data is stored in tables; each row contains data for all columns (although
a value can be NULL)
Col 1 Col 2 Col 3
Column family A
row1
row2
row3
row4
Col 1 Col 2 Col 3
Table A
row1
row2
row3
row4
25. Why columnar DBMS ?
The benefits of column-oriented DBMS reside only in the way they
store data on-disk: they stores data by column instead of by row.
This makes such DBMS more performant when you query a few
columns, but read/write many things in those few columns.
It also makes possible to store the columns in a compressed state, and
only the columns being queried will be decompressed (on the fly).
Such DBMS are meant for analytics or batch-processing use-cases (and
not performant at all for OLTP).
26. Colum oriented storage vs Row oriented storage
Column oriented storage (columnar DBMS’ strategy)
• Each column is stored in its own datafile
source
datafile0
datafile1
a. Adding/deleting a column is relatively cheep in I/O: it only requires
working on a single small datafile.
b. Columns are stored compressed on the disk. Only the columns you
query will be decompressed (on the fly).
Row oriented storage (RDBMS’ strategy)
a. it might require to rewrite the whole table...
b. you can’t compress rows, because the whole row has to be decompressed
in order to be understandable (just like in a column-oriented storage, the whole column has to be
decompressed, or at least the whole subset of a column –i.e. “segment” ?-). This means the whole
table would have to be decompressed in order to be queried (I don’t think it you
could only decompress a subset of the table, because it is hard to think of a meaningful way the table could have
been chunked. Maybe you could only compress all the values except the ID, and chunk the table based on the ID;
but it would only be useful for JOINs based on foreign key.). A decompressed table is often too
big to fit only in memory, so you’d have to swap part of it on disk (which is
slow) just to be able to query it.
27. Colum oriented storage vs Row oriented storage
when not to use
Column oriented storage (columnar DBMS’ strategy)
source
• If you only want to work on a few rows (like it’s often the case in
OLTP), it won’t be performant at all: you’ll have to read and
decompress all the columns (or at least their relevant subsets), and
then recompress and rewrite them.
Row oriented storage (RDBMS’ strategy)
• If you only need to work on a few columns, but the table has may
columns, and you want to read/write many thing from those few
columns, you’ll have to read the whole row, just to get the few column
data that interests you.
Col 1 Col 2 Col 3
You just want to
modify a row
FYI: Memory page: the smallest unit of data for virtual-memory management: the OS will move this unit
of block from the HD to the RAM using I/O channels, and vice-versa. As it is the smallest unit, a page is
read from disk as a whole, including unused space.
29. How to deal with many relationships
RDBMS
• you would use JOINs to compute relationships, at query
time. On top of being less intuitive, the performance of
the JOINs will decrease exponentially with the size of
the tables being joined.
Graph DBMS
• the relationships are natively stored, so no relationship
will have to be computed at run time.
30. Labelled Property Graph Model (e.g. implemented by Neo4J)
A graph in such a model is composed of:
• Nodes
• Relationships (between 2 nodes.)
31. Labelled Property Graph Model (e.g. implemented by Neo4J)
About Nodes
A node can contain:
• Properties: multiple key-value pairs
• Labels: tags representing the roles of the node in the data domain. They are used to group
nodes into sets. Labels may also serve to attach metadata (index or constraint information)
to certain nodes.
Nodes
+
Label Labelled nodes
Person Book
Those names are
properties
32. Labelled Property Graph Model (e.g. implemented by Neo4J)
About Relationships
A relationship always has:
• a direction: a start node, and an end node
• a type (i.e. a name)
• Properties: multiple key-value pairs
Properties
Properties
The type of relationship is
“HAS_READ”
Editor's Notes
I think an aggregate is stored as
a BLOB value (associated to a key), in a Key-Value DBMS
a document, in a Document DBMS
a column, in a Columnar DBMS