NoSQL databases are currently used in several applications scenarios in contrast to Relations Databases. Several type of Databases there exist. In this presentation we compare Key Value, Column Oriented, Document Oriented and Graph Databases. Using a simple case study there are evaluated pros and cons of the NoSQL databases taken into account.
This presentation contains the introduction to NOSQL databases, it's types with examples, differentiation with 40 year old relational database management system, it's usage, why and we should use it.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
This presentation contains the introduction to NOSQL databases, it's types with examples, differentiation with 40 year old relational database management system, it's usage, why and we should use it.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
In this lecture we analyze key-values databases. At first we introduce key-value characteristics, advantages and disadvantages.
Then we analyze the major Key-Value data stores and finally we discuss about Dynamo DB.
In particular we consider how Dynamo DB: How is implemented
1. Motivation Background
2. Partitioning: Consistent Hashing
3. High Availability for writes: Vector Clocks
4. Handling temporary failures: Sloppy Quorum
5. Recovering from failures: Merkle Trees
6. Membership and failure detection: Gossip Protocol
This presentation is all about for the difference in between the Sql and NoSQL database because this question generally comes in the mind of every people that on what parameters and
how we can differentiate both these databases.
So, after viewing this presentation all your doubts and misconfusion between Sql and NoSQL got clear.
This presentation explains the major differences between SQL and NoSQL databases in terms of Scalability, Flexibility and Performance. It also talks about MongoDB which is a document-based NoSQL database and explains the database strutre for my mouse-human research classifier project.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
MongoDB is a popular NoSQL database. This presentation was delivered during a workshop.
First it talks about NoSQL databases, shift in their design paradigm, focuses a little more on document based NoSQL databases and tries drawing some parallel from SQL databases.
Second part, is for hands-on session of MongoDB using mongo shell. But the slides help very less.
At last it touches advance topics like data replication for disaster recovery and handling big data using map-reduce as well as Sharding.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
Here is my seminar presentation on No-SQL Databases. it includes all the types of nosql databases, merits & demerits of nosql databases, examples of nosql databases etc.
For seminar report of NoSQL Databases please contact me: ndc@live.in
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Oracle GoldenGate is the leading real-time data integration software provider in the industry - customers include 3 of the top 5 commercial banks, 3 of the top 3 busiest ATM networks, and 4 of the top 5 telecommunications providers.
Oracle GoldenGate moves transactional data in real-time across heterogeneous database, hardware and operating systems with minimal impact. The software platform captures, routes, and delivers data in real time, enabling organizations to maintain continuous uptime for critical applications during planned and unplanned outages.
Additionally, it moves data from transaction processing environments to read-only reporting databases and analytical applications for accurate, timely reporting and improved business intelligence for the enterprise.
In this lecture we analyze key-values databases. At first we introduce key-value characteristics, advantages and disadvantages.
Then we analyze the major Key-Value data stores and finally we discuss about Dynamo DB.
In particular we consider how Dynamo DB: How is implemented
1. Motivation Background
2. Partitioning: Consistent Hashing
3. High Availability for writes: Vector Clocks
4. Handling temporary failures: Sloppy Quorum
5. Recovering from failures: Merkle Trees
6. Membership and failure detection: Gossip Protocol
This presentation is all about for the difference in between the Sql and NoSQL database because this question generally comes in the mind of every people that on what parameters and
how we can differentiate both these databases.
So, after viewing this presentation all your doubts and misconfusion between Sql and NoSQL got clear.
This presentation explains the major differences between SQL and NoSQL databases in terms of Scalability, Flexibility and Performance. It also talks about MongoDB which is a document-based NoSQL database and explains the database strutre for my mouse-human research classifier project.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
MongoDB is a popular NoSQL database. This presentation was delivered during a workshop.
First it talks about NoSQL databases, shift in their design paradigm, focuses a little more on document based NoSQL databases and tries drawing some parallel from SQL databases.
Second part, is for hands-on session of MongoDB using mongo shell. But the slides help very less.
At last it touches advance topics like data replication for disaster recovery and handling big data using map-reduce as well as Sharding.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
Here is my seminar presentation on No-SQL Databases. it includes all the types of nosql databases, merits & demerits of nosql databases, examples of nosql databases etc.
For seminar report of NoSQL Databases please contact me: ndc@live.in
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Oracle GoldenGate is the leading real-time data integration software provider in the industry - customers include 3 of the top 5 commercial banks, 3 of the top 3 busiest ATM networks, and 4 of the top 5 telecommunications providers.
Oracle GoldenGate moves transactional data in real-time across heterogeneous database, hardware and operating systems with minimal impact. The software platform captures, routes, and delivers data in real time, enabling organizations to maintain continuous uptime for critical applications during planned and unplanned outages.
Additionally, it moves data from transaction processing environments to read-only reporting databases and analytical applications for accurate, timely reporting and improved business intelligence for the enterprise.
NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
NoSQL includes a wide range of different database technologies and were developed as a result of surging volume of data stored. Relational databases are not capable of coping with this huge volume and faces agility challenges. This is where NoSQL databases have come in to play and are popular because of their features. The session covers the following topics to help you choose the right NoSQL databases:
Traditional databases
Challenges with traditional databases
CAP Theorem
NoSQL to the rescue
A BASE system
Choose the right NoSQL database
MySQL 8.0 is the latest Generally Available version of MySQL. This session will give a brief introduction to MySQL 8.0 and help you upgrade from older versions, understand what utilities are available to make the process smoother and also understand what you need to bear in mind with the new version and considerations for possible behaviour changes and solutions. It really is a simple process.
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
Hadoop has been widely embraced for its ability to economically store and analyze large data sets. Using parallel computing techniques like MapReduce, Hadoop can reduce long computation times to hours or minutes. This works well for mining large volumes of historical data stored on disk, but it is not suitable for gaining real-time insights from live operational data. Still, the idea of using Hadoop for real-time data analytics on live data is appealing because it leverages existing programming skills and infrastructure – and the parallel architecture of Hadoop itself. This presentation will describe how real-time analytics using Hadoop can be performed by combining an in-memory data grid (IMDG) with an integrated, stand-alone Hadoop MapReduce execution engine. This new technology delivers fast results for live data and also accelerates the analysis of large, static data sets.
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
Build an Open Source Data Lake For Data ScientistsShawn Zhu
This is a talk I presented in 2019 ICSA (International Chinese Statistics Association) Applied Statistics Symposium in session "How Data Science Drives Success in Enterprises"
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Kyle Davis
Let's explore how Redis (and Redis Enterprise) can be used to store data in not only deterministic structures but also probabilistic structures like Bloom filters, HyperLogLog, Count Min Sketch and Cuckoo filters. We examine both usage and briefly summarize the algorithms that back these structures. Also we review the use-cases and applications for probabilistic structures.
Video in french at https://www.youtube.com/watch?v=9LNnNh63rBI
Sizing an Elasticsearch cluster has to consider many dimensions. In this presentation we go through the different elements and features you should consider to handle big and varying loads of log data.
MySQL Day Paris 2018 - MySQL JSON Document StoreOlivier DASINI
NoSQL + SQL = MySQL
MySQL Document Store allows developers to work with SQL relational tables and schema-less JSON collections. To make that possible MySQL has created the X Dev API which puts a strong focus on CRUD by providing a fluent API allowing you to work with JSON documents in a natural way. The X Protocol is a highly extensible and is optimized for CRUD as well as SQL API operations.
MySQL Document store gives users maximum flexibility developing traditional SQL relational applications and NoSQL schema-free document database applications. This eliminates the need for a separate NoSQL document database. Developers can mix and match relational data and JSON documents in the same database as well as the same application. For example, both data models can be queried in the same application and results can be in table, tabular or JSON formats.
The MySQL Document Store architecture consists of the following components:
Native JSON Document Storage - MySQL provides a native JSON datatype is efficiently stored in binary with the ability to create virtual columns that can be indexed. JSON Documents are automatically validated.
X Plugin - The X Plugin enables MySQL to use the X Protocol and uses Connectors and the Shell to act as clients to the server.
X Protocol - The X Protocol is a new client protocol based on top of the Protobuf library, and works for both, CRUD and SQL operations.
X DevAPI - The X DevAPI is a new, modern, async developer API for CRUD and SQL operations on top of X Protocol. It introduces Collections as new Schema objects. Documents are stored in Collections and have their dedicated CRUD operation set.
MySQL Shell - The MySQL Shell is an interactive Javascript, Python, or SQL interface supporting development and administration for the MySQL Server. You can use the MySQL Shell to perform data queries and updates as well as various administration operations.
MySQL Connectors - The following MySQL Connectors support the X Protocol and enable you to use X DevAPI in your chosen language.
MySQL Connector/Node.js
MySQL Connector/PHP
MySQL Connector/Python
MySQL Connector/J
MySQL Connector/NET
MySQL Connector/C++
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
(updated slides used for North Texas DAMA meetup Oct 2018) As we move more and more towards the need for everyone to do Agile Data Warehousing, we need a data modeling method that can be agile with us. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for over 15 years and is now growing in popularity. The purpose of this presentation is to provide attendees with an introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics:
• What the basic components of a DV model are
• How to build, and design structures incrementally, without constant refactoring
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019Dave Stokes
How well are you taking care of your database? Well, if you paycheck depends on your database you will want to make sure that you are not making these mistakes.
What is Bigdata
Sources of Bigdata
What can be done with Big data?
Handling Bigdata
MapReduce
What is Hadoop?
Why Hadoop is Useful?
Other big data use cases
In these slides is given an overview of the different parts of Apache Spark.
We analyze spark shell both in scala and python. Then we consider Spark SQL with an introduction to Data Frame API. Finally we describe Spark Streaming and we make some code examples.
Topics:spark-shell, pyspark, HDFS, how to copy file to HDFS, spark transformations, spark actions, Spark SQL (Shark),
spark streaming, streaming transformation stateless vs stateful, sliding windows, examples
In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning
In this lecture we analyze graph oriented databases. In particular, we consider TtitanDB as graph database. We analyze how to query using gremlin and how to create edges and vertex.
Finally, we presents how to use rexster to visualize the storeg graph/
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
In these slides we introduce Column-Oriented Stores. We deeply analyze Google BigTable. We discuss about features, data model, architecture, components and its implementation. In the second part we discuss all the major open source implementation for column-oriented databases.
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download the slides
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me to download the slides
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me to download the slides
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...Fabio Fumarola
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
2. Scaling Up Databases
A question I’m often asked about Heroku is: “How do you scale
the SQL database?” There’s a lot of things I can say about using
caching, sharding, and other techniques to take load off the
database. But the actual answer is: we don’t. SQL databases are
fundamentally non-scalable, and there is no magical pixie dust
that we, or anyone, can sprinkle on them to suddenly make
them scale.
Adam Wiggins Heroku
Adam Wiggins, Heroku Patterson, David; Fox, Armando (2012-07-11). Engineering Long-Lasting
Software: An Agile Approach Using SaaS and Cloud Computing, Alpha Edition (Kindle Locations
1285-1288). Strawberry Canyon LLC. Kindle Edition.
2
3. Data Management Systems: History
• In the last decades RDBMS have been successful in
solving problems related to storing, serving and
processing data.
• RDBMS are adopted for:
– Online transaction processing (OLTP),
– Online analytical processing (OLAP).
• Vendors such as Oracle, Vertica, Teradata, Microsoft
and IBM proposed their solution based on Relational
Math and SQL.
But….
3
4. Something Changed!
• Traditionally there were transaction recording (OLTP)
and analytics (OLAP) of the recorded data.
• Not much was done to understand:
– the reasons behind transactions,
– what factor contributed to business, and
– what factor could drive the customer’s behavior.
• Pursuing such initiatives requires working with a
large amount of varied data.
4
5. Something Changed!
• This approach was pioneered by Google, Amazon, Yahoo,
Facebook and LinkedIn.
• They work with different type of data, often semi or un-
structured.
• And they have to store, serve and process huge amount of data.
5
6. Something Changed!
• RDBMS can somehow deal with this aspects, but they
have issues related to:
– expensive licensing,
– requiring complex application logic,
– Dealing with evolving data models
• There were a need for systems that could:
– work with different kind of data format,
– Do not require strict schema,
– and are easily scalable.
6
7. Evolutions in Data Management
• As part of innovation in data management system,
several new technologies where built:
– 2003 - Google File System,
– 2004 - MapReduce,
– 2006 - BigTable,
– 2007 - Amazon DynamoDB
– 2012 Google Cloud Engine
• Each solved different use cases and had a different
set of assumptions.
• All these mark the beginning of a different way of
thinking about data management.
7
9. Big Data: Try { Definition }
Big Data means the data is large enough that you have
to think about it in order to gain insights from it
Or
Big Data when it stops fitting on a single machine
“Big Data, is a fundamentally different way of thinking
about data and how it’s used to drive business value.”
9
11. NoSQL
• In 2006 Google published BigTable paper.
• In 2007 Amazon presented DynamoDB.
• It didn’t take long for all these ideas to used in:
– Several open source projects (Hbase, Cassandra) and
– Other companies (Facebook, Twitter, …)
• And now? Now, nosql-database.org lists more than
150 NoSQL databases.
11
12. NoSQL related facts
• Explosion of social media sites (Facebook, Twitter)
with large data needs.
• Rise of cloud-based solutions such as Amazon S3
(simple storage solution).
• Moving to dynamically-typed languages
(Ruby/Groovy), a shift to dynamically-typed data
with frequent schema changes.
• Functional Programming (Scala, Clojure, Erlang).
12
13. NoSQL Categorization
• Key Value Store / Tuple Store
• Column-Oriented Store
• Document Store
• Graph Databases
• Multimodel Databases
• Object Databases
• Unresolved and Uncategorized
13
15. TwitBase: Data Model
Entities
•User(user: String, name: String, email: String, password: String,
twitsCount: Int)
•Twit(user: String, datetime: DateTime, text: String)
•Relation(from: String, relation: String, to: String)
Design Steps:
1.Primary key definition
2.Data shape and access patterns definition
3.Logical model definition (Physical Model)
15
16. TwitBase: Actions
• Users:
– add a new user,
– retrieve a specific user,
– list all the users
• Twit:
– post a new twit on user’s behalf,
– list all the twits for the specified user
16
17. TwitBase: Actions
• Relation
– Add a new relationship where from follows to,
– list everyone user-Id follows,
– list everyone who follows user-Id
– count users' followers
• The considered relations are follows and followedBy
17
19. Key Value Store
• Extremely simple interface:
– Data model: (key, value) pairs
– Basic Operations: : Insert(key, value),
Fetch(key),Update(key), Delete(key)
• Values are store as a “blob”:
– Without caring or knowing what is inside
– The application layer has to understand the data
• Advantages: efficiency, scalability, fault-tolerance
19
20. Key Value Store: Examples
• Memcached – Key value stores,
• Membase – Memcached with persistence and
improved consistent hashing,
• Aerospike – fast key value for ssd disks,
• Redis – Data structure server,
• Riak – Based on Amazon’s Dynamo (Erlang),
• Leveldb - A fast and lightweight key/value database
library by Google,
• DynamoDB – Amazon Key Value Database.
21. Memcached & MemBase
• Atomic operations set/get/delete.
• O(1) to set/get/delete.
• Consistent hashing.
• In memory caching, no persistence.
• LRU eviction policy.
• No iterators.
21
22. Aerospike
• Key-Value database optimized for hybrid (DRAM + Flash)
approach
• First published in the Proceedings of VLDB (Very Large
Databases) in 2011, “Citrusleaf: A Real-Time NoSQL DB which
Preserves ACID”
22
23. Redis
• Written C++ with BSD License
• It is an advanced key-value store.
• It can contain strings, hashes, lists, sets and sorted
sets.
• It works with an in-memory.
• data can be persisted either by dumping the dataset
to disk every once in a while, or by appending each
command to a log.
• Created by Salvatore Sanfilippo (Pivotal)
23
24. Riak
• Distributed Database written in: Erlang & C, some
JavaScript
• Operations
– GET /buckets/BUCKET/keys/KEY
– PUT|POST /buckets/BUCKET/keys/KEY
– DELETE /buckets/BUCKET/keys/KEY
• Integrated with Solr and MapReduce
• Data Types: basic, Sets and Maps
24
curl -XPUT 'http://localhost:8098/riak/food/favorite'
-H 'Content-Type:text/plain'
-d 'pizza'
25. LevelDB
LevelDB is a fast key-value storage library written at
Google that provides an ordered mapping from string
keys to string values.
– Keys and values are arbitrary byte arrays.
– Data is stored sorted by key.
– The basic operations are Put(key ,value), Get(key),
Delete(key).
– Multiple changes can be made in one atomic batch.
Limitation
– There is no client-server support built in to the library.
25
26. DynamoDB
• Fully managed NoSQL cloud database service
• Characteristics:
– Low latency ( < 5ms read, < 10ms to write) (SSD backend)
– Massive scale ( No table size limit. Unlimited storage)
• It run over ssd disk
• Cons: 64KB limit on row size, Limited Data Types: It
doesn’t accept binary data, 1MB limit on Querying
and Scanning
26
28. Redis TwitBase: User
User(user: String, name: String, email: String, password: String)
• we can use the SET, HSET or HMSET operators
SET users:1 {user: ‘pippo', name: ‘pippo basile’, email: ‘prova@mail.com’, password:
‘xxx’, count: 0}
HSET users:1 user ‘pippo’
HSET users:1 name ‘pippo basile’
HMSET users:1 user ‘pippo’ name ‘pippo basile’
• Primary Key -> users:userId
• Operations
–add a new user -> SET, HSET or HMSET
–retrieve a specific user -> HGET or HKEYS/HGETALL
–list all the users -> KEYS users:* (What is the cost?)
http://redis.io/commands
28
29. Redis TwitBase: Twit
Twit(user: String, datetime: DateTime, text: String)
• we can use the SET or HSET operators
SET twit:pippo:1405547914879 {user: ‘pippo', datetime: 1405547914879, email: ‘hello’}
HSET twit:pippo:1405547914879 user ‘pippo’
HSET twit:pippo:1405547914879 datetime 1405547914879
HMSET twit:pippo:1405547914879 user ‘pippo’ datetime 1405547914879 …
• Primary Key-> twit:userId:timestamp (???)
• Operations
–post a new twit on user’s behalf -> SET, HSET or HMSET
–list all the twits for the specified user -> KEYS
http://redis.io/commands
29
30. Redis TwitBase: Relation
Relation(from: String, relation: String, to: String)
• we can use the SET, HSET, HMSET operators
SET follows:1:2 {from: ‘pippo', relation: ‘follows’, to: ‘martin’}
HMSET followed:2:1 from ‘martin’ relation ‘followedBy’, to ‘pippo’
• Primary Key:
• Operations
–add a new relation-> SET/HSET/HMSET
–list everyone user-Id follows -> KEYS
–list everyone who follows user-Id -> KEYS
–count users' followers -> any suggestion? What happen if we use LIST
LPUSH follows:1 {from: ‘pippo', relation: ‘follows’, to: ‘martin’} …
http://redis.io/commands
30
31. Key Value Store
• Pros:
– very fast
– very scalable
– simple model
– able to distribute horizontally
• Cons:
– many data structures (objects) can't be easily modeled as
key value pairs
31
33. Column-oriented
• Store data in columnar format
• Each storage block contains data from only one
column
• Allow key-value pairs to be stored (and retrieved on
key) in a massively parallel system
– data model: families of attributes defined in a schema,
new attributes can be added online
– storing principle: big hashed distributed tables
– properties: partitioning (horizontally and/or vertically),
high availability etc. completely transparent to application
33
36. BigTable
• Project started at Google in 2005.
• Written in C and C++.
• Used by Gmail and all the other service at Google.
• It can be used as service (Google Cloud Platform) and
it can be integrated with Google Big Query.
36
37. HBase
Apache HBase™ is the Hadoop database to use when
you need you need random, realtime read/write access
to your Data.
•Automatic and configurable sharding of tables
•Automatic failover support between RegionServers.
•Convenient base classes for backing Hadoop
MapReduce jobs with Apache HBase tables.
•Easy to use Java API for client access.
•To be distributed, it has to run on top of hdfs
•Integrated with MapReduce
37
40. Hypertable
• Hypertable is an open source database system
inspired by publications on the design of Google's
BigTable.
• Hypertable runs on top of a distributed file system
such as the Apache Hadoop DFS, GlusterFS, or the
Kosmos File System (KFS). It is written almost entirely
in C++.
40
41. BigTable – Hbase - Hypertable
• Operator supported:
– put(key, columnFamily, columnQualifier, value)
– get(key)
– Scan(startKey, endKey)
– delete(key)
• Get and delete support optional column family and
qualifier
41
42. Cassandra
• Big-Table extension:
– All nodes are similar.
– Can be used as a distributed hash-table, with an "SQL-like"
language, CQL (but no JOIN!)
• Data can have expiration (set on INSERT)
• Map/reduce possible with Apache Hadoop
• Rich Data Model (columns, composites, counters,
secondary indexes, map, set, list, counters)
42
43. Proven Scalability and High
Performances
43
http://planetcassandra.org/nosql-performance-benchmarks/
48. HBase TwitBase: User
User(user: String, name: String, email: String, password: String)
• We can define the table users
create ’users’, ‘info’
• Primary Key -> we need an unique identifier
• Operations
–add a new user -> put(key, columnFamily, columnQualifier, value)
–retrieve a specific user -> get(key)
–list all the users -> scan on the table users setting the family
48
49. HBase TwitBase: Twit
Twit(user: String, datetime: DateTime, text: String)
• We can define the table Twits
create ‘twits’ ‘twits’
• Primary Key-> [md5(userId),Bytes.toByte(-1*timestamp)]
• Operations
–post a new twit on user’s behalf -> put
–list all the twits for the specified user -> scan on a partial key, that is from
[md5(userId),8byte] to [md5(userId),8byte] +1 on the last byte
49
50. HBase TwitBase: Relation
Relation(from: String, relation: String, to: String)
• We can define the table Relation
create ’follows’, ‘f’
create ‘followedBy’, ‘f’
• Primary Key: [md5(fromId),md5(toId)]
• Operations
–add a new relation-> put
–list everyone user-Id follows -> Scan using the md5[userId]
–list everyone who follows user-Id -> Scan using the md5[userId]
–count users' followers -> any suggestion?
50
52. Column Oriented Considerations
More efficient than row (or document) store if:
•Multiple row/record/documents are inserted at the
same time so updates of column blocks can be
aggregated
•Retrievals access only some of the columns in a
row/record/document
52
53. Wait… online vs. offline operations
• We focused on online operations.
• Get and Put return result in milliseconds.
• The twits table’s rowkey is designed to maximize
physical data locality and minimize the time spent
scanning records.
• But not all the operations can be done online.
• What’s about offline operations (e.g site traffic
summary report).
• This operations have performance concerns as well.
53
54. Scan vs Thread Scan vs MapReduce
• Scan is a serial operations
• We can used a thread pool to speedup the
computation
• We can use MapReduce to split the work in Map and
Reduce.
54
1. map(key: LongWritable ,value: Text ,context: Context)
2. reduce(key: Text, vals: Iterable[LongWritable],context:
Context)
57. Document Store
• Schema Free.
• Usually JSON (BSON) like interchange model, which
supports lists, maps, dates, Boolean with nesting
• Query Model: JavaScript or custom.
• Aggregations: Map/Reduce.
• Indexes are done via B-Trees.
• Example: Mongo
– {Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1“
Grandchildren: [Claire: "7", Barbara: "6", "Magda: "3", "Kirsten:
"1", "Otis: "3", Richard: "1"]
}
57
58. Document Store: Advantages
• Documents are independent units
• Application logic is easier to write. (JSON).
• Schema Free:
– Unstructured data can be stored easily, since a document
contains whatever keys and values the application logic
requires.
– In addition, costly migrations are avoided since the
database does not need to know its information schema in
advance.
58
60. MongoDB
• Consistency and Partition Tolerance
• MongoDB's documents are encoded in a JSON-like
format (BSON) which
– makes storage easy, is a natural fit for modern object-
oriented programming methodologies,
– and is also lightweight, fast and traversable.
• It supports rich queries and full indexes.
– Queries are javascript expressions.
– Each object stored as an object Id.
• Has geospatial indexing.
• Supports Map-Reduce queries.
60
61. MongoDB: Features
• Replication Methods: replica set and master slave
• Read Performance - Mongo employs a custom binary
protocol (and format) providing at least a magnitude
times faster reads than CouchDB at the moment.
• Provides speed-oriented operations like upserts and
update-in-place mechanics in the database.
61
62. CouchDB
• Written in Erlang.
• Documents are stored using JSON.
• The query language is in Javascript and supports Map-Reduce
integration.
• One of its distinguishing features is multi-master replication.
• ACID: It implements a form of Multi-Version Concurrency
Control (MVCC) in order to avoid the need to lock the
database file during writes.
• CouchDB guarantees eventual consistency to be able to
provide both availability and partition tolerance.
62
63. CouchDB: Features
• Master-Master Replication - Because of the append-
only style of commits.
• Reliability of the actual data store backing the DB
(Log Files)
• Mobile platform support. CouchDB actually has
installs for iOS and Android.
• HTTP REST JSON interaction only. No binary protocol
63
64. CouchDB JSON Example
{
"_id": "guid goes here",
"_rev": "314159",
"type": "abstract",
"author": "Keith W. Hare"
"title": "SQL Standard and NoSQL Databases",
"body": "NoSQL databases (either no-SQL or Not Only SQL)
are currently a hot topic in some parts of
computing.",
"creation_timestamp": "2011/05/10 13:30:00 +0004"
}
67. RethinkDB
• Document Store based on JSON
• It extends MongoDB allowing for:
– Aggregations using grouped map reduce
– Joins and full sub queries
– Full javascript v8 functions
• RethinkDB supports primary key, compound,
secondary, and arbitrarily computed indexes stored
as B-trees.
67
69. MongoDB TwitBase: User
User(user: String, name: String, email: String, password: String)
• We can define the collection users
db.createCollection(“users”)
• Primary Key ->can we use the default OBjectId?
• Operations
–add a new user -> db.users.insert({…})
–retrieve a specific user -> db.user.find({…})
–list all the users -> db.users.findAll({})
69
70. MongoDB TwitBase: Twit
Twit(user: String, datetime: DateTime, text: String)
• We can define the collection Twits
db.createCollection(“twits”)
• Primary Key-> ???
• Operations
–post a new twit on user’s behalf -> db.twits.insert({})
–list all the twits for the specified user -> db.find({userId : pippo})
70
71. Data Model Considerations
• Put as much in as it is possible (subdocuments, avoid joins)
• Separate data that can be referred to from multiple sources
• Document size consideration (16Mb)
• Complex data structures (search issues, no subdocument
return)
• Data Consistency, there aren’t locks
Twits Data Model
• Embed Twits into Users collections and assign an id to each
twit.
• The Object id has a timestamp embedded so we can use that
71
72. MongoDB TwitBase: Relation
Relation(from: String, relation: String, to: String)
• We can embed the collection relation in the users collection
• Operations
–add a new relation
–list everyone user-Id follows
–list everyone who follows user-Id
–count users' followers -> any suggestion? counter
72
74. Graph Databases
• They are significantly different from the other three classes of
NoSQL databases.
• Graph Databases are based on the mathematical concept of
graph theory.
• They fit well in several real world applications (twits,
permission models)
• Are based on the concepts of Vertex and Edges
• A Graph DB can be labeled, directed, attributed multi-graph
• Relational DBs can model graphs, but an edge does not
require a join which is expensive.
74
78. Neo4j
• A highly scalable open source graph database that
supports ACID,
• has high-availability clustering for enterprise
deployments,
• Neo4j provides a fully equipped, well designed and
documented rest interface
• Neo4j is the most popular graph database in use
today.
• License AGPLv3/Commercial
78
79. Neo4j: Features
• It includes extensive and extensible libraries for
powerful graph operations such as traversals,
shortest path determination, conversions,
transformations.
• It includes triggers, which are referred to as
transaction event handlers.
• Neo4j can be configured as a multi-node cluster.
• An embedded version of Neo4j is available, which
runs directly in the application context.
• GIS Indexes (QuadTree, HierarchyTree, …)
79
80. Titan
• Support for ACID and eventual consistency.
• Support for various storage backends: Apache
Cassandra, Apache Hbase, Oracle BerkeleyDB, Akiban
Persistit and Hazelcast.
• Support for geo, numeric range, and full-text search
via: ElasticSearch, Apache Lucene
• Native integration with the TinkerPop graph stack
• Open source with the liberal Apache 2 license.
• Edge compression and vertex-centric indices
80
81. OrientDB
• Key-Value DB, Graph DB and Document DB
• SQL and Transactions
• Distributed: OrientDB supports Multi-Master
Replication
• It supports different types of relations:
– 1-1 and N-1 referenced relationships
– 1-N and N-M referenced relationships
• LINKLIST, as an ordered list of links
• LINKSET, as an unordered set of links. It doesn't accepts duplicates
• LINKMAP, as an ordered map of links with key a String. It doesn't
accepts duplicated keys
81
84. TitanDB TwitBase: User
User(user: String, name: String, email: String, password: String)
• We can define a vertex for each user
g.createKeyIndex("name",Vertex.class);
Vertex pippo = g.addVertex(null);
juno.setProperty(”type", ”user");
juno.setProperty(”user", ”pippo");
• Primary key -> unique index on name property
• Operations
–add a new user -> g.addVertex()
–retrieve a specific user -> g.getVertices(“name”,”pippo”)
–list all the users -> g.getVertices(“type”,”user”)
84
85. TitanDB TwitBase: Twit
Twit(user: String, datetime: DateTime, text: String)
85
UserUser TwitTwit
Twits(time: Long)
Name: String
• Operations
– post a new twit on user’s behalf
• Val twit = g.addVertex(“twit”); val edge = g.addEdge(pippo,twit,”twit”)
– list all the twits for the specified user
• Val results = pippo.query().labels(“twit”).vertices()
86. TitanDB TwitBase: Relation
Relation(from: String, relation: String, to: String)
86
UserUser TwitTwit
Twits(time: Long)
name: String
• Operations
– add a new relation -> val edge = g.addEdge(pippo,martin,”follows”)
– list everyone user-Id follows -> pippo.query().labels(“follows”).vertices()
– list everyone who follows user-Id ->
pippo.query().labels(“followedBy”).vertices()
– count users' followers -> pippo.query().labels(“followedBy”).count()
follows
followedBy
88. Storage Model
• Adjacency list in one column family
• Row key = vertex id
• Each property and edge in one column
– Denormalized, i.e. stored twice
• Direction and label/key as column prefix
• Index are maintained into a separate column family
88
89. TwitBase: Stream and
Recommendations
• Add Stream
– pippo.in("follows").each{g.addEdge(it,tweet,"stream",['time':4])}
• Read Stream
– Martin.outE(“stream”)[0..9].inV.map
• Followship Recommendation:
val follows = g.V('name',’Hercules’).out('follows').toList()!
val follows20 = follows[(0..19).collect{random.nextInt(follows.size)}]!
val m = [:]!
follows20.each
{ it.outE('follows’[0..29].inV.except(follows).groupCount(m).iterate() }
m.sort{a,b -> b.value <=> a.value}[0..4]
89
93. NewSQL
• Just like NoSQL it is more of a movement than specific
product or even product family
• The “New” refers to the Vendors and not the SQL
• Goal(s):
– Bring the benefits of relational model to distributed
architectures, or,
• VoltDB, ScaleDB, etc.
– Improve Relational DB performance to no longer require
horizontal scaling
• Tokutek, ScaleBase, etc.
• “SQL-as-a-service”: Amazon RDS, Microsoft SQL Azure, Google Cloud SQL
95. Spark
• Apache Spark is an open source cluster computing
system that aims to make data analytics fast — both
fast to run and fast to write.
• To run programs faster, Spark offers a general
execution model that can optimize arbitrary operator
graphs, and supports in-memory computing, which
lets it query data faster than disk-based engines like
Hadoop.
• Written in Scala using Akka.io
95
97. Spark
• Machine Learning Library (MLlib)
– binary classification, regression, clustering and
collaborative filtering, as well as an underlying gradient
descent optimization primitive
• Bagel is a Spark implementation of Google’s Pregel
graph processing framework
– jobs run as a sequence of iterations called supersteps. In
each superstep, each vertex in the graph runs a user-
specified function that can update state associated with
the vertex and send messages to other vertices for use in
the next iteration.
97
98. Shark
• Shark is a large-scale data warehouse system for
Spark designed to be compatible with Apache Hive. It
can execute Hive QL queries up to 100 times faster
than Hive without any modification to the existing
data or queries.
– REATE TABLE logs_last_month_cached AS SELECT * FROM
logs WHERE time > date(...);
– SELECT page, count(*) c FROM logs_last_month_cached
GROUP BY page ORDER BY c DESC LIMIT 10;
98
102. Brewer’s CAP Theorem
A distributed system can support only two of the
following characteristics:
•Consistency (all copies have same value)
•Availability (system can run even if parts have failed)
•Partition Tolerance (network can break into two or
more parts, each with active systems that can not
influence other parts)
102
103. Brewer’s CAP Theorem
Very large systems will partition at some point:
•it is necessary to decide between Consistency and
Availability,
•traditional DBMS prefer Consistency over Availability
and Partition,
•most Web applications choose Availability (except in
specific applications such as order processing)
103
You can run atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set.
Lpush for each attribute
Lpush for each attribute
Lpush for each attribute
Gossip protocol
Commit log for durability – sequential write
Memtable – no disk access (no reads or seeks)
Sstables written sequentially to the disk
The operational design integrates nicely with the operating system page cache. Because Cassandra does not modify the data, dirty pages that would have to be flushed are not even generated.
Lpush for each attribute
Lpush for each attribute
Scan or Coprocessor
But, what’s about Key Value Store?
Replica sets are the preferred replication mechanism in MongoDB. However, if your deployment requires more than 12 nodes, you must use master/slave replication.
Master-Master Couch does every modification to the DB is considered a revision making conflicts during replication much less likely and allowing for some awesome master-master replication or what Cassandra calls a &quot;ring&quot; of servers all bi-directionally replicating to each other. It can even look more like a fully connected graph of replication rules.
Reliability: Because CouchDB records any changes as a &quot;revision&quot; to a document and appends them to the DB file on disk, the file can be copied or snapshotted at any time even while the DB is running and you don&apos;t have to worry about corruption. It is a really resilient method of storage.
Lpush for each attribute
Considera la differenza con I database column oriented