1. Coratti Stefano
coratti.1624508@studenti.uniroma1.it
github.com/CorattiS86
it.linkedin.com/in/Stefano-coratti-83005a85
www.slideshare.net/StefanoCoratti
Project for the course of
“Big Data Management 2016”
DIAG - SAPIENZA
>> Analysis of a Database for IOT
2. Big Data Management 2016
SAPIENZA - DIAG
Overview
IOT world
Databases requirements for IOT
Relational Database
NOSQL Databases
Document-Based & MongoDB
MongoDB - Overview
MongoDB - Logical Data Model
MongoDB - Architecture
MongoDB - Storage Data Structure
MongoDB - Query Language
MongoDB - Other Features
MongoDB - In action
3. 01
IOT world
Big Data Management 2016
SAPIENZA - DIAG
Sensor data is only useful if you can do something with it
A world where all your physical assets and devices are connected to each other and share information,
making life easier and more convenient.
Areas of application today :
Financial Services
Remotely monitor vehicle
performance and driver
behavior using telematics
sensor data to text those
metrics with insurance
premiums.
Government
Use biometric sensor data
from patients to alert
doctors early so that
they can prevent medical
Emergencies.
High Tech
Ability to quantify people
lifestyles with wearable
tech , analyzing diet, sleep
exercise and the rest of
their activity.
Retail
Present enticing offers to
shoppers using in-store
beacons and purchase
history data as they walk
through in the store.
4. 02
Databases requirement for IOT
Big Data Management 2016
SAPIENZA - DIAG
Scalability
Continous machine-scale ingestion, indexing and storage. A modest data source may generate millions of complex
records per seconds on a continuous basis.
Operational “real-time” queries and analytics
Extracting value from IOT data is all about minimizing the latency from data ingestion to online queries
and actionable analytics.
Spatio-temporal data
Many data objects in real world have attributes related to both space and time.
IOT data is all about spatiotemporal relationships and join operations. It require at least a true time-series database for
very simple uses and a true spatial database for the more general case.
Schema flexibility
IOT Database must be as flexible as required by the application. Schema changes over time.
High variety of data
Data in IOT, but in general for Big Data, may be structured or not, dense or sparse, connected or disconnected.
5. 03
Why Relational DBs fail in IOT ?
Big Data Management 2016
SAPIENZA - DIAG
Scalability
RDBMS are not designed for scale. It’s hard and a huge challenge to reach scalability.
They are designed to run on a single server in order to maintain the integrity of the table mappings and avoid
the problems of distributed computing.
Operational “real-time” queries and analytics
Queries execution time increase as the size of tables. Operations like JOIN are compute- and memory-intensive and
have an exponential cost as queries grow.
Spatio-temporal data
Managing spatio-temporal data in RDBMS is complex and inefficient, because in general it doesn’t support efficiently
objects that are multi-dimensional in nature.
Schema flexibility
Relational DBs are STATIC schema, it must be defined at the beginning and respected successively to met the ACID property
High variety of data
In a Relational DBs data must be structured, no all kind of data nature is efficient in this case.
6. 04
NOSQL Databases
Big Data Management 2016
SAPIENZA - DIAG
Graph DBs
Databases that use graph structures for semantic queries
with nodes, edges and properties to represent and store data.
Specific query language, designed for Graph DB, which allow to
traverse a graph.
Relationships allow data in the store to be linked together directly,
and in many cases retrieved with one operation.
Key–Value DBs
Is a data storage paradigm designed for storing, retrieving and managing associative arrays
No pre-defined structure for the data, known as a dictionary that contain a collection of objects,
or records which in turn have many different fields, each containing data.
With little or no needs to maintain indexes, it is designed to be horizontally scalable.
Support very efficiently only simple queries, particularly suited for problems where the data
are not highly related.
You have to write a lot more application code to reassemble collections of key-value pairs into objects
7. 05
NOSQL Databases
Big Data Management 2016
SAPIENZA - DIAG
Data Warehouse
Column family DBs
Data Warehouse are large then can accommodate a lot of sensor data,
and scalable architecture solution exist.
“Near Real-time” solution, the bottleneck is in the ETL process,
difficulty in supporting updates in “real-time”.
Like a relational database, it also requires to maintain a schema.
It’s nature is to be multi-dimensional and integrate heterogenous information
Consist of rows and columns, data stored in key-value pair,
where the key is mapped to a value that is a set of columns.
Differently from relational DBs where records are represented in a row,
here data is organized by column, storing data in this fashion allow queries
that perform aggregation over very large sets of data to run very efficiently.
Data can be easily partitioned in a separate database server (Sharding)
8. 06
NOSQL Databases
Big Data Management 2016
SAPIENZA - DIAG
Document-based DBs
All information for a given object is stored in a single instance in the database,
and every stored object can be different from every other ( fields, format, etc…),
well structured using an encoding such as an XML or JSON.
SCHEMALESS, documents don’t require to adhere to a standard schema, flexible,
adaptive based on the applications, each uses different objects.
Database offers API or query language that allows the user to retrieve document based
on content, for example retrieve documents with a certain field.
Drop ACID properties to allow high scalability.
Most popular Document-based DBs
9. 07
NOSQL DBs Challenge
Big Data Management 2016
SAPIENZA - DIAG
Maturity
RDBMS systems have been around for along time, they are stable and richly functional.
In comparison most NOSQL alternatives are in pre-production versions with many key features yet to be implemented.
Support
Enterprise want reassurance that if a key system fails, they will be able to get timely and competent support.
RDBMS vendors provide high levels of support, in constrast most NOSQL DBs are open source, which companies are
small start-ups without support resources or credibility of an Oracle, Microsoft or IBM.
Analytics and business intelligence
NOSQL databases are oriented towards the demands of Web 2.0 applications. However, data in an application has value
to the business that goes beyond the Insert - Read – Update – Delete - Cycle of a typical Web application.
Businesses mine information in corporate databases to improve their efficiency and competitiveness,
and business intelligence is a key IT issue for all medium to large companies.
Expertise
There are millions of developers familiar with RDBMS, it’s more easy to find an expert of RDBMS than a NOSQL.
This situation will address naturally over time.
Administration
The design goals for NOSQL may be to provide a zero admin solution, but the current reality falls well short of that goal.
NOSQL today requires a lot of skill to install and a lot of effort to maintain.
10. 08
MongoDB Overview
Big Data Management 2016
SAPIENZA - DIAG
IOT is HARD MongoDB Makes it EASY
Each new generation of “thing” comes with
new sensors. New sensors create new data and
new functionality requirements.
A database should succeed in the hard task to
incorporate new da and iterate on a data model.
Document model enables to store and process
data of any structure: events, text, binary data,
geospatial coordinates and anything else.
Structure of a document’s schema can change rapidly
as data generated by IOT.
Scaling problem. Billion sensors generate volume
of data. That’s a lot more than a single server
can handle.
Scale Big. MongoDB is built to scale out on commodity
hardware, in an own data center or in the cloud, serving
a lot of users and sensor data without extra software
The need to analyze rapidly changing,
multi-structured data in real time.
It’s no possible to have the luxury of lengthy
ETL processes to cleanse data for downstream
reporting.
Signal vs Noise. MongoDB can analyze data of any structure.
It can do so directly within the database, giving result in
real time, and without expensive data warehouse loads.
11. 08
MongoDB Overview
Big Data Management 2016
SAPIENZA - DIAG
IOT is HARD MongoDB Makes it EASY
Each new generation of “thing” comes with
new sensors. New sensors create new data and
new functionality requirements.
A database should succeed in the hard task to
incorporate new data and iterate on a data model.
Scaling problem. Billion sensors generate volume
of data. That’s a lot more than a single server
can handle.
Scale Big. MongoDB is built to scale out on commodity
hardware, in an own data center or in the cloud, serving
a lot of users and sensor data without extra software
The need to analyze rapidly changing,
multi-structured data in real time.
It’s no possible to have the luxury of lengthy
ETL processes to cleanse data for downstream
reporting.
Signal vs Noise. MongoDB can analyze data of any structure.
It can do so directly within the database, giving result in
real time, and without expensive data warehouse loads.
Document model enables to store and process
data of any structure: events, text, binary data,
geospatial coordinates and anything else.
Structure of a document’s schema can change rapidly
as data generated by IOT.
12. 09
MongoDB Logical Data Model
Big Data Management 2016
SAPIENZA - DIAG
Data as Documents
A document is a set of key-value pairs.
document collection
Documents that share a similar structure are typically organized as collection.
Fields can vary from document to document, there is no need to declare
the structure of documents to the system.
If a new field needs to be added to a document then it can be created without
affecting all other documents and without updating central system catalog.
13. 09
MongoDB Logical Data Model
Big Data Management 2016
SAPIENZA - DIAG
Data as Documents
A document is a set of key-value pairs.
document collection
Documents that share a similar structure are typically organized as collection.
RDBMS comparison
Table:
It’s possible to think to collections
as being analogous to table.
Rows:
Documents are similar to rows.
Column:
Fields are similar to columns.
Fields can vary from document to document, there is no need to declare
the structure of documents to the system.
If a new field needs to be added to a document then it can be created without
affecting all other documents and without updating central system catalog.
14. 10
MongoDB Logical Data Model
Big Data Management 2016
SAPIENZA - DIAG
MongoDB stores data in to documents in a binary representation called BSON (Binary JSON)
it allow, differently from JSON, representation of data types.
Data as Documents
15. OBJECT_ID
10
MongoDB Logical Data Model
Big Data Management 2016
SAPIENZA - DIAG
string
number
date-time
Object_ID
struct
4-byte value representing the seconds
since the Unix epoch
3-byte machine identifier
2-byte process id
3-byte counter,
starting with random value
MongoDB stores data in to documents in a binary representation called BSON (Binary JSON)
it allow, differently from JSON, representation of data types.
Data as Documents
16. 11
MongoDB Logical Data Model
Big Data Management 2016
SAPIENZA - DIAG
MongoDB documents tend to have all data for a given record in a single document,
whereas in a relational database information for a given record is usually spread across many tables.
Data as Documents
EXAMPLE
Suppose to design a database for a blog/website that need the following requirements:
Every post has the unique title, description and url
Every post can have one or more tags
Every post has the name of its publisher and total number of likes
Every post has comments by users with their name, message, data-time and likes
On each post, there can be zero or more comments
17. 11
MongoDB Logical Data Model
Big Data Management 2016
SAPIENZA - DIAG
MongoDB documents tend to have all data for a given record in a single document,
whereas in a relational database information for a given record is usually spread across many tables.
Data as Documents
EXAMPLE
Suppose to design a database for a blog/website that need the following requirements:
Every post has the unique title, description and url
Every post can have one or more tags
Every post has the name of its publisher and total number of likes
Every post has comments by users with their name, message, data-time and likes
On each post, there can be zero or more comments
RDBMS requires
minimum 3 tables
A unique
MongoDB documentVS
It dramatically reduces the need to JOIN separate tables
18. 12
MongoDB Storage Data Structure
Big Data Management 2016
SAPIENZA - DIAG
Every MongoDB instance consists of a namespace file, journal files and data files.
Data files
Focusing on data files,
they store BSON documents, indexes and generated metadata in structure called extents.
19. 12
MongoDB Storage Data Structure
Big Data Management 2016
SAPIENZA - DIAG
Every MongoDB instance consists of a namespace file, journal files and data files.
Data files
Focusing on data files,
they store BSON documents, indexes and generated metadata in structure called extents.
Extents
Extents are containers within data files used to store documents and indexes.
20. 12
MongoDB Storage Data Structure
Big Data Management 2016
SAPIENZA - DIAG
Every MongoDB instance consists of a namespace file, journal files and data files.
Data files
Focusing on data files,
they store BSON documents, indexes and generated metadata in structure called extents.
Extents
Extents are containers within data files used to store documents and indexes.
data and indexes are each contained in their own sets of extents
no extent will ever contain content for more than one collection
data and indexes are never contained within the same extent
data and indexes for a collection will usually span multiple extents
when a new extent is needed, MongoDB will attempt to use available
space within current data file, if cannot be found, then a new data file is created.
21. 13
MongoDB Storage metrics
The dataSize metric is the sum of the sizes (bytes) of all documents and padding stored in database.
dataSize
My-db.1 My-db.2
While dataSize decrease when document are deleted, it doesn’t when documents shrink
because space used by original document has already allocated and cannot be used by others.
If document is updated with more data dataSize will remain the same as long as the
new document fits within its originally padded pre-allocated space.
record = document + padding
in MongoDB every document is stored
in a record which contain the document
itself and extra space called padding.
padding
document is allocated with additional
space, this is a strategy to increase
efficiently for the updating process.
Big Data Management 2016
SAPIENZA - DIAG
dataSize
22. 14
MongoDB Storage metrics
The storageSize metric is equal of the sizes (bytes) of all data extents in database.
storageSize
My-db.1 My-db.2
dataSize
The storageSize is larger than dataSize because it includes yet-unused space (in data extents)
and space vacated by deleted or moved documents within extents
The storageSize does not decrease as you remove or shrink documents.
Big Data Management 2016
SAPIENZA - DIAG
storageSize
23. 15
MongoDB Storage metrics
fileSize
My-db.1 My-db.2
dataSize
This metric represent the storage footprint of the database on disk.
It doesn’t decrease when collections, documents or indexes are removed,
it decreases when a database is deleted.
Big Data Management 2016
SAPIENZA - DIAG
storageSize
fileSize
The fileSize metric is equal of the sizes (bytes) of all data extents, index extents
and yet-unused space (in data files) in database.
24. 16
MongoDB Distribution
Big Data Management 2016
SAPIENZA - DIAG
MongoDB provides horizontal scale-out for database on low cost, commodity hardware
using a technique called “Sharding”.
SHARDING technique
Sharding distributes data across multiple physical partitions called “shards”.
Data are automatically balanced in multiple server as data grows,
addressing the hardware limitation of a single server.
Transparent to applications, whether there is one or one hundreds shards,
the application code for querying MongoDB is the same.
25. 16
MongoDB Distribution
Big Data Management 2016
SAPIENZA - DIAG
MongoDB provides horizontal scale-out for database on low cost, commodity hardware
using a technique called “Sharding”.
SHARDING technique
Sharding distributes data across multiple physical partitions called “shards”.
Data are automatically balanced in multiple server as data grows,
addressing the hardware limitation of a single server.
Transparent to applications, whether there is one or one hundreds shards,
the application code for querying MongoDB is the same.
Recovery from hardware failure
and service interruptions.
Protection from
the loss of a single server.
No needs to backup,
one copies can act as a backup.
It can eventually provides redundancy
with multiple copies of data on different databases.
26. 17
MongoDB Distribution
A MongoDB sharded cluster consists of the following components:
shard: each contain a subset of the sharded data,
each shard can be deployed as a replica set.
mongos: the mongos acts as a query router, providing
an interface between client applications and
the sharded cluster.
config servers: store metadata and
configuration setting for the cluster.
Big Data Management 2016
SAPIENZA - DIAG
27. 17
MongoDB Distribution
A MongoDB sharded cluster consists of the following components:
shard: each contain a subset of the sharded data,
each shard can be deployed as a replica set.
mongos: the mongos acts as a query router, providing
an interface between client applications and
the sharded cluster.
config servers: store metadata and
configuration setting for the cluster.
The query router uses this metadata to target
operations to specific shards.
Big Data Management 2016
SAPIENZA - DIAG
Shard Keys: to distribute the documents in a collection, MongoDB partitions the collection
using the shard key, that consists of an immutable field or fields that exist in
every document in the target collection
28. 18
MongoDB Query Language
Big Data Management 2016
SAPIENZA - DIAG
To query data from collection, MongoDB provide the method find()
>> db.collection_name.find()
29. 18
MongoDB Query Language
Big Data Management 2016
SAPIENZA - DIAG
To query data from collection, MongoDB provide the method find()
>> db.collection_name.find( conditions )
The conditions for the query must be specified as arguments in the find methods.
30. 18
MongoDB Query Language
Big Data Management 2016
SAPIENZA - DIAG
To query data from collection, MongoDB provide the method find()
>> db.collection_name.find({ key : value })
The conditions for the query must be specified as arguments in the find methods.
The conditions for the are expressed specifying key : value pairs
31. 19
MongoDB Query Language
Depending on the condition it is possible to obtain the equivalent of an SQL query.
OPERATION
Selection
from one collection
MONGODB SYNTAX
db.collection.find()
SQL EQUIVALENT
SELECT *
FROM collection
32. 19
MongoDB Query Language
Depending on the condition it is possible to obtain the equivalent of an SQL query.
OPERATION
Selection
from one collection
Selection
for equality
Selection
for a value less than
Selection
for a value not equals to
MONGODB SYNTAX
db.collection.find()
db.collection.find( { name : ‘’John’’ } )
db.collection.find( { num : {$lt : 50} } )
db.collection.find( { val : {$ne: 0} } )
SQL EQUIVALENT
SELECT *
FROM collection
SELECT *
FROM collection
WHERE name = ‘’John’’
SELECT *
FROM collection
WHERE num < 50
SELECT *
FROM collection
WHERE val != 0
33. 19
MongoDB Query Language
Depending on the condition it is possible to obtain the equivalent of an SQL query.
OPERATION
Selection
from one collection
Selection
for equality
Selection
for a value less than
Selection
for a value not equals to
Projection
MONGODB SYNTAX
db.collection.find()
db.collection.find( { name : ‘’John’’ } )
db.collection.find( { num : {$lt : 50} } )
db.collection.find( { val : {$ne: 0} } )
db.collection.find( {} , { name: 1, job: 1} )
SQL EQUIVALENT
SELECT *
FROM collection
SELECT *
FROM collection
WHERE name = ‘’John’’
SELECT *
FROM collection
WHERE num < 50
SELECT *
FROM collection
WHERE val != 0
SELECT name, job
FROM collection
34. 19
MongoDB Query Language
Depending on the condition it is possible to obtain the equivalent of an SQL query.
OPERATION
Selection
from one collection
Selection
for equality
Selection
for a value less than
Selection
for a value not equals to
Projection
Projection and Selection
MONGODB SYNTAX
db.collection.find()
db.collection.find( { name : ‘’John’’ } )
db.collection.find( { num : {$lt : 50} } )
db.collection.find( { val : {$ne: 0} } )
db.collection.find( {} , { name: 1, job: 1} )
db.collection.find({ age : {$gte : 18} } , { name: 1})
SQL EQUIVALENT
SELECT *
FROM collection
SELECT *
FROM collection
WHERE name = ‘’John’’
SELECT *
FROM collection
WHERE num < 50
SELECT *
FROM collection
WHERE val != 0
SELECT name, job
FROM collection
SELECT name
FROM collection
WHERE age >= 18
35. 19
MongoDB Query Language
Depending on the condition it is possible to obtain the equivalent of an SQL query.
OPERATION
AND
OR
MONGODB SYNTAX
db.collection.find(
{
$and: [
{ job: ‘’employee’’} , { age: {$gte : 65} }
]
}
)
db.collection.find(
{
$or: [
{ job: ‘’employee’’} , { job: ‘’freelancer’’}
]
}
)
SQL EQUIVALENT
SELECT *
FROM collection
WHERE job = ‘’employee’’
AND
age >= 65
SELECT *
FROM collection
WHERE job = ‘’employee’’
OR
job = ‘’freelancer’’
36. 20
MongoDB CRUD Concepts
C
Create or insert operation,
it allow to add new documents to a collection.
Can be see as the equivalent of INSERT in SQL.
R
Read operation, is a way to query a collection for documents.
As seen before, there ways to do a selection, projection, etc…
U
Update operation, it modifies existing documents in a collection.
It allow to set a filter to identify the document to update.
D
Delete operation, it removes documents from a collection.
It allow to set a filter to delete only specified documents
otherwise they will be all deleted.
37. 21
How ACID is MongoDB ?
Atomicity & Isolation
In MongoDB a write operation is atomic on the level of a single document. When a single write
operation modifies multiple documents, the modification of each document is atomic.
The $isolated operator can prevent other processes from interleaving once the write operation modifies the first document,
this ensures that no one sees the changes until the write operation completes or errors out.
The operation as a whole is not atomic, then other operation can interleave.
However it’s possible to isolate a single operation with the $isolated operator.
However an isolated write operation does not provide “all-or-nothing” atomicity.
That is an error during the write operation does not roll back all its changes that preceded the error.
The $isolated operator causes write operations to acquire an exclusive lock on the collection.
Big Data Management 2016
SAPIENZA - DIAG
38. 22
How ACID is MongoDB ?
Consistency
Even in replica set configurations, the primary Mongo server is targeted with all the writes,
single server consistency is easy to guarantee.
The secondary nodes may be out of date with respect to the primary, as eventual consistency only
guarantee that if after a long enough period with no writes, they will get up to date with to the primary.
However by default the secondary servers cannot answer reads, so the traffic could be distribute with the penalty
of inconsistency, it is a configure choice.
Durability
Durability of writes is the biggest issue with MongoDB
What SQL DBs do is committing after every write operation. In MongoDB this doesn’t happen, a choice of developers,
they say because in many scenario the OS doesn’t write the file on disk even after syncing( hardware buffering), and
because time spent waiting for recovering would impact availability
So if the server crashes, writes accepted after the last commit will be lost.
Big Data Management 2016
SAPIENZA - DIAG
39. 23
MongoDB summing up
Is MongoDB a good choice for IOT ?
Scalability
Operational “real-time” queries and analytics
Spatio-temporal data
Schema flexibility
High variety of data
Big Data Management 2016
SAPIENZA - DIAG
40. 23
MongoDB summing up
Is MongoDB a good choice for IOT ?
Scalability
The horizontal scale allow to scale easly, it is possible add multiple servers when needed ,
and the sharding technique allow to balance data across the servers.
Operational “real-time” queries and analytics
Spatio-temporal data
Schema flexibility
High variety of data
Big Data Management 2016
SAPIENZA - DIAG
41. 23
MongoDB summing up
Is MongoDB a good choice for IOT ?
Scalability
The horizontal scale allow to scale easly, it is possible add multiple servers when needed ,
and the sharding technique allow to balance data across the servers.
Operational “real-time” queries and analytics
Sacrificing the ACID properties it allows more speed in the operations, and because much of related data
are inside the same document, it doesn’t require the expensive JOIN operations.
Spatio-temporal data
Schema flexibility
High variety of data
Big Data Management 2016
SAPIENZA - DIAG
42. 23
MongoDB summing up
Is MongoDB a good choice for IOT ?
Scalability
The horizontal scale allow to scale easly, it is possible add multiple servers when needed ,
and the sharding technique allow to balance data across the servers.
Operational “real-time” queries and analytics
Sacrificing the ACID properties it allows more speed in the operations, and because much of related data
are inside the same document, it doesn’t require the expensive JOIN operations.
Spatio-temporal data
MongoDB offers a number of indexes and query mechanisms to handle geospatial information.
Location data are stored as GeoJSON objects.
Schema flexibility
High variety of data
Big Data Management 2016
SAPIENZA - DIAG
43. 23
MongoDB summing up
Is MongoDB a good choice for IOT ?
Scalability
The horizontal scale allow to scale easly, it is possible add multiple servers when needed ,
and the sharding technique allow to balance data across the servers.
Operational “real-time” queries and analytics
Sacrificing the ACID properties it allows more speed in the operations, and because much of related data
are inside the same document, it doesn’t require the expensive JOIN operations.
Spatio-temporal data
MongoDB offers a number of indexes and query mechanisms to handle geospatial information.
Location data are stored as GeoJSON objects.
Schema flexibility
Schema is free, it can change during writes operations, and changes can affect one or more documents.
High variety of data
Big Data Management 2016
SAPIENZA - DIAG
44. 23
MongoDB summing up
Is MongoDB a good choice for IOT ?
Scalability
The horizontal scale allow to scale easly, it is possible add multiple servers when needed ,
and the sharding technique allow to balance data across the servers.
Operational “real-time” queries and analytics
Sacrificing the ACID properties it allows more speed in the operations, and because much of related data
are inside the same document, it doesn’t require the expensive JOIN operations.
Spatio-temporal data
MongoDB offers a number of indexes and query mechanisms to handle geospatial information.
Location data are stored as GeoJSON objects.
Schema flexibility
Schema is free, it can change during writes operations, and changes can affect one or more documents.
High variety of data
It represent the strong point for IOT, sensor data can be represented with a field and respective value,
then they are stored in they natural way
Big Data Management 2016
SAPIENZA - DIAG
45. 24
MongoDB simulation
Big Data Management 2016
SAPIENZA - DIAG
>> use myDB
>> show dbs
>> show collections
>> db.collection.find()
>> db.collection.insert()
>> db.collection.find().explain(“executionStats”)
>> db.collection.find( { field: “value”} )
>> db.collection.save()
>> db.collection.update({}, { $set:{} })
>> db.collection.find().mapReduce()