MongoDB

 Coratti Stefano
 coratti.1624508@studenti.uniroma1.it
 github.com/CorattiS86
 it.linkedin.com/in/Stefano-coratti-83005a85
 www.slideshare.net/StefanoCoratti
Project for the course of
“Big Data Management 2016”
DIAG - SAPIENZA
>> Analysis of a Database for IOT

Big Data Management 2016
SAPIENZA - DIAG
Overview
 IOT world
 Databases requirements for IOT
 Relational Database
 NOSQL Databases
 Document-Based & MongoDB
 MongoDB - Overview
 MongoDB - Logical Data Model
 MongoDB - Architecture
 MongoDB - Storage Data Structure
 MongoDB - Query Language
 MongoDB - Other Features
 MongoDB - In action

01
IOT world
SAPIENZA - DIAG
 Sensor data is only useful if you can do something with it
 A world where all your physical assets and devices are connected to each other and share information,
making life easier and more convenient.
 Areas of application today :
Financial Services
Remotely monitor vehicle
performance and driver
behavior using telematics
sensor data to text those
metrics with insurance
premiums.
Government
Use biometric sensor data
from patients to alert
doctors early so that
they can prevent medical
Emergencies.
High Tech
Ability to quantify people
lifestyles with wearable
tech , analyzing diet, sleep
exercise and the rest of
their activity.
Retail
Present enticing offers to
shoppers using in-store
beacons and purchase
history data as they walk
through in the store.

02
Databases requirement for IOT
SAPIENZA - DIAG
 Scalability
Continous machine-scale ingestion, indexing and storage. A modest data source may generate millions of complex
records per seconds on a continuous basis.
 Operational “real-time” queries and analytics
Extracting value from IOT data is all about minimizing the latency from data ingestion to online queries
and actionable analytics.
 Spatio-temporal data
Many data objects in real world have attributes related to both space and time.
IOT data is all about spatiotemporal relationships and join operations. It require at least a true time-series database for
very simple uses and a true spatial database for the more general case.
 Schema flexibility
IOT Database must be as flexible as required by the application. Schema changes over time.
 High variety of data
Data in IOT, but in general for Big Data, may be structured or not, dense or sparse, connected or disconnected.

03
Why Relational DBs fail in IOT ?
SAPIENZA - DIAG
 Scalability
RDBMS are not designed for scale. It’s hard and a huge challenge to reach scalability.
They are designed to run on a single server in order to maintain the integrity of the table mappings and avoid
the problems of distributed computing.
Queries execution time increase as the size of tables. Operations like JOIN are compute- and memory-intensive and
have an exponential cost as queries grow.
Managing spatio-temporal data in RDBMS is complex and inefficient, because in general it doesn’t support efficiently
objects that are multi-dimensional in nature.
Relational DBs are STATIC schema, it must be defined at the beginning and respected successively to met the ACID property
In a Relational DBs data must be structured, no all kind of data nature is efficient in this case.

04
NOSQL Databases
SAPIENZA - DIAG
Graph DBs
 Databases that use graph structures for semantic queries
with nodes, edges and properties to represent and store data.
 Specific query language, designed for Graph DB, which allow to
traverse a graph.
 Relationships allow data in the store to be linked together directly,
and in many cases retrieved with one operation.
Key–Value DBs
 Is a data storage paradigm designed for storing, retrieving and managing associative arrays
 No pre-defined structure for the data, known as a dictionary that contain a collection of objects,
or records which in turn have many different fields, each containing data.
 With little or no needs to maintain indexes, it is designed to be horizontally scalable.
 Support very efficiently only simple queries, particularly suited for problems where the data
are not highly related.
 You have to write a lot more application code to reassemble collections of key-value pairs into objects

05
NOSQL Databases
SAPIENZA - DIAG
Data Warehouse
Column family DBs
 Data Warehouse are large then can accommodate a lot of sensor data,
and scalable architecture solution exist.
 “Near Real-time” solution, the bottleneck is in the ETL process,
difficulty in supporting updates in “real-time”.
 Like a relational database, it also requires to maintain a schema.
 It’s nature is to be multi-dimensional and integrate heterogenous information
 Consist of rows and columns, data stored in key-value pair,
where the key is mapped to a value that is a set of columns.
 Differently from relational DBs where records are represented in a row,
here data is organized by column, storing data in this fashion allow queries
that perform aggregation over very large sets of data to run very efficiently.
 Data can be easily partitioned in a separate database server (Sharding)

06
NOSQL Databases
SAPIENZA - DIAG
Document-based DBs
 All information for a given object is stored in a single instance in the database,
and every stored object can be different from every other ( fields, format, etc…),
well structured using an encoding such as an XML or JSON.
 SCHEMALESS, documents don’t require to adhere to a standard schema, flexible,
adaptive based on the applications, each uses different objects.
 Database offers API or query language that allows the user to retrieve document based
on content, for example retrieve documents with a certain field.
 Drop ACID properties to allow high scalability.
Most popular Document-based DBs

07
NOSQL DBs Challenge
SAPIENZA - DIAG
 Maturity
RDBMS systems have been around for along time, they are stable and richly functional.
In comparison most NOSQL alternatives are in pre-production versions with many key features yet to be implemented.
 Support
Enterprise want reassurance that if a key system fails, they will be able to get timely and competent support.
RDBMS vendors provide high levels of support, in constrast most NOSQL DBs are open source, which companies are
small start-ups without support resources or credibility of an Oracle, Microsoft or IBM.
 Analytics and business intelligence
NOSQL databases are oriented towards the demands of Web 2.0 applications. However, data in an application has value
to the business that goes beyond the Insert - Read – Update – Delete - Cycle of a typical Web application.
Businesses mine information in corporate databases to improve their efficiency and competitiveness,
and business intelligence is a key IT issue for all medium to large companies.
 Expertise
There are millions of developers familiar with RDBMS, it’s more easy to find an expert of RDBMS than a NOSQL.
This situation will address naturally over time.
 Administration
The design goals for NOSQL may be to provide a zero admin solution, but the current reality falls well short of that goal.
NOSQL today requires a lot of skill to install and a lot of effort to maintain.

08
MongoDB Overview
SAPIENZA - DIAG
IOT is HARD MongoDB Makes it EASY
 Each new generation of “thing” comes with
new sensors. New sensors create new data and
new functionality requirements.
A database should succeed in the hard task to
incorporate new da and iterate on a data model.
 Document model enables to store and process
data of any structure: events, text, binary data,
geospatial coordinates and anything else.
Structure of a document’s schema can change rapidly
as data generated by IOT.
 Scaling problem. Billion sensors generate volume
of data. That’s a lot more than a single server
can handle.
 Scale Big. MongoDB is built to scale out on commodity
hardware, in an own data center or in the cloud, serving
a lot of users and sensor data without extra software
 The need to analyze rapidly changing,
multi-structured data in real time.
It’s no possible to have the luxury of lengthy
ETL processes to cleanse data for downstream
reporting.
 Signal vs Noise. MongoDB can analyze data of any structure.
It can do so directly within the database, giving result in
real time, and without expensive data warehouse loads.

08
MongoDB Overview
SAPIENZA - DIAG
IOT is HARD MongoDB Makes it EASY
 Each new generation of “thing” comes with
new sensors. New sensors create new data and
new functionality requirements.
A database should succeed in the hard task to
incorporate new data and iterate on a data model.
 Scaling problem. Billion sensors generate volume
of data. That’s a lot more than a single server
can handle.
 Scale Big. MongoDB is built to scale out on commodity
hardware, in an own data center or in the cloud, serving
a lot of users and sensor data without extra software
 The need to analyze rapidly changing,
multi-structured data in real time.
It’s no possible to have the luxury of lengthy
ETL processes to cleanse data for downstream
reporting.
 Signal vs Noise. MongoDB can analyze data of any structure.
It can do so directly within the database, giving result in
real time, and without expensive data warehouse loads.
 Document model enables to store and process
data of any structure: events, text, binary data,
geospatial coordinates and anything else.
Structure of a document’s schema can change rapidly
as data generated by IOT.

09
MongoDB Logical Data Model
SAPIENZA - DIAG
Data as Documents
 A document is a set of key-value pairs.
document collection
 Documents that share a similar structure are typically organized as collection.
 Fields can vary from document to document, there is no need to declare
the structure of documents to the system.
 If a new field needs to be added to a document then it can be created without
affecting all other documents and without updating central system catalog.

09
SAPIENZA - DIAG
Data as Documents
 A document is a set of key-value pairs.
document collection
 Documents that share a similar structure are typically organized as collection.
RDBMS comparison
 Table:
It’s possible to think to collections
as being analogous to table.
 Rows:
Documents are similar to rows.
 Column:
Fields are similar to columns.
 Fields can vary from document to document, there is no need to declare
the structure of documents to the system.
 If a new field needs to be added to a document then it can be created without
affecting all other documents and without updating central system catalog.

10
SAPIENZA - DIAG
 MongoDB stores data in to documents in a binary representation called BSON (Binary JSON)
it allow, differently from JSON, representation of data types.
Data as Documents

OBJECT_ID
10
SAPIENZA - DIAG
string
number
date-time
Object_ID
struct
4-byte value representing the seconds
since the Unix epoch
3-byte machine identifier
2-byte process id
3-byte counter,
starting with random value
 MongoDB stores data in to documents in a binary representation called BSON (Binary JSON)
it allow, differently from JSON, representation of data types.
Data as Documents

11
SAPIENZA - DIAG
 MongoDB documents tend to have all data for a given record in a single document,
whereas in a relational database information for a given record is usually spread across many tables.
Data as Documents
EXAMPLE
Suppose to design a database for a blog/website that need the following requirements:
 Every post has the unique title, description and url
 Every post can have one or more tags
 Every post has the name of its publisher and total number of likes
 Every post has comments by users with their name, message, data-time and likes
 On each post, there can be zero or more comments

11
SAPIENZA - DIAG
 MongoDB documents tend to have all data for a given record in a single document,
whereas in a relational database information for a given record is usually spread across many tables.
Data as Documents
EXAMPLE
Suppose to design a database for a blog/website that need the following requirements:
 Every post has the unique title, description and url
 Every post can have one or more tags
 Every post has the name of its publisher and total number of likes
 Every post has comments by users with their name, message, data-time and likes
 On each post, there can be zero or more comments
RDBMS requires
minimum 3 tables
A unique
MongoDB documentVS
 It dramatically reduces the need to JOIN separate tables

12
MongoDB Storage Data Structure
SAPIENZA - DIAG
 Every MongoDB instance consists of a namespace file, journal files and data files.
Data files
 Focusing on data files,
they store BSON documents, indexes and generated metadata in structure called extents.

12
SAPIENZA - DIAG
Data files
Extents
 Extents are containers within data files used to store documents and indexes.

12
SAPIENZA - DIAG
Data files
Extents
 Extents are containers within data files used to store documents and indexes.
 data and indexes are each contained in their own sets of extents
 no extent will ever contain content for more than one collection
 data and indexes are never contained within the same extent
 data and indexes for a collection will usually span multiple extents
 when a new extent is needed, MongoDB will attempt to use available
space within current data file, if cannot be found, then a new data file is created.

13
MongoDB Storage metrics
 The dataSize metric is the sum of the sizes (bytes) of all documents and padding stored in database.
dataSize
My-db.1 My-db.2
 While dataSize decrease when document are deleted, it doesn’t when documents shrink
because space used by original document has already allocated and cannot be used by others.
 If document is updated with more data dataSize will remain the same as long as the
new document fits within its originally padded pre-allocated space.
record = document + padding
in MongoDB every document is stored
in a record which contain the document
itself and extra space called padding.
padding
document is allocated with additional
space, this is a strategy to increase
efficiently for the updating process.
SAPIENZA - DIAG
dataSize

14
 The storageSize metric is equal of the sizes (bytes) of all data extents in database.
storageSize
My-db.1 My-db.2
dataSize
 The storageSize is larger than dataSize because it includes yet-unused space (in data extents)
and space vacated by deleted or moved documents within extents
 The storageSize does not decrease as you remove or shrink documents.
SAPIENZA - DIAG
storageSize

15
fileSize
My-db.1 My-db.2
dataSize
 This metric represent the storage footprint of the database on disk.
 It doesn’t decrease when collections, documents or indexes are removed,
it decreases when a database is deleted.
SAPIENZA - DIAG
storageSize
fileSize
 The fileSize metric is equal of the sizes (bytes) of all data extents, index extents
and yet-unused space (in data files) in database.

16
MongoDB Distribution
SAPIENZA - DIAG
 MongoDB provides horizontal scale-out for database on low cost, commodity hardware
using a technique called “Sharding”.
SHARDING technique
 Sharding distributes data across multiple physical partitions called “shards”.
Data are automatically balanced in multiple server as data grows,
addressing the hardware limitation of a single server.
 Transparent to applications, whether there is one or one hundreds shards,
the application code for querying MongoDB is the same.

16
SAPIENZA - DIAG
 MongoDB provides horizontal scale-out for database on low cost, commodity hardware
using a technique called “Sharding”.
SHARDING technique
 Sharding distributes data across multiple physical partitions called “shards”.
Data are automatically balanced in multiple server as data grows,
addressing the hardware limitation of a single server.
 Transparent to applications, whether there is one or one hundreds shards,
the application code for querying MongoDB is the same.
Recovery from hardware failure
and service interruptions.
Protection from
the loss of a single server.
No needs to backup,
one copies can act as a backup.
It can eventually provides redundancy
with multiple copies of data on different databases.

17
 A MongoDB sharded cluster consists of the following components:
shard: each contain a subset of the sharded data,
each shard can be deployed as a replica set.
mongos: the mongos acts as a query router, providing
an interface between client applications and
the sharded cluster.
config servers: store metadata and
configuration setting for the cluster.
SAPIENZA - DIAG

17
 A MongoDB sharded cluster consists of the following components:
shard: each contain a subset of the sharded data,
each shard can be deployed as a replica set.
mongos: the mongos acts as a query router, providing
an interface between client applications and
the sharded cluster.
config servers: store metadata and
configuration setting for the cluster.
The query router uses this metadata to target
operations to specific shards.
SAPIENZA - DIAG
 Shard Keys: to distribute the documents in a collection, MongoDB partitions the collection
using the shard key, that consists of an immutable field or fields that exist in
every document in the target collection

18
MongoDB Query Language
SAPIENZA - DIAG
 To query data from collection, MongoDB provide the method find()
>> db.collection_name.find()

18
SAPIENZA - DIAG
>> db.collection_name.find( conditions )
 The conditions for the query must be specified as arguments in the find methods.

18
SAPIENZA - DIAG
>> db.collection_name.find({ key : value })
 The conditions for the query must be specified as arguments in the find methods.
 The conditions for the are expressed specifying key : value pairs

19
 Depending on the condition it is possible to obtain the equivalent of an SQL query.
OPERATION
Selection
from one collection
MONGODB SYNTAX
db.collection.find()
SQL EQUIVALENT
SELECT *
FROM collection

19
OPERATION
Selection
from one collection
Selection
for equality
Selection
for a value less than
Selection
for a value not equals to
MONGODB SYNTAX
db.collection.find( { name : ‘’John’’ } )
db.collection.find( { num : {$lt : 50} } )
db.collection.find( { val : {$ne: 0} } )
SQL EQUIVALENT
SELECT *
FROM collection
SELECT *
FROM collection
WHERE name = ‘’John’’
SELECT *
FROM collection
WHERE num < 50
SELECT *
FROM collection
WHERE val != 0

19
OPERATION
Selection
from one collection
Selection
for equality
Selection
Selection
Projection
MONGODB SYNTAX
db.collection.find( {} , { name: 1, job: 1} )
SQL EQUIVALENT
SELECT *
FROM collection
SELECT *
FROM collection
SELECT *
FROM collection
WHERE num < 50
SELECT *
FROM collection
WHERE val != 0
SELECT name, job
FROM collection

19
OPERATION
Selection
from one collection
Selection
for equality
Selection
Selection
Projection
Projection and Selection
MONGODB SYNTAX
db.collection.find( {} , { name: 1, job: 1} )
db.collection.find({ age : {$gte : 18} } , { name: 1})
SQL EQUIVALENT
SELECT *
FROM collection
SELECT *
FROM collection
SELECT *
FROM collection
WHERE num < 50
SELECT *
FROM collection
WHERE val != 0
SELECT name, job
FROM collection
SELECT name
FROM collection
WHERE age >= 18

19
OPERATION
AND
OR
MONGODB SYNTAX
db.collection.find(
{
$and: [
{ job: ‘’employee’’} , { age: {$gte : 65} }
]
}
)
db.collection.find(
{
$or: [
{ job: ‘’employee’’} , { job: ‘’freelancer’’}
]
}
)
SQL EQUIVALENT
SELECT *
FROM collection
WHERE job = ‘’employee’’
AND
age >= 65
SELECT *
FROM collection
WHERE job = ‘’employee’’
OR
job = ‘’freelancer’’

20
MongoDB CRUD Concepts
 C
Create or insert operation,
it allow to add new documents to a collection.
Can be see as the equivalent of INSERT in SQL.
 R
Read operation, is a way to query a collection for documents.
As seen before, there ways to do a selection, projection, etc…
 U
Update operation, it modifies existing documents in a collection.
It allow to set a filter to identify the document to update.
 D
Delete operation, it removes documents from a collection.
It allow to set a filter to delete only specified documents
otherwise they will be all deleted.

21
How ACID is MongoDB ?
Atomicity & Isolation
 In MongoDB a write operation is atomic on the level of a single document. When a single write
operation modifies multiple documents, the modification of each document is atomic.
 The $isolated operator can prevent other processes from interleaving once the write operation modifies the first document,
this ensures that no one sees the changes until the write operation completes or errors out.
 The operation as a whole is not atomic, then other operation can interleave.
However it’s possible to isolate a single operation with the $isolated operator.
 However an isolated write operation does not provide “all-or-nothing” atomicity.
That is an error during the write operation does not roll back all its changes that preceded the error.
 The $isolated operator causes write operations to acquire an exclusive lock on the collection.
SAPIENZA - DIAG

22
How ACID is MongoDB ?
Consistency
 Even in replica set configurations, the primary Mongo server is targeted with all the writes,
single server consistency is easy to guarantee.
 The secondary nodes may be out of date with respect to the primary, as eventual consistency only
guarantee that if after a long enough period with no writes, they will get up to date with to the primary.
 However by default the secondary servers cannot answer reads, so the traffic could be distribute with the penalty
of inconsistency, it is a configure choice.
Durability
 Durability of writes is the biggest issue with MongoDB
 What SQL DBs do is committing after every write operation. In MongoDB this doesn’t happen, a choice of developers,
they say because in many scenario the OS doesn’t write the file on disk even after syncing( hardware buffering), and
because time spent waiting for recovering would impact availability
 So if the server crashes, writes accepted after the last commit will be lost.
SAPIENZA - DIAG

23
MongoDB summing up
Is MongoDB a good choice for IOT ?
 Scalability
SAPIENZA - DIAG

23
MongoDB summing up
 Scalability
The horizontal scale allow to scale easly, it is possible add multiple servers when needed ,
and the sharding technique allow to balance data across the servers.
SAPIENZA - DIAG

23
MongoDB summing up
 Scalability
Sacrificing the ACID properties it allows more speed in the operations, and because much of related data
are inside the same document, it doesn’t require the expensive JOIN operations.
SAPIENZA - DIAG

23
MongoDB summing up
 Scalability
MongoDB offers a number of indexes and query mechanisms to handle geospatial information.
Location data are stored as GeoJSON objects.
SAPIENZA - DIAG

23
MongoDB summing up
 Scalability
Schema is free, it can change during writes operations, and changes can affect one or more documents.
SAPIENZA - DIAG

23
MongoDB summing up
 Scalability
Schema is free, it can change during writes operations, and changes can affect one or more documents.
It represent the strong point for IOT, sensor data can be represented with a field and respective value,
then they are stored in they natural way
SAPIENZA - DIAG

24
MongoDB simulation
SAPIENZA - DIAG
>> use myDB
>> show dbs
>> show collections
>> db.collection.find()
>> db.collection.insert()
>> db.collection.find().explain(“executionStats”)
>> db.collection.find( { field: “value”} )
>> db.collection.save()
>> db.collection.update({}, { $set:{} })
>> db.collection.find().mapReduce()

END
Any questions ???
SAPIENZA - DIAG

MongoDB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MongoDB

Similar to MongoDB (20)

Recently uploaded

Recently uploaded (20)

MongoDB