Concepts of 
Juan Antonio Roy Couto 
Twitter: @juanroycouto 
Website: www.juanroy.es 
September 2014
Juan Antonio Roy Couto 2 
Concepts 
Contents 
Why? 
Characteristics 
Who? 
DB Ranking 
Shell Drivers 
Utilities 
Community 
Terms 
Failover Replication Schema design 
Replica Set 
Indexes 
Sharding 
Pre-splitting Questions?
Apps 
● Horizontal scalability 
● Real time analytics 
● Better strategic decisions 
Internet of Things 
Juan Antonio Roy Couto 3 
Wearables 
Smartcities 
Cloud computing 
● Non structured data 
● Reduce costs and time to 
market 
Concepts 
Why? 
MongoDB 
● Faster development
Juan Antonio Roy Couto 4 
Concepts 
Who provides MongoDB in the cloud? 
http://www.mongodb.com/partners/list 
Who is using MongoDB? 
http://www.mongodb.com/who-uses-mongodb 
Who?
Juan Antonio Roy Couto 5 
Concepts 
DB Ranking 
http://db-engines.com/en/ranking
Juan Antonio Roy Couto 6 
Concepts 
Community 
8 Million + 
Downloads 
200k+ 
Education Registrations 
30k+ 
MongoDB User Group Members
Juan Antonio Roy Couto 7 
Concepts 
Drivers 
http://docs.mongodb.org/ecosystem/drivers/ 
Driver 
MongoDB 
● C 
● C++ 
● C# 
● Java 
● Node.js 
● Perl 
● PHP 
● Python 
● Ruby 
● Scala 
App
Juan Antonio Roy Couto 8 
Concepts 
Characteristics 
http://www.mongodbspain.com/en/2014/08/17/mongodb-characteristics-future/ 
General purpose NoSQL database Native replication 
Document oriented (stores data as 
documents in BSON – Binary JSON) Auto sharding & load balancing 
Schemaless (dynamic schema) Security 
Open source Automatic failover 
High availability (replica sets) JSON objects 
Horizontal scalability (commodity 
servers) MMS (continuous monitoring in the cloud) 
Aggregation framework Geospatial queries 
Map Reduce In-memory performance 
Hadoop connector (for processing large 
volumes of data in batch) ACID compliant at the document level
Juan Antonio Roy Couto 9 
Concepts 
Advanced characteristics 
Chunk 1 
Chunk 2 
Chunk 3 
GridFS 
TTL (special indexes that 
MongoDB can use to 
automatically remove 
documents from a collection 
after a certain amount of 
time) 
Capped collections 
Index intersection 
...
Juan Antonio Roy Couto 10 
Concepts 
Shell 
MongoDB 
● Administrative tasks 
● Full featured 
● Javascript interpreter 
● Standalone MongoDB client 
● Allows interaction with a MongoDB instance from the 
command line
mmoonnggooeexxppoorrtt mongoimport mongodump mongorestore mongoexport Utility that generates a JSON or CSV file of data from a MongoDB instance 
Imports content from a JSON, CSV or TSV export 
Utility for creating a binary export 
Writes data to a MongoDB instance from a binary file 
Juan Antonio Roy Couto 11 
Concepts 
Utilities 
MongoDB tools for backup: 
MongoDB tools for tracking instances: 
mongostat Provides a quick overview of the status of a running mongod or mongos 
instance 
mongotop 
Provides a method to track the amount of time a MongoDB instance spends 
reading and writing data. mongotop provides statistics on a per-collection level. 
By default, mongotop returns values every second
Juan Antonio Roy Couto 12 
Concepts 
Basic terms to know 
MongoDB SQL 
database database 
collection table 
document row 
field column 
embedding join
Geospatial indexes 
MongoDB has two types of indexes 
for supporting geographical queries. 
● 2d indexes: for calculations on a 
flat surface 
● 2dsphere indexes: for 
calculations on a earth-like 
sphere 
Juan Antonio Roy Couto 13
Tables 
Customers Addresses 
Juan Antonio Roy Couto 14 
Concepts 
SQL Schema Design 
Customer key 
First name 
Last name 
Phone number 
Address key 
Customer key 
Street 
Number 
Location 
Postal Code 
Pets 
Pet key 
Customer key 
Type 
Breed 
Name 
Age
Customers collection 
Customer info Addresses 
Juan Antonio Roy Couto 15 
Concepts 
MongoDB Schema Design 
> db.customers.findOne() 
{ 
"_id" : ObjectId("54131863041cd2e6181156ba"), 
"first_name" : "Peter", 
"last_name" : "Keil", 
"phone_number" : 619123456, 
"address" : { 
"street" : "C/Alcalá", 
"number" : 123, 
"location" : "Madrid", 
"postal_code" : 12345 
}, 
"pets" : [ 
{ 
"type" : "Dog", 
"breed" : "Airedale Terrier", 
"name" : "Linda", 
"age" : 2 
}, 
{ 
"type" : "Dog", 
"breed" : "Akita", 
"name" : "Bruto", 
"age" : 10 
} 
] 
} 
> 
First name 
Last name 
Phone number 
Street 
Number 
Location 
Postal Code 
Type 
Breed 
Name 
Age 
Type 
Breed 
Name 
Age 
Pets
Replica Set ● High availability 
Juan Antonio Roy Couto 16 
Concepts 
Replication 
Primary 
Secondary 1 
Secondary 2 
● Data safety 
● Read preference 
● Asynchronus 
● Single primary 
● Statement based 
● Master-slave 
● Automatic failover 
● Automatic node recovery
Replica Set 
Juan Antonio Roy Couto 17 
Concepts 
Failover scenario 
Replica Set 
Primary 
Secondary 1 
Secondary 2 
Secondary 2 
Primary 
Secondary 1 
1) Primary goes 
down 
2) New election 
(majority of the 
set) 
3) Primary comes 
back (now as 
secondary) 
4) The new primary 
assumes 
replication tasks
Replica Set 
Juan Antonio Roy Couto 18 
Concepts 
Failover scenario with rollback 
Replica Set 
Primary 
Secondary 1 
Secondary 2 
Secondary 2 
Primary 
Secondary 1 
Rollback 
Hard Disk 
mongorestore
Juan Antonio Roy Couto 19 
Concepts 
Replica Set principles 
● Write is truly 
committed 
upon 
application at 
the majority of 
the set
Juan Antonio Roy Couto 20 
Concepts 
Replica Set: read preference 
Reasons 
Geography dispersed 
nodes 
Separate a work load 
Availability 
Types 
Primary 
Primary preferred 
Secondary 
Secondary preferred 
Nearest 
Tags
Shard 2 
Shard N-1 
Juan Antonio Roy Couto 21 
Concepts 
Sharding 
Shard 0 
Secondary 
Secondary 
Primary 
Shard 1 
Secondary 
Secondary 
Primary 
Secondary 
Secondary 
Primary 
Secondary 
Secondary 
Primary 
Config server 
Config server 
Config server 
Query router Query router 
... 
Client Client Client 
CLUSTER
Sharding: concepts 
Sharding concepts 
Data are uniformely distributed across the 
shards using the shard key 
Each shard allocates those documents that 
belongs to its own range 
Sharding improves efficiency and, therefore, 
the performance because queries are routed 
only to the shards in where our data resides 
Juan Antonio Roy Couto 22
Sharding: metadata 
The config servers allocates the config database which contains the cluster metadata 
Metadata describes what is in the cluster, what is contained in the shards 
It is a map of the data itself 
Range-based partitioning 
Shard key: 
lastname Low High Shard 
Range 0 Martín Pérez 0 
Range 1 Pérez Rodriguez 1 
Juan Antonio Roy Couto 23
Sharding: chunks, split and migrate 
Chunk Split Migrate 
Range data subset Runs in background Runs in background 
Juan Antonio Roy Couto 24 
Aproximately 1 chunk per 60MB 
When a chunk grows beyond 
60MB it will be splitted in two 
equal chunks 
It will move the 
chunks across the 
shards in order to 
achieve the balance 
The MongoDB goal is to achieve a uniform data distribution 
across all the shards 
MongoDB balances the number of chunks pers shard (nor 
documents nor bytes) 
By default all collections belong to shard 0 
An empty collection has only one chunk (shard 0)
Sharding: chunks, split and migrate (2) 
mongos 
Shard 0 
chunk 0 
chunk 0 
chunk 1 
Shard 1 
Juan Antonio Roy Couto 25
Pre-splitting 
 Utilized in batch/bulk loads 
 Split and migration do not work 
 Metadata are not altered 
 Data are stored automatically in its 
shard 
Shard 0 
Shard 1 
Shard 2 
mongos 
data 
data 
data 
Juan Antonio Roy Couto 26
Summary 
Designed to be: 
● Fast (no joins, in-memory performance), 
Juan Antonio Roy Couto 27 
● Flexible (schemaless), 
● Scalable (horizontal vs vertical), 
● Easy to learn 
Designed to: 
● Reduce administrative tasks (replica set, sharding, disaster recovery) 
With powerful: 
● Analysis tools (aggregation framework, map reduce, hadoop 
connector), 
● Characteristics such as geospatial indexes, GridFS, etc.
Questions? 
Any questions? 
Juan Antonio Roy Couto 28
Concepts 
Thank you for your attention! 
Juan Antonio Roy Couto 
Email: juanroycouto@gmail.com September 2014 
Juan Antonio Roy Couto 29

MongoDB Concepts

  • 1.
    Concepts of JuanAntonio Roy Couto Twitter: @juanroycouto Website: www.juanroy.es September 2014
  • 2.
    Juan Antonio RoyCouto 2 Concepts Contents Why? Characteristics Who? DB Ranking Shell Drivers Utilities Community Terms Failover Replication Schema design Replica Set Indexes Sharding Pre-splitting Questions?
  • 3.
    Apps ● Horizontalscalability ● Real time analytics ● Better strategic decisions Internet of Things Juan Antonio Roy Couto 3 Wearables Smartcities Cloud computing ● Non structured data ● Reduce costs and time to market Concepts Why? MongoDB ● Faster development
  • 4.
    Juan Antonio RoyCouto 4 Concepts Who provides MongoDB in the cloud? http://www.mongodb.com/partners/list Who is using MongoDB? http://www.mongodb.com/who-uses-mongodb Who?
  • 5.
    Juan Antonio RoyCouto 5 Concepts DB Ranking http://db-engines.com/en/ranking
  • 6.
    Juan Antonio RoyCouto 6 Concepts Community 8 Million + Downloads 200k+ Education Registrations 30k+ MongoDB User Group Members
  • 7.
    Juan Antonio RoyCouto 7 Concepts Drivers http://docs.mongodb.org/ecosystem/drivers/ Driver MongoDB ● C ● C++ ● C# ● Java ● Node.js ● Perl ● PHP ● Python ● Ruby ● Scala App
  • 8.
    Juan Antonio RoyCouto 8 Concepts Characteristics http://www.mongodbspain.com/en/2014/08/17/mongodb-characteristics-future/ General purpose NoSQL database Native replication Document oriented (stores data as documents in BSON – Binary JSON) Auto sharding & load balancing Schemaless (dynamic schema) Security Open source Automatic failover High availability (replica sets) JSON objects Horizontal scalability (commodity servers) MMS (continuous monitoring in the cloud) Aggregation framework Geospatial queries Map Reduce In-memory performance Hadoop connector (for processing large volumes of data in batch) ACID compliant at the document level
  • 9.
    Juan Antonio RoyCouto 9 Concepts Advanced characteristics Chunk 1 Chunk 2 Chunk 3 GridFS TTL (special indexes that MongoDB can use to automatically remove documents from a collection after a certain amount of time) Capped collections Index intersection ...
  • 10.
    Juan Antonio RoyCouto 10 Concepts Shell MongoDB ● Administrative tasks ● Full featured ● Javascript interpreter ● Standalone MongoDB client ● Allows interaction with a MongoDB instance from the command line
  • 11.
    mmoonnggooeexxppoorrtt mongoimport mongodumpmongorestore mongoexport Utility that generates a JSON or CSV file of data from a MongoDB instance Imports content from a JSON, CSV or TSV export Utility for creating a binary export Writes data to a MongoDB instance from a binary file Juan Antonio Roy Couto 11 Concepts Utilities MongoDB tools for backup: MongoDB tools for tracking instances: mongostat Provides a quick overview of the status of a running mongod or mongos instance mongotop Provides a method to track the amount of time a MongoDB instance spends reading and writing data. mongotop provides statistics on a per-collection level. By default, mongotop returns values every second
  • 12.
    Juan Antonio RoyCouto 12 Concepts Basic terms to know MongoDB SQL database database collection table document row field column embedding join
  • 13.
    Geospatial indexes MongoDBhas two types of indexes for supporting geographical queries. ● 2d indexes: for calculations on a flat surface ● 2dsphere indexes: for calculations on a earth-like sphere Juan Antonio Roy Couto 13
  • 14.
    Tables Customers Addresses Juan Antonio Roy Couto 14 Concepts SQL Schema Design Customer key First name Last name Phone number Address key Customer key Street Number Location Postal Code Pets Pet key Customer key Type Breed Name Age
  • 15.
    Customers collection Customerinfo Addresses Juan Antonio Roy Couto 15 Concepts MongoDB Schema Design > db.customers.findOne() { "_id" : ObjectId("54131863041cd2e6181156ba"), "first_name" : "Peter", "last_name" : "Keil", "phone_number" : 619123456, "address" : { "street" : "C/Alcalá", "number" : 123, "location" : "Madrid", "postal_code" : 12345 }, "pets" : [ { "type" : "Dog", "breed" : "Airedale Terrier", "name" : "Linda", "age" : 2 }, { "type" : "Dog", "breed" : "Akita", "name" : "Bruto", "age" : 10 } ] } > First name Last name Phone number Street Number Location Postal Code Type Breed Name Age Type Breed Name Age Pets
  • 16.
    Replica Set ●High availability Juan Antonio Roy Couto 16 Concepts Replication Primary Secondary 1 Secondary 2 ● Data safety ● Read preference ● Asynchronus ● Single primary ● Statement based ● Master-slave ● Automatic failover ● Automatic node recovery
  • 17.
    Replica Set JuanAntonio Roy Couto 17 Concepts Failover scenario Replica Set Primary Secondary 1 Secondary 2 Secondary 2 Primary Secondary 1 1) Primary goes down 2) New election (majority of the set) 3) Primary comes back (now as secondary) 4) The new primary assumes replication tasks
  • 18.
    Replica Set JuanAntonio Roy Couto 18 Concepts Failover scenario with rollback Replica Set Primary Secondary 1 Secondary 2 Secondary 2 Primary Secondary 1 Rollback Hard Disk mongorestore
  • 19.
    Juan Antonio RoyCouto 19 Concepts Replica Set principles ● Write is truly committed upon application at the majority of the set
  • 20.
    Juan Antonio RoyCouto 20 Concepts Replica Set: read preference Reasons Geography dispersed nodes Separate a work load Availability Types Primary Primary preferred Secondary Secondary preferred Nearest Tags
  • 21.
    Shard 2 ShardN-1 Juan Antonio Roy Couto 21 Concepts Sharding Shard 0 Secondary Secondary Primary Shard 1 Secondary Secondary Primary Secondary Secondary Primary Secondary Secondary Primary Config server Config server Config server Query router Query router ... Client Client Client CLUSTER
  • 22.
    Sharding: concepts Shardingconcepts Data are uniformely distributed across the shards using the shard key Each shard allocates those documents that belongs to its own range Sharding improves efficiency and, therefore, the performance because queries are routed only to the shards in where our data resides Juan Antonio Roy Couto 22
  • 23.
    Sharding: metadata Theconfig servers allocates the config database which contains the cluster metadata Metadata describes what is in the cluster, what is contained in the shards It is a map of the data itself Range-based partitioning Shard key: lastname Low High Shard Range 0 Martín Pérez 0 Range 1 Pérez Rodriguez 1 Juan Antonio Roy Couto 23
  • 24.
    Sharding: chunks, splitand migrate Chunk Split Migrate Range data subset Runs in background Runs in background Juan Antonio Roy Couto 24 Aproximately 1 chunk per 60MB When a chunk grows beyond 60MB it will be splitted in two equal chunks It will move the chunks across the shards in order to achieve the balance The MongoDB goal is to achieve a uniform data distribution across all the shards MongoDB balances the number of chunks pers shard (nor documents nor bytes) By default all collections belong to shard 0 An empty collection has only one chunk (shard 0)
  • 25.
    Sharding: chunks, splitand migrate (2) mongos Shard 0 chunk 0 chunk 0 chunk 1 Shard 1 Juan Antonio Roy Couto 25
  • 26.
    Pre-splitting  Utilizedin batch/bulk loads  Split and migration do not work  Metadata are not altered  Data are stored automatically in its shard Shard 0 Shard 1 Shard 2 mongos data data data Juan Antonio Roy Couto 26
  • 27.
    Summary Designed tobe: ● Fast (no joins, in-memory performance), Juan Antonio Roy Couto 27 ● Flexible (schemaless), ● Scalable (horizontal vs vertical), ● Easy to learn Designed to: ● Reduce administrative tasks (replica set, sharding, disaster recovery) With powerful: ● Analysis tools (aggregation framework, map reduce, hadoop connector), ● Characteristics such as geospatial indexes, GridFS, etc.
  • 28.
    Questions? Any questions? Juan Antonio Roy Couto 28
  • 29.
    Concepts Thank youfor your attention! Juan Antonio Roy Couto Email: juanroycouto@gmail.com September 2014 Juan Antonio Roy Couto 29

Editor's Notes

  • #4 NoSQL surge debido a la globalización, se necesita una muy alta tasa de lectura y escritura, soportar gran cantidad de datos, máxima disponibilidad, peticiones,... Rendimiento Fiabilidad Escalabilidad Replica Set Sharding Clusters Auto balanceado de carga Disminución de las labores típicas de administración de una base de datos (enumerar cuáles y por qué) Aumento en la velocidad de la puesta en producción de un proceso al disminuir el tiempo del desarrollo de un producto NoSQL significa No solo SQL En el momento en que el modelo relacional no es capaz de asumir las necesidades actuales de almacenamiento y procesado de la ingente cantidad de datos que hoy se genera (IoT, redes sociales,...) Hoy los datos que se generan son multidisciplinares, no siguen un esquema fijo
  • #5 MongoDB no pretende que nadie cambie su base de datos si esta le ofrece un rendimiento y fiabilidad con la que está satisfecho. Sin embargo, sí basa su esfuerzo en las pequeñas empresas o startups que abordan nuevos proyectos. También en aquellas empresas, de cualquier tamaño, que quieren o necesitan mejorar el rendimiento de una aplicación en marcha. BBVA, Telefónica, Santander, ...
  • #6 Por que es la base de datos no relacional líder del mercado
  • #9 Open-source db used by companies of all sizes, across all industries and for a wide variety of applications. It is an agile database that allows schemas to change quickly as applications evolve, while still providing the functionality developers expect from traditional databases, such as secondary indexes, a full query language and strict consistency. MongoDB is built for scalability, performance and high availability, scaling from single server deployments to large, complex multi-site architectures. By leveraging in-memory computing, MongoDB provides high performance for both reads and writes. MongoDB’s native replication and automated failover enable enterprise-grade reliability and operational flexibility. Horizontal Scalability. As the data volume and throughput grow, developers can take advantage of commodity hardware and cloud infrastructure to increase the capacity of the MongoDB system. High Availability. Multiple copies of data are maintained with native replication. Automatic failover to secondary nodes, racks and data centers makes it possible to achieve enterprise- grade uptime without custom code and complicated tuning In-Memory Performance. Data is read and written to RAM while also persisted to disk for durability, providing fast performance and eliminating the need for a separate caching layer. Aggregation - Batch processing of data and aggregate calculations JavaScript execution - Ability to store JavaScript functions on the server
  • #10 Es una base de datos generalista, no se enfoca en hacer bien una cosa, como podría ser el caso de las clave:valor que son las que ofrecen la velocidad de respuesta más elevada del mercado. Su objetivo es abarcar lo más posible y, por tanto, ofrece todas, o casi todas, las características de las bases de datos relacionales y las ventajas de las no relacionales, como pueden ser: schemaless, rendimiento,... All mapReduce functions are native for both MongoDB are JavaScript and run on the database nodes.
  • #12 Además de estas herramientas existen otras técnicas para hacer backup, como puede ser a través de una simple copia de los ficheros
  • #13 MongoDB ha sido diseñada para que sea rápida (no joins but embedded documents)
  • #14 Geospatial queries return results based on proximity criteria, intersection and inclusion as specified by a point, line, circle or polygon. For supporting geospatial queries (2d and 2dsphere)
  • #17 Failover: - Proceso desde que se cae el primario hasta que otro nodo asume su papel Node recovery: - Rollback a todas las escrituras del primario que no llegaron a replicarse (si las había). - Recepción de todas las operaciones que se han hecho mientras ha estado caído. - Comienza a funcionar como secundario Slave Delay: Tiempo de retraso hasta que un secundario se actualiza. Se utiliza en situaciones en las que se ha cometido un error (fat fingers) y se necesita volver atrás rápidamente sin tener que esperar a hacer un restore desde algún backup.
  • #21 Tags: Sirve para escoger los servidores con los que queremos hablar
  • #22 Los routers (mongos) enrutan las peticiones de los clientes al shard/s implicado El cliente no sabe si la colección está particionada o no, ni en qué shard residen los datos que necesita. Por lo tanto, no hay que cambiar el código de nuestra aplicación MongoDB leverages horizontal scalability effortlessly by using commodity computers
  • #23 Replica: High availability Data safety Disaster recovery Sharding: Scale out Sharding allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the application.
  • #25 1 chunk is about 60MB of data Chunks > 60 MB → split Uniform data distribution across shards (chunks / shard) Balancer decides when to migrate chunks and to which shard
  • #28 Performance Horizontal scalability with commodity hardware Replica Set Sharding Clusters Auto load balancing high availability In-memory performance Schema less Failover Data safety Disaster recovery
  • #29 MongoDB ha sido diseñada para que sea rápida (no joins but embedded documents), flexible (schema less), escalable (horizontal no vertical), para reducir al mínimo las labores de administración (replica set, failover, sharding) y para que a los programadores les resulte divertida y rápida de aprender a utilizar y dotada de potentes herramientas de análisis de datos (aggregation framework), geospatial indexes, GridFS, and so on. MongoDB does not support multi-document transactions. However, MongoDB does provide atomic operations on a single document. Often these document-level atomic operations are sufficient to solve problems that would require ACID transactions in a relational database. Relational databases might represent the same kind of data with multiple tables and rows, which would require transaction support to update the data atomically.