1
 Akbar Shaikh | Monocept
2
2002 2004 2006 2008 2010 2012
Data
3
Data
 Facebook had 60k servers in 2010
 Google had 450k servers in 2006 (speculated)
 Microsoft: between 100k and 500k servers (since Azure)
 Amazon: likely has a similar numbers, too (S3)
 Atomicity: Everything in a transaction succeeds lest it is rolled back.
 Consistency: A transaction cannot leave the database in an inconsistent state.
 Isolation: One transaction cannot interfere with another.
 Durability: A completed transaction persists, even after applications restart.
4
 Basic availability: Each request is guaranteed a response—successful or failed
execution.
 Soft state: The state of the system may change over time, at times without any
input (for eventual consistency).
 Eventual consistency: The database may be momentarily inconsistent but will be
consistent eventually.
5
The point I am trying to make here is, we may have to look beyond ACID to
something called BASE, coined by Eric Brewer:
 Consistency : Data access in a distributed database is considered to be consistent when an
update written on one node is immediately available on another node.
 Availability : The system guarantees availability for requests even though one or more
nodes are down.
 Partition Tolerance : Nodes can be physically separated from each other at any given
point and for any length of time. The time they're not able to reach each other, due to
routing problems, network interface troubles, or firewall issues, is called a network
partition. During the partition, all nodes should still be able to serve both read and write
requests. Ideally the system automatically reconciles updates as soon as every node can
reach every other node again.
6
Eric Brewer also noted that it is impossible for a distributed computer system to provide
consistency, availability and partition tolerance simultaneously. This is more commonly referred
to as the CAP theorem.
ACID
 Strong consistency for transactions
highest priority
 Availability less important
 Pessimistic
 Complex Mechanisms
BASE
 Availability and Scaling highest
priorities
 Weak consistency
 Optimistic
 Simple and Fast
7
8
9
10
11
{ "customer" : "billingAddress" : [ { "city" : "Chicago" } ],
"id" : 1,
"name" : "Martin",
"orders" : [ { "customerId" : 1,
"id" : 99,
"orderItems" : [ { "price" : 32.450000000000003,
"productId" : 27,
"productName" : "NoSQL Distilled"
} ],
"orderPayment" : [ { "billingAddress" : { "city" : "Chicago" },
"ccinfo" : "1000-1000-1000-1000",
"txnId" : "abelif879rft"
} ],
"shippingAddress" : [ { "city" : "Chicago" } ]
} ]
}
We see two primary reasons why people consider using a NoSQL database.
 Application development productivity.
A lot of application development effort is spent on mapping data between in-memory
data structures and a relational database. A NoSQL database may provide a data model
that better fits the application’s needs, thus simplifying that interaction and resulting in
less code to write, debug, and evolve.
 Large-scale data.
Organizations are finding it valuable to capture more data and process it more quickly.
They are finding it expensive, if even possible, to do so with relational databases. The
primary reason is that a relational database is designed to run on a single machine, but
it is usually more economic to run large data and computing loads on clusters of many
smaller and cheaper machines. Many NoSQL databases are designed explicitly to run
on clusters, so they make a better fit for big data scenarios.
12
 For almost as long as we’ve been in the software profession, relational databases
have been the default choice for serious data storage, especially in the world of
enterprise applications.
 If you’re an architect starting a new project, your only choice is likely to be which
relational database to use.
 After such a long period of dominance, the current excitement about NoSQL
databases comes as a surprise.
13
 Schemaless : data representation: Almost all NoSQL implementations offer schemaless data representation. This
means that you don’t have to think too far ahead to define a structure and you can continue to evolve over time—
including adding new fields or even nesting the data, for example, in case of JSON representation.
 Development time : I have heard stories about reduced development time because one doesn’t have to deal with
complex SQL queries. Do you remember the JOIN query that you wrote to collate the data across multiple tables to
create your final view?
 Speed : Even with the small amount of data that you have, if you can deliver in milliseconds rather than hundreds of
milliseconds—especially over mobile and other intermittently connected devices—you have much higher probability
of winning users over.
 Plan ahead for scalability : You read it right. Why fall into the ditch and then try to get out of it? Why not just plan
ahead so that you never fall into one. Or in other words, your application can be quite elastic—it can handle sudden
spikes of load. Of course, you win users over straightaway.
14
NoSQL databases have a lot more to offer than just solving the problems of scale
which are mentioned as follows:
Some NoSQL use cases
1. Massive data volumes
 Massively distributed architecture required to store the data
 Google, Amazon, Yahoo, Facebook…
2. Extreme query workload
 Impossible to efficiently do joins at that scale with an RDBMS
3. Schema evolution
 Schema flexibility (migration) is not trivial at large scale
 Schema changes can be gradually introduced with NoSQL
15
16
17
The main idea here is using a hash table where
there is a unique key and a pointer to a particular
item of data. The Key/value model is the simplest
and easiest to implement.
Key-value stores
But it is inefficient when you are only
interested in querying or updating part of
a value, among other disadvantages.
One key  one value, very fast
Key: Hash (no duplicates)
Value: binary object („BLOB“)
(DB does not understand your content)
customer_22
?=PQ)“§VN?
=§(Q$U%V§W=(BN
W§(=BU&W§$()=
W§$(=%
GIVE ME A
MEANING!
Key
Value
18
 A key-value store is a simple hash table
 Primarily used when all access to the database is via primary key
 Simplest NoSQL data stores to use (from an API perspective) PUT, GET, DELETE (matches REST)
 Value is a blob with the data store not caring or knowing what is inside
 Aggregate-Oriented
Suitable Use Cases
 Storing Session Information
 User Profiles, Preferences
 Shopping Cart Data
19
Key Value Databases
These were inspired by Lotus Notes and are similar to
key-value stores. The model is basically versioned
documents that are collections of other key-value
collections.
The semi-structured documents are stored in formats
like JSON.
Document databases are essentially the next level of
Key/value, allowing nested values associated with each
key. Document databases support querying more
efficiently.
Document databases
20
 Documents are the main concept
 Stores and retrieves documents, which can be XML, JSON, BSON, …
 Documents are self-describing, hierarchical tree data structures which can
consist of maps, collections and scalar values
 Documents stored are similar to each other but do not have to be exactly the same
 Aggregate-Oriented Suitable
Use Cases
 Event Logging
 Content Management Systems
 Web Analytics or Real-Time Analytics
 Product Catalog
21
Documents Databases
Often referred as “BigTable clones” • "a sparse,
distributed multi-dimensional sorted map“
These were created to store and process very large
amounts of data distributed over many machines.
There are still keys but they point to multiple columns.
The columns are arranged by column family.
Wide-column stores
22
Column stores can greatly improve the performance of queries that only touch a small amount of columns
 This is because they will only access these columns' particular data
 Simple math: table t has a total of 10 GB data, with
 column a: 4 GB
 column b: 2 GB
 column c: 3 GB
 column d: 1 GB
If a query only uses column d, at most 1 GB of data will be processed by a column store
n a row store, the full 10 GB will be processed
 Aggregate-Oriented Suitable
Use Cases
• Event Logging
• Content Management Systems
23
Wide-column Databases
 Are used to store information about networks, such
as social connections.
Graph stores
24
 Allow to store entities and relationships between these entities
 Entities are known as nodes, which have properties
 Relations are known as edges, which also have properties
 A query on the graph is also known as traversing the graph
 Traversing the relationships is very fast
Suitable Use Cases
 Connected Data
 Routing, Dispatch and Location-Based Services
 Recommendation Engines
25
Graph Databases
POLYGLOT PERSISTENCE
 In 2006, Neal Ford coined the term Polyglot Programming
 Applications should be written in a mix of languages to take advantage of the fact that
different languages are suitable for tackling different problems Polyglot Persistence
defines a hybrid approach to persistence
 Using multiple data storage technologies
 Selected based on the way data is being used by individual applications
 Why store binary images in relational databases, when there are better storage
systems?
 Can occur both over the enterprise as well as within a single application
26
27
POLYGLOT PERSISTENCE
„Traditional“ Today we use the same database for all
kind of data Shopping cart data User Sessions
Completed Order Product Catalog Recommendations
• Business transactions, session management
RDBMS data, reporting, logging information,
content information, ...
Need for same properties of availability, consistency
or backup requirements
Polyglot Data Storage Usage allows to mix and
match Relational and NoSQL data stores
28
POLYGLOT PERSISTENCE – CHALLENGES
 Decisions
• Have to decide what data storage technology to use
• Today it is easier to go with relational
 New Data Access APIs
• Each data store has its own mechanisms for
accessing the data
• Different API‟s
 Solution: Wrap the data access code into services
(Data/Entity Service) exposed to applications
 Will enforce a contract/schema to a schemaless database
29
Replica Sets: High
Availability
Replication is the process of synchronizing data across multiple servers.
Purpose of Replication
Replication provides redundancy and increases data availability.
With multiple copies of data on different database servers, replication protects a database from the loss of
a single server.
Replication also allows you to recover from hardware failure and service interruptions.
With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.
In some cases, you can use replication to increase read capacity.
Clients have the ability to send read and write operations to different servers.
You can also maintain copies in different data centers to increase the locality and availability of data for
distributed applications.
30
Replica Sets: High
Availability
The primary accepts all write
operations from clients. Replica
set can have only one primary.
Because only one member can
accept write operations, replica
sets provide strict consistency.
The secondaries replicate the primary’s
oplog and apply the operations to their
data sets.
Secondaries’ data sets reflect the
primary’s data set.
31
Replica Sets: High
Availability
Automatic Failover
When a primary does not communicate with the other members of the set for more than 10 seconds, the
replica set will attempt to select another member to become the new primary. The first secondary that
receives a majority of the votes becomes primary.
32
Sharding: High Scalability And
Throughput
Sharding is a method for storing data across multiple
machines.
Purpose of Sharding
Database systems with large data sets and high throughput applications can challenge the capacity of a
single server.
High query rates can exhaust the CPU capacity of the server. Larger data sets exceed the storage
capacity of a single machine.
Finally, working set sizes larger than the system’s RAM stress the I/O capacity of disk drives.
33
Sharding: high scalability and throughput
Sharding, or horizontal scaling, by contrast, divides the data set and distributes the data over multiple
servers, or shards. Each shard is an independent database, and collectively, the shards make up a single
logical database.
34
Map-Reduce
The map-reduce pattern is a way to organize processing in such a way as to take advantage of multiple
machines on a cluster while keeping as much processing and the data it needs together on the same
machine.
It first gained prominence with Google’s Map Reduce
framework.
"Map" step: The master node takes the input,
divides it into smaller sub-problems, and distributes
them to worker nodes. A worker node may do this again
in turn, leading to a multi-level tree structure.
The worker node processes the smaller problem,
and passes the answer back to its master node.
"Reduce" step: The master node then collects the answers to all the sub-problems and combines them
in some way to form the output – the answer to the problem it was originally trying to solve.
35
36
Advantages of MongoDB over RDBMS
Schema less : MongoDB is document database in which one collection holds different
documents.
Number of fields, content and size of the document can be differ from one document to
another.
Structure of a single object is clear
No complex joins
Deep query-ability. MongoDB supports dynamic queries on documents using a document-
based query language that's nearly as powerful as SQL
Ease of scale-out: MongoDB is easy to scale
37
 Why should use MongoDB
  Document Oriented Storage : Data is stored in the form of JSON style documents
  Index on any attribute
  Replication & High Availability
  Auto-Sharding
  Rich Queries
  Fast In-Place Updates
  Professional Support By MongoDB
 Where should use MongoDB?
  Big Data
  Content Management and Delivery
  Mobile and Social Infrastructure
  User Data Management
  Data Hub
38
39
40
41
Storage Type: Document
 http://www.mongodb.com/scale
 http://www.mongodb.com/partners/cloud/microsoft
 http://azure.microsoft.com/en-us/gallery/store/mongodb/mongodb-inc/
 http://www.mongodb.com/leading-nosql-database
 http://nosql.findthebest.com/
 http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
 http://stackoverflow.com/questions/5252577/how-much-faster-is-redis-than-mongodb
Azure offered as a Service:
 https://mongolab.com/welcome/
mongodb offered as a Service:
 http://www.objectrocket.com/
 https://www.mongohq.com/
42
43
44
Thank You

NOSQL

  • 1.
  • 2.
    2 2002 2004 20062008 2010 2012 Data
  • 3.
    3 Data  Facebook had60k servers in 2010  Google had 450k servers in 2006 (speculated)  Microsoft: between 100k and 500k servers (since Azure)  Amazon: likely has a similar numbers, too (S3)
  • 4.
     Atomicity: Everythingin a transaction succeeds lest it is rolled back.  Consistency: A transaction cannot leave the database in an inconsistent state.  Isolation: One transaction cannot interfere with another.  Durability: A completed transaction persists, even after applications restart. 4
  • 5.
     Basic availability:Each request is guaranteed a response—successful or failed execution.  Soft state: The state of the system may change over time, at times without any input (for eventual consistency).  Eventual consistency: The database may be momentarily inconsistent but will be consistent eventually. 5 The point I am trying to make here is, we may have to look beyond ACID to something called BASE, coined by Eric Brewer:
  • 6.
     Consistency :Data access in a distributed database is considered to be consistent when an update written on one node is immediately available on another node.  Availability : The system guarantees availability for requests even though one or more nodes are down.  Partition Tolerance : Nodes can be physically separated from each other at any given point and for any length of time. The time they're not able to reach each other, due to routing problems, network interface troubles, or firewall issues, is called a network partition. During the partition, all nodes should still be able to serve both read and write requests. Ideally the system automatically reconciles updates as soon as every node can reach every other node again. 6 Eric Brewer also noted that it is impossible for a distributed computer system to provide consistency, availability and partition tolerance simultaneously. This is more commonly referred to as the CAP theorem.
  • 7.
    ACID  Strong consistencyfor transactions highest priority  Availability less important  Pessimistic  Complex Mechanisms BASE  Availability and Scaling highest priorities  Weak consistency  Optimistic  Simple and Fast 7
  • 8.
  • 9.
  • 10.
  • 11.
    11 { "customer" :"billingAddress" : [ { "city" : "Chicago" } ], "id" : 1, "name" : "Martin", "orders" : [ { "customerId" : 1, "id" : 99, "orderItems" : [ { "price" : 32.450000000000003, "productId" : 27, "productName" : "NoSQL Distilled" } ], "orderPayment" : [ { "billingAddress" : { "city" : "Chicago" }, "ccinfo" : "1000-1000-1000-1000", "txnId" : "abelif879rft" } ], "shippingAddress" : [ { "city" : "Chicago" } ] } ] }
  • 12.
    We see twoprimary reasons why people consider using a NoSQL database.  Application development productivity. A lot of application development effort is spent on mapping data between in-memory data structures and a relational database. A NoSQL database may provide a data model that better fits the application’s needs, thus simplifying that interaction and resulting in less code to write, debug, and evolve.  Large-scale data. Organizations are finding it valuable to capture more data and process it more quickly. They are finding it expensive, if even possible, to do so with relational databases. The primary reason is that a relational database is designed to run on a single machine, but it is usually more economic to run large data and computing loads on clusters of many smaller and cheaper machines. Many NoSQL databases are designed explicitly to run on clusters, so they make a better fit for big data scenarios. 12
  • 13.
     For almostas long as we’ve been in the software profession, relational databases have been the default choice for serious data storage, especially in the world of enterprise applications.  If you’re an architect starting a new project, your only choice is likely to be which relational database to use.  After such a long period of dominance, the current excitement about NoSQL databases comes as a surprise. 13
  • 14.
     Schemaless :data representation: Almost all NoSQL implementations offer schemaless data representation. This means that you don’t have to think too far ahead to define a structure and you can continue to evolve over time— including adding new fields or even nesting the data, for example, in case of JSON representation.  Development time : I have heard stories about reduced development time because one doesn’t have to deal with complex SQL queries. Do you remember the JOIN query that you wrote to collate the data across multiple tables to create your final view?  Speed : Even with the small amount of data that you have, if you can deliver in milliseconds rather than hundreds of milliseconds—especially over mobile and other intermittently connected devices—you have much higher probability of winning users over.  Plan ahead for scalability : You read it right. Why fall into the ditch and then try to get out of it? Why not just plan ahead so that you never fall into one. Or in other words, your application can be quite elastic—it can handle sudden spikes of load. Of course, you win users over straightaway. 14 NoSQL databases have a lot more to offer than just solving the problems of scale which are mentioned as follows:
  • 15.
    Some NoSQL usecases 1. Massive data volumes  Massively distributed architecture required to store the data  Google, Amazon, Yahoo, Facebook… 2. Extreme query workload  Impossible to efficiently do joins at that scale with an RDBMS 3. Schema evolution  Schema flexibility (migration) is not trivial at large scale  Schema changes can be gradually introduced with NoSQL 15
  • 16.
  • 17.
  • 18.
    The main ideahere is using a hash table where there is a unique key and a pointer to a particular item of data. The Key/value model is the simplest and easiest to implement. Key-value stores But it is inefficient when you are only interested in querying or updating part of a value, among other disadvantages. One key  one value, very fast Key: Hash (no duplicates) Value: binary object („BLOB“) (DB does not understand your content) customer_22 ?=PQ)“§VN? =§(Q$U%V§W=(BN W§(=BU&W§$()= W§$(=% GIVE ME A MEANING! Key Value 18
  • 19.
     A key-valuestore is a simple hash table  Primarily used when all access to the database is via primary key  Simplest NoSQL data stores to use (from an API perspective) PUT, GET, DELETE (matches REST)  Value is a blob with the data store not caring or knowing what is inside  Aggregate-Oriented Suitable Use Cases  Storing Session Information  User Profiles, Preferences  Shopping Cart Data 19 Key Value Databases
  • 20.
    These were inspiredby Lotus Notes and are similar to key-value stores. The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSON. Document databases are essentially the next level of Key/value, allowing nested values associated with each key. Document databases support querying more efficiently. Document databases 20
  • 21.
     Documents arethe main concept  Stores and retrieves documents, which can be XML, JSON, BSON, …  Documents are self-describing, hierarchical tree data structures which can consist of maps, collections and scalar values  Documents stored are similar to each other but do not have to be exactly the same  Aggregate-Oriented Suitable Use Cases  Event Logging  Content Management Systems  Web Analytics or Real-Time Analytics  Product Catalog 21 Documents Databases
  • 22.
    Often referred as“BigTable clones” • "a sparse, distributed multi-dimensional sorted map“ These were created to store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns. The columns are arranged by column family. Wide-column stores 22
  • 23.
    Column stores cangreatly improve the performance of queries that only touch a small amount of columns  This is because they will only access these columns' particular data  Simple math: table t has a total of 10 GB data, with  column a: 4 GB  column b: 2 GB  column c: 3 GB  column d: 1 GB If a query only uses column d, at most 1 GB of data will be processed by a column store n a row store, the full 10 GB will be processed  Aggregate-Oriented Suitable Use Cases • Event Logging • Content Management Systems 23 Wide-column Databases
  • 24.
     Are usedto store information about networks, such as social connections. Graph stores 24
  • 25.
     Allow tostore entities and relationships between these entities  Entities are known as nodes, which have properties  Relations are known as edges, which also have properties  A query on the graph is also known as traversing the graph  Traversing the relationships is very fast Suitable Use Cases  Connected Data  Routing, Dispatch and Location-Based Services  Recommendation Engines 25 Graph Databases
  • 26.
    POLYGLOT PERSISTENCE  In2006, Neal Ford coined the term Polyglot Programming  Applications should be written in a mix of languages to take advantage of the fact that different languages are suitable for tackling different problems Polyglot Persistence defines a hybrid approach to persistence  Using multiple data storage technologies  Selected based on the way data is being used by individual applications  Why store binary images in relational databases, when there are better storage systems?  Can occur both over the enterprise as well as within a single application 26
  • 27.
    27 POLYGLOT PERSISTENCE „Traditional“ Todaywe use the same database for all kind of data Shopping cart data User Sessions Completed Order Product Catalog Recommendations • Business transactions, session management RDBMS data, reporting, logging information, content information, ... Need for same properties of availability, consistency or backup requirements Polyglot Data Storage Usage allows to mix and match Relational and NoSQL data stores
  • 28.
    28 POLYGLOT PERSISTENCE –CHALLENGES  Decisions • Have to decide what data storage technology to use • Today it is easier to go with relational  New Data Access APIs • Each data store has its own mechanisms for accessing the data • Different API‟s  Solution: Wrap the data access code into services (Data/Entity Service) exposed to applications  Will enforce a contract/schema to a schemaless database
  • 29.
    29 Replica Sets: High Availability Replicationis the process of synchronizing data across multiple servers. Purpose of Replication Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup. In some cases, you can use replication to increase read capacity. Clients have the ability to send read and write operations to different servers. You can also maintain copies in different data centers to increase the locality and availability of data for distributed applications.
  • 30.
    30 Replica Sets: High Availability Theprimary accepts all write operations from clients. Replica set can have only one primary. Because only one member can accept write operations, replica sets provide strict consistency. The secondaries replicate the primary’s oplog and apply the operations to their data sets. Secondaries’ data sets reflect the primary’s data set.
  • 31.
    31 Replica Sets: High Availability AutomaticFailover When a primary does not communicate with the other members of the set for more than 10 seconds, the replica set will attempt to select another member to become the new primary. The first secondary that receives a majority of the votes becomes primary.
  • 32.
    32 Sharding: High ScalabilityAnd Throughput Sharding is a method for storing data across multiple machines. Purpose of Sharding Database systems with large data sets and high throughput applications can challenge the capacity of a single server. High query rates can exhaust the CPU capacity of the server. Larger data sets exceed the storage capacity of a single machine. Finally, working set sizes larger than the system’s RAM stress the I/O capacity of disk drives.
  • 33.
    33 Sharding: high scalabilityand throughput Sharding, or horizontal scaling, by contrast, divides the data set and distributes the data over multiple servers, or shards. Each shard is an independent database, and collectively, the shards make up a single logical database.
  • 34.
    34 Map-Reduce The map-reduce patternis a way to organize processing in such a way as to take advantage of multiple machines on a cluster while keeping as much processing and the data it needs together on the same machine. It first gained prominence with Google’s Map Reduce framework. "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
  • 35.
  • 36.
  • 37.
    Advantages of MongoDBover RDBMS Schema less : MongoDB is document database in which one collection holds different documents. Number of fields, content and size of the document can be differ from one document to another. Structure of a single object is clear No complex joins Deep query-ability. MongoDB supports dynamic queries on documents using a document- based query language that's nearly as powerful as SQL Ease of scale-out: MongoDB is easy to scale 37
  • 38.
     Why shoulduse MongoDB   Document Oriented Storage : Data is stored in the form of JSON style documents   Index on any attribute   Replication & High Availability   Auto-Sharding   Rich Queries   Fast In-Place Updates   Professional Support By MongoDB  Where should use MongoDB?   Big Data   Content Management and Delivery   Mobile and Social Infrastructure   User Data Management   Data Hub 38
  • 39.
  • 40.
  • 41.
  • 42.
     http://www.mongodb.com/scale  http://www.mongodb.com/partners/cloud/microsoft http://azure.microsoft.com/en-us/gallery/store/mongodb/mongodb-inc/  http://www.mongodb.com/leading-nosql-database  http://nosql.findthebest.com/  http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis  http://stackoverflow.com/questions/5252577/how-much-faster-is-redis-than-mongodb Azure offered as a Service:  https://mongolab.com/welcome/ mongodb offered as a Service:  http://www.objectrocket.com/  https://www.mongohq.com/ 42
  • 43.
  • 44.

Editor's Notes

  • #4 http://downloadsquad.switched.com/2010/06/29/facebook-doubles-its-server-count-from-30-000-to-60-000-in-just-6-months/ by Sebastian Anthony on June 29, 2010 at 10:00 AM