3. 3
Data
Facebook had 60k servers in 2010
Google had 450k servers in 2006 (speculated)
Microsoft: between 100k and 500k servers (since Azure)
Amazon: likely has a similar numbers, too (S3)
4. Atomicity: Everything in a transaction succeeds lest it is rolled back.
Consistency: A transaction cannot leave the database in an inconsistent state.
Isolation: One transaction cannot interfere with another.
Durability: A completed transaction persists, even after applications restart.
4
5. Basic availability: Each request is guaranteed a response—successful or failed
execution.
Soft state: The state of the system may change over time, at times without any
input (for eventual consistency).
Eventual consistency: The database may be momentarily inconsistent but will be
consistent eventually.
5
The point I am trying to make here is, we may have to look beyond ACID to
something called BASE, coined by Eric Brewer:
6. Consistency : Data access in a distributed database is considered to be consistent when an
update written on one node is immediately available on another node.
Availability : The system guarantees availability for requests even though one or more
nodes are down.
Partition Tolerance : Nodes can be physically separated from each other at any given
point and for any length of time. The time they're not able to reach each other, due to
routing problems, network interface troubles, or firewall issues, is called a network
partition. During the partition, all nodes should still be able to serve both read and write
requests. Ideally the system automatically reconciles updates as soon as every node can
reach every other node again.
6
Eric Brewer also noted that it is impossible for a distributed computer system to provide
consistency, availability and partition tolerance simultaneously. This is more commonly referred
to as the CAP theorem.
7. ACID
Strong consistency for transactions
highest priority
Availability less important
Pessimistic
Complex Mechanisms
BASE
Availability and Scaling highest
priorities
Weak consistency
Optimistic
Simple and Fast
7
12. We see two primary reasons why people consider using a NoSQL database.
Application development productivity.
A lot of application development effort is spent on mapping data between in-memory
data structures and a relational database. A NoSQL database may provide a data model
that better fits the application’s needs, thus simplifying that interaction and resulting in
less code to write, debug, and evolve.
Large-scale data.
Organizations are finding it valuable to capture more data and process it more quickly.
They are finding it expensive, if even possible, to do so with relational databases. The
primary reason is that a relational database is designed to run on a single machine, but
it is usually more economic to run large data and computing loads on clusters of many
smaller and cheaper machines. Many NoSQL databases are designed explicitly to run
on clusters, so they make a better fit for big data scenarios.
12
13. For almost as long as we’ve been in the software profession, relational databases
have been the default choice for serious data storage, especially in the world of
enterprise applications.
If you’re an architect starting a new project, your only choice is likely to be which
relational database to use.
After such a long period of dominance, the current excitement about NoSQL
databases comes as a surprise.
13
14. Schemaless : data representation: Almost all NoSQL implementations offer schemaless data representation. This
means that you don’t have to think too far ahead to define a structure and you can continue to evolve over time—
including adding new fields or even nesting the data, for example, in case of JSON representation.
Development time : I have heard stories about reduced development time because one doesn’t have to deal with
complex SQL queries. Do you remember the JOIN query that you wrote to collate the data across multiple tables to
create your final view?
Speed : Even with the small amount of data that you have, if you can deliver in milliseconds rather than hundreds of
milliseconds—especially over mobile and other intermittently connected devices—you have much higher probability
of winning users over.
Plan ahead for scalability : You read it right. Why fall into the ditch and then try to get out of it? Why not just plan
ahead so that you never fall into one. Or in other words, your application can be quite elastic—it can handle sudden
spikes of load. Of course, you win users over straightaway.
14
NoSQL databases have a lot more to offer than just solving the problems of scale
which are mentioned as follows:
15. Some NoSQL use cases
1. Massive data volumes
Massively distributed architecture required to store the data
Google, Amazon, Yahoo, Facebook…
2. Extreme query workload
Impossible to efficiently do joins at that scale with an RDBMS
3. Schema evolution
Schema flexibility (migration) is not trivial at large scale
Schema changes can be gradually introduced with NoSQL
15
18. The main idea here is using a hash table where
there is a unique key and a pointer to a particular
item of data. The Key/value model is the simplest
and easiest to implement.
Key-value stores
But it is inefficient when you are only
interested in querying or updating part of
a value, among other disadvantages.
One key one value, very fast
Key: Hash (no duplicates)
Value: binary object („BLOB“)
(DB does not understand your content)
customer_22
?=PQ)ҤVN?
=§(Q$U%V§W=(BN
W§(=BU&W§$()=
W§$(=%
GIVE ME A
MEANING!
Key
Value
18
19. A key-value store is a simple hash table
Primarily used when all access to the database is via primary key
Simplest NoSQL data stores to use (from an API perspective) PUT, GET, DELETE (matches REST)
Value is a blob with the data store not caring or knowing what is inside
Aggregate-Oriented
Suitable Use Cases
Storing Session Information
User Profiles, Preferences
Shopping Cart Data
19
Key Value Databases
20. These were inspired by Lotus Notes and are similar to
key-value stores. The model is basically versioned
documents that are collections of other key-value
collections.
The semi-structured documents are stored in formats
like JSON.
Document databases are essentially the next level of
Key/value, allowing nested values associated with each
key. Document databases support querying more
efficiently.
Document databases
20
21. Documents are the main concept
Stores and retrieves documents, which can be XML, JSON, BSON, …
Documents are self-describing, hierarchical tree data structures which can
consist of maps, collections and scalar values
Documents stored are similar to each other but do not have to be exactly the same
Aggregate-Oriented Suitable
Use Cases
Event Logging
Content Management Systems
Web Analytics or Real-Time Analytics
Product Catalog
21
Documents Databases
22. Often referred as “BigTable clones” • "a sparse,
distributed multi-dimensional sorted map“
These were created to store and process very large
amounts of data distributed over many machines.
There are still keys but they point to multiple columns.
The columns are arranged by column family.
Wide-column stores
22
23. Column stores can greatly improve the performance of queries that only touch a small amount of columns
This is because they will only access these columns' particular data
Simple math: table t has a total of 10 GB data, with
column a: 4 GB
column b: 2 GB
column c: 3 GB
column d: 1 GB
If a query only uses column d, at most 1 GB of data will be processed by a column store
n a row store, the full 10 GB will be processed
Aggregate-Oriented Suitable
Use Cases
• Event Logging
• Content Management Systems
23
Wide-column Databases
24. Are used to store information about networks, such
as social connections.
Graph stores
24
25. Allow to store entities and relationships between these entities
Entities are known as nodes, which have properties
Relations are known as edges, which also have properties
A query on the graph is also known as traversing the graph
Traversing the relationships is very fast
Suitable Use Cases
Connected Data
Routing, Dispatch and Location-Based Services
Recommendation Engines
25
Graph Databases
26. POLYGLOT PERSISTENCE
In 2006, Neal Ford coined the term Polyglot Programming
Applications should be written in a mix of languages to take advantage of the fact that
different languages are suitable for tackling different problems Polyglot Persistence
defines a hybrid approach to persistence
Using multiple data storage technologies
Selected based on the way data is being used by individual applications
Why store binary images in relational databases, when there are better storage
systems?
Can occur both over the enterprise as well as within a single application
26
27. 27
POLYGLOT PERSISTENCE
„Traditional“ Today we use the same database for all
kind of data Shopping cart data User Sessions
Completed Order Product Catalog Recommendations
• Business transactions, session management
RDBMS data, reporting, logging information,
content information, ...
Need for same properties of availability, consistency
or backup requirements
Polyglot Data Storage Usage allows to mix and
match Relational and NoSQL data stores
28. 28
POLYGLOT PERSISTENCE – CHALLENGES
Decisions
• Have to decide what data storage technology to use
• Today it is easier to go with relational
New Data Access APIs
• Each data store has its own mechanisms for
accessing the data
• Different API‟s
Solution: Wrap the data access code into services
(Data/Entity Service) exposed to applications
Will enforce a contract/schema to a schemaless database
29. 29
Replica Sets: High
Availability
Replication is the process of synchronizing data across multiple servers.
Purpose of Replication
Replication provides redundancy and increases data availability.
With multiple copies of data on different database servers, replication protects a database from the loss of
a single server.
Replication also allows you to recover from hardware failure and service interruptions.
With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.
In some cases, you can use replication to increase read capacity.
Clients have the ability to send read and write operations to different servers.
You can also maintain copies in different data centers to increase the locality and availability of data for
distributed applications.
30. 30
Replica Sets: High
Availability
The primary accepts all write
operations from clients. Replica
set can have only one primary.
Because only one member can
accept write operations, replica
sets provide strict consistency.
The secondaries replicate the primary’s
oplog and apply the operations to their
data sets.
Secondaries’ data sets reflect the
primary’s data set.
31. 31
Replica Sets: High
Availability
Automatic Failover
When a primary does not communicate with the other members of the set for more than 10 seconds, the
replica set will attempt to select another member to become the new primary. The first secondary that
receives a majority of the votes becomes primary.
32. 32
Sharding: High Scalability And
Throughput
Sharding is a method for storing data across multiple
machines.
Purpose of Sharding
Database systems with large data sets and high throughput applications can challenge the capacity of a
single server.
High query rates can exhaust the CPU capacity of the server. Larger data sets exceed the storage
capacity of a single machine.
Finally, working set sizes larger than the system’s RAM stress the I/O capacity of disk drives.
33. 33
Sharding: high scalability and throughput
Sharding, or horizontal scaling, by contrast, divides the data set and distributes the data over multiple
servers, or shards. Each shard is an independent database, and collectively, the shards make up a single
logical database.
34. 34
Map-Reduce
The map-reduce pattern is a way to organize processing in such a way as to take advantage of multiple
machines on a cluster while keeping as much processing and the data it needs together on the same
machine.
It first gained prominence with Google’s Map Reduce
framework.
"Map" step: The master node takes the input,
divides it into smaller sub-problems, and distributes
them to worker nodes. A worker node may do this again
in turn, leading to a multi-level tree structure.
The worker node processes the smaller problem,
and passes the answer back to its master node.
"Reduce" step: The master node then collects the answers to all the sub-problems and combines them
in some way to form the output – the answer to the problem it was originally trying to solve.
37. Advantages of MongoDB over RDBMS
Schema less : MongoDB is document database in which one collection holds different
documents.
Number of fields, content and size of the document can be differ from one document to
another.
Structure of a single object is clear
No complex joins
Deep query-ability. MongoDB supports dynamic queries on documents using a document-
based query language that's nearly as powerful as SQL
Ease of scale-out: MongoDB is easy to scale
37
38. Why should use MongoDB
Document Oriented Storage : Data is stored in the form of JSON style documents
Index on any attribute
Replication & High Availability
Auto-Sharding
Rich Queries
Fast In-Place Updates
Professional Support By MongoDB
Where should use MongoDB?
Big Data
Content Management and Delivery
Mobile and Social Infrastructure
User Data Management
Data Hub
38
http://downloadsquad.switched.com/2010/06/29/facebook-doubles-its-server-count-from-30-000-to-60-000-in-just-6-months/
by Sebastian Anthony on June 29, 2010 at 10:00 AM