Talk at the "2nd International ScaDS Summer School on Big Data", Universität Leipzig, 07/2016:
Modern web-scale and big data applications require efficient solutions to process huge amounts of possibly unstructured data. Many NoSQL stores allow a schema-free data storage, easy replication mechanisms and horizontal scalability. According to the CAP theorem, distributed systems can guarantee consistency (C) or availability (A) while preserving partition tolerance (P) in case of network failures. The talk will give an introduction to different data models and technical models for NoSQL datastores, focusing on two prominent systems: the AP system and key-value store Dynamo as well as the CP system and document store MongoDB.
1. NoSQL – Data Stores for Big Data
NoSQL - Data Stores
for Big Data
2nd International ScaDS Summer School on Big Data
Anika Groß, Database Group, Universität Leipzig
Leipzig, 12.07.2016
2. NoSQL – Data Stores for Big Data
“NoSQL for BigData“
• Massive data growth
• Big data, cloud, real-time applications, …
• Requirements
• High read and write scalability
• Management of unstructured and semi-structured data
• Continuous availability
• Decentralized applications
• …
• Modern NoSQL data stores pioneered by
leading internet companies as in-house solutions
2
Figure: https://www.dezyre.com/article/nosql-vssql-4-
reasons-why-nosql-is-better-for-big-data-applications/86
3. NoSQL – Data Stores for Big Data
“Not only SQL”
• No standardized definition!
• Non-relational approaches
• Different applications require different types of databases
• Database system with one or more of these criteria:
• No relational data model
• Schema free, only weak restrictions
• No joins, no normalization
• Distributed, horizontally scalable system
• Use of commodity hardware
• “No SQL”
• Simple API instead of SQL
• “No transactions”
• BASE consistency model instead of ACID
3
4. NoSQL – Data Stores for Big Data
Key-Value Stores
Wide Column Stores
Document Stores
Graph Databases
NoSQL Data Stores
4
*
* multi-model
*
• Collection of key-value pairs
• Data access via key: get(key), put(key, value)
• Semi-structured data in documents (e.g. JSON)
• Access via key or simple API/query language
• Tables with records with (many) dynamic columns
• Access via key, SQL-like query language, ..
• Data as nodes and edges with properties
• Database queries incl. graph algorithms
5. NoSQL – Data Stores for Big Data
DB Engines Ranking
5http://db-engines.com/en/ranking
6. NoSQL – Data Stores for Big Data
Agenda
• CAP, ACID, BASE, Consistency Models
• Key-value store: Dynamo
• Consistent hashing
• Object versioning
• Quorum-like consistency model
• …
• Document store: MongoDB
• Query language
• Indexing
• Replication
• Sharding
6
7. NoSQL – Data Stores for Big Data
Distributed Data Management
• Distributed system needs to deal with
• Network failures
• Network latency, limited throughput
• Change of network topology
• …
• Communication between nodes
• a.o. synchronization and replication
• Robust against node failure, loss of messages, …
• Trade-off: performance vs. data consistency
• Wait for synchronization between nodes
• Avoid conflicts / inconsistencies
7
8. NoSQL – Data Stores for Big Data
• Consistency
• All nodes see the same data at
the same time
• Availability
• Every read or write request receives
a response (succeeded or failed)
• Partitioning tolerance
• System continues to operate despite arbitrary partitioning
due to network failures (loss of messages)
• Theorem: A distributed computer system can at most
provide two of these three properties.
CAP Theorem
8
Brewer: Towards robust distributed systems. Proceedings of the
Annual ACM Symposium on Principles of Distributed Computing, 2000
Partition
Tolerance
Availability
Consistency
9. NoSQL – Data Stores for Big Data
Source: Misconceptions about
the CAP Theorem
CAP Theorem (2)
9
Partition
Tolerance
Availability
Consistency
CP
AP
(CA)
MongoDB
BigTable
HBase
Dynamo/S3
Cassandra
• Consistent but not available under
network partitions
• Lock transactions, avoid conflicts, …
• Available but not consistent under
network partitions
• Writes always possible even if no
communication/synchronization is possible
• Inconsistent data, conflict resolution necessary
• Controversy!
• “2 of 3” was misleading, no CA:
CAP Twelve Years Later: How the "Rules" Have Changed
• Classification of systems difficult:
Please stop calling databases CP or AP
CP
AP
10. NoSQL – Data Stores for Big Data
ACID
• RDBMS ensure ACID properties for transactions:
• Atomicity
• "all or nothing” – property
• if part of the transaction fails, the entire transaction fails, and the database state
is left unchanged
• Consistency
• A successful transaction preserves the database consistency
• Guarantee defined integrity constraints
• Isolation
• Concurrent execution of transactions results in a system state as if
transactions were executed serially
• Transactions can not rely on intermediate or unfinished state
• Durability
• Successfully committed transactions will remain, even in the event of system
failure, power loss, other breakdowns (persistency)
10
11. NoSQL – Data Stores for Big Data
BASE
• BA - Basically Available
• Partial network failure → response to any request (response could be ‘failure’)
• Replication factor = 3, 1 node fails: query response still possible
• S - Soft State
• The system could change over time
• Even during times without input → changes due to “eventual consistency”
→ state of the system is always “soft”
• E - Eventually Consistent
• Consistency is not checked for every transaction before it moves
onto the next one → Replica can be inconsistent
• The system will eventually become consistent (once it stops receiving input)
• “Sooner or later” the data will be propagated to everywhere it should
11
12. NoSQL – Data Stores for Big Data
Consistency Models
Strong Consistency
12
inconsistency window
r(x)=v1 r(x)=v2
r(x)=v2update(x,v2)
r(x)=v2
r(x)=v1 r(x)=v2
r(x)=v1update(x,v2)
r(x)=v1
r(x)=v1 r(x)=v2
t
t
Eventual Consistency
Eventually (after the
inconsistency window
closes) all accesses will
return the last
updated value.
13. NoSQL – Data Stores for Big Data
Consistency Models
Read-your-writes Consistency
13
Monotonic Read Consistency
r(x)=v1 r(x)=v1
r(x)=v2update(x,v2)
r(x)=v1
r(x)=v1 r(x)=v2
r(x)=v1update(x,v2)
r(x)=v1
r(x)=v2 r(x)=v2
r(x)=v2
t
t
Client reads updated value
→ will never read any
previous value
Client updates item
→ will always access the
updated value and never
see an older value
14. NoSQL – Data Stores for Big Data
Agenda
• CAP, ACID, BASE, Consistency Models
• Key-value store: Dynamo
• Consistent hashing
• Object versioning
• Quorum-like consistency model
• …
• Document store: MongoDB
• Query language
• Indexing
• Replication
• Sharding
14
15. NoSQL – Data Stores for Big Data
Key-Value Stores
• Data structure: collection of
key-value pairs = associative
array / dictionary / map
• Key
• Unique within a namespace
Namespace = collection of keys, ‘bucket‘
• Values
• Uninterpreted string of bytes of arbitrary length (BLOB)
• No integrity constraints (check on application side)
• Different types of key-value stores
• Different consistency models, ordered/unordered keys,
RAM vs. disk/SSD
15
Sales
key 1 value
key 2 value
key 3 value
key n value
…
…
Inventory
key 1 value
key 2 value
key 3 value
key n value
…
…
Product
descriptions
key 1 value
key 2 value
key 3 value
key n value
…
…
16. NoSQL – Data Stores for Big Data
Amazon Dynamo
• Scalable distributed data store built
for Amazon’s platform
• Dynamo principles (or part of them)
implemented in several NoSQL solutions
• “Not only” Dynamo:
e.g. Cassandra = Dynamo + BigTable
• Motivation
• Scale to extreme peak loads efficiently without any downtime
• e.g. busy holiday shopping season
16
DeCandia et al.: Dynamo: Amazon’s Highly Available Key-value Store.
ACM SIGOPS Operating Systems Review, 41(6), 2007.
Project
Voldemort
17. NoSQL – Data Stores for Big Data
Amazon Dynamo
• Aims: high availability and performance
• Address tradeoffs between availability, consistency,
cost-effectiveness and performance
• Eventually-consistent storage system
• “always writeable” data store
• Favor availability over consistency (if necessary)
• Performance SLA (Service Level Agreement)
• “response within 300ms for 99.9% of requests for peak client load
of 500 requests per second”
• Decentralized system: P2P-like distribution
• No master nodes
• All nodes have the same functionality
17
18. NoSQL – Data Stores for Big Data
Techniques
• Consistent hashing
• Object versioning / vector clocks
• Quorum-like consistency model
• Decentralized replica synchronization protocol
• Gossip-based membership protocol and failure detection
18
19. NoSQL – Data Stores for Big Data
Partitioning and Replication of Keys
• Logical ring of nodes
• Output range of a hash function
→ fixed circular space
• Node position is a random value
in the range of the hash function
• Assignment of data to nodes
• Determine hash value of keys → position on ring
• Assign to N successor nodes (clockwise)
• Hash value between A and B, N=3 → B, C, D
Consistent hashing
• Minimize the number of re-assignments when nodes are added or removed
• Need sophisticated hash function for good load balancing and data locality
• Preference list: list of nodes that is responsible for storing a
particular key (every node knows the preference list)
19
A
B
C
DE
F
G
Hash(key)
20. NoSQL – Data Stores for Big Data
Data Access
• Key-Value Store interface
• Access via Primary-Key; no complex queries
• Every node in the ring can route each query
• Routing to (the usually first) node in preference list of the specific key
• Put (Key, Context, Object)
• Coordinator creates vector clock (versioning) based on context
• Local write of object incl. vector clock
• Asynchronous replication
• Write request to N-1 remaining nodes in preference list
• Successful write, if (at least) W-1 nodes respond
• Asynchronous update of replicas W<N → consistency problems
• Get (Key)
• Read request to N nodes in preference list
• Response from R nodes → possibly different versions of the same object:
List of (Object, Context) pairs
20
21. NoSQL – Data Stores for Big Data
Replication
• Read/Write quorum
• R/W = minimal number of N replica nodes that must
participate in a successful read/write operation
• Flexible adaptation of (N,R,W) according to application
requirements w.r.t. performance, availability, durability
• Ensure read of current version: R + W > N
• No loss of information
• Conflict resolution
• Data store side: e.g. “last write wins”
• Application side: e.g. merge conflicting shopping cart versions
21
22. NoSQL – Data Stores for Big Data
Quorum variants
• Optimizing reads: R=1, W=N
• Consistency due to „write to all“
= wait for all write ack’s
• Optimizing writes: R=N, W=1
• Consistency due to „read from all“
= last version will be included
• R+W>N
R=N=3, W=1
R
W
R=1, W=N=3
R=3, W=3, N=5 R=4, W=2, N=5
Eventual consistency: R+W≤N
• Read might not cover current write
R=2, W=2, N=4
23. NoSQL – Data Stores for Big Data
Versioning
• Aim: Capture causality between different versions of an object
• Which object versions are known?
• Parallel branches or causal ordering?
• Vector clock: List of (node, counter) pairs
• Version counter per replica node
e.g. 𝐷([𝑆𝑥, 1]) for object 𝐷, node 𝑆𝑥, version 1
• Example: evolving object versions
23
Sx
Sy
Sz m1, m2 - messages
24. NoSQL – Data Stores for Big Data
Versioning (2)
• Does one object version descend
from another one?
• Counters 1st vector clock
≤ all counters 2nd vector clock
→ 1st version is ancestor of 2nd
(forget 1st version)
• Otherwise: conflicting versions
• Client identifies conflict
during read
• Gets all known versions
• Subsequent update
consolidates versions
24
Figure: DeCandia et al.: Dynamo: Amazon’s Highly Available Key-value Store.
ACM SIGOPS Operating Systems Review, 41(6), 2007.
25. NoSQL – Data Stores for Big Data
Handling Temporary Failures
• “ Sloppy“ Quorum (N, R, W)
• Perform all operations on the first N healthy nodes from
the preference list (not necessarily the first N nodes in the ring)
• Still “writable” in case of node failure
• Hinted Handoff
• Unreachable node → write request send to other node (“hinted replica”)
• Availability!
• Node recovers → sync of hinted replica and original node
• Example
• B is not available
• Replica to E (handoff) with hint
to intended recipient B
• B recovers
• Hinted replica E → B
• E can delete hinted replica
25
26. NoSQL – Data Stores for Big Data
Replica Synchronization
• Hash tree (Merkle tree) for key range
• Leaves are hashes of the values for individual keys
• Parent nodes are hash values of
child node hash values
• Advantages:
• Efficient check: equal root hashes
→ replica are in sync
• Efficient identification of “out of sync”
keys: subtree traversal to find differences
• Disadvantages
• Recalculation of hash trees in case of
repartitioning (added or removed node)
26
K1
V1
K2
V2
K3
V3
K4
V4
H(k1) H(k2) H(k4)H(k3)
H(H(k1), H(k2)) H(H(k3), H(k4))
H(H(H(k1), H(k2)), H(H(k3), H(k4)))
...
...
H(...)
...
27. NoSQL – Data Stores for Big Data
Overview: Amazon Dynamo Techniques
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
High Availability
for writes
Vector clocks with
reconciliation during reads
Version size is decoupled from update rates
Handling
temporary failures
Sloppy Quorum and
hinted handoff
Provides high availability and durability
guarantee when some of the replicas are not
available
Recovering from
permanent
failures
Anti-entropy using
Merkle trees
Efficient synchronization of divergent replicas
in the background
Membership and
failure detection
Gossip-based membership
protocol and failure
detection
Preserves symmetry and avoids having a
centralized registry for storing membership
and node liveness information
27
Source: DeCandia et al.: Dynamo: Amazon’s Highly Available Key-value Store.
ACM SIGOPS Operating Systems Review, 41(6), 2007.
28. NoSQL – Data Stores for Big Data
Agenda
• CAP, ACID, BASE, Consistency Models
• Key-value store: Dynamo
• Consistent hashing
• Object versioning
• Quorum-like consistency model
• …
• Document store: MongoDB
• Query language
• Indexing
• Replication
• Sharding
28
29. NoSQL – Data Stores for Big Data
• Collection of documents
• Semi-structured data (e.g. JSON format)
• Flexible, extensible schema
• Embedded (denormalized) data model
• Data access via key, (simple) query language, map/reduce queries
• Use cases: web applications, mobile applications,
e-commerce solutions …
• Examples
Document Stores
29
Database
Collection
{
{doc1}
{doc2}
…
}
Collection
{
{doc1}
{doc2}
…
}
30. NoSQL – Data Stores for Big Data
Example – Collection “images”
{
_id: 1,
name: “fish.jpg”,
time: 17:46,
user: “bob”,
camera: “nikon”,
info: { width: 100, height: 200, size: 12345 },
tags: [“tuna”, “shark”]
}
30
id name time user camera info tags
width height size
1 fish.jpg 17:46 bob nikon 100 200 12345 [tuna, shark]
2 trees.jpg 17:57 john canon 30 250 32091 [oak]
3 hawaii.png 17:59 john nikon 128 64 92834 [maui, tuna]
4 island.gif 17:43 zztop nikon 640 480 50398 [maui]
{
_id: 2,
name: “trees.jpg”,
time: 17:57,
user: “john”,
camera: “canon”,
info: { width: 30, height: 250, size: 32091 },
tags: [oak]
}
…
embedded document
array of strings
field: value
31. NoSQL – Data Stores for Big Data
mongoDB
• Open source document database
• Current release 3.2
• Embedded data model
• JSON-like documents (BSON = Binary JSON)
• Features
• Query language
• Indexing
• Replication
• Sharding
31
32. NoSQL – Data Stores for Big Data
Query Language
• Selection, projection:
db.images.find({camera:"nikon"}, {name:1, camera:1, _id:0})
• Querying multi-valued attributes:
• Pictures with tag "shark“: db.images.find({tags:"shark"})
• Pictures with tags "a", "b" and "c":
db.images.find({tags:{$all:["a","b","c"]}})
• Querying nested objects
• Pictures with width < 100px : db.images.find({info.width:{$lt:100}})
32
{
_id: 1,
name: “fish.jpg”,
…
camera: “nikon”,
info: { width: 100, height: 200, size: 12345 },
tags: [“tuna”, “shark”]
}
33. NoSQL – Data Stores for Big Data
Aggregation Framework
• Pipeline of operators
• $match: filter documents
• $project: inclusion, suppression, new field, value reset for attributes
• $group: grouping and aggregation
• $unwind: unnest arrays (one document per array element)
• $sort, $limit, $skip, …
http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/
34. NoSQL – Data Stores for Big Data
Indexing
• Aim: Efficient query execution
• Avoid collection scan
• Similar to other database
systems (B-tree data structure)
Index Types
• Default _id index (unique)
• Single field index
• Compound index: e.g. { userid: 1, score: -1 } → 1 asc, -1 desc
• Multikey index: index content of arrays
(separate index entries for every array element)
• Geospatial index: index coordinate pairs
34
https://docs.mongodb.com/manual/indexes/
35. NoSQL – Data Stores for Big Data
Replication
• Aim: redundancy, data availability
• Asynchronous master-slave replication
• Replica Sets
• Group of servers (mongod instances) with
multiple copies of the same data set
• Writes
• Primary receives all write operations
• Writes recorded to operation log (oplog), ‘write acknowledgement’
• Replication of oplog to “secondaries”
• Secondaries apply operations asynchronously
• ‘majority’ writeConcern: ack’ write from majority of members (not only primary)
• Reads
• Default “primary”: all reads directed to primary
• Read preference modes
• “primary preferred”, “secondary preferred”, “nearest” … → eventual consistency
35
https://docs.mongodb.com/manual/replication
36. NoSQL – Data Stores for Big Data
Atomicity, Isolation, Durability
• Atomic single document writes
• write can update multiple fields in a document
→ reader can not see partially updated documents
• Non-atomic (!) multiple document writes
• Alternative: $isolated operator
• Isolate multi-document update operation (no interleaving operations)
• BUT: not “all-or-nothing” atomicity (no rollback after error during write)
• $isolated does not work for sharded clusters
• Durability
• In replica set: update written to a majority of voting nodes’ journal files
• readConcern
• ‘local’: concurrent readers may see the updated document
before changes are durable (read uncommitted)
• ‘majority’: client can read only durable writes
36
37. NoSQL – Data Stores for Big Data
https://docs.mongodb.com/manual/
core/replica-set-elections/
Automatic Failover
• Primary election
• Primary inaccessible for 10 seconds
• During election
→ no primary → read-only
• New primary: first secondary that,
• holds election
• received a majority of the members’ votes
• has most current optime (timestamp last write)
• Network partition
• Minority partition: primary downgraded to secondary
• Rollback: revert writes on former primary when
it rejoins its replica set after failover
• Majority partition: if necessary, election of new primary
37
38. NoSQL – Data Stores for Big Data
Sharding
• Aim: scalability
• Horizontal partitioning into shards
• Shard
• Contains subset of the data
• Every shard can be a replica set
• Shard key
• Immutable field or fields, that exist in
every document of the collection
• Collection must be indexed on the shard key
38
https://docs.mongodb.com/manual/sharding/
39. NoSQL – Data Stores for Big Data
Sharding (2)
• mongos
• query router
• interface between client
applications and sharded
cluster
• config servers
• store metadata and
configuration settings for
the cluster (data location)
• can be deployed as
replica set
• mongod
• Primary daemon process
for MongoDB
• Request handling,
manages data access, …
39
Web server
mongos
App
Web server
mongos
App
mongod
configsrv
mongod
configsrv
mongod
mongod
mongod
Shard01
Replica Set
rs01
mongod
mongod
mongod
Shard02
Replica Set
rs02
mongod
mongod
mongod
Shard03
Replica Set
rs03
mongod
configsrv
Source: Tilmann Beittner, Jeremias Brödel: Erste Gehversuche mit
MongoDB - Schritt für Schritt. iXDeveloper, Big Data, 02/2015.
40. NoSQL – Data Stores for Big Data
Sharding - Chunks
• Sharded data is partitioned into chunks
• Lower and upper range based on shard key
• Shard split:
• chunk size > max chunk size
(default 64MB)
• #documents > max # documents
per chunk
• Migration of chunks across
shards (even balance)
40https://docs.mongodb.com/manual/core/sharding-data-partitioning
41. NoSQL – Data Stores for Big Data
Hashed and Ranged Sharding
• Hash of shard key field’s value
• Range of hashed shard key values
assigned to each chunk
+ Even data distribution
- “close range” of shard key values
unlikely in same chunk
41
• Specific key range → same chunk
+ Efficient range queries
• Routing only to shards that
contain required data
- Possibly uneven data distribution
• Careful selection of shard key!
Hash-based Range-based
42. NoSQL – Data Stores for Big Data
CP system? - take care …
• “Jepsen: MongoDB stale reads”
• “… we’ll see that Mongo’s consistency model is broken by design: not only
can “strictly consistent” reads see stale versions of documents, but they
can also return garbage data from writes that never should have
occurred. …”
• Source: https://aphyr.com/posts/322-jepsen-mongodb-stale-reads
42
43. NoSQL – Data Stores for Big Data
Summary
• CAP, ACID, BASE, Consistency Models
• Key-value store: Dynamo
• Consistent hashing , vector clocks, sloppy quorum, …
• Document data store: MongoDB
• Query language, indexing, replication, sharding, …
• The “best” database? Application dependent!
• Relational vs. non-relational
(document, graph… ) data model?
• ACID transactions?
• Large data sets? High query load?
Need for distributed data storage?
• Availability, consistency requirements?
• …
43
44. NoSQL – Data Stores for Big Data
References
• Eric A. Brewer: Towards robust distributed systems. Proceedings of the Annual ACM
Symposium on Principles of Distributed Computing (PODS), 2000
• Eric A. Brewer: CAP Twelve Years Later: How the "Rules" Have Changed, Computer,
45(2), 2012. https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
• Martin Kleppmann: Please stop calling databases CP or AP, 2015.
https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
• Giuseppe DeCandia et al.: Dynamo: Amazon’s Highly Available Key-value Store. ACM
SIGOPS Operating Systems Review, 41(6), 2007.
• MongoDB documentation: https://docs.mongodb.com/
• Tilmann Beittner, Jeremias Brödel: Erste Gehversuche mit MongoDB - Schritt für
Schritt. iXDeveloper, Big Data, 02/2015.
• Kyle Kingsbury: Jepsen: MongoDB stale reads, 2015.
https://aphyr.com/posts/322-jepsen-mongodb-stale-reads
• Lecture “NoSQL-Datenbanken”, Database Group, Universität Leipzig
• Contributors: Anika Groß, Martin Junghanns, Lars Kolb, Andreas Thor
44