NoSQL - Data Stores for Big Data

NoSQL – Data Stores for Big Data
NoSQL - Data Stores
for Big Data
2nd International ScaDS Summer School on Big Data
Anika Groß, Database Group, Universität Leipzig
Leipzig, 12.07.2016

“NoSQL for BigData“
• Massive data growth
• Big data, cloud, real-time applications, …
• Requirements
• High read and write scalability
• Management of unstructured and semi-structured data
• Continuous availability
• Decentralized applications
• …
• Modern NoSQL data stores pioneered by
leading internet companies as in-house solutions
2
Figure: https://www.dezyre.com/article/nosql-vssql-4-
reasons-why-nosql-is-better-for-big-data-applications/86

“Not only SQL”
• No standardized definition!
• Non-relational approaches
• Different applications require different types of databases
• Database system with one or more of these criteria:
• No relational data model
• Schema free, only weak restrictions
• No joins, no normalization
• Distributed, horizontally scalable system
• Use of commodity hardware
• “No SQL”
• Simple API instead of SQL
• “No transactions”
• BASE consistency model instead of ACID
3

Key-Value Stores
Wide Column Stores
Document Stores
Graph Databases
NoSQL Data Stores
4
*
* multi-model
*
• Collection of key-value pairs
• Data access via key: get(key), put(key, value)
• Semi-structured data in documents (e.g. JSON)
• Access via key or simple API/query language
• Tables with records with (many) dynamic columns
• Access via key, SQL-like query language, ..
• Data as nodes and edges with properties
• Database queries incl. graph algorithms

DB Engines Ranking
5http://db-engines.com/en/ranking

Agenda
• CAP, ACID, BASE, Consistency Models
• Key-value store: Dynamo
• Consistent hashing
• Object versioning
• Quorum-like consistency model
• …
• Document store: MongoDB
• Query language
• Indexing
• Replication
• Sharding
6

Distributed Data Management
• Distributed system needs to deal with
• Network failures
• Network latency, limited throughput
• Change of network topology
• …
• Communication between nodes
• a.o. synchronization and replication
• Robust against node failure, loss of messages, …
• Trade-off: performance vs. data consistency
• Wait for synchronization between nodes
• Avoid conflicts / inconsistencies
7

• Consistency
• All nodes see the same data at
the same time
• Availability
• Every read or write request receives
a response (succeeded or failed)
• Partitioning tolerance
• System continues to operate despite arbitrary partitioning
due to network failures (loss of messages)
• Theorem: A distributed computer system can at most
provide two of these three properties.
CAP Theorem
8
Brewer: Towards robust distributed systems. Proceedings of the
Annual ACM Symposium on Principles of Distributed Computing, 2000
Partition
Tolerance
Availability
Consistency

Source: Misconceptions about
the CAP Theorem
CAP Theorem (2)
9
Partition
Tolerance
Availability
Consistency
CP
AP
(CA)
MongoDB
BigTable
HBase
Dynamo/S3
Cassandra
• Consistent but not available under
network partitions
• Lock transactions, avoid conflicts, …
• Available but not consistent under
network partitions
• Writes always possible even if no
communication/synchronization is possible
• Inconsistent data, conflict resolution necessary
• Controversy!
• “2 of 3” was misleading, no CA:
CAP Twelve Years Later: How the "Rules" Have Changed
• Classification of systems difficult:
Please stop calling databases CP or AP
CP
AP

ACID
• RDBMS ensure ACID properties for transactions:
• Atomicity
• "all or nothing” – property
• if part of the transaction fails, the entire transaction fails, and the database state
is left unchanged
• Consistency
• A successful transaction preserves the database consistency
• Guarantee defined integrity constraints
• Isolation
• Concurrent execution of transactions results in a system state as if
transactions were executed serially
• Transactions can not rely on intermediate or unfinished state
• Durability
• Successfully committed transactions will remain, even in the event of system
failure, power loss, other breakdowns (persistency)
10

BASE
• BA - Basically Available
• Partial network failure → response to any request (response could be ‘failure’)
• Replication factor = 3, 1 node fails: query response still possible
• S - Soft State
• The system could change over time
• Even during times without input → changes due to “eventual consistency”
→ state of the system is always “soft”
• E - Eventually Consistent
• Consistency is not checked for every transaction before it moves
onto the next one → Replica can be inconsistent
• The system will eventually become consistent (once it stops receiving input)
• “Sooner or later” the data will be propagated to everywhere it should
11

Consistency Models
Strong Consistency
12
inconsistency window
r(x)=v1 r(x)=v2
r(x)=v2update(x,v2)
r(x)=v2
r(x)=v1 r(x)=v2
r(x)=v1update(x,v2)
r(x)=v1
r(x)=v1 r(x)=v2
t
t
Eventual Consistency
Eventually (after the
inconsistency window
closes) all accesses will
return the last
updated value.

Consistency Models
Read-your-writes Consistency
13
Monotonic Read Consistency
r(x)=v1 r(x)=v1
r(x)=v2update(x,v2)
r(x)=v1
r(x)=v1 r(x)=v2
r(x)=v1update(x,v2)
r(x)=v1
r(x)=v2 r(x)=v2
r(x)=v2
t
t
Client reads updated value
→ will never read any
previous value
Client updates item
→ will always access the
updated value and never
see an older value

Agenda
• …
• Query language
• Indexing
• Replication
• Sharding
14

Key-Value Stores
• Data structure: collection of
key-value pairs = associative
array / dictionary / map
• Key
• Unique within a namespace
Namespace = collection of keys, ‘bucket‘
• Values
• Uninterpreted string of bytes of arbitrary length (BLOB)
• No integrity constraints (check on application side)
• Different types of key-value stores
• Different consistency models, ordered/unordered keys,
RAM vs. disk/SSD
15
Sales
key 1 value
key 2 value
key 3 value
key n value
…
…
Inventory
key 1 value
key 2 value
key 3 value
key n value
…
…
Product
descriptions
key 1 value
key 2 value
key 3 value
key n value
…
…

Amazon Dynamo
• Scalable distributed data store built
for Amazon’s platform
• Dynamo principles (or part of them)
implemented in several NoSQL solutions
• “Not only” Dynamo:
e.g. Cassandra = Dynamo + BigTable
• Motivation
• Scale to extreme peak loads efficiently without any downtime
• e.g. busy holiday shopping season
16
DeCandia et al.: Dynamo: Amazon’s Highly Available Key-value Store.
ACM SIGOPS Operating Systems Review, 41(6), 2007.
Project
Voldemort

Amazon Dynamo
• Aims: high availability and performance
• Address tradeoffs between availability, consistency,
cost-effectiveness and performance
• Eventually-consistent storage system
• “always writeable” data store
• Favor availability over consistency (if necessary)
• Performance SLA (Service Level Agreement)
• “response within 300ms for 99.9% of requests for peak client load
of 500 requests per second”
• Decentralized system: P2P-like distribution
• No master nodes
• All nodes have the same functionality
17

Techniques
• Object versioning / vector clocks
• Decentralized replica synchronization protocol
• Gossip-based membership protocol and failure detection
18

Partitioning and Replication of Keys
• Logical ring of nodes
• Output range of a hash function
→ fixed circular space
• Node position is a random value
in the range of the hash function
• Assignment of data to nodes
• Determine hash value of keys → position on ring
• Assign to N successor nodes (clockwise)
• Hash value between A and B, N=3 → B, C, D
Consistent hashing
• Minimize the number of re-assignments when nodes are added or removed
• Need sophisticated hash function for good load balancing and data locality
• Preference list: list of nodes that is responsible for storing a
particular key (every node knows the preference list)
19
A
B
C
DE
F
G
Hash(key)

Data Access
• Key-Value Store interface
• Access via Primary-Key; no complex queries
• Every node in the ring can route each query
• Routing to (the usually first) node in preference list of the specific key
• Put (Key, Context, Object)
• Coordinator creates vector clock (versioning) based on context
• Local write of object incl. vector clock
• Asynchronous replication
• Write request to N-1 remaining nodes in preference list
• Successful write, if (at least) W-1 nodes respond
• Asynchronous update of replicas W<N → consistency problems
• Get (Key)
• Read request to N nodes in preference list
• Response from R nodes → possibly different versions of the same object:
List of (Object, Context) pairs
20

Replication
• Read/Write quorum
• R/W = minimal number of N replica nodes that must
participate in a successful read/write operation
• Flexible adaptation of (N,R,W) according to application
requirements w.r.t. performance, availability, durability
• Ensure read of current version: R + W > N
• No loss of information
• Conflict resolution
• Data store side: e.g. “last write wins”
• Application side: e.g. merge conflicting shopping cart versions
21

Quorum variants
• Optimizing reads: R=1, W=N
• Consistency due to „write to all“
= wait for all write ack’s
• Optimizing writes: R=N, W=1
• Consistency due to „read from all“
= last version will be included
• R+W>N
R=N=3, W=1
R
W
R=1, W=N=3
R=3, W=3, N=5 R=4, W=2, N=5
Eventual consistency: R+W≤N
• Read might not cover current write
R=2, W=2, N=4

Versioning
• Aim: Capture causality between different versions of an object
• Which object versions are known?
• Parallel branches or causal ordering?
• Vector clock: List of (node, counter) pairs
• Version counter per replica node
e.g. 𝐷([𝑆𝑥, 1]) for object 𝐷, node 𝑆𝑥, version 1
• Example: evolving object versions
23
Sx
Sy
Sz m1, m2 - messages

Versioning (2)
• Does one object version descend
from another one?
• Counters 1st vector clock
≤ all counters 2nd vector clock
→ 1st version is ancestor of 2nd
(forget 1st version)
• Otherwise: conflicting versions
• Client identifies conflict
during read
• Gets all known versions
• Subsequent update
consolidates versions
24
Figure: DeCandia et al.: Dynamo: Amazon’s Highly Available Key-value Store.

Handling Temporary Failures
• “ Sloppy“ Quorum (N, R, W)
• Perform all operations on the first N healthy nodes from
the preference list (not necessarily the first N nodes in the ring)
• Still “writable” in case of node failure
• Hinted Handoff
• Unreachable node → write request send to other node (“hinted replica”)
• Availability!
• Node recovers → sync of hinted replica and original node
• Example
• B is not available
• Replica to E (handoff) with hint
to intended recipient B
• B recovers
• Hinted replica E → B
• E can delete hinted replica
25

Replica Synchronization
• Hash tree (Merkle tree) for key range
• Leaves are hashes of the values for individual keys
• Parent nodes are hash values of
child node hash values
• Advantages:
• Efficient check: equal root hashes
→ replica are in sync
• Efficient identification of “out of sync”
keys: subtree traversal to find differences
• Disadvantages
• Recalculation of hash trees in case of
repartitioning (added or removed node)
26
K1
V1
K2
V2
K3
V3
K4
V4
H(k1) H(k2) H(k4)H(k3)
H(H(k1), H(k2)) H(H(k3), H(k4))
H(H(H(k1), H(k2)), H(H(k3), H(k4)))
...
...
H(...)
...

Overview: Amazon Dynamo Techniques
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
High Availability
for writes
Vector clocks with
reconciliation during reads
Version size is decoupled from update rates
Handling
temporary failures
Sloppy Quorum and
hinted handoff
Provides high availability and durability
guarantee when some of the replicas are not
available
Recovering from
permanent
failures
Anti-entropy using
Merkle trees
Efficient synchronization of divergent replicas
in the background
Membership and
failure detection
Gossip-based membership
protocol and failure
detection
Preserves symmetry and avoids having a
centralized registry for storing membership
and node liveness information
27
Source: DeCandia et al.: Dynamo: Amazon’s Highly Available Key-value Store.

Agenda
• …
• Query language
• Indexing
• Replication
• Sharding
28

• Collection of documents
• Semi-structured data (e.g. JSON format)
• Flexible, extensible schema
• Embedded (denormalized) data model
• Data access via key, (simple) query language, map/reduce queries
• Use cases: web applications, mobile applications,
e-commerce solutions …
• Examples
Document Stores
29
Database
Collection
{
{doc1}
{doc2}
…
}
Collection
{
{doc1}
{doc2}
…
}

Example – Collection “images”
{
_id: 1,
name: “fish.jpg”,
time: 17:46,
user: “bob”,
camera: “nikon”,
info: { width: 100, height: 200, size: 12345 },
tags: [“tuna”, “shark”]
}
30
id name time user camera info tags
width height size
1 fish.jpg 17:46 bob nikon 100 200 12345 [tuna, shark]
2 trees.jpg 17:57 john canon 30 250 32091 [oak]
3 hawaii.png 17:59 john nikon 128 64 92834 [maui, tuna]
4 island.gif 17:43 zztop nikon 640 480 50398 [maui]
{
_id: 2,
name: “trees.jpg”,
time: 17:57,
user: “john”,
camera: “canon”,
tags: [oak]
}
…
embedded document
array of strings
field: value

mongoDB
• Open source document database
• Current release 3.2
• Embedded data model
• JSON-like documents (BSON = Binary JSON)
• Features
• Query language
• Indexing
• Replication
• Sharding
31

Query Language
• Selection, projection:
db.images.find({camera:"nikon"}, {name:1, camera:1, _id:0})
• Querying multi-valued attributes:
• Pictures with tag "shark“: db.images.find({tags:"shark"})
• Pictures with tags "a", "b" and "c":
db.images.find({tags:{$all:["a","b","c"]}})
• Querying nested objects
• Pictures with width < 100px : db.images.find({info.width:{$lt:100}})
32
{
_id: 1,
name: “fish.jpg”,
…
camera: “nikon”,
tags: [“tuna”, “shark”]
}

Aggregation Framework
• Pipeline of operators
• $match: filter documents
• $project: inclusion, suppression, new field, value reset for attributes
• $group: grouping and aggregation
• $unwind: unnest arrays (one document per array element)
• $sort, $limit, $skip, …
http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/

Indexing
• Aim: Efficient query execution
• Avoid collection scan
• Similar to other database
systems (B-tree data structure)
Index Types
• Default _id index (unique)
• Single field index
• Compound index: e.g. { userid: 1, score: -1 } → 1 asc, -1 desc
• Multikey index: index content of arrays
(separate index entries for every array element)
• Geospatial index: index coordinate pairs
34
https://docs.mongodb.com/manual/indexes/

Replication
• Aim: redundancy, data availability
• Asynchronous master-slave replication
• Replica Sets
• Group of servers (mongod instances) with
multiple copies of the same data set
• Writes
• Primary receives all write operations
• Writes recorded to operation log (oplog), ‘write acknowledgement’
• Replication of oplog to “secondaries”
• Secondaries apply operations asynchronously
• ‘majority’ writeConcern: ack’ write from majority of members (not only primary)
• Reads
• Default “primary”: all reads directed to primary
• Read preference modes
• “primary preferred”, “secondary preferred”, “nearest” … → eventual consistency
35
https://docs.mongodb.com/manual/replication

Atomicity, Isolation, Durability
• Atomic single document writes
• write can update multiple fields in a document
→ reader can not see partially updated documents
• Non-atomic (!) multiple document writes
• Alternative: $isolated operator
• Isolate multi-document update operation (no interleaving operations)
• BUT: not “all-or-nothing” atomicity (no rollback after error during write)
• $isolated does not work for sharded clusters
• Durability
• In replica set: update written to a majority of voting nodes’ journal files
• readConcern
• ‘local’: concurrent readers may see the updated document
before changes are durable (read uncommitted)
• ‘majority’: client can read only durable writes
36

https://docs.mongodb.com/manual/
core/replica-set-elections/
Automatic Failover
• Primary election
• Primary inaccessible for 10 seconds
• During election
→ no primary → read-only
• New primary: first secondary that,
• holds election
• received a majority of the members’ votes
• has most current optime (timestamp last write)
• Network partition
• Minority partition: primary downgraded to secondary
• Rollback: revert writes on former primary when
it rejoins its replica set after failover
• Majority partition: if necessary, election of new primary
37

Sharding
• Aim: scalability
• Horizontal partitioning into shards
• Shard
• Contains subset of the data
• Every shard can be a replica set
• Shard key
• Immutable field or fields, that exist in
every document of the collection
• Collection must be indexed on the shard key
38
https://docs.mongodb.com/manual/sharding/

Sharding (2)
• mongos
• query router
• interface between client
applications and sharded
cluster
• config servers
• store metadata and
configuration settings for
the cluster (data location)
• can be deployed as
replica set
• mongod
• Primary daemon process
for MongoDB
• Request handling,
manages data access, …
39
Web server
mongos
App
Web server
mongos
App
mongod
configsrv
mongod
configsrv
mongod
mongod
mongod
Shard01
Replica Set
rs01
mongod
mongod
mongod
Shard02
Replica Set
rs02
mongod
mongod
mongod
Shard03
Replica Set
rs03
mongod
configsrv
Source: Tilmann Beittner, Jeremias Brödel: Erste Gehversuche mit
MongoDB - Schritt für Schritt. iXDeveloper, Big Data, 02/2015.

Sharding - Chunks
• Sharded data is partitioned into chunks
• Lower and upper range based on shard key
• Shard split:
• chunk size > max chunk size
(default 64MB)
• #documents > max # documents
per chunk
• Migration of chunks across
shards (even balance)
40https://docs.mongodb.com/manual/core/sharding-data-partitioning

Hashed and Ranged Sharding
• Hash of shard key field’s value
• Range of hashed shard key values
assigned to each chunk
+ Even data distribution
- “close range” of shard key values
unlikely in same chunk
41
• Specific key range → same chunk
+ Efficient range queries
• Routing only to shards that
contain required data
- Possibly uneven data distribution
• Careful selection of shard key!
Hash-based Range-based

CP system? - take care …
• “Jepsen: MongoDB stale reads”
• “… we’ll see that Mongo’s consistency model is broken by design: not only
can “strictly consistent” reads see stale versions of documents, but they
can also return garbage data from writes that never should have
occurred. …”
• Source: https://aphyr.com/posts/322-jepsen-mongodb-stale-reads
42

Summary
• Consistent hashing , vector clocks, sloppy quorum, …
• Document data store: MongoDB
• Query language, indexing, replication, sharding, …
• The “best” database? Application dependent!
• Relational vs. non-relational
(document, graph… ) data model?
• ACID transactions?
• Large data sets? High query load?
Need for distributed data storage?
• Availability, consistency requirements?
• …
43

References
• Eric A. Brewer: Towards robust distributed systems. Proceedings of the Annual ACM
Symposium on Principles of Distributed Computing (PODS), 2000
• Eric A. Brewer: CAP Twelve Years Later: How the "Rules" Have Changed, Computer,
45(2), 2012. https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
• Martin Kleppmann: Please stop calling databases CP or AP, 2015.
https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
• Giuseppe DeCandia et al.: Dynamo: Amazon’s Highly Available Key-value Store. ACM
SIGOPS Operating Systems Review, 41(6), 2007.
• MongoDB documentation: https://docs.mongodb.com/
• Tilmann Beittner, Jeremias Brödel: Erste Gehversuche mit MongoDB - Schritt für
Schritt. iXDeveloper, Big Data, 02/2015.
• Kyle Kingsbury: Jepsen: MongoDB stale reads, 2015.
https://aphyr.com/posts/322-jepsen-mongodb-stale-reads
• Lecture “NoSQL-Datenbanken”, Database Group, Universität Leipzig
• Contributors: Anika Groß, Martin Junghanns, Lars Kolb, Andreas Thor
44

NoSQL - Data Stores for Big Data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Recently uploaded

Recently uploaded (20)

NoSQL - Data Stores for Big Data