SlideShare a Scribd company logo
1 of 75
Key-Value Databases
In Depth
Ciao
ciao
Vai a fare
ciao ciao
Dr. Fabio Fumarola
Outline
• Key-values introduction
• Major Key-Value Databases
• Dynamo DB: How is implemented
– Background
– Partitioning: Consistent Hashing
– High Availability for writes: Vector Clocks
– Handling temporary failures: Sloppy Quorum
– Recovering from failures: Merkle Trees
– Membership and failure detection: Gossip Protocol
2
Key-Value Databases
• A key-value store is a simple hash table
• Where all the accesses to the database are via
primary keys.
• A client can either:
– Get the value for a key
– Put a value for a key
– Delete a key from the data store.
3
Key-value store: characteristics
• Key-value data access enable high performance and
availability.
• Both keys and values can be complex compound
objects and sometime lists, maps or other data
structures.
• Consistency is applicable only for operations on a
single key (eventually-consistency).
4
Key-Values: Cons
• No complex query filters
• All joins must be done in code
• No foreign key constraints
• No trigger
5
Key-Values: Pros
• Efficient queries (very predictable performance).
• Easy to distribute across a cluster.
• Service-orientation disallows foreign key constraints
and forces joins to be done in code anyway.
• Using a relational DB + Cache forces into a key-value
storage anyway
• No object-relational miss-match
6
Popular Key-Value Stores
• Riak Basho
• Redis – Data Structure server
• Memcached DB
• Berkeley DB – Oracle
• Aerospike – fast key-value for SSD disks
• LevelDB – Google key-value store
• DynamoDB – Amazon key-value store
• VoltDB – Open Source Amazon replica
7
Memcached DB
• Atomic operations set/get/delete.
• O(1) to set/get/delete.
• Consistent hashing.
• In memory caching, no persistence.
• LRU eviction policy.
• No iterators.
8
Aerospike
• Key-Value database optimized for hybrid (DRAM + Flash)
approach
• First published in the Proceedings of VLDB (Very Large
Databases) in 2011, “Citrusleaf: A Real-Time NoSQL DB which
Preserves ACID”
9
Redis
• Written C++ with BSD License
• It is an advanced key-value store.
• Keys can contain strings, hashes, lists, sets, sorted sets,
bitmaps and hyperloglogs.
• It works with an in-memory.
• data can be persisted either by dumping the dataset to disk
every once in a while, or by appending each command to a
log.
• Created by Salvatore Sanfilippo (Pivotal)
10
Riak
• Distributed Database written in: Erlang & C, some JavaScript
• Operations
– GET /buckets/BUCKET/keys/KEY
– PUT|POST /buckets/BUCKET/keys/KEY
– DELETE /buckets/BUCKET/keys/KEY
• Integrated with Solr and MapReduce
• Data Types: basic, Sets and Maps
11
curl -XPUT 'http://localhost:8098/riak/food/favorite' 
-H 'Content-Type:text/plain' 
-d 'pizza'
LevelDB
LevelDB is a fast key-value storage library written at Google that
provides an ordered mapping from string keys to string values.
•Keys and values are arbitrary byte arrays.
•Data is stored sorted by key.
•The basic operations are Put(key ,value), Get(key), Delete(key).
•Multiple changes can be made in one atomic batch.
Limitation
•There is no client-server support built in to the library.
12
DynamoDB
• Peer-To-Peer key-value database.
• Service Level Agreement at 99.9% percentile.
• Highly available scarifying consistency
• Can handle online node adds and node failures
• It supports object versioning and application-assisted
conflict resolution (Eventually-Consistent Data
Structures)
13
Dynamo
Amazon’s Highly Available Key-value Store
14
Amazon Dynamo DB
• We analyze the design and the implementation of
Dynamo.
• Amazon runs a world-wide e-commerce platform
• It serves 10 millions customers
• At peak times it uses 10000 servers located in many
data centers around the worlds.
• The have requirements of performance, reliability
and efficiency that needs a fully scalable platform.
15
Motivation of Dynamo
• There are many Amazon services that only need
primary-key access to a data store
– To provide best-seller lists
– Shopping carts
– Customer preferences
– Session management
– Sales rank and product catalogs
• Using relations database would lead to inefficiencies
and limit scale availability
16
Background
17
Scalability is application dependent
• Lesson 1: the reliability and scalability of a system is
dependent on how it s application state is managed.
• Amazon uses a highly decentralized, loosely couples
service oriented architecture composed of hundred
of services.
• They need that the storage is always available.
18
Shopping carts always
• Customers should be able to view and add items to
their shopping carts even if:
– Disk are failing, or
– A data center are being destroyed by a tornados or a
kraken.
19
Failures Happens
• When you deal with an infrastructure composed by
million of component servers and network
components crashes.
20
http://letitcrash.com/
High Availability by contract
• Service Level Agreement (SLA) is the guarantee that
an application can deliver its functionality in a
bounded time.
• An example of SLA is to guarantee that the Acme API
provide a response within 300ms for 99.9% of its
requests for a peak of 500 concurrent users (CCU).
• Normally SLA is described using average, median and
expected variance.
21
Dynamo DB
It uses a synthesis of well known techniques to achieve
scalability and availability.
1.Data is partitioned and replicated using consistent hashing
[Karger et al. 1997].
2.Consistency if facilitated by version clock and object versioning
[Lamport 1978]
3.Consistency among replicas is maintained by a decentralized
replica synchronization protocol (E-CRDT).
4.Gossip protocol is used for membership and failure detection.
22
System Interface
• Dynamo stores objects associated with a key through
two operations: get() and put()
– The get(key) locates the object replicas associated with
the key in the storage and returns a single object or a list
of objects with conflicting versions along with a context.
– The put(key, context, object) operation determines where
the replicas of the object should be placed based on the
associated key, and writes the replicas to disk.
– The context encodes system metadata about the object
23
Key and Value encoding
• Dynamo treats both the key and the object supplied
by the caller as an opaque array of bytes.
• It applies a MD5 hash on the key to generate a 128-
bit identifier, which is used to determine the storage
nodes that are responsible for serving the key.
24
Dynamo Architectural Choice 1/2
We focus on the core of distributed systems techniques used
25
Problem Technique Advantage
Partitioning Consistent Hashing
Incremental Scalability
High Availability for writes Vector clocks with
reconciliation during reads
Version size is decoupled
from update rates.
Handling temporary
failures
Sloppy Quorum and hinted
handoff
Provides high availability
and durability guarantee
when some of the replicas
are not available
Dynamo Architectural Choice 2/2
We focus on the core of distributed systems techniques used
26
Problem Technique Advantage
Recovering from
permanent failures
Anti-entropy using Merkle
trees
Synchronizes divergent
replicas in the background.
Membership and failure
detection
Gossip-based membership
protocol and failure
detection
Preserves symmetry and
avoids having a centralized
registry for storing
membership and node
liveness information.
Partitioning: Consistent Hashing
• Dynamo musts scale incrementally.
• This requires a mechanism to dynamically partition
the data over the set of nodes (i.e., storage hosts) in
the system.
• Dynamo’s partitioning scheme relies on consistent
hashing to distribute the load across multiple storage
hosts.
• the output range of a hash function is treated as a
fixed circular space or ring
27
Partitioning: Consistent Hashing
• Each node in the system is assigned a random value
within this space which represents its “position” on
the ring.
• Each data item is assigned to a node by:
1. hashing the data item’s key to yield its position on the
ring,
2. and then walking the ring clockwise to find the first node
with a position larger than the item’s position.
28
Partitioning: Consistent Hashing
• each node becomes
responsible for the region in
the ring between it and its
predecessor node on the ring
• The principle advantage of
consistent hashing is that
departure or arrival of a node
only affects its immediate
neighbors and other nodes
remain unaffected.
29
Consistent Hashing: Idea
• Consistent hashing is a technique that lets you
smoothly handle these problems:
1. Given a resource key and a list of servers, how do you
find a primary, second, tertiary (and on down the line)
server for the resource?
2. If you have different size servers, how do you assign each
of them an amount of work that corresponds to their
capacity?
30
Consistent Hashing: Idea
• Consistent hashing is a technique that lets you
smoothly handle these problems:
3. How do you smoothly add capacity to the system without
downtime?
4. Specifically, this means solving two problems:
• How do you avoid dumping 1/N of the total load on a new server
as soon as you turn it on?
• How do you avoid rehashing more existing keys than necessary?
31
Consistent Hashing: How To
• Imagine a 128-bit space.
• visualize it as a ring, or a
clock face
• Now imagine hashing
resources into points on
the circle
32
Consistent Hashing: How To
• They could be URLs, GUIDs,
integer IDs, or any arbitrary
sequence of bytes.
• Just run them through a good
hash function (eg, MD5) and
shave off everything but 16
bytes.
• We have four key-values: 1, 2,
3, 4.
33
Consistent Hashing: How To
• Finally, imagine our servers.
– A,
– B, and
– C
• We put our servers in the same
ring.
• We solved the problem of
which server should user
Resource 2
34
Consistent Hashing: How To
• We start where resource 2 is
and, head clockwise on the
ring until we hit a server.
• If that server is down, we go
to the next one, and so on
and so forth
35
Consistent Hashing: How To
• Key-value 4 and 1 belong to
the server A
• Key-value 2 to the server B
• Key-value 3 to the server C
36
Consistent Hashing: Del Server
• If the server C is removed
• Key-value 3 now belongs to
the server A
• All the other key-values
mapping are unchanged
37
Consistent Hashing: Add Server
• If server D is added in the
position marked
• What are the object that will
belongs to D?
38
Consistent Hashing: Cons
• This works well, except the size of the intervals
assigned to each cache is pretty hit and miss.
• Since it is essentially random it is possible to have a
very non-uniform distribution of objects between
caches.
• To address this issue it is introduced the idea of
"virtual nodes”
39
Consistent Hashing: Virtual Nodes
• Instead of mapping a server to a single point in the
circle, each server gets assigned to multiple points in
the ring.
• A virtual node looks like a single node in the system,
but each node can be responsible for more than one
virtual node.
• Effectively, when a new node is added to the system,
it is assigned multiple positions in the ring.
40
Virtual Nodes: Advantages
• If a node becomes unavailable (due to failures or routine
maintenance), the load handled by this node is evenly
dispersed across the remaining available nodes.
• When a node becomes available again, or a new node is
added to the system, the newly available node accepts a
roughly equivalent amount of load from each of the other
available nodes.
• The number of virtual nodes that a node is responsible can
decided based on its capacity, accounting for heterogeneity in
the physical infrastructure.
41
Data Replication
• To achieve high availability and durability, Dynamo
replicates its data on multiple hosts.
• Each data item is replicated at N hosts, where N is a
parameter configured “per-instance”.
• Each key k is assigned to a coordinator node
(described above).
• The coordinator is in charge of the replication of the
data items that fall within its range (ring).
42
Data Replication
• The coordinator locally
store each key within its
range,
• And in addition, it replicates
these keys at the N-1
clockwise successor nodes
in the ring.
43
Data Replication
• The list of nodes that is responsible for storing a particular key
is called the preference list
• The system is designed so that every node in the system can
determine which nodes should be in this list for any particular
key.
• To account for node failures, preference list contains more
than N nodes.
• To avoid that with “virtual nodes” a key k is owned by less
than N physical nodes, the preference list skips some nodes.
44
High Availability for writes
• With eventual consistency writes are propagated
asynchronously.
• A put() may return to its caller before the update has
been applied at all the replicas.
• In this scenarios where a subsequent get() operation
may return an object that does not have the latest
updates.
45
High Availability for writes: Example
• We can see this event with shopping carts.
• The “Add to Cart” operation can never be forgotten
or rejected.
• When a customer wants to add an item to (or
remove from) a shopping cart and the latest version
is not available, the item is added to (or removed
from) the older version and the divergent versions
are reconciled later.
• Question!
46
High Availability for writes
• Dynamo treats the result of each modification as a
new and immutable version of the data.
• It allows for multiple versions of an object to be
present in the system at the same time.
• Most of the time, new versions subsume the
previous version(s), and the system itself can
determine the authoritative version (syntactic
reconciliation).
47
Singly-Linked List
START
48
Singly-Linked List
49
3355 77
ConsCons
NilNilabstract sealed class List {
def head: Int
def tail: List
def isEmpty: Boolean
}
case object Nil extends List {
def head: Int = fail("Empty list.")
def tail: List = fail("Empty list.")
def isEmpty: Boolean = true
}
case class Cons(head: Int, tail: List = Nil) extends List {
def isEmpty: Boolean = false
}
List: analysis
50
3355 77A =
B = Cons(9, A) = 99
C = Cons(1, Cons(8, B)) = 11 88
structural sharingstructural sharing
/**
* Time - O(1)
* Space - O(1)
*/
def prepend(x: Int): List = Cons(x, this)
/**
* Time - O(n)
* Space - O(n)
*/
def append(x: Int): List =
if (isEmpty) Cons(x)
else Cons(head, tail.append(x))
List: append & prepend
51
3355 7799
3355 77 99
List: apply
52
3355 77 4422 66
n - 1
/**
* Time - O(n)
* Space - O(n)
*/
def apply(n: Int): A =
if (isEmpty) fail("Index out of bounds.")
else if (n == 0) head
else tail(n - 1) // or tail.apply(n - 1)
List: concat
53
path copyingpath copying
A = 4422 66
B = 3355 77
C = A.concat(B) = 4422 66
/**
* Time - O(n)
* Space - O(n)
*/
def concat(xs: List): List =
if (isEmpty) xs
else tail.concat(xs).prepend(head)
List: reverse (two approaches)
54
4422 66 4466 22reverse( ) =
def reverse: List =
if (isEmpty) Nil
else tail.reverse.append(head)
, or tail recursion in O(n)
The straightforward solution in O(n2
)
def reverse: List = {
@tailrec
def loop(s: List, d: List): List =
if (s.isEmpty) d
else loop(s.tail, d.prepend(s.head))
loop(this, Nil)
}
List performance
55
Singly-Linked List
END
56
High Availability for writes
• Failure in nodes can potentially result in the system
having not just two but several versions of the same
data.
• Updates in the presence of network partitions and
node failures can potentially result in an object
having distinct version sub-histories.
57
High Availability for writes
• Dynamo uses vector clocks in order to capture
causality between different versions of the same
object.
• One vector clock is associated with every version of
every object
• We can determine whether two versions of an object
are on parallel branches or have a causal ordering, by
examine their vector clocks
58
High Availability for writes
• When dealing with different copy of the same object:
– If the counters on the first object’s clock are less-than-or-
equal to all of the nodes in the second clock, then the first
is an ancestor of the second and can be forgotten.
– Otherwise, the two changes are considered to be in
conflict and require reconciliation.
59
HA with Vectors Clock
• Vector Clock is an algorithm for generating a partial
ordering of events in a distributed system and
detecting causality violations.
• They are based on logical timestamp, otherwise
known as a Lamport Clock.
• A Lamport Clock is a single integer value that is
passed around the cluster with every message sent
between nodes.
60
HA with Vectors Clock
• Events in the blue region are the causes leading to event B4,
whereas those in the red region are the effects of event B4
61
HA with Vectors Clock
• Each node keeps a record of what it thinks the latest (i.e.
highest) Lamport Clock value is, and if it hears a larger value
from some other node, it updates its own value.
• Every time a database record is produced, the producing
node can attach the current Lamport Clock value + 1 to it as a
timestamp.
• This sets up a total ordering on all records with the valuable
property that if record A may causally precede record B, then
A's timestamp < B's timestamp.
62
Example Vector Clock: Dynamo
63
Execution of get() and put()
• Each read and write is in charge of a coordinator.
• Typically, this is the first among the top N nodes in
the preference list
• Read and write operations involve the first N healthy
nodes in the preference list, skipping over those that
are down or inaccessible.
64
Handling temporary failures
• To handle this kind of failures Dynamo uses a “sloppy
quorum”.
• When there is a failure, a write is persisted on the
next available nodes in the preference list.
• The replica sent to D will have a hint in its metadata
that suggests which node was the intended recipient
of the replica (in this case A).
65
Handling temporary failures
• Nodes that receive hinted replicas will keep them in
a separate local database that is scanned
periodically.
• Upon detecting that A has recovered, D will attempt
to deliver the replica to A.
• Once the transfer succeeds, D may delete the object
from its local store without decreasing the total
number of replicas in the system.
66
Recovering from permanent failures
• It is a scenarios when the hinted replica become
unavailable before they can be returned to the
original replica node.
• To handle this and other threats to durability,
Dynamo implements an anti-entropy protocol to
keep the replicas synchronized.
67
Recovering from permanent failures
• To detect the inconsistencies between replicas faster
and to minimize the amount of transferred data,
Dynamo uses Merkle trees [Merkle 1988]
• A Merkle tree is a hash tree where:
– leaves are hashes of the values of individual keys.
– Parent nodes higher in the tree are hashes of their respective
children.
• The principal advantage of Merkle tree is that each
branch of the tree can be checked independently
without requiring nodes to download the entire tree.
68
Membership and failure detection
• It depends on total failures of nodes or manual
errors.
• In such cases, An administrator uses a command line
tool or a browser
– to connect to a Dynamo node and issue a membership
change
– to join a node to a ring or
– remove a node from a ring.
69
Implementation
• In Dynamo, each storage node has three main
software components:
1. request coordination,
2. membership and failure detection,
3. and a local persistence engine.
• All these components are implemented in Java.
70
Backend Storage
• Dynamo’s local persistence component allows for
different storage engines to be plugged in.
• Engines that are in use
1. are Berkeley Database (BDB) Transactional Data Store2,
2. Berkeley Database Java Edition,
3. MySQL,
4. and an in-memory buffer with persistent backing store.
71
Conclusions
72
Dynamo Main Contributions
1. It demonstrates how different techniques can be
combined to provide a single highly-available
system.
2. It demonstrates that an eventually consistent
storage system can be used in production with
demanding applications
3. It provides insight into the tuning of these
techniques.
73
References
1. http://diyhpl.us/~bryan/papers2/distributed/distributed-
systems/consistent-hashing.1996.pdf
2. http://www.ist-selfman.org/wiki/images/9/9f/2006-schuett-gp2pc.pdf
3. http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-
3-consistent-hashing/
4. http://www.tom-e-white.com/2007/11/consistent-hashing.html
5. http://michaelnielsen.org/blog/consistent-hashing/
6. http://research.microsoft.com/pubs/66979/tr-2003-60.pdf
7. http://www.quora.com/Why-use-Vector-Clocks-in-a-distributed-
database
74
References
8. http://basho.com/why-vector-clocks-are-easy/
9. http://en.wikipedia.org/wiki/Vector_clock
10. http://basho.com/why-vector-clocks-are-hard/
11. http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-
clocks
12. https://github.com/patriknw/akka-data-replication
75

More Related Content

What's hot

Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB Habilelabs
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 

What's hot (20)

Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Redis introduction
Redis introductionRedis introduction
Redis introduction
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Presto
PrestoPresto
Presto
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 

Viewers also liked

Key-Value Pairs
Key-Value PairsKey-Value Pairs
Key-Value Pairslittledata
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory Fabio Fumarola
 
Non-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value StoresNon-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value StoresJoël Perras
 
The inner workings of Dynamo DB
The inner workings of Dynamo DBThe inner workings of Dynamo DB
The inner workings of Dynamo DBJonathan Lau
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases labFabio Fumarola
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases LabFabio Fumarola
 
Key-Value Stores: a practical overview
Key-Value Stores: a practical overviewKey-Value Stores: a practical overview
Key-Value Stores: a practical overviewMarc Seeger
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2Fabio Fumarola
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2Fabio Fumarola
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 
SSDB(LevelDB server) vs Redis
SSDB(LevelDB server) vs RedisSSDB(LevelDB server) vs Redis
SSDB(LevelDB server) vs Redisideawu
 
高性能并发网络服务器设计与实现
高性能并发网络服务器设计与实现高性能并发网络服务器设计与实现
高性能并发网络服务器设计与实现ideawu
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases LabFabio Fumarola
 
Advantages of Cassandra's masterless architecture
Advantages of Cassandra's masterless architectureAdvantages of Cassandra's masterless architecture
Advantages of Cassandra's masterless architectureDuy Lâm
 
Eventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introductionEventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introductionMarc Seeger
 
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databasesFabio Fumarola
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...Fabio Fumarola
 
Neo4j - 7 databases in 7 weeks
Neo4j - 7 databases in 7 weeksNeo4j - 7 databases in 7 weeks
Neo4j - 7 databases in 7 weeksLandier Nicolas
 

Viewers also liked (20)

Key-Value Pairs
Key-Value PairsKey-Value Pairs
Key-Value Pairs
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
Non-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value StoresNon-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value Stores
 
The inner workings of Dynamo DB
The inner workings of Dynamo DBThe inner workings of Dynamo DB
The inner workings of Dynamo DB
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab
 
Key-Value Stores: a practical overview
Key-Value Stores: a practical overviewKey-Value Stores: a practical overview
Key-Value Stores: a practical overview
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
SSDB(LevelDB server) vs Redis
SSDB(LevelDB server) vs RedisSSDB(LevelDB server) vs Redis
SSDB(LevelDB server) vs Redis
 
高性能并发网络服务器设计与实现
高性能并发网络服务器设计与实现高性能并发网络服务器设计与实现
高性能并发网络服务器设计与实现
 
Hbase an introduction
Hbase an introductionHbase an introduction
Hbase an introduction
 
10b. Graph Databases Lab
10b. Graph Databases Lab10b. Graph Databases Lab
10b. Graph Databases Lab
 
Advantages of Cassandra's masterless architecture
Advantages of Cassandra's masterless architectureAdvantages of Cassandra's masterless architecture
Advantages of Cassandra's masterless architecture
 
Eventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introductionEventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introduction
 
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
 
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
1. Introduction to the Course "Designing Data Bases with Advanced Data Models...
 
Neo4j - 7 databases in 7 weeks
Neo4j - 7 databases in 7 weeksNeo4j - 7 databases in 7 weeks
Neo4j - 7 databases in 7 weeks
 
3 Git
3 Git3 Git
3 Git
 

Similar to 7. Key-Value Databases: In Depth

Cassandra
CassandraCassandra
Cassandraexsuns
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentSpeedment, Inc.
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial Na Zhu
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists jlacefie
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 

Similar to 7. Key-Value Databases: In Depth (20)

Cassandra
CassandraCassandra
Cassandra
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Introduction
IntroductionIntroduction
Introduction
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 

More from Fabio Fumarola

11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with DockerFabio Fumarola
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and DockerFabio Fumarola
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbtFabio Fumarola
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and dockerFabio Fumarola
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and dockerFabio Fumarola
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce Fabio Fumarola
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and consFabio Fumarola
 

More from Fabio Fumarola (12)

11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
10. Graph Databases
10. Graph Databases10. Graph Databases
10. Graph Databases
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
2 Linux Container and Docker
2 Linux Container and Docker2 Linux Container and Docker
2 Linux Container and Docker
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
An introduction to maven gradle and sbt
An introduction to maven gradle and sbtAn introduction to maven gradle and sbt
An introduction to maven gradle and sbt
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
 
Linux containers and docker
Linux containers and dockerLinux containers and docker
Linux containers and docker
 
08 datasets
08 datasets08 datasets
08 datasets
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 

Recently uploaded

1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 

Recently uploaded (20)

1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 

7. Key-Value Databases: In Depth

  • 1. Key-Value Databases In Depth Ciao ciao Vai a fare ciao ciao Dr. Fabio Fumarola
  • 2. Outline • Key-values introduction • Major Key-Value Databases • Dynamo DB: How is implemented – Background – Partitioning: Consistent Hashing – High Availability for writes: Vector Clocks – Handling temporary failures: Sloppy Quorum – Recovering from failures: Merkle Trees – Membership and failure detection: Gossip Protocol 2
  • 3. Key-Value Databases • A key-value store is a simple hash table • Where all the accesses to the database are via primary keys. • A client can either: – Get the value for a key – Put a value for a key – Delete a key from the data store. 3
  • 4. Key-value store: characteristics • Key-value data access enable high performance and availability. • Both keys and values can be complex compound objects and sometime lists, maps or other data structures. • Consistency is applicable only for operations on a single key (eventually-consistency). 4
  • 5. Key-Values: Cons • No complex query filters • All joins must be done in code • No foreign key constraints • No trigger 5
  • 6. Key-Values: Pros • Efficient queries (very predictable performance). • Easy to distribute across a cluster. • Service-orientation disallows foreign key constraints and forces joins to be done in code anyway. • Using a relational DB + Cache forces into a key-value storage anyway • No object-relational miss-match 6
  • 7. Popular Key-Value Stores • Riak Basho • Redis – Data Structure server • Memcached DB • Berkeley DB – Oracle • Aerospike – fast key-value for SSD disks • LevelDB – Google key-value store • DynamoDB – Amazon key-value store • VoltDB – Open Source Amazon replica 7
  • 8. Memcached DB • Atomic operations set/get/delete. • O(1) to set/get/delete. • Consistent hashing. • In memory caching, no persistence. • LRU eviction policy. • No iterators. 8
  • 9. Aerospike • Key-Value database optimized for hybrid (DRAM + Flash) approach • First published in the Proceedings of VLDB (Very Large Databases) in 2011, “Citrusleaf: A Real-Time NoSQL DB which Preserves ACID” 9
  • 10. Redis • Written C++ with BSD License • It is an advanced key-value store. • Keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs. • It works with an in-memory. • data can be persisted either by dumping the dataset to disk every once in a while, or by appending each command to a log. • Created by Salvatore Sanfilippo (Pivotal) 10
  • 11. Riak • Distributed Database written in: Erlang & C, some JavaScript • Operations – GET /buckets/BUCKET/keys/KEY – PUT|POST /buckets/BUCKET/keys/KEY – DELETE /buckets/BUCKET/keys/KEY • Integrated with Solr and MapReduce • Data Types: basic, Sets and Maps 11 curl -XPUT 'http://localhost:8098/riak/food/favorite' -H 'Content-Type:text/plain' -d 'pizza'
  • 12. LevelDB LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. •Keys and values are arbitrary byte arrays. •Data is stored sorted by key. •The basic operations are Put(key ,value), Get(key), Delete(key). •Multiple changes can be made in one atomic batch. Limitation •There is no client-server support built in to the library. 12
  • 13. DynamoDB • Peer-To-Peer key-value database. • Service Level Agreement at 99.9% percentile. • Highly available scarifying consistency • Can handle online node adds and node failures • It supports object versioning and application-assisted conflict resolution (Eventually-Consistent Data Structures) 13
  • 15. Amazon Dynamo DB • We analyze the design and the implementation of Dynamo. • Amazon runs a world-wide e-commerce platform • It serves 10 millions customers • At peak times it uses 10000 servers located in many data centers around the worlds. • The have requirements of performance, reliability and efficiency that needs a fully scalable platform. 15
  • 16. Motivation of Dynamo • There are many Amazon services that only need primary-key access to a data store – To provide best-seller lists – Shopping carts – Customer preferences – Session management – Sales rank and product catalogs • Using relations database would lead to inefficiencies and limit scale availability 16
  • 18. Scalability is application dependent • Lesson 1: the reliability and scalability of a system is dependent on how it s application state is managed. • Amazon uses a highly decentralized, loosely couples service oriented architecture composed of hundred of services. • They need that the storage is always available. 18
  • 19. Shopping carts always • Customers should be able to view and add items to their shopping carts even if: – Disk are failing, or – A data center are being destroyed by a tornados or a kraken. 19
  • 20. Failures Happens • When you deal with an infrastructure composed by million of component servers and network components crashes. 20 http://letitcrash.com/
  • 21. High Availability by contract • Service Level Agreement (SLA) is the guarantee that an application can deliver its functionality in a bounded time. • An example of SLA is to guarantee that the Acme API provide a response within 300ms for 99.9% of its requests for a peak of 500 concurrent users (CCU). • Normally SLA is described using average, median and expected variance. 21
  • 22. Dynamo DB It uses a synthesis of well known techniques to achieve scalability and availability. 1.Data is partitioned and replicated using consistent hashing [Karger et al. 1997]. 2.Consistency if facilitated by version clock and object versioning [Lamport 1978] 3.Consistency among replicas is maintained by a decentralized replica synchronization protocol (E-CRDT). 4.Gossip protocol is used for membership and failure detection. 22
  • 23. System Interface • Dynamo stores objects associated with a key through two operations: get() and put() – The get(key) locates the object replicas associated with the key in the storage and returns a single object or a list of objects with conflicting versions along with a context. – The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk. – The context encodes system metadata about the object 23
  • 24. Key and Value encoding • Dynamo treats both the key and the object supplied by the caller as an opaque array of bytes. • It applies a MD5 hash on the key to generate a 128- bit identifier, which is used to determine the storage nodes that are responsible for serving the key. 24
  • 25. Dynamo Architectural Choice 1/2 We focus on the core of distributed systems techniques used 25 Problem Technique Advantage Partitioning Consistent Hashing Incremental Scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available
  • 26. Dynamo Architectural Choice 2/2 We focus on the core of distributed systems techniques used 26 Problem Technique Advantage Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.
  • 27. Partitioning: Consistent Hashing • Dynamo musts scale incrementally. • This requires a mechanism to dynamically partition the data over the set of nodes (i.e., storage hosts) in the system. • Dynamo’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. • the output range of a hash function is treated as a fixed circular space or ring 27
  • 28. Partitioning: Consistent Hashing • Each node in the system is assigned a random value within this space which represents its “position” on the ring. • Each data item is assigned to a node by: 1. hashing the data item’s key to yield its position on the ring, 2. and then walking the ring clockwise to find the first node with a position larger than the item’s position. 28
  • 29. Partitioning: Consistent Hashing • each node becomes responsible for the region in the ring between it and its predecessor node on the ring • The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected. 29
  • 30. Consistent Hashing: Idea • Consistent hashing is a technique that lets you smoothly handle these problems: 1. Given a resource key and a list of servers, how do you find a primary, second, tertiary (and on down the line) server for the resource? 2. If you have different size servers, how do you assign each of them an amount of work that corresponds to their capacity? 30
  • 31. Consistent Hashing: Idea • Consistent hashing is a technique that lets you smoothly handle these problems: 3. How do you smoothly add capacity to the system without downtime? 4. Specifically, this means solving two problems: • How do you avoid dumping 1/N of the total load on a new server as soon as you turn it on? • How do you avoid rehashing more existing keys than necessary? 31
  • 32. Consistent Hashing: How To • Imagine a 128-bit space. • visualize it as a ring, or a clock face • Now imagine hashing resources into points on the circle 32
  • 33. Consistent Hashing: How To • They could be URLs, GUIDs, integer IDs, or any arbitrary sequence of bytes. • Just run them through a good hash function (eg, MD5) and shave off everything but 16 bytes. • We have four key-values: 1, 2, 3, 4. 33
  • 34. Consistent Hashing: How To • Finally, imagine our servers. – A, – B, and – C • We put our servers in the same ring. • We solved the problem of which server should user Resource 2 34
  • 35. Consistent Hashing: How To • We start where resource 2 is and, head clockwise on the ring until we hit a server. • If that server is down, we go to the next one, and so on and so forth 35
  • 36. Consistent Hashing: How To • Key-value 4 and 1 belong to the server A • Key-value 2 to the server B • Key-value 3 to the server C 36
  • 37. Consistent Hashing: Del Server • If the server C is removed • Key-value 3 now belongs to the server A • All the other key-values mapping are unchanged 37
  • 38. Consistent Hashing: Add Server • If server D is added in the position marked • What are the object that will belongs to D? 38
  • 39. Consistent Hashing: Cons • This works well, except the size of the intervals assigned to each cache is pretty hit and miss. • Since it is essentially random it is possible to have a very non-uniform distribution of objects between caches. • To address this issue it is introduced the idea of "virtual nodes” 39
  • 40. Consistent Hashing: Virtual Nodes • Instead of mapping a server to a single point in the circle, each server gets assigned to multiple points in the ring. • A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node. • Effectively, when a new node is added to the system, it is assigned multiple positions in the ring. 40
  • 41. Virtual Nodes: Advantages • If a node becomes unavailable (due to failures or routine maintenance), the load handled by this node is evenly dispersed across the remaining available nodes. • When a node becomes available again, or a new node is added to the system, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. • The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure. 41
  • 42. Data Replication • To achieve high availability and durability, Dynamo replicates its data on multiple hosts. • Each data item is replicated at N hosts, where N is a parameter configured “per-instance”. • Each key k is assigned to a coordinator node (described above). • The coordinator is in charge of the replication of the data items that fall within its range (ring). 42
  • 43. Data Replication • The coordinator locally store each key within its range, • And in addition, it replicates these keys at the N-1 clockwise successor nodes in the ring. 43
  • 44. Data Replication • The list of nodes that is responsible for storing a particular key is called the preference list • The system is designed so that every node in the system can determine which nodes should be in this list for any particular key. • To account for node failures, preference list contains more than N nodes. • To avoid that with “virtual nodes” a key k is owned by less than N physical nodes, the preference list skips some nodes. 44
  • 45. High Availability for writes • With eventual consistency writes are propagated asynchronously. • A put() may return to its caller before the update has been applied at all the replicas. • In this scenarios where a subsequent get() operation may return an object that does not have the latest updates. 45
  • 46. High Availability for writes: Example • We can see this event with shopping carts. • The “Add to Cart” operation can never be forgotten or rejected. • When a customer wants to add an item to (or remove from) a shopping cart and the latest version is not available, the item is added to (or removed from) the older version and the divergent versions are reconciled later. • Question! 46
  • 47. High Availability for writes • Dynamo treats the result of each modification as a new and immutable version of the data. • It allows for multiple versions of an object to be present in the system at the same time. • Most of the time, new versions subsume the previous version(s), and the system itself can determine the authoritative version (syntactic reconciliation). 47
  • 49. Singly-Linked List 49 3355 77 ConsCons NilNilabstract sealed class List { def head: Int def tail: List def isEmpty: Boolean } case object Nil extends List { def head: Int = fail("Empty list.") def tail: List = fail("Empty list.") def isEmpty: Boolean = true } case class Cons(head: Int, tail: List = Nil) extends List { def isEmpty: Boolean = false }
  • 50. List: analysis 50 3355 77A = B = Cons(9, A) = 99 C = Cons(1, Cons(8, B)) = 11 88 structural sharingstructural sharing
  • 51. /** * Time - O(1) * Space - O(1) */ def prepend(x: Int): List = Cons(x, this) /** * Time - O(n) * Space - O(n) */ def append(x: Int): List = if (isEmpty) Cons(x) else Cons(head, tail.append(x)) List: append & prepend 51 3355 7799 3355 77 99
  • 52. List: apply 52 3355 77 4422 66 n - 1 /** * Time - O(n) * Space - O(n) */ def apply(n: Int): A = if (isEmpty) fail("Index out of bounds.") else if (n == 0) head else tail(n - 1) // or tail.apply(n - 1)
  • 53. List: concat 53 path copyingpath copying A = 4422 66 B = 3355 77 C = A.concat(B) = 4422 66 /** * Time - O(n) * Space - O(n) */ def concat(xs: List): List = if (isEmpty) xs else tail.concat(xs).prepend(head)
  • 54. List: reverse (two approaches) 54 4422 66 4466 22reverse( ) = def reverse: List = if (isEmpty) Nil else tail.reverse.append(head) , or tail recursion in O(n) The straightforward solution in O(n2 ) def reverse: List = { @tailrec def loop(s: List, d: List): List = if (s.isEmpty) d else loop(s.tail, d.prepend(s.head)) loop(this, Nil) }
  • 57. High Availability for writes • Failure in nodes can potentially result in the system having not just two but several versions of the same data. • Updates in the presence of network partitions and node failures can potentially result in an object having distinct version sub-histories. 57
  • 58. High Availability for writes • Dynamo uses vector clocks in order to capture causality between different versions of the same object. • One vector clock is associated with every version of every object • We can determine whether two versions of an object are on parallel branches or have a causal ordering, by examine their vector clocks 58
  • 59. High Availability for writes • When dealing with different copy of the same object: – If the counters on the first object’s clock are less-than-or- equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten. – Otherwise, the two changes are considered to be in conflict and require reconciliation. 59
  • 60. HA with Vectors Clock • Vector Clock is an algorithm for generating a partial ordering of events in a distributed system and detecting causality violations. • They are based on logical timestamp, otherwise known as a Lamport Clock. • A Lamport Clock is a single integer value that is passed around the cluster with every message sent between nodes. 60
  • 61. HA with Vectors Clock • Events in the blue region are the causes leading to event B4, whereas those in the red region are the effects of event B4 61
  • 62. HA with Vectors Clock • Each node keeps a record of what it thinks the latest (i.e. highest) Lamport Clock value is, and if it hears a larger value from some other node, it updates its own value. • Every time a database record is produced, the producing node can attach the current Lamport Clock value + 1 to it as a timestamp. • This sets up a total ordering on all records with the valuable property that if record A may causally precede record B, then A's timestamp < B's timestamp. 62
  • 64. Execution of get() and put() • Each read and write is in charge of a coordinator. • Typically, this is the first among the top N nodes in the preference list • Read and write operations involve the first N healthy nodes in the preference list, skipping over those that are down or inaccessible. 64
  • 65. Handling temporary failures • To handle this kind of failures Dynamo uses a “sloppy quorum”. • When there is a failure, a write is persisted on the next available nodes in the preference list. • The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case A). 65
  • 66. Handling temporary failures • Nodes that receive hinted replicas will keep them in a separate local database that is scanned periodically. • Upon detecting that A has recovered, D will attempt to deliver the replica to A. • Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system. 66
  • 67. Recovering from permanent failures • It is a scenarios when the hinted replica become unavailable before they can be returned to the original replica node. • To handle this and other threats to durability, Dynamo implements an anti-entropy protocol to keep the replicas synchronized. 67
  • 68. Recovering from permanent failures • To detect the inconsistencies between replicas faster and to minimize the amount of transferred data, Dynamo uses Merkle trees [Merkle 1988] • A Merkle tree is a hash tree where: – leaves are hashes of the values of individual keys. – Parent nodes higher in the tree are hashes of their respective children. • The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree. 68
  • 69. Membership and failure detection • It depends on total failures of nodes or manual errors. • In such cases, An administrator uses a command line tool or a browser – to connect to a Dynamo node and issue a membership change – to join a node to a ring or – remove a node from a ring. 69
  • 70. Implementation • In Dynamo, each storage node has three main software components: 1. request coordination, 2. membership and failure detection, 3. and a local persistence engine. • All these components are implemented in Java. 70
  • 71. Backend Storage • Dynamo’s local persistence component allows for different storage engines to be plugged in. • Engines that are in use 1. are Berkeley Database (BDB) Transactional Data Store2, 2. Berkeley Database Java Edition, 3. MySQL, 4. and an in-memory buffer with persistent backing store. 71
  • 73. Dynamo Main Contributions 1. It demonstrates how different techniques can be combined to provide a single highly-available system. 2. It demonstrates that an eventually consistent storage system can be used in production with demanding applications 3. It provides insight into the tuning of these techniques. 73
  • 74. References 1. http://diyhpl.us/~bryan/papers2/distributed/distributed- systems/consistent-hashing.1996.pdf 2. http://www.ist-selfman.org/wiki/images/9/9f/2006-schuett-gp2pc.pdf 3. http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part- 3-consistent-hashing/ 4. http://www.tom-e-white.com/2007/11/consistent-hashing.html 5. http://michaelnielsen.org/blog/consistent-hashing/ 6. http://research.microsoft.com/pubs/66979/tr-2003-60.pdf 7. http://www.quora.com/Why-use-Vector-Clocks-in-a-distributed- database 74
  • 75. References 8. http://basho.com/why-vector-clocks-are-easy/ 9. http://en.wikipedia.org/wiki/Vector_clock 10. http://basho.com/why-vector-clocks-are-hard/ 11. http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector- clocks 12. https://github.com/patriknw/akka-data-replication 75

Editor's Notes

  1. CRDT = Consistent replicated data structures
  2. What kind of operations are used for add and del to cart?