NoSQL - how it works (@pavlobaron)

NoSQL.
How it works. Pavlo Baron

Geek‘s Guide
To The Working Life

Pavlo Baron
pavlo.baron@codecentric.de
@pavlobaron

NoSQL is not about …

<140’000 things NoSQL
is not about>…

NoSQL is about choice

(Jan Lehnardt on NoSQL)

NoSQL addresses the issue
of poorly structured data

of data management simplicity

of data flood

of extremely frequent
reads/writes

of big data streams

of real-time data
processing and analysis

of huge data storage

of fast data filtering

of complex, deep relations

NoSQL addresses the issue of
pure web existences

of picking the
right tool for the job

Chop in bite-size,
manageable pieces

Caching

Variations:

eager write, append only

lazy write, eventual
consistency

Write through
write read

read
cache
write read
through miss

products users
data store

Write back /
write snapshotting

read

cache
read write
miss back

products users
data store

Design for theoretically
unlimited amount of data

Append, update, mark, recycle,
don’t delete and restructure

Parallelize and
distribute

Decentralize with
“equal” nodes

Build upon consensus,
agreement, voting, quorum

write RM2 Gossip –
RM

RM1
Clock table
Value
Update log stable clock
Replica clock updates
Value

Executed operation table

Gossip – node down/up
Node 1
Node 2

update, read,
update update
4 down 4 up
Node 3 Node 4

update read

Don’t trust time
and timestamps

Clocks

V(i), V(j): competing

Conflict resolution:

1: siblings, client
2: merge, system
3: voting, system

Timestamps
Node 1

10:00 10:10 10:20
Node 2

10:01 10:11 10:20
Node 3

9:59 10:09 10:18 10:19

Logical clocks

?
Node 1

1 4 5 6 7
Node 2

2 3 6 7

?
Node 3

2 4 5 6 7

Vector clocks
Node 1

1,0,0 2,2,0 3,2,0 4,3,3
Node 2

1,1,0 1,2,0 1,3,3 4,4,3
Node 3

1,0,1 1,2,2 1,2,3 4,3,4

Vector clocks
Node 1 Node 2 Node 3 Node 4

1,0,0,0

1,1,0,0 1,2,0,0 1,3,0,3

1,0,1,0 1,0,2,0

1,0,0,1 1,2,0,2 1,2,0,3

Strive for
O(1) for data lookups

#

Merkle Trees

N, M: nodes
HT(N), HT(M): hash trees

M needs update:
obtain HT(N)
calc delta(HT(M), HT(N))
pull keys(delta)

Node a.1 Merkle Trees
a
ab ac
abc abd acb acc

abe abd ada adb
ab ad
a
Node a.2

Node a.1 Merkle Trees
a
ab
abc abd

abd ada adb
ab ad
a
Node a.2

Node 1 Vertical
sharding
users addresses
contracts
orders „read
contract“
user=foo

invoices
products items
Node 2

Node 1 Range based
sharding
users
id(1-N) addresses
zip(1234- read
2345)
write

products
write
addresses
users zip(2346- read
id(1-M) 9999)
Node 2

Hash based sharding

start with 3 nodes:
node hash N = # mod 3

add 2 nodes
N = # mod 5

kill 2 nodes
N = # mod 3

Insert key
N

Key = “foo”
#=N

Add
2 nodes

rehash

leave
rehash

leave

Lookup
Key = “foo” key
#=N
N

Value = “bar”

Remove
node

rehash

leave
rehash

leave

The ring

X bit integer space
0 <= N <= 2 ^ X

or: 2 x Pi
0 <= A <= 2 x Pi
x(N) = cos(A)
y(N) = sin(A)

Clustering

12 partitions (constant)
3 nodes, 4 vnodes each
add node
4 nodes, 3 vnodes each

Alternatives:
3 nodes, 2 x 5 + 1 x 2 vnodes
container based

Quorum

V: vnodes holding a key
W: write quorum
R: read quorum
DW: durable write quorum

• W > 0.5 * V
R + W > V

Insert key
Key = “foo”
(sloppy quorum)
# = N, W = 2

replicate

N
ok

Add node

co
py
leave
leave
co
py
py

leave
co

Lookup key
(sloppy
quorum)
N
Value = “bar”

Key = “foo”
# = N, R = 2

Remove
node

copy

leave

Minimize the distance
between the data
and its
processors

Utilize commodity
hardware

MapReduce

model: functional map/fold

out-database MR irrelevant

in-database MR:
data locality
no splitting needed
distributed querying
distributed processing

In-database MapReduce

query =
Node X "Alice"
map reduce hit
list

map map

N= N= N=
„Alice" "Alice" "Alice"
Node A Node B Node C

Design with eventual
actuality/consistency
in mind

BASE

Basically Available,
Soft-state,
Eventually consistent

Opposite to ACID

Read your write consistency

FE1 FE2
write read write read
v2 v2 v1 v1

v1 v2 v3

Data store

Session consistency

FE
Session 1 Session 2
write read write read
v2 v2 v1 v1

v1 v2 v3

Data store

Monotonic read consistency

FE1 FE2
read read read read read
v2 v2 v3 v3 v4

v1 v2 v3 v4

Data store

Monotonic write consistency

FE1 FE2
write write read read
v1 v2 v3 v3

v1 v2 v3 v4

Data store

Eventual consistency

FE1 FE2
read read read read write
v1 v2 v2 v3 v3

v1 v2 v3

Data store

Implement redundancy
and replication

Source node Replication –
addresses state transfer

products

take
users

Target node

Source node Replication –
deletes operational
transfer
inserts

take
updates

run

Target node

Eager replication - 3PC
Coordinator
Cohort 1

can yes pre ACK commit ok
commit? commit
Cohort 2

Eager replication –
3PC (failure)
Coordinator
Cohort 1

can yes pre ACK abort ok
commit? commit
Cohort 2

Eager replication-
Paxos Commit

2F + 1 acceptors overall , F + 1
correct ones to achieve
consensus

Stability, Consistency,
Non-Triviality,
Non-Blocking

Paxos Commit
Eager replication –

commit
2b
prepared
prepare
2a prepared begin commit
initial other
Acceptors leader RMs RM1

Eager replication – Paxos
Commit (failure)
Acceptors

2a prepared
2a prepared
timeout, timeout,
no no
decision decision
leader
initial

prepare

prepare

abort
begin commit
other
RMs RM 1

Master node Lazy replication –
master/slave
addresses

products write
users read

read

Slave node(s)

Master node(s) Lazy replication –
master/master

users items
id(1-N) id(1-K) write

read

users items read
id(1-M) id(1-L)
write

Master node(s)

Hinted handoff

N: node, G: group including N

node(N) is unavailable
replicate to G or
store data(N) locally
hint handoff for later
node(N) is alive
handoff data to node(N)

Key = “foo”, # = N -> Direct
handoff hint = true replica
fails
Key = “foo”
N

replicate

All
Key = “foo”,
# = N -> replicas
handoff hint = fail
true

N

All
replicas
handoff recover

replicate

Consider latency an
adjustment screw

Consider availability an
adjustment screw

CAP – the variations

CA – irrelevant

CP – eventually unavailable
offering maximum consistency

AP – eventually inconsistent
offering maximum availability

CAP – the tradeoff

A C

Replica 1 CP

v1 read

v2 write
v2

v2

v1 read

Replica 2

Replica 1 CP (partition)

v1 read

v2 write
v2

v1 read

Replica 2

Replica 1 AP

v1 write
v2
v2 read

replicate

v2 v1 read

Replica 2

Replica 1 AP (partition)

v1 write
v2
v2 read
hint
handoff
v2

v1 read

Replica 2

Build upon appropriate
storage strategy,
not upon a general one

Design for frequent
structure changes

Most queries are known up front

Ad-hoc queries are
seldom necessary

Prepared queries can
extremely speed up data retrieval

Index can help ad-hoc querying,
and can be externalized

Index should be incremental

Store as

Document (semi-structured)
Key/Value (unstructured)
Graph (special case)
...

Externalize relations and
properties

The graph case

Saving graph in a table leads to:

Limited depth
Fixed relation types
Expensive nested subselects
Full table scan tendency

Graph data stores store graph
data optimally

Many graphics I’ve
created myself

Some images originate from
istockphoto.com

except few ones taken
from Wikipedia
and product pages

NoSQL - how it works (@pavlobaron)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to NoSQL - how it works (@pavlobaron)

Similar to NoSQL - how it works (@pavlobaron) (20)

More from Pavlo Baron

More from Pavlo Baron (11)

Recently uploaded

Recently uploaded (20)

NoSQL - how it works (@pavlobaron)