Big Data & NoSQL - EFS'11 (Pavlo Baron)

Pavlo Baron http://www.pbit.org [email_address] @pavlobaron

Agenda Blah-blah More blah-blah Color pics Standing ovations

So, come on, sell this to me

Somewhere a mosquito coughs…

… and somewhere else a data center gets flooded with data (PB)

Big Data describes datasets that grow so large that they become awkward to work with using on-hand database management tools (Wikipedia)

NoSQL is not about … <140’000 things NoSQL is not about>… NoSQL is about choice (Jan Lehnardt, CouchDB)

Look here brother, who you jivin‘ with that Cosmik Debris ?

So, you think you can tell heaven from hell ...

Where does your data actually come from ?

Do you have a million well structured records?

Or a couple of Gigabytes of storage?

Does your data get modified every now and then ?

Do you look at your data Once a month to create a management report?

Or is your data an unstructured chaos?

Do you get flooded by tera-/petabytes of data?

Or do you simply get bombed with data?

Does your data flow on streams at a very high rate from different locations?

Or do you have to read The Matrix ?

Do you need to distribute your data over the whole world

Or does your existence depend on (the quality of) your data?

Look back and turn back. Look at yourself

Is it the storage that you need to focus on?

Or are you more preparing data?

Or do you have your customers spread all over the world ?

Or do you have complex statistical analysis to do?

Or do you have to filter data as it comes?

Or is it necessary to visualize the data?

...every blade is sharp, the arrows fly...

Chop in bite-size , manageable pieces

Separate reading from writing

Update and mark, don’t delete physically

Separate archive from accessible data

Trash everything that has only to be analyzed in real-time

Decentralize with “ equal” nodes

Design with Byzantine faults in mind

Build upon consensus , agreement , voting , quorum

Don’t trust time and timestamps

Strive for O(1) for data lookups #

Minimize the distance between the data and its processors

Consider hardware fallibility

Relax new hardware startup procedure

Build upon asynchronous message passing

Consider network unreliability

Consider asynchronous message passing unreliability

Design with eventual actuality/consistency in mind

Implement redundancy and replication

Consider latency an adjustment screw

Consider availability an adjustment screw

Design for theoretically unlimited amount of data

Design for frequent structure changes

Design for the all-in-one mix

Why can we never be sure till we die . Or have killed for an answer

CAP – C onsistency, A vailability, P artition tolerance

CAP – the variations CA – irrelevant CP – eventually unavailable offering maximum consistency AP – eventually inconsistent offering maximum availability

CP Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2 v 2

CP ( partition ) Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2

AP Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2 v 2 replicate

AP ( partition ) Replica 1 Replica 2 v 1 read write v 2 read v 1 v 2 v 2 hint handoff

BASE B asically A vailable, S oft-state, E ventually consistent Opposite to ACID

Causal ordering / consistency RM1 RM2 RM3

Read your write consistency write v 2 read v2 FE1 v 2 Data store v 3 v 1 write v 1 read v1 FE2

Session 2 Session 1 Session consistency write v 2 read v2 FE v 2 Data store v 3 v 1 write v 1 read v1

Monotonic read consistency read v 2 read v2 FE1 v 2 Data store v 3 v 1 read v 3 read v4 FE2 v 4 read v3

Monotonic write consistency write v 1 write v4 FE1 Data store v 2 write v 2 write v3 FE2 v 4 v 1 v 3

Eventual consistency read v 1 read v2 FE1 Data store v 3 write v 3 FE2 read v3 v 1 read v2 v 2

Run, rabbit, run. Dig that hole , forget the sun

Node 1 Node 2 users products contracts Vertical sharding items orders addresses invoices „ read contract“ user=foo

Node 1 Node 2 users id(1-N) products Range based sharding addresses zip(1234- 2345) read users id(1-M) addresses zip(2346- 9999) write write read

Hash based sharding start with 3 nodes: node hash N = # mod 3 add 2 nodes N = # mod 5 kill 2 nodes N = # mod 3

Insert key Key = “foo” # = N N

rehash leave leave rehash Add 2 nodes

Lookup key Key = “foo” # = N N Value = “bar”

rehash leave leave rehash Remove node

The ring X bit integer space 0 <= N <= 2 ^ X or: 2 x Pi 0 <= A <= 2 x Pi x(N) = cos(A) y(N) = sin(A)

Key = “foo” # = N N Insert key

copy leave rehash leave leave rehash Add node

copy/ miss leave rehash leave leave rehash Remove node

Clustering 12 partitions (constant) 3 nodes, 4 vnodes each add node 4 nodes, 3 vnodes each Alternatives: 3 nodes, 2 x 5 + 1 x 2 vnodes container based

Quorum V: vnodes holding a key W: write quorum R: read quorum DW: durable write quorum W > 0.5 * V R + W > V

Key = “foo” # = N, W = 2 N Insert key ( sloppy quorum) replicate ok

leave Add node copy copy leave

Key = “foo” # = N, R = 2 N Lookup key ( sloppy quorum) Value = “bar”

leave Remove node copy copy leave

Inside out, outside in. Perpetual change

Clocks V(i), V(j): competing Conflict resolution: 1: siblings , client 2: merge , system 3: voting , system

Node 1 Node 2 Node 3 10:00 10:11 10:20 10:20 10:01 9:59 10:09 10:10 Timestamps 10:18 10:19

Node 1 Node 2 Node 3 1 3 5 6 2 2 4 5 4 7 7 7 Logical clocks 6 6 ? ?

Node 1 Node 2 Node 3 1,0,0 1,2,0 3,2,0 1,3,3 1,1,0 1,0,1 1,2,2 1,2,3 2,2,0 4,3,3 4,4,3 4,3,4 Vector clocks

Node 2 Node 3 Node 4 1,1,0,0 1,0,1,0 1,0,0,1 1,3,0,3 1,2,0, 2 1,2,0,3 Vector clocks Node 1 1,0,0,0 1,2,0,0 1,0,2,0

Merkle Trees N, M: nodes HT(N), HT(M): hash trees M needs update: obtain HT(N) calc delta(HT(M), HT(N)) pull keys(delta)

Node a.1 Node a.2 a ab ac abc abd acb acc Merkle Trees a ab ad abe abd ada adb

Node a.1 Node a.2 a ab abc abd Merkle Trees a ab ad abd ada adb

Sudden call shouldn't take away the startled memory

Replication – state transfer Target node users products addresses Source node take

Replication – operational transfer Target node updates inserts deletes Source node take run

Eager replication - 3PC Coordinator Cohort 1 Cohort 2 yes can commit? pre commit ACK commit ok

Eager replication – 3PC ( failure ) Coordinator Cohort 1 Cohort 2 yes can commit? pre commit ACK abort ok

Eager replication- Paxos Commit 2F + 1 acceptorsoverall , F + 1 correct ones to achieve consensus Stability, Consistency, Non-Triviality, Non-Blocking

prepare 2b prepared initial leader other RMs RM1 2a prepared Eager replication – Paxos Commit Acceptors begin commit commit

Eager replication – Paxos Commit ( failure ) prepare timeout, no decision initial leader other RMs RM 1 2a prepared Acceptors begin commit abort prepare 2a prepared timeout, no decision

Master node Slave node(s) users products Lazy replication – Master/slave addresses read write read

Master node(s) Master node(s) Lazy replication – Master/master read write read users id(1-N) users id(1-M) items id(1-K) items id(1-L) write

stable updates Gossip – RM RM1 Clock table Replica clock Update log Value clock Value Executed operation table write RM2 gossip

Node 1 Node 2 Node 3 update Gossip – node down/up Node 4 update update, 4 down read read, 4 up update

Hinted handoff N: node, G: group including N node(N) is unavailable replicate to G or store data(N) locally hint handoff for later node(N) is alive handoff data to node(N)

Key = “foo” N replicate Key = “foo”, # = N -> handoff hint = true Direct replica fails

N Key = “foo”, # = N -> handoff hint = true All replicas fail

All replicas recover replicate handoff

I’m a speed king, see me fly

MapReduce model: functional map/fold out-database MR irrelevant in-database MR: data locality no splitting needed distributed querying distributed processing

In-database MapReduce map reduce Node X Node C N = "Alice" map query = "Alice" Node A N = „ Alice" Node B N = "Alice" map hit list

Caching Variations: eager write , append only lazy write , eventual consistency

Write through read write data store products write through users cache read read miss

Write back / snapshotting read write data store products write back users cache read miss

Physical storage row based: irrelevant column based: many rows, few columns value based: ad-hoc querying

Column based storage 1, 2 Peter, Anna London, Paris data store ID Name City 1 Peter London 2 Anna Paris

Value based storage 1:1, 3:Peter, 5:London, 2:2, 4:Anna, 6:Paris, 7:[1, 3, 5], 8:[2, 4, 6] data store ID Name City 1 Peter London 2 Anna Paris

Many graphics I’ve created myself, though I better should have asked @mononcqc for help ‘cause his drawings are awesome Some images originate from istockphoto.com except few ones taken from Wikipedia and product pages

Big Data & NoSQL - EFS'11 (Pavlo Baron)

Recommended

Recommended

More Related Content

Similar to Big Data & NoSQL - EFS'11 (Pavlo Baron)

Similar to Big Data & NoSQL - EFS'11 (Pavlo Baron) (20)

More from Pavlo Baron

More from Pavlo Baron (20)

Recently uploaded

Recently uploaded (20)

Big Data & NoSQL - EFS'11 (Pavlo Baron)