The Art of Big Data

The road lies plain before
me;--'tis a theme
Single and of
determined bounds; …
- Wordsworth, The Prelude

m
pre ss.co
. word ol
bl eclix te Scho
p:/ /dou Gr adua 1
ka r, htt val Post 2 9,201
n a San r, Na Nov
Krish in a
st Sem
hD Gue
00–P
EC40

What is
Big
Data ?

Big
Data to
smart
data

Big
o  Agenda Data
o  To cover the broad Pipeline

picture
o  Understand the
waypoints &
o  Drill down into one
area (NOSQL) Analytics/
Modeling
Analytic Storage -
R
Algorithms NOSQL

o  Can do others later
…
Processing -
o  Of the Big Data Visualization
Hadoop
…

domain …

Thanks to …
The giants whose
shoulders I am
standing on

Special
Thanks
to:

Peter
Ateshian,
NPS

Prof
Murali
Tummala,
NPS

Shirley
Bailes,O’Reilly

Ed
Dumbill,O’Reilly

Jeﬀ
Barr,AWS

Jenny
Kohr
Chynoweth,AWS

When I think of my own native land,
In a moment I seem to be there;

But, alas! recollection at hand

Soon hurries me back to despair.
- Cowper, The Solitude Of Alexander SelKirk

What is Big Data ?
“Big data” is data “Big data” is less
that becomes large about size, more
enough that it about ﬂow & velocity
cannot be processed - persisting
using conventional petabytes per year is
methods. @twitter

easier than
processing terabytes
per hour. @twitter

Ref:
hIp://radar.oreilly.com/2010/09/the-‐smaq-‐stack-‐for-‐big-‐data.html

What is Big Data ?

Vinod Khosla’s Cool Dozen!
  Consumers : “Widespread innovation in
technologies that reduce data overload for
users” ~ Data Reduction

  Businesses : “Simple solutions to handle
the deluge of data generated from various
sources …” ~ Big Data Analytics

TV
2.0,
EducaXon,
Social
NEXT,Tools
for
sharing
inteerst,Publishing,…

Ref:
hIp://www.ciol.com/News/News/News-‐Reports/Vinod-‐Khosla%E2%80%99s-‐cool-‐dozen-‐tech-‐innovaXons/156307/0/

hIp://yourstory.in/2011/11/vinod-‐khoslas-‐keynote-‐at-‐nasscom-‐product-‐conclave-‐reject-‐punditry-‐believe-‐in-‐an-‐idea-‐take-‐risk-‐and-‐succeed/

EBC322

  Volume

o  Scale

  Velocity

o  Data
change
rate
vs.
decision
window

  Variety

o  Diﬀerent
sources
&
formats

o  Structured
vs.
Unstructured

  Variability

o  Breadth
of
interpreta<on
&

o  Depth
of
analy<cs

  Contextual

o  Dynamic
variability

o  RecommendaXon

  Connectedness

hIp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/

hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf

I.  Two
Main
Types
–
based
on
collecXon

i.  Big
Data
Streams

o  Data
in
“moXon”

o  TwiIer
ﬁre
hose,
Facebook,
G+

ii.  Big
Data
Logs

o  Data
“at
rest”

o  Logs,
DW,
external
market
data,
POS,
…

II.  Typically,
Big
Data
has
a
non-‐determinisXc
angle
as
well
…

o  CreaXve
Discovery

o  IteraXve,
Model
based
AnalyXcs

o  Explore
quesXons
to
ask

III.  Smart
Data
=
Big
Data
+
context
+
embedded/interacXve
(inference,

reasoning)
models

o  Model
Driven

o  DeclaraXvely
InteracXve

hIp://www.slideshare.net/leonsp/hadoop-‐slides-‐11-‐what-‐is-‐big-‐data

hIp://www.slideshare.net/Dataversity/wed-‐1550-‐bacvanskivladimircolor

AWS – 600 Billion
objects!

Twitter

§  200 million tweets/day

§  Peak 10,000/second

§  How would you handle the ﬁre
hose for social network analytics

?
Zynga

§  “Analytics company, not a
gaming company!”

§  Harvests data : 15 TB/day

Storage

§  Test new features

§  4 U box = 40 TB,

§  Target advertising

1 PB = 25 boxes !

§ 
§  230 million players/month

hIp://goo.gl/dcBsQ

•  6
Billion
Messages
per

day

•  2
PB
(w/compression)

online

•  6
PB
w/
replicaXon

•  250
TB/Month
growth

•  HBase
Infrastructure

50
TB/Day
Very
systemaXc

240
nodes,
84
PB
Diagram
speaks
volumes!

Path
Analysis
Teradata
InstallaXon

A/B
TesXng

Ref:
hIp://www.hpts.ws/sessions/2011HPTS-‐TomFastner.pdf

•  “…
they
didn’t
need
a
genius,
…
but
build
the
world’s
most
impressive

dileIante
…
baIling
the
efficient
human
mind
with
spectacular

flamboyant
inefficiency”
–
Final
Jeopardy
by
Stephen
Baker

•  15
TB
memory,
across
90
IBM
760
servers,
in
10
racks

•  1
TB
of
dataset

•  200
Million
pages
processed
by
Hadoop

•  This
is
a
good
example
of
Connected
data

–  Contextual
w/
variability

–  Breath
of
interpretaXon

–  AnalyXcs
depth

hIp://doubleclix.wordpress.com/2011/03/01/the-‐educaXon-‐of-‐a-‐machine-‐%E2%80%93-‐review-‐of-‐book-‐%E2%80%9Cfinal-‐jeopardy
%E2%80%9D-‐by-‐stephen-‐baker/

hIp://doubleclix.wordpress.com/2011/02/17/watson-‐at-‐jeopardy-‐a-‐race-‐of-‐machines/

Warehouse-‐style

ApplicaXons

Block
Store

Distributed
Big Data
ApplicaXons

Storage
Object
Store

NOSQL

AnalyXcs
Parallelism
Map/Reduce

Web
HPC

AnalyXcs

Cloud
Architecture

Social
Media

Log
Inference

AnalyXcs

Social

RecommendaXon/
Graph
Inference
Engines

Machine

Knowledge
Search,
Learning
Mahout

Graph
Indexing

ClassiﬁcaXon,
Clustering

“A towel is about the most massively useful thing an
interstellar hitchhiker can have … any man who can
hitch the length and breadth of the Galaxy, rough it …
win through, and still know where his towel is, is clearly
a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Published by Harmony Books in 1979

Big Data to Smart Data

Don’t throw away
1
any data !

Big data to smart data
Be ready for diﬀerent
2
ways of organizing
the data
•  summary

h;p://goo.gl/fGw7r

Big Data Pipeline

If a problem has no solution, it is not a problem,
but a fact, not to be solved but to be coped with,
over time …
- Peres’s Law

Big Data Pipeline
•  Stages
o  Collect
o  Store
o  Transform & Analyze
o  Model & Reason
o  Predict, Recommend & Visualize
•  Different systems have different characteristics
o  Infrastructure optimization based in application/hardware
attributes correlation (short term)
•  Hadoop, Splunk, internal Dashboard
o  Application performance trends (medium term)
•  Analytics, Modeling,…
o  Product Metrics
•  Feature set vs. usage, what is important to users, stratification
•  Modeling using R, Visualization layers like Tableau

Big Data Pipeline
Ref:h;p:goo.gl/Mm83k

Infer-ability

Model

Internal

dashboards
Hand
,
Tableau

Context

coded

Programs,

Connectedness

R,
Mahout,

…

SQL,

Variety

BI
Tools,

Hadoop,

Pig,

Variability

SQL
Hive,

.NET

NOSQL,

Logs,
Dryad,

Velocity

Scribe,

HDFS,

XML,

Various

Flume,
other

<iles,
…

Volume

Hadoop

tools

…

Decomplexify! Contextualize! Network! Reason! Infer!

Build to Fail - “It is working” is not binary

The NOSQL !

I AM monarch of all I survey;
My right there is none to dispute;

From the centre all round to the sea
I am lord of the fowl and the brute

Agenda
•  Opening Gambit
–  NOSQL
:
Toil,
Tears
&
Sweat
!

•  The Pragmas
–  ABCs
of
NOSQL
[ACID,
BASE
&
CAP]

•  The Mechanics
–  Algorithmics
&
Mechanisms
(For
reference)

Referenced Links @ http://doubleclix.wordpress.com/2010/06/20/nosql-talk-references/

What is NOSQL
Anyway ?
•  NOSQL

!=
NoSQL
or
NOSQL
!=
(!SQL)

•  NOSQL
=
Not
Only
SQL

•  Can
be
traced
back
to
Eric
Evans[2]!

–  You
can
ask
him
during
the
ayernoon
session!

•  Unfortunate
Name,
but
is
stuck
now

•  Non
RelaXonal
could
have
been
beIer

•  Usually
OperaXonal,
Deﬁnitely
Distributed

•  NOSQL
has
certain
semanXcs
–
need
not
stay
that
way

NOSQL

Key
Value
Column
Document
Graph

In-‐memory
SimpleDB
CouchDB
Neo4j

Memcached
Google

MongoDB
FlockDB

BigTable

Disk
Based

HBase
Lotus
Domino
InﬁniteGraph

Redis

Cassandra
Riak

Tokyo
Cabinet

Dynamo
HyperTable

Voldemort
Azure
TS
Ref:
[22,51,52]

When I think of my own native land,
In a moment I seem to be there;
But, alas! recollection at hand
Soon hurries me back to despair.

NOSQL Tales from the field
WHAT WORKS

•  Designer Augmenting RDBMS with a Distributed key
Value Store[40 : A good talk by Geir]
•  Invitation only designer brand sales
•  Limited inventory sales – start at 12:00, members have
10 min to grab them. 500K mails every day
•  Keeps brand value, hidden from search
•  Interesting load properties
•  Each item a row in DB-BUY NOW reserves it
–  Can't order more
•  Started out as a Rails app
–  shared nothing
•  Narrow peaks – half of revenue

Christian Louboutin
Effect

•  ½ amz for Louboutin
•  Use Voldemort
•  Inventory, Shopping Cart,
Checkout
•  Partition by prod ID
•  Shared infrastructure – “fog”
not “cloud’ - Joyent!
•  In-memory inventory
•  Not afraid of sale anymore!
And SQL DBs are
still relevant !

Typical NOSQL Example Bit.ly
•  Bit,ly URL shortening service, uses MongoDB
•  User, title, URL, hash, labels[I-5], sort by time
•  Scale – ~50M users, ~10K concurrent, ~1.25B shortens
per month
•  Criteria:
–  Simple, Zippy FAST, Very Flexible, Reasonable Durability, Low
cost of ownership
•  Sharded by userid

•  New kind of “dictionary” a word repository, GPS for
English – context, pronunciations, twitter … developer
API
•  Characteristics[I-6,Tony Tam’s presentation]
–  RO-centric, 10,000 reads for every write
–  Hit a wall with MySQL (4B rows)
–  MongoDB read was so good that memcached layer was not
required
–  MongoDB used 4 times MySQL storage
•  Another example :
–  Voldemort – Unified Communications, IP-Phone data stored
keyed off of phone number. Data relatively stable

Large Hadron Collider@CERN
•  DAS is part of giant data management
enterprise (cms)
–  Polygot Persistence (SQL + NOSQL, Mongo, Couch,
memcache, HDFS, Luster, Oracle, mySQL, …)
•  Data Aggregation System [I-1,I-2,I-3,I-4]
–  Uses MongoDB
–  Distributed Model, 2-6 pb data
–  Combine info. from different metadata sources, query
without knowing their existence, user has domain
knowledge – but shouldn’t deal with various formats,
interfaces and query semantics
–  DAS aggregates, caches and presents data as JSON
documents – preserving security & integrity

And SQL DBs are
still relevant !

•  Digg
–  RDBMS places burden on reads than writes[I-8]
–  Looked at NOSQL, selected Cassandra
•  Colum oriented, so more structure than key-value
•  Heard from noSQL Boston[http://twitter.com/
#search?q=%23nosqllive]
–  Baidu: 120 node HyperTable cluster managing
600TB of data
–  StumbleUpon uses HBase for Analytics
–  Twitter’s Current Cassandra cluster: 45 nodes

•  Adob is a HBase shop •  BBC is a CouchDB shop
[I-10,I-11,2] [I-13]
•  Adobe SaaS Infrastructure – •  Sweet spot:
tagging, content aggregation, •  Multi-master, multi
search, storage and so forth datacenter replication
•  Dynamic schema & huge
number of records[I-5]
•  40 million records in 2008 to
1 billion with 50 ms response •  Interactive Mediums
•  NOSQL not mature in 2008, •  Old data to CouchDB
now good enough •  Thus free up DB to do
•  Prod Analytics:40 nodes, work!
largest has 100 nodes

•  Cloudkick is a Cassandra shop[I-12]
•  Cloudkick offers cloud management services
•  Store metrics data
•  Linear scalability for write load
•  Massive write performance
•  Memory table & serial commit log
•  Low operational costs
•  Data Structure
–  Metrics, Rolled-up data, Statuses at time slice : all indexed by
timestamp

•  Guardian/UK
–  Runs on Redis[I-14] !
–  “Long-term The Guardian is looking
towards the adoption of a schema-free
database to sit alongside its Oracle
database and is investigating CouchDB.
… the relational database is now just a
component in the overall data
management story, alongside data
caching, data stores, search engines
And SQL DBs are
etc.
still relevant !
–  NOSQL can increase performance of "The evil that SQL
relational data by offloading specific DBs do lives after
data and tasks them; the good is
oft interred with
their bones...",

NOSQL at Netflix
•  Netflix is fully in the cloud
•  Uses NOSQL across the globe
•  Customer Profiles, watchlog, usage logging (see next
slide)
–  No multi-record locking
•  No DBA !
•  Easier Schema Changes
•  Less complex, Highly Available data store
•  Joins happen in the applications

http://www.hpts.ws/sessions/nosql-ecosystem.pdf
http://www.hpts.ws/sessions/GlobalNetflixHPTS.pdf

21 NOSQL Themes
•  Web
Scale

•  Scale
Incrementally/conXnuous
growth

•  Oddly
shaped
&
exponenXally
connected

•  Structure
data
as
it
will
be
used
–
i.e.
read,
query

•  Know
your
queries/updates
in
advance[96],
but
you
can
change

them
later

•  Compute
aIributes
at
run
Xme

•  Create
a
few
large
enXXes
with
opXonal
parts

–  NormalizaXon
creates
many
small
enXXes

•  Deﬁne
Schemas
in
models
(not
in
databases)

•  Avoid
impedance
mismatch

•  Narrow
down
&
solve
your
core
problem

•  Solve
the
right
problem
with
the
right
tool

Ref:
[I-‐8]

21 NOSQL Themes
•  ExisXng
soluXons
are
clunky[1]
(in
certain
situaXons)

•  Scale
automaXcally,
“becoming
prohibiXvely
costly
(in

terms
of
manpower)
to
operate”
TwiIer[I-‐9]

•  DistribuXon
&
parXXoning
are
built-‐in
NOSQL

•  RDBMS
distribuXon
&
sharding
not
fun
and
is
expensive

–  Lose
most
funcXonality
along
the
way

•  Data
at
the
center,
Flexible
schema,
Less
joins

•  The
value
of
NOSQL
is
in
ﬂexibility
as
much
as
it
is
in
“Big

Data”

21 NOSQL Themes
•  Requirements[3]

–  Data
will
not
ﬁt
in
one
node

•  And
so
need
data
parXXon/distribuXon
by
the
system

–  Nodes
will
fail,
but
data
needs
to
be
safe
–
replicaXon!

–  Low
latency
for
real-‐Xme
use

•  Data
Locality

–  Row
based
structures
will
need
to
read
whole
row,

even
for
a
column

–  Column
based
structures
need
to
scan
for
each
row

•  SoluXon
:
Column
storage
with
Locality

–  Keep
data
that
is
read
together,
don’t
read
what
you

don’t
care

•  For
example
friends
–
other
data

Ref:
3

ABCs of
NOSQL -
ACID,
BASE &
CAP
The woods are lovely, dark, and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
-Frost

CAP Principle
“CAP
Principle
→

Strong
Consistency,

High
Availability,

Consistency

Par::on-‐resilience:

Pick
at
most
2”[37]

Availability Partition

Which
feature
to
discard
depends
on
the
nature
of
your
system[41]

CAP Principle
“CAP
Principle
→

Strong
Consistency,

High
Availability,

Consistency


Pick
at
most
2”[37]

C-‐A
No
P
→
Single
DB

server,
no
network
par::on


Which
feature
to
discard
depends
on
the
nature
of
your
system[41]

CAP Principle
“CAP
Principle
→

Strong
Consistency,

High
Availability,

Consistency


Pick
at
most
2”[37]

C-‐P
No
A
→
Block

transac:on
in

case
of
par::on

failure


Which
feature
to
discard
depends
on
the
nature
of
your
system[41]

CAP Principle
Interesting (& controversial) from
“CAP
Principle
→

NOSQL perspective

Strong
Consistency,

High
Availability,

Consistency


Pick
at
most
2”[37]
A-‐P
No
C
→

Expira:on
based

caching,
vo:ng

majority


ABCs
of
NOSQL

•  ACID

o  Atomicity,
Consistency,
IsolaXon
&
Durability
–

fundamental
properXes
of
SQL
DBMS

•  BASE[35,39]

o  Basically
Available
Soy
state(Scalable)

Eventually
Consistent

•  CAP[36,39]

o  Consistency,
Availability
&
ParXXoning

o  This
C
is
~A+C

•  i.e.
Atomic
Consistency[36]

ACID

•  Atomicity

o  All
or
nothing

•  Consistent

o  From
one
consistent
state
to
another

•  e.g.
ReferenXal
Integrity

o  But
it
is
also
applicaXon
dependent
on

•  e.g.
min
account
balance

•  Predicates,
invariants,…

•  IsolaXon

•  Durability

CAP
Pragmas

•  PrecondiXons

o  The
domain
is
scalable
web
apps

o  Low
Latency
For
real
Xme
use

o  A
small
sub-‐set
of
SQL
FuncXonality

o  Horizontal
Scaling

•  PritcheI[35]
talks
about
relaxing
consistency

across
funcXonal
groups
than
within
funcXonal

groups

•  Idempotency
to
consider

o  Updates
inc/dec
are
rarely
idempotent

o  Order
preserving
trx
are
not
idempotent
either

o  MVCC
is
an
answer
for
this
(CouchDB)

Consistency

•  Strict
Consistency

o Any
read
on
Data
X
will
return
the
most

recent
write
on
X[42]

•  SequenXal
Consistency

o Maintains
sequenXal
order
from

mulXple
processes
(No
menXon
of
Xme)

•  Linearizability

o Add
Xmestamp
from
loosely

synchronized
processes

Consistency

•  Write
availability,
not
read
availability[44]

•  Even
load
distribuXon
is
easier
in

eventually
consistent
systems

•  MulX-‐data
center
support
is
easier
in

eventually
consistent
systems

•  Some
problems
are
not
solvable
with

eventually
consistent
systems

•  Code
is
someXmes
simpler
to
write
in

strongly
consistent
systems

CAP
EssenXals
–
1
of
3

•  “CAP
Principle
→
Strong
Consistency,
High

Availability,
ParXXon-‐resilience:
Pick
at

most
2”[37]

o  C-‐A
No
P
→
Single
DB
server,
no
network

parXXon

o  C-‐P
No
A
→
Block
transacXon
in
case
of

parXXon
failure

o  A-‐P
No
C
→
ExpiraXon
based
caching,
voXng

majority

•  Which
feature
to
discard
depends
on
the

nature
of
your
system[41]

CAP
EssenXals
–
2
of
3

•  Yield
vs.
Harvest[37]

o  Yield
→
Probability
of
compleXng
a
request

o  Harvest
→
FracXon
of
data
reﬂected
in
the

response

•  Some
systems
tolerate
<
100%
harvest
(e.g

search
i.e.
approximate
answers
OK)

others
need
100%
harvest
(e.g.
Trx
i.e.

correct
behavior
=
single
well
deﬁned

response)

•  For
sub-‐systems
that
tolerate
harvest

degradaXon,
CAP
makes
sense

CAP
EssenXals
–
3
of
3

•  Trading
Harvest
for
yield
–
AP

•  ApplicaXon
decomposiXon
&
use
NOSQL
in

appropriate
sub-‐systems
that
has
state

management
and
data
semanXcs
that
match
the

opera<onal
feature
&
impedance

o  Hence
NotOnly
SQL
not
No
SQL

o  Intelligent
homing
to
tolerate
parXXon
failures[44]

o  MulX
zones
in
a
region
(150
miles
-‐
5
ms)

o  TwiIer
tweets
in
Cassandra
&
MySQL

o  BBC
using
MongoDB
for
oﬄoading
DBMS

o  Polygot
persistence
at
LHC@CERN

CAP
EssenXals
–
3
of
3

•  Trading
Harvest
for
yield
–
AP

•  ApplicaXon
decomposiXon
&
use
NOSQL
in

appropriate
sub-‐systems
that
has
state

management
and
data
semanXcs
that
match
the

opera<onal
feature
&
impedance

o  Hence
NotOnly
SQL
not
No
SQL

o  Intelligent
homing
to
tolerate
parXXon
failures[44]

o  MulX
zones
in
a
region
(150
miles
-‐
5
ms)

o  TwiIer
tweets
in
Cassandra
and
MySQL

Most important
o  BBC
using
MongoDB
for
oﬄoading
DBMS

point in the whole
o  Polygot
persistence
at
LHC@CERN

presentation

Eventual
Consistency
&
AMZ

•  DistribuXon
Transparency[38]

•  Larger
distributed
systems,
network

parXXons
are
given

•  Consistency
Models

o  Strong

o  Weak

•  Has
an
inconsistency
window
before
update
and

guaranteed

view

o  Eventual

•  If
no
new
updates,
all
will
see
the
value,
eventually

Eventual
Consistency
&
AMZ

•  Guarantee
variaXons[38]

o Read-‐Your-‐writes

o Session
consistency

o Monotonic
Read
consistency

•  Access
will
not
return
previous
value

o Monotonic
Write
consistency

•  Serialize
write
by
the
same
process

•  Guarantee
order
(vector
clocks,

mvcc)

o  Example
:
Amz
Cart
merger
(let
cart
add
even
with
parXal

failure)

Eventual
Consistency
&
AMZ
-‐
SimpleDB

•  SimpleDB
strong
consistency

semanXcs
[49,50]

o UnXl
Feb
2010,
SimpleDB
only

supported
eventual
consistency
i.e.

GetAIributes
ayer
PutAIributes
might

not
be
the
same
for
some
Xme
(1

second)

o On
Feb
24,
AWS
Added

ConsistentRead=True
aIribute
for
read

o Read
will
reﬂect
all
writes
that
got

200OK
Xll
that
Xme!

Eventual
Consistency
&
AMZ
-‐
SimpleDB

•  SimpleDB
strong
consistency

semanXcs
[49,50]

o Also
added
condiXonal
put/delete

o Put
aIribute
has
a
speciﬁed
value

(Expected.1.Value=)
or
(Expected.
1.Exists
=
true/false)

o Same
condiXonal
check
capability
for

delete
also

o 
Only
on
one
aIribute
!

Eventual
Consistency
&
AMZ
–
S3

•  S3
is
an
eventual
consistency
system

o Versioning

o “S3
PUT
&
COPY
synchronously
store

data
across
mulXple
faciliXes
before

returning
SUCCESS”

o Repair
Lost
redundancy,
repair
bit-‐rot

o Reduced
Redundancy
opXon
for
data

that
can
be
reproduced

(99.999999999%

vs.
99.99%)

•  Approx
1/3rd
less

o CloudFront
for
caching

!SQL
?

•  “We
conclude
that
the
current
RDBMS
code
lines,
while

aIempXng
to
be
a
“one
size
ﬁts
all”
soluXon,
in
fact,
excel
at

nothing.
Hence,
they
are
25
year
old
legacy
code
lines
that

should
be
reXred
in
favor
of
a
collecXon
of
“from
scratch”

specialized
engines.”[43]

•  “Current
systems
were
built
in
an
era
where
resources
were

incredibly
expensive,
and
every
compuXng
system
was

watched
over
by
a
collecXon
of
wizards
in
white
lab
coats,

responsible
for
the
care,
feeding,
tuning
and
opXmizaXon
of

the
system.
In
that
era,
computers
were
expensive
and

people
were
cheap”

•  “The
1970
-‐
1985
period
was
a
<me
of
intense
debate,
a

myriad
of
ideas,
&
considerable
upheaval.
We
predict
the

next
ﬁUeen
years
will
have
the
same
feel
“

Further
deliberaXon

•  Daniel
Abadi[45],Mike
Stonebreaker[46],

James
Hamilton[47],
Pat
Hilland[48]
are
all

good
read
for
further
deliberaXons

NOSQL Internals & Algorithmics

Caveats

•  A
representaXve
subset
of
the
mechanics
and

mechanisms
used
in
the
NOSQL
world

•  Being
reﬁned
&
newer
ones
are
being
tried

•  At
a
system
level
–
to
show
how
the
techniques

play
a
part
to
deliver
a
capability

•  The
NOSQL
Papers
and
other
references
for

further
deliberaXon

•  Even
if
we
don’t
cover
fully,
it
is
OK.
I
want
to

introduce
some
of
the
concepts
so
that
you
get

an
appreciaXon
…

NOSQL
Mechanics

•  Horizontal
Scalability
•  Performance

–  Gossip
(Cluster
–  SStables/memtables

membership)
–  LSM
w/Bloom
Filter

–  Failure
DetecXon
•  Integrity/Version

–  Consistent
Hashing
reconcilia<on

–  ReplicaXon
–  Timestamps

Techniques

–  Vector
Clocks

•  Hinted
Handoﬀ

•  Merkle
Trees
–  MVCC

–  Sharding
MongoDB
–  SemanXc
vs.
syntacXc

reconciliaXon

–  Regions
in
HBase

Consistent
Hashing

•  Origin:
web
caching
“To
decrease
‘hot

spots’

•  Three
goals[87]

–  Smooth
evoluXon

•  When
a
new
machine
joins,
minimum
rebalance

work
and
impact

–  Spread

•  Objects
assigned
to
a
min
number
of
nodes

–  Load

•  #
of
disXnct
objects
assigned
to
a
node
is
small

Consistent
Hashing

•  Hash
Keyspace/Token
is
divided
into
parXXons/ranges

•  Cassandra
–
choice

–  OrderPreserving
parXXoner
–
key
=
token
(for
range
queries)

–  Also
saw
a
CollaXngOrderPreservingParXXoner

•  ParXXons
assigned
to
nodes
that
are
logically
arranged
in
a
circle

topology

•  Amz
(dynamo)
–
assign
sets
of

(random)
mulXple
points
to

diﬀerent
machines
depending
on

load

•  Cassandra
–
monitor
load
&

distribute

•  Speciﬁc
join
&
leave
protocols

•  ReplicaXon
–
next
3
consecuXve

•  Cassandra
–
Rack-‐aware,

Datacenter-‐aware

Consistent
Hashing
-‐
Hinted-‐handoﬀ

•  What
happens
when
a
node
is
not
available
?

–  May
be
under
load

–  May
be
network
parXXon

•  Sloppy
Quorum
&
Hinted-‐handoﬀ

•  R/W
performed
on
the
1st
n
healthy
nodes

•  Replica
sent
to
a
host
node
with
hint
in

metadata
&
then
transferred
when
the
actual

node
is
up

•  Burdens
neighboring
nodes

•  Cassandra
0.6.2
default
is
disabled
(I
think)

Consistent
Hashing
-‐
ReplicaXon

•  What
happens
when
a
new
node

joins
?

– It
gets
one
or
more
parXXons

– Dynamo
:
Copy
the
whole
parXXon

– Cassandra
:
Replicate
keyset

– Cassandra
:
working
on
a
bit
torrent

type
protocol
to
copy
from
replicas

AnX-‐entropy

•  Merge
and
reconciliaXon
operaXons

–  Operate
on
two
states
and
return
a
new
state[86]

•  Merkle
Trees

–  Dynamo
use
of
Merkle
trees
to
detect

inconsistencies
between
replicas

–  AnXEntropy
in
Cassandra
exchanges
Merkle
trees

and
if
they
disagree,
range
repair
via
compacXon
[91,92]

–  Cassandra
uses
the
ScuIlebuI
ReconciliaXon[86]

Gossip

•  Membership
&
Failure
detecXon

•  Based
on
emergence
without
rigidity
–

pulse
coupled
oscillators,
biological

systems
like
ﬁreﬂies
![90]

•  Also
used
for
state
propagaXon

–  Used
in
Dynamo/Cassandra

Gossip

•  Cassandra
exchanges
heartbeat
state,
applicaXon
state

and
so
forth

•  Every
second,
random
live
node,
random
unreachable

node
and
exchanges
key-‐value
structures

•  Some
nodes
play
the
part
of
seeds

•  Seed
/iniXal
contact
points
in
staXc
conf
file

storage.conf
file

•  Could
also
come
from
a
configuraXon
service
like

zookeeper

•  To
guard
against
node
flap,
explicit
membership
join
and

leave
–
now
you
know
why
hinted
handoff
was
added

Membership
&
Failure
detecXon

•  Consensus
&
Atomic
Broadcast

-‐
impossible
to

solve
in
a
distributed
system[88,89]

–  Cannot
diﬀerenXate
between
an
slow
system
and
a

crashed
system

•  Completeness

–  Every
system
that
crashed
will
be
eventually

detected

•  Correctness

–  A
correct
process
is
never
suspected

•  In
short,
if
you
are
dead
somebody
will
no<ce
it

and
if
you
are
alive,
nobody
will
mistake
you
for

dead
!

Ø
Accrual
Failure
Detector

•  Not

Boolean
value
but
a
probabilisXc
number
that
“accrues”
over

an
exponenXal
scale

•  Captures
the
degree
of
conﬁdence
that
a
corresponding
monitored

process
has
crashed[94]

–  Suspicion
Level

–  Ø
=
1
-‐>
prob(error)
10%

–  Ø
=
2
-‐>
prob(error)
1%

–  Ø
=
3
-‐>
prob(error)
0.1%

•  If
process
is
dead,

–  Ø
is
monotonically
increasing
&
Ø→α
as
t
→α

•  If
process
is
alive
and
kicking,
Ø=0

•  Account
for
lost
messages,
network
latency
and
actual
crash
of

system/process

•  Well
known
heartbeat
period
Δi,
then
network
latency
Δtr
can
be

tracked
by
inter-‐arrival
Xme
modeling

Write/Read
Mechanisms

•  Read
&
Write
to
a
random
node

(StorageProxy)

•  Proxy
coordinates
the
read
and
write

strategy
(R/W
=
any,
quorum
et
al)

•  Memtables/SSTables
from
big
table

•  Bloom
Filter/Index

•  LSM
Trees

Hbase – WAL,
Node Write Memstore, HDFS File
system

Commit
Logs
Node
M
e
m
o
MemTable r
y
Read

Flushing

Index Index Index
D
i
BF BF BF s
k
SSTable
• Immutable
• Compaction
• Maintain Index & Bloom Filter

How…
does
HBase
work
again?

http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/

Bloom
Filter

•  The
BloomFilter
answers
the
quesXon

•  “Might
there
be
data
for
this
key
in
this

SSTable?”
[Ref:
Cassandra/Hbase
mailer]

–  “Maybe"
or

– 
“Deﬁnitely
not“

–  When
the
BloomFilter
says
"maybe"
we
have
to
go
to

disk
to
check
out
the
content
of
the
SSTable

•  Depends
on
implementaXon

–  Redone
in
Cassandra

–  Hbase
0.20.x
removed,
will
be
back
in
0.90
with
a

“jazzy”
implementaXon

Was it a vision, or a waking dream?
Fled is that music:—do I wake or sleep?
-Keats, Ode to a Nightingale

•  http://www.readwriteweb.com/enterprise/2011/11/infographic-data-
deluge---8-ze.php
•  http://www.crn.com/news/data-center/232200061/efficiency-or-
bust-data-centers-drive-for-low-power-solutions-prompts-channel-
growth.htm
•  http://www.quantumforest.com/2011/11/do-we-need-to-deal-with-
big-data-in-r/
•  http://www.forbes.com/special-report/2011/migration.html
•  http://www.mercurynews.com/bay-area-news/ci_19368103
•  http://www.businessinsider.com/apple-new-data-center-north-
carolina-created-50-jobs-2011-11

The Art of Big Data

More Related Content

Similar to The Art of Big Data

More from Krishna Sankar

Recently uploaded

The Art of Big Data