Managing Big Data (Chapter 2, SC 11 Tutorial)

An
Introduc+on
to

Data
Intensive
Compu+ng

Chapter
2:
Data
Management

Robert
Grossman

University
of
Chicago

Open
Data
Group

Collin
BenneC

Open
Data
Group

November
14,
2011

1

1.  Introduc+on
(0830-‐0900)

a.  Data
clouds
(e.g.
Hadoop)

b.  U+lity
clouds
(e.g.
Amazon)

2.  Managing
Big
Data
(0900-‐0945)

a.  Databases

b.  Distributed
File
Systems
(e.g.
Hadoop)

c.  NoSql
databases
(e.g.
HBase)

3.  Processing
Big
Data
(0945-‐1000
and
1030-‐1100)

a.  Mul+ple
Virtual
Machines
&
Message
Queues

b.  MapReduce

c.  Streams
over
distributed
ﬁle
systems

4.  Lab
using
Amazon’s
Elas+c
Map
Reduce

(1100-‐1200)

What
Are
the
Choices?

Applica+ons

File
Systems

(R,
SAS,
Excel,
etc.
)

Clustered

Databases

File
Systems

(SqlServer,
Oracle,
DB2)
(glusterfs,
…)

Distributed
File
Systems

NoSQL
Databases
(Hadoop,
Sector)

(HBase,
Accumulo,

Cassandra,
SimpleDB,
…)

What
is
the
Fundamental
Trade
Oﬀ?

vs
…

Scale
out

Scale
up

Advice
From
Jim
Gray

1.  Analyzing
big
data
requires

scale-‐out
solu+ons
not
scale-‐up

solu+ons
(GrayWulf)

2.  Move
the
analysis
to
the
data.

3.  Work
with
scien+sts
to
ﬁnd
the

most
common
“20
queries”
and

make
them
fast.

4.  Go
from
“working
to
working.”

PaCern
1:
Put
the
metadata
in
a

database
and
point
to
ﬁles
in
a

ﬁle
system.

Example:
Sloan
Digital
Sky
Survey

•  Two
surveys
in
one

–  Photometric
survey
in
5
bands

–  Spectroscopic
redshii
survey

•  Data
is
public

–  40
TB
of
raw
data

–  5
TB
processed
catalogs

–  2.5
Terapixels
of
images

•  Catalog
uses
Microsoi
SQLServer

•  Started
in
1992,
ﬁnished
in
2008

•  JHU
SkyServer
serves
millions
of
queries

Example:
Bionimbus
Genomics
Cloud

www.bionimbus.org

GWT-‐based
Front
End

U+lity
Cloud

Services

Analysis
Pipelines

Database
&
Re-‐analysis

Services
Intercloud

Services

Services

Data

Inges+on
Data

Cloud
Services

Services

(Eucalyptus,

GWT-‐based
Front
End

OpenStack)

Elas+c
Cloud

(PostgreSQL)
Services

Analysis
Pipelines

Database
&
Re-‐analysis

Services
Intercloud

Services

Services

ID
Service

Data
(UDT,

Inges+on
Large
Data

replica+on)

Cloud
Services

Services
(Hadoop,

Sector/Sphere)

Sec+on
2.2

Distributed
File
Systems

Sector/Sphere

Hadoop’s
Large
Data
Cloud

Applica+ons

Compute
Services
Hadoop’s
MapReduce

Data
Services
NoSQL
Databases

Storage
Services
Hadoop
Distributed
File

System
(HDFS)

Hadoop’s
Stack

13

PaCern
2:
Put
the
data
into
a

distributed
ﬁle
system.

Hadoop
Design

•  Designed
to
run
over
commodity
components

that
fail.

•  Data
replicated,
typically
three
+mes.

•  Block-‐based
storage.

•  Op+mized
for
eﬃcient
scans
with
high

throughput,
not
low
latency
access.

•  Designed
for
write
once,
read
many.

•  Append
opera+on
planned
for
future.

Hadoop
Distributed
File
System
(HDFS)

Architecture

control
•  HDFS
is
block-‐
Client
Name
Node
based.

•  WriCen
in
Java.

data

Data
Node
Data
Node
Data
Node

Data
Node
Data
Node
Data
Node

Rack
Rack
Rack

Sector
Distributed
File
System
(SDFS)

Architecture

•  Broadly
similar
to
Google
File
System
and

Hadoop
Distributed
File
System.

•  Uses
na+ve
ﬁle
system.

It
is
not
block
based.

•  Has
security
server
that
provides

authoriza+ons.

•  Has
mul+ple
master
name
servers
so
that

there
is
no
single
point
of
failure.

•  Use
UDT
to
support
wide
area
opera+ons.

Sector
Distributed
File
System
(SDFS)

control
Architecture

control
Master
Node
•  HDFS
is
ﬁle-‐
Client
based.

Master
Node
•  WriCen
in
C++.

Security
Server
•  Security
server.

data
•  Mul+ple
masters.

Slave
Node
Slave
Node
Slave
Node

Slave
Node
Slave
Node
Slave
Node

Rack
Rack
Rack

GlusterFS
Architecture

•  No
metadata
server.

•  No
single
point
of
failure.

•  Uses
algorithms
to
determine
loca+on
of
data.

•  Can
scale
out
by
adding
more
bricks.

GlusterFS
Architecture

File-‐based.

Client

GlusterFS
Server

data

Brick
Brick
Brick

Brick
Brick
Brick

Rack
Rack
Rack

Sec+on
2.3

NoSQL
Databases

21

Evolu+on

•  Standard
architecture
for
simple
web

applica+ons:

–  Presenta+on:
front-‐end,
load
balanced
web
servers

–  Business
logic
layer

–  Backend
database

•  Database
layer
does
not
scale
with
large

numbers
of
users
or
large
amounts
of
data

•  Alterna+ves
arose

–  Sharded
(par++oned)
databases
or
master-‐slave
dbs

–  memcache

22

Scaling
RDMS

•  Master
–
slave
database
systems

–  Writes
to
master

–  Reads
from
slaves

–  Can
be
boClenecks
wri+ng
to
slaves;
can
be

inconsistent

•  Sharded
databases

–  Applica+ons
and
queries
must
understand
sharing

schema

–  Both
reads
and
writes
scale

–  No
na+ve,
direct
support
for
joins
across
shards

23

NoSQL
Systems

•  Suggests
No
SQL
support,
also
Not
Only
SQL

•  One
or
more
of
the
ACID
proper+es
not

supported

•  Joins
generally
not
supported

•  Usually
ﬂexible
schemas

•  Some
well
known
examples:
Google’s
BigTable,

Amazon’s
Dynamo
&
Facebook’s
Cassandra

•  Quite
a
few
recent
open
source
systems

24

PaCern
3:
Put
the
data
into
a

NoSQL
applica+on.

CAP
–
Choose
Two
Consistency

Per
Opera+on

C

CP:
always
consistent,
even
in
a

CA:
available
and
par++on,
but
a
reachable
replica

consistent,
unless
there
may
deny
service
without

is
a
par++on.
quorum.

BigTable,

HBase

Dynamo,
Cassandra

A
AP:
a
reachable
replica
P

Availability
provides
service
even
in
a
Par++on-‐resiliency

par++on,
but
may
be

inconsistent.

CAP
Theorem

•  Proposed
by
Eric
Brewer,
2000

•  Three
proper+es
of
a
system:
consistency,

availability
and
par++ons

•  You
can
have
at
most
two
of
these
three

proper+es
for
any
shared-‐data
system

•  Scale
out
requires
par++ons

•  Most
large
web-‐based
systems
choose

availability
over
consistency

Reference:
Brewer,
PODC
2000;
Gilbert/Lynch,
SIGACT
News
2002
28

Eventual
Consistency

•  If
no
updates
occur
for
a
while,
all
updates

eventually
propagate
through
the
system
and

all
the
nodes
will
be
consistent

•  Eventually,
a
node
is
either
updated
or

removed
from
service.

•  Can
be
implemented
with
Gossip
protocol

•  Amazon’s
Dynamo
popularized
this
approach

•  Some+mes
this
is
called
BASE
(Basically

Available,
Soi
state,
Eventual
consistency),
as

opposed
to
ACID
29

Diﬀerent
Types
of
NoSQL
Systems

•  Distributed
Key-‐Value
Systems

–  Amazon’s
S3
Key-‐Value
Store
(Dynamo)

–  Voldemort

–  Cassandra

•  Column-‐based
Systems

–  BigTable

–  HBase

–  Cassandra

•  Document-‐based
systems

–  CouchDB

30

Hbase
Architecture

Client
Client
Client
Client
Client

Java
Client

REST API

HBaseMaster

HRegionServer
HRegionServer

HRegionServer
HRegionServer
HRegionServer

Disk
Disk
Disk
Disk

Source:
Raghu
Ramakrishnan

HRegion
Server

•  Records
par++oned
by
column
family
into
HStores

–  Each
HStore
contains
many
MapFiles

•  All
writes
to
HStore
applied
to
single
memcache

•  Reads
consult
MapFiles
and
memcache

•  Memcaches
ﬂushed
as
MapFiles
(HDFS
ﬁles)
when
full

•  Compac+ons
limit
number
of
MapFiles

HRegionServer

writes
Memcache

Flush
to
disk

HStore

reads
MapFiles

Source:
Raghu
Ramakrishnan

Facebook’s
Cassandra

•  Modeled
aier
BigTable’s
data
model

•  Modeled
aier
Dynamo’s
eventual
consistency

•  Peer
to
peer
storage
architecture
using

consistent
hashing
(Chord
hashing)

33

Databases
NoSQL
Systems

Scalability
100’s
TB
100’s
PB

Func+onality
Full
SQL-‐based
queries,
Op+mized
access
to

including
joins
sorted
tables
(tables
with

single
keys)

Op+mized
Databases
op+mized
Clouds
op+mized
for

for
safe
writes
eﬃcient
reads

Consistency
ACID
(Atomicity,
Eventual
consistency
–

model
Consistency,
Isola+on
updates
eventually

&
Durability)
–
propagate
through

database
always
system

consist

Parallelism
Diﬃcult
because
of
Basic
design
incorporates

ACID
model;
shared
parallelism
over

nothing
is
possible
commodity
components

Scale
Racks
Data
center
34

Sec+on
2.3

Case
Study:
Project
Matsu

Zoom
Levels
/
Bounds

Zoom
Level
1:
4
images
Zoom
Level
2:
16
images

Zoom
Level
3:
64
images
Zoom
Level
4:
256
images

Source:
Andrew
Levine

Build
Tile
Cache
in
the
Cloud
-‐
Mapper

Mapper
Input
Key:
Bounding
Box
Mapper
Output
Key:
Bounding
Box

(minx
=
-‐135.0
miny
=
45.0
maxx
=
-‐112.5
maxy
=
67.5)
Mapper
Output
Value:

Mapper
Input
Value:
Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Step
1:
Input
to
Mapper
Mapper
Output
Value:

Mapper
resizes
and/or
cuts
up
the
original
Mapper
Output
Key:
Bounding
Box

image
into
pieces
to
output
Bounding
Boxes
Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Step
2:
Processing
in
Mapper
Step
3:
Mapper
Output

Source:
Andrew
Levine

Build
Tile
Cache
in
the
Cloud
-‐
Reducer

Reducer
Key
Input:
Bounding
Box

(minx
=
-‐45.0
miny
=
-‐2.8125
maxx
=
-‐43.59375
maxy
=
-‐2.109375)

Reducer
Value
Input:

•  Output
to
HBase

…
•  Builds
up
Layers

Step
1:
Input
to
Reducer

for
WMS
for

Assemble
Images
based
on
bounding
box
various
datasets

Step
2:
Reducer
Output

Source:
Andrew
Levine

HBase
Tables

•  Open
Geospa+al
Consor+um
(OGC)
Web

Mapping
Service
(WMS)
Query
translates
to

HBase
scheme

–  Layers,
Styles,
Projec+on,
Size

•  Table
name:
WMS
Layer

–  Row
ID:
Bounding
Box
of
image

-‐Column
Family:
Style
Name
and
Projec+on

-‐Column
Qualiﬁer:
Width
x
Height

-‐Value:
Buﬀered
Image

Sec+on
2.4

Distributed
Key-‐Value
Stores

S3

PaCern
4:
Put
the
data
into
a

distributed
key-‐value
store.

S3
Buckets

•  S3
bucket
names
must
be
unique
across
AWS

•  A
good
prac+ce
is
to
use
a
paCern
like

tutorial.osdc.org/dataset1.txt

for
a
domain
you
own.

•  The
ﬁle
is
then
referenced
as

tutorial.osdc.org.s3.
amazonaws.com/
dataset1.txt

•  If
you
own
osdc.org
you
can
create
a
DNS

CNAME
entry
to
access
the
ﬁle
as

tutorial.osdc.org/dataset1.txt

S3
Keys

•  Keys
must
be
unique
within
a
bucket.

•  Values
can
be
as
large
as
5
TB
(formerly
5
GB)

S3
Security

•  AWS
access
key
(user
name)

•  This
func+on
as
your
S3
username.
It
is
an

alphanumeric
text
string
that
uniquely

iden+ﬁes
users.

•  AWS
Secret
key
(func+ons
as
password)

AWS
Account
Informa+on

Access
Keys

User
Name
Password

Other
Amazon
Data
Services

•  Amazon
Simple
Database
Service
(SDS)

•  Amazon’s
Elas+c
Block
Storage
(EBS)

Sec+on
2.5

Moving
Large
Data
Sets

The
Basic
Problem

•  TCP
was
never
designed
to
move
large
data

sets
over
wide
area
high
performance

networks.

•  As
a
general
rule,
reading
data
oﬀ
disks
is

slower
than
transpor+ng
it
over
the
network.

TCP Throughput vs RTT and Packet Loss
LAN US US-EU US-ASIA
1000

800

600

400
Throughput (Mb/s)

200
1000

800 0.01%

600 0.05%

0.1%
400

0.5%
200

0.1%

1 10 100 200 400

Round Trip Time (ms)
Source:
Yunhong
Gu,

2007,
experiments
over
wide
area
1G.

The
Solu+on

•  Use
parallel
TCP
streams

–  GridFTP

•  Use
specialized
network
protocols

–  UDT,
FAST,
etc.

•  Use
RAID
to
stripe
data
across
disks
to

improve
throughput
when
reading

•  These
techniques
are
well
understood
in
HEP,

astronomy,
but
not
yet
in
biology.

Case
Study:
Bio-‐mirror

[The
open
source
GridFTP]
from
the
Globus
project
has

recently
been
improved
to
offer
UDP-‐based
file
transport,

with
long-‐distance
speed
improvements
of
3x
to
10x
over

the
usual
TCP-‐based
file
transport.

-‐-‐
Don
Gilbert,
August
2010,
bio-‐mirror.net

Moving
113GB
of
Bio-‐mirror
Data

Site
RTT
TCP
UDT
TCP/UDT
Km

NCSA
10
139
139
1
200

Purdue
17
125
125
1
500

ORNL
25
361
120
3
1,200

TACC
37
616
120
55
2,000

SDSC
65
750
475
1.6
3,300

CSTNET
274
3722
304
12
12,000

GridFTP
TCP
and
UDT
transfer
+mes
for
113
GB
from
gridip.bio-‐mirror.net/biomirror/
blast/
(Indiana
USA).

All
TCP
and
UDT
+mes
in
minutes.

Source:

hCp://gridip.bio-‐
mirror.net/biomirror/

Case
Study:
CGI
60
Genomes

•  Trace
by
Complete
Genomics
showing
performance
of

moving
60
complete
human
genomes
from
Mountain

View
to
Chicago
using
the
open
source
Sector/UDT.

•  Approximately
18
TB
at
about
0.5
Mbs
on
1G
link.

Source:
Complete
Genomics.

Resource
Use

Protocol
CPU
Usage*
Memory*

GridFTP
(UDT)
1.0%
-‐
3.0%

40
Mb

GridFTP
(TCP)
0.1%
-‐
0.6%
6
Mb

*CPU
and
memory
usage
collected
by

Don
Gilbert.

He
reports
that
rsync
uses
more

CPU
than
GridFTP
with
UDT.

Source:
hCp://gridip.bio-‐mirror.net/biomirror/.

Sector/Sphere

•  Sector/Sphere
is
a
pla{orm
for
data
intensive

compu+ng
built
over
UDT
and
designed
to

support
geographically
distributed
clusters.

Ques+ons?

For
the
most
current
version
of
these
notes,
see

rgrossman.com

Managing Big Data (Chapter 2, SC 11 Tutorial)

More Related Content

What's hot

Viewers also liked

Similar to Managing Big Data (Chapter 2, SC 11 Tutorial)

More from Robert Grossman

Recently uploaded

Managing Big Data (Chapter 2, SC 11 Tutorial)