Managing Big Data: An Introduction to Data Intensive Computing

An
Introduc+on
to

Data
Intensive
Compu+ng

Chapter
2:
Data
Management

Robert
Grossman

University
of
Chicago

Open
Data
Group

Collin
BenneC

Open
Data
Group

November
14,
2011

1

1.  Introduc+on
(0830-‐0900)

a.  Data
clouds
(e.g.
Hadoop)

b.  U+lity
clouds
(e.g.
Amazon)

2.  Managing
Big
Data
(0900-‐0945)

a.  Databases

b.  Distributed
File
Systems
(e.g.
Hadoop)

c.  NoSql
databases
(e.g.
HBase)

3.  Processing
Big
Data
(0945-‐1000
and
1030-‐1100)

a.  Mul+ple
Virtual
Machines
&
Message
Queues

b.  MapReduce

c.  Streams
over
distributed
ﬁle
systems

4.  Lab
using
Amazon’s
Elas+c
Map
Reduce

(1100-‐1200)

What
Are
the
Choices?

Databases

(SqlServer,
Oracle,
DB2)

File
Systems

Distributed
File
Systems

(Hadoop,
Sector)

Clustered

File
Systems

(glusterfs,
…)

NoSQL
Databases

(HBase,
Accumulo,

Cassandra,
SimpleDB,
…)

Applica+ons

(R,
SAS,
Excel,
etc.
)

What
is
the
Fundamental
Trade
Oﬀ?

Scale
up

Scale
out

vs
…

Advice
From
Jim
Gray

1.  Analyzing
big
data
requires

scale-‐out
solu+ons
not
scale-‐up

solu+ons
(GrayWulf)

2.  Move
the
analysis
to
the
data.

3.  Work
with
scien+sts
to
ﬁnd
the

most
common
“20
queries”
and

make
them
fast.

4.  Go
from
“working
to
working.”

PaCern
1:
Put
the
metadata
in
a

database
and
point
to
ﬁles
in
a

ﬁle
system.

Example:
Sloan
Digital
Sky
Survey

•  Two
surveys
in
one

– Photometric
survey
in
5
bands

– Spectroscopic
redshii
survey

•  Data
is
public

– 40
TB
of
raw
data

– 5
TB
processed
catalogs

– 2.5
Terapixels
of
images

•  Catalog
uses
Microsoi
SQLServer

•  Started
in
1992,
ﬁnished
in
2008

•  JHU
SkyServer
serves
millions
of
queries

Example:
Bionimbus
Genomics
Cloud

www.bionimbus.org

Database

Services

Analysis
Pipelines

&
Re-‐analysis

Services

GWT-‐based
Front
End

Data

Cloud
Services

Data

Inges+on

Services

U+lity
Cloud

Services

Intercloud

Services

Database

Services

Analysis
Pipelines

&
Re-‐analysis

Services

GWT-‐based
Front
End

Large
Data

Cloud
Services

Data

Inges+on

Services

Elas+c
Cloud

Services

Intercloud

Services

(Hadoop,

Sector/Sphere)

(Eucalyptus,

OpenStack)

(PostgreSQL)

ID
Service

(UDT,

replica+on)

Sec+on
2.2

Distributed
File
Systems

Sector/Sphere

Hadoop’s
Large
Data
Cloud

Storage
Services

Compute
Services

13
Hadoop’s
Stack

Applica+ons

Hadoop
Distributed
File

System
(HDFS)

Hadoop’s
MapReduce

Data
Services
NoSQL
Databases

PaCern
2:
Put
the
data
into
a

distributed
ﬁle
system.

Hadoop
Design

•  Designed
to
run
over
commodity
components

that
fail.

•  Data
replicated,
typically
three
+mes.

•  Block-‐based
storage.

•  Op+mized
for
eﬃcient
scans
with
high

throughput,
not
low
latency
access.

•  Designed
for
write
once,
read
many.

•  Append
opera+on
planned
for
future.

Hadoop
Distributed
File
System
(HDFS)

Architecture

Name
Node

Data
Node

Data
Node

Data
Node

Client

control

Data
Node

Data
Node

Data
Node

data

Rack
Rack
Rack

•  HDFS
is
block-‐
based.

•  WriCen
in
Java.

Sector
Distributed
File
System
(SDFS)

Architecture

•  Broadly
similar
to
Google
File
System
and

Hadoop
Distributed
File
System.

•  Uses
na+ve
ﬁle
system.

It
is
not
block
based.

•  Has
security
server
that
provides

authoriza+ons.

•  Has
mul+ple
master
name
servers
so
that

there
is
no
single
point
of
failure.

•  Use
UDT
to
support
wide
area
opera+ons.

Sector
Distributed
File
System
(SDFS)

Architecture

Master
Node

Slave
Node

Slave
Node

Slave
Node

Client

control

Slave
Node

Slave
Node

Slave
Node

data

Rack
Rack
Rack

•  HDFS
is
ﬁle-‐
based.

•  WriCen
in
C++.

•  Security
server.

•  Mul+ple
masters.

Security
Server

control

Master
Node

GlusterFS
Architecture

•  No
metadata
server.

•  No
single
point
of
failure.

•  Uses
algorithms
to
determine
loca+on
of
data.

•  Can
scale
out
by
adding
more
bricks.

GlusterFS
Architecture

Brick

Brick

Brick

Client

Brick

Brick

Brick

data

Rack
Rack
Rack

File-‐based.

GlusterFS
Server

Sec+on
2.3

NoSQL
Databases

21

Evolu+on

•  Standard
architecture
for
simple
web

applica+ons:

– Presenta+on:
front-‐end,
load
balanced
web
servers

– Business
logic
layer

– Backend
database

•  Database
layer
does
not
scale
with
large

numbers
of
users
or
large
amounts
of
data

•  Alterna+ves
arose

– Sharded
(par++oned)
databases
or
master-‐slave
dbs

– memcache

22

Scaling
RDMS

•  Master
–
slave
database
systems

– Writes
to
master

– Reads
from
slaves

– Can
be
boClenecks
wri+ng
to
slaves;
can
be

inconsistent

•  Sharded
databases

– Applica+ons
and
queries
must
understand
sharing

schema

– Both
reads
and
writes
scale

– No
na+ve,
direct
support
for
joins
across
shards

23

NoSQL
Systems

•  Suggests
No
SQL
support,
also
Not
Only
SQL

•  One
or
more
of
the
ACID
proper+es
not

supported

•  Joins
generally
not
supported

•  Usually
ﬂexible
schemas

•  Some
well
known
examples:
Google’s
BigTable,

Amazon’s
Dynamo
&
Facebook’s
Cassandra

•  Quite
a
few
recent
open
source
systems

24

PaCern
3:
Put
the
data
into
a

NoSQL
applica+on.

C

A
P

Consistency

Availability
Par++on-‐resiliency

CA:
available
and

consistent,
unless
there

is
a
par++on.

AP:
a
reachable
replica

provides
service
even
in
a

par++on,
but
may
be

inconsistent.

CP:
always
consistent,
even
in
a

par++on,
but
a
reachable
replica

may
deny
service
without

quorum.

Dynamo,
Cassandra

BigTable,

HBase

CAP
–
Choose
Two

Per
Opera+on

CAP
Theorem

•  Proposed
by
Eric
Brewer,
2000

•  Three
proper+es
of
a
system:
consistency,

availability
and
par++ons

•  You
can
have
at
most
two
of
these
three

proper+es
for
any
shared-‐data
system

•  Scale
out
requires
par++ons

•  Most
large
web-‐based
systems
choose

availability
over
consistency

28
Reference:
Brewer,
PODC
2000;
Gilbert/Lynch,
SIGACT
News
2002

Eventual
Consistency

•  If
no
updates
occur
for
a
while,
all
updates

eventually
propagate
through
the
system
and

all
the
nodes
will
be
consistent

•  Eventually,
a
node
is
either
updated
or

removed
from
service.

•  Can
be
implemented
with
Gossip
protocol

•  Amazon’s
Dynamo
popularized
this
approach

•  Some+mes
this
is
called
BASE
(Basically

Available,
Soi
state,
Eventual
consistency),
as

opposed
to
ACID
29

Diﬀerent
Types
of
NoSQL
Systems

•  Distributed
Key-‐Value
Systems

–  Amazon’s
S3
Key-‐Value
Store
(Dynamo)

–  Voldemort

–  Cassandra

•  Column-‐based
Systems

–  BigTable

–  HBase

–  Cassandra

•  Document-‐based
systems

–  CouchDB

30

Hbase
Architecture

HRegionServer

Client
Client
Client
Client
Client

HBaseMaster

REST API
Disk

HRegionServer

Java
Client

Disk

HRegionServer

Disk

HRegionServer

Disk

HRegionServer

Source:
Raghu
Ramakrishnan

HRegion
Server

•  Records
par++oned
by
column
family
into
HStores

–  Each
HStore
contains
many
MapFiles

•  All
writes
to
HStore
applied
to
single
memcache

•  Reads
consult
MapFiles
and
memcache

•  Memcaches
ﬂushed
as
MapFiles
(HDFS
ﬁles)
when
full

•  Compac+ons
limit
number
of
MapFiles

HRegionServer

HStore

MapFiles

Memcache
writes

Flush
to
disk

reads

Source:
Raghu
Ramakrishnan

Facebook’s
Cassandra

•  Modeled
aier
BigTable’s
data
model

•  Modeled
aier
Dynamo’s
eventual
consistency

•  Peer
to
peer
storage
architecture
using

consistent
hashing
(Chord
hashing)

33

Databases
NoSQL
Systems

Scalability
100’s
TB
100’s
PB

Func+onality
Full
SQL-‐based
queries,

including
joins

Op+mized
access
to

sorted
tables
(tables
with

single
keys)

Op+mized
Databases
op+mized

for
safe
writes

Clouds
op+mized
for

eﬃcient
reads

Consistency

model

ACID
(Atomicity,

Consistency,
Isola+on

&
Durability)
–

database
always

consist

Eventual
consistency
–

updates
eventually

propagate
through

system

Parallelism
Diﬃcult
because
of

ACID
model;
shared

nothing
is
possible

Basic
design
incorporates

parallelism
over

commodity
components

Scale
Racks
Data
center
34

Sec+on
2.3

Case
Study:
Project
Matsu

Zoom
Levels
/
Bounds

Zoom
Level
1:
4
images
Zoom
Level
2:
16
images

Zoom
Level
3:
64
images
Zoom
Level
4:
256
images

Source:
Andrew
Levine

Mapper
Input
Key:
Bounding
Box

Mapper
Input
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
resizes
and/or
cuts
up
the
original

image
into
pieces
to
output
Bounding
Boxes

(minx
=
-‐135.0
miny
=
45.0
maxx
=
-‐112.5
maxy
=
67.5)

Step
1:
Input
to
Mapper

Step
2:
Processing
in
Mapper
Step
3:
Mapper
Output

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Mapper
Output
Key:
Bounding
Box

Mapper
Output
Value:

Build
Tile
Cache
in
the
Cloud
-‐
Mapper

Source:
Andrew
Levine

Reducer
Key
Input:
Bounding
Box

(minx
=
-‐45.0
miny
=
-‐2.8125
maxx
=
-‐43.59375
maxy
=
-‐2.109375)

Reducer
Value
Input:

Step
1:
Input
to
Reducer

…

Step
2:
Reducer
Output

Assemble
Images
based
on
bounding
box

•  Output
to
HBase

•  Builds
up
Layers

for
WMS
for

various
datasets

Build
Tile
Cache
in
the
Cloud
-‐
Reducer

Source:
Andrew
Levine

HBase
Tables

•  Open
Geospa+al
Consor+um
(OGC)
Web

Mapping
Service
(WMS)
Query
translates
to

HBase
scheme

– Layers,
Styles,
Projec+on,
Size

•  Table
name:
WMS
Layer

– Row
ID:
Bounding
Box
of
image

-‐Column
Family:
Style
Name
and
Projec+on

-‐Column
Qualiﬁer:
Width
x
Height

-‐Value:
Buﬀered
Image

Sec+on
2.4

Distributed
Key-‐Value
Stores

S3

PaCern
4:
Put
the
data
into
a

distributed
key-‐value
store.

S3
Buckets

•  S3
bucket
names
must
be
unique
across
AWS

•  A
good
prac+ce
is
to
use
a
paCern
like

tutorial.osdc.org/dataset1.txt

for
a
domain
you
own.

•  The
ﬁle
is
then
referenced
as

tutorial.osdc.org.s3.
amazonaws.com/
dataset1.txt

•  If
you
own
osdc.org
you
can
create
a
DNS

CNAME
entry
to
access
the
ﬁle
as

tutorial.osdc.org/dataset1.txt

S3
Keys

•  Keys
must
be
unique
within
a
bucket.

•  Values
can
be
as
large
as
5
TB
(formerly
5
GB)

S3
Security

•  AWS
access
key
(user
name)

•  This
func+on
as
your
S3
username.
It
is
an

alphanumeric
text
string
that
uniquely

iden+ﬁes
users.

•  AWS
Secret
key
(func+ons
as
password)

AWS
Account
Informa+on

Access
Keys

User
Name
Password

Other
Amazon
Data
Services

•  Amazon
Simple
Database
Service
(SDS)

•  Amazon’s
Elas+c
Block
Storage
(EBS)

Sec+on
2.5

Moving
Large
Data
Sets

The
Basic
Problem

•  TCP
was
never
designed
to
move
large
data

sets
over
wide
area
high
performance

networks.

•  As
a
general
rule,
reading
data
oﬀ
disks
is

slower
than
transpor+ng
it
over
the
network.

TCP Throughput vs RTT and Packet Loss
0.01%
0.05%
0.1%
0.1%
0.5%
1000
800
600
400
200
1 10 100 200 400
1000
800
600
400
200
Throughput(Mb/s)
Round Trip Time (ms)
LAN US-EU US-ASIAUS
Source:
Yunhong
Gu,

2007,
experiments
over
wide
area
1G.

The
Solu+on

•  Use
parallel
TCP
streams

– GridFTP

•  Use
specialized
network
protocols

– UDT,
FAST,
etc.

•  Use
RAID
to
stripe
data
across
disks
to

improve
throughput
when
reading

•  These
techniques
are
well
understood
in
HEP,

astronomy,
but
not
yet
in
biology.

Case
Study:
Bio-‐mirror

[The
open
source
GridFTP]
from
the
Globus
project
has

recently
been
improved
to
offer
UDP-‐based
file
transport,

with
long-‐distance
speed
improvements
of
3x
to
10x
over

the
usual
TCP-‐based
file
transport.

-‐-‐
Don
Gilbert,
August
2010,
bio-‐mirror.net

Moving
113GB
of
Bio-‐mirror
Data

Site
RTT
TCP
UDT
TCP/UDT
Km

NCSA
10
139
139
1
200

Purdue
17
125
125
1
500

ORNL
25
361
120
3
1,200

TACC
37
616
120
55
2,000

SDSC
65
750
475
1.6
3,300

CSTNET
274
3722
304
12
12,000

GridFTP
TCP
and
UDT
transfer
+mes
for
113
GB
from
gridip.bio-‐mirror.net/biomirror/
blast/
(Indiana
USA).

All
TCP
and
UDT
+mes
in
minutes.

Source:

hCp://gridip.bio-‐
mirror.net/biomirror/

Case
Study:
CGI
60
Genomes

•  Trace
by
Complete
Genomics
showing
performance
of

moving
60
complete
human
genomes
from
Mountain

View
to
Chicago
using
the
open
source
Sector/UDT.

•  Approximately
18
TB
at
about
0.5
Mbs
on
1G
link.

Source:
Complete
Genomics.

Resource
Use

Protocol
CPU
Usage*
Memory*

GridFTP
(UDT)
1.0%
-‐
3.0%

40
Mb

GridFTP
(TCP)
0.1%
-‐
0.6%
6
Mb

*CPU
and
memory
usage
collected
by

Don
Gilbert.

He
reports
that
rsync
uses
more

CPU
than
GridFTP
with
UDT.

Source:
hCp://gridip.bio-‐mirror.net/biomirror/.

Sector/Sphere

•  Sector/Sphere
is
a
pla{orm
for
data
intensive

compu+ng
built
over
UDT
and
designed
to

support
geographically
distributed
clusters.

Ques+ons?

For
the
most
current
version
of
these
notes,
see

rgrossman.com

Managing Big Data: An Introduction to Data Intensive Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Managing Big Data: An Introduction to Data Intensive Computing

Similar to Managing Big Data: An Introduction to Data Intensive Computing (20)

Recently uploaded

Recently uploaded (20)

Managing Big Data: An Introduction to Data Intensive Computing