The Open Science Data Cloud: Empowering the Long Tail of Science

A
501(c)(3)
not-‐for-‐proﬁt

operaCng
clouds
for
science.

The
Open
Science
Data
Cloud:

Empowering
the
Long
Tail
of
Science

October
12,
2012

Robert
L.
Grossman

University
of
Chicago

and
Open
Cloud
ConsorCum

QuesCon
1.
What
is
the

cyberinfrastructure
required
to
manage,

analyze,
archive
and
share
big
data?

Call
this
analyCc
infrastructure.

QuesCon
2.
What
is
the
analogy
of
the

GLIF*
for
analyCc
infrastructure?

*GLIF
(www.glif.is),
the
Global
Lambda
Integrated
Facility,
is
an
internaConal

virtual
organizaCon
that
promotes
the
paradigm
of
lambda
networking.
GLIF

provides
lambdas
internaConally
as
an
integrated
facility
to
support
data-‐
intensive
scienCﬁc
research,
and
supports
middleware
development
for

lambda
networking.

Number

1000’s
Individual
scienCsts
&

small
projects

100’s

Community
based

science
via
Science
as
a

10’s
Service

very
large
projects

Data
Size

Small
Medium
to
Large

Very
Large

Public
Shared
community
Dedicated

infrastructure
infrastructure
infrastructure

The
long
tail
of
data
science

A
few
large
data
Many
smaller
data

science
projects.
science
projects.

Part
1.

What
Instrument
Do
we
Use
to

Make
Big
Data
Discoveries?

How
do
we
build
a
“datascope?”

TB?

PB?

EB?

ZB?

What
is
big
data?

Another
way:

opencompute.org

Think
of
data
as
big
if
you
measure
it
in
MW,
as
in

Facebook’s
Pineville
Data
Center
is
30
MW.

An
algorithm
and

compuCng

infrastructure
is
“big-‐
data
scalable”
if
adding

a
rack
(or
container)
of

data
(and
corresponding

processors)
allows
you

to
do
the
same

computaCon
in
the

same
Cme
but
over

more
data.

Commercial
Cloud
Service
Provider
(CSP)

15
MW
Data
Center

Monitoring,

AccounCng
and

network
security

billing
Customer

and
forensics

Facing

Portal

AutomaCc

provisioning
and
100,000
servers

infrastructure
1
PB
DRAM

management
100’s
of
PB
of
disk
~1
Tbps
egress
bandwidth

25
operators
for
15
MW
Commercial
Cloud
Data
center
network

My
vote
for
a
datascope:
a
(bouCque)

data
center
scale
facility
with
a
big-‐
data
scalable
analyCc
infrastructure.

What
would
a
global
integrated

facility
for
datascopes
look
like?

Some
Examples
of
Big
Data
Science

Discipline
Dura2on
Size
#
Devices

HEP
-‐
LHC
10
years
15
PB/year*
One

Astronomy
-‐
LSST
10
years
12
PB/year**
One

Genomics
-‐
NGS
2-‐4
years
0.5
TB/genome
1000’s

*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
parCcle
accelerator,
is
expected
to
produce
more
than
15

million
Gigabytes
of
data
each
year.

…
This
ambiCous
project
connects
and
combines
the
IT
power
of
more
than
140
computer

centres
in
33
countries.

Source:
hjp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html

**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes

processed),
resulCng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.

Source:
hjp://www.lsst.org/
News/enews/teragrid-‐1004.html

One
large
instrument
Many
smaller
instruments

Sci
CSP
services

Data
scienCst

Datascope
–
Science
Cloud
Service

Provider
(Sci
CSP)

What
are
some
of
the
important

diﬀerences
between
commercial

and
research-‐focused
Sci
CSPs?

Science
CSP
Commercial
CSP

POV
DemocraCze
access
to
As
long
as
you
pay
the
bill;

data.

Integrate
data
to
as
long
as
the
business

make
discoveries.

Long
model
holds.

term
archive.

Data
&
Data
intensive
Internet
style
scale
out

Storage
Science
Clouds

compuCng
&
HP
storage
and
object-‐based
storage

Flows
Large
data
ﬂows
in
and
Lots
of
small
web
ﬂows

out

Streams
Streaming
processing
NA

required

AccounCng
EssenCal
EssenCal

Lock
in
Moving
environment
Lock
in
is
good

between
CSPs
essenCal

Part
2.

The
Open
Cloud
ConsorCum’s

Open
Science
Data
Cloud

•  U.S
based
corporaCon.

•  Manages
cloud
compuCng
infrastructure
to

support
scienCﬁc
research:
Open
Science

Data
Cloud.

•  Manages
cloud
compuCng
testbeds:
Open

Cloud
Testbed.

www.opencloudconsorCum.org
18

OCC
Members
&
Partners

•  Companies:
Cisco,
Yahoo!,
Citrix,
…

•  UniversiCes:

University
of
Chicago,

Northwestern
Univ.,
Johns
Hopkins,
Calit2,

ORNL,
University
of
Illinois
at
Chicago,
…

•  Federal
agencies
and
labs:
NASA,
LLNL,
ORNL

•  InternaConal
Partners:
AIST
(Japan),
U.

Edinburgh,
U.
Amsterdam,
…

•  Partners:
NaConal
Lambda
Rail

19

OCC
2011
Resources

Resource
Type
Comments

OSDC
Adler
&
UClity
Cloud

1248
cores
and
0.4
PB
disk

Sullivan

OCC
–
Y
Data
Cloud
928
cores
and
1.0

PB
disk

OCC
–
Matsu
Mixed
1
rack

OSDC
Root
Storage
0.8
PB

•  OCC-‐Adler,
Sullivan
&
Root
will
more
than
double
in

size
in
2012.

Bionimbus
WG

bionimbus.opensciencedatacloud.org
(biological
data)

One
Million
Genomes

•  Sequencing
a
million
genomes
would
most

likely
fundamentally
change
the
way
we

understand
genomic
variaCon.

•  The
genomic
data
for
a
paCent
is
about
1
TB

(including
samples
from
both
tumor
and

normal
Cssue).

•  One
million
genomes
is
about
1000
PB
or
1
EB

•  With
compression,
it
may
be
about
100
PB

•  At
$1000/genome,
the
sequencing
would
cost

about
$1B

Big
data
driven
discovery
on

1,000,000
genomes
and
1
EB
of
data.

Genomic-‐ Improved

Genomic-‐

driven
understanding
driven
drug

diagnosis
of
genomic
development

science

Precision
diagnosis
and

treatment.

PrevenCve

health
care.

Project Matsu WG:
Clouds to Support Earth Science

matsu.opensciencedatacloud.org

24

UDR

•  UDT
is
a
high
performance
network
transport
protocol

•  UDR
=
rsync
+
UDT

•  It
is
easy
for
an
average
systems
administrator
to
keep

100’s
of
TB
of
distributed
data
synchronized.

•  We
are
using
it
to
distribute
c.
1
PB
from
the
OSDC

OpenFlow-‐Enabled
Hadoop
WG

•  When
running
Hadoop
some
map
and
reduce
jobs

take
signiﬁcantly
longer
than
others.

•  These
are
stragglers
and
can
signiﬁcantly
slow
down

a
MapReduce
computaCon.

•  Stragglers
are
common
(dirty
secret
about
Hadoop)

•  Infoblox
and
UChicago
are
leading
a
OCC
Working

Group
on
OpenFlow-‐enabled
Hadoop
that
will

provide
addiConal
bandwidth
to
stragglers.

•  We
have
a
testbed
for
a
wide
area
version
of
this

project.

OSDC
PIRE
Project

We
select
OSDC
PIRE
Fellows

(US
ciCzens
or
permanent

residents):

•  We
give
them
tutorials
and

training
on
big
data
science.

•  We
provide
them

fellowships
to
work
with

OSDC
internaConal

partners.

•  We
give
them
preferred

access
to
the
OSDC.

Nominate
your
favorite
scienCst
as
an
OSDC
PIRE
Fellow.

www.opensciencedatacloud.org

(look
for
PIRE)

Part
3.

Cloud
Services
OperaCons
Centers

Open
Science
Data
Cloud

AccounCng
and

Monitoring,
billing
(OSDC)

compliance,
&

security
Customer
Facing

Science
Cloud
SW

&
Services
Portal
(Tukey)

AutomaCc

provisioning
and

3
PB
2011

infrastructure
10
PB
2012

management
~100
Gbps
bandwidth

able
to
scale
to

100
PB?

5-‐12
operators
to
operate
1-‐5
MW
Science
Cloud
Data
center
network

OSDC
Data
Stack
based
upon
OpenStack,
Hadoop,
GlusterFS,
UDT,
…

Cloud
Services

OperaCons
Centers
(CSOC)

•  The
OSDC
operates
Cloud
Services
OperaCons

Center
(or
CSOC).

•  It
is
a
CSOC
focused
on
supporCng
Science

Clouds
for
researchers.

•  Compare
to
Network
OperaCons
Center
or

NOC.

•  Both
are
an
important
part
of
cyber

infrastructure
for
big
data
science.

OSDC
Racks

•  How
quickly
can

we
set
up
a
rack?

•  How
eﬃciently
can

we
operate
a
rack?

(racks/admin)

2012
OSDC
rack
design
(dray)

•  950
TB
/
rack

•  600
cores
/
rack

EssenCal
Services
for
a
Science
CSP

•  Support
for
data
intensive
compuCng

•  Support
for
big
data
ﬂows

•  Account
management,
authenCcaCon
and

authorizaCon
services

•  Health
and
status
monitoring

•  Billing
and
accounCng

•  Ability
to
rapidly
provision
infrastructure

•  Security
services,
logging,
event
reporCng

•  Access
to
large
amounts
of
public
data

•  High
performance
storage

•  Simple
data
export
and
import
services

Please
Join
Us!

(Help
us
from
making
even

more
mistakes.)

Acknowledgements

Major
funding
and
support
for
the
Open
Science
Data
Cloud
(OSDC)
is
provided
by
the

Gordon
and
Bejy
Moore
FoundaCon.

This
funding
is
used
to
support
the
OSDC-‐Adler,

Sullivan
and
Root
faciliCes.

AddiConal
funding
for
the
OSDC
has
been
provided
by
the
following
sponsors:

•  The
OCC-‐Y
Hadoop
Cluster
(approximately
1000
cores
and
1
PB
of
storage)
was

donated
by
Yahoo!
in
2011.

•  Cisco
provides
the
OSDC
access
to
the
Cisco
C-‐Wave,
which
connects
OSDC
data

centers
with
10
Gbps
wide
area
networks.

•  NSF
awarded
the
OSDC
a
5-‐year
(2010-‐2016)
PIRE
award
to
train
scienCsts
to
use

the
OSDC
and
to
further
develop
the
underlying
technology.

•  OSDC
technology
for
high
performance
data
transport
is
support
in
part
by

NSF

Award
1127316.

•  The
StarLight
Facility
in
Chicago
enables
the
OSDC
to
connect
to
over
30
high

performance
research
networks
around
the
world
at
10
Gbps
or
higher,
with
an

increasing
number
of
100
Gbps
connecCons.

The
OSDC
is
managed
by
the
Open
Cloud
ConsorCum,
a
501(c)(3)

corporaCon.
If
you
are
interested
in
providing
funding
or
donaCng
equipment
or

services,
please
contact
us
at
info@opensciencedatacloud.org.

For
more
informaCon

•  You
can
ﬁnd
some
more
informaCon
on
my
blog:

rgrossman.com.

•  Some
of
my
technical
papers
are
also
available
there.

•  My
email
address
is
robert.grossman
at
uchicago
dot
edu.

Center for
Research
Informatics

The Open Science Data Cloud: Empowering the Long Tail of Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to The Open Science Data Cloud: Empowering the Long Tail of Science

Similar to The Open Science Data Cloud: Empowering the Long Tail of Science (20)

More from Robert Grossman

More from Robert Grossman (9)

Recently uploaded

Recently uploaded (13)

The Open Science Data Cloud: Empowering the Long Tail of Science