Using the Open Science Data Cloud for Data Science Research

Using
the
Open
Science
Data
Cloud

for
Data
Science
Research

Robert
Grossman

University
of
Chicago

Open
Cloud
Consor=um

June
17,
2013

Data:
1
PB
of
OSDC

data
across
several

disciplines

Instrument:

3000
cores
/

5
PB
OSDC

science
cloud

+
+

Team:
you

and
your

colleagues

Discoveries

correla=on

algorithms
+

Part
1

What
Instrument
Do
we
Use
to

Make
Big
Data
Discoveries?

How
do
we
build
a
“datascope?”

What
is
big
data?

TB?
PB?
EB?

W?
KW?
MW?

An
algorithm
and

compu=ng

infrastructure
is
“big-‐
data
scalable”
if
adding

a
rack
(or
container)
of

data
(and
corresponding

processors)
allows
you

to
do
the
same

computa=on
in
the

same
=me
but
over

more
data.

Commercial
Cloud
Service
Provider
(CSP)

15
MW
Data
Center

100,000
servers

1
PB
DRAM

100’s
of
PB
of
disk

Automa=c

provisioning
and

infrastructure

management

Monitoring,

network
security

and
forensics

Accoun=ng
and

billing
Customer

Facing

Portal

Data
center
network

~1
Tbps
egress
bandwidth

25
operators
for
15
MW
Commercial
Cloud

OSDC’s
vote
for
a
datascope:
a

(bou=que)
data
center
scale
facility

with
a
big-‐data
scalable
analy=c

infrastructure.

Discipline
Dura2on
Size
#
Devices

HEP
-‐
LHC
10
years
15
PB/year*
One

Astronomy
-‐
LSST
10
years
12
PB/year**
One

Genomics
-‐
NGS
2-‐4
years
0.5
TB/genome
1000’s

Some
Examples
of
Big
Data
Science

*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
par=cle
accelerator,
is
expected
to
produce
more
than
15

million
Gigabytes
of
data
each
year.

…
This
ambi=ous
project
connects
and
combines
the
IT
power
of
more
than
140
computer

centres
in
33
countries.

Source:
hhp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html

**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes

processed),
resul=ng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.

Source:
hhp://www.lsst.org/
News/enews/teragrid-‐1004.html

One
large
instrument
Many
smaller
instruments

Part
2.

What
is
a
Cloud
and
Why
Do
We
Care?

11

There
Are
Two
Essen=al

Characteris=cs
of
a
Cloud

1.  Self
service

2.  Scale

•  Clouds
enable
you
to
compute
over
large

amounts
of
data
with
the
necessity
of
ﬁrst

downloading
the
data.

•  Clouds
can
be
designed
to
be
secure
and

compliant.

12

Self
Service

Self
Service

13

Types
of
Clouds

•  Public
Clouds

– Amazon

•  Private
Clouds

– Run
internally
by
universi=es
or
companies

•  Community
Clouds

– Run
by
organiza=ons
(either
formally
or

informally),
such
as
the
Open
Cloud
Consor=um

15

Amazon
Web
Services

(AWS)?

Community
clouds,

science
clouds,
etc.

•  Lower
cost
(at
medium
scale)

•  Data
too
important
for

commercial
cloud

•  Compu=ng
over
scien=ﬁc

data
is
a
core
competency

•  Can
support
any
required

governance
/
security

•  Scale

•  Simplicity
of
a
credit
card

•  Wide
variety
of
oﬀerings.

vs.

OCC
supports
AWS
interop
and
burs=ng
when
permissible.
16

Science
Clouds

NFP
Science
Clouds
Commercial
Clouds

POV
Democra=ze
access
to

data.

Integrate
data
to

make
discoveries.

Long

term
archive.

As
long
as
you
pay
the
bill;

as
long
as
the
business

model
holds.

Data
&

Storage

Data
intensive

compu=ng
&
HP
storage

Internet
style
scale
out

and
object-‐based
storage

Flows
Large
&
small
data
flows
Lots
of
small
web
flows

Streams
Streaming
processing

required

NA

Accoun=ng
Essen=al
Essen=al

Lock
in
Moving
environment

between
CSPs
essen=al

Lock
in
is
good

Interop
Cri=cal,
but
difficult
Customers
will
drive
to

some
degree
17

Essen=al
Services
for
a
Science
CSP

•  Support
for
data
intensive
compu=ng

•  Support
for
big
data
ﬂows

•  Account
management,
authen=ca=on
and

authoriza=on
services

•  Health
and
status
monitoring

•  Billing
and
accoun=ng

•  Ability
to
rapidly
provision
infrastructure

•  Security
services,
logging,
event
repor=ng

•  Access
to
large
amounts
of
public
data

•  High
performance
storage

•  Simple
data
export
and
import
services

Datascope
–
Science
Cloud

Service
Provider
(Sci
CSP)

Data
scien=st

Sci
CSP
services

Cloud
Services

Opera=ons
Centers
(CSOC)

•  The
OSDC
operates
Cloud
Services
Opera=ons

Center
(or
CSOC).

•  It
is
a
CSOC
focused
on
suppor=ng
Science

Clouds
for
researchers.

•  Compare
to
Network
Opera=ons
Center
or

NOC.

•  Both
are
an
important
part
of
cyber

infrastructure
for
big
data
science.

Datascope
–
Science
Cloud

Service
Provider
(Sci
CSP)

Data
scien=st

Sci
CSP
services

Cloud
Service
Opera=ons

Center
(CSOC)

Part
3

Data
Science

Data

Founda=ons
of
data
science

General
and
discipline

speciﬁc
souware

applica=ons
and
tools

Models
and
algorithms

Establish
best
prac=ces,
strategies
for

data
science
in
general
and
discipline

speciﬁc
data
science
in
par=cular

Analy=c
infrastructure

Data

What
are
the
founda=ons
for
data
science?

Theory
to
Big
Data
Spectrum

Simple
counts

and
sta=s=cs

over
big
data

Mathema=cal

theorems

No
data
Small
data

Big
data

Tradi=onal

sta=s=cal
modeling

Medium
data

(Semi-‐)Automa=ng

sta=s=cal
modeling

GB
TB
PB

OSDC
Datascope
0.5-‐2.0
MW

Part
4

The
Open
Science
Data
Cloud

www.opensciencedatacloud.org

2013
Open
Science
Data
Cloud
(IaaS)

5
PB
2013

(OpenStack
&

GlusterFS)

Infrastructure

automa=on
&

management

(Yates)

Compliance,
&

security

(OpenFISMA)

Accoun=ng
&

billing

(Salesforce.com)

Customer
Facing

Portal
(Tukey)

Data
center
network

~10-‐100
Gbps
bandwidth

5
engineers
to
operate
0.5
MW
Science
Cloud

Science
Cloud
SW

&
Services

•  Virtual
Machine
(VM)
containing
common
applica=ons
&

pipelines

•  Tukey
(OSDC
portal
&
middleware
v0.3)

•  Yates
(infrastructure
automa=on
and
management
v0.1)
28

Tukey

•  Tukey
(based
in
part
on
Horizon).

•  We
have
factored
out
digital
ID
service,
ﬁle

sharing,
and
transport
from
Bionimbus
and

Matsu.

Yates

•  Automa=on

installa=on
of

OSDC
souware

stack
on
rack
of

computers.

•  Based
upon
Chef

•  Version
0.1

UDR

•  UDT
is
a
high
performance
network
transport
protocol

•  UDR
=
rsync
+
UDT

•  It
is
easy
for
an
average
systems
administrator
to
keep

100’s
of
TB
of
distributed
data
synchronized.

•  We
are
using
it
to
distribute
c.
1
PB
from
the
OSDC

Open
Science
Data
Cloud
Services

•  Digital
ID
services

•  Data
sharing
services

•  Data
transport
services
(UDR)

•  What
other
core
services
are
essen&al?

•  Of
course,
working
groups
and
applica=ons

always
add
their
own
services

•  These
core
services
will
hopefully
make
the

OSDC
ahrac=ve
as
a
plaxorm
(PaaS)
for

scien=ﬁc
discovery.

33

www.opencloudconsor=um.org

•  U.S
based
not-‐for-‐proﬁt
corpora=on.

•  Manages
cloud
compu=ng
infrastructure
to

support
scien=ﬁc
research:
Open
Science
Data

Cloud.

•  Manages
cloud
compu=ng
infrastructure
to

support
medical
and
health
care
research:

Biomedical
Commons
Cloud

•  Manages
cloud
compu=ng
testbeds:
Open
Cloud

Testbed.

OCC
Members
&
Partners

•  Companies:
Cisco,
Yahoo!,
Intel,
…

•  Universi=es:

University
of
Chicago,

Northwestern
Univ.,
Johns
Hopkins,
Calit2,

ORNL,
University
of
Illinois
at
Chicago,
…

•  Federal
agencies
and
labs:
NASA

•  Interna=onal
Partners:
Univ.
Edinburgh,
AIST

(Japan),
Univ.
Amsterdam,
…

•  Partners:
Na=onal
Lambda
Rail

34

Third
party
open

source
souware

+

Tukey

Yates

Open
source
souware

developed
by
the
OCC
and

open
standards

+

Data
center

+

Data
with
permissions

+

Authoriza=on
of
users

access
to
data

+

Policies,
procedures,

controls,
etc.

+

Governance,
legal
agreements

+

Sustainability
model
35

OSDC
Public
Data
Sets

•  Over
800
TB
of
open
access
data
in
the
OSDC

•  Earth
sciences
data

•  Biological
sciences
data

•  Social
sciences
data

•  Digital
humani=es

Part
6

OSDC
Working
Groups

Just
look
around
you

Matsu Working Group:
Clouds to Support Earth Science
41
matsu.opensciencedatacloud.org

Matsu
Architecture

Hadoop
HDFS

Matsu
Web
Map

Tile
Service
(WMTS)

Matsu
MR-‐based

Tiling
Service

NoSQL
Database

Images
at
different
zoom
layers

suitable
for
OGC
Web
Mapping
Server

Level
0,
Level
1
and
Level
2

images

MapReduce
used
to
process
Level
n
to
Level
n+1

data
and
to
par==on
images
for
different
zoom

levels

NoSQL-‐based

Analy=c
Services

Streaming
Analy=c

Services

MR-‐based
Analy=c

Services

Analy=c
Services
Storage
for
WMS
=les
and

derived
data
products

Presenta=on
Services

Web
Coverage

Processing
Service

(WCPS)

Workflow
Services

Hadoop-‐Based
Re-‐Analysis

Zoom
Level
1:
4
images
Zoom
Level
2:
16
images

Zoom
Level
3:
64
images
Zoom
Level
4:
256
images

Bionimbus

Working
Group

bionimbus.opensciencedatacloud.org
(biological
data)

Bionimbus
Protected
Data
Cloud

45

Analyzing
Data
From

The
Cancer
Genome
Atlas
(TCGA)

1.  Apply
to
dbGaP
for
access

to
data.

2.  Hire
staﬀ,
set
up
and

operate
secure
compliant

compu=ng
environment
to

mange
10
–
100+
TB
of
data.

3.  Get
environment
approved

by
your
research
center.

4.  Setup
analysis
pipelines.

5.  Download
data
from
CG-‐
Hub
(takes
days
to
weeks).

6.  Begin
analysis.

Current
Prac2ce
With
Protected
Data
Cloud
(PDC)

1.  Apply
to
dbGaP
for
access

to
data.

2.  Use
your
eRA
commons

creden=als
to
login
to
the

PDC,
select
the
data
that

you
want
to
analyze,
and

the
pipelines
that
you
want

to
use.

3.  Begin
analysis.

46

One
Million
Genomes

•  Sequencing
a
million
genomes
would
most

likely
fundamentally
change
the
way
we

understand
genomic
varia=on.

•  The
genomic
data
for
a
pa=ent
is
about
1
TB

(including
samples
from
both
tumor
and

normal
=ssue).

•  One
million
genomes
is
about
1000
PB
or
1
EB

•  With
compression,
it
may
be
about
100
PB

•  At
$1000/genome,
the
sequencing
would
cost

about
$1B

Big
data
driven
discovery
on

1,000,000
genomes
and
1
EB
of
data.

Genomic-‐
driven

diagnosis

Improved

understanding

of
genomic

science

Genomic-‐

driven
drug

development

Precision
diagnosis
and

treatment.

Preven=ve

health
care.

Biomedical
Commons
Cloud
(BCC)

Working
Group

Cloud
for

Public
Data

Cloud
for
Controlled

Genomic
Data

Cloud
for

EMR,
PHI,

data

Example:
Open
Cloud
Consor=um’s

Biomedical
Commons
Cloud
(BCC)

Medical
Research

Center
A

Medical
Research

Center
B

Hospital
D

Medical
Research

Center
C

49

Resource
Who
users
Who
operates

Open
Science
Data

Cloud
(OSDC)

Pan
science
data

for
researchers

Open
Cloud
Consor=um

(OCC)
supported
by

University
OCC

members

Biomedical
Commons

Clouds
(BCC)

(Interna=onal)

biomedical

researchers

OCC
Biomedical

Commons
Cloud

Working
Group

supported
by
OCC

University
members

Bionimbus
Protected

Data
Cloud

Genomics

researchers

University
of
Chicago

supported
by
the
OCC

50

OpenFlow-‐Enabled
Hadoop
WG

•  When
running
Hadoop
some
map
and
reduce
jobs

take
signiﬁcantly
longer
than
others.

•  These
are
stragglers
and
can
signiﬁcantly
slow
down

a
MapReduce
computa=on.

•  Stragglers
are
common
(dirty
secret
about
Hadoop)

•  Infoblox
and
UChicago
are
leading
a
OCC
Working

Group
on
OpenFlow-‐enabled
Hadoop
that
will

provide
addi=onal
bandwidth
to
stragglers.

•  We
have
a
testbed
for
a
wide
area
version
of
this

project.

OSDC
PIRE
Project

We
select
OSDC
PIRE
Fellows

(US
ci=zens
or
permanent

residents):

•  We
give
them
tutorials
and

training
on
big
data
science.

•  We
provide
them

fellowships
to
work
with

OSDC
interna=onal

partners.

•  We
give
them
preferred

access
to
the
OSDC.

Nominate
your
favorite
scien=st
as
an
OSDC
PIRE
Fellow.

www.opensciencedatacloud.org

(look
for
PIRE)

Part
7

Key
Ques=ons
for
This
Workshop

•  Ques=on
1.

How
can
we
add
partner
sites
at
other
loca=ons

that
extend
the
OSDC?

In
par=cular,
how
can
we
extend
the

OSDC
to
sites
around
the
world?

How
can
the
OSDC

interoperate
with
other
science
clouds?

•  Ques=on
2.
What
data
can
we
add
to
the
OSDC
to
facilitate

data
intensive
cross-‐disciplinary
discoveries?

•  Ques=on
3.

How
can
we
build
a
plugin
structure
so
that

Tukey
can
be
extended
by
other
users
and
by
other

communi=es?

•  Ques=on
4.
What
tools
and
applica=ons
can
we
add
to
the

OSDC
facilitate
data
intensive
cross-‐disciplinary
discoveries?

•  Ques=on
5.

How
can
we
beher
integrate
digital
IDs
and
ﬁle

sharing
services
into
the
OSDC?

•  Ques=on
6.
What
are
3-‐5
grand
challenge
ques=ons
that

leverage
the
OSDC?

Robert
Grossman
is
a
faculty
member
at
the
University
of
Chicago.

He
is
the
Chief

Research
Informa=cs
Oﬃcer
for
the
Biological
Sciences
Division,
a
Faculty
Member

and
Senior
Fellow
at
the
Computa=on
Ins=tute
and
the
Ins=tute
for
Genomics
and

Systems
Biology,
and
a
Professor
of
Medicine
in
the
Sec=on
of
Gene=c
Medicine.

His

research
group
focuses
on
big
data,
biomedical
informa=cs,
data
science,
cloud

compu=ng,
and
related
areas.

He
is
also
the
Founder
and
a
Partner
of
Open
Data
Group,
which
has
been
building

predic=ve
models
over
big
data
for
companies
for
over
ten
years.

He
recently
wrote
a
book
for
the
general
reader
that
discusses
big
data
(among
other

topics)
called
the
Structure
of
Digital
Compu=ng:
From
Mainframes
to
Big
Data,
which

can
be
purchased
from
Amazon.

He
blogs
occasionally
about
big
data
at
rgrossman.com.

Using the Open Science Data Cloud for Data Science Research

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Using the Open Science Data Cloud for Data Science Research

Similar to Using the Open Science Data Cloud for Data Science Research (20)

More from Robert Grossman

More from Robert Grossman (10)

Recently uploaded

Recently uploaded (20)

Using the Open Science Data Cloud for Data Science Research