0
Big
Data:
tools
and
techniques
for
working
  with
large
data
sets               Ian
Stokes‐Rees,
PhD        Harvard
Medica...
Slides
and
Contact   ijstokes@hkl.hms.harvard.edu   http://linkedin.com/in/ijstokes   http://slidesha.re/ijstokes-thailand...
Slides
and
Contact   ijstokes@hkl.hms.harvard.edu   http://linkedin.com/in/ijstokes   http://slidesha.re/ijstokes-thailand...
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
MeBig Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
rotational      translation 2D
simple
crystal           Patterson
map                                               search...
Protein Structure StudiesBig Data - Ian Stokes-Rees      ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...Big Data - Ian Stokes-Rees     ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
dataBig Data - Ian Stokes-Rees      ...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Data,
Data
Everywhere
...                             • We
are
being
overwhelmed
with
data                               •...
Next
Generation
SequencingBig Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
High
Energy
PhysicsBig Data - Ian Stokes-Rees       ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
40
MHz
bunch
crossing
rate     10
million
data
channels     1
KHz
level
1
event
recording
rate     1­10
MB
per
event     1...
Molecular
Dynamics
SimulationsBig Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Molecular
Dynamics
Simulations                                   1
fs
time
step                                   1ns
snap...
Electronic
Patient
RecordsBig Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records    77
page
PDF
(bespoke
report)Big Data - Ian Stokes-Rees         ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records               Clinical
Document
Architecture
XML
representationBig Data - Ian Stokes-Rees      ...
Electronic
Patient
Records                        HTML
rendering
of
XML
via
XSLT
transformBig Data - Ian Stokes-Rees      ...
Clinical
Imaging
Data   DICOM
­
Digital
Imaging
and
   Communications
in
Medicine   2D,
3D,
4DBig Data - Ian Stokes-Rees  ...
Clinical
Imaging
DataBig Data - Ian Stokes-Rees       ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
It
is
clear
there
is
no
shortage
of
data.Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
It
is
clear
there
is
no
shortage
of
data.         Potential
for
great
new
insights
...Big Data - Ian Stokes-Rees          ...
It
is
clear
there
is
no
shortage
of
data.         Potential
for
great
new
insights
...         ...
if
we
can
organize,
acc...
Jumping
to
the
end
...Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...     • Data
can
empower
rather
than
overwhelm
you        •   but
this
requires
thought
and
planningB...
Jumping
to
the
end
...     • Data
can
empower
rather
than
overwhelm
you        •   but
this
requires
thought
and
planning ...
Jumping
to
the
end
...     • Data
can
empower
rather
than
overwhelm
you        •   but
this
requires
thought
and
planning ...
Jumping
to
the
end
...     • Data
can
empower
rather
than
overwhelm
you        •   but
this
requires
thought
and
planning ...
Jumping
to
the
end
...     • Data
can
empower
rather
than
overwhelm
you        •   but
this
requires
thought
and
planning ...
Problems
arising
from
“Big
Data”Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”                  • Where
to
storeBig Data - Ian Stokes-Rees           ijstokes@hkl.hms.ha...
Problems
arising
from
“Big
Data”                  • Where
to
store                  • How
to
storeBig Data - Ian Stokes-Re...
Problems
arising
from
“Big
Data”                  • Where
to
store                  • How
to
store                  • How
...
Problems
arising
from
“Big
Data”                  •   Where
to
store                  •   How
to
store                  • ...
Problems
arising
from
“Big
Data”                  • Where
to
store                  • How
to
store                  • How
...
Problems
arising
from
“Big
Data”                  • Where
to
store                  • How
to
store                  • How
...
Problems
arising
from
“Big
Data”                  • Where
to
store                  • How
to
store                  • How
...
Problems
arising
from
“Big
Data”                  •   Where
to
store                  •   How
to
store                  • ...
Where
to
store
(I)Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Where
to
store
(I)  • RAM     •   fast     •   expensive     •   volatileBig Data - Ian Stokes-Rees          ijstokes@hkl....
Where
to
store
(I)  • RAM     •   fast                             • local
disk     •   expensive                         ...
Where
to
store
(I)  • RAM     •   fast                                          • local
disk     •   expensive            ...
Where
to
store
(II)Big Data - Ian Stokes-Rees         ijstokes@hkl.hms.harvard.edu
Where
to
store
(II) • SAN
with
high
performance
   interconnect   •   Storage
Area
Network   •   fully
managed
data
storag...
Where
to
store
(II) • SAN
with
high
performance
   interconnect   •   Storage
Area
Network   •   fully
managed
data
storag...
Where
to
store
(II) • SAN
with
high
performance
   interconnect   •   Storage
Area
Network   •   fully
managed
data
storag...
How
to
store
(data
formats)       • ASCII                      • SQL
DB           •   tab
delimited           •   MySQL   ...
How
to
process  • Analytical
software      • Analytical
environments     •   custom
programs        •   multi‐core
machine...
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
48 cores, single system image
For
$500
to
$2000
(USD),
up
to
order
of
magnitude
processing
speedups
may
be
possible
GPU
Computing
200­800
stream
                        processing
cores
per
cardFor
$500
to
$2000
(USD),
up
to
order
of
magn...
Open
Science
Grid                             www.opensciencegrid.orgBig Data - Ian Stokes-Rees              ijstokes@hkl....
Map/Reduce         • Unix
users:            •   cat | grep | sort | unique > file         • Map/Reduce
equivalent:        ...
Extensions         • Pig
and
Hive            •   pig.apache.org



hive.apache.org            •   simplify
writing
Map/Red...
Organization,
Searching,
and
Meta‐Data         • Few
“software”
solutions
for
this
problem            •   iRODS

provides
...
•   www.irods.org         •   File‐like
paradigm
for
data‐management         •   addition
of
meta‐data         •   can
int...
Search:
Apache         •   lucene.apache.org         •   Java‐based         •   full
text
querying
and
searching         •...
Meta‐Data:
Semantic
Media
WikiBig Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki   • You
know
WikipediaBig Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki   • You
know
Wikipedia   • It
is
built
using
MediawikiBig Data - Ian Stokes-Rees         ij...
Meta‐Data:
Semantic
Media
Wiki   • You
know
Wikipedia   • It
is
built
using
Mediawiki   • Semantic
Media
Wiki
adds
Semanti...
Access
ControlBig Data - Ian Stokes-Rees          ijstokes@hkl.hms.harvard.edu
Access
Control  • Need
a
strong
Identity
Management
environment     •   individuals:
identity
tokens
and
identiOiers     •...
Access
Control  • Need
a
strong
Identity
Management
environment     •   individuals:
identity
tokens
and
identiOiers     •...
Access
Control  • Need
a
strong
Identity
Management
environment     •   individuals:
identity
tokens
and
identiOiers     •...
Case
Study:
SBGrid         • www.sbgrid.org         • computing
expertise
for
protein
structure
and
           function
re...
SBGrid
Science
Portal                GlobusOnline                            UC San Diego                 @Argonne        ...
Data
Model    • Data
Tiers       •   VO­wide:
all
sites,
admin
managed,
very
stable       •   User
project:
all
sites,
use...
Data
Management quota du
scan tmpwatch conventions workOlow
integration Data
Movement scp
(users) rsync
(VO‐wide) grid‐ftp...
red
­
push
<iles   green
­
pull
<ilesBig Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
red
­
push
<iles   green
­
pull
<iles                             1.
user
<ile
uploadBig Data - Ian Stokes-Rees           ...
red
­
push
<iles   green
­
pull
<iles                             2.
replicate
gold
standard                              ...
3.
Auto­replicate    red
­
push
<iles   green
­
pull
<iles                                   2.
replicate
gold
standard   ...
4.
pull
<iles
from                                                  UCSD
to
WNs                        3.
Auto­replicate  ...
4.
pull
<iles
from                                                  UCSD
to
WNs                                           ...
4.
pull
<iles
from                                                  UCSD
to
WNs                                           ...
4.
pull
<iles
from                                                  UCSD
to
WNs                                           ...
4.
pull
<iles
from                                                  UCSD
to
WNs                                           ...
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Copy,
Move,
BackupBig Data - Ian Stokes-Rees      ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup       • Large
data
sets
are
difOicult
to
copy,
move,
         replicate,
and
backupBig Data - Ian Stoke...
Copy,
Move,
Backup       • Large
data
sets
are
difOicult
to
copy,
move,
         replicate,
and
backup       • Tools
and
p...
Copy,
Move,
Backup       • Large
data
sets
are
difOicult
to
copy,
move,
         replicate,
and
backup       • Tools
and
p...
Copy,
Move,
Backup       • Large
data
sets
are
difOicult
to
copy,
move,
         replicate,
and
backup       • Tools
and
p...
Copy,
Move,
Backup       • Large
data
sets
are
difOicult
to
copy,
move,
         replicate,
and
backup       • Tools
and
p...
Copy,
Move,
Backup       • Large
data
sets
are
difOicult
to
copy,
move,
         replicate,
and
backup       • Tools
and
p...
Globus
Online:
High
Performance
           Reliable
3rd
Party
File
Transfer                     http://www.globusonline.or...
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Summary     • Data
can
empower
rather
than
overwhelm
you        •   but
this
requires
thought
and
planning     • Understan...
Acknowledgements
&
Questions  • Piotr
Sliz     •   Principle
Investigator,
head
of
SBGrid  • SBGrid
System
Administrators ...
Acknowledgements
&
Questions  • Piotr
Sliz     •   Principle
Investigator,
head
of
SBGrid  • SBGrid
System
Administrators ...
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Upcoming SlideShare
Loading in...5
×

Big Data: tools and techniques for working with large data sets

9,298

Published on

Working with thousands, millions, or billions of data records in high dimensions is increasingly becoming the reality for scientific research. What are some techniques to make this kind of data volume tractable? How can parallel computing help? In this talk I'll review data management tools and infrastructures, languages, and paradigms that help in this regard. In particular, I'll discuss Hadoop, MapReduce, Python, NumPy, and Globus Online to provide a survey of ways in which researchers can manage their data and process it in parallel.

Published in: Technology, Education
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
9,298
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
679
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "Big Data: tools and techniques for working with large data sets"

    1. 1. Big
Data:
tools
and
techniques
for
working
 with
large
data
sets Ian
Stokes‐Rees,
PhD Harvard
Medical
School,
Boston,
USA Workshop
on
Tools,
Technologies
and
Collaborative
Opportunities
for
HPC
in
Life
Sciences
and
Healthcare http://portal.sbgrid.org ijstokes@hkl.hms.harvard.edu
    2. 2. Slides
and
Contact ijstokes@hkl.hms.harvard.edu http://linkedin.com/in/ijstokes http://slidesha.re/ijstokes-thailand2011Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    3. 3. Slides
and
Contact ijstokes@hkl.hms.harvard.edu http://linkedin.com/in/ijstokes http://slidesha.re/ijstokes-thailand2011 http://www.sbgrid.org http://portal.sbgrid.org http://www.opensciencegrid.orgBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    4. 4. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    5. 5. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    6. 6. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    7. 7. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    8. 8. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    9. 9. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    10. 10. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    11. 11. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    12. 12. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    13. 13. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    14. 14. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    15. 15. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    16. 16. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    17. 17. rotational translation 2D
simple
crystal Patterson
map search search score
model: aggregatebest
peak,
R
factor, alternatives composites and
cluster electron
densityBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    18. 18. Protein Structure StudiesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    19. 19. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    20. 20. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    21. 21. Data,
Data
Everywhere
...Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    22. 22. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    23. 23. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronicsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    24. 24. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniquesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    25. 25. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    26. 26. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
setsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    27. 27. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulationBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    28. 28. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modelingBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    29. 29. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    30. 30. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacityBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    31. 31. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaborationBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    32. 32. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration • provenance
‐
origin,
access,
changesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    33. 33. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration • provenance
‐
origin,
access,
changes Today,
we’ll
think
about
software,
hardware,
and
 models
for
coping
with
large
quantities
of
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    34. 34. Next
Generation
SequencingBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    35. 35. High
Energy
PhysicsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    36. 36. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    37. 37. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    38. 38. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    39. 39. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    40. 40. 40
MHz
bunch
crossing
rate 10
million
data
channels 1
KHz
level
1
event
recording
rate 1­10
MB
per
event 14
hours
per
day,
7+
months
/
year 4
detectors 6
PB
of
data
/
year globally
distribute
data
for
analysis
(x2)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    41. 41. Molecular
Dynamics
SimulationsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    42. 42. Molecular
Dynamics
Simulations 1
fs
time
step 1ns
snapshot 1
us
simulation 1e6
steps 1000
frames 10
MB
/
frame 10
GB
/
sim 20
CPU­years 3
months
(wall­ clock)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    43. 43. Electronic
Patient
RecordsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    44. 44. Electronic
Patient
Records 77
page
PDF
(bespoke
report)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    45. 45. Electronic
Patient
Records Clinical
Document
Architecture
XML
representationBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    46. 46. Electronic
Patient
Records HTML
rendering
of
XML
via
XSLT
transformBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    47. 47. Clinical
Imaging
Data DICOM
­
Digital
Imaging
and
 Communications
in
Medicine 2D,
3D,
4DBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    48. 48. Clinical
Imaging
DataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    49. 49. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    50. 50. It
is
clear
there
is
no
shortage
of
data.Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    51. 51. It
is
clear
there
is
no
shortage
of
data. Potential
for
great
new
insights
...Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    52. 52. It
is
clear
there
is
no
shortage
of
data. Potential
for
great
new
insights
... ...
if
we
can
organize,
access,
share,
and
 analyze
this
data
ef[icientlyBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    53. 53. Jumping
to
the
end
...Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    54. 54. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planningBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    55. 55. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sourcesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    56. 56. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumersBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    57. 57. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technologyBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    58. 58. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology • Design
your
data
management
system
suitablyBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    59. 59. Problems
arising
from
“Big
Data”Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    60. 60. Problems
arising
from
“Big
Data” • Where
to
storeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    61. 61. Problems
arising
from
“Big
Data” • Where
to
store • How
to
storeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    62. 62. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
processBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    63. 63. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    64. 64. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
accessBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    65. 65. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backupBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    66. 66. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup • ProvenanceBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    67. 67. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup • Provenance • LifecycleBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    68. 68. Where
to
store
(I)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    69. 69. Where
to
store
(I) • RAM • fast • expensive • volatileBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    70. 70. Where
to
store
(I) • RAM • fast • local
disk • expensive • get
a
good
controller
(SATA/SAS2) • volatile • lots
of
fast
spinning
disk
(7200+
rpm) • high
bandwidth
possible • good
Oirst
stop
for
data • hard
to
share,
persist,
backup • SSD
good
for
random
reads:
lots
of
small
 Oiles,
unpredictable
I/O
patterns • large
Oiles,
sequential
I/O,
spinning
disk
 comparable
to
SSDsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    71. 71. Where
to
store
(I) • RAM • fast • local
disk • expensive • get
a
good
controller
(SATA/SAS2) • volatile • lots
of
fast
spinning
disk
(7200+
rpm) • high
bandwidth
possible • good
Oirst
stop
for
data • hard
to
share,
persist,
backup • Parallel
Filesystem • SSD
good
for
random
reads:
lots
of
small
 • gluster,
luster,
gpfs Oiles,
unpredictable
I/O
patterns • HDFS
(Hadoop) • large
Oiles,
sequential
I/O,
spinning
disk
 • auto‐replication
for
parallel
 comparable
to
SSDs decentralized
I/OBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    72. 72. Where
to
store
(II)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    73. 73. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • Oiber
channel
(2
Gb/s)
or
InOiniband
 (10,20,40
Gb/s)
interconnect • parallel,
non‐blocking,
dedicated
 routesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    74. 74. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • NAS
over
ethernet • Oiber
channel
(2
Gb/s)
or
InOiniband
 • Network
Attached
Storage (10,20,40
Gb/s)
interconnect • Think
NFS,
CIFS,
Samba
network
 • parallel,
non‐blocking,
dedicated
 interface
to
storage routes • ethernet
1
Gb/s
with
contention
 (effective
limit
of
~500
Mb/s) • SATA
(10k
rpm,
2
TB,
3
Gb/s) • SAS2
(15k
rpm,
750
GB,
6
Gb/s) • Cloud
storage • Amazon
S3 • Box.net,
Dropbox • BackBlaze:
bit.ly/backblaze‐20Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    75. 75. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • NAS
over
ethernet • Oiber
channel
(2
Gb/s)
or
InOiniband
 • Network
Attached
Storage (10,20,40
Gb/s)
interconnect • Think
NFS,
CIFS,
Samba
network
 • parallel,
non‐blocking,
dedicated
 interface
to
storage routes • ethernet
1
Gb/s
with
contention
 (effective
limit
of
~500
Mb/s) • SATA
(10k
rpm,
2
TB,
3
Gb/s) • Hybrid • SAS2
(15k
rpm,
750
GB,
6
Gb/s) • Create
in‐house
tiered
storage • Cloud
storage • Amazon
S3 • Box.net,
Dropbox • BackBlaze:
bit.ly/backblaze‐20Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    76. 76. How
to
store
(data
formats) • ASCII • SQL
DB • tab
delimited • MySQL • comma
separated • sqlite • XML • Oracle • Access • DTD
deOinition? • SQL
Server • Schema
deOinition? • Namespaces? • Hierarchical
DB • JSON • Berkeley
XML
DB • LDAP • NetCDF • Object‐Relational
Mapper • HDF5 • SQL
Alchemy
(Python) • DICOM • Hibernate
(Java,
.NET) • Django
ORM
(Python) • Matlab
.MAT
format • No‐SQL
DB • NumPy
.NPZ
format • MongoDB • Bespoke
binary • CouchDBBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    77. 77. How
to
process • Analytical
software • Analytical
environments • custom
programs • multi‐core
machine
‐
48+
core
 • Matlab systems
for
under
$5000
(USD) • Perl • GPU • R • compute
cluster • Python • supercomputers • SAS,
SPSS • grid
computing • Tableau • cloud
computing • web‐based
services • network
of
workstations
(NOW) • Map/Reduce
models • “screen‐saver”
computing
(BOINC)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    78. 78. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    79. 79. 48 cores, single system image
    80. 80. For
$500
to
$2000
(USD),
up
to
order
of
magnitude
processing
speedups
may
be
possible
    81. 81. GPU
Computing
200­800
stream
 processing
cores
per
cardFor
$500
to
$2000
(USD),
up
to
order
of
magnitude
processing
speedups
may
be
possible
    82. 82. Open
Science
Grid www.opensciencegrid.orgBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    83. 83. Map/Reduce • Unix
users: • cat | grep | sort | unique > file • Map/Reduce
equivalent: • input | map | shuffle | reduce > output • HadoopFS
(HDFS) • large
data
set
is
automatically
spread
and
replicated
across
local
 storage
resources
(disks)
of
each
node
in
a
cluster • Map • creates
a
job
for
each
data
block
in
the
input • maps
the
computational
kernel
to
each
job • schedules
jobs
to
nodes
with
required
data
block • each
job
produces
a
set
of
key/value
pair
job
result • Reduce • collect
results
from
Map
stage
based
on
keys
(Combine) • aggregates
values
to
produce
task
(Oinal)
result
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    84. 84. Extensions • Pig
and
Hive • pig.apache.org



hive.apache.org • simplify
writing
Map/Reduce
programs
for
Hadoop • SQL‐like
query
language
for
datasets
available
on
HDFS • Cloudera • www.cloudera.com • packaged
distribution
of
Hadoop
+
extensions • education
+
training
material • Amazon
Elastic
Map
Reduce • aws.amazon.com/elasticmapreduce • Amazon
“cloud‐based”
hosting
of
Hadoop
for
Map/Reduce
using
EC2
 for
compute
and
S3
for
storageBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    85. 85. Organization,
Searching,
and
Meta‐Data • Few
“software”
solutions
for
this
problem • iRODS

provides
some
of
this • Unix
“locate”
database • SAN
solutions
may
index
software
and
provide
tools
for
searching • Establish
protocols,
document,
communicate • director
hierarchy • Oile
naming • persisted
working
space • scratch/temporary
space • Filesystem
functionality • many
Oile
systems
have
per‐Oile
meta‐data
controls
to
add
arbitrary
 key/value
pairs • Augmented
web‐based
view • cern_meta
Apache
module
provides
key/value
pairs
in
HTTP
HEAD • ability
to
assert
arbitrary
web
organization
on
top
of
Oilesystem
 organization,
with
searching
and
graphical
viewsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    86. 86. • www.irods.org • File‐like
paradigm
for
data‐management • addition
of
meta‐data • can
integrate
database
resources • provides
rich
access
policy
management • automated
workOlows
based
on
data
actions • add,
remove,
modify • automated
replication • built‐in
provenance • information
life‐cycle
managementBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    87. 87. Search:
Apache • lucene.apache.org • Java‐based • full
text
querying
and
searching • indexing • Solr
provides
web
interfaceBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    88. 88. Meta‐Data:
Semantic
Media
WikiBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    89. 89. Meta‐Data:
Semantic
Media
Wiki • You
know
WikipediaBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    90. 90. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia • It
is
built
using
MediawikiBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    91. 91. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia • It
is
built
using
Mediawiki • Semantic
Media
Wiki
adds
Semantic
Web
features • Flexible
key/value
schemas • User
deOined
and
changeable
object
classes • Built‐in
knowledge
of
dates
→
timelines • Built‐in
knowledge
of
locations
→
maps • Built‐in
handling
of
images
→
picture
galleriesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    92. 92. Access
ControlBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    93. 93. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ basedBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    94. 94. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based • Need
to
manage
and
communicate
Access
Control
policies • institutionally
driven • user
drivenBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    95. 95. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based • Need
to
manage
and
communicate
Access
Control
policies • institutionally
driven • user
driven • Need
Authorization
System • Policy
Enforcement
Point
(shell
login,
data
access,
web
access,
start
application) • Policy
Decision
Point
(store
policies
and
understand
relationship
of
identity
token

 and
policy)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    96. 96. Case
Study:
SBGrid • www.sbgrid.org • computing
expertise
for
protein
structure
and
 function
research • software • training • technical
support • storage • cluster
and
grid
computing • 150
member
labs
in
consortium • about
1000
total
researchers • structure
imaging
and
model
building: • imaging
techniques
are
data
intensive • model
determination
techniques
are
compute
intensiveBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    97. 97. SBGrid
Science
Portal GlobusOnline UC San Diego @Argonne GUMS User GUMS GridFTP + glideinWMS data Hadoop factory Open Science Grid computations MyProxy @NCSA, UIUC monitoring interfaces data computation ID mgmt Ganglia scp Condor FreeIPA Apache DOEGrids CA Nagios GridFTP Cycle Server @Lawrence GridSite LDAP RSV SRM VDT Berkley Labs Django VOMS Globus pacct WebDAV Sage Math GUMS glideinWMS Gratia Accting R-Studio GACL @FermiLab file SQL shell CLI server DB cluster Monitoring SBGrid Science Portal @ Harvard Medical School @IndianaBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    98. 98. Data
Model • Data
Tiers • VO­wide:
all
sites,
admin
managed,
very
stable • User
project:
all
sites,
user
managed,
1‐10
weeks,
1‐3
GB • User
static:
all
sites,
user
managed,
indeOinite,
10
MB • Job
set:
all
sites,
infrastructure
managed,
1‐10
days,
0.1‐1
GB • Job:
direct
to
worker
node,
infrastructure
managed,
1
day,
<10
MB • Job
indirect:
to
worker
node
via
UCSD,
infrastructure
managed,
1
 day,
<10
GBBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    99. 99. Data
Management quota du
scan tmpwatch conventions workOlow
integration Data
Movement scp
(users) rsync
(VO‐wide) grid‐ftp
(UCSD) curl
(WNs) cp
(NFS) htcp
(secure
web)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    100. 100. red
­
push
<iles green
­
pull
<ilesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    101. 101. red
­
push
<iles green
­
pull
<iles 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    102. 102. red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    103. 103. 3.
Auto­replicate red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    104. 104. 4.
pull
<iles
from UCSD
to
WNs 3.
Auto­replicate red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    105. 105. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    106. 106. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    107. 107. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 7.
job
results
copied
 back
to
SBGrid 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    108. 108. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 7.
job
results
copied
 back
to
SBGrid 8a.
large
job
results
 copied
to
UCSD 8b.
later
pulled
to
 1.
user
<ile
upload SBGridBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    109. 109. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    110. 110. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    111. 111. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    112. 112. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    113. 113. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    114. 114. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    115. 115. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    116. 116. Copy,
Move,
BackupBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    117. 117. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backupBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    118. 118. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    119. 119. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    120. 120. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTPBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    121. 121. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP • Storage
Resource
Broker
(SRB)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    122. 122. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP • Storage
Resource
Broker
(SRB) • GlobusOnlineBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    123. 123. Globus
Online:
High
Performance
 Reliable
3rd
Party
File
Transfer http://www.globusonline.org portal cluster data collection facility lab file server desktop laptopBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    124. 124. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    125. 125. Summary • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology • Design
your
data
management
system
suitablyBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    126. 126. Acknowledgements
&
Questions • Piotr
Sliz • Principle
Investigator,
head
of
SBGrid • SBGrid
System
Administrators • Ian
Levesque,
Peter
Doherty • Globus
Online
Team • Steve
Tueke,
Ian
Foster,
Rachana
 Ananthakrishnan,
Raj
Kettimuthu
 • Terrence
Martin • System
administrator
at
UCSD
for
assistance
and
 encouragement
using
1
PB
Hadoop
storage
array • Brian
Bockleman • Physics
faculty
at
University
of
Nebraska • Steve
Timm • System
administrator
at
FermiLab • Ruth
Pordes • Director
of
OSG,
for
championing
SBGridBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    127. 127. Acknowledgements
&
Questions • Piotr
Sliz • Principle
Investigator,
head
of
SBGrid • SBGrid
System
Administrators • Ian
Levesque,
Peter
Doherty Please
contact
me
 • Globus
Online
Team with
any
questions: • Steve
Tueke,
Ian
Foster,
Rachana
 • Ian
Stokes‐Rees Ananthakrishnan,
Raj
Kettimuthu
 • ijstokes@hkl.hms.harvard.edu • ijstokes@spmetric.com • Terrence
Martin • System
administrator
at
UCSD
for
assistance
and
 encouragement
using
1
PB
Hadoop
storage
array Look
at
our
work • Brian
Bockleman • portal.sbgrid.org • Physics
faculty
at
University
of
Nebraska • www.sbgrid.org • www.opensciencegrid.org • Steve
Timm • System
administrator
at
FermiLab • Ruth
Pordes • Director
of
OSG,
for
championing
SBGridBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×