Big Data: tools and techniques for working with large data sets
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Big Data: tools and techniques for working with large data sets

  • 9,408 views
Uploaded on

Working with thousands, millions, or billions of data records in high dimensions is increasingly becoming the reality for scientific research. What are some techniques to make this kind of data ...

Working with thousands, millions, or billions of data records in high dimensions is increasingly becoming the reality for scientific research. What are some techniques to make this kind of data volume tractable? How can parallel computing help? In this talk I'll review data management tools and infrastructures, languages, and paradigms that help in this regard. In particular, I'll discuss Hadoop, MapReduce, Python, NumPy, and Globus Online to provide a survey of ways in which researchers can manage their data and process it in parallel.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
9,408
On Slideshare
9,388
From Embeds
20
Number of Embeds
4

Actions

Shares
Downloads
647
Comments
0
Likes
11

Embeds 20

http://www.linkedin.com 10
https://www.linkedin.com 5
http://vizedhtmlcontent.next.ecollege.com 4
https://bb9dev.newcastle.edu.au 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. Big
Data:
tools
and
techniques
for
working
 with
large
data
sets Ian
Stokes‐Rees,
PhD Harvard
Medical
School,
Boston,
USA Workshop
on
Tools,
Technologies
and
Collaborative
Opportunities
for
HPC
in
Life
Sciences
and
Healthcare http://portal.sbgrid.org ijstokes@hkl.hms.harvard.edu
  • 2. Slides
and
Contact ijstokes@hkl.hms.harvard.edu http://linkedin.com/in/ijstokes http://slidesha.re/ijstokes-thailand2011Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 3. Slides
and
Contact ijstokes@hkl.hms.harvard.edu http://linkedin.com/in/ijstokes http://slidesha.re/ijstokes-thailand2011 http://www.sbgrid.org http://portal.sbgrid.org http://www.opensciencegrid.orgBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 4. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 5. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 6. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 7. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 8. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 9. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 10. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 11. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 12. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 13. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 14. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 15. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 16. About
MeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 17. rotational translation 2D
simple
crystal Patterson
map search search score
model: aggregatebest
peak,
R
factor, alternatives composites and
cluster electron
densityBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 18. Protein Structure StudiesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 19. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 20. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 21. Data,
Data
Everywhere
...Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 22. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 23. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronicsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 24. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniquesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 25. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 26. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
setsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 27. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulationBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 28. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modelingBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 29. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 30. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacityBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 31. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaborationBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 32. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration • provenance
‐
origin,
access,
changesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 33. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration • provenance
‐
origin,
access,
changes Today,
we’ll
think
about
software,
hardware,
and
 models
for
coping
with
large
quantities
of
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 34. Next
Generation
SequencingBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 35. High
Energy
PhysicsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 36. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 37. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 38. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 39. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 40. 40
MHz
bunch
crossing
rate 10
million
data
channels 1
KHz
level
1
event
recording
rate 1­10
MB
per
event 14
hours
per
day,
7+
months
/
year 4
detectors 6
PB
of
data
/
year globally
distribute
data
for
analysis
(x2)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 41. Molecular
Dynamics
SimulationsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 42. Molecular
Dynamics
Simulations 1
fs
time
step 1ns
snapshot 1
us
simulation 1e6
steps 1000
frames 10
MB
/
frame 10
GB
/
sim 20
CPU­years 3
months
(wall­ clock)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 43. Electronic
Patient
RecordsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 44. Electronic
Patient
Records 77
page
PDF
(bespoke
report)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 45. Electronic
Patient
Records Clinical
Document
Architecture
XML
representationBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 46. Electronic
Patient
Records HTML
rendering
of
XML
via
XSLT
transformBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 47. Clinical
Imaging
Data DICOM
­
Digital
Imaging
and
 Communications
in
Medicine 2D,
3D,
4DBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 48. Clinical
Imaging
DataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 49. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 50. It
is
clear
there
is
no
shortage
of
data.Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 51. It
is
clear
there
is
no
shortage
of
data. Potential
for
great
new
insights
...Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 52. It
is
clear
there
is
no
shortage
of
data. Potential
for
great
new
insights
... ...
if
we
can
organize,
access,
share,
and
 analyze
this
data
ef[icientlyBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 53. Jumping
to
the
end
...Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 54. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planningBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 55. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sourcesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 56. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumersBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 57. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technologyBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 58. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology • Design
your
data
management
system
suitablyBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 59. Problems
arising
from
“Big
Data”Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 60. Problems
arising
from
“Big
Data” • Where
to
storeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 61. Problems
arising
from
“Big
Data” • Where
to
store • How
to
storeBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 62. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
processBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 63. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 64. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
accessBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 65. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backupBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 66. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup • ProvenanceBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 67. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup • Provenance • LifecycleBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 68. Where
to
store
(I)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 69. Where
to
store
(I) • RAM • fast • expensive • volatileBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 70. Where
to
store
(I) • RAM • fast • local
disk • expensive • get
a
good
controller
(SATA/SAS2) • volatile • lots
of
fast
spinning
disk
(7200+
rpm) • high
bandwidth
possible • good
Oirst
stop
for
data • hard
to
share,
persist,
backup • SSD
good
for
random
reads:
lots
of
small
 Oiles,
unpredictable
I/O
patterns • large
Oiles,
sequential
I/O,
spinning
disk
 comparable
to
SSDsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 71. Where
to
store
(I) • RAM • fast • local
disk • expensive • get
a
good
controller
(SATA/SAS2) • volatile • lots
of
fast
spinning
disk
(7200+
rpm) • high
bandwidth
possible • good
Oirst
stop
for
data • hard
to
share,
persist,
backup • Parallel
Filesystem • SSD
good
for
random
reads:
lots
of
small
 • gluster,
luster,
gpfs Oiles,
unpredictable
I/O
patterns • HDFS
(Hadoop) • large
Oiles,
sequential
I/O,
spinning
disk
 • auto‐replication
for
parallel
 comparable
to
SSDs decentralized
I/OBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 72. Where
to
store
(II)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 73. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • Oiber
channel
(2
Gb/s)
or
InOiniband
 (10,20,40
Gb/s)
interconnect • parallel,
non‐blocking,
dedicated
 routesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 74. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • NAS
over
ethernet • Oiber
channel
(2
Gb/s)
or
InOiniband
 • Network
Attached
Storage (10,20,40
Gb/s)
interconnect • Think
NFS,
CIFS,
Samba
network
 • parallel,
non‐blocking,
dedicated
 interface
to
storage routes • ethernet
1
Gb/s
with
contention
 (effective
limit
of
~500
Mb/s) • SATA
(10k
rpm,
2
TB,
3
Gb/s) • SAS2
(15k
rpm,
750
GB,
6
Gb/s) • Cloud
storage • Amazon
S3 • Box.net,
Dropbox • BackBlaze:
bit.ly/backblaze‐20Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 75. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • NAS
over
ethernet • Oiber
channel
(2
Gb/s)
or
InOiniband
 • Network
Attached
Storage (10,20,40
Gb/s)
interconnect • Think
NFS,
CIFS,
Samba
network
 • parallel,
non‐blocking,
dedicated
 interface
to
storage routes • ethernet
1
Gb/s
with
contention
 (effective
limit
of
~500
Mb/s) • SATA
(10k
rpm,
2
TB,
3
Gb/s) • Hybrid • SAS2
(15k
rpm,
750
GB,
6
Gb/s) • Create
in‐house
tiered
storage • Cloud
storage • Amazon
S3 • Box.net,
Dropbox • BackBlaze:
bit.ly/backblaze‐20Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 76. How
to
store
(data
formats) • ASCII • SQL
DB • tab
delimited • MySQL • comma
separated • sqlite • XML • Oracle • Access • DTD
deOinition? • SQL
Server • Schema
deOinition? • Namespaces? • Hierarchical
DB • JSON • Berkeley
XML
DB • LDAP • NetCDF • Object‐Relational
Mapper • HDF5 • SQL
Alchemy
(Python) • DICOM • Hibernate
(Java,
.NET) • Django
ORM
(Python) • Matlab
.MAT
format • No‐SQL
DB • NumPy
.NPZ
format • MongoDB • Bespoke
binary • CouchDBBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 77. How
to
process • Analytical
software • Analytical
environments • custom
programs • multi‐core
machine
‐
48+
core
 • Matlab systems
for
under
$5000
(USD) • Perl • GPU • R • compute
cluster • Python • supercomputers • SAS,
SPSS • grid
computing • Tableau • cloud
computing • web‐based
services • network
of
workstations
(NOW) • Map/Reduce
models • “screen‐saver”
computing
(BOINC)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 78. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 79. 48 cores, single system image
  • 80. For
$500
to
$2000
(USD),
up
to
order
of
magnitude
processing
speedups
may
be
possible
  • 81. GPU
Computing
200­800
stream
 processing
cores
per
cardFor
$500
to
$2000
(USD),
up
to
order
of
magnitude
processing
speedups
may
be
possible
  • 82. Open
Science
Grid www.opensciencegrid.orgBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 83. Map/Reduce • Unix
users: • cat | grep | sort | unique > file • Map/Reduce
equivalent: • input | map | shuffle | reduce > output • HadoopFS
(HDFS) • large
data
set
is
automatically
spread
and
replicated
across
local
 storage
resources
(disks)
of
each
node
in
a
cluster • Map • creates
a
job
for
each
data
block
in
the
input • maps
the
computational
kernel
to
each
job • schedules
jobs
to
nodes
with
required
data
block • each
job
produces
a
set
of
key/value
pair
job
result • Reduce • collect
results
from
Map
stage
based
on
keys
(Combine) • aggregates
values
to
produce
task
(Oinal)
result
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 84. Extensions • Pig
and
Hive • pig.apache.org



hive.apache.org • simplify
writing
Map/Reduce
programs
for
Hadoop • SQL‐like
query
language
for
datasets
available
on
HDFS • Cloudera • www.cloudera.com • packaged
distribution
of
Hadoop
+
extensions • education
+
training
material • Amazon
Elastic
Map
Reduce • aws.amazon.com/elasticmapreduce • Amazon
“cloud‐based”
hosting
of
Hadoop
for
Map/Reduce
using
EC2
 for
compute
and
S3
for
storageBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 85. Organization,
Searching,
and
Meta‐Data • Few
“software”
solutions
for
this
problem • iRODS

provides
some
of
this • Unix
“locate”
database • SAN
solutions
may
index
software
and
provide
tools
for
searching • Establish
protocols,
document,
communicate • director
hierarchy • Oile
naming • persisted
working
space • scratch/temporary
space • Filesystem
functionality • many
Oile
systems
have
per‐Oile
meta‐data
controls
to
add
arbitrary
 key/value
pairs • Augmented
web‐based
view • cern_meta
Apache
module
provides
key/value
pairs
in
HTTP
HEAD • ability
to
assert
arbitrary
web
organization
on
top
of
Oilesystem
 organization,
with
searching
and
graphical
viewsBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 86. • www.irods.org • File‐like
paradigm
for
data‐management • addition
of
meta‐data • can
integrate
database
resources • provides
rich
access
policy
management • automated
workOlows
based
on
data
actions • add,
remove,
modify • automated
replication • built‐in
provenance • information
life‐cycle
managementBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 87. Search:
Apache • lucene.apache.org • Java‐based • full
text
querying
and
searching • indexing • Solr
provides
web
interfaceBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 88. Meta‐Data:
Semantic
Media
WikiBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 89. Meta‐Data:
Semantic
Media
Wiki • You
know
WikipediaBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 90. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia • It
is
built
using
MediawikiBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 91. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia • It
is
built
using
Mediawiki • Semantic
Media
Wiki
adds
Semantic
Web
features • Flexible
key/value
schemas • User
deOined
and
changeable
object
classes • Built‐in
knowledge
of
dates
→
timelines • Built‐in
knowledge
of
locations
→
maps • Built‐in
handling
of
images
→
picture
galleriesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 92. Access
ControlBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 93. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ basedBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 94. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based • Need
to
manage
and
communicate
Access
Control
policies • institutionally
driven • user
drivenBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 95. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based • Need
to
manage
and
communicate
Access
Control
policies • institutionally
driven • user
driven • Need
Authorization
System • Policy
Enforcement
Point
(shell
login,
data
access,
web
access,
start
application) • Policy
Decision
Point
(store
policies
and
understand
relationship
of
identity
token

 and
policy)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 96. Case
Study:
SBGrid • www.sbgrid.org • computing
expertise
for
protein
structure
and
 function
research • software • training • technical
support • storage • cluster
and
grid
computing • 150
member
labs
in
consortium • about
1000
total
researchers • structure
imaging
and
model
building: • imaging
techniques
are
data
intensive • model
determination
techniques
are
compute
intensiveBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 97. SBGrid
Science
Portal GlobusOnline UC San Diego @Argonne GUMS User GUMS GridFTP + glideinWMS data Hadoop factory Open Science Grid computations MyProxy @NCSA, UIUC monitoring interfaces data computation ID mgmt Ganglia scp Condor FreeIPA Apache DOEGrids CA Nagios GridFTP Cycle Server @Lawrence GridSite LDAP RSV SRM VDT Berkley Labs Django VOMS Globus pacct WebDAV Sage Math GUMS glideinWMS Gratia Accting R-Studio GACL @FermiLab file SQL shell CLI server DB cluster Monitoring SBGrid Science Portal @ Harvard Medical School @IndianaBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 98. Data
Model • Data
Tiers • VO­wide:
all
sites,
admin
managed,
very
stable • User
project:
all
sites,
user
managed,
1‐10
weeks,
1‐3
GB • User
static:
all
sites,
user
managed,
indeOinite,
10
MB • Job
set:
all
sites,
infrastructure
managed,
1‐10
days,
0.1‐1
GB • Job:
direct
to
worker
node,
infrastructure
managed,
1
day,
<10
MB • Job
indirect:
to
worker
node
via
UCSD,
infrastructure
managed,
1
 day,
<10
GBBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 99. Data
Management quota du
scan tmpwatch conventions workOlow
integration Data
Movement scp
(users) rsync
(VO‐wide) grid‐ftp
(UCSD) curl
(WNs) cp
(NFS) htcp
(secure
web)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 100. red
­
push
<iles green
­
pull
<ilesBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 101. red
­
push
<iles green
­
pull
<iles 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 102. red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 103. 3.
Auto­replicate red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 104. 4.
pull
<iles
from UCSD
to
WNs 3.
Auto­replicate red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 105. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 106. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 107. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 7.
job
results
copied
 back
to
SBGrid 1.
user
<ile
uploadBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 108. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 7.
job
results
copied
 back
to
SBGrid 8a.
large
job
results
 copied
to
UCSD 8b.
later
pulled
to
 1.
user
<ile
upload SBGridBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 109. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 110. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 111. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 112. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 113. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 114. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 115. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 116. Copy,
Move,
BackupBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 117. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backupBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 118. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 119. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
dataBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 120. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTPBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 121. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP • Storage
Resource
Broker
(SRB)Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 122. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP • Storage
Resource
Broker
(SRB) • GlobusOnlineBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 123. Globus
Online:
High
Performance
 Reliable
3rd
Party
File
Transfer http://www.globusonline.org portal cluster data collection facility lab file server desktop laptopBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 124. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 125. Summary • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology • Design
your
data
management
system
suitablyBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 126. Acknowledgements
&
Questions • Piotr
Sliz • Principle
Investigator,
head
of
SBGrid • SBGrid
System
Administrators • Ian
Levesque,
Peter
Doherty • Globus
Online
Team • Steve
Tueke,
Ian
Foster,
Rachana
 Ananthakrishnan,
Raj
Kettimuthu
 • Terrence
Martin • System
administrator
at
UCSD
for
assistance
and
 encouragement
using
1
PB
Hadoop
storage
array • Brian
Bockleman • Physics
faculty
at
University
of
Nebraska • Steve
Timm • System
administrator
at
FermiLab • Ruth
Pordes • Director
of
OSG,
for
championing
SBGridBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 127. Acknowledgements
&
Questions • Piotr
Sliz • Principle
Investigator,
head
of
SBGrid • SBGrid
System
Administrators • Ian
Levesque,
Peter
Doherty Please
contact
me
 • Globus
Online
Team with
any
questions: • Steve
Tueke,
Ian
Foster,
Rachana
 • Ian
Stokes‐Rees Ananthakrishnan,
Raj
Kettimuthu
 • ijstokes@hkl.hms.harvard.edu • ijstokes@spmetric.com • Terrence
Martin • System
administrator
at
UCSD
for
assistance
and
 encouragement
using
1
PB
Hadoop
storage
array Look
at
our
work • Brian
Bockleman • portal.sbgrid.org • Physics
faculty
at
University
of
Nebraska • www.sbgrid.org • www.opensciencegrid.org • Steve
Timm • System
administrator
at
FermiLab • Ruth
Pordes • Director
of
OSG,
for
championing
SBGridBig Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu