Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Adapting
federated

cyberinfrastructure
for

shared
data
collection

facilities
in
structural
biology

GlobusWorld
2012

Ian
Stokes-‐Rees
ijstokes@seas.harvard.edu
@ijstokes
Harvard
Medical
School

J.
Synchrotron
Rad.
(2012)
19
doi:10.1107/S0909049512009776

Online
6
April
2012

Structural
Biology:
Study
of
Protein
Structure
and
Function

400m
1mm

10nm
• Shared
scientiRic
data
collection
facility
• Data
intensive
(10-‐100
GB/day)
Grid Overview - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu

Cryo
Electron
Microscopy

• Previously,
1-10,000
images,
managed
by
hand
• Now,
robotic
systems
collect
millions
of
hi-res
images
• estimate
250,000
CPU-hours
to
reconstruct
model
• 20-50
TB
of
data
in
most
extreme
case

Molecular
Dynamics
Simulations
1
fs
time
step
1ns
snapshot
1
us
simulation
1e6
steps
1000
frames
10
MB
/
frame
10
GB
/
sim
20
CPU-years
3
months
(wall-
clock)


SBGrid
Consortium Cornell U.
Washington U. School of Med.
R. Cerione NE-CAT
T. Ellenberger
B. Crane R. Oswald
D. Fremont
S. Ealick C. Parrish
Rosalind Franklin NIH M. Jin H. Sondermann
D. Harrison M. Mayer
A. Ke UMass Medical
U. Washington
T. Gonen
U. Maryland W. Royer
E. Toth
Brandeis U.
UC Davis N. Grigorieff
H. Stahlberg Tufts U.
K. Heldwein
UCSF Columbia U.
JJ Miranda Q. Fan
Y. Cheng
Rockefeller U.
Stanford R. MacKinnon
A. Brunger Yale U.
K. Garcia T. Boggon K. Reinisch
T. Jardetzky D. Braddock J. Schlessinger
Y. Ha F. Sigworth
CalTech E. Lolis F. Zhou
P. Bjorkman Harvard and Affiliates
W. Clemons N. Beglova A. Leschziner
G. Jensen Rice University S. Blacklow K. Miller
D. Rees E. Nikonowicz B. Chen A. Rao
Y. Shamoo Vanderbilt J. Chou T. Rapoport
Y.J. Tao Center for Structural Biology J. Clardy M. Samso
WesternU
W. Chazin C. Sanders M. Eck P. Sliz
M. Swairjo
B. Eichman B. Spiller B. Furie T. Springer
M. Egli M. Stone R. Gaudet G. Verdine
UCSD B. Lacy M. Waterman M. Grant G. Wagner
T. Nakagawa M. Ohi S.C. Harrison L. Walensky
H. Viadiu Thomas Jefferson J. Hogle S.Walker
J. Williams D. Jeruzalmi T.Walz
D. Kahne J. Wang
Not Pictured:
University of Toronto: L. Howell, E. Pai, F. Sicheri; NHRI (Taiwan): G. Liou; Trinity College, Dublin: Amir Khan T. Kirchhausen S. Wong

Boston
Life
Sciences
Hub

• Biomedical
researchers
• Government
agencies
• Life
sciences Tufts

• Universities
Universit
y
School
of
Medicin
e

• Hospitals


Globus
Online:
High
Performance

Reliable
3rd
Party
File
Transfer
GUMS
DN
to
user
mapping CertiRicate
Authority
VOMS root
of
trust
VO
membership

portal

cluster

Globus
Online
Rile
transfer
service

data collection
lab file
facility
server

desktop laptop

User
can
directly
access
lab

or
facility
data
from
laptop
"Sco?"+from
Harvard general+public
Local
accounts
within the+Sliz+lab Public
access
available
to

lab
infrastructure ~sue
archived
data
through
web

~andy interface
~sco?
Sliz+lab

NEBCAT+beamline+at+APS
WWW
data
collec(on

Shared
(lab
level)

accounts
at
facility
/data/sliz /public/2009/
Embargo
policy
to

/stage/sliz /data/murphy /embarg/2010/ hold
deposited
data

/stage/murphy
for
agreed
time
Tiered
storage /data/deacon /embarg/2011/

6+month 10+TB+per+group+ 10+PB
staging+storage permanent+archive public+archive
Tier'1 Tier'2 Tier'3

VO
management
VOMS
Globus'Online SBGrid'Science'Portal

Architecture
✦ SBGrid
• manages
all
user
account
creation
and
credential
mgmt
• hosts
MyProxy,
VOMS,
GridFTP,
and
user
interfaces

✦ Facility
• knows
about
lab
groups
• e.g.
“Harrison”,
“Sliz”
• delegates
knowledge
of
group
membership
to
SBGrid
VOMS
• facility
can
poll
VOMS
for
list
of
current
members
• uses
X.509
for
user
identiRication
• deploys
GridFTP
server

✦ Lab
group
• designates
group
manager
that
adds/removes
individuals
• deploys
GridFTP
server
or
Globus
Connect
client

✦ Individual
• username/password
to
access
facility
and
lab
storage
• Globus
Connect
for
personal
GridFTP
server
to
laptop
• Globus
Online
web
interface
to
“drive”
transfers

Objective

✦ Easy
to
use
high
performance
data
mgmt

environment
✦ Fast
and
reliable
Rile
transfer
• facility-‐to-‐lab,
facility-‐to-‐individual,
lab-‐to-‐individual
✦ Reduced
administrative
overhead
✦ Better
data
curation
✦ Release
data
to
public
after
embargo
period

Challenges
✦ Access
control
• visibility
• policies
✦ Provenance
• data
origin
• history
✦ Meta-‐data
• attributes
• searching

Capabilities

✦ Server-‐to-‐server
interaction
✦ Federated
aggregation
of
CPU
power
✦ Predictable
data
patterns
✦ Command-‐line
access
from
specialized
servers
✦ Web-‐based
access
through
“Science
Gateways”
• pre-‐deRined
computing
workRlows
• e.g.
SBGrid
Science
Portal

Weaknesses

✦ Federated
user
identity
system
• X.509
digital
certiRicates
and
associated
apparatus
✦ Ease
of
use
for
end
users
✦ Data
management
tools
✦ Support
for
collaborations

Opportunities

✦ “Last
mile”
challenge
• to
the
desktop
• to
the
laptop
✦ UniRied
identity
management
• centralized
set
of
credentials
for
each
person
✦ Empower
collaborations
to
self-‐manage
✦ Shift
of
focus
from
“compute”
to
“data”
• for
users
• for
facilities
where
data
is
the
main
challenge

X.509
Digital
CertiRicates
✦ Analogy
to
a
passport:
• Application
form
• Sponsor’s
attestation
• Consular
services
• veriRication
of
application,
sponsor,
and
accompanying

identiRication
and
eligibility
documents
• Passport
issuing
ofRice
✦ Portable,
digital
passport
• Rixed
and
secure
user
identiRiers
• name,
email,
home
institution
• signed
by
widely
trusted
issuer
• time
limited
• ISO
standard

U1 U1 U1

Addressing
CertiRicate
Problems
/.' -.'.)"*&' 012*%2!' 3%"!'
)"*"!4&"'
,"!&'5":'14(!'
!"#$"%&'%()*"+',"!&'
U1
!"&$!*'&!4,5(*)'*$67"!''

*289:'4)"*&%'
!";("<'!"#$"%&'
R1
time

;"!(9:'$%"!'"=()(7(=(&:'

,2*>!6'"=()(7(=(&:'
S1
411!2;"',"!&'
R2
%()*',"!&'
*289:'4;4(=47(=(&:'

!"&!(";"',"!&'
U2a
"?12!&'%()*"+'
,"!&'5":'14(!'

VO
(Group)
Membership

Registration
!")*# !"#$%&'(# *+,(-,.# /-0.#
+.0-0(:#50.:#:,#.0?>0-:#&0&90.-<'+#93#@A#

.0?>0-:#!"#8.,>+-#4(%#.,70-# U2b
(,123#4%&'(#
V1
;0.'23#>-0.#07'8'9'7':3#

5,(6.&#07'8'9'7':3#
S2
time

4++.,;0#&0&90.-<'+=#
8.,>+-=#4(%#.,70-#
4%%#@A# V2
:,#!")*#
(,123#

.0?>0-:#!")*#$B#
.0:>.(#!")*#$B# 4%%#$B#:,#
+.,C3#50.:#

()' AB)!' !"#$%&' *+",-"#' .-/#'

;<' #/>:/-$'+"#$%&'%66":,$'
=3!#"I3' ;<=*' )@81,' 0/#176%?",'/8%1&'-/,$'
/8%1&'0/#17/@' U1
4/,/#%$/' !"#$%&'(%)*+%
#/>:/-$'-14,/@'6/#$' 6/#$'9/3'+%1#' ',,#""%+#0%
1*$/'2%
#/$:#,'$#%691,4',:85/#'
,"?23'%4/,$-'
0/#123'/&14151&1$3'
A1a
6#/%$/' 6",7#8'/&14151&1$3'
&"6%&'%66$'
S1*
%++#"0/'6/#$'
time

-14,'6/#$'
,"?23'%0%1&%51&1$3' A1b
-/$'#/$#1/0%&'-/#1%&',:85/#'
-/$';<'#14C$-'
%66":,$'#/%@3',"?76%?",'
+"#$%&'&"41,'
U2*
#/>:/-$'-14,/@'6/#?76%$/'
#/$:#,'-14,/@'6/#?76%$/' D'E'+%1#'-14,/@'6/#$'
1,$"'!F(*GDH'7&/'
#/41-$/#'+#"I3'6/#$'
H'E'6#/%$/'&"6%&'
J1$C'=3!#"I3' !"#$%&'(%)*+%
+#"I3'6/#$'
',,#""%-#.#$'/#.%
$#"*!$,#"%

Process
and
Design
Improvements
✦ Single
web-‐form
application
• includes
e-‐mail
veriRicationn
✦ Centralized
and
connected
credential
management
• FreeIPA
LDAP
-‐
user
directory
and
credential
store
• VOMS
-‐
lab,
institution,
and
collaboration
afRiliations
• MyProxy
-‐
X.509
credential
store
✦ Overlap
administrative
roles
• system
admin
• registration
agent
for
certiRicate
authority
(approve
X.509

request)
• VO
administrator
to
register
group
afRiliations
✦ Automation

“Last
Mile”
and
Ease
of
Use

Grid
computing
at
your
desk
✦ Platforms
• Windows
• OS
X
✦ Web
interfaces
• SBGrid
Science
Portal
✦ One-‐stop-‐shop
for
account
management
• Centralized
services
• Automated
processes
• Administrator
intervention
• Support

Ryan,
a
postdoc
in
the

Frank
Lab
at
Columbia

Access
NRAMM
facilities

securely
and
transfer
data

back
to
home
institute
automated
X.509
check
SBGrid
for

application group

Ryan’s

membership /data/columbia/frank
facility file
server transfer
data
to
lab
veriRication
in
Frank
Lab,
so

of

grant
access
to
Riles
lab
membership

SBGrid Ryan
initiate
tfransfer
at

applies
or
an
request
access
Science account
at
the
SBGrid

NRAMM to
NRAMM
Portal Science
Portal facility
using
credential

notify
user
of

held
by
SBGrid lab file
desktop
completion server
automated
/nfs/data/rsmith
Globus
Online

application
use
Globus
Online

to
manage
transfer
from
/Users/Ryan
laptop
NRAMM
back
to
lab

Service
Architecture
GlobusOnline UC San Diego
@Argonne GUMS
User GUMS
GridFTP + glideinWMS
data Hadoop factory Open Science Grid

computations
MyProxy
@NCSA, UIUC
monitoring interfaces data computation ID mgmt
Ganglia scp Condor FreeIPA
Apache DOEGrids CA
Nagios GridFTP Cycle Server @Lawrence
GridSite LDAP
RSV SRM VDT Berkley Labs
Django VOMS
Globus
pacct WebDAV
Sage Math GUMS
glideinWMS Gratia Acct'ing
R-Studio GACL @FermiLab
file SQL
shell CLI server DB cluster
Monitoring
SBGrid Science Portal @ Harvard Medical School @Indiana

Acknowledgements
✦ Piotr
Sliz
✦ SBGrid
Science
Portal
• Daniel
O’Donovan,
Meghan
Porter-‐Mahoney
✦ SBGrid
System
Administrators
• Ian
Levesque,
Peter
Doherty,
Steve
Jahl
✦ Facility
Collaborators
• Frank
Murphy
(NE-‐CAT/APS)
• Ashley
Deacon
(JCSG/SLAC)
✦ Globus
Online
Team
• Steve
Tueke,
Ian
Foster,
Rachana

Ananthakrishnan,
Raj
Kettimuthu

✦ Ruth
Pordes
• Director
of
OSG,
for
championing
SBGrid

Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Recommended

Recommended

More Related Content

More from Boston Consulting Group

More from Boston Consulting Group (8)

Recently uploaded

Recently uploaded (20)

Adapting federated cyberinfrastructure for shared data collection facilities in structural biology