Accelerating data-driven discovery
by outsourcing the mundane


Ian Foster


                                 www.ci.anl.gov
                                 www.ci.uchicago.edu
The data deluge




                  www.ci.anl.gov
                  www.ci.uchicago.edu
The data deluge in biology




                                x10 in 6 years


                         x105 in 6 years



                                       www.ci.anl.gov
3
                                       www.ci.uchicago.edu
Number of sequencing machines




                            http://omicsmaps.com/
                                        www.ci.anl.gov
4
                                        www.ci.uchicago.edu
Moore’s Law for X-ray sources




                                 18 orders
                                 of magnitude
12 orders of                     in 5 decades!
magnitude
in 6 decades




                                   www.ci.anl.gov
 5   Credit: Linda Young           www.ci.uchicago.edu
Exploding data volumes in astronomy


      MACHO et al.: 1 TB
     Palomar: 3 TB
    2MASS: 10 TB
    GALEX: 30 TB           100,000 TB
    Sloan: 40 TB
Pan-STARRS:
    40,000 TB

                                        www.ci.anl.gov
6
                                        www.ci.uchicago.edu
Exploding data volumes in climate science
                     2004: 36 TB
                     2012: 2,300 TB




Climate
model intercomparison
project (CMIP) of the IPCC
                                       www.ci.anl.gov
7
                                       www.ci.uchicago.edu
Big science has been successful


                                   OSG: 1.4M CPU-hours/day,
                                   >90 sites, >3000 users,
                                   >260 pubs in 2010
LIGO: 1 PB data in last science
run, distributed worldwide
 Robust production solutions
 Substantial teams and expense
 Sustained, multi-year effort
 Application-specific solutions,
  built on common technology ESG: 1.2 PB climate data
                                 delivered to 23,000 users; 600+ pubs
 8       All build on NSF OCI (& DOE)-supported Globus Toolkit software
                                                           www.ci.anl.gov
                                                           www.ci.uchicago.edu
Small science is struggling




More data, more complex data
Ad-hoc solutions
Inadequate software, hardware
Data plan mandates
                                www.ci.anl.gov
9
                                www.ci.uchicago.edu
Dark data in the long tail of science
                                             Awarded Amount 2007


     $7,000,000

     $6,000,000

     $5,000,000

     $4,000,000

     $3,000,000

     $2,000,000

     $1,000,000

            $0
                  1   586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776



     NSF grant awards, 2007 (Bryan Heidorn)
                                                                                     www.ci.anl.gov
10
                                                                                     www.ci.uchicago.edu
The challenge of staying competitive
"Well, in our country," said Alice …
 "you'd generally get to somewhere
 else — if you run very fast for a
 long time, as we've been doing.”

"A slow sort of country!" said the
 Queen. "Now, here, you see, it
 takes all the running you can do, to
 keep in the same place. If you want
 to get somewhere else, you must run
 at least twice as fast as that!"
                                        www.ci.anl.gov
11
                                        www.ci.uchicago.edu
A crisis that demands new approaches
•    We have exceptional infrastructure for the 1%
     (e.g., supercomputers, Large Hadron Collider, …)
•    But not for the 99% (e.g., the vast majority of
     the 1.8M publicly funded researchers in the EU)

     We need new approaches to providing
     research cyberinfrastructure, that:
     — Reduce barriers to entry
     — Are cheaper
     — Are sustainable
                                              www.ci.anl.gov
12
                                              www.ci.uchicago.edu
You can run a company from a coffee shop




                                     www.ci.anl.gov
13
                                     www.ci.uchicago.edu
Because businesses outsource their IT
     Web presence
     Email (hosted Exchange)
     Calendar                       Software
     Telephony (hosted VOIP)       as a Service
     Human resources and payroll      (SaaS)
     Accounting
     Customer relationship mgmt



                                        www.ci.anl.gov
14
                                        www.ci.uchicago.edu
And often their large-scale computing too
     Web presence
     Email (hosted Exchange)
     Calendar                       Software
     Telephony (hosted VOIP)       as a Service
     Human resources and payroll      (SaaS)
     Accounting
     Customer relationship mgmt
                                   Infrastructure
     Data analytics
                                    as a Service
     Content distribution
                                       (IaaS)
                                         www.ci.anl.gov
15
                                         www.ci.uchicago.edu
Let’s rethink how we provide research IT

Accelerate discovery and innovation worldwide
by providing research IT as a service
Leverage the cloud to
• provide millions of researchers with
   unprecedented access to powerful tools;
• enable a massive shortening of cycle times in
   time-consuming research processes; and
• reduce research IT costs dramatically via
   economies of scale
                                          www.ci.anl.gov
16
                                          www.ci.uchicago.edu
grail.cs.washington.edu
17
                          www.ci.anl.gov
                          www.ci.uchicago.edu
Cloud layers


          Software as a Service: SaaS


          Platform as a Service: PaaS



        Infrastructure as a Service: IaaS



                                            www.ci.anl.gov
 18
18                                          www.ci.uchicago.edu
Common research data management steps
     •   Dark Energy Survey   •   SBGrid structural biology consortium
     •   Galaxy genomics      •   NCAR climate data applications
     •   LIGO observatory     •   Land use change; economics




                                                              www.ci.anl.gov
19
                                                              www.ci.uchicago.edu
Common research data management steps
     •   Dark Energy Survey   •   SBGrid structural biology consortium
     •   Galaxy genomics      •   NCAR climate data applications
     •   LIGO observatory     •   Land use change; economics




                                                              www.ci.anl.gov
20
                                                              www.ci.uchicago.edu
Scientific data delivery, 2012 1980
•    “*A+ majority of users at BES facilities … physically transport data
     to a home institution using portable media … data volumes are
     going to increase significantly in the next few years (to 70 TB/day
     or more) – data must be transferred over the network”
•    “the effectiveness of data transfer middleware [is] not just on the
     transfer speed, but also the time and interruption to other work
     required to supervise and check on the success of large data
     transfers”
•    “It took two weeks and email traffic between network specialists
     at NERSC and ORNL, sys-admins at NERSC, … and combustion staff
     at ORNL and SNL to move 10 TB from NERSC to ORNL”
     Major usability, productivity, performance problems
                                [ESNet Network Requirements Workshops, 2007-2010]
                                                                  www.ci.anl.gov
21
                                                                  www.ci.uchicago.edu
The challenge: Moving big data easily
What should be trivial …

        “I need my data over there    Data                            Data
              – at my _____” (       Source                        Destination
           supercomputing center,
            campus server, etc.)




 … can be painfully tedious and time-consuming
          “GAAAH
          !%&@#&
             ”                       ! Config issues
                     Data                                              Data
                                                ! Firewall issues
                    Source                                          Destination
                                                 ! Unexpected failure
                                                    = manual retry


                                                                    www.ci.anl.gov
22
                                                                    www.ci.uchicago.edu
• GO PICTURE
Globus Online: Data transfer as SaaS
• Reliable file transfer.
      –   Easy “fire-and-forget” transfers
      –   Automatic fault recovery
      –   High performance
      –   Across multiple security domains
• No IT required.
      – Software as a Service (SaaS)
            • No client software installation
            • New features automatically available
      – Consolidated support & troubleshooting
      – Works with existing GridFTP servers
      – Globus Connect solves “last mile problem”
• >4000 registered users, >3 Petabytes moved
Recommended by XSEDE, NERSC, Blue Waters, and many campuses
                                                     www.ci.anl.gov
 24
                                                     www.ci.uchicago.edu
Dark Energy Survey use of Globus Online
•        Dark Energy Survey
                                       Blanco 4m on Cerro Tololo
         receives 100,000 files
         each night in Illinois
•        They transmit files to
         Texas for analysis …
         then move results back
         to Illinois
•        Process must be reliable,
         routine, and efficient
•        They outsource this task    Image credit: Roger Smith/NOAO/AURA/NSF

         to Globus Online
                                                                       www.ci.anl.gov
    25
                                                                       www.ci.uchicago.edu
www.ci.anl.gov
26
     www.ci.uchicago.edu
www.ci.anl.gov
27
     www.ci.uchicago.edu
Integration with Earth System Grid




High-speed transfers
Automated retries
Works behind firewalls
Credential management
Transfer monitoring
                                     www.ci.anl.gov
28
                                     www.ci.uchicago.edu   2
Globus Online under the covers


                                 User Hub manages
                                  user identities and
                                  profiles
                                 Group Hub manages
                                  groups and policies
                                 Resource Hub for
                                  resource definitions




                                          www.ci.anl.gov
29
                                          www.ci.uchicago.edu
Globus Online under the covers


Monitoring and control
Auto-tuning of transfer           User Hub manages
 parameters                        user identities and
Detection & attempted              profiles
 correction of errors             Group Hub manages
Manual intervention                groups and policies
 when required                    Resource Hub for
                                   resource definitions




                                           www.ci.anl.gov
30
                                           www.ci.uchicago.edu
Globus Online under the covers


Monitoring and control
Auto-tuning of transfer                                       User Hub manages
 parameters                                                    user identities and
Detection & attempted                                          profiles
 correction of errors                                         Group Hub manages
Manual intervention                                            groups and policies
 when required                                                Resource Hub for
                                                               resource definitions


                      Reliable cloud-based infrastructure
                      EC2 for transfer management
                      S3 for system state
                      SimpleDB for lock management
                      Replication across availability zones
                                                                       www.ci.anl.gov
31
                                                                       www.ci.uchicago.edu
Globus Online under the covers


Monitoring and control
Auto-tuning of transfer                                       User Hub manages
 parameters                                                    user identities and
Detection & attempted                                          profiles
 correction of errors                                         Group Hub manages
Manual intervention                                            groups and policies
 when required                                                Resource Hub for
                                                               resource definitions


                      Reliable cloud-based infrastructure
                      EC2 for transfer management
                      S3 for system state
                      SimpleDB for lock management
                      Replication across availability zones
                                                                       www.ci.anl.gov
32
                                                                       www.ci.uchicago.edu
Towards “research IT as a service”
     •   Dark Energy Survey   •   SBGrid structural biology consortium
     •   Galaxy genomics      •   NCAR climate data applications
     •   LIGO observatory     •   Land use change; economics




                                                              www.ci.anl.gov
33
                                                              www.ci.uchicago.edu
Towards “research IT as a service”
      Research data management as a service
       Globus     Globus        Globus          Globus    ...   SaaS
       Transfer   Storage      Collaborate      Catalog

                    Globus Integrate platform                   PaaS




                                                                       www.ci.anl.gov
34
                                                                       www.ci.uchicago.edu
Globus Storage: For when you want to …
•        Place your data where
         you want
•        Access it from anywhere             GridFTP, HTTP, WebDAV

         via different protocols
•        Update it, version it,      Globus
                                     Storage
         and take snapshots          volume
•        Share versions with who
         you want                Commercial                      Campus
                                                   National
•        Synchronize among         storage
                                   service
                                                   research     computing
                                                                  center
                                                    center
         locations                provider

                                                               www.ci.anl.gov
    35
                                                               www.ci.uchicago.edu
Globus Collaborate: For when you want to
Join with a few or
many people to:
• Share documents
• Track tasks
• Send email
• Share data
• Do whatever
With:
• Common groups
• Delegated mgmt
                                     www.ci.anl.gov
36
                                     www.ci.uchicago.edu
Globus Integrate: For when you want to
Write programs that access/manage user
identities, profiles, groups, resources—and data …
                                                       Globus
     Globus Transfer        Globus Storage
                                                     Collaborate
     • In production use   • Early release
     • Service and Web       available in March    • Initial projects
       UI enhancements     • Generally               starting in March
       continue              available in Q3       • Early release
                                                     sometime in Q3


     Globus Integrate                             Globus Connect
     • Transfer API available                       Multi User
     • User profile, group APIs in alpha
     • APIs for Storage, Collaborate              Globus Connect
       planned after app release

… via REST APIs and command line programs
                                                                         www.ci.anl.gov
37
                                                                         www.ci.uchicago.edu
Other innovative science SaaS projects




                                         www.ci.anl.gov
38
                                         www.ci.uchicago.edu
Other innovative science SaaS projects




                                         www.ci.anl.gov
39
                                         www.ci.uchicago.edu
Other innovative science SaaS projects




                                         www.ci.anl.gov
40
                                         www.ci.uchicago.edu
Other innovative science SaaS projects




                                         www.ci.anl.gov
41
                                         www.ci.uchicago.edu
Realizing the benefits of cloud services
•    Understand what services researchers really
     need
•    Acquire and sustain the expertise required to
     create and operate useful services
•    Incentivize those who produce services that are
     widely adopted
•    Provide excellent network connectivity



                                              www.ci.anl.gov
42
                                              www.ci.uchicago.edu
On the importance of networks


     “80 percent of
      success is
      showing up”




                                www.ci.anl.gov
43
                                www.ci.uchicago.edu
Time required to move 10 Terabytes
                                      10,000.00



                                       1,000.00
     Hours to transfer 10 Terabytes




                                        100.00



                                         10.00



                                           1.00



                                           0.10



                                           0.01
                                                  1.E+01   3.E+01   1.E+02   3.E+02   1.E+03   3.E+03   1.E+04   3.E+04   1.E+05   3.E+05   1.E+06

                                                                             Network speed in Megabits/sec

                                                                                                                                      www.ci.anl.gov
44
                                                                                                                                      www.ci.uchicago.edu
Time required to move 10 Terabytes
                                      10,000.00



                                       1,000.00
     Hours to transfer 10 Terabytes




                                        100.00



                                         10.00

                                                                                                        2 hours           US R1 Universities
                                           1.00



                                           0.10



                                           0.01
                                                  1.E+01   3.E+01   1.E+02   3.E+02   1.E+03   3.E+03   1.E+04   3.E+04   1.E+05   3.E+05   1.E+06

                                                                             Network speed in Megabits/sec

                                                                                                                                      www.ci.anl.gov
45
                                                                                                                                      www.ci.uchicago.edu
Time required to move 10 Terabytes
                                      10,000.00



                                       1,000.00
     Hours to transfer 10 Terabytes




                                        100.00



                                         10.00

                                                                                                        2 hours   US R1 Universities
                                           1.00                                                              10 mins       Upgrade

                                           0.10



                                           0.01
                                                  1.E+01   3.E+01   1.E+02   3.E+02   1.E+03   3.E+03   1.E+04   3.E+04   1.E+05   3.E+05   1.E+06

                                                                             Network speed in Megabits/sec

                                                                                                                                      www.ci.anl.gov
46
                                                                                                                                      www.ci.uchicago.edu
Time required to move 10 Terabytes
                                      10,000.00



                                       1,000.00                                          1 month                          Cinvestav Langebio
     Hours to transfer 10 Terabytes




                                        100.00



                                         10.00

                                                                                                        2 hours            US R1 Universities
                                           1.00                                                                           10 mins Upgrade

                                           0.10



                                           0.01
                                                  1.E+01   3.E+01   1.E+02   3.E+02   1.E+03   3.E+03   1.E+04   3.E+04    1.E+05   3.E+05   1.E+06

                                                                             Network speed in Megabits/sec

                                                                                                                                       www.ci.anl.gov
47
                                                                                                                                       www.ci.uchicago.edu
A 21st C research cyberinfrastructure
•    To provide                Small and medium laboratories and projects
                                L L L         L L L           L L L
     more capability for       L L P L PL L P L P L L P L
     more people at less cost … L L L L L L L L L
•    Create cloud-based services
      – Robust and universal    Research data management a
      – Economies of scale      Collaboration, computation a
                                Research administration               S
         –   Positive returns to scale
•    Via the creative use of
         – Aggregation (“cloud”)
         – Federation (“grid”)
•    Powered by networks
                                                             www.ci.anl.gov
    48
                                                             www.ci.uchicago.edu
Questions for you
•    How much “dark data” exists in your field? How
     important is that data?
•    Can you quantify the scale, in your field, of
     – Wasted resources due to duplicated effort
     – Delays in research progress due to inadequate
       infrastructure?
•    If you could do one thing to accelerate adoption
     of advanced computing within your field, what
     would it be?

                                                   www.ci.anl.gov
49
                                                   www.ci.uchicago.edu
Acknowledgments
Colleagues at UChicago and Argonne
     Steve Tuecke, Ravi Madduri, Kyle Chard, Tanu
     Malik, Rachana Ananthakrisnan, Raj Kettimuthu,
     and others listed at
     www.globusonline.org/about/goteam/

NSF Office of Cyberinfrastructure
DOE Office of Advanced Scientific Computing Res.
National Institutes of Health

                                                  www.ci.anl.gov
50
                                                  www.ci.uchicago.edu
For more information
Attend GlobusWorld in Chicago, April 10-12, 2012
• www.globusonline.org
• Twitter: @globusonline, Globus Online on Facebook
• Foster, I. Globus Online: Accelerating and
  democratizing science through cloud-based services.
  IEEE Internet Computing(May/June):70-73, 2011.
• Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswa
  my, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pi
  ckett, K. and Tuecke, S. Software as a Service for Data
  Scientists. Communications of the ACM, Feb, 2012.

                                                      www.ci.anl.gov
51
                                                      www.ci.uchicago.edu
Thank you!
foster@uchicago.edu
foster@anl.gov

www.globusonline.org
Twitter: @globusonline, @ianfoster
                                     www.ci.anl.gov
                                     www.ci.uchicago.edu

Mexico talk foster march 2012

  • 1.
    Accelerating data-driven discovery byoutsourcing the mundane Ian Foster www.ci.anl.gov www.ci.uchicago.edu
  • 2.
    The data deluge www.ci.anl.gov www.ci.uchicago.edu
  • 3.
    The data delugein biology x10 in 6 years x105 in 6 years www.ci.anl.gov 3 www.ci.uchicago.edu
  • 4.
    Number of sequencingmachines http://omicsmaps.com/ www.ci.anl.gov 4 www.ci.uchicago.edu
  • 5.
    Moore’s Law forX-ray sources 18 orders of magnitude 12 orders of in 5 decades! magnitude in 6 decades www.ci.anl.gov 5 Credit: Linda Young www.ci.uchicago.edu
  • 6.
    Exploding data volumesin astronomy MACHO et al.: 1 TB Palomar: 3 TB 2MASS: 10 TB GALEX: 30 TB 100,000 TB Sloan: 40 TB Pan-STARRS: 40,000 TB www.ci.anl.gov 6 www.ci.uchicago.edu
  • 7.
    Exploding data volumesin climate science 2004: 36 TB 2012: 2,300 TB Climate model intercomparison project (CMIP) of the IPCC www.ci.anl.gov 7 www.ci.uchicago.edu
  • 8.
    Big science hasbeen successful OSG: 1.4M CPU-hours/day, >90 sites, >3000 users, >260 pubs in 2010 LIGO: 1 PB data in last science run, distributed worldwide Robust production solutions Substantial teams and expense Sustained, multi-year effort Application-specific solutions, built on common technology ESG: 1.2 PB climate data delivered to 23,000 users; 600+ pubs 8 All build on NSF OCI (& DOE)-supported Globus Toolkit software www.ci.anl.gov www.ci.uchicago.edu
  • 9.
    Small science isstruggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates www.ci.anl.gov 9 www.ci.uchicago.edu
  • 10.
    Dark data inthe long tail of science Awarded Amount 2007 $7,000,000 $6,000,000 $5,000,000 $4,000,000 $3,000,000 $2,000,000 $1,000,000 $0 1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776 NSF grant awards, 2007 (Bryan Heidorn) www.ci.anl.gov 10 www.ci.uchicago.edu
  • 11.
    The challenge ofstaying competitive "Well, in our country," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.” "A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!" www.ci.anl.gov 11 www.ci.uchicago.edu
  • 12.
    A crisis thatdemands new approaches • We have exceptional infrastructure for the 1% (e.g., supercomputers, Large Hadron Collider, …) • But not for the 99% (e.g., the vast majority of the 1.8M publicly funded researchers in the EU) We need new approaches to providing research cyberinfrastructure, that: — Reduce barriers to entry — Are cheaper — Are sustainable www.ci.anl.gov 12 www.ci.uchicago.edu
  • 13.
    You can runa company from a coffee shop www.ci.anl.gov 13 www.ci.uchicago.edu
  • 14.
    Because businesses outsourcetheir IT Web presence Email (hosted Exchange) Calendar Software Telephony (hosted VOIP) as a Service Human resources and payroll (SaaS) Accounting Customer relationship mgmt www.ci.anl.gov 14 www.ci.uchicago.edu
  • 15.
    And often theirlarge-scale computing too Web presence Email (hosted Exchange) Calendar Software Telephony (hosted VOIP) as a Service Human resources and payroll (SaaS) Accounting Customer relationship mgmt Infrastructure Data analytics as a Service Content distribution (IaaS) www.ci.anl.gov 15 www.ci.uchicago.edu
  • 16.
    Let’s rethink howwe provide research IT Accelerate discovery and innovation worldwide by providing research IT as a service Leverage the cloud to • provide millions of researchers with unprecedented access to powerful tools; • enable a massive shortening of cycle times in time-consuming research processes; and • reduce research IT costs dramatically via economies of scale www.ci.anl.gov 16 www.ci.uchicago.edu
  • 17.
    grail.cs.washington.edu 17 www.ci.anl.gov www.ci.uchicago.edu
  • 18.
    Cloud layers Software as a Service: SaaS Platform as a Service: PaaS Infrastructure as a Service: IaaS www.ci.anl.gov 18 18 www.ci.uchicago.edu
  • 19.
    Common research datamanagement steps • Dark Energy Survey • SBGrid structural biology consortium • Galaxy genomics • NCAR climate data applications • LIGO observatory • Land use change; economics www.ci.anl.gov 19 www.ci.uchicago.edu
  • 20.
    Common research datamanagement steps • Dark Energy Survey • SBGrid structural biology consortium • Galaxy genomics • NCAR climate data applications • LIGO observatory • Land use change; economics www.ci.anl.gov 20 www.ci.uchicago.edu
  • 21.
    Scientific data delivery,2012 1980 • “*A+ majority of users at BES facilities … physically transport data to a home institution using portable media … data volumes are going to increase significantly in the next few years (to 70 TB/day or more) – data must be transferred over the network” • “the effectiveness of data transfer middleware [is] not just on the transfer speed, but also the time and interruption to other work required to supervise and check on the success of large data transfers” • “It took two weeks and email traffic between network specialists at NERSC and ORNL, sys-admins at NERSC, … and combustion staff at ORNL and SNL to move 10 TB from NERSC to ORNL” Major usability, productivity, performance problems [ESNet Network Requirements Workshops, 2007-2010] www.ci.anl.gov 21 www.ci.uchicago.edu
  • 22.
    The challenge: Movingbig data easily What should be trivial … “I need my data over there Data Data – at my _____” ( Source Destination supercomputing center, campus server, etc.) … can be painfully tedious and time-consuming “GAAAH !%&@#& ” ! Config issues Data Data ! Firewall issues Source Destination ! Unexpected failure = manual retry www.ci.anl.gov 22 www.ci.uchicago.edu
  • 23.
  • 24.
    Globus Online: Datatransfer as SaaS • Reliable file transfer. – Easy “fire-and-forget” transfers – Automatic fault recovery – High performance – Across multiple security domains • No IT required. – Software as a Service (SaaS) • No client software installation • New features automatically available – Consolidated support & troubleshooting – Works with existing GridFTP servers – Globus Connect solves “last mile problem” • >4000 registered users, >3 Petabytes moved Recommended by XSEDE, NERSC, Blue Waters, and many campuses www.ci.anl.gov 24 www.ci.uchicago.edu
  • 25.
    Dark Energy Surveyuse of Globus Online • Dark Energy Survey Blanco 4m on Cerro Tololo receives 100,000 files each night in Illinois • They transmit files to Texas for analysis … then move results back to Illinois • Process must be reliable, routine, and efficient • They outsource this task Image credit: Roger Smith/NOAO/AURA/NSF to Globus Online www.ci.anl.gov 25 www.ci.uchicago.edu
  • 26.
    www.ci.anl.gov 26 www.ci.uchicago.edu
  • 27.
    www.ci.anl.gov 27 www.ci.uchicago.edu
  • 28.
    Integration with EarthSystem Grid High-speed transfers Automated retries Works behind firewalls Credential management Transfer monitoring www.ci.anl.gov 28 www.ci.uchicago.edu 2
  • 29.
    Globus Online underthe covers User Hub manages user identities and profiles Group Hub manages groups and policies Resource Hub for resource definitions www.ci.anl.gov 29 www.ci.uchicago.edu
  • 30.
    Globus Online underthe covers Monitoring and control Auto-tuning of transfer User Hub manages parameters user identities and Detection & attempted profiles correction of errors Group Hub manages Manual intervention groups and policies when required Resource Hub for resource definitions www.ci.anl.gov 30 www.ci.uchicago.edu
  • 31.
    Globus Online underthe covers Monitoring and control Auto-tuning of transfer User Hub manages parameters user identities and Detection & attempted profiles correction of errors Group Hub manages Manual intervention groups and policies when required Resource Hub for resource definitions Reliable cloud-based infrastructure EC2 for transfer management S3 for system state SimpleDB for lock management Replication across availability zones www.ci.anl.gov 31 www.ci.uchicago.edu
  • 32.
    Globus Online underthe covers Monitoring and control Auto-tuning of transfer User Hub manages parameters user identities and Detection & attempted profiles correction of errors Group Hub manages Manual intervention groups and policies when required Resource Hub for resource definitions Reliable cloud-based infrastructure EC2 for transfer management S3 for system state SimpleDB for lock management Replication across availability zones www.ci.anl.gov 32 www.ci.uchicago.edu
  • 33.
    Towards “research ITas a service” • Dark Energy Survey • SBGrid structural biology consortium • Galaxy genomics • NCAR climate data applications • LIGO observatory • Land use change; economics www.ci.anl.gov 33 www.ci.uchicago.edu
  • 34.
    Towards “research ITas a service” Research data management as a service Globus Globus Globus Globus ... SaaS Transfer Storage Collaborate Catalog Globus Integrate platform PaaS www.ci.anl.gov 34 www.ci.uchicago.edu
  • 35.
    Globus Storage: Forwhen you want to … • Place your data where you want • Access it from anywhere GridFTP, HTTP, WebDAV via different protocols • Update it, version it, Globus Storage and take snapshots volume • Share versions with who you want Commercial Campus National • Synchronize among storage service research computing center center locations provider www.ci.anl.gov 35 www.ci.uchicago.edu
  • 36.
    Globus Collaborate: Forwhen you want to Join with a few or many people to: • Share documents • Track tasks • Send email • Share data • Do whatever With: • Common groups • Delegated mgmt www.ci.anl.gov 36 www.ci.uchicago.edu
  • 37.
    Globus Integrate: Forwhen you want to Write programs that access/manage user identities, profiles, groups, resources—and data … Globus Globus Transfer Globus Storage Collaborate • In production use • Early release • Service and Web available in March • Initial projects UI enhancements • Generally starting in March continue available in Q3 • Early release sometime in Q3 Globus Integrate Globus Connect • Transfer API available Multi User • User profile, group APIs in alpha • APIs for Storage, Collaborate Globus Connect planned after app release … via REST APIs and command line programs www.ci.anl.gov 37 www.ci.uchicago.edu
  • 38.
    Other innovative scienceSaaS projects www.ci.anl.gov 38 www.ci.uchicago.edu
  • 39.
    Other innovative scienceSaaS projects www.ci.anl.gov 39 www.ci.uchicago.edu
  • 40.
    Other innovative scienceSaaS projects www.ci.anl.gov 40 www.ci.uchicago.edu
  • 41.
    Other innovative scienceSaaS projects www.ci.anl.gov 41 www.ci.uchicago.edu
  • 42.
    Realizing the benefitsof cloud services • Understand what services researchers really need • Acquire and sustain the expertise required to create and operate useful services • Incentivize those who produce services that are widely adopted • Provide excellent network connectivity www.ci.anl.gov 42 www.ci.uchicago.edu
  • 43.
    On the importanceof networks “80 percent of success is showing up” www.ci.anl.gov 43 www.ci.uchicago.edu
  • 44.
    Time required tomove 10 Terabytes 10,000.00 1,000.00 Hours to transfer 10 Terabytes 100.00 10.00 1.00 0.10 0.01 1.E+01 3.E+01 1.E+02 3.E+02 1.E+03 3.E+03 1.E+04 3.E+04 1.E+05 3.E+05 1.E+06 Network speed in Megabits/sec www.ci.anl.gov 44 www.ci.uchicago.edu
  • 45.
    Time required tomove 10 Terabytes 10,000.00 1,000.00 Hours to transfer 10 Terabytes 100.00 10.00 2 hours US R1 Universities 1.00 0.10 0.01 1.E+01 3.E+01 1.E+02 3.E+02 1.E+03 3.E+03 1.E+04 3.E+04 1.E+05 3.E+05 1.E+06 Network speed in Megabits/sec www.ci.anl.gov 45 www.ci.uchicago.edu
  • 46.
    Time required tomove 10 Terabytes 10,000.00 1,000.00 Hours to transfer 10 Terabytes 100.00 10.00 2 hours US R1 Universities 1.00 10 mins Upgrade 0.10 0.01 1.E+01 3.E+01 1.E+02 3.E+02 1.E+03 3.E+03 1.E+04 3.E+04 1.E+05 3.E+05 1.E+06 Network speed in Megabits/sec www.ci.anl.gov 46 www.ci.uchicago.edu
  • 47.
    Time required tomove 10 Terabytes 10,000.00 1,000.00 1 month Cinvestav Langebio Hours to transfer 10 Terabytes 100.00 10.00 2 hours US R1 Universities 1.00 10 mins Upgrade 0.10 0.01 1.E+01 3.E+01 1.E+02 3.E+02 1.E+03 3.E+03 1.E+04 3.E+04 1.E+05 3.E+05 1.E+06 Network speed in Megabits/sec www.ci.anl.gov 47 www.ci.uchicago.edu
  • 48.
    A 21st Cresearch cyberinfrastructure • To provide Small and medium laboratories and projects L L L L L L L L L more capability for L L P L PL L P L P L L P L more people at less cost … L L L L L L L L L • Create cloud-based services – Robust and universal Research data management a – Economies of scale Collaboration, computation a Research administration S – Positive returns to scale • Via the creative use of – Aggregation (“cloud”) – Federation (“grid”) • Powered by networks www.ci.anl.gov 48 www.ci.uchicago.edu
  • 49.
    Questions for you • How much “dark data” exists in your field? How important is that data? • Can you quantify the scale, in your field, of – Wasted resources due to duplicated effort – Delays in research progress due to inadequate infrastructure? • If you could do one thing to accelerate adoption of advanced computing within your field, what would it be? www.ci.anl.gov 49 www.ci.uchicago.edu
  • 50.
    Acknowledgments Colleagues at UChicagoand Argonne Steve Tuecke, Ravi Madduri, Kyle Chard, Tanu Malik, Rachana Ananthakrisnan, Raj Kettimuthu, and others listed at www.globusonline.org/about/goteam/ NSF Office of Cyberinfrastructure DOE Office of Advanced Scientific Computing Res. National Institutes of Health www.ci.anl.gov 50 www.ci.uchicago.edu
  • 51.
    For more information AttendGlobusWorld in Chicago, April 10-12, 2012 • www.globusonline.org • Twitter: @globusonline, Globus Online on Facebook • Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011. • Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswa my, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pi ckett, K. and Tuecke, S. Software as a Service for Data Scientists. Communications of the ACM, Feb, 2012. www.ci.anl.gov 51 www.ci.uchicago.edu
  • 52.

Editor's Notes

  • #2 Cyberinfrastructure:The distributed computer, information, and communication technologies [that] empower the modern scientific research endeavor [Atlins report]
  • #4 Gap of >1000 – AND many more systems as people jump on bandwagonMeanwhile, other resources [money, people] stay flatCrisis10^5 in 6 years10 in 6 years
  • #5 http://omicsmaps.com/
  • #10 PI and a handful of students and staff
  • #11 80% of awards and 50% of grant $$ are < $350K
  • #12 Lewis CarrollEnd-to-end crisis
  • #13 The answer cannot simply be more moneyWe lack both $$ and the people to spend $$ on
  • #17 Not (particularly) computing as a serviceBut the IT functions that researchers need to functionInclude collaboration as a service
  • #19 Infrastructure will be provided by many – competitive – race to the bottomInteresting questions are What is the platform? And what is the software?
  • #20 Sequencing: at center X, move data to Y, analyze, load into Short Read Archive (?), share, …
  • #21 Sequencing: at center X, move data to Y, analyze, load into Short Read Archive (?), share, …
  • #22 But when we get to work, we go back in time 20 years
  • #30 User Hub-- Profiles-- IdentitiesGroup Hub-- Definitions-- PoliciesResource Hub-- Definitions-- History
  • #31 User Hub-- Profiles-- IdentitiesGroup Hub-- Definitions-- PoliciesResource Hub-- Definitions-- History
  • #32 User Hub-- Profiles-- IdentitiesGroup Hub-- Definitions-- PoliciesResource Hub-- Definitions-- History
  • #33 User Hub-- Profiles-- IdentitiesGroup Hub-- Definitions-- PoliciesResource Hub-- Definitions-- History
  • #44 With a high-speed network, one can show up.Not just in person, but also computationally.