SlideShare a Scribd company logo
1 of 39
Download to read offline
Data-Intensive Science
with High Performance
Computing leveraging
Presented to
Fifth Annual
University of Massachusetts
and New England Area
Librarian
e-Science Symposium
Afternoon Panel
John W. Cobb, Ph.D.
Physicist
Computer Science & Mathematics Division
Shrewsbury, Massachusetts
April 3, 2013
Acknowledgements


•  DataONE project (PI Michener, U. New Mexico)
•  Oak Ridge National Laboratory and the Oak Ridge
   Leadership Computing Facility
•  Cornell Lab of Ornithology eBrid project and S. Kelling,
   D. Fink, K. Webb, T. Damalou, (Cornell)
•  Collaborators: M. Jones (UCSB) C. Tenopir (UTK), S.
   Allard (UTK), B. Wilson (ORNL/UTK), D. Vieglais
   (Kansas)



2   Managed by UT-Battelle
    for the U.S. Department of Energy
DataONE Community




3   Managed by UT-Battelle
    for the U.S. Department of Energy
Outline
•  Data Begets Science                   challenges
•  The data lifecycle – the             •  DataONE project
   workflow of data driven              •  Dryad
   science
                                        •  Role of libraries as data
•  Data at Scale                           repositories
•  HPC at Scale                         •  DMPTool
•  Pathfinder exemplar: eBird •  Open data movement
   occurrence maps
•  Data management



4   Managed by UT-Battelle
    for the U.S. Department of Energy
Data Gives Birth to Scientific Revolutions


•  Kepler’s laws were divined by
   careful examination of Brahe’s
   recorded observations
•  Leeuwenhoek’s founding of
   microbiology was triggered by
   observations with newly
   developed microscope.




5   Managed by UT-Battelle
    for the U.S. Department of Energy
The data lifecycle: the workflow of
science

The conduct of science is                                       Refined DataONE
collaborative and multi-                                        internal view
disciplinary                                       Collect


                                        Analyze               Assure



                              Integrate                          Describe




                                        Discover              Deposit

                                                   Preserve
6   Managed by UT-Battelle
    for the U.S. Department of Energy
User Matrix (DataONE)
    Different team
    members care




                                                                 Management
    about different


                                                  Investigator




                                                                                           Database
                                                                              Practices




                                                                                                                 Curricula
                                                                 Planning
    things




                                                                                                      Training
                                        Service



                                                  ToolKit




                                                                                          Tools
                                        Data




                                                                 Data




                                                                              Best
    Scientist


    Data Librarians


    Ecological
    Modeler

    Resource
    Manager
7   Managed by UT-Battelle
    for the U.S. Department of Energy
Can we share data along the data
lifecyle?
                                                                         Tenopir, C, Allard S, Douglass K,
                                                    Baseline             Aydinoglu AU, Wu L, Read E, Manoff M,

•  Demographics                                   assessment:            Frame M. 2011. Data Sharing by
                                                                         Scientists: Practices and Perceptions.
                                                                         PLoS ONE. 6(6)
                                                scientists (2010)
     Discipline	
                                              Work	
  Sector	
  
          medicine              other 7%             social sciences                    non-profit 3%    other
            2%                                            16%              commercial                     2%
                                                                              2%
                                                              computer
                                                               science/     government
     ecology                                                 engineering       13%
      18%                                                         9%


                                                                physical
                                                                                            academic
                                                                sciences
     biology                                                                                  80%
                                                                  12%
      14%

       atmospheric                                   environmental
       science 4%                                    sciences 18%
8   Managed by UT-Battelle
    for the U.S. Department of Energy
                                        n=1317	
                                            n=1315	
  
Many are interested in sharing data




9   Managed by UT-Battelle
    for the U.S. Department of Energy   Percent agree
What standard do you currently use?


                                                                                 676




                                                                         266


                                              95     95     96    97
              12                 21      26

             DIF              DwC        DC   EML   FGDC   Open   ISO   My Lab   none
                                                           GIS
                                              Metadata language
10   Managed by UT-Battelle
     for the U.S. Department of Energy
Answer: Yes!


But: There is a gap between desire and practice.


This indicates an opportunity to improve practice and
improve science outcomes



“The spirit is willing but the flesh is weak”



11   Managed by UT-Battelle
     for the U.S. Department of Energy
How big is big data?


•  Possible answers:
      –  the largest of all datasets ever created (>10 PB)
      –  The largest of all datasets ever created in each discipline
      –  larger than we are comfortable managing
      –  larger than what we dealt with last week/year/decade




12   Managed by UT-Battelle
     for the U.S. Department of Energy
How big is big data?


•  Possible answers:
      –  the largest of all datasets ever created (>10 PB)
      –  The largest of all datasets ever created in each discipline
      –  larger than we are comfortable managing
      –  larger than what we dealt with last week/year/decade
•  But larger question: what is the measure of data size?




13   Managed by UT-Battelle
     for the U.S. Department of Energy
Data Ecosystem


                                                                                             	
  
                                                                                             	
  
                                                                                          Science	
  
                                                                                         Leverage	
  

                                                                                       Seman2cs	
  

                                                                                      Workflow	
  
                                                                                     Provenance	
  
                                                                                                                                                                                   L&IS
                                                                Metadata	
  Management	
  


HPC and Data                                                                          Replica2on	
  
Center
                                         Physical	
  Storage	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  I/O	
  Rate	
  
14   Managed by UT-Battelle
     for the U.S. Department of Energy
                                                                                                                                                                              14
Where are the opportunities?
•  Integrating storage management
   and information management           “Building	
  the	
  Knowledge	
  Pyramid”	
  
•  Integrating data from different data          90:10 à 10:90	
  
   activities
                                    Increasing Process Knowledge
      Decreasing Spatial Coverage




                                                                                            Intensive science sites
                                                                                            and experiments


                                                                                                Extensive science sites


                                                                                                     Volunteer &
                                                                                                     education
                                                                                                     networks
                                                                                                          Remote
                                                                                                          sensing

15   Managed by UT-Battelle
                                                                   Adapted from CENR-OSTP
     for the U.S. Department of Energy
HPC at scale – example Titan at OLCF


•  Physical plant challenges:
      –  Size: 40,000 sq-ft (2 floors)
      –  Power: 10’s of MW
      –  Cooling: dual loops chilled water
      –  Raised floor high-load capacity
         (36” , 250 lbs/sq-ft)




16   Managed by UT-Battelle
     for the U.S. Department of Energy
HPC at scale – example Titan at OLCF


•  Named Titan
•  27 Petaflops, 710 TB memory
•  Spider storage > 10 PB, 250 GB/s
•  8972 GPU-enabled nodes (Kepler) in
   200 cabinets
•  Each node contains: One AMD 16-
   core intelagos CPU, one Nvidia K20x
   Kepler, 32 GB memory
•  Note: NVIDIA offers K20x for
   desktop
17   Managed by UT-Battelle
     for the U.S. Department of Energy
Data and the Long Tail of Science

•  As data gets larger, the data
   tail is now quantifiable: flocks
   of black swans
•  Extraordinary events are
   often the most interesting
      –  “500 year storms”
      –  Best candidate materials
         (second place is first loser)
      –  Very non-uniform utility
         functions.
•  Conclusion: applying large
   data analysis can create new
 18breakthroughs
      Managed by UT-Battelle
      for the U.S. Department of Energy
eBird pilot project
exploration and visualization

                                         Diverse	
  bird	
  observa2ons	
  and	
                   Model	
  results	
  
                                         environmental	
  data	
  from	
  
                                         300,00	
  loca2ons	
  in	
  the	
  US	
      Occurrence	
  of	
  Indigo	
  Bun=ng	
  (2008)	
  
                                         integrated	
  and	
  analyzed	
  using	
  
                                         High	
  Performance	
  Compu2ng	
  
                                         Resources	
  


     Land	
  Cover	
  


                                                                                         Jan	
     Apr	
     Jun	
     Sep	
     Dec	
  

     Meteorology	
  
                                                                                        •  Examine	
  paLerns	
  of	
  
                                                                                           migra2on	
  	
  
     MODIS	
  –	
                        Spa2o-­‐Temporal	
  Exploratory	
              •  Infer	
  how	
  climate	
  
     Remote	
                            Model	
  iden2fies	
  factors	
                    change	
  may	
  affect	
  
                                         affec2ng	
  paLerns	
  of	
  migra2on	
  
     sensing	
  data	
                                                                     bird	
  migra2on	
  

19   Managed by UT-Battelle
     for the U.S. Department of Energy
Secretary Salazar on Birds (May 3, 2011):
      “The State of the Birds report is a measurable indicator of how well we are
     fulfilling our shared role as stewards of our nation’s public lands and waters.”

                                                       Acadian Flycatcher Distribution – eBird.org




20   Managed by UT-Battelle
     for the U.S. Department of Energy
HPC centers and data management


•  Often HPC focused – cycles ( and storage)
•  Data and information management may be a foreign
   culture
•  HPC can enable extreme scalability: “What would you
   do if you had unlimited computing/storage/badnwidth?”
•  Bottlenecks:
      –  Data management issues
      –  Metadata creation and harmonization
      –  Data preservation
      –  Items not scaling with Moore’s law: metadata, human effort
21   Managed by UT-Battelle
     for the U.S. Department of Energy
Data deluge and interoperability
 “the flood of increasingly heterogeneous data”

      •  Data are heterogeneous
              –  Syntax
                    •  (format)
              –  Schema
                    •  (model)
              –  Semantics
                    •  (meaning)




 By hand is time-
 consuming and brittle
                                                  Jones et al. 2007


22   Managed by UT-Battelle
     for the U.S. Department of Energy
Myriad Metadata Standards


      For
      instance:
      Metadata
      Crosswalks




23   Managed by UT-Battelle
     Credit: Department of Energy
     for the U.S.
                  Jenn Riley Indiana University Digital Library Program 2012   23
Poor data practice
                                         “data entropy”
                                          Time of publication               In what sense is
                                                                            modern science
                                               Specific details             reproducible?

                                                    General details


                                                                      Retirement or
              Information Content




                                                                      career change

                                    Accident

                                                                                   Death



                                                           Time             (Michener et al. 1997)



24   Managed by UT-Battelle
     for the U.S. Department of Energy
DataONE project (movie with sound)




                                         http://vimeo.com/36383735
25   Managed by UT-Battelle
     for the U.S. Department of Energy
DataONE Component Interdependency

      Scientists:                                     Member Nodes:
       Receive: Access to more data sources            Receive: Additional
       and tools                                       users, replication,
       Provide: Scientific progress and                communities of best
       acknowledgment                                  practice, appreciation
                                                       Provide: Access to
      DataONE:                                         data collections,
       Receives: MN and                                service interfaces
       scientist appreciation,
       access to MN data
       Provides: “Glue”
       services                          Funders:
       to enable interoperability,        Receive: More efficient science
       communities of best                output, chances for breakthrough
       practice, standard                 advances
       interfaces                         Provide: Resources to facilitate
26   Managed by UT-Battelle
     for the U.S. Department of Energy    science
                                                                    26
Current Operational Member Nodes

       •  Released production CI 10 months ago
       •  Today: 13 production Member Nodes
       •  300,000 Data objects represented




27     •  Near-term 15 more candidates
     Managed by UT-Battelle
     for the U.S. Department of Energy
                                                 27
The Investigator Toolkit


                                                               Inves=gator	
  Toolkit	
  
                                  Web	
  Interface	
       Analysis,	
  Visualiza2on	
              Data	
  Management	
  
                                                                  Client	
  Libraries	
  
                                          Java	
                       Python	
                      Command	
  Line	
  

                                         Member	
  Nodes	
                                  Coordina2ng	
  Nodes	
  


      •  Developer, end-user tools
      •  Creation, search, retrieval,
         management
      •  Plugins, extensions for
         analysis tools                                                                        Kepler


28   Managed by UT-Battelle
     for the U.S. Department of Energy
Identify objects


Goal: Uniquely identify data or metadata objects
•  Support the several identifier types widely used
•  Identifiers assigned by Member Nodes
•  Uniqueness ensured by Coordinating Nodes
•  Resolution through Coordinating Nodes




     GUID! LSID                          PURL
29   {3F2504E0-4…
     Managed by UT-Battelle
     for the U.S. Department of Energy
Provide Credit for Data Publication




        •     Data citation standards and courtesy customs
        •     Needs to metrics – how often cited
        •     Socio-cultural change: include data citations in promotion and tenure
        •     DataONE needs to nurture Member Node needs not work against them
30   Managed by UT-Battelle
     for the U.S. Department of Energy
Identify people: federated identity


      •  Identity provider selected by
         the user
      •  Member nodes define access
         rules
      •  Rules propagated by
         Coordinating Nodes
      •  Identity and access control
         consistent across entire
         infrastructure
      •  (note similarity with Globus
         Online approach)

31   Managed by UT-Battelle
     for the U.S. Department of Energy
Support for Entire Data Lifecycle


                                                          Plan	
  

                                         Analyze	
                      Collect	
  




                               Integrate	
                                       Assure	
  
                   Kepler




                                         Discover	
                    Describe	
  

                                                        Preserve	
  
32   Managed by UT-Battelle
     for the U.S. Department of Energy
Open Science Movement




33   Managed by UT-Battelle
     for the U.S. Department of Energy
Building global communities of practice:
        … creating long-lived CI enterprises,

 •           Broad, active community engagement
            –        Involvement of library and science educators
                     engaging new generations of students in best practices
            –        Existing outreach and education programs
 •           Transparent, participatory governance
 •           Adoption/creation of innovative and sustainable business and
             organizational models




34   Managed by UT-Battelle
     for the U.S. Department of Energy
Libraries and museums: value

      •  As Member Nodes:
                  –  Facilitate the teaching and research mission of institution
                  –  Build data collections for the 21st century
      •  In support of Data Librarians:
                  –  Provide access to data management plans
                  –  Provide best practices for faculty and students
                  –  Cyberinfrastructure supporting the data lifecycle




35   Managed by UT-Battelle
     for the U.S. Department of Energy
                                                                       35
Data Management Planning Tool

                                                          https://dmp.cdlib.org/




     •  Create ready-to-use data management plans for specific funding
        agencies
     •  Meet funder requirements for data management plans
     •  Get step-by-step instructions and guidance for your data management
        plan as you build it
     •  Learn about resources and services available at your institution to help
        fulfill the data management requirements of your grant
     •  Released: Oct. 2011
     •  Support for NIH requirements added 2/22/2012
     •  Other similar efforts now also underway at institutional levels or with
        other entities.




36   Managed by UT-Battelle
     for the U.S. Department of Energy
Plug: DMPTool next rev upcoming




37   Managed by UT-Battelle
     for the U.S. Department of Energy
                                         37	
  
DataONE DUG July 7-8 Chapel Hill NC




     Co-located with ESIP Federation Meeting.

38   Managed by UT-Battelle
     for the U.S. Department of Energy
                                                38	
  
Question & Discussion
                                                         John W. Cobb
                                                          865.576.5439
                                                        cobbjw@ornl.gov




39   Managed by UT-Battelle
     for the U.S. Department of Energy

More Related Content

Similar to Data-Intensive Science Leveraging High Performance Computing

Research operations at ornl
Research operations at ornlResearch operations at ornl
Research operations at ornlDIv CHAS
 
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory NotebookA Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory NotebookCourtney Esco
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTERN Australia
 
Scientific data management (v2)
Scientific data management (v2)Scientific data management (v2)
Scientific data management (v2)Jian Qin
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen ARDC
 
UpSkills: Research Data Management for the Sciences
UpSkills: Research Data Management for the SciencesUpSkills: Research Data Management for the Sciences
UpSkills: Research Data Management for the Sciencesstevage
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Anita de Waard
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reusevoginip
 
Information resources in high energy physics
Information resources in high energy physicsInformation resources in high energy physics
Information resources in high energy physicsProyecto CeVALE2
 
Stuart Phinn and Andy Lowe_TERN's national ecosystem data infrastructure is d...
Stuart Phinn and Andy Lowe_TERN's national ecosystem data infrastructure is d...Stuart Phinn and Andy Lowe_TERN's national ecosystem data infrastructure is d...
Stuart Phinn and Andy Lowe_TERN's national ecosystem data infrastructure is d...TERN Australia
 
Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015Susanna-Assunta Sansone
 
SCAR Data Management and Policy
SCAR Data Management and PolicySCAR Data Management and Policy
SCAR Data Management and PolicyAnton Van de Putte
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceAndrew Sallans
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Sandra Binning
 

Similar to Data-Intensive Science Leveraging High Performance Computing (20)

Research operations at ornl
Research operations at ornlResearch operations at ornl
Research operations at ornl
 
Michener Plenary PPSR2012
Michener Plenary PPSR2012Michener Plenary PPSR2012
Michener Plenary PPSR2012
 
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory NotebookA Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
A Quick Guide For Using Microsoft OneNote As An Electronic Laboratory Notebook
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
Scientific data management (v2)
Scientific data management (v2)Scientific data management (v2)
Scientific data management (v2)
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen
 
UpSkills: Research Data Management for the Sciences
UpSkills: Research Data Management for the SciencesUpSkills: Research Data Management for the Sciences
UpSkills: Research Data Management for the Sciences
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
eScience-School-Oct2012-Campinas-Brazil
eScience-School-Oct2012-Campinas-BrazileScience-School-Oct2012-Campinas-Brazil
eScience-School-Oct2012-Campinas-Brazil
 
Data users, data producers
Data users, data producersData users, data producers
Data users, data producers
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
 
Information resources in high energy physics
Information resources in high energy physicsInformation resources in high energy physics
Information resources in high energy physics
 
Stuart Phinn and Andy Lowe_TERN's national ecosystem data infrastructure is d...
Stuart Phinn and Andy Lowe_TERN's national ecosystem data infrastructure is d...Stuart Phinn and Andy Lowe_TERN's national ecosystem data infrastructure is d...
Stuart Phinn and Andy Lowe_TERN's national ecosystem data infrastructure is d...
 
Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015Big Data Standards - Workshop, ExpBio, Boston, 2015
Big Data Standards - Workshop, ExpBio, Boston, 2015
 
SCAR Data Management and Policy
SCAR Data Management and PolicySCAR Data Management and Policy
SCAR Data Management and Policy
 
B4OS-2012
B4OS-2012B4OS-2012
B4OS-2012
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-Science
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?
 

Data-Intensive Science Leveraging High Performance Computing

  • 1. Data-Intensive Science with High Performance Computing leveraging Presented to Fifth Annual University of Massachusetts and New England Area Librarian e-Science Symposium Afternoon Panel John W. Cobb, Ph.D. Physicist Computer Science & Mathematics Division Shrewsbury, Massachusetts April 3, 2013
  • 2. Acknowledgements •  DataONE project (PI Michener, U. New Mexico) •  Oak Ridge National Laboratory and the Oak Ridge Leadership Computing Facility •  Cornell Lab of Ornithology eBrid project and S. Kelling, D. Fink, K. Webb, T. Damalou, (Cornell) •  Collaborators: M. Jones (UCSB) C. Tenopir (UTK), S. Allard (UTK), B. Wilson (ORNL/UTK), D. Vieglais (Kansas) 2 Managed by UT-Battelle for the U.S. Department of Energy
  • 3. DataONE Community 3 Managed by UT-Battelle for the U.S. Department of Energy
  • 4. Outline •  Data Begets Science challenges •  The data lifecycle – the •  DataONE project workflow of data driven •  Dryad science •  Role of libraries as data •  Data at Scale repositories •  HPC at Scale •  DMPTool •  Pathfinder exemplar: eBird •  Open data movement occurrence maps •  Data management 4 Managed by UT-Battelle for the U.S. Department of Energy
  • 5. Data Gives Birth to Scientific Revolutions •  Kepler’s laws were divined by careful examination of Brahe’s recorded observations •  Leeuwenhoek’s founding of microbiology was triggered by observations with newly developed microscope. 5 Managed by UT-Battelle for the U.S. Department of Energy
  • 6. The data lifecycle: the workflow of science The conduct of science is Refined DataONE collaborative and multi- internal view disciplinary Collect Analyze Assure Integrate Describe Discover Deposit Preserve 6 Managed by UT-Battelle for the U.S. Department of Energy
  • 7. User Matrix (DataONE) Different team members care Management about different Investigator Database Practices Curricula Planning things Training Service ToolKit Tools Data Data Best Scientist Data Librarians Ecological Modeler Resource Manager 7 Managed by UT-Battelle for the U.S. Department of Energy
  • 8. Can we share data along the data lifecyle? Tenopir, C, Allard S, Douglass K, Baseline Aydinoglu AU, Wu L, Read E, Manoff M, •  Demographics assessment: Frame M. 2011. Data Sharing by Scientists: Practices and Perceptions. PLoS ONE. 6(6) scientists (2010) Discipline   Work  Sector   medicine other 7% social sciences non-profit 3% other 2% 16% commercial 2% 2% computer science/ government ecology engineering 13% 18% 9% physical academic sciences biology 80% 12% 14% atmospheric environmental science 4% sciences 18% 8 Managed by UT-Battelle for the U.S. Department of Energy n=1317   n=1315  
  • 9. Many are interested in sharing data 9 Managed by UT-Battelle for the U.S. Department of Energy Percent agree
  • 10. What standard do you currently use? 676 266 95 95 96 97 12 21 26 DIF DwC DC EML FGDC Open ISO My Lab none GIS Metadata language 10 Managed by UT-Battelle for the U.S. Department of Energy
  • 11. Answer: Yes! But: There is a gap between desire and practice. This indicates an opportunity to improve practice and improve science outcomes “The spirit is willing but the flesh is weak” 11 Managed by UT-Battelle for the U.S. Department of Energy
  • 12. How big is big data? •  Possible answers: –  the largest of all datasets ever created (>10 PB) –  The largest of all datasets ever created in each discipline –  larger than we are comfortable managing –  larger than what we dealt with last week/year/decade 12 Managed by UT-Battelle for the U.S. Department of Energy
  • 13. How big is big data? •  Possible answers: –  the largest of all datasets ever created (>10 PB) –  The largest of all datasets ever created in each discipline –  larger than we are comfortable managing –  larger than what we dealt with last week/year/decade •  But larger question: what is the measure of data size? 13 Managed by UT-Battelle for the U.S. Department of Energy
  • 14. Data Ecosystem     Science   Leverage   Seman2cs   Workflow   Provenance   L&IS Metadata  Management   HPC and Data Replica2on   Center Physical  Storage                                                I/O  Rate   14 Managed by UT-Battelle for the U.S. Department of Energy 14
  • 15. Where are the opportunities? •  Integrating storage management and information management “Building  the  Knowledge  Pyramid”   •  Integrating data from different data 90:10 à 10:90   activities Increasing Process Knowledge Decreasing Spatial Coverage Intensive science sites and experiments Extensive science sites Volunteer & education networks Remote sensing 15 Managed by UT-Battelle Adapted from CENR-OSTP for the U.S. Department of Energy
  • 16. HPC at scale – example Titan at OLCF •  Physical plant challenges: –  Size: 40,000 sq-ft (2 floors) –  Power: 10’s of MW –  Cooling: dual loops chilled water –  Raised floor high-load capacity (36” , 250 lbs/sq-ft) 16 Managed by UT-Battelle for the U.S. Department of Energy
  • 17. HPC at scale – example Titan at OLCF •  Named Titan •  27 Petaflops, 710 TB memory •  Spider storage > 10 PB, 250 GB/s •  8972 GPU-enabled nodes (Kepler) in 200 cabinets •  Each node contains: One AMD 16- core intelagos CPU, one Nvidia K20x Kepler, 32 GB memory •  Note: NVIDIA offers K20x for desktop 17 Managed by UT-Battelle for the U.S. Department of Energy
  • 18. Data and the Long Tail of Science •  As data gets larger, the data tail is now quantifiable: flocks of black swans •  Extraordinary events are often the most interesting –  “500 year storms” –  Best candidate materials (second place is first loser) –  Very non-uniform utility functions. •  Conclusion: applying large data analysis can create new 18breakthroughs Managed by UT-Battelle for the U.S. Department of Energy
  • 19. eBird pilot project exploration and visualization Diverse  bird  observa2ons  and   Model  results   environmental  data  from   300,00  loca2ons  in  the  US   Occurrence  of  Indigo  Bun=ng  (2008)   integrated  and  analyzed  using   High  Performance  Compu2ng   Resources   Land  Cover   Jan   Apr   Jun   Sep   Dec   Meteorology   •  Examine  paLerns  of   migra2on     MODIS  –   Spa2o-­‐Temporal  Exploratory   •  Infer  how  climate   Remote   Model  iden2fies  factors   change  may  affect   affec2ng  paLerns  of  migra2on   sensing  data   bird  migra2on   19 Managed by UT-Battelle for the U.S. Department of Energy
  • 20. Secretary Salazar on Birds (May 3, 2011): “The State of the Birds report is a measurable indicator of how well we are fulfilling our shared role as stewards of our nation’s public lands and waters.” Acadian Flycatcher Distribution – eBird.org 20 Managed by UT-Battelle for the U.S. Department of Energy
  • 21. HPC centers and data management •  Often HPC focused – cycles ( and storage) •  Data and information management may be a foreign culture •  HPC can enable extreme scalability: “What would you do if you had unlimited computing/storage/badnwidth?” •  Bottlenecks: –  Data management issues –  Metadata creation and harmonization –  Data preservation –  Items not scaling with Moore’s law: metadata, human effort 21 Managed by UT-Battelle for the U.S. Department of Energy
  • 22. Data deluge and interoperability “the flood of increasingly heterogeneous data” •  Data are heterogeneous –  Syntax •  (format) –  Schema •  (model) –  Semantics •  (meaning) By hand is time- consuming and brittle Jones et al. 2007 22 Managed by UT-Battelle for the U.S. Department of Energy
  • 23. Myriad Metadata Standards For instance: Metadata Crosswalks 23 Managed by UT-Battelle Credit: Department of Energy for the U.S. Jenn Riley Indiana University Digital Library Program 2012 23
  • 24. Poor data practice “data entropy” Time of publication In what sense is modern science Specific details reproducible? General details Retirement or Information Content career change Accident Death Time (Michener et al. 1997) 24 Managed by UT-Battelle for the U.S. Department of Energy
  • 25. DataONE project (movie with sound) http://vimeo.com/36383735 25 Managed by UT-Battelle for the U.S. Department of Energy
  • 26. DataONE Component Interdependency Scientists: Member Nodes: Receive: Access to more data sources Receive: Additional and tools users, replication, Provide: Scientific progress and communities of best acknowledgment practice, appreciation Provide: Access to DataONE: data collections, Receives: MN and service interfaces scientist appreciation, access to MN data Provides: “Glue” services Funders: to enable interoperability, Receive: More efficient science communities of best output, chances for breakthrough practice, standard advances interfaces Provide: Resources to facilitate 26 Managed by UT-Battelle for the U.S. Department of Energy science 26
  • 27. Current Operational Member Nodes •  Released production CI 10 months ago •  Today: 13 production Member Nodes •  300,000 Data objects represented 27 •  Near-term 15 more candidates Managed by UT-Battelle for the U.S. Department of Energy 27
  • 28. The Investigator Toolkit Inves=gator  Toolkit   Web  Interface   Analysis,  Visualiza2on   Data  Management   Client  Libraries   Java   Python   Command  Line   Member  Nodes   Coordina2ng  Nodes   •  Developer, end-user tools •  Creation, search, retrieval, management •  Plugins, extensions for analysis tools Kepler 28 Managed by UT-Battelle for the U.S. Department of Energy
  • 29. Identify objects Goal: Uniquely identify data or metadata objects •  Support the several identifier types widely used •  Identifiers assigned by Member Nodes •  Uniqueness ensured by Coordinating Nodes •  Resolution through Coordinating Nodes GUID! LSID PURL 29 {3F2504E0-4… Managed by UT-Battelle for the U.S. Department of Energy
  • 30. Provide Credit for Data Publication •  Data citation standards and courtesy customs •  Needs to metrics – how often cited •  Socio-cultural change: include data citations in promotion and tenure •  DataONE needs to nurture Member Node needs not work against them 30 Managed by UT-Battelle for the U.S. Department of Energy
  • 31. Identify people: federated identity •  Identity provider selected by the user •  Member nodes define access rules •  Rules propagated by Coordinating Nodes •  Identity and access control consistent across entire infrastructure •  (note similarity with Globus Online approach) 31 Managed by UT-Battelle for the U.S. Department of Energy
  • 32. Support for Entire Data Lifecycle Plan   Analyze   Collect   Integrate   Assure   Kepler Discover   Describe   Preserve   32 Managed by UT-Battelle for the U.S. Department of Energy
  • 33. Open Science Movement 33 Managed by UT-Battelle for the U.S. Department of Energy
  • 34. Building global communities of practice: … creating long-lived CI enterprises, •  Broad, active community engagement –  Involvement of library and science educators engaging new generations of students in best practices –  Existing outreach and education programs •  Transparent, participatory governance •  Adoption/creation of innovative and sustainable business and organizational models 34 Managed by UT-Battelle for the U.S. Department of Energy
  • 35. Libraries and museums: value •  As Member Nodes: –  Facilitate the teaching and research mission of institution –  Build data collections for the 21st century •  In support of Data Librarians: –  Provide access to data management plans –  Provide best practices for faculty and students –  Cyberinfrastructure supporting the data lifecycle 35 Managed by UT-Battelle for the U.S. Department of Energy 35
  • 36. Data Management Planning Tool https://dmp.cdlib.org/ •  Create ready-to-use data management plans for specific funding agencies •  Meet funder requirements for data management plans •  Get step-by-step instructions and guidance for your data management plan as you build it •  Learn about resources and services available at your institution to help fulfill the data management requirements of your grant •  Released: Oct. 2011 •  Support for NIH requirements added 2/22/2012 •  Other similar efforts now also underway at institutional levels or with other entities. 36 Managed by UT-Battelle for the U.S. Department of Energy
  • 37. Plug: DMPTool next rev upcoming 37 Managed by UT-Battelle for the U.S. Department of Energy 37  
  • 38. DataONE DUG July 7-8 Chapel Hill NC Co-located with ESIP Federation Meeting. 38 Managed by UT-Battelle for the U.S. Department of Energy 38  
  • 39. Question & Discussion John W. Cobb 865.576.5439 cobbjw@ornl.gov 39 Managed by UT-Battelle for the U.S. Department of Energy