Cobb u mass_neal_e_science_v06

202 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
202
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cobb u mass_neal_e_science_v06

  1. 1. Data-Intensive Sciencewith High PerformanceComputing leveragingPresented toFifth AnnualUniversity of Massachusettsand New England AreaLibrariane-Science SymposiumAfternoon PanelJohn W. Cobb, Ph.D.PhysicistComputer Science & Mathematics DivisionShrewsbury, MassachusettsApril 3, 2013
  2. 2. Acknowledgements•  DataONE project (PI Michener, U. New Mexico)•  Oak Ridge National Laboratory and the Oak Ridge Leadership Computing Facility•  Cornell Lab of Ornithology eBrid project and S. Kelling, D. Fink, K. Webb, T. Damalou, (Cornell)•  Collaborators: M. Jones (UCSB) C. Tenopir (UTK), S. Allard (UTK), B. Wilson (ORNL/UTK), D. Vieglais (Kansas)2 Managed by UT-Battelle for the U.S. Department of Energy
  3. 3. DataONE Community3 Managed by UT-Battelle for the U.S. Department of Energy
  4. 4. Outline•  Data Begets Science challenges•  The data lifecycle – the •  DataONE project workflow of data driven •  Dryad science •  Role of libraries as data•  Data at Scale repositories•  HPC at Scale •  DMPTool•  Pathfinder exemplar: eBird •  Open data movement occurrence maps•  Data management4 Managed by UT-Battelle for the U.S. Department of Energy
  5. 5. Data Gives Birth to Scientific Revolutions•  Kepler’s laws were divined by careful examination of Brahe’s recorded observations•  Leeuwenhoek’s founding of microbiology was triggered by observations with newly developed microscope.5 Managed by UT-Battelle for the U.S. Department of Energy
  6. 6. The data lifecycle: the workflow ofscienceThe conduct of science is Refined DataONEcollaborative and multi- internal viewdisciplinary Collect Analyze Assure Integrate Describe Discover Deposit Preserve6 Managed by UT-Battelle for the U.S. Department of Energy
  7. 7. User Matrix (DataONE) Different team members care Management about different Investigator Database Practices Curricula Planning things Training Service ToolKit Tools Data Data Best Scientist Data Librarians Ecological Modeler Resource Manager7 Managed by UT-Battelle for the U.S. Department of Energy
  8. 8. Can we share data along the datalifecyle? Tenopir, C, Allard S, Douglass K, Baseline Aydinoglu AU, Wu L, Read E, Manoff M,•  Demographics assessment: Frame M. 2011. Data Sharing by Scientists: Practices and Perceptions. PLoS ONE. 6(6) scientists (2010) Discipline   Work  Sector   medicine other 7% social sciences non-profit 3% other 2% 16% commercial 2% 2% computer science/ government ecology engineering 13% 18% 9% physical academic sciences biology 80% 12% 14% atmospheric environmental science 4% sciences 18%8 Managed by UT-Battelle for the U.S. Department of Energy n=1317   n=1315  
  9. 9. Many are interested in sharing data9 Managed by UT-Battelle for the U.S. Department of Energy Percent agree
  10. 10. What standard do you currently use? 676 266 95 95 96 97 12 21 26 DIF DwC DC EML FGDC Open ISO My Lab none GIS Metadata language10 Managed by UT-Battelle for the U.S. Department of Energy
  11. 11. Answer: Yes!But: There is a gap between desire and practice.This indicates an opportunity to improve practice andimprove science outcomes“The spirit is willing but the flesh is weak”11 Managed by UT-Battelle for the U.S. Department of Energy
  12. 12. How big is big data?•  Possible answers: –  the largest of all datasets ever created (>10 PB) –  The largest of all datasets ever created in each discipline –  larger than we are comfortable managing –  larger than what we dealt with last week/year/decade12 Managed by UT-Battelle for the U.S. Department of Energy
  13. 13. How big is big data?•  Possible answers: –  the largest of all datasets ever created (>10 PB) –  The largest of all datasets ever created in each discipline –  larger than we are comfortable managing –  larger than what we dealt with last week/year/decade•  But larger question: what is the measure of data size?13 Managed by UT-Battelle for the U.S. Department of Energy
  14. 14. Data Ecosystem     Science   Leverage   Seman2cs   Workflow   Provenance   L&IS Metadata  Management  HPC and Data Replica2on  Center Physical  Storage                                                I/O  Rate  14 Managed by UT-Battelle for the U.S. Department of Energy 14
  15. 15. Where are the opportunities?•  Integrating storage management and information management “Building  the  Knowledge  Pyramid”  •  Integrating data from different data 90:10 à 10:90   activities Increasing Process Knowledge Decreasing Spatial Coverage Intensive science sites and experiments Extensive science sites Volunteer & education networks Remote sensing15 Managed by UT-Battelle Adapted from CENR-OSTP for the U.S. Department of Energy
  16. 16. HPC at scale – example Titan at OLCF•  Physical plant challenges: –  Size: 40,000 sq-ft (2 floors) –  Power: 10’s of MW –  Cooling: dual loops chilled water –  Raised floor high-load capacity (36” , 250 lbs/sq-ft)16 Managed by UT-Battelle for the U.S. Department of Energy
  17. 17. HPC at scale – example Titan at OLCF•  Named Titan•  27 Petaflops, 710 TB memory•  Spider storage > 10 PB, 250 GB/s•  8972 GPU-enabled nodes (Kepler) in 200 cabinets•  Each node contains: One AMD 16- core intelagos CPU, one Nvidia K20x Kepler, 32 GB memory•  Note: NVIDIA offers K20x for desktop17 Managed by UT-Battelle for the U.S. Department of Energy
  18. 18. Data and the Long Tail of Science•  As data gets larger, the data tail is now quantifiable: flocks of black swans•  Extraordinary events are often the most interesting –  “500 year storms” –  Best candidate materials (second place is first loser) –  Very non-uniform utility functions.•  Conclusion: applying large data analysis can create new 18breakthroughs Managed by UT-Battelle for the U.S. Department of Energy
  19. 19. eBird pilot projectexploration and visualization Diverse  bird  observa2ons  and   Model  results   environmental  data  from   300,00  loca2ons  in  the  US   Occurrence  of  Indigo  Bun=ng  (2008)   integrated  and  analyzed  using   High  Performance  Compu2ng   Resources   Land  Cover   Jan   Apr   Jun   Sep   Dec   Meteorology   •  Examine  paLerns  of   migra2on     MODIS  –   Spa2o-­‐Temporal  Exploratory   •  Infer  how  climate   Remote   Model  iden2fies  factors   change  may  affect   affec2ng  paLerns  of  migra2on   sensing  data   bird  migra2on  19 Managed by UT-Battelle for the U.S. Department of Energy
  20. 20. Secretary Salazar on Birds (May 3, 2011): “The State of the Birds report is a measurable indicator of how well we are fulfilling our shared role as stewards of our nation’s public lands and waters.” Acadian Flycatcher Distribution – eBird.org20 Managed by UT-Battelle for the U.S. Department of Energy
  21. 21. HPC centers and data management•  Often HPC focused – cycles ( and storage)•  Data and information management may be a foreign culture•  HPC can enable extreme scalability: “What would you do if you had unlimited computing/storage/badnwidth?”•  Bottlenecks: –  Data management issues –  Metadata creation and harmonization –  Data preservation –  Items not scaling with Moore’s law: metadata, human effort21 Managed by UT-Battelle for the U.S. Department of Energy
  22. 22. Data deluge and interoperability “the flood of increasingly heterogeneous data” •  Data are heterogeneous –  Syntax •  (format) –  Schema •  (model) –  Semantics •  (meaning) By hand is time- consuming and brittle Jones et al. 200722 Managed by UT-Battelle for the U.S. Department of Energy
  23. 23. Myriad Metadata Standards For instance: Metadata Crosswalks23 Managed by UT-Battelle Credit: Department of Energy for the U.S. Jenn Riley Indiana University Digital Library Program 2012 23
  24. 24. Poor data practice “data entropy” Time of publication In what sense is modern science Specific details reproducible? General details Retirement or Information Content career change Accident Death Time (Michener et al. 1997)24 Managed by UT-Battelle for the U.S. Department of Energy
  25. 25. DataONE project (movie with sound) http://vimeo.com/3638373525 Managed by UT-Battelle for the U.S. Department of Energy
  26. 26. DataONE Component Interdependency Scientists: Member Nodes: Receive: Access to more data sources Receive: Additional and tools users, replication, Provide: Scientific progress and communities of best acknowledgment practice, appreciation Provide: Access to DataONE: data collections, Receives: MN and service interfaces scientist appreciation, access to MN data Provides: “Glue” services Funders: to enable interoperability, Receive: More efficient science communities of best output, chances for breakthrough practice, standard advances interfaces Provide: Resources to facilitate26 Managed by UT-Battelle for the U.S. Department of Energy science 26
  27. 27. Current Operational Member Nodes •  Released production CI 10 months ago •  Today: 13 production Member Nodes •  300,000 Data objects represented27 •  Near-term 15 more candidates Managed by UT-Battelle for the U.S. Department of Energy 27
  28. 28. The Investigator Toolkit Inves=gator  Toolkit   Web  Interface   Analysis,  Visualiza2on   Data  Management   Client  Libraries   Java   Python   Command  Line   Member  Nodes   Coordina2ng  Nodes   •  Developer, end-user tools •  Creation, search, retrieval, management •  Plugins, extensions for analysis tools Kepler28 Managed by UT-Battelle for the U.S. Department of Energy
  29. 29. Identify objectsGoal: Uniquely identify data or metadata objects•  Support the several identifier types widely used•  Identifiers assigned by Member Nodes•  Uniqueness ensured by Coordinating Nodes•  Resolution through Coordinating Nodes GUID! LSID PURL29 {3F2504E0-4… Managed by UT-Battelle for the U.S. Department of Energy
  30. 30. Provide Credit for Data Publication •  Data citation standards and courtesy customs •  Needs to metrics – how often cited •  Socio-cultural change: include data citations in promotion and tenure •  DataONE needs to nurture Member Node needs not work against them30 Managed by UT-Battelle for the U.S. Department of Energy
  31. 31. Identify people: federated identity •  Identity provider selected by the user •  Member nodes define access rules •  Rules propagated by Coordinating Nodes •  Identity and access control consistent across entire infrastructure •  (note similarity with Globus Online approach)31 Managed by UT-Battelle for the U.S. Department of Energy
  32. 32. Support for Entire Data Lifecycle Plan   Analyze   Collect   Integrate   Assure   Kepler Discover   Describe   Preserve  32 Managed by UT-Battelle for the U.S. Department of Energy
  33. 33. Open Science Movement33 Managed by UT-Battelle for the U.S. Department of Energy
  34. 34. Building global communities of practice: … creating long-lived CI enterprises, •  Broad, active community engagement –  Involvement of library and science educators engaging new generations of students in best practices –  Existing outreach and education programs •  Transparent, participatory governance •  Adoption/creation of innovative and sustainable business and organizational models34 Managed by UT-Battelle for the U.S. Department of Energy
  35. 35. Libraries and museums: value •  As Member Nodes: –  Facilitate the teaching and research mission of institution –  Build data collections for the 21st century •  In support of Data Librarians: –  Provide access to data management plans –  Provide best practices for faculty and students –  Cyberinfrastructure supporting the data lifecycle35 Managed by UT-Battelle for the U.S. Department of Energy 35
  36. 36. Data Management Planning Tool https://dmp.cdlib.org/ •  Create ready-to-use data management plans for specific funding agencies •  Meet funder requirements for data management plans •  Get step-by-step instructions and guidance for your data management plan as you build it •  Learn about resources and services available at your institution to help fulfill the data management requirements of your grant •  Released: Oct. 2011 •  Support for NIH requirements added 2/22/2012 •  Other similar efforts now also underway at institutional levels or with other entities.36 Managed by UT-Battelle for the U.S. Department of Energy
  37. 37. Plug: DMPTool next rev upcoming37 Managed by UT-Battelle for the U.S. Department of Energy 37  
  38. 38. DataONE DUG July 7-8 Chapel Hill NC Co-located with ESIP Federation Meeting.38 Managed by UT-Battelle for the U.S. Department of Energy 38  
  39. 39. Question & Discussion John W. Cobb 865.576.5439 cobbjw@ornl.gov39 Managed by UT-Battelle for the U.S. Department of Energy

×