NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an era of data-intensive science


Published on

Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator

The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Networking, interconnectedness of information. Defining the relationships between components increases the value and utility of those items.The internet provides connectivity between systems, and a good deal of infrastructure has been built on this rapidly evolving, now pervasive fabric.The design of most internet based infrastructure though is very ephemeral, and thus is not suitable for preservation of information, or more importantly, the relationships between elements.URLs are often used as identifiers, except these have a significant problem in that their resolution, that is finding the location where the content identified by the URL may be retrieved is entirely dependent on the persistent availability of the service endpoint referenced by the URL. Change in any component in the resolution chain results in failure, and thus negates the utility of the URL.[Diagram of URL resolution process]The semantic web, the goal of interconnectedness between information is entirely dependent on effective identifier resolution.Preservation of content.Access to content. Creating communities of agents able to access and manipulate, information. Generating new content, relationships between content, discovering new associations. Being completely open about activity – the generation of new content, mining existing information, access to processing resources may however be best done with some privacy. There are always some activities best not to perform in full public view.The DataONE project is building infrastructure that addresses these concerns.
  • In fact, many researchers find the new requirement to be quite confusing. Here are just a few examples of the questions that they are asking.
  • There is widely used infrastructure for certain well-defined “easy” biological datatypes like DNA sequences and protein structures. But these repositories are not adequate to capture all those many datasets that requires more context to be reusable. Our civilization is not wealthy to ever support the variety specialized repositories that would be needed, and the curation that would be needed to standardize these data.
  • DataONE is a federated data network built to improve access to Earth science data, and to support science by: engaging the relevant science, data, and policy communities; facilitating easy, secure, and persistent storage of data; and disseminating integrated and user-friendly tools for data discovery, analysis, visualization, and decision-making. There are three principal components:Member Nodes which include a diverse array of data centers and repositories that are associated with national and international agencies and research networks, universities, libraries, etc.Coordinating Nodes which support data replication across Member Nodes (i.e., data centers) as well as network wide services like 24/7 access to metadata at the CNs, indexing and rapid search and discovery, etc. Am Investigator Toolkit that includes tools that are widely used by scientists, The tools are coupled with the DataONE resources so that it is, for example, possible to seamlessly and transparently access data at Member Nodes through the tool of your choice.
  • ContentData supporting peer-reviewed articles in basic and applied bioscienceCurrently, 2.4 Gb data from ~400 articles and 50 journalsPlatformCustomized Dspace repositoryMetadata and data standardsDublin Core Application ProfileData file format determined by depositor and journal policySome curation and migration of file formatsAvailabilityOpen Data (Creative Commons Zero), with time-limited embargoesIdentifier schemeDataCite DOIUsage~3000 annual downloadsGovernance and sustainabilityJointly managed by a consortium of partner journalsProject funding from NSF (since 2008) and JISC (starting 2010)Institutional homeNational Evolutionary Synthesis Center, British Library (pending)
  • As one example, DataONE is part of a consortium that is developing a Data Management Planning Online Tool. The tool “walks” scientists through the process of developing a concise, but comprehensive data management plan that could enable good stewardship of data and meet requirements of sponsors and home institutions.
  • First, one logs in, selects the Research sponsor and solicitation number.
  • The five steps are located on the left side bar and include information about the data, metadata (or documentation about the data, policies for access and re-use, and plans for archiving and preserving the data. In this example, the Univ. of Virginia offers suggested text for archiving and preserving the data that can be pasted into the plan.
  • There are many opportunities for collaboration with DataONE and there are many benefits to doing so; the next few slides highlight the benefit and opps for research scientists, Member Nodes, and funding agencies. This map highlights many of the international partners that have expressed interest in establishing Member Nodes, many of which are active members of the DataONE Users Group.
  • NASA Collectors: Field investigators who collect data from NASA-funded projects and deposit those data in the ORNL DAAC. DAAC Users: Those who search and download data from the ORNL DAACMember Node Crescent: the software stack that enables the MN functionality for the ORNL DAAC. This crescent software is developed and installed by D1 staff, making use of the characteristics of the DAAC system and metadata DAAC users can obtain data directly from the ORNL DAAC, as they did before. D1 users will access metadata from the CN and will acquire ORNL DAAC data from the DAAC indirectly via the Member Node. The data and documentation downloads are recorded by the DAAC; the D1 users sees the DAAC’s citation to the downloaded data set
  • I
  • Other development activities during years 2-5 will focus on expanding the suite of tools that are available through the Investigator Toolkit. New tool additions will be identified and prioritized by the DataONE Users Group.
  • How else do we know what the community needs?The Scientific Exploration, Visualization and Analysis working group is another example that you heard about earlier. In summary, by running through a comprehensive case study, this working group was able to provide specific guidance on the challenges faced when conducting data intensive science. Challenges that were communicated to, and met by, the DataONE core CI team and developers.Another mechanism to understand community needs is to conduct extensive surveys of stakeholders….
  • NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an era of data-intensive science

    1. 1. Data Observation Network for Earth(DataONE): Supporting Scientific DataPreservation, Discovery, and InnovationBill MichenerProfessor and DataONE Project DirectorUniversity of New Mexico24 September 2012National Information Standards Organization
    2. 2. 2
    3. 3. Research and Data Life Cycle Integration ? Plan Proposal writing Analyze CollectIdeas Research Integrate Assure Discover Describe Publication Preserve ? 3
    4. 4. Three Key Challenges Plan Analyze Collect I v o n a n n t o i Integrate Assure Discover Describe Preserve 4
    5. 5. 1. Data Preservation and Planning✔ ? 5
    6. 6. The Long Tail of Orphan Data “Most of the bytes are at the high end, Specialized repositories but most of the (e.g. GenBank, PDB) datasets are at theVolume low end” – Jim Gray Orphan data (B. Heidorn) Rank frequency of datatype 6
    7. 7. Planning ? Metadata standard? Data repository? 7
    8. 8. DataONE and the DMPTool Support Data PreservationThree major components for a Member Nodesflexible, scalable, sustainable • diverse institutions Coordinating Nodesnetwork • serve local community • retain complete metadata Investigator Toolkit • provide resources for catalog managing their data • indexing for search • retain copies of data • network-wide services • ensure content availability (preservation) • replication services 8
    9. 9. Dryad (>3,000 data products)Coordinatedsubmission of articlesand underlying dataHandshaking withspecializedrepositoriesPromotion of reuseand incentives fordeposit 9
    10. 10. Knowledge Network for Biocomplexity (20,000+ data packages) Data Types • Ecological • Environmental • Demographic • Social/Legal/EconomicContributors 60• Individual investigators 45 Data• Field stations and networks 30 Sizes• Government agencies % 15• Non-profit partnerships 0 10-200 >200 <1 1-10• Synthesis centers MB 10
    11. 11. ✔Check for best practices ✔Create metadata ✔Connect to ONEShare Data &Metadata (EML) 11
    12. 12. Data Management Planning Tool 12
    13. 13. 13
    14. 14. 14
    15. 15. 2. Data Discovery 15
    16. 16. Data Silos 16
    17. 17. The DataONE Federation 17
    18. 18. Member Node Functional TiersTier 1: Read only, public content ping(), getLogRecords(), getCapabilities(),get(), getSys temMetadata(), getChecksum(),listObjects(), synchronizat ionFailed()Tier 2: Read only, with access control isAuthorized(), setAccessPolicy()Tier 3: Read/Write using client tools create(), update(), delete()Tier 4: Able to operate as a replication target replicate(),getReplica() 18
    19. 19. ORNL DAACas a DataONEMember Node NASA collectors DAAC Users (UWG)Investigator Toolkit DataONE Users 19
    20. 20. 20
    21. 21. 21
    22. 22. 22
    23. 23. 23
    24. 24. 24
    25. 25. 1. Ontology-based discovery search resultsConcepts acquirecontext: biomass as Material orbiomass as Energy Additional search terms Super-classes may have different 1. NCBO ontology repository instance properties 2. Populated with ontologies (e.g., the NASA-JPL Semantic Web for Earth and Environmental Terminology) 3. Queried ontologies and returned results using REST services 25
    26. 26. Approach 2: Enrich MN Metadata DAAC DRYAD KNB 3 KNBNumber of Documents 978 1,729 24,249 2 DRYADTotal Number of Keywords 7,294 8,266 254,525 1 DAACAverage Keywords/Document 7.46 4.78 10.49 0 2 4 6 8 10 12 Actual Keywords Suggested Keywords [1]field investigation 1. canopy characteristics [2]analysis 2. field investigation [3]land cover [4]computational model 3. vegetation index [5]reflectance 4. leaf characteristics [6]vegetative cover [7]biomass 5. Satellite [8]primary production [9]steel measuring tape 6. land cover [10]weigh balance 7. leaf area meter [11]precipitation amount [12]canopy characteristics 8. Reflectance [13]leaf characteristics 9. steel measuring tape [14]water vapor [15]quadrat sample frame 10. vegetative cover [16]rain gauge [17]surface air temperature 11. plant characteristics [18]air temperature 12. albedo [19]meteorological station [20]human observer [21]vegetation index [22]soil core device [23]plant characteristics [24]surface wind 26 [25]albedo
    27. 27. 3. InnovationThe Fourth Paradigm:1. Observational and experimental2. Theoretical research3. Computer simulations of natural phenomena4. Data-intensive research • new tools, techniques, and ways of working 27 27
    28. 28. “Data Intensive Science” and the “80:20 Rule” Increasing Process KnowledgeDecreasing Spatial Coverage Intensive science sites and experiments Extensive science sites Volunteer & education networks Remote sensing Adapted from CENR-OSTP 28
    29. 29. Public Participation in Scientific Research Conference: 4-5 August 2012 inPortland, Oregon USA prior to Ecological Society of America meeting (6-10 Aug.): 29
    30. 30. Investigator Toolkit Support Plan DMP-Tool Analyze CollectKepler Integrate Assure Discover Describe Preserve 30
    31. 31. Exploration, Visualization, and Analysis Diverse bird observations and Model results environmental data from 300,00 locations in the US Occurrence of Indigo Bunting (2008) integrated and analyzed using High Performance Computing ResourcesLand Cover Jan Ap Jun Sep Dec rMeteorology • Examine patterns of migrationMODIS – Spatio-Temporal Exploratory • Infer how climateRemote Model identifies factors change may affectsensing data affecting patterns of bird migration migration 31
    32. 32. Taverna, MyExperiment 32
    33. 33. Provenance Browser 33 33
    34. 34. DataONE: Supporting Scientific Data Preservation, Discovery, and Innovation Current Member Nodes: Coming Soon:Current Tools:Tools Coming Soon: Queensland University of Technology 34
    35. 35. Deployment Targets – Y5  2009 2010 2011 2012 2013 2014 Y1 Y2 Y3 Y4 Y5 Metadata Objects 100k (130k) 400k 1M Datasets 90k (120k) 180k 360k Uptime 99.0 (100) 99.9 99.9 Metadata Schemas 8 (4) 8 8 Member Nodes 10 (8) 20 40 MN Countries 3 (2) 5 10 Coordinating Nodes 3 (3) 4 5 CN Countries 1 (1) 1 2 ITK Tools 8 (4) 10 12 35
    36. 36. Community Engagement 36
    37. 37. User AssessmentsScientists: BL Scientists: FU Library Policies: BL Library Policies: FU Librarians: BL Librarians: FU Policy Makers: BL Policy Makers: FU Educators: BL Educators: FU Year 1 Year 2 Year 3 Year 4 Year 5 37
    38. 38. Community Engagement 38
    39. 39. Best Practices and Software Tools 39
    40. 40. June 3-21, 2013University of New Mexico 40
    41. 41. Internships 2009 – 4 interns, 2010 – 4 interns 2011 – 8 interns, 2012 – 6 interns 41
    42. 42. DataONE: Supporting Scientific DataPreservation, Discovery, and Innovation 42
    43. 43. 43
    44. 44. DataONE Team and Sponsors • Amber Budden, Roger Dahl, Rebecca Koskela, Bill • Ewa Deelman Michener, Robert Nahf, Skye Roseboom, Mark Servilla • Deborah McGuinness • Dave Vieglais • Suzie Allard, Nick Dexter, Kimberly • Jeff Horsburgh Douglass, Carol Tenopir, Robert Waltz, Bruce • Wilson John Cobb, Bob Cook, Ranjeet • Robert Sandusky Devarakonda, Giri Palanismy, Line Pouchard • Patricia Cruse, John Kunze • Bertram Ludaescher • Sky Bristol, Mike Frame, Richard Huffine, Viv • Peter Honeyman Hutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly • Stephanie Hampton, Chris Jones, Matt • Cliff Duke Jones, Ben Leinfelder, Andrew Pippin • Paul Allen, Rick Bonney, Steve Kelling • Carole Goble • Ryan Scherle, Todd Vision • Donald Hobern • Randy Butler • David DeRoure LEON LEVY FOUNDATION 44