Your SlideShare is downloading. ×
  • Like

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an era of data-intensive science


Scientific discovery and innovation in an era of data-intensive science …

Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator

The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.

Published in Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Networking, interconnectedness of information. Defining the relationships between components increases the value and utility of those items.The internet provides connectivity between systems, and a good deal of infrastructure has been built on this rapidly evolving, now pervasive fabric.The design of most internet based infrastructure though is very ephemeral, and thus is not suitable for preservation of information, or more importantly, the relationships between elements.URLs are often used as identifiers, except these have a significant problem in that their resolution, that is finding the location where the content identified by the URL may be retrieved is entirely dependent on the persistent availability of the service endpoint referenced by the URL. Change in any component in the resolution chain results in failure, and thus negates the utility of the URL.[Diagram of URL resolution process]The semantic web, the goal of interconnectedness between information is entirely dependent on effective identifier resolution.Preservation of content.Access to content. Creating communities of agents able to access and manipulate, information. Generating new content, relationships between content, discovering new associations. Being completely open about activity – the generation of new content, mining existing information, access to processing resources may however be best done with some privacy. There are always some activities best not to perform in full public view.The DataONE project is building infrastructure that addresses these concerns.
  • In fact, many researchers find the new requirement to be quite confusing. Here are just a few examples of the questions that they are asking.
  • There is widely used infrastructure for certain well-defined “easy” biological datatypes like DNA sequences and protein structures. But these repositories are not adequate to capture all those many datasets that requires more context to be reusable. Our civilization is not wealthy to ever support the variety specialized repositories that would be needed, and the curation that would be needed to standardize these data.
  • DataONE is a federated data network built to improve access to Earth science data, and to support science by: engaging the relevant science, data, and policy communities; facilitating easy, secure, and persistent storage of data; and disseminating integrated and user-friendly tools for data discovery, analysis, visualization, and decision-making. There are three principal components:Member Nodes which include a diverse array of data centers and repositories that are associated with national and international agencies and research networks, universities, libraries, etc.Coordinating Nodes which support data replication across Member Nodes (i.e., data centers) as well as network wide services like 24/7 access to metadata at the CNs, indexing and rapid search and discovery, etc. Am Investigator Toolkit that includes tools that are widely used by scientists, The tools are coupled with the DataONE resources so that it is, for example, possible to seamlessly and transparently access data at Member Nodes through the tool of your choice.
  • ContentData supporting peer-reviewed articles in basic and applied bioscienceCurrently, 2.4 Gb data from ~400 articles and 50 journalsPlatformCustomized Dspace repositoryMetadata and data standardsDublin Core Application ProfileData file format determined by depositor and journal policySome curation and migration of file formatsAvailabilityOpen Data (Creative Commons Zero), with time-limited embargoesIdentifier schemeDataCite DOIUsage~3000 annual downloadsGovernance and sustainabilityJointly managed by a consortium of partner journalsProject funding from NSF (since 2008) and JISC (starting 2010)Institutional homeNational Evolutionary Synthesis Center, British Library (pending)
  • As one example, DataONE is part of a consortium that is developing a Data Management Planning Online Tool. The tool “walks” scientists through the process of developing a concise, but comprehensive data management plan that could enable good stewardship of data and meet requirements of sponsors and home institutions.
  • First, one logs in, selects the Research sponsor and solicitation number.
  • The five steps are located on the left side bar and include information about the data, metadata (or documentation about the data, policies for access and re-use, and plans for archiving and preserving the data. In this example, the Univ. of Virginia offers suggested text for archiving and preserving the data that can be pasted into the plan.
  • There are many opportunities for collaboration with DataONE and there are many benefits to doing so; the next few slides highlight the benefit and opps for research scientists, Member Nodes, and funding agencies. This map highlights many of the international partners that have expressed interest in establishing Member Nodes, many of which are active members of the DataONE Users Group.
  • NASA Collectors: Field investigators who collect data from NASA-funded projects and deposit those data in the ORNL DAAC. DAAC Users: Those who search and download data from the ORNL DAACMember Node Crescent: the software stack that enables the MN functionality for the ORNL DAAC. This crescent software is developed and installed by D1 staff, making use of the characteristics of the DAAC system and metadata DAAC users can obtain data directly from the ORNL DAAC, as they did before. D1 users will access metadata from the CN and will acquire ORNL DAAC data from the DAAC indirectly via the Member Node. The data and documentation downloads are recorded by the DAAC; the D1 users sees the DAAC’s citation to the downloaded data set
  • I
  • Other development activities during years 2-5 will focus on expanding the suite of tools that are available through the Investigator Toolkit. New tool additions will be identified and prioritized by the DataONE Users Group.
  • How else do we know what the community needs?The Scientific Exploration, Visualization and Analysis working group is another example that you heard about earlier. In summary, by running through a comprehensive case study, this working group was able to provide specific guidance on the challenges faced when conducting data intensive science. Challenges that were communicated to, and met by, the DataONE core CI team and developers.Another mechanism to understand community needs is to conduct extensive surveys of stakeholders….


  • 1. Data Observation Network for Earth(DataONE): Supporting Scientific DataPreservation, Discovery, and InnovationBill MichenerProfessor and DataONE Project DirectorUniversity of New Mexico24 September 2012National Information Standards Organization
  • 2. 2
  • 3. Research and Data Life Cycle Integration ? Plan Proposal writing Analyze CollectIdeas Research Integrate Assure Discover Describe Publication Preserve ? 3
  • 4. Three Key Challenges Plan Analyze Collect I v o n a n n t o i Integrate Assure Discover Describe Preserve 4
  • 5. 1. Data Preservation and Planning✔ ? 5
  • 6. The Long Tail of Orphan Data “Most of the bytes are at the high end, Specialized repositories but most of the (e.g. GenBank, PDB) datasets are at theVolume low end” – Jim Gray Orphan data (B. Heidorn) Rank frequency of datatype 6
  • 7. Planning ? Metadata standard? Data repository? 7
  • 8. DataONE and the DMPTool Support Data PreservationThree major components for a Member Nodesflexible, scalable, sustainable • diverse institutions Coordinating Nodesnetwork • serve local community • retain complete metadata Investigator Toolkit • provide resources for catalog managing their data • indexing for search • retain copies of data • network-wide services • ensure content availability (preservation) • replication services 8
  • 9. Dryad (>3,000 data products)Coordinatedsubmission of articlesand underlying dataHandshaking withspecializedrepositoriesPromotion of reuseand incentives fordeposit 9
  • 10. Knowledge Network for Biocomplexity (20,000+ data packages) Data Types • Ecological • Environmental • Demographic • Social/Legal/EconomicContributors 60• Individual investigators 45 Data• Field stations and networks 30 Sizes• Government agencies % 15• Non-profit partnerships 0 10-200 >200 <1 1-10• Synthesis centers MB 10
  • 11. ✔Check for best practices ✔Create metadata ✔Connect to ONEShare Data &Metadata (EML) 11
  • 12. Data Management Planning Tool 12
  • 13. 13
  • 14. 14
  • 15. 2. Data Discovery 15
  • 16. Data Silos 16
  • 17. The DataONE Federation 17
  • 18. Member Node Functional TiersTier 1: Read only, public content ping(), getLogRecords(), getCapabilities(),get(), getSys temMetadata(), getChecksum(),listObjects(), synchronizat ionFailed()Tier 2: Read only, with access control isAuthorized(), setAccessPolicy()Tier 3: Read/Write using client tools create(), update(), delete()Tier 4: Able to operate as a replication target replicate(),getReplica() 18
  • 19. ORNL DAACas a DataONEMember Node NASA collectors DAAC Users (UWG)Investigator Toolkit DataONE Users 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 1. Ontology-based discovery search resultsConcepts acquirecontext: biomass as Material orbiomass as Energy Additional search terms Super-classes may have different 1. NCBO ontology repository instance properties 2. Populated with ontologies (e.g., the NASA-JPL Semantic Web for Earth and Environmental Terminology) 3. Queried ontologies and returned results using REST services 25
  • 26. Approach 2: Enrich MN Metadata DAAC DRYAD KNB 3 KNBNumber of Documents 978 1,729 24,249 2 DRYADTotal Number of Keywords 7,294 8,266 254,525 1 DAACAverage Keywords/Document 7.46 4.78 10.49 0 2 4 6 8 10 12 Actual Keywords Suggested Keywords [1]field investigation 1. canopy characteristics [2]analysis 2. field investigation [3]land cover [4]computational model 3. vegetation index [5]reflectance 4. leaf characteristics [6]vegetative cover [7]biomass 5. Satellite [8]primary production [9]steel measuring tape 6. land cover [10]weigh balance 7. leaf area meter [11]precipitation amount [12]canopy characteristics 8. Reflectance [13]leaf characteristics 9. steel measuring tape [14]water vapor [15]quadrat sample frame 10. vegetative cover [16]rain gauge [17]surface air temperature 11. plant characteristics [18]air temperature 12. albedo [19]meteorological station [20]human observer [21]vegetation index [22]soil core device [23]plant characteristics [24]surface wind 26 [25]albedo
  • 27. 3. InnovationThe Fourth Paradigm:1. Observational and experimental2. Theoretical research3. Computer simulations of natural phenomena4. Data-intensive research • new tools, techniques, and ways of working 27 27
  • 28. “Data Intensive Science” and the “80:20 Rule” Increasing Process KnowledgeDecreasing Spatial Coverage Intensive science sites and experiments Extensive science sites Volunteer & education networks Remote sensing Adapted from CENR-OSTP 28
  • 29. Public Participation in Scientific Research Conference: 4-5 August 2012 inPortland, Oregon USA prior to Ecological Society of America meeting (6-10 Aug.): 29
  • 30. Investigator Toolkit Support Plan DMP-Tool Analyze CollectKepler Integrate Assure Discover Describe Preserve 30
  • 31. Exploration, Visualization, and Analysis Diverse bird observations and Model results environmental data from 300,00 locations in the US Occurrence of Indigo Bunting (2008) integrated and analyzed using High Performance Computing ResourcesLand Cover Jan Ap Jun Sep Dec rMeteorology • Examine patterns of migrationMODIS – Spatio-Temporal Exploratory • Infer how climateRemote Model identifies factors change may affectsensing data affecting patterns of bird migration migration 31
  • 32. Taverna, MyExperiment 32
  • 33. Provenance Browser 33 33
  • 34. DataONE: Supporting Scientific Data Preservation, Discovery, and Innovation Current Member Nodes: Coming Soon:Current Tools:Tools Coming Soon: Queensland University of Technology 34
  • 35. Deployment Targets – Y5  2009 2010 2011 2012 2013 2014 Y1 Y2 Y3 Y4 Y5 Metadata Objects 100k (130k) 400k 1M Datasets 90k (120k) 180k 360k Uptime 99.0 (100) 99.9 99.9 Metadata Schemas 8 (4) 8 8 Member Nodes 10 (8) 20 40 MN Countries 3 (2) 5 10 Coordinating Nodes 3 (3) 4 5 CN Countries 1 (1) 1 2 ITK Tools 8 (4) 10 12 35
  • 36. Community Engagement 36
  • 37. User AssessmentsScientists: BL Scientists: FU Library Policies: BL Library Policies: FU Librarians: BL Librarians: FU Policy Makers: BL Policy Makers: FU Educators: BL Educators: FU Year 1 Year 2 Year 3 Year 4 Year 5 37
  • 38. Community Engagement 38
  • 39. Best Practices and Software Tools 39
  • 40. June 3-21, 2013University of New Mexico 40
  • 41. Internships 2009 – 4 interns, 2010 – 4 interns 2011 – 8 interns, 2012 – 6 interns 41
  • 42. DataONE: Supporting Scientific DataPreservation, Discovery, and Innovation 42
  • 43. 43
  • 44. DataONE Team and Sponsors • Amber Budden, Roger Dahl, Rebecca Koskela, Bill • Ewa Deelman Michener, Robert Nahf, Skye Roseboom, Mark Servilla • Deborah McGuinness • Dave Vieglais • Suzie Allard, Nick Dexter, Kimberly • Jeff Horsburgh Douglass, Carol Tenopir, Robert Waltz, Bruce • Wilson John Cobb, Bob Cook, Ranjeet • Robert Sandusky Devarakonda, Giri Palanismy, Line Pouchard • Patricia Cruse, John Kunze • Bertram Ludaescher • Sky Bristol, Mike Frame, Richard Huffine, Viv • Peter Honeyman Hutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly • Stephanie Hampton, Chris Jones, Matt • Cliff Duke Jones, Ben Leinfelder, Andrew Pippin • Paul Allen, Rick Bonney, Steve Kelling • Carole Goble • Ryan Scherle, Todd Vision • Donald Hobern • Randy Butler • David DeRoure LEON LEVY FOUNDATION 44