Grand Challenges and Big Data:
Implications for Public Participation in
Scientific Research

Bill Michener

Professor and DataONE Project Director
University of New Mexico

4 August 2012

PPSR Meeting in Portland, OR
CHALLENGES




             2
3
“Data Intensive Science” and the “80:20 Rule”
                              Increasing Process Knowledge
Decreasing Spatial Coverage




                                                                                      Intensive science sites
                                                                                      and experiments


                                                                                          Extensive science sites


                                                                                                Volunteer &
                                                                                                education networks

                                                                                                      Remote
                                                                                                      sensing
                                                             Adapted from CENR-OSTP

                                                                                                                     4
Broadening Public Participation in Research




                                              5
Where are the Data?




                      6
The Long Tail of Orphan Data


                                      “Most of the bytes
                                      are at the high end,
         Specialized repositories     but most of the
         (e.g. GenBank, PDB)          datasets are at the
Volume




                                      low end” – Jim Gray

                      Orphan data



                                                  (B. Heidorn)
                    Rank frequency of datatype

                                                                 7
Research and Data Life Cycle Integration



                                                    Plan
         Proposal
          writing                       Analyze              Collect




Ideas                 Research   Integrate                         Assure




                                       Discover              Describe

        Publication                               Preserve




                                                                            8
SOLUTIONS




            9
Three Key Challenges
                             Plan

                Analyze               Collect




         Integrate                              Assure




                Discover              Describe

                           Preserve
                                                         10
1. Data Preservation




✔                      ?   11
DataONE Supports Data Preservation
Three major components for a      Member Nodes
flexible, scalable, sustainable   • diverse institutions
                                  Coordinating Nodes
network                           • serve local community
                                  • retain complete metadata
                                  Investigator Toolkit
                                  • provide resources for
                                    catalog
                                    managing their data
                                  • indexing for search
                                  • retain copies of data
                                  • network-wide services
                                  • ensure content
                                    availability (preservation)
                                  • replication services




                                                                  12
ORNL DAAC
as a DataONE
Member Node              NASA collectors   DAAC Users (UWG)




Investigator Toolkit




         DataONE Users
                                                          13
The DataONE Federation




                         14
2. Data Discovery




                    15
16
17
18
19
20
3. Tools for Innovation and Discovery


The Fourth Paradigm:
1. Observational and
   experimental
2. Theoretical research
3. Computer simulations of
   natural phenomena
4. Data-intensive research
    • new tools, techniques,
      and ways of working



                                             21
                                                  21
Investigator Toolkit Support

                            Plan
                          DMP-Tool
               Analyze               Collect
Kepler




         Integrate                         Assure




               Discover              Describe

                          Preserve
                                                    22
Data Management Planning Tool




                                23
24
✔Check for best practices
                 ✔Create metadata
                 ✔Connect to ONEShare




   Data &
Metadata (EML)




                                         25
Exploration, Visualization, and Analysis Tools

                Diverse bird observations and           Model results
                environmental data from
                300,00 locations in the US      Occurrence of Indigo Bunting (2008)
                integrated and analyzed using
                High Performance Computing
                Resources


Land Cover


                                                  Jan   Ap     Jun   Sep    Dec
                                                        r
Meteorology
                                                  • Examine patterns of
                                                    migration
MODIS –         Spatio-Temporal Exploratory       • Infer how climate
Remote          Model identifies factors            change may affect
sensing data    affecting patterns of               bird migration
                migration


                                                                                      26
Scientific workflows




                       27
Collaboration environments




                             28
DataONE.org




              29
Thanks!




          LEON LEVY
          FOUNDATION




                       30

Michener Plenary PPSR2012

  • 1.
    Grand Challenges andBig Data: Implications for Public Participation in Scientific Research Bill Michener Professor and DataONE Project Director University of New Mexico 4 August 2012 PPSR Meeting in Portland, OR
  • 2.
  • 3.
  • 4.
    “Data Intensive Science”and the “80:20 Rule” Increasing Process Knowledge Decreasing Spatial Coverage Intensive science sites and experiments Extensive science sites Volunteer & education networks Remote sensing Adapted from CENR-OSTP 4
  • 5.
  • 6.
  • 7.
    The Long Tailof Orphan Data “Most of the bytes are at the high end, Specialized repositories but most of the (e.g. GenBank, PDB) datasets are at the Volume low end” – Jim Gray Orphan data (B. Heidorn) Rank frequency of datatype 7
  • 8.
    Research and DataLife Cycle Integration Plan Proposal writing Analyze Collect Ideas Research Integrate Assure Discover Describe Publication Preserve 8
  • 9.
  • 10.
    Three Key Challenges Plan Analyze Collect Integrate Assure Discover Describe Preserve 10
  • 11.
  • 12.
    DataONE Supports DataPreservation Three major components for a Member Nodes flexible, scalable, sustainable • diverse institutions Coordinating Nodes network • serve local community • retain complete metadata Investigator Toolkit • provide resources for catalog managing their data • indexing for search • retain copies of data • network-wide services • ensure content availability (preservation) • replication services 12
  • 13.
    ORNL DAAC as aDataONE Member Node NASA collectors DAAC Users (UWG) Investigator Toolkit DataONE Users 13
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    3. Tools forInnovation and Discovery The Fourth Paradigm: 1. Observational and experimental 2. Theoretical research 3. Computer simulations of natural phenomena 4. Data-intensive research • new tools, techniques, and ways of working 21 21
  • 22.
    Investigator Toolkit Support Plan DMP-Tool Analyze Collect Kepler Integrate Assure Discover Describe Preserve 22
  • 23.
  • 24.
  • 25.
    ✔Check for bestpractices ✔Create metadata ✔Connect to ONEShare Data & Metadata (EML) 25
  • 26.
    Exploration, Visualization, andAnalysis Tools Diverse bird observations and Model results environmental data from 300,00 locations in the US Occurrence of Indigo Bunting (2008) integrated and analyzed using High Performance Computing Resources Land Cover Jan Ap Jun Sep Dec r Meteorology • Examine patterns of migration MODIS – Spatio-Temporal Exploratory • Infer how climate Remote Model identifies factors change may affect sensing data affecting patterns of bird migration migration 26
  • 27.
  • 28.
  • 29.
  • 30.
    Thanks! LEON LEVY FOUNDATION 30

Editor's Notes

  • #2 Networking, interconnectedness of information. Defining the relationships between components increases the value and utility of those items.The internet provides connectivity between systems, and a good deal of infrastructure has been built on this rapidly evolving, now pervasive fabric.The design of most internet based infrastructure though is very ephemeral, and thus is not suitable for preservation of information, or more importantly, the relationships between elements.URLs are often used as identifiers, except these have a significant problem in that their resolution, that is finding the location where the content identified by the URL may be retrieved is entirely dependent on the persistent availability of the service endpoint referenced by the URL. Change in any component in the resolution chain results in failure, and thus negates the utility of the URL.[Diagram of URL resolution process]The semantic web, the goal of interconnectedness between information is entirely dependent on effective identifier resolution.Preservation of content.Access to content. Creating communities of agents able to access and manipulate, information. Generating new content, relationships between content, discovering new associations. Being completely open about activity – the generation of new content, mining existing information, access to processing resources may however be best done with some privacy. There are always some activities best not to perform in full public view.The DataONE project is building infrastructure that addresses these concerns.
  • #8 There is widely used infrastructure for certain well-defined “easy” biological datatypes like DNA sequences and protein structures. But these repositories are not adequate to capture all those many datasets that requires more context to be reusable. Our civilization is not wealthy to ever support the variety specialized repositories that would be needed, and the curation that would be needed to standardize these data.
  • #11 In fact, many researchers find the new requirement to be quite confusing. Here are just a few examples of the questions that they are asking.
  • #13 DataONE is a federated data network built to improve access to Earth science data, and to support science by: engaging the relevant science, data, and policy communities; facilitating easy, secure, and persistent storage of data; and disseminating integrated and user-friendly tools for data discovery, analysis, visualization, and decision-making. There are three principal components:Member Nodes which include a diverse array of data centers and repositories that are associated with national and international agencies and research networks, universities, libraries, etc.Coordinating Nodes which support data replication across Member Nodes (i.e., data centers) as well as network wide services like 24/7 access to metadata at the CNs, indexing and rapid search and discovery, etc. Am Investigator Toolkit that includes tools that are widely used by scientists, The tools are coupled with the DataONE resources so that it is, for example, possible to seamlessly and transparently access data at Member Nodes through the tool of your choice.
  • #14 NASA Collectors: Field investigators who collect data from NASA-funded projects and deposit those data in the ORNL DAAC. DAAC Users: Those who search and download data from the ORNL DAACMember Node Crescent: the software stack that enables the MN functionality for the ORNL DAAC. This crescent software is developed and installed by D1 staff, making use of the characteristics of the DAAC system and metadata DAAC users can obtain data directly from the ORNL DAAC, as they did before. D1 users will access metadata from the CN and will acquire ORNL DAAC data from the DAAC indirectly via the Member Node. The data and documentation downloads are recorded by the DAAC; the D1 users sees the DAAC’s citation to the downloaded data set
  • #15 There are many opportunities for collaboration with DataONE and there are many benefits to doing so; the next few slides highlight the benefit and opps for research scientists, Member Nodes, and funding agencies. This map highlights many of the international partners that have expressed interest in establishing Member Nodes, many of which are active members of the DataONE Users Group.
  • #23 Other development activities during years 2-5 will focus on expanding the suite of tools that are available through the Investigator Toolkit. New tool additions will be identified and prioritized by the DataONE Users Group.
  • #24 As one example, DataONE is part of a consortium that is developing a Data Management Planning Online Tool. The tool “walks” scientists through the process of developing a concise, but comprehensive data management plan that could enable good stewardship of data and meet requirements of sponsors and home institutions.
  • #25 The five steps are located on the left side bar and include information about the data, metadata (or documentation about the data, policies for access and re-use, and plans for archiving and preserving the data. In this example, the Univ. of Virginia offers suggested text for archiving and preserving the data that can be pasted into the plan.
  • #27 How else do we know what the community needs?The Scientific Exploration, Visualization and Analysis working group is another example that you heard about earlier. In summary, by running through a comprehensive case study, this working group was able to provide specific guidance on the challenges faced when conducting data intensive science. Challenges that were communicated to, and met by, the DataONE core CI team and developers.Another mechanism to understand community needs is to conduct extensive surveys of stakeholders….