School of Information                                                                      Studies                        ...
School of Information                                                                     Studies                         ...
School of Information Studies                                                                   Syracuse UniversityE-Infra...
School of Information Studies                                                                                 Syracuse Uni...
School of Information Studies                                                                               Syracuse Unive...
School of Information Studies                                                               Syracuse University           ...
School of Information Studies                                                                     Syracuse University     ...
School of Information Studies                                                                Syracuse University          ...
School of Information Studies                                                              Syracuse University            ...
School of Information Studies                                                           Syracuse University               ...
School of Information Studies                                                                        Syracuse University  ...
School of Information Studies                                                              Syracuse University    Datasets...
School of Information Studies                                                                                Syracuse Univ...
School of Information Studies                                                            Syracuse University              ...
School of Information Studies                                                                        Syracuse University  ...
School of Information Studies                                                                     Syracuse University Impl...
School of Information Studies                                                          Syracuse University  Cyberinfrastru...
School of Information Studies                                                                     Syracuse University     ...
School of Information Studies                                                     Syracuse University                     ...
School of Information Studies                                                    Syracuse University     Gravitational Wav...
School of Information Studies                                                     Syracuse University                     ...
School of Information Studies                                                        Syracuse University      Understand t...
School of Information Studies        Mapping out the knowledge v0.2      Syracuse University                    Overview o...
School of Information Studies        Mapping out the knowledge v1.0                        Syracuse University3/16/2011 2:...
School of Information                                                                     Studies                         ...
School of Information Studies                                                                   Syracuse University       ...
School of Information Studies                                                                    Syracuse University      ...
Upcoming SlideShare
Loading in...5
×

Scientific Data Management

1,731

Published on

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,731
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
75
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Scientific Data Management

  1. 1. School of Information Studies Syracuse University Scientific Data Management 2011 Professional Development Day for New England Librarians Jian Qin School of Information Studies Syracuse University http://eslib.ischool.syr.edu/ School of Information Studies Syracuse University The Day ahead An environmental scan • E-Science, cyberinfrastructure, and data • What do all these have to do with me? Case study: The gravitational wave research data management Group work: Role play in developing data management initiatives3/16/2011 2:00 PM Overview of E-Science 2 1
  2. 2. School of Information Studies Syracuse University An environmental scan • E-Science, cyberinfrastructure, and data • What do all these have to do with me? Overview of E-Science Characteristics of e-science Data sets, data collections, and data repositories Why does it matter to libraries? School of Information Studies Syracuse University E-Science “In the future, e-Science will refer to the large scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet. ” National e-Science Center. (2008). Defining e-Science. http://www.nesc.ac.uk/nesc/define.html Overview of E-Science 43/16/2011 2:01 PM 2
  3. 3. School of Information Studies Syracuse UniversityE-Infrastructure for the research lifecycle http://epubs.cclrc.ac.uk/bitstrea m/3857/science_lifecycle_STFC _poster1.PDF Overview of E-Science 5 3/16/2011 2:01 PM School of Information Studies Syracuse University Characteristics of e-science • Digital data driven • Distributed • Collaborative • Trans-disciplinary • Fuses pillars of science – Experiment – Theory Greer, Chris. (2008). E-Science: Trends, Transformations & Responses. In: – Model/simulation Reinventing Science Librarianship: Models for the Future, October 2008. – Observation/correlation http://www.arl.org/bm~doc/ff08greer. pps 3/16/2011 2:01 PM Overview of E-Science 6 3
  4. 4. School of Information Studies Syracuse University 3/16/2011 2:01 PM Overview of E-Science 7 School of Information Studies Syracuse University Shift in Science Paradigms Thousand A few hundred A few decades Today years ago years ago ago Data exploration (eScience) unify theory, experiment, and simulation A computational -- Data captured by approach instruments or generated by simulating simulator Theoretical complex -- Processed by software branch phenomena -- Information/Knowledge using models, stored in computer generalizations -- Scientist analyzes Science was database/files using data empirical management and statisticsdescribing natural Gray, J. & Szalay, A. (2007). eScience – A transformed phenomena scientific method. http://research.microsoft.com/en- us/um/people/gray/talks/NRC-CSTB_eScience.ppt 4
  5. 5. School of Information Studies Syracuse University Gray, J. & Szalay, A. (2007). eScience – A transformed X-Info scientific method. http://research.microsoft.com/en- us/um/people/gray/talks/NRC-CSTB_eScience.ppt• The evolution of X-Info and Comp-X for each discipline X• How to codify and represent our knowledge Experiments & Instruments Other Archives facts questions Literature facts ? answers Simulations The Generic Problems • Data ingest • Query and Vis tools • Managing a petabyte • Building and executing models • Common schema • Integrating data and Literature • How to organize it • Documenting experiments • How to reorganize it • Curation and long-term preservation • How to share with others School of Information Studies Syracuse University Useful resources • What is eScience? • eScience Initiatives • Science Research and Data • Science Data Management • Literature Reviews • Data Policy Issues • eScience Research Centers • http://eslib.ischool.syr.edu/index.phhttp://research.microsoft.com/en- p?option=com_content&view=sectious/collaboration/fourthparadigm/ n&id=9&Itemid=83 Overview of E-Science 10 3/16/2011 2:01 PM 5
  6. 6. School of Information Studies Syracuse University Data Any and all complexdata entities fromobservations, experiments,simulations, models, andhigher order assemblies,along with the associateddocumentation needed to An artist’s conception (above) depicts fundamental NEON observatorydescribe and interpret the instrumentation and systems as well as potential spatial organization of thedata. environmental measurements made by these instruments and systems. http://www.nsf.gov/pubs/2007/nsf0728/ns f0728_4.pdf3/16/2011 2:01 PM Overview of E-Science 11 School of Information Studies Syracuse University Scientific data formats Common data format Image formats Matrix formats Microarray file formats Communication protocols3/16/2011 2:01 PM Overview of E-Science 12 6
  7. 7. School of Information Studies Syracuse University Scientific datasets• The scientific data set, or SDS, is a group of data structures used to store and describe multidimensional arrays of scientific data. NCSA HDF Development Group. (1998). HDF 4.1r2 Users Guide. http://www.hdfgroup.org/training/HDFtraining/UsersGuide/SDS_SD.fm1.ht ml#48894 Overview of E-Science 13 3/16/2011 2:01 PM School of Information Studies Syracuse University Example: Ecological dataset• Floristic diversity data – Related links – Data attributes – Download link Overview of E-Science 14 3/16/2011 2:01 PM 7
  8. 8. School of Information Studies Syracuse University Example: Biodiversity dataset• Actions for Porcupine Marine Natural History Society - Marine flora and fauna records from the North-east Atlantic – Metadata record output in different standard formats – URL for dataset download Overview of E-Science 15 3/16/2011 2:01 PM School of Information Studies Syracuse University Example: The Significant Earthquake Database • The Significant Earthquake Database – A database containing data about significant earthquake events and the damages caused – An interface for extracting a subset of data – A link to download the whole dataset – Documentation Overview of E-Science 16 3/16/2011 2:01 PM 8
  9. 9. School of Information Studies Syracuse University Dataset example: Cayuga Lake (New York) Water Quality Monitoring Data Related to the Cornell Lake Source Cooling Facility, 1998-2009 Overview of E-Science 173/16/2011 2:01 PM School of Information Studies Syracuse University Research data collections Data output Size Metadata Management Standards Larger, Multiple, Organized discipline- comprehensive Institutionalized, based Heroic Smaller, individual team-based None or inside the random team3/16/2011 2:01 PM Overview of E-Science 18 9
  10. 10. School of Information Studies Syracuse University Research collections • Limited processing or long-term management • Not conformed to any data standards • Varying sizes and formats of data files • Low level of processing, lack of plan for data products • Low awareness of metadata standards and data management issues Overview of E-Science 19 3/16/2011 2:01 PM School of Information Studies Syracuse University Resource collections• Authored by a community of investigators, within a domain or science or engineering• Developed with community level standards• Life time is between mid- and long-term• Example: Hubbard Brook Ecosystem Study (http://www.hubbardbrook.org ) – One of the regional sites in the Long term Ecological Research Network (LTER) – Community of the ecological domain – Community of investigators from around the country on ecosystem study – Ecological Metadata Language (EML), a community-level standard – Cataloged, searchable dataset collections Overview of E-Science 20 3/16/2011 2:01 PM 10
  11. 11. School of Information Studies Syracuse University Reference collection • Example: Global Biodiversity Information Facility – Created by large segments of science community – Conform to robust, well-established and comprehensive standards, e.g. • ABCD (Access to Biological Collection Data) • Darwin Core • DiGIR (Distributed Generic Information Retrieval) • Dublin Core Metadata standard • GGF (Global Grid Forum) • Invasive Alien Species Profile • LSID (Life Sciences Identifier) • OGC (Open Geospatial Consortium) 3/16/2011 2:01 PM Overview of E-Science 21 School of Information Studies Syracuse University http://www.tdwg.org/Global standards/BiodiversityInformationFacilityhttp://www.gbif.org/informatics/discoverymetadata/a-metadata-infrastructure/ PM 3/16/2011 2:01 Overview of E-Science 22 11
  12. 12. School of Information Studies Syracuse University Datasets, data collections, and data repositories System for storing, managing, preserving, and• Data collections are built for providing access to larger segments of science datasets and engineering Data• Datasets repository – typically centered around an A repository may event or a study contain one or more – contain a single file or multiple data collections files in various formats A data collection may – coupled with documentation contain one or more about the background of data datasets collection and processing A dataset may contain one or more3/16/2011 2:01 PM Overview of E-Science data files 23 School of Information Studies Syracuse University An emerging trend in academic libraries3/16/2011 2:01 PM Overview of E-Science 24 12
  13. 13. School of Information Studies Syracuse University Initiatives in research libraries Data support and Libraries involved in services in supporting institutions: eScience: 45% 73% • Pressure points: – Lack of resources – Difficulty acquiring the appropriate staff and expertise to provide eScience and data management or curation services – Lack of a unifying direction on campusSource: Soehner, C., Steeves, C. & Ward, J. (2010). E-Science and data support services: Astudy of ARL member institution. http://www.arl.org/bm~doc/escience_report2010.pdf 3/16/2011 2:01 PM Overview of E-Science 25 School of Information Studies Syracuse University Data preservation challenges• Data formats – Vary in data types, e.g. vector and raster data types – Format conversions, e.g. from an old version to a newer one• Data relations – e.g. there are data models, annotations, classification schemes, and symbolization files for a digital map• Semantic issues – Naming datasets and attributes Overview of E-Science 26 3/16/2011 2:01 PM 13
  14. 14. School of Information Studies Syracuse University Data access challenges• Reliability• Authenticity• Leverage technology to make data access easier and more effective – Cross-database search – Integration applications Overview of E-Science 27 3/16/2011 2:01 PM School of Information Studies Syracuse University Supporting digital research data • Lifecycle of research data – Create: data creation/capture/gathering from laboratory experiments, field work, surveys, devices, media, simulation output… – Edit: organize, annotate, clean, filter… – Use/reuse: analyze, mine, model, derive additional data, visualize, input to instruments /computers – Publish: disseminate, create, portals /data. Databases, associate with literature – Preserve/destroy: store / preserve, store /replicate /preserve, store / ignore, destroy… Overview of E-Science 28 3/16/2011 2:01 PM 14
  15. 15. School of Information Studies Syracuse University Supporting data managementThe data deluge Researchers need:Numerical, image, video Specialized search engines to discoverModels, simulations, bit the data they needstreams Powerful data miningXML, CVS, DB, HTML tools to use and analyze the data 3/16/2011 2:01 PM Overview of E-Science 29 School of Information Studies Syracuse University Research data management Community Institution eScience librarian Financial and policy support Science Data content User domain idiosyncrasies requirements Evolving and interconnecting – Institutional Community National International repository repository repository repository Overview of E-Science 30 3/16/2011 2:01 PM 15
  16. 16. School of Information Studies Syracuse University Implications to scholarly communication process Publishing Curation Archiving Data publishing; Maintaining, preserving The long-termNew scholarly publishing and adding value to storage, retrieval, and models—open access, digital research data use of scientific data institutional and throughout its lifecycle. and methods.community repositories, self-publishing, library publishing, .... 3/16/2011 2:01 PM Overview of E-Science 31 School of Information Studies Syracuse University SDM, research impact, and value 3/16/2011 2:01 PM Overview of E-Science 32 16
  17. 17. School of Information Studies Syracuse University Cyberinfrastructure (CI) vision for the 21st century discovery Atkins, D. (2007). Global interdisciplinary research. http://www.nsf.gov/pubs/2007/nsf0728/index.jsp Overview of E-Science 33 3/16/2011 2:01 PM School of Information Studies Syracuse University Educating the new type of workforce• Scientific data literacy (SDL) project (http://sdl.syr.edu), 2007-2009• E-Science Librarianship Curriculum project (eSLib) (http://eslib.ischool.syr.edu), 2009-2012, in partnership with Cornell University Library Overview of E-Science 34 3/16/2011 2:01 PM 17
  18. 18. School of Information Studies Syracuse University eSLib curriculum Scientific Data Management (core) Other activities: Cyberinfrastructure and • eScience Labs Scientific Collaboration (core) • Mentorship program Data services (capstone) • Student projects • Internships Database systems (required elective) Metadata, workflows, XML3/16/2011 2:01 PM Overview of E-Science 35 School of Information Studies Syracuse University Defining data literacy IL: ACRL. (2010). DL: Finn, Charles, W.P. (Tech & Learning, 2004) SDL: Qin, J. & J. D’Ignazio, (Journal of Library Metadata, 2010)3/16/2011 2:00 PM Overview of E-Science 36 18
  19. 19. School of Information Studies Syracuse University Summary • E-Science development has raised expectations to research libraries – Working knowledge and skills in e-Science – Focus on process (data and team science) rather than product (reference services) – Proactive, collaborative, integrative, and interdisciplinary Overview of E-Science 373/16/2011 2:01 PM School of Information Studies Syracuse University Case Study: Learning Data Management Needs of the Gravitational Wave Researchers3/16/2011 2:01 PM Overview of E-Science 38 19
  20. 20. School of Information Studies Syracuse University Gravitational Wave (GW) Research Overview of E-Science 393/16/2011 2:01 PM School of Information Studies Syracuse University Nature of GW research • Computational intensive • Large amounts of data • Highly distributed and collaborative • Open while “secretive” (embargo time) Overview of E-Science 403/16/2011 2:01 PM 20
  21. 21. School of Information Studies Syracuse University Data management goal• To enable: – dataset search by various options and – once a dataset of interest is identified, enable the tracking of this dataset to its original (raw) data. Overview of E-Science 41 3/16/2011 2:01 PM School of Information Studies Syracuse UniversityWhat is needed to accomplish the goal?• Metadata, of course! But – What metadata? – Metadata about what data, the raw data, processed data, code data, or output data?• What is there to be learned? Overview of E-Science 42 3/16/2011 2:01 PM 21
  22. 22. School of Information Studies Syracuse University Understand the research workflow• Interview the scientist – Listening (good listening skills) – Asking questions (don’t be afraid of asking questions) – Use your librarian brain to ingest the conversation: • How does the research flow from one point to next? • What consists of the research input and output at each stage of research in terms of data? Overview of E-Science 43 3/16/2011 2:01 PM School of Information Studies Mapping out the knowledge v0.1 Syracuse University Overview of E-Science 44 3/16/2011 2:01 PM 22
  23. 23. School of Information Studies Mapping out the knowledge v0.2 Syracuse University Overview of E-Science 453/16/2011 2:01 PM School of Information Studies Mapping Syracuse University out the knowledge v0.3 Overview of E-Science 463/16/2011 2:01 PM 23
  24. 24. School of Information Studies Mapping out the knowledge v1.0 Syracuse University3/16/2011 2:01 PM Overview of E-Science 47 School of Information Studies Syracuse University Lessons learned • Science is learnable even if you don’t have a subject background – Learn enough to understand the research process and workflow • Scientists are eager to get help • Librarians need to be technical-minded – Data, metadata, database – Structures, models, workflows • Librarians need to be good listeners while staying good conversation leaders – Know when and how to lead the conversation to get what you need for data management planning and implementation – Do your homework on the subject so that you can be an intelligent listener Overview of E-Science 483/16/2011 2:01 PM 24
  25. 25. School of Information Studies Syracuse University Case Discussions for Working Lunch3/16/2011 2:01 PM Overview of E-Science 49 School of Information Studies Syracuse University Case Study #1: To build or not to build a data repository?A university library has developed an institutional repository for preserving andproviding access to the scholarly output by the researchers in this institution.Now the new challenge arises from e-science research demanding datamanagement plan by the funding agency and the linking between publicationsand data by the authors and users. You already know that some faculty usetheir disciplinary data repository for submitting their datasets (e.g., GenBank formicrobiology research data). The problem you face now is whether aninstitutional data repository should be built for those who do “small science” anddon’t have funding nor expertise to manage their data.Questions to be addressed:• What are the strategies you will use to approach the problem?• What are the possible solutions for the problem?• What are some of the tradeoffs for the solutions you will adopt? Overview of E-Science 503/16/2011 2:01 PM 25
  26. 26. School of Information Studies Syracuse University Case study #2: Developing a data taxonomyThe concept of research data management is a stranger to many faculty aswell as your library staff. What is data? What is a data set? These seeminglysimple terms can be very confusing and have different interpretations indifferent context and disciplines. As part of the data management strategies,you decide to develop an authoritative data taxonomy for the campus researchcommunity. This data taxonomy will benefit the creation and use of institutionaldata policies, data repository or repositories, and data management plansrequired of funding agencies.Questions to be addressed:• What should the data taxonomy include?• What form should it take, a database-driven website or a static HTML page?• Who should be the constituencies in this process?• Who will be the maintainer once the taxonomy is released? Overview of E-Science 513/16/2011 2:01 PM School of Information Studies Syracuse UniversityCase study #3: Developing a data policyData policies play an important role in governing how the data will be managed,shared, and accessed. It is also an instrument that will fend off potential legalproblems. Data policies have several types: data access and use, datapublishing, and data management. Your university’s Office of SponsoredResearch has some existing policy on data, but it is neither systematic norcomplete. Many of the terms were defined years ago and did not cover the newareas such as the embargo period of data. As the university has decided tobuild a data repository for managing and preserving datasets, a data policy hasbecome one of the top priorities for both the institution and the data repository.Questions to be addressed:• What should the data policy include?• Who should be the constituencies in this process?• Who will be the interpretation authority for the data policy?3/16/2011 2:01 PM Overview of E-Science 52 26
  27. 27. School of Information Studies Syracuse University Case study #4: Cataloging datasets Describing datasets is the process of creating metadata for datasets. In scientific disciplines, several metadata standards have been developed, e.g., the Content Standard for Digital Geospatial Metadata (CSDGM), Darwin Core, and Ecological Metadata Language (EML). Each of these metadata standards contains hundreds of elements and requires both metadata and subject knowledge training in order to use them. Besides, creating one record using any of these standards will require a tremendous time investment. But you library does not have such specialized personnel nor have the fund to hire new persons for the job. The existing staff has some general metadata skills such as Dublin Core. In deciding the metadata schema for your data repository, you need to address these questions: • Should I adopt a scientific metadata standard or develop one tailored to our need? • How can I learn what metadata elements are critical to dataset submitters and searchers? • What are some of the benefits and disadvantages for adopting a standard or developing a local schema? 3/16/2011 2:01 PM Overview of E-Science 53 School of Information Studies Syracuse UniversityCase study #5: Evaluating data repository tools Research data as a driving force for e-science is inherently a tool-intensive field. Tools related to data management can be divided into two broad categories: those for creating metadata records and those for data repository management. An academic institution decided to build their own data repository as part of the supporting service for researchers to meet the data management plan requirement of funding agencies. This data repository development task was handed down to the library. You the library director have to decide whether to develop an in-house system or use an off-the-shelf software system. As usual, you put together a taskforce to find a solution to this challenge. The questions to be addressed by the taskforce include: • What are the options available to us? • What evaluation criteria are the most important to our goal? • What are the limitations for us to adopt one option or the other? • How will this option be interoperate with existing institutional repository system? Or, can the existing repository system used for data repository purposes? 3/16/2011 2:01 PM Overview of E-Science 54 27

×