Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture


Published on

A presentation with a review of technical trends in data management, publication and citation, and methodologies on data interoperability, provenance of research and semantic escience.

Published in: Science
  • Be the first to comment

Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture

  1. 1. TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer Polytechnic Institute Email:; Twitter: @MarshallXMa ICSU-WDS Data Stewardship Award Lecture SciDataCon 2014, New Delhi, India, Nov. 02-05
  2. 2. TAckWnowledgCements • Dr. Mustapha Mokrane and Dr. Simon Hodson • Colleagues at TWC/RPI, CODATA-ECDP, ESIP, CGI-IUGS, AGU/ESSI, ICSU-WDS, RDA, ITC, and more • My mentor Prof. Peter Fox • My family • All of you
  3. 3. TWOutlinCe • Technical trends – Data management, publication & citation • Methodology – Interoperability & Provenance • Data management is just a start – Data analysis – Semantic eScience 3
  4. 4. TDatWa ManagCement 4 data work Image courtesy Randy Glasbergen
  5. 5. DTata MWanagemCent Plan • Data Management Plan – A formal document that outlines what you will do with your data during and after you complete your research • Resources/Tools help create DMPs: – NSF Data Management Plan Requirements: – DCC Data Management Plans: – DMPTool: – DCC DMPOnline: 5
  6. 6. TDaWta PubliCcation • Data as first class products of research – e.g., NSF bio-sketches can include data publications See: 6 Image from
  7. 7. TWC 7 “All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. ” “…authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications.” “…authors must make materials, data, and associated protocols available to readers.” “…it is a condition of publication that authors make available the data and research materials supporting the results in the article.” “…require authors to make all data underlying the findings described in their manuscript fully available without restriction…” “Earth and space science data should be widely accessible in multiple formats and long‐term preservation of data is an integral responsibility of scientists and sponsoring institutions.” “…support the principle that research data should be made freely available to all researchers…” “…recommends depositing data that correspond to journal articles in reliable data repositories…”
  8. 8. TWC • Ways of data publication – Data as supplemental material of a paper – Standalone data – Data paper: data in a repository + descriptive ‘data paper’ 8 Examples: • Standalone data journals: Nature Scientific Data, Geoscience Data Journal, Ecological Archives, Data in Brief … • Journals that publish data papers: Earth and Space Science, GigaScience, F1000 Research, Internet Archaeology … Strasser, GeoData 2014 Workshop Presentation (2014)
  9. 9. TWC 9 An isolateddata island ?! Image from
  10. 10. TDWata CitaCtion • Data Citation Index – Indexes the world's leading data repositories – Connects datasets to related refereed literature indexed in the Web of Science™ – Efficient access to data across subjects and regions 10 Image courtesy
  11. 11. TDataW interopCerability 11 Interoperability: “Data should be discoverable, accessible, decodable, understandable and usable, and data sharing should be legal and ethical for all participants.” Ma et al., Nature Geosciecne (2011) Original image from:
  12. 12. PTroveWnance ofC research 12 Provenance documentation “Linking a range of observations and model outputs, research activities, people and organizations involved in the production of scientific findings with the supporting data sets and methods used to generate them” Image from Ma et al., Nature Climate Change (2014)
  13. 13. TWC • IPython Notebook: A web-based interactive computational environment Codes, APIs, datasets, text… PDF document • We made extension to the IPython Notebook environment to enable automatic provenance capture during a scientific workflow Di Stefano et al., ESIP 2014 Summer Meeting Presentation (2014) 13
  14. 14. TWC 14
  15. 15. TSemWantic eSCcience • Artificial Intelligence accelerates scientific discovery – Data search, synthesis and hypothesis representation – Data analysis: reasoning with models of the data Gil et al., Science (2014) Image from A state-of-the-art example: Hanalyzer (high-throughput analyzer) • Uses natural language processing to automatically extract a semantic network from all PubMed papers relevant to a scientist • Uses Semantic Web technology to integrate assertions from other biomedical sources • Reasons about the network to find new correlations that suggest new genes to investigate Leach et al., PLoS Comput Bio (2009) 15
  16. 16. TWC Deep Carbon Virtual Observatory Fox, RDA Fourth Plenary Meeting Presentation (2014) A cyber-enabled platform for linked science
  17. 17. TWSummaCry • Data as first class products of research • eScience: the digital or electronic facilitation of science • Semantic eScience – A virtuous circle between science and semantic technologies – Data driven + Knowledge driven? Image courtesy @WileyExchanges 17
  18. 18. TWC More information: Marshall X Ma Thank you!