1) The document discusses emerging practices around citing data in scholarly publications.
2) It outlines principles for data citation including treating data as a first-class scholarly object, facilitating attribution, discovery, access and provenance of data.
3) Current gaps and infrastructure like DataCite, FigShare and Dataverse Network are described, as well as emerging developments like integrated data publication workflows.
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSMicah Altman
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk discusses findings from an analysis of data sharing and citation policies in Open Access journals and describes a set of novel tools for open data publication in open access journal workflows. Bring your lunch and enjoy a discussion fit for scholars, Open Access fans, and students alike.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology.
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk reviews emerging big data sources for social scientific analysis and explores the challenges these present. Many of these sources pose distinct challenges for acquisition, processing, analysis, inference, sharing, and preservation.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology. Dr. Altman is also a Non-Resident Senior Fellow at The Brookings Institution. Prior to arriving at MIT, Dr. Altman served at Harvard University for fifteen years as the Associate Director of the Harvard-MIT Data Center, Archival Director of the Henry A. Murray Archive, and Senior Research Scientist in the Institute for Quantitative Social Sciences.
Dr. Altman conducts research in social science, information science and research methods -- focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.
"Reproducibility from the Informatics Perspective"Micah Altman
Dr. Altman will provide expert comment on the need for informatics modeling as part of the National Academies workshop: Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results
This workshop focuses on the topic of addressing statistical challenges in assessing and fostering the reproducibility of scientific results by examining three issues from a statistical perspective: the extent of reproducibility, the causes of reproducibility failures, and potential remedies.
Lesson 8 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Lesson 7 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Managing Confidential Information – Trends and ApproachesMicah Altman
Personal information is ubiquitous and it is becoming increasingly easy to link information to individuals. Laws, regulations and policies governing information privacy are complex, but most intervene through either access or anonymization at the time of data publication.
Trends in information collection and management -- cloud storage, "big" data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.
This session presented as part of the the Program on Information Science seminar series, examines trends information privacy. And the session will also discuss emerging approaches and research around managing confidential research information throughout its lifecycle.
Reproducibility from an infomatics perspectiveMicah Altman
Scientific reproducibility is most viewed through a methodological or statistical lens, and increasingly, through a computational lens. Over the last several years, I've taken part in collaborations to that approach reproducibility from the perspective of informatics: as a flow of information across a lifecycle that spans collection, analysis, publication, and reuse.
These slides sketch of this approach, and were presented at a recent workshop on reproducibility at the National Academy of Sciences, and at one our Program on Information Science brown bag talks. See: informatics.mit.edu
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSMicah Altman
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk discusses findings from an analysis of data sharing and citation policies in Open Access journals and describes a set of novel tools for open data publication in open access journal workflows. Bring your lunch and enjoy a discussion fit for scholars, Open Access fans, and students alike.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology.
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk reviews emerging big data sources for social scientific analysis and explores the challenges these present. Many of these sources pose distinct challenges for acquisition, processing, analysis, inference, sharing, and preservation.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology. Dr. Altman is also a Non-Resident Senior Fellow at The Brookings Institution. Prior to arriving at MIT, Dr. Altman served at Harvard University for fifteen years as the Associate Director of the Harvard-MIT Data Center, Archival Director of the Henry A. Murray Archive, and Senior Research Scientist in the Institute for Quantitative Social Sciences.
Dr. Altman conducts research in social science, information science and research methods -- focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.
"Reproducibility from the Informatics Perspective"Micah Altman
Dr. Altman will provide expert comment on the need for informatics modeling as part of the National Academies workshop: Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results
This workshop focuses on the topic of addressing statistical challenges in assessing and fostering the reproducibility of scientific results by examining three issues from a statistical perspective: the extent of reproducibility, the causes of reproducibility failures, and potential remedies.
Lesson 8 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Lesson 7 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Managing Confidential Information – Trends and ApproachesMicah Altman
Personal information is ubiquitous and it is becoming increasingly easy to link information to individuals. Laws, regulations and policies governing information privacy are complex, but most intervene through either access or anonymization at the time of data publication.
Trends in information collection and management -- cloud storage, "big" data, and debates about the right to limit access to published but personal information complicate data management, and make traditional approaches to managing confidential data decreasingly effective.
This session presented as part of the the Program on Information Science seminar series, examines trends information privacy. And the session will also discuss emerging approaches and research around managing confidential research information throughout its lifecycle.
Reproducibility from an infomatics perspectiveMicah Altman
Scientific reproducibility is most viewed through a methodological or statistical lens, and increasingly, through a computational lens. Over the last several years, I've taken part in collaborations to that approach reproducibility from the perspective of informatics: as a flow of information across a lifecycle that spans collection, analysis, publication, and reuse.
These slides sketch of this approach, and were presented at a recent workshop on reproducibility at the National Academy of Sciences, and at one our Program on Information Science brown bag talks. See: informatics.mit.edu
February 18 2015 NISO Virtual Conference Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Using data management plans as a research tool: an introduction to the DART Project
Amanda L. Whitmire, Ph.D., Assistant Professor, Data Management Specialist, Oregon State University Libraries & Press
Slides describing Force11 Work and background of several of the speakers, used for talks to University of Lethbridge, Carnegie Mellon and to Elsevier internally
DataONE Education Module 03: Data Management PlanningDataONE
Lesson 3 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
This talk presents a set of detailed technical recommendations for operationalizing the Joint Declaration of Data Citation Principles (JDDCP) - the most widely agreed set of principle-based recommendations for direct scholarly data citation.
We will provide initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data.
We hope that these recommendations along with the new NISO JATS document schema revision, developed in parallel, will help accelerate the wide adoption of data citation in scholarly literature. We believe their adoption will enable open data transparency for validation, reuse and extension of scientific results; and will significantly counteract the problem of false positives in the literature.
Lesson 2 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Going Full Circle: Research Data Management @ University of PretoriaJohann van Wyk
Presentation delivered at the eResearch Africa Conference, held 23-27 November 2014, at the University of Cape Town, Cape Town, South Africa. Various approaches to Research Data Management at Higher Education Institutions focus on an aspect or two of the research data cycle. At the University of Pretoria the approach has been to support researchers throughout the research process covering the whole research data cycle. The idea is to facilitate/capture the research data throughout the research cycle. This will give context to the data and will add provenance to the data. The University of Pretoria uses the UK Data Archive’s research data cycle model, to align its Research Data Management project-development. This model identifies the stages of a research data cycle as: creating data, processing data, analysing data, preserving data, giving access to data, and reusing data. This paper will give a short overview of the chronological development of research data management at the University of Pretoria. The overview will also highlight findings of two surveys done at the University, one in 2009 and one in 2013. This will be followed by a discussion of a number of pilot projects at the University, and how the needs of researchers involved in these projects are being addressed in a number of the stages of the research data cycle. The discussion will also give a short overview of how the University plans to support those stages not currently being addressed. The second part of the presentation will focus on the projects and technology (software and hardware) used. The University of Pretoria has adopted an Enterprise Content Management (ECM) approach to manage its Research Data. ECM is not a singular platform or system but rather a set of strategies, tools and methodologies that interoperate with each other to create a comprehensive management tool. These sets create an all-encompassing process addressing document, web, records and digital asset management. At the University of Pretoria we address all these processes with different software suites and tools to create a complete management system. Each process presented its own technical challenges. These had to be addressed, while keeping in mind the end objective of supporting researchers throughout the whole research process and data life cycle. Various platforms and standards have been adopted to meet the University of Pretoria’s criteria. To date three processes have been addressed namely, the capturing of data during the research process, the dissemination of data and the preservation of data.
This presentation was provided by Tim McGeary of Duke University during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
Amit Sheth with TK Prasad, "Semantic Technologies for Big Science and Astrophysics", Invited Plenary Presentation, at Earthcube Solar-Terrestrial End-User Workshop, NJIT, Newark, NJ, August 13, 2014.
Like many other fields of Big Science, Astrophysics and Solar Physics deal with the challenges of Big Data, including Volume, Variety, Velocity, and Veracity. There is already significant work on handling volume related challenges, including the use of high performance computing. In this talk, we will mainly focus on other challenges from the perspective of collaborative sharing and reuse of broad variety of data created by multiple stakeholders, large and small, along with tools that offer semantic variants of search, browsing, integration and discovery capabilities. We will borrow examples of tools and capabilities from state of the art work in supporting physicists (including astrophysicists) [1], life sciences [2], material sciences [3], and describe the role of semantics and semantic technologies that make these capabilities possible or easier to realize. This applied and practice oriented talk will complement more vision oriented counterparts [4].
[1] Science Web-based Interactive Semantic Environment: http://sciencewise.info/
[2] NCBO Bioportal: http://bioportal.bioontology.org/ , Kno.e.sis’s work on Semantic Web for Healthcare and Life Sciences: http://knoesis.org/amit/hcls
[3] MaterialWays (a Materials Genome Initiative related project): http://wiki.knoesis.org/index.php/MaterialWays
[4] From Big Data to Smart Data: http://wiki.knoesis.org/index.php/Smart_Data
DataONE Education Module 10: Legal and Policy IssuesDataONE
Lesson 10 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaidatascienceiqss
The DataTags framework makes it easy for data producers to deposit, data publishers to store and distribute, and data users to access and use datasets containing confidential information, in a standardized and responsible way. The talk will first introduce the concepts and tools behind DataTags, and then focus on the user-facing component of the system - Tagging Server (available today at datatags.org). We will conclude by describing how future versions of Dataverse will use DataTags to automatically handle sensitive datasets, that can only be shared under some restrictions.
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
The tremendous growth in digital data has led to an increase in metadata initiatives for different types of scientific data, as evident in Ball’s survey (2009). Although individual communities have specific needs, there are shared goals that need to be recognized if systems are to effectively support data sharing within and across all domains. This paper considers this need, and explores systems requirements that are essential for metadata supporting the discovery and management of scientific data. The paper begins with an introduction and a review of selected research specific to metadata modeling in the sciences. Next, the paper’s goals are stated, followed by the presentation of valuable systems requirements. The results include a base-model with three chief principles: principle of least effort, infrastructure service, and portability. The principles are intended to support “data user” tasks. Results also include a set of defined user tasks and functions, and applications scenarios.
Deliver Perfect Images At Any Size
with Anne Thomas
Out of the Sandbox
Overview
One of the most difficult aspects of developing for different screen sizes is the need to serve high quality images without slowing down the browsing experience. Websites are becoming more image heavy every year and with the popularity of content management systems growing, we don’t always have the luxury of complete control over the image sizes that are uploaded. Anne will be sharing some of the tricks that she has learned over the years to achieve the ideal combination for responsive images – fast, good and cheap.
Objective
Help you build sites that deliver high-quality images regardless of screen size with modern techniques (and even support Internet Explorer!)
Five Things Audience Members Will Learn
Alternatives to JPGs and the pros and cons
How to speed up load times for images
Modern methods to display images beyond the usual img element
How to generate correct image size for every device
Handy comparison of JS libraries to support older browsers
February 18 2015 NISO Virtual Conference Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Using data management plans as a research tool: an introduction to the DART Project
Amanda L. Whitmire, Ph.D., Assistant Professor, Data Management Specialist, Oregon State University Libraries & Press
Slides describing Force11 Work and background of several of the speakers, used for talks to University of Lethbridge, Carnegie Mellon and to Elsevier internally
DataONE Education Module 03: Data Management PlanningDataONE
Lesson 3 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Doing for Data what Pubmed did for literature: DATS a model for dataset description datasets indexing and data discovery.
Googleslides [https://goo.gl/cd5KKa] or Slideshare [https://goo.gl/c8DH5N]
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
This talk presents a set of detailed technical recommendations for operationalizing the Joint Declaration of Data Citation Principles (JDDCP) - the most widely agreed set of principle-based recommendations for direct scholarly data citation.
We will provide initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data.
We hope that these recommendations along with the new NISO JATS document schema revision, developed in parallel, will help accelerate the wide adoption of data citation in scholarly literature. We believe their adoption will enable open data transparency for validation, reuse and extension of scientific results; and will significantly counteract the problem of false positives in the literature.
Lesson 2 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Going Full Circle: Research Data Management @ University of PretoriaJohann van Wyk
Presentation delivered at the eResearch Africa Conference, held 23-27 November 2014, at the University of Cape Town, Cape Town, South Africa. Various approaches to Research Data Management at Higher Education Institutions focus on an aspect or two of the research data cycle. At the University of Pretoria the approach has been to support researchers throughout the research process covering the whole research data cycle. The idea is to facilitate/capture the research data throughout the research cycle. This will give context to the data and will add provenance to the data. The University of Pretoria uses the UK Data Archive’s research data cycle model, to align its Research Data Management project-development. This model identifies the stages of a research data cycle as: creating data, processing data, analysing data, preserving data, giving access to data, and reusing data. This paper will give a short overview of the chronological development of research data management at the University of Pretoria. The overview will also highlight findings of two surveys done at the University, one in 2009 and one in 2013. This will be followed by a discussion of a number of pilot projects at the University, and how the needs of researchers involved in these projects are being addressed in a number of the stages of the research data cycle. The discussion will also give a short overview of how the University plans to support those stages not currently being addressed. The second part of the presentation will focus on the projects and technology (software and hardware) used. The University of Pretoria has adopted an Enterprise Content Management (ECM) approach to manage its Research Data. ECM is not a singular platform or system but rather a set of strategies, tools and methodologies that interoperate with each other to create a comprehensive management tool. These sets create an all-encompassing process addressing document, web, records and digital asset management. At the University of Pretoria we address all these processes with different software suites and tools to create a complete management system. Each process presented its own technical challenges. These had to be addressed, while keeping in mind the end objective of supporting researchers throughout the whole research process and data life cycle. Various platforms and standards have been adopted to meet the University of Pretoria’s criteria. To date three processes have been addressed namely, the capturing of data during the research process, the dissemination of data and the preservation of data.
This presentation was provided by Tim McGeary of Duke University during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
Amit Sheth with TK Prasad, "Semantic Technologies for Big Science and Astrophysics", Invited Plenary Presentation, at Earthcube Solar-Terrestrial End-User Workshop, NJIT, Newark, NJ, August 13, 2014.
Like many other fields of Big Science, Astrophysics and Solar Physics deal with the challenges of Big Data, including Volume, Variety, Velocity, and Veracity. There is already significant work on handling volume related challenges, including the use of high performance computing. In this talk, we will mainly focus on other challenges from the perspective of collaborative sharing and reuse of broad variety of data created by multiple stakeholders, large and small, along with tools that offer semantic variants of search, browsing, integration and discovery capabilities. We will borrow examples of tools and capabilities from state of the art work in supporting physicists (including astrophysicists) [1], life sciences [2], material sciences [3], and describe the role of semantics and semantic technologies that make these capabilities possible or easier to realize. This applied and practice oriented talk will complement more vision oriented counterparts [4].
[1] Science Web-based Interactive Semantic Environment: http://sciencewise.info/
[2] NCBO Bioportal: http://bioportal.bioontology.org/ , Kno.e.sis’s work on Semantic Web for Healthcare and Life Sciences: http://knoesis.org/amit/hcls
[3] MaterialWays (a Materials Genome Initiative related project): http://wiki.knoesis.org/index.php/MaterialWays
[4] From Big Data to Smart Data: http://wiki.knoesis.org/index.php/Smart_Data
DataONE Education Module 10: Legal and Policy IssuesDataONE
Lesson 10 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinaidatascienceiqss
The DataTags framework makes it easy for data producers to deposit, data publishers to store and distribute, and data users to access and use datasets containing confidential information, in a standardized and responsible way. The talk will first introduce the concepts and tools behind DataTags, and then focus on the user-facing component of the system - Tagging Server (available today at datatags.org). We will conclude by describing how future versions of Dataverse will use DataTags to automatically handle sensitive datasets, that can only be shared under some restrictions.
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
The tremendous growth in digital data has led to an increase in metadata initiatives for different types of scientific data, as evident in Ball’s survey (2009). Although individual communities have specific needs, there are shared goals that need to be recognized if systems are to effectively support data sharing within and across all domains. This paper considers this need, and explores systems requirements that are essential for metadata supporting the discovery and management of scientific data. The paper begins with an introduction and a review of selected research specific to metadata modeling in the sciences. Next, the paper’s goals are stated, followed by the presentation of valuable systems requirements. The results include a base-model with three chief principles: principle of least effort, infrastructure service, and portability. The principles are intended to support “data user” tasks. Results also include a set of defined user tasks and functions, and applications scenarios.
Deliver Perfect Images At Any Size
with Anne Thomas
Out of the Sandbox
Overview
One of the most difficult aspects of developing for different screen sizes is the need to serve high quality images without slowing down the browsing experience. Websites are becoming more image heavy every year and with the popularity of content management systems growing, we don’t always have the luxury of complete control over the image sizes that are uploaded. Anne will be sharing some of the tricks that she has learned over the years to achieve the ideal combination for responsive images – fast, good and cheap.
Objective
Help you build sites that deliver high-quality images regardless of screen size with modern techniques (and even support Internet Explorer!)
Five Things Audience Members Will Learn
Alternatives to JPGs and the pros and cons
How to speed up load times for images
Modern methods to display images beyond the usual img element
How to generate correct image size for every device
Handy comparison of JS libraries to support older browsers
Metadata and data citation. Session 2.5 of the RDMRose v3 materials.
The JISC funded RDMRose project (June 2012-May 2013) was a collaboration between the libraries of the University of Leeds, Sheffield and York, with the Information School at Sheffield to provide an Open Educational Resource for information professionals on Research Data Management. The materials were revised between November 2014 and February 2015 for the consortium of North West Academic Libraries (NoWAL).
http://www.sheffield.ac.uk/is/research/projects/rdmrose
[4.1] Data Citation and DOI's - Research Data Management - part of PhD course...3TU.Datacentrum
Training about Data Archive
You will learn:
What data citation is, and what the benefits are.
How to use DOIs for data citation.
How to cite a dataset
How to find publications with DOIs
Link your publications to your dataset (and vise versa) using DOIs
PLoS ONE Piwowar: Sharing Detailed Research Data Is Associated with Increa...Heather Piwowar
Heather A Piwowar, Roger S Day, Douglas B Fridsma (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate PLoS ONE 2: 3. e308
Abstract: Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available. We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression. This correlation between publicly available data and increased literature impact may further motivate investigators to share their detailed research data.
DataONE Education Module 09: Analysis and WorkflowsDataONE
Lesson 9 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel
FIndable, Accessible, Interoperable, Reusable Software and Data Citation: Europe, Research Objects, and BioSchemas.org
FAIR Data Management and FAIR Data SharingMerce Crosas
Presentation at the Critical Perspective on the Practice of Digiral Archeology symposium: http://archaeology.harvard.edu/critical-perspectives-practice-digital-archaeology
State of the Art Informatics for Research Reproducibility, Reliability, and...Micah Altman
In March, I had the pleasure of being the inaugural speaker in a new lecture series (http://library.wustl.edu/research-data-testing/dss_speaker/dss_altman.html) initiated by the Libraries at the Washington University in St. Louis Libraries -- dedicated to the topics of data reproducibility, citation, sharing, privacy, and management.
In the presentation embedded below, I provide an overview of the major categories of new initiatives to promote research reproducibility, reliability, and reuse and related state of the art in informatics methods for managing data.
Metadata and Metrics to Support Open AccessMicah Altman
This presentation, invited for a workshop on Open Access and Scholarly Books (sponsored by the Berkman Center and Knowledge Unlatched), provides a very brief overview of metadata design principles, approaches to evaluation metrics, and some relevant standards and exemplars in scholarly publishing. It is intended to provoke discussion on approaches to evaluation of the use, characteristics, and value of OA publications.
Crediting informatics and data folks in life science teamsCarole Goble
Science Europe LEGS Committee: Career Pathways in Multidisciplinary Research: How to Assess the Contributions of Single Authors in Large Teams, 1-2 Dec 2015, Brussels
The People Behind Research Software crediting from the informatics, technical point of view
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
On November 21st 2014 at the Tufts University Medford campus and November 25th 2014 at the campus of the University of Massachusetts Medical School in Worcester, the BLC and Digital Science hosted a workshop focused on better understanding the research information management landscape.
Jonathan Breeze, CEO of Symplectic, reflected on the emergence of research information management systems and the resulting benefits they can provide.
This was part of a webinar from the Materials Research Society on Machine Learning, AI, and Data-Driven Materials Development and Design. The spoken content (including Q&A) is available through MRS.
Privacy in Research Data Managemnt - Use CasesMicah Altman
From Integrating Approaches to Privacy across the Research Lifecycle http://privacytools.seas.harvard.edu/fall-2013-workshop
This workshop will consider how emerging tools and perspectives from a variety of disciplines, such as computer science, social science, law, and the health sciences, should be integrated in the management of confidential research data. Multidisciplinary discussion groups will grapple with these issues in the context of exemplar research use cases.
This presentation was provided by Lisa Johnston, University of Minnesota, for a NISO Virtual Conference on data curation held on Wednesday, August 31, 2016
Talk at JISC Repositories conference intended for repository managers or research managers on some of the issues involved. Talk had to be originally given unaided because of a technology problem!
Doing research better: The role of meta‐dataGarethKnight
Presentation given by David Leon, Professor of Epidemiology at the London School of Hygiene and Tropical Medicine in January 2012. Subsequently reused at various internal events
Selecting efficient and reliable preservation strategiesMicah Altman
This article addresses the problem of formulating efficient and reliable operational preservation policies that ensure bit-level information integrity over long periods, and in the presence of a diverse range of real-world technical, legal, organizational, and economic threats. We develop a systematic, quantitative prediction framework that combines formal modeling, discrete-event-based simulation, hierarchical modeling, and then use empirically calibrated sensitivity analysis to identify effective strategies.
This discussion, covened by the Dubai Future Foundation, focusses on identifying the significance of the concept of well-being for social-science and policy; and the opportunities to measure it at scale.
Matching Uses and Protections for Government Data Releases: Presentation at t...Micah Altman
In the work included below, and presented at the Simons Institute, we describe work-in progress that aims to align emerging methods of data protections with research uses.
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019Micah Altman
Libraries enable patrons to access a wide range of information, but much of the access to this information is now directly managedy publishers. This has lead to a significant gap across library values, patrons perception of privacy, and effective privacy protection for access to digital resources.
In the work included below, and presented at NERCOMP 2019, we review privacy principles based on ALA, IFLA, and NISO policies. We then organizing and comparing high level privacy protections required by ALA checklist, NISO, and GDPR. This framework of principles and controls is then used to score the privacy policies and practices of major vendors of research library content. We evaluate each element of the vendors privacy policy, and use instrumented browsers to identify the types of tracking mechanisms used by different vendors. We use this set of privacy scores to support analyses of change over time, and of potential gaps between patron expectations and privacy policies and practices.
Presentation by Philip Cohen on collaborative work with Micah Altman as part of the MIT CREOS research talk series. Presented in fall 2018, in Cambridge, MA.
Contemporary journal peer review is beset by a range of problems. These include (a) long delay times to publication, during which time research is inaccessible; (b) weak incentives to conduct reviews, resulting in high refusal rates as the pace of journal publication increases; (c) quality control problems that produce both errors of commission (accepting erroneous work) and omission (passing over important work, especially null findings); (d) unknown levels of bias, affecting both who is asked to perform peer review and how reviewers treat authors, and; (e) opacity in the process that impedes error correction and more systematic learning, and enables conflicts of interest to pass undetected. Proposed alternative practices attempt to address these concerns -- especially open peer review, and post-publication peer review. However, systemic solutions will require revisiting the functions of peer review in its institutional context.
Presentation by Philip Cohen and Micah Altman on developing an exchange system for peer review in support for open science. Prepared for presentation at the ACRL-SSRC meeting on Open scholarship in the social sciences. Washington DC, Dec 2018
Redistricting in the US -- An OverviewMicah Altman
This presentation was prepared for the International Seminar on Electoral Districting, National Electoral Institute El Colegio de México. http://www.ine.mx/seminario-internacional-distritacion-electoral/
This presentation was prepared for the International Seminar on Electoral Districting, National Electoral Institute El Colegio de México. http://www.ine.mx/seminario-internacional-distritacion-electoral/
A History of the Internet :Scott Bradner’s Program on Information Science Talk Micah Altman
Scott Bradner is a Berkman Center affiliate who worked for 50 at Harvard in the areas of computer programming, system management, networking, IT security, and identity management. Scott Bradner was involved in the design, operation and use of data networks at Harvard University since the early days of the ARPANET and served in many leadership roles in the IETF. He presented the talk recorded below, entitled, A History of the Internet -- as part of Program on Information Science Brown Bag Series:
Bradner abstracted his talk as follows:
In a way the Russians caused the Internet. This talk will describe how that happened (hint it was not actually the Bomb) and follow the path that has led to the current Internet of (unpatchable) Things (the IoT) and the Surveillance Economy.
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman
The web is now firmly established as the primary communication and publication platform for sharing and accessing social and cultural materials. This networked world has created both opportunities and pitfalls for libraries and archives in their mission to preserve and provide ongoing access to knowledge. How can the affordances of the web be leveraged to drastically extend the plurality of representation in the archive? What challenges are imposed by the intrinsic ephemerality and mutability of online information? What methodological reorientations are demanded by the scale and dynamism of machine-generated cultural artifacts? This talk will explore the interplay of the web, contemporary historical records, and the programs, technologies, and approaches by which libraries and archives are working to extend their mission to preserve and provide access to the evidence of human activity in a world distinguished by the ubiquity of born-digital materials.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...Micah Altman
Cassidy Sugimoto is Associate Professor in the School of Informatics and Computing, Indiana University Bloomington, who researches within the domain of scholarly communication and scientometrics, examining the formal and informal ways in which knowledge producers consume and disseminate scholarship. She presented this talk, entitled Labor And Reward In Science: Do Women Have An Equal Voice In Scholarly Communication? A Brown Bag With Cassidy Sugimoto, as part of the Program on Information Science Brown Bag Series.
Despite progress, gender disparities in science persist. Women remain underrepresented in the scientific workforce and under rewarded for their contributions. This talk will examine multiple layers of gender disparities in science, triangulating data from scientometrics, surveys, and social media to provide a broader perspective on the gendered nature of scientific communication. The extent of gender disparities and the ways in which new media are changing these patterns will be discussed. The talk will end with a discussion of interventions, with a particular focus on the roles of libraries, publishers, and other actors in the scholarly ecosystem..
Utilizing VR and AR in the Library Space:Micah Altman
Matt Bernhardt is a web developer in the MIT libraries and a collaborator in our program. He presented this talk, entitled Reality Bytes - Utilizing VR and AR in The Library Space, as part of Program on Information Science Brown Bag Series.
Terms like "virtual reality" and "augmented reality" have existed for a long time. In recent years, thanks to products like Google Cardboard and games like Pokemon Go, an increasing number of people have gained first-hand experience with these once-exotic technologies. The MIT Libraries are no exception to this trend. The Program on Information Science has conducted enough experimentation that we would like to share what we have learned, and solicit ideas for further investigation.
For slides and comments see: http://informatics.mit.edu/blog
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsMicah Altman
Catherine D'Ignazio is an Assistant Professor of Civic Media and Data Visualization at Emerson College, a principal investigator at the Engagement Lab, and a research affiliate at the MIT Media Lab/Center for Civic Media. She presented this talk, entitled, Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots as part of Program on Information Science Brown Bag Series.
Communities, governments, libraries and organizations are swimming in data—demographic data, participation data, government data, social media data—but very few understand what to do with it. Though governments and foundations are creating open data portals and corporations are creating APIs, these rarely focus on use, usability, building community or creating impact. So although there is an explosion of data, there is a significant lag in data literacy at the scale of communities and citizens. This creates a situation of data-haves and have-nots which is troubling for an open data movement that seeks to empower people with data. But there are emerging technocultural practices that combine participation, creativity, and context to connect data to everyday life. These include data journalism, citizen science, emerging forms for documenting and publishing metadata, novel public engagement in government processes, and participatory data art. This talk surveys these practices both lovingly and critically, including their aspirations and the challenges they face in creating citizens that are truly empowered with data.
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...Micah Altman
Access to high-quality, relevant information is absolutely foundational for a quality education. Yet, so many schools across the developing world lack fundamental resources, like textbooks, libraries, electricity and Internet connectivity. The SolarSPELL (Solar Powered Educational Learning Library) is designed specifically to address these infrastructural challenges, by bringing relevant, digital educational content to offline, off-grid locations. SolarSPELL is a portable, ruggedized, solar-powered digital library that broadcasts a webpage with open-access educational content over an offline WiFi hotspot, content that is curated for a particular audience in a specified locality—in this case, for schoolchildren and teachers in remote locations. It is a hands-on, iteratively developed project that has involved undergraduate students in all facets and at every stage of development. This talk will examine the design, development, and deployment of a for-the-field technology that looks simple but has a quite complex background.
Laura Hosman is Assistant Professor at Arizona State University, holding a joint appointment in the School for the Future of Innovation in Society and in The Polytechnic School. Her work is action-oriented and focuses on the role for information and communications technology (ICT) in developing countries. Presently, she focuses on ICT-in-education projects, and brings her passion for experiential learning to the classroom by leading real-world-focused, project-based courses that have seen student-built technology deployed in schools in Haiti, Vanuatu, Micronesia, Samoa, and Tonga.
Information Science Brown Bag talks, hosted by the Program on Information Science, consists of regular discussions and brainstorming sessions on all aspects of information science and uses of information science and technology to assess and solve institutional, social and research problems. These are informal talks. Discussions are often inspired by real-world problems being faced by the lead discussant.
Making Decisions in a World Awash in Data: We’re going to need a different bo...Micah Altman
In his abstract, Scriffignano summarizes as follows:
l explore some of the ways in which the massive availability of data is changing and the types of questions we must ask in the context of making business decisions. Truth be told, nearly all organizations struggle to make sense out of the mounting data already within the enterprise. At the same time, businesses, individuals, and governments continue to try to outpace one another, often in ways that are informed by newly-available data and technology, but just as often using that data and technology in alarmingly inappropriate or incomplete ways. Multiple “solutions” exist to take data that is poorly understood, promising to derive meaning that is often transient at best. A tremendous amount of “dark” innovation continues in the space of fraud and other bad behavior (e.g. cyber crime, cyber terrorism), highlighting that there are very real risks to taking a fast-follower strategy in making sense out of the ever-increasing amount of data available. Tools and technologies can be very helpful or, as Scriffignano puts it, “they can accelerate the speed with which we hit the wall.” Drawing on unstructured, highly dynamic sources of data, fascinating inference can be derived if we ask the right questions (and maybe use a bit of different math!). This session will cover three main themes: The new normal (how the data around us continues to change), how are we reacting (bringing data science into the room), and the path ahead (creating a mindset in the organization that evolves). Ultimately, what we learn is governed as much by the data available as by the questions we ask. This talk, both relevant and occasionally irreverent, will explore some of the new ways data is being used to expose risk and opportunity and the skills we need to take advantage of a world awash in data.
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...Micah Altman
Rebecca Kennison, who is the Principal of K|N Consultants, the co-founder of the Open Access Network; and was was the founding director of the Center for Digital Research and Scholarship, gave this talk on Come Together Right Now: An Introduction To The Open Access Network as part of the Program on Information Science Brown Bag Series.
Gary Price, MIT Program on Information ScienceMicah Altman
Gary Price, who is chief editor of InfoDocket, contributing editor of Search Engine Land, co-founder of Full Text Reports and who has worked with internet search firms and library systems developers alike, gave this talk on Issues in Curating the Open Web at Scale as part of the Program on Information Science Brown Bag Series.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
1. Prepared for
MIT Libraries Informatics Program Brown Bag Talk
August2013
Emerging Data Citation Infrastructure
Dr. Micah Altman
<escience@mit.edu>
Director of Research, MIT Libraries
2. DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi,
Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle,
George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,
etc.
Emerging Data Citation Practices
3. Collaborators & Co-Conspirators
• Merce Crosas, IQSS, Harvard U.
• Data-PASS Steering Committee
<data-pass.org>
• CODATA-ICSTI Task Group on Data Citation
Standards and Practices
<www.codata.org/taskgroups/TGdatacitation/>
• Research Support
– Thanks to the National Academies BRDI
Sponsors: Department of Energy (DOE). Institute
of Museum and Library Services (IMLS), The
Library of Congress (LOC). Microsoft Research.
National Institute of Standards and Technology
(NIST), National Institutes of Health
(NIH),National Oceanic and Atmospheric
Administration (NOAA), National Science
Foundation (NSF). U.S. Geological Survey
(USGS) & the Massachusetts Institute of
Technology. Emerging Data Citation Practices
4. Related Work
• CODATA-ICSTI Task Group on Data Citation Standards and Practices, 2013 , Out
of Cite, “Out of Mind: The Current State of Practice, Policy, and Technology for
the Citation of Data”, Data Science Journal. Forthcoming.
• P. F. Uhlir (Ed.), Developing Data Attribution and Citation Practices and
Standards Report from an International Workshop (p. Forthcoming).
National Academies Press.
• M. Altman,2008, "A Fingerprint Method for Verification of Scientific
Data" in, Advances in Systems, Computing Sciences and Software
Engineering, (Proceedings of the International Conference on Systems,
Computing Sciences and Software Engineering 2007) , Springer Verlag.
• Altman, M., & King, G. 2007. A Proposed Standard for the Scholarly
Citation of Quantitative Data. DLib Magazine, 13(3/4)
Most reprints available from:
informatics.mit.edu
Emerging Data Citation Practices
5. This Talk
• What is data citation? Why Cite?
• Emerging Principles
• On the horizon
Emerging Data Citation Practices
6. What’s Wrong with this Picture?
“To test Benet’s (1998) theory of “politically-induced
intelligence” (Benet 1999, pg 8), use a hierarchical
corrected contingency model (see Altman & Smith 2010;
Edgeworth 1863). We apply this model to a snowball
sample (Glass 1973) of eligible voters14, to which the
standard Stanford-Binet (Stanford & Binet 1766) has
been applied. Our results show that adoption of
Pastafarrianism can be expected to yield an increase
mean intelligence by 10.3 points. ”
Emerging Data Citation Practices
13 We thank Jon Sample, Director of the institute of the Pastaffarian
institute for supplying this dataset, which is available upon request.
7. “How much slower would scientific progress be if
the near universal standards for scholarly citation of
articles and books had never been
developed?Suppose shortly after publication only
some printed works could be reliably found by
other scholars; or if researchers were only
permitted to read an article if they first committed
not to criticize it, or were required to coauthor with
the original author any work that built on the
original … *If+ printed works existed in different
libraries under different titles; if researchers
routinely redistributed modified versions of other
authors' works without changing the title or author
listed; or if publishing new editions of books meant
that earlier editions were destroyed?...” – Altman &
King 2007
Emerging Data Citation Practices
8. “Citations to unpublished data and personal
communications cannot be used to support
claims in a published paper”
“All data necessary to understand, assess, and
extend the conclusions of the manuscript must
be available to any reader of Science.”
Ideal
Helping Journals Manage Data
9. Reality
Helping Journals Manage Data
Compliance is low even in
best examples of journals
Checking compliance
manually is tedious, hard
to scale
10. Attribution
• Cite data as first class work
• Identify contributors to data
Discovery
• Associate a persistent id with a
work
• Locate data via identifier
• Locate data integral to article
• Locate works related to data –
articles, derivatives, sources
Persistence
• Reference exists as long as referring object
• Evidence persists as long as assertions
based on evidence?
• Durability of data transparent?
Access
• Citation provides for mediated
access
• Access to surrogate
• On-line access to object
• Machine understandability
• Long-term human
understandability
Provenance
• Associate work with version of
evidence used
• Verify fixity of information
Principles for Data Citation
Theory: Use Cases Operational Constraints?
-Syntax
-Interoperability
-Technical contexts of use
11. Reference
• Formal syntax used within the text of a publication to denote a relationship
to an external object. May contain additional information about the
portion/subset of external object implicated. Also known as “in-text
reference”, “pin-cite”.
We applied contingency analysis to the greatest data ever. [Altman 2005]”
Citation
•Formal description of external object, used for location and attribution.
Micah Altman; Karin MacDonald; Michael P. McDonald, 2005, "Computer
Use in Redistricting", hdl:1902.1/AMXGCNKCLU
UNF:3:J0PkMygLPfIyT1E/8xO/EA==
http://id.thedata.org/hdl%3A1902.1%2FAMXGCNKCLU
Citation Metadata
•Metadata that is systematically associated with citation through well-
known public service, catalog, or protocol.
<component_list> <component parent_relation="isPartOf">
<description><b>Figure 1:</b> This is the caption of the first
figure...</description>
<format mime_type="image/jpeg">Web resolution image</format>
External Service
•Applications and services that consume, enhance, aggregrate citation
information.
Practice
12. Analysis Method
Emerging Data Citation Practices
2 Workshops
(70+ participants)
+ 1 Literature Review
(400+ resources)
+ 2 Task Groups
NAS & Co-Data
(25+ members)
+ 60 Interviews
+ 7 authors
Out of Cite, Out of Mind: The
Current State of Practice, Policy,
and Technology for the Citation of
Data
13. Principles for Data Citation
- Separate
- scientific principles
- use cases
- requirements
- Distinguish
- syntax
- semantics
- presentation
- Design for
- Ecosystem
- Lifecycle
- Stakeholders
- Implement
- Incremental value for incremental effort
- Think globally, act Locally
Analysis Approach
14. Principles for Data Citation
1. Status of Data: Data citations should be accorded the same importance in
the scholarly record as the citation of other objects.
2. Attribution: Citations should facilitate giving scholarly credit and legal
attribution to all parties responsible for those data.
3. Persistence: Citations should be as durable as the cited objects.
4. Access: Citations should facilitate access to data by humans and by machines.
5. Discovery: Citations should support the discovery of data and their
documentation.
6. Provenance: Citations should facilitate the establishment of provenance of
data.
7. Granularity: Citations should support the finest grained description
necessary to identify the data.
8. Verifiability: Citations should contain information sufficient to identify the
data unambiguously.
9. Metadata Standards: Citations should employ widely accepted metadata
standards.
10. Flexibility: Citation methods should be sufficiently flexible to accommodate
the variant practices among communities.
Data Citation Principles
15. Principles for Data Citation
• Author.
– The creator of the data set.
• Title.
– As well as the name of the cited resource itself, this may also include the name of a facility and the titles of the top collection and main
parent subcollection (if any) of which the data set is a part.
• Publisher.
– The organization (or repository) either hosting the data or performing quality assurance.
• Publication date.
– Whichever is later: the date the data set was made available, the date all quality assurance procedures were completed, or the date
the embargo period (if applicable) expired. In other standards an “Access Date” field is used to document the date the data set was
successfully accessed.
• Resource type.
– Examples: “database” or “data set.”
• Edition.
– The level or stage of processing of the data, indicating how raw or refined the data set is.
• Version.
– A number increased when the data changes, as the result of adding more data points or rerunning a derivation process, for example.
• Feature name and URI.
– The name of an ISO 19101:2002 “feature” (e.g., GridSeries, ProfileSeries) and the URI identifying its standard definition, used to pick
out a subset of the data.
• Verifier
– to verify the identity of the content.
• Identifier.
– A resolvable web identifier for the data, according to a persistent scheme. There are several types of persistent identifiers, but the
scheme that is gaining the most traction is the Digital Object Identifier (DOI).
• Location.
– A persistent URL or UNF from which the data set is available. Some identifier schemes provide these via an identifier resolver service.
Citation Metadata Elements
16. Gaps
• Metadata/Structural
– Granularity
– Version Control
– Microattribution
– Contributor ID
– Facilitation of reuse
• Practice
– Author: use of citations to data
– Journals: ad-hoc syntax and location
– Infrastructure: failure to index citations and references to
data, even when associated with DOI’s
– Tools: support for datasets in reference managers, etc.
Emerging Data Citation Practices
17. Harmonizing Principles & Requirements
DataCite
• DOI
• Creator
• Title
• Publisher
• Publication
Year
Emerging Data Citation Practices
Digital Curation Center
1. The citation itself must be able to identify
uniquely the object cited, though
different citations might use
different methods or schemes to do
so.
2. It must be able to identify subsets of
the data as well as the whole
dataset.
3.
a. It must provide the reader with
enough information to access the
dataset;
b. indeed, when expressed digitally
it should provide a mechanism for
accessing the dataset through the
Web infrastructure.
4.
a. It must be usable not only by
humans but also by software tools,
so that additional services may be
built using these citations.
b. In particular, there need to be
services that use the citations in
metrics to support the academic
reward system, and services that can
generate complete citations.- See
more at:
Force 11
• Data should be considered citable
products of research.
• Such data should be held in persistent
public repositories.
• If a publication is based on data not
included with the article, those data
should be cited in the publication.
• A data citation in a publication should
resemble a bibliographic citation and be
located in the publication’s reference list.
• Such a data citation should include a
unique persistent identifier (a DataCite
DOI recommended, or other persistent
identifiers already in use within the
community).
• The identifier should resolve to a page
that either provides direct access to the
data or information concerning its
accessibility. Ideally, that landing page
should be machine-actionable to
promote interoperability of the data.
• If the data are available in different
versions, the identifier should provide a
method to access the previous or related
versions.
• Data citation should facilitate attribution
of credit to all contributors
18. Current Infrastructure
FigShare
• Closed source
• No charge
• Archives data
• Supports DOI’s, ORCIDS
• Preserved in CLOCKSS
Emerging Data Citation Practices
Data Citation Index
• Commercial Service
(Thomson Reuters)
• Indexes many large
repositories
(e.g. Data-PASS)
• Beginning to extract
citations from TR
publications
Dataverse Network
• Open Source System
• Hubs run at Harvard
other universities
• Archives data
• Generates persistent
identifiers (handles, DOI’s
forthcoming)
• Generates resolvable
citations
• Versioned
• Harvard Library Dataverse
now part of DataCite,
Data-PASS preservation
network
DataCite
• DOI registry service
(DOI provider)
• Data DOI metadata
indexing service
(parallel to CrossRef)
• Not-for-profit
membership
Organization
• Collaborating with
ORCID-EU to embed
ORCIDs
19. Emerging Developments
Emerging Data Citation Practices
Open Journal Data
Publication
• Open source integration
of PKP-OJS and Dataverse
Network
• Uses SWORD
• Integrated data
submission/citation/publi
cation workflow for OJS
open journals
Journal Developments
• NISO Recommendations on
Supplementary Materials
• Sloan/ICPSR Data Citation Project
• Data-PASS Journal Outreach
• New journal types:
– Registered Replication journals
– Null results journals
– Data journals/data papers
22. Brightening the “Dark Matter” of Scholarly
Communications
Researcher Identifiers: Developments, Opportunities &
Challenges
Research & Node Layout: Kevin Boyack and Dick
Klavans (mapofscience.com); Data: Thompson ISI;
Graphics & Typography: W. Bradford Paley
(didi.com/brad); Commissioned Katy Börner
(scimaps.org)
Seed Magazine, Mar 7, 2007
http://seedmagazine.com/content/article/scientific_m
ethod_relationships_among_scientific_paradigms/
22
• Bibliometric and network analysis are
the “telescopes” for exploring the
structure of science
• Researcher ID’s allow us to see more
connections, more reliably
• Identifiers for datasets, etc. reveal the
“dark matter” of science
Some potential questions:
• Are fields linked through evidence that are
not linked through publications?
• How is the practice of science changing – are
data scientists, statisticians, etc. making
bigger contributions?
• How would be the results of:
– Catalyzing new research collaborations among individuals,
organizations?
– Strengthening support for specific areas of
interdisciplinary research?
– Growing the evidence base in particular areas?
Questions about how network of contributors and outputs
evolves over time
23. Additional Bibliography (Selected)
• Starr, J., & Gastl, A. (2011). IsCitedBy: A metadata scheme for datacite. D-Lib
Magazine, 17(½). doi:10.1045/january2011-starr
• Piwowar, H., Vision, T.J. (2013). Data reuse and the open data citation
advantage. PeerJ PrePrints. 1:e1v1. doi: 10.7287/peerj.preprints.1
• Cronin, B. (1984). The citation process: The role and significance of citations
in scientific publication. London, United Kingdom: Taylor Graham.
• Van Leunen, M. (1992). A handbook for scholars. New York, NY: Oxford
University Press.
Emerging Data Citation Practices
This work. by Micah Altman (http://micahaltman.com) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk discusses findings from this survey, common gaps, and trends in this area.(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier's reliability claims. For more on that see this earlier post: http://drmaltman.wordpress.com/2012/11/15/amazons-creeping-glacier-and-digital-preservation )
Data citation supports attribution, provenance, discovery, provenance, and persistence. It is not (and should not be) sufficient for all of these things, but its an important component. In the last 2 years, there have been several major efforts to standardize data citation practices, build citation infrastructure, and analyze data citation practices. This session presented as part of the the Program on Information Science seminar series, examines data citation from an information lifecycle approach: what are the use cases, requirements and research opportunities. And the session will also discuss emerging infrastructure and standardization efforts around data citation.A number of principles have emerged for citation -- the most central is that data citations should be treated consistently with citations to other objects:Data citations should at least provide the minimal core elements expected in other modern citations; should be included in the references section along with citations to other elements; and indexed in the same way.Adoption of data citation by journals can provide positive and sustainable incentives for more reproducible science and more complete attribution. This would act to brighten the dark matter of science -- revealing connections among evidence bases that are not now visible through citations of articles.