An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
Validata: A tool for testing profile conformanceAlasdair Gray
Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata).
Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
Publishing of Scientific Data - Science Foundation Ireland Summit 2010jodischneider
Slides prepared for the Publishing of Scientific Data workshop at the Science Foundation Ireland Summit 2010. I was one of three panelists. We had a lively discussion!
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
This talk presents a set of detailed technical recommendations for operationalizing the Joint Declaration of Data Citation Principles (JDDCP) - the most widely agreed set of principle-based recommendations for direct scholarly data citation.
We will provide initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data.
We hope that these recommendations along with the new NISO JATS document schema revision, developed in parallel, will help accelerate the wide adoption of data citation in scholarly literature. We believe their adoption will enable open data transparency for validation, reuse and extension of scientific results; and will significantly counteract the problem of false positives in the literature.
An Identifier Scheme for the Digitising Scotland ProjectAlasdair Gray
The Digitising Scotland project is having the vital records of Scotland transcribed from images of the original handwritten civil registers . Linking the resulting dataset of 24 million vital records covering the lives of 18 million people is a major challenge requiring improved record linkage techniques. Discussions within the multidisciplinary, widely distributed Digitising Scotland project team have been hampered by the teams in each of the institutions using their own identification scheme. To enable fruitful discussions within the Digitising Scotland team, we required a mechanism for uniquely identifying each individual represented on the certificates. From the identifier it should be possible to determine the type of certificate and the role each person played. We have devised a protocol to generate for any individual on the certificate a unique identifier, without using a computer, by exploiting the National Records of Scotland•À_s registration districts. Importantly, the approach does not rely on the handwritten content of the certificates which reduces the risk of the content being misread resulting in an incorrect identifier. The resulting identifier scheme has improved the internal discussions within the project. This paper discusses the rationale behind the chosen identifier scheme, and presents the format of the different identifiers. The work reported in the paper was supported by the British ESRC under grants ES/K00574X/1(Digitising Scotland) and ES/L007487/1 (Administrative Data Research Center - Scotland).
Supporting Dataset Descriptions in the Life SciencesAlasdair Gray
Machine processable descriptions of datasets can help make data more FAIR; that is Findable, Accessible, Interoperable, and Reusable. However, there are a variety of metadata profiles for describing datasets, some specific to the life sciences and others more generic in their focus. Each profile has its own set of properties and requirements as to which must be provided and which are more optional. Developing a dataset description for a given dataset to conform to a specific metadata profile is a challenging process.
In this talk, I will give an overview of some of the dataset description specifications that are available. I will discuss the difficulties in writing a dataset description that conforms to a profile and the tooling that I've developed to support dataset publishers in creating metadata description and validating them against a chosen specification.
Seminar talk given at the EBI on 5 April 2017
Validata: A tool for testing profile conformanceAlasdair Gray
Validata (http://hw-swel.github.io/Validata/) is an online web application for validating a dataset description expressed in RDF against a community profile expressed as a Shape Expression (ShEx). Additionally it provides an API for programmatic access to the validator. Validata is capable of being used for multiple community agreed standards, e.g. DCAT, the HCLS community profile, or the Open PHACTS guidelines, and there are currently deployments to support each of these. Validata can be easily repurposed for different deployments by providing it with a new ShEx schema. The Validata code is available from GitHub (https://github.com/HW-SWeL/Validata).
Presentation given at SDSVoc https://www.w3.org/2016/11/sdsvoc
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsAlasdair Gray
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting HCLS community profile covers elements of description, identification, attribution, versioning, provenance, and content summarization. The HCLS community profile reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.
The goal of this presentation is to give an overview of the HCLS Community Profile and explain how it extends and builds upon other approaches.
Presentation given at SDSVoc (https://www.w3.org/2016/11/sdsvoc/)
Publishing of Scientific Data - Science Foundation Ireland Summit 2010jodischneider
Slides prepared for the Publishing of Scientific Data workshop at the Science Foundation Ireland Summit 2010. I was one of three panelists. We had a lively discussion!
Data Citation Implementation Guidelines By Tim Clarkdatascienceiqss
This talk presents a set of detailed technical recommendations for operationalizing the Joint Declaration of Data Citation Principles (JDDCP) - the most widely agreed set of principle-based recommendations for direct scholarly data citation.
We will provide initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data.
We hope that these recommendations along with the new NISO JATS document schema revision, developed in parallel, will help accelerate the wide adoption of data citation in scholarly literature. We believe their adoption will enable open data transparency for validation, reuse and extension of scientific results; and will significantly counteract the problem of false positives in the literature.
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
Using Neo4j for exploring the research graph connections made by RD-Switchboardamiraryani
In this talk, Jingbo Wang (NCI) and Amir Aryani (ANDS) have presented the Neo4j queries that can help data managers to explore the connections between datasets, researchers, grants, and publications using the graph model and Research Data Switchboard. In addition, they have discussed a paper on "Graph connections made by RD-Switchboard using NCI’s metadata", presented in the Reproducible Open Science workshop in Hannover September 2016.
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
OpenAIRE-COAR conference 2014: Next generation metrics of scholarly performa...OpenAIRE
Presentation at the OpenAIRE-COAR Conference: "Open Access Movement to Reality: Putting the Pieces Together", Athens - May 21-22, 2014.
Session 4: The impact of openness and how to evaluate research.
Next generation metrics of scholarly performance, by William Gunn - Head of Academic Outreach for Mendeley
Metadata as Linked Data for Research Data Repositoriesandrea huang
“Every man has his own cosmology and who can say that his own is right.” said by Einstein. This is also true when we come to understand data semantics that one data may be different interpreted by different data creators, curators and re-users. Then, how do we build a better research data repository?
We start with the point made by Willis, C., Greenberg, J., & White, H. (2012) that the metadata of research data increases the access to and reuse of the data. And Stanford, Harvard, and Cornell believe the use of linked data technologies is a promising method to gather contextual information about research resources.
To look for inspiration tools that can meet the urgent needs of innovative solutions providing feature-rich services for helping data publishing such as visualization, validation & reuse in different applications by research repositories (Assante, et.al, 2016), the CKAN (Comprehensive Knowledge Archive Network) as a major solution that makes linked metadata available, citable, and validated becomes our first choice.
Original file: http://m.odw.tw/u/odw/m/metadata-as-linked-data-for-research-data-repositories/
Linking Scientific Metadata (presented at DC2010)Jian Qin
Linked entity data in metadata records builds a foundation for semantic web. Even though metadata records contain rich entity data, there is no linking between associated entities such as persons, datasets, projects, publications, or organizations. We conducted a small experiment using the dataset collection from the Hubbard Brook Ecosystem Study (HBES), in which we converted the entities and their relationships into RDF triples and linked the URIs contained in RDF triples to the corresponding entities in the Ecological Metadata Language (EML) records. Through the transformation program written in XML Stylesheet Language (XSL), we turned a plain EML record display into an interlinked semantic web of ecological datasets. The experiment suggests a methodological feasibility in incorporating linked entity data into metadata records. The paper also argues for the need of changing the scientific as well as general metadata paradigm.
Using Dataverse Virtual Archive Technology for Research Data ManagementGary Wilhelm
One of the most important components of research is access to quality data. Digital data archives must work to increase submission rates to insure that quality data exist for future researchers. This is a challenge given that recent studies show that vast amounts of data collected during publicly funded projects are not being archived. Even the best-planned methodology will not succeed when researchers use tainted data or fail to find adequate data. Social science data archivists play a key role in the effort to maintain quality sources of data for social science investigators to repurpose and reuse. The dynamic, circular movement of data between the producers and archives is critical to the future of social science research. Data archives have historically provided for this data interchange using considerable human capital. Dedicated archivists and investigators have worked together to ensure that data were processed and placed into an archive best designed for their preservation, a manual process that has become increasingly expensive and unwieldy due to the volume of data being produced and the advanced metadata required to provide future researchers enough details to reuse the study. Typical methods have the researchers working with the archives to deposit the data long after the project has been complete and the papers published. The manual creation of metadata at this point takes far long than if it were collected earlier in the research life cycle. Recent advances in archival repository software may be the key to streamlining this increasingly inefficient archival process by allowing archivist and researchers the ability to create detailed metadata earlier in the research lifecycle at a point where it will take far less time. Software allows researchers greater personal control over archival ingest processes, bridging the gap between researchers and archives and possibly increasing submission rates of valuable data to archives. Archival technology provides tools that manage automated ingest, data cataloging, advanced search and indexing, and rights and access issues. Archival tools also provide proper citation, creation of persistent identifiers, automatic creation of preservation formats, format migration, and statistical analysis of data. Customized branding and citation management can provide investigators collecting these data with a tool that will ensure that they get the credit they deserve. The Dataverse Network Technology has the potential to aid many research groups at UNC in the data management processes and has the potential for use in many disciplines. This presentation will explain the technology and its applicability for managing research data.
Data Citation Implementation at DataverseMerce Crosas
Presentation at the Data Citation Implementation Pilot Workshop in Boston, February 3rd, 2016.
https://www.force11.org/group/data-citation-implementation-pilot-dcip/pilot-project-kick-workshop
Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...Joint ALMA Observatory
Astronomers are being drowned in data: facilities like ALMA currently provide datasets in the Gigabyte range, and increasing, while facilities like the LSST and the SKA will generate datasets large enough so that data download, even of the reduced datasets, will not be feasible. In this talk we will introduce the concept of Scientific Workflows, as software tools that allow for the easy exploration of both local and remote datasets and processing services, and of Research Objects, which encapsulate all relevant aspects of a scientific experiment, and allow for its quantitative and qualitative assessment, enable reuse with proper attribution, and linkage to publications, among others. The AstroTaverna plugin, with astronomy-specific for workflow creation, was also presented in this ALMA Weekly Seminar.
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
This talk was part of a session at the Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016:
https://rd-alliance.org/rda-8th-plenary-joint-meeting-ig-domain-repositories-wg-rdaniso-privacy-implications-research-data
Slides in Merce Crosas site:
http://scholar.harvard.edu/mercecrosas/presentations/datatags-system-sharing-sensitive-data-confidence
Using Neo4j for exploring the research graph connections made by RD-Switchboardamiraryani
In this talk, Jingbo Wang (NCI) and Amir Aryani (ANDS) have presented the Neo4j queries that can help data managers to explore the connections between datasets, researchers, grants, and publications using the graph model and Research Data Switchboard. In addition, they have discussed a paper on "Graph connections made by RD-Switchboard using NCI’s metadata", presented in the Reproducible Open Science workshop in Hannover September 2016.
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
OpenAIRE-COAR conference 2014: Next generation metrics of scholarly performa...OpenAIRE
Presentation at the OpenAIRE-COAR Conference: "Open Access Movement to Reality: Putting the Pieces Together", Athens - May 21-22, 2014.
Session 4: The impact of openness and how to evaluate research.
Next generation metrics of scholarly performance, by William Gunn - Head of Academic Outreach for Mendeley
Metadata as Linked Data for Research Data Repositoriesandrea huang
“Every man has his own cosmology and who can say that his own is right.” said by Einstein. This is also true when we come to understand data semantics that one data may be different interpreted by different data creators, curators and re-users. Then, how do we build a better research data repository?
We start with the point made by Willis, C., Greenberg, J., & White, H. (2012) that the metadata of research data increases the access to and reuse of the data. And Stanford, Harvard, and Cornell believe the use of linked data technologies is a promising method to gather contextual information about research resources.
To look for inspiration tools that can meet the urgent needs of innovative solutions providing feature-rich services for helping data publishing such as visualization, validation & reuse in different applications by research repositories (Assante, et.al, 2016), the CKAN (Comprehensive Knowledge Archive Network) as a major solution that makes linked metadata available, citable, and validated becomes our first choice.
Original file: http://m.odw.tw/u/odw/m/metadata-as-linked-data-for-research-data-repositories/
Linking Scientific Metadata (presented at DC2010)Jian Qin
Linked entity data in metadata records builds a foundation for semantic web. Even though metadata records contain rich entity data, there is no linking between associated entities such as persons, datasets, projects, publications, or organizations. We conducted a small experiment using the dataset collection from the Hubbard Brook Ecosystem Study (HBES), in which we converted the entities and their relationships into RDF triples and linked the URIs contained in RDF triples to the corresponding entities in the Ecological Metadata Language (EML) records. Through the transformation program written in XML Stylesheet Language (XSL), we turned a plain EML record display into an interlinked semantic web of ecological datasets. The experiment suggests a methodological feasibility in incorporating linked entity data into metadata records. The paper also argues for the need of changing the scientific as well as general metadata paradigm.
Using Dataverse Virtual Archive Technology for Research Data ManagementGary Wilhelm
One of the most important components of research is access to quality data. Digital data archives must work to increase submission rates to insure that quality data exist for future researchers. This is a challenge given that recent studies show that vast amounts of data collected during publicly funded projects are not being archived. Even the best-planned methodology will not succeed when researchers use tainted data or fail to find adequate data. Social science data archivists play a key role in the effort to maintain quality sources of data for social science investigators to repurpose and reuse. The dynamic, circular movement of data between the producers and archives is critical to the future of social science research. Data archives have historically provided for this data interchange using considerable human capital. Dedicated archivists and investigators have worked together to ensure that data were processed and placed into an archive best designed for their preservation, a manual process that has become increasingly expensive and unwieldy due to the volume of data being produced and the advanced metadata required to provide future researchers enough details to reuse the study. Typical methods have the researchers working with the archives to deposit the data long after the project has been complete and the papers published. The manual creation of metadata at this point takes far long than if it were collected earlier in the research life cycle. Recent advances in archival repository software may be the key to streamlining this increasingly inefficient archival process by allowing archivist and researchers the ability to create detailed metadata earlier in the research lifecycle at a point where it will take far less time. Software allows researchers greater personal control over archival ingest processes, bridging the gap between researchers and archives and possibly increasing submission rates of valuable data to archives. Archival technology provides tools that manage automated ingest, data cataloging, advanced search and indexing, and rights and access issues. Archival tools also provide proper citation, creation of persistent identifiers, automatic creation of preservation formats, format migration, and statistical analysis of data. Customized branding and citation management can provide investigators collecting these data with a tool that will ensure that they get the credit they deserve. The Dataverse Network Technology has the potential to aid many research groups at UNC in the data management processes and has the potential for use in many disciplines. This presentation will explain the technology and its applicability for managing research data.
Data Citation Implementation at DataverseMerce Crosas
Presentation at the Data Citation Implementation Pilot Workshop in Boston, February 3rd, 2016.
https://www.force11.org/group/data-citation-implementation-pilot-dcip/pilot-project-kick-workshop
Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...Joint ALMA Observatory
Astronomers are being drowned in data: facilities like ALMA currently provide datasets in the Gigabyte range, and increasing, while facilities like the LSST and the SKA will generate datasets large enough so that data download, even of the reduced datasets, will not be feasible. In this talk we will introduce the concept of Scientific Workflows, as software tools that allow for the easy exploration of both local and remote datasets and processing services, and of Research Objects, which encapsulate all relevant aspects of a scientific experiment, and allow for its quantitative and qualitative assessment, enable reuse with proper attribution, and linkage to publications, among others. The AstroTaverna plugin, with astronomy-specific for workflow creation, was also presented in this ALMA Weekly Seminar.
This set of slides introduce the TANGO control system for the SKA telescopes, using analogies between tango dancing and the paradigms of the TANGO control system.
The talk was given in the context of the Engineering Q&A talks of the SKA Organisation.
Talk given as part of the Bluedot Festival. Tries to emphasise the current trends in Citizen Science, how are they powered by the same ICT innovation that powers other industries, and how curation and metadata are key both for professional and citizen scientist, and facilities to perform that will be needed.
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
Research Integrity Advisor and Data ManagementARDC
Dr Paul Wong from the Australian Research Data Commons presented at the University of Technology Sydney's RIA Data Management Workshop on 21 June 2018. In partnership with the Australian Research Council, the National Health and Medical Research Council, the Australian Research Data Commons, and RMIT University, this is part of a national workshop series in data management for research integrity advisors.
This presentation was part of a workshop of IEDA (http://www.iedadata.org) at the AGU (American Geophysical Union) Fall Meeting 2013 in San Francisco that was intended as an introduction to the topic of data publication.
Scott Edmunds talk in the "Policies and Standards for Reproducible Research" session on Revolutionizing Data Dissemination: GigaScience, at the Genomic Standards Consortium meeting at Shenzhen. 6th March 2012
Big Data Repository for Structural Biology: Challenges and Opportunities by P...datascienceiqss
SBGrid (Morin et al., 2013, eLIFE and www.sbgrid.org) is a Harvard based structural biology global computing consortium with a primary focus on the curation of research software. Dr. Sliz will discuss a recent SBGrid project that aims to establish a repository for experimental datasets from SBGrid laboratories. Issues of handling large data volumes, data validation and repository sustainability will be addressed in this talk.
Data Equivalence
Mark Parsons, Lead Project Manager, Senior Associate Scientist, National Snow and Ice Data Center
Data citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I propose some interim solutions and suggest research strategies for the future.
RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…ASIS&T
Jared Lyle, ICPSR
Domain Repositories and Institutional Repositories Partnering to Curate: Opportunities and Examples
Panel: Partnerships between institutional repositories, domain repositories, and publishers
Research Data Access & Preservation Summit 2013
Baltimore, MD April 4, 2013 #rdap13
Similar to Data curation at Dryad Digital Repository: A former curator's perspective (20)
Structured data:
- What is it?
- Data modelling standards
- Semantic Web
Knowledge Organisation Systems
- Natural language → controlled language
- Controlled vocabulary → ontology
- Design considerations & user impact
VALA Tech Camp 2017: Intro to Wikidata & SPARQLJane Frazier
A hands-on introduction to interrogation of Wikidata content using SPARQL, the query language used to query data represented in RDF, SKOS, OWL, and other Semantic Web standards.
Presented by myself and Peter Neish, Research Data Specialist @ University of Melbourne.
Controlled vocabularies for health & medicineJane Frazier
During this workshop, given at the Health Libraries Australia Professional Development Day 9 July in Brisbane, we discussed maximising value in the use of controlled vocabularies for health & medicine.
Simon Cox (Researcher @ CSIRO) and I presented on outcomes of the CSIRO Summer of Vocabularies.
This project focused on:
- Examining the state of the management of various controlled vocabularies and developing reusable processes to clean & standardise those vocabularies.
- Developing standards & technologies for vocabulary management, curation & visualisation.
We presented on these Summer of Vocabularies activities and how they are being used now to further the work of the Vocram project and the improvement and development of ANDS vocabulary services.
Flying solo: data librarians working outside (traditional) librariesJane Frazier
I used these slides for my portion of the "Flying Solo" ANDS webinar:
Did you know there are data librarians who work outside of (traditional) libraries? For some, being a data librarian means leaving the relative comfort of the library behind and ‘flying solo’ into unchartered territory. These are new and demanding roles that require a steep learning curve with minimal support. In this webinar, three data librarians working outside of libraries will share their experience of going it alone, reflecting on these challenging yet rewarding roles that push the boundaries of librarianship and open new opportunities for the profession.
Siobhann McCafferty is based at QUT’s Institute for Future Environments in Brisbane and is the Research Data Coordinator for the National Agricultural Nitrous Oxide Research Program (NANORP). She is embedded in the Healthy Ecosystems and Environmental Management group at IFE and works with researchers from across Australia to store program data and make it discoverable and reusable.
Jane Frazier is a Data Librarian at ANDS. She has previously worked in the University of North Carolina Music Library cataloging 20th century American vocal sheet music, as curatorial assistant at the Dryad digital repository, and at the UNC Metadata Research Center exploring automatic subject indexing processes for Dryad. From 2013 to 2014 Jane led the research and development of a new web-based cataloging system for collectible items with Stanley Gibbons, one of the world’s oldest stamp collecting firms.
-- Michelle Teis has more than 25 years’ industry experience, senior consultant
Michelle Teis is an enterprise information management expert specialising in content, data and knowledge management, and information privacy. http://www.glentworth.com/about-us/our-key-people/michelle-teis/
Subject tagging: Recommendations for Dryad curators and scientistsJane Frazier
How do scientists tag their data? Do they come up with their own topical terms, do they consult controlled vocabularies, or do they use some other source for these terms? The goal of this project is to write a memo or guide which will help Dryad librarians guide scientists to describe their data in the best way possible by submitting meaningful subject headings/topical headings with their data.
A Bibliographic Overview of English-Language Opera TranslationJane Frazier
This paper aims to give an overview of the subject and guidance for librarians wishing to meet the needs of users searching for opera translated into the English Language.
Music Information Retrieval: A Literature ReviewJane Frazier
Music information retrieval is a small subfield within traditional information retrieval. For the purposes of this piece, I selected publications which are overviews of MIR as a field and publications having to do with user discovery of music.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
10. Submitting data to Dryad
Dryad accepts content associated with published articles:
•Data files
–Spreadsheets & CSVs
–DNA alignments
–Gene sequencing
–Phylogenetic trees
–Images & video
–GIS
•Software scripts
11. Sidlauskas B (2007) Data from: Testing for unequal rates of morphological diversification in the absence of a detailed phylogeny: a case study from characiform fishes. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.20
12. Riju A, Chandrasekar A, Arunachalam V (2007) Data from: Mining for single nucleotide polymorphisms and insertions / deletions in expressed sequence tag libraries of oil palm. Dryad Digital Repository.
http://dx.doi.org/10.5061/dryad.157
13. Drew JA, Philipp C, Westneat MW (2013) Data from: Shark tooth weapons from the 19th century reflect shifting baselines in Central Pacific predator assemblies. Dryad Digital Repository.
http://dx.doi.org/10.5061/dryad.6b2c9
14. Data curation
•Manage multiple submission workflows
•Accept/reject data submissions
•Manage data embargoes
•Oversee DOI assignment
•Manage data citation
•Ensure link with publication
•Author name & Journal name authority control
•Metadata consistency & quality control
18. Rejection of submissions
•Duplicated submission/duplicated files
•Data not associated with a publication
•Corrupt data files
•Non-Creative Commons licencing
•Manuscript submitted
•Human subject data insufficiently anonymised