This is module 6 in the EDI Data Publishing training course. In this module, you will learn how to create quality metadata and be introduced to the landscape of data repositories and their functions.
This is module 4 in the EDI Data Publishing training course. In this module, you will learn how to group your data files and other information products into a publishable unit.
This is module 5 in the EDI Data Publishing training course. In this module, you will learn how to properly format a data file for publishing in the EDI Repository.
This is module 10 in the EDI Data Publishing training course. In this module, you will receive an introduction to what a data package is, how DOIs are assigned to data packages, and the repository's steps to insert a data package.
Introduction to the Environmental Data Initiative (EDI)Corinna Gries
The Environmental Data Initiative enables the environmental science community to maximize knowledge development through the reusability of FAIR environmental data by providing curation services, training, and a robust and modern data repository.
Please cite as: Gries, Corinna. (2018, December). Introduction to the Environmental Data Initiative (EDI) (Version 1.0). Zenodo. http://doi.org/10.5281/zenodo.4672376
This is module 12 in the EDI Data Publishing training course. In this module, you will learn how to correctly cite an EDI dataset and create provenance metadata for a derived dataset.
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Lesson 2 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
This is module 4 in the EDI Data Publishing training course. In this module, you will learn how to group your data files and other information products into a publishable unit.
This is module 5 in the EDI Data Publishing training course. In this module, you will learn how to properly format a data file for publishing in the EDI Repository.
This is module 10 in the EDI Data Publishing training course. In this module, you will receive an introduction to what a data package is, how DOIs are assigned to data packages, and the repository's steps to insert a data package.
Introduction to the Environmental Data Initiative (EDI)Corinna Gries
The Environmental Data Initiative enables the environmental science community to maximize knowledge development through the reusability of FAIR environmental data by providing curation services, training, and a robust and modern data repository.
Please cite as: Gries, Corinna. (2018, December). Introduction to the Environmental Data Initiative (EDI) (Version 1.0). Zenodo. http://doi.org/10.5281/zenodo.4672376
This is module 12 in the EDI Data Publishing training course. In this module, you will learn how to correctly cite an EDI dataset and create provenance metadata for a derived dataset.
DataONE Education Module 01: Why Data Management?DataONE
Lesson 1 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Lesson 2 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
This is module 11 in the EDI Data Publishing training course. In this module, you will learn the procedure to upload a data package to the EDI Repository.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
The challenge of sharing data well, how publishers can helpVarsha Khodiyar
Researchers, academic institutes and funders are increasingly recognizing the importance of data sharing for reproducible science. However, it is not always straightforward and clear to researchers as to how best to share data in a useful way. At Springer Nature we are working on several initiatives to help facilitate the sharing of research data in a reusable way, with our overarching goal being to publish research that is robust and reproducible. I will talk about the effort that goes into our flagship data journal, Scientific Data, to facilitate best practices in publication and sharing of research data, and share some of our experiences publishing Challenge datasets. I will also describe some of the newer Research Data Services that are now available to help all researchers (not only Springer Nature authors) to share their data in a useful way.
This presentation gives an overview of the key things that we need to consider before publishing data from the repository. It briefly discusses research data management, research data lifecycle, FAIR principles of research data management and then move on to key elements that should be considered while preparing datasets for publishing through repository.
This presentation introduces the basics of the Dataverse including preparing the submission to the Dataverse, creating an account and logging in, adding datasets to the Dataverse account, and metadata.
This presentation gives an overview of the key things that we need to consider before deciding to set up a data repository. It briefly talks about data repository, the software behind data repository and their limitations and merits. Additionally, the presenters shared IFPRI's experiences with Harvard Dataverse.
An update on the latest BioSharing work; including work with ELIXIR and NIH BD2K, also our survey to assess user needs (530 replies) and the work on the recommender tool
S. Venkataraman (DCC) talks about the basics of Research Data Management and how to apply this when creating or reviewing a Data Management Plan (DMP). He discusses data formats and metadata standards, persistent identifiers, licensing, controlled vocabularies and data repositories.
link to : dcc.ac.uk/resources
A basic course on Research data management, part 4: caring for your data, or ...Leon Osinski
A basic course on research data management for PhD students. The course consists of 4 parts. The course was given at Eindhoven University of Technology (TUe), 24-01-2017
DataONE Education Module 03: Data Management PlanningDataONE
Lesson 3 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Presentation given at the Indiana University School of Medicine's Ruth Lilly Medical Library. Contains information and resources specific to Indiana University Purdue University Indianapolis (IUPUI). For full class materials, see LYD17_IUPUIWorkshop folder here: https://osf.io/r8tht/.
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
This is module 11 in the EDI Data Publishing training course. In this module, you will learn the procedure to upload a data package to the EDI Repository.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
The challenge of sharing data well, how publishers can helpVarsha Khodiyar
Researchers, academic institutes and funders are increasingly recognizing the importance of data sharing for reproducible science. However, it is not always straightforward and clear to researchers as to how best to share data in a useful way. At Springer Nature we are working on several initiatives to help facilitate the sharing of research data in a reusable way, with our overarching goal being to publish research that is robust and reproducible. I will talk about the effort that goes into our flagship data journal, Scientific Data, to facilitate best practices in publication and sharing of research data, and share some of our experiences publishing Challenge datasets. I will also describe some of the newer Research Data Services that are now available to help all researchers (not only Springer Nature authors) to share their data in a useful way.
This presentation gives an overview of the key things that we need to consider before publishing data from the repository. It briefly discusses research data management, research data lifecycle, FAIR principles of research data management and then move on to key elements that should be considered while preparing datasets for publishing through repository.
This presentation introduces the basics of the Dataverse including preparing the submission to the Dataverse, creating an account and logging in, adding datasets to the Dataverse account, and metadata.
This presentation gives an overview of the key things that we need to consider before deciding to set up a data repository. It briefly talks about data repository, the software behind data repository and their limitations and merits. Additionally, the presenters shared IFPRI's experiences with Harvard Dataverse.
An update on the latest BioSharing work; including work with ELIXIR and NIH BD2K, also our survey to assess user needs (530 replies) and the work on the recommender tool
S. Venkataraman (DCC) talks about the basics of Research Data Management and how to apply this when creating or reviewing a Data Management Plan (DMP). He discusses data formats and metadata standards, persistent identifiers, licensing, controlled vocabularies and data repositories.
link to : dcc.ac.uk/resources
A basic course on Research data management, part 4: caring for your data, or ...Leon Osinski
A basic course on research data management for PhD students. The course consists of 4 parts. The course was given at Eindhoven University of Technology (TUe), 24-01-2017
DataONE Education Module 03: Data Management PlanningDataONE
Lesson 3 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
Presentation given at the Indiana University School of Medicine's Ruth Lilly Medical Library. Contains information and resources specific to Indiana University Purdue University Indianapolis (IUPUI). For full class materials, see LYD17_IUPUIWorkshop folder here: https://osf.io/r8tht/.
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
The LODE-BD Recommendations present a reference tool that assist bibliographic data providers in selecting appropriate encoding strategies according to their needs in order to facilitate metadata exchange by, for example, constructing crosswalks between their local data formats and widely-used formats or even with a Linked Data representation. The LODE-BD Recommendations aim to address two questions: how to encode bibliographic data hosted by diverse open repositories for the purpose of exchanging data across data providers; and how to encode these data as Linked Open Data (LOD) - enabled bibliographic data.
The core component of the LODE-BD Recommendations report contains a set of recommended decision-making trees for common properties used in describing a bibliographic resource instance (article, monograph, thesis, conference paper, presentation material, research report, learning object, etc. - in print or electronic format). Each decision tree is delivered with various acting points and the matching encoding suggestions, usually with multiple options.
LODE-BD is a part of a series of LODE recommendations overarching a wide range of resource types including the encoding of value vocabularies used in describing agents, places, and topics in bibliographic data.
Metadata for digital long-term preservationMichael Day
Presentation given at the Max Planck Gesellschaft eScience Seminar 2008: Aspects of long-term archiving, hosted by the Gesellschaft für Wissenschaftliche Datenverarbeitung mbh Göttingen (GWDG), Göttingen, Germany, 19-20 June 2008
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T
Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23, 2015
Part of “Beyond metadata: Supporting non-standardized documentation to facilitate data reuse”
Lesson 7 in a set of 10 created by DataONE on Best Practices fo Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely
promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation
sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw
NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover,
the (meta)data does often not meet the FAIR principles [6]. Consequently, a central FAIR-compliant repository
is highly desirable to support NGS related research. We have selected iRODS (Rule-Oriented Data management
systems) [3] as a basis for implementing a sequencing data repository because it allows storing both data and metadata
together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized
way, and supports different types of clients. This repository will be part of an ecosystem of RDM solutions that
cover complementary phases of the research data life cycle in our organization (Academic Medical Center of the
University of Amsterdam). We selected Virtuoso [5] to enrich the metadata from iRODS to enable the management
of a triplestore for linked data. The metadata in the iCat (iRODS’ metadata catalogue) and the ontology in Virtuoso
are kept synchronized by enforcement of strict data manipulation policies. We have implemented a prototype to
preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes:
Davrods [4] for data and metadata ingestion, data retrieval; Metalnx-web [7] for administration, data curation, and
repository browsing; and iCommands [2] for all tasks by advanced users. Different user profiles are defined (principal
investigator, data curator, repository administrator), with different access rights. New data is ingested by copying raw
sequence files and the corresponding metadata file (a sample sheet) to the landing collection on iRODS. An iRODS
rule is triggered by the sample sheet file, which extracts the metadata and registers it to the iCAT as AVU (Attribute,
Value and Unit). Ontology files are registered into Virtuoso. The sequence files are copied to the persistent collection
and are made uniquely identifiable based on metadata. All the steps are recorded into a report file that enables
monitoring and tracking of progress and faults. Here we describe the design and implementation of the prototype,
and discuss the first assessment results. Initial results indicate that the proposed solution is acceptable and fits the
researchers workflow well.
What is data discovery and how do people find out about data?
Metadata: What information helps potential users decide whether that data might be useful?
How and why do machines exchange information about research data?
Data without metadata and connections is useless:
Linked data
How Scholix is helping publishers and others to link data with publications and more
Metadata, controlled vocabularies, linked data and crosswalks
Things #11, #12, #13 of 23 Things
How do we make FAIR data? Finable, Accessible, Interoperable, Reusable?
Feb 26 NISO Training Thursday
Crafting a Scientific Data Management Plan
About the Training
Addressing a data management plan for the first time can be an intimidating exercise. Join NISO for a hands-on workshop that will guide you through the elements of creating a data management plan, including gathering necessary information, identifying needed resources, and navigating potential pitfalls. Participants explore the important components of a data management plan and critique excerpts of sample plans provided by the instructors.
This session is meant to be a guided, step-by-step session that will follow the February 18 NISO Virtual Conference, Scientific Data Management: Caring for Your Institution and its Intellectual Wealth.
About the Instructors
Kiyomi D. Deards, MSLIS, Assistant Professor, University of Nebraska-Lincoln Libraries
Jennifer Thoegersen, Data Curation Librarian, University of Nebraska-Lincoln Libraries
Meeting the NSF DMP Requirement: March 7, 2012IUPUI
March 7 version of the IUPUI workshop Meeting the NSF Data Management Plan Requirement: What you need to know. This workshop is co-sponsored by the Office of the Vice Chancellor for Research and the University Library.
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
OpenAIRE Interoperability Workshop (8 Feb. 2013).
DataCite – Bridging the gap and helping to find, access and reuse data – Herbert Gruttemeier, INIST-CNRS
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. 2
Background
Data are not inherently self describing. An understanding of what the data are and
how they can be used requires quality metadata (data about data). The level of
metadata quality varies considerably and is a distinguishing feature among data
repositories.
3. 3
Here is the greenish title slide
Objectives
Define metadata and discuss why they are important
Tips for writing quality metadata
Describe the functions of a data repository
4. 4
What are metadata?
Table 1: Average temperature of observation for each species
Courtesy: Viv Hutchison
5. 5
What are metadata?
Table 1: Average temperature of observation for each species
Courtesy: Viv Hutchison
What do temps
represent?
How?
Where?
Units?
6. What are metadata?
Metadata are data about data
WHO created the data?
WHAT is the content of the data?
WHEN were the data created?
WHERE were they collected?
WHY were the data collected?
6
7. Value of Metadata
Essential for making data FAIR
● Findable: Keywords, good title, DOI
● Accessible: Tell user how to access the data or provide direct link to it
● Interoperable: Accurate and well-described methods and attributes
● Reusable: Understandable
7
8. Metadata for EDI (1)
Title and Abstract
Investigators: Synonymous with
"authors" of a paper, where the
investigator is the persons (or in
some case institutions) that have
made an intellectual contribution to
design of the data
collection/creation effort.
License: Tells future data users how
they can re-use the data
8
9. Metadata for EDI (2)
Keywords:
● Important for data discovery.
● Select from an existing
controlled vocabulary or
thesaurus.
Funding:
● Include award number
Timeframe & Location
Taxonomic species
Methods 9
10. Metadata for EDI (3)
Describe each data table:
Column Name
Description
● Standard units: EML metadata has
a set of predefined variable units
(EML unit dictionary).
○ Kg/m2 =
kilogramPerMeterSquared
● Custom units: Any unit not defined
in the dictionary can be included as
custom unit.
Unit/Code Explanation/Date format
Empty Value Code
10
12. EDI Metadata (4)
12
Scripts/code (software): Data
processing and analysis scripts can
be included in a data package.
Data provenance: A record trail
that accounts for the origin of a
dataset.
13. Titles, titles, titles
Titles are critical in helping readers find your data
○ While individuals are searching for the most appropriate datasets, they are
most likely going to use the title as the first criterion to determine if a
dataset meets their needs.
A complete title includes: What, Where, and When (and Who, if relevant)
13
14. Titles, titles, titles
Which title is better?
● Periphyton
● Periphyton Abundance data collected by FCE LTER from Northeast Shark
River Slough, Florida Everglades National Park, from September 2006 to
September 2008
14
18. Ecological Metadata Language (EML)
Metadata standard used widely in US ecological community
Implemented in the Extensible Markup Language (XML)
18
<title>Water Quality Data from Shark River
Slough, Everglades National Park</title>
<originator>
<firstName>Evelyn</lastName>
<lastName>Gaiser</lastName>
</originator>
<method>Grab samples of water were
collected monthly </method>
<date>
<begin>2000-06-01</begin>
<end>2017-03-30</begin>
</date>
19. What does one do with an EML document?
Deposit metadata and data in a data repository!
A data repository is a service operated by research organizations, where research
materials are stored, managed and made accessible
19
20. Data Repositories ensure
● Long-term security of the data
● Long-term accessibility of the data
● Data integrity
● Data discovery
● Datasets are citable
● Most repositories provide a DOI
20
21. Where to deposit ecological data?
Domain specific repositories
● Environmental Data Initiative Repository
● Knowledge Network for Biocomplexity
● Arctic Data Center
Generalist repositories
● Dryad
● Figshare
● Zenodo
Institutional repositories
21
22. Lots of repositories to choose from….
Repositories differ:
● Amount of metadata required
● Support of provenance
● Immutability
● Domains supported
22
26. 26
Here is the greenish title slide
Summary
A metadata record captures critical information about the content of a dataset
Metadata allow data to be discovered, accessed, integrated and re-used
Data repositories support Findability, Accessibility, Interoperability, and Reusability
(FAIR) of research data
Editor's Notes
Describe functions of a data repository which is the final destination of the metadata.
What are metadata? Let’s take a look at this question from the perspective of a researcher. Suppose you are a scientist who wants to study the effects of temperature on frogs. You reach out to all your frog scientist friends and ask for datasets on this topic because you want to do a metaanalysis, an analysis across multiple studies. You are sent this data file by one colleague, with no supporting info. What additional information would you need in order to use these data?
Units?
What do these temperatures represent? Temperature of the skin of the frog or water it was found in?
How were the data collected? Where? In the wild, or in a zoo?
When were the data collected? Was it 30 years ago before amphibians were in decline?
Furthermore, Was the minimum temperature for one of these poor Wood Frogs really zero?
Metadata are just data about data. They help the original creator of the data remember what they did, and they help a secondary data user to understand the data well enough to reuse them. So metadata include information about who created the dataset. A secondary data user may want to contact this creator for more information. What is the content of the dataset? The abstract in the metadata should briefly describe this. When were the data collected? Are the data from a long-term study, or just a short experiment? Where were the measurements collected? How were they collected? Why were the data collected? This Why question may indicate that there was some bias in how measurements were made that make the data unsuitable for a new purpose. So metadata are the who, what, when, where and why of a dataset.
Relative to the value of metadata, You will recall the FAIR data principles that Susanne described on Tuesday. The FAIR principles are guidelines for making data findable, accessible, interoperable and reusable. Metadata are essential to all four of the FAIR principles.
With respect to data findability, Metadata contain keywords, a good title, and a persistent identifier or DOI. All of these facilitate data discovery. Metadata tell a user how to access the data or provide a direct link to it. They indicate how the data are licensed and what a reuser may do with them. Very detailed metadata include accurate and well-described methods and attributes, which are essential for interoperability and integration of datasets. Finally, complete metadata should make the data understandable to a secondary user, without that user needing to contact the data creator.
Speaking of complete and detailed metadata, let’s talk a bit about what metadata EDI requires. This is the Word template for EDI metadata that you may already have seen. I will step you through what it is needed to complete this document. Remember, if you are filling out this template, you need to provide answers to the questions that a typical data reuser would need to in order to interpret these data correctly.
The License you choose will tell future data users how they can reuse the data.
Creative Commons is an American non-profit organization devoted to expanding the range of open access creative works available for others to build upon legally and to share. The organization has released several copyright-licenses, known as Creative Commons licenses, free of charge to the public. CC0 = no rights reserved. Same as CC-BY is a license that requires that the data authors get attribution, but the data can be used however someone likes.
If you don’t choose either one of these licenses, then by default your data set will be given the cc0 license.
On the next page is the section to provide keywords. We suggest that metadata creators select several keywords that are highly relevant to the data being documented. Keywords help a would-be secondary user of the data find the data. Keywords should be precise. Sometimes people get carried away and include 40 keywords. That’s too many. My rule of thumb is 10 or fewer from the LTER CV, and a couple additional ones that describe the project.
Link to a tool to which you can input the abstract of a dataset, for example, and the tool will suggest keywords from the LTER Controlled Vocabulary.
Providing a reference to the funding source for the study is important. Funders like to be able to search a data repository and see what their funding dollars bought. If you provide a grant number and funder id, then NSF, for example, can quickly find datasets related to projects they funded.
Timeframe, Geographic Location
In Methods, you should describe what you did so that someone else could reproduce your study. You should describe experimental design, instruments used, how samples were processed. You can point to published protocols, too, if they are relevant. Methods are really important when a data reuser is trying to determine is the data are suitable for their analysis or not.
Here is where you describe all the attributes in a data table. In the first column, you would put the variable or attribute names from the header of your dataset.
Units have to written in a particular way. Units get written out in camelCase so that they are unambiguous.
Example of data from long-term stream chemistry study.
Data packages don’t always contain just data and metadata. They may contain scripts that were used to process the data in some way. If you generated code while manipulating the dataset and quality controlling it, you can include the code in the data package.
Finally, data provenance can be described. Data provenance refers to a record trail that accounts for the origin of the dataset. If the frog researcher integrated 15 frog datasets from other researchers into a single dataset for her study, then this is where the identity of those original datasets can be recorded. Important for supporting reproducible science.
I will now offer you a few tips on how to create quality metadata, starting with what a good title should contain.
Select keywords wisely. Keywords aren’t something you should just pull out of the air. It’s better to choose terms from a thesaurus or controlled vocabulary. A controlled vocabulary is a standardized list of words that provide a consistent way to describe and index data. In the case of the LTER Controlled Vocabulary, the list consists of about 700 terms that ecologists use frequently to keyword data. So, How you would use the Controlled Vocabulary? If you are considering using CO2 as your keyword, for instance, you would look into this controlled vocabulary and see if CO2 is there, and it is, but it is not the preferred term. The words carbon dioxide should be written out, rather than entering CO2 as the keyword. By using these standard terms, it’s possible to index data holdings based on these terms. This improves the potential for data discovery considerably.
Also, it can be helpful to have a reference for standardized place names. Sometimes you may get data that contain specific place names that are likely to be expressed in a variety of different ways. For instance, in the Everglades there are these “Conservation Areas” that have received different treatments. Metadata for these areas may say the research site is “Conservation area 3” or WMACA 3 or other permutations. To get the standardized name, I consult this gazetteer. It’s a lot easier to find data for these locations if all datasets use the same version of the place name.
So you’ve written some brilliant metadata. Then what happens? Well, The Word template isn’t machine readable. Computers like more structure than a Word document can offer. You will learn later today how to generate structured metadata from the EDI template. The structured metadata standard we use at EDI is called Ecological Metadata Language. EML was developed for documenting ecological and environmental datasets, and is implemented in XML. This blue box shows a fragment of EML. You can see that elements of the metadata are enclosed in tags that describe their content. These tags are the XML, in the simplest possible sense. Having the data in EML makes it machine-readable. You can throw 1000 EML documents at a computer and request all the titles be output, and the computer can do that easily.
Once you have your clean dataset and your EML, what do you do with it? You are ready to share data through the EDI Repository. A data repository is a service operated by research organizations where research materials are stored, managed, and made accessible.
What is special about a data repository as opposed to sharing your data and metadata on a lab web page, or a field station’s website. Data repositories have some important functions that a lab website does not.
For instance, Repositories provide for the Long-term security of the data, meaning that a dataset will not ever be lost from a repository. It will be available 20 or more years after it is deposited.
Repositories ensure Long-term accessibility of data. : A dataset will always be retrievable from the repository.
Data integrity is preserved in a repository, meaning the data set will never be changed while in the repository. Data is said to be immutable.
For Data discovery: The repository will offer a mechanism by which to find data.
Datasets in a repository are citeable: datasets in most repositories receive a DOI, Digital object identifier, which provides a persistent link to a dataset’s location on the Internet.
You won’t get a DOI by posting your data on your lab website, and DOIs are what makes it possible for researchers to get credit from citations of their data.
Is EDI the only place to store ecological data? No there are many repositories that will accept ecological data. There are three kinds of repositories. Domain-specific, generalist, and institutional. Domain specific repositories store data from different domains, for example ecological data, physics data, sociological data, and all the rest. Repositories specifically for ecological data in the US include: KNB, Arctic Data center. Many other ecological repositories in other countries. Generalist repositories are designed to accept any kind of data. Institutional repositories are found at large institutions which now run their own repositories to store data, reports, articles, photos, all kinds of products from researchers at the institution. Some researchers prefer to store their data in their institutional repository.
RE3data.org: 2,540 repositories indexed by this service. Neotoma (paleoecological data), Gulf Coast Repository, VertNet, Fish Database of Taiwan, Australian Waterbird Surveys,
Let’s take a look at a data record in the EDI Repository so you can see how the structured metadata is turned into a nice html display.
Data are cited alongside journal citations in the references section of a paper.
These columns represent the columns in the dataset. Look at the detail here! Because the data are described so carefully, it’s possible to write on-the-fly R code or Python code that will directly extract this data table from the repository and import it into R.