This is module 6 in the EDI Data Publishing training course. In this module, you will learn how to create quality metadata and be introduced to the landscape of data repositories and their functions.
2. 2
Background
Data are not inherently self describing. An understanding of what the data are and
how they can be used requires quality metadata (data about data). The level of
metadata quality varies considerably and is a distinguishing feature among data
repositories.
3. 3
Here is the greenish title slide
Objectives
Define metadata and discuss why they are important
Tips for writing quality metadata
Describe the functions of a data repository
4. 4
What are metadata?
Table 1: Average temperature of observation for each species
Courtesy: Viv Hutchison
5. 5
What are metadata?
Table 1: Average temperature of observation for each species
Courtesy: Viv Hutchison
What do temps
represent?
How?
Where?
Units?
6. What are metadata?
Metadata are data about data
WHO created the data?
WHAT is the content of the data?
WHEN were the data created?
WHERE were they collected?
WHY were the data collected?
6
7. Value of Metadata
Essential for making data FAIR
● Findable: Keywords, good title, DOI
● Accessible: Tell user how to access the data or provide direct link to it
● Interoperable: Accurate and well-described methods and attributes
● Reusable: Understandable
7
8. Metadata for EDI (1)
Title and Abstract
Investigators: Synonymous with
"authors" of a paper, where the
investigator is the persons (or in
some case institutions) that have
made an intellectual contribution to
design of the data
collection/creation effort.
License: Tells future data users how
they can re-use the data
8
9. Metadata for EDI (2)
Keywords:
● Important for data discovery.
● Select from an existing
controlled vocabulary or
thesaurus.
Funding:
● Include award number
Timeframe & Location
Taxonomic species
Methods 9
10. Metadata for EDI (3)
Describe each data table:
Column Name
Description
● Standard units: EML metadata has
a set of predefined variable units
(EML unit dictionary).
○ Kg/m2 =
kilogramPerMeterSquared
● Custom units: Any unit not defined
in the dictionary can be included as
custom unit.
Unit/Code Explanation/Date format
Empty Value Code
10
12. EDI Metadata (4)
12
Scripts/code (software): Data
processing and analysis scripts can
be included in a data package.
Data provenance: A record trail
that accounts for the origin of a
dataset.
13. Titles, titles, titles
Titles are critical in helping readers find your data
○ While individuals are searching for the most appropriate datasets, they are
most likely going to use the title as the first criterion to determine if a
dataset meets their needs.
A complete title includes: What, Where, and When (and Who, if relevant)
13
14. Titles, titles, titles
Which title is better?
● Periphyton
● Periphyton Abundance data collected by FCE LTER from Northeast Shark
River Slough, Florida Everglades National Park, from September 2006 to
September 2008
14
18. Ecological Metadata Language (EML)
Metadata standard used widely in US ecological community
Implemented in the Extensible Markup Language (XML)
18
<title>Water Quality Data from Shark River
Slough, Everglades National Park</title>
<originator>
<firstName>Evelyn</lastName>
<lastName>Gaiser</lastName>
</originator>
<method>Grab samples of water were
collected monthly </method>
<date>
<begin>2000-06-01</begin>
<end>2017-03-30</begin>
</date>
19. What does one do with an EML document?
Deposit metadata and data in a data repository!
A data repository is a service operated by research organizations, where research
materials are stored, managed and made accessible
19
20. Data Repositories ensure
● Long-term security of the data
● Long-term accessibility of the data
● Data integrity
● Data discovery
● Datasets are citable
● Most repositories provide a DOI
20
21. Where to deposit ecological data?
Domain specific repositories
● Environmental Data Initiative Repository
● Knowledge Network for Biocomplexity
● Arctic Data Center
Generalist repositories
● Dryad
● Figshare
● Zenodo
Institutional repositories
21
22. Lots of repositories to choose from….
Repositories differ:
● Amount of metadata required
● Support of provenance
● Immutability
● Domains supported
22
26. 26
Here is the greenish title slide
Summary
A metadata record captures critical information about the content of a dataset
Metadata allow data to be discovered, accessed, integrated and re-used
Data repositories support Findability, Accessibility, Interoperability, and Reusability
(FAIR) of research data
Editor's Notes
Describe functions of a data repository which is the final destination of the metadata.
What are metadata? Let’s take a look at this question from the perspective of a researcher. Suppose you are a scientist who wants to study the effects of temperature on frogs. You reach out to all your frog scientist friends and ask for datasets on this topic because you want to do a metaanalysis, an analysis across multiple studies. You are sent this data file by one colleague, with no supporting info. What additional information would you need in order to use these data?
Units?
What do these temperatures represent? Temperature of the skin of the frog or water it was found in?
How were the data collected? Where? In the wild, or in a zoo?
When were the data collected? Was it 30 years ago before amphibians were in decline?
Furthermore, Was the minimum temperature for one of these poor Wood Frogs really zero?
Metadata are just data about data. They help the original creator of the data remember what they did, and they help a secondary data user to understand the data well enough to reuse them. So metadata include information about who created the dataset. A secondary data user may want to contact this creator for more information. What is the content of the dataset? The abstract in the metadata should briefly describe this. When were the data collected? Are the data from a long-term study, or just a short experiment? Where were the measurements collected? How were they collected? Why were the data collected? This Why question may indicate that there was some bias in how measurements were made that make the data unsuitable for a new purpose. So metadata are the who, what, when, where and why of a dataset.
Relative to the value of metadata, You will recall the FAIR data principles that Susanne described on Tuesday. The FAIR principles are guidelines for making data findable, accessible, interoperable and reusable. Metadata are essential to all four of the FAIR principles.
With respect to data findability, Metadata contain keywords, a good title, and a persistent identifier or DOI. All of these facilitate data discovery. Metadata tell a user how to access the data or provide a direct link to it. They indicate how the data are licensed and what a reuser may do with them. Very detailed metadata include accurate and well-described methods and attributes, which are essential for interoperability and integration of datasets. Finally, complete metadata should make the data understandable to a secondary user, without that user needing to contact the data creator.
Speaking of complete and detailed metadata, let’s talk a bit about what metadata EDI requires. This is the Word template for EDI metadata that you may already have seen. I will step you through what it is needed to complete this document. Remember, if you are filling out this template, you need to provide answers to the questions that a typical data reuser would need to in order to interpret these data correctly.
The License you choose will tell future data users how they can reuse the data.
Creative Commons is an American non-profit organization devoted to expanding the range of open access creative works available for others to build upon legally and to share. The organization has released several copyright-licenses, known as Creative Commons licenses, free of charge to the public. CC0 = no rights reserved. Same as CC-BY is a license that requires that the data authors get attribution, but the data can be used however someone likes.
If you don’t choose either one of these licenses, then by default your data set will be given the cc0 license.
On the next page is the section to provide keywords. We suggest that metadata creators select several keywords that are highly relevant to the data being documented. Keywords help a would-be secondary user of the data find the data. Keywords should be precise. Sometimes people get carried away and include 40 keywords. That’s too many. My rule of thumb is 10 or fewer from the LTER CV, and a couple additional ones that describe the project.
Link to a tool to which you can input the abstract of a dataset, for example, and the tool will suggest keywords from the LTER Controlled Vocabulary.
Providing a reference to the funding source for the study is important. Funders like to be able to search a data repository and see what their funding dollars bought. If you provide a grant number and funder id, then NSF, for example, can quickly find datasets related to projects they funded.
Timeframe, Geographic Location
In Methods, you should describe what you did so that someone else could reproduce your study. You should describe experimental design, instruments used, how samples were processed. You can point to published protocols, too, if they are relevant. Methods are really important when a data reuser is trying to determine is the data are suitable for their analysis or not.
Here is where you describe all the attributes in a data table. In the first column, you would put the variable or attribute names from the header of your dataset.
Units have to written in a particular way. Units get written out in camelCase so that they are unambiguous.
Example of data from long-term stream chemistry study.
Data packages don’t always contain just data and metadata. They may contain scripts that were used to process the data in some way. If you generated code while manipulating the dataset and quality controlling it, you can include the code in the data package.
Finally, data provenance can be described. Data provenance refers to a record trail that accounts for the origin of the dataset. If the frog researcher integrated 15 frog datasets from other researchers into a single dataset for her study, then this is where the identity of those original datasets can be recorded. Important for supporting reproducible science.
I will now offer you a few tips on how to create quality metadata, starting with what a good title should contain.
Select keywords wisely. Keywords aren’t something you should just pull out of the air. It’s better to choose terms from a thesaurus or controlled vocabulary. A controlled vocabulary is a standardized list of words that provide a consistent way to describe and index data. In the case of the LTER Controlled Vocabulary, the list consists of about 700 terms that ecologists use frequently to keyword data. So, How you would use the Controlled Vocabulary? If you are considering using CO2 as your keyword, for instance, you would look into this controlled vocabulary and see if CO2 is there, and it is, but it is not the preferred term. The words carbon dioxide should be written out, rather than entering CO2 as the keyword. By using these standard terms, it’s possible to index data holdings based on these terms. This improves the potential for data discovery considerably.
Also, it can be helpful to have a reference for standardized place names. Sometimes you may get data that contain specific place names that are likely to be expressed in a variety of different ways. For instance, in the Everglades there are these “Conservation Areas” that have received different treatments. Metadata for these areas may say the research site is “Conservation area 3” or WMACA 3 or other permutations. To get the standardized name, I consult this gazetteer. It’s a lot easier to find data for these locations if all datasets use the same version of the place name.
So you’ve written some brilliant metadata. Then what happens? Well, The Word template isn’t machine readable. Computers like more structure than a Word document can offer. You will learn later today how to generate structured metadata from the EDI template. The structured metadata standard we use at EDI is called Ecological Metadata Language. EML was developed for documenting ecological and environmental datasets, and is implemented in XML. This blue box shows a fragment of EML. You can see that elements of the metadata are enclosed in tags that describe their content. These tags are the XML, in the simplest possible sense. Having the data in EML makes it machine-readable. You can throw 1000 EML documents at a computer and request all the titles be output, and the computer can do that easily.
Once you have your clean dataset and your EML, what do you do with it? You are ready to share data through the EDI Repository. A data repository is a service operated by research organizations where research materials are stored, managed, and made accessible.
What is special about a data repository as opposed to sharing your data and metadata on a lab web page, or a field station’s website. Data repositories have some important functions that a lab website does not.
For instance, Repositories provide for the Long-term security of the data, meaning that a dataset will not ever be lost from a repository. It will be available 20 or more years after it is deposited.
Repositories ensure Long-term accessibility of data. : A dataset will always be retrievable from the repository.
Data integrity is preserved in a repository, meaning the data set will never be changed while in the repository. Data is said to be immutable.
For Data discovery: The repository will offer a mechanism by which to find data.
Datasets in a repository are citeable: datasets in most repositories receive a DOI, Digital object identifier, which provides a persistent link to a dataset’s location on the Internet.
You won’t get a DOI by posting your data on your lab website, and DOIs are what makes it possible for researchers to get credit from citations of their data.
Is EDI the only place to store ecological data? No there are many repositories that will accept ecological data. There are three kinds of repositories. Domain-specific, generalist, and institutional. Domain specific repositories store data from different domains, for example ecological data, physics data, sociological data, and all the rest. Repositories specifically for ecological data in the US include: KNB, Arctic Data center. Many other ecological repositories in other countries. Generalist repositories are designed to accept any kind of data. Institutional repositories are found at large institutions which now run their own repositories to store data, reports, articles, photos, all kinds of products from researchers at the institution. Some researchers prefer to store their data in their institutional repository.
RE3data.org: 2,540 repositories indexed by this service. Neotoma (paleoecological data), Gulf Coast Repository, VertNet, Fish Database of Taiwan, Australian Waterbird Surveys,
Let’s take a look at a data record in the EDI Repository so you can see how the structured metadata is turned into a nice html display.
Data are cited alongside journal citations in the references section of a paper.
These columns represent the columns in the dataset. Look at the detail here! Because the data are described so carefully, it’s possible to write on-the-fly R code or Python code that will directly extract this data table from the repository and import it into R.