SlideShare a Scribd company logo
1 of 26
Download to read offline
An Introduction to Metadata
and Data Repositories
1
(Phase 3)
2
Background
Data are not inherently self describing. An understanding of what the data are and
how they can be used requires quality metadata (data about data). The level of
metadata quality varies considerably and is a distinguishing feature among data
repositories.
3
Here is the greenish title slide
Objectives
Define metadata and discuss why they are important
Tips for writing quality metadata
Describe the functions of a data repository
4
What are metadata?
Table 1: Average temperature of observation for each species
Courtesy: Viv Hutchison
5
What are metadata?
Table 1: Average temperature of observation for each species
Courtesy: Viv Hutchison
What do temps
represent?
How?
Where?
Units?
What are metadata?
Metadata are data about data
WHO created the data?
WHAT is the content of the data?
WHEN were the data created?
WHERE were they collected?
WHY were the data collected?
6
Value of Metadata
Essential for making data FAIR
● Findable: Keywords, good title, DOI
● Accessible: Tell user how to access the data or provide direct link to it
● Interoperable: Accurate and well-described methods and attributes
● Reusable: Understandable
7
Metadata for EDI (1)
Title and Abstract
Investigators: Synonymous with
"authors" of a paper, where the
investigator is the persons (or in
some case institutions) that have
made an intellectual contribution to
design of the data
collection/creation effort.
License: Tells future data users how
they can re-use the data
8
Metadata for EDI (2)
Keywords:
● Important for data discovery.
● Select from an existing
controlled vocabulary or
thesaurus.
Funding:
● Include award number
Timeframe & Location
Taxonomic species
Methods 9
Metadata for EDI (3)
Describe each data table:
Column Name
Description
● Standard units: EML metadata has
a set of predefined variable units
(EML unit dictionary).
○ Kg/m2 =
kilogramPerMeterSquared
● Custom units: Any unit not defined
in the dictionary can be included as
custom unit.
Unit/Code Explanation/Date format
Empty Value Code
10
Example table description
11
The Data Table
EDI Metadata (4)
12
Scripts/code (software): Data
processing and analysis scripts can
be included in a data package.
Data provenance: A record trail
that accounts for the origin of a
dataset.
Titles, titles, titles
Titles are critical in helping readers find your data
○ While individuals are searching for the most appropriate datasets, they are
most likely going to use the title as the first criterion to determine if a
dataset meets their needs.
A complete title includes: What, Where, and When (and Who, if relevant)
13
Titles, titles, titles
Which title is better?
● Periphyton
● Periphyton Abundance data collected by FCE LTER from Northeast Shark
River Slough, Florida Everglades National Park, from September 2006 to
September 2008
14
Repercussions of bad titles ...
15
Select keywords wisely
Use a thesaurus or a controlled vocabulary for keywords whenever possible
● LTER Controlled Vocabulary
16
Gazetteers: Standardized place names
17
Ecological Metadata Language (EML)
Metadata standard used widely in US ecological community
Implemented in the Extensible Markup Language (XML)
18
<title>Water Quality Data from Shark River
Slough, Everglades National Park</title>
<originator>
<firstName>Evelyn</lastName>
<lastName>Gaiser</lastName>
</originator>
<method>Grab samples of water were
collected monthly </method>
<date>
<begin>2000-06-01</begin>
<end>2017-03-30</begin>
</date>
What does one do with an EML document?
Deposit metadata and data in a data repository!
A data repository is a service operated by research organizations, where research
materials are stored, managed and made accessible
19
Data Repositories ensure
● Long-term security of the data
● Long-term accessibility of the data
● Data integrity
● Data discovery
● Datasets are citable
● Most repositories provide a DOI
20
Where to deposit ecological data?
Domain specific repositories
● Environmental Data Initiative Repository
● Knowledge Network for Biocomplexity
● Arctic Data Center
Generalist repositories
● Dryad
● Figshare
● Zenodo
Institutional repositories
21
Lots of repositories to choose from….
Repositories differ:
● Amount of metadata required
● Support of provenance
● Immutability
● Domains supported
22
EDI Data Repository
23
Data Citation
More EDI Metadata
24
Attributes of a data table
25
26
Here is the greenish title slide
Summary
A metadata record captures critical information about the content of a dataset
Metadata allow data to be discovered, accessed, integrated and re-used
Data repositories support Findability, Accessibility, Interoperability, and Reusability
(FAIR) of research data

More Related Content

What's hot

EDI Training Module 11: Publishing Data in the EDI Repository
EDI Training Module 11:  Publishing Data in the EDI RepositoryEDI Training Module 11:  Publishing Data in the EDI Repository
EDI Training Module 11: Publishing Data in the EDI RepositoryEnvironmental Data Initiative
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
Mc0088 data mining
Mc0088  data miningMc0088  data mining
Mc0088 data miningsmumbahelp
 
Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareAnita de Waard
 
The challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can helpThe challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can helpVarsha Khodiyar
 
Best practices data management
Best practices data managementBest practices data management
Best practices data managementSherry Lake
 
Publishing the Full Research Data Lifecycle
Publishing the Full Research Data LifecyclePublishing the Full Research Data Lifecycle
Publishing the Full Research Data LifecycleAnita de Waard
 
Who will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynoteWho will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynoteJisc RDM
 
Gaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingGaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingVarsha Khodiyar
 
Basics of Research Data Management
Basics of Research Data ManagementBasics of Research Data Management
Basics of Research Data ManagementOpenAIRE
 
A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...Leon Osinski
 
Top (10) challenging problems in data mining
Top (10) challenging problems  in data miningTop (10) challenging problems  in data mining
Top (10) challenging problems in data miningAhmedasbasb
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE
 
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...datacite
 
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Anita de Waard
 
Love Your Data Locally
Love Your Data LocallyLove Your Data Locally
Love Your Data LocallyErin D. Foster
 

What's hot (20)

EDI Training Module 11: Publishing Data in the EDI Repository
EDI Training Module 11:  Publishing Data in the EDI RepositoryEDI Training Module 11:  Publishing Data in the EDI Repository
EDI Training Module 11: Publishing Data in the EDI Repository
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Mc0088 data mining
Mc0088  data miningMc0088  data mining
Mc0088 data mining
 
Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and software
 
The challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can helpThe challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can help
 
Getting data into the data repository
Getting data into the data repositoryGetting data into the data repository
Getting data into the data repository
 
Workingwith dataverserepository
Workingwith dataverserepositoryWorkingwith dataverserepository
Workingwith dataverserepository
 
Best practices data management
Best practices data managementBest practices data management
Best practices data management
 
Setting up a data repository, what does it entail?
Setting up a data repository, what does it entail?Setting up a data repository, what does it entail?
Setting up a data repository, what does it entail?
 
Publishing the Full Research Data Lifecycle
Publishing the Full Research Data LifecyclePublishing the Full Research Data Lifecycle
Publishing the Full Research Data Lifecycle
 
Who will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynoteWho will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynote
 
BioSharing - Update - Feb2016
BioSharing - Update - Feb2016BioSharing - Update - Feb2016
BioSharing - Update - Feb2016
 
Gaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingGaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data Publishing
 
Basics of Research Data Management
Basics of Research Data ManagementBasics of Research Data Management
Basics of Research Data Management
 
A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...
 
Top (10) challenging problems in data mining
Top (10) challenging problems  in data miningTop (10) challenging problems  in data mining
Top (10) challenging problems in data mining
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management Planning
 
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
 
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
 
Love Your Data Locally
Love Your Data LocallyLove Your Data Locally
Love Your Data Locally
 

Similar to EDI Training Module 12: An Introduction to Metadata and Data Repositories

Metadata standards
Metadata standardsMetadata standards
Metadata standardsmakammer
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalWaqas Tariq
 
Metadata for digital long-term preservation
Metadata for digital long-term preservationMetadata for digital long-term preservation
Metadata for digital long-term preservationMichael Day
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T
 
Managing your data paget
Managing your data pagetManaging your data paget
Managing your data pagetTERN Australia
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Natsuko Nicholls
 
DataCite and its DOI infrastructure - IASSIST 2013
DataCite and its DOI infrastructure - IASSIST 2013DataCite and its DOI infrastructure - IASSIST 2013
DataCite and its DOI infrastructure - IASSIST 2013Frauke Ziedorn
 
Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesRebekah Cummings
 
Data Standards & Best Practices for the Stratigraphic Record
Data Standards & Best Practices for the Stratigraphic RecordData Standards & Best Practices for the Stratigraphic Record
Data Standards & Best Practices for the Stratigraphic RecordKerstin Lehnert
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFelipe Gutierrez
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217lyarmey
 
FSCI Data Discovery
FSCI Data DiscoveryFSCI Data Discovery
FSCI Data DiscoveryARDC
 
How to expose research data in EOSC
How to expose research data in EOSCHow to expose research data in EOSC
How to expose research data in EOSCEUDAT
 
Meeting the NSF DMP Requirement: March 7, 2012
Meeting the NSF DMP Requirement: March 7, 2012Meeting the NSF DMP Requirement: March 7, 2012
Meeting the NSF DMP Requirement: March 7, 2012IUPUI
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
 
Metadata & controlled vocabulary
Metadata & controlled vocabularyMetadata & controlled vocabulary
Metadata & controlled vocabularyDaryl Superio
 

Similar to EDI Training Module 12: An Introduction to Metadata and Data Repositories (20)

Metadata standards
Metadata standardsMetadata standards
Metadata standards
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
Metadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the schemeMetadata : Concentrating on the data, not on the scheme
Metadata : Concentrating on the data, not on the scheme
 
Metadata for digital long-term preservation
Metadata for digital long-term preservationMetadata for digital long-term preservation
Metadata for digital long-term preservation
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
 
Managing your data paget
Managing your data pagetManaging your data paget
Managing your data paget
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
 
DataCite and its DOI infrastructure - IASSIST 2013
DataCite and its DOI infrastructure - IASSIST 2013DataCite and its DOI infrastructure - IASSIST 2013
DataCite and its DOI infrastructure - IASSIST 2013
 
Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and Humanities
 
Data Standards & Best Practices for the Stratigraphic Record
Data Standards & Best Practices for the Stratigraphic RecordData Standards & Best Practices for the Stratigraphic Record
Data Standards & Best Practices for the Stratigraphic Record
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODS
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
FSCI Data Discovery
FSCI Data DiscoveryFSCI Data Discovery
FSCI Data Discovery
 
NISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management PlanNISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management Plan
 
How to expose research data in EOSC
How to expose research data in EOSCHow to expose research data in EOSC
How to expose research data in EOSC
 
Meeting the NSF DMP Requirement: March 7, 2012
Meeting the NSF DMP Requirement: March 7, 2012Meeting the NSF DMP Requirement: March 7, 2012
Meeting the NSF DMP Requirement: March 7, 2012
 
Data mining
Data miningData mining
Data mining
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
Metadata & controlled vocabulary
Metadata & controlled vocabularyMetadata & controlled vocabulary
Metadata & controlled vocabulary
 

Recently uploaded

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 

Recently uploaded (16)

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 

EDI Training Module 12: An Introduction to Metadata and Data Repositories

  • 1. An Introduction to Metadata and Data Repositories 1 (Phase 3)
  • 2. 2 Background Data are not inherently self describing. An understanding of what the data are and how they can be used requires quality metadata (data about data). The level of metadata quality varies considerably and is a distinguishing feature among data repositories.
  • 3. 3 Here is the greenish title slide Objectives Define metadata and discuss why they are important Tips for writing quality metadata Describe the functions of a data repository
  • 4. 4 What are metadata? Table 1: Average temperature of observation for each species Courtesy: Viv Hutchison
  • 5. 5 What are metadata? Table 1: Average temperature of observation for each species Courtesy: Viv Hutchison What do temps represent? How? Where? Units?
  • 6. What are metadata? Metadata are data about data WHO created the data? WHAT is the content of the data? WHEN were the data created? WHERE were they collected? WHY were the data collected? 6
  • 7. Value of Metadata Essential for making data FAIR ● Findable: Keywords, good title, DOI ● Accessible: Tell user how to access the data or provide direct link to it ● Interoperable: Accurate and well-described methods and attributes ● Reusable: Understandable 7
  • 8. Metadata for EDI (1) Title and Abstract Investigators: Synonymous with "authors" of a paper, where the investigator is the persons (or in some case institutions) that have made an intellectual contribution to design of the data collection/creation effort. License: Tells future data users how they can re-use the data 8
  • 9. Metadata for EDI (2) Keywords: ● Important for data discovery. ● Select from an existing controlled vocabulary or thesaurus. Funding: ● Include award number Timeframe & Location Taxonomic species Methods 9
  • 10. Metadata for EDI (3) Describe each data table: Column Name Description ● Standard units: EML metadata has a set of predefined variable units (EML unit dictionary). ○ Kg/m2 = kilogramPerMeterSquared ● Custom units: Any unit not defined in the dictionary can be included as custom unit. Unit/Code Explanation/Date format Empty Value Code 10
  • 12. EDI Metadata (4) 12 Scripts/code (software): Data processing and analysis scripts can be included in a data package. Data provenance: A record trail that accounts for the origin of a dataset.
  • 13. Titles, titles, titles Titles are critical in helping readers find your data ○ While individuals are searching for the most appropriate datasets, they are most likely going to use the title as the first criterion to determine if a dataset meets their needs. A complete title includes: What, Where, and When (and Who, if relevant) 13
  • 14. Titles, titles, titles Which title is better? ● Periphyton ● Periphyton Abundance data collected by FCE LTER from Northeast Shark River Slough, Florida Everglades National Park, from September 2006 to September 2008 14
  • 15. Repercussions of bad titles ... 15
  • 16. Select keywords wisely Use a thesaurus or a controlled vocabulary for keywords whenever possible ● LTER Controlled Vocabulary 16
  • 18. Ecological Metadata Language (EML) Metadata standard used widely in US ecological community Implemented in the Extensible Markup Language (XML) 18 <title>Water Quality Data from Shark River Slough, Everglades National Park</title> <originator> <firstName>Evelyn</lastName> <lastName>Gaiser</lastName> </originator> <method>Grab samples of water were collected monthly </method> <date> <begin>2000-06-01</begin> <end>2017-03-30</begin> </date>
  • 19. What does one do with an EML document? Deposit metadata and data in a data repository! A data repository is a service operated by research organizations, where research materials are stored, managed and made accessible 19
  • 20. Data Repositories ensure ● Long-term security of the data ● Long-term accessibility of the data ● Data integrity ● Data discovery ● Datasets are citable ● Most repositories provide a DOI 20
  • 21. Where to deposit ecological data? Domain specific repositories ● Environmental Data Initiative Repository ● Knowledge Network for Biocomplexity ● Arctic Data Center Generalist repositories ● Dryad ● Figshare ● Zenodo Institutional repositories 21
  • 22. Lots of repositories to choose from…. Repositories differ: ● Amount of metadata required ● Support of provenance ● Immutability ● Domains supported 22
  • 25. Attributes of a data table 25
  • 26. 26 Here is the greenish title slide Summary A metadata record captures critical information about the content of a dataset Metadata allow data to be discovered, accessed, integrated and re-used Data repositories support Findability, Accessibility, Interoperability, and Reusability (FAIR) of research data

Editor's Notes

  1. Describe functions of a data repository which is the final destination of the metadata.
  2. What are metadata? Let’s take a look at this question from the perspective of a researcher. Suppose you are a scientist who wants to study the effects of temperature on frogs. You reach out to all your frog scientist friends and ask for datasets on this topic because you want to do a metaanalysis, an analysis across multiple studies. You are sent this data file by one colleague, with no supporting info. What additional information would you need in order to use these data? Units? What do these temperatures represent? Temperature of the skin of the frog or water it was found in? How were the data collected? Where? In the wild, or in a zoo? When were the data collected? Was it 30 years ago before amphibians were in decline? Furthermore, Was the minimum temperature for one of these poor Wood Frogs really zero?
  3. Metadata are just data about data. They help the original creator of the data remember what they did, and they help a secondary data user to understand the data well enough to reuse them. So metadata include information about who created the dataset. A secondary data user may want to contact this creator for more information. What is the content of the dataset? The abstract in the metadata should briefly describe this. When were the data collected? Are the data from a long-term study, or just a short experiment? Where were the measurements collected? How were they collected? Why were the data collected? This Why question may indicate that there was some bias in how measurements were made that make the data unsuitable for a new purpose. So metadata are the who, what, when, where and why of a dataset.
  4. Relative to the value of metadata, You will recall the FAIR data principles that Susanne described on Tuesday. The FAIR principles are guidelines for making data findable, accessible, interoperable and reusable. Metadata are essential to all four of the FAIR principles. With respect to data findability, Metadata contain keywords, a good title, and a persistent identifier or DOI. All of these facilitate data discovery. Metadata tell a user how to access the data or provide a direct link to it. They indicate how the data are licensed and what a reuser may do with them. Very detailed metadata include accurate and well-described methods and attributes, which are essential for interoperability and integration of datasets. Finally, complete metadata should make the data understandable to a secondary user, without that user needing to contact the data creator.
  5. Speaking of complete and detailed metadata, let’s talk a bit about what metadata EDI requires. This is the Word template for EDI metadata that you may already have seen. I will step you through what it is needed to complete this document. Remember, if you are filling out this template, you need to provide answers to the questions that a typical data reuser would need to in order to interpret these data correctly. The License you choose will tell future data users how they can reuse the data. Creative Commons is an American non-profit organization devoted to expanding the range of open access creative works available for others to build upon legally and to share. The organization has released several copyright-licenses, known as Creative Commons licenses, free of charge to the public. CC0 = no rights reserved. Same as CC-BY is a license that requires that the data authors get attribution, but the data can be used however someone likes. If you don’t choose either one of these licenses, then by default your data set will be given the cc0 license.
  6. On the next page is the section to provide keywords. We suggest that metadata creators select several keywords that are highly relevant to the data being documented. Keywords help a would-be secondary user of the data find the data. Keywords should be precise. Sometimes people get carried away and include 40 keywords. That’s too many. My rule of thumb is 10 or fewer from the LTER CV, and a couple additional ones that describe the project. Link to a tool to which you can input the abstract of a dataset, for example, and the tool will suggest keywords from the LTER Controlled Vocabulary. Providing a reference to the funding source for the study is important. Funders like to be able to search a data repository and see what their funding dollars bought. If you provide a grant number and funder id, then NSF, for example, can quickly find datasets related to projects they funded. Timeframe, Geographic Location In Methods, you should describe what you did so that someone else could reproduce your study. You should describe experimental design, instruments used, how samples were processed. You can point to published protocols, too, if they are relevant. Methods are really important when a data reuser is trying to determine is the data are suitable for their analysis or not.
  7. Here is where you describe all the attributes in a data table. In the first column, you would put the variable or attribute names from the header of your dataset. Units have to written in a particular way. Units get written out in camelCase so that they are unambiguous.
  8. Example of data from long-term stream chemistry study.
  9. Data packages don’t always contain just data and metadata. They may contain scripts that were used to process the data in some way. If you generated code while manipulating the dataset and quality controlling it, you can include the code in the data package. Finally, data provenance can be described. Data provenance refers to a record trail that accounts for the origin of the dataset. If the frog researcher integrated 15 frog datasets from other researchers into a single dataset for her study, then this is where the identity of those original datasets can be recorded. Important for supporting reproducible science.
  10. I will now offer you a few tips on how to create quality metadata, starting with what a good title should contain.
  11. Metadata aggregator. Obscure titles aren’t helpful.
  12. Select keywords wisely. Keywords aren’t something you should just pull out of the air. It’s better to choose terms from a thesaurus or controlled vocabulary. A controlled vocabulary is a standardized list of words that provide a consistent way to describe and index data. In the case of the LTER Controlled Vocabulary, the list consists of about 700 terms that ecologists use frequently to keyword data. So, How you would use the Controlled Vocabulary? If you are considering using CO2 as your keyword, for instance, you would look into this controlled vocabulary and see if CO2 is there, and it is, but it is not the preferred term. The words carbon dioxide should be written out, rather than entering CO2 as the keyword. By using these standard terms, it’s possible to index data holdings based on these terms. This improves the potential for data discovery considerably.
  13. Also, it can be helpful to have a reference for standardized place names. Sometimes you may get data that contain specific place names that are likely to be expressed in a variety of different ways. For instance, in the Everglades there are these “Conservation Areas” that have received different treatments. Metadata for these areas may say the research site is “Conservation area 3” or WMACA 3 or other permutations. To get the standardized name, I consult this gazetteer. It’s a lot easier to find data for these locations if all datasets use the same version of the place name.
  14. So you’ve written some brilliant metadata. Then what happens? Well, The Word template isn’t machine readable. Computers like more structure than a Word document can offer. You will learn later today how to generate structured metadata from the EDI template. The structured metadata standard we use at EDI is called Ecological Metadata Language. EML was developed for documenting ecological and environmental datasets, and is implemented in XML. This blue box shows a fragment of EML. You can see that elements of the metadata are enclosed in tags that describe their content. These tags are the XML, in the simplest possible sense. Having the data in EML makes it machine-readable. You can throw 1000 EML documents at a computer and request all the titles be output, and the computer can do that easily.
  15. Once you have your clean dataset and your EML, what do you do with it? You are ready to share data through the EDI Repository. A data repository is a service operated by research organizations where research materials are stored, managed, and made accessible.
  16. What is special about a data repository as opposed to sharing your data and metadata on a lab web page, or a field station’s website. Data repositories have some important functions that a lab website does not. For instance, Repositories provide for the Long-term security of the data, meaning that a dataset will not ever be lost from a repository. It will be available 20 or more years after it is deposited. Repositories ensure Long-term accessibility of data. : A dataset will always be retrievable from the repository. Data integrity is preserved in a repository, meaning the data set will never be changed while in the repository. Data is said to be immutable. For Data discovery: The repository will offer a mechanism by which to find data. Datasets in a repository are citeable: datasets in most repositories receive a DOI, Digital object identifier, which provides a persistent link to a dataset’s location on the Internet. You won’t get a DOI by posting your data on your lab website, and DOIs are what makes it possible for researchers to get credit from citations of their data.
  17. Is EDI the only place to store ecological data? No there are many repositories that will accept ecological data. There are three kinds of repositories. Domain-specific, generalist, and institutional. Domain specific repositories store data from different domains, for example ecological data, physics data, sociological data, and all the rest. Repositories specifically for ecological data in the US include: KNB, Arctic Data center. Many other ecological repositories in other countries. Generalist repositories are designed to accept any kind of data. Institutional repositories are found at large institutions which now run their own repositories to store data, reports, articles, photos, all kinds of products from researchers at the institution. Some researchers prefer to store their data in their institutional repository.
  18. RE3data.org: 2,540 repositories indexed by this service. Neotoma (paleoecological data), Gulf Coast Repository, VertNet, Fish Database of Taiwan, Australian Waterbird Surveys,
  19. Let’s take a look at a data record in the EDI Repository so you can see how the structured metadata is turned into a nice html display. Data are cited alongside journal citations in the references section of a paper.
  20. These columns represent the columns in the dataset. Look at the detail here! Because the data are described so carefully, it’s possible to write on-the-fly R code or Python code that will directly extract this data table from the repository and import it into R.