SlideShare a Scribd company logo
1 of 28
Turning three thesauri into a
Global Agricultural Concept Scheme
March 9, 2015
Research Data Alliance
Session II: Good Practices towards opening data in agriculture
Cynthia Parr, National Agricultural Library
@cydparr, cynthia.parr@ars.usda.gov
Outline
1. Background
2. Starting point: three thesauri
3. Creating GACS
4. Challenges
5. Next steps and future of GACS
Background
● Food and Agriculture Organization of the UN
● CABI (UK)
● National Agricultural Library (US)
Each organization maintains a thesaurus of terms and concepts related to
agriculture -- concepts like rice, ricefield aquaculture, and plant pests.
Separate thesauri, separate databases
Create GACS as a glue linking them together
Global Agricultural Concept Scheme (GACS)
agreement October 2013 to conduct feasibility study
1. To improve the semantic interoperability of thesauri
maintained by FAO, CABI, and NAL.
2. To identify and provide core concepts broadly
supported across the three thesauri.
3. To achieve efficiencies of scale by maintaining the core
concepts in cooperation.
Consultants
Osma Suominen (Finland)
osma.suominen@helsinki.fi
Tom Baker (Germany)
tom@tombaker.org
Creating GACS
Phase One: Analysis of Thesauri
AGROVOC CAB Thesaurus NAL Thesaurus
140,000
concepts,
>1.4M terms
32,000
concepts,
>1.2M terms
53,000
concepts,
>200k terms
English, Spanish,
Portuguese, German,
Czech, Persian, Polish,
Hindi, French, Italian,
Russian, Japanese,
Hungarian, Chinese,
Slovak, Thai, Lao, Turkish,
Korean, Arabic, Telugu ...
English, Spanish,
Portuguese, Dutch
+ many languages with
lower coverage
English, Spanish
All thesauri represented using SKOS
Overlap estimate
Obtained via automatic
mappings created using
AgreementMakerLight
Long tail distribution (in AGRIS)
10,000 concepts cover nearly 99% of occurrences in metadata
Requirements and Wishes
1. An integrated view and bridge of existing thesauri
2. Reuses thesaurus development work, incl. translations
3. Compatible with existing databases
4. Based on RDF technologies: URIs, SKOS etc.
5. Available as Linked Open Data
Currently building GACS Beta, a proof-of-concept
implementation attempting to fulfill most requirements
Creating GACS
Phase Two: Proof of Concept
Selection of top 10,000 concepts
Each partner organization provided
the 10,000 concepts most frequently
used in their respective databases.
These lists of concepts were
modified as follows:
● added all countries (from
AGROVOC)
● added organisms hierarchy all
the way to the top
Automated mappings
Created using AgreementMakerLight software
between the full thesauri, for completeness
AgreementMakerLight was top performer at
OAEI 2014 ontology mapping competition!
Human evaluation of mappings
Created Google Docs spreadsheets using the lists of selected concepts and
the auto-generated mappings. Three sheets with circa 10,700 rows each.
Mappings manually evaluated by
staff of partner organizations.
Evaluated 60 to 150 rows/hour,
total evaluation time over 300
hours so far.
Currently projected to take
500-600 hours for GACS Beta.
Forming GACS concepts
by merging the source concepts and aggregating their information
rice
UF paddy
UF paddy rice
cereals
UF feed cereals
UF small grain cereals (grain)
Oryza sativa
UF Oryza glutinosa
UF Oryza indica
UF Oryza japonica
UF Oryza sativa … (subsp, var etc.)
Oryza
UF Padia
UF rice (plant)
agrovoc:c_5435
cabt:82917
nalt:56271
exactMatch
agrovoc:c_5438
cabt:82935
nalt:56277
exactMatch
agrovoc:c_1474
cabt:26247
exactMatch
agrovoc:c_6599
cabt:101613
nalt:56293
exactMatch
(actually we use SKOS, not traditional thesaurus tags)
Size of GACS
GACS
GACS Beta
will have around
14,000 of the
most used
concepts
Quality evaluation
Using the qSKOS and Skosify tools that can find and correct problems in SKOS
vocabularies [1], we can detect
● missing, invalid or overlapping concept labels
● anomalies in concept hierarchy, e.g. cycles
● ...and many other kinds of problems.
Many problems are expected due to merging of concepts within GACS, but
most should be automatically corrected.
[1] Osma Suominen and Christian Mader: Assessing and Improving the
Quality of SKOS Vocabularies. JoDS, 3(1) 2014.
Demo of GACS Alpha in Skosmos
http://bit.ly/1Gjf5jl
Additional mapping rounds
Need to perform 2-3 more
smaller mapping rounds
in order to ensure that
all necessary concepts
have been fully mapped
between all source thesauri
Lessons already learned
● It is hard to sustain focus on mapping beyond circa five hours per day.
● Mapping reveals issues with both the source and target thesauri -- areas
for improvement, or errors, fixable in collaboration.
● Starting with the 10,000 most-used concepts shines a light on parts of
thesauri that may long have lacked attention.
● Starting small, with a core, avoids the potential stress of over-committing
resources.
● Mapping provides an incentive to adopt open-data technologies that have
proven beneficial in other areas.
Challenges
Differences in modeling
Q: Are taxonomic organism names (e.g. ‘Bos taurus’)
different concepts than the common names (‘cattle’)?
● sometimes there is no 1:1 match
and/or context of use is different
● the source thesauri all have different policies
No final answer yet...
Lumps
clusters of concepts mapped one-to-several, several-to-one, or in spirals
Next steps
and future of GACS
GACS system infrastructure
Beyond GACS Beta?
Q: Can GACS replace existing agricultural thesauri?
● definitely not with GACS Beta due to smaller scope/size
● a future GACS may be an alternative for some
scenarios, but not all uses of existing thesauri because
o they cover areas beyond agriculture
o existing systems and processes (publication,
automatic indexing…) depend on current thesauri
In future, more partners are expected and the scope of GACS can be adjusted.
Thank you
Reports available on the FAO AIMS site:
http://aims.fao.org/community/agrovoc/blogs/phase-one-gacs-approved-read-reports
GACS Alpha: http://tester-os-kktest.lib.helsinki.fi/gacsdemo/en/
Slides prepared by Osma Suominen and Tom Baker
osma.suominen@helsinki.fi
tom@tombaker.org
@cydparr

More Related Content

Viewers also liked

Viewers also liked (8)

BEYOND ballet why and how
BEYOND ballet why and howBEYOND ballet why and how
BEYOND ballet why and how
 
How to prepare for sat intelligently
How to prepare for sat intelligentlyHow to prepare for sat intelligently
How to prepare for sat intelligently
 
газета на сайт
газета на сайтгазета на сайт
газета на сайт
 
ESTEFANIA QUIRINO APARICIO
ESTEFANIA QUIRINO APARICIOESTEFANIA QUIRINO APARICIO
ESTEFANIA QUIRINO APARICIO
 
The Job of Content
The Job of ContentThe Job of Content
The Job of Content
 
ownCloud8リリース
ownCloud8リリース ownCloud8リリース
ownCloud8リリース
 
Touring sanfeliu
Touring sanfeliuTouring sanfeliu
Touring sanfeliu
 
Yorkshire Tea – King of the Mornings
Yorkshire Tea – King of the MorningsYorkshire Tea – King of the Mornings
Yorkshire Tea – King of the Mornings
 

Similar to Turning three thesauri into a Global Agricultural Concept Scheme

A Review on Novel Approach for Load Balancing in Cloud Computing
A Review on Novel Approach for Load Balancing in Cloud ComputingA Review on Novel Approach for Load Balancing in Cloud Computing
A Review on Novel Approach for Load Balancing in Cloud Computing
ijtsrd
 

Similar to Turning three thesauri into a Global Agricultural Concept Scheme (20)

Global Agricultural Concept Scheme The collaborative integration of three the...
Global Agricultural Concept SchemeThe collaborative integration of three the...Global Agricultural Concept SchemeThe collaborative integration of three the...
Global Agricultural Concept Scheme The collaborative integration of three the...
 
The GACS Project by Caterina Caracciolo
The GACS Project by Caterina CaraccioloThe GACS Project by Caterina Caracciolo
The GACS Project by Caterina Caracciolo
 
IGAD_CODATA
IGAD_CODATAIGAD_CODATA
IGAD_CODATA
 
From GACS to Agrisemantics - Steps Forward Towards Interoperability of Data f...
From GACS to Agrisemantics - Steps Forward Towards Interoperability of Data f...From GACS to Agrisemantics - Steps Forward Towards Interoperability of Data f...
From GACS to Agrisemantics - Steps Forward Towards Interoperability of Data f...
 
DSWS PARSEC 200925
DSWS PARSEC 200925DSWS PARSEC 200925
DSWS PARSEC 200925
 
The Generation Challenge Programme: Lessons learnt relevant to CRPs, and the ...
The Generation Challenge Programme: Lessons learnt relevant to CRPs, and the ...The Generation Challenge Programme: Lessons learnt relevant to CRPs, and the ...
The Generation Challenge Programme: Lessons learnt relevant to CRPs, and the ...
 
2015 01 godan_wageningen_gacs
2015 01 godan_wageningen_gacs2015 01 godan_wageningen_gacs
2015 01 godan_wageningen_gacs
 
Amman Workshop #2 - M MacKay
Amman Workshop #2 - M MacKayAmman Workshop #2 - M MacKay
Amman Workshop #2 - M MacKay
 
Looking forward to the next 5 years of cassava modelling
Looking forward to the next 5 years of cassava modellingLooking forward to the next 5 years of cassava modelling
Looking forward to the next 5 years of cassava modelling
 
2005 09 Dc Keynote
2005 09 Dc Keynote2005 09 Dc Keynote
2005 09 Dc Keynote
 
A Review on Novel Approach for Load Balancing in Cloud Computing
A Review on Novel Approach for Load Balancing in Cloud ComputingA Review on Novel Approach for Load Balancing in Cloud Computing
A Review on Novel Approach for Load Balancing in Cloud Computing
 
The evolution of the Sentinel Landscape initiative
The evolution of the Sentinel Landscape initiativeThe evolution of the Sentinel Landscape initiative
The evolution of the Sentinel Landscape initiative
 
Frankenberger - Do Edge of Field Monitoring Results Inform, Support, and Improve
Frankenberger - Do Edge of Field Monitoring Results Inform, Support, and ImproveFrankenberger - Do Edge of Field Monitoring Results Inform, Support, and Improve
Frankenberger - Do Edge of Field Monitoring Results Inform, Support, and Improve
 
AgroSupportAnalytics: big data recommender system for agricultural farmer co...
AgroSupportAnalytics: big data recommender system for  agricultural farmer co...AgroSupportAnalytics: big data recommender system for  agricultural farmer co...
AgroSupportAnalytics: big data recommender system for agricultural farmer co...
 
Day 1_Session3_TRIPS_WASDS_ICRISAT
Day 1_Session3_TRIPS_WASDS_ICRISATDay 1_Session3_TRIPS_WASDS_ICRISAT
Day 1_Session3_TRIPS_WASDS_ICRISAT
 
GO FAIR Food Systems Implementation Network
GO FAIR Food Systems Implementation NetworkGO FAIR Food Systems Implementation Network
GO FAIR Food Systems Implementation Network
 
Australia's Environmental Predictive Capability
Australia's Environmental Predictive CapabilityAustralia's Environmental Predictive Capability
Australia's Environmental Predictive Capability
 
From Problem Models to Solutions: the Global Livestock CRSP
From Problem Models to Solutions: the Global Livestock CRSPFrom Problem Models to Solutions: the Global Livestock CRSP
From Problem Models to Solutions: the Global Livestock CRSP
 
Farm level case studies Tanzania
Farm level case studies TanzaniaFarm level case studies Tanzania
Farm level case studies Tanzania
 
Analysing the Impact of Web-based, 'Discussion-Support' in the Australian Sug...
Analysing the Impact of Web-based, 'Discussion-Support' in the Australian Sug...Analysing the Impact of Web-based, 'Discussion-Support' in the Australian Sug...
Analysing the Impact of Web-based, 'Discussion-Support' in the Australian Sug...
 

More from CIARD Movement

More from CIARD Movement (20)

Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...
 
Social Media in: Disseminating and Sharing Agriculture Data/Information
Social Media in: Disseminating and Sharing Agriculture Data/InformationSocial Media in: Disseminating and Sharing Agriculture Data/Information
Social Media in: Disseminating and Sharing Agriculture Data/Information
 
DSpace at ILRI : A semi-technical overview of “CGSpace”
DSpace at ILRI : A semi-technical overview of “CGSpace”DSpace at ILRI : A semi-technical overview of “CGSpace”
DSpace at ILRI : A semi-technical overview of “CGSpace”
 
University of Nairobi, Open Access Initiatives
University of Nairobi, Open Access InitiativesUniversity of Nairobi, Open Access Initiatives
University of Nairobi, Open Access Initiatives
 
Knowledge Management at KEFRI
Knowledge Management at KEFRIKnowledge Management at KEFRI
Knowledge Management at KEFRI
 
Open Research Data – the KALRO experience
Open Research Data – the KALRO experienceOpen Research Data – the KALRO experience
Open Research Data – the KALRO experience
 
JKUAT Case on Open Access
JKUAT Case on Open AccessJKUAT Case on Open Access
JKUAT Case on Open Access
 
JKUAT Case on Open Access
JKUAT Case on Open AccessJKUAT Case on Open Access
JKUAT Case on Open Access
 
Open Data and Open Science in Agriculture: Management
Open Data and Open Science in Agriculture: ManagementOpen Data and Open Science in Agriculture: Management
Open Data and Open Science in Agriculture: Management
 
Open Access Initiatives and Challenges in Kenya: Universities
Open Access Initiatives and Challenges in Kenya: UniversitiesOpen Access Initiatives and Challenges in Kenya: Universities
Open Access Initiatives and Challenges in Kenya: Universities
 
ICT Centre of Excellence and Open Data –iCEOD
ICT Centre of Excellence and Open Data –iCEODICT Centre of Excellence and Open Data –iCEOD
ICT Centre of Excellence and Open Data –iCEOD
 
Open Data and Big Data Capacity Building Initiative
Open Data and Big Data Capacity Building InitiativeOpen Data and Big Data Capacity Building Initiative
Open Data and Big Data Capacity Building Initiative
 
Forum on Open Data and Open Science in Agriculture in Kenya: African Journal ...
Forum on Open Data and Open Science in Agriculture in Kenya: African Journal ...Forum on Open Data and Open Science in Agriculture in Kenya: African Journal ...
Forum on Open Data and Open Science in Agriculture in Kenya: African Journal ...
 
Open Data and Open Science in Agriculture : Experiences and Opinions
Open Data and Open Science in Agriculture : Experiences and Opinions Open Data and Open Science in Agriculture : Experiences and Opinions
Open Data and Open Science in Agriculture : Experiences and Opinions
 
Open Access, Open Data and Open Science in the context of agricultural research
Open Access, Open Data and Open Science in the context of agricultural researchOpen Access, Open Data and Open Science in the context of agricultural research
Open Access, Open Data and Open Science in the context of agricultural research
 
Introducing the GODAN Secretariat
Introducing the GODAN SecretariatIntroducing the GODAN Secretariat
Introducing the GODAN Secretariat
 
Research Data Management at International Food Policy Research Institute-IFPRI
Research Data Management at International Food Policy Research Institute-IFPRIResearch Data Management at International Food Policy Research Institute-IFPRI
Research Data Management at International Food Policy Research Institute-IFPRI
 
Enabling Global Solutions for Agricultural and Nutrition Challenges through L...
Enabling Global Solutions for Agricultural and Nutrition Challenges through L...Enabling Global Solutions for Agricultural and Nutrition Challenges through L...
Enabling Global Solutions for Agricultural and Nutrition Challenges through L...
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 
RDA Wheat Data Interoperability Cookbook and last developments
RDA Wheat Data Interoperability Cookbook and last developmentsRDA Wheat Data Interoperability Cookbook and last developments
RDA Wheat Data Interoperability Cookbook and last developments
 

Recently uploaded

LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Silpa
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
Silpa
 

Recently uploaded (20)

Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 

Turning three thesauri into a Global Agricultural Concept Scheme

  • 1. Turning three thesauri into a Global Agricultural Concept Scheme March 9, 2015 Research Data Alliance Session II: Good Practices towards opening data in agriculture Cynthia Parr, National Agricultural Library @cydparr, cynthia.parr@ars.usda.gov
  • 2. Outline 1. Background 2. Starting point: three thesauri 3. Creating GACS 4. Challenges 5. Next steps and future of GACS
  • 3. Background ● Food and Agriculture Organization of the UN ● CABI (UK) ● National Agricultural Library (US) Each organization maintains a thesaurus of terms and concepts related to agriculture -- concepts like rice, ricefield aquaculture, and plant pests.
  • 4. Separate thesauri, separate databases Create GACS as a glue linking them together
  • 5. Global Agricultural Concept Scheme (GACS) agreement October 2013 to conduct feasibility study 1. To improve the semantic interoperability of thesauri maintained by FAO, CABI, and NAL. 2. To identify and provide core concepts broadly supported across the three thesauri. 3. To achieve efficiencies of scale by maintaining the core concepts in cooperation.
  • 7. Creating GACS Phase One: Analysis of Thesauri
  • 8. AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000 concepts, >1.2M terms 53,000 concepts, >200k terms English, Spanish, Portuguese, German, Czech, Persian, Polish, Hindi, French, Italian, Russian, Japanese, Hungarian, Chinese, Slovak, Thai, Lao, Turkish, Korean, Arabic, Telugu ... English, Spanish, Portuguese, Dutch + many languages with lower coverage English, Spanish All thesauri represented using SKOS
  • 9. Overlap estimate Obtained via automatic mappings created using AgreementMakerLight
  • 10. Long tail distribution (in AGRIS) 10,000 concepts cover nearly 99% of occurrences in metadata
  • 11. Requirements and Wishes 1. An integrated view and bridge of existing thesauri 2. Reuses thesaurus development work, incl. translations 3. Compatible with existing databases 4. Based on RDF technologies: URIs, SKOS etc. 5. Available as Linked Open Data Currently building GACS Beta, a proof-of-concept implementation attempting to fulfill most requirements
  • 12. Creating GACS Phase Two: Proof of Concept
  • 13. Selection of top 10,000 concepts Each partner organization provided the 10,000 concepts most frequently used in their respective databases. These lists of concepts were modified as follows: ● added all countries (from AGROVOC) ● added organisms hierarchy all the way to the top
  • 14. Automated mappings Created using AgreementMakerLight software between the full thesauri, for completeness AgreementMakerLight was top performer at OAEI 2014 ontology mapping competition!
  • 15. Human evaluation of mappings Created Google Docs spreadsheets using the lists of selected concepts and the auto-generated mappings. Three sheets with circa 10,700 rows each. Mappings manually evaluated by staff of partner organizations. Evaluated 60 to 150 rows/hour, total evaluation time over 300 hours so far. Currently projected to take 500-600 hours for GACS Beta.
  • 16. Forming GACS concepts by merging the source concepts and aggregating their information rice UF paddy UF paddy rice cereals UF feed cereals UF small grain cereals (grain) Oryza sativa UF Oryza glutinosa UF Oryza indica UF Oryza japonica UF Oryza sativa … (subsp, var etc.) Oryza UF Padia UF rice (plant) agrovoc:c_5435 cabt:82917 nalt:56271 exactMatch agrovoc:c_5438 cabt:82935 nalt:56277 exactMatch agrovoc:c_1474 cabt:26247 exactMatch agrovoc:c_6599 cabt:101613 nalt:56293 exactMatch (actually we use SKOS, not traditional thesaurus tags)
  • 17. Size of GACS GACS GACS Beta will have around 14,000 of the most used concepts
  • 18. Quality evaluation Using the qSKOS and Skosify tools that can find and correct problems in SKOS vocabularies [1], we can detect ● missing, invalid or overlapping concept labels ● anomalies in concept hierarchy, e.g. cycles ● ...and many other kinds of problems. Many problems are expected due to merging of concepts within GACS, but most should be automatically corrected. [1] Osma Suominen and Christian Mader: Assessing and Improving the Quality of SKOS Vocabularies. JoDS, 3(1) 2014.
  • 19. Demo of GACS Alpha in Skosmos http://bit.ly/1Gjf5jl
  • 20. Additional mapping rounds Need to perform 2-3 more smaller mapping rounds in order to ensure that all necessary concepts have been fully mapped between all source thesauri
  • 21. Lessons already learned ● It is hard to sustain focus on mapping beyond circa five hours per day. ● Mapping reveals issues with both the source and target thesauri -- areas for improvement, or errors, fixable in collaboration. ● Starting with the 10,000 most-used concepts shines a light on parts of thesauri that may long have lacked attention. ● Starting small, with a core, avoids the potential stress of over-committing resources. ● Mapping provides an incentive to adopt open-data technologies that have proven beneficial in other areas.
  • 23. Differences in modeling Q: Are taxonomic organism names (e.g. ‘Bos taurus’) different concepts than the common names (‘cattle’)? ● sometimes there is no 1:1 match and/or context of use is different ● the source thesauri all have different policies No final answer yet...
  • 24. Lumps clusters of concepts mapped one-to-several, several-to-one, or in spirals
  • 27. Beyond GACS Beta? Q: Can GACS replace existing agricultural thesauri? ● definitely not with GACS Beta due to smaller scope/size ● a future GACS may be an alternative for some scenarios, but not all uses of existing thesauri because o they cover areas beyond agriculture o existing systems and processes (publication, automatic indexing…) depend on current thesauri In future, more partners are expected and the scope of GACS can be adjusted.
  • 28. Thank you Reports available on the FAO AIMS site: http://aims.fao.org/community/agrovoc/blogs/phase-one-gacs-approved-read-reports GACS Alpha: http://tester-os-kktest.lib.helsinki.fi/gacsdemo/en/ Slides prepared by Osma Suominen and Tom Baker osma.suominen@helsinki.fi tom@tombaker.org @cydparr

Editor's Notes

  1. I am presenting this talk on behalf of the GACS partnership – I myself am not directly involved but none of them could be here. In particular, I should credit up front Osma Suominen and Tom Baker who prepared the presentation, and Lori Finch from the NAL who helped me understand and prepare. I will do my best to try to answer your questions or can help direct you to one of those who is directly involved.
  2. Each organization has its own thesaurus which they maintain, and one or more databases that make use of the concepts within that thesaurus. Currently no semantic compatibility - cannot do cross-database queries.
  3. The three partner organizations hired consultants to do most of the work that is being presented here one from Finland, one from Germany currently teaching in Korea but from Germany
  4. The three thesauri vary a lot. The CAB thesuaurs has a huge number of unique concepts compared to the other two, though because Agrovoc expresses those concepts Ina large number of languages, it has almost as many term as CAB Thesaurus. NAL thesaurus, relatively new, has a smaller number of terms but more concepts than Agrovoc. Note however, that all three share two languages: English & Spanish
  5. To determine what else was shared. OsmaSuominen used an automated open source tool for ontology matching: AgreementMakerLight. The “common core” shared by all three thesauri is about 13,500 concepts. Here we show some of the concepts that are not shared (you can see in the CAB thesaurus many of these are scientific names). Some examples of the terms that ARE shared are shown here.
  6. Interestingly, if you look at metadata records from AGRIS, the FAO database of publications and other materials, and what AGROVOC concepts are used in those records, you can see it is a very long tail distribution – of the 32K concepts in Agrivoc, 10,000 concepts cover nearly 99% of the occurences. Similar results were found for CAB abstracts and Agricola.
  7. So the conclusion of Phase 1 is that phase two should focus on integrating and bridging these existing thesauri in order to re-use the existing work (including the translations into other languages). There was a desire to maintain compatibility with the existing databases – good pipelines and tools already exist, so there is no desire to replace the thesauri currently being used by the three partners. The partners decided the work should be based on RDF and made available as linked open data. So, Phase 2 would be to create a proof of concept GACS beta impelmentation.
  8. The partnership is now nearing the end of phase 2. So I will now describe how that proof of concept is being developed.
  9. First, each organism took their top 10Kf concepts In case they were not in the top 10K, the partner also added all Country names from Agrovoc and, if the core concept was an organism additional names were added to support the taxonomic hierarchy all the way to the top.
  10. OAEI ontology alignment evaluation initiative Pairwise in this direction: NALT-AGROVOC, AGROVOC-CABT, and CABT-NALT
  11. For example, Oryza (the genus for rice) has exact match in each of three, so that term will receive its own GACS URIC Cereal is a broader term than rice, Oryza is a broader term than O. sativa, rice and Oryza sativa have an associative relation – not as specific as an exactMatch. If you want to know about Oryza sativa you may also be interested in the concept rice. UF is "used for" in SKOS it is "alt label"
  12. Might have errors in mapping that Vascular tissues -- is that of animal or plant systems? does it truly map to these (the code on the arrow is the row in themapping spreadsheet) should overweight and obesity map be an exact match AC1501 (agrivoc to cabi spreadsheet line 1501)(1 = exact match) Partners will be resolving some of these specific cases as well the larger policy questions by the end of march when there’s a meeting in Wallingford UK. Half of the problems are taxonomy problems, not surprising given the difficulty of staying up to date and the different authorities and sources.
  13. anticipate phase 3 to address infrastructure for auto processes editorial environment publishing platform (using SKOSMOS right now)
  14. CABI includes CAB health (beyond agricutlure) CABI uses theres for publishing NAL uses theirs for automated indexing ... More partners expected, eg. CGIAR...
  15. I will tweet these