Cynthia Parr, National Agricultural Library, at RDA 5th Plenary Meeting, IG Agriculture Data Interoperability Session in San Diego (CA, US) on the 9th of March 2015
Turning three thesauri into a Global Agricultural Concept Scheme
1. Turning three thesauri into a
Global Agricultural Concept Scheme
March 9, 2015
Research Data Alliance
Session II: Good Practices towards opening data in agriculture
Cynthia Parr, National Agricultural Library
@cydparr, cynthia.parr@ars.usda.gov
3. Background
● Food and Agriculture Organization of the UN
● CABI (UK)
● National Agricultural Library (US)
Each organization maintains a thesaurus of terms and concepts related to
agriculture -- concepts like rice, ricefield aquaculture, and plant pests.
5. Global Agricultural Concept Scheme (GACS)
agreement October 2013 to conduct feasibility study
1. To improve the semantic interoperability of thesauri
maintained by FAO, CABI, and NAL.
2. To identify and provide core concepts broadly
supported across the three thesauri.
3. To achieve efficiencies of scale by maintaining the core
concepts in cooperation.
10. Long tail distribution (in AGRIS)
10,000 concepts cover nearly 99% of occurrences in metadata
11. Requirements and Wishes
1. An integrated view and bridge of existing thesauri
2. Reuses thesaurus development work, incl. translations
3. Compatible with existing databases
4. Based on RDF technologies: URIs, SKOS etc.
5. Available as Linked Open Data
Currently building GACS Beta, a proof-of-concept
implementation attempting to fulfill most requirements
13. Selection of top 10,000 concepts
Each partner organization provided
the 10,000 concepts most frequently
used in their respective databases.
These lists of concepts were
modified as follows:
● added all countries (from
AGROVOC)
● added organisms hierarchy all
the way to the top
14. Automated mappings
Created using AgreementMakerLight software
between the full thesauri, for completeness
AgreementMakerLight was top performer at
OAEI 2014 ontology mapping competition!
15. Human evaluation of mappings
Created Google Docs spreadsheets using the lists of selected concepts and
the auto-generated mappings. Three sheets with circa 10,700 rows each.
Mappings manually evaluated by
staff of partner organizations.
Evaluated 60 to 150 rows/hour,
total evaluation time over 300
hours so far.
Currently projected to take
500-600 hours for GACS Beta.
16. Forming GACS concepts
by merging the source concepts and aggregating their information
rice
UF paddy
UF paddy rice
cereals
UF feed cereals
UF small grain cereals (grain)
Oryza sativa
UF Oryza glutinosa
UF Oryza indica
UF Oryza japonica
UF Oryza sativa … (subsp, var etc.)
Oryza
UF Padia
UF rice (plant)
agrovoc:c_5435
cabt:82917
nalt:56271
exactMatch
agrovoc:c_5438
cabt:82935
nalt:56277
exactMatch
agrovoc:c_1474
cabt:26247
exactMatch
agrovoc:c_6599
cabt:101613
nalt:56293
exactMatch
(actually we use SKOS, not traditional thesaurus tags)
18. Quality evaluation
Using the qSKOS and Skosify tools that can find and correct problems in SKOS
vocabularies [1], we can detect
● missing, invalid or overlapping concept labels
● anomalies in concept hierarchy, e.g. cycles
● ...and many other kinds of problems.
Many problems are expected due to merging of concepts within GACS, but
most should be automatically corrected.
[1] Osma Suominen and Christian Mader: Assessing and Improving the
Quality of SKOS Vocabularies. JoDS, 3(1) 2014.
19. Demo of GACS Alpha in Skosmos
http://bit.ly/1Gjf5jl
20. Additional mapping rounds
Need to perform 2-3 more
smaller mapping rounds
in order to ensure that
all necessary concepts
have been fully mapped
between all source thesauri
21. Lessons already learned
● It is hard to sustain focus on mapping beyond circa five hours per day.
● Mapping reveals issues with both the source and target thesauri -- areas
for improvement, or errors, fixable in collaboration.
● Starting with the 10,000 most-used concepts shines a light on parts of
thesauri that may long have lacked attention.
● Starting small, with a core, avoids the potential stress of over-committing
resources.
● Mapping provides an incentive to adopt open-data technologies that have
proven beneficial in other areas.
23. Differences in modeling
Q: Are taxonomic organism names (e.g. ‘Bos taurus’)
different concepts than the common names (‘cattle’)?
● sometimes there is no 1:1 match
and/or context of use is different
● the source thesauri all have different policies
No final answer yet...
27. Beyond GACS Beta?
Q: Can GACS replace existing agricultural thesauri?
● definitely not with GACS Beta due to smaller scope/size
● a future GACS may be an alternative for some
scenarios, but not all uses of existing thesauri because
o they cover areas beyond agriculture
o existing systems and processes (publication,
automatic indexing…) depend on current thesauri
In future, more partners are expected and the scope of GACS can be adjusted.
28. Thank you
Reports available on the FAO AIMS site:
http://aims.fao.org/community/agrovoc/blogs/phase-one-gacs-approved-read-reports
GACS Alpha: http://tester-os-kktest.lib.helsinki.fi/gacsdemo/en/
Slides prepared by Osma Suominen and Tom Baker
osma.suominen@helsinki.fi
tom@tombaker.org
@cydparr
Editor's Notes
I am presenting this talk on behalf of the GACS partnership – I myself am not directly involved but none of them could be here. In particular, I should credit up front Osma Suominen and Tom Baker who prepared the presentation, and Lori Finch from the NAL who helped me understand and prepare. I will do my best to try to answer your questions or can help direct you to one of those who is directly involved.
Each organization has its own thesaurus which they maintain, and one or more databases that make use of the concepts within that thesaurus. Currently no semantic compatibility - cannot do cross-database queries.
The three partner organizations hired consultants to do most of the work that is being presented here
one from Finland, one from Germany
currently teaching in Korea but from Germany
The three thesauri vary a lot. The CAB thesuaurs has a huge number of unique concepts compared to the other two, though because Agrovoc expresses those concepts Ina large number of languages, it has almost as many term as CAB Thesaurus. NAL thesaurus, relatively new, has a smaller number of terms but more concepts than Agrovoc.
Note however, that all three share two languages: English & Spanish
To determine what else was shared. OsmaSuominen used an automated open source tool for ontology matching: AgreementMakerLight. The “common core” shared by all three thesauri is about 13,500 concepts. Here we show some of the concepts that are not shared (you can see in the CAB thesaurus many of these are scientific names). Some examples of the terms that ARE shared are shown here.
Interestingly, if you look at metadata records from AGRIS, the FAO database of publications and other materials, and what AGROVOC concepts are used in those records, you can see it is a very long tail distribution – of the 32K concepts in Agrivoc, 10,000 concepts cover nearly 99% of the occurences. Similar results were found for CAB abstracts and Agricola.
So the conclusion of Phase 1 is that phase two should focus on integrating and bridging these existing thesauri in order to re-use the existing work (including the translations into other languages).
There was a desire to maintain compatibility with the existing databases – good pipelines and tools already exist, so there is no desire to replace the thesauri currently being used by the three partners.
The partners decided the work should be based on RDF and made available as linked open data.
So, Phase 2 would be to create a proof of concept GACS beta impelmentation.
The partnership is now nearing the end of phase 2. So I will now describe how that proof of concept is being developed.
First, each organism took their top 10Kf concepts
In case they were not in the top 10K, the partner also added all Country names from Agrovoc and, if the core concept was an organism additional names were added to support the taxonomic hierarchy all the way to the top.
OAEI ontology alignment evaluation initiative
Pairwise in this direction: NALT-AGROVOC, AGROVOC-CABT, and CABT-NALT
For example, Oryza (the genus for rice) has exact match in each of three, so that term will receive its own GACS URIC
Cereal is a broader term than rice, Oryza is a broader term than O. sativa, rice and Oryza sativa have an associative relation – not as specific as an exactMatch. If you want to know about Oryza sativa you may also be interested in the concept rice.
UF is "used for" in SKOS it is "alt label"
Might have errors in mapping that Vascular tissues -- is that of animal or plant systems? does it truly map to these
(the code on the arrow is the row in themapping spreadsheet)
should overweight and obesity map be an exact match
AC1501 (agrivoc to cabi spreadsheet line 1501)(1 = exact match)
Partners will be resolving some of these specific cases as well the larger policy questions by the end of march when there’s a meeting in Wallingford UK.
Half of the problems are taxonomy problems, not surprising given the difficulty of staying up to date and the different authorities and sources.
anticipate phase 3 to address infrastructure for
auto processes
editorial environment
publishing platform (using SKOSMOS right now)
CABI includes CAB health (beyond agricutlure)
CABI uses theres for publishing
NAL uses theirs for automated indexing ...
More partners expected, eg. CGIAR...