SlideShare a Scribd company logo
1 of 24
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case
Fabrizio Celli – Food and Agriculture
Organization of the UN - 27th March 2014
Before Starting…
• AGROVOC is the FAO 30 years old multilingual vocabulary
containing more than 32 000 concepts in 22 languages
(http://aims.fao.org/standards/agrovoc/about )
• AGRIS (http://agris.fao.org/ ) is a database of more than 7
million bibliographic references in Agriculture
– A collaborative network of more than 150 institutions from 65
countries
– AGRIS bibliographic metadata are enhanced by AGROVOC
descriptors, which is very important in the context of adopting LOD
technologies (http://agris.fao.org/content/about )
• Both are exposed as RDF
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Outline
• Disambiguation
• How does it work?
• Use Case 1: indexing AGRIS resources
• Use Case 2: crawling the Web
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Disambiguation
• At a high level of abstraction, AgroTagger is a
keyword extractor that uses the AGROVOC
thesaurus to enhance bibliographic resources
• The name AgroTagger may refer to different tools:
– MIMOS-hosted IIT Kanpur Agrotagger: a tool developed in
collaboration with Indian Institute of Technology of Kanpur
(IITK) in 2010, built on top of the popular Keyword
Extraction Engine (KEA, http://www.nzdl.org/Kea/ )
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Disambiguation (2)
– A Web Application developed by MIMOS in
collaboration with IITK and FAO
(http://kt.mimos.my/AgroTagger/)
• built on top of the IITK tagging service
• It generates keywords as RDF triples
• It builds a tag cloud showing the most commonly
extracted keywords
• More information on AIMS:
http://aims.fao.org/agrotagger
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Disambiguation (3)
• «AgroTagger» refers also to a command line
application, based on MAUI
(https://code.google.com/p/maui-indexer/)
• There isn’t a graphic interface neither a Web Service
on top of the application
• It is a JAVA API
• This is the AgroTagger exposed in this presentation!
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
MAUI
• Maui is named after the Polynesian mythological hero
and demi-god, which would transform himself into
different kinds of birds to perform many of his exploits
• Similarly, the Maui algorithm assimilates two software
tools named after New Zealand native birds Kea
(keyphrase extraction algorithm) and Weka (the
machine learning toolkit for creating the topic indexing
model from documents with topics assigned by people
and applying it to new documents)
• Maui automatically identifies main topics in text
documents
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
How does it work?
• The purpose of the application is to index some Web
resources (i.e. URLs) with the AGROVOC thesaurus
• The application can accept two different inputs:
– A text file with a list of URLs
– The output file of an Apache Nuts Web Crawler (which
contains a list of discovered URLs, but in a specific format)
• The output is a set of connections between input URLs
and some extracted AGROVOC URIs
– It can be a simple text file or a set of triples (NTRIPLES
serialization)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
A text file with a list of
URLs of Web resources input
AgroTagger
output
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
How does it work?
• For each URL in the input file
– Download the resource
– Run the MAUI indexer trained with AGROVOC (the
application was trained with 780 bibliographic
resources manually indexed by FAO cataloguers)
– Update the output file with discovered
connections (source URL -> set of AGROVOC URIs)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Use Case 1:
indexing AGRIS
resources
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
AGRIS
• A collection of more than 7 million
bibliographic references in agriculture
• AGRIS records come with AGROVOC
descriptors
• An RDF-aware system
– the AGRIS database is exposed as RDF
– AGROVOC is the backbone to interlink to external
sources of information (statistics, distribution
maps, country profiles, germplasm data…)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
The problem
• Sometimes AGRIS records have not been
indexed with Agrovoc keywords
• When Agrovoc keywords are not available, an
AGRIS record cannot be interlinked to external
sources of information
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
The solution
Not yet implemented!
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
An example
• In 2012 AGRIS received from the WorldBank
28.582 bibliographic records
• All records came with a fulltext link, but no
keywords associated
• Running the AgroTagger we were able to
assign from 4 to 10 AGROVOC keywords to
each WorldBank resource
• We did a manual, random evaluation of the
quality of the output, with good results!
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
AgroTagger
output
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Use Case 2:
crawling the Web
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
The setting
• Objective: discovering Web resources in
agriculture and interlinking them to AGRIS
records
• Tools:
– Apache Nuts Crawler
– AgroTagger Java API
• Final Goal: when the system displays an AGRIS
record, a list of related Web resources should
be available to the user
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
The algorithm
• The Apache Nuts Web Crawler, after a
tuning, crawls the Web starting from a list of
preselected URLs
– The output of the Crawler (a list of discovered URLs) is
given to the AgroTagger
• The AgroTagger assigns some AGROVOC URIs to
each URL discovered by the Crawler
• AGRIS records are interlinked to these URLs if
they have at least 5 common AGROVOC URIs (the
number has to be tuned)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
First test: some numbers
• A first test started from the URL:
http://ageconsearch.umn.edu/
• 101,000 distinct Web resources have been
discovered by the WebCrawler and associated to
AGROVOC URIs by the AgroTagger
• An algorithm tried to match AGRIS data to these
resources
– E.g. the resource
«http://www.waeaonline.org/WEForum/WEF-Vol.9-
No.2-Fall2010.pdf» was associated to the AGRIS
record «http://agris.fao.org/aos/records/US7938594»
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
First test: some numbers (2)
Number of AGRIS records Common AGROVOC URIs
between AGRIS and the
output of the Crawler
Number of associations
900 K 3 17 MLN
530 K 4 1,9 MLN
2,3 MLN 5 1,27 MLN
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Future
• Other qualitative/quantitative tests
• Optimization of the algorithm to run faster
• Tuning of the physical infrastructure
• Complete automation of procedures (e.g. the
output goes directy to a triplestore)
• Reach the final goal: when the system displays
an AGRIS record, a list of related Web
resources are available to the user
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Thank you !
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014

More Related Content

Similar to Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Agris (agricultural information system)
Agris (agricultural information system)Agris (agricultural information system)
Agris (agricultural information system)yashir16
 
Web services and the Development of Semantic Applications
Web services and the Development of Semantic ApplicationsWeb services and the Development of Semantic Applications
Web services and the Development of Semantic ApplicationsTrish Whetzel
 
Developing a network of content providers: The case of Organic.Edunet
Developing a network of content providers: The case of Organic.EdunetDeveloping a network of content providers: The case of Organic.Edunet
Developing a network of content providers: The case of Organic.EdunetVassilis Protonotarios
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas WorkshopNiall Beard
 
2007 08 26 Dc Keynote Keizer
2007 08 26 Dc Keynote Keizer2007 08 26 Dc Keynote Keizer
2007 08 26 Dc Keynote KeizerJohannes Keizer
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web DataIRJET Journal
 
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...LIBER Europe
 
Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)Dag Endresen
 
Presentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferencePresentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferenceJohannes Keizer
 
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Dag Endresen
 
Jisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to InstitutionsJisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to InstitutionsEDINA, University of Edinburgh
 
Introduction to Big data
Introduction to Big dataIntroduction to Big data
Introduction to Big datacthanopoulos
 
App db egi.tf.2013.v2
App db egi.tf.2013.v2App db egi.tf.2013.v2
App db egi.tf.2013.v2Nuno Ferreira
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
 
Global RDF Descriptors for Germplasm Data
Global RDF Descriptors for Germplasm DataGlobal RDF Descriptors for Germplasm Data
Global RDF Descriptors for Germplasm DataVassilis Protonotarios
 

Similar to Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase (20)

Agris (agricultural information system)
Agris (agricultural information system)Agris (agricultural information system)
Agris (agricultural information system)
 
Web services and the Development of Semantic Applications
Web services and the Development of Semantic ApplicationsWeb services and the Development of Semantic Applications
Web services and the Development of Semantic Applications
 
Developing a network of content providers: The case of Organic.Edunet
Developing a network of content providers: The case of Organic.EdunetDeveloping a network of content providers: The case of Organic.Edunet
Developing a network of content providers: The case of Organic.Edunet
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
2007 08 26 Dc Keynote Keizer
2007 08 26 Dc Keynote Keizer2007 08 26 Dc Keynote Keizer
2007 08 26 Dc Keynote Keizer
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web Data
 
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
 
AGRIS: an RDF-aware system in the agricultural domain
AGRIS: an RDF-aware system in the agricultural domainAGRIS: an RDF-aware system in the agricultural domain
AGRIS: an RDF-aware system in the agricultural domain
 
Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)
 
Presentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferencePresentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conference
 
GBIF Work Programme 2016 Update
GBIF Work Programme 2016 UpdateGBIF Work Programme 2016 Update
GBIF Work Programme 2016 Update
 
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
 
Jisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to InstitutionsJisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to Institutions
 
Jisc Publications Router
Jisc Publications RouterJisc Publications Router
Jisc Publications Router
 
Introduction to Big data
Introduction to Big dataIntroduction to Big data
Introduction to Big data
 
App db egi.tf.2013.v2
App db egi.tf.2013.v2App db egi.tf.2013.v2
App db egi.tf.2013.v2
 
An approach for knowledge-driven product, process and resource mappings for a...
An approach for knowledge-driven product, process and resource mappings for a...An approach for knowledge-driven product, process and resource mappings for a...
An approach for knowledge-driven product, process and resource mappings for a...
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Global RDF Descriptors for Germplasm Data
Global RDF Descriptors for Germplasm DataGlobal RDF Descriptors for Germplasm Data
Global RDF Descriptors for Germplasm Data
 
AKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS NetworkAKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS Network
 

More from AIMS (Agricultural Information Management Standards)

More from AIMS (Agricultural Information Management Standards) (20)

Linked Data Competency Index : Mapping the field for teachers and learners
 Linked Data Competency Index : Mapping the field for teachers and learners Linked Data Competency Index : Mapping the field for teachers and learners
Linked Data Competency Index : Mapping the field for teachers and learners
 
Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...
 
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic ResourcesAssigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
 
VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release
 
The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...
 
Webinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management PlanningWebinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management Planning
 
Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library
 
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
 
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
 
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
 
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA) Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
 
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
 
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
 
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research PublishingWebinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
 
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
 
Research4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portesResearch4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portes
 
Publishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmosPublishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmos
 
Research4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertasResearch4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertas
 
Research4Life: The library that opens doors
Research4Life: The library that opens doorsResearch4Life: The library that opens doors
Research4Life: The library that opens doors
 
Webinar@AIMS: Perspective on Big Data in the CGIAR
Webinar@AIMS: Perspective on Big Data in the CGIARWebinar@AIMS: Perspective on Big Data in the CGIAR
Webinar@AIMS: Perspective on Big Data in the CGIAR
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

  • 1. Automatic Indexing of Bibliographic Metadata: The AgroTagger use case Fabrizio Celli – Food and Agriculture Organization of the UN - 27th March 2014
  • 2. Before Starting… • AGROVOC is the FAO 30 years old multilingual vocabulary containing more than 32 000 concepts in 22 languages (http://aims.fao.org/standards/agrovoc/about ) • AGRIS (http://agris.fao.org/ ) is a database of more than 7 million bibliographic references in Agriculture – A collaborative network of more than 150 institutions from 65 countries – AGRIS bibliographic metadata are enhanced by AGROVOC descriptors, which is very important in the context of adopting LOD technologies (http://agris.fao.org/content/about ) • Both are exposed as RDF Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 3. Outline • Disambiguation • How does it work? • Use Case 1: indexing AGRIS resources • Use Case 2: crawling the Web Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 4. Disambiguation • At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to enhance bibliographic resources • The name AgroTagger may refer to different tools: – MIMOS-hosted IIT Kanpur Agrotagger: a tool developed in collaboration with Indian Institute of Technology of Kanpur (IITK) in 2010, built on top of the popular Keyword Extraction Engine (KEA, http://www.nzdl.org/Kea/ ) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 5. Disambiguation (2) – A Web Application developed by MIMOS in collaboration with IITK and FAO (http://kt.mimos.my/AgroTagger/) • built on top of the IITK tagging service • It generates keywords as RDF triples • It builds a tag cloud showing the most commonly extracted keywords • More information on AIMS: http://aims.fao.org/agrotagger Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 6. Disambiguation (3) • «AgroTagger» refers also to a command line application, based on MAUI (https://code.google.com/p/maui-indexer/) • There isn’t a graphic interface neither a Web Service on top of the application • It is a JAVA API • This is the AgroTagger exposed in this presentation! Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 7. MAUI • Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits • Similarly, the Maui algorithm assimilates two software tools named after New Zealand native birds Kea (keyphrase extraction algorithm) and Weka (the machine learning toolkit for creating the topic indexing model from documents with topics assigned by people and applying it to new documents) • Maui automatically identifies main topics in text documents Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 8. How does it work? • The purpose of the application is to index some Web resources (i.e. URLs) with the AGROVOC thesaurus • The application can accept two different inputs: – A text file with a list of URLs – The output file of an Apache Nuts Web Crawler (which contains a list of discovered URLs, but in a specific format) • The output is a set of connections between input URLs and some extracted AGROVOC URIs – It can be a simple text file or a set of triples (NTRIPLES serialization) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 9. A text file with a list of URLs of Web resources input AgroTagger output Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 10. How does it work? • For each URL in the input file – Download the resource – Run the MAUI indexer trained with AGROVOC (the application was trained with 780 bibliographic resources manually indexed by FAO cataloguers) – Update the output file with discovered connections (source URL -> set of AGROVOC URIs) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 11. Use Case 1: indexing AGRIS resources Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 12. AGRIS • A collection of more than 7 million bibliographic references in agriculture • AGRIS records come with AGROVOC descriptors • An RDF-aware system – the AGRIS database is exposed as RDF – AGROVOC is the backbone to interlink to external sources of information (statistics, distribution maps, country profiles, germplasm data…) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 13. Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 14. The problem • Sometimes AGRIS records have not been indexed with Agrovoc keywords • When Agrovoc keywords are not available, an AGRIS record cannot be interlinked to external sources of information Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 15. The solution Not yet implemented! Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 16. An example • In 2012 AGRIS received from the WorldBank 28.582 bibliographic records • All records came with a fulltext link, but no keywords associated • Running the AgroTagger we were able to assign from 4 to 10 AGROVOC keywords to each WorldBank resource • We did a manual, random evaluation of the quality of the output, with good results! Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 17. AgroTagger output Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 18. Use Case 2: crawling the Web Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 19. The setting • Objective: discovering Web resources in agriculture and interlinking them to AGRIS records • Tools: – Apache Nuts Crawler – AgroTagger Java API • Final Goal: when the system displays an AGRIS record, a list of related Web resources should be available to the user Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 20. The algorithm • The Apache Nuts Web Crawler, after a tuning, crawls the Web starting from a list of preselected URLs – The output of the Crawler (a list of discovered URLs) is given to the AgroTagger • The AgroTagger assigns some AGROVOC URIs to each URL discovered by the Crawler • AGRIS records are interlinked to these URLs if they have at least 5 common AGROVOC URIs (the number has to be tuned) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 21. First test: some numbers • A first test started from the URL: http://ageconsearch.umn.edu/ • 101,000 distinct Web resources have been discovered by the WebCrawler and associated to AGROVOC URIs by the AgroTagger • An algorithm tried to match AGRIS data to these resources – E.g. the resource «http://www.waeaonline.org/WEForum/WEF-Vol.9- No.2-Fall2010.pdf» was associated to the AGRIS record «http://agris.fao.org/aos/records/US7938594» Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 22. First test: some numbers (2) Number of AGRIS records Common AGROVOC URIs between AGRIS and the output of the Crawler Number of associations 900 K 3 17 MLN 530 K 4 1,9 MLN 2,3 MLN 5 1,27 MLN Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 23. Future • Other qualitative/quantitative tests • Optimization of the algorithm to run faster • Tuning of the physical infrastructure • Complete automation of procedures (e.g. the output goes directy to a triplestore) • Reach the final goal: when the system displays an AGRIS record, a list of related Web resources are available to the user Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 24. Thank you ! Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Editor's Notes

  1. Tuning parameters, both for the crawler and for the matching algorithmParallelizationCloud