SlideShare a Scribd company logo
Crawling the Web 
Fabrizio Celli 
Rome, 25th September 2014
Outline 
• Purpose of this Webinar 
• The Web Crawler 
• The AgroTagger 
• The AGRIS use case 
– What’s next? 
2
Purpose of this Webinar 
• SemaGrow is a project funded by the Seventh 
Framework Programme (FP7) of the European 
Commission 
• Algorithms, infrastructures and methodologies to 
cope with large data volumes and real time 
performance 
• http://www.semagrow.eu 
• One of SemaGrow demonstrators is the 
component “Web Crawler + AgroTagger”, 
objective of this Webinar 
3
The demonstrator 
• It is based on two command line applications 
(no user interface): 
– Web Crawler 
– AgroTagger 
• Goal: 
– discover resources on the Web 
– tag resources with AGROVOC URIs 
– filter only resources about agriculture and 
interlink to AGRIS 
4
What we expect from the Webinar 
• Comments, suggestions, opinions 
• Other real case scenarios for the 
demonstrator 
• You can send your feedback at agris@fao.org 
5
THE WEB-CRAWLER 
6
Apache Nutch 
• http://nutch.apache.org/ 
• Highly extensible and scalable open source 
Web crawler 
• Configurable 
• Input: a list of pre-selected URLs 
• Output: a list of discovered URLs 
7
How it works 
• The user defines a list of Web sites (URLs) 
• Each URL is a ROOT 
• The user defines the “depth”: the number of 
"hops" a discovered link is away from the 
ROOT 
– Links very "far away" from the ROOT are unlikely 
to hold much information 
• Start to crawl the Web! 
8
Example: depth = 3 
9 
ROOT (URL) 
depth = 1 URL_1_1 URL_1_2 URL_1_n 
depth = 2 
depth = 3 
… 
URL_2_2_1 … 
URL_2_2_m 
URL_3_2_1_1 … URL_3_2_1_p
The application 
• https://github.com/agrisfao/agrotagger/tree/master/cr 
awler/application 
• Command line application 
• Provided with bash scripts to run in Linux 
environments 
• Example of usage: 
– depth = 5 
– output directory = work/output 
– directory with source URLS = work/urls 
crawler_exec.sh 5 work/output work/urls 
10
The output 
URL:: http:/ 
URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.php 
URL:: http://10-29-2013-tfic-luncheon.eventbrite.com/ 
URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina- 
Hale-Inc-FactSheet.pdf 
URL:: http://2014.northernspark.org/ 
URL:: http://2014.northernspark.org/project/chimera 
outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: 
outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the- 
city-of-minneapolis anchor: 
outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: 
URL:: http://aaea.execinc.com/edibo/JobMarketCandidates 
outlink: toUrl: http://www.aaea.org/ anchor: AAEA 
outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In 
outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections 
outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees 
outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors 
... 
11
THE AGROTAGGER 
12
AGROVOC 
• FAO multilingual vocabulary 
• Over 32 000 concepts in up to 21 languages 
• Part of the LOD cloud 
• Extensively used by cataloguers for indexing 
data in agricultural information systems 
• http://202.45.139.84:10035/catalogs/fao/rep 
ositories/agrovoc 
13
The AgroTagger 
• At a high level of abstraction, AgroTagger is a 
keyword extractor that uses the AGROVOC 
thesaurus to extract keywords from some 
URLs 
• Or better… to extract URIs 
• It is based on MAUI 
14
MAUI 
• Maui is named after the Polynesian 
mythological hero and demi-god, which would 
transform himself into different kinds of birds 
to perform many of his exploits 
• Maui automatically identifies main topics in 
text documents 
• It uses different kinds of algorithms (Kea and 
Weka, named after New Zealand native birds) 
• https://code.google.com/p/maui-indexer 
15
How it works 
• Input: 
– A text file with a list of URLs 
– The output file of an Apache Nutch crawler 
• Output: 
– A set of triples 
<URL> dcterms:subject <AGROVOC_URI> 
16
The algorithm 
• For each URL in the input file 
– Download the resource 
– Run the MAUI indexer trained with AGROVOC 
– Create a set of triples 
• Multi-threaded 
• Currently, MAUI is trained only for English 
– It can be trained in other languages that use Latin 
characters 
– Other solutions are needed for Chinese, Arabic, 
Russian, etc. 
17
The application 
• https://github.com/agrisfao/agrotagger 
• Command line application 
• Entirely based on JAVA 
• Provided with bash scripts 
• Example of usage: 
– directory with source files = work/source 
– output directory = work/output 
– type of source files = nutchOutput 
– output format = rdfnt 
taggerDir.sh /work/source /work/output nutchOutput rdfnt 
18
The output 
19 
Input 
AgroTagger 
Output
THE AGRIS USE CASE 
20
AGRIS 
• http://agris.fao.org 
• A collection of more than 7.8 million 
bibliographic references in agriculture 
• AGRIS records come with AGROVOC descriptors 
• An RDF-aware system 
– the AGRIS database is publicly exposed as RDF 
– AGROVOC is the backbone to interlink to external 
sources of information (statistics, distribution maps, 
country profiles, germplasm data…) 
21
22
SemaGrow demonstrator 
• The core idea is to harvest the Web 
– Input: pre-selected sources of information about 
agriculture 
• Crawl and assign AGROVOC URIs 
– Store triples in the “crawler” database 
• Definition of combinations between the 
“crawler” database and the AGRIS database 
• New widget in AGRIS mashup pages! 
23
Related resources 
available on the Web 
24 
• http://... 
• https://...
Current status 
• The Web Crawler gathers data from the Web 
• The AgroTagger computes triples to assign 
Agrovoc URIs to discovered URLs 
• A “crawler” triplestore is ready for computations 
25
What’s next 
• Processing phase 
• Discover meaningful combinations between 
the AGRIS core database and “crawler” 
database 
• A triplestore of combinations will be set up 
and used by AGRIS to generate a widget in the 
mashup page 
• Evaluation of the quality of the widget 
• What does “meaningful combinations” mean? 
26
Naïve Algorithm 
• Just for testing purposes 
• Meaningful combinations = at least N 
common AGROVOC URIs 
27
Example 
• http://ageconsearch.umn.edu/ 
• 101,000 distinct Web resources discovered by the 
WebCrawler (depth = 5) 
• ~1 million triples generated by the AgroTagger 
(“crawler” database) 
28 
Number of AGRIS records N: common AGROVOC URIs 
between AGRIS and the 
output of the Crawler 
Number of associations 
900 K 3 17 MLN 
900 K 4 3,2 MLN 
1 MLN 5 0.6 MLN
Your feedback 
• Comments, suggestions, other real case 
scenarios 
• Ideas about the meaning of “meaningful 
combinations” 
• If you will test the application, any comments 
to improve it 
• Can the demonstrator support to overcome 
data problems? 
• You can send your feedback at agris@fao.org 
29
30 
谢谢 
Gracias 
σας ευχαριστώ

More Related Content

What's hot

WebCrawler
WebCrawlerWebCrawler
WebCrawler
mynameismrslide
 
“Web crawler”
“Web crawler”“Web crawler”
“Web crawler”
ranjit banshpal
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
Suhasini S Kulkarni
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
Partnered Health
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerGeorge Ang
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
Sanchit Saini
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014
Sunny Gupta
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
Rishikesh Pathak
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
Akshay Pratap Singh
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
CloudTechnologies
 
Web Crawlers - Exploring the WWW
Web Crawlers - Exploring the WWWWeb Crawlers - Exploring the WWW
Web Crawlers - Exploring the WWW
Siddhartha Anand
 
Smart Crawler
Smart CrawlerSmart Crawler
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Rana Jayant
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis Vikram Parmar
 
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep WebSmart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
S Sai Karthik
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
Pvrtechnologies Nellore
 
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Rana Jayant
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
IOSR Journals
 

What's hot (20)

WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
“Web crawler”
“Web crawler”“Web crawler”
“Web crawler”
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web Crawler
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
Web Crawlers - Exploring the WWW
Web Crawlers - Exploring the WWWWeb Crawlers - Exploring the WWW
Web Crawlers - Exploring the WWW
 
Smart Crawler
Smart CrawlerSmart Crawler
Smart Crawler
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
 
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep WebSmart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...Smart crawlet A two stage crawler  for efficiently harvesting deep web interf...
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
 

Similar to SemaGrow demonstrator: “Web Crawler + AgroTagger”

Large Scale Drupal - Behind the Scenes
Large Scale Drupal - Behind the ScenesLarge Scale Drupal - Behind the Scenes
Large Scale Drupal - Behind the Scenes
Boyan Borisov
 
Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
Automatic Indexing of Bibliographic Metadata: The AgroTagger UsecaseAutomatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
AIMS (Agricultural Information Management Standards)
 
Automatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAutomatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked Data
Albert Meroño-Peñuela
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
AGROVOC GACS Working Group
AGROVOC GACS Working GroupAGROVOC GACS Working Group
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
Databricks
 
2nd Content Providers Community Call
2nd Content Providers Community Call2nd Content Providers Community Call
2nd Content Providers Community Call
OpenAIRE
 
App_Engine_PPT..........................
App_Engine_PPT..........................App_Engine_PPT..........................
App_Engine_PPT..........................
HassamShahid2
 
App_Engine_PPT.ppt
App_Engine_PPT.pptApp_Engine_PPT.ppt
App_Engine_PPT.ppt
Pikachu925105
 
App_Engine_PPT.ppt
App_Engine_PPT.pptApp_Engine_PPT.ppt
App_Engine_PPT.ppt
ArunPrakash330
 
JRuby, Ruby, Rails and You on the Cloud
JRuby, Ruby, Rails and You on the CloudJRuby, Ruby, Rails and You on the Cloud
JRuby, Ruby, Rails and You on the Cloud
Hiro Asari
 
Making Chrome Extension with AngularJS
Making Chrome Extension with AngularJSMaking Chrome Extension with AngularJS
Making Chrome Extension with AngularJS
Ben Lau
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Clouds
inovex GmbH
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Kritika Garg
 
Module development
Module development Module development
Module development
Araport
 
DevOps-Roadmap
DevOps-RoadmapDevOps-Roadmap
DevOps-Roadmap
BnhNguynHuy1
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 
Web security: Securing untrusted web content at browsers
Web security: Securing untrusted web content at browsersWeb security: Securing untrusted web content at browsers
Web security: Securing untrusted web content at browsers
Phú Phùng
 
Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)
Viral Solani
 

Similar to SemaGrow demonstrator: “Web Crawler + AgroTagger” (20)

Large Scale Drupal - Behind the Scenes
Large Scale Drupal - Behind the ScenesLarge Scale Drupal - Behind the Scenes
Large Scale Drupal - Behind the Scenes
 
Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
Automatic Indexing of Bibliographic Metadata: The AgroTagger UsecaseAutomatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
 
Automatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked DataAutomatic Query-Centric API for Routine Access to Linked Data
Automatic Query-Centric API for Routine Access to Linked Data
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
AGROVOC GACS Working Group
AGROVOC GACS Working GroupAGROVOC GACS Working Group
AGROVOC GACS Working Group
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
2nd Content Providers Community Call
2nd Content Providers Community Call2nd Content Providers Community Call
2nd Content Providers Community Call
 
App_Engine_PPT..........................
App_Engine_PPT..........................App_Engine_PPT..........................
App_Engine_PPT..........................
 
App_Engine_PPT.ppt
App_Engine_PPT.pptApp_Engine_PPT.ppt
App_Engine_PPT.ppt
 
App_Engine_PPT.ppt
App_Engine_PPT.pptApp_Engine_PPT.ppt
App_Engine_PPT.ppt
 
App_Engine_PPT.ppt
App_Engine_PPT.pptApp_Engine_PPT.ppt
App_Engine_PPT.ppt
 
JRuby, Ruby, Rails and You on the Cloud
JRuby, Ruby, Rails and You on the CloudJRuby, Ruby, Rails and You on the Cloud
JRuby, Ruby, Rails and You on the Cloud
 
Making Chrome Extension with AngularJS
Making Chrome Extension with AngularJSMaking Chrome Extension with AngularJS
Making Chrome Extension with AngularJS
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Clouds
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
 
Module development
Module development Module development
Module development
 
DevOps-Roadmap
DevOps-RoadmapDevOps-Roadmap
DevOps-Roadmap
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Web security: Securing untrusted web content at browsers
Web security: Securing untrusted web content at browsersWeb security: Securing untrusted web content at browsers
Web security: Securing untrusted web content at browsers
 
Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)Introduction to Laravel Framework (5.2)
Introduction to Laravel Framework (5.2)
 

More from AIMS (Agricultural Information Management Standards)

Linked Data Competency Index : Mapping the field for teachers and learners
 Linked Data Competency Index : Mapping the field for teachers and learners Linked Data Competency Index : Mapping the field for teachers and learners
Linked Data Competency Index : Mapping the field for teachers and learners
AIMS (Agricultural Information Management Standards)
 
Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...
AIMS (Agricultural Information Management Standards)
 
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic ResourcesAssigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
AIMS (Agricultural Information Management Standards)
 
VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release
AIMS (Agricultural Information Management Standards)
 
The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...
AIMS (Agricultural Information Management Standards)
 
Webinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management PlanningWebinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management Planning
AIMS (Agricultural Information Management Standards)
 
Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library
AIMS (Agricultural Information Management Standards)
 
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
AIMS (Agricultural Information Management Standards)
 
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
AIMS (Agricultural Information Management Standards)
 
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
AIMS (Agricultural Information Management Standards)
 
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA) Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
AIMS (Agricultural Information Management Standards)
 
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
AIMS (Agricultural Information Management Standards)
 
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
AIMS (Agricultural Information Management Standards)
 
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research PublishingWebinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
AIMS (Agricultural Information Management Standards)
 
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
AIMS (Agricultural Information Management Standards)
 
Using AGRIS as a portal of choice to access agricultural research and technol...
Using AGRIS as a portal of choice to access agricultural research and technol...Using AGRIS as a portal of choice to access agricultural research and technol...
Using AGRIS as a portal of choice to access agricultural research and technol...
AIMS (Agricultural Information Management Standards)
 
Research4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portesResearch4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portes
AIMS (Agricultural Information Management Standards)
 
Publishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmosPublishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmos
AIMS (Agricultural Information Management Standards)
 
Research4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertasResearch4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertas
AIMS (Agricultural Information Management Standards)
 
Research4Life: The library that opens doors
Research4Life: The library that opens doorsResearch4Life: The library that opens doors
Research4Life: The library that opens doors
AIMS (Agricultural Information Management Standards)
 

More from AIMS (Agricultural Information Management Standards) (20)

Linked Data Competency Index : Mapping the field for teachers and learners
 Linked Data Competency Index : Mapping the field for teachers and learners Linked Data Competency Index : Mapping the field for teachers and learners
Linked Data Competency Index : Mapping the field for teachers and learners
 
Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...
 
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic ResourcesAssigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
 
VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release
 
The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...
 
Webinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management PlanningWebinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management Planning
 
Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library
 
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
 
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
 
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
 
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA) Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
 
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
 
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
 
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research PublishingWebinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
 
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
 
Using AGRIS as a portal of choice to access agricultural research and technol...
Using AGRIS as a portal of choice to access agricultural research and technol...Using AGRIS as a portal of choice to access agricultural research and technol...
Using AGRIS as a portal of choice to access agricultural research and technol...
 
Research4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portesResearch4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portes
 
Publishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmosPublishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmos
 
Research4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertasResearch4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertas
 
Research4Life: The library that opens doors
Research4Life: The library that opens doorsResearch4Life: The library that opens doors
Research4Life: The library that opens doors
 

Recently uploaded

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

SemaGrow demonstrator: “Web Crawler + AgroTagger”

  • 1. Crawling the Web Fabrizio Celli Rome, 25th September 2014
  • 2. Outline • Purpose of this Webinar • The Web Crawler • The AgroTagger • The AGRIS use case – What’s next? 2
  • 3. Purpose of this Webinar • SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission • Algorithms, infrastructures and methodologies to cope with large data volumes and real time performance • http://www.semagrow.eu • One of SemaGrow demonstrators is the component “Web Crawler + AgroTagger”, objective of this Webinar 3
  • 4. The demonstrator • It is based on two command line applications (no user interface): – Web Crawler – AgroTagger • Goal: – discover resources on the Web – tag resources with AGROVOC URIs – filter only resources about agriculture and interlink to AGRIS 4
  • 5. What we expect from the Webinar • Comments, suggestions, opinions • Other real case scenarios for the demonstrator • You can send your feedback at agris@fao.org 5
  • 7. Apache Nutch • http://nutch.apache.org/ • Highly extensible and scalable open source Web crawler • Configurable • Input: a list of pre-selected URLs • Output: a list of discovered URLs 7
  • 8. How it works • The user defines a list of Web sites (URLs) • Each URL is a ROOT • The user defines the “depth”: the number of "hops" a discovered link is away from the ROOT – Links very "far away" from the ROOT are unlikely to hold much information • Start to crawl the Web! 8
  • 9. Example: depth = 3 9 ROOT (URL) depth = 1 URL_1_1 URL_1_2 URL_1_n depth = 2 depth = 3 … URL_2_2_1 … URL_2_2_m URL_3_2_1_1 … URL_3_2_1_p
  • 10. The application • https://github.com/agrisfao/agrotagger/tree/master/cr awler/application • Command line application • Provided with bash scripts to run in Linux environments • Example of usage: – depth = 5 – output directory = work/output – directory with source URLS = work/urls crawler_exec.sh 5 work/output work/urls 10
  • 11. The output URL:: http:/ URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.php URL:: http://10-29-2013-tfic-luncheon.eventbrite.com/ URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina- Hale-Inc-FactSheet.pdf URL:: http://2014.northernspark.org/ URL:: http://2014.northernspark.org/project/chimera outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the- city-of-minneapolis anchor: outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: URL:: http://aaea.execinc.com/edibo/JobMarketCandidates outlink: toUrl: http://www.aaea.org/ anchor: AAEA outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors ... 11
  • 13. AGROVOC • FAO multilingual vocabulary • Over 32 000 concepts in up to 21 languages • Part of the LOD cloud • Extensively used by cataloguers for indexing data in agricultural information systems • http://202.45.139.84:10035/catalogs/fao/rep ositories/agrovoc 13
  • 14. The AgroTagger • At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to extract keywords from some URLs • Or better… to extract URIs • It is based on MAUI 14
  • 15. MAUI • Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits • Maui automatically identifies main topics in text documents • It uses different kinds of algorithms (Kea and Weka, named after New Zealand native birds) • https://code.google.com/p/maui-indexer 15
  • 16. How it works • Input: – A text file with a list of URLs – The output file of an Apache Nutch crawler • Output: – A set of triples <URL> dcterms:subject <AGROVOC_URI> 16
  • 17. The algorithm • For each URL in the input file – Download the resource – Run the MAUI indexer trained with AGROVOC – Create a set of triples • Multi-threaded • Currently, MAUI is trained only for English – It can be trained in other languages that use Latin characters – Other solutions are needed for Chinese, Arabic, Russian, etc. 17
  • 18. The application • https://github.com/agrisfao/agrotagger • Command line application • Entirely based on JAVA • Provided with bash scripts • Example of usage: – directory with source files = work/source – output directory = work/output – type of source files = nutchOutput – output format = rdfnt taggerDir.sh /work/source /work/output nutchOutput rdfnt 18
  • 19. The output 19 Input AgroTagger Output
  • 20. THE AGRIS USE CASE 20
  • 21. AGRIS • http://agris.fao.org • A collection of more than 7.8 million bibliographic references in agriculture • AGRIS records come with AGROVOC descriptors • An RDF-aware system – the AGRIS database is publicly exposed as RDF – AGROVOC is the backbone to interlink to external sources of information (statistics, distribution maps, country profiles, germplasm data…) 21
  • 22. 22
  • 23. SemaGrow demonstrator • The core idea is to harvest the Web – Input: pre-selected sources of information about agriculture • Crawl and assign AGROVOC URIs – Store triples in the “crawler” database • Definition of combinations between the “crawler” database and the AGRIS database • New widget in AGRIS mashup pages! 23
  • 24. Related resources available on the Web 24 • http://... • https://...
  • 25. Current status • The Web Crawler gathers data from the Web • The AgroTagger computes triples to assign Agrovoc URIs to discovered URLs • A “crawler” triplestore is ready for computations 25
  • 26. What’s next • Processing phase • Discover meaningful combinations between the AGRIS core database and “crawler” database • A triplestore of combinations will be set up and used by AGRIS to generate a widget in the mashup page • Evaluation of the quality of the widget • What does “meaningful combinations” mean? 26
  • 27. Naïve Algorithm • Just for testing purposes • Meaningful combinations = at least N common AGROVOC URIs 27
  • 28. Example • http://ageconsearch.umn.edu/ • 101,000 distinct Web resources discovered by the WebCrawler (depth = 5) • ~1 million triples generated by the AgroTagger (“crawler” database) 28 Number of AGRIS records N: common AGROVOC URIs between AGRIS and the output of the Crawler Number of associations 900 K 3 17 MLN 900 K 4 3,2 MLN 1 MLN 5 0.6 MLN
  • 29. Your feedback • Comments, suggestions, other real case scenarios • Ideas about the meaning of “meaningful combinations” • If you will test the application, any comments to improve it • Can the demonstrator support to overcome data problems? • You can send your feedback at agris@fao.org 29
  • 30. 30 谢谢 Gracias σας ευχαριστώ

Editor's Notes

  1. keyphrase extraction algorithm Kea machine learning toolkit Weka