Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SemaGrow demonstrator: “Web Crawler + AgroTagger”

1,320 views

Published on

The webinar will present the SemaGrow demonstrator “Web Crawler + AgroTagger”, in order to collect feedback, ideas and comments about the status of the development and how the demonstrator helps to overcome data problems.

SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission, aiming at developing algorithms, infrastructures and methodologies to cope with large data volumes and real time performance.

In this context, FAO is providing a component than can be used to crawl the Web, giving a meaning to discovered resources by using the AgroTagger, which can assign some AGROVOC URIs to resources gathered by a Web crawler.

The demonstrator is publicly available at https://github.com/agrisfao/agrotagger.

Published in: Technology
  • Be the first to comment

SemaGrow demonstrator: “Web Crawler + AgroTagger”

  1. 1. Crawling the Web Fabrizio Celli Rome, 25th September 2014
  2. 2. Outline • Purpose of this Webinar • The Web Crawler • The AgroTagger • The AGRIS use case – What’s next? 2
  3. 3. Purpose of this Webinar • SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission • Algorithms, infrastructures and methodologies to cope with large data volumes and real time performance • http://www.semagrow.eu • One of SemaGrow demonstrators is the component “Web Crawler + AgroTagger”, objective of this Webinar 3
  4. 4. The demonstrator • It is based on two command line applications (no user interface): – Web Crawler – AgroTagger • Goal: – discover resources on the Web – tag resources with AGROVOC URIs – filter only resources about agriculture and interlink to AGRIS 4
  5. 5. What we expect from the Webinar • Comments, suggestions, opinions • Other real case scenarios for the demonstrator • You can send your feedback at agris@fao.org 5
  6. 6. THE WEB-CRAWLER 6
  7. 7. Apache Nutch • http://nutch.apache.org/ • Highly extensible and scalable open source Web crawler • Configurable • Input: a list of pre-selected URLs • Output: a list of discovered URLs 7
  8. 8. How it works • The user defines a list of Web sites (URLs) • Each URL is a ROOT • The user defines the “depth”: the number of "hops" a discovered link is away from the ROOT – Links very "far away" from the ROOT are unlikely to hold much information • Start to crawl the Web! 8
  9. 9. Example: depth = 3 9 ROOT (URL) depth = 1 URL_1_1 URL_1_2 URL_1_n depth = 2 depth = 3 … URL_2_2_1 … URL_2_2_m URL_3_2_1_1 … URL_3_2_1_p
  10. 10. The application • https://github.com/agrisfao/agrotagger/tree/master/cr awler/application • Command line application • Provided with bash scripts to run in Linux environments • Example of usage: – depth = 5 – output directory = work/output – directory with source URLS = work/urls crawler_exec.sh 5 work/output work/urls 10
  11. 11. The output URL:: http:/ URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.php URL:: http://10-29-2013-tfic-luncheon.eventbrite.com/ URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina- Hale-Inc-FactSheet.pdf URL:: http://2014.northernspark.org/ URL:: http://2014.northernspark.org/project/chimera outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the- city-of-minneapolis anchor: outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: URL:: http://aaea.execinc.com/edibo/JobMarketCandidates outlink: toUrl: http://www.aaea.org/ anchor: AAEA outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors ... 11
  12. 12. THE AGROTAGGER 12
  13. 13. AGROVOC • FAO multilingual vocabulary • Over 32 000 concepts in up to 21 languages • Part of the LOD cloud • Extensively used by cataloguers for indexing data in agricultural information systems • http://202.45.139.84:10035/catalogs/fao/rep ositories/agrovoc 13
  14. 14. The AgroTagger • At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to extract keywords from some URLs • Or better… to extract URIs • It is based on MAUI 14
  15. 15. MAUI • Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits • Maui automatically identifies main topics in text documents • It uses different kinds of algorithms (Kea and Weka, named after New Zealand native birds) • https://code.google.com/p/maui-indexer 15
  16. 16. How it works • Input: – A text file with a list of URLs – The output file of an Apache Nutch crawler • Output: – A set of triples <URL> dcterms:subject <AGROVOC_URI> 16
  17. 17. The algorithm • For each URL in the input file – Download the resource – Run the MAUI indexer trained with AGROVOC – Create a set of triples • Multi-threaded • Currently, MAUI is trained only for English – It can be trained in other languages that use Latin characters – Other solutions are needed for Chinese, Arabic, Russian, etc. 17
  18. 18. The application • https://github.com/agrisfao/agrotagger • Command line application • Entirely based on JAVA • Provided with bash scripts • Example of usage: – directory with source files = work/source – output directory = work/output – type of source files = nutchOutput – output format = rdfnt taggerDir.sh /work/source /work/output nutchOutput rdfnt 18
  19. 19. The output 19 Input AgroTagger Output
  20. 20. THE AGRIS USE CASE 20
  21. 21. AGRIS • http://agris.fao.org • A collection of more than 7.8 million bibliographic references in agriculture • AGRIS records come with AGROVOC descriptors • An RDF-aware system – the AGRIS database is publicly exposed as RDF – AGROVOC is the backbone to interlink to external sources of information (statistics, distribution maps, country profiles, germplasm data…) 21
  22. 22. 22
  23. 23. SemaGrow demonstrator • The core idea is to harvest the Web – Input: pre-selected sources of information about agriculture • Crawl and assign AGROVOC URIs – Store triples in the “crawler” database • Definition of combinations between the “crawler” database and the AGRIS database • New widget in AGRIS mashup pages! 23
  24. 24. Related resources available on the Web 24 • http://... • https://...
  25. 25. Current status • The Web Crawler gathers data from the Web • The AgroTagger computes triples to assign Agrovoc URIs to discovered URLs • A “crawler” triplestore is ready for computations 25
  26. 26. What’s next • Processing phase • Discover meaningful combinations between the AGRIS core database and “crawler” database • A triplestore of combinations will be set up and used by AGRIS to generate a widget in the mashup page • Evaluation of the quality of the widget • What does “meaningful combinations” mean? 26
  27. 27. Naïve Algorithm • Just for testing purposes • Meaningful combinations = at least N common AGROVOC URIs 27
  28. 28. Example • http://ageconsearch.umn.edu/ • 101,000 distinct Web resources discovered by the WebCrawler (depth = 5) • ~1 million triples generated by the AgroTagger (“crawler” database) 28 Number of AGRIS records N: common AGROVOC URIs between AGRIS and the output of the Crawler Number of associations 900 K 3 17 MLN 900 K 4 3,2 MLN 1 MLN 5 0.6 MLN
  29. 29. Your feedback • Comments, suggestions, other real case scenarios • Ideas about the meaning of “meaningful combinations” • If you will test the application, any comments to improve it • Can the demonstrator support to overcome data problems? • You can send your feedback at agris@fao.org 29
  30. 30. 30 谢谢 Gracias σας ευχαριστώ

×