Arcomem training heritrix_advanced

297 views

Published on

This presentation on using the Heritrix crawler is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
297
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Arcomem training heritrix_advanced

  1. 1. Adaptive Heritrix ATHENA – Research and Innovation Center in Information, Communication and Knowledge Technologies
  2. 2. ARCOMEM Requirements for crawling • ARCOMEM aims to guide crawling based on – Advanced semantic link extraction – Use of social media – Analysis of crawled content in large-scale distributed environment • These aims require a crawler to – Update adaptively priorities – Operate as a service 2Adaptive Heritrix
  3. 3. Adaptive Prioritization • New Heritrix frontier class – Plug & Play with open source Heritrix – Minimal configuration • Adding forward index for URLs – locates a link already scheduled for crawling • Moves scheduled link to the place corresponding to the updated priority 3Adaptive Heritrix
  4. 4. Heritrix as a crawling service • Decoupled fetching and link prioritization • Writing crawled data to modified WARC files – WARCS are loaded on Hbase by different process • Efficient URL injection end-point – Receives scored links from online analysis and API crawler – ARCOMEM-specific JSON format of outlinks – External-memory queue to handle large volumes of links 4Adaptive Heritrix
  5. 5. Assessing the impact of adaptive prioritization • Simulations to evaluate how adaptive prioritization affects performance of a focused crawler – Simulation on 3 DMOZ topics: Genetics, Recycling, Oceanography • Running simulated crawl – Start from set of 20 randomly selected seeds (repeated 3 times) – Topic vector is the sum of the seed vectors – Crawl 10,000 web pages • Compare the effectiveness of a best-first crawler to – Adaptive prioritization: priorities are updated using MAX, MIN, AVG, SUM, FIRST, LAST functions 5Adaptive Heritrix
  6. 6. Adaptive Prioritization results 6 Update function Harvest Ratio Average Similarity DMOZ topics FIRST 0.3317 0.2945 0.4979 AVG 0.3609 0.3024 0.5779 MAX 0.3388 0.2967 0.5270 SUM 0.2679 0.2759 0.4650 LAST 0.3404 0.2961 0.5985 FIRST 0.3317 0.2945 0.4979 • AVG and LAST have highest harvest ratios and find most pages from DMOZ topics • Adaptive prioritization more effective that FIRST, i.e. Best- First crawler Adaptive Heritrix
  7. 7. Adaptive Prioritization results 6 Update function Harvest Ratio Average Similarity DMOZ topics FIRST 0.3317 0.2945 0.4979 AVG 0.3609 0.3024 0.5779 MAX 0.3388 0.2967 0.5270 SUM 0.2679 0.2759 0.4650 LAST 0.3404 0.2961 0.5985 FIRST 0.3317 0.2945 0.4979 • AVG and LAST have highest harvest ratios and find most pages from DMOZ topics • Adaptive prioritization more effective that FIRST, i.e. Best- First crawler Adaptive Heritrix

×