Focused Crawling for Vertical Search
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Focused Crawling for Vertical Search

on

  • 1,763 views

A tutorial for focused crawling presented at the JCC, Curicó, Chile, 2011.

A tutorial for focused crawling presented at the JCC, Curicó, Chile, 2011.

Statistics

Views

Total Views
1,763
Views on SlideShare
1,763
Embed Views
0

Actions

Likes
2
Downloads
44
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Focused Crawling for Vertical Search Presentation Transcript

  • 1. Focused Crawling for Vertical SearchFocused Crawling for Vertical Search Marcelo Mendoza 11.11.11 - JCC 2011 - Curic´, Chile - o 11.11.11 1 / 40
  • 2. Focused Crawling for Vertical SearchOverview1 Vertical Search2 Crawling3 State-of-the-art4 Conclusion - JCC 2011 - Curic´, Chile - o 11.11.11 2 / 40
  • 3. Focused Crawling for Vertical Search Vertical SearchWhy Web Vertical Search Matters? Web size: More than 20 billion pages. Millions of users, millions of queries, millions of needs. Advantages: 1 Greater precision due to limited scope 2 Leverage domain knowledge (ontologies) Domains: business, medicine, science, education, ... - JCC 2011 - Curic´, Chile - o 11.11.11 3 / 40
  • 4. Focused Crawling for Vertical Search Vertical SearchScience Vertical Searchscienceresearch.com - JCC 2011 - Curic´, Chile - o 11.11.11 4 / 40
  • 5. Focused Crawling for Vertical Search Vertical SearchBusiness Vertical Searchbiznar.com - JCC 2011 - Curic´, Chile - o 11.11.11 5 / 40
  • 6. Focused Crawling for Vertical Search Vertical SearchEducation Vertical Searchcontentcompass.cl1 1 Fondef D08I1155 - JCC 2011 - Curic´, Chile - o 11.11.11 6 / 40
  • 7. Focused Crawling for Vertical Search CrawlingHyperlinks among web pages - JCC 2011 - Curic´, Chile - o 11.11.11 7 / 40
  • 8. Focused Crawling for Vertical Search CrawlingThe Web as a graph web pages hyperlinks - JCC 2011 - Curic´, Chile - o 11.11.11 8 / 40
  • 9. Focused Crawling for Vertical Search CrawlingThe Web: Some facts The size of the Web: 11.5 billion of pages (indexable, 2005). The deep Web: available by quering databases. Static / dynamic pages. Graph model: Free-scale network, degree distribution ≈ power law. The Web structure: Bow-tie model (IN/SCC/OUT/ISLANDS). - JCC 2011 - Curic´, Chile - o 11.11.11 9 / 40
  • 10. Focused Crawling for Vertical Search CrawlingCrawler architectureOnline resource: C. Castillo, Effective Web Crawling (PhD Thesis) URL - JCC 2011 - Curic´, Chile - o 11.11.11 10 / 40
  • 11. Focused Crawling for Vertical Search CrawlingCrawling strategies Breadth-first crawlers: URL frontier implemented as a FIFO queue. Preferential crawlers: URL frontier implemented as a priority queue. Priority scores: 1 Topological properties (e.g. indegree of the target page). 2 Content properties (e.g. similarity between a query and the source page). 3 Hybrid measures. - JCC 2011 - Curic´, Chile - o 11.11.11 11 / 40
  • 12. Focused Crawling for Vertical Search CrawlingUniversal / Focused crawling Universal crawlers: General purpose. Challenges: 1 Scalability 2 Coverage / Freshness Focused crawlers: We may want to crawl pages in certain topics. Challenges: 1 Coverage / Accuracy - JCC 2011 - Curic´, Chile - o 11.11.11 12 / 40
  • 13. Focused Crawling for Vertical Search CrawlingFocused CrawlingBreadth-first: depth 1 Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 13 / 40
  • 14. Focused Crawling for Vertical Search CrawlingFocused CrawlingBreadth-first: depth 2 Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 14 / 40
  • 15. Focused Crawling for Vertical Search CrawlingFocused CrawlingBreadth-first: depth 3 Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 15 / 40
  • 16. Focused Crawling for Vertical Search CrawlingFocused CrawlingBreadth-first: unreacheble pages, excessive computational costs! Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 16 / 40
  • 17. Focused Crawling for Vertical Search State-of-the-artEarly algorithms: Fish searchBra, P., and Post, R. (1994)Query (keywords), source page terms, term-based distance, best-first - JCC 2011 - Curic´, Chile - o 11.11.11 17 / 40
  • 18. Focused Crawling for Vertical Search State-of-the-artEarly algorithms: Shark searchHersovici et al. (1998)Query (keywords), anchor text, term-based distance, best-first - JCC 2011 - Curic´, Chile - o 11.11.11 18 / 40
  • 19. Focused Crawling for Vertical Search State-of-the-artEarly algorithms: ARACHNIDMenczer, F. (1997)Multi-agents, evolutionary inspired: mutation (new seeds), fitness (scoreacc.), term-based scores. - JCC 2011 - Curic´, Chile - o 11.11.11 19 / 40
  • 20. Focused Crawling for Vertical Search State-of-the-artContext: Link AnalysisThe Web graph as an information source (beyond the text)Kleinberg, J. (1998)HITS: authoritative pages (OUT), hub pages (IN).Brin, S. & Page, L. (1998)PageRank: Random walk over the Web graph, stationary probabilityvector. - JCC 2011 - Curic´, Chile - o 11.11.11 20 / 40
  • 21. Focused Crawling for Vertical Search State-of-the-artLink-based algorithmsCho, J., Garcia-Molina, H., Page L. (1998)Link-based scores: Backlinks count, PageRankChakrabarti, S., Van den Berg, M., and Dom, B. (1999)Topic distillation: Text-based classifier over web page examples percategory (off-line dataset construction, human labeling, content textpositive and negative examples). On-line phase: Anchor-based score (ML)+ HITS-based score for distillation. - JCC 2011 - Curic´, Chile - o 11.11.11 21 / 40
  • 22. Focused Crawling for Vertical Search State-of-the-artLink-based algorithms: Basic assumptions Seed TargetDavidson, B. (2000)Topical locality: Locality based on anchor text and links. - JCC 2011 - Curic´, Chile - o 11.11.11 22 / 40
  • 23. Focused Crawling for Vertical Search State-of-the-artLink-based algorithms: Basic assumptionsMenczer, F. (2004)Link cluster conjecture: Related pages tend to be linked. - JCC 2011 - Curic´, Chile - o 11.11.11 23 / 40
  • 24. Focused Crawling for Vertical Search State-of-the-artLink-based algorithms: Backlink graphConsidering how far is the target: Layered backlink graph!Diligenti et al. (2000)Using the backlink graph for multiclass learning. Greedy approach.Babaria et al. (2007)Using the backling graph for ordinal regression. Greedy approach. - JCC 2011 - Curic´, Chile - o 11.11.11 24 / 40
  • 25. Focused Crawling for Vertical Search State-of-the-artOff-line learning-based algorithmsKinds of features The content of the web pages which are known to link to the candidate URL. URL tokens from the candidate URL. - JCC 2011 - Curic´, Chile - o 11.11.11 25 / 40
  • 26. Focused Crawling for Vertical Search State-of-the-artOff-line learning-based algorithmsRennie & McCallum (1999)1st stage (Off-line): Text-based features (anchor + header + title of thetarget). 2nd stage (On-line): Candidate URL scoring based on the textclassifier (candidate URL (anchor + URL text)).Li et al. (2005)1st stage (Off-line): ID3 learning strategy. Anchor text-based features.2nd stage (On-line): Candidate URL scoring based on the text classifier(candidate URL (anchor)). - JCC 2011 - Curic´, Chile - o 11.11.11 26 / 40
  • 27. Focused Crawling for Vertical Search State-of-the-artOff-line learning-based algorithmsPant & Srinivasan (2006)1st stage (Off-line): SVM learning strategy. Content text-based features.2nd stage (On-line): Candidate URL scoring based on the text classifier(candidate URL (surrounding text)).Feng et al. (2010)1st stage (Off-line): Term-based weights. Weighted graph construction.2nd stage (Off-line): PageRank over the weighted graph. 3rd stage(Off-line): Labeling based on PageRank. Term-based learning. 4th stage(On-line): Candidate URL scoring based on the text classifier (candidateURL (anchor)). - JCC 2011 - Curic´, Chile - o 11.11.11 27 / 40
  • 28. Focused Crawling for Vertical Search State-of-the-artMachine Learning-based adaptive algorithmsLearning on-the-fly from the context - JCC 2011 - Curic´, Chile - o 11.11.11 28 / 40
  • 29. Focused Crawling for Vertical Search State-of-the-artMachine Learning-based adaptive algorithmsLearning on-the-fly from the context "Ba ch" "Bach" candidate URLAggarwal et al. (2000)1st stage (Off-line): Crawling for dataset construction. Human labeling(positive examples). Bayes learning strategy. Content text-based features.2nd stage (On-line): Candidate URL scoring based on the text classifier +feature selection based on interest ratio (candidate URL (anchor)). - JCC 2011 - Curic´, Chile - o 11.11.11 29 / 40
  • 30. Focused Crawling for Vertical Search State-of-the-artMachine Learning-based adaptive algorithmsLearning on-the-fly from the contextChakrabarti et al. (2002)1st stage (Off-line): Crawling for dataset construction. Human labeling(positive examples). Content text-based features. 2nd stage (On-line):Training from positive examples using fetched pages (more sophisticatedfeatures such as DOM tree). 3rd stage (On-line): URL scoring based onthe apprentice learner. - JCC 2011 - Curic´, Chile - o 11.11.11 30 / 40
  • 31. Focused Crawling for Vertical Search State-of-the-artMachine Learning-based adaptive algorithmsLearning to skip off-topic pages Seed Target - JCC 2011 - Curic´, Chile - o 11.11.11 31 / 40
  • 32. Focused Crawling for Vertical Search State-of-the-artMachine Learning-based adaptive algorithmsLearning to skip off-topic pages 111 000 111 000 111 000 Dud 111 000 Seed 0.8 0.7 0.25 0.1 Target 0.2 0.7 0.6 111 000 0.45 0.8 111 000 111 000 0.7 0.7 111 000 0.7 0.7 0.5 111 000 111 000 111 000 111 000 0.75 000 111 0.5 0.75 0.5 0.5 0.4 0.2 0.15 0.8 0.7 0.5 - JCC 2011 - Curic´, Chile - o 11.11.11 32 / 40
  • 33. Focused Crawling for Vertical Search State-of-the-artMachine Learning-based adaptive algorithmsLearning to skip off-topic pages: Tunneling!Bergmark et al. (2002)1st stage (Off-line): Crawling for dataset construction. Human labeling(positive examples). Content text-based features. 2nd stage (Off-line):Tunneling module construction. Cutoff threshold learning based onnugget-dud paths. 3rd stage (On-line): Apprentice tunneling learner.Adaptive cutoff based on paths evaluated by using fetched pages. - JCC 2011 - Curic´, Chile - o 11.11.11 33 / 40
  • 34. Focused Crawling for Vertical Search State-of-the-artMachine Learning-based adaptive algorithmsAgents for path detection: AntsGasparetti & Micarelli (2004)Close in aim to ARACHNID (multi agents, multi seeds). Back and forthtrips to relevant resources generates pheromone trails. Shortest pathsattract more ants. - JCC 2011 - Curic´, Chile - o 11.11.11 34 / 40
  • 35. Focused Crawling for Vertical Search State-of-the-artOntology driven crawling strategiesKnowledge representation: Ontologies sc : SubClassOf dom : Domain range : Range Camp Nou i : InstanceOf eq : Equivalent i range city Barcelona sp : SubPropertyOf i dom sc sports stadiums country coastal_city sp sp eq range dom i football soccer plays_in Spain sp national i teams Barcelona F.C. - JCC 2011 - Curic´, Chile - o 11.11.11 35 / 40
  • 36. Focused Crawling for Vertical Search State-of-the-artOntology driven crawling strategiesOntology-based match expansionEhrig & Maedge (2003)Relevance scoring. 1st stage: Concept matching (ontology + lexicon). 2ndstage: Ontology-based expansion. 3rd stage: Summarization. - JCC 2011 - Curic´, Chile - o 11.11.11 36 / 40
  • 37. Focused Crawling for Vertical Search State-of-the-artOntology driven crawling strategiesOntology-based learning strategyZheng et al. (2008)Relevance scoring for fetched pages. 1st stage: Concept matching(ontology + lexicon), Concept distances, Doc. scoring. 2nd stage: ANNtraining. 3rd stage (On-line): term-based URL scoring (ANN, anchor asinput). - JCC 2011 - Curic´, Chile - o 11.11.11 37 / 40
  • 38. Focused Crawling for Vertical Search State-of-the-artMore features for unvisited URL scoringFeng et al. (2010)On-line PageRank + term scoring (anchor, surrounding)Patel & Schmidt (2011)Term scoring based on matching and document structure (structure of thecurrent page). - JCC 2011 - Curic´, Chile - o 11.11.11 38 / 40
  • 39. Focused Crawling for Vertical Search ConclusionChallenges Precision / Recall trade off Benchmarking Ontology IE for effective crawling Unbiased seed identification Efficiency issues (scalability,...) - JCC 2011 - Curic´, Chile - o 11.11.11 39 / 40
  • 40. Focused Crawling for Vertical Search ConclusionReferencesReferences here - JCC 2011 - Curic´, Chile - o 11.11.11 40 / 40