• Save
Web Crawling and Reinforcement Learning
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Web Crawling and Reinforcement Learning



Description of a "new" way to implement a crawler to avoid the bias of payed google indexing services

Description of a "new" way to implement a crawler to avoid the bias of payed google indexing services



Total Views
Views on SlideShare
Embed Views



4 Embeds 38

http://www.gadaleta.org 21
http://www.slideshare.net 8
https://www.linkedin.com 6
http://www.linkedin.com 3



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • great work and web crawling issue might be a bomb topic here


    TubeHunter downloads YouTube/PornoTube/xTube/yuvutu/LubeYourTube videos to MPEG/iPOD/PSP

    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Web Crawling and Reinforcement Learning Presentation Transcript

  • 1. Web crawling and reinforcement learning Approfondimento per il corso di Soft Computing Francesco Gadaleta
  • 2. What’s happening in the world? • "Search is the first thing people use on the Web now” - Doug Cutting, a founder and core project manager at Nutch • For certain types of searches, search engines are very good. But I still see major failures, where they aren't delivering useful results. I think at a deeper almost political level, I think it's important that we as a global society have some transparency in search. • What are the algorithms involved? • What are the reasons why one site comes up over another one
  • 3. • If you consider one of the basic tasks of a search engine, it is to make a decision: this page is good or this page sucks Jimmy Wales, father of Wikipedia • Computers are notoriously bad at making such judgments • “Dear Jimbo, you do not know the power of machine learning”
  • 4. • Google™ is the most powerful agency crawling the web • Billions and billions of page crawled • Page Ranking based search system • Wanna pay for some ranking points?
  • 5. Features • As soon as you compensate someone for a link (with cash, or a returned link exchanged for barter reasons only), you break the model. • It doesn't mean that all these links are bad, or evil; • It means that we can't evaluate their real merit. • We are slaves to this fact, and can't change it.
  • 6. What’s a spider? • Is that a movie? Or an animal? • Explore the web using a target based search • Bag-of-words (or ontology) for searching
  • 7. Google Page Ranking (1/2) • How does it work? • Rank(A) = (1-d) + d Rank(T1) + Rank(T2) + ... Rank(Ti) + C(Ti) C(T1) C(T2) •C(Ti) is the outbound set of links from Ti •Rank(j) depends on Rank(•) of other pages
  • 8. Google Page Ranking (2/2) Some real fuzzy rules made by Google™ • if Rank(A) high Rank(B) += Rank(B) + k • if Rank(A) high Weight(li) += Weight(li) + w • if Rank(A) low Weight(li) = Weight(li)
  • 9. Reinforced spidering: a classical problem • The “mouse and maze” scenario • States, actions and reward function • state: position into the maze AND positions of peaces of cheese to be catched • action: move right, left, up, down • reward: ƒ=1/d•ß
  • 10. Reinforced spidering: a not so classic problem • State: current crawler position • Action: follow links from current position • Reward: ƒ(q,d) calculated indipendently on every page • Probability: P(s,a) query-page similitude calculation (naive Bayes) OR/AND a-posteriori from end user selection
  • 11. Reinforced spidering: a not so classic problem Features • a web page is a formatted document (<h1>,<h2>,<h3>,<p>,<a>) • a web page belongs to a graph: whenever the agent finds relevant infos receives a reward. Reinforcement learning used to let the agent learn how to maximize rewards and surf the web and search relevant informations. • reward defined by a Relevance function measuring relevance of page d wrt query q
  • 12. • Given a query q, calculate the retrieval status value rsv0(q,d) indipendently for each page d. These are the immediate rewards of each page. • Then we’ve to propagate rewards of hyperlinks (with value iteration for example) into the graph: rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d') |links(d)| where • ∆ inflation coeff. (how neighbor. pages influence current document • links(d) is the set of hyperlinks from d.
  • 13. rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d') |links(d)| 1. Repeatedly applied formula for each document in a subset of the collection 2. Subset with significant rsv0 3. After convergence pages that are n links away from page d make a n contribution (reward) proportional to ∆ times their rsv