Web Crawling and Reinforcement Learning


Published on

Description of a "new" way to implement a crawler to avoid the bias of payed google indexing services

Published in: Technology, Design
1 Comment
  • great work and web crawling issue might be a bomb topic here


    TubeHunter downloads YouTube/PornoTube/xTube/yuvutu/LubeYourTube videos to MPEG/iPOD/PSP

    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Web Crawling and Reinforcement Learning

  1. 1. Web crawling and reinforcement learning Approfondimento per il corso di Soft Computing Francesco Gadaleta
  2. 2. What’s happening in the world? • "Search is the first thing people use on the Web now” - Doug Cutting, a founder and core project manager at Nutch • For certain types of searches, search engines are very good. But I still see major failures, where they aren't delivering useful results. I think at a deeper almost political level, I think it's important that we as a global society have some transparency in search. • What are the algorithms involved? • What are the reasons why one site comes up over another one
  3. 3. • If you consider one of the basic tasks of a search engine, it is to make a decision: this page is good or this page sucks Jimmy Wales, father of Wikipedia • Computers are notoriously bad at making such judgments • “Dear Jimbo, you do not know the power of machine learning”
  4. 4. • Google™ is the most powerful agency crawling the web • Billions and billions of page crawled • Page Ranking based search system • Wanna pay for some ranking points?
  5. 5. Features • As soon as you compensate someone for a link (with cash, or a returned link exchanged for barter reasons only), you break the model. • It doesn't mean that all these links are bad, or evil; • It means that we can't evaluate their real merit. • We are slaves to this fact, and can't change it.
  6. 6. What’s a spider? • Is that a movie? Or an animal? • Explore the web using a target based search • Bag-of-words (or ontology) for searching
  7. 7. Google Page Ranking (1/2) • How does it work? • Rank(A) = (1-d) + d Rank(T1) + Rank(T2) + ... Rank(Ti) + C(Ti) C(T1) C(T2) •C(Ti) is the outbound set of links from Ti •Rank(j) depends on Rank(•) of other pages
  8. 8. Google Page Ranking (2/2) Some real fuzzy rules made by Google™ • if Rank(A) high Rank(B) += Rank(B) + k • if Rank(A) high Weight(li) += Weight(li) + w • if Rank(A) low Weight(li) = Weight(li)
  9. 9. Reinforced spidering: a classical problem • The “mouse and maze” scenario • States, actions and reward function • state: position into the maze AND positions of peaces of cheese to be catched • action: move right, left, up, down • reward: ƒ=1/d•ß
  10. 10. Reinforced spidering: a not so classic problem • State: current crawler position • Action: follow links from current position • Reward: ƒ(q,d) calculated indipendently on every page • Probability: P(s,a) query-page similitude calculation (naive Bayes) OR/AND a-posteriori from end user selection
  11. 11. Reinforced spidering: a not so classic problem Features • a web page is a formatted document (<h1>,<h2>,<h3>,<p>,<a>) • a web page belongs to a graph: whenever the agent finds relevant infos receives a reward. Reinforcement learning used to let the agent learn how to maximize rewards and surf the web and search relevant informations. • reward defined by a Relevance function measuring relevance of page d wrt query q
  12. 12. • Given a query q, calculate the retrieval status value rsv0(q,d) indipendently for each page d. These are the immediate rewards of each page. • Then we’ve to propagate rewards of hyperlinks (with value iteration for example) into the graph: rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d') |links(d)| where • ∆ inflation coeff. (how neighbor. pages influence current document • links(d) is the set of hyperlinks from d.
  13. 13. rsvt+1(q,d) = rsv0(q,d) + ∆ ∑ rsvt(q,d') |links(d)| 1. Repeatedly applied formula for each document in a subset of the collection 2. Subset with significant rsv0 3. After convergence pages that are n links away from page d make a n contribution (reward) proportional to ∆ times their rsv