Do Not Crawl In The DUST: Different URLs Similar Text Uri Schonfeld  Department of Electrical Engineering  Technion Joint ...
<ul><li>Problem statement and motivation </li></ul><ul><li>Related work </li></ul><ul><li>Our contribution </li></ul><ul><...
<ul><li>DUST  –  D ifferent  U RLs  S imilar  T ext </li></ul><ul><li>Examples: </li></ul><ul><ul><li>Standard Canonizatio...
<ul><li>Dust rule:   Transforms  one URL to another </li></ul><ul><ul><li>Example: “ index.html ”    “” </li></ul></ul><u...
<ul><li>Expensive to crawl </li></ul><ul><ul><li>Access the same document via multiple URLs </li></ul></ul><ul><li>Forces ...
<ul><li>Given:  a list of URLs from a site S </li></ul><ul><ul><li>Crawl log </li></ul></ul><ul><ul><li>Web server log </l...
<ul><li>Domain name aliases </li></ul><ul><li>Standard extensions </li></ul><ul><li>Default file names:  index.html ,  def...
<ul><li>Site-specific DUST: </li></ul><ul><ul><li>“ story_ ”    “ story?id= “  </li></ul></ul><ul><ul><li>“ news.google.c...
<ul><li>Shingles are document sketches  [ Broder,Glassman,Manasse 97] </li></ul><ul><li>Used to compare documents for simi...
<ul><li>Shingles expensive: </li></ul><ul><ul><li>Require fetch </li></ul></ul><ul><ul><li>Parsing </li></ul></ul><ul><ul>...
<ul><li>Mirror detection </li></ul><ul><li>[Bharat,Broder 99], [ Bharat,Broder,Dean,Henzinger 00],  [Cho,Shivakumar,Garcia...
<ul><li>An algorithm that </li></ul><ul><ul><li>finds site-specific valid DUST rules </li></ul></ul><ul><ul><li>requires m...
<ul><li>Alias DUST: simple substring substitutions </li></ul><ul><ul><li>“ story_1259 ”    “ story?id=1259 ” </li></ul></...
<ul><li>Input: URL list </li></ul><ul><li>Detect likely DUST rules </li></ul><ul><li>Eliminate redundant rules </li></ul><...
<ul><li>Large support principle :  </li></ul><ul><li>Likely DUST rules have lots of “evidence” supporting them </li></ul><...
Large Support Principle <ul><li>A pair of URLs (u,v) is an  instance  of rule r, if: </li></ul><ul><ul><li>r(u) = v  </li>...
Rule Support: An Equivalent View <ul><li> : a string </li></ul><ul><ul><li>Ex:    = “ story_ ” </li></ul></ul><ul><li>u ...
Envelopes Example
Rule Support: An Equivalent View <ul><li>       : an alias DUST rule </li></ul><ul><ul><li>Ex:    = “ story_ ”,    = ...
Large Buckets <ul><li>Often there is a large set of substrings that are interchangeable within a given URL  while not bein...
<ul><li>Big Buckets: </li></ul><ul><ul><li>popular prefix suffix </li></ul></ul><ul><ul><li>Often do not contain similar c...
<ul><li>Scan Log and form buckets </li></ul><ul><li>Ignore big buckets </li></ul><ul><li>For each  small  Bucket: </li></u...
Size and Comments <ul><li>Consider only instances of rules whose size “matches” </li></ul><ul><li>Use ranges of sizes </li...
<ul><li>Input: URL list </li></ul><ul><li>Detect likely DUST rules </li></ul><ul><li>Eliminate redundant rules </li></ul><...
Eliminating Redundant Rules <ul><ul><li>“ /vlsi / ”      “/labs/vlsi/” </li></ul></ul><ul><ul><li>“ /vlsi”    “/labs/vls...
Validating Likely Rules <ul><li>For each likely rule r, for both directions </li></ul><ul><ul><li>Find sample URLs from li...
<ul><li>Assumption: </li></ul><ul><ul><li>if validation beyond threshold in 100 it will be the same for any validation abo...
<ul><li>We experiment on logs of two web sites: </li></ul><ul><ul><li>Dynamic Forum </li></ul></ul><ul><ul><li>Academic Si...
Precision at k
Precision vs. Validation
<ul><li>How many of the DUST do we find? </li></ul><ul><li>What other duplicates are there: </li></ul><ul><ul><li>Soft err...
<ul><li>In a crawl examined  18%  of the crawl was reduced. </li></ul>DUST Distribution 47.1 DUST 25.7% Images 7.6% Soft E...
<ul><li>DustBuster is an efficient algorithm </li></ul><ul><li>Finds DUST rules </li></ul><ul><li>Can reduce a crawl </li>...
THE END
<ul><li>= => --> </li></ul><ul><li>all rules with “” </li></ul><ul><li>Fix drawing urls crossing alpha not all p and all s...
<ul><li>So far, non-directional </li></ul><ul><li>Prefer shrinking rules </li></ul><ul><li>Prefer lexicographically loweri...
<ul><li>Parameter name and possible values </li></ul><ul><li>What rules: </li></ul><ul><ul><li>Remove parameter </li></ul>...
<ul><li>Unfortunately we see a lot of “wrong” rules </li></ul><ul><li>Substitute 1 with 2 </li></ul><ul><li>Just wrong: </...
Filtering out False Rulese <ul><li>Getting rid of the big buckets </li></ul><ul><li>Using the size field: </li></ul><ul><u...
DustBuster – cleaning up the rules <ul><li>Go over list with a window </li></ul><ul><li>If  </li></ul><ul><ul><li>Rule a r...
DustBuster – Validation <ul><li>Validation per rule </li></ul><ul><ul><li>Get sample URLs </li></ul></ul><ul><ul><li>URLs ...
DustBuster - Validation <ul><li>Stop fetching when: </li></ul><ul><ul><li>#failures > 100 * (1-threshold) </li></ul></ul><...
Detect Alias DUST – take 2 <ul><li>Tokenize of course </li></ul><ul><li>Form buckets </li></ul><ul><li>Ignore big buckets ...
Eliminate Redundancies <ul><li>1: EliminateRedundancies ( pairs_list R )  </li></ul><ul><li>2: for i  =  1 to |R| do </li>...
Validate a Single Rule <ul><li>1 : ValidateRule ( R, L )  </li></ul><ul><li>2 :   positive  :=  0  </li></ul><ul><li>3 :  ...
Validate Rules <ul><li>1 : Validate ( rules_list R, test_log L )  </li></ul><ul><li>2  create list of rules LR </li></ul><...
Upcoming SlideShare
Loading in...5
×

Do not crawl in the dust 
different ur ls similar text

1,148

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,148
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Do not crawl in the dust 
different ur ls similar text

  1. 1. Do Not Crawl In The DUST: Different URLs Similar Text Uri Schonfeld Department of Electrical Engineering Technion Joint Work with Dr. Ziv Bar Yossef and Dr. Idit Keidar
  2. 2. <ul><li>Problem statement and motivation </li></ul><ul><li>Related work </li></ul><ul><li>Our contribution </li></ul><ul><li>The DustBuster algorithm </li></ul><ul><li>Experimental results </li></ul><ul><li>Concluding remarks </li></ul>Talk Outline
  3. 3. <ul><li>DUST – D ifferent U RLs S imilar T ext </li></ul><ul><li>Examples: </li></ul><ul><ul><li>Standard Canonization: </li></ul></ul><ul><ul><ul><li>“ http://domain.name /index.html ”  “http://domain.name” </li></ul></ul></ul><ul><ul><li>Domain names and virtual hosts </li></ul></ul><ul><ul><ul><li>“ http:// news.google.com ”  “http:// google.com/news ” </li></ul></ul></ul><ul><ul><li>Aliases and symbolic links: </li></ul></ul><ul><ul><ul><li>“ http://domain.name/ ~ shuri”  “http://domain.name/ people/ shuri” </li></ul></ul></ul><ul><ul><li>Parameters with little affect on content </li></ul></ul><ul><ul><ul><li>Print=1 </li></ul></ul></ul><ul><ul><li>URL transformations: </li></ul></ul><ul><ul><ul><li>“ http://domain.name/ story_ ”  “http://domain.name/ story?id= ” </li></ul></ul></ul>Even the WWW Gets Dusty
  4. 4. <ul><li>Dust rule: Transforms one URL to another </li></ul><ul><ul><li>Example: “ index.html ”  “” </li></ul></ul><ul><li>Valid DUST rule: </li></ul><ul><li>r is a valid DUST rule w.r.t. site S if for every URL u  S, </li></ul><ul><ul><ul><li>r(u) is a valid URL </li></ul></ul></ul><ul><ul><ul><li>r(u) and u have “similar” contents </li></ul></ul></ul><ul><li>Why similar and not identical? </li></ul><ul><ul><li>Comments, news, text ads, counters </li></ul></ul>DUST Rules!
  5. 5. <ul><li>Expensive to crawl </li></ul><ul><ul><li>Access the same document via multiple URLs </li></ul></ul><ul><li>Forces us to shingle </li></ul><ul><ul><li>An expensive technique used to discover similar documents </li></ul></ul><ul><li>Ranking algorithms suffer </li></ul><ul><ul><li>References to a document split among its aliases </li></ul></ul><ul><li>Multiple identical results </li></ul><ul><ul><li>The same document is returned several times in the search results </li></ul></ul><ul><li>Any algorithm based on URLs suffers </li></ul>DUST is Bad
  6. 6. <ul><li>Given: a list of URLs from a site S </li></ul><ul><ul><li>Crawl log </li></ul></ul><ul><ul><li>Web server log </li></ul></ul><ul><li>Want: to find valid DUST rules w.r.t. S </li></ul><ul><ul><li>As many as possible </li></ul></ul><ul><ul><li>Including site-specific ones </li></ul></ul><ul><ul><li>Minimize number of fetches </li></ul></ul><ul><li>Applications: </li></ul><ul><ul><li>Site-specific canonization </li></ul></ul><ul><ul><li>More efficient crawling </li></ul></ul>We Want To
  7. 7. <ul><li>Domain name aliases </li></ul><ul><li>Standard extensions </li></ul><ul><li>Default file names: index.html , default.htm </li></ul><ul><li>File path canonizations: “ dirname/../ ”  “”, “ // ”  “ / ” </li></ul><ul><li>Escape sequences: “ %7E ”  “ ~ ” </li></ul>How do we Fight DUST Today? (1) Standard Canonization
  8. 8. <ul><li>Site-specific DUST: </li></ul><ul><ul><li>“ story_ ”  “ story?id= “ </li></ul></ul><ul><ul><li>“ news.google.com ”  “ google.com/news ” </li></ul></ul><ul><ul><li>“ labs ”  “ laboratories ” </li></ul></ul><ul><li>This DUST is harder to find </li></ul>Standard Canonization is not Enough
  9. 9. <ul><li>Shingles are document sketches [ Broder,Glassman,Manasse 97] </li></ul><ul><li>Used to compare documents for similarity </li></ul><ul><li>Pr(Shingles are equal) = Document similarity </li></ul><ul><li>Compare documents by comparing shingles </li></ul><ul><li>Calculate Shingle: </li></ul><ul><ul><li>Take all m word sequences </li></ul></ul><ul><ul><li>Hash them with h i </li></ul></ul><ul><ul><li>Choose the min </li></ul></ul><ul><ul><li>That's your shingle </li></ul></ul>How do we Fight DUST Today? (2) Shingles
  10. 10. <ul><li>Shingles expensive: </li></ul><ul><ul><li>Require fetch </li></ul></ul><ul><ul><li>Parsing </li></ul></ul><ul><ul><li>Hash </li></ul></ul><ul><li>Shingles do not find rules </li></ul><ul><li>Therefore, not applicable to new pages </li></ul>Shingles are Not Perfect
  11. 11. <ul><li>Mirror detection </li></ul><ul><li>[Bharat,Broder 99], [ Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01] </li></ul><ul><li>Identifying plagiarized documents [Hoad,Zobel 03] </li></ul><ul><li>Finding near-replicas </li></ul><ul><li>[Shivakumar,Garcia-Molina 98], </li></ul><ul><li>[Di Iorio,Diligenti,Gori,Maggini,Pucci 03] </li></ul><ul><li>Copy detection </li></ul><ul><li>[ Brin,Davis,Garcia-Molina 95], [ Garcia-Molina,Gravano,Shivakumar 96], [Shivakumar,Garcia-Molina 96] </li></ul>More Related Work
  12. 12. <ul><li>An algorithm that </li></ul><ul><ul><li>finds site-specific valid DUST rules </li></ul></ul><ul><ul><li>requires minimum number of fetches </li></ul></ul><ul><li>Convincing results in experiments </li></ul><ul><li>Benefits to crawling </li></ul>Our contributions
  13. 13. <ul><li>Alias DUST: simple substring substitutions </li></ul><ul><ul><li>“ story_1259 ”  “ story?id=1259 ” </li></ul></ul><ul><ul><li>“ news.google.com ”  “ google.com/news ” </li></ul></ul><ul><ul><li>“ /index.html ”  “” </li></ul></ul><ul><li>Parameter DUST: </li></ul><ul><ul><li>Standard URL structure: protocol://domain.name/path/name?para=val&pa=va </li></ul></ul><ul><ul><li>Some parameters do not affect content: </li></ul></ul><ul><ul><ul><li>Can be removed </li></ul></ul></ul><ul><ul><ul><li>Can changed to a default value </li></ul></ul></ul>Types of DUST
  14. 14. <ul><li>Input: URL list </li></ul><ul><li>Detect likely DUST rules </li></ul><ul><li>Eliminate redundant rules </li></ul><ul><li>Validate DUST rules using samples: </li></ul><ul><ul><li>Eliminate DUST rules that are “wrong” </li></ul></ul><ul><ul><li>Further eliminate duplicate DUST rules </li></ul></ul>Our Basic Framework No Fetch Required
  15. 15. <ul><li>Large support principle : </li></ul><ul><li>Likely DUST rules have lots of “evidence” supporting them </li></ul><ul><li>Small buckets principle : </li></ul><ul><li>Ignore evidence that supports many different rules </li></ul>How to detect likely DUST rules?
  16. 16. Large Support Principle <ul><li>A pair of URLs (u,v) is an instance of rule r, if: </li></ul><ul><ul><li>r(u) = v </li></ul></ul><ul><li>Support(r) = all instances (u,v) of r </li></ul>Large Support Principle The support of a valid DUST rule is large
  17. 17. Rule Support: An Equivalent View <ul><li> : a string </li></ul><ul><ul><li>Ex:  = “ story_ ” </li></ul></ul><ul><li>u : URL that contains  as a substring </li></ul><ul><ul><li>Ex: u = “ http://www.sitename.com/story_2659 ” </li></ul></ul><ul><li>Envelope of  in u : </li></ul><ul><ul><li>A pair of strings (p,s) </li></ul></ul><ul><ul><li>p : prefix of u preceding  </li></ul></ul><ul><ul><li>s : suffix of u succeeding  </li></ul></ul><ul><ul><li>Example: p = “ http://www.sitename.com/ ”, s = “ 2659 ” </li></ul></ul><ul><li>E( α ): all envelopes of  in URLs that appear in input URL list </li></ul>
  18. 18. Envelopes Example
  19. 19. Rule Support: An Equivalent View <ul><li>   : an alias DUST rule </li></ul><ul><ul><li>Ex:  = “ story_ ”,  = “ story?id= “ </li></ul></ul><ul><li>Lemma : |Support(    )| = | E(  ) ∩ E(  )| </li></ul><ul><li>Proof : </li></ul><ul><ul><li>bucket(p,s) = {  | (p,s)  E(  ) } </li></ul></ul><ul><ul><li>Observation: (u,v) is an instance of    if and only if u = p  s and v = p  s for some (p,s) </li></ul></ul><ul><ul><li>Hence, (u,v) is an instance of    iff (p,s)  E(  ) ∩ E(  ) </li></ul></ul>
  20. 20. Large Buckets <ul><li>Often there is a large set of substrings that are interchangeable within a given URL while not being DUST : </li></ul><ul><ul><li>page=1,page=2,… </li></ul></ul><ul><ul><li>lecture-1.pdf, lecture-2.pdf </li></ul></ul><ul><li>This gives rise to large buckets: </li></ul>
  21. 21. <ul><li>Big Buckets: </li></ul><ul><ul><li>popular prefix suffix </li></ul></ul><ul><ul><li>Often do not contain similar content </li></ul></ul><ul><ul><li>Big buckets are expensive to process </li></ul></ul>Small Bucket Principle I am a DUCK not a DUST Small Buckets Principle Most of the support of valid Alias DUST rules is likely to belong to small buckets
  22. 22. <ul><li>Scan Log and form buckets </li></ul><ul><li>Ignore big buckets </li></ul><ul><li>For each small Bucket: </li></ul><ul><ul><li>For every two substrings α , β in the bucket </li></ul></ul><ul><ul><ul><li>print ( α , β ) </li></ul></ul></ul><ul><li>Sort by ( α , β ) </li></ul><ul><li>For every pair ( α , β ): </li></ul><ul><ul><li>Count </li></ul></ul><ul><ul><li>If (Count > threshold) print α  β </li></ul></ul>Algorithm – Detecting Likely DUST Rules No Fetch here!
  23. 23. Size and Comments <ul><li>Consider only instances of rules whose size “matches” </li></ul><ul><li>Use ranges of sizes </li></ul><ul><li>Running time O(Llog(L)) </li></ul><ul><li>Process only short substrings </li></ul><ul><li>Tokenize URLs </li></ul>
  24. 24. <ul><li>Input: URL list </li></ul><ul><li>Detect likely DUST rules </li></ul><ul><li>Eliminate redundant rules </li></ul><ul><li>Validate DUST rules using samples: </li></ul><ul><ul><li>Eliminate DUST rules that are “wrong” </li></ul></ul><ul><ul><li>Further eliminate duplicate DUST rules </li></ul></ul>Our Basic Framework No Fetch Required
  25. 25. Eliminating Redundant Rules <ul><ul><li>“ /vlsi / ”  “/labs/vlsi/” </li></ul></ul><ul><ul><li>“ /vlsi”  “/labs/vlsi” </li></ul></ul><ul><ul><li>Lemma: </li></ul></ul><ul><ul><li>A substitution rule α ’  β ’ refines rule α  β if and only if there exists an envelope ( γ , δ ) such that α ’ = γ◦α◦δ and β ’= γ ◦ β ◦ δ </li></ul></ul><ul><li>Lemma helps us identify refinements easily </li></ul><ul><li>φ refines ψ ? remove ψ if supports match </li></ul><ul><li>Rule φ refines rule ψ if SUPPORT( φ )  SUPPORT( ψ ) </li></ul>No Fetch here!
  26. 26. Validating Likely Rules <ul><li>For each likely rule r, for both directions </li></ul><ul><ul><li>Find sample URLs from list to which r is applicable </li></ul></ul><ul><ul><li>For each URL u in the sample: </li></ul></ul><ul><ul><ul><li>v = r(u) </li></ul></ul></ul><ul><ul><ul><li>Fetch u and v </li></ul></ul></ul><ul><ul><ul><li>Check if content(u) is similar to content(v) </li></ul></ul></ul><ul><ul><li>if fraction of similar pairs > threshold: </li></ul></ul><ul><ul><ul><li>Declare rule r valid </li></ul></ul></ul>
  27. 27. <ul><li>Assumption: </li></ul><ul><ul><li>if validation beyond threshold in 100 it will be the same for any validation above </li></ul></ul><ul><li>Why isn’t threshold 100%? </li></ul><ul><ul><li>A 95% valid rule may still be worth it </li></ul></ul><ul><ul><li>Dynamic pages change often </li></ul></ul>Comments About Validation
  28. 28. <ul><li>We experiment on logs of two web sites: </li></ul><ul><ul><li>Dynamic Forum </li></ul></ul><ul><ul><li>Academic Site </li></ul></ul><ul><li>Detected from a log of about 20,000 unique URLs </li></ul><ul><li>On each site we used four logs from different time periods </li></ul>Experimental Setup
  29. 29. Precision at k
  30. 30. Precision vs. Validation
  31. 31. <ul><li>How many of the DUST do we find? </li></ul><ul><li>What other duplicates are there: </li></ul><ul><ul><li>Soft errors </li></ul></ul><ul><ul><li>True copies: </li></ul></ul><ul><ul><ul><li>Last semesters course </li></ul></ul></ul><ul><ul><ul><li>All authors of paper </li></ul></ul></ul><ul><ul><li>Frames </li></ul></ul><ul><ul><li>Image galleries </li></ul></ul>Recall
  32. 32. <ul><li>In a crawl examined 18% of the crawl was reduced. </li></ul>DUST Distribution 47.1 DUST 25.7% Images 7.6% Soft Errors 17.9% Exact Copy 1.8% misc
  33. 33. <ul><li>DustBuster is an efficient algorithm </li></ul><ul><li>Finds DUST rules </li></ul><ul><li>Can reduce a crawl </li></ul><ul><li>Can benefit ranking algorithms </li></ul>Conclusions
  34. 34. THE END
  35. 35. <ul><li>= => --> </li></ul><ul><li>all rules with “” </li></ul><ul><li>Fix drawing urls crossing alpha not all p and all s </li></ul>Things to fix
  36. 36. <ul><li>So far, non-directional </li></ul><ul><li>Prefer shrinking rules </li></ul><ul><li>Prefer lexicographically lowering rules </li></ul><ul><li>Check those directions first </li></ul>
  37. 37. <ul><li>Parameter name and possible values </li></ul><ul><li>What rules: </li></ul><ul><ul><li>Remove parameter </li></ul></ul><ul><ul><li>Substitute one value with another </li></ul></ul><ul><ul><li>Substitute all values with a single value </li></ul></ul><ul><li>Rules are validated the same way the alias rules are </li></ul><ul><li>Will not discuss further </li></ul>Parametric DUST
  38. 38. <ul><li>Unfortunately we see a lot of “wrong” rules </li></ul><ul><li>Substitute 1 with 2 </li></ul><ul><li>Just wrong: </li></ul><ul><ul><li>One domain name with another with similar software </li></ul></ul><ul><li>False rules examples: </li></ul><ul><ul><li>/YoninaEldar/ != /DavidMalah/ </li></ul></ul><ul><ul><li>/labs/vlsi/oldsite != /labs/vlsi </li></ul></ul><ul><ul><li>-2. != -3. </li></ul></ul>False Rules
  39. 39. Filtering out False Rulese <ul><li>Getting rid of the big buckets </li></ul><ul><li>Using the size field: </li></ul><ul><ul><li>False dust rules: </li></ul></ul><ul><ul><ul><li>May give valid URLs </li></ul></ul></ul><ul><ul><ul><li>Content is not similar </li></ul></ul></ul><ul><ul><ul><li>Size is probably different </li></ul></ul></ul><ul><ul><ul><li>Size ranges used </li></ul></ul></ul><ul><li>Tokenization helps </li></ul>
  40. 40. DustBuster – cleaning up the rules <ul><li>Go over list with a window </li></ul><ul><li>If </li></ul><ul><ul><li>Rule a refines rule b </li></ul></ul><ul><ul><li>Their support size is close </li></ul></ul><ul><li>Leave only rule a </li></ul>
  41. 41. DustBuster – Validation <ul><li>Validation per rule </li></ul><ul><ul><li>Get sample URLs </li></ul></ul><ul><ul><li>URLs that the rule can be applied </li></ul></ul><ul><ul><li>Apply URL => applied URL </li></ul></ul><ul><ul><li>Get content </li></ul></ul><ul><ul><li>Compare using shingles </li></ul></ul>
  42. 42. DustBuster - Validation <ul><li>Stop fetching when: </li></ul><ul><ul><li>#failures > 100 * (1-threshold) </li></ul></ul><ul><li>Page that doesn't exist is not similar to anything else </li></ul><ul><li>Why use threshold < 100%? </li></ul><ul><ul><li>Shingles not perfect </li></ul></ul><ul><ul><li>Dynamic pages may change a lot fast </li></ul></ul>
  43. 43. Detect Alias DUST – take 2 <ul><li>Tokenize of course </li></ul><ul><li>Form buckets </li></ul><ul><li>Ignore big buckets </li></ul><ul><li>Count support only if size matches </li></ul><ul><li>Don't count Long substrings </li></ul><ul><li>Results are cleaner </li></ul>
  44. 44. Eliminate Redundancies <ul><li>1: EliminateRedundancies ( pairs_list R ) </li></ul><ul><li>2: for i = 1 to |R| do </li></ul><ul><li>3: if ( already eliminated R [ i ] ) continue </li></ul><ul><li>4 : to_eliminate_current := false </li></ul><ul><li>/* Go over a window */ </li></ul><ul><li>5 : for j = 1 to min(MW, |R| - i ) do </li></ul><ul><li>/* Support not close? Stop checking */ </li></ul><ul><li>6 : if ( R [ i ]. size - R [ i + j ]. size > max ( MRD*R [ i ]. size, MAD )) break </li></ul><ul><li>/* a refines b? remove b */ </li></ul><ul><li>7 : if ( R [ i ] refines R [ i + j ]) </li></ul><ul><li>8 : eliminate R [ i + j ] </li></ul><ul><li>9 : else if ( R [ i + j ] refines R [ i ]) then </li></ul><ul><li>10 : to_eliminate_current := true </li></ul><ul><li>11 : break </li></ul><ul><li>12 : if ( to_eliminate_current ) </li></ul><ul><li>13 : eliminate R [ i ] </li></ul><ul><li>14 : return R </li></ul>No Fetch here!
  45. 45. Validate a Single Rule <ul><li>1 : ValidateRule ( R, L ) </li></ul><ul><li>2 : positive := 0 </li></ul><ul><li>3 : negative := 0 </li></ul><ul><li>/* Stop When You Are sure you either succeeded or failed */ </li></ul><ul><li>4 : while ( positive < ( 1 - ε ) N AND (negative < ε N ) do </li></ul><ul><li>5 : u := a random URL from L to which R is applicable </li></ul><ul><li>6 : v := outcome of application of R to u </li></ul><ul><li>7 : fetch u and v </li></ul><ul><li>8 : if ( fetch u failed ) continue </li></ul><ul><li>/* Something went wrong, negative sample */ </li></ul><ul><li>9 : if ( fetch v failed) OR (shingling ( u )  shingling ( v )) </li></ul><ul><li>10 negative := negative + 1 </li></ul><ul><li>/* Another positive sample */ </li></ul><ul><li>11 : else </li></ul><ul><li>12 : positive := positive + 1 </li></ul><ul><li>13 : if ( negative  ε N ) </li></ul><ul><li>14 : retrun FALSE </li></ul><ul><li>15 : return TRUE </li></ul>
  46. 46. Validate Rules <ul><li>1 : Validate ( rules_list R, test_log L ) </li></ul><ul><li>2 create list of rules LR </li></ul><ul><li>3 : for i = 1 to |R| do </li></ul><ul><li>/* Go over rules that survived = valid rules */ </li></ul><ul><li>4 : for j = 1 to i - 1 do </li></ul><ul><li>5 : if ( R [ j ] was not eliminated AND R [ i ] refines R [ j ]) </li></ul><ul><li>6 : eliminate R [ i ] from the list </li></ul><ul><li>7 : break </li></ul><ul><li>8 : if ( R [ i ] was eliminated ) </li></ul><ul><li>9 : continue </li></ul><ul><li>/* Test one direction */ </li></ul><ul><li>10 : if ( ValidateRule ( R [ i ]. alpha  R [ i ]. beta, L )) </li></ul><ul><li>11 : add R [ i ]. alpha  R [ i ]. beta to LR </li></ul><ul><li>/* Test other direction only if first direction failed */ </li></ul><ul><li>12 : else if </li></ul><ul><li>( ValidateRule ( R [ i ]. beta  R [ i ]. alpha, L )) </li></ul><ul><li>13 : add R [ i ]. alpha  R [ i ]. beta to LR </li></ul><ul><li>14 : else </li></ul><ul><li>15 : eliminate R [ i ] from the list </li></ul><ul><li>16 : return LR </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×