Your SlideShare is downloading. ×
Information organization
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Information organization


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. SIMS 202Information Organization and RetrievalProf. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000 Uploaded by: CarAutoDriver
  • 2. Last TimeWeb Search– Directories vs. Search engines– How web search differs from other search » Type of data searched over » Type of searches done » Type of searchers doing search– Web queries are short » This probably means people are often using search engines to find starting points » Once at a useful site, they must follow links or use site search– Web search ranking combines many features
  • 3. What about Ranking?Lots of variation here– Pretty messy in many cases– Details usually proprietary and fluctuatingCombining subsets of:– Term frequencies– Term proximities– Term position (title, top of page, etc)– Term characteristics (boldface, capitalized, etc)– Link analysis information– Category information– Popularity informationMost use a variant of vector space ranking tocombine theseHere’s how it might work:– Make a vector of weights for each feature– Multiply this by the counts for each feature
  • 4. From description of the NorthernLight search engine, by Mark Krellenstein
  • 5. High-Precision RankingProximity search can help get high-precision results if > 1 term– Hearst ’96 paper: » Combine Boolean and passage-level proximity » Proves significant improvements when retrieving top 5, 10, 20, 30 documents » Results reproduced by Mitra et al. 98 » Google uses something similar
  • 6. Boolean Formulations, Hearst 96Results
  • 7. SpamEmail Spam:– Undesired contentWeb Spam:– Content is disguised as something it is not, in order to » Be retrieved more often than it otherwise would » Be retrieved in contexts that it otherwise would not be retrieved in
  • 8. Web SpamWhat are the types of Web spam?– Add extra terms to get a higher ranking » Repeat “cars” thousands of times– Add irrelevant terms to get more hits » Put a dictionary in the comments field » Put extra terms in the same color as the background of the web page– Add irrelevant terms to get different types of hits » Put “sex” in the title field in sites that are selling cars– Add irrelevant links to boost your link analysis rankingThere is a constant “arms race” betweenweb search companies and spammers
  • 9. Commercial IssuesGeneral internet search is oftencommercially driven– Commercial sector sometimes hides things – harder to track than research– On the other hand, most CTOs for search engine companies used to be researchers, and so help us out– Commercial search engine information changes monthly– Sometimes motivations are commercial rather than technical » uses payments to determine ranking order » gives out prizes
  • 10. Web Search Architecture
  • 11. Web Search ArchitecturePreprocessing– Collection gathering phase » Web crawling– Collection indexing phaseOnline– Query servers– This part not talked about in the readings
  • 12. From description of the FAST search engine, by Knut Risvik
  • 13. Standard Web Search Engine Architecture Check for duplicates, crawl the store the web documents DocIds create an user invertedquery index Search Inverted Show results engine To user index servers
  • 14. More detailedarchitecture,from Brin & Page98.Only covers thepreprocessing indetail, not thequery serving.
  • 15. Inverted Indexes for Web Search EnginesInverted indexes are still used, eventhough the web is so hugeSome systems partition the indexes acrossdifferent machines; each machine handlesdifferent parts of the dataOther systems duplicate the data acrossmany machines; queries are distributedamong the machinesMost do a combination of these
  • 16. In this example, the datafor the pages ispartitioned acrossmachines. Additionally,each partition is allocatedmultiple machines tohandle the queries.Each row can handle 120queries per secondEach column can handle7M pagesTo handle more queries,add another row. From description of the FAST search engine, by Knut Risvik
  • 17. Cascading Allocation of CPUsA variation on this that produces acost-savings:– Put high-quality/common pages on many machines– Put lower quality/less common pages on fewer machines– Query goes to high quality machines first– If no hits found there, go to other machines
  • 18. Web CrawlersHow do the web search engines get allof the items they index?Main idea:– Start with known sites– Record information for these sites– Follow the links from each site– Record information found at new sites– Repeat
  • 19. Web CrawlersHow do the web search engines get all ofthe items they index?More precisely:– Put a set of known sites on a queue– Repeat the following until the queue is empty: » Take the first page off of the queue » If this page has not yet been processed: Record the information found on this page – Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processedIn what order should the links be followed?
  • 20. Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees: Structure to be traversed
  • 21. Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees: Breadth-first search (must be in presentation mode to see this animation)
  • 22. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: search(must be in presentation mode to see this animation)
  • 23. Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees:
  • 24. Depth-First Crawling (more complex – graphs & sites) Site Page 1 1 1 2 Page 1 1 4 Site 1 Page 1 Site 2 1 6 1 3 1 5 3 1 Page 3 Page 2 5 1 Page 3Page 2 6 5 1 2 2 1 2 2 Page 5 Page 1 2 3 Page 4 Site 5 Page 1 Page 6 Page 1 Page 2 Site 6 Site 3
  • 25. Breadth First Crawling (more complex – graphs & sites) Site Page 1 1 Page 1 2 1 Site 1 Page 1 Site 2 1 2 1 6 1 3 Page 3 Page 2 2 2 Page 3Page 2 2 3 1 4 3 1 1 5 Page 5 Page 1 5 1 Page 4 5 2 Site 5 Page 1 6 1 Page 6 Page 1 Page 2 Site 6 Site 3
  • 26. Web Crawling IssuesKeep out signs– A file called norobots.txt tells the crawler which directories are off limitsFreshness– Figure out which pages change often– Recrawl these oftenDuplicates, virtual hosts, etc– Convert page contents with a hash function– Compare new pages to the hash tableLots of problems– Server unavailable– Incorrect html– Missing links– Infinite loopsWeb crawling is difficult to do robustly!
  • 27. Cha-ChaCha-cha searches an intranet– Sites associated with an organizationInstead of hand-edited categories– Computes shortest path from the root for each hit– Organizes search results according to which subdomain the pages are found in
  • 28. Cha-Cha Web Crawling AlgorithmStart with a list of servers to crawl– for UCB, simply start with www.berkeley.eduRestrict crawl to certain domain(s)– *.berkeley.eduObey No Robots standardFollow hyperlinks only– do not read local filesystems » links are placed on a queue » traversal is breadth-firstSee first lecture or the technical papers formore information
  • 29. SummaryWeb search differs from traditional IRsystems– Different kind of collection– Different kinds of users/queries– Different economic motivationsRanking combines many features in adifficult-to-specify manner– Link analysis and proximity of terms seems especially important– This is in contrast to the term-frequency orientation of standard search » Why?
  • 30. Summary (cont.)Web search engine archicture– Similar in many ways to standard IR– Indexes usually duplicated across machines to handle many queries quicklyWeb crawling– Used to create the collection– Can be guided by quality metrics– Is very difficult to do robustly
  • 31. Web Search Statistics
  • 32. Searches per DayInfo missingFor,Excite,Northernlight,etc. Information from
  • 33. WebSearchEngine Visits Information from
  • 34. Percentageof web userswho visit the site shown Information from
  • 35. SearchEngine Size (July2000) Information from
  • 36. Does size matter?You can’t accessmany hitsanyhow. Information from
  • 37. Increasingnumbers of indexedpages, self- reported Information from
  • 38. Increasingnumbers of indexed pages (more recent) self- reported Information from
  • 39. WebCoverage Information from
  • 40. From description of the FAST search engine, by Knut Risvik
  • 41. Directory sizes Information from