Information organization


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Information organization

  1. 1. SIMS 202Information Organization and RetrievalProf. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000 Uploaded by: CarAutoDriver
  2. 2. Last TimeWeb Search– Directories vs. Search engines– How web search differs from other search » Type of data searched over » Type of searches done » Type of searchers doing search– Web queries are short » This probably means people are often using search engines to find starting points » Once at a useful site, they must follow links or use site search– Web search ranking combines many features
  3. 3. What about Ranking?Lots of variation here– Pretty messy in many cases– Details usually proprietary and fluctuatingCombining subsets of:– Term frequencies– Term proximities– Term position (title, top of page, etc)– Term characteristics (boldface, capitalized, etc)– Link analysis information– Category information– Popularity informationMost use a variant of vector space ranking tocombine theseHere’s how it might work:– Make a vector of weights for each feature– Multiply this by the counts for each feature
  4. 4. From description of the NorthernLight search engine, by Mark Krellenstein
  5. 5. High-Precision RankingProximity search can help get high-precision results if > 1 term– Hearst ’96 paper: » Combine Boolean and passage-level proximity » Proves significant improvements when retrieving top 5, 10, 20, 30 documents » Results reproduced by Mitra et al. 98 » Google uses something similar
  6. 6. Boolean Formulations, Hearst 96Results
  7. 7. SpamEmail Spam:– Undesired contentWeb Spam:– Content is disguised as something it is not, in order to » Be retrieved more often than it otherwise would » Be retrieved in contexts that it otherwise would not be retrieved in
  8. 8. Web SpamWhat are the types of Web spam?– Add extra terms to get a higher ranking » Repeat “cars” thousands of times– Add irrelevant terms to get more hits » Put a dictionary in the comments field » Put extra terms in the same color as the background of the web page– Add irrelevant terms to get different types of hits » Put “sex” in the title field in sites that are selling cars– Add irrelevant links to boost your link analysis rankingThere is a constant “arms race” betweenweb search companies and spammers
  9. 9. Commercial IssuesGeneral internet search is oftencommercially driven– Commercial sector sometimes hides things – harder to track than research– On the other hand, most CTOs for search engine companies used to be researchers, and so help us out– Commercial search engine information changes monthly– Sometimes motivations are commercial rather than technical » uses payments to determine ranking order » gives out prizes
  10. 10. Web Search Architecture
  11. 11. Web Search ArchitecturePreprocessing– Collection gathering phase » Web crawling– Collection indexing phaseOnline– Query servers– This part not talked about in the readings
  12. 12. From description of the FAST search engine, by Knut Risvik
  13. 13. Standard Web Search Engine Architecture Check for duplicates, crawl the store the web documents DocIds create an user invertedquery index Search Inverted Show results engine To user index servers
  14. 14. More detailedarchitecture,from Brin & Page98.Only covers thepreprocessing indetail, not thequery serving.
  15. 15. Inverted Indexes for Web Search EnginesInverted indexes are still used, eventhough the web is so hugeSome systems partition the indexes acrossdifferent machines; each machine handlesdifferent parts of the dataOther systems duplicate the data acrossmany machines; queries are distributedamong the machinesMost do a combination of these
  16. 16. In this example, the datafor the pages ispartitioned acrossmachines. Additionally,each partition is allocatedmultiple machines tohandle the queries.Each row can handle 120queries per secondEach column can handle7M pagesTo handle more queries,add another row. From description of the FAST search engine, by Knut Risvik
  17. 17. Cascading Allocation of CPUsA variation on this that produces acost-savings:– Put high-quality/common pages on many machines– Put lower quality/less common pages on fewer machines– Query goes to high quality machines first– If no hits found there, go to other machines
  18. 18. Web CrawlersHow do the web search engines get allof the items they index?Main idea:– Start with known sites– Record information for these sites– Follow the links from each site– Record information found at new sites– Repeat
  19. 19. Web CrawlersHow do the web search engines get all ofthe items they index?More precisely:– Put a set of known sites on a queue– Repeat the following until the queue is empty: » Take the first page off of the queue » If this page has not yet been processed: Record the information found on this page – Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processedIn what order should the links be followed?
  20. 20. Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees: Structure to be traversed
  21. 21. Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees: Breadth-first search (must be in presentation mode to see this animation)
  22. 22. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: search(must be in presentation mode to see this animation)
  23. 23. Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees:
  24. 24. Depth-First Crawling (more complex – graphs & sites) Site Page 1 1 1 2 Page 1 1 4 Site 1 Page 1 Site 2 1 6 1 3 1 5 3 1 Page 3 Page 2 5 1 Page 3Page 2 6 5 1 2 2 1 2 2 Page 5 Page 1 2 3 Page 4 Site 5 Page 1 Page 6 Page 1 Page 2 Site 6 Site 3
  25. 25. Breadth First Crawling (more complex – graphs & sites) Site Page 1 1 Page 1 2 1 Site 1 Page 1 Site 2 1 2 1 6 1 3 Page 3 Page 2 2 2 Page 3Page 2 2 3 1 4 3 1 1 5 Page 5 Page 1 5 1 Page 4 5 2 Site 5 Page 1 6 1 Page 6 Page 1 Page 2 Site 6 Site 3
  26. 26. Web Crawling IssuesKeep out signs– A file called norobots.txt tells the crawler which directories are off limitsFreshness– Figure out which pages change often– Recrawl these oftenDuplicates, virtual hosts, etc– Convert page contents with a hash function– Compare new pages to the hash tableLots of problems– Server unavailable– Incorrect html– Missing links– Infinite loopsWeb crawling is difficult to do robustly!
  27. 27. Cha-ChaCha-cha searches an intranet– Sites associated with an organizationInstead of hand-edited categories– Computes shortest path from the root for each hit– Organizes search results according to which subdomain the pages are found in
  28. 28. Cha-Cha Web Crawling AlgorithmStart with a list of servers to crawl– for UCB, simply start with www.berkeley.eduRestrict crawl to certain domain(s)– *.berkeley.eduObey No Robots standardFollow hyperlinks only– do not read local filesystems » links are placed on a queue » traversal is breadth-firstSee first lecture or the technical papers formore information
  29. 29. SummaryWeb search differs from traditional IRsystems– Different kind of collection– Different kinds of users/queries– Different economic motivationsRanking combines many features in adifficult-to-specify manner– Link analysis and proximity of terms seems especially important– This is in contrast to the term-frequency orientation of standard search » Why?
  30. 30. Summary (cont.)Web search engine archicture– Similar in many ways to standard IR– Indexes usually duplicated across machines to handle many queries quicklyWeb crawling– Used to create the collection– Can be guided by quality metrics– Is very difficult to do robustly
  31. 31. Web Search Statistics
  32. 32. Searches per DayInfo missingFor,Excite,Northernlight,etc. Information from
  33. 33. WebSearchEngine Visits Information from
  34. 34. Percentageof web userswho visit the site shown Information from
  35. 35. SearchEngine Size (July2000) Information from
  36. 36. Does size matter?You can’t accessmany hitsanyhow. Information from
  37. 37. Increasingnumbers of indexedpages, self- reported Information from
  38. 38. Increasingnumbers of indexed pages (more recent) self- reported Information from
  39. 39. WebCoverage Information from
  40. 40. From description of the FAST search engine, by Knut Risvik
  41. 41. Directory sizes Information from