Information organization
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Information organization

on

  • 288 views

 

Statistics

Views

Total Views
288
Views on SlideShare
288
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Information organization Presentation Transcript

  • 1. SIMS 202Information Organization and RetrievalProf. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000 Uploaded by: CarAutoDriver
  • 2. Last TimeWeb Search– Directories vs. Search engines– How web search differs from other search » Type of data searched over » Type of searches done » Type of searchers doing search– Web queries are short » This probably means people are often using search engines to find starting points » Once at a useful site, they must follow links or use site search– Web search ranking combines many features
  • 3. What about Ranking?Lots of variation here– Pretty messy in many cases– Details usually proprietary and fluctuatingCombining subsets of:– Term frequencies– Term proximities– Term position (title, top of page, etc)– Term characteristics (boldface, capitalized, etc)– Link analysis information– Category information– Popularity informationMost use a variant of vector space ranking tocombine theseHere’s how it might work:– Make a vector of weights for each feature– Multiply this by the counts for each feature
  • 4. From description of the NorthernLight search engine, by Mark Krellensteinhttp://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm
  • 5. High-Precision RankingProximity search can help get high-precision results if > 1 term– Hearst ’96 paper: » Combine Boolean and passage-level proximity » Proves significant improvements when retrieving top 5, 10, 20, 30 documents » Results reproduced by Mitra et al. 98 » Google uses something similar
  • 6. Boolean Formulations, Hearst 96Results
  • 7. SpamEmail Spam:– Undesired contentWeb Spam:– Content is disguised as something it is not, in order to » Be retrieved more often than it otherwise would » Be retrieved in contexts that it otherwise would not be retrieved in
  • 8. Web SpamWhat are the types of Web spam?– Add extra terms to get a higher ranking » Repeat “cars” thousands of times– Add irrelevant terms to get more hits » Put a dictionary in the comments field » Put extra terms in the same color as the background of the web page– Add irrelevant terms to get different types of hits » Put “sex” in the title field in sites that are selling cars– Add irrelevant links to boost your link analysis rankingThere is a constant “arms race” betweenweb search companies and spammers
  • 9. Commercial IssuesGeneral internet search is oftencommercially driven– Commercial sector sometimes hides things – harder to track than research– On the other hand, most CTOs for search engine companies used to be researchers, and so help us out– Commercial search engine information changes monthly– Sometimes motivations are commercial rather than technical » Goto.com uses payments to determine ranking order » iwon.com gives out prizes
  • 10. Web Search Architecture
  • 11. Web Search ArchitecturePreprocessing– Collection gathering phase » Web crawling– Collection indexing phaseOnline– Query servers– This part not talked about in the readings
  • 12. From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  • 13. Standard Web Search Engine Architecture Check for duplicates, crawl the store the web documents DocIds create an user invertedquery index Search Inverted Show results engine To user index servers
  • 14. More detailedarchitecture,from Brin & Page98.Only covers thepreprocessing indetail, not thequery serving.
  • 15. Inverted Indexes for Web Search EnginesInverted indexes are still used, eventhough the web is so hugeSome systems partition the indexes acrossdifferent machines; each machine handlesdifferent parts of the dataOther systems duplicate the data acrossmany machines; queries are distributedamong the machinesMost do a combination of these
  • 16. In this example, the datafor the pages ispartitioned acrossmachines. Additionally,each partition is allocatedmultiple machines tohandle the queries.Each row can handle 120queries per secondEach column can handle7M pagesTo handle more queries,add another row. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  • 17. Cascading Allocation of CPUsA variation on this that produces acost-savings:– Put high-quality/common pages on many machines– Put lower quality/less common pages on fewer machines– Query goes to high quality machines first– If no hits found there, go to other machines
  • 18. Web CrawlersHow do the web search engines get allof the items they index?Main idea:– Start with known sites– Record information for these sites– Follow the links from each site– Record information found at new sites– Repeat
  • 19. Web CrawlersHow do the web search engines get all ofthe items they index?More precisely:– Put a set of known sites on a queue– Repeat the following until the queue is empty: » Take the first page off of the queue » If this page has not yet been processed: Record the information found on this page – Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processedIn what order should the links be followed?
  • 20. Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees:http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Structure to be traversed
  • 21. Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Breadth-first search (must be in presentation mode to see this animation)
  • 22. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlDepth-first search(must be in presentation mode to see this animation)
  • 23. Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
  • 24. Depth-First Crawling (more complex – graphs & sites) Site Page 1 1 1 2 Page 1 1 4 Site 1 Page 1 Site 2 1 6 1 3 1 5 3 1 Page 3 Page 2 5 1 Page 3Page 2 6 5 1 2 2 1 2 2 Page 5 Page 1 2 3 Page 4 Site 5 Page 1 Page 6 Page 1 Page 2 Site 6 Site 3
  • 25. Breadth First Crawling (more complex – graphs & sites) Site Page 1 1 Page 1 2 1 Site 1 Page 1 Site 2 1 2 1 6 1 3 Page 3 Page 2 2 2 Page 3Page 2 2 3 1 4 3 1 1 5 Page 5 Page 1 5 1 Page 4 5 2 Site 5 Page 1 6 1 Page 6 Page 1 Page 2 Site 6 Site 3
  • 26. Web Crawling IssuesKeep out signs– A file called norobots.txt tells the crawler which directories are off limitsFreshness– Figure out which pages change often– Recrawl these oftenDuplicates, virtual hosts, etc– Convert page contents with a hash function– Compare new pages to the hash tableLots of problems– Server unavailable– Incorrect html– Missing links– Infinite loopsWeb crawling is difficult to do robustly!
  • 27. Cha-ChaCha-cha searches an intranet– Sites associated with an organizationInstead of hand-edited categories– Computes shortest path from the root for each hit– Organizes search results according to which subdomain the pages are found in
  • 28. Cha-Cha Web Crawling AlgorithmStart with a list of servers to crawl– for UCB, simply start with www.berkeley.eduRestrict crawl to certain domain(s)– *.berkeley.eduObey No Robots standardFollow hyperlinks only– do not read local filesystems » links are placed on a queue » traversal is breadth-firstSee first lecture or the technical papers formore information
  • 29. SummaryWeb search differs from traditional IRsystems– Different kind of collection– Different kinds of users/queries– Different economic motivationsRanking combines many features in adifficult-to-specify manner– Link analysis and proximity of terms seems especially important– This is in contrast to the term-frequency orientation of standard search » Why?
  • 30. Summary (cont.)Web search engine archicture– Similar in many ways to standard IR– Indexes usually duplicated across machines to handle many queries quicklyWeb crawling– Used to create the collection– Can be guided by quality metrics– Is very difficult to do robustly
  • 31. Web Search Statistics
  • 32. Searches per DayInfo missingFor fast.com,Excite,Northernlight,etc. Information from searchenginewatch.com
  • 33. WebSearchEngine Visits Information from searchenginewatch.com
  • 34. Percentageof web userswho visit the site shown Information from searchenginewatch.com
  • 35. SearchEngine Size (July2000) Information from searchenginewatch.com
  • 36. Does size matter?You can’t accessmany hitsanyhow. Information from searchenginewatch.com
  • 37. Increasingnumbers of indexedpages, self- reported Information from searchenginewatch.com
  • 38. Increasingnumbers of indexed pages (more recent) self- reported Information from searchenginewatch.com
  • 39. WebCoverage Information from searchenginewatch.com
  • 40. From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  • 41. Directory sizes Information from searchenginewatch.com