SIMS 202Information Organization and RetrievalProf. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000 Uploaded by: CarAutoDriver
Last TimeWeb Search– Directories vs. Search engines– How web search differs from other search » Type of data searched over » Type of searches done » Type of searchers doing search– Web queries are short » This probably means people are often using search engines to find starting points » Once at a useful site, they must follow links or use site search– Web search ranking combines many features
What about Ranking?Lots of variation here– Pretty messy in many cases– Details usually proprietary and fluctuatingCombining subsets of:– Term frequencies– Term proximities– Term position (title, top of page, etc)– Term characteristics (boldface, capitalized, etc)– Link analysis information– Category information– Popularity informationMost use a variant of vector space ranking tocombine theseHere’s how it might work:– Make a vector of weights for each feature– Multiply this by the counts for each feature
From description of the NorthernLight search engine, by Mark Krellensteinhttp://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm
High-Precision RankingProximity search can help get high-precision results if > 1 term– Hearst ’96 paper: » Combine Boolean and passage-level proximity » Proves significant improvements when retrieving top 5, 10, 20, 30 documents » Results reproduced by Mitra et al. 98 » Google uses something similar
SpamEmail Spam:– Undesired contentWeb Spam:– Content is disguised as something it is not, in order to » Be retrieved more often than it otherwise would » Be retrieved in contexts that it otherwise would not be retrieved in
Web SpamWhat are the types of Web spam?– Add extra terms to get a higher ranking » Repeat “cars” thousands of times– Add irrelevant terms to get more hits » Put a dictionary in the comments field » Put extra terms in the same color as the background of the web page– Add irrelevant terms to get different types of hits » Put “sex” in the title field in sites that are selling cars– Add irrelevant links to boost your link analysis rankingThere is a constant “arms race” betweenweb search companies and spammers
Commercial IssuesGeneral internet search is oftencommercially driven– Commercial sector sometimes hides things – harder to track than research– On the other hand, most CTOs for search engine companies used to be researchers, and so help us out– Commercial search engine information changes monthly– Sometimes motivations are commercial rather than technical » Goto.com uses payments to determine ranking order » iwon.com gives out prizes
Web Search ArchitecturePreprocessing– Collection gathering phase » Web crawling– Collection indexing phaseOnline– Query servers– This part not talked about in the readings
From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Standard Web Search Engine Architecture Check for duplicates, crawl the store the web documents DocIds create an user invertedquery index Search Inverted Show results engine To user index servers
More detailedarchitecture,from Brin & Page98.Only covers thepreprocessing indetail, not thequery serving.
Inverted Indexes for Web Search EnginesInverted indexes are still used, eventhough the web is so hugeSome systems partition the indexes acrossdifferent machines; each machine handlesdifferent parts of the dataOther systems duplicate the data acrossmany machines; queries are distributedamong the machinesMost do a combination of these
In this example, the datafor the pages ispartitioned acrossmachines. Additionally,each partition is allocatedmultiple machines tohandle the queries.Each row can handle 120queries per secondEach column can handle7M pagesTo handle more queries,add another row. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Cascading Allocation of CPUsA variation on this that produces acost-savings:– Put high-quality/common pages on many machines– Put lower quality/less common pages on fewer machines– Query goes to high quality machines first– If no hits found there, go to other machines
Web CrawlersHow do the web search engines get allof the items they index?Main idea:– Start with known sites– Record information for these sites– Follow the links from each site– Record information found at new sites– Repeat
Web CrawlersHow do the web search engines get all ofthe items they index?More precisely:– Put a set of known sites on a queue– Repeat the following until the queue is empty: » Take the first page off of the queue » If this page has not yet been processed: Record the information found on this page – Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processedIn what order should the links be followed?
Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees:http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Structure to be traversed
Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Breadth-first search (must be in presentation mode to see this animation)
Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlDepth-first search(must be in presentation mode to see this animation)
Page Visit OrderAnimated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Web Crawling IssuesKeep out signs– A file called norobots.txt tells the crawler which directories are off limitsFreshness– Figure out which pages change often– Recrawl these oftenDuplicates, virtual hosts, etc– Convert page contents with a hash function– Compare new pages to the hash tableLots of problems– Server unavailable– Incorrect html– Missing links– Infinite loopsWeb crawling is difficult to do robustly!
Cha-ChaCha-cha searches an intranet– Sites associated with an organizationInstead of hand-edited categories– Computes shortest path from the root for each hit– Organizes search results according to which subdomain the pages are found in
Cha-Cha Web Crawling AlgorithmStart with a list of servers to crawl– for UCB, simply start with www.berkeley.eduRestrict crawl to certain domain(s)– *.berkeley.eduObey No Robots standardFollow hyperlinks only– do not read local filesystems » links are placed on a queue » traversal is breadth-firstSee first lecture or the technical papers formore information
SummaryWeb search differs from traditional IRsystems– Different kind of collection– Different kinds of users/queries– Different economic motivationsRanking combines many features in adifficult-to-specify manner– Link analysis and proximity of terms seems especially important– This is in contrast to the term-frequency orientation of standard search » Why?
Summary (cont.)Web search engine archicture– Similar in many ways to standard IR– Indexes usually duplicated across machines to handle many queries quicklyWeb crawling– Used to create the collection– Can be guided by quality metrics– Is very difficult to do robustly