Alta vista indexing and search engine


Published on

An overview of how a web search engine is organized is provided. A key component of the AltaVista search engine: its indexing library, is described in more depth. The library manages a set of inverted files, and provides mechanisms to construct and optimize complex queries on those inverted files. The design goals were to enable efficient queries on bodies of text up to a few hundred gigabytes in size (e.g. AltaVista) without sacrificing too much generality, and without giving up on small applications (e.g. mail directories).

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Conditional brunch instructions. For 1/2/3 bytes delta value. Common case is 2-bytes, but only 2/3 of the whole case. Seeking is popular operation
  • To speed up the algorithm, you should choose the constrain that will move your ISR farthest. Make query logic’s correctness very trivial to verify.
  • Alta vista indexing and search engine

    1. 1. AltaVista Indexing and Search Engine - By Mike Burrows - Recreated by Changshu Liu [ ]
    2. 2. Goals <ul><li>• General purpose </li></ul><ul><li>• Good query performance </li></ul><ul><li>• Scale to hundreds of gigabytes </li></ul><ul><li>• Compact index/query representation </li></ul><ul><li>• Queries possible during updates </li></ul><ul><li>• Reasonable update performance </li></ul>
    3. 3. Non-Goals <ul><li>• Scale beyond a terabyte </li></ul><ul><li>• Document parsing </li></ul><ul><li>• Query parsing </li></ul><ul><li>• Ranking for query results </li></ul>
    4. 4. Structure of Inverted Files <ul><li>Chose flat inverted files that map words to lists of locations where those words occur </li></ul><ul><li>Words are null-terminated byte strings </li></ul><ul><li>Locations are 64-bit unsigned integers </li></ul><ul><li>Client picks what locations mean. No predefined notion of document, page or word number </li></ul>
    5. 5. Documents <ul><li>A document is contiguous in location space </li></ul><ul><li>Documents do not overlap </li></ul><ul><li>Location space is allocated densely. The first document is at location 1 </li></ul><ul><li>Word endDoc at last location of document </li></ul><ul><li>All document structure encoded with words </li></ul><ul><ul><li>For example: begintitle , endtitle </li></ul></ul>
    6. 6. Inverted File Format <ul><li>Words ordered lexicographically </li></ul><ul><li>Each word followed by list of locations </li></ul><ul><li>Common word prefixes are compressed </li></ul><ul><li>Locations are encoded as deltas </li></ul><ul><li>Deltas stored in as few bytes as possible </li></ul><ul><ul><li>2 bytes is common </li></ul></ul><ul><li>Full-text index occupies about 30% text size. Word-in-document (non-positional) index is about 10% </li></ul>
    7. 7. <ul><li>Obvious format for deltas: </li></ul><ul><li>Continuation Bits Indicate Delta Boundaries </li></ul><ul><li>Key operation: Find first location at least X </li></ul><ul><li>Better format for efficient scanning: </li></ul><ul><ul><li>Deltas packed into aligned 64-bit word </li></ul></ul><ul><ul><li>First byte contains continuation bits </li></ul></ul>
    8. 8. Parsing a Delta <ul><li>Observation: </li></ul><ul><ul><li>Choose instructions to dual-issue well. </li></ul></ul><ul><ul><li>Fixed word structure allows prefetch. </li></ul></ul><ul><ul><li>Avoid branch mispredictions. </li></ul></ul><ul><li>6 instr. to extract+sum+compare a delta </li></ul><ul><ul><li>extql b, tp, x ; get next delta </li></ul></ul><ul><ul><li>addq tp, l, tp ; point to next delta </li></ul></ul><ul><ul><li>mskql x, l, x ; cut delta to length </li></ul></ul><ul><ul><li>srl l, 3, l ; get next delta length </li></ul></ul><ul><ul><li>addq cur, x, cur ; add delta to location </li></ul></ul><ul><ul><li>bge cur, done ; bail if done </li></ul></ul><ul><li>With loop overhead, 35 instr/64-bit word. </li></ul><ul><li>10 cycles/64-bit word. </li></ul>
    9. 9. Index Stream Reader (ISRs) <ul><li>An interface for reading the result of a query as an ascending sequence of locations </li></ul><ul><li>Lazily evaluated </li></ul><ul><li>ISRs are objects with methods: </li></ul><ul><ul><li>Loc() – Return Current Location </li></ul></ul><ul><ul><li>Next() – Advance to next location </li></ul></ul><ul><ul><li>Seek(X) – Advance to first location at least X. </li></ul></ul><ul><li>Subtype ISRP adds: </li></ul><ul><ul><li>Prev() – Return previous location </li></ul></ul><ul><ul><ul><ul><li>Used for fielded queries (e.g. in title) </li></ul></ul></ul></ul><ul><li>No methods move backwards </li></ul>
    10. 10. ISR Implementations <ul><li>file — reads inverted files; </li></ul><ul><li>seek () method is the delta parsing loop </li></ul><ul><li>or — merges two or more ISRs </li></ul><ul><li>not — returns locations not in argument ISR </li></ul><ul><li>and — constraint solver ( AND , NEAR etc) </li></ul><ul><li>and other, specialized ISRs </li></ul><ul><li>and & not cannot support prev () </li></ul>
    11. 11. ISR And—constraint solver <ul><li>Arguments: list of ISR s , list of Constraint s </li></ul><ul><li>Constraint types: (A and B are ISRs) </li></ul><ul><ul><li>1. loc ( A ) ≤ loc ( B ) + K </li></ul></ul><ul><ul><li>2. prev ( A ) ≤ loc ( B ) + K </li></ul></ul><ul><ul><li>3. loc ( A ) ≤ prev ( B ) + K </li></ul></ul><ul><ul><li>4. prev ( A ) ≤ prev ( B ) + K </li></ul></ul><ul><li>If each word takes a location, constraints for two-word phrase “ a b ” are: </li></ul><ul><ul><li>loc ( A ) < loc ( B ) </li></ul></ul><ul><ul><li>loc ( B ) ≤ loc ( A ) + 1 </li></ul></ul>
    12. 12. <ul><li>Let E , BT , ET be ISRPs of words: </li></ul><ul><li>enddoc , begintitle , endtitle </li></ul><ul><li>Constraints for conjunction: a and b </li></ul><ul><ul><li>prev ( E ) < loc ( A ) loc ( A ) ≤ loc ( E ) </li></ul></ul><ul><ul><li>prev ( E ) < loc ( B ) loc ( B ) ≤ loc ( E ) </li></ul></ul><ul><li>Constraints for field query: title: A </li></ul><ul><ul><li>prev ( BT ) < loc ( A ) loc ( A ) ≤ loc ( ET ) </li></ul></ul><ul><ul><li>prev ( BT ) < loc ( ET ) loc ( ET ) ≤ loc ( BT ) </li></ul></ul>
    13. 13. Solver Algorithm <ul><li>While (Unsatisfied Constraints) </li></ul><ul><ul><li>Pick Unsatisfied Constraint() </li></ul></ul><ul><ul><li>Satisfy Constraint() </li></ul></ul><ul><li>To Satisfy </li></ul><ul><li>loc ( A ) ≤ loc ( B ) + K: </li></ul><ul><li>seek ( B, loc ( A ) − K ) </li></ul><ul><li>prev ( A ) ≤ loc ( B ) + K: </li></ul><ul><li>seek ( B, prev ( A ) − K ) </li></ul><ul><li>loc ( A ) ≤ prev ( B ) + K: </li></ul><ul><li>seek ( B, loc ( A ) − K ) </li></ul><ul><li>next ( B ) </li></ul><ul><li>prev ( A ) ≤ prev ( B ) + K: </li></ul><ul><li>seek ( B, prev ( A ) − K ) </li></ul><ul><li>next ( B ) </li></ul>
    14. 14. Some Metrics <ul><li>(performance based on AltaVista Web index) </li></ul><ul><li>20K lines of code </li></ul><ul><li>Indexes around 1.5GByte/Hr/600MHz CPU </li></ul><ul><li>Queries take about 100 cycles/query/MByte </li></ul><ul><li>Queries are CPU bound: </li></ul><ul><li>95% in user space, 5% in kernel </li></ul><ul><li>Memory bus is currently under-utilized </li></ul>
    15. 15. Breakdown of user CPU time <ul><li>30% inner loop </li></ul><ul><li>15% constraint solver </li></ul><ul><li>15% higher level seek code </li></ul><ul><li>7% ranking code </li></ul><ul><li>0.2% merging results </li></ul><ul><li>Miss Ratios: </li></ul><ul><ul><li>2% I-cache </li></ul></ul><ul><ul><li>8% D-cache </li></ul></ul><ul><ul><li>8% level-2 cache </li></ul></ul><ul><ul><li>40% level-3 cache </li></ul></ul>
    16. 16. Postmortem <ul><li>Successes </li></ul><ul><ul><li>ISRs are a good abstraction </li></ul></ul><ul><ul><li>Flat location space </li></ul></ul><ul><ul><li>Representing structure as words </li></ul></ul><ul><li>Regrets </li></ul><ul><ul><li>No ability to run ISRs backwards </li></ul></ul><ul><ul><li>Wish ISR constraint solver were less complex </li></ul></ul>
    17. 17. AltaVista Site Architecture - By Mike Burrows - Recreated by Changshu Liu [ ]
    18. 18. Structure of the Site <ul><li>Front-Ends: Alpha Workstations </li></ul><ul><li>Back-Ends: </li></ul><ul><ul><li>4-10 CPU Alpha Servers </li></ul></ul><ul><ul><li>8GBytes RAM / 150 GBytes Disc. </li></ul></ul><ul><ul><li>Organized in Groups of 4-10 Machines </li></ul></ul><ul><ul><li>Each machine has 1/Nth of the whole index </li></ul></ul>Broad Routers 0 Broad Routers 1 FDDI Router Front End 0 Front End 1 Front End N-1 FDDI Router Front End N
    19. 19. Handling Failures <ul><li>Disc: RAID controllers with spare discs </li></ul><ul><li>Back-ends: front-ends use other groups </li></ul><ul><li>Frond-ends: hot-spare grabs IP address </li></ul><ul><li>FDDI: manual replacement of cold spare </li></ul><ul><li>Site: failover via manual DNS change </li></ul>
    20. 20. RAID <ul><li>Reconstructing a disc takes 30 minutes. </li></ul><ul><ul><li>Disc performance is crippled </li></ul></ul><ul><li>Except a few discs to fail a month </li></ul><ul><ul><li>Need daily schedule for checking discs. </li></ul></ul><ul><li>GUI annoying when checking 60 controllers </li></ul><ul><li>Once a disc failed with no error reported </li></ul><ul><ul><li>Corrupted index file </li></ul></ul><ul><li>On first day, the only non-RAID device (root disc) failed during demo for press </li></ul>
    21. 21. File System <ul><li>Need a Journaling File System </li></ul><ul><ul><li>Write Ahead Log </li></ul></ul><ul><ul><li>FSCK(consistency checker) takes ours </li></ul></ul><ul><li>Software/Memory errors destroy file systems </li></ul><ul><ul><li>Restoring 300GB from tape doesn’t work </li></ul></ul><ul><ul><ul><li>Tape may be in error </li></ul></ul></ul><ul><ul><ul><li>Too slow </li></ul></ul></ul><ul><ul><li>Important to replicate data in spinning disk </li></ul></ul>
    22. 22. Back-Ends <ul><li>Back-Ends were Digital 8400’s (Turoblaser) </li></ul><ul><li>Huge cards with large connectors </li></ul><ul><li>Pins are on backplane, not card </li></ul><ul><li>RAID setup took hours on separate machine </li></ul><ul><li>Console interrupt is a boon </li></ul>
    23. 23. Front-Ends <ul><li>Biggest Problems: </li></ul><ul><ul><li>Poorly-Tested software </li></ul></ul><ul><ul><li>Operator Error </li></ul></ul><ul><li>Automatic restart dealt with former </li></ul><ul><li>A trivial IP failover scheme dealt with latter </li></ul>
    24. 24. HTTP Server <ul><li>Original NCSA httpd was abysmal </li></ul><ul><ul><li>Forked too often </li></ul></ul><ul><ul><li>Synchronous name resolution </li></ul></ul><ul><ul><li>Logs writes to full disc </li></ul></ul><ul><ul><li>Prone to denial of service attacks </li></ul></ul><ul><li>Fixed with new first http server </li></ul><ul><ul><li>Never Forks: aggravates software test issues </li></ul></ul><ul><ul><li>Submit limits: sockets/threads/requests rate </li></ul></ul>
    25. 25. Load Balance <ul><li>Front-End </li></ul><ul><ul><li>DNS round robin </li></ul></ul><ul><li>Backend </li></ul><ul><ul><li>Front-Ends will group similar queries to the same specific backend for cache </li></ul></ul>
    26. 26. Overload Handling <ul><li>Back-ends take short-cuts when verloaded </li></ul><ul><ul><li>Ultimately, they can refuse service </li></ul></ul><ul><li>Front-ends have spare capacity to avoid site appearing completely dead </li></ul>
    27. 27. Reference <ul><li>The AltaVista Indexing and Search Engine </li></ul><ul><ul><li>Mike Burrows, Compaq SRC </li></ul></ul><ul><ul><li>Production Date: 01/18/2000 </li></ul></ul><ul><ul><li>Link: </li></ul></ul>