3 Understanding Search

2,804 views
2,744 views

Published on

An overview of how the Google Search engine works. Based on the excellent book by Langville & Meyer.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,804
On SlideShare
0
From Embeds
0
Number of Embeds
2,169
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

3 Understanding Search

  1. 1. Understanding Search Engines
  2. 2. Basic Defintions: Search Engine <ul><li>Search engines are information retrieval (IR) systems designed to help find specific information stored in digital server and database systems. </li></ul><ul><li>Search engines are meant to minimize both the time required to find information, and the amount of information which must be searched. </li></ul>
  3. 3. Our focus is on Web Information Retrieval, not traditional IR <ul><li>· Web IR means “search within the world’s largest and linked document collection.” </li></ul><ul><li>· This document collection is growing at a rate that is almost impossible to know. </li></ul><ul><li>· Links arise and disappear at an unknown rate. </li></ul>
  4. 4. Methods of IR and Search <ul><li>· Boolean Search </li></ul><ul><li>· Vector Space Model Search </li></ul><ul><li>· Probabilistic Model Search </li></ul><ul><li>· Meta Search </li></ul>
  5. 5. Boolean Search <ul><li>· One of the earliest and simplest computerized IR methods. </li></ul><ul><li>· Applies Boolean algebraic operations (AND, OR, NOT) to user keywords. </li></ul><ul><li>· AND = x and y satisfied (both conditions, I) </li></ul><ul><li>· OR= x or y condition (either condition, U) </li></ul><ul><li>· NOT= only x, not y (specific subset, S) </li></ul>
  6. 6. Boolean Search 2 <ul><li>+’s </li></ul><ul><li>· Simple. Fast. Manageable. </li></ul><ul><li>—’ s </li></ul><ul><li>· Simplistic; car+maintenance≠ auto care (polysemy and synonymy) </li></ul><ul><li>Assumes user has strong familiarity with the topic domain. </li></ul><ul><li>· Limited; best used for specific topics with small vocabulary. </li></ul>
  7. 7. Vector Space Model Search <ul><li>· Developed in the early 1960s by Gerard Salton. </li></ul><ul><li>· Transforms text into numeric vectors and matrices, then uses matrix analysis techniques to discern features and semantic relationships.(!) </li></ul>
  8. 8. Vector Space Model 2 <ul><li>+’s </li></ul><ul><li>Incredibly powerful tool for keeping track of evolving meanings and shifting vocabularies. </li></ul><ul><li>Automatically includes relevance scores thereby returning ranked search results.(!) </li></ul><ul><li>—’ s </li></ul><ul><li>Computationally intense; requires massive computing power and cannot scale up to deal with massive (web-sized) document sets. </li></ul>
  9. 9. Probabilistic Model Search <ul><li>Uses a probability model to guess which documents a user will find relevant. They key to this model’s effectiveness is the set of initial conditions. </li></ul><ul><li>One of the most powerful initial conditions is an index of a user’s search history/search tendency. </li></ul><ul><li>Another initial condition is the search term. Some powerful search algorithms begin by broadening the search terms to include conceptually related documents. </li></ul><ul><li>Most appropriate for enterprises where complete understanding of an evolving topic domain or wordspace is mission critical. </li></ul><ul><li>Grapeshot </li></ul>
  10. 10. Probabilistic Model 2 <ul><li>+’s </li></ul><ul><li>Very powerful tool. Uses evolving meanings and shifting vocabularies to expand the search vectors. </li></ul><ul><li>Cutting edge. This is the area of greatest research interest, and greatest value generation. In other words, this is where the money is. </li></ul><ul><li>—’ s </li></ul><ul><li>When there is no history, you have to start with assumptions; that can be devastating to relevance. </li></ul><ul><li>Very hard to build, therefore, very expensive. Like, unbelievably expensive. Megabucks. </li></ul>
  11. 11. Meta Search <ul><ul><ul><li>If one search engine is good (but has drawbacks) why not combine them?! </li></ul></ul></ul><ul><ul><ul><li>That’s a MetaSearch engine. </li></ul></ul></ul><ul><ul><ul><li>Queries are sent to multiple engines, or multiple processors. </li></ul></ul></ul><ul><ul><ul><li>As you would expect, this can be very accurate, but very slow. </li></ul></ul></ul><ul><ul><ul><li>When they’re wrong, they’re monumentally wrong. </li></ul></ul></ul>
  12. 12. To make the perfect Web Search Engine, you must deal with the web’s externalities: <ul><ul><ul><li>You will have to search through the largest document set in the known universe. </li></ul></ul></ul><ul><ul><ul><li>That document set is changing </li></ul></ul></ul><ul><ul><ul><li>The set is self-organizing; or more accurately, the set is completely disorganized. </li></ul></ul></ul><ul><ul><ul><li>It is hyperlinked </li></ul></ul></ul>
  13. 13. The perfect web search engine: A Huge Document Set <ul><ul><ul><li>The web is, in fact, too big to accurately measure. </li></ul></ul></ul><ul><ul><ul><li>JAN 2004: 10,000,000,000+ pages </li></ul></ul></ul><ul><ul><ul><li>FEB 2007: 25,000,000,000+ pages </li></ul></ul></ul><ul><ul><ul><li>Surface web counts, not Deep Web. </li></ul></ul></ul>
  14. 14. The perfect web search engine: A Changing Document Set <ul><ul><ul><li>Cho and Molina, 2000. The evolution of the Web and implications for an incremental crawler . Proceedings of the 26 th International Conference on Very Large Databases </li></ul></ul></ul><ul><ul><ul><li>40% of pages in sample changed w/in 7 days </li></ul></ul></ul><ul><ul><ul><li>23% changed w/in 24 hours </li></ul></ul></ul><ul><ul><ul><li>* Growth rate is unknown, but significant </li></ul></ul></ul>
  15. 15. The perfect web search engine: A Self-Organizing Set <ul><ul><ul><li>There are no standards for content, minimal control over structure, no rules for formats. The data are volatile subject to error, dishonesty, link-rot, and file disappearance. </li></ul></ul></ul><ul><ul><ul><li>Data exist in multiple formats; in duplicate; or they don’t exist until a specific request. </li></ul></ul></ul><ul><ul><ul><li>Data are re-created for many different uses and conditions (shopping, research, entertainment, way-finding). </li></ul></ul></ul>
  16. 16. The perfect web search engine: A Hyperlinked Set <ul><ul><ul><li>Thank God. </li></ul></ul></ul><ul><ul><ul><li>The availability of hyperlinks creates an additional layer of meaning. This also places the web document set into a relational framework that can be very accurately described using a branch of mathematics called topology. </li></ul></ul></ul><ul><ul><ul><li>Hyperlinks (the only new form of punctuation created in the last 500 years) allow us to do ranked searches. </li></ul></ul></ul>
  17. 17. Designing a precise* search mechanism. <ul><ul><ul><li>Crawler Module </li></ul></ul></ul><ul><ul><ul><li>Page Repository </li></ul></ul></ul><ul><ul><ul><li>Indexing Module </li></ul></ul></ul><ul><ul><ul><li>Indexes </li></ul></ul></ul><ul><ul><ul><li>Query Module </li></ul></ul></ul><ul><ul><ul><li>Ranking Module </li></ul></ul></ul>
  18. 18. The Pieces (Google style)
  19. 19. The Crawler Module (CM) <ul><ul><ul><li>A distributed system of software robots (bots, spiders) designed to examine and record the content and structure of pages within a site within a defined domain. </li></ul></ul></ul><ul><ul><ul><li>CM gives bots root URLs </li></ul></ul></ul><ul><ul><ul><li>Spiders consume resources! (bandwidth, quotas) </li></ul></ul></ul><ul><ul><ul><li>Should conform to ethical crawling (robots.txt) </li></ul></ul></ul>
  20. 20. The Page Repository <ul><ul><ul><li>Temporary storage for full page contents and link structure. </li></ul></ul></ul><ul><ul><ul><li>Valuable and popular pages can be stored for longer term. </li></ul></ul></ul>
  21. 21. Indexing Module · A software processor that applies a compression algorithm. · For content, the algorithm generates an inverted file index. · Also yields Structure Indexes, and Special-purpose Indexes (for PDFs and video) Software, 2 Processor, 3 Compression, 7 Algorithm, 8, 12 Index, 17 Indexes, 21, 25
  22. 22. Indexes Storage area for inverted files and other processed page results. These are the valuable assets of an Internet Search company.
  23. 23. The Query Module The software that handles user queries. Interacts with the ranking module, the indexes, and the page repository. Must be fast! Feb 2003, Google reported serving 250,000,000 searches per day. (2,894 queries per second) Langville & Meyer, 2006
  24. 24. The Ranking Module The software that examines the hyperlink structure and calculates a page’s value.
  25. 25. 3 guys and 2 theses <ul><li>Sergey Brin, Larry Page and Jon Kleinberg </li></ul><ul><li>HITS and PageRank™ </li></ul><ul><li>More on this next class. </li></ul>
  26. 26. An Excellent History (the key reference text) Amy Langville, Carl Meyer, Google’s Page Rank and Beyond: The Science of Search Engine Rankings. Princeton University Pres, 2006
  27. 27. Questions & Discussion <ul><li>Ask questions now. </li></ul>

×