2. Index
✔ Prelude
✔ How search engine works
✔ Search Engine Architecture
✔ Working
✔ Features(Web Crawling,Page Rank,etc.)
✔ Problems
✔ Future
✔ Conclusion
3. Prelude
✔ A Search Engine is an information retrieval system
designed to help find information stored on a computer
system, such as on the
● World Wide Web,
● inside a corporate or proprietary network,
● or in a personal computer.
✔ The search engine allows one to ask for content meeting
specific criteria (typically those containing a given word or
phrase) and retrieves a list of items that match those criteria.
4. How search engine works
✔ SEARCH STARTS WITH THE WEB. IT'S MADE UP OF OVER
60 TRILLION INDIVIDUAL PAGES AND IT'S CONSTANTLY
GROWING.
✔ Search engine navigates the web by:
● Web Crawling
● Indexing
● Searching
6. Working...
✔ There is an URL server that sends lists of URL to be
fetched to the crawlers.
✔ The web pages that are fetched are then sent to the store
server.
✔ The store server then compresses and stores the web
pages into a repository.
✔ Every web page has an associated ID number called a
doc ID which is assigned whenever a new URL is parsed
out of a web page.
7. Working...(cont.)
✔ The indexing function is performed by the indexer and the
sorter.
✔ The indexer performs a number of functions. It reads the
repository, uncompresses the documents, and parses them.
✔ The indexer distributes hits into a set of "barrels", creating a
partially sorted forward index.
✔ It parses out all the links in every web page and stores
important information about them in anchor files.
✔ Anchor file contains enough information to determine where
each link points from and to, and the text of the link.
8. Working...(cont.)
✔ The URL resolver reads the anchors file and converts
relative URL into absolute URL and in turn into doc ID.
✔ It puts the anchor text into the forward index, associated
with the doc ID that the anchor points to. It also
generates a database of links which are pairs of doc IDs.
The links database is used to compute Page Ranks for
all the documents.
✔ The sorter takes the barrels, which are sorted by doc ID
and resorts them by word ID to generate the inverted
index.
9. Working...(cont.)
✔ A program called Dump Lexicon takes this list together
with the lexicon produced by the indexer and generates a
new lexicon to be used by the searcher.
✔ The searcher is run by a web server and uses the lexicon
built by Dump Lexicon together with the inverted index
and the Page Ranks to answer queries.
10. Crawling the Web (Web Crawler)
✔ A web crawler or spider is used to create an index of web pages to
be used by a web search engine.
✔ Web search is based on this index.
✔ Search engines has a fast distributed crawling system.
✔ A single URL server serves lists of URL to a number of crawlers.
✔ Both the URL server and the crawlers are implemented in Python.
✔ Each crawler keeps roughly 300 connections open at once. At peak
speeds, the system can crawl over 100 web pages per second using
four crawlers. This amounts to roughly 600K per second of data.
✔ Each crawler maintains its own DNS cache so it does not need to do
a DNS lookup before crawling each document.
12. Indexing the Web
✔ Parsing
◦ Any parser which is designed to run on the entire Web must handle a huge
array of possible errors.
✔
Indexing Documents into Barrels
◦ After each document is parsed, it is encoded into a number of barrels. Every
word is converted into a word ID by using an in-memory hash table -- the
lexicon.
◦ Once the words are converted into word ID, their occurrences in the current
document are translated into hit lists and are written into the forward barrels.
✔ Sorting
◦ the sorter takes each of the forward barrels and sorts it by word ID to produce
an inverted barrel for title and anchor hits and a full text inverted barrel.
14. Web Searching
✔ Match the query terms with words in the index
✔ Sort documents by relevance
✔ Find files or records
✔ Open each one and read it
✔ Store each word in a search able index
✔ Kinds of information
✔ Dates
✔ Languages
✔ Update processes
16. Other System Features
✔ It makes use of the link structure of the Web to calculate
a quality ranking for each web page, called Page Rank
◦ Page Rank is a trademark of Google. The Page Rank
process has been patented.
✔ Search Engine utilizes link to improve search results.
17. Page Rank
✔ Page Rank is a link analysis algorithm which assigns a
numerical weighting to each Web page, with the purpose
of "measuring" relative importance.
Based on the hyper links map
An excellent way to prioritize the results of web keyword
searches
18. Intuitive Justification
✔ A "random surfer" who is given a web page at random and keeps
clicking on links, never hitting "back“, but eventually gets bored
and starts on another random page.
◦ The probability that the random surfer visits a page is its Page
Rank.
◦ The d damping factor is the probability at each page the "random
surfer" will get bored and request another random page.
✔ A page can have a high Page Rank
◦ If there are many pages that point to it
◦ Or if there are some pages that point to it, and have a high Page
Rank.
19. Anchor Text
● <A href =" http://www.yahoo.com/">Yahoo!</A>
Besides the text of a hyper link (anchor text) is
associated with the page that the link is on, it is also
associated with the page the link points to.
◦ anchors often provide more accurate descriptions of
web pages than the pages themselves.
◦ anchors may exist for documents which cannot be
indexed by a text-based search engine, such as images,
programs, and databases.
20. Major Data Structures
✔
Big Files
virtual files spanning multiple file systems and are addressable by 64 bit integers.
✔ Repository
contains the full HTML of every web page.
✔ Document Index
keeps information about each document.
✔
Lexicon
two parts – a list of the words and a hash table of pointers.
✔
Hit Lists
a list of occurrences of a particular word in a particular document including position, font, and
capitalization information.
✔
Forward Index
stored in a number of barrels
✔
Inverted Index
consists of the same barrels as the forward index, except that they have been processed by the sorter.
21. Search is NOT a Panacea
✔ Search can’t find what’s not there
◦ The content is hugely important
✔ Information Architecture is vital
✔ Usable sites have good navigation and structure
22. Problems
✔ High Quality Search
The biggest problem facing users of web search engines
today is the quality of the results they get back.
✔ Scalable Architecture
Google is designed to scale. It must be efficient in both
space and time
23. The Future
“The ultimate search engine would
understand exactly what you mean
and give back exactly what you
want.”
- Larry Page
24. Conclusion
✔ The primary goal is to provide high quality search results
over a rapidly growing World Wide Web.
✔ Search engines employs a number of techniques to
improve search quality including page rank, anchor text,
and proximity information.
✔ Search engines is a complete architecture for gathering
web pages, indexing them, and performing search
queries over them.
25. And thats how search works...
● THANK YOU
Presented by:Presented by:
SAURABH GOELSAURABH GOEL
111438111438