2. Contents
Introduction: what do web search engines
mean for us today?
History of web search engines
How web search engines work
Most popular search engines
Conclusion: past, present and future of web
search
2
3. Contents
➔ Introduction: what do web search engines
mean for us today?
History of web search engines
How web search engines work
Most popular search engines
Conclusion: past, present and future of web
search
3
4. The Web as a huge storage of
information
A huge amount of information is contained in
the Word Wide Web
And this amount is still growing
day by day
We need to orient ourself in this enormous
information space
Web search engines provide us fast
search of information that we are
interested in
4
5. Web search engines in our life
We use web search engines every day for:
Searching texts, articles, books, news, etc.
Searching different media: music, videos, films,
pictures, etc.
Searching goods
Searching web sites and web portals
Preparing lectures and presentations ☺
…
The verb “to google” is included in dictionaries
Web search engines have become an integral
part of our life 5
6. Contents
✔ Introduction: what do web search engines
mean for us today?
➔ History of web search engines
How web search engines work
Most popular search engines
Conclusion: past, present and future of web
search
6
7. The very first search tools
1989–1991 – the invention of the World Wide
Web by Sir Tim Berners-Lee in CERN
Archie (1990)
The first Internet search tool
Fetching and indexing files on FTP servers
Providing search for indexed files
Veronica and Jughead – similar to Archie search
tools for Gopher protocol invented in 1991
7
8. The first web search engines
W3Catalog (1993)
The first primitive search engine
Mirroring and integration of manually maintained
catalogues
Still available: http://www.w3catalog.com/
World Wide Web Wanderer (1993)
The first web crawler
The first web index called Wandex
Aimed to count Web size, not to serve as a search
tool
8
9. The first web search engines
JumpStation (1993)
The first web search engine combining crawling,
indexing and searching
A web form for search queries
No ranking, just listing search results
Excite (1994)
The first ranking system
WebCrawler (1994)
Indexing full text
The first widely known web search engine
9
10. Web search evolution
1994–1997 – a number of similar web search
engines:
Infoseek
OpenText
Magellan
Inktomi
Northern Light
AskJeeves
AltaVista
10
11. Web search evolution
Yahoo! (1994)
Search in human edited hierarchical web directory
Manual solution of relevancy
Search by keywords as well as browsing full
directory
Gained large popularity
Later in 2004 developed its own web search engine
One of the main stars in business world in 1990s
11
12. Web search evolution
Google (1998)
The invention of Page Rank
Simple and clear interface instead of turning to a
web portal
Yandex (1997)
Full-text search with Russian morphology support
Quickly gained large popularity in Russia
12
13. Web search engines today
Powerful web search technologies
Maximal freshness of results
Variety of types of searchable documents
Intelligent algorithms of ranking
Media search:
Images
Music
Videos
…
13
14. Web search engines today
Personalized search
Based on user's search history
Based on personal information from virtual
social spaces
Location-based search
Vertical search
Image-based search
Audio-based search
14
15. Contents
✔ Introduction: what do web search engines
mean for us today?
✔ History of web search engines
➔ How web search engines work
Most popular search engines
Conclusion: past, present and future of web
search
15
16. Basic principles of web search
Create and sort a pool of data
Find the most appropriate information
Deliver this information
16
17. Basic parts of web search engine
A web spider/crawler/robot – a computer
program which:
Continuously traverses web pages
Finds new or changed content
Stores visited pages in corpus
Index – a database containing crawling results
Search engine – a computer program which:
Identifies pages relevant to search query
Retrieve this pages
Rank them
User interface 17
18. Web crawling
Web crawling is aimed to traverse web pages
and to store their copies for further indexing
General web crawler algorithm:
Starts with a list of initial URLs, called
the seeds
Visits these URLs
Retrieves required information from the page
Identifies all the hyper-links on the page
Adds this links to the queue of URLs, called the
crawl frontier
Recursively visit URLs from the crawl frontier 18
20. Crawling policies
A selection policy
Focused crawling
Restricting followed links
URL normalization
Path-ascending crawling
A re-visit policy
Uniform policy
Proportional policy
A politeness policy
A parallelization policy 20
21. Indexing
Indexing is purposed to provide high speed and
performance in finding relevant documents in
corpus for a search query.
For example 10,000 documents:
Queried within milliseconds with the help of index
Sequential scan could take hours
Meta search engines reuse the indices of other
services and do not store a local index
E.g. vertical search can use indices of vertical
services
21
22. Inverted index
For each word stores a list of documents
containing this word
Provides direct access to the documents
associated with each word in the search query
Commonly used by web search engines
Not convenient to update
22
23. Forward index
Stores a list of words for each document
It's more handy to store words per document
immediately during its parsing
Enables asynchronous processing – mush easy
to update then inverted index
Is stored to be transformed to inverted index
23
24. Ranking
Ranking is an arrangement of web search
results in order of relevance
Usually based on statistical methods
Frequency of keywords in particulat document
Rating page popularity and authority
Advanced search engines also use intelligent
algorithms of ranking
24
25. Google PageRank
PageRank was invented in 1998 by Larry Page
and Sergey Brin at Stanford University
It is aimed to rate web page authority relatively
to other web pages
Basic principles:
A hyperlink to a page counts as a vote of support
Page with high number of incoming links has high
authority
A hyperlink coming from authoritative web page
gives more points
PR(p) is a probability that a person randomly
clicking on links will arrive at page p
25
26. Google PageRank
A B C D
0.25 0.25 0.25 0.25
A B C D
1/2 1/6 1/6 1/6
A B C D
6/17 2/17 3/17 6/17
26
27. Google PageRank
So, PageRank of page A:
In the general case, the PageRank value for
any page u:
where Bu – set containing all pages linking to
page u; L(v) – number of links from page v.
27
28. Google PageRank
Spider traps:
A B C
Damp factor
d – probability that random surfer continue traversal
(1-d) – probability of going to random site
The result formula:
28
30. Contents
✔ Introduction: what do web search engines
mean for us today?
✔ History of web search engines
✔ How web search engines work
➔ Most popular search engines
Conclusion: past, present and future of web
search
30
31. Google
Was started in 1996 as the research project of
Larry Page and Sergey Brin in Stanford
University
Was launched in 1998
By the end of 1998 already
had an index of about 60
million pages
Quickly gained popularity due
to PageRank algorithm
31
32. Google
Today Google is the most popular web search
engine in the world: 85% of web search market
Provides many other services:
Gmail
Google maps
Google+
…
Has its own OS – Android
Provides web browser – Google Chrome
... 32
33. Yandex
Was founded in 1997 by
Arkady Volozh and Ilya Segalovich
The first web search engine providing
morphological search
The prototype of Yandex search engine was a
system for autimated searching in Bible
The name stand for “Yet Another iNDEXer”
33
34. Yandex
In 1998 Yandex launched
contextual advertisement
In 2001 Yandex.Direct was launched - an
automated, auction-based system for
placement of text-based advertising
2005 – Ukraine portal, www.yandex.ua
2008 – Yandex Labs in San Francisco Bay area
2010 – English version of web search engine
2011 - search engine and a range of other
services in Turkey, at yandex.com.tr 34
36. Yandex today
63% of Russian web search market
More than 3500 employees
24 offices in 8 countries
36
37. Contents
✔ Introduction: what do web search engines
mean for us today?
✔ History of web search engines
✔ How web search engines work
✔ Most popular search engines
➔ Conclusion: past, present and future of web
search
37
38. Conclusion
Web search engines are an integral part of our
life today
They did a long way before they reached
today's performance and power
Their development is far from being finished
Main developing trends are:
Web search personalization
Local-based search
Vertical search
38