This document summarizes the content of a seminar on the history of search and web search engines given by Prof. Beat Signer. It discusses early methods of information storage and retrieval from papyrus and parchment to the Dewey Decimal system. It also covers the development of hypertext from Vannevar Bush's Memex to Ted Nelson's Project Xanadu and the creation of the World Wide Web by Tim Berners-Lee and Robert Cailliau. The document then provides an overview of the history of search engines from Archie to modern search engines like AltaVista and Yahoo.
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
Introduction to digital libraries - definitions, examples, concepts and trend...Olaf Janssen
This presentation gives an introduction to digital libraries.
It first explores different defintions of the phrase "Digital Library".
It then looks at 11 real life examples of digital library websites (slides 44-112), including Europeana, Google Books, Flickr the Commons, Delpher, Wikisource, The Memory of the Netherlands and Project Gutenberg. Each of these DLs is assessed against five different criteria (concepts, properties)
- Content/User experience
- Cultural heritage domain (libraries, archives, museums, AV-institutions)
- Controlled / run by
- Content providing parties
- User involvement
Many references are made to Web2.0-concepts from Tim O'Reilly's article http://www.oreilly.com/pub/a/web2/archive/what-is-web-20.html
From these 11x5 = 55 datapoints 6 trend plots are drawn (slides 116-166) to show "what is hot" and "what is not" in the current DL-landscape. Key slide summarizing this = no 168
Finally, some strategies for content & brand distribution of DLs are being discussed (SEO, Wikipedia, social & ego networks) , as well as some financial trends in DLs
This presentation was given by Olaf Janssen (National Library of the Netherlands - KB) as a lecture for students of the master's course "The Library" at Leiden University, most recently on 3-11-2016.
A importância dos dados em sua arquitetura... uma visão muito além do SQL Ser...Alexandre Porcelli
Nos últimos 30 anos temos vivido a hegemonia dos bancos de dados relacionais, a grande bala de prata da TI. O armazenamento de dados se tornou tão comoditizado, que nem mesmo nos questionamos se o modelo relacional é adequado as nossas necessidades. Mas será que o armazenamento de dados se resume ao modelo relacional? Será que as técnicas tradicionais de normalização ou ferramentas de produtividade como ORM são realmente adequadas? Será que você está tratando seus dados com a devida atenção?
Nesta palestra respondemos estas e outras perguntas sobre tratamento e armazenamento de dados. Colocamos o "dedo na ferida" e apresentamos uma nova escola de pensamento bem como algumas ferramentas que suportam esta nova realidade.
Reflections on 10 years of the Institutional Weblisbk
Slides from a talk by Andy Powell on "Reflections on 10 years of the Institutional Web" given at the Institutional Web Management Workshop 2006 on 16 June 2006.
See <http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2006/talks/powell/>.
Indoor Positioning Using the OpenHPS FrameworkBeat Signer
Research paper presentation given at IPIN 2021, Lloret de Mar, Spain.
Hybrid positioning frameworks use various sensors and algorithms to enhance positioning through different types of fusion. The optimisation of the fusion process requires the testing of different algorithm parameters and optimal lowas well as high-level sensor fusion techniques. The presented OpenHPS open source hybrid positioning system is a modular framework managing individual nodes in a process network, which can be configured to support concrete positioning use cases or to adapt to specific technologies. This modularity allows developers to rapidly develop and optimise their positioning system while still providing them the flexibility to add their own algorithms. In this paper we discuss how a process network developed with OpenHPS can be used to realise a customisable indoor positioning solution with an offline and online stage, and how it can be adapted for high accuracy or low latency. For the demonstration and validation of our indoor positioning solution, we further compiled a publicly available dataset containing data from WLAN access points, BLE beacons as well as several trajectories that include IMU data.
Research paper: https://beatsigner.com/publications/indoor-positioning-using-the-openhps-framework.pdf
More Related Content
Similar to History of Search and Web Search Engines - Seminar on Web Search
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
Introduction to digital libraries - definitions, examples, concepts and trend...Olaf Janssen
This presentation gives an introduction to digital libraries.
It first explores different defintions of the phrase "Digital Library".
It then looks at 11 real life examples of digital library websites (slides 44-112), including Europeana, Google Books, Flickr the Commons, Delpher, Wikisource, The Memory of the Netherlands and Project Gutenberg. Each of these DLs is assessed against five different criteria (concepts, properties)
- Content/User experience
- Cultural heritage domain (libraries, archives, museums, AV-institutions)
- Controlled / run by
- Content providing parties
- User involvement
Many references are made to Web2.0-concepts from Tim O'Reilly's article http://www.oreilly.com/pub/a/web2/archive/what-is-web-20.html
From these 11x5 = 55 datapoints 6 trend plots are drawn (slides 116-166) to show "what is hot" and "what is not" in the current DL-landscape. Key slide summarizing this = no 168
Finally, some strategies for content & brand distribution of DLs are being discussed (SEO, Wikipedia, social & ego networks) , as well as some financial trends in DLs
This presentation was given by Olaf Janssen (National Library of the Netherlands - KB) as a lecture for students of the master's course "The Library" at Leiden University, most recently on 3-11-2016.
A importância dos dados em sua arquitetura... uma visão muito além do SQL Ser...Alexandre Porcelli
Nos últimos 30 anos temos vivido a hegemonia dos bancos de dados relacionais, a grande bala de prata da TI. O armazenamento de dados se tornou tão comoditizado, que nem mesmo nos questionamos se o modelo relacional é adequado as nossas necessidades. Mas será que o armazenamento de dados se resume ao modelo relacional? Será que as técnicas tradicionais de normalização ou ferramentas de produtividade como ORM são realmente adequadas? Será que você está tratando seus dados com a devida atenção?
Nesta palestra respondemos estas e outras perguntas sobre tratamento e armazenamento de dados. Colocamos o "dedo na ferida" e apresentamos uma nova escola de pensamento bem como algumas ferramentas que suportam esta nova realidade.
Reflections on 10 years of the Institutional Weblisbk
Slides from a talk by Andy Powell on "Reflections on 10 years of the Institutional Web" given at the Institutional Web Management Workshop 2006 on 16 June 2006.
See <http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2006/talks/powell/>.
Indoor Positioning Using the OpenHPS FrameworkBeat Signer
Research paper presentation given at IPIN 2021, Lloret de Mar, Spain.
Hybrid positioning frameworks use various sensors and algorithms to enhance positioning through different types of fusion. The optimisation of the fusion process requires the testing of different algorithm parameters and optimal lowas well as high-level sensor fusion techniques. The presented OpenHPS open source hybrid positioning system is a modular framework managing individual nodes in a process network, which can be configured to support concrete positioning use cases or to adapt to specific technologies. This modularity allows developers to rapidly develop and optimise their positioning system while still providing them the flexibility to add their own algorithms. In this paper we discuss how a process network developed with OpenHPS can be used to realise a customisable indoor positioning solution with an offline and online stage, and how it can be adapted for high accuracy or low latency. For the demonstration and validation of our indoor positioning solution, we further compiled a publicly available dataset containing data from WLAN access points, BLE beacons as well as several trajectories that include IMU data.
Research paper: https://beatsigner.com/publications/indoor-positioning-using-the-openhps-framework.pdf
Personalised Learning Environments Based on Knowledge Graphs and the Zone of ...Beat Signer
Presentation given at CSEDU 2022, Virtual Event.
The learning of new knowledge and skills often requires previous knowledge, which can lead to some frustration if a teacher does not know a learner's exact knowledge and skills and therefore confronts them with exercises that are too difficult to solve. We present a solution to address this issue when teaching techniques and skills in the domain of table tennis, based on the concrete needs of trainers that we have investigated in a survey. We present a conceptual model for the representation of knowledge graphs as well as the level at which individual players already master parts of this knowledge graph. Our fine-grained model enables the automatic suggestion of optimal exercises in a player's so-called zone of proximal development, and our domain-specific application allows table tennis trainers to schedule their training sessions and exercises based on this rich information. In an initial evaluation of the resulting solution for personalised learning environments, we received positive and promising feedback from trainers. We are currently investigating how our approach and conceptual model can be generalised to some more traditional educational settings and how the personalised learning environment might be further improved based on the expressive concepts of the presented model.
Research paper: https://beatsigner.com/publications/personalised-learning-environments-based-on-knowledge-graphs-and-the-zone-of-proximal-development.pdf
Cross-Media Technologies and Applications - Future Directions for Personal In...Beat Signer
Webinar given at icity Lab Talks - The Digital Value Chain
In this talk, I will first provide an overview of the lab’s research on a general data-driven approach for cross-media information system and architectures based on the resource-selector-link (RSL) hypermedia metamodel. We will then have a look at several cross-media applications for personal information management and next-generation presentation solutions (MindXpres). Finally, I will outline the lab’s most recent research on tangible interaction and dynamic data physicalisation.
Codeschool in a Box: A Low-Barrier Approach to Packaging Programming CurriculaBeat Signer
Presentation given at CSEDU 2023, Prague, Czech Republic.
The tech industry is a fast-growing field, with many companies facing issues in finding skilled workers to fill their open vacancies. At the same time, many people have limited access to the quality education necessary to enter this job market. To address this issue, various small and often volunteer-run non-profit organisations have emerged to up-skill capable learners. However, these organisations face tight constraints and many challenges while trying to design and deliver high-quality education to their learners. In this position paper, we discuss some of these challenges and present a preliminary version of a curriculum packager addressing some of these issues. Our proposed solution, inspired by first-hand experience in these organisations as well as computing education research (CER), is based on a combination of micromaterials, study lenses and a companion mobile application. While our solution is designed for the specific context of small organisations providing vocational ICT training, it can also be applied to the broader domain of learning environments facing similar constraints.
Research paper: https://beatsigner.com/publications/codeschool-in-a-box-a-low-barrier-approach-to-packaging-programming-curricula.pdf
Towards a Framework for Dynamic Data PhysicalisationBeat Signer
Presentation given at the International Workshop Toward a Design Language for Data Physicalization, Berlin, Germany, October 2018
ABSTRACT: Advanced data visualisation techniques enable the exploration and analysis of large datasets. Recently, there is the emerging field of data physicalisation, where data is represented in physical space (e.g. via physical models) and can no longer only be explored visually, but also by making use of other senses such as touch. Most existing data physicalisation solutions are static and cannot be dynamically updated based on a user's interaction. Our goal is to develop a framework for new forms of dynamic data physicalisation in order to support an interactive exploration and analysis of datasets. Based on a study of the design space for dynamic data physicalisation, we are therefore working on a grammar for representing the fundamental physical operations and interactions that can be applied to the underlying data. Our envisioned extensible data physicalisation framework will enable the rapid prototyping of dynamic data physicalisations and thereby support researchers who want to experiment with new combinations of physical variables or output devices for dynamic data physicalisation as well as designers and application developers who are interested in the development of innovative dynamic data physicalisation solutions.
Paper: https://www.academia.edu/37336859/Towards_a_Framework_for_Dynamic_Data_Physicalisation
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
History of Search and Web Search Engines - Seminar on Web Search
1. 2 December 2005
Seminar on Web Search
History of Search and Web Search Engines
Prof. Beat Signer
Department of Computer Science
Vrije Universiteit Brussel
http://vub.academia.edu/BeatSigner
2. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 2September 5, 2011
Seminar Organisation
Prof. Beat Signer
WISE Lab, Vrije Universiteit Brussel
bsigner@vub.ac.be
cross-media information spaces
and architectures
interactive paper and augmented reality
multimodal and multi-touch interaction
Content of the Seminar
history of search and web search engines
search engine optimisation (SEO) and
search engine marketing (SEM)
current and future trends in web search
3. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 3September 5, 2011
Early "Documents"
4. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 4September 5, 2011
Papyrus
Greeks and Romans
stored information on
papyrus scrolls
Tags with a summary of
the content facilitated the
retrieval of information
Table of content was
introduced around 100 BC
Parchment (vellum) came
up as an alternative
bound in book form
5. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 5September 5, 2011
Paper
Invented in China (105 AD)
Brought to Europe only in
the twelfth century
Took another 300 years
before paper became the
major writing material
How long will we still use
paper?
electronic paper vs.
augmented paper
6. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 6September 5, 2011
Printing Press
Johann Gutenberg
invented the printing press
in 1450
Gutenberg Bible published
in 1455
Growing libraries and
need to search for
information
7. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 7September 5, 2011
Reading Wheel (Bookwheel)
Described by Agostino
Ramelli in 1588
Keep several books open
to read from them at the
same time
comparable to modern
tabbed browsing
The reading wheel has
never really been built
Could be seen as a
predecessor of hypertext
8. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 8September 5, 2011
Dewey Decimal Classification (DDC)
Library classification
system
developed by Melvil Dewey
in 1876
Hierarchical classification
10 main classes with
10 divisions each and
10 sections per division
total of 1000 sections
often separate fiction section
Documents can appear in
more than one class
9. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 9September 5, 2011
Dewey Decimal Classification (DDC) ...
After the three numbers,
decimals can be used for
further subclassification
Different Alternatives
Library of Congress
classification
Universal Decimal
Classification (UDC)
10. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 10September 5, 2011
Dewey Decimal Classification (DDC) ...
000-099 Computer Science, Information and General Works
000 Computer Science, Knowledge and Systems
000 Computer Science, Knowledge and General Works
...
005 Computer Programming, Programs and Data
...
009 [Unassigned]
010 Bibliographies
...
100-199 Philosophy and Psychology
200-299 Religion
300-399 Social Sciences
340 Law
341 International Law
400-499 Language
500-599 Science
600-699 Technology
700-799 Arts
800-899 Literature
900-999 History, Geography and Biography
11. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 11September 5, 2011
"As We May Think" (1945)
... When data of any sort are placed in
storage, they are filed alphabetically
or numerically, and information is
found (when it is) by tracing it down
from subclass to subclass. It can be in
only one place, unless duplicates are
used; one has to have rules as to which
path will locate it, and the rules are
cumbersome. Having found one
item, moreover, one has to emerge from
the system and re-enter on a
new path. The human mind does not work
that way. It operates by association.
...
Vannevar Bush
12. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 12September 5, 2011
"As We May Think" (1945) …
... It affords an immediate step,
however, to associative indexing, the
basic idea of which is a
provision whereby any item may be
caused at will to select immediately
and automatically another. This is the
essential feature of the memex. The
process of tying two items together is
the important thing. ...
Vannevar Bush, As We May Think,
Atlanic Monthly, July 1945
Vannevar Bush
13. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 13September 5, 2011
"As We May Think" (1945) …
Bush's article 'As We My Think'
(1945) is often seen as
the “origin" of hypertext
Article introduces the Memex
prototypical hypertext machine
store and access information
follow cross-references in the form
of associative trails between pieces
of information (microfilms)
trail blazers are those who find
delight in the task of establishing
useful trails
Memex
14. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 14September 5, 2011
Memex Movie
15. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 15September 5, 2011
Hypertext (1965)
Ted Nelson coined the term hypertext
Nelson started Project Xanadu in 1960
first hypertext project
nonsequential writing
referencing/embedding parts of a document
in another document (transclusion)
transpointing windows
bidirectional (bivisible) links
version and rights management
XanaduSpace 1.0 was released as part of Project
Xanadu in 2007
Ted Nelson
16. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 16September 5, 2011
World Wide Web (WWW)
Networked hypertext system
(over ARPANET) to share in-
formation at CERN
first draft in March 1989
The Information Mine,
Information Mesh, …?
Components by end of 1990
HyperText Transfer Protocol (HTTP)
HyperText Markup Language (HTML)
HTTP server software
Web browser (WorldWideWeb)
First public "release" in August 1991
Tim Berners-Lee Robert Cailliau
17. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 17September 5, 2011
Search Engine History
Early "search engines" include various systems
starting with Bush's Memex
Archie (1990)
first Internet search engine
indexing of files on FTP servers
W3Catalog (September 1993)
first "web search engine"
mirroring and integration of manually maintained catalogues
JumpStation (December 1993)
first web search engine combining crawling, indexing and
searching
18. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 18September 5, 2011
Search Engine History ...
In the following two years (1994/1995) many
new search engines appeared
AltaVista, Infoseek, Excite, Inktomi, Yahoo!, ...
Two categories of early Web search solutions
full text search
- based on an index that is automatically created by a web crawler in
combination with an indexer
- e.g. AltaVista or InfoSeek
manually maintained classification (hierarchy) of webpages
- significant human editing effort
- e.g. Yahoo
19. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 19September 5, 2011
Information Retrieval
Precision and recall can be used to measure the
performance of different information retrieval algorithms
documentsretrieved
documentsretrieveddocumentsrelevant
precision
documentsrelevant
documentsretrieveddocumentsrelevant
recall
D1 D2 D4
D6 D7 D10
D3 D5
D8 D9
D1 D3 D8
D9 D10
query
6.0
5
3
precision
75.0
4
3
recall
20. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 20September 5, 2011
Information Retrieval ...
Often a combination of precision and recall, the so-called
F-score (harmonic mean) is used as a single measure
D1 D2 D4
D6 D7 D10
D3 D5
D8 D9
D1 D3
D8 D9 D10
query
57.0precision
1recall
recallprecision
recallprecision
2scoreF
D1 D2 D4
D6 D7 D10
D3 D5
D8 D9
D1 D3 D8
D9 D10
query
6.0precision
75.0recall
67.0score-F
D5D2
73.0score-F
21. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 21September 5, 2011
Bank
Delhaize
Ghent
Metro
Shopping
Train
D1 D2 D3 D4 D5 D6
1
Boolean Model
Based on set theory and boolean logic
Exact matching of documents to a user query
Uses the boolean AND, OR and NOT operators
query: Shopping AND Ghent AND NOT Delhaize
computation: 101110 AND 100111 AND 000111 = 000110
result: document set {D4,D5}
1 0 0 1 1
1
1
0
1
1
1
0
0
1
0
0
1
1
1
0
0
1
0
1
1
0
1
0
1
0
0
1
0
0
0
... ... ... ... ... ... ...
22. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 22September 5, 2011
Boolean Model ...
Advantages
relatively easy to implement and scalable
fast query processing based on parallel scanning of indexes
Disadvantages
does not pay attention to synonymy
does not pay attention to polysemy
no ranking of output
often the user has to learn a special syntax such as the use of
double quotes to search for phrases
Variants of the boolean model form the basis for many
search engines
23. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 23September 5, 2011
Vector Space Model
Algebraic model representing text documents and
queries as vectors based on the index terms
one dimension for each term
Compute the similarity (angle) between the query vector
and the document vectors
Advantages
simple model based on linear algebra
partial matching with relevance scoring for results
potenial query reevaluation based on user relevance feedback
Disadvantages
computationally expensive (similarity measures for each query)
limited scalability
24. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 24September 5, 2011
Web Search Engines
Most web search engines are based on traditional
information retrieval techniques but they have to be
adapted to deal with the characteristics of the the Web
immense amount of web resources (>50 billion webpages)
hyperlinked resources
dynamic content with frequent updates
self-organised web resources
Evaluation of performance
no standard collections
often based on user studies (satisfaction)
Of course not only the precision and recall but also the
query answer time is an important issue
25. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 25September 5, 2011
What About Old Content?
26. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 26September 5, 2011
The Internet Archive
27. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 27September 5, 2011
Web Crawler
A web crawler or spider is used to create an
index of webpages to be used by a web search engine
any web search is then based on this index
Web crawler has to deal with the following issues
freshness
- the index should be updated regularly (based on webpage update frequency)
quality
- since not all webpages can be indexed, the crawler should give priority to
"high quality" pages
scalabilty
- it should be possible to increase the crawl rate by just adding additional
servers (modular architecture)
- e.g. the estimated number of Google servers in 2007 was 1'000'000 (including
not only the crawler but the entire Google platform)
28. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 28September 5, 2011
Web Crawler ...
distribution
- the crawler should be able to run in a distributed manner (computer centers all
over the world)
robustness
- the Web contains a lot of pages with errors and a crawler has to deal with
these problems
- e.g. deal with a web server that creates an unlimited number of "virtual web
pages" (crawler trap)
efficiency
- resources (e.g. network bandwidth) should be used in a most efficient way
crawl rates
- the crawler should pay attention to existing web server policies
(e.g. revisit-after HTML meta tag or robots.txt file)
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/ robots.txt
29. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 29September 5, 2011
Web Search Engine Architecture
WWW Crawler
URL Pool
Storage
Manager
Page
Repository
content already added?
Document
Index
Special
Indexes
IndexersURL Handler
URL
Repository
filter
normalisation
and duplicate
elimination
Client
Query
Handler
inverted index
Ranking
30. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 30September 5, 2011
Pre-1998 Web Search
Find all documents for a given query term
use information retrieval (IR) solutions
- boolean model
- vector space model
- ...
ranking based on "on-page factors"
problem: poor quality of search results (order)
Larry Page and Sergey Brin proposed to compute the
absolute quality of a page called PageRank
based on the number and quality of pages linking
to a page (votes)
query-independent
31. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 31September 5, 2011
Origins of PageRank
Developed as part of an
academic project at Stanford
University
research platform to aid under-
standing of large-scale web data
and enable researchers to easily
experiment with new search
technologies
Larry Page and Sergey Brin worked on the project about a new
kind of search engine (1995-1998) which finally led to a functional
prototype called Google
Larry Page Sergey Brin
32. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 32September 5, 2011
PageRank
A page Pi has a high PageRank Ri if
there are many pages linking to it
or, if there are some pages with a high PageRank linking to it
Total score = IR score × PageRank
P1
R1
P2
R2
P3
R3
P4
R4
P5
R5
P6
R6
P7
R7
P8
R8
33. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 33September 5, 2011
Basic PageRank Algorithm
where
Bi is the set of pages
that link to page Pi
Lj is the number of
outgoing links for page Pj
ij BP j
j
i
L
PR
PR
)(
)(
P1 P2
P3
P1
1
P2
1
P3
1
P1
1.5
P2
1.5
P3
0.75
P1
1.5
P2
1.5
P3
0.75
34. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 34September 5, 2011
Matrix Representation
Let us define a hyperlink
matrix H
P1 P2
P3
otherwise0
if1 ijj
ij
BPL
H
0210
001
1210
H
iPRRand
HRR
R is an eigenvector of H
with eigenvalue 1
35. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 35September 5, 2011
Matrix Representation ...
We can use the power method to find R
sparse matrix H with 40 billion columns and rows but only an
average of 10 non-zero entries in each colum
tt
HRR 1
0210
001
1210
HFor our example
this results in or 122R 2.04.04.0
36. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 36September 5, 2011
Dangling Pages (Rank Sink)
Problem with pages that
have no outbound links (e.g. P2)
Stochastic adjustment
if page Pj has no outgoing links then replace column j with 1/Lj
New stochastic matrix S always has a stationary vector R
can also be interpreted as a markov chain
P1 P2
01
00
H and 00R
210
210
C
211
210
CHSand
C
C
37. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 37September 5, 2011
Strongly Connected Pages (Graph)
Add new transition proba-
bilities between all pages
with probability d we follow
the hyperlink structure S
with probability 1-d we
choose a random page
matrix G becomes irreducible
Google matrix G reflects
a random surfer
no modelling of back button
P1 P2
P3P4
P5
1SG
n
dd
1
1 GRR
1-d
1-d 1-d
38. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 38September 5, 2011
Examples 1SG
n
dd
1
1
A1
0.26
A2
0.37
A3
0.37
44. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 44September 5, 2011
Implications for Website Development
First make sure that your page gets indexed
on-page factors
Think about your site's internal link structure
create many internal links for important pages
be "careful" about where to put outgoing links
Increase the number of pages
Ensure that webpages are addressed consistently
http://www.vub.ac.be http://www.vub.ac.be/index.php
Make sure that you get incoming links from good
websites
45. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 45September 5, 2011
Tools
Google toolbar
shows logarithmic PageRank value (from 0 to 10)
information not frequently updated (google dance)
Google webmaster tools
accepts a sitemap (XML document) with the structure of a website
variety of reports that help to improve the quality of a website
- meta description issues
- title tag issues
- non-indexable content issues
- number and URLs of indexed pages
- number and URLs of inbound/outbound links
- ...
46. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 46September 5, 2011
Questions
Is PageRank fair?
What about Google's power and influence?
What about Web 2.0 or Web 3.0 and web search?
"non-existent" webpages such as offered by Rich Internet
Applications (e.g. Ajax) may bring problems for traditional search
engines (hidden web)
new forms of social search
- Wikia Search
- Delicious
- ...
social marketing
47. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 47September 5, 2011
HITS Algorithm
Hypertext Induced Topic Search
Jon Kleinberg
developed around the same time when
Page and Brin invented PageRank
Uses the link structure like PageRank to
compute a popularity score
Differences from PageRank
two popularity values for each page (hub and authority score)
note that the values are not query-independent
user gets a ranked hub and authority list
Jon Kleinberg
48. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 48September 5, 2011
HITS Algorithm ...
Good authorities are linked by good hubs and good hubs
link to good authorities
Compute impact of authorities and hubs similar to
PageRank (but only on limited set of result pages!)
P1 P2
Authority Hub
initialise each page with an authority and hub score of 1
repeat {
compute new authority scores
compute new hub scores
normalise authority and hub scores
}
49. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 49September 5, 2011
Meta Search Engines
Search tool that sends a query to multiple search
engines
Aggregates the individual results on a single result page
metacrawler is an example of a meta search engine that
uses different search engines (Google, Bing, Yahoo!, ...)
50. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 50September 5, 2011
Search Engine Market Share
51. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 51September 5, 2011
Conclusions
Web information retrieval techniques have to deal with
the specific characteristics of the Web
PageRank algorithm
absolute quality of a page based on incoming links
based on random surfer model
computed as eigenvector of Google matrix G
PageRank is just one (important) factor
Implications for website development and SEO
52. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 52September 5, 2011
References
Vannevar Bush, As We May Think, Atlanic Monthly,
July 1945
http://www.theatlantic.com/doc/194507/bush/
http://sloan.stanford.edu/MouseSite/Secondary.html
L. Page, S. Brin, R. Motwani and T. Winograd,
The PageRank Citation Ranking: Bringing Order
to the Web, January 1998
S. Brin and L. Page, The Anatomy of a Large-Scale
Hypertextual Web Search Engine, Computer Networks
and ISDN Systems, 30(1-7), April 1998
53. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 53September 5, 2011
References …
Amy N. Langville and Carl D. Meyer, Google's
PageRank and Beyond – The Science of Search Engine
Rankings, Princeton University Press, July 2006
PageRank Calculator
http://www.webworkshop.net/pagerank_calculator.php
Google Webmaster Tools
http://www.google.com/webmasters/
54. 2 December 2005
Next Lecture
Search Engine Optimisation (SEO) and Search
Engine Marketing (SEM)