Why we need an independent index of the Web
Dirk Lewandowski
dirk.lewandowski@haw-hamburg.de
http://www.bui.haw-hamburg.de/lewandowski.html
@Dirk_Lew
Society of the Query Conference, Amsterdam, 7/11/2013
The “local copy” of the Web
• Web Indexing
– New, changed, deleted document
– “Holy grail” of keeping the index complete and current
Risvik, K. M., & Michelsen, R. (2002). Search engines and web dynamics. Computer Networks, 39(3), 289–302.
Representation of documents in a search engine
Referring documents à Document à Metadata (examplex)
heading1
heading 2
Anchor text
Anchor text
Anchor text
From the source code
- Title
- Description
- Keywords
- Author
From the document
(document info)
- Length
- Date
- Decay
- Name of the author
From the Web
- PageRank
- Number of citations
The User’s Perspective
• Everyone uses search engines (Purcell, Brenner & Raine, 2012; van Eimeren &
Frees, 2012)
• Market is dominated by Google (ComScore data)
• Users rely on
– Google’s method of ordering results
– Google’s method of collecting data
à If Google hasn’t seen it — and indexed it — or kept it up to date, it
can’t be found with a search query.
Freshness of Web search engines
(see Lewandowski, Wahlig & Meyer-Bautor, 2006; Lewandowski, 2008)
Original (as of yesterday) Google‘s copy (as of yesterday)
What about the alternatives to Google?
• Many “seems to be” search engines
– Accessing the data of another search engine
– Representing nothing more than an alternative user interface to one of the more
well-known engines
– In many cases, that turns out to be Google
– E.g., in Germany, we can see that the major internet portals T-Online, GMX,
AOL, and web.de all display results obtained from Google
Why is one search engine not enough?
• We need more than one search engine to ensure that a broad range of
opinions are represented in the search market.
• Users should have the choice between different worldviews which originate
as a product of algorithm-based search result generation
• Ideology-free search algorithms are simply not possible
Alternative Search Engine Indexes
• There are only a handful of search engines that operate their own indexes,
due to costs and technical complexity
• Search engines start-ups
– Use an existing external index
– Focus on a specialised topic (which requires only a small index)
– Aggregate data from different search engines (meta search engine)
• Actual search engine startups like Blekko and Duck Duck Go are more the
exception than the rule
Partner model
• “Real” search engine providers such as Google and Bing operate their own
search engines but also provide their search results to partners
• All the major web portals have now embraced this model.
• Income through ads; revenue-sharing
• Attractiveness of the model
– The search engine provider encounters only minimal costs
– The operator of the portal no longer needs to go to the great expense of running
its own search engine.
– The partner index model has served to thin out the competition in the search
industry.
Access to Search Engine Indexes
• Application programming interfaces (APIs)
– No direct access to the search engine index
– Limited number of top results which have already been ranked by the search
engine provider
– Access via APIs is similar to what is occurring at the meta-search engines
– The representation of the document in the source search engine is also not
included
Alternative Search Engines
• What constitutes an “alternative search engine”?
– All search engines that are not Google? (“Google Killers“, e.g., Cuil)
– Some alternatives are not perceived as such because they are considered to be
simply the same as Google (e.g., Bing)
– Search engines which explicitly position themselves as an alternative to Google
through a regional approach (e.g., Seekport)
– New approaches to search / “Real alternatives”: Alternative approaches to
gathering and representing web content
Public Support for Search Engine Technology?
• Quaero/Theseus: Funding a “Google Killer”?
– Quaero: Technologies for multimedia searching.
– Theseus: Semantic technologies for business-to-business applications (without
focusing exclusively on search).
• The proposal to provide government funding for search engine technology
has been subject to intense criticism in the past
• Establish a single alternative?
• A number of factors which would cause it to fail
– Poor marketing
– Graphic design of the user interface
– ...
• Regardless of the reason, a failure of the new search engine would result in
the entire publicly funded initiative failing.
Economic perspective
• Only the largest internet companies are able to afford large indexes.
• Microsoft is the only company besides Google to possess a comprehensive
search engine index.
• Yahoo gave up on its own index several years ago
• It appears as though operating a dedicated index is attractive to practically
no one — and there are hardly any candidates with the necessary financial
resources in any case
The Solution
• Create the conditions that will make establishing alternative search engines
possible
• We can expect that the possibilities it presents would benefit a number of
different companies, individuals, and institutions.
• The result will be fair competition to develop the best concepts for using the
data provided by the index.
Vision
• “An index of the web that can be accessed at fair conditions for
everyone”
– “Everyone” means that anyone who is interested can access the index.
– “Fair conditions” does not mean that access to the index must be free of
charge for everyone. A certain number of document requests per day
should be available at no cost in order to promote non-profit projects.
– “Access” to the index can be defined as the ability to automatically
query the index with ease.
– The concept “index of the web” is intended to cover as much of the web
as possible
Funding and operation
• Funding
– This type of project cannot be supported by any one country alone. The only
feasible option is a pan-European initiative.
• Who would operate the index?
– Existing research institution or newly-founded institution
– The operator of the index should not obtain the exclusive right to determine the
way in which the documents are used or made available (à Board of trustees)
Conclusion: Advantages of an independent index of the web
• Motivate companies, institutions, and developers pursuing personal projects
to create their own search applications.
• The data available on the web is so boundless that it lends itself to
countless applications in a broad range of fields.
• Enable applications we are not yet capable of even imagining.
• An open structure, transparency with respect to access, and the assurance
of permanent availability thanks to state sponsorship would lay the
groundwork for innovation.
Thank you
Prof. Dr. Dirk Lewandowski
Hochschule für Angewandte Wissenschaften
Hamburg
dirk.lewandowski@haw-hamburg,de
Twitter: Dirk_Lew
http://www.bui.haw-hamburg.de/lewandowski.html
http://www.searchstudies.org