Web Archives and the dream of the Personal Search Engine

Web Archives and the dream of the
Personal Search Engine
Prof.dr.ir. Arjen P. de Vries
arjen@acm.org
Hannover, October 19th, 2017

Library of the “Muntmuseum” in Utrecht (Erik van Hannen)

Why Search Remains Difficult to Get Right
 Heterogeneous data sources
- WWW, wikipedia, news, e-mail, patents, twitter, personal
information, …
 Varying result types
- “Documents”, tweets, courses, people, experts, gene
expressions, temperatures, …
 Multiple dimensions of relevance
- Topicality, recency, reading level, …
Actual information needs often require a mix within
and across dimensions. E.g., “recent news and
patents from our top competitors”

 System’s internal information representation
- Linguistic annotations
- Named entities, sentiment, dependencies, …
- Knowledge resources
- Wikipedia, Freebase, IDC9, IPTC, …
- Links to related documents
- Citations, urls
 Anchors that describe the URI
- Anchor text
 Queries that lead to clicks on the URI
- Session, user, dwell-time, …
 Tweets that mention the URI
- Time, location, user, …
 Other social media that describe the URI
- User, rating
- Tag, organisation of `folksonomy’
+ UNCERTAINTY ALL OVER!

Learning to Rank (LTOR)
 IR as a machine learning problem
 Learn the matching function from observations
- E.g., pairwise – clicked document below retrieved document
should trigger a swap of their positions

Detect and classify NEs
Rank search results
Predict query intent
Search suggestions

Spelling correction
Predict query intent
Rank Verticals
Search suggestions

RobertJohnson(1911-13):“Earlythismorning
whenyouknockeduponmydoor/AndIsaid,‘Hello,
Satan,Ibelieveit’stimetogo.’”
https://youtu.be/3MCHI23FTP8

WWW
 The Web has become ever more centralized
+ Cloud services – good value-for-money/value-for-effort
 Mobile makes things only worse
“There is an app for that”

Decentralize Web Search?
See also yacy.net/

Without the log data, web search isn’t as good
 This also hinders retrieval experiments in academia!
- Reproducibility vs. Representativeness of research results?
Samar, T., Bellogín, A. & de Vries, A.P. Inf Retrieval J (2016) 19: 230.
doi: 10.1007/s10791-015-9276-9

http://www.mkomo.com/cost-per-gigabyte-update

WESTERN DIGITAL DEMONSTRATES PROTOTYPE OF THE WORLD’S FIRST
1TERABYTE SDXC CARD AT PHOTOKINA 2016
SEP 20, 2016

Realistic?
 Clueweb 2012: 80TB
Recent CommonCrawl (August 2017): 3.28B pages, 280TB
 Average web page takes up 320 KB
- Large sample collected with Googlebot, May 26th, 2010
- Reported 4.2B pages (would require ~1.3 Petabyte)
 De Kunder & Van de Bosch estimate an upper bound of ~50B pages
- http://www.worldwidewebsize.com/
 Also considering continuing growth (claimed in unpublished work)
- Andrew Trotman, Jinglan Zhang, Future Web Growth and its Consequences for Web
Search Architectures. https://arxiv.org/abs/1307.1179
https://web-beta.archive.org/web/20100628055041/http://code.google.com/speed/articles/web-metrics.html

Realistic?
 Who actually needs all of the Web if their search engine is
truly personal?
 E.g., I do not read more than 4 or 5 languages…
 And, I do not want to see or read anything related to
qualifiers for the world cup

Two Problems
 How to get the web data on the personal search engine?
 How to replace the lack of usage data from many?

Getting the Data
 Idea:
- Organize the web crawl in topically related bundles
- Apply bittorrent-like decentralization to share & update bundles
 Use techniques inspired by query obfuscation to hide the
real user’s interests when downloading bundles
See also WebRTC based in-browser implementations:
 Webtorrent: https://webtorrent.io/
 CacheP2P: http://www.cachep2p.com/
academictorrents.com shares 16TB research data, including
Clueweb 2009 and 2012 anchor tekst
And, IPFS: https://ipfs.io/
“A peer-to-peer hypermedia protocol to make the web faster, safer,
and more open.”

Web Archives to the Rescue?
 Web Archives already store the data that the personal
search engine would need
- Just not (yet) organized in topical bundles

Rescue the Web archives?!
 Q: a business model for archiving?
 Q: enrich the (rarely used) web archives with usage data?
 Q: crowd-sourced seed-lists for crawling?
See also different direction to rescue the Web Archive:
bit.ly/VisualNavigationProject by Hugo Huurdeman

IPFS
 “The Permanent Web”
- Smart mix of bittorrent for peer-2-peer filesharing and git for
versioning
- Each file and all of the blocks within it are given a unique
fingerprint called a cryptographic hash
- This hash is used to lookup files
 IPFS = the Inter-Planetary File System
 Decentralized file sharing, but no decentralized search

“… communication and media
limitations, due to the distance
between Earth and Mars,
resulting in time delays: they will
have to request the movies or
news broadcasts they want to
see in advance.
[…]
Easy Internet access will be
limited to their preferred sites
that are constantly updated on
the local Mars web server.
Other websites will take
between 6 and 45 minutes to
appear on their screen - first 3-
22 minutes for your click to
reach Earth, and then another
3-22 minutes for the website
data to reach Mars.”
http://www.mars-one.com/faq/mission-to-mars/what-will-the-astronauts-do-on-mars

“Searching from Mars”
 Tradeoff between “effort” (waiting for responses from Earth) and “data
transfer” (pre-fetching or caching data on Mars).
 Related work:
- Jimmy Lin, Charles L. A. Clarke, and Gaurav Baruah. Searching from Mars. Internet
Computing, 20(1):77-82, 2016. http://dx.doi.org/10.1109/MIC.2016.2
- Charles L.A. Clarke, Gordon V. Cormack, Jimmy Lin, and Adam Roegiest.
Total Recall: Blue Sky on Mars. ICTIR '16. http://dx.doi.org/10.1145/2970398.2970430
- Charles L. A. Clarke, Gordon V. Cormack, Jimmy Lin, Adam Roegiest.
Ten Blue Links on Mars. https://arxiv.org/abs/1610.06468

Pre-fetching & Caching
 Hide latencies of getting the data from the live web
- Pre-fetch pages linked from initial query results page
- Pre-fetch additional related pages
- Pre-fetches expanded with those from query suggestions
 Cache web data to avoid accessing the live web

Analogy
 Web Archive ~ Earth
 Personal search engine (@ people’s homes) ~ Mars

Alternatives for Log Data?
 Social annotations
- E.g., bit.ly shortened urls
- Still requires access to an API conveying the query representation
- E.g., anchor text
- E.g., “twanchor text” – tweets providing context to a URL

“SearsiaSuggest”
 Searsia (federated search engine created by Djoerd
Hiemstra) uses anchor text instead of query logs for its
autocompletions
- “… for queries of 2 words or more (the average query length in
the test data is 2.6), anchor text autocompletions perform better
than query log autocompletions”
- No more tracking of users!
 See also:
- searsia.org/blog/2017-03-18-query-suggestions-without-
tracking-users/
- github.com/searsia/searsiasuggest

Anchor Text & Timestamps
 Anchor text exhibits characteristics similar to user query
and document title [Eiron & McCurley, Jin et al.]
 Anchor text with timestamps can be used to capture &
trace entity evolution [Kanhabua and Nejdl]
 Anchor text with timestamps lets us reconstruct (past) topic
popularity [Samar et al.]
Again, the Web Archive to the rescue!

Recover Past Trends
 “Ground-truth” from WikiStats, Google Trends and the KB
online newspaper archive’s query log
 Anchor Text combined with timestamps can be used to find
past popular topics
- The % of coverage varies across the sources of past trends
- Anchor Text popularity correlates with the % of coverage
 Crawl strategy: KB vs. CommonCrawl
- Breadth-first (CommonCrawl) covers more topics globally and
from the NL domain

Investigate Bias
 Our study, on the Dutch Web Archive:
- Anchor Text from external links
- Create query sets with a timestamp per query (2009 – 2012)
- De-duplicated for year of crawl
(Most sites crawled once a year, but a subset more frequently.)
 Retrievability study
- Number of sites crawled in a year does not influence the
retrievability of documents from that year
- Difficulty to retrieve a document from a certain timeframe does
depend on the subset size
L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th
ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.

http://dx.doi.org/10.1007/s00799-017-0215-9

Trade log data!
IR-809: (2011) Feild, H., Allan, J. and Glatt, J.,
"CrowdLogging: Distributed, private, and
anonymous search logging," Proceedings of the
International Conference on Research and
Development in Information Retrieval (SIGIR'11),
pp. 375-384. [View bibtex]
We describe an approach for distributed search log collection, storage, and mining,
with the dual goals of preserving privacy and making the mined information broadly
available. [..] The approach works with any search behavior artifact that can be
extracted from a search log, including queries, query reformulations, and query-
click pairs.

Open challenges
 How to select the part of your log data you are willing to
share?
 How to estimate the value of this log data?

Share Log Segments by Topic?
 Represent searchers’ previous search history in the form of
concise human-readable topical profiles
- Classifier trained on ODP applied to clicked pages
Carsten Eickhoff, Kevyn Collins-Thompson, Paul Bennett, and
Susan Dumais. Designing human-readable user profiles for
search evaluation (ECIR’13)

Share Log Segments by Topic?
 Linking a year of query logs to Wikipedia categories helped
distill the segments corresponding to events like marriage,
a first-born, and expat life (taxes)
- Jiyin He, Marc Bron: Measuring Demonstrated Potential
Domain Knowledge with Knowledge Graphs.
KG4IR@SIGIR 2017: 13-18

But wait…
… do we REALLY need all that query log info?

Personal search engine
 Safely gain access to rich personal data including email,
browsing history, documents read and contents of the
user’s home directory
 Can high quality evidence about an individual’s recurring
long-term interests replace the shallow information of
many?

Better Search – “Deep Personalization”
 “Even more broadly than trying to get people the right
content based on their context, we as a community need to
be thinking about how to support people through the entire
search experience.”
Jaime Teevan on “Slow Search”
 Search as a dialogue
My first journal paper:
De Vries, Van der Veer and Blanken: Let’s talk about it: dialogues with multimedia databases (1998)

“Deep Personalization”
 How could the indexer know about the wide variety of
sources and their schema information...
 Or, How to build 1000+ search engines?!

Engineer the Search EngineModel the Search Engine

“Search by Strategy”
• “No idealized one-shot search engine”
• Hand over control to the user (or, most
likely, the search intermediary)

“Search by Strategy”
• “No idealized one-shot search engine”
• Hand over control to the user (or, most
likely, the search intermediary)
• Search (and link) strategies can be
shared!

Note: Enhances Reproducibility of IR Research!

Web Archives to Lead the Revolution!
 Two main opportunities:
- Free us from the mass surveillance that is now the default
business model of the internet
- Improve Web Archive and Web Archive search
 Long run: realize truly personal search engines?

Blueprint of the Personal Search Engine
 Decentralize search
 Webarchives to rescue
- Super-peers in a P2P network of personal search engines
 “Deep personalization”
- Exploit the rich source data that can be processed safely locally
 A sharing economy:
- Data markets to trade log-data and improve – mutually – your
search results

Web Archives and the dream of the Personal Search Engine

More Related Content

What's hot

Similar to Web Archives and the dream of the Personal Search Engine

More from Arjen de Vries

Recently uploaded

Web Archives and the dream of the Personal Search Engine

Editor's Notes