Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The personal search engine

369 views

Published on

Huygens colloquium at Radboud University Science Faculty.

Effective web search engines (and the commercial success of a few internet giants) depend upon the data collected from the online seeking behaviour of huge numbers of users. Put differently, the high quality search results we accept for granted every day come at the price of reduced privacy.

A personal search engine would not only search the web, but also rich personal data including email, browsing history, documents read and contents of the user’s home directory. Results with so-called "slow search" indicate that the user experience can be improved significantly when the search engine gains access to additional data. However, will we be prepared to give up even more of our privacy, and eventually be prepared to give up control over all that personal information?

My proposal is to mitigate these concerns by developing a new architecture for web search, in which users control the trade-off between search result quality and the privacy risk inherent to sharing usage logs. Under this design, all data of the “personal search engine” (PSE) (web and usage data) resides in its owner’s personal digital infrastructure.

Two challenges need to be overcome to turn this into a viable alternative. Can we compensate for the loss of information about searches of large numbers of users? And, can we maintain an up-to-date index in a cost-effective manner? As a solution, I propose to organise personal search engines in a decentralised social network. This serves two goals: the index can be kept up-to-date collaboratively, and usage data may be traded with peers.

Published in: Science
  • Be the first to comment

The personal search engine

  1. 1. The personal search engine Prof.dr.ir. Arjen P. de Vries arjen@acm.org Nijmegen, November 7th, 2016
  2. 2. Disclaimer:
  3. 3. “Computational Relevance” “Intellectually it is possible for a human to establish the relevance of a document to a query. For a computer to do this we need to construct a model within which relevance decisions can be quantified. It is interesting to note that most research in information retrieval can be shown to have been concerned with different aspects of such a model.” Van Rijsbergen, 1976 Retrieval Model
  4. 4. Probabilistic Ranking Principle  “Provides a theoretical justification for why documents should be ranked by the probability of relevance” Stephen Robertson, 1977
  5. 5. IR Solved?  “Provides a theoretical justification for why documents should be ranked by the probability of relevance” Stephen Robertson, 1977  PRP assumes (unreasonably?) independence between results and 1/0 loss (or Boolean relevance assessments)  PRP does not state how the probability of relevance should be estimated
  6. 6. Why Search Remains Difficult to Get Right  Heterogeneous data sources - WWW, wikipedia, news, e-mail, patents, twitter, personal information, …  Varying result types - “Documents”, tweets, courses, people, experts, gene expressions, temperatures, …  Multiple dimensions of relevance - Topicality, recency, reading level, … Actual information needs often require a mix within and across dimensions. E.g., “recent news and patents from our top competitors”
  7. 7.  System’s internal information representation - Linguistic annotations - Named entities, sentiment, dependencies, … - Knowledge resources - Wikipedia, Freebase, IDC9, IPTC, … - Links to related documents - Citations, urls  Anchors that describe the URI - Anchor text  Queries that lead to clicks on the URI - Session, user, dwell-time, …  Tweets that mention the URI - Time, location, user, …  Other social media that describe the URI - User, rating - Tag, organisation of `folksonomy’ + UNCERTAINTY ALL OVER!
  8. 8. Learning to Rank (LTOR)  IR as a machine learning problem  Learn the matching function from observations - E.g., pairwise – clicked document below retrieved document should trigger a swap of their positions
  9. 9. Detect and classify NEs Rank search results Predict query intent Search suggestions
  10. 10. Spelling correction Predict query intent Rank Verticals Search suggestions
  11. 11. RobertJohnson(1911-13):“Earlythismorning whenyouknockeduponmydoor/AndIsaid,‘Hello, Satan,Ibelieveit’stimetogo.’” ttps://youtu.be/3MCHI23FTP8
  12. 12. WWW  The Web has become ever more centralized + Cloud services – good value-for-money/value-for-effort  Mobile makes things only worse “There is an app for that”
  13. 13. Without the log data, web search isn’t as good  This also hinders retrieval experiments in academia! - Reproducibility vs. Representativeness of research results? Samar, T., Bellogín, A. & de Vries, A.P. Inf Retrieval J (2016) 19: 230. doi: 10.1007/s10791-015-9276-9
  14. 14. Decentralize Web Search?
  15. 15. Personal search engine!
  16. 16. http://www.mkomo.com/cost-per-gigabyte-update
  17. 17. WESTERN DIGITAL DEMONSTRATES PROTOTYPE OF THE WORLD’S FIRST 1TERABYTE SDXC CARD AT PHOTOKINA 2016 SEP 20, 2016
  18. 18. Realistic?  Clueweb 2012: 80TB Recent CommonCrawl: 150TB  Average web page takes up 320 KB - Large sample collected with Googlebot, May 26th, 2010 - Reported 4.2B pages (would require ~1.3 Petabyte)  De Kunder & Van de Bosch estimate an upper bound of ~50B pages - http://www.worldwidewebsize.com/  Also considering continuing growth (claimed in unpublished work by colleagues) - Andrew Trotman, Jinglan Zhang, Future Web Growth and its Consequences for Web Search Architectures. https://arxiv.org/abs/1307.1179 https://web-beta.archive.org/web/20100628055041/http://code.google.com/speed/articles/web-metrics.html
  19. 19. Two Problems  How to get the web data on the personal search engine?  How to replace the lack of usage data from many?
  20. 20. Getting the Data  Idea: - Organize the web crawl in topically related bundles - Apply bittorrent-like decentralization to share & update bundles  Use techniques inspired by query obfuscation to hide the real user’s interests when downloading bundles  Web Archives to the rescue? - Web Archive to play a role as “super-peer” See also WebRTC based in-browser implementations:  Webtorrent: https://webtorrent.io/  CacheP2P: http://www.cachep2p.com/ And, http://academictorrents.com/ shares 16TB research data, including Clueweb 2009 and 2012
  21. 21. “… communication and media limitations, due to the distance between Earth and Mars, resulting in time delays: they will have to request the movies or news broadcasts they want to see in advance. […] Easy Internet access will be limited to their preferred sites that are constantly updated on the local Mars web server. Other websites will take between 6 and 45 minutes to appear on their screen - first 3-22 minutes for your click to reach Earth, and then another 3-22 minutes for the website data to reach Mars.” http://www.mars-one.com/faq/mission-to-mars/what-will-the-astronauts-do-on-mars
  22. 22. Analogy  Web Archive ~ Earth  Personal search engine (@ people’s homes) ~ Mars
  23. 23. “Searching from Mars”  Tradeoff between “effort” (waiting for responses from Earth) and “data transfer” (pre-fetching or caching data on Mars).  Related work: - Jimmy Lin, Charles L. A. Clarke, and Gaurav Baruah. Searching from Mars. Internet Computing, 20(1):77-82, 2016. http://dx.doi.org/10.1109/MIC.2016.2 - Charles L.A. Clarke, Gordon V. Cormack, Jimmy Lin, and Adam Roegiest. Total Recall: Blue Sky on Mars. ICTIR '16. http://dx.doi.org/10.1145/2970398.2970430 - Charles L. A. Clarke, Gordon V. Cormack, Jimmy Lin, Adam Roegiest. Ten Blue Links on Mars. https://arxiv.org/abs/1610.06468
  24. 24. Pre-fetching & Caching  Hide latencies of getting the data from the live web - Pre-fetch pages linked from initial query results page - Pre-fetch additional related pages - Pre-fetches expanded with those from query suggestions  Cache web data to avoid accessing the live web
  25. 25. Two Problems  How to get the web data on the personal search engine?  How to replace the lack of usage data from many?
  26. 26. Truly personal search?  Safely gain access to rich personal data including email, browsing history, documents read and contents of the user’s home directory  Can high quality evidence about an individual’s recurring long-term interests replace the shallow information of many?
  27. 27. Better Search – “Deep Personalization”  “Even more broadly than trying to get people the right content based on their context, we as a community need to be thinking about how to support people through the entire search experience.” Jaime Teevan on “Slow Search”  Search as a dialogue My first journal paper: De Vries, Van der Veer and Blanken: Let’s talk about it: dialogues with multimedia databases (1998)
  28. 28. Alternatives for Log Data?  Social annotations - E.g., bit.ly shortened urls - Still requires access to an API conveying the query representation - E.g., anchor text - E.g., “twanchor text” – tweets providing context to a URL
  29. 29. Anchor Text & Timestamps  Exhibits characteristics similar to user query and document title [Eiron & McCurley, Jin et al.]  Anchor text with timestamps can be used to capture & trace entity evolution [Kanhabua and Nejdl]  Anchor text with timestamps lets us reconstruct (past) topic popularity [Samar et al.]
  30. 30. Trade log data!
  31. 31. Open challenges  How to select the part of your log data you are willing to share?  How to estimate the value of this log data?
  32. 32. Blueprint of the Personal Search Engine  Decentralize search  Webarchives to rescue - Super-peers in a P2P network of personal search engines  “Deep personalization” - Exploit the rich source data that can be processed safely locally  A sharing economy: - Data markets to trade log-data and improve – mutually – your search results

×