Searching over the past, present and future

Searching over the past, present and
future
Roi Blanco (roi@yahoo-inc.com)
http://labs.yahoo.com/Yahoo_Labs_Barcelona

Yahoo! Research Barcelona
Established January, 2006
Led by Ricardo Baeza-Yates
Research areas
• Web Mining
• Social Media
• Distributed Web retrieval
• Geo information retrieval
• NLP and Semantics

Agenda
• Natural Language retrieval
• Time and search engines
• Searching over web archives
• Searching on real time information
• Caching!
• Time-based exploratory search
• Searching over future events
• Future directions

Natural Language Retrieval
• How to exploit the structure and meaning of
natural language text to improve search
• Current search engines perform only limited NLP
(tokenization, stemming)
• Automated tools exist for deeper analysis
• Applications to diversity-aware search
• Source, Location, Time, Language, Opinion, Ranking…
• Search over semi-structured data, semantic search
• Roll-out user experiences that use higher layers of
the NLP stack
• In this talk, focus on the time dimension

High-level Architecture of WSEs
Cache Query
results
Runtime system
Parser/
Tokenizer
Index
terms
Engine
queries
Indexing pipeline
W WWWWW

Web Search and time
• Information freshness adds constraints/tensions in
every layer of WSE
• Architecture
• Crawling
• Indexing
• Caching
• Serving system
• Modeling
• Time-dependent user intent
• UI (how to let the user take control)
7

Adding the time dimension
• Some solutions don’t scale up anymore
Review your architecture
Review your algorithms
Add more machines (~$$$)
• Some solutions don’t apply anymore
Caching
8

Evolution
• 1999
• Index updated ~once per month
• Disk-based updates/indexing
• 2001
• In-memory indexes
• Changes the whole-game!
• 2007
• Indexing time < 1 minute
• Accept updates while serving
• Now
• Focused crawling, delayed transactions, etc.
• Batch Updates -> Incremental processing
9

Some landmarks
• Reliable distributed storage
• Some models/processes require millions of accesses
• Massive parallelization
• Map/Reduce – Hadoop
• Semi-structured storage systems
• Asynchronous item updates
10

What’s going on “right now”?
11

Query temporal profiles
• Modeling
• Time-dependent user intent
• Implicitly time-qualified search queries
• SIGIR
• Dream theater barcelona
• Barcelona vs Madrid
• ….
12

Caching for Real-Time Indexes
• Queries are redundant (heavy-tail) and bursty
• Caching search results saves up executing ~30/60% of the queries
• Tens of machines do the work of 1000s
• Dilemma: Freshness versus Computation
• Extreme #1: do not cache at all – evaluate all queries
• 100% fresh results, lots of redundant evaluations
• Extreme #2: never invalidate the cache
• A majority of stale results – results refreshed only due to
cache replacement, no redundant work
• Middle ground: invalidate periodically (TTL)
• A time-to-live parameter is applied to each cached entry

•Problem:
•In fast crawling, cache not always up-to-date (stale)
•Solution:
• Cache Invalidator Predictor - looks into new documents and
invalidates queries accordingly
• Using synopsis reduces
the number of refreshes up
to 30% compared to a time-to-
live baseline
14
CACHING FOR INCREMENTAL INDEXES

Time(ly) opportunities
Can we create new user experiences based on a deeper
analysis and exploration of the time dimension?
Goals:
Build an application that helps users to explore, interact
and ultimately understand existing information about
the past and the future.
Help the user cope with the information overload and
eventually find/learn about what she’s looking for

Original Idea
R. Baeza-Yates, Searching the Future, MF/IR 2005
On December 1st 2003, on Google News, there were more than 100K
references to 2004 and beyond.
E.g. 2034:
The ownership of Dolphin Square in London must revert to an
insurance company.
Voyager 2 should run out of fuel.
Long-term care facilities may have to house 2.1 million people
in the USA.
A human base in the moon would be in operation.

17
Time Explorer
• Public demo since August 2010
• For exploring news through time and into the
future
• Using a 1.8M news articles from New York Times
Annotated Corpus
• Try it at
http://fbmya01.barcelonamedia.org:8080/future/

19
Time Explorer - Motivation
 Time is important to search
 Recency, particularly in news is highly related to
relevancy
 But, what about evolution over time?
 How has a topic evolved over time?
 How did the entities (people, place, etc) evolve with respect to the
topic over time?
 How will this topic continue to evolve over the future?
 How does bias and sentiment in blogs and news change over time?
 Google Trends, Yahoo! Clues, RecordedFuture …
 Great research playground

21
Collections
 New York Times (1.8 million document)
 Well structured
 manual annotations
 publically available
 but, not diverse
 Web Crawl Collection (100 news source and 500
blogs sites)
 Great for diversity
 Challenge because of format, languages, structure, etc
 Custom Collections
 Yahoo! News

22
Analysis Pipeline
 Tokenization, Sentence Splitting, Part-of-speech
tagging, chunking with OpenNLP
 Entity extraction with SuperSense tagger
 Time expressions extracted with TimeML
 Explicit dates (August 23rd, 2008)
 Relative dates (Next year, resolved with Pub Date)
 Sentiment Analysis with LivingKnowledge
 Ontology matching with Yago
 Image Analysis – sentiment and face detection

23
Indexing/Search
• Lucene/Solr search platform to index and search
– Sentence level
– Document level
• Facets for entity types
• Index publication date and content date –extracted dates if
they exists or publication date
• Solr Faceting allows aggregation over query entity ranking
and allowing for aggregating counts over time
• Content date allows searching into the future

Oil Spill – Predictions 2011

UI - Snippets
Snippet – With Source Summary
Snippet – With image support – Negative Image

Ongoing Work
• Better Sentiment Detection
– How has sentiment towards a particular topic changed
over time
• Better Bias Detection
– How does Fox News differ from NYT on presenting global
warming
• Future Mentions to Future Prediction
– Which opinions to trust?
– How to aggregate?
• Move to web dataset
– Domain shift – news to blogs
– Noisy data – boilerplate, more date format, etc
• Integrating multimedia data

36
Any Questions?
Thanks for your attention
Joint work with Mike
Matthews, Peter Mika, Jordi
Atserias, Hugo Zaragoza and
many others

References
37
•Caching Search Engine Results over Incremental Indices
Roi Blanco; Edward Bortnikov; Flavio Junqueira; Ronny
Lempel; Luca Telloli; Hugo Zaragoza, SIGIR'2010,
•Searching through time in the New York Times Michael
Matthews; Pancho Tolchinsky; Roi Blanco; Jordi
Atserias; Peter Mika; Hugo Zaragoza, HCIR 2010, 2010
•Ranking Related News Predictions Nattiya Kanhabua;
Roi Blanco; Michael Matthews, SIGIR, 2011
•Searching the future. Ricardo Baeza-Yates, MF/IR
workshop 2005

Searching over the past, present and future

More Related Content

What's hot

Viewers also liked

Similar to Searching over the past, present and future

More from Roi Blanco

Recently uploaded

Searching over the past, present and future

Editor's Notes