Everything You Wish You Knew About Search

Everything
You Wish You Knew
About Search

About
• Enterprise software company that develops products for software
developers, project managers, and content management

• Enterprise software company that develops products for software
developers, project managers, and content management
• Our products:
About

About Me
Head of Search & Smarts Engineering at Atlassian
• In charge of all customer-facing ML/AI initiatives, including Search
• Our main initiative is Cross-Product Search in ‘Home’
Before Atlassian:
• Particle Physicist by training
• Initiated Data Science efforts at several companies
• Previously member of the Search team at @WalmartLabs

About this Talk
What to expect
• A general introduction to Search
• A overview of both the Engineering and ML aspects of Search
• Insights into the current and future challenges of Search
What not to expect
• An extensive tutorial covering the entire Learning-to-Rank landscape
• To become a Search expert in 40 min

Outline
• Part I: The Concepts of Search
• Part II: The Technical Aspects of Search
• Part III: Learning Algorithms
• Part IV: Measuring Search Relevance
• Part V: The Challenges and the Future of Search

Part I: The Concepts of Search

Altavista
First to allow NL queries
Web Crawler
1st crawler to index entire pages
The (Pre)History of Search
1990
Archie
First search engine: an index of
downloadable directory listings 1991
Veronika, Jughead
Search file names and titles stored
in Gopher index systems 1992
Vlib
Time Berners-Lee set
up a Virtual Library
1993
Excite
WWW Wanderer
Primitive Web Search1994
1995
LookSmart
1996
Inktomi: HotBot
Google
1997
Ask.com
Lycos
Ranked relevance retrieval
Yahoo! Directory

The History of Search
1998 MSN
Open Directory Project
1999AllTheWeb
Overture Services
2000
Snap
2003
2004
2001
2002
2005
2006
LiveSearch
2007
2008
2009
Cuil
Bing
Inline search suggestions
2010

What is Search?
Convert an intent into an action that helps people
retrieve something, i.e. a piece of content
CONTENT OVERLOAD
Search

What is Search?
Convert an intent into an action that helps people
retrieve something, i.e. a piece of content
CONTENT OVERLOAD
Search
• Retrieving, organizing & classifying information
• Includes:
• Web Search
• Faceted Search (e-Commerce)
• Enterprise Search
• But also
• Different types of documents: Image Search, etc.
• In a wider sense of the term:
• Recommendation (Search with no explicit intent from the user)
• Structured Query Language

User Intent
What is Search (Really) About?
Users

User Intent
Users
Content
Documents

User Intent
Users
Content
Request
Search Query
Return
Search Results
Documents
INTERPRETATION
DISPLAY
RETRIEVAL

User 1 - Intent
Users
Content
Request Search Query
Return Search Results
Documents 1
INTERPRETATION
DISPLAY
RETRIEVAL
• Query space not controlled
• Content dependent on customer
Multi-tenancy Search
User 2 - Intent
User 3 - Intent
Documents 2
Documents 3
DISPLAY
INTERPRETATION
DISPLAY
INTERPRETATION

Query data
• What are you searching for? (query terms)
Content data
• What are the documents about? (topics)
Contextual data
• Who are you? (user data – both static and learned)
• In which circumstances are you searching?
Engagement data
• As a group (what web pages are ‘hot’ these days?)
• As an individual (your personal viewing history)
Data Zoo For Search

CRAWLER
strips out the html text content
The Processes of Search
Automated browser
that views your web pages

CRAWLER
INDEXER
Stores records of all pages viewed by
the spider/crawler
Automated browser
Database being searched
when ‘search’ button is hit

CRAWLER
INDEXER
SEARCHER
Stores records of all pages viewed by
the spider/crawler
Algorithm used to sort through
the database of pages
Automated browser
Database being searched
when ‘search’ button is hit
finds the most relevant content

Part II: The Technical Aspects of Search

Search Engine Architecture
Crawler
Document
Analyzer
Indexer
Indexed corpus
Document
Representation
Index
Ranking procedure
Ranker
Feedback
Results
Query
representation
Query
Evaluation
User

Indexing
The purpose of storing an index is to optimize speed and performance in finding
relevant documents for a search query.
Indexing

• Without an index, the search engine would scan every document in the corpus
• Benefits: computation and time saving at query time
• 10,000 documents can be queried within milliseconds with an index
• a sequential scan could take hours
• Disadvantages:
• additional computer storage required to store the index
• increase in the time required for an update to take place
• Design factors:
• Storage techniques
• Index size, lookup speed
• Maintenance, fault tolerance
Indexing
The purpose of storing an index is to optimize speed and performance in finding
relevant documents for a search query.
Indexing

What Happens at Indexing Time?
Text + Metadata
(Doc type, structure, features)
Text Acquisition
Index
Takes index terms
& creates data structures
(inverted indexes)
to support fast searching
Transforms documents into
index terms or features
Document
data store
E-mail, Web pages, News
articles, Memos, Letters
Identifies and stores
documents for indexing
Indexing Process
Index Creation
Text Transformation

1. Identify What To Search For
Find out what words get searched and interpret the query term
2. Parse The Query Language Itself
Recognizing and interpreting operators (AND, OR, NOT, etc.) and field restrictors
3. Extend Search to Other Query Terms
This includes:
• Fuzzy Matching (spelling mistakes)
• Entity and Thematic Modeling (related words)
4. Relevance Ranking Improvements
… such as:
• boosting documents containing all of the terms close together (proximity weighting)
• boosting documents from trustworthy sources, reducing documents from unreliable sites
Parsing

Ranking
Cats with sunglasses
Just hanging out with
my sunglasses on
Am I cool or what?
Me with glasses just
because…
it makes me smart.
What I see right here is Jim
Belushi as a cat.
Along with the Blues Brothers behind.
You will never be as capable
of rocking shades…
quite as well as this feline friend.

Ranking
Relevance score ∈ 0,1
0.9
0.7
0.3
0.1
Cats with sunglasses
Just hanging out with
my sunglasses on
Am I cool or what?
Me with glasses just
because…
it makes me smart.
What I see right here is Jim
Belushi as a cat.
Along with the Blues Brothers behind.
You will never be as capable
of rocking shades…
quite as well as this feline friend.
𝑓 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡

Reranking
Allows to run a simple query (A) for matching documents and
re-order the top N documents using the scores from a more complex query (B)
Query Re-Ranking

Reranking
Query Re-Ranking
0.9
0.7
0.3
0.1
Original
rank

Reranking
Query Re-Ranking
0.9
0.7
0.3
0.1
TopNdocuments
Original
rank

Reranking
Query Re-Ranking
0.9
0.7
0.3
0.1
TopNdocuments
Original
rank
1.0
0.9
0.5
Re-ranking

Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)

Boosting
0.9
0.7
0.3
Original
relevance
Original
rank

Boosting
0.9
0.7
0.3
Original
relevance
Original
rank
2,000
5,000
6,000
Page
clicks
+ 𝛼
.
+ 𝛼
.
+ 𝛼
.

Boosting
0.9
0.7
0.3
Original
relevance
Original
rank
2,000
5,000
6,000
Page
clicks
+ 𝛼
.
+ 𝛼
.
+ 𝛼
.
Total dwell
time (minutes)
500
400
100
+ 𝛽.
+ 𝛽.
+ 𝛽.

Boosting
0.9
0.7
0.3
Original
relevance
Original
rank
2,000
5,000
6,000
Page
clicks
+ 𝛼
.
+ 𝛼
.
+ 𝛼
.
Total dwell
time (minutes)
500
400
100
+ 𝛽.
+ 𝛽.
+ 𝛽.
New
relevance
= 65.9
= 154.7
= 181.3
𝛼 = 0.03, 𝛽 = 0.01
New
rank

Learning-to-Rank (1)
User Query
Top-k retrieval
Results page
Ranking model
Learning
algorithm
Training
data
Documents
Indexer
Index

Learning-to-Rank (2)
Learning
System
Ranking System
Model h
q
x1
x2
xm
h(x)
…
q
x1
x2
xm
?
…
q1
x1
(1)
x2
(1)
xm(1)
(1)
y
(1)
…
q2
x1
(2)
x2
(2)
xm(2)
(2)
y
(2)
…
qn
x1
(n)
x2
(n)
xm(n)
(n)
y
(n)
…
…
Training Data
Test Data Prediction

Pointwise
• Predict relevance on a document-by-document basis
• 3 types of supervised machine learning algorithms can be used:
• Regression-based algorithms
• Classification-based algorithms
• Ordinal regression
Learning-to-Rank Algorithms

Pointwise
Pairwise
• Tell which document is better in a given pair of documents: it is a classification
problem
• The goal is to minimize average number of inversions in ranking

Pointwise
Pairwise
• Tell which document is better in a given pair of documents: it is a classification
problem
• The goal is to minimize average number of inversions in ranking
Listwise
• Directly optimize one of the ranking evaluation measures

Pointwise Approach
• Predict the exact relevance degree of each document
• Assumes that each {query, document} pair has a numerical or ordinal score
• Input space contains the feature vector of every single document
• Can be approximated by a regression problem
• Ordinal regression:
• {query, document} relevance score can only take small, finite number of values

Pointwise Approach
Regression Classification Ordinal Regression
Input Space Single Documents yj
Output Space Real Values
Non-ordered
Categories
Ordinal Categories
Hypothesis Space Scoring Function f(x)
Loss Function
Regression Loss Classification Loss
Ordinal Regression
Loss
L(f; xj, yj)
• Predict the exact relevance degree of each document
• Assumes that each {query, document} pair has a numerical or ordinal score
• Input space contains the feature vector of every single document
• Can be approximated by a regression problem
• Ordinal regression:
• {query, document} relevance score can only take small, finite number of values
Summary

• Focus on relative order between 2 documents instead of predicting relevance
• Learn a binary classifier to tell which document is better in a pair of documents
• Goal: minimize average number of inversions in ranking
• Pairwise preference is used as the ground truth
• Limitations:
• Does not differentiate inversions at top vs. bottom positions
• Examples:
• RankNet
Pairwise Algorithms

• Focus on relative order between 2 documents instead of predicting relevance
• Learn a binary classifier to tell which document is better in a pair of documents
• Goal: minimize average number of inversions in ranking
• Pairwise preference is used as the ground truth
• Limitations:
• Does not differentiate inversions at top vs. bottom positions
• Examples:
• RankNet
Pairwise Algorithms
Input Space Document pairs (xu, xv)
Output Space Preference 𝑦5,6 ∈ {+1, −1}
Hypothesis Space Preference function ℎ 𝑥5, 𝑥6 = 2. 𝐼{@ AB C@ AD } − 1
Loss Function Pairwise classification loss 𝐿(ℎ; 𝑥5, 𝑥6, 𝑦5,6)
Summary

• Pick an evaluation measure & optimize its value, averaged over all queries
• Challenges:
• Continuous approximations on measures used b/c most are not continuous functions
• 2 Types of approaches:
• Direct Optimization of IR Evaluation Measures
• Minimization of Listwise Ranking Losses
Listwise Algorithms

• Pick an evaluation measure & optimize its value, averaged over all queries
• Challenges:
• Continuous approximations on measures used b/c most are not continuous functions
• 2 Types of approaches:
• Direct Optimization of IR Evaluation Measures
• Minimization of Listwise Ranking Losses
Listwise Algorithms
Listwise Loss Minimization
Direct Optimization of IR
Measure
Input Space Document set 𝒙 =
{𝑥J}JKL
M
Output Space Permutation 𝜋O
Ordered Categories
𝒚 =
{𝑦J}JKL
M
Hypothesis Space ℎ 𝑥 = 𝑠𝑜𝑟𝑡 ∘ 𝑓(𝑥) ℎ 𝑥 = 𝑓(𝑥)
Loss Function Listwise Loss 𝐿(ℎ; 𝒙, 𝜋O)
1-surrogate Measure
𝐿(ℎ; 𝒙, 𝒚)
Summary

3 input ligands: C
Summary
B A
DifferentMethods
Pointwise Pairwise Listwise
C Score(C)
B Score(B)
A Score(A)
BA f(A)>f(B)
CB f(B)>f(C)
CA f(A)>f(C)
CBA PA,B,C
CB A PB,A,C
CB A PB,C,A
Output
Ranking = CBA

• Link analysis algorithm
Example: the PageRank Algorithm
• Algorithm invented by Larry Page (Google founder)
• score goes from 0 to 10
• Other Alternatives:
• Page Authority
• HostRank
• Voting Algorithms
• …
Graph-Based Algorithms
A
A
C
B
B
B
B
B
C

Features
Rank Features Rank Features
1 TF of body … …
2 TF of anchor 51 PageRank
3 TF of title 52 HostRank
4 TF of URL 53 Topical PageRank
5 TF of whole document 54 Topical HITS authority
6 IDF of body 55 Topical HITS hub
7 IDF of anchor 56 Inlink number
8 IDF of title 57 Outlink number
9 IDF of URL 58 Number of slash in URL
10 IDF of whole document 59 Length of URL
IR/NLPfeatures
LinkageEngagement
Example features (TREC)
TF: term frequency
IDF: inverse document frequency

Conventional Ranking Models
Query-dependent
• Boolean model, extended Boolean model, etc.
• Vector space model, latent semantic indexing (LSI), etc.
• BM25 model, statistical language model, etc.
Query-independent
• PageRank, TrustRank, BrowseRank, etc.
Problems with Conventional Models
• Manual parameter tuning difficult
• Too many parameters
• Evaluation measures not smooth
• Sometimes leads to overfitting
• Ensemble approach (combining models into a more effective one) not trivial

Part IV: Measuring Search Relevance

Corpus Size
• Number of pages indexed
Search engine overlap
• Fraction of pages indexed by engine A also indexed by engine B
Freshness
• Age of the pages in the index
Spam resilience
• Fraction of pages in index that are spam
Duplicates
• Number of unique pages in index
Search Engine Evaluation: Index

Search Engine Evaluation: Relevance Judgment
Types of judgments classified similarly to Ranking Algorithms
1. Degree of Relevance
• Binary: relevant vs. irrelevant
• Multiple ordered categories:
Perfect > Excellent > Good > Fair > Bad
2. Pairwise Preference
• Document A is more relevant than document B
3. Total Order
• Documents are ranked as {A,B,C,..} according to their relevance

Evaluation Measure – MAP & NDCG
Precision at position k for query q :
Average precision for query q :
𝑃@𝑘 =

#
{ 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡
𝑑 𝑜𝑐𝑠
𝑖 𝑛
𝑡 𝑜𝑝
𝑘
𝑟 𝑒𝑠𝑢𝑙𝑡𝑠}
𝑘
𝐴𝑃 =

∑ 𝑃@𝑘. 𝑙^^
#
{ 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡
𝑑 𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}
NDCG at position n for query q :
𝑁𝐷𝐶𝐺@𝑘 =
𝑍^
e 𝐺 𝜋fL
𝑗
𝜂(𝑗)
^
JKL
Normalized Cumulative
(Position)
Discounted
MAP & NDCG: Averaged over all queries
MAP NDCG
Gain

Evaluation Measure - Summary
Query-level: every query contributes equally to the measure
• Computed on documents associated with the same query
• Bounded for each query
• Averaged over all test queries
Position-based: rank position is explicitly used (weighting)
• Top-ranked objects more important
• Relative order vs. relevance score of each document
• Rank is a non-continuous, non-differentiable of scores

Part V: The Challenges
and the Future of Search

• Near duplicates and versioning
• More recently, “quoting” in-between websites
• Metadata and file formats
• Search across multiple sources
• How to merge several indexes?
• Challenges with latency?
• Security, Privacy, Regulations
The Challenges of Enterprise Search

• User Logs as Ground Truth
• A gold mine that has not been leveraged so far
• Implicit feedback
• Click-through rates, etc.
• Feature Engineering
• New Directions of Research
• Semi-supervised Ranking
• Transfer Ranking
Future Research

• While 20+ years old, Search is still hard
• But there are off-the-shelf solutions…
• A problem where ML can help (learning-to-rank space)
• Most promising algorithms use a listwise approach
• Very dynamic area of research
• But doing Search well requires more than Learning-to-Rank:
• Query Parsing, Topic modeling, etc.
• It is getting harder with ever more types of documents
Conclusions

• Learning-to-Rank for Information Retrieval, by Tie-Yan Liu
• Learning-to-Rank Tutorial, by Tie-Yan Liu
• The PageRank Model, by Ian Rogers
• Search is Hard, by Priyendra Deshwal
• Why Is Enterprise Search so Hard?, by Miles Kehoe
References

Everything You Wish You Knew About Search

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Everything You Wish You Knew About Search

Similar to Everything You Wish You Knew About Search (20)

More from IDEAS - Int'l Data Engineering and Science Association

More from IDEAS - Int'l Data Engineering and Science Association (20)

Recently uploaded

Recently uploaded (20)

Everything You Wish You Knew About Search