Everything
You Wish You Knew
About Search
About
• Enterprise software company that develops products for software
developers, project managers, and content management
• Enterprise software company that develops products for software
developers, project managers, and content management
• Our products:
About
About Me
Head of Search & Smarts Engineering at Atlassian
• In charge of all customer-facing ML/AI initiatives, including Search
• Our main initiative is Cross-Product Search in ‘Home’
Before Atlassian:
• Particle Physicist by training
• Initiated Data Science efforts at several companies
• Previously member of the Search team at @WalmartLabs
About this Talk
What to expect
• A general introduction to Search
• A overview of both the Engineering and ML aspects of Search
• Insights into the current and future challenges of Search
What not to expect
• An extensive tutorial covering the entire Learning-to-Rank landscape
• To become a Search expert in 40 min
Outline
• Part I: The Concepts of Search
• Part II: The Technical Aspects of Search
• Part III: Learning Algorithms
• Part IV: Measuring Search Relevance
• Part V: The Challenges and the Future of Search
Part I: The Concepts of Search
Altavista
First to allow NL queries
Web Crawler
1st crawler to index entire pages
The (Pre)History of Search
1990
Archie
First search engine: an index of
downloadable directory listings 1991
Veronika, Jughead
Search file names and titles stored
in Gopher index systems 1992
Vlib
Time Berners-Lee set
up a Virtual Library
1993
Excite
WWW Wanderer
Primitive Web Search1994
1995
LookSmart
1996
Inktomi: HotBot
Google
1997
Ask.com
Lycos
Ranked relevance retrieval
Yahoo! Directory
The History of Search
1998 MSN
Open Directory Project
1999AllTheWeb
Overture Services
2000
Snap
2003
2004
2001
2002
2005
2006
LiveSearch
2007
2008
2009
Cuil
Bing
Inline search suggestions
2010
What is Search?
Convert an intent into an action that helps people
retrieve something, i.e. a piece of content
CONTENT OVERLOAD
Search
What is Search?
Convert an intent into an action that helps people
retrieve something, i.e. a piece of content
CONTENT OVERLOAD
Search
• Retrieving, organizing & classifying information
• Includes:
• Web Search
• Faceted Search (e-Commerce)
• Enterprise Search
• But also
• Different types of documents: Image Search, etc.
• In a wider sense of the term:
• Recommendation (Search with no explicit intent from the user)
• Structured Query Language
User Intent
What is Search (Really) About?
Users
User Intent
What is Search (Really) About?
Users
Content
Documents
User Intent
What is Search (Really) About?
Users
Content
Request
Search Query
Return
Search Results
Documents
INTERPRETATION
DISPLAY
RETRIEVAL
User 1 - Intent
What is Search (Really) About?
Users
Content
Request Search Query
Return Search Results
Documents 1
INTERPRETATION
DISPLAY
RETRIEVAL
• Query space not controlled
• Content dependent on customer
Multi-tenancy Search
User 2 - Intent
User 3 - Intent
Documents 2
Documents 3
Request Search Query
Return Search Results
Request Search Query
Return Search Results
DISPLAY
INTERPRETATION
DISPLAY
INTERPRETATION
Query data
• What are you searching for? (query terms)
Content data
• What are the documents about? (topics)
Contextual data
• Who are you? (user data – both static and learned)
• In which circumstances are you searching?
Engagement data
• As a group (what web pages are ‘hot’ these days?)
• As an individual (your personal viewing history)
Data Zoo For Search
CRAWLER
strips out the html text content
The Processes of Search
Automated browser
that views your web pages
CRAWLER
INDEXER
strips out the html text content
Stores records of all pages viewed by
the spider/crawler
The Processes of Search
Automated browser
that views your web pages
Database being searched
when ‘search’ button is hit
CRAWLER
INDEXER
SEARCHER
strips out the html text content
Stores records of all pages viewed by
the spider/crawler
Algorithm used to sort through
the database of pages
The Processes of Search
Automated browser
that views your web pages
Database being searched
when ‘search’ button is hit
finds the most relevant content
Part II: The Technical Aspects of Search
Search Engine Architecture
Crawler
Document
Analyzer
Indexer
Indexed corpus
Document
Representation
Index
Ranking procedure
Ranker
Feedback
Results
Query
representation
Query
Evaluation
User
Indexing
The purpose of storing an index is to optimize speed and performance in finding
relevant documents for a search query.
Indexing
• Without an index, the search engine would scan every document in the corpus
• Benefits: computation and time saving at query time
• 10,000 documents can be queried within milliseconds with an index
• a sequential scan could take hours
• Disadvantages:
• additional computer storage required to store the index
• increase in the time required for an update to take place
• Design factors:
• Storage techniques
• Index size, lookup speed
• Maintenance, fault tolerance
Indexing
The purpose of storing an index is to optimize speed and performance in finding
relevant documents for a search query.
Indexing
What Happens at Indexing Time?
Text + Metadata
(Doc type, structure, features)
Text Acquisition
Index
Takes index terms
& creates data structures
(inverted indexes)
to support fast searching
Transforms documents into
index terms or features
Document
data store
E-mail, Web pages, News
articles, Memos, Letters
Identifies and stores
documents for indexing
Indexing Process
Index Creation
Text Transformation
1. Identify What To Search For
Find out what words get searched and interpret the query term
2. Parse The Query Language Itself
Recognizing and interpreting operators (AND, OR, NOT, etc.) and field restrictors
3. Extend Search to Other Query Terms
This includes:
• Fuzzy Matching (spelling mistakes)
• Entity and Thematic Modeling (related words)
4. Relevance Ranking Improvements
… such as:
• boosting documents containing all of the terms close together (proximity weighting)
• boosting documents from trustworthy sources, reducing documents from unreliable sites
Parsing
Ranking
Ranking
Cats with sunglasses
Ranking
Cats with sunglasses
Just hanging out with
my sunglasses on
Am I cool or what?
Me with glasses just
because…
it makes me smart.
What I see right here is Jim
Belushi as a cat.
Along with the Blues Brothers behind.
You will never be as capable
of rocking shades…
quite as well as this feline friend.
Ranking
Relevance score ∈ 0,1
0.9
0.7
0.3
0.1
Cats with sunglasses
Just hanging out with
my sunglasses on
Am I cool or what?
Me with glasses just
because…
it makes me smart.
What I see right here is Jim
Belushi as a cat.
Along with the Blues Brothers behind.
You will never be as capable
of rocking shades…
quite as well as this feline friend.
𝑓 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
Reranking
Allows to run a simple query (A) for matching documents and
re-order the top N documents using the scores from a more complex query (B)
Query Re-Ranking
Reranking
Allows to run a simple query (A) for matching documents and
re-order the top N documents using the scores from a more complex query (B)
Query Re-Ranking
0.9
0.7
0.3
0.1
Original
rank
Reranking
Allows to run a simple query (A) for matching documents and
re-order the top N documents using the scores from a more complex query (B)
Query Re-Ranking
0.9
0.7
0.3
0.1
TopNdocuments
Original
rank
Reranking
Allows to run a simple query (A) for matching documents and
re-order the top N documents using the scores from a more complex query (B)
Query Re-Ranking
0.9
0.7
0.3
0.1
TopNdocuments
Original
rank
1.0
0.9
0.5
Re-ranking
Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
0.9
0.7
0.3
Original
relevance
Original
rank
Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
0.9
0.7
0.3
Original
relevance
Original
rank
2,000
5,000
6,000
Page
clicks
+ 𝛼	
  .
+ 𝛼	
  .
+ 𝛼	
  .
Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
0.9
0.7
0.3
Original
relevance
Original
rank
2,000
5,000
6,000
Page
clicks
+ 𝛼	
  .
+ 𝛼	
  .
+ 𝛼	
  .
Total dwell
time (minutes)
500
400
100
+ 𝛽.
+ 𝛽.
+ 𝛽.
Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
0.9
0.7
0.3
Original
relevance
Original
rank
2,000
5,000
6,000
Page
clicks
+ 𝛼	
  .
+ 𝛼	
  .
+ 𝛼	
  .
Total dwell
time (minutes)
500
400
100
+ 𝛽.
+ 𝛽.
+ 𝛽.
New
relevance
= 65.9
= 154.7
= 181.3
𝛼 = 0.03, 𝛽 = 0.01
New
rank
Part III: Learning Algorithms
Learning-to-Rank (1)
User Query
Top-k retrieval
Results page
Ranking model
Learning
algorithm
Training
data
Documents
Indexer
Index
Learning-to-Rank (2)
Learning
System
Ranking System
Model h
q
x1
x2
xm
h(x)
…
q
x1
x2
xm
?
…
q1
x1
(1)
x2
(1)
xm(1)
(1)
y
(1)
…
q2
x1
(2)
x2
(2)
xm(2)
(2)
y
(2)
…
qn
x1
(n)
x2
(n)
xm(n)
(n)
y
(n)
…
…
Training Data
Test Data Prediction
Pointwise
• Predict relevance on a document-by-document basis
• 3 types of supervised machine learning algorithms can be used:
• Regression-based algorithms
• Classification-based algorithms
• Ordinal regression
Learning-to-Rank Algorithms
Pointwise
• Predict relevance on a document-by-document basis
• 3 types of supervised machine learning algorithms can be used:
• Regression-based algorithms
• Classification-based algorithms
• Ordinal regression
Pairwise
• Tell which document is better in a given pair of documents: it is a classification
problem
• The goal is to minimize average number of inversions in ranking
Learning-to-Rank Algorithms
Pointwise
• Predict relevance on a document-by-document basis
• 3 types of supervised machine learning algorithms can be used:
• Regression-based algorithms
• Classification-based algorithms
• Ordinal regression
Pairwise
• Tell which document is better in a given pair of documents: it is a classification
problem
• The goal is to minimize average number of inversions in ranking
Listwise
• Directly optimize one of the ranking evaluation measures
Learning-to-Rank Algorithms
Pointwise Approach
• Predict the exact relevance degree of each document
• Assumes that each {query, document} pair has a numerical or ordinal score
• Input space contains the feature vector of every single document
• Can be approximated by a regression problem
• Ordinal regression:
• {query, document} relevance score can only take small, finite number of values
Pointwise Approach
Regression Classification Ordinal Regression
Input Space Single Documents yj
Output Space Real Values
Non-ordered
Categories
Ordinal Categories
Hypothesis Space Scoring Function f(x)
Loss Function
Regression Loss Classification Loss
Ordinal Regression
Loss
L(f; xj, yj)
• Predict the exact relevance degree of each document
• Assumes that each {query, document} pair has a numerical or ordinal score
• Input space contains the feature vector of every single document
• Can be approximated by a regression problem
• Ordinal regression:
• {query, document} relevance score can only take small, finite number of values
Summary
• Focus on relative order between 2 documents instead of predicting relevance
• Learn a binary classifier to tell which document is better in a pair of documents
• Goal: minimize average number of inversions in ranking
• Pairwise preference is used as the ground truth
• Limitations:
• Does not differentiate inversions at top vs. bottom positions
• Examples:
• RankNet
Pairwise Algorithms
• Focus on relative order between 2 documents instead of predicting relevance
• Learn a binary classifier to tell which document is better in a pair of documents
• Goal: minimize average number of inversions in ranking
• Pairwise preference is used as the ground truth
• Limitations:
• Does not differentiate inversions at top vs. bottom positions
• Examples:
• RankNet
Pairwise Algorithms
Input Space Document pairs (xu, xv)
Output Space Preference 𝑦5,6 ∈ {+1, −1}
Hypothesis Space Preference function ℎ 𝑥5, 𝑥6 = 2. 𝐼{@ AB C@ AD } − 1
Loss Function Pairwise classification loss 𝐿(ℎ; 𝑥5, 𝑥6, 𝑦5,6)
Summary
• Pick an evaluation measure & optimize its value, averaged over all queries
• Challenges:
• Continuous approximations on measures used b/c most are not continuous functions
• 2 Types of approaches:
• Direct Optimization of IR Evaluation Measures
• Minimization of Listwise Ranking Losses
Listwise Algorithms
• Pick an evaluation measure & optimize its value, averaged over all queries
• Challenges:
• Continuous approximations on measures used b/c most are not continuous functions
• 2 Types of approaches:
• Direct Optimization of IR Evaluation Measures
• Minimization of Listwise Ranking Losses
Listwise Algorithms
Listwise Loss Minimization
Direct Optimization of IR
Measure
Input Space Document set 𝒙 =	
  {𝑥J}JKL
M
Output Space Permutation 𝜋O
Ordered Categories
𝒚 =	
  {𝑦J}JKL
M
Hypothesis Space ℎ 𝑥 = 𝑠𝑜𝑟𝑡 ∘ 𝑓(𝑥) ℎ 𝑥 = 𝑓(𝑥)
Loss Function Listwise Loss 𝐿(ℎ; 𝒙, 𝜋O)
1-surrogate Measure
𝐿(ℎ; 𝒙, 𝒚)
Summary
3 input ligands: C
Summary
B A
DifferentMethods
Pointwise Pairwise Listwise
C Score(C)
B Score(B)
A Score(A)
BA f(A)>f(B)
CB f(B)>f(C)
CA f(A)>f(C)
CBA PA,B,C
CB A PB,A,C
CB A PB,C,A
Output
Ranking = CBA
• Link analysis algorithm
Example: the PageRank Algorithm
• Algorithm invented by Larry Page (Google founder)
• score goes from 0 to 10
• Other Alternatives:
• Page Authority
• HostRank
• Voting Algorithms
• …
Graph-Based Algorithms
A
A
C
B
B
B
B
B
C
Features
Rank Features Rank Features
1 TF of body … …
2 TF of anchor 51 PageRank
3 TF of title 52 HostRank
4 TF of URL 53 Topical PageRank
5 TF of whole document 54 Topical HITS authority
6 IDF of body 55 Topical HITS hub
7 IDF of anchor 56 Inlink number
8 IDF of title 57 Outlink number
9 IDF of URL 58 Number of slash in URL
10 IDF of whole document 59 Length of URL
IR/NLPfeatures
LinkageEngagement
Example features (TREC)
TF: term frequency
IDF: inverse document frequency
Conventional Ranking Models
Query-dependent
• Boolean model, extended Boolean model, etc.
• Vector space model, latent semantic indexing (LSI), etc.
• BM25 model, statistical language model, etc.
Query-independent
• PageRank, TrustRank, BrowseRank, etc.
Problems with Conventional Models
• Manual parameter tuning difficult
• Too many parameters
• Evaluation measures not smooth
• Sometimes leads to overfitting
• Ensemble approach (combining models into a more effective one) not trivial
Part IV: Measuring Search Relevance
Corpus Size
• Number of pages indexed
Search engine overlap
• Fraction of pages indexed by engine A also indexed by engine B
Freshness
• Age of the pages in the index
Spam resilience
• Fraction of pages in index that are spam
Duplicates
• Number of unique pages in index
Search Engine Evaluation: Index
Search Engine Evaluation: Relevance Judgment
Types of judgments classified similarly to Ranking Algorithms
1. Degree of Relevance
• Binary: relevant vs. irrelevant
• Multiple ordered categories:
Perfect > Excellent > Good > Fair > Bad
2. Pairwise Preference
• Document A is more relevant than document B
3. Total Order
• Documents are ranked as {A,B,C,..} according to their relevance
Evaluation Measure – MAP & NDCG
Precision at position k for query q :
Average precision for query q :
𝑃@𝑘 =	
  
#	
  { 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡	
   𝑑 𝑜𝑐𝑠	
   𝑖 𝑛	
   𝑡 𝑜𝑝	
   𝑘	
   𝑟 𝑒𝑠𝑢𝑙𝑡𝑠}
𝑘
𝐴𝑃 =	
  
∑ 𝑃@𝑘. 𝑙^^
#	
  { 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡	
   𝑑 𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}
NDCG at position n for query q :
𝑁𝐷𝐶𝐺@𝑘 =	
   𝑍^ 	
  e 𝐺 𝜋fL
𝑗 	
   𝜂(𝑗)
^
JKL
Normalized Cumulative
(Position)
Discounted
MAP & NDCG: Averaged over all queries
MAP NDCG
Gain
Evaluation Measure - Summary
Query-level: every query contributes equally to the measure
• Computed on documents associated with the same query
• Bounded for each query
• Averaged over all test queries
Position-based: rank position is explicitly used (weighting)
• Top-ranked objects more important
• Relative order vs. relevance score of each document
• Rank is a non-continuous, non-differentiable of scores
Part V: The Challenges
and the Future of Search
• Near duplicates and versioning
• More recently, “quoting” in-between websites
• Metadata and file formats
• Search across multiple sources
• How to merge several indexes?
• Challenges with latency?
• Security, Privacy, Regulations
The Challenges of Enterprise Search
• User Logs as Ground Truth
• A gold mine that has not been leveraged so far
• Implicit feedback
• Click-through rates, etc.
• Feature Engineering
• New Directions of Research
• Semi-supervised Ranking
• Transfer Ranking
Future Research
• While 20+ years old, Search is still hard
• But there are off-the-shelf solutions…
• A problem where ML can help (learning-to-rank space)
• Most promising algorithms use a listwise approach
• Very dynamic area of research
• But doing Search well requires more than Learning-to-Rank:
• Query Parsing, Topic modeling, etc.
• It is getting harder with ever more types of documents
Conclusions
Thank You for Your Attention!
• Learning-to-Rank for Information Retrieval, by Tie-Yan Liu
• Learning-to-Rank Tutorial, by Tie-Yan Liu
• The PageRank Model, by Ian Rogers
• Search is Hard, by Priyendra Deshwal
• Why Is Enterprise Search so Hard?, by Miles Kehoe
References

Everything You Wish You Knew About Search

  • 1.
    Everything You Wish YouKnew About Search
  • 2.
    About • Enterprise softwarecompany that develops products for software developers, project managers, and content management
  • 3.
    • Enterprise softwarecompany that develops products for software developers, project managers, and content management • Our products: About
  • 4.
    About Me Head ofSearch & Smarts Engineering at Atlassian • In charge of all customer-facing ML/AI initiatives, including Search • Our main initiative is Cross-Product Search in ‘Home’ Before Atlassian: • Particle Physicist by training • Initiated Data Science efforts at several companies • Previously member of the Search team at @WalmartLabs
  • 5.
    About this Talk Whatto expect • A general introduction to Search • A overview of both the Engineering and ML aspects of Search • Insights into the current and future challenges of Search What not to expect • An extensive tutorial covering the entire Learning-to-Rank landscape • To become a Search expert in 40 min
  • 6.
    Outline • Part I:The Concepts of Search • Part II: The Technical Aspects of Search • Part III: Learning Algorithms • Part IV: Measuring Search Relevance • Part V: The Challenges and the Future of Search
  • 7.
    Part I: TheConcepts of Search
  • 8.
    Altavista First to allowNL queries Web Crawler 1st crawler to index entire pages The (Pre)History of Search 1990 Archie First search engine: an index of downloadable directory listings 1991 Veronika, Jughead Search file names and titles stored in Gopher index systems 1992 Vlib Time Berners-Lee set up a Virtual Library 1993 Excite WWW Wanderer Primitive Web Search1994 1995 LookSmart 1996 Inktomi: HotBot Google 1997 Ask.com Lycos Ranked relevance retrieval Yahoo! Directory
  • 9.
    The History ofSearch 1998 MSN Open Directory Project 1999AllTheWeb Overture Services 2000 Snap 2003 2004 2001 2002 2005 2006 LiveSearch 2007 2008 2009 Cuil Bing Inline search suggestions 2010
  • 10.
    What is Search? Convertan intent into an action that helps people retrieve something, i.e. a piece of content CONTENT OVERLOAD Search
  • 11.
    What is Search? Convertan intent into an action that helps people retrieve something, i.e. a piece of content CONTENT OVERLOAD Search • Retrieving, organizing & classifying information • Includes: • Web Search • Faceted Search (e-Commerce) • Enterprise Search • But also • Different types of documents: Image Search, etc. • In a wider sense of the term: • Recommendation (Search with no explicit intent from the user) • Structured Query Language
  • 12.
    User Intent What isSearch (Really) About? Users
  • 13.
    User Intent What isSearch (Really) About? Users Content Documents
  • 14.
    User Intent What isSearch (Really) About? Users Content Request Search Query Return Search Results Documents INTERPRETATION DISPLAY RETRIEVAL
  • 15.
    User 1 -Intent What is Search (Really) About? Users Content Request Search Query Return Search Results Documents 1 INTERPRETATION DISPLAY RETRIEVAL • Query space not controlled • Content dependent on customer Multi-tenancy Search User 2 - Intent User 3 - Intent Documents 2 Documents 3 Request Search Query Return Search Results Request Search Query Return Search Results DISPLAY INTERPRETATION DISPLAY INTERPRETATION
  • 16.
    Query data • Whatare you searching for? (query terms) Content data • What are the documents about? (topics) Contextual data • Who are you? (user data – both static and learned) • In which circumstances are you searching? Engagement data • As a group (what web pages are ‘hot’ these days?) • As an individual (your personal viewing history) Data Zoo For Search
  • 17.
    CRAWLER strips out thehtml text content The Processes of Search Automated browser that views your web pages
  • 18.
    CRAWLER INDEXER strips out thehtml text content Stores records of all pages viewed by the spider/crawler The Processes of Search Automated browser that views your web pages Database being searched when ‘search’ button is hit
  • 19.
    CRAWLER INDEXER SEARCHER strips out thehtml text content Stores records of all pages viewed by the spider/crawler Algorithm used to sort through the database of pages The Processes of Search Automated browser that views your web pages Database being searched when ‘search’ button is hit finds the most relevant content
  • 20.
    Part II: TheTechnical Aspects of Search
  • 21.
    Search Engine Architecture Crawler Document Analyzer Indexer Indexedcorpus Document Representation Index Ranking procedure Ranker Feedback Results Query representation Query Evaluation User
  • 22.
    Indexing The purpose ofstoring an index is to optimize speed and performance in finding relevant documents for a search query. Indexing
  • 23.
    • Without anindex, the search engine would scan every document in the corpus • Benefits: computation and time saving at query time • 10,000 documents can be queried within milliseconds with an index • a sequential scan could take hours • Disadvantages: • additional computer storage required to store the index • increase in the time required for an update to take place • Design factors: • Storage techniques • Index size, lookup speed • Maintenance, fault tolerance Indexing The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Indexing
  • 24.
    What Happens atIndexing Time? Text + Metadata (Doc type, structure, features) Text Acquisition Index Takes index terms & creates data structures (inverted indexes) to support fast searching Transforms documents into index terms or features Document data store E-mail, Web pages, News articles, Memos, Letters Identifies and stores documents for indexing Indexing Process Index Creation Text Transformation
  • 25.
    1. Identify WhatTo Search For Find out what words get searched and interpret the query term 2. Parse The Query Language Itself Recognizing and interpreting operators (AND, OR, NOT, etc.) and field restrictors 3. Extend Search to Other Query Terms This includes: • Fuzzy Matching (spelling mistakes) • Entity and Thematic Modeling (related words) 4. Relevance Ranking Improvements … such as: • boosting documents containing all of the terms close together (proximity weighting) • boosting documents from trustworthy sources, reducing documents from unreliable sites Parsing
  • 26.
  • 27.
  • 28.
    Ranking Cats with sunglasses Justhanging out with my sunglasses on Am I cool or what? Me with glasses just because… it makes me smart. What I see right here is Jim Belushi as a cat. Along with the Blues Brothers behind. You will never be as capable of rocking shades… quite as well as this feline friend.
  • 29.
    Ranking Relevance score ∈0,1 0.9 0.7 0.3 0.1 Cats with sunglasses Just hanging out with my sunglasses on Am I cool or what? Me with glasses just because… it makes me smart. What I see right here is Jim Belushi as a cat. Along with the Blues Brothers behind. You will never be as capable of rocking shades… quite as well as this feline friend. 𝑓 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
  • 30.
    Reranking Allows to runa simple query (A) for matching documents and re-order the top N documents using the scores from a more complex query (B) Query Re-Ranking
  • 31.
    Reranking Allows to runa simple query (A) for matching documents and re-order the top N documents using the scores from a more complex query (B) Query Re-Ranking 0.9 0.7 0.3 0.1 Original rank
  • 32.
    Reranking Allows to runa simple query (A) for matching documents and re-order the top N documents using the scores from a more complex query (B) Query Re-Ranking 0.9 0.7 0.3 0.1 TopNdocuments Original rank
  • 33.
    Reranking Allows to runa simple query (A) for matching documents and re-order the top N documents using the scores from a more complex query (B) Query Re-Ranking 0.9 0.7 0.3 0.1 TopNdocuments Original rank 1.0 0.9 0.5 Re-ranking
  • 34.
    Boosting and Personalization Boosting Runninga simple query (A) and modify the {query, document} relevance scores to boost some content (for example, based on popularity, engagement, etc.)
  • 35.
    Boosting and Personalization Boosting Runninga simple query (A) and modify the {query, document} relevance scores to boost some content (for example, based on popularity, engagement, etc.) 0.9 0.7 0.3 Original relevance Original rank
  • 36.
    Boosting and Personalization Boosting Runninga simple query (A) and modify the {query, document} relevance scores to boost some content (for example, based on popularity, engagement, etc.) 0.9 0.7 0.3 Original relevance Original rank 2,000 5,000 6,000 Page clicks + 𝛼  . + 𝛼  . + 𝛼  .
  • 37.
    Boosting and Personalization Boosting Runninga simple query (A) and modify the {query, document} relevance scores to boost some content (for example, based on popularity, engagement, etc.) 0.9 0.7 0.3 Original relevance Original rank 2,000 5,000 6,000 Page clicks + 𝛼  . + 𝛼  . + 𝛼  . Total dwell time (minutes) 500 400 100 + 𝛽. + 𝛽. + 𝛽.
  • 38.
    Boosting and Personalization Boosting Runninga simple query (A) and modify the {query, document} relevance scores to boost some content (for example, based on popularity, engagement, etc.) 0.9 0.7 0.3 Original relevance Original rank 2,000 5,000 6,000 Page clicks + 𝛼  . + 𝛼  . + 𝛼  . Total dwell time (minutes) 500 400 100 + 𝛽. + 𝛽. + 𝛽. New relevance = 65.9 = 154.7 = 181.3 𝛼 = 0.03, 𝛽 = 0.01 New rank
  • 39.
  • 40.
    Learning-to-Rank (1) User Query Top-kretrieval Results page Ranking model Learning algorithm Training data Documents Indexer Index
  • 41.
    Learning-to-Rank (2) Learning System Ranking System Modelh q x1 x2 xm h(x) … q x1 x2 xm ? … q1 x1 (1) x2 (1) xm(1) (1) y (1) … q2 x1 (2) x2 (2) xm(2) (2) y (2) … qn x1 (n) x2 (n) xm(n) (n) y (n) … … Training Data Test Data Prediction
  • 42.
    Pointwise • Predict relevanceon a document-by-document basis • 3 types of supervised machine learning algorithms can be used: • Regression-based algorithms • Classification-based algorithms • Ordinal regression Learning-to-Rank Algorithms
  • 43.
    Pointwise • Predict relevanceon a document-by-document basis • 3 types of supervised machine learning algorithms can be used: • Regression-based algorithms • Classification-based algorithms • Ordinal regression Pairwise • Tell which document is better in a given pair of documents: it is a classification problem • The goal is to minimize average number of inversions in ranking Learning-to-Rank Algorithms
  • 44.
    Pointwise • Predict relevanceon a document-by-document basis • 3 types of supervised machine learning algorithms can be used: • Regression-based algorithms • Classification-based algorithms • Ordinal regression Pairwise • Tell which document is better in a given pair of documents: it is a classification problem • The goal is to minimize average number of inversions in ranking Listwise • Directly optimize one of the ranking evaluation measures Learning-to-Rank Algorithms
  • 45.
    Pointwise Approach • Predictthe exact relevance degree of each document • Assumes that each {query, document} pair has a numerical or ordinal score • Input space contains the feature vector of every single document • Can be approximated by a regression problem • Ordinal regression: • {query, document} relevance score can only take small, finite number of values
  • 46.
    Pointwise Approach Regression ClassificationOrdinal Regression Input Space Single Documents yj Output Space Real Values Non-ordered Categories Ordinal Categories Hypothesis Space Scoring Function f(x) Loss Function Regression Loss Classification Loss Ordinal Regression Loss L(f; xj, yj) • Predict the exact relevance degree of each document • Assumes that each {query, document} pair has a numerical or ordinal score • Input space contains the feature vector of every single document • Can be approximated by a regression problem • Ordinal regression: • {query, document} relevance score can only take small, finite number of values Summary
  • 47.
    • Focus onrelative order between 2 documents instead of predicting relevance • Learn a binary classifier to tell which document is better in a pair of documents • Goal: minimize average number of inversions in ranking • Pairwise preference is used as the ground truth • Limitations: • Does not differentiate inversions at top vs. bottom positions • Examples: • RankNet Pairwise Algorithms
  • 48.
    • Focus onrelative order between 2 documents instead of predicting relevance • Learn a binary classifier to tell which document is better in a pair of documents • Goal: minimize average number of inversions in ranking • Pairwise preference is used as the ground truth • Limitations: • Does not differentiate inversions at top vs. bottom positions • Examples: • RankNet Pairwise Algorithms Input Space Document pairs (xu, xv) Output Space Preference 𝑦5,6 ∈ {+1, −1} Hypothesis Space Preference function ℎ 𝑥5, 𝑥6 = 2. 𝐼{@ AB C@ AD } − 1 Loss Function Pairwise classification loss 𝐿(ℎ; 𝑥5, 𝑥6, 𝑦5,6) Summary
  • 49.
    • Pick anevaluation measure & optimize its value, averaged over all queries • Challenges: • Continuous approximations on measures used b/c most are not continuous functions • 2 Types of approaches: • Direct Optimization of IR Evaluation Measures • Minimization of Listwise Ranking Losses Listwise Algorithms
  • 50.
    • Pick anevaluation measure & optimize its value, averaged over all queries • Challenges: • Continuous approximations on measures used b/c most are not continuous functions • 2 Types of approaches: • Direct Optimization of IR Evaluation Measures • Minimization of Listwise Ranking Losses Listwise Algorithms Listwise Loss Minimization Direct Optimization of IR Measure Input Space Document set 𝒙 =  {𝑥J}JKL M Output Space Permutation 𝜋O Ordered Categories 𝒚 =  {𝑦J}JKL M Hypothesis Space ℎ 𝑥 = 𝑠𝑜𝑟𝑡 ∘ 𝑓(𝑥) ℎ 𝑥 = 𝑓(𝑥) Loss Function Listwise Loss 𝐿(ℎ; 𝒙, 𝜋O) 1-surrogate Measure 𝐿(ℎ; 𝒙, 𝒚) Summary
  • 51.
    3 input ligands:C Summary B A DifferentMethods Pointwise Pairwise Listwise C Score(C) B Score(B) A Score(A) BA f(A)>f(B) CB f(B)>f(C) CA f(A)>f(C) CBA PA,B,C CB A PB,A,C CB A PB,C,A Output Ranking = CBA
  • 52.
    • Link analysisalgorithm Example: the PageRank Algorithm • Algorithm invented by Larry Page (Google founder) • score goes from 0 to 10 • Other Alternatives: • Page Authority • HostRank • Voting Algorithms • … Graph-Based Algorithms A A C B B B B B C
  • 53.
    Features Rank Features RankFeatures 1 TF of body … … 2 TF of anchor 51 PageRank 3 TF of title 52 HostRank 4 TF of URL 53 Topical PageRank 5 TF of whole document 54 Topical HITS authority 6 IDF of body 55 Topical HITS hub 7 IDF of anchor 56 Inlink number 8 IDF of title 57 Outlink number 9 IDF of URL 58 Number of slash in URL 10 IDF of whole document 59 Length of URL IR/NLPfeatures LinkageEngagement Example features (TREC) TF: term frequency IDF: inverse document frequency
  • 54.
    Conventional Ranking Models Query-dependent •Boolean model, extended Boolean model, etc. • Vector space model, latent semantic indexing (LSI), etc. • BM25 model, statistical language model, etc. Query-independent • PageRank, TrustRank, BrowseRank, etc. Problems with Conventional Models • Manual parameter tuning difficult • Too many parameters • Evaluation measures not smooth • Sometimes leads to overfitting • Ensemble approach (combining models into a more effective one) not trivial
  • 55.
    Part IV: MeasuringSearch Relevance
  • 56.
    Corpus Size • Numberof pages indexed Search engine overlap • Fraction of pages indexed by engine A also indexed by engine B Freshness • Age of the pages in the index Spam resilience • Fraction of pages in index that are spam Duplicates • Number of unique pages in index Search Engine Evaluation: Index
  • 57.
    Search Engine Evaluation:Relevance Judgment Types of judgments classified similarly to Ranking Algorithms 1. Degree of Relevance • Binary: relevant vs. irrelevant • Multiple ordered categories: Perfect > Excellent > Good > Fair > Bad 2. Pairwise Preference • Document A is more relevant than document B 3. Total Order • Documents are ranked as {A,B,C,..} according to their relevance
  • 58.
    Evaluation Measure –MAP & NDCG Precision at position k for query q : Average precision for query q : 𝑃@𝑘 =   #  { 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡   𝑑 𝑜𝑐𝑠   𝑖 𝑛   𝑡 𝑜𝑝   𝑘   𝑟 𝑒𝑠𝑢𝑙𝑡𝑠} 𝑘 𝐴𝑃 =   ∑ 𝑃@𝑘. 𝑙^^ #  { 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡   𝑑 𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠} NDCG at position n for query q : 𝑁𝐷𝐶𝐺@𝑘 =   𝑍^  e 𝐺 𝜋fL 𝑗   𝜂(𝑗) ^ JKL Normalized Cumulative (Position) Discounted MAP & NDCG: Averaged over all queries MAP NDCG Gain
  • 59.
    Evaluation Measure -Summary Query-level: every query contributes equally to the measure • Computed on documents associated with the same query • Bounded for each query • Averaged over all test queries Position-based: rank position is explicitly used (weighting) • Top-ranked objects more important • Relative order vs. relevance score of each document • Rank is a non-continuous, non-differentiable of scores
  • 60.
    Part V: TheChallenges and the Future of Search
  • 61.
    • Near duplicatesand versioning • More recently, “quoting” in-between websites • Metadata and file formats • Search across multiple sources • How to merge several indexes? • Challenges with latency? • Security, Privacy, Regulations The Challenges of Enterprise Search
  • 62.
    • User Logsas Ground Truth • A gold mine that has not been leveraged so far • Implicit feedback • Click-through rates, etc. • Feature Engineering • New Directions of Research • Semi-supervised Ranking • Transfer Ranking Future Research
  • 63.
    • While 20+years old, Search is still hard • But there are off-the-shelf solutions… • A problem where ML can help (learning-to-rank space) • Most promising algorithms use a listwise approach • Very dynamic area of research • But doing Search well requires more than Learning-to-Rank: • Query Parsing, Topic modeling, etc. • It is getting harder with ever more types of documents Conclusions
  • 64.
    Thank You forYour Attention!
  • 65.
    • Learning-to-Rank forInformation Retrieval, by Tie-Yan Liu • Learning-to-Rank Tutorial, by Tie-Yan Liu • The PageRank Model, by Ian Rogers • Search is Hard, by Priyendra Deshwal • Why Is Enterprise Search so Hard?, by Miles Kehoe References