Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

Learning To Rank For Solr
Michael Nilsson – Software Engineer
Diego Ceccarelli – Software Engineer
Joshua Pantony – Software Engineer
Bloomberg LP

OUTLINE
●  Search at Bloomberg
●  Why do we need machine learning for search?
●  Learning to Rank
●  Solr Learning to Rank Plugin

8 millions searches PER DAY
1 million PER DAY
400
million
stories
in
the
index

SOLR IN BLOOMBERG
●  Search engine of choice at Bloomberg
─  Large community / Well distributed committers
─  Open source Apache Project
─  Used within many commercial products
─  Large feature set and rapid growth
●  Committed to open-source
─  Ability to contribute to core engine
─  Ability to fix bugs ourselves
─  Contributions in almost every Solr release since 4.5.0

PROBLEM SETUP
score: 30
score: 1.0

PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=100∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+
10∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛
score: 52.2
score: 30.8

PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=100∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+
10∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛

PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=𝟏𝟓𝟎∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒+
𝟑.𝟏𝟒∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛+
𝟒𝟐∗ 𝑐𝑙𝑖𝑐𝑘𝑠

PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=𝟗𝟗.𝟗∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒
+𝟑.𝟏𝟏𝟏𝟒∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛
+𝟒𝟐.𝟒𝟐∗ 𝑐𝑙𝑖𝑐𝑘𝑠 +
5 ∗ timeElapsedFrom LastUpdate

●  It’s hard to manually tweak the ranking
─  You must be an expert in the domain
─  … or a magician
PROBLEM SETUP
𝑆𝑐𝑜𝑟𝑒=𝟗𝟗.𝟗∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝑇𝑖𝑡𝑙𝑒
+𝟑.𝟏𝟏𝟏𝟒∗ 𝑠𝑐𝑜𝑟𝑒𝑂𝑛𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛
+𝟒𝟐.𝟒𝟐∗ 𝑐𝑙𝑖𝑐𝑘𝑠 +
5 ∗ timeElapsedFrom LastUpdate
query = solr query = lucene query = austin query = bloomberg query = …

PROBLEM SETUP
It’s easier with Machine Learning
●  2,000+ parameters (non-linear, factorially larger than linear form)
●  8,000+ queries that are regularly tuned
●  Early on we spent many days hand tuning…

SEARCH PIPELINE (ONLINE)
Index
Top-k
retrieval
User
Query
People
Commodities
News
Other Sources
ReRanking
Model
Top-k
reranked
Top-x
retrieval
x >> k

TRAINING PIPELINE (OFFLINE)
Index
Feature
Extraction
Learning
Algorithm
Ranking
Model
Training
Query-Document
Pairs
People
Commodities
News
Other Sources
Metrics

TRAINING DATA: IMPLICIT VS EXPLICIT
What is explicit data?
●  A set of judges will assess the
search results manually given a
query
─  Experts
─  Crowd
What is implicit data?
●  Infer user preferences based on
user behavior
─  Aggregated results clicks
─  Query reformulation
─  Dwell time
Pros:
─  Data is very clean
Cons:
─  Can be very expensive!
Pros:
─  A lot of data!
Cons:
─  Extremely noisy
─  Privacy concerns

FEATURES
●  A feature is an individual measurable property
●  Given a query, and a collection we can produce many features for each
document in the collection
─  If the query matches the title
─  Length of the document
─  Number of views
─  How old is it?
─  Can be visualized on a mobile device?

FEATURES
Extract “features”
Was the result a
cofounder?
0
Features are signals that give an indication of a result’s importance

FEATURES
Was the result a
cofounder?
0
Does the document
have an exec. position?
1
Query : APPL US

FEATURES
Was the result a
cofounder?
0
Does the query match
the document title?
0
Does the document
1

FEATURES
Was the result a
cofounder?
0
the document title?
0
Does the document
1
Popularity (%) 0.9

FEATURES
Was the result a
cofounder?
0
the document title?
1
Does the document
0
Popularity (%) 0.6

METRICS
How do we know if our model is doing better?
●  Offline metrics
─  Precision/Recall/F1 score
─  nDCG (Normalized Discount Cumulative Gain)
─  Other metrics (e.g., ERR, MAP, …)
●  Online Metrics
─  Click through rates à higher
─  Time to first click à lower
─  Interleaving1
1O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large scale validation and analysis of interleaved search evaluation. ACM
Transactions on Information Science, 30(1), 2012.

LEARNING TO RANK
●  Learn how to combine the features for optimizing one or more metrics
●  Many learning algorithms
─  RankSVM1
─  LambdaMART2
─  …
1T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and
Data Mining (KDD), ACM, 2002.
2C.J.C. Burges, "From RankNet to LambdaRank to LambdaMART: An Overview", Microsoft Research Technical Report MSR-
TR-2010-82, 2010.

SEARCH PIPELINE: STANDARD
Index
Top-k
retrieval
User
Query
SolrPeople
Commodities
News
Other Sources

Index
Top-k
retrieval
User
Query
Solr
Training
Data
Learning
Algorithm
Ranking
Model Offline
People
Commodities
News
Other Sources

Index
Top-k
retrieval
User
Query
Solr
Ranking
ModelOnline
Top-x
reranked
People
Commodities
News
Other Sources

SEARCH PIPELINE: SOLR INTEGRATION
Index
Top-k
retrieval
User
Query
Solr
Ranking
ModelOnline
Top-x
reranked
People
Commodities
News
Other Sources

SOLR RELEVANCY
●  Pros
─  Simple and quick scoring computation
─  Phrase matching
─  Function query boosting on time, distance, popularity, etc
─  Customized fields for stemming, synonyms, etc
●  Cons
─  Lots of manual time for creating a well tuned query
─  Weights are brittle, and may not be compatible in the future with more documents
or fields added

LTR PLUGIN: GOALS
●  Don’t tune the relevancy manually!
─  Uses machine learning to power automatic relevancy tuning
●  Significant relevancy improvements
●  Allow comparable scores across collections
─  Collections of different sizes
●  Maintaining low latency
─  Re-use the vast Solr search functionality that is already built-in
─  Less data transport
●  Makes it simple to use domain knowledge to rapidly create features
─  Features are no longer coded but rather scripted

STANDARD SOLR SEARCH REQUEST
Index
Top-k
retrieval
User
Query
People
Commodities
News
Other Sources

Index
STANDARD SOLR SEARCH REQUEST
Index
[10 Million]
Top-10
retrieval
User
Query
Matches
[10k]
Score
[10k]
Solr Query
People
Commodities
News
Other Sources

LTR SOLR SEARCH REQUEST
Index
[10 Million]
Top-1000
retrieval
User
Query
Matches
[10k]
Score
[10k]
Ranking
Model
Top-10
reranked
Solr Query
LTR Query
People
Commodities
News
Other Sources

<queryParser name="ltr" class="org.apache.solr.ltr.ranking.LTRQParserPlugin" />

LTR PLUGIN: RERANKING
●  LTRQuery extends Solr’s RankQuery
─  Wraps main query to fetch initial results
─  Returns custom TopDocsCollector for reranked ordered results
●  Solr rerank request parameter
rq={!ltr model=myModel1 reRankDocs=100 efi.user_query=‘james’ efi.my_var=123}
─  !ltr – name used in the solrconfig.xml for the LTRQParserPlugin
─  model – name of deployed model to use for reranking
─  reRankDocs – total number of documents to rerank
─  efi.* – custom parameters used to pass external feature information for your
features to use
•  Query intent
•  Personalization

SEARCH PIPELINE (ONLINE)
Index
[10 Million]
Top-1000
retrieval
User
Query
Matches
[10k]
Score
[10k]
Ranking
Model
Top-10
reranked
Feature
Extraction
People
Commodities
News
Other Sources

{

"name":

"Tim
Cook",

"primary_position":

"ceo",

"category
":

"person",

…

}

FEATURES
Was the result a
cofounder?
0
the document title?
0
Does the document
1
Popularity (%) 0.9

[

{

"name":

"isPersonAndExecutive",

"type":
"org.apache.solr.ltr.feature.impl.SolrFeature",

"params":
{

"fq":
[

"{!terms
f=category}person",

"{!terms
f=primary_position}ceo,
cto,
cfo,
president"

]

}

},

…

]

LTR PLUGIN: FEATURES AFTER

LTR PLUGIN: FUNCTION QUERIES
[

{

"name":

"documentRecency",

"type":
"org.apache.solr.ltr.feature.impl.SolrFeature",

"params":
{

"q":
"{!func}recip(
ms(NOW,publish_date),
3.16e-‐11,
1,
1)"

}

},

…

]

1
for
docs
dated
now,
1/2
for
docs
dated
1
year
ago,
1/3
for
docs
dated
2
years
ago,
etc..

See
http://wiki.apache.org/solr/FunctionQuery#Date_Boosting

LTR PLUGIN: FEATURE STORE
●  FeatureStore is a Solr Managed Resource
─  REST API endpoint for performing CRUD operations on Solr objects
─  Stored in maintained in Zookeeper
●  Deploy
─  curl -XPUT 'http://yoursolrserver/solr/collection/config/fstore'
--data-binary @./features.json -H 'Content-type:application/json'
●  View
─  http://yoursolrserver/solr/collection/config/fstore

LTR PLUGIN: FEATURES
●  Simplifies feature engineering through configuration file
●  Utilizes rich search functionality built-in to Solr
─  Phrase matching
─  Synonyms, Stemming, etc
●  Inherit the Feature class for specialized features

TRAINING PIPELINE (OFFLINE)
Index
[10 Million]
Top-1000
retrieval
Training
Queries
Matches
[10k]
Score
[10k]
Feature
Extraction
Learning
Algorithm
Ranking
Model
People
Commodities
News
Other Sources

<transformer name="fv" class= "org.apache.solr.ltr.ranking.LTRFeatureTransformer" />

LTR PLUGIN: FEATURE EXTRACTION
●  Feature extraction uses Solr’s TransformerFactory
─  Returns a custom field with each document
●  fl = *, [fv]
{

"name":

"Tim
Cook",

"primary_position":

"ceo",

"category
":

"person",

…

"[fv]":

"isCofounder:0.0,
isPersonAndExecutive:1.0,
matchTitle:0.0,
popularity:0.9"

}

LTR PLUGIN: MODEL{

"type":
"org.apache.solr.ltr.ranking.LambdaMARTModel",

"name":
"mymodel1",

"features":
[

{
"name":
"matchedTitle"},

{
"name":
"isPersonAndExecutive"}

],

"params":
{

"trees":
[

{

"weight":
1,

"tree":
{

"feature":
"matchedTitle",

"threshold":
0.5,

"left":
{
"value":
-‐100
},

"right":
{

"feature":
"isPersonAndExecutive",

"threshold":
0.5,

"left":
{
"value":
50
},

"right":
{
"value":
75
}

}

}

}

]

}

}

LTR PLUGIN: MODEL
●  ModelStore is also a Solr Managed Resource
●  Deploy
─  curl -XPUT 'http://yoursolrserver/solr/collection/config/mstore'
--data-binary @./model.json -H 'Content-type:application/json'
●  View
─  http://yoursolrserver/solr/collection/config/mstore
●  Inherit from the model class for new scoring algorithms
─  score()
─  explain()

LTR PLUGIN: EVALUATION
●  Offline Metrics
─  nDCG increased approximately 10% after reranking
─  Clicks @ 1 up by approximately 10%

BEFORE AND AFTER
Query: “unemployment”
Solr Ranking Machine Learned Reranking

LTR PLUGIN: EVALUATION
●  Offline Metrics
─  nDCG increased approximately 10% after reranking
─  Clicks @ 1 up by approximately 10%
●  Performance
─  About 30% faster than previous external ranking system
10 million documents in collection
100k queries
1k features
1k documents/query reranked

LTR PLUGIN: BENEFITS
●  Simpler feature engineering, without compiling
●  Access to rich internal Solr search functionality for feature building
●  Search result relevancy improvements vs regular Solr relevance
●  Automatic relevancy tuning
●  Compatible scores across collections
●  Performance benefits vs external ranking system

FUTURE WORK
●  Continue work to open source the plugin
●  Support pipelining multiple reranking models
●  Allow a simple ranking model to be used in the first pass

Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP

Similar to Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bloomberg LP