#Codemotion Rome 2018 - Traditional search engine approaches for providing the most relevant results for a query has been focus on matching the query terms with the words in the documents, mainly TF-IDF and BM25, however with methods are hard to tune for best results and provide no personalisation. This talk will introduce Learning to Rank, a machine learning approach to bring personalisation to search, and it’s key concepts, before diving into a real life demo based on elasticsearch and real data. At the end of it you will take home a basic understanding of LTR, applications and enough to start using it.
2. About me
Pere Urbon - Bayes (Berliner since 2011)
Software Architect and Data Engineer
All about systems, data and teams
Open Source Advocate and Contributor
3. All will be available from
● github.com/purbon/learning_to_rank_101
● speakerdeck.com/purbon
5. Building Search
A search engine is an information retrieval
system designed to help find information stored
on a computer system.
wikipedia.org/wiki/Search_engine_(computing)
6. Building Search
When search works, it can feel almost
magical: you simply type in what you’re looking
for and it’s served up in mere milliseconds. It’s
fast, convenient, and super efficient – no
wonder so many users prefer search over
clicking around the site’s categories!
7. Search, how does this works?
documents
D={d1
,d2
,...,dN
}
IR System
Query
q
List of documents (ranked)
dq,1
dq,2
dq,3
dq,4
dq,5
...
dq,n
Ranking based relevance
TF-IDF, BM25
8. Building search
The phases of building a search engine:
● Tokenization
○ synonyms (filter)
○ stop words (filter)
○ whitespace
○ ngram
● Analyzer
○ languages
○ keywords
○ standard
● Normalization
Indexing Time
Query Time
11. The second line of defence
● Tags and Ontologies.
● Natural Language Processing.
● Result click tracking.
● Genetic and evolutionary methods to optimize boosting and weights.
● Build your own scorer
● ...
Scary and Complex!!!
14. Learning to Rank
The usage of machine learning (supervised, semi-supervised, …) to improve
the creation of ranking models for information retrieval.
Common applications are in search engines, collaborative filtering,
machine translation, biological computation, etc.
The idea was introduced in 1992 by Norbert Fuhr, describing learning in
information retrieval as a parameter estimation problem.
15. Learning to Rank, how does this works?
documents
D={d1
,d2
,...,dN
}
IR System
Query
qm+1
List of documents (ranked)
dq,1
, f(qm+1, d1)
dq,2,
f(qm+1, d1)
dq,3,
f(qm+1, d1)
dq,4,
f(qm+1, d1)
dq,5,
f(qm+1, d1)
...
dq,n,
f(qm+1, d1)
Learning
System
q1
d1,1
d1,2
d1,3
...
dq,n
qm
dm,1
dm,2
dm,3
...
dm,n
f(q,d
)
16. Learning to Rank
Algorithms can be divided in three different groups:
● Pointwise: If we assume that each pair (query, document) get a score,
then the problem can be approximated by a regression.
● Pairwise: In this case the problem is treated as a classification problem,
learning how to better classify each given pair of documents.
● Listwise: The last case try to optimize the value of one of previous
methods, averaged overall queries.
Order of quality: Listwise > Pairwise > Pointwise.
17. Learning to Rank
Most popular algorithms are:
● RankNet, LamdaRank, LamdaMart by Chris C.J Burges et others.
www.microsoft.com/en-us/research/publication/ranking-boosting-and-
model-adaptation/?from=http%3A%2F%2Fresearch.microsoft.com%2F
pubs%2F69536%2Ftr-2008-109.pdf
● RankSVM or (*) Gradient descendant variants.
19. References
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul
Lamere.The Million Song Dataset. In Proceedings of the 12th International
Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
Million Song Dataset, official website by Thierry Bertin-Mahieux,
available at: http://labrosa.ee.columbia.edu/millionsong/
Tie-Yan Liu (2009), "Learning to Rank for Information Retrieval",
Foundations and Trends in Information Retrieval, Foundations and Trends
in Information Retrieval, 3 (3): 225–331, doi:10.1561/1500000016, ISBN
978-1-60198-244-5.