Bringing personalisation to data discovery, Learning to Rank 101 by Pere Urbon-Bayes

•

0 likes•342 views

#Codemotion Rome 2018 - Traditional search engine approaches for providing the most relevant results for a query has been focus on matching the query terms with the words in the documents, mainly TF-IDF and BM25, however with methods are hard to tune for best results and provide no personalisation. This talk will introduce Learning to Rank, a machine learning approach to bring personalisation to search, and it’s key concepts, before diving into a real life demo based on elasticsearch and real data. At the end of it you will take home a basic understanding of LTR, applications and enough to start using it.

Technology

Learning to Rank 101
Pere Urbon-Bayes
ROME - APRIL 13/14 2018

About me
Pere Urbon - Bayes (Berliner since 2011)
Software Architect and Data Engineer
All about systems, data and teams
Open Source Advocate and Contributor

All will be available from
● github.com/purbon/learning_to_rank_101
● speakerdeck.com/purbon

Building Search
A search engine is an information retrieval
system designed to help find information stored
on a computer system.
wikipedia.org/wiki/Search_engine_(computing)

Building Search
When search works, it can feel almost
magical: you simply type in what you’re looking
for and it’s served up in mere milliseconds. It’s
fast, convenient, and super efficient – no
wonder so many users prefer search over
clicking around the site’s categories!

Search, how does this works?
documents
D={d1
,d2
,...,dN
}
IR System
Query
q
List of documents (ranked)
dq,1
dq,2
dq,3
dq,4
dq,5
...
dq,n
Ranking based relevance
TF-IDF, BM25

Building search
The phases of building a search engine:
● Tokenization
○ synonyms (filter)
○ stop words (filter)
○ whitespace
○ ngram
● Analyzer
○ languages
○ keywords
○ standard
● Normalization
Indexing Time
Query Time

Tf-IDF
Term frequency - Inverse Document Frequency

Okapi BM25
Okapi search Best Matching 25 (BM25)
Others: PageRank, Learning to Rank, ….

The second line of defence
● Tags and Ontologies.
● Natural Language Processing.
● Result click tracking.
● Genetic and evolutionary methods to optimize boosting and weights.
● Build your own scorer
● ...
Scary and Complex!!!

Learning to Rank
The usage of machine learning (supervised, semi-supervised, …) to improve
the creation of ranking models for information retrieval.
Common applications are in search engines, collaborative filtering,
machine translation, biological computation, etc.
The idea was introduced in 1992 by Norbert Fuhr, describing learning in
information retrieval as a parameter estimation problem.

Learning to Rank, how does this works?
documents
D={d1
,d2
,...,dN
}
IR System
Query
qm+1
List of documents (ranked)
dq,1
, f(qm+1, d1)
dq,2,
f(qm+1, d1)
dq,3,
f(qm+1, d1)
dq,4,
f(qm+1, d1)
dq,5,
f(qm+1, d1)
...
dq,n,
f(qm+1, d1)
Learning
System
q1
d1,1
d1,2
d1,3
...
dq,n
qm
dm,1
dm,2
dm,3
...
dm,n
f(q,d
)

Learning to Rank
Algorithms can be divided in three different groups:
● Pointwise: If we assume that each pair (query, document) get a score,
then the problem can be approximated by a regression.
● Pairwise: In this case the problem is treated as a classification problem,
learning how to better classify each given pair of documents.
● Listwise: The last case try to optimize the value of one of previous
methods, averaged overall queries.
Order of quality: Listwise > Pairwise > Pointwise.

Learning to Rank
Most popular algorithms are:
● RankNet, LamdaRank, LamdaMart by Chris C.J Burges et others.
www.microsoft.com/en-us/research/publication/ranking-boosting-and-
model-adaptation/?from=http%3A%2F%2Fresearch.microsoft.com%2F
pubs%2F69536%2Ftr-2008-109.pdf
● RankSVM or (*) Gradient descendant variants.

References
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul
Lamere.The Million Song Dataset. In Proceedings of the 12th International
Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
Million Song Dataset, official website by Thierry Bertin-Mahieux,
available at: http://labrosa.ee.columbia.edu/millionsong/
Tie-Yan Liu (2009), "Learning to Rank for Information Retrieval",
Foundations and Trends in Information Retrieval, Foundations and Trends
in Information Retrieval, 3 (3): 225–331, doi:10.1561/1500000016, ISBN
978-1-60198-244-5.

Thank! Questions?
Pere Urbon Bayes — Data Wrangler
www.springernature.com
www.purbon.com

What's hot

Viva questions ds th c++

mrecedu

Make money fast! department of computer science-copypasteads.com

jackpot201

Statistics in Data Science with Python

Mahe Karim

Iterative deepening search

Ashis Kumar Chanda

We benchmark Conformer-Kernel models under the strict blind evaluation setting of the TREC 2020 Deep Learning track. In particular, we study the impact of incorporating: (i) Explicit term matching to complement matching based on learned representations (i.e., the “Duet principle”), (ii) query term independence (i.e., the “QTI assumption”) to scale the model to the full retrieval setting, and (iii) the ORCAS click data as an additional document description field. We find evidence which supports that all three aforementioned strategies can lead to improved retrieval quality.

Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track

Bhaskar Mitra

Lec3

Ibrahim El-Torbany

This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.

Duet @ TREC 2019 Deep Learning Track

Bhaskar Mitra

Artificial intelligence

Mahira Javed

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494

Sean Golliher

Data Applied: Clustering

DataminingTools Inc

Data Structures problems 2002

Sanjay Goel

Snm Tauctv

FNian

NetworkX is a Python language software package and an open-source tool for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. NetworkX can load, store and analyze networks, generate new networks, build network models, and draw networks. It is a computational network modelling tool and not a software tool development. The first public release of the library, which is all based on Python, was in April 2005.

Python networkx library quick start guide

Universiti Technologi Malaysia (UTM)

R tools for HiC data visualization

tuxette

Bca2020 data structure and algorithm

smumbahelp

In this paper we propose Regularised Cross-Modal Hashing (RCMH) a new cross-modal hashing model that projects annotation and visual feature descriptors into a common Hamming space. RCMH optimises the hashcode similarity of related data-points in the annotation modality using an iterative three-step hashing algorithm: in the first step each training image is assigned a K-bit hashcode based on hyperplanes learnt at the previous iteration; in the second step the binary bits are smoothed by a formulation of graph regularisation so that similar data-points have similar bits; in the third step a set of binary classifiers are trained to predict the regularised bits with maximum margin. Visual descriptors are projected into the annotation Hamming space by a set of binary classifiers learnt using the bits of the corresponding annotations as labels. RCMH is shown to consistently improve retrieval effectiveness over state-of-the-art baselines.

Regularised Cross-Modal Hashing (SIGIR'15 Poster)

Sean Moran

Process mining methods use data recorded by information systems to analyze the real execution of processes. This event data is stored in an event log, which is the main input to most process mining methods. The XES standard provides a uniform way to store event logs. OpenXES is the XES reference implementation, which is used widely by research tools. However, OpenXES is not scalable towards large event log. XESLite provides solutions to manage large event logs that are compatible with the OpenXES interfaces. Therefore, it can be used as drop-in replacement for existing algorithms. This presentation investigates the storage requirements of different types of event logs, describes XESLite, and contains a benchmark of XESLite and OpenXES based on real-life event logs.

XESLite - Handling Event Logs in ProM

Felix Mannhardt

Tabu search

Ahmed Fouad Ali

What's hot (18)

Viva questions ds th c++

Make money fast! department of computer science-copypasteads.com

Statistics in Data Science with Python

Iterative deepening search

Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track

Lec3

Duet @ TREC 2019 Deep Learning Track

Artificial intelligence

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494

Data Applied: Clustering

Data Structures problems 2002

Snm Tauctv

Python networkx library quick start guide

R tools for HiC data visualization

Bca2020 data structure and algorithm

Regularised Cross-Modal Hashing (SIGIR'15 Poster)

XESLite - Handling Event Logs in ProM

Tabu search

Similar to Bringing personalisation to data discovery, Learning to Rank 101 by Pere Urbon-Bayes

Tutorial 1 (information retrieval basics)

Kira

Slides

butest

Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...

DrkhanchanaR

Intro.ppt

WrushabhShirsat3

Machine Learning ebook.pdf

HODIT12

1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1

MostafaHazemMostafaa

know Machine Learning Basic Concepts.pdf

hemangppatel

Query expansion_group42_ire

KovidaN

Algorithms.

Faculty of Science , portsaid Univeristy

2022_Fal-con_CQF_Presentation_Crowdstrike.pptx

ssuser9fc96c

nnml.ppt

yang947066

Data structures using C

Pdr Patnaik

Ds12 140715025807-phpapp02

Salman Qamar

Renaud bourassa building machine learning models with strict privacy boundaries

MLconf

Data Science.pptx

TrainerAnalogicx

Query expansion_Team42_IRE2k14

sudhir11292rt

KDD, Data Mining, Data Science_I.pptx

YogeshGairola2

COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING

PUNE VIDYARTHI GRIHA'S COLLEGE OF ENGINEERING, NASHIK

Machine learning

Digvijay Singh

Turbocharge your data science with python and r

Kelli-Jean Chun

Similar to Bringing personalisation to data discovery, Learning to Rank 101 by Pere Urbon-Bayes (20)

Tutorial 1 (information retrieval basics)

Slides

Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...

Intro.ppt

Machine Learning ebook.pdf

1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1

know Machine Learning Basic Concepts.pdf

Query expansion_group42_ire

Algorithms.

2022_Fal-con_CQF_Presentation_Crowdstrike.pptx

nnml.ppt

Data structures using C

Ds12 140715025807-phpapp02

Renaud bourassa building machine learning models with strict privacy boundaries

Data Science.pptx

Query expansion_Team42_IRE2k14

KDD, Data Mining, Data Science_I.pptx

COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING

Machine learning

Turbocharge your data science with python and r

More from Codemotion

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...

Codemotion

Pompili - From hero to_zero: The FatalNoise neverending story

Codemotion

Pastore - Commodore 65 - La storia

Codemotion

Rivivere l'ebbrezza di progettare un vecchio computer o una consolle da bar è oggi possibile sfruttando le FPGA, ovvero logiche programmabili che consentono a chiunque di progettare il proprio hardware o di ricrearne uno del passato. In questa sessione si racconta come dal reverse engineering dell'hardware di vecchie glorie come il Commodore 64 e lo ZX Spectrum sia stato possibile farle rivivere attraverso tecnologie oggi alla portata di tutti.

Pennisi - Essere Richard Altwasser

Codemotion

There's a lot of talk about blockchain, but how does the technology behind it actually work? For developers, getting some hands-on experience is the fastest way to get familiair with new technologies. So let's build a blockchain, then! In this session, we're going to build one in plain old Java, and have it working in 40 minutes. We'll cover key concepts of a blockchain: transactions, blocks, mining, proof-of-work, and reaching consensus in the blockchain network. After this session, you'll have a better understanding of core aspects of blockchain technology.

Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...

Codemotion

When was the last time you were truly lost? Thanks to the maps and location technology in our phones, a whole generation has now grown up in a world where getting lost is truly a thing of the past. Location technology goes far beyond maps in the palm of our hand, however. In this talk, we will explore how a ridesharing app works. How do we discover our destination?How do we find the closest driver? How do we display this information on a map? How do we find the best route?To answer these questions,we will be learning about a variety of location APIs, including Maps, Positioning, Geocoding etc.

Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019

Codemotion

Eward Driehuis, SecureLink's research chief, will guide you through the bumpy ride we call the cyber threat landscape. As the industry has over a decade of experience of dealing with increasingly sophisticated attacks, you might be surprised to hear more attacks slip through the cracks than ever. From analyzing 20.000 of them in 2018, backed by a quarter of a million security events and over ten trillion data points, Eward will outline why this happens, how attacks are changing, and why it doesn't matter how neatly or securely you code.

Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019

Codemotion

IoT revolution is ended. Thanks to hardware improvement, building an intelligent ecosystem is easier than never before for both startups and large-scale enterprises. The real challenge is now to connect, process, store and analyze data: in the cloud, but also, at the edge. We’ll give a quick look on frameworks that aggregate dispersed devices data into a single global optimized system allowing to improve operational efficiency, to predict maintenance, to track asset in real-time, to secure cloud-connected devices and much more.

Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -

Codemotion

What if Virtual Reality glasses could transform your environment into a three-dimensional work of art in realtime in the style of a painting from Van Gogh? One of the many interesting developments in the field of Deep Learning is the so called "Style Transfer". It describes a possibility to create a patchwork (or pastiche) from two images. While one of these images defines the the artistic style of the result picture, the other one is used for extracting the image content. A team from TNG Technology Consulting managed to build an AI showcase using OpenCV and Tensorflow to realize such goggles.

Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...

Codemotion

Blockchain (and Cryptocurrency) is an evolution of 20-year old research from scientists like Chaum, Lamport, and Castro & Liskov. Due to the current hype, it's hard to distinguish beneficial aspects of the technology from a desire for a "silver bullet" for device security, verifiable logistics, or "saving democracy". The problem: blockchain introduces new security challenges - and blind adoption without understanding reduces overall security. In this talk, Melanie Rieback and Klaus Kursawe explain the pitfalls and limits of blockchain, so you can avoid making your applications LESS secure.

Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...

Codemotion

Networking is a core part of computing in the digital world we inhabit. But, how well do you know how it works? Do you understand all the moving parts of the OSI stack inside your computer, and how the network is actually put together? How can this ever work? This guided safari of layers, standards, protocols, and happenstance will bring us close to the copper wire, and up through the layers of CDMA/CD, ARP, routing and HTTP. We will make a few excursions through patchworks that still work forty years later, and cleverly designed mechanisms that show that simplicity is the only way to last.

Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...

Codemotion

Performance tests are not only an important instrument for understanding a system and its runtime environment. It is also essential in order to check stability and scalability – non-functional requirements that might be decisive for success. But won't my cloud hosting service scale for me as long as I can afford it? Yes, but… It only operates and scales resources. It won't automatically make your system fast, stable and scalable. This talk shows how such and comparable questions can be clarified with performance tests and how DevOps teams benefit from regular test practise.

Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...

Codemotion

Sascha will demonstrate the opportunities and challenges of Conversational AI learned from the practice. Both Technology and User Experience will be covered introducing a process finding micro-moments, writing happy paths, gathering intents, designing the conversational flow, and finally publishing on almost all channels including Voice Services and Chatbots. Valuable for enterprises, developers, and designers. All live on stage in just minutes and with almost no code.

Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019

Codemotion

A key challenge we face at Pacmed is quickly calibrating and deploying our tools for clinical decision support in different hospitals, where data formats may vary greatly. Using Intensive Care Units as a case study, I’ll delve into our scalable Python pipeline, which leverages Pandas’ split-apply-combine approach to perform complex feature engineering and automatic quality checks on large time-varying data, e.g. vital signs. I’ll show how we use the resulting flexible and interpretable dataframes to quickly (re)train our models to predict mortality, discharge, and medical complications.

Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019

Codemotion

Coolblue is a proud Dutch company, with a large internal development department; one that truly takes CI/CD to heart. Empowerment through automation is at the heart of these development teams, and with more than 1000 deployments a day, we think it's working out quite well. In this session, Pat Hermens (a Development Managers) will step you through what enables us to move so quickly, which tools we use, and most importantly, the mindset that is required to enable development teams to deliver at such a rapid pace.

Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019

Codemotion

Quantum computers can use all of the possible pathways generated by quantum decisions to solve problems that will forever remain intractable to classical compute power. As the mega players vie for quantum supremacy and Rigetti announces its $1M "quantum advantage" prize, we live in exciting times. IBM-Q and Microsoft Q# are two ways you can learn to program quantum computers so that you're ready when the quantum revolution comes. I'll demonstrate some quantum solutions to problems that will forever be out of reach of classical, including organic chemistry and large number factorisation.

James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...

Codemotion

Chinese food exploded across America in the early 20th century, rapidly adapting to local tastes while also spreading like wildfire. How was it able to spread so fast? The GY6 is a family of scooter engines that has achieved near total ubiquity in Europe. It is reliable and cheap to manufacture, and it's made in factories across China. How are these factories able to remain afloat? Chinese-American food and the GY6 are both riveting studies in product-market fit, and both are the product of a distributed open source-like development model. What lessons can we learn for open source software?

Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...

Codemotion

Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019

Codemotion

Would you fly in a plane designed by a craftsman or would you prefer your aircraft to be designed by engineers? We are learning that science and empiricism works in software development, maybe now is the time to redefine what “Software Engineering” really means. Software isn't bridge-building, it is not car or aircraft development either, but then neither is Chemical Engineering. Engineering is different in different disciplines. Maybe it is time for us to begin thinking about retrieving the term "Software Engineering" maybe it is time to define what our "Engineering" discipline should be.

Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019

Codemotion

What is the job of a CTO and how does it change as a startup grows in size and scale? As a CTO, where should you spend your focus? As an engineer aspiring to be a CTO, what skills should you pursue? In this inspiring and personal talk, I describe my journey from early Red Hat engineer to CTO at Bloomon. I will share my view on what it means to be a CTO, and ultimately answer the question: Should the CTO be coding?

Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019

Codemotion

More from Codemotion (20)