2. We empower
search teams!
Becky Billingsley
Matt Clark
Elizabeth Haubert
Charlie Hull
Max Irwin
Eric Pugh
Bertrand Rigaldies
Jennifer Schmidt
Scott Stults
Doug Turnbull
John Woodell
Dan Worley
René Kriegler
Aaron Pastor
Christine Boyd
Tiffany Brown
4. Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Alessandro Benedetti (Sease)
Making the case for human judgement relevance testing
Tara Diedrichsen and Tito Sierra (LexisNexis)
Towards a Learning To Rank Ecosystem @ Snag - We've got LTR to work! Now what?
Xun Wang (Snag)
Evolution of Yelp search to a generalized ranking platform
Umesh Dangat (Yelp)
Addressing variance in AB tests: Interleaved evaluation of rankers
Erik Bernhardson (Wikimedia)
Solving for Satisfaction: Introduction to Click Models
Elizabeth Haubert (OpenSource Connections)
Custom Solr Query Parser Design Option, and Pros & Cons
Bertrand Rigaldies (OpenSource Connections)
Improving Search Relevance with Numeric Features in Elasticsearch
Mayya Sharipova
'Relevant' Machine Translation with Learning to Rank
Suneel Marthi (Amazon Web Services)
Search with Vectors
Simon Hughes (Dice Holdings Inc.)
Ontology and Oncology: NLP for Precision Medicine
Sean Mullane (University of Virginia)
Autocomplete as Relevancy
Rimple Shah
Query relaxation - a rewriting technique between search and recommendations
Rene Kriegler
Beyond The Search Engine: Improving Relevancy through Query Expansion
Taylor Rose and David Mitchell (Ibotta)
How The New York Times Tackles Relevance
Jeremiah Via (The New York Times)
Establishing a relevance focused culture in a large organization
Tom Burgmans (Wolters Kluwer)
Architectural considerations on search relevancy in the context of e-commerce
Johannes Peter (Media Markt Saturn)
Search Logs + Machine Learning = Auto-Tagging Inventory
John Berryman (Eventbrite)
Natural Language Search with Knowledge Graphs
Trey Grainger (Lucidworks)
Search-based recommendations at Politico
Ryan Kohl (Politico)
Theater 5 Theater 4
44. Rated Ranking Evaluator: an Open Source Approach for Search Quality
Evaluation
Alessandro Benedetti (Sease)
Making the case for human judgement relevance testing
Tara Diedrichsen and Tito Sierra (LexisNexis)
Towards a Learning To Rank Ecosystem @ Snag - We've got LTR to
work! Now what?
Xun Wang (Snag)
Evolution of Yelp search to a generalized ranking platform
Umesh Dangat (Yelp)
Addressing variance in AB tests: Interleaved evaluation of rankers
Erik Bernhardson (Wikimedia)
Solving for Satisfaction: Introduction to Click Models
Elizabeth Haubert (OpenSource Connections)
Custom Solr Query Parser Design Option, and Pros & Cons
Bertrand Rigaldies (OpenSource Connections)
Improving Search Relevance with Numeric Features in Elasticsearch
Mayya Sharipova
'Relevant' Machine Translation with Learning to Rank
Suneel Marthi (Amazon Web Services)
Search with Vectors
Simon Hughes (Dice Holdings Inc.)
Ontology and Oncology: NLP for Precision Medicine
Sean Mullane (University of Virginia)
Autocomplete as Relevancy
Rimple Shah
Query relaxation - a rewriting technique between search and
recommendations
Rene Kriegler
Beyond The Search Engine: Improving Relevancy through Query
Expansion
Taylor Rose and David Mitchell (Ibotta)
How The New York Times Tackles Relevance
Jeremiah Via (The New York Times)
Establishing a relevance focused culture in a large organization
Tom Burgmans (Wolters Kluwer)
Architectural considerations on search relevancy in the context of e-
commerce
Johannes Peter (Media Markt Saturn)
Search Logs + Machine Learning = Auto-Tagging Inventory
John Berryman (Eventbrite)
Natural Language Search with Knowledge Graphs
Trey Grainger (Lucidworks)
Search-based recommendations at Politico
Ryan Kohl (Politico)
Editor's Notes
<This slide will be shown while people are taking their seats.>
<This slide will be shown when making announcements>
This is a photo from last year’s Haystack. Raise your hand if you are in this photo!
clap by Berkah Icon from the Noun Project
This year, we’re going to talk about relevance, and how it’s related to people, and how it’s related to machines
So what does relevance mean to each of these? And how do we unify them to bridge the gap?
I like to talk about search quality that goes beyond relevance and considers experience and performance.
And there’s an interesting parallel we can make if we try looking at relevance on it’s own. But the truth is that they’re these three are inseparable.
https://commons.wikimedia.org/wiki/File:Flower_jtca001.jpg
Because we make big promises to customers. And we put our reputation on the line when catch phrases are espoused. So we need to take a closer look and understand what all this relevance stuff really means. Because none of the promises can deliver unless we really know what’s relevant.
Because for machines, you need an absolute and clearly defined mathematical rule to measure success. For search, that measure of success for search is relevance.
https://commons.wikimedia.org/wiki/File:Gradient_method.svg
But people and businesses, have a very difficult time defining success when it relates to relevance – and even more so in a way that machines can understand. Because we have a complex and intimate understanding of the world around us. Simplifying it is not so easy. Let’s see why.
While we pause to examine the cranial anatomy of the Relevance Engineer, we first ask them: “what is relevance?”
http://clipart-library.com/clipart/8iEbGk88T.htm
And they might show you this. Some of you may recognize this instantly but I’m sure it’s a mystery to many of you. This is the formula for “normalized Discounted Cumulative Gain”, better known as nDCG.
And when we evaluate search with judgement data we get a number. Here’s a possible result. But what does it mean? How did we get that? Even if I showed you the query and the documents that produced this number, is there any real world comparison that you can ascribe it to?
Let’s see how it works…and we’ll look at just the numerator which is Discounted Cumulative Gain, or DCG. This will get you the DCG score for 1 query with p relevance graded results.
<animate and read the steps>
Recap
Let’s look at all the possible combinations we’d ever see in the top 4 results, and what the nDCG score would be.
Here’s what it looks like when you view the spectrum of all possible relevance combinations for the top five results of a query. The different colors represent how strict, or lenient, the graded punishment is. 1.0 nDCG is considered perfect success with relevance. 0.0 nDCG is complete irrelevance.
We can also look at how strict we want to be in the score. These are just variations on the ‘punishment for lower rank’ part of nDCG. But you can see we have a good deal of control tailoring nDCG for how we want to represent relevance.
But we’re still missing something. We have yet to define relevance at the atomic level, so that a machine can understand it.
RELEVANCE DENOTES HOW WELL A RETRIEVED DOCUMENT MEETS A USER’S INFORMATION NEED.
OK this is great. But what’s an information need?
AN INFORMATION NEED IS A DESIRE TO LOCATE AND OBTAIN INFORMATION TO SATISFY A CONSCIOUS OR UNCONSCIOUS NEED.
So that’s the really tricky part! How are we, the humble product and engineering folks, able to understand the needs of our customers when even they are not conscious of it?
Well we add judgements and we look at usage data, to try and dig into this problem and come up with a good shape and model of understanding that we work towards.
Let’s ask the humans for judgements first. While we would trust our experts to describe relevance, we need to be careful and make sure that it is done properly.
So we look to something called inter-rater reliability. The field started in psychology. We draw from psychology because that has the tools we need to dive into the conscious and unconscious of our customers and raters! Inter-rater reliability came from the requirement for consensus on patient diagnosis to be measured.
Let’s look to the work of Klaus Krippendorff.
https://50.asc.upenn.edu/drupal/klaus-krippendorff
He developed a coefficient now known as Krippendorff’s Alpha.
https://en.wikipedia.org/wiki/Krippendorff%27s_alpha
Which measures the deviation of agreement from chance. So if you were to pick your relevance judgements at random, it would give you zero. More agreeability between raters would get you closer to 1. Interestingly it’s possible to have negative alpha if the disagreement is worse than random!
So you may get a number like this. And you’ll see that perhaps your raters don’t really agree. With a large group, what does that mean? I can’t trust anyone?
Well there’s another mathematician named Arprad Elo, who developed the Elo rating system for chessplayers.
It gives you a way to give a rating to someone in a competition to see how likely they are to win against others. It was initially used for chess, but it can be used to measure almost any competitive system. here’s an example of tennis champions and their Elo ratings.
https://www.betfair.com.au/hub/an-introduction-to-tennis-modelling/
Let’s play a game. We will give our raters a starting Elo rating. We will turn rating agreement into a contest, and award those who agree on relevance.
If you agree with someone else you win. If you disagree with everyone else, you lose
https://emojipedia.org/apple/
Here’s an example of 106 crowdsourced raters being compared for agreement 50000 times and getting an Elo rating. Everyone starts with the rating of ‘1’. After playing enough relevance judgement games, this shows the likelihood of a rater to agree or disagree using their representative rating.
There is vast diversity here. Some are very likely to agree, and some are very likely to disagree. But does that make those who disagree wrong?
It might not. Because everyone has their own mind, and everyone has their own needs.
OK, taking a step back, let’s look at the logs.
A huge problem that many teams face is that relevance is really a data annotation and training problem. And it is difficult to attribute automatic data annotation to relevance success when you can’t understand the numbers being produced by something like nDCG.
So you’ll be walking along the forest gathering data berries…
…and then you get chased out by the reality werewolf of misunderstanding how to interpret data. So you need a plan and a methodology for taking the right path.
https://upload.wikimedia.org/wikipedia/commons/9/91/Werewolf.svg
https://www.flickr.com/photos/99873033@N08/17780417711
Now we’ll turn to the logs. If we have a mature enough analytics capture you might have this data. What we call conversations, or sessions, in search.
If we remember that we need a goal to
https://commons.wikimedia.org/wiki/File:Time_study_stopwatch.JPG
MAKE THIS LESS CONFUSING BY ONLY USING THE BLUE LINE
And we see the obvious connection, that you probably already knew when we started. It’s time, it’s effort. It’s those factors which weigh heavy on whether our customers will be happy with search. Getting that needle down as low as possible is the great frontier of search.
But this is really just one methodology! You have your own, because your product and your users are unique!
We are the community that pushes advancements for open search methodology. We’ve got some great talks. We’ve got some great people. We’re here to learn, connect, and grow.
https://commons.wikimedia.org/wiki/File:Flower_jtca001.jpg