Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of rankers - Erik Bernhardson

•

1 like•555 views

Evaluation of search quality is essential for developing effective rankers. Interleaved comparison methods achieve statistical significance with less data than with traditional A/B testing, meaning tests can be run in shorter timeframes and more sensitive changes to the ranker can be evaluated. In interleaved ranking two result lists are combined in a "fair" manner, such that clicks can be interpreted as unbiased judgments about the relative quality of the two rankers. In this talk we will dive into why interleaving can be a superior online evaluation method, along with how it could be added to your own evaluation toolset.

Data & Analytics

Addressing variance
in AB tests
Interleaved evaluation of rankers

Wikimedia
Search
Platform
● 300 languages
● 900 wikis
● 80% of search: 20 wikis
● 700M wiki pages indexed
● 1000+ autocomplete qps
● 700+ full text qps
● 6 clusters in 2 DCs
● Team of 5 engineers

Chapelle, Joachims, Radlinski, Yue 2012
http://olivier.chapelle.cc/pub/interleaving.pdf
Large Scale Validation
and Analysis of
Interleaved Search
Evaluation

Evaluation
of
Search
Quality
Public Domain, US Government

Ofﬂine
Evaluation
● Metrics on scale of result
set changes
● Golden Set / Expert
Judgements
● Simulated AB tests

Goals
● Blind to the user
● Robust to biases
unrelated to ranker
quality.
● Not substantially alter
the search experience
● Lead to clicks that reﬂect
user preference
Joachims 2003

Absolute
Metrics
● Clicks@N
● Max Clicked Position
● Clicks per query
● Time to First Click
● Session Abandonment
● Reformulations
● Zero Result Rate
● Interactions on clicked
pages

So, Variance?
How far a set of numbers are spread out from
their average value.
https://en.wikipedia.org/wiki/Variance

CC by SA 4.0, Zachary McCunePublic Domain, JRBrown

Maximum
Clicked
Position
850k per bucket

INPUT RANKING BALANCED
Rank A B A FIRST B FIRST
1 z y z (A) y (B)
2 y v y (B) z (A)
3 x z v (B) v (B)
4 w u x (A) x (A)
w (A) u (B)
u (B) w (A)

INPUT RANKING BALANCED
Rank A B A FIRST B FIRST
1 z y z (A) y (B)
2 y v
3 x z
4 w u

INPUT RANKING BALANCED
Rank A B A FIRST B FIRST
1 z y z (A) y (B)
2 y v y (B) z (A)
3 x z
4 w u

INPUT RANKING BALANCED
Rank A B A FIRST B FIRST
1 z y z (A) y (B)
2 y v y (B) z (A)
3 x z v (B) v (B)
4 w u

INPUT RANKING BALANCED
Rank A B A FIRST B FIRST
1 z y z (A) y (B)
2 y v y (B) z (A)
3 x z v (B) v (B)
4 w u x (A) x (A)

INPUT RANKING BALANCED
Rank A B A FIRST B FIRST
1 z y z (A) y (B)
2 y v y (B) z (A)
3 x z v (B) v (B)
4 w u x (A) x (A)
w (A) u (B)

k_a, k_b, I = 0, 0, []
A = ['a', 'b', 'c', 'd']
B = ['b', 'e', 'a', 'f']
A_first = random.choice((True, False))
while k_a < len(A) and k_b < len(B):
if k_a < k_b or (k_a == k_b and A_first):
if A[k_a] not in I:
I.append(A[k_a])
k_a += 1
else:
if B[k_b] not in I:
I.append(B[k_b])
k_b += 1

INPUT RANKING BALANCED
Rank A B A FIRST B FIRST
1 z y z (A) y (B)
2 y x y (B) z (A)
3 x w x (B) x (B)
4 w z w (B) w (B)
Not Always So Balanced

INPUT RANKING TEAM DRAFT
Rank A B BBA ABA AAA
1 z y y (B) z (A) z (A)
2 y v z (A) y (B) y (B)
3 x z v (B) v (B) x (A)
4 w u x (A) x (A) v (B)
w (A) w (A) w (A)
u (B) u (B) u (B)

INPUT RANKING TEAM DRAFT
Rank A B BBA ABA AAA
1 z y y (B) z (A) z (A)
2 y v z (A) y (B) y (B)
3 x z
4 w u

INPUT RANKING TEAM DRAFT
Rank A B BBA ABA AAA
1 z y y (B) z (A) z (A)
2 y v z (A) y (B) y (B)
3 x z v (B) v (B) x (A)
4 w u x (A) x (A) v (B)

How much
less data?
● Yahoo: 20x - 400x
● Netﬂix: > 100x
● Wikipedia: 10x - 100x

Is it Accurate?
Fair Use, Netﬂixhttps://medium.com/netflix-techblog/interleaving-in-online-experiments-at-netflix-a04ee392ec55

Running Shorter
Tests
But not too short.

Backend
● _msearch
● interleave
● Include “owner” in
response

Frontend
● Log clicks with owner of
clicked link

Analysis
● Choose win/tie on
per-search basis
● Choose win/tie on
per-session basis
● Bootstrap conﬁdence
intervals
● Pretty graphs

*** Please check out our LinkedIn Engineering blog post: https://engineering.linkedin.com/blog/2019/04/ai-behind-linkedin-recruiter-search-and-recommendation-systems *** LinkedIn Talent Solutions business contributes to around 65% of LinkedIn’s annual revenue, and provides tools for job providers to reach out to potential candidates and for job seekers to find suitable career opportunities. LinkedIn’s job ecosystem has been designed as a platform to connect job providers and job seekers, and to serve as a marketplace for efficient matching between potential candidates and job openings. A key mechanism to help achieve these goals is the LinkedIn Recruiter product, which enables recruiters to search for relevant candidates and obtain candidate recommendations for their job postings. We highlight a few unique information retrieval, system, and modeling challenges associated with talent search and recommendation systems. In this talk, we will present how we formulated and addressed the problems, the overall system design and architecture, the challenges encountered in practice, and the lessons learned from the production deployment of these systems at LinkedIn. By presenting our experiences of applying techniques at the intersection of recommender systems, information retrieval, machine learning, and statistical modeling in a large-scale industrial setting and highlighting the open problems, we hope to stimulate further research and collaborations within the SIGIR community.

Activity Ranking in LinkedIn FeedBodla Kumar

How to accelerate UAT & Regression Testing

Sensiple Inc.,

PPT FOR ONLINE HOTEL MANAGEMENT

Jaya0006

PageRank

abhav_luthra

Use of data science in recommendation system

AkashPatil334

Pagerank Algorithm Explained

jdhaar

In an instant search setting such as Netflix Search where results are returned in response to every keystroke, determining how a partial query maps onto broad classes of relevant entities orfacets --- such as videos, talent, and genres --- can facilitate a better understanding of the underlying objective of that query. Such a query-to-facet mapping system has a multitude of applications. It can help improve the quality of search results, drive meaningful result organization, and can be leveraged to establish trust by being transparent with Netflix members when they search for an entity that is not available on the service. By anticipating the relevant facets with each keystroke entry, the system can also better guide the experience within a search session. When aggregated across queries, the facets can reveal interesting patterns of member interest. A key challenge for building such a system is to judiciously balance lexical similarity with behavioral relevance. In this paper, we present a high level overview of a Query Facet Mapping system that we have developed at Netflix, describe its main components, provide evaluation results with real-world data, and outline several potential applications.

Recommender system introductionLiang Xiang

Diversity and novelty for recommendation system

Zhenv5

ML Zoomcamp - Course Overview and Logistics

Alexey Grigorev

Gym Management System - A Complete Solution For Fitness Business

Panoramic Infotech

Opinion spam and analysis

SOYEON KIM

rapit prototyping

sankar n

Recommender systems

S.M. Mahdi Seyednezhad, Ph.D.

Multidimensional IndexingDigvijay Singh

Online Examination System

Samrat Roy

Personalized Page Generation for Browsing Recommendations

Justin Basilico

Drilling, reaming, boring, tapping

Raja P

eJobs-UrbanClap.pptx

VanshilPatel3

Feed at LinkedIn (Quora Talk)

Shubham Gupta

How to build a recommender system?

blueace

Recommendation system for ecommerce

Tu Pham

Recommender Systems

Carlos Castillo (ChaTo)

Chapter 23 cutting tool technology

Hassan Shehwar Shah

Encores

OpenSource Connections

Test driven relevancy

OpenSource Connections

How To Structure Your Search Team for Success

OpenSource Connections

Smarter search drives value to your business. Delivering search that matches users to the right content is what you care about. But organizations often get stuck getting there. It turns out that you need quite a number of very different ingredients to deliver tremendous search. It can make your head spin! To help you think through where your team is on its road to smarter search, Pugh introduces the maturity model used by OpenSource Connections and walks you through a very concrete method to inventory needed skills and translate that into search roles for your team. He shows how to measure your capabilities in key areas of search to drive better ROI from search.

What's hot

Introduction to Recommendation System

Minha Hwang

Query Facet Mapping and its Applications in Streaming Services: The Netflix C...

Sudeep Das, Ph.D.

Recommender system introductionLiang Xiang

Diversity and novelty for recommendation system

Zhenv5

ML Zoomcamp - Course Overview and Logistics

Alexey Grigorev

Gym Management System - A Complete Solution For Fitness Business

Panoramic Infotech

Opinion spam and analysis

SOYEON KIM

rapit prototyping

sankar n

Recommender systems

S.M. Mahdi Seyednezhad, Ph.D.

Multidimensional IndexingDigvijay Singh

Online Examination System

Samrat Roy

Personalized Page Generation for Browsing Recommendations

Justin Basilico

Drilling, reaming, boring, tapping

Raja P

eJobs-UrbanClap.pptx

VanshilPatel3

Feed at LinkedIn (Quora Talk)

Shubham Gupta

How to build a recommender system?

blueace

Recommendation system for ecommerce

Tu Pham

Recommender Systems

Carlos Castillo (ChaTo)

Chapter 23 cutting tool technology

Hassan Shehwar Shah

What's hot (19)

Introduction to Recommendation System

Query Facet Mapping and its Applications in Streaming Services: The Netflix C...

Recommender system introduction

Diversity and novelty for recommendation system

ML Zoomcamp - Course Overview and Logistics

Gym Management System - A Complete Solution For Fitness Business

Opinion spam and analysis

rapit prototyping

Recommender systems

Multidimensional Indexing

Online Examination System

Personalized Page Generation for Browsing Recommendations

Drilling, reaming, boring, tapping

eJobs-UrbanClap.pptx

Feed at LinkedIn (Quora Talk)

How to build a recommender system?

Recommendation system for ecommerce

Recommender Systems

Chapter 23 cutting tool technology

More from OpenSource Connections

Encores

OpenSource Connections

Test driven relevancy

OpenSource Connections

How To Structure Your Search Team for Success

OpenSource Connections

The right path to making search relevant - Taxonomy Bootcamp London 2019

OpenSource Connections

Payloads and OCR with Solr

OpenSource Connections

Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull

OpenSource Connections

Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

OpenSource Connections

Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...

OpenSource Connections

Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj

OpenSource Connections

Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...

OpenSource Connections

Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl

OpenSource Connections

Over the past year, the POLITICO team has developed a recommendation system for our users, which recommends not only news content to read but also news topics to subscribe to. This talk will discuss our development path, including dead-ends and performance trade-offs. In the end, the team produced a system based on search technology (in our case, Elasticsearch) and refined by machine learning techniques to achieve a balance between personalization and serendipity.

Haystack 2019 - Search with Vectors - Simon Hughes

OpenSource Connections

With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger

OpenSource Connections

To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain. In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into { filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } } We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.

Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...

OpenSource Connections

For e-commerce applications, matching users with the items they want is the name of the game. If they can't find what they want then how can they buy anything?! Typically this functionality is provided through search and browse experience. Search allows users to type in text and match against the text of the items in the inventory. Browse allows users to select filters and slice-and-dice the inventory down to the subset they are interested in. But with the shift toward mobile devices, no one wants to type anymore - thus browse is becoming dominant in the e-commerce experience. But there's a problem! What if your inventory is not categorized? Perhaps your inventory is user generated or generated by external providers who don't tag and categorize the inventory. No categories and no tags means no browse experience and missed sales. You could hire an army of taxonomists and curators to tag items - but training and curation will be expensive. You can demand that your providers tag their items and adhere to your taxonomy - but providers will buck this new requirement unless they see obvious and immediate benefit. Worse, providers might use tags to game the system - artificially placing themselves in the wrong category to drive more sales. Worst of all, creating the right taxonomy is hard. You have to structure a taxonomy to realistically represent how your customers think about the inventory. Eventbrite is investigating a tantalizing alternative: using a combination of customer interactions and machine learning to automatically tag and categorize our inventory. As customers interact with our platform - as they search for events and click on and purchase events that interest them - we implicitly gather information about how our users think about our inventory. Search text effectively acts like a tag and a click on an event card is a vote for that clicked event is representative of that tag. We are able to use this stream of information as training data for a machine learning classification model; and as we receive new inventory, we can automatically tag it with the text that customers will likely use when searching for it. This makes it possible to better understand our inventory, our supply and demand, and most importantly this allows us to build the browse experience that customers demand. In this talk I will explain in depth the problem space and Eventbrite's approach in solving the problem. I will describe how we gathered training data from our search and click logs, and how we built and refined the model. I will present the output of the model and discuss both the positive results of our work as well as the work left to be done. Those attending this talk will leave with some new ideas to take back to their own business.

Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...

OpenSource Connections

Recently Elasticsearch has introduced a number of ways to improve search relevance of your documents based on numeric features. In this talk I will present the newly introduced field types of "rank_feature", "rank_features" ,"dense_field", and "sparse_vector" and discuss in what situations and how they can be used to boost scores of your documents. I will also talk about the inner workings of queries based on these fields, and related performance considerations.

Haystack 2019 - Architectural considerations on search relevancy in the conte...

OpenSource Connections

With an increasing amount of relevancy factors, relevancy fine-tuning becomes more complex as changing the impact of factors produces increasingly more unintended side effects. In recent years, there has been a lot of discussion about how learning algorithms can replace manual relevancy fine-tuning in order to manage this complexity. However, discussions about the challenge of relevancy should additionally consider architectural aspects. Especially microservice-based architectures provide many ways to encapsulate and to separate complexities of search solutions, which facilitates optimizing the search as well as locating and fixing problems. Generally, relevancy factors can be assigned to three different groups, each handled at a different stage of the search request processing. The first group contains contextual factors that depend on certain characteristics of a query, such as query-related boosts lifting up top-sellers for queries or category-related boosts to distinguish products from their accessories. Such contextual factors can be handled as a step of the preprocessing of queries. The respective boosting information can simply be appended to the query before it is actually sent to the search engine. Ideally, the normalization of the query is done beforehand. The second group contains factors that are considered for all queries in more or less the same way, e. g. a ranking function basing on keyword occurrences, product topicality or sales in total. Factors related to this group can be handled directly by configuring the search engine. The third group contains situational factors. For instance, a certain product might be a good match for a certain query in general, but for situational circumstances it should not appear among the top five products (e. g. because it is out of stock). Such situational factors can be handled by resorting result sets, after they were returned by the search engine. The handling of the different factors within successive stages of search request processing will be discussed from an architectural perspective. Implications for applying learning algorithms and the implementation of a personalized search will be considered.

Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...

OpenSource Connections

Does your search application include a custom query syntax with various search operators such as Booleans, proximity, term or phrase frequency, capitalization, quoted text or as-is operator, and other advanced operators? Although most search applications offer a natural language-oriented search box, some advanced applications may also offer a custom query syntax for advanced users or automated tasks. The Lucene "classic" query operators that are supported by the Solr edismax query parser (Boolean, phrase with slop, wildcard, etc.) cover a good amount of use cases, but they only get you so far. In this talk, we will explore various strategies to support a custom and advanced query syntax in Solr, covering a spectrum of options from leveraging the out-of-the-box Solr query DSL, to a custom Solr query parser, and hybrid solutions in between. We will identify the options' pros and cons, discuss relevancy considerations, and illustrate the options in Java.

Haystack 2019 - Establishing a relevance focused culture in a large organizat...

OpenSource Connections

For a relevance engineer one of the most difficult tasks in the tuning process is to convince others in the organization that this is a joint effort. Even the brightest search guru doesn't get very far when working in isolation, so establishing cross-collaboration through the organization is essential. But how to get there? On top of that, in a large organization a relevance engineer often works on multiple seemingly unrelated search projects. The challenge is not to get drowned in building custom solutions for each project, but to design generic and re-usable strategies which solve many problems at once. In this session we'll discuss how to build a widely supported basis for search quality improvements in an organization. It is full of practical tips and examples which could help you in establishing a cross-functional culture that is optimal for relevance tuning. It also zooms in on an holistic approach of solving multiple equivalent search issues at once.

2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via

OpenSource Connections

The New York Times has had search for a long time but 2018 was the year in which the company engaged with relevance in a deep way. The aim of this talk is to share what we've learned as we've increased our search sophistication and some of the challenges we still face. Some of the techniques we've adopted in this past year include offline metrics testing, reflective testing, and user engagement metrics. We now have a process in place to quickly get mappings changes out to production. As a team we now also have a vocabulary for talking about relevance and can use it to discuss trade-offs and goals in conjunction with our metrics. We hope this talk is of use to those who've put off working on search relevance due to fear, uncertainty, or ambivalence. We will talk about how we went from working on everything but search relevance to finally pulling back the curtain on the search system. We hope what we've learned can help others get started.

Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...

OpenSource Connections

Due to a variable inventory and an ephemeral data set, users often search for terms that are outside of our corpus. This leads to empty search result sets, despite often having relevant content for our users. In order to improve relevancy, we moved beyond the search engine and implemented a number of Query Expansion techniques, including spell correction, category identification and synonym matching. In this talk, we will outline how we used machine learning and heuristics to improve the search experience for our users while highlighting successes and failures along the way.

More from OpenSource Connections (20)