This document discusses the multi-armed bandit problem and potential algorithms to improve the DARTS system at Intuit. It begins with an overview of multi-armed bandit problems and terminology. It then covers several non-contextual bandit algorithms like epsilon-greedy, Thompson sampling, and UCB. Next, it discusses exploiting context, including user context through clustering and arm context through linear models. It compares the exploration-exploitation tradeoff, total regret, and robustness to batch updates of different algorithms. The document aims to explore using multi-armed bandit algorithms to optimize content delivery in DARTS.
What if there’s a better way to run more complex tests and gain results faster? Joni explains the wonderful world of multi-armed bandit experiments.
About Joni
Joni Turunen is a Senior Developer at FROSMO currently working in the Product Team. He has vast experience with different frontend & backend technologies. With his 8 year history in the company he knows Frosmo's software inside out.
The document discusses multi-armed bandits and their applications. It provides an overview of multi-armed bandits, describing the exploration-exploitation dilemma. It then discusses the optimal UCB algorithm and how it balances exploration and exploitation. Finally, it summarizes two applications of multi-armed bandits: using them for learning to rank in recommendation systems and addressing the cold-start problem in recommender systems.
This document discusses various algorithms for multi-armed bandit problems including k-armed bandits, action value methods like epsilon-greedy, tracking non-stationary problems, optimistic initial values, upper confidence bound action selection, gradient bandit algorithms, contextual bandits, and Thomson sampling. The k-armed bandit problem involves choosing actions to maximize reward over time without knowing the expected reward of each action. The document outlines methods for balancing exploration of unknown actions with exploitation of best known actions.
Hello~! :)
While studying the Sutton-Barto book, the traditional textbook for Reinforcement Learning, I created PPT about the Multi-armed Bandits, a Chapter 2.
If there are any mistakes, I would appreciate your feedback immediately.
Thank you.
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
- The document discusses the multi-armed bandit problem, which is a simplified decision-making problem used to discuss exploration-exploitation dilemmas in reinforcement learning.
- It provides examples of applying the k-armed bandit problem to recommendation systems, choosing experimental medical treatments, and other scenarios.
- Two methods are introduced for estimating the value of each action: sample-average methods which average rewards over time, and incremental implementations which update estimates online without storing all past rewards.
- Exploration involves selecting non-greedy actions to improve estimates, while exploitation selects the action with the highest estimated value. The ε-greedy policy balances exploration and exploitation.
In this talk we discuss about the aplicação of Reinforcement Learning to Games. Recently, OpenAI created an algorithm capable of beating a human team in DOTA, considered a game with great amount of complexity and strategy. In this talk, we'll evaluate the role Reinforcement Learning plays in the world of games, taking a look at some of main achievements and how they look like in terms of implementation. We'll also take a look at some of the history of AI applied to games and how things evolved over time.
Proximal Policy Optimization (Reinforcement Learning)Thom Lane
The document discusses Proximal Policy Optimization (PPO), a policy gradient method for reinforcement learning. Some key points:
- PPO directly learns the policy rather than values like Q-functions. It can handle both discrete and continuous action spaces.
- Policy gradient methods estimate the gradient of expected return with respect to the policy parameters. Basic updates involve taking a step in the direction of this gradient.
- PPO improves stability over basic policy gradients by clipping the objective to constrain the policy update. It also uses multiple losses including for the value function and entropy.
- Actor-critic methods like PPO learn the policy (actor) and estimated state value (critic) simultaneously. The critic acts as
What if there’s a better way to run more complex tests and gain results faster? Joni explains the wonderful world of multi-armed bandit experiments.
About Joni
Joni Turunen is a Senior Developer at FROSMO currently working in the Product Team. He has vast experience with different frontend & backend technologies. With his 8 year history in the company he knows Frosmo's software inside out.
The document discusses multi-armed bandits and their applications. It provides an overview of multi-armed bandits, describing the exploration-exploitation dilemma. It then discusses the optimal UCB algorithm and how it balances exploration and exploitation. Finally, it summarizes two applications of multi-armed bandits: using them for learning to rank in recommendation systems and addressing the cold-start problem in recommender systems.
This document discusses various algorithms for multi-armed bandit problems including k-armed bandits, action value methods like epsilon-greedy, tracking non-stationary problems, optimistic initial values, upper confidence bound action selection, gradient bandit algorithms, contextual bandits, and Thomson sampling. The k-armed bandit problem involves choosing actions to maximize reward over time without knowing the expected reward of each action. The document outlines methods for balancing exploration of unknown actions with exploitation of best known actions.
Hello~! :)
While studying the Sutton-Barto book, the traditional textbook for Reinforcement Learning, I created PPT about the Multi-armed Bandits, a Chapter 2.
If there are any mistakes, I would appreciate your feedback immediately.
Thank you.
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
- The document discusses the multi-armed bandit problem, which is a simplified decision-making problem used to discuss exploration-exploitation dilemmas in reinforcement learning.
- It provides examples of applying the k-armed bandit problem to recommendation systems, choosing experimental medical treatments, and other scenarios.
- Two methods are introduced for estimating the value of each action: sample-average methods which average rewards over time, and incremental implementations which update estimates online without storing all past rewards.
- Exploration involves selecting non-greedy actions to improve estimates, while exploitation selects the action with the highest estimated value. The ε-greedy policy balances exploration and exploitation.
In this talk we discuss about the aplicação of Reinforcement Learning to Games. Recently, OpenAI created an algorithm capable of beating a human team in DOTA, considered a game with great amount of complexity and strategy. In this talk, we'll evaluate the role Reinforcement Learning plays in the world of games, taking a look at some of main achievements and how they look like in terms of implementation. We'll also take a look at some of the history of AI applied to games and how things evolved over time.
Proximal Policy Optimization (Reinforcement Learning)Thom Lane
The document discusses Proximal Policy Optimization (PPO), a policy gradient method for reinforcement learning. Some key points:
- PPO directly learns the policy rather than values like Q-functions. It can handle both discrete and continuous action spaces.
- Policy gradient methods estimate the gradient of expected return with respect to the policy parameters. Basic updates involve taking a step in the direction of this gradient.
- PPO improves stability over basic policy gradients by clipping the objective to constrain the policy update. It also uses multiple losses including for the value function and entropy.
- Actor-critic methods like PPO learn the policy (actor) and estimated state value (critic) simultaneously. The critic acts as
Counterfactual Learning for RecommendationOlivier Jeunen
Slides for our presentation at the REVEAL workshop for RecSys '19 in Copenhagen and a Data Science Leuven Meetup, titled "Counterfactual Learning for Recommendation".
Deep reinforcement learning from scratchJie-Han Chen
1. The document provides an overview of deep reinforcement learning and the Deep Q-Network algorithm. It defines the key concepts of Markov Decision Processes including states, actions, rewards, and policies.
2. The Deep Q-Network uses a deep neural network as a function approximator to estimate the optimal action-value function. It employs experience replay and a separate target network to stabilize learning.
3. Experiments applying DQN to the Atari 2600 game Space Invaders are discussed, comparing different loss functions and optimizers. The standard DQN configuration with MSE loss and RMSProp performed best.
This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
An agent interacts with an environment to maximize rewards. Reinforcement learning algorithms learn through trial and error by taking actions and receiving rewards or penalties. The document discusses reinforcement learning concepts like the agent, environment, actions, policy, and rewards. It also summarizes OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms with different environments like CartPole. Code examples are provided to interact with environments using a hardcoded policy and a basic neural network.
Practical AI for Business: Bandit AlgorithmsSC5.io
This document provides an overview of bandit algorithms and their applications. It begins by explaining the multi-armed bandit problem and some basic algorithms like epsilon-greedy to solve it. It then discusses more advanced techniques like Thompson sampling and their benefits over naive approaches. Finally, it outlines several real-world uses of bandit algorithms, including UI optimization, recommendation systems, and multiplayer games. Bandit algorithms provide a powerful way to optimize outcomes in situations where rewards are not immediately revealed.
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
In this talk, we present a general multi-armed bandit framework for recommendations on the Netflix homepage. We present two example case studies using MABs at Netflix - a) Artwork Personalization to recommend personalized visuals for each of our members for the different titles and b) Billboard recommendation to recommend the right title to be watched on the Billboard.
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
I reviewed the following papers.
- T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., “Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
Tutorial on Object Detection (Faster R-CNN)Hwa Pyung Kim
The document describes Faster R-CNN, an object detection method that uses a Region Proposal Network (RPN) to generate region proposals from feature maps, pools features from each proposal into a fixed size using RoI pooling, and then classifies and regresses bounding boxes for each proposal using a convolutional network. The RPN outputs objectness scores and bounding box adjustments for anchor boxes sliding over the feature map, and non-maximum suppression is applied to reduce redundant proposals.
Most large-scale online recommender systems like newsfeed ranking, people recommendations, job recommendations, etc. often have multiple utilities or metrics that need to be simultaneously optimized. The machine learning models that are trained to optimize a single utility are combined together through parameters to generate the final ranking function. These combination parameters drive business metrics. Finding the right choice of the parameters is often done through online A/B experimentation, which can be incredibly complex and time-consuming, especially considering the non-linear effects of these parameters on the metrics of interest.
In this tutorial, we will talk about how we can apply Bayesian Optimization techniques to obtain the parameters for such complex online systems in order to balance the competing metrics. First, we will provide an in-depth introduction to Bayesian Optimization, covering some of the basics as well as the recent advances in the field. Second, we will talk about how to formulate a real-world recommender system problem as a black-box optimization problem that can be solved via Bayesian Optimization. We will focus on a few key problems such as newsfeed ranking, people recommendations, job recommendations, etc. Third, we will talk about the architecture of the solution and how we are able to deploy it for large-scale systems. Finally, we will discuss the extensions and some of the future directions in this domain.
The lecture slides in DSAI 2018, National Cheng Kung University. It's about famous deep reinforcement learning algorithm: Actor-Critc. In this slides, we introduce advantage function, A3C/A2C.
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
1. The document discusses evaluating learning to rank models, including offline and online evaluation methods. Offline evaluation involves building a test set from labeled data and evaluating metrics like NDCG, while online evaluation uses methods like A/B testing and interleaving to directly measure user behavior and business metrics.
2. Common mistakes in offline evaluation include having only one sample per query, single relevance labels per query, and unrepresentative test samples. While offline evaluation provides efficiency, online evaluation allows observing real user interactions and model impact on key metrics.
3. Recommendations are given to test models both offline and online, with online testing providing advantages like measuring actual business outcomes and interpreting model effects.
This document provides an overview of Markov Decision Processes (MDPs) and related concepts in decision theory and reinforcement learning. It defines MDPs and their components, describes algorithms for solving MDPs like value iteration and policy iteration, and discusses extensions to partially observable MDPs. It also briefly mentions dynamic Bayesian networks, the dopaminergic system, and its role in reinforcement learning and decision making.
Deep sort and sort paper introduce presentation경훈 김
This document discusses multi-object tracking algorithms. It begins by introducing object tracking and classification of trackers. Simple Online and Realtime Tracking (SORT) is described, which uses a Kalman filter for state estimation and the Hungarian algorithm for data association. Deep SORT is then introduced, which improves on SORT by incorporating appearance features and using the Mahalanobis distance and cosine distance for data association, helping with short-term and long-term occlusion. Results show Deep SORT performs well on benchmark datasets.
Abstract: This PDSG workship introduces basic concepts on Bellman Equations. Concepts covered are States, Actions, Rewards, Value Function, Discount Factor, Bellman Equation, Bellman Optimality, Deterministic vs. Non-Deterministic, Policy vs. Plan, and Lifespan Penalty.
Level: Intermediate
Requirements: Should have some prior familiarity with graph theory and basic statistics. No prior programming knowledge is required.
Facebook Talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Sam Daulton from Facebook discusses "Practical Solutions to real-world exploration problems".
발표자: 이활석 (Naver Clova)
발표일: 2017.11.
(현) NAVER Clova Vision
(현) TFKR 운영진
개요:
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨지고 있습니다.
특히 컴퓨터 비전 기술 분야에서는 지도학습에 해당하는 이미지 내에 존재하는 정보를 찾는 인식 기술에서,
비지도학습에 해당하는 특정 정보를 담는 이미지를 생성하는 기술인 생성 기술로 연구 동향이 바뀌어 가고 있습니다.
본 세미나에서는 생성 기술의 두 축을 담당하고 있는 VAE(variational autoencoder)와 GAN(generative adversarial network) 동작 원리에 대해서 간략히 살펴 보고, 관련된 주요 논문들의 결과를 공유하고자 합니다.
딥러닝에 대한 지식이 없더라도 생성 모델을 학습할 수 있는 두 방법론인 VAE와 GAN의 개념에 대해 이해하고
그 기술 수준을 파악할 수 있도록 강의 내용을 구성하였습니다.
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
Slides of my presentation at CIKM2018 about version 2 of the GRU4Rec algorithm, a recurrent neural network based algorithm for the session-based recommendation task.
We discuss sampling strategies and introduce additional sampling to the algorithm. We also redesign the loss function to cope with additional sampling. The resulting BPR-max loss function is able to efficiently handle many negative samples without encountering the vanishing gradient problem. We also introduce constrained embeddings which speeds up the conversion of item representations and reduces memory usage by a factor of 4. These improvements increase offline measures up to 52%.
In the talk we also discuss online A/B test and the implications of long time observations. Most of these observations are exclusive to this talk and are not in the paper.
You can access the preprint version of the paper on arXiv: https://arxiv.org/abs/1706.03847
The code is available on GitHub: https://github.com/hidasib/GRU4Rec
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Kinjal Basu from LinkedIn discussed Online Parameter Selection for web-based Ranking vis Bayesian Optimization
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
The document discusses various methods for multiclass classification including Gaussian and linear classifiers, multi-class classification models, and multi-class strategies like one-versus-all, one-versus-one, and error-correcting codes. It also provides summaries of naive Bayes, linear/quadratic discriminant analysis, stochastic gradient descent, multilabel vs multiclass classification, and one-versus-all, one-versus-one, and error-correcting output codes classification strategies.
Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...IT Arena
Lviv IT Arena is a conference specially designed for programmers, designers, developers, top managers, inverstors, entrepreneurs and startuppers. Annually it takes place at the beginning of October in Lviv at Arena Lviv stadium. In 2016 the conference gathered more than 1800 participants and over 100 speakers from companies like Microsoft, Philips, Twitter, UBER and IBM. More details about the conference at itarena.lviv.ua.
If you have atleast two choices of doing same thing and want to decide which one is better in a more data driven approach then Bandit algorithms helps you to do that. It's an alternative to traditional hypothesis testing approach in measuring A/B Testing systems. Also, does better than hypothesis testing approaches. It makes real time decisions and continuously learn about system. It explains how to handle trade off between exploration and exploitation in measuring
Counterfactual Learning for RecommendationOlivier Jeunen
Slides for our presentation at the REVEAL workshop for RecSys '19 in Copenhagen and a Data Science Leuven Meetup, titled "Counterfactual Learning for Recommendation".
Deep reinforcement learning from scratchJie-Han Chen
1. The document provides an overview of deep reinforcement learning and the Deep Q-Network algorithm. It defines the key concepts of Markov Decision Processes including states, actions, rewards, and policies.
2. The Deep Q-Network uses a deep neural network as a function approximator to estimate the optimal action-value function. It employs experience replay and a separate target network to stabilize learning.
3. Experiments applying DQN to the Atari 2600 game Space Invaders are discussed, comparing different loss functions and optimizers. The standard DQN configuration with MSE loss and RMSProp performed best.
This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
An agent interacts with an environment to maximize rewards. Reinforcement learning algorithms learn through trial and error by taking actions and receiving rewards or penalties. The document discusses reinforcement learning concepts like the agent, environment, actions, policy, and rewards. It also summarizes OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms with different environments like CartPole. Code examples are provided to interact with environments using a hardcoded policy and a basic neural network.
Practical AI for Business: Bandit AlgorithmsSC5.io
This document provides an overview of bandit algorithms and their applications. It begins by explaining the multi-armed bandit problem and some basic algorithms like epsilon-greedy to solve it. It then discusses more advanced techniques like Thompson sampling and their benefits over naive approaches. Finally, it outlines several real-world uses of bandit algorithms, including UI optimization, recommendation systems, and multiplayer games. Bandit algorithms provide a powerful way to optimize outcomes in situations where rewards are not immediately revealed.
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
In this talk, we present a general multi-armed bandit framework for recommendations on the Netflix homepage. We present two example case studies using MABs at Netflix - a) Artwork Personalization to recommend personalized visuals for each of our members for the different titles and b) Billboard recommendation to recommend the right title to be watched on the Billboard.
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
I reviewed the following papers.
- T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., “Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
Tutorial on Object Detection (Faster R-CNN)Hwa Pyung Kim
The document describes Faster R-CNN, an object detection method that uses a Region Proposal Network (RPN) to generate region proposals from feature maps, pools features from each proposal into a fixed size using RoI pooling, and then classifies and regresses bounding boxes for each proposal using a convolutional network. The RPN outputs objectness scores and bounding box adjustments for anchor boxes sliding over the feature map, and non-maximum suppression is applied to reduce redundant proposals.
Most large-scale online recommender systems like newsfeed ranking, people recommendations, job recommendations, etc. often have multiple utilities or metrics that need to be simultaneously optimized. The machine learning models that are trained to optimize a single utility are combined together through parameters to generate the final ranking function. These combination parameters drive business metrics. Finding the right choice of the parameters is often done through online A/B experimentation, which can be incredibly complex and time-consuming, especially considering the non-linear effects of these parameters on the metrics of interest.
In this tutorial, we will talk about how we can apply Bayesian Optimization techniques to obtain the parameters for such complex online systems in order to balance the competing metrics. First, we will provide an in-depth introduction to Bayesian Optimization, covering some of the basics as well as the recent advances in the field. Second, we will talk about how to formulate a real-world recommender system problem as a black-box optimization problem that can be solved via Bayesian Optimization. We will focus on a few key problems such as newsfeed ranking, people recommendations, job recommendations, etc. Third, we will talk about the architecture of the solution and how we are able to deploy it for large-scale systems. Finally, we will discuss the extensions and some of the future directions in this domain.
The lecture slides in DSAI 2018, National Cheng Kung University. It's about famous deep reinforcement learning algorithm: Actor-Critc. In this slides, we introduce advantage function, A3C/A2C.
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
1. The document discusses evaluating learning to rank models, including offline and online evaluation methods. Offline evaluation involves building a test set from labeled data and evaluating metrics like NDCG, while online evaluation uses methods like A/B testing and interleaving to directly measure user behavior and business metrics.
2. Common mistakes in offline evaluation include having only one sample per query, single relevance labels per query, and unrepresentative test samples. While offline evaluation provides efficiency, online evaluation allows observing real user interactions and model impact on key metrics.
3. Recommendations are given to test models both offline and online, with online testing providing advantages like measuring actual business outcomes and interpreting model effects.
This document provides an overview of Markov Decision Processes (MDPs) and related concepts in decision theory and reinforcement learning. It defines MDPs and their components, describes algorithms for solving MDPs like value iteration and policy iteration, and discusses extensions to partially observable MDPs. It also briefly mentions dynamic Bayesian networks, the dopaminergic system, and its role in reinforcement learning and decision making.
Deep sort and sort paper introduce presentation경훈 김
This document discusses multi-object tracking algorithms. It begins by introducing object tracking and classification of trackers. Simple Online and Realtime Tracking (SORT) is described, which uses a Kalman filter for state estimation and the Hungarian algorithm for data association. Deep SORT is then introduced, which improves on SORT by incorporating appearance features and using the Mahalanobis distance and cosine distance for data association, helping with short-term and long-term occlusion. Results show Deep SORT performs well on benchmark datasets.
Abstract: This PDSG workship introduces basic concepts on Bellman Equations. Concepts covered are States, Actions, Rewards, Value Function, Discount Factor, Bellman Equation, Bellman Optimality, Deterministic vs. Non-Deterministic, Policy vs. Plan, and Lifespan Penalty.
Level: Intermediate
Requirements: Should have some prior familiarity with graph theory and basic statistics. No prior programming knowledge is required.
Facebook Talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Sam Daulton from Facebook discusses "Practical Solutions to real-world exploration problems".
발표자: 이활석 (Naver Clova)
발표일: 2017.11.
(현) NAVER Clova Vision
(현) TFKR 운영진
개요:
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨지고 있습니다.
특히 컴퓨터 비전 기술 분야에서는 지도학습에 해당하는 이미지 내에 존재하는 정보를 찾는 인식 기술에서,
비지도학습에 해당하는 특정 정보를 담는 이미지를 생성하는 기술인 생성 기술로 연구 동향이 바뀌어 가고 있습니다.
본 세미나에서는 생성 기술의 두 축을 담당하고 있는 VAE(variational autoencoder)와 GAN(generative adversarial network) 동작 원리에 대해서 간략히 살펴 보고, 관련된 주요 논문들의 결과를 공유하고자 합니다.
딥러닝에 대한 지식이 없더라도 생성 모델을 학습할 수 있는 두 방법론인 VAE와 GAN의 개념에 대해 이해하고
그 기술 수준을 파악할 수 있도록 강의 내용을 구성하였습니다.
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
Slides of my presentation at CIKM2018 about version 2 of the GRU4Rec algorithm, a recurrent neural network based algorithm for the session-based recommendation task.
We discuss sampling strategies and introduce additional sampling to the algorithm. We also redesign the loss function to cope with additional sampling. The resulting BPR-max loss function is able to efficiently handle many negative samples without encountering the vanishing gradient problem. We also introduce constrained embeddings which speeds up the conversion of item representations and reduces memory usage by a factor of 4. These improvements increase offline measures up to 52%.
In the talk we also discuss online A/B test and the implications of long time observations. Most of these observations are exclusive to this talk and are not in the paper.
You can access the preprint version of the paper on arXiv: https://arxiv.org/abs/1706.03847
The code is available on GitHub: https://github.com/hidasib/GRU4Rec
LinkedIn talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Kinjal Basu from LinkedIn discussed Online Parameter Selection for web-based Ranking vis Bayesian Optimization
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
The document discusses various methods for multiclass classification including Gaussian and linear classifiers, multi-class classification models, and multi-class strategies like one-versus-all, one-versus-one, and error-correcting codes. It also provides summaries of naive Bayes, linear/quadratic discriminant analysis, stochastic gradient descent, multilabel vs multiclass classification, and one-versus-all, one-versus-one, and error-correcting output codes classification strategies.
Boosting Ad Revenue Using Reinforcement Learning (Robin Schuil Technology Str...IT Arena
Lviv IT Arena is a conference specially designed for programmers, designers, developers, top managers, inverstors, entrepreneurs and startuppers. Annually it takes place at the beginning of October in Lviv at Arena Lviv stadium. In 2016 the conference gathered more than 1800 participants and over 100 speakers from companies like Microsoft, Philips, Twitter, UBER and IBM. More details about the conference at itarena.lviv.ua.
If you have atleast two choices of doing same thing and want to decide which one is better in a more data driven approach then Bandit algorithms helps you to do that. It's an alternative to traditional hypothesis testing approach in measuring A/B Testing systems. Also, does better than hypothesis testing approaches. It makes real time decisions and continuously learn about system. It explains how to handle trade off between exploration and exploitation in measuring
The document discusses recommender systems which help users find items that meet their needs. It describes three common recommendation approaches: collaborative filtering, content-based filtering, and hybrid recommender systems. Collaborative filtering makes predictions about a user's interests based on preferences of many users, while content-based filtering relies on descriptions of items and a user's preference profile. The document also outlines the speaker's research into a hybrid food recommender system that recommends restaurants and foods based on a user's taste preferences and location.
Recommender Systems Tutorial (Part 3) -- Online ComponentsBee-Chung Chen
This is a tutorial given in the International Conference on Machine Learning. The slides consist of four parts. Please look for Part 1, Part 2 and Part 4 to get a complete picture of this technology.
A Multi-armed Bandit Approach to Online Spatial Task AssignmentUmair ul Hassan
https://www.insight-centre.org/content/multi-armed-bandit-approach-online-spatial-task-assignment
Presented at UIC 2014
Abstract
Spatial crowdsourcing uses workers for performing tasks that require travel to different locations in the physical world. This paper considers the online spatial task assignment problem. In this problem, spatial tasks arrive in an online manner and an appropriate worker must be assigned to each task. However, outcome of an assignment is stochastic since the worker can choose to accept or reject the task. Primary goal of the assignment algorithm is to maximize the number of successful assignments over all tasks. This presents an exploration-exploitation challenge; the algorithm must learn the task acceptance behavior of workers while selecting the best worker based on the previous learning. We address this challenge by defining a framework for online spatial task assignment based on the multi-armed bandit formalization of the problem. Furthermore, we adapt a contextual bandit algorithm to assign a worker based on the spatial features
of tasks and workers. The algorithm simultaneously adapts
the worker assignment strategy based on the observed task
acceptance behavior of workers. Finally, we present an evaluation
methodology based on a real world dataset, and evaluate the
performance of the proposed algorithm against the baseline
algorithms. The results demonstrate that the proposed algorithm
performs better in terms of the number of successful assignments.
Ensemble Contextual Bandits for Personalized RecommendationLiang Tang
This document discusses an ensemble contextual bandit approach for personalized recommendation. It addresses the cold start problem where there is not enough data to validate individual recommendation models. The approach treats each model as an arm in a contextual multi-armed bandit problem, where the context is user features. It uses Thompson sampling to allocate recommendation chances to models, allowing models with higher estimated click-through rates to be selected more often. This ensemble approach can perform close to the best individual model without needing to evaluate models separately during the cold start period.
Multi-Armed Bandits: Intro, examples and tricksIlias Flaounas
In this talk Ilias will discuss some variations of the Multi-Armed Bandits (MABs), a less popular although important area of Machine Learning. MABs enable us to build adaptive systems capable of finding solutions for tasks based on the interactions with their environment. MABs solve a task by acquiring useful knowledge at every step of an iterative process while they balance the exploration-exploitation dilemma. They are used to tackle practical problems like selecting appropriate online ads and personalized content for presentation to users; assigning people to cohorts in controlled trials; supporting decision making and more. To solve these kinds of problems solutions need to be identified as fast as possible since accepting errors can be costly. Ilias will discuss some examples from industry and academia as well as some of the related work at Atlassian.
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Data Con LA
This document discusses decision making systems and the lambda architecture. It introduces decision making algorithms like multi-armed bandits that balance exploration vs exploitation. Contextual multi-armed bandits are discussed as well. The lambda architecture is then described as having serving, speed, and batch layers to enable low latency queries, real-time updates, and batch model training. The software stack of Kafka, Spark/Spark Streaming, HBase and MLLib is presented as enabling scalable stream processing and machine learning.
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
Applied Data Science for monetization: pitfalls, common misconceptions, and n...DevGAMM Conference
This talk guides us through modern twists on classic user-oriented data science tasks, such as churn prediction, clusterization, calculating user metrics, and others. We will discuss unusual angles for solving these tasks; how and why they can be used to improve player experience and monetization; the intuition behind these methods, and insights into inner machinery; and why conventional methods work poorly. Finally, I'll show you how you can apply this knowledge to improve your users' playing experience, and streamline analytics; and we'll talk general situation of applied data science and analytics in the industry.
The document outlines 4 principles for making better decisions with data from A/B testing:
1. Averages can hide important trends, so always review time series data and look for steady trends over time.
2. Avoid cherry-picking data after the fact to tell a story; focus on pre-defined metrics and hypotheses.
3. Simpson's Paradox, where trends appear different when data is combined vs separate, can mislead; carefully account for all variables.
4. Beware optimizing individual metrics at the expense of overall goals; use holistic metrics like revenue or engagement over time.
1. The document discusses consumer behavior and provides definitions and models to explain it.
2. Consumer behavior is defined as the process individuals or groups go through to select, use, and dispose of products and services.
3. Several models of consumer behavior are presented, including the buyer decision process model, learning model, economic model, and psychological factors.
Using Data Science to Transform OpenTable Into Your Local Dining ExpertPablo Delgado
Presentation for Spark Summit 2015, San Francisco
https://spark-summit.org/2015/events/using-data-science-to-transform-opentable-into-your-local-dining-expert/
Biotech Communications Workshop for Chinese Ministry of Agriculture and Triangle biotech professionals, Day 2
Presented by Zachary Brown, GES Center, NC State University
Tuesday, 10/3/2017
The document describes rational choice theory and how it can be used to explain individual decision-making and social outcomes. It outlines the central assumptions of rational choice theory, which are that decision-makers have logically consistent goals and choose the best available option given those goals. Experimental methods and laboratory experiments are also discussed as ways to test rational choice theory by studying individual behavior in controlled settings with financial incentives.
The document provides an overview of a livestock curriculum for Texas 4-H and FFA youth. It has two main objectives: quality assurance and character education. For quality assurance, it covers concepts like the impact of livestock projects, responsibilities in food production, proper medication use, and animal care. For character education, it addresses pillars like trustworthiness, respect, and citizenship. It includes activities to teach concepts like reading labels and applying ethical decision making. The curriculum aims to promote food safety and positive values through youth livestock programs.
Presentation on "Customer perception towards fast food chains in India - A study on Mcdonalds, Dominos and Subway.
This was undertaken to find out what is the perception of a consumer before he chooses to have Fast Food in the Fast food giants.
Running Head PSA1PSA 4Black Lives Matter Mo.docxtoltonkendal
Running Head: PSA1
PSA 4
Black Lives Matter Movement
Student’s Name
Institutional Affiliation
Black Lives Matter Movement (BLM)
The BLM’s major objective is to campaigns against racism towards black race.BLM is an international activists group founded in the United States. The image 1 below shows a black child involved in the campaign against racism and killing of black community.The discrimination against the black people includes children and adults. This has led to children being involved in protests against police killings in the United States. According to the artifact even the black children are discriminated due to their skin color and this has led to both young and adults to join the protest and inform the world that black people lives matter.
Image 1
Source (Robinson, 2015)
Image 2 below black people protesting in the streets, the BLM seeks to advocate against racism and killing of black people. The movement seeks to address dehumanization of black people who are violently killed by police officers and vigilantes.BLM goes beyond racism and it advocates for black people to love black, but black and live black. In the image 2 below the protestors are wearing black which represent black liberation. The movement seeks for black people to be identified to be equal to other races without any form of discrimination. There should be equal rights to all individuals regardless of skin color.
Image 2
Source (Robinson, 2015)
In a clip by Bethel Archive Teaching (2016), an angry SUV driver runs through the BLM protestors and it also shows law enforcers assaulting a BLM protestor. From the clip, it is clear that some people are not ready to accept black community to be equal to the white race(Teaching, 2016). The driver does not appreciate the BLM effort in advocating for the liberation of black people from racism and extra judicial killings. On the other hand, the police officers are assaulting an unharmed protestor; this reflects that the law enforcers are not ready to protect the rights and lives of black people.
Reference
Robinson, A. (2015, March 16). Black Lives Matter: The Evolution of a Movement. Retrieved from Occupy.com: http://www.occupy.com/article/black-lives-matter-evolution-movement#sthash.yfPpltWK.dpbs
Teaching, B. A. (2016, August 16). Bethel Archive Teaching. Retrieved from Bethel Archive Teaching: https://www.youtube.com/watch?v=DzTvoezYPCA
1
Choice Context
BUS143: Judgment and Decision Making
Ye Li
Repeating themes in this class
• People’s evaluations tied to the local, rather than
global context. For example:
– (Topic1) We take choices as given (concreteness
principle), and evaluate outcomes relative to reference
points
– (Topic 4) We form narrow, “topical” accounts rather
than comprehensive mental accounts
– (Topic 5) We exhibit myopia in intertemporal choice
• Why? We can’t pay attention to everything and have
difficulty thinking about what’s not present
– In many cases, p ...
How To Write A Critique Of A Novel. Online assignment writing service.Andrea Jones
The document provides steps for writing a critique of a novel on the website HelpWriting.net. It outlines 5 steps: 1) Create an account with a password and email, 2) Complete a 10-minute order form providing instructions and deadline, 3) Review bids from writers and choose one, 4) Review the paper and authorize payment, 5) Request revisions to ensure satisfaction and get a refund for plagiarism.
The document provides guidance on running effective black hat sessions to analyze competitors.
Key points covered include:
1) Thorough preparation is essential, including developing a win-loss analysis and bidder profile documents.
2) The session should have the right mix of participants and be structured to avoid groupthink, with multiple teams role-playing competitors.
3) Facilitation techniques like Porter's four corners model, organizational culture frameworks, and predictive markets can generate insights beyond typical "MBA thinking".
4) An action plan and follow up process is needed to measure predictions and continuously improve competitive intelligence over time.
1. The document describes the frustration of customers who receive poor service at restaurants, department stores, and gas stations. They are forced to wait patiently while employees finish other tasks instead of serving them.
2. These types of customers are unlikely to return due to the lack of courtesy and service. Businesses end up spending thousands of dollars to regain customers they could have retained by providing better initial service.
3. The key message is that customers value courtesy and service. By prioritizing customer needs over other tasks, businesses can build loyalty and reduce costly customer acquisition expenses.
Future of AI-powered automation in businessLouis Dorard
Starting from examples of current use cases of AI in business and in everyday life, we'll see what the future holds and we'll mention questions to address when giving autonomy to intelligent machines. We'll also aim at demystifying how AI works, in particular how machines can use data to automatically learn business rules and actions to perform in different contexts.
This document discusses using survival analysis to model customer retention and lifetime value for Tails.com, a company that personalizes dog food. It introduces survival analysis concepts and describes how Tails.com built a Cox proportional hazards model using scikit-survival to predict how long customers will remain active based on customer attributes. Key evaluation metrics like the concordance index are used to assess the model's accuracy in ordering customer lifetimes.
The document presents a marketing plan for positioning the Grifols Panther System as the leading nucleic acid amplification test for blood donor screening. It identifies the primary target market as blood banks of private tertiary hospitals. It outlines strategies for convincing targets, including presenting the unique selling proposition of smart design, simple operation and versatile capabilities. It also highlights differentiating points from competitors like sensitivity and specificity. The document proposes marketing strategies like promoting product features and competitive pricing. The goal is for the Panther System to be the test of choice for safer blood transfusion.
Using Data Science to Transform OpenTable Into Your Local Dining Expert-(Pabl...Spark Summit
Using data science techniques, OpenTable analyzes large amounts of data from over 32,000 restaurants and 190 million diners to provide personalized dining recommendations. Some key techniques used include collaborative filtering, matrix factorization, topic modeling of large volumes of reviews to understand restaurant attributes, and word embedding models to find similar restaurants across locations. The goal is to understand individual diner and restaurant preferences and trends to connect diners with the best possible dining experiences.
Short: Using Behavioural Economics to sell carbon-reducing products & initiat...The Hunting Dynasty
The document discusses using behavioral economics to promote carbon-reducing products and initiatives. It outlines 23 cognitive biases that influence human decision-making and can be leveraged to change behaviors. Examples are provided of how default options and immediate incentives were more effective than others at increasing sales of hybrid vehicles. The document advocates applying these behavioral insights to create sustainable behaviors.
This document discusses various aspects of quality control and assurance in pharmaceutical manufacturing. It notes that quality testing only examines a small sample of total production. It questions whether current testing methods are directly linked to ensuring the commitments and effects promised for drugs. The document also discusses the importance of understanding manufacturing processes, the resistance to change, and the risks when contamination occurs. It emphasizes learning objectively from experiences to improve quality control.
This document provides an overview of high-effort judgment and decision-making processes. It discusses different types of high-effort decisions like those involving thoughtful consideration of attributes or feelings. Factors that can influence high-effort decisions are explored, such as consumer characteristics, decision characteristics, and group context. Various decision-making models are also examined, including compensatory, non-compensatory, brand-based, and attribute-based models.
This document discusses rational decision making versus biased decision making. Rational decisions involve satisfying needs, considering trade-offs, thinking at the margins, and considering price, cost and value. Biased decisions are influenced by factors like framing effects, ownership bias, anchoring, focusing on sunk costs rather than opportunity costs, negative comparisons, overconfidence, preference for the status quo, and self-serving bias. The document provides examples of how these biases manifest and suggests considering them when designing incentives and marketing strategies.
1. Intuit Confidential and Proprietary 1
CTG Data Science Lab
August 17, 2016
Multi-armed Bandit Problem
Potential Improvement for DARTS
Aniruddha Bhargava, Yika Yujia Luo
2. Intuit Confidential and Proprietary 2
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
4. Intuit Confidential and Proprietary 4
When do we run into Multi-armed Bandit Problem (MAB)?
Gambling Research Funding
Clinical Trials Content Management
5. Intuit Confidential and Proprietary 5
What is Multi-armed Bandit Problem (MAB)?
Goal: Pick the best restaurant efficiently
Logistics: Select a restaurant for each person, who leaves you a tip afterwards
$1 $8 $10
How?
$3 $6 $6Average: $2 Average: $7 Average: $6
6. Intuit Confidential and Proprietary 6
MAB Terminology
Exploration: a learning process of people’s
preferences, always involves a certain degree of
randomness
Exploitation: use the current, reliable knowledge
of a certain parameter to select a restaurant
Arm: restaurant
Expected Reward: Average tips in the end
Regret: expected tip loss after sending a person
to a restaurant that is not the best
Policy: a strategy that you use to select restaurant
Total Cumulative Regret: the total tips you lose -
- a performance measure for bandit algorithms
Expected: $1
Expected: $10
Regret is $9!
Expected: $8
Regret is $9!
Regret is $2!
0 Regret!
Total regret: $20
User: People sent to restaurants
Reward: Tips
$0
7. Intuit Confidential and Proprietary 7
Big Picture
MAB Big Picture
Decision
Making
Optimization
MAB
Choose the best product
by finding the best restaurant
to go
Minimize total regret
by avoiding sending people to bad
restaurants as much as possible
8. Intuit Confidential and Proprietary 8
Algorithms
(Non-contextual Cases)
“Anytime you are faced with the problem of both exploring
and exploiting a search space, you have a bandit problem.
Any method of solving that problem is a bandit algorithm”
-- Chris Stucchio
9. Intuit Confidential and Proprietary 9
Non-Contextual
Non-contextual V.S. Contextual
User Product
IMPORTANT THING HERE: Although everyone has different taste, we pick one best restaurant for everyone
10. Intuit Confidential and Proprietary 10
ε-greedy
Thompson Sampling
Upper Confidence Bound (UCB)
MAB Policies
There are more bandit algorithms… ...
A/B Testing
Adaptive
11. Intuit Confidential and Proprietary 11
AB Testing
Person i Random
100%
Exploration Exploitation
Person j
100%
12. Intuit Confidential and Proprietary 12
ε-greedy
Person i
Highest
average tips
Random
Record person i’s
feedback,
Update that
restaurant’s average
tips value
Select (ε = 0.2)
Update
13. Intuit Confidential and Proprietary 13
Upper Confidence Bound (UCB)
Person i
Highest
upper
confidence
bound Record person i’s
feedback,
Update the upper
confidence bound
of that
restaurant’s
average tips
Select
Update
Average tips
from restaurant j #people went
to restaurant j
#people
100%
14. Intuit Confidential and Proprietary 14
Thompson Sampling (Bayesian)
Person i
Highest
tips from
the
sampling
Record person i’s
feedback,
Update that
restaurant’s average
tip distribution
Select
Update
Simulate
3 restaurants’
average tip
distribution,
randomly draw
a value
from each
distribution
Sampling
McDonald’s
Subway
Chili's
Average Tips($)
100%
15. Intuit Confidential and Proprietary 15
Thompson Sampling (Bayesian)
Pr(r < b) = 10% Pr(r < b) = 0.01%
16. Intuit Confidential and Proprietary 16
Algorithm Comparison
1. Exploration V.S Exploitation
2. Total Regret
3. Batch Update
17. Intuit Confidential and Proprietary 17
Algorithm Comparison: Exploration V.S. Exploitation
IMPORTANT THING HERE:
Exploration costs money!
Exploration(%)
Time (%)
75
50
25
0
100
25 50 75 100
AB Testing
ε
ε-greedy
18. Intuit Confidential and Proprietary 18
Algorithm Comparison: Total Regret
M
44%
S
28%
C
28%
AdaptiveAB Testing
M
70%
S
18%
C
12%
Time Time
19. Intuit Confidential and Proprietary 19
Algorithm Comparison: Batch Update
AB Testing ε-greedy UCB Thompson
Very Robust Depends Not Robust Robust
System User
Question
Answer
Store
Many
Answers
20. Intuit Confidential and Proprietary 20
Algorithm Comparison: Summary
AB Testing ε-greedy UCB Thompson
• Easy to
implement
• If good ε found,
lower total regret
and faster to find
best arm than ε-
first
• Good for large amount of arms
• Find the best arm fast
• Low total regret
• Robust to batch
update
Pros
Cons
• Easy to
implement
• Good for small
amount of arms
• Robust to batch
update
• Not robust to
batch update
• Sensitive to statistical
assumptions
• High total
regrets
• Need to figure
out good ε
• High total
regrets
21. Intuit Confidential and Proprietary 21
ContextualNon-Contextual
Non-contextual V.S. Contextual
Female
Vegetarian
Married
Latino
Burger
Non-
Vegetarian
Cheap
Good
Service
User Product
IMPORTANT THING HERE: Everyone has different tastes, so we pick one best restaurant for each person
22. Intuit Confidential and Proprietary 22
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
24. Intuit Confidential and Proprietary 24
What do we mean by context?
Likes spicy food, refined
tastes, plays violin, Male,
…
From Wisconsin, likes
German food, likes
Football, Male, …
Student, doesn’t like
seafood, allergic to cats,
Female, …
Chief of AFC, watches
shows on competitive
eating, Female, …
User side Arm side
Tex-Mex style, sit down dining,
founded in 1975, …
Serves sandwiches, has veggie
options, founded in 1965, …
Breakfast, lunch, and dinner, cheap,
founded in 1940, …
25. Intuit Confidential and Proprietary 25
User Context
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
Average reward over time
Non-contextual Best possible without context Context (user) Best possible with context
Non-Contextual
User Context
26. Intuit Confidential and Proprietary 26
Arm Context
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25
Average reward over time
Non-contextual Contextual (arm)
Contextual (user) Best possible without user context
Best possible with user context Context (arm and user)
Non-contextualOnly arm context
Both arm and user context
27. User context can increase the
optimal rewards;
Arm context can get you there
faster!
Takeaway Message
28. Intuit Confidential and Proprietary 28
User side:
Population segmentation
e.g. DARTS
Clustering users
Learning embedding
Arms side:
Linear models:
LinUCB, Linear TS, OFUL
Maintain estimate of best arm
More data → shrink uncertainty
Exploiting Context
29. Intuit Confidential and Proprietary 29
Assumptions:
• Users can be represented as points in space
• Users cluster together so that points that are close are similar
• Stationarity
Exploiting User Context
30. Intuit Confidential and Proprietary 30
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
31. Intuit Confidential and Proprietary 31
Linear
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
32. Intuit Confidential and Proprietary 32
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
Quadratic
33. Intuit Confidential and Proprietary 33
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
40% 35% 25%
Hierarchical
34. Intuit Confidential and Proprietary 34
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
80% 15% 5%
5% 15% 80%
Hierarchical
35. Intuit Confidential and Proprietary 35
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
5% 50% 45%
80% 15% 5%
5% 10% 85%
15% 80% 5%
Hierarchical
36. Intuit Confidential and Proprietary 36
80% 15% 5%
Exploiting User Context
meat vegetarian
spicymild
Joe
Yao
Nichola
Peter
Aniruddha
Rachel
Sophie
Yika
Vineeta
Jason Andre
Chris
Madeline
John
5% 5% 90%
10% 45% 45%
5% 50% 45%
15% 80% 5%
Hierarchical
37. Intuit Confidential and Proprietary 37
Assumptions:
• We can represent arms as vectors.
• Rewards are a noisy version of the inner product.
• Stationarity.
Look at only arm context and no user context
Methods include:
• Linear UCB
• Linear Thompson Sampling
• OFUL (Optimism in the Face of Uncertainty – Linear)
• ... and many more.
Linear models
Exploiting Arm Context
38. Intuit Confidential and Proprietary 38
The Math Slide
Standard noisy linear model:
rt = xtTθ* + ηt
θ* : the optimal arm
xt : arm pulled at time t
rt : reward at time t
ηt : noise at time t
Ct : confidence set
λ : ridge term
Xt : matrix of all arms
pulled till time t
Collect all data and write:
r = X θ* + η
Least Squares Solution:
θLS = (XTX)-1 XTr
Ridge regression:
θLSR = (XTX + λI)-1 XTr
Typical Linear Bandit
algorithm:
θ0 = 0
t = 0,1,2,…
xt = argmaxx∈Ct (xTθt )
θt = (Xt
TXt + λI)-1 Xt
Trt
39. Intuit Confidential and Proprietary 39
Exploiting Arm Context Arms
Optimal arm
meat vegetarian
spicymild
Mince pie
Buffalo wings
Tofu scramble
Grilled
vegetables
Ratatouille
Tandoori
Chicken
Jalapeno
scramble
Pad Thai
Penne Arrabiata
Set of Arms
x1, x2, …
θ* : the optimal arm
40. Intuit Confidential and Proprietary 40
Exploiting Arm Context Arms
Optimal arm
Next arm
chosen
Reward (=cos(θ)) is small, but we can still infer
information about other arms!
Buffalo wings
θ
41. Intuit Confidential and Proprietary 41
Exploiting Arm Context
C1
θ1
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty
42. Intuit Confidential and Proprietary 42
Exploiting Arm Context
We’ve already honed in on a pretty good choice
x2
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty
43. Intuit Confidential and Proprietary 43
Exploiting Arm Context
And the process continues …
C2
θ2
Arms
Optimal arm
Next arm
chosen
Estimate of
optimal arm
Region of
uncertainty
44. Intuit Confidential and Proprietary 44
• Big assumption that we know good features.
• Finding features takes a lot of work.
• Few arms, many people → learn an embedding of arms
• Few people, many arms → Featurize, linear bandits
• Linear models are a naive assumption, see kernel methods.
Some Caveats
45. Intuit Confidential and Proprietary 45
Agenda
1. Problem Overview
2. Algorithms
Non-contextual cases
Contextual cases
3. Industry Review
4. Advanced Topics
49. Intuit Confidential and Proprietary 49
Used Upper Confidence Bound (UCB) to picking headlines and photos
Washington Post
50. Intuit Confidential and Proprietary 50
Google Experiments
Used Thompson Sampling (TS)
Updated models twice a day
Two metrics used to gauge end of experiment:
• 95% confidence that alternate better or …
• "potential value remaining in the experiment”
51. The more arms the higher the
gain over A/B testing.
Takeaway Message
53. Intuit Confidential and Proprietary 53
Biasing
Data Joining and Latency
Non-stationary
Topics
54. Intuit Confidential and Proprietary 54
Bias
Website 1 Website 2
50% 50%Probability
Number
sold
100 20
90% 10%Probability
Number
sold
100 20
Who did better?
55. Intuit Confidential and Proprietary 55
• Be careful when using past data!
• Inverse Propensity Score Matching
• New sales estimates:
Bias
Website 1: 100*0.5+20*0.5 = 60
Website 2: 100*0.5*(0.5/0.9) + 20*0.5*(0.5/0.1) = 75
56. Intuit Confidential and Proprietary 56
Data Joining and Latency
Courtesy: Microsoft MWT white paper
Context, decision
Rewards
Latency
57. Intuit Confidential and Proprietary 57
Non-Stationarity – Beer example
January April July October December
Stouts and
porters
Pale Ales
and IPAs
Wits and
Lagers
Oktoberfests
and Reds
Christmas
Ales
My yearly beer taste:
58. Intuit Confidential and Proprietary 58
Preferences change over time.
There may be periodicity in data, Tax season is a great example.
Some solutions:
• Slow changes → System with finite memory
• Abrupt changes → Subspace tracking/anomaly detection
Non-Stationarity
59. Preferences change over time,
biases are added and data
needs to be joined from
different sources.
Takeaway Message
which machines to play, how many times to play each machine and in which order to play them
Given a fixed budget, the problem is to allocate resources among the competing projects
investigating the effects of different experimental treatments while minimizing patient losses
Compared to recommendation problems (Netflix), only one pair is know, e.g.. Peter has watched movies vs Peter went to McDonalds.
Remove the without box. So just have the with
On average what a population might pay for the best average experience is lower than what each individual might pay for their optimal experience
Think of vegetarians and meat eaters: suppose population with 2/3rd meat eaters, 1/3rd vegetarians
On average, cater to non-vegetarians. So if people willing to spend, on average, $15, best possible reward, population wide is $10.
If we can identify the two populations then selecting appropriate restaurants, we can get on average $15. A $5 increase!
We can learn faster
Knowing something about one arm tells us information about other arms:
Think of vegetarians and meat eaters again
If we know that the population says no to meat-only restaurants, then we know that they might say the same about other meat-only restaurants
Learning that reduces the number of contenders for the optimal restaurant
change from more to less text
Exploiting user context
quadratic/linear/... should be in the subtitle
exploiting user context - hierarchical
Convention: bold: vectors, capitals matrices
remove dashed arm, make the title bold
center the circle
continuous estimate of the best answer but selecting from a discrete space
put dots
Get rid of the bullet points
reduce verbosity further
bullet for each company
instead of bullets add as many logos as possible
group by the type of usage
title: Companies using Adaptive MAB algorithms
add subtitles for what each are
say that they aren't adding content
talk over that they aren't using either context
make the workflow bigger
last and second to last can be combined
remove bullet points
make a meta-point that there are different things that people do between completely deploy and forget and monitoring
The "value remaining" in an experiment is the amount of increased conversion rate you could get by switching away from the champion. The whole point of experimenting is to search for this value. If you’re 100% sure that the champion is the best arm, then there is no value remaining in the experiment, and thus no point in experimenting. But if you’re only 70% sure that an arm is optimal, then there is a 30% chance that another arm is better, and we can use Bayes’ rule to work out the distribution of how much better it is.
Figure 2: how many days earlier does TS end compared to A/B/n testing
Figure 2: frequency out of 500 experiments
Remove first line
swap biasing an non-stationarity
make latency and joining in one bullet point
Both sold 120 pies and the same number of each. The probability of showing them is different
voice over why
Data Joining, we need to bring together:
The context (user and arm)
What variation was shown (decision)
Reward
Latency:
Delay between user response and when systems sees it
Batch updates
In reality, for a short enough period, preferences remain mostly the same.
only infer insights from a fixed time window (good for slow changing signals).
see if something weird happened (good for sudden changes).