Mohamad C

Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
1
Reinforcement Learning
in the Wild
&
Lessons Learned
Mohamad Charafeddine
@mohamadtweets
Director of Tech Planning, AI Team
Samsung SDS Research America
April 12th, 2018

@mohamadtweets
“Theory is the first term in the Taylor series of practice” – Tom Cover, Stanford
Professor of Information Theory, in his 1990 Shannon Lecture
2
Practice = Theory + higher order terms
“the wild”

@mohamadtweets
3
Theory
Practice
Time
Complexity
Optimize
“engagement” through
AI personalization
• What does “engagement” means?
• From who’s point of view? User? Company? Society?
• For what time horizon? Days? Weeks? Years?
• ..
• Unintended 2nd order effect: amplification of
echo-chambers
• Unintended 3rd order effect: ?
• Should the AI objective function be open-source and
auditable?
• Should the AI objective function imitate/learn from
humans?

@mohamadtweets
Takeaways from this talk
 Introduce Reinforcement Learning and its breadth of potential applications
 Showcase some RL examples
 Provide a framework to better evaluate RL application areas in terms of risk
and design challenges from a PM point of view
4

@mohamadtweets
5
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned

@mohamadtweets
Rewards
Perceived
State of the
environment
Reinforcement Learning
6
RL Agent Environment
Actions
Inputs

@mohamadtweets
In the context of an Atari game, the long-term objective is the score
7
Actions
Rewards:
score
State of the
environment
frames on the
screen
Inputs
https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning/

@mohamadtweets
.. But just focusing on the score in the reward function can sometimes back-fire!
8
Actions
State of the
environment
Inputs
Rewards:
score
https://blog.openai.com/faulty-reward-functions/
The boat can spin in loops collecting goodies but never finishes the race! – unintended high order effect

@mohamadtweets
.. But just focusing on the score in the reward function can sometimes back-fire!
9
Actions
State of the
environment
Inputs
The boat can spin in loops collecting goodies but never finishes the race! – unintended high order effect
Rewards:
score
https://blog.openai.com/faulty-reward-functions/

@mohamadtweets
10

@mohamadtweets
Many potential use cases of RL, where the environment is: Object (physics, etc.),
Human (preferences..), Biology (& chemical,..), Market (multiple agents), Code,..
11
Environment
Robotics
Industrial
Manufacturing
Social
Content
Wellness
Healthcare
Pharmaceutical
Agriculture
Advertising
Marketing
Enterprise
Finance
Games
Security
E-commerce
Networking
Another AI system
..

@mohamadtweets
12
Marketing, Sales, Customer Support, Security,
Recruiting, Education, Investment, Legal, Logistics,
Healthcare, Wellness, Automotive, Manufacturing,
Agriculture, Personal Assistants, Speech/Image/Video
recognition, Advertising, ..
Deep Learning has succeeded to break into Practice
Reinforcement Learning succeeded in fewer areas
Games, Robotics, Chatbots, Manufacturing, Wellness,
Automotive*, Marketing*, Customer Support*,
Agriculture*, Advertising*,..

@mohamadtweets
13
Insights
Perception Models
Features
Data
Recommendation
Decision
Value
Perception
Decision
Deep Learning
Reinforcement
Learning
Insights
Perception Models
Features
Data
Recommendation
Decision
Value
Perception
Decision
Deep Learning
Human-designed
Rules
Most AI applications today Future AI applications
tech + ethical challenges
Slowly human decision rules will be replaced with AI decisions.
Such move, inherent to RL applications, opens door for ethics and decision
governance questions.

@mohamadtweets
To apply RL, we need to understand the characteristics (metadata) of the
problems that are most appropriate to apply it to.
14
Where to start to bring RL to practice? Risk profile?
And how to build a framework to qualify application areas?

@mohamadtweets
15
II. 3 RL Use Cases

@mohamadtweets
Traffic Lights IoT Optimization
16
• 2 sensors on each lane that
measure # of cars that pass
& speed of cars
• 2 intersections with 4 lights
each
• Goal is to optimize for flow
rate (# cars/sec)
Controlled Environment: Simulator

@mohamadtweets
Traffic Lights IoT Optimization
17
Direct
Reward
State
DRL Agent
LSTM, Online Learning
Environment
8 discrete actions
Inputs
16 sensors outputs
car counter + speed
Stream of data
Flow of cars
in intersections
End-to-End: no features
engineering, just feeding
raw stream of data

@mohamadtweets
Results: ~ 30% reduction in total Travel Time
18
Uneven Flow High Flow
32% faster 39%
30%

@mohamadtweets
Theory to Practice journey:
19
• What happens if I want to apply this to a city?
• How to handle 10s of thousands of actions?
• How to characterize convergence, robustness?
• How fast can it adapt to changes?
• …
• Cost of sensors? Use of Traffic on Google Maps? Etc.
• Can Autonomous Vehicles play a role as extra control knobs?

@mohamadtweets
20
Simulation: Wu, et al. IEEE T-RO, 2018
Ion Stoica - RL Systems @ RISELab at UC Berkeley
ScaledML Conference by Matroid, March 2018, Stanford Univ.
https://www.youtube.com/watch?v=-KC3tO4BDuQ
RL for traffic management – using Autonomous Vehicles, on the road with human drivers
2 lanes

@mohamadtweets
21
II. 3 RL Use Cases

@mohamadtweets
Applying DRL to Storage Servers to optimize operational efficiency
22
State: function of workload (Reads or Writes) and temperature
Actions:
24 SSD Drives
Fan Speeds
Reward: A function of Temperatures & Fans Speeds
Advantage Actor-Critic
(A2C)
S. Srinivasa, G. Kathalagiri, J. Varanasi, L. Quintela, M. Charafeddine, C. Lee,
“On Optimizing Operational Efficiency in Storage Systems Via Deep Reinforcement Learning”, submitted to ECML PKDD
Desired
operation
region

@mohamadtweets
Learning… over few days
23
RL Agent
Fan Speeds
State:
Different
workloads,
Temperature
Reward
• Learning directly on the real environment
(no simulator)
• Model-free: does not require any knowledge
of the SSD server behavior dynamics
• Exposed to different stochastic workloads

@mohamadtweets
Performance for Idle Vs Heavy I/O workloads on the operational contours
24
Status Quo controller Using Deep Reinforcement Learning
Desired
Operational
Region

@mohamadtweets
Performance for different workloads
25
At the beginning of training, algorithm
is exploring and learning
Once finished learning right policy,
operational behavior is within desired region
Desired
Operational
Region
Resulting
distribution
from different
workloads

@mohamadtweets
26
II. 3 RL Use Cases

@mohamadtweets
Ads Spend Optimization for leads demand generation
27
RL in Digital Marketing

@mohamadtweets
Demand gen challenges
28
Daily leads qualification
volume constraints
Financial Services
Changing inventory:
Hotels, Car Rentals
Hospitality
Limited supply & time
sensitivity challenges
Food Apps Retail
Limited # for
Inventories, discounts.
Marketing bidding vs
competitors
Under-producing
Demand
Demand < Supply
Opportunity Loss
Demand > Supply
Over-producing
Demand
Over Spending
Maximum Gross Profit

@mohamadtweets
Reward: Gross Profit
State: Previous CPC per SEM Account, Hour of Day, Day of Week, ..
29
Setup: Marketing Demand Gen Optimization
Every Hour
Marketplace webpage
Decide hourly Cost Per Click for 8 Search Engine Marketing accounts to optimize Gross Profit:
sum over 24 hrs of (Hourly Lead Gen Referral Revenue - Hourly Cust. Acquisition Cost from SEM)
RL Agent
TRPO,
Importance Sampling

@mohamadtweets
Results for SEM Demand Gen: 12-20% gross profit uplift
30
A. Beloi, M. Charafeddine, G. Kathalagiri, A. Mishra, L. Quintela, S.
Srinivasa, patent filed: “Spending Allocation in Multi-Channel
Digital Marketing” (U.S. Application No.: 20180047039)
Gross Profit
Cumulative Demand Spend
Joint Decisions
Gross Margin

@mohamadtweets
31
II. 3 RL Use Cases

@mohamadtweets
32
EASIER HARDER
Fully observable Partially observable
Low dimensionality
to represent
High dimensionality
to represent
Time-invariant
(if I conduct the experiment now
or next week, it’s the same)
Time variant
We bring the concept of “Environment
Coherence time tc” borrowed from digital
Communication to characterize how the
“channel” or “environment” is changing.
Well-behaved Stochastic w/ Fat Tails
Environment

@mohamadtweets
33
Objective
Subjective
(mostly dealing w/ humans)
prone to PM/Data Scientists bias;
has an ethical dimension
Monolithic
Direct Indirect w/ a lag (e.g., Marketing)
Composite
(e.g., Robotics: Get closer  move arm
 orient  pick  move  stack)
Simple to describe Complex, need AI (Inverse RL) to learn it
(how to prune bad actors?)
EASIER HARDERReward
Most challenging for Product Management

@mohamadtweets
34
Discrete: small number ~< 20 Discrete: large # (100s,1K,..)
Need Hierarchical Actions
Continuous: small number
(Self-Driving Cars:
Gas, Brake, Steering)
Continuous: large number
(Ad Spend CPC per Keyword)
Static Dynamic with time
(new Ads added, removed,..)
EASIER HARDERActions

@mohamadtweets
35
A high-fidelity simulator exists
(in a game, Simulator = Environment)
Low-fidelity simulator or none
Can run many parallel experiments Only 1 experiment at a time
(marketplace that’s hard to simulate)
$0, no impact
$$$ or Humans involved
(Ad Spend, Healthcare, Social media,..)
EASIER HARDERExploration Cost of Learning
Fast learning episodes
Long cycle learning episodes
(wellness, marketing re-targeting)

@mohamadtweets
36
RL in a Lab RL in the Wild
RL for a Game: Simulator & Environment
are 100% the same
There is Simulator & Environment Gap
~0 Exploration Cost $-$$$ Exploration Cost
Environment is Time-Invariant Environment can be Time-Variant
Direct, instant feedback More complex: can be indirect or w/ lag
Unconstrained Convergence Time Convergence Time << Env Coherence Time
Big Data Big Data & Small Data

@mohamadtweets
37
Controlled
Environment
Wild Environment
Simulator ≠ Reality
Low Exploration Risk
High Exploration Risk
Healthcare
WellnessAgTech
Trading
Manufacturing
Marketing

@mohamadtweets
38
Advice to AI entrepreneurs planning their journey into the wild
Pick your vertical wisely.. It decides the macro terms that you will face

Mohamad C

Recommended

Recommended

More Related Content

Similar to Mohamad C

Similar to Mohamad C (20)

More from Hilary Ip

More from Hilary Ip (20)

Recently uploaded

Recently uploaded (20)

Mohamad C

Editor's Notes