Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
1
Reinforcement Learning
in the Wild
&
Lessons Learned
Mohamad Charafeddine
@mohamadtweets
Director of Tech Planning, AI Team
Samsung SDS Research America
April 12th, 2018
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
“Theory is the first term in the Taylor series of practice” – Tom Cover, Stanford
Professor of Information Theory, in his 1990 Shannon Lecture
2
Practice = Theory + higher order terms
“the wild”
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
3
Theory
Practice
Time
Complexity
Optimize
“engagement” through
AI personalization
• What does “engagement” means?
• From who’s point of view? User? Company? Society?
• For what time horizon? Days? Weeks? Years?
• ..
• Unintended 2nd order effect: amplification of
echo-chambers
• Unintended 3rd order effect: ?
• Should the AI objective function be open-source and
auditable?
• Should the AI objective function imitate/learn from
humans?
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Takeaways from this talk
 Introduce Reinforcement Learning and its breadth of potential applications
 Showcase some RL examples
 Provide a framework to better evaluate RL application areas in terms of risk
and design challenges from a PM point of view
4
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
5
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Rewards
Perceived
State of the
environment
Reinforcement Learning
6
RL Agent Environment
Actions
Inputs
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
In the context of an Atari game, the long-term objective is the score
7
RL Agent Environment
Actions
Rewards:
score
State of the
environment
frames on the
screen
Inputs
https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning/
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
.. But just focusing on the score in the reward function can sometimes back-fire!
8
RL Agent Environment
Actions
State of the
environment
Inputs
Rewards:
score
https://blog.openai.com/faulty-reward-functions/
The boat can spin in loops collecting goodies but never finishes the race! – unintended high order effect
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
.. But just focusing on the score in the reward function can sometimes back-fire!
9
RL Agent Environment
Actions
State of the
environment
Inputs
The boat can spin in loops collecting goodies but never finishes the race! – unintended high order effect
Rewards:
score
https://blog.openai.com/faulty-reward-functions/
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
10
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Many potential use cases of RL, where the environment is: Object (physics, etc.),
Human (preferences..), Biology (& chemical,..), Market (multiple agents), Code,..
11
Environment
Robotics
Industrial
Manufacturing
Social
Content
Wellness
Healthcare
Pharmaceutical
Agriculture
Advertising
Marketing
Enterprise
Finance
Games
Security
E-commerce
Networking
Another AI system
..
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
12
Marketing, Sales, Customer Support, Security,
Recruiting, Education, Investment, Legal, Logistics,
Healthcare, Wellness, Automotive, Manufacturing,
Agriculture, Personal Assistants, Speech/Image/Video
recognition, Advertising, ..
Deep Learning has succeeded to break into Practice
Reinforcement Learning succeeded in fewer areas
Games, Robotics, Chatbots, Manufacturing, Wellness,
Automotive*, Marketing*, Customer Support*,
Agriculture*, Advertising*,..
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
13
Insights
Perception Models
Features
Data
Recommendation
Decision
Value
Perception
Decision
Deep Learning
Reinforcement
Learning
Insights
Perception Models
Features
Data
Recommendation
Decision
Value
Perception
Decision
Deep Learning
Human-designed
Rules
Most AI applications today Future AI applications
tech + ethical challenges
Slowly human decision rules will be replaced with AI decisions.
Such move, inherent to RL applications, opens door for ethics and decision
governance questions.
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
To apply RL, we need to understand the characteristics (metadata) of the
problems that are most appropriate to apply it to.
14
Where to start to bring RL to practice? Risk profile?
And how to build a framework to qualify application areas?
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
15
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Traffic Lights IoT Optimization
16
• 2 sensors on each lane that
measure # of cars that pass
& speed of cars
• 2 intersections with 4 lights
each
• Goal is to optimize for flow
rate (# cars/sec)
Controlled Environment: Simulator
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Traffic Lights IoT Optimization
17
Direct
Reward
State
DRL Agent
LSTM, Online Learning
Environment
8 discrete actions
Inputs
16 sensors outputs
car counter + speed
Stream of data
Flow of cars
in intersections
End-to-End: no features
engineering, just feeding
raw stream of data
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Results: ~ 30% reduction in total Travel Time
18
Uneven Flow High Flow
32% faster 39%
30%
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Theory to Practice journey:
19
• What happens if I want to apply this to a city?
• How to handle 10s of thousands of actions?
• How to characterize convergence, robustness?
• How fast can it adapt to changes?
• …
• Cost of sensors? Use of Traffic on Google Maps? Etc.
• Can Autonomous Vehicles play a role as extra control knobs?
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
20
Simulation: Wu, et al. IEEE T-RO, 2018
Ion Stoica - RL Systems @ RISELab at UC Berkeley
ScaledML Conference by Matroid, March 2018, Stanford Univ.
https://www.youtube.com/watch?v=-KC3tO4BDuQ
RL for traffic management – using Autonomous Vehicles, on the road with human drivers
2 lanes
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
21
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Applying DRL to Storage Servers to optimize operational efficiency
22
State: function of workload (Reads or Writes) and temperature
RL Agent Environment
Actions:
24 SSD Drives
Fan Speeds
Reward: A function of Temperatures & Fans Speeds
Advantage Actor-Critic
(A2C)
S. Srinivasa, G. Kathalagiri, J. Varanasi, L. Quintela, M. Charafeddine, C. Lee,
“On Optimizing Operational Efficiency in Storage Systems Via Deep Reinforcement Learning”, submitted to ECML PKDD
Desired
operation
region
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Learning… over few days
23
RL Agent
Fan Speeds
State:
Different
workloads,
Temperature
Reward
• Learning directly on the real environment
(no simulator)
• Model-free: does not require any knowledge
of the SSD server behavior dynamics
• Exposed to different stochastic workloads
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Performance for Idle Vs Heavy I/O workloads on the operational contours
24
Status Quo controller Using Deep Reinforcement Learning
Desired
Operational
Region
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Performance for different workloads
25
At the beginning of training, algorithm
is exploring and learning
Once finished learning right policy,
operational behavior is within desired region
Desired
Operational
Region
Resulting
distribution
from different
workloads
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
26
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Ads Spend Optimization for leads demand generation
27
RL in Digital Marketing
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Demand gen challenges
28
Daily leads qualification
volume constraints
Financial Services
Changing inventory:
Hotels, Car Rentals
Hospitality
Limited supply & time
sensitivity challenges
Food Apps Retail
Limited # for
Inventories, discounts.
Marketing bidding vs
competitors
Under-producing
Demand
Demand < Supply
Opportunity Loss
Demand > Supply
Over-producing
Demand
Over Spending
Maximum Gross Profit
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Reward: Gross Profit
State: Previous CPC per SEM Account, Hour of Day, Day of Week, ..
29
Setup: Marketing Demand Gen Optimization
Every Hour
Marketplace webpage
Decide hourly Cost Per Click for 8 Search Engine Marketing accounts to optimize Gross Profit:
sum over 24 hrs of (Hourly Lead Gen Referral Revenue - Hourly Cust. Acquisition Cost from SEM)
RL Agent
TRPO,
Importance Sampling
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Results for SEM Demand Gen: 12-20% gross profit uplift
30
A. Beloi, M. Charafeddine, G. Kathalagiri, A. Mishra, L. Quintela, S.
Srinivasa, patent filed: “Spending Allocation in Multi-Channel
Digital Marketing” (U.S. Application No.: 20180047039)
Gross Profit
Cumulative Demand Spend
Joint Decisions
Gross Margin
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
31
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
32
EASIER HARDER
Fully observable Partially observable
Low dimensionality
to represent
High dimensionality
to represent
Time-invariant
(if I conduct the experiment now
or next week, it’s the same)
Time variant
We bring the concept of “Environment
Coherence time tc” borrowed from digital
Communication to characterize how the
“channel” or “environment” is changing.
Well-behaved Stochastic w/ Fat Tails
Environment
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
33
Objective
Subjective
(mostly dealing w/ humans)
prone to PM/Data Scientists bias;
has an ethical dimension
Monolithic
Direct Indirect w/ a lag (e.g., Marketing)
Composite
(e.g., Robotics: Get closer  move arm
 orient  pick  move  stack)
Simple to describe Complex, need AI (Inverse RL) to learn it
(how to prune bad actors?)
EASIER HARDERReward
Most challenging for Product Management
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
34
Discrete: small number ~< 20 Discrete: large # (100s,1K,..)
Need Hierarchical Actions
Continuous: small number
(Self-Driving Cars:
Gas, Brake, Steering)
Continuous: large number
(Ad Spend CPC per Keyword)
Static Dynamic with time
(new Ads added, removed,..)
EASIER HARDERActions
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
35
A high-fidelity simulator exists
(in a game, Simulator = Environment)
Low-fidelity simulator or none
Can run many parallel experiments Only 1 experiment at a time
(marketplace that’s hard to simulate)
$0, no impact
$$$ or Humans involved
(Ad Spend, Healthcare, Social media,..)
EASIER HARDERExploration Cost of Learning
Fast learning episodes
Long cycle learning episodes
(wellness, marketing re-targeting)
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
36
RL in a Lab RL in the Wild
RL for a Game: Simulator & Environment
are 100% the same
There is Simulator & Environment Gap
~0 Exploration Cost $-$$$ Exploration Cost
Environment is Time-Invariant Environment can be Time-Variant
Direct, instant feedback More complex: can be indirect or w/ lag
Unconstrained Convergence Time Convergence Time << Env Coherence Time
Big Data Big Data & Small Data
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
37
Controlled
Environment
Wild Environment
Simulator ≠ Reality
Low Exploration Risk
High Exploration Risk
Healthcare
WellnessAgTech
Trading
Manufacturing
Marketing
Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
38
Advice to AI entrepreneurs planning their journey into the wild
Pick your vertical wisely.. It decides the macro terms that you will face

Reinforcement Learning in the Wild and Lessons Learned

  • 1.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 1 Reinforcement Learning in the Wild & Lessons Learned Mohamad Charafeddine @mohamadtweets Director of Tech Planning, AI Team Samsung SDS Research America April 12th, 2018
  • 2.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets “Theory is the first term in the Taylor series of practice” – Tom Cover, Stanford Professor of Information Theory, in his 1990 Shannon Lecture 2 Practice = Theory + higher order terms “the wild”
  • 3.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 3 Theory Practice Time Complexity Optimize “engagement” through AI personalization • What does “engagement” means? • From who’s point of view? User? Company? Society? • For what time horizon? Days? Weeks? Years? • .. • Unintended 2nd order effect: amplification of echo-chambers • Unintended 3rd order effect: ? • Should the AI objective function be open-source and auditable? • Should the AI objective function imitate/learn from humans?
  • 4.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Takeaways from this talk  Introduce Reinforcement Learning and its breadth of potential applications  Showcase some RL examples  Provide a framework to better evaluate RL application areas in terms of risk and design challenges from a PM point of view 4
  • 5.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 5 I. Reinforcement Learning Intro II. 3 RL Use Cases III. Lessons Learned
  • 6.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Rewards Perceived State of the environment Reinforcement Learning 6 RL Agent Environment Actions Inputs
  • 7.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets In the context of an Atari game, the long-term objective is the score 7 RL Agent Environment Actions Rewards: score State of the environment frames on the screen Inputs https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning/
  • 8.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets .. But just focusing on the score in the reward function can sometimes back-fire! 8 RL Agent Environment Actions State of the environment Inputs Rewards: score https://blog.openai.com/faulty-reward-functions/ The boat can spin in loops collecting goodies but never finishes the race! – unintended high order effect
  • 9.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets .. But just focusing on the score in the reward function can sometimes back-fire! 9 RL Agent Environment Actions State of the environment Inputs The boat can spin in loops collecting goodies but never finishes the race! – unintended high order effect Rewards: score https://blog.openai.com/faulty-reward-functions/
  • 10.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 10
  • 11.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Many potential use cases of RL, where the environment is: Object (physics, etc.), Human (preferences..), Biology (& chemical,..), Market (multiple agents), Code,.. 11 Environment Robotics Industrial Manufacturing Social Content Wellness Healthcare Pharmaceutical Agriculture Advertising Marketing Enterprise Finance Games Security E-commerce Networking Another AI system ..
  • 12.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 12 Marketing, Sales, Customer Support, Security, Recruiting, Education, Investment, Legal, Logistics, Healthcare, Wellness, Automotive, Manufacturing, Agriculture, Personal Assistants, Speech/Image/Video recognition, Advertising, .. Deep Learning has succeeded to break into Practice Reinforcement Learning succeeded in fewer areas Games, Robotics, Chatbots, Manufacturing, Wellness, Automotive*, Marketing*, Customer Support*, Agriculture*, Advertising*,..
  • 13.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 13 Insights Perception Models Features Data Recommendation Decision Value Perception Decision Deep Learning Reinforcement Learning Insights Perception Models Features Data Recommendation Decision Value Perception Decision Deep Learning Human-designed Rules Most AI applications today Future AI applications tech + ethical challenges Slowly human decision rules will be replaced with AI decisions. Such move, inherent to RL applications, opens door for ethics and decision governance questions.
  • 14.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets To apply RL, we need to understand the characteristics (metadata) of the problems that are most appropriate to apply it to. 14 Where to start to bring RL to practice? Risk profile? And how to build a framework to qualify application areas?
  • 15.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 15 I. Reinforcement Learning Intro II. 3 RL Use Cases III. Lessons Learned
  • 16.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Traffic Lights IoT Optimization 16 • 2 sensors on each lane that measure # of cars that pass & speed of cars • 2 intersections with 4 lights each • Goal is to optimize for flow rate (# cars/sec) Controlled Environment: Simulator
  • 17.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Traffic Lights IoT Optimization 17 Direct Reward State DRL Agent LSTM, Online Learning Environment 8 discrete actions Inputs 16 sensors outputs car counter + speed Stream of data Flow of cars in intersections End-to-End: no features engineering, just feeding raw stream of data
  • 18.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Results: ~ 30% reduction in total Travel Time 18 Uneven Flow High Flow 32% faster 39% 30%
  • 19.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Theory to Practice journey: 19 • What happens if I want to apply this to a city? • How to handle 10s of thousands of actions? • How to characterize convergence, robustness? • How fast can it adapt to changes? • … • Cost of sensors? Use of Traffic on Google Maps? Etc. • Can Autonomous Vehicles play a role as extra control knobs?
  • 20.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 20 Simulation: Wu, et al. IEEE T-RO, 2018 Ion Stoica - RL Systems @ RISELab at UC Berkeley ScaledML Conference by Matroid, March 2018, Stanford Univ. https://www.youtube.com/watch?v=-KC3tO4BDuQ RL for traffic management – using Autonomous Vehicles, on the road with human drivers 2 lanes
  • 21.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 21 I. Reinforcement Learning Intro II. 3 RL Use Cases III. Lessons Learned
  • 22.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Applying DRL to Storage Servers to optimize operational efficiency 22 State: function of workload (Reads or Writes) and temperature RL Agent Environment Actions: 24 SSD Drives Fan Speeds Reward: A function of Temperatures & Fans Speeds Advantage Actor-Critic (A2C) S. Srinivasa, G. Kathalagiri, J. Varanasi, L. Quintela, M. Charafeddine, C. Lee, “On Optimizing Operational Efficiency in Storage Systems Via Deep Reinforcement Learning”, submitted to ECML PKDD Desired operation region
  • 23.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Learning… over few days 23 RL Agent Fan Speeds State: Different workloads, Temperature Reward • Learning directly on the real environment (no simulator) • Model-free: does not require any knowledge of the SSD server behavior dynamics • Exposed to different stochastic workloads
  • 24.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Performance for Idle Vs Heavy I/O workloads on the operational contours 24 Status Quo controller Using Deep Reinforcement Learning Desired Operational Region
  • 25.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Performance for different workloads 25 At the beginning of training, algorithm is exploring and learning Once finished learning right policy, operational behavior is within desired region Desired Operational Region Resulting distribution from different workloads
  • 26.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 26 I. Reinforcement Learning Intro II. 3 RL Use Cases III. Lessons Learned
  • 27.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Ads Spend Optimization for leads demand generation 27 RL in Digital Marketing
  • 28.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Demand gen challenges 28 Daily leads qualification volume constraints Financial Services Changing inventory: Hotels, Car Rentals Hospitality Limited supply & time sensitivity challenges Food Apps Retail Limited # for Inventories, discounts. Marketing bidding vs competitors Under-producing Demand Demand < Supply Opportunity Loss Demand > Supply Over-producing Demand Over Spending Maximum Gross Profit
  • 29.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Reward: Gross Profit State: Previous CPC per SEM Account, Hour of Day, Day of Week, .. 29 Setup: Marketing Demand Gen Optimization Every Hour Marketplace webpage Decide hourly Cost Per Click for 8 Search Engine Marketing accounts to optimize Gross Profit: sum over 24 hrs of (Hourly Lead Gen Referral Revenue - Hourly Cust. Acquisition Cost from SEM) RL Agent TRPO, Importance Sampling
  • 30.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets Results for SEM Demand Gen: 12-20% gross profit uplift 30 A. Beloi, M. Charafeddine, G. Kathalagiri, A. Mishra, L. Quintela, S. Srinivasa, patent filed: “Spending Allocation in Multi-Channel Digital Marketing” (U.S. Application No.: 20180047039) Gross Profit Cumulative Demand Spend Joint Decisions Gross Margin
  • 31.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 31 I. Reinforcement Learning Intro II. 3 RL Use Cases III. Lessons Learned
  • 32.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 32 EASIER HARDER Fully observable Partially observable Low dimensionality to represent High dimensionality to represent Time-invariant (if I conduct the experiment now or next week, it’s the same) Time variant We bring the concept of “Environment Coherence time tc” borrowed from digital Communication to characterize how the “channel” or “environment” is changing. Well-behaved Stochastic w/ Fat Tails Environment
  • 33.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 33 Objective Subjective (mostly dealing w/ humans) prone to PM/Data Scientists bias; has an ethical dimension Monolithic Direct Indirect w/ a lag (e.g., Marketing) Composite (e.g., Robotics: Get closer  move arm  orient  pick  move  stack) Simple to describe Complex, need AI (Inverse RL) to learn it (how to prune bad actors?) EASIER HARDERReward Most challenging for Product Management
  • 34.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 34 Discrete: small number ~< 20 Discrete: large # (100s,1K,..) Need Hierarchical Actions Continuous: small number (Self-Driving Cars: Gas, Brake, Steering) Continuous: large number (Ad Spend CPC per Keyword) Static Dynamic with time (new Ads added, removed,..) EASIER HARDERActions
  • 35.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 35 A high-fidelity simulator exists (in a game, Simulator = Environment) Low-fidelity simulator or none Can run many parallel experiments Only 1 experiment at a time (marketplace that’s hard to simulate) $0, no impact $$$ or Humans involved (Ad Spend, Healthcare, Social media,..) EASIER HARDERExploration Cost of Learning Fast learning episodes Long cycle learning episodes (wellness, marketing re-targeting)
  • 36.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 36 RL in a Lab RL in the Wild RL for a Game: Simulator & Environment are 100% the same There is Simulator & Environment Gap ~0 Exploration Cost $-$$$ Exploration Cost Environment is Time-Invariant Environment can be Time-Variant Direct, instant feedback More complex: can be indirect or w/ lag Unconstrained Convergence Time Convergence Time << Env Coherence Time Big Data Big Data & Small Data
  • 37.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 37 Controlled Environment Wild Environment Simulator ≠ Reality Low Exploration Risk High Exploration Risk Healthcare WellnessAgTech Trading Manufacturing Marketing
  • 38.
    Copyright © 2018Samsung SDS, Inc. All rights reserved @mohamadtweets 38 Advice to AI entrepreneurs planning their journey into the wild Pick your vertical wisely.. It decides the macro terms that you will face

Editor's Notes

  • #7 is an area of machine learning an agent learns from interaction with an environment what actions to take in order to optimize a long-term objective (expectation of cumulative rewards)