More Related Content Similar to Mohamad C (20) Mohamad C1. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
1
Reinforcement Learning
in the Wild
&
Lessons Learned
Mohamad Charafeddine
@mohamadtweets
Director of Tech Planning, AI Team
Samsung SDS Research America
April 12th, 2018
2. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
“Theory is the first term in the Taylor series of practice” – Tom Cover, Stanford
Professor of Information Theory, in his 1990 Shannon Lecture
2
Practice = Theory + higher order terms
“the wild”
3. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
3
Theory
Practice
Time
Complexity
Optimize
“engagement” through
AI personalization
• What does “engagement” means?
• From who’s point of view? User? Company? Society?
• For what time horizon? Days? Weeks? Years?
• ..
• Unintended 2nd order effect: amplification of
echo-chambers
• Unintended 3rd order effect: ?
• Should the AI objective function be open-source and
auditable?
• Should the AI objective function imitate/learn from
humans?
4. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Takeaways from this talk
Introduce Reinforcement Learning and its breadth of potential applications
Showcase some RL examples
Provide a framework to better evaluate RL application areas in terms of risk
and design challenges from a PM point of view
4
5. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
5
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
6. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Rewards
Perceived
State of the
environment
Reinforcement Learning
6
RL Agent Environment
Actions
Inputs
7. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
In the context of an Atari game, the long-term objective is the score
7
RL Agent Environment
Actions
Rewards:
score
State of the
environment
frames on the
screen
Inputs
https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning/
8. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
.. But just focusing on the score in the reward function can sometimes back-fire!
8
RL Agent Environment
Actions
State of the
environment
Inputs
Rewards:
score
https://blog.openai.com/faulty-reward-functions/
The boat can spin in loops collecting goodies but never finishes the race! – unintended high order effect
9. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
.. But just focusing on the score in the reward function can sometimes back-fire!
9
RL Agent Environment
Actions
State of the
environment
Inputs
The boat can spin in loops collecting goodies but never finishes the race! – unintended high order effect
Rewards:
score
https://blog.openai.com/faulty-reward-functions/
11. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Many potential use cases of RL, where the environment is: Object (physics, etc.),
Human (preferences..), Biology (& chemical,..), Market (multiple agents), Code,..
11
Environment
Robotics
Industrial
Manufacturing
Social
Content
Wellness
Healthcare
Pharmaceutical
Agriculture
Advertising
Marketing
Enterprise
Finance
Games
Security
E-commerce
Networking
Another AI system
..
12. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
12
Marketing, Sales, Customer Support, Security,
Recruiting, Education, Investment, Legal, Logistics,
Healthcare, Wellness, Automotive, Manufacturing,
Agriculture, Personal Assistants, Speech/Image/Video
recognition, Advertising, ..
Deep Learning has succeeded to break into Practice
Reinforcement Learning succeeded in fewer areas
Games, Robotics, Chatbots, Manufacturing, Wellness,
Automotive*, Marketing*, Customer Support*,
Agriculture*, Advertising*,..
13. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
13
Insights
Perception Models
Features
Data
Recommendation
Decision
Value
Perception
Decision
Deep Learning
Reinforcement
Learning
Insights
Perception Models
Features
Data
Recommendation
Decision
Value
Perception
Decision
Deep Learning
Human-designed
Rules
Most AI applications today Future AI applications
tech + ethical challenges
Slowly human decision rules will be replaced with AI decisions.
Such move, inherent to RL applications, opens door for ethics and decision
governance questions.
14. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
To apply RL, we need to understand the characteristics (metadata) of the
problems that are most appropriate to apply it to.
14
Where to start to bring RL to practice? Risk profile?
And how to build a framework to qualify application areas?
15. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
15
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
16. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Traffic Lights IoT Optimization
16
• 2 sensors on each lane that
measure # of cars that pass
& speed of cars
• 2 intersections with 4 lights
each
• Goal is to optimize for flow
rate (# cars/sec)
Controlled Environment: Simulator
17. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Traffic Lights IoT Optimization
17
Direct
Reward
State
DRL Agent
LSTM, Online Learning
Environment
8 discrete actions
Inputs
16 sensors outputs
car counter + speed
Stream of data
Flow of cars
in intersections
End-to-End: no features
engineering, just feeding
raw stream of data
18. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Results: ~ 30% reduction in total Travel Time
18
Uneven Flow High Flow
32% faster 39%
30%
19. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Theory to Practice journey:
19
• What happens if I want to apply this to a city?
• How to handle 10s of thousands of actions?
• How to characterize convergence, robustness?
• How fast can it adapt to changes?
• …
• Cost of sensors? Use of Traffic on Google Maps? Etc.
• Can Autonomous Vehicles play a role as extra control knobs?
20. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
20
Simulation: Wu, et al. IEEE T-RO, 2018
Ion Stoica - RL Systems @ RISELab at UC Berkeley
ScaledML Conference by Matroid, March 2018, Stanford Univ.
https://www.youtube.com/watch?v=-KC3tO4BDuQ
RL for traffic management – using Autonomous Vehicles, on the road with human drivers
2 lanes
21. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
21
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
22. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Applying DRL to Storage Servers to optimize operational efficiency
22
State: function of workload (Reads or Writes) and temperature
RL Agent Environment
Actions:
24 SSD Drives
Fan Speeds
Reward: A function of Temperatures & Fans Speeds
Advantage Actor-Critic
(A2C)
S. Srinivasa, G. Kathalagiri, J. Varanasi, L. Quintela, M. Charafeddine, C. Lee,
“On Optimizing Operational Efficiency in Storage Systems Via Deep Reinforcement Learning”, submitted to ECML PKDD
Desired
operation
region
23. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Learning… over few days
23
RL Agent
Fan Speeds
State:
Different
workloads,
Temperature
Reward
• Learning directly on the real environment
(no simulator)
• Model-free: does not require any knowledge
of the SSD server behavior dynamics
• Exposed to different stochastic workloads
24. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Performance for Idle Vs Heavy I/O workloads on the operational contours
24
Status Quo controller Using Deep Reinforcement Learning
Desired
Operational
Region
25. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Performance for different workloads
25
At the beginning of training, algorithm
is exploring and learning
Once finished learning right policy,
operational behavior is within desired region
Desired
Operational
Region
Resulting
distribution
from different
workloads
26. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
26
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
27. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Ads Spend Optimization for leads demand generation
27
RL in Digital Marketing
28. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Demand gen challenges
28
Daily leads qualification
volume constraints
Financial Services
Changing inventory:
Hotels, Car Rentals
Hospitality
Limited supply & time
sensitivity challenges
Food Apps Retail
Limited # for
Inventories, discounts.
Marketing bidding vs
competitors
Under-producing
Demand
Demand < Supply
Opportunity Loss
Demand > Supply
Over-producing
Demand
Over Spending
Maximum Gross Profit
29. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Reward: Gross Profit
State: Previous CPC per SEM Account, Hour of Day, Day of Week, ..
29
Setup: Marketing Demand Gen Optimization
Every Hour
Marketplace webpage
Decide hourly Cost Per Click for 8 Search Engine Marketing accounts to optimize Gross Profit:
sum over 24 hrs of (Hourly Lead Gen Referral Revenue - Hourly Cust. Acquisition Cost from SEM)
RL Agent
TRPO,
Importance Sampling
30. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
Results for SEM Demand Gen: 12-20% gross profit uplift
30
A. Beloi, M. Charafeddine, G. Kathalagiri, A. Mishra, L. Quintela, S.
Srinivasa, patent filed: “Spending Allocation in Multi-Channel
Digital Marketing” (U.S. Application No.: 20180047039)
Gross Profit
Cumulative Demand Spend
Joint Decisions
Gross Margin
31. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
31
I. Reinforcement Learning Intro
II. 3 RL Use Cases
III. Lessons Learned
32. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
32
EASIER HARDER
Fully observable Partially observable
Low dimensionality
to represent
High dimensionality
to represent
Time-invariant
(if I conduct the experiment now
or next week, it’s the same)
Time variant
We bring the concept of “Environment
Coherence time tc” borrowed from digital
Communication to characterize how the
“channel” or “environment” is changing.
Well-behaved Stochastic w/ Fat Tails
Environment
33. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
33
Objective
Subjective
(mostly dealing w/ humans)
prone to PM/Data Scientists bias;
has an ethical dimension
Monolithic
Direct Indirect w/ a lag (e.g., Marketing)
Composite
(e.g., Robotics: Get closer move arm
orient pick move stack)
Simple to describe Complex, need AI (Inverse RL) to learn it
(how to prune bad actors?)
EASIER HARDERReward
Most challenging for Product Management
34. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
34
Discrete: small number ~< 20 Discrete: large # (100s,1K,..)
Need Hierarchical Actions
Continuous: small number
(Self-Driving Cars:
Gas, Brake, Steering)
Continuous: large number
(Ad Spend CPC per Keyword)
Static Dynamic with time
(new Ads added, removed,..)
EASIER HARDERActions
35. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
35
A high-fidelity simulator exists
(in a game, Simulator = Environment)
Low-fidelity simulator or none
Can run many parallel experiments Only 1 experiment at a time
(marketplace that’s hard to simulate)
$0, no impact
$$$ or Humans involved
(Ad Spend, Healthcare, Social media,..)
EASIER HARDERExploration Cost of Learning
Fast learning episodes
Long cycle learning episodes
(wellness, marketing re-targeting)
36. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
36
RL in a Lab RL in the Wild
RL for a Game: Simulator & Environment
are 100% the same
There is Simulator & Environment Gap
~0 Exploration Cost $-$$$ Exploration Cost
Environment is Time-Invariant Environment can be Time-Variant
Direct, instant feedback More complex: can be indirect or w/ lag
Unconstrained Convergence Time Convergence Time << Env Coherence Time
Big Data Big Data & Small Data
37. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
37
Controlled
Environment
Wild Environment
Simulator ≠ Reality
Low Exploration Risk
High Exploration Risk
Healthcare
WellnessAgTech
Trading
Manufacturing
Marketing
38. Copyright © 2018 Samsung SDS, Inc. All rights reserved
@mohamadtweets
38
Advice to AI entrepreneurs planning their journey into the wild
Pick your vertical wisely.. It decides the macro terms that you will face
Editor's Notes is an area of machine learning an agent learns from interaction with an environment what actions to take in order to optimize a long-term objective (expectation of cumulative rewards)