SlideShare a Scribd company logo
1 of 29
Download to read offline
A Journey to Reinforcement Learning
fangkuoyu@gmail.com
12/30/2022
2
The Treasure Map
MuZero
Alpha Zero
Gym
Gym
3
Atari Games
Pong Breakout Phoenix
https://www.gymlibrary.dev/
https://gymnasium.farama.org/
4
Reinforcement Learning Framework
ENVIRONMENT
AGENT
State Action Reward
(s1 → a1 → r1)→ (s2 → a2 → r2)→ (s3 → a3 → r3)→ …
Learning to Make Decisions for Maximizing Long-Term Rewards
5
Atari Breakout in OpenAI Gym
import gym
env = gym.make("ALE/Breakout-v5", render_mode="human")
state, info = env.reset()
for index in range(1000):
action = env.action_space.sample() # action by random or policy
state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
state, info = env.reset()
env.close()
https://www.gymlibrary.dev/
https://gymnasium.farama.org/
6
State/Action/Reward in Atari Breakout
State:
●
(210, 160, 3) - image
Action:
●
0 - NO OP
●
1 - FIRE
●
2 - RIGHT
●
3 - LEFT
Reward:
●
Red - 7 points
●
Orange - 7 points
●
Yellow - 4 points
●
Green - 4 points
●
Aqua - 1 point
●
Blue - 1 point
https://www.gymlibrary.dev/
https://gymnasium.farama.org/
7
How Well Can Reinforcement Learning Do?
Artificial Intelligence and the Future - Demis Hassabis/DeepMind
https://youtu.be/zYII3AOSgo8?t=2236
8
From One Game to All The Games in Atari
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
9
One Journey to General Artificial Intelligence
https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
DQN/2015
R2D2/2019
NGU/2019
Agent57/2020
10
OpenAI Gym Taxi-v3 : State/Action/Reward
State:
●
Number of Variable : 1
●
Range of Variable : [1, 500]
●
25 taxi positions x 5 passenger positions x 4 destination locations
Action:
●
0 : move south
●
1 : move north
●
2 : move east
●
3 : move west
●
4 : pickup passenger
●
5 : drop off passenger
Reward:
●
-1 : per step unless other rewards is triggered
●
+20 : delivering passenger
●
-10 : pickup/dropoff illegally
https://www.gymlibrary.dev/environments/toy_text/taxi/
11
OpenAI Gym Taxi-v3 : Q Table
(500 x 6)
https://www.gocoder.one/blog/rl-tutorial-with-openai-gym
12
Q Learning (with epsilon greedy policy)
3. exploitation
1. initialize Q table
4. exploration
5. action
2. state
8. update Q table
6. next state
7. reward
https://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/
13
Deep Q Network (DQN) Architecture (1/2)
Ref : Human-level control through deep reinforcement learning
14
Deep Q Network (DQN) Architecture (2/2)
Ref : Massively Parallel Methods for Deep Reinforcement Learning
15
Deep Q Learning (with experience replay and dual networks)
1. initialize replay memory
5. store transition in replay memory
6. get batch from replay memory
2. initialize main network
3. initialize target network
4. epsilon greedy policy from main network
7. calculate error between two networks
8. synchronize two networks
Ref : Human-level control through deep reinforcement learning
16
Four Tough Games in Atari
Pitfall Solaris Skiing Montezuma’s Revenge
Problems : long-term credit assignment and exploitation/exploration tradeoff
Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc.
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
17
Distributed Reinforcement Learning
Agent57
Gorila
https://arxiv.org/abs/2003.13350
https://arxiv.org/abs/1507.04296
18
How Well Can Agent57 Do?
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
19
Reinforcement Learning at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/
20
Mastering Go at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/
21
Another Journey to General Artificial Intelligence
https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules
https://www.youtube.com/watch?v=lVMgxtm5L-U
22
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
60 players from China, Korea, Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=HT-UZkiOLv8
23
AlphaGo Zero Training Process
Self-Play
Train
Value
Network
Train
Policy
Network
https://www.youtube.com/watch?v=mWHK27pXjqo
24
AlphaGo Zero Performance Benchmark
https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/
25
MuZero Training Process
h: representation
f: prediction
g: dynamics
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model
26
MuZero Performance Benchmark
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model
27
Exploring The Treasure Map ...
MuZero
Alpha Zero
Gym
Gym
28
Beyond the Treasure Map ...
MuZero
Alpha Zero
Gym
Gym
AlphaStar
AlphaFlod
AlphaTensor
Other Domains, e.g.,
Mobile/Wireless
Communication
A Journey to Reinforcement Learning
Q & A

More Related Content

Similar to A Journey to Reinforcement Learning

TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Knock, knock, who is there? Doze.
Knock, knock, who is there? Doze.Knock, knock, who is there? Doze.
Knock, knock, who is there? Doze.Yonatan Levin
 
Neural Network Based Player Retention Prediction in Free to Play Games
Neural Network Based Player Retention Prediction in Free to Play GamesNeural Network Based Player Retention Prediction in Free to Play Games
Neural Network Based Player Retention Prediction in Free to Play GamesAMR koura
 
Visual Component Testing -- w/ Gil Tayar (Applitools) and Gleb Bahmutov (Cyp...
Visual Component Testing  -- w/ Gil Tayar (Applitools) and Gleb Bahmutov (Cyp...Visual Component Testing  -- w/ Gil Tayar (Applitools) and Gleb Bahmutov (Cyp...
Visual Component Testing -- w/ Gil Tayar (Applitools) and Gleb Bahmutov (Cyp...Applitools
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoTim Riser
 
Accelerating Incident Response To Production Outages
Accelerating Incident Response To Production OutagesAccelerating Incident Response To Production Outages
Accelerating Incident Response To Production OutagesTier1 app
 
What's Coming Next in Sencha Frameworks
What's Coming Next in Sencha FrameworksWhat's Coming Next in Sencha Frameworks
What's Coming Next in Sencha FrameworksGrgur Grisogono
 
Do we need Unsafe in Java?
Do we need Unsafe in Java?Do we need Unsafe in Java?
Do we need Unsafe in Java?Andrei Pangin
 
Android programming -_pushing_the_limits
Android programming -_pushing_the_limitsAndroid programming -_pushing_the_limits
Android programming -_pushing_the_limitsDroidcon Berlin
 
Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017
Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017
Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017Codemotion
 
Microsoft Data Wranglers - 8august2017
Microsoft Data Wranglers - 8august2017Microsoft Data Wranglers - 8august2017
Microsoft Data Wranglers - 8august2017Julian Lee
 
Reactive reference architecture
Reactive reference architectureReactive reference architecture
Reactive reference architectureMarkus Jura
 
Deep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceDeep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceBen Ball
 
Major outagesmajorenteprises 2021
Major outagesmajorenteprises 2021Major outagesmajorenteprises 2021
Major outagesmajorenteprises 2021Tier1 app
 
Defcon CTF quals
Defcon CTF qualsDefcon CTF quals
Defcon CTF qualssnyff
 

Similar to A Journey to Reinforcement Learning (20)

TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Knock, knock, who is there? Doze.
Knock, knock, who is there? Doze.Knock, knock, who is there? Doze.
Knock, knock, who is there? Doze.
 
Neural Network Based Player Retention Prediction in Free to Play Games
Neural Network Based Player Retention Prediction in Free to Play GamesNeural Network Based Player Retention Prediction in Free to Play Games
Neural Network Based Player Retention Prediction in Free to Play Games
 
Visual Component Testing -- w/ Gil Tayar (Applitools) and Gleb Bahmutov (Cyp...
Visual Component Testing  -- w/ Gil Tayar (Applitools) and Gleb Bahmutov (Cyp...Visual Component Testing  -- w/ Gil Tayar (Applitools) and Gleb Bahmutov (Cyp...
Visual Component Testing -- w/ Gil Tayar (Applitools) and Gleb Bahmutov (Cyp...
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
 
Accelerating Incident Response To Production Outages
Accelerating Incident Response To Production OutagesAccelerating Incident Response To Production Outages
Accelerating Incident Response To Production Outages
 
What's Coming Next in Sencha Frameworks
What's Coming Next in Sencha FrameworksWhat's Coming Next in Sencha Frameworks
What's Coming Next in Sencha Frameworks
 
Do we need Unsafe in Java?
Do we need Unsafe in Java?Do we need Unsafe in Java?
Do we need Unsafe in Java?
 
Understanding AlphaGo
Understanding AlphaGoUnderstanding AlphaGo
Understanding AlphaGo
 
Android programming -_pushing_the_limits
Android programming -_pushing_the_limitsAndroid programming -_pushing_the_limits
Android programming -_pushing_the_limits
 
Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017
Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017
Machine Learning Exposed - James Weaver - Codemotion Amsterdam 2017
 
Tomorrow Java
Tomorrow JavaTomorrow Java
Tomorrow Java
 
Microsoft Data Wranglers - 8august2017
Microsoft Data Wranglers - 8august2017Microsoft Data Wranglers - 8august2017
Microsoft Data Wranglers - 8august2017
 
Reactive reference architecture
Reactive reference architectureReactive reference architecture
Reactive reference architecture
 
Android swedroid
Android swedroidAndroid swedroid
Android swedroid
 
Deep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceDeep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for Finance
 
Major outagesmajorenteprises 2021
Major outagesmajorenteprises 2021Major outagesmajorenteprises 2021
Major outagesmajorenteprises 2021
 
Deep cv 101
Deep cv 101Deep cv 101
Deep cv 101
 
Deep learning
Deep learningDeep learning
Deep learning
 
Defcon CTF quals
Defcon CTF qualsDefcon CTF quals
Defcon CTF quals
 

More from Frank Fang Kuo Yu

Microsoft Bing Image Creator (OpenAI DALL-E 3) 文字生成圖片經驗分享
Microsoft Bing Image Creator (OpenAI DALL-E 3) 文字生成圖片經驗分享Microsoft Bing Image Creator (OpenAI DALL-E 3) 文字生成圖片經驗分享
Microsoft Bing Image Creator (OpenAI DALL-E 3) 文字生成圖片經驗分享Frank Fang Kuo Yu
 
Microsoft Bing Image Creator (OpenAI DALL·E) 建築景觀圖片生成經驗分享
Microsoft Bing Image Creator (OpenAI DALL·E) 建築景觀圖片生成經驗分享Microsoft Bing Image Creator (OpenAI DALL·E) 建築景觀圖片生成經驗分享
Microsoft Bing Image Creator (OpenAI DALL·E) 建築景觀圖片生成經驗分享Frank Fang Kuo Yu
 
大型語言模型的幻覺和風險
大型語言模型的幻覺和風險大型語言模型的幻覺和風險
大型語言模型的幻覺和風險Frank Fang Kuo Yu
 
人工智慧圖像應用簡介
人工智慧圖像應用簡介人工智慧圖像應用簡介
人工智慧圖像應用簡介Frank Fang Kuo Yu
 
Orange Data Mining 軟體系統簡介及生醫應用支援
Orange Data Mining 軟體系統簡介及生醫應用支援Orange Data Mining 軟體系統簡介及生醫應用支援
Orange Data Mining 軟體系統簡介及生醫應用支援Frank Fang Kuo Yu
 
從開源資料集看人工智慧醫療應用
從開源資料集看人工智慧醫療應用從開源資料集看人工智慧醫療應用
從開源資料集看人工智慧醫療應用Frank Fang Kuo Yu
 
Deep Learning and Object Detection
Deep Learning and Object DetectionDeep Learning and Object Detection
Deep Learning and Object DetectionFrank Fang Kuo Yu
 
Data Science and Machine Learning in Smart manufacturing
Data Science and Machine Learning in Smart manufacturingData Science and Machine Learning in Smart manufacturing
Data Science and Machine Learning in Smart manufacturingFrank Fang Kuo Yu
 
Deep Learning and Image Recognition
Deep Learning and Image RecognitionDeep Learning and Image Recognition
Deep Learning and Image RecognitionFrank Fang Kuo Yu
 
Leap Motion Controller and Application Development
Leap Motion Controller and Application DevelopmentLeap Motion Controller and Application Development
Leap Motion Controller and Application DevelopmentFrank Fang Kuo Yu
 
Startup Ecosystem in Shanghai
Startup Ecosystem in ShanghaiStartup Ecosystem in Shanghai
Startup Ecosystem in ShanghaiFrank Fang Kuo Yu
 
Case Method at Harvard Business School
Case Method at Harvard Business SchoolCase Method at Harvard Business School
Case Method at Harvard Business SchoolFrank Fang Kuo Yu
 
如何做報告 (How to make a presentation?)
如何做報告 (How to make a presentation?)如何做報告 (How to make a presentation?)
如何做報告 (How to make a presentation?)Frank Fang Kuo Yu
 

More from Frank Fang Kuo Yu (18)

Microsoft Bing Image Creator (OpenAI DALL-E 3) 文字生成圖片經驗分享
Microsoft Bing Image Creator (OpenAI DALL-E 3) 文字生成圖片經驗分享Microsoft Bing Image Creator (OpenAI DALL-E 3) 文字生成圖片經驗分享
Microsoft Bing Image Creator (OpenAI DALL-E 3) 文字生成圖片經驗分享
 
Microsoft Bing Image Creator (OpenAI DALL·E) 建築景觀圖片生成經驗分享
Microsoft Bing Image Creator (OpenAI DALL·E) 建築景觀圖片生成經驗分享Microsoft Bing Image Creator (OpenAI DALL·E) 建築景觀圖片生成經驗分享
Microsoft Bing Image Creator (OpenAI DALL·E) 建築景觀圖片生成經驗分享
 
大型語言模型的幻覺和風險
大型語言模型的幻覺和風險大型語言模型的幻覺和風險
大型語言模型的幻覺和風險
 
人工智慧圖像應用簡介
人工智慧圖像應用簡介人工智慧圖像應用簡介
人工智慧圖像應用簡介
 
Orange Data Mining 軟體系統簡介及生醫應用支援
Orange Data Mining 軟體系統簡介及生醫應用支援Orange Data Mining 軟體系統簡介及生醫應用支援
Orange Data Mining 軟體系統簡介及生醫應用支援
 
從開源資料集看人工智慧醫療應用
從開源資料集看人工智慧醫療應用從開源資料集看人工智慧醫療應用
從開源資料集看人工智慧醫療應用
 
Deep Learning and Object Detection
Deep Learning and Object DetectionDeep Learning and Object Detection
Deep Learning and Object Detection
 
Data Science and Machine Learning in Smart manufacturing
Data Science and Machine Learning in Smart manufacturingData Science and Machine Learning in Smart manufacturing
Data Science and Machine Learning in Smart manufacturing
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Deep Learning and Image Recognition
Deep Learning and Image RecognitionDeep Learning and Image Recognition
Deep Learning and Image Recognition
 
Leap Motion Controller and Application Development
Leap Motion Controller and Application DevelopmentLeap Motion Controller and Application Development
Leap Motion Controller and Application Development
 
創客/創業/創新
創客/創業/創新創客/創業/創新
創客/創業/創新
 
Startup Ecosystem in Shanghai
Startup Ecosystem in ShanghaiStartup Ecosystem in Shanghai
Startup Ecosystem in Shanghai
 
Case Method at Harvard Business School
Case Method at Harvard Business SchoolCase Method at Harvard Business School
Case Method at Harvard Business School
 
如何做報告 (How to make a presentation?)
如何做報告 (How to make a presentation?)如何做報告 (How to make a presentation?)
如何做報告 (How to make a presentation?)
 
Introduction to GPRS
Introduction to GPRSIntroduction to GPRS
Introduction to GPRS
 
Introduction to PPP
Introduction to PPPIntroduction to PPP
Introduction to PPP
 
Introduction to TCP/IP
Introduction to TCP/IPIntroduction to TCP/IP
Introduction to TCP/IP
 

Recently uploaded

buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutionsmonugehlot87
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?Watsoo Telematics
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 

Recently uploaded (20)

buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutions
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 

A Journey to Reinforcement Learning

  • 1. A Journey to Reinforcement Learning fangkuoyu@gmail.com 12/30/2022
  • 3. 3 Atari Games Pong Breakout Phoenix https://www.gymlibrary.dev/ https://gymnasium.farama.org/
  • 4. 4 Reinforcement Learning Framework ENVIRONMENT AGENT State Action Reward (s1 → a1 → r1)→ (s2 → a2 → r2)→ (s3 → a3 → r3)→ … Learning to Make Decisions for Maximizing Long-Term Rewards
  • 5. 5 Atari Breakout in OpenAI Gym import gym env = gym.make("ALE/Breakout-v5", render_mode="human") state, info = env.reset() for index in range(1000): action = env.action_space.sample() # action by random or policy state, reward, terminated, truncated, info = env.step(action) if terminated or truncated: state, info = env.reset() env.close() https://www.gymlibrary.dev/ https://gymnasium.farama.org/
  • 6. 6 State/Action/Reward in Atari Breakout State: ● (210, 160, 3) - image Action: ● 0 - NO OP ● 1 - FIRE ● 2 - RIGHT ● 3 - LEFT Reward: ● Red - 7 points ● Orange - 7 points ● Yellow - 4 points ● Green - 4 points ● Aqua - 1 point ● Blue - 1 point https://www.gymlibrary.dev/ https://gymnasium.farama.org/
  • 7. 7 How Well Can Reinforcement Learning Do? Artificial Intelligence and the Future - Demis Hassabis/DeepMind https://youtu.be/zYII3AOSgo8?t=2236
  • 8. 8 From One Game to All The Games in Atari https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
  • 9. 9 One Journey to General Artificial Intelligence https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/ https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark DQN/2015 R2D2/2019 NGU/2019 Agent57/2020
  • 10. 10 OpenAI Gym Taxi-v3 : State/Action/Reward State: ● Number of Variable : 1 ● Range of Variable : [1, 500] ● 25 taxi positions x 5 passenger positions x 4 destination locations Action: ● 0 : move south ● 1 : move north ● 2 : move east ● 3 : move west ● 4 : pickup passenger ● 5 : drop off passenger Reward: ● -1 : per step unless other rewards is triggered ● +20 : delivering passenger ● -10 : pickup/dropoff illegally https://www.gymlibrary.dev/environments/toy_text/taxi/
  • 11. 11 OpenAI Gym Taxi-v3 : Q Table (500 x 6) https://www.gocoder.one/blog/rl-tutorial-with-openai-gym
  • 12. 12 Q Learning (with epsilon greedy policy) 3. exploitation 1. initialize Q table 4. exploration 5. action 2. state 8. update Q table 6. next state 7. reward https://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/
  • 13. 13 Deep Q Network (DQN) Architecture (1/2) Ref : Human-level control through deep reinforcement learning
  • 14. 14 Deep Q Network (DQN) Architecture (2/2) Ref : Massively Parallel Methods for Deep Reinforcement Learning
  • 15. 15 Deep Q Learning (with experience replay and dual networks) 1. initialize replay memory 5. store transition in replay memory 6. get batch from replay memory 2. initialize main network 3. initialize target network 4. epsilon greedy policy from main network 7. calculate error between two networks 8. synchronize two networks Ref : Human-level control through deep reinforcement learning
  • 16. 16 Four Tough Games in Atari Pitfall Solaris Skiing Montezuma’s Revenge Problems : long-term credit assignment and exploitation/exploration tradeoff Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc. https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
  • 18. 18 How Well Can Agent57 Do? https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
  • 19. 19 Reinforcement Learning at DeepMind https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/
  • 20. 20 Mastering Go at DeepMind https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/
  • 21. 21 Another Journey to General Artificial Intelligence https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules https://www.youtube.com/watch?v=lVMgxtm5L-U
  • 22. 22 AlphaGo Fan/Lee/Master ● European Go Champion Fan Hui — 5:0 ● South Korean professional Go player Lee Sedol — 4:1 ● 60 players from China, Korea, Japan — 60:0 ● Chinese professional Go player Ke Jie — 3:0 https://www.youtube.com/watch?v=HT-UZkiOLv8
  • 23. 23 AlphaGo Zero Training Process Self-Play Train Value Network Train Policy Network https://www.youtube.com/watch?v=mWHK27pXjqo
  • 24. 24 AlphaGo Zero Performance Benchmark https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/
  • 25. 25 MuZero Training Process h: representation f: prediction g: dynamics Ref: Mastering Atari, Go, chess and shogi by planning with a learned model
  • 26. 26 MuZero Performance Benchmark Ref: Mastering Atari, Go, chess and shogi by planning with a learned model
  • 27. 27 Exploring The Treasure Map ... MuZero Alpha Zero Gym Gym
  • 28. 28 Beyond the Treasure Map ... MuZero Alpha Zero Gym Gym AlphaStar AlphaFlod AlphaTensor Other Domains, e.g., Mobile/Wireless Communication
  • 29. A Journey to Reinforcement Learning Q & A