SlideShare a Scribd company logo
1 of 16
OpenAI Retro Contest
My solutions for fast learner of Sonic
Kiyonari Harigae
Agenda
 Introduction
 Problem Overview
 Domain Adaptation
 Reinforcement learning
 Evaluation
 Results
 Discussion
 Implementation Detail
 Hyper-Parameters
 Appendix
 Reference
Introduction
• Open AI Retro Contest
This contest focuses on the transfer performance of reinforcement
learning. I aimed to make "Fast Learner" described in the details of the
contest.
My approach is Few-Shot Learning. An agent learned at one Level of the
training data set and aimed at achieving high performance at test Levels.
Problem Overview
• Problem Formulation and Assumptions
We formalize our transfer problem in a general way by considering a
source domain and a target domain, denoted DS and DT , which each
correspond to Markov decision processes (MDPs)
The state spaces (raw pixels) of the source and the target domains
completely different, however, share action spaces. Finally, state
transition and reward functions share structural similarity.
DS = (SS,AS,TS,RS) and DT = (ST ,AT ,TT ,RT )
SS ≠ ST, AS=AT, TS≈TT, RS≈RT
𝐷𝑠 ∈ M
M is a set of all natural world MDPs
Domain Adaptation
• Representation learning with Stacked AE
The behavior of RL agent is defined by a policy π : S -> A, specifying an
action to apply in each state(S). the agent will take the state space (raw
pixel) and decide what we should do. so, The Agent needs to learn
generalized Policy π of the source domain.
I thought that approach is appropriate for this problem is like DARLA[1].
It is encodes the observations that receives from environment as general
representation, and then uses these representations to learn a robust
policy that is capable of domain adaptation.
As with the original, I implemented the two steps in which the 1st step is
a DAE(Denoising Auto Encoder) and the 2nd step is VAE(Variational
AutoEncoder) [2], But difference from the original is that allowing fine
tuning in 2nd step, and β = 1(no longer Beta-VAE) [3]. More detail of
implementation are below.
Reinforcement learning
• PPO(Proximal Policy Optimization)
In PPO[4], The architecture is same as baseline, the convolution part of
the network that is shared between the policy net and value net was
replaced with the pre-trained encoder of 2nd step(VAE) of Stacked AE.
Observations
The observation resize to 78x78x3(HxWxC)
Actions
The eight button combinations:
{{LEFT}, {RIGHT}, {LEFT, DOWN},{RIGHT, DOWN}, {DOWN}, {DOWN, B},
{B}}
Rewards
The horizontal offset from the player’s initial position, But allowing
backwards(no negative rewards).
Evaluation
• Training and Evaluation procedure
Gathering images
Images(frames) were collected from the final results of the agent that
trained at each level of Training Set and used for training of Stacked VAE.
Pre-Training
Training 1st Step and 2nd Step of Stacked VAE using the image collected
above
Reinforcement Learning on the one level of training set
Training 1e6 time steps at GreenHillZone.Act1 of the level which is
simple and can learn the essence of the game. Also, allowing fine-tune
it’s convolutional part of policy that pre-trained encoder of Stacked VAE.
Evaluating transfer performance
At evaluation time, play each test level for 1e6 time steps using
the agent which trained above.
Also, For comparison of results, (Sonic) Baseline PPO[5] was added
to the evaluation.
Results
State Baseline
PPO
(Scratch)
Baseline
PPO
(Trained)
Fast Learner
PPO
(Stacked VAE)
AngelIslandZone Act2 1246.0 1377.8 1993.2
CasinoNightZone Act2 3302.9 3193.6 3684.5
FlyingBatteryZone Act2 996.0 1047.4 871.1
GreenHillZone Act2 2434.1 4846.3 4780.0
HillTopZone Act2 2227.7 2876.7 2664.4
HydrocityZone Act1 1966.7 713.8 1538.3
LavaReefZone Act1 1666.7 739.5 2891.0
MetropolisZone Act3 1395.6 1438.8 1566.2
ScrapBrainZone Act1 1080.8 1272.6 1281.3
SpringYardZone Act1 1595.7 1261.7 1272.2
StarLightZone Act3 2604.4 2627.0 2684.8
Average 1865.1 1945.0 2293.3
Baseline PPO(Scratch) is zero-shot learning, and Baseline PPO(Trained) is trained one
level of Training set(GreenHillZone.Act1) 1e6 time steps, Finally, Fast learner PPO is
evaluation target.
Baseline PPO
(Trained) seems to
be a bit Overfit,
Fast Learner PPO
(Stacked VAE) seems
to be generalized.
Results
0
500
1000
1500
2000
2500
3000
3500
score
Time steps
Learning curve
Baseline PPO (Scratch) Baseline PPO (Trained) Fast Learner PPO (Stacked VAE)
Submission Results
TASK Baseline
PPO
(Scratch)
Baseline
PPO
(Trained)
Fast Learner
PPO
(Stacked VAE)
#1 907.16 6492.72 8071.51
#2 3417.66 2477.81 2629.08
#3 2690.13 2266.38 3166.67
#4 1642.06 1915.59 2012.49
#5 1786.79 2048.23 2979.00
Average 2088.76 3040.15 3771.75
At the submission, the procedure of training is the same as at the evaluation,
but we also added the image of the collected Test Set to the training Stacked
VAE. The submitted agent trained 1e6 time steps at the same Level as at the
evaluation.
The results are as
follows, compared
with Scratch PPO,
approximately 80%
improvement.
And with Trained
PPO, approximately
24% improvement.
Discussion
At this time, by learning a good representation from training set, the agent quickly
learned the essence of the game and We confirmed the feasibility of transfer to
test levels.
However, I would like to investigate the following issues as future tasks.
・Overfitting to Source Domain
As described in DARLA's paper, By allowing fine-tune it‘s convolution layer while
learning the source policy, It can speed up learning of Source Domain, but there is
a problem of over fitting.
Consideration of validation strategy for robust policy is necessary.
・Learn good feature representation
In this time, it seems that it was difficult to collect the images and it was not able
to obtain enough representation(For example, obstacle or character motion)
It is necessary to collect more situational images and increase generalization
performance by image augmentation etc.
Implementation Detail
VAE
Encoder VAE
Decoder
DAE
Encoder
DAE
Decoder
𝑍
𝑥
𝑥
𝑥
𝑆 𝑧
𝜇, Σ
1st Step
2nd Step
𝜋 𝑆(a|𝑆 𝑍
; 𝜃)
RL
𝑥
𝐽
𝓛(𝜃,Φ;𝑥. 𝑧,𝛽) = 𝔼 𝑞Φ(𝑧|𝑥) 𝐽( 𝑥)- 𝐽(𝑥) 2
2
− 𝐷 𝐾𝐿(𝑞Φ(𝑧|𝑥) 𝑝(𝑧))
Hyper-parameters
Algorithm Value
Denoising AutoEncoder Architecture Kernel size 4, Stride 2, Encoder
layers {32r-32r-64r-64r}, latent dim
100l, Decoder layers{64r-64r-32r-
32r},
Adam optimizer
Noise factor 1.5
Learning Rate 1e-3
Variational AutoEncoder Architecture Kernel size 4, Stride 2, Encoder
layers {32r-32r-64r-64r}, latent dim
200(100 independent Gaussian
Distributions), Decoder layers{64r-
64r-32r-32r},Adam optimizer
Learning Rate 1e-4
Hyper-parameters
Algorithm Value
PPO Architecture The convolutional part of the policy
net is encoder of VAE. And then 𝑆 𝑧
input to policy layer {200l-512l-7l}
Epochs 4
Minibatch size 8
Discount(γ) 0.99
GAE parameter(λ) 0.95
Clipping param(ε) 0.2
Entropy coeff 0.001
Reward scale 0.005
Appendix 1
• Latent space linear interpolation
The representation between Levels seems to be acquired?
References
[1] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick,
C.Blundell, and A. Lerchner, “DARLA: Improving zero-shot transfer in reinforcement
learning,”2017. eprint: arXiv:1707.08475.
[2] Kingma, D. P. and Welling, M. Auto-encoding variational bayes.ICLR, 2014.
[3] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher,Glorot, Xavier,
Botvinick, Matthew, Mohamed, Shakir, andLerchner, Alexander. Beta-vae: Learning
basic visual concepts with a constrained variational framework. In ICLR, 2017.
[4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy
optimization algorithms,” 2017. eprint: arXiv:1707.06347
[5] https://github.com/openai/retro-baselines

More Related Content

What's hot

Jpeg image compression using discrete cosine transform a survey
Jpeg image compression using discrete cosine transform   a surveyJpeg image compression using discrete cosine transform   a survey
Jpeg image compression using discrete cosine transform a surveyIJCSES Journal
 
Multimedia basic video compression techniques
Multimedia basic video compression techniquesMultimedia basic video compression techniques
Multimedia basic video compression techniquesMazin Alwaaly
 
//STEIM Workshop: A Vernacular of File Formats
//STEIM Workshop: A Vernacular of File Formats//STEIM Workshop: A Vernacular of File Formats
//STEIM Workshop: A Vernacular of File FormatsRosa ɯǝukɯɐn
 
Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithmsMazin Alwaaly
 
Multimedia communication jpeg
Multimedia communication jpegMultimedia communication jpeg
Multimedia communication jpegDr. Kapil Gupta
 
Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)danishrafiq
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its applicationSrishty Saha
 
What Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? pptWhat Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? ppttaeseon ryu
 
Next generation image compression standards: JPEG XR and AIC
Next generation image compression standards: JPEG XR and AICNext generation image compression standards: JPEG XR and AIC
Next generation image compression standards: JPEG XR and AICTouradj Ebrahimi
 
Survey of Super Resolution Task (SISR Only)
Survey of Super Resolution Task (SISR Only)Survey of Super Resolution Task (SISR Only)
Survey of Super Resolution Task (SISR Only)MYEONGGYU LEE
 
[Paper] learning video representations from correspondence proposals
[Paper]  learning video representations from correspondence proposals[Paper]  learning video representations from correspondence proposals
[Paper] learning video representations from correspondence proposalsSusang Kim
 
Design of Image Compression Algorithm using MATLAB
Design of Image Compression Algorithm using MATLABDesign of Image Compression Algorithm using MATLAB
Design of Image Compression Algorithm using MATLABIJEEE
 
Mining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial DatasetMining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial Datasetbutest
 
Dct,gibbs phen,oversampled adc,polyphase decomposition
Dct,gibbs phen,oversampled adc,polyphase decompositionDct,gibbs phen,oversampled adc,polyphase decomposition
Dct,gibbs phen,oversampled adc,polyphase decompositionMuhammad Younas
 

What's hot (20)

Jpeg image compression using discrete cosine transform a survey
Jpeg image compression using discrete cosine transform   a surveyJpeg image compression using discrete cosine transform   a survey
Jpeg image compression using discrete cosine transform a survey
 
Multimedia basic video compression techniques
Multimedia basic video compression techniquesMultimedia basic video compression techniques
Multimedia basic video compression techniques
 
Lect5 v2
Lect5 v2Lect5 v2
Lect5 v2
 
//STEIM Workshop: A Vernacular of File Formats
//STEIM Workshop: A Vernacular of File Formats//STEIM Workshop: A Vernacular of File Formats
//STEIM Workshop: A Vernacular of File Formats
 
Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithms
 
Multimedia communication jpeg
Multimedia communication jpegMultimedia communication jpeg
Multimedia communication jpeg
 
Style gan hdw
Style gan hdwStyle gan hdw
Style gan hdw
 
Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its application
 
What Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? pptWhat Makes Training Multi-modal Classification Networks Hard? ppt
What Makes Training Multi-modal Classification Networks Hard? ppt
 
Next generation image compression standards: JPEG XR and AIC
Next generation image compression standards: JPEG XR and AICNext generation image compression standards: JPEG XR and AIC
Next generation image compression standards: JPEG XR and AIC
 
Unit ii
Unit iiUnit ii
Unit ii
 
Survey of Super Resolution Task (SISR Only)
Survey of Super Resolution Task (SISR Only)Survey of Super Resolution Task (SISR Only)
Survey of Super Resolution Task (SISR Only)
 
[Paper] learning video representations from correspondence proposals
[Paper]  learning video representations from correspondence proposals[Paper]  learning video representations from correspondence proposals
[Paper] learning video representations from correspondence proposals
 
Oc2423022305
Oc2423022305Oc2423022305
Oc2423022305
 
JPEG
JPEGJPEG
JPEG
 
Design of Image Compression Algorithm using MATLAB
Design of Image Compression Algorithm using MATLABDesign of Image Compression Algorithm using MATLAB
Design of Image Compression Algorithm using MATLAB
 
Image transforms
Image transformsImage transforms
Image transforms
 
Mining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial DatasetMining Regional Knowledge in Spatial Dataset
Mining Regional Knowledge in Spatial Dataset
 
Dct,gibbs phen,oversampled adc,polyphase decomposition
Dct,gibbs phen,oversampled adc,polyphase decompositionDct,gibbs phen,oversampled adc,polyphase decomposition
Dct,gibbs phen,oversampled adc,polyphase decomposition
 

Similar to OpenAI Retro Contest

Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersVasileiosMezaris
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcscpconf
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
CyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdfCyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdfMohammadAzreeYahaya
 
Trackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity CalorimeterTrackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity CalorimeterYousef Fadila
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformFadwa Fouad
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challengekenluck2001
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Image Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic AlgorithmImage Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic AlgorithmAM Publications
 

Similar to OpenAI Retro Contest (20)

Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Steganography Part 2
Steganography Part 2Steganography Part 2
Steganography Part 2
 
RLTopics_2021_Lect1.pdf
RLTopics_2021_Lect1.pdfRLTopics_2021_Lect1.pdf
RLTopics_2021_Lect1.pdf
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 
AI Lesson 39
AI Lesson 39AI Lesson 39
AI Lesson 39
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
CyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdfCyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdf
 
Trackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity CalorimeterTrackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity Calorimeter
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
2021 05-04-u2-net
2021 05-04-u2-net2021 05-04-u2-net
2021 05-04-u2-net
 
[ppt]
[ppt][ppt]
[ppt]
 
Image Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic AlgorithmImage Steganography Using Wavelet Transform And Genetic Algorithm
Image Steganography Using Wavelet Transform And Genetic Algorithm
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

OpenAI Retro Contest

  • 1. OpenAI Retro Contest My solutions for fast learner of Sonic Kiyonari Harigae
  • 2. Agenda  Introduction  Problem Overview  Domain Adaptation  Reinforcement learning  Evaluation  Results  Discussion  Implementation Detail  Hyper-Parameters  Appendix  Reference
  • 3. Introduction • Open AI Retro Contest This contest focuses on the transfer performance of reinforcement learning. I aimed to make "Fast Learner" described in the details of the contest. My approach is Few-Shot Learning. An agent learned at one Level of the training data set and aimed at achieving high performance at test Levels.
  • 4. Problem Overview • Problem Formulation and Assumptions We formalize our transfer problem in a general way by considering a source domain and a target domain, denoted DS and DT , which each correspond to Markov decision processes (MDPs) The state spaces (raw pixels) of the source and the target domains completely different, however, share action spaces. Finally, state transition and reward functions share structural similarity. DS = (SS,AS,TS,RS) and DT = (ST ,AT ,TT ,RT ) SS ≠ ST, AS=AT, TS≈TT, RS≈RT 𝐷𝑠 ∈ M M is a set of all natural world MDPs
  • 5. Domain Adaptation • Representation learning with Stacked AE The behavior of RL agent is defined by a policy π : S -> A, specifying an action to apply in each state(S). the agent will take the state space (raw pixel) and decide what we should do. so, The Agent needs to learn generalized Policy π of the source domain. I thought that approach is appropriate for this problem is like DARLA[1]. It is encodes the observations that receives from environment as general representation, and then uses these representations to learn a robust policy that is capable of domain adaptation. As with the original, I implemented the two steps in which the 1st step is a DAE(Denoising Auto Encoder) and the 2nd step is VAE(Variational AutoEncoder) [2], But difference from the original is that allowing fine tuning in 2nd step, and β = 1(no longer Beta-VAE) [3]. More detail of implementation are below.
  • 6. Reinforcement learning • PPO(Proximal Policy Optimization) In PPO[4], The architecture is same as baseline, the convolution part of the network that is shared between the policy net and value net was replaced with the pre-trained encoder of 2nd step(VAE) of Stacked AE. Observations The observation resize to 78x78x3(HxWxC) Actions The eight button combinations: {{LEFT}, {RIGHT}, {LEFT, DOWN},{RIGHT, DOWN}, {DOWN}, {DOWN, B}, {B}} Rewards The horizontal offset from the player’s initial position, But allowing backwards(no negative rewards).
  • 7. Evaluation • Training and Evaluation procedure Gathering images Images(frames) were collected from the final results of the agent that trained at each level of Training Set and used for training of Stacked VAE. Pre-Training Training 1st Step and 2nd Step of Stacked VAE using the image collected above Reinforcement Learning on the one level of training set Training 1e6 time steps at GreenHillZone.Act1 of the level which is simple and can learn the essence of the game. Also, allowing fine-tune it’s convolutional part of policy that pre-trained encoder of Stacked VAE. Evaluating transfer performance At evaluation time, play each test level for 1e6 time steps using the agent which trained above. Also, For comparison of results, (Sonic) Baseline PPO[5] was added to the evaluation.
  • 8. Results State Baseline PPO (Scratch) Baseline PPO (Trained) Fast Learner PPO (Stacked VAE) AngelIslandZone Act2 1246.0 1377.8 1993.2 CasinoNightZone Act2 3302.9 3193.6 3684.5 FlyingBatteryZone Act2 996.0 1047.4 871.1 GreenHillZone Act2 2434.1 4846.3 4780.0 HillTopZone Act2 2227.7 2876.7 2664.4 HydrocityZone Act1 1966.7 713.8 1538.3 LavaReefZone Act1 1666.7 739.5 2891.0 MetropolisZone Act3 1395.6 1438.8 1566.2 ScrapBrainZone Act1 1080.8 1272.6 1281.3 SpringYardZone Act1 1595.7 1261.7 1272.2 StarLightZone Act3 2604.4 2627.0 2684.8 Average 1865.1 1945.0 2293.3 Baseline PPO(Scratch) is zero-shot learning, and Baseline PPO(Trained) is trained one level of Training set(GreenHillZone.Act1) 1e6 time steps, Finally, Fast learner PPO is evaluation target. Baseline PPO (Trained) seems to be a bit Overfit, Fast Learner PPO (Stacked VAE) seems to be generalized.
  • 9. Results 0 500 1000 1500 2000 2500 3000 3500 score Time steps Learning curve Baseline PPO (Scratch) Baseline PPO (Trained) Fast Learner PPO (Stacked VAE)
  • 10. Submission Results TASK Baseline PPO (Scratch) Baseline PPO (Trained) Fast Learner PPO (Stacked VAE) #1 907.16 6492.72 8071.51 #2 3417.66 2477.81 2629.08 #3 2690.13 2266.38 3166.67 #4 1642.06 1915.59 2012.49 #5 1786.79 2048.23 2979.00 Average 2088.76 3040.15 3771.75 At the submission, the procedure of training is the same as at the evaluation, but we also added the image of the collected Test Set to the training Stacked VAE. The submitted agent trained 1e6 time steps at the same Level as at the evaluation. The results are as follows, compared with Scratch PPO, approximately 80% improvement. And with Trained PPO, approximately 24% improvement.
  • 11. Discussion At this time, by learning a good representation from training set, the agent quickly learned the essence of the game and We confirmed the feasibility of transfer to test levels. However, I would like to investigate the following issues as future tasks. ・Overfitting to Source Domain As described in DARLA's paper, By allowing fine-tune it‘s convolution layer while learning the source policy, It can speed up learning of Source Domain, but there is a problem of over fitting. Consideration of validation strategy for robust policy is necessary. ・Learn good feature representation In this time, it seems that it was difficult to collect the images and it was not able to obtain enough representation(For example, obstacle or character motion) It is necessary to collect more situational images and increase generalization performance by image augmentation etc.
  • 12. Implementation Detail VAE Encoder VAE Decoder DAE Encoder DAE Decoder 𝑍 𝑥 𝑥 𝑥 𝑆 𝑧 𝜇, Σ 1st Step 2nd Step 𝜋 𝑆(a|𝑆 𝑍 ; 𝜃) RL 𝑥 𝐽 𝓛(𝜃,Φ;𝑥. 𝑧,𝛽) = 𝔼 𝑞Φ(𝑧|𝑥) 𝐽( 𝑥)- 𝐽(𝑥) 2 2 − 𝐷 𝐾𝐿(𝑞Φ(𝑧|𝑥) 𝑝(𝑧))
  • 13. Hyper-parameters Algorithm Value Denoising AutoEncoder Architecture Kernel size 4, Stride 2, Encoder layers {32r-32r-64r-64r}, latent dim 100l, Decoder layers{64r-64r-32r- 32r}, Adam optimizer Noise factor 1.5 Learning Rate 1e-3 Variational AutoEncoder Architecture Kernel size 4, Stride 2, Encoder layers {32r-32r-64r-64r}, latent dim 200(100 independent Gaussian Distributions), Decoder layers{64r- 64r-32r-32r},Adam optimizer Learning Rate 1e-4
  • 14. Hyper-parameters Algorithm Value PPO Architecture The convolutional part of the policy net is encoder of VAE. And then 𝑆 𝑧 input to policy layer {200l-512l-7l} Epochs 4 Minibatch size 8 Discount(γ) 0.99 GAE parameter(λ) 0.95 Clipping param(ε) 0.2 Entropy coeff 0.001 Reward scale 0.005
  • 15. Appendix 1 • Latent space linear interpolation The representation between Levels seems to be acquired?
  • 16. References [1] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick, C.Blundell, and A. Lerchner, “DARLA: Improving zero-shot transfer in reinforcement learning,”2017. eprint: arXiv:1707.08475. [2] Kingma, D. P. and Welling, M. Auto-encoding variational bayes.ICLR, 2014. [3] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher,Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, andLerchner, Alexander. Beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017. [4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. eprint: arXiv:1707.06347 [5] https://github.com/openai/retro-baselines