[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기이 의령
This document discusses using OpenAI Gym to create reinforcement learning agents and summarizes various reinforcement learning algorithms. It introduces OpenAI Gym, an environment for developing and comparing reinforcement learning algorithms. It then provides overviews of model-free reinforcement learning algorithms like Q-learning, SARSA, policy iteration, and value iteration. Deep Q-networks which use neural networks to approximate the Q-function are also introduced. Examples of code implementations are provided.
The document summarizes the policy gradient theorem, which provides a way to perform policy improvement in reinforcement learning using gradient ascent on the expected returns with respect to the policy parameters. It begins by motivating policy gradients as a way to do policy improvement when the action space is large or continuous. It then defines the necessary notation, expected returns objective function, and discounted state visitation measure. The main part of the document proves the policy gradient theorem, which expresses the policy gradient as an expectation over the discounted state visitation measure and action-value function. It notes that in practice the action-value function must be estimated, and proves the compatible function approximation theorem, which ensures the policy gradient is computed correctly when using an estimated action-value
This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기이 의령
This document discusses using OpenAI Gym to create reinforcement learning agents and summarizes various reinforcement learning algorithms. It introduces OpenAI Gym, an environment for developing and comparing reinforcement learning algorithms. It then provides overviews of model-free reinforcement learning algorithms like Q-learning, SARSA, policy iteration, and value iteration. Deep Q-networks which use neural networks to approximate the Q-function are also introduced. Examples of code implementations are provided.
The document summarizes the policy gradient theorem, which provides a way to perform policy improvement in reinforcement learning using gradient ascent on the expected returns with respect to the policy parameters. It begins by motivating policy gradients as a way to do policy improvement when the action space is large or continuous. It then defines the necessary notation, expected returns objective function, and discounted state visitation measure. The main part of the document proves the policy gradient theorem, which expresses the policy gradient as an expectation over the discounted state visitation measure and action-value function. It notes that in practice the action-value function must be estimated, and proves the compatible function approximation theorem, which ensures the policy gradient is computed correctly when using an estimated action-value
This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
Lecture slides in DASI spring 2018, National Cheng Kung University, Taiwan. The content is about deep reinforcement learning: policy gradient including variance reduction and importance sampling
This document provides an overview of kernel methods and Gaussian processes. It discusses dual representations, constructing kernels such as polynomial and radial basis function kernels. It also covers Gaussian processes for regression and classification, including learning hyperparameters, automatic relevance determination, and using the Laplace approximation. The document contains section headings and mathematical equations but no complete paragraphs of text.
Conditional Image Generation with PixelCNN Decoderssuga93
The document summarizes research on conditional image generation using PixelCNN decoders. It discusses how PixelCNNs sequentially predict pixel values rather than the whole image at once. Previous work used PixelRNNs, but these were slow to train. The proposed approach uses a Gated PixelCNN that removes blind spots in the receptive field by combining horizontal and vertical feature maps. It also conditions PixelCNN layers on class labels or embeddings to generate conditional images. Experimental results show the Gated PixelCNN outperforms PixelCNN and achieves performance close to PixelRNN on CIFAR-10 and ImageNet, while training faster. It can also generate portraits conditioned on embeddings of people.
The document discusses hierarchical reinforcement learning (HRL). It provides an overview of reinforcement learning and Markov decision processes before introducing HRL. HRL can improve performance by decomposing problems into subproblems using temporal and state abstraction. Several HRL approaches are described, including options which define subpolicies and termination conditions. The document outlines future work, such as automated discovery of subgoals and state abstractions, and developing agents that can continually learn across tasks.
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
파이콘 코리아 2018년도 튜토리얼 세션의 "RL Adventure : DQN 부터 Rainbow DQN까지"의 발표 자료입니다.
2017년도 Deepmind에서 발표한 value based 강화학습 모형인 Rainbow의 이해를 돕기 위한 튜토리얼로 DQN부터 Rainbow까지 순차적으로 중요한 점만 요약된 내용이 들어있습니다.
파트 1 : DQN, Double & Dueling DQN - 성태경
파트 2 : PER and NoisyNet - 양홍선
파트 3 : Distributed RL - 이의령
파트 4 : RAINBOW - 김예찬
관련된 코드와 구현체를 확인하고 싶으신 분들은
https://github.com/hongdam/pycon2018-RL_Adventure
에서 확인하실 수 있습니다
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
The document summarizes a paper about modeling quantum interactions using a continuous-filter convolutional neural network called SchNet. Some key points:
1) SchNet performs convolution using distances between nodes in 3D space rather than graph connectivity, allowing it to model interactions between arbitrarily positioned nodes.
2) This is useful for cases where graphs have different configurations that impact properties, or where graph and physical distances differ.
3) The paper proposes a continuous-filter convolutional layer and interaction block to incorporate distance information into graph convolutions performed by the SchNet model.
안녕하세요.
강화학습을 공부하면서 처음 접하시는 분들을 위해 ppt로 '강화학습의 개요'에 대해서 정리했습니다.
동물이 학습하는 것과 똑같이 시행착오를 겪으면서 학습하는 강화학습은 기계학습 분야에서 상당히 매력적이라고 생각합니다.
https://www.youtube.com/watch?v=PQtDTdDr8vs&feature=youtu.be
위의 링크는 스키너의 쥐 실험 영상입니다.
감사합니다.
RLKorea의 프로젝트인 피지여행에서 진행한 내용을 정리한 것입니다. 피지여행은 DeepRL에서 중요한 Policy Gradient를 쭉 정리해보는 프로젝트입니다. PG의 처음 시작인 REINFORCE 부터 현재 새로운 baseline이 된 PPO까지 이론과 코드를 함께 살펴봅니다.
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
Deep generative model 중 하나인 VAE의 Framework은 컴퓨터 비전, 자연어 처리 등 머신러닝의 전반에서 generative model의 변화를 가져왔다.
VAE를 처음 접하는 연구자들을 위해 대부분의 VAE tutorial은 구현을 목적으로 Neural Network구조와 Loss function에 초점을 맞추고 있다. 본 세미나는 Variational Inference 관점에서 Auto-encoding variational bayes에 나오는 수식들을 살펴보고자 한다. 본 수식들이 구현에서는 어떻게 적용되는지도 살펴보고자 한다.
1. Policy gradient methods estimate the optimal policy through gradient ascent on the expected return. They directly learn stochastic policies without estimating value functions.
2. REINFORCE uses Monte Carlo returns to estimate the policy gradient. It updates the policy parameters in the direction of the gradient to maximize expected returns.
3. PPO improves upon REINFORCE by clipping the objective function to restrict how far the new policy can be from the old policy, which helps stabilize training. It uses a surrogate objective and importance sampling to train the policy on data collected from previous policies.
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
Lecture slides in DASI spring 2018, National Cheng Kung University, Taiwan. The content is about deep reinforcement learning: policy gradient including variance reduction and importance sampling
This document provides an overview of kernel methods and Gaussian processes. It discusses dual representations, constructing kernels such as polynomial and radial basis function kernels. It also covers Gaussian processes for regression and classification, including learning hyperparameters, automatic relevance determination, and using the Laplace approximation. The document contains section headings and mathematical equations but no complete paragraphs of text.
Conditional Image Generation with PixelCNN Decoderssuga93
The document summarizes research on conditional image generation using PixelCNN decoders. It discusses how PixelCNNs sequentially predict pixel values rather than the whole image at once. Previous work used PixelRNNs, but these were slow to train. The proposed approach uses a Gated PixelCNN that removes blind spots in the receptive field by combining horizontal and vertical feature maps. It also conditions PixelCNN layers on class labels or embeddings to generate conditional images. Experimental results show the Gated PixelCNN outperforms PixelCNN and achieves performance close to PixelRNN on CIFAR-10 and ImageNet, while training faster. It can also generate portraits conditioned on embeddings of people.
The document discusses hierarchical reinforcement learning (HRL). It provides an overview of reinforcement learning and Markov decision processes before introducing HRL. HRL can improve performance by decomposing problems into subproblems using temporal and state abstraction. Several HRL approaches are described, including options which define subpolicies and termination conditions. The document outlines future work, such as automated discovery of subgoals and state abstractions, and developing agents that can continually learn across tasks.
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
파이콘 코리아 2018년도 튜토리얼 세션의 "RL Adventure : DQN 부터 Rainbow DQN까지"의 발표 자료입니다.
2017년도 Deepmind에서 발표한 value based 강화학습 모형인 Rainbow의 이해를 돕기 위한 튜토리얼로 DQN부터 Rainbow까지 순차적으로 중요한 점만 요약된 내용이 들어있습니다.
파트 1 : DQN, Double & Dueling DQN - 성태경
파트 2 : PER and NoisyNet - 양홍선
파트 3 : Distributed RL - 이의령
파트 4 : RAINBOW - 김예찬
관련된 코드와 구현체를 확인하고 싶으신 분들은
https://github.com/hongdam/pycon2018-RL_Adventure
에서 확인하실 수 있습니다
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
The document summarizes a paper about modeling quantum interactions using a continuous-filter convolutional neural network called SchNet. Some key points:
1) SchNet performs convolution using distances between nodes in 3D space rather than graph connectivity, allowing it to model interactions between arbitrarily positioned nodes.
2) This is useful for cases where graphs have different configurations that impact properties, or where graph and physical distances differ.
3) The paper proposes a continuous-filter convolutional layer and interaction block to incorporate distance information into graph convolutions performed by the SchNet model.
안녕하세요.
강화학습을 공부하면서 처음 접하시는 분들을 위해 ppt로 '강화학습의 개요'에 대해서 정리했습니다.
동물이 학습하는 것과 똑같이 시행착오를 겪으면서 학습하는 강화학습은 기계학습 분야에서 상당히 매력적이라고 생각합니다.
https://www.youtube.com/watch?v=PQtDTdDr8vs&feature=youtu.be
위의 링크는 스키너의 쥐 실험 영상입니다.
감사합니다.
RLKorea의 프로젝트인 피지여행에서 진행한 내용을 정리한 것입니다. 피지여행은 DeepRL에서 중요한 Policy Gradient를 쭉 정리해보는 프로젝트입니다. PG의 처음 시작인 REINFORCE 부터 현재 새로운 baseline이 된 PPO까지 이론과 코드를 함께 살펴봅니다.
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
Deep generative model 중 하나인 VAE의 Framework은 컴퓨터 비전, 자연어 처리 등 머신러닝의 전반에서 generative model의 변화를 가져왔다.
VAE를 처음 접하는 연구자들을 위해 대부분의 VAE tutorial은 구현을 목적으로 Neural Network구조와 Loss function에 초점을 맞추고 있다. 본 세미나는 Variational Inference 관점에서 Auto-encoding variational bayes에 나오는 수식들을 살펴보고자 한다. 본 수식들이 구현에서는 어떻게 적용되는지도 살펴보고자 한다.
1. Policy gradient methods estimate the optimal policy through gradient ascent on the expected return. They directly learn stochastic policies without estimating value functions.
2. REINFORCE uses Monte Carlo returns to estimate the policy gradient. It updates the policy parameters in the direction of the gradient to maximize expected returns.
3. PPO improves upon REINFORCE by clipping the objective function to restrict how far the new policy can be from the old policy, which helps stabilize training. It uses a surrogate objective and importance sampling to train the policy on data collected from previous policies.
Intro to Reinforcement learning - part IMikko Mäkipää
Introduction to Reinforcement Learning, part I: Dynamic programming
This is the first presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce reinforcement learning as a machine learning approach. We cover the terminology and building blocks needed, such as agents and environments, policies and value functions, Markov Decision Processes.
We introduce two basic dynamic programming algorithms; Value iteration and Policy iteration, and illustrate the algorithms using a simple (canonical) maze as an example.
The document provides an overview of regression analysis techniques, including linear regression and logistic regression. It explains that regression analysis is used to understand relationships between variables and can be used for prediction. Linear regression finds relationships when the dependent variable is continuous, while logistic regression is used when the dependent variable is binary. The document also discusses selecting the appropriate regression model and highlights important considerations for linear and logistic regression.
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
This document summarizes a section on policy learning approaches for recommendation systems. It begins by contrasting policy-based models with value-based models, noting that policy models directly learn a mapping from user states to actions rather than computing value estimates for all actions.
It then introduces concepts in contextual bandits and reinforcement learning, noting that contextual bandits are often a better fit for recommendations since recommendations typically have independent effects. It also discusses using counterfactual risk minimization to address covariate shift in policy learning models by reweighting training data based on logging and target policies.
Finally, it proposes two formulations for contextual bandit models for recommendations - one that directly optimizes a clipped importance sampling objective, and one that optimizes
Intro to Reinforcement learning - part IIMikko Mäkipää
Introduction to Reinforcement Learning, part II: Basic tabular methods
This is the second presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce some more building blocks, such as policy iteration, bandits and exploration, epsilon-greedy policies, temporal difference methods.
We introduce basic model-free methods that use tabular value representation; Monte Carlo on- and off-policy, Sarsa, Expected Sarsa, and Q-learning.
The algorithms are illustrated using simple black jack as an environment.
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
This document discusses statistical learning and model selection. It introduces statistical learning problems, statistical models, the need for statistical modeling, and issues around evaluating models. Key points include: statistical learning involves using data to build a predictive model; a good model balances bias and variance to minimize prediction error; cross-validation is described as the ideal procedure for evaluating models without overfitting to the test data.
This document discusses the use of dummy variables in financial econometrics. Dummy variables can be used to correct for non-normality in data by accounting for outliers. They can also model qualitative effects by taking values of 0 or 1. The document examines intercept dummy variables, which control for changes in the intercept, and slope dummy variables, which control for changes in the slope of the regression line. Dummy variables are a useful tool for testing for structural breaks and stability in time series regressions.
This document discusses the use of dummy variables in financial econometrics. Dummy variables can be used to correct for non-normality in the error term, which is often caused by outliers in financial data. An impulse dummy variable that takes the value of 1 for the outlier observation can force the residual to 0. Dummy variables can also model qualitative effects, seasonal effects, and changes in the intercept or slope of the regression line over time. The Bera-Jarque test is used to test whether the error term is normally distributed.
This document discusses the use of dummy variables in financial econometrics. Dummy variables can be used to correct for non-normality in data by accounting for outliers. They can also model qualitative effects by taking values of 0 or 1. The document examines intercept dummy variables, which control for changes in the intercept, and slope dummy variables, which control for changes in the slope of the regression line. Dummy variables are a useful tool for testing for structural breaks in data.
This document discusses the use of dummy variables in financial econometrics. Dummy variables can be used to correct for non-normality in data by accounting for outliers. They can also model qualitative effects by taking values of 0 or 1. The document examines intercept dummy variables, which control for changes in the intercept, and slope dummy variables, which control for changes in the slope of the regression line. Dummy variables are a useful tool for testing for structural breaks in data.
Reinforcement Learning Guide For Beginnersgokulprasath06
Reinforcement Learning Guide:
Land in multiple job interviews by joining our Data Science certification course.
Data Science course content designed uniquely, which helps you start learning Data Science from basics to advanced data science concepts.
Content: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Models of Operational research, Advantages & disadvantages of Operational res...Sunny Mervyne Baa
This document discusses operational research models and their advantages and disadvantages. It describes several common OR models including linear programming, network flow programming, integer programming, nonlinear programming, dynamic programming, stochastic programming, combinatorial optimization, stochastic processes, discrete time Markov chains, continuous time Markov chains, queuing, and simulation. It notes advantages of OR in developing better systems, control, and decisions. However, it also lists limitations such as dependence on computers, inability to quantify all factors, distance between managers and researchers, costs of money and time, and challenges implementing OR solutions.
This document provides an overview of deep reinforcement learning and related concepts. It discusses reinforcement learning techniques such as model-based and model-free approaches. Deep reinforcement learning techniques like deep Q-networks, policy gradients, and actor-critic methods are explained. The document also introduces decision transformers, which transform reinforcement learning into a sequence modeling problem, and multi-game decision transformers which can learn to play multiple games simultaneously.
This document discusses methods for linear model selection and regularization, including subset selection methods like best subset selection and stepwise selection, and shrinkage methods like ridge regression and the lasso. Subset selection methods aim to identify a subset of predictive variables, while shrinkage methods shrink coefficient estimates towards zero to reduce variance. Ridge regression minimizes a loss function with an L2 penalty on coefficients, tending to shrink large coefficients. The lasso uses an L1 penalty instead, allowing some coefficients to be shrunk exactly to zero for variable selection. Cross-validation is used to select the optimal tuning parameter value.
This document discusses various forecasting methods used in business analytics including trend analysis, straight line forecasting, naïve forecasting, moving averages, weighted moving averages, and exponential smoothing. It provides details on how to perform each method, their characteristics, limitations, and appropriate uses. Trend analysis fits a trend line to historical data to project future forecasts. Moving averages and weighted moving averages smooth out fluctuations by taking the average of past periods. Exponential smoothing gives more weight to recent data using exponentially decreasing weights as data becomes older. The document recommends various forecasting methods based on the linearity and seasonality of the time series data.
How to formulate reinforcement learning in illustrative waysYasutoTamura1
This lecture introduces reinforcement learning and how to approach learning it. It discusses formulating the environment as a Markov decision process and defines important concepts like policy, value functions, returns, and the Bellman equation. The key ideas are that reinforcement learning involves optimizing a policy to maximize expected returns, and value functions are introduced to indirectly evaluate and improve the policy through dynamic programming methods like policy iteration and value iteration. Understanding these fundamental concepts through simple examples is emphasized as the starting point for learning reinforcement learning.
Reinforcement learning:policy gradient (part 1)Bean Yen
The policy gradient theorem is from "Reinforcement Learning : An Introduction". DPG and DDPG is from the original paper.
original link https://docs.google.com/presentation/d/1I3QqfY6h2Pb0a-KEIbKy6v5NuZtnTMLN16Fl-IuNtUo/edit?usp=sharing
Task 1:
Collect values of “Total Asset” of assigned company 1 for time period 2000-2013. Suppose that company has adopted new Capital Budgeting policy from 2007. At the 0.05 significance level, can you conclude the Total Asset of assigned company is higher after adopting new management policy?
Task 2:
In this task, for group 20, the assigned company 1 is Monno Ceramics Industries Limited. The second assigned company is Jamuna Oil Company Limited and the third company which has chosen by the members of group 20 is Agricultural Manufacturing Company Limited (PRAN).
According to this task, the information of the ‘Inventory’ figure from the balance sheet of these companies from 2009 to 2013 have been collected and summarized in the below graph. This information of inventory figures have been collected from the annual report of these companies.
Similar to Trust Region Policy Optimization, Schulman et al, 2015 (20)
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
Momenti Seminar에서 진행했던 "하스스톤 시뮬레이터 RosettaStone 개발 5년 간의 기록"의 발표 자료를 공유드립니다. 5년 동안 오픈 소스 프로젝트를 진행하면서 경험했던 일들을 정리하며 어떤 교훈을 얻었는지 생각해보는 시간이었습니다. 많은 분들에게 도움이 되었으면 합니다.
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
RL 논문 리뷰 스터디에서 Evolving Reinforcement Learning Algorithms 논문 내용을 정리해 발표했습니다. 이 논문은 Value-based Model-free RL 에이전트의 손실 함수를 표현하는 언어를 설계하고 기존 DQN보다 최적화된 손실 함수를 제안합니다. 많은 분들에게 도움이 되었으면 합니다.
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
RL 논문 리뷰 스터디에서 Adversarially Guided Actor-Critic 논문 내용을 정리해 발표했습니다. AGAC는 Actor-Critic에 GAN에서 영감을 받은 방법들을 결합해 리워드가 희소하고 탐험이 어려운 환경에서 뛰어난 성능을 보여줍니다. 많은 분들에게 도움이 되었으면 합니다.
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Chris Ohk
RL 논문 리뷰 스터디에서 Agent57 논문 내용을 정리해 발표했습니다. Agent57은 NGU(Never Give Up)를 기반으로 몇 가지 기능을 개선해 57개의 Atari 게임 모두 인간보다 뛰어난 점수를 기록한 최초의 RL 알고리즘입니다. 많은 분들에게 도움이 되었으면 합니다.
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
The paper introduces Deep Deterministic Policy Gradient (DDPG), a model-free reinforcement learning algorithm for problems with continuous action spaces. DDPG combines actor-critic methods with experience replay and target networks similar to DQN. It uses a replay buffer to minimize correlations between samples and target networks to provide stable learning targets. The algorithm was able to solve challenging control problems with high-dimensional observation and action spaces, demonstrating the ability of deep reinforcement learning to handle complex, continuous control tasks.
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
강화학습에 관심을 갖게 되어 어떤 게임에 적용해볼까 고민하다가 평소 즐기던 '하스스톤'이라는 게임에 관심을 갖게 되어 2017년 말부터 하스스톤 강화 학습을 위한 API를 만들기 시작했습니다. 이 발표를 통해 평소 하스스톤과 같은 카드 게임 개발이나 게임에 강화학습을 적용하기 위한 환경을 구축하는데 관심을 갖고 있던 프로그래머들에게 조금이나마 도움이 되었으면 합니다.
모던 C++의 시초인 C++11은 C++ 코드 전반에 많은 변화를 가져왔습니다. 그리고 최근 C++20의 표준위원회 회의가 마무리되었습니다. 내년에 C++20이 도입되면 C++11이 처음 도입되었을 때와 비슷한 규모, 또는 그 이상의 변화가 있을 것이라고 예상하고 있습니다. C++20에는 Concepts, Contract, Ranges, Coroutine, Module 등 굵직한 기능 외에도 많은 기능들이 추가될 예정입니다. 이번 세션에서는 C++20에 추가될 주요 기능들을 살펴보고자 합니다.
GDG Campus Korea에서 개최한 'Daily 만년 Junior들의 이야기 : 델리만주' 밋업에서 발표했던 내용으로 대학원 석사 입학 후부터 오늘날까지 어떤 활동들을 했는지 정리했습니다. 대학원생 분들과 게임 프로그래머 취업을 준비하시는 분들께 많은 도움이 되었으면 합니다.
오픈소스 개발을 시작하기로 결정했더라도, 처음 개발하는 경우에는 막상 무엇을 개발할지, 그리고 어떻게 개발해야 할 지 막막하기만 합니다. 이 때는 기존에 공개되어 있는 오픈소스 프로젝트를 활용해 개선해나가는 프로젝트부터 시작하면 많은 도움이 됩니다. 이번 강연에서는 기존 오픈소스 프로젝트를 처음부터 새로 만들어가면서 개선해나갔던 경험을 이야기하고 어떻게 하면 오픈소스 개발에 쉽게 접근할 수 있는지를 알려줍니다.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Trust Region Policy Optimization, Schulman et al, 2015
1. Trust Region Policy Optimization, Schulman et al, 2015.
옥찬호
utilForever@gmail.com
2. Trust Region Policy Optimization, Schulman et al, 2015.Introduction
• Most algorithms for policy optimization can be classified...
• (1) Policy iteration methods (Alternate between estimating the value
function under the current policy and improving the policy)
• (2) Policy gradient methods (Use an estimator of the gradient of the
expected return)
• (3) Derivative-free optimization methods (Treat the return as a black box
function to be optimized in terms of the policy parameters) such as
• The Cross-Entropy Method (CEM)
• The Covariance Matrix Adaptation (CMA)
3. Introduction
• General derivative-free stochastic optimization methods
such as CEM and CMA are preferred on many problems,
• They achieve good results while being simple to understand and implement.
• For example, while Tetris is a classic benchmark problem for
approximate dynamic programming(ADP) methods, stochastic
optimization methods are difficult to beat on this task.
Trust Region Policy Optimization, Schulman et al, 2015.
4. Introduction
• For continuous control problems, methods like CMA have been
successful at learning control policies for challenging tasks like
locomotion when provided with hand-engineered policy
classes with low-dimensional parameterizations.
• The inability of ADP and gradient-based methods to
consistently beat gradient-free random search is unsatisfying,
• Gradient-based optimization algorithms enjoy much better sample
complexity guarantees than gradient-free methods.
Trust Region Policy Optimization, Schulman et al, 2015.
5. Introduction
• Continuous gradient-based optimization has been very
successful at learning function approximators for supervised
learning tasks with huge numbers of parameters, and
extending their success to reinforcement learning would allow
for efficient training of complex and powerful policies.
Trust Region Policy Optimization, Schulman et al, 2015.
6. Preliminaries
• Markov Decision Process(MDP) : 𝒮, 𝒜, 𝑃, 𝑟, 𝜌0, 𝛾
• 𝒮 is a finite set of states
• 𝒜 is a finite set of actions
• 𝑃 ∶ 𝒮 × 𝒜 × 𝒮 → ℝ is the transition probability
• 𝑟 ∶ 𝒮 → ℝ is the reward function
• 𝜌0 ∶ 𝒮 → ℝ is the distribution of the initial state 𝑠0
• 𝛾 ∈ 0, 1 is the discount factor
• Policy
• 𝜋 ∶ 𝒮 × 𝒜 → 0, 1
Trust Region Policy Optimization, Schulman et al, 2015.
17. Preliminaries
• This equation implies that any policy update 𝜋 → 𝜋 that has
a nonnegative expected advantage at every state 𝑠,
i.e., σ 𝑎 𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) ≥ 0, is guaranteed to increase
the policy performance 𝜂, or leave it constant in the case that
the expected advantage is zero everywhere.
Trust Region Policy Optimization, Schulman et al, 2015.
19. Preliminaries
• However, in the approximate setting, it will typically be
unavoidable, due to estimation and approximation error, that
there will be some states 𝑠 for which the expected advantage
is negative, that is, σ 𝑎 𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) < 0.
• The complex dependency of 𝜌𝜋 𝑠 on 𝜋 makes equation
difficult to optimize directly.
Trust Region Policy Optimization, Schulman et al, 2015.
20. Preliminaries
• Instead, we introduce the following local approximation to 𝜂:
• Note that 𝐿 𝜋 uses the visitation frequency 𝜌 𝜋 rather than 𝜌𝜋,
ignoring changes in state visitation density due to changes in
the policy.
Trust Region Policy Optimization, Schulman et al, 2015.
22. Preliminaries
• However, if we have a parameterized policy 𝜋 𝜃, where
𝜋 𝜃 𝑎 𝑠 is a differentiable function of the parameter vector 𝜃,
then 𝐿 𝜋 matches 𝜂 to first order.
• This equation implies that a sufficiently small step 𝜋 𝜃0
→ 𝜋 that
improves 𝐿 𝜋 𝜃 𝑜𝑙𝑑
will also improve 𝜂, but does not give us any
guidance on how big of a step to take.
Trust Region Policy Optimization, Schulman et al, 2015.
24. Preliminaries
• Conservative Policy Iteration
• Provide explicit lower bounds on the improvement of 𝜂.
• Let 𝜋 𝑜𝑙𝑑 denote the current policy, and let 𝜋′ = argmax 𝜋′ 𝐿 𝜋 𝑜𝑙𝑑
(𝜋′).
The new policy 𝜋 𝑛𝑒𝑤 was defined to be the following texture:
Trust Region Policy Optimization, Schulman et al, 2015.
25. Preliminaries
• Conservative Policy Iteration
• The following lower bound:
• However, this policy class is unwieldy and restrictive in practice,
and it is desirable for a practical policy update scheme to be applicable to
all general stochastic policy classes.
Trust Region Policy Optimization, Schulman et al, 2015.
26. Monotonic Improvement Guarantee
• The policy improvement bound in above equation can be
extended to general stochastic policies, by
• (1) Replacing 𝛼 with a distance measure between 𝜋 and 𝜋
• (2) Changing the constant 𝜖 appropriately
27. Monotonic Improvement Guarantee
• The particular distance measure we use is
the total variation divergence for
discrete probability distributions 𝑝, 𝑞:
𝐷TV 𝑝 ∥ 𝑞 =
1
2
𝑖
𝑝𝑖 − 𝑞𝑖
• Define 𝐷TV
max
𝜋, 𝜋 as
29. Monotonic Improvement Guarantee
• We note the following relationship between the total variation
divergence and the KL divergence.
𝐷TV 𝑝 ∥ 𝑞 2
≤ 𝐷KL 𝑝 ∥ 𝑞
• Let 𝐷KL
max
𝜋, 𝜋 = max
𝑠
𝐷KL 𝜋 ∙ 𝑠 ∥ 𝜋 ∙ |𝑠 .
The following bound then follows directly from Theorem 1:
33. Monotonic Improvement Guarantee
• Algorithm 1 is guaranteed to generate a monotonically
improving sequence of policies 𝜂 𝜋0 ≤ 𝜂 𝜋1 ≤ 𝜂 𝜋2 ≤ ⋯.
• Let 𝑀𝑖 𝜋 = 𝐿 𝜋 𝑖
− 𝐶𝐷KL
max
𝜋𝑖, 𝜋 . Then,
• Thus, by maximizing 𝑀𝑖 at each iteration, we guarantee that
the true objective 𝜂 is non-decreasing.
• This algorithm is a type of minorization-maximization (MM) algorithm.
34. Optimization of Parameterized Policies
• Overload our previous notation to use functions of 𝜃.
• 𝜂 𝜃 ≔ 𝜂 𝜋 𝜃
• 𝐿 𝜃
෨𝜃 ≔ 𝐿 𝜋 𝜃
𝜋෩𝜃
• 𝐷KL 𝜃 ∥ ෨𝜃 ≔ 𝐷KL 𝜋 𝜃 ∥ 𝜋෩𝜃
• Use 𝜃old to denote the previous policy parameters
35. Optimization of Parameterized Policies
• The preceding section showed that
𝜂 𝜃 ≥ 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
• Thus, by performing the following maximization,
we are guaranteed to improve the true objective 𝜂:
maximize 𝜃 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
36. Optimization of Parameterized Policies
• In practice, if we used the penalty coefficient 𝐶 recommended
by the theory above, the step sizes would be very small.
• One way to take largest steps in a robust way is to use a
constraint on the KL divergence between the new policy and
the old policy, i.e., a trust region constraint:
37. Optimization of Parameterized Policies
• This problem imposes a constraint that the KL divergence is
bounded at every point in the state space. While it is motivated
by the theory, this problem is impractical to solve due to the
large number of constraints.
• Instead, we can use a heuristic approximation which considers
the average KL divergence:
38. Optimization of Parameterized Policies
• We therefore propose solving the following optimization
problem to generate a policy update:
40. Sample-Based Estimation
• How the objective and constraint functions can be
approximated using Monte Carlo simulation?
• We seek to solve the following optimization problem, obtained
by expanding 𝐿 𝜃old
in previous equation:
41. Sample-Based Estimation
• We replace
• σ 𝑠 𝜌 𝜃old
𝑠 … →
1
1−𝛾
𝐸𝑠~𝜌 𝜃old
[… ]
• 𝐴 𝜃old
→ 𝑄 𝜃old
• σ 𝑎 𝜋 𝜃old
𝑎|𝑠 𝐴 𝜃old
→ 𝐸 𝑎~𝑞
𝜋 𝜃(𝑎|𝑠)
𝑞(𝑎|𝑠)
𝐴 𝜃old
(We replace the sum over the actions by an importance sampling estimator.)
44. Sample-Based Estimation
• Now, our optimization problem in previous equation is exactly
equivalent to the following one, written in terms of
expectations:
• Constrained Optimization → Sample-based Estimation
45. Sample-Based Estimation
• All that remains is to replace the expectations by sample
averages and replace the 𝑄 value by an empirical estimate.
The following sections describe two different schemes for
performing this estimation.
• Single Path: Typically used for policy gradient estimation (Bartlett & Baxter,
2011), and is based on sampling individual trajectories.
• Vine: Constructing a rollout set and then performing multiple actions from
each state in the rollout set.
49. Practical Algorithm
1. Use the single path or vine procedures to collect a set of state-
action pairs along with Monte Carlo estimates of their 𝑄-values.
2. By averaging over samples, construct the estimated objective and
constraint in Equation .
3. Approximately solve this constrained optimization problem to
update the policy’s parameter vector 𝜃. We use the conjugate
gradient algorithm followed by a line search, which is altogether
only slightly more expensive than computing the gradient itself.
See Appendix C for details.
Trust Region Policy Optimization, Schulman et al, 2015.
50. Practical Algorithm
• With regard to (3), we construct the Fisher information matrix
(FIM) by analytically computing the Hessian of KL divergence,
rather than using the covariance matrix of the gradients.
• That is, we estimate 𝐴𝑖𝑗 as ,
rather than .
• This analytic estimator has computational benefits in the
large-scale setting, since it removes the need to store a dense
Hessian or all policy gradients from a batch of trajectories.
Trust Region Policy Optimization, Schulman et al, 2015.
51. Experiments
We designed experiments to investigate the following questions:
1. What are the performance characteristics of the single path and vine
sampling procedures?
2. TRPO is related to prior methods but makes several changes, most notably by
using a fixed KL divergence rather than a fixed penalty coefficient. How does
this affect the performance of the algorithm?
3. Can TRPO be used to solve challenging large-scale problems? How does
TRPO compare with other methods when applied to large-scale problems,
with regard to final performance, computation time, and sample complexity?
Trust Region Policy Optimization, Schulman et al, 2015.
56. Experiments
• Playing Games from Images
• To evaluate TRPO on a partially observed task with complex observations, we
trained policies for playing Atari games, using raw images as input.
• Challenging points
• The high dimensionality, challenging elements of these games include delayed rewards (no
immediate penalty is incurred when a life is lost in Breakout or Space Invaders)
• Complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms)
• Non-stationary image statistics (Enduro involves a changing and flickering background)
Trust Region Policy Optimization, Schulman et al, 2015.
60. Discussion
• We proposed and analyzed trust region methods for optimizing
stochastic control policies.
• We proved monotonic improvement for an algorithm that
repeatedly optimizes a local approximation to the expected
return of the policy with a KL divergence penalty.
Trust Region Policy Optimization, Schulman et al, 2015.
61. Discussion
• In the domain of robotic locomotion, we successfully learned
controllers for swimming, walking and hopping in a physics
simulator, using general purpose neural networks and
minimally informative rewards.
• At the intersection of the two experimental domains we
explored, there is the possibility of learning robotic control
policies that use vision and raw sensory data as input, providing
a unified scheme for training robotic controllers that perform
both perception and control.
Trust Region Policy Optimization, Schulman et al, 2015.
62. Discussion
• By combining our method with model learning, it would also be
possible to substantially reduce its sample complexity, making
it applicable to real-world settings where samples are
expensive.
Trust Region Policy Optimization, Schulman et al, 2015.