Introduction to reinforcement learning - Phu Nguyen

•Download as PPTX, PDF•

1 like•135 views

The document introduces reinforcement learning and uses the example of creating a bot to play FlappyBird. It discusses what reinforcement learning is, including the agent-environment interaction and rewards. It also covers Markov decision processes, value functions, exploration vs exploitation, and algorithms like Q-learning. It concludes with a demo of a Deep Q-Network agent learning to play FlappyBird and recommends further courses and books on machine learning and reinforcement learning topics.

Technology

July 2, 2017
Create Bot to play FlappyBird
Introduce to Reinforcement
Learning

 What is Reinforcement Learning?
 Markov Decision Process
 Introduce OpenAI Gym
 Demo: Bot to play FlappyBird
Agenda

 No supervisor, only the reward signal.
 Feedback is delayed, not instantaneous.
 Sequential data, time is master.
 Agent’s actions affect the subsequent data it receives.
Difficulties of RL

Agent and Environment
ActionObservation
Reward
Ot
At
Rt

 History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt
 State is the information used to determine what happens next
 St = f(Ht)
 Agent state vs Environment state (Sa
t vs Se
t)
 Fully Observable and Partially Observable environment.
State

 Policy
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
 Value function
vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)
 Model
Pa
ss’ = P[St+1 = s’ | St = s, At = a]
Ra
s = E [Rt+1 | St = s, At = a]
Major components of an agent

 Value based
Value function
No policy (Implicit)
 Policy based
No value function
Policy
 Actor Critic
Value function
Policy
Categorizing RL agents

 Model free
Value function and/or policy
No model
 Model based
Value function and/or policy
Model
Categorizing RL agents

 Exploration finds more information about the environment
 Exploitation exploits known information to maximize reward
Exploration vs Exploitation
if np.random.uniform() < eps:
action = random_action()
else:
action = get_best_action()

 Markov state contains all useful information from the history.
 P[St+1 | St] = P[St+1 | S1,…, St]
 Some examples:
Se
t is Markov.
The history Ht is Markov.
Markov state (Information state)

 A Markov Decision Process is a tuple (S, A, P, R, γ).
 S: a finite set of states.
 A: a finite set of actions
 P: a state transition probability matrix
Pa
ss’ = P [St+1 = s’ | St = s, At = a]
 R: reward function
Ra
s = E [Rt+1 | St = s, At = a]
 γ: discount factor, γ ∈ [0, 1]
Markov Decision Process (MDP)

 The state-value function vπ(s) is the expected return
starting from state s, and then following policy π.
 The action-value function qπ(s, a) is the expected return
starting from state s, taking action a, and then following policy
π.
 vπ(s) = Eπ [Gt | St = s]
 qπ(s, a) = Eπ [Gt | St = s, At = a]
 Gt = Rt+1 + γRt+2 + γ2Rt+3 + …
Value function of MDP

State-Value Function for Student MDP
7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10

 State-value function
v∗(s) = maxπ vπ(s)
 Action-value function
q∗(s, a) = maxπ qπ(s, a)
 π* (a|s) =
1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎)
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Optimal value function and policy

Bellman equation for optimal value function

 Value Iteration
 Policy Iteration
 Q-learning
 Sarsa
 …
Solving the Bellman Optimality Equation

 https://www.coursera.org/learn/machine-learning
 https://www.coursera.org/learn/neural-networks
 NLP: https://web.stanford.edu/class/cs224n/
 CNN: http://cs231n.stanford.edu/
 RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
 http://www.deeplearningbook.org/
 Reinforcement Learning: An Introduction (Richard S. Sutton and
Andrew G. Barto)
Courses and books

This document provides a summary of sampling-based approximations for reinforcement learning. It discusses using samples to approximate value iteration, policy iteration, and Q-learning when the state-action space is too large to store a table of values. Key points covered include using Q-learning with function approximation instead of a table, using features to generalize Q-values across states, and examples of feature representations like those used for the Tetris domain. Convergence properties of approximate Q-learning are also discussed.

MM framework for RL

Sung Yub Kim

This document presents a majorization-minimization framework for reinforcement learning. It introduces the relative policy performance identity which relates policy space to performance space. The framework majorizes the objective function using a KL divergence term to control changes to the policy. Natural policy gradient is derived from this framework by taking the Taylor expansion. Several algorithms are discussed including trust region policy optimization which uses conjugate gradient to estimate the natural policy gradient.

RL intro

KhangBom

This presentation discusses Markov decision processes (MDPs) for solving sequential decision problems under uncertainty. An MDP is defined by a tuple containing states, actions, transition probabilities, and rewards. The objective is to find an optimal policy that maximizes expected long-term rewards by choosing the best sequence of actions. Value iteration is introduced as an algorithm for computing optimal policies by iteratively updating the value of each state. The presentation also discusses MDP terminology, stationary policies, influence diagrams, and methods for solving large MDP problems incrementally using decision trees.

Markov decision process

Hamed Abdi

This document provides an overview of Markov Decision Processes (MDPs) and related concepts in decision theory and reinforcement learning. It defines MDPs and their components, describes algorithms for solving MDPs like value iteration and policy iteration, and discusses extensions to partially observable MDPs. It also briefly mentions dynamic Bayesian networks, the dopaminergic system, and its role in reinforcement learning and decision making.

Cs221 lecture8-fall11

darwinrlo

This lecture discusses planning under uncertainty using Markov decision processes (MDPs). Key points: 1. MDPs provide a framework for planning when the world is stochastic and states are observable. The goal is to find an optimal policy that maximizes expected reward. 2. Value iteration is commonly used to solve MDPs by iteratively updating state values until convergence. This allows computing optimal state/action values and policies. 3. Partially observable MDPs (POMDPs) extend MDPs to cases when states are not directly observable. Planning requires reasoning in the space of probability distributions over states, known as the belief space.

S19_lecture6_exploreexploitinbandits.pdf

LPrashanthi

1) The lecture discusses the exploration-exploitation dilemma in reinforcement learning using the multi-armed bandit problem as a simplified example. 2) In the multi-armed bandit problem, an agent must choose between multiple actions ("arms") to maximize rewards, balancing exploring new actions with exploiting the currently most rewarding action. 3) The key challenge is optimally balancing exploration of actions to gain more information with exploitation of the currently best action based on existing information.

Reinfrocement Learning

Natan Katz

Reinforcement learning is a computational approach for learning through interaction without an explicit teacher. An agent takes actions in various states and receives rewards, allowing it to learn relationships between situations and optimal actions. The goal is to learn a policy that maximizes long-term rewards by balancing exploitation of current knowledge with exploration of new actions. Methods like Q-learning use value function approximation and experience replay in deep neural networks to scale to complex problems with large state spaces like video games. Temporal difference learning combines the advantages of Monte Carlo and dynamic programming by bootstrapping values from current estimates rather than waiting for full episodes.

This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.

Intro rl

Ronald Teo

This document provides an overview of an introductory lecture on reinforcement learning. The key points covered include: - Reinforcement learning involves an agent learning through trial-and-error interactions with an environment by receiving rewards. - The goal of reinforcement learning is for the agent to select actions that maximize total rewards. This involves making decisions to balance short-term versus long-term rewards. - Major components of a reinforcement learning agent include its policy, which determines its behavior, its value function which predicts future rewards, and its model which represents its understanding of the environment's dynamics.

CS294-112 Lec 05

Gyubin Son

This document summarizes an academic lecture on deep reinforcement learning. It discusses: 1. Improving the policy gradient method with a critic network to reduce variance. The critic fits a value function to estimate advantages. 2. Methods for policy evaluation including Monte Carlo evaluation and temporal difference bootstrapping to fit the value function for a fixed policy. 3. The actor-critic algorithm which approximates both the policy and value function with neural networks and optimizes them together online using sampled episodes.

Reinforcement Learning : A Beginners Tutorial

Omar Enayet

This document provides an overview of reinforcement learning concepts including: 1) It defines the key components of a Markov Decision Process (MDP) including states, actions, transitions, rewards, and discount rate. 2) It describes value functions which estimate the expected return for following a particular policy from each state or state-action pair. 3) It discusses several elementary solution methods for reinforcement learning problems including dynamic programming, Monte Carlo methods, temporal-difference learning, and actor-critic methods.

Lec5 advanced-policy-gradient-methods

Ronald Teo

The document summarizes several advanced policy gradient methods for reinforcement learning, including trust region policy optimization (TRPO), proximal policy optimization (PPO), and using the natural policy gradient with the Kronecker-factored approximation (K-FAC). TRPO frames policy optimization as solving a constrained optimization problem to limit policy updates, while PPO uses a clipped objective function as a pessimistic bound. Both methods improve upon vanilla policy gradients. K-FAC provides an efficient way to approximate the natural policy gradient using the Fisher information matrix. The document reviews the theory and algorithms behind these methods.

RL unit 5 part 1.pdf

ChandanaVemulapalli2

The document provides an overview of policy gradient algorithms. It begins by motivating policy gradients as a way to perform policy improvement with gradient ascent instead of argmax when the action space is large or continuous. It then defines key concepts like the policy gradient theorem, which shows the gradient of expected returns with respect to the policy parameters can be estimated by sampling simulation paths. It presents the policy gradient algorithm REINFORCE, which uses Monte Carlo returns to estimate the action-value function and perform policy gradient ascent updates. It also discusses canonical policy function approximations like the softmax policy for finite actions and Gaussian policy for continuous actions.

100 things I know

r-uribe

The document provides a summary of Semi-Markov Decision Processes (SMDPs) in 10 points: 1. It describes the basic components of an SMDP including states, actions, rewards, policies, and value functions. 2. It discusses the concepts of optimal policies, average reward models, and discount factors in SMDPs. 3. It introduces the idea of transition times in SMDPs, which allows actions to take varying amounts of time. This makes SMDPs a generalization of Markov Decision Processes. 4. It notes that algorithms for solving SMDPs typically involve estimating the average reward per action to find an optimal policy.

Policy-Gradient for deep reinforcement learning.pdf

21522733

Financial Trading as a Game: A Deep Reinforcement Learning Approach

謙益黃

An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.

RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

deeplearning6

This document provides an overview of reinforcement learning. It discusses key concepts like rewards, environment, state, and the reinforcement learning agent. The agent learns through trial-and-error interactions with its environment by trying different actions and receiving feedback in the form of rewards. The goal is to learn a policy that maximizes long-term rewards by balancing exploration of new actions with exploitation of known rewarding actions. The document also covers reinforcement learning problems, components of the reinforcement learning agent, and different categories of reinforcement learning methods.

An introduction to reinforcement learning (rl)

pauldix

This document provides an introduction to reinforcement learning (RL) and RL for brain-machine interfaces (RL-BMI). It outlines key RL concepts like the environment, value functions, and methods for achieving optimality including dynamic programming, Monte Carlo, and temporal difference methods. It also discusses eligibility traces and provides an example of an online/closed-loop RL-BMI architecture. References for further reading on the topics are included.

lecture_21.pptx - PowerPoint Presentation

butest

The document summarizes key concepts in reinforcement learning: - Agent-environment interaction is modeled as states, actions, and rewards - A policy is a rule for selecting actions in each state - The return is the total discounted future reward an agent aims to maximize - Tasks can be episodic or continuing - The Markov property means the future depends only on the present state - The agent-environment framework can be modeled as a Markov decision process

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...

The Statistical and Applied Mathematical Sciences Institute

Introduction to Reinforcement Learning for Molecular Design

Dan Elton

Partially observable Markov decision processes for spoken dialog systems

Martin Majlis

Lecture3-MDP.pdf

NistalaPraneeth

This document discusses value functions and Markov decision processes (MDPs). It defines value functions as estimating the long-term expected rewards from each state. It presents the Bellman equation and how it can be used to compute value functions. Finally, it introduces MDPs, which extend Markov reward processes by adding actions, and provides some examples of MDP problems like navigation and Atari games.

AI - Introduction to Markov Principles

Andrew Ferlitsch

Deep Reinforcement Learning: Q-Learning

Kai-Wen Zhao

Reinforcement Learning Overview | Marco Del Pra

Data Science Milan

This document provides an overview of reinforcement learning. It discusses the reinforcement learning framework including actors like agents, environments, states, actions, rewards, and policies. It also summarizes several common reinforcement learning methods including value-based methods, policy-based methods, and model-based methods. Value-based methods estimate value functions using algorithms like Q-learning and deep Q-networks. Policy-based methods directly learn policies using policy gradient algorithms like REINFORCE. Model-based methods learn models of the environment and then plan based on these models.

Practical Reinforcement Learning with TensorFlow

Illia Polosukhin

This document discusses reinforcement learning techniques for solving problems modeled as Markov decision processes using TensorFlow. It introduces the OpenAI Gym environment for testing RL algorithms, describes modeling problems as MDPs and key concepts like state-value functions and Q-learning. Deep Q-learning and policy gradient methods are explained for approximating value and policy functions with neural networks. Asynchronous advantage actor-critic (A3C) is presented as an effective approach and results are shown matching or beating human performance on Atari games. Practical applications of RL are identified in robotics, finance, optimization and predictive assistants.

Deep dive into Google Cloud for Big Data

Tu Le Dinh

The document discusses Big Data challenges at Dyno including having a multi-terabyte data warehouse with over 100 GB of new raw data daily from 65 online and unlimited offline data sources, facing daily data quality problems, and needing to derive user interests and intentions from user information, behavior, and other data while managing a high performance and cost effective system. It also advertises job openings at Dyno for frontend and backend developers.

Progressive web apps - Linh Nguyen

Tu Le Dinh

Similar to Introduction to reinforcement learning - Phu Nguyen

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

Jack Clark

Intro rl

Ronald Teo

CS294-112 Lec 05

Gyubin Son

Reinforcement Learning : A Beginners Tutorial

Omar Enayet

Lec5 advanced-policy-gradient-methods

Ronald Teo

RL unit 5 part 1.pdf

ChandanaVemulapalli2

100 things I know

r-uribe

Policy-Gradient for deep reinforcement learning.pdf

21522733

Financial Trading as a Game: A Deep Reinforcement Learning Approach

謙益黃

RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

deeplearning6

An introduction to reinforcement learning (rl)

pauldix

lecture_21.pptx - PowerPoint Presentation

butest

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...

The Statistical and Applied Mathematical Sciences Institute

Introduction to Reinforcement Learning for Molecular Design

Dan Elton

Partially observable Markov decision processes for spoken dialog systems

Martin Majlis

Lecture3-MDP.pdf

NistalaPraneeth

AI - Introduction to Markov Principles

Andrew Ferlitsch

Deep Reinforcement Learning: Q-Learning

Kai-Wen Zhao

Reinforcement Learning Overview | Marco Del Pra

Data Science Milan

Practical Reinforcement Learning with TensorFlow

Illia Polosukhin

Similar to Introduction to reinforcement learning - Phu Nguyen (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI

Intro rl

CS294-112 Lec 05

Reinforcement Learning : A Beginners Tutorial

Lec5 advanced-policy-gradient-methods

RL unit 5 part 1.pdf

100 things I know

Policy-Gradient for deep reinforcement learning.pdf

Financial Trading as a Game: A Deep Reinforcement Learning Approach

RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

An introduction to reinforcement learning (rl)

lecture_21.pptx - PowerPoint Presentation

PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...

Introduction to Reinforcement Learning for Molecular Design

Partially observable Markov decision processes for spoken dialog systems

Lecture3-MDP.pdf

AI - Introduction to Markov Principles

Deep Reinforcement Learning: Q-Learning

Reinforcement Learning Overview | Marco Del Pra

Practical Reinforcement Learning with TensorFlow

More from Tu Le Dinh

Deep dive into Google Cloud for Big Data

Tu Le Dinh

Progressive web apps - Linh Nguyen

Tu Le Dinh

How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...

Tu Le Dinh

1. The document discusses how to build a virtual assistant using the Google Assistant SDK and embed it on a Raspberry Pi 3. It covers what a virtual assistant is, what the Google Assistant SDK can do, how to embed the assistant on a Pi, how to create a simple assistant application, and how to make the assistant control devices. 2. The agenda includes an overview of virtual assistants and how they work, discussing the Google Assistant SDK's capabilities like embedding the assistant anywhere and creating custom apps, and how to embed the assistant on a Raspberry Pi 3 through configuring a project, installing prerequisites, and running a demo. 3. The document demonstrates controlling devices from the virtual assistant by using technologies like RPi.

The potential of chatbot - Why NLP is important for chatbot - Duc Nguyen

Tu Le Dinh

Natural language processing (NLP) is key for chatbots to understand human language and have natural conversations. As messaging platforms are used more than social networks, with 77% of Generation Z and 50% of people preferring to contact businesses through chat over the phone, businesses need chatbots to provide customer service through these popular channels. For chatbots to be successful, they must have integrated natural language processing, good UX/UI design, and connect to various APIs and services to handle tasks like making reservations, answering questions about products, and more.

UI, UX: Who Does What? Where?-Vu Hoang

Tu Le Dinh

This document discusses user experience design and provides tips for creating the best UX. It suggests defining business goals, thinking about the end-user, and creating a user experience that is specific, measurable, achievable, relevant and time-bound to optimize the information architecture and interaction design. The focus is on understanding what happens in front of the screen and on the screen to improve the overall information flow for users.

Welcome remark from GDG mien trung

Tu Le Dinh

The document contains an agenda for a conference organized by GDG Mientrung. It includes sessions on becoming a Google Developer Expert, getting jobs at companies like Google, building virtual assistants with Google Assistant SDK, progressive web apps, chatbots and natural language processing, Android architecture, big data on Google Cloud, Android Things, reinforcement learning, and UI/UX design. The agenda runs from 1pm to 5:30pm and includes sessions, breaks, demonstrations of Android applications, and a closing event debriefing with a speaker poll and lucky draw.

Google developer experts program - Hieu Hua

Tu Le Dinh

This document discusses the Google Developer Experts (GDE) program. It provides an overview of the GDE program, describing GDEs as experienced Google technology developers who support other developers and startups through activities like content creation, speaking at conferences, and mentoring. The nomination and interview process to become a GDE can take up to two months and involves technical interviews with GDEs and Googlers. The document encourages interested individuals to contact local Developer Relations to learn more about becoming a Google Developer Expert.

Android Architecture - Khoa Tran

Tu Le Dinh

The document discusses Android Architecture Components which provide lifecycle-aware observables, lightweight ViewModels, and object mapping for SQLite. It describes lifecycles and how they allow observers to automatically subscribe and unsubscribe from data based on a component's lifecycle state. LiveData is introduced as an observable data holder that is lifecycle-aware and manages automatic subscription. ViewModels are described as data holders for UI controllers that survive configuration changes. Room provides object mapping for SQLite to reduce boilerplate code. The document outlines how activities, fragments, ViewModels, repositories, and data sources fit together in the Android architecture.

Zero to one with Android Things - Hieu Hua

Tu Le Dinh

The document discusses Android Things, an operating system based on Android for building embedded devices and hardware prototypes. It describes how Android Things uses familiar Android development tools and APIs to access device hardware while being optimized for Internet of Things applications. Examples of supported hardware platforms and development boards are provided, along with code samples for connecting devices to WiFi and using drivers to control hardware components like LEDs and buzzers on the Rainbow HAT development board.

More from Tu Le Dinh (9)

Deep dive into Google Cloud for Big Data

Progressive web apps - Linh Nguyen

How to build virtual assistant like Jarvis (in Ironman) with Google Assistant...

The potential of chatbot - Why NLP is important for chatbot - Duc Nguyen

UI, UX: Who Does What? Where?-Vu Hoang

Welcome remark from GDG mien trung

Google developer experts program - Hieu Hua

Android Architecture - Khoa Tran

Zero to one with Android Things - Hieu Hua

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Uni Systems S.M.S.A.

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

SOFTTECHHUB

The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing. One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.

How to use Firebase Data Connect For Flutter

Daiki Mogmet Ito

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

Video Streaming: Then, Now, and in the Future

Alpen-Adria-Universität

In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Neo4j

Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.

GraphRAG for Life Science to increase LLM accuracy

Tomaz Bratanic

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

Microsoft - Power Platform_G.Aspiotis.pdf

Uni Systems S.M.S.A.

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

IndexBug

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

Infrastructure Challenges in Scaling RAG with Custom AI models

Zilliz

Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

Mind map of terminologies used in context of Generative AI

Kumud Singh

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

20240607 QFM018 Elixir Reading List May 2024

Matthew Sinclair

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

How to Get CNIC Information System with Paksim Ga.pptx

danishmna97

20240609 QFM020 Irresponsible AI Reading List May 2024

Matthew Sinclair

20240605 QFM017 Machine Intelligence Reading List May 2024

Matthew Sinclair

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

How to use Firebase Data Connect For Flutter

Climate Impact of Software Testing at Nordic Testing Days

Video Streaming: Then, Now, and in the Future

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

GraphRAG for Life Science to increase LLM accuracy

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Microsoft - Power Platform_G.Aspiotis.pdf

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

Communications Mining Series - Zero to Hero - Session 1

Infrastructure Challenges in Scaling RAG with Custom AI models

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Mind map of terminologies used in context of Generative AI

UiPath Test Automation using UiPath Test Suite series, part 5

20240607 QFM018 Elixir Reading List May 2024

HCL Notes and Domino License Cost Reduction in the World of DLAU

How to Get CNIC Information System with Paksim Ga.pptx

20240609 QFM020 Irresponsible AI Reading List May 2024

20240605 QFM017 Machine Intelligence Reading List May 2024

Introduction to reinforcement learning - Phu Nguyen

1. July 2, 2017 Create Bot to play FlappyBird Introduce to Reinforcement Learning

2.  What is Reinforcement Learning?  Markov Decision Process  Introduce OpenAI Gym  Demo: Bot to play FlappyBird Agenda

3. What is RL?

4. RL examples

5.  No supervisor, only the reward signal.  Feedback is delayed, not instantaneous.  Sequential data, time is master.  Agent’s actions affect the subsequent data it receives. Difficulties of RL

6. Agent and Environment ActionObservation Reward Ot At Rt

7.  History: O1, R1, A1, O2, R2, A2 ….At-1, Ot, Rt  State is the information used to determine what happens next  St = f(Ht)  Agent state vs Environment state (Sa t vs Se t)  Fully Observable and Partially Observable environment. State

8.  Policy Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P[At = a|St = s]  Value function vπ (s) = Eπ (Rt+1 + γRt+2 + γ2Rt+3 + … | St = s)  Model Pa ss’ = P[St+1 = s’ | St = s, At = a] Ra s = E [Rt+1 | St = s, At = a] Major components of an agent

9.  Value based Value function No policy (Implicit)  Policy based No value function Policy  Actor Critic Value function Policy Categorizing RL agents

10.  Model free Value function and/or policy No model  Model based Value function and/or policy Model Categorizing RL agents

11.  Exploration finds more information about the environment  Exploitation exploits known information to maximize reward Exploration vs Exploitation if np.random.uniform() < eps: action = random_action() else: action = get_best_action()

12.  Markov state contains all useful information from the history.  P[St+1 | St] = P[St+1 | S1,…, St]  Some examples: Se t is Markov. The history Ht is Markov. Markov state (Information state)

13.  A Markov Decision Process is a tuple (S, A, P, R, γ).  S: a finite set of states.  A: a finite set of actions  P: a state transition probability matrix Pa ss’ = P [St+1 = s’ | St = s, At = a]  R: reward function Ra s = E [Rt+1 | St = s, At = a]  γ: discount factor, γ ∈ [0, 1] Markov Decision Process (MDP)

14. Example: Student MDP

15.  The state-value function vπ(s) is the expected return starting from state s, and then following policy π.  The action-value function qπ(s, a) is the expected return starting from state s, taking action a, and then following policy π.  vπ(s) = Eπ [Gt | St = s]  qπ(s, a) = Eπ [Gt | St = s, At = a]  Gt = Rt+1 + γRt+2 + γ2Rt+3 + … Value function of MDP

16. Bellman Expectation Equation for vπ

17. Bellman Expectation Equation for qπ

18. State-Value Function for Student MDP 7.4 = 0.5 * (1 + 0.4*7.4 + 0.4*2.7 + 0.2*(-1.3)) + 0.5 * 10

19.  State-value function v∗(s) = maxπ vπ(s)  Action-value function q∗(s, a) = maxπ qπ(s, a)  π* (a|s) = 1 𝑖𝑓 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞∗(𝑠, 𝑎) 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Optimal value function and policy

20. Bellman equation for optimal value function

21. Optimal policy for Student MDP

22.  Value Iteration  Policy Iteration  Q-learning  Sarsa  … Solving the Bellman Optimality Equation

23. Deep Q-Learning

24. Deep Q-Learning

25. Demo FlappyBird & Discussion

26.  https://www.coursera.org/learn/machine-learning  https://www.coursera.org/learn/neural-networks  NLP: https://web.stanford.edu/class/cs224n/  CNN: http://cs231n.stanford.edu/  RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html  http://www.deeplearningbook.org/  Reinforcement Learning: An Introduction (Richard S. Sutton and Andrew G. Barto) Courses and books

Editor's Notes

Real world reinforcement learning: learn from experience to maximize the rewards. Dog watches the actions of the trainer, hears her command and react based on those information. If the react is good, dog receives rewards (lure, compliment…). If the react is not good, dog will not receive any reward. Dog will learn from its experience to find the way to get as many rewards as possible.
AlphaGo: defeated Ke Jie (Kha Khiết) (other game playing: Atari, chess…) Waymo: Self driving car (Google) DeepMind AI Reduces Google Data Centre Cooling Bill by 40% (https://goo.gl/JbcH5n) Robotics SpaceX reuses rocket. Financial (Investment)
Supervised learning, unsupervised learning? We usually don’t receive the reward immediately. When playing chess, we win or lose because of some moves in the past For the self driving car problem, right before the accident, driver often hits the brake. Observation -> action -> reward -> new observation -> new action -> new reward. The actions of agent can change the environment and affect to the future observation.
At step t: do action At, see new observation Ot and receive reward Rt
History is a series of observations, rewards and actions from the beginning to current time. State is a function of history. Env state is environment’s private representation, usually not visible to the agent. If it’s visible, it may contain the irrelevant information. In fully observable env, agent directly observes the environment. (Sa = Se) In particially observable env, agent indirectly observes env (Sa != St)
Policy is the agent’s behavior, it maps from state to action. Value function is a prediction of future reward, used to evaluate the goodness/badness of states  choose the action. A model predicts what the environment will do next P predict the next state R predict the next immediately reward. (not the Rt+1, just the expected value) If gamma = 0  just care about immediately reward, if gamma =1  don’t discount.
Categorizing : value based, policy based, actor critic
Categorizing : model free, model based
Reinforcement learning is like trial-and-error learning The agent discover the good policy from its experiences of the environment without losing too much reward along way. Reduce epsilon during training time. When at test mode, just choose the best action. Epsilon is a small number (1-> 0.1)
When the state is known, the history can be thrown away. Can convert or create the Markov state by adding more information. Some more examples: chess board and know the player will move next, drive a car -> just need to know the current conditions: position, speed…, don’t need to care about history.
Why do we need the gamma discount factor? The discount γ is the present value of future rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future Like the bank, the money today is better than tomorrow. Animal/human behavior shows preference for immediatereward
The example is from David Silver’s course. Circles and squares are states (square: terminal state) Some actions: Facebook, Quit, Study… From the 3rd state, if we chose action Pub, it may ends with different states.
From state s, we can do many action, the probability of each action is π(a|s) After that, we receive reward then it can move to other state s’ with the probability Pass’
From state s, we choose action a, receive reward Ras , then can move to many new states. After that, we can do many actions based on π(a’|s’)
The optimal state-value function v∗(s) is the maximum value function over all policies The optimal action-value function q∗(s, a) is the maximum action-value function over all policies An MDP is “solved” when we know the optimal value The optimal value function specifies the best possible performance in the MDP If we know q∗(s; a), we immediately have the optimal policy
Follow the q*, we will find the optimal policy
Input: state Output: vector for q value (size : nb_actions). Dueling DQN: the first is the value function V(s), which says simple how good it is to be in any given state. The second is the advantage function A(a), which tells how much better taking a certain action would be compared to the others. We can then think of Q as being the combination of V and A.

Introduction to reinforcement learning - Phu Nguyen

Recommended

Recommended

More Related Content

Similar to Introduction to reinforcement learning - Phu Nguyen

Similar to Introduction to reinforcement learning - Phu Nguyen (20)

More from Tu Le Dinh

More from Tu Le Dinh (9)

Recently uploaded

Recently uploaded (20)

Introduction to reinforcement learning - Phu Nguyen

Editor's Notes