R22 Machine learning jntuh UNIT- 5.pptx

Machine Learning
By
R.Sanghavi
Asst Professor
CSE(DS)
MALLA REDDY ENGINEERING COLLEGE (Autonomous)

Module 5:
Reinforcement Learning

Syllabus
Reinforcement Learning–(Q-Learning, Deep Q-
Networks) – Transfer Learning and Pretrained Models –
Markov Chain Monte Carlo Methods – Sampling –
Proposal Distribution – Markov Chain Monte Carlo –
Graphical Models – Bayesian Networks – Markov
Random Fields – Case Studies: Real-World Machine
Learning Applications – Future Trends in Machine
Learning

Reinforcement Learning
• Reinforcement Learning (RL) is a branch of machine learning that teaches agents how to make
decisions by interacting with an environment to achieve a goal. In RL, an agent learns to
perform tasks by trying different strategies to maximize cumulative rewards based on feedback
received through its actions.
• Agent: The decision-maker that performs actions.
• Environment: The world or system in which the agent operates.
• State: The situation or condition the agent is currently in.
• Action: The possible moves or decisions the agent can make.
• Reward: The feedback or result from the environment
• based on the agent’s action.

How RL Works?
• The RL process involves an agent performing actions in an environment, receiving
rewards or penalties based on those actions, and adjusting its behavior accordingly.
This loop helps the agent improve its decision-making over time to maximize the
cumulative reward.
RL components:
• Policy: A strategy that the agent uses to determine the next action based on the
current state.
• Reward Function: A function that provides feedback on the actions taken, guiding
the agent towards its goal.
• Value Function: Estimates the future cumulative rewards the agent will receive from
a given state.
• Model of the Environment: A representation of the environment that predicts future
states and rewards, aiding in planning.

RL Example: Navigating a Maze
Imagine a robot navigating a maze to reach a diamond while
avoiding fire hazards. The goal is to find the optimal path with
the least number of hazards while maximizing the reward:
• Each time the robot moves correctly, it receives a reward.
• If the robot takes the wrong path, it loses points.
The robot learns by exploring different paths in the maze. By
trying various moves, it evaluates the rewards and penalties
for each path. Over time, the robot determines the best route
by selecting the actions that lead to the highest cumulative
reward.

The robot’s learning process is as follows:
• Exploration: The robot starts by exploring all possible paths in the maze, taking
different actions at each step (e.g., move left, right, up, or down).
• Feedback: After each move, the robot receives feedback from the environment:
• A positive reward for moving closer to the diamond.
• A penalty for moving into a fire hazard.
• Adjusting Behavior: Based on this feedback, the robot adjusts its behavior to
maximize the cumulative reward, favoring paths that avoid hazards and bring it
closer to the diamond.
• Optimal Path: Eventually, the robot discovers the optimal path with the least number
of hazards and the highest reward by selecting the right actions based on past
experiences.

Types of RL Algorithm
1. Model-Based RL
• In model-based reinforcement learning algorithm, the agent builds a model of the
environment's dynamics. This model predicts the next state and the reward given
the current state and action. The agent uses this model to plan actions by simulating
possible future scenarios before deciding on the best action. This type of RL is
appropriate for environments where building an accurate model is feasible, allowing
for efficient exploration and planning.
2. Model-Free RL
• Model-free reinforcement learning algorithm does not require a model of the
environment. Instead, the agent learns directly from interactions with the
environment by trial and error. The agent learns to associate actions with rewards
and uses this experience to improve decision-making over time. This type of
reinforcement learning is suitable for complex environments where modeling the
environment's dynamics is difficult or impossible.

RL Models
Traditional reinforcement learning models
• Traditional reinforcement learning models are based on the foundational
principles of RL, where an agent learns to make decisions through trial and error
by interacting with an environment. These models often rely on tabular methods,
like Q-learning and SARSA , which use a table or matrix to store and update the
values of different actions in various states.
• Q-Learning is a value-based method in which the agent learns the value of taking a
particular action in a specific state, aiming to maximize the cumulative reward over
time.
• SARSA is similar to Q-learning, but the agent updates its value estimates using the
action taken rather than the best possible action.

Q-learning
• Q-learning is a model-free reinforcement learning algorithm used to train agents
(computer programs) to make optimal decisions by interacting with an environment.
It helps the agent explore different actions and learn which ones lead to better
outcomes. The agent uses trial and error to determine which actions result in
rewards (good outcomes) or penalties (bad outcomes).
• Over time, it improves its decision-making by updating a Q-table, which stores Q-
values representing the expected rewards for taking particular actions in given
states.
• Q-Learning works well for small state-action spaces, it struggles with scalability
when dealing with high-dimensional environments like images or continuous states.

Working of Q-Learning
Q-learning models follow an iterative process, where different
components work together to train the agent:
• Agent: The entity that makes decisions and takes actions within the
environment.
• States: The variables that define the agent’s current position in the
environment.
• Actions: The operations the agent performs when in a specific state.
• Rewards: The feedback the agent receives after taking an action.
• Episodes: A sequence of actions that ends when the agent reaches a
terminal state.
• Q-values: The estimated rewards for each state-action pair.

Algorithm of Q learning
• Initialization: The agent starts with an initial Q-table, where Q-values
are typically initialized to zero.
• Exploration: The agent chooses an action based on the ϵ-greedy policy
(either exploring or exploiting).
• Action and Update: The agent takes the action, observes the next
state, and receives a reward. The Q-value for the state-action pair is
updated using the TD update rule.
• Iteration: The process repeats for multiple episodes until the agent
learns the optimal policy.

Deep Reinforcement Learning Models
• Deep reinforcement learning (Deep RL) models combine the principles of traditional
RL with deep learning, allowing the agent to handle complex environments with
high-dimensional inputs, such as images or continuous action spaces.
• Deep RL models are powerful in handling complex and large-scale problems, such
as playing video games, robotics, and autonomous driving. They can process high-
dimensional data and learn features automatically without manual feature
engineering. Instead of using tables to store values, Deep RL models utilize neural
networks to approximate the value functions or policies, explained in detail below:

Deep Q-Networks (DQN)
• DQN is a powerful algorithm in the field of RL. It
combines the principles of deep neural networks
with Q-learning, enabling agents to learn optimal
policies in complex environments.
• Traditional Q-Learning uses a table to store Q-values
for each state-action pair, which becomes
impractical for large state spaces.
• Instead of maintaining a table of Q-values for each
state-action pair, DQNs approximate the Q-value
function using a neural network parameterized by
weights θ. The network takes a state as input and
outputs Q-values for all possible actions.

Architecture of Deep Q-Networks
1. Neural Network
The network approximates the Q-value function Q(s,a;θ)Q(s,a;θ),
where θθ represents the trainable parameters.
For example, in Atari games, the input might be raw pixels from the game screen,
and the output is a vector of Q-values corresponding to each possible action.
2. Experience Replay
To stabilize training, DQNs store past experiences (s,a,r,s′)(s,a,r,s′) in a replay buffer.
During training, mini-batches of experiences are sampled randomly from the buffer,
breaking the correlation between consecutive experiences and improving
generalization.
3. Target Network
A separate target network with parameters θ−θ− is used to compute the target
Q-values during updates. The target network is periodically updated with the
weights of the main network to ensure stability.

Loss Function
•The loss function measures the difference between the
predicted Q-values and the target Q-values:
L(θ)=E[(r+γmaxa’
Q(s’,a’;θ )–
− Q(s,a;θ))2]

Advantages
• High-Dimensional State Spaces : Traditional Q-Learning requires
storing a Q-table, which becomes infeasible for large state spaces.
Neural networks can generalize across states, making them suitable
for complex environments.
• Continuous Input Data : Many real-world problems involve continuous
inputs, such as pixel data from video frames. Neural networks excel at
processing such data.
• Scalability : By leveraging the representational power of deep
learning, DQNs can scale to solve tasks that were previously
unsolvable with tabular methods.

Applications of RL
• Gaming: Training AI to outperform humans in complex games like chess, Go, and multiplayer online games.
• Autonomous Vehicles: Developing decision-making systems for self-driving cars, drones, and other
autonomous systems to navigate and operate safely.
• Robotics: Teaching robots to perform tasks such as assembly, walking, and complex manipulation through
adaptive learning.
• Finance: Enhancing strategies in trading, portfolio management, and risk assessment.
• Healthcare: Personalizing medical treatments, managing patient care, and assisting in surgeries with robotic
systems.
• Supply Chain Management: Optimizing logistics, inventory management, and distribution networks.
• Energy Management: Managing and distributing renewable energy in smart grids to enhance efficiency and
sustainability.
• Advertising: Optimizing ad placements and bidding strategies in real-time to maximize engagement and
revenue.
• Manufacturing: Automating and optimizing production lines and processes.

Transfer learning
• Transfer learning is a machine learning technique where knowledge gained from
solving one problem is applied to a different but related problem. Instead of
training a model from scratch for a new task, a pre-trained model on a similar task
is used, saving time and resources.
• It involves reusing a model pre-trained on one task and fine-tuning it for a new,
related task.
• It's particularly beneficial when you have limited data for the new task, as the pre-
trained model already has learned a lot of useful features.
• Typically, training a model takes a large amount of compute resources, data and
time. Using a pretrained model as a starting point helps cut down on all three, as
developers don't have to start from scratch, training a large model on what would
be an even bigger data set.

How to use Transfer Learning
• Transfer learning can be accomplished in several ways. One way is to find a related
learned task -- labeled as Task B -- that has plenty of transferable labeled data. The
new model is then trained on Task B. After this training, the model has a starting
point for solving its initial task, Task A.
• Another way to accomplish transfer learning
is to use a pretrained model. This process is
easier, as it involves the use of an already
trained model. The pretrained model should
have been trained using a large data set to
solve a similar task as task A. Models can be
imported from other developers who have
published them online.

Types of transfer learning
• Transfer learning methods fall into one of the following three
categories:
• Transductive transfer. Target tasks are the same but use different data
sets.
• Inductive transfer. Source and target tasks are different, regardless of
the data set. Source and target data are typically labeled.
• Unsupervised transfer. Source and target tasks are different, but the
process uses unlabeled source and target data. Unsupervised learning
is useful in settings where manually labeling data is impractical.

Transfer learning examples
• In machine learning, knowledge or data gained while solving one problem is
stored, labeled and then applied to a different but related problem. In NLP, for
example, a data set from an old model that understands the vocabulary used in
one area can be used to train a new model whose goal is to understand dialects
in multiple areas. An organization could then apply this for sentiment analysis.
• Transfer learning is also useful during the deployment of upgraded technology,
such as a chatbot. If the new domain is similar enough to previous deployments,
transfer learning can assess which knowledge should be transplanted. Using
transfer learning, developers can decide what knowledge and data is reusable
from the previous deployments and transfer that information for use when
developing the upgraded version.

Pre-trained model
• a pre-trained model is a model created by some one else to solve a
similar problem. Instead of building a model from scratch to solve a
similar problem, you use the model trained on other problem as a
starting point.
• For example, if you want to build a self learning car. You can spend
years to build a decent image recognition algorithm from scratch or
you can take inception model (a pre-trained model) from Google
which was built on ImageNet data to identify images in those pictures.
• A pre-trained model may not be 100% accurate in your application,
but it saves huge efforts required to re-invent the wheel. Let me show
this to you with a recent example.

Some real-world examples
Examples where pre-trained NLP models are used
• Sentiment Analysis is an NLP task where a model tries to identify if the given text
has positive, negative, or neutral sentiment. Sentiment analysis can be used in many
real-world scenarios like customer support chatbots and spam detection. Pre-
trained NLP models for sentiment analysis are provided by open-source NLP
libraries such as BERT, NTLK, Spacy, and Stanford NLP.
• Text Summarization is an NLP task where a model tries to summarize the input text
into a shorter version in an efficient way that preserves all important information
from the input text. NER, NMT, and Sentiment Analysis models are often used as
part of the pipeline for pre-processing input text before sending it over to a
summarization model.
• Automated Question Answering Systems
• Speech Recognition

Markov Chain Monte Carlo Methods –
Sampling
• Markov Chain Monte Carlo (MCMC) is a family of algorithms used to sample from a
probability distribution, especially when direct sampling is difficult or impossible.
• It works by constructing a Markov chain whose stationary distribution is the target
distribution. The Markov chain property ensures that the next sample depends only on
the current sample, not the entire history.
• MCMC methods are a family of algorithms that uses Markov Chains to perform Monte
Carlo estimate.
• The name gives us a hint, that it is composed of two components – Monte Carlo and
Markov Chain. Let us understand them separately and in their combined form.
• Instead of sampling independently, MCMC constructs a Markov Chain whose
stationary distribution is the target distribution. By simulating the chain for a long
time, you obtain samples approximately distributed according to the target.

Monte Carlo Sampling
• A stochastic process where the next state depends only on the current state.
• It is a technique for sampling from a probability distribution and using those
samples to approximate desired quantity. In other words, it uses
randomness to estimate some deterministic quantity of interest.
• Say we have Expectation (s) to estimate, this could be a highly complex
integral or even intractable to estimate— using the Monte Carlo method we
resort to approximate such quantities by averaging over samples.
Original Expectation to be calculated Approximated Expectation generated by
stimulating large samples of f(x)

• Computing the average over a large number of samples could reduce the
standard error and provide us with a fairly accurate approximation.
• "This method has a limitation, for it assumes to easily sample from a probability
distribution, however doing so is not always possible. Sometimes, we can’t even
sample from the distribution. In such cases, we make use of Markov chains to
efficiently sample from an intractable probability distribution."

Sampling
• Sampling is a way to approximately estimate certain characteristics of the whole
population by taking a subset of the population into the study.
Sampling has various use cases :
• It could be used to approximate an intractable sum or integral.
• It could be used to provide a significant speedup in estimating tractable but costly
sum or integral.
• In some cases like density estimation, it could simply be used to approximate
probability distribution and then impute missing data.
• Few Sampling techniques – Ancestral Sampling, Inverse Transform Sampling,
Rejection Sampling, Importance Sampling, Monte Carlo Sampling, MCMC Sampling.

• The goal is to sample from the posterior distribution (in Bayesian
methods) or from a joint distribution when exact computation (e.g.,
normalization) is intractable.
Common Sampling Strategies:
• Direct Sampling (if the distribution is simple)
• Rejection Sampling
• Importance Sampling
• MCMC Sampling (when above methods fail due to complexity)

Algorithms:
• Metropolis-Hastings: Uses an acceptance rule to decide
whether to accept a proposed sample.
• Gibbs Sampling: A special case where you sample each
variable conditional on all others, used when conditionals
are easier to sample from.

MCMC Graphical models
• Graphical models use graphs to represent the conditional dependence
structure between random variables.
Types:
• Directed: Bayesian Networks (DAGs)
• Undirected: Markov Random Fields (MRFs)
They provide a compact representation of joint distributions, allow
efficient inference, and help visualize relationships.

Bayesian Networks (BNs)
• A Bayesian Network is a directed acyclic graph (DAG) where nodes
represent variables and edges indicate direct probabilistic
dependencies.
Key Features:
• Each node has a conditional probability distribution (CPD) given its
parents.
• Joint distribution is factorized as:
P(X1
,…,Xn
)=i=1 n
P(Xi Parents(Xi
))
∏ ∣

Markov Random Fields (MRFs)
• An undirected graphical model where nodes represent variables and
edges encode conditional independence assumptions.
Key Features
• Joint distribution is represented using potential functions over cliques
(fully connected subgraphs):
• Widely used in computer vision, image processing, and spatial
statistics.
P(X)=Z1
C Cliques
ϕC
(XC
)
∈ ∏
Z is the normalization constant (partition function).

• Graphical models (Bayesian Networks)
• Sampling and inference
• MCMC (specifically, Gibbs Sampling)
Consisder eg :Medical Diagnosis
Suppose you have a Bayesian Network that models the relationships
among the following binary variables:
A: Has a disease (True/False)
B: Test result (Positive/Negative)
C: Has symptoms (Present/Absent)
The dependencies
•A B
→
•A C
→

Medical Diagnosis Example
Bayesian Network can be formed as follows:
A
/
B C
Each variable has a conditional probability
table (CPT):
• P(A)
• P(B A)
∣
• P(C A)
∣
to Compute Posterior P(A B=Positive,C=Present)
∣
Direct computation is hard if the network is large or CPTs are complex.
So, we use MCMC (specifically, Gibbs Sampling) to approximatethe posterior.

Cont...
1. Initialize all variables randomly (e.g., A=False,B=Positive,C=Present )
Note: Since and
𝐵 c are observed, we fix them during sampling.
2. Loop: At each step, sample only unobserved variables, which is just A here.
To sample A, use the conditional distribution
P(AB,C) P(A) P(B A) P(C A)
∝ ⋅ ∣ ⋅ ∣
Since B and C are fixed, we compute this for both A=True and A=False normalize, and sample
a new value of A.
3. Repeatmany times (e.g., 10,000 iterations), discarding early samples (burn-in).
4. Estimate Posterior:
After sampling, compute
P^(A=True B=Positive,C=Present)=
∣ (Number of samples where A=True/Total samples after burn-in)

Concept Type Key Feature
MCMC Sampling method
Uses Markov Chains to sample from complex
distributions
Sampling General concept Drawing samples from a distribution
Proposal Distribution MCMC component Suggests new candidate states
Metropolis-Hastings / Gibbs MCMC algorithms Methods to simulate Markov Chains
Graphical Models Modeling tool
Represent probabilistic dependencies with
graphs
Bayesian Networks Directed Graph Encode causal/conditional dependencies
Markov Random Fields Undirected Graph Encode symmetric relationships using potentials

Case Studies: Real-World ML Applications
Case 1 : Google DeepMind – Diabetic Retinopathy Detection
• Diabetic retinopathy is a leading cause of blindness, and early
detection is crucial.
• DeepMind developed an AI model that analyzes eye images to detect
signs of the disease.
• The system achieved expert-level accuracy, enabling faster and more
scalable diagnoses, especially in underserved areas.

Case2 :PayPal – Fraud Detection
• Problem: Online transactions are vulnerable to fraud, leading to
financial losses.
• Solution: PayPal implemented a machine learning system that
analyzes millions of transactions in real time to detect anomalies.
• Results: The AI-driven approach significantly improved fraud
detection, reducing unauthorized transactions

Future Trends in Machine Learning

• Quantum Computing & ML – Quantum algorithms will enable faster
and more complex computations, revolutionizing areas like
cryptography and optimization.
• Edge Computing for Real-Time Analytics – Processing data closer to its
source will enhance efficiency in applications like autonomous
vehicles and industrial automation.
• AI-Driven Predictive Analytics – Improved forecasting models will
provide deeper insights in finance, healthcare, and climate science.

• Self-Learning Robots – Robotics will advance with AI-driven
adaptability, making machines more efficient in manufacturing and
logistics.
• Autonomous Transportation – Machine learning will continue to refine
navigation and safety in self-driving vehicles.
• Autonomous Agents – AI-powered agents will become more prevalent,
handling complex tasks independently

• https://www.simplilearn.com/tutorials/machine-learning-tutorial/rein
forcement-learning
• https://www.geeksforgeeks.org/q-learning-in-python
• https://towardsdatascience.com/monte-carlo-markov-chain-mcmc-ex
plained-94e3a6c8de11/

R22 Machine learning jntuh UNIT- 5.pptx

More Related Content

What's hot

Similar to R22 Machine learning jntuh UNIT- 5.pptx

More from 23Q95A6706

Recently uploaded

R22 Machine learning jntuh UNIT- 5.pptx