1) Markov models and hidden Markov models describe systems that transition between states based on probabilities, where the next state depends only on the current state.
2) Markov models assume each state corresponds to a directly observable event, while hidden Markov models allow states to be hidden and observations to depend probabilistically on the current state.
3) Transition and initial state probabilities can be described using a transition matrix in Markov models to calculate the probability of state sequences.
Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute.
These patterns are then utilized to predict the values of the target attribute in future data instances.
Unsupervised learning: The data have no target attribute.
We want to explore the data to find some intrinsic structures in them.
Unsupervised learning is a machine learning paradigm where the algorithm is trained on a dataset containing input data without explicit target values or labels. The primary goal of unsupervised learning is to discover patterns, structures, or relationships within the data without guidance from predefined categories or outcomes. It is a valuable approach for tasks where you want the algorithm to explore the inherent structure and characteristics of the data on its own.
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute.
These patterns are then utilized to predict the values of the target attribute in future data instances.
Unsupervised learning: The data have no target attribute.
We want to explore the data to find some intrinsic structures in them.
Unsupervised learning is a machine learning paradigm where the algorithm is trained on a dataset containing input data without explicit target values or labels. The primary goal of unsupervised learning is to discover patterns, structures, or relationships within the data without guidance from predefined categories or outcomes. It is a valuable approach for tasks where you want the algorithm to explore the inherent structure and characteristics of the data on its own.
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
Recurrent Neural Networks have shown to be very powerful models as they can propagate context over several time steps. Due to this they can be applied effectively for addressing several problems in Natural Language Processing, such as Language Modelling, Tagging problems, Speech Recognition etc. In this presentation we introduce the basic RNN model and discuss the vanishing gradient problem. We describe LSTM (Long Short Term Memory) and Gated Recurrent Units (GRU). We also discuss Bidirectional RNN with an example. RNN architectures can be considered as deep learning systems where the number of time steps can be considered as the depth of the network. It is also possible to build the RNN with multiple hidden layers, each having recurrent connections from the previous time steps that represent the abstraction both in time and space.
KNN algorithm is one of the simplest classification algorithm and it is one of the most used learning algorithms. KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we discuss the evaluation of classification algorithms.
Applying your Convolutional Neural NetworksDatabricks
Part 3 of the Deep Learning Fundamentals Series, this session starts with a quick primer on activation functions, learning rates, optimizers, and backpropagation. Then it dives deeper into convolutional neural networks discussing convolutions (including kernels, local connectivity, strides, padding, and activation functions), pooling (or subsampling to reduce the image size), and fully connected layer. The session also provides a high-level overview of some CNN architectures. The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Recurrent Neural Networks have shown to be very powerful models as they can propagate context over several time steps. Due to this they can be applied effectively for addressing several problems in Natural Language Processing, such as Language Modelling, Tagging problems, Speech Recognition etc. In this presentation we introduce the basic RNN model and discuss the vanishing gradient problem. We describe LSTM (Long Short Term Memory) and Gated Recurrent Units (GRU). We also discuss Bidirectional RNN with an example. RNN architectures can be considered as deep learning systems where the number of time steps can be considered as the depth of the network. It is also possible to build the RNN with multiple hidden layers, each having recurrent connections from the previous time steps that represent the abstraction both in time and space.
KNN algorithm is one of the simplest classification algorithm and it is one of the most used learning algorithms. KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.
Machine Learning and Data Mining: 14 Evaluation and CredibilityPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we discuss the evaluation of classification algorithms.
Applying your Convolutional Neural NetworksDatabricks
Part 3 of the Deep Learning Fundamentals Series, this session starts with a quick primer on activation functions, learning rates, optimizers, and backpropagation. Then it dives deeper into convolutional neural networks discussing convolutions (including kernels, local connectivity, strides, padding, and activation functions), pooling (or subsampling to reduce the image size), and fully connected layer. The session also provides a high-level overview of some CNN architectures. The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Aurelian Isar - Decoherence And Transition From Quantum To Classical In Open ...SEENET-MTP
Lecture by Prof. Dr. Aurelian Isar (Department of Theoretical Physics, National Institute of Physics and Nuclear Engineering, Bucharest-Magurele, Romania) on October 20, 2011 at the Faculty of Science and Mathematics, Nis, Serbia.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
2. Time-based Models
• Simple parametric distributions are typically based
on what is called the “independence assumption”-
each data point is independent of the others, and
there is no time-sequencing or ordering.
• What if the data has correlations based on its order,
like a time-series?
3. States
• An atomic event is an assignment to every
random variable in the domain.
• States are atomic events that can transfer
from one to another
• Suppose a model has n states
• A state-transition diagram describes how the
model behaves
4. State-transition
Following assumptions
- Transition probabilities are stationary
- The event space does not change over time
- Probability distribution over next states depends
only on the current state
5. State-transition
Following assumptions
- Transition probabilities are stationary
- The event space does not change over time
- Probability distribution over next states depends
only on the current state
Markov Assumption
6. Markov random processes
• A random sequence has the Markov property if its
distribution is determined solely by its current state.
• Any random process having this property is called a
Markov random process.
• A system with states that obey the Markov
assumption is called a Markov Model.
• A sequence of states resulting from such a model is
called a Markov Chain.
7. Chain Rule & Markov Property
Bayes rule
P (qt , qt −1 ,...q1 ) = P(qt | qt −1 ,...q1 ) P (qt −1 ,...q1 )
= P(qt | qt −1 ,...q1 ) P(qt −1 | qt − 2 ,...q1 ) P(qt − 2 ,...q1 )
t
= P (q1 )∏ P (qi | qi −1 ,...q1 )
i =2
Markov property
P (qi | qi −1 ,...q1 ) = P (qi | qi −1 ) for i > 1
t
P (qt , qt −1 ,...q1 ) = P (q1 )∏ P (qi | qi −1 ) = P (q1 ) P(q2 | q1 )...P(qt | qt −1 )
i=2
8. Markov Assumption
• The Markov assumption states that
probability of the occurrence of word wi at
time t depends only on occurrence of word
wi-1 at time t-1
– Chain rule:
n
( 1 . n ∏ i 1 . i−)
Pw . ,w)= Pw|w . ,w1
,. ( ,.
=
i 2
– Markov assumption:
n
( 1 .,w ∏ i i− )
Pw . n)≈ Pw|w1
,. (
=
i 2
9. Andrei Andreyevich Markov
Born: 14 June 1856 in Ryazan, Russia
Died: 20 July 1922 in Petrograd (now St Petersburg),
Russia
Markov is particularly remembered for his study of
Markov chains, sequences of random variables in
which the future variable is determined by the
present variable but is independent of the way in
which the present state arose from its predecessors.
This work launched the theory of stochastic
processes.
10. A Markov System
Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
s2
s1 s3
N=3
t=0
11. A Markov System
Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
s2 On the t’th timestep the system is in exactly
one of the available states. Call it qt
Note: qt ∈{s1, s2 .. sN }
Current State
s1 s3
N=3
t=0
qt=q0=s3
12. A Markov System
Current State Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, …
s2 On the t’th timestep the system is in exactly
one of the available states. Call it qt
Note: qt ∈{s1, s2 .. sN }
Between each timestep, the next state is
chosen randomly.
s1 s3
N=3
t=1
qt=q1=s2
13. P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
Has N states, called s1, s2 .. sN
P(qt+1=s3|qt=s2) = 0
There are discrete timesteps, t=0, t=1, …
P(qt+1=s1|qt=s1) = 0
s2 On the t’th timestep the system is in exactly
P(qt+1=s2|qt=s1) = 0 one of the available states. Call it qt
P(qt+1=s3|qt=s1) = 1 Note: qt ∈{s1, s2 .. sN }
Between each timestep, the next state is
chosen randomly.
s1 s3 The current state determines the probability
N=3 distribution for the next state.
t=1
P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
14. P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
Has N states, called s1, s2 .. sN
P(qt+1=s3|qt=s2) = 0
There are discrete timesteps, t=0, t=1, …
P(qt+1=s1|qt=s1) = 0
s2 1/2 On the t’th timestep the system is in exactly
P(qt+1=s2|qt=s1) = 0 one of the available states. Call it qt
P(qt+1=s3|qt=s1) = 1 Note: qt ∈{s1, s2 .. sN }
1/2 2/3
Between each timestep, the next state is
chosen randomly.
s1 1/3 s3 The current state determines the probability
N=3 1 distribution for the next state.
t=1
P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
Often notated with arcs
between states
15. P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
Markov Property
P(qt+1=s1|qt=s1) = 0
s2 qt+1 is conditionally independent of { qt-1, qt-2,
1/2 … q , q } given q .
P(qt+1=s2|qt=s1) = 0 1 0 t
P(qt+1=s3|qt=s1) = 1 In other words:
1/2 2/3
P(qt+1 = sj |qt = si ) =
s1 1/3 s3 P(qt+1 = sj |qt = si ,any earlier history)
Notation:
N=3 1
t=1
P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
P(qt+1=s2|qt=s3) = 2/3 aij = P (qt +1 = si | q = s j )
P(qt+1=s3|qt=s3) = 0
π i = P(q1 = si )
16. P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
P(qt+1=s3|qt=s2) = 0
Markov Property
P(qt+1=s1|qt=s1) = 0
s2 qt+1 is conditionally independent of { qt-1, qt-2,
1/2 … q , q } given q .
P(qt+1=s2|qt=s1) = 0 1 0 t
P(qt+1=s3|qt=s1) = 1 In other words:
1/2 2/3
P(qt+1 = sj |qt = si ) =
s1 1/3 s3 P(qt+1 = sj |qt = si ,any earlier history)
Notation:
N=3 1
Transition probability
t=1
P(qt+1=s1|qt=s3) = 1/3
qt=q1=s2
P(qt+1=s2|qt=s3) = 2/3 aij = P (qt +1 = si | q = s j )
P(qt+1=s3|qt=s3) = 0
π i = P(q1 = si )
Initial probability
17. Example: A Simple Markov Model For
Weather Prediction
• Any given day, the weather can be described as being in
one of three states:
– State 1: precipitation (rain, snow, hail, etc.)
– State 2: cloudy
– State 3: sunny
Transitions between states are described by the
transition matrix
This model can then be described by the
following directed graph
18. Basic Calculations
• Example: What is the probability that the
weather for eight consecutive days is “sun-
sun-sun-rain-rain-sun-cloudy-sun”?
• Solution:
• O = sun sun sun rain rain sun cloudy sun
3 3 3 1 1 3 2 3
19. From Markov To Hidden Markov
• The previous model assumes that each state can be uniquely
associated with an observable event
– Once an observation is made, the state of the system is then trivially
retrieved
– This model, however, is too restrictive to be of practical use for most
realistic problems
• To make the model more flexible, we will assume that the
outcomes or observations of the model are a probabilistic
function of each state
– Each state can produce a number of outputs according to a unique
probability distribution, and each distinct output can potentially be
generated at any state
– These are known a Hidden Markov Models (HMM), because the state
sequence is not directly observable, it can only be approximated from
the sequence of observations produced by the system
20. The coin-toss problem
• To illustrate the concept of an HMM consider the following
scenario
– Assume that you are placed in a room with a curtain
– Behind the curtain there is a person performing a coin-toss
experiment
– This person selects one of several coins, and tosses it: heads (H) or
tails (T)
– The person tells you the outcome (H,T), but not which coin was used
each time
• Your goal is to build a probabilistic model that best explains
a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}
– The coins represent the states; these are hidden because you do not
know which coin was tossed each time
– The outcome of each toss represents an observation
– A “likely” sequence of coins may be inferred from the observations,
but this state sequence will not be unique
•
21. The Coin Toss Example – 1 coin
•As a result, the Markov model is observable since there is only one state
•In fact, we may describe the system with a deterministic model where the states are
the actual observations (see figure)
•the model parameter P(H) may be found from the ratio of heads and tails
•O= H H H T T H…
•S = 1 1 1 2 2 1…
22. The Coin Toss Example – 1 coin
•As a result, the Markov model is observable since there is only one state
•In fact, we may describe the system with a deterministic model where the states are
the actual observations (see figure)
•the model parameter P(H) may be found from the ratio of heads and tails
•O= H H H T T H…
•S = 1 1 1 2 2 1…
24. From Markov to Hidden Markov Model:
The Coin Toss Example – 3 coins
25. Hidden model
• As spectators, we can not tell which coin is
being used, all we can observe is the output
(head/tail)
• We assume the outputs are based on coin
tendencies (output) probabilities
26. Coin Toss Example
hidden state variables L
= coins
C1 C2 Ci CL-1 CL
P1 P2 Pi PL-1 PL
observed data
(“output”) =
heads/tails
27. Hidden Markov Models
• Used when states can not directly be observed, good
for noisy data
• Requirements:
– A finite number of states, each with an output probability
distribution
– State transition probabilities
– Observed phenomenon, which can be randomly generated
given state-associated probabilities.
28. HMM Notation *L. R. Rabiner, "A Tutorial on
(from Rabiner’s Survey) Hidden Markov Models and
Selected Applications in
Speech Recognition," Proc.
The states are labeled S1 S2 .. SN of the IEEE, Vol.77, No.2,
pp.257--286, 1989.
For a particular trial….
Let T be the number of observations
T is also the number of states passed
through
O = O1 O2 .. OT is the sequence of observations
Q = q1 q2 .. qT is the notation for a path of states
λ = 〈N,M,{π i,},{aij},{bi(j)}〉 is the specification of an
HMM
29. HMM Formal Definition
An HMM, λ, is a 5-tuple consisting of
• N the number of states
• M the number of possible observations
• {π1, π2, .. πN} The starting state probabilities
P(q0 = Si) = πi
• a11 a22 … a1N
a21 a22 … a2N The state transition probabilities
: : :
P(qt+1=Sj | qt=Si)=aij
aN1 aN2 … aNN
• b1(1) b1(2) … b1(M) The observation probabilities
b2(1) b2(2) … b2(M) P(Ot=k | qt=Si)=bi(k)
: : :
bN(1) bN(2) … bN(M)
30. Assumptions
• Markov assumption
– States depend on previous states
• Stationary assumption
– Transition probabilities are independent of time
(“memoryless”)
• Output independence
– Observations are independent of previous
observations
31. The three main questions on HMMs
• Evaluation
– What is the probability that the observations were
generated by a given model?
• Decoding
– Given a model and a sequence of observations, what is the
most likely state observations?
• Learning:
– Given a model and a sequence of observations, how
should we modify the model parameters to maximize
p{observe|model)
32. The three main questions on HMMs
1. Evaluation
GIVEN a HMM M, and a sequence x,
FIND Prob[ x | M ]
• Decoding
GIVEN a HMM M, and a sequence x,
FIND the sequence π of states that maximizes P[ x, π | M ]
5. Learning
GIVEN a HMM M, with unspecified transition/emission probs.,
and a sequence x,
FIND parameters θ = (bi(.), aij) that maximize P[ x | θ ]
33. Let’s not be confused by notation
P[ x | M ]: The probability that sequence x was generated by
the model
The model is: architecture (#states, etc)
+ parameters θ = aij, ei(.)
So, P[ x | θ ], and P[ x ] are the same, when the architecture, and
the entire model, respectively, are implied
Similarly, P[ x, π | M ] and P[ x, π ] are the same
In the LEARNING problem we always write P[ x | θ ] to emphasize
that we are seeking the θ that maximizes P[ x | θ ]
35. Hidden Markov Models
• Used when states can not directly be observed, good
for noisy data
• Requirements:
– A finite number of states, each with an output probability
distribution
– State transition probabilities
– Observed phenomenon, which can be randomly generated
given state-associated probabilities.
36. Description
Specification of an HMM
• N - number of states
– Q = {q1; q2; : : : ;qT} - set of states
• M - the number of symbols (observables)
– O = {o1; o2; : : : ;oT} - set of symbols
37. Description
Specification of an HMM
• A - the state transition probability matrix
– aij = P(qt+1 = j|qt = i)
• B- observation probability distribution
– bj(k) = P(ot = k|qt = j) i ≤ k ≤ M
• π - the initial state distribution
38. HMM Formal Definition
An HMM, λ, is a 5-tuple consisting of
• N the number of states
• M the number of possible observations
• {π1, π2, .. πN} The starting state probabilities
P(q0 = Si) = πi
• a11 a22 … a1N
a21 a22 … a2N The state transition probabilities
: : :
P(qt+1=Sj | qt=Si)=aij
aN1 aN2 … aNN
• b1(1) b1(2) … b1(M) The observation probabilities
b2(1) b2(2) … b2(M) P(Ot=k | qt=Si)=bi(k)
: : :
bN(1) bN(2) … bN(M)
39. Assumptions
• Markov assumption
– States depend on previous states
• Stationary assumption
– Transition probabilities are independent of time
(“memoryless”)
• Output independence
– Observations are independent of previous
observations
40. The three main questions on HMMs
• Evaluation
– What is the probability that the observations were
generated by a given model?
• Decoding
– Given a model and a sequence of observations, what is the
most likely state observations?
• Learning:
– Given a model and a sequence of observations, how
should we modify the model parameters to maximize
p{observe|model)
41. Central
problems
Central problems in HMM modelling
• Problem 1
Evaluation:
– Probability of occurrence of a particular
observation sequence, O = {o1,…,ok}, given the
model
– P(O|λ)
– Complicated – hidden states
– Useful in sequence classification
42. Central
problems
Central problems in HMM modelling
• Problem 2
Decoding:
– Optimal state sequence to produce given
observations, O = {o1,…,ok}, given model
– Optimality criterion
– Useful in recognition problems
43. Central
problems
Central problems in HMM modelling
• Problem 3
Learning:
– Determine optimum model, given a training set of
observations
– Find λ, such that P(O|λ) is maximal
44. Task: Part-Of-Speech Tagging
• Goal: Assign the correct part-of-speech to
each word (and punctuation) in a text.
• Example:
Two old men bet on the game .
CRD ADJ NN VBD Prep Det NN SYM
• Learn a local model of POS dependencies,
usually from pretagged data
45. Hidden Markov Models
• Assume: POS generated as random process,
and each POS randomly generates a word
0.2
ADJ 0.3
“cats”
0.2
“a” 0.6 0.5 NNS
0.3
Det 0.9 “men”
0.5
0.4 NN
“the” 0.1
“cat”
“bet”
46. HMMs For Tagging
• First-order (bigram) Markov assumptions:
– Limited Horizon: Tag depends only on previous tag
P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)
– Time invariance: No change over time
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj tk)
• Output probabilities:
– Probability of getting word wk for tag tj: P(wk | tj)
– Assumption:
Not dependent on other tags or words!
47. Combining Probabilities
• Probability of a tag sequence:
P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)
Assume t0 – starting tag:
= P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)
• Prob. of word sequence and tag sequence:
P(W,T) = Πi P(ti-1ti) P(wi | ti)
48. Training from labeled training
• Labeled training = each word has a POS tag
• Thus:
π(tj) = PMLE(tj) = C(tj) / N
a(tjtk) = PMLE(tjtk) = C(tj, tk) / C(tj)
b(tj | wk) = PMLE(tj | wk) = C(tj:wk) / C(wk)
• Smoothing applies as usual
49. Three Basic Problems
• Compute the probability of a text:
Pλ(W1,N)
• Compute maximum probability tag
sequence:
arg maxT1,N Pλ(T1,N | W1,N)
• Compute maximum likelihood model
arg max λ Pλ(W1,N)
50. Central
Problem 1: Naïve solution problems
• State sequence Q = (q1,…qT)
• Assume independent observations:
T
P (O | q, λ ) = ∏ P (ot | qt , λ ) = bq1 (o1 )bq 2 (o2 )...bqT (oT )
i =1
NB Observations are mutually independent, given the
hidden states. (Joint distribution of independent
variables factorises into marginal distributions of the
independent variables.)
51. Central
Problem 1: Naïve solution
problems
• Observe that :
P(q | λ ) = π q1aq1q 2 aq 2 q 3 ...aqT −1qT
• And that:
P (O | λ ) = ∑ P (O | q, λ ) P (q | λ )
q
52. Central
Problem 1: Naïve solution problems
• Finally get:
P (O | λ ) = ∑ P (O | q, λ ) P (q | λ )
q
NB:
-The above sum is over all state paths
-There are NT states paths, each ‘costing’
O(T) calculations, leading to O(TNT)
time complexity.
53. Central
Problem 1: Efficient solution
problems
Forward algorithm:
• Define auxiliary forward variable α:
α t (i ) = P(o1 ,..., ot | qt = i, λ )
αt(i) is the probability of observing a partial sequence of
observables o1,…ot such that at time t, state qt=i
54. Central
Problem 1: Efficient solution
problems
• Recursive algorithm:
– Initialise:
α1 (i ) = π i bi (o1 )
(Partial obs seq to t AND state i at t)
– Calculate: x (transition to j at t+1) x (sensor)
N
α t +1 ( j ) = [∑ α t (i )aij ]b j (ot +1 ) Sum, as can reach j from
any preceding state
i =1
– Obtain: α incorporates partial obs seq to t
N
P (O | λ ) = ∑ α T (i )
i =1
Sum of different ways Complexity is O(N2T)
of getting obs seq
55. Forward Algorithm
Define αk(i) = P(w1,k, tk=ti)
• For i = 1 To N:
α1(i) = a(t0ti)b(w1 | ti)
4. For k = 2 To T; For j = 1 To N:
αk(j) = [Σ α i k-1 ]
(i)a(titj) b(wk | tj)
5. Then:
Pλ(W1,T) = Σ α (i)
i T
Complexity = O(N2 T)
56. Pλ(W1,3)
Forward Algorithm
w1 w2 w3
1 a(t1t1) 1 a(t1t1)
t α1(1) t α2(1) t1 α3(1)
a(t2t1) a(t2t1)
t2 α1(2) a(t3t1) t2 α2(2) a(t3t1) t2 α3(2)
a(t0ti)
t3 α1(3) t3 α2(3) t3 α3(3)
a(t4t1) a(t4t1)
t4 α1(4) t4 α2(4) t4 α3(4)
a(t5t1) a(t5t1)
t5 α1(5) t5 α2(5) t5 α3(5)
57. Central
problems
Problem 1: Alternative solution
Backward algorithm:
• Define auxiliary forward
variable β:
β t (i ) = P (ot +1 , ot + 2 ,..., oT | qt = i, λ )
βt(i) – the probability of observing a sequence of
observables ot+1,…,oT given state qt =i at time t, and λ
58. Central
problems
Problem 1: Alternative solution
• Recursive algorithm:
– Initialise:
βT ( j ) = 1
– Calculate:
N
βt (i ) = ∑ β t +1 ( j )aij b j (ot +1 )
j =1
– Terminate:
N
p (O | λ ) = ∑ β1 (i ) t = T − 1,...,1
i =1
Complexity is O(N2T)
59. Backward Algorithm
Define βk(i) = P(wk+1,N | tk=ti) --note the difference!
• For i = 1 To N:
βT(i) = 1
5. For k = T-1 To 1; For j = 1 To N:
βk(j) = [Σ a(t t )b(w
i
j i
k+1 | ti) βk+1(i) ]
6. Then:
Pλ(W1,T) = Σ a(t t )b(w | t ) β (i)
i 0
i
1
i
1
Complexity = O(Nt2 N)
61. Viterbi Algorithm (Decoding)
• Most probable tag sequence given text:
T* = arg maxT Pλ(T | W)
= arg maxT Pλ(W | T) Pλ(T) / Pλ(W)
(Bayes’ Theorem)
= arg maxT Pλ(W | T) Pλ(T)
(W is constant for all T)
= arg maxT Πi[a(ti-1ti) b(wi | ti) ]
= arg maxT Σi log[a(ti-1ti) b(wi | ti) ]
62. w1 w2 w3
t1 t1 t1
t0 t2 t2 t2
t3 t3 t3
A(,) t1 t2 t3 B(,) w1 w2 w3
t0 0.005 0.02 0.1 t1 0.2 0.005 0.005
t1 0.02 0.1 0.005 t2 0.02 0.2 0.0005
t2 0.5 0.0005 0.0005 t3 0.02 0.02 0.05
t3 0.05 0.05 0.005
63. w1 w2 w3
-1.7 -1.7
1
t 1
t -6 t1 -7.3
-3
-2.3
-0.3 -0.3
0
-1.7 2
t t -3.4 t2 -4.7 t2 -10.3
-1 -1.3 -1.3
t3 -2.7 t3 -6.7 t3 -9.3
-log A t1 t2 t3 -log B w1 w2 w3
t0 2.3 1.7 1 t1 0.7 2.3 2.3
t1 1.7 1 2.3 t2 1.7 0.7 3.3
t2 0.3 3.3 3.3 t3 1.7 1.7 1.3
t3 1.3 1.3 2.3
64. Viterbi Algorithm
• D(0, START) = 0
• for each tag t != START do: D(1, t) = -∞
• for i 1 to N do:
a. for each tag tj do:
D(i, tj) maxk D(i-1,tk)b(wi|tj)a(tktj)
D(i, tj) maxk D(i-1,tk) + logb(wi|tj) + loga(tktj)
• log P(W,T) = maxj D(N, tj)
where logb(wi|tj) = log b(wi|tj) and so forth
65.
66. Question: Suppose the sequence of our game is:
HHHTHHHTTHHTH?
0.5 start 0.5
Heads Heads
0.1
fair loaded
0.9 0.9
Tails Tails
0.1
What is the probability of the sequence given the model?
67. Decoding
• Suppose we have a text written by Shakespeare
and a monkey. Can we tell who wrote what?
• Text: Shakespeare or Monkey?
• Case 1:
– Fehwufhweuromeojulietpoisonjigjreijge
• Case 2:
– mmmmbananammmmmmmbananammm