ProbabilisticModeling20080411

Copyright © 2005 by DataPath, Inc.
Probabilistic Modeling
Clay Stanek,
Steven Bottone, DataPath, Inc.
11 April 2008

Defying boundaries. Communicating anywhere.
2
Probabilistic Models
Most generally, we would like to make decisions based on
– Data we have observed
– Any previous knowledge we may have
Best framed in terms of a probabilistic model
where X is data that has been observed (inputs) and Y is what you
would like to infer or predict (outputs)
The probability of each possible value of Y given the values of all X that
have been observed
Example, what is the probability of system outage over the next week
given the state of the system today.
P(Y|X)

3
Decision Theory
A cost, or utility, is assigned to each possible outcome of the inference, Y
There may be a cost for each action you can take
Decision: choose the action which maximizes the expected utility, or
minimizes the cost
where A is the action is set of all possible actions with elements a.
Example: there is a probability distribution for a stock to rise, fall, or stay
the same. Possible actions: buy the stock or leave money in the bank.
Value(P(Y|X)) = maxa∈A Σy U(a,y) P(y|X)

4
Decision Theory
Action 1: buy stock
P(Y=Down) = .25, U = $500
P(Y=Stay) = .50, U = $1000
P(Y=Up) = .25, U = $2000
E(Y) = (.25)($500) + (.50)($1000) + (.25)($2000) = $1375
Action 2: leave in bank
U = $1005
E(Y) = (.25)($1005) + (.50)($1005) + (.25)($1005) = $1005
Decision: buy stock

5
Types of Models Used in Decision Making
There are many types of models used in decision making
– Linear models for regression, Y is continuous
– Linear models for classification, Y is discrete
– Neural networks
– Kernel machines
– Support vector machines (SVM)
– Relevance vector machines (RVM)
– Bayesian Networks
Bayesian networks that have been augmented with decision and utility
nodes are called influence diagrams

6
The Importance of Data
This is not physics
– With so many variables it is generally not possible to know what
will happen from first principles
Use past data to estimates parameters in model
– Supervised learning
Hope that past data reflects what will happen with future data
– Must be careful about over-fitting the model

7
What is Importance To Model Building
Data and expertise are most important in probabilistic model building
– The expert can use knowledge to choose relevant inference
variables, Y
– The expert can help determine what data is important and which
variables might depend on one another
X Y
Data Expertise

8
Outline
Probabilistic Graphical Models
– Bayesian Networks
Simple Bayesian Networks
Three Main Problems for Bayesian Networks
Sample Bayesian Networks
Monitoring and Control of Satellite Earth Terminals
Detailed Bayesian Network for Satellite Monitoring and Control System

9
Probabilistic Graphical Models
Marriage between probability theory and graph theory
– Deals with uncertainty and complexity
Nodes (vertices) of graph are random variables
Edges of graph are (conditional) probability distributions
Graphical structure related to (conditional) independence of nodes
(random variables)
– Appealing interface for humans
– Graph theory provides methods for efficient general-purpose
computation algorithms
A
B
P(B=bi|A=aj)
|B|×|A| Table

10
Bayesian Networks
Graphical model where all edges (arcs) of graph are directed and there
are no cycles
– DAG (directed acyclic graph)
– Direction hints of causal connection between nodes
Joint probability distribution determined from graph
where the nodes are V = (V1
,...,VN
) and pa(V) are the parents of V.
All marginal and conditional distributions can be determined, in principle,
from joint distribution
P(V) = ∏ P(Vi
| pa(Vi
))
i = 1
N

12
Three Main Problems for Bayesian Networks
1. Given a graphical model, compute marginal and conditional
probability distributions, given evidence (inference)
2. Given a graphical structure and some data (with possibly missing
values), estimate unknown parameters for conditional probabilities
(learning probabilities)
3. Given some data (with possibly missing values) and some wisdom,
construct graphical structure and estimate unknown parameters for
conditional probabilities (learning structure)

13
Problem 1 for Bayesian Networks
Inference – Compute marginal probability distribution on unobserved
nodes given evidence on observed nodes
For discrete Bayesian networks, exact inference methods exist (Hugin)
– Elimination algorithm
– Sum-product algorithm
– Join (Junction) tree algorithm
Approximate inference using sampling methods
– Importance sampling
– Markov chain Monte Carlo (Gibbs sampling - BUGS)
Variational Methods
Bayesian inference is consistent
– All probabilities are ≥ 0 and sum to one

14
Learning Probabilities – Estimate parameters for local conditional
probability distributions given graphical structure and data (with
possibly missing values)
– Hidden nodes have no observed data
For discrete Bayesian networks with no missing data, probabilities
can be learned using unrestricted multinomial distributions
When data contains missing values
– Gibbs sampling (BUGS)
– Gaussian Approximations
– EM (Expectation-Maximization) Algorithm

15
Learning Structure – Determine graphical structure and estimate
local conditional probability parameters given (possibly missing)
data
Most difficult problem
Complexity can be greatly reduced using expert experience and
physics
Bayesian approach – compare posterior distributions of various
candidate structures given data
– Model selection
– Selective model averaging

16
Graphical Model for Monty Hall Problem
Prize
Door
Chosen
Monty
Opens
Door 1 1/3
Door 2 1/3
Door 3 1/3
Prior Probability for
Prize
Door 1
1/3
Door 2 0
Door 3
2/3
Probability for Prize
Given
Door Chosen = Door 1
Monty Opens = Door 2
e
e
Add Evidence
Door Chosen = Door 1
Add Evidence
Monty Opens = Door 2
P(Prize)
P(Prize|DC=1,MO=2)

18
Graphical Model for Kalman Filter
x1 x2 x3 xn
z1 z2
x2
z3 zn
. . .P(z1
|x1
) P(z2
|x2
) P(z3
|x3
) P(zn
|xn
)
P(x2
|x1
) P(x3
|x2
) P(xn
|xn-1
)
xk
= Axk-1
+ Buk-1
+ wk-1
zk
= Hxk
+ vk-1
Dynamic Bayesian Networks (DBNs) are directed graphical models of stochastic
processes. They generalize hidden Markov models (HMMs) and
linear dynamical systems (LDSs) by representing the hidden (and observed) state
in terms of state variables, which can have complex interdependencies. The
graphical structure provides an easy way to specify these conditional
independencies, and hence to provide a compact parameterization of the model.
A Linear Dynamical System (LDS) has the same topology as an HMM, but all the nodes are
assumed to have linear-Gaussian distributions,

19
Satellite Earth
Terminal
LNA or
LNB
TWTA,
Klystron,
SSPA
MODEM

20
Digital Satellite Uplink Chain
1. Digital data sent to modulator and converted to intermediate
frequency (L band, 70 – 140 MHz)
2. Intermediate frequency signal sent to upconverter and converted to
higher frequency (S, C, X, Ku, or Ka band, ≥ ~1000 MHz)
3. Noise removed and sent to high-power amplifier (HPA)
4. Amplified signal sent down waveguide to satellite dish
5. Dish emits high-frequency signal to satellite

21
Digital Satellite Downlink Chain
1. Satellite transmits signal that contains encoded data
2. Signal is received at satellite antenna dish
3. Signal is amplified through a low noise power amplifier (LNA) and fed
to downconverter
4. Downconverter converts high-frequency signal to intermediate
frequency
5. Intermediate frequency fed into demodulator and converted to digital
data
6. Data is sent to network through a router

22
Monitoring and Control of Satellite Earth
Terminals
Monitoring and control software systematically monitors state of each
piece of equipment
Wealth of information potentially available for predictive maintenance and
diagnosis
– Time history of each variable and be recorded
– All pieces of equipment
– All fielded systems
Data potentially much more extensive than available in lab tests
HPA often fails – are there precursors?
– Cathode current tends to drop before filaments burns out
– Helix current typically begins to rise before tube failure
– These rates may depend on environment, such as temperature

23
MaxView Monitoring and Control Software for Satellite
Uplink Chain

24
Bayesian Network for Satellite Earth Terminals
All nodes are discrete with finite number of states
Failure nodes for system and components
– Two states: fail and no fail
– Gives probability of failure for fixed period of time
Nodes for measurable components are usually instantiated (have
evidence)
– e.g. helix and cathode currents in HPA
– Component failure may depend trends in variables
– Time history of measurements needed for trend nodes
Most component nodes depend on environmental nodes
– e.g. temperatures, power, air conditioning
Conditional probabilities determined by data and prior lab tests of
equipment by manufacturer

25

26
M&C Bayesian Network with No Evidence

27
M&C Bayesian Network with Evidence – Temp = hot

28
Diagnosis
If system fails, what is most likely component to have failed?
– Reduce time to find component responsible for failure
– Help determine order of replacing or checking equipment
– May take into account cost of replacement or time to replace
System failure node has fail state instantiated
Attempt to find most probable configuration of Bayesian network given
evidence
– Hugin max-propagation algorithm
– Determines most likely state of each unobserved node

29
Diagnosis of Component Failure – OutdoorTemp = hot,
SystemFailure = fail

30
Value of Information (VOI) Analysis
Which nodes, if measured, provide the most information in helping to
determine the most likely hypothesis?
Useful measure of information of a discrete random variable, X, is the
entropy, H(X)
– Most informative: one xi has probability one, H = 0, minimum H
– Least Informative: X uniformly distributive, H is maximum
H(X) = − ∑ P(xi
) log P(xi)
i = 1
N
No Evidence BrgMeas=0-45 BrgMeas=0-45
Location=xyz
BrgMeas=0-45
Location=xyz
HotSpot=hot
BrgMeas=0-45
Location=xyz
HotSpot=hot
Freq=1000-1160
H(ID) 1.61 1.58 1.52 1.51 0.98
Entropy of ID variable given various evidence

31
Value of Information (VOI) Analysis
∑ ∑=
y x yPxP
yxP
yxPyPYXI
)()(
),(
log)|()(),(
The mutual information between two variables, I(X,Y), is the amount
entropy of X is reduced given Y, H(X|Y) = H(X) – I(X,Y)
Mutual information between ID and other measurable nodes, I(ID,Y)
Measurable Node, Y No Evidence
Mutual Information, I(ID,Y)
Frequency Known
Mutual Information, I(ID,Y|F)
Freq Measured 0.2522 0.0000
PRF Measured 0.1575 0.0090
PW Measured 0.1183 0.0140
Elevation Measured 0.0695 0.0816
Bearing Measured 0.0179 0.0056
PA Measured 0.0004 0.0059
ELINT Location 0.0000 0.0000
Hot Spot 0.0000 0.0000

32
Software Tools for Graphical Methods
BUGS (Bayesian Inference Using Gibbs Sampling)
– Bayesian analysis using Markov chain Monte Carlo
– Powerful, can sample from a variety of continuous and discrete probability
distributions
Hugin Expert
– Easy to use, discrete distributions (some Gaussian)
– Exact inference, EM algorithm
WEKA (Waikato Environment for Knowledge Analysis)
– Machine learning algorithms for data mining
PNL – Intel's open source probabilistic networks library
MSBNx – Microsoft Bayesian Networks Editor/Toolkit

33
Learning Graphs
One needs to specify two things to describe a BN: the graph topology (structure) and
the parameters of each CPD.
It is possible to learn both of these from data. However, learning structure is much
harder than learning parameters. Also, learning when some of the nodes are
hidden, or we have missing data, is much harder than when everything is
observed.
This gives rise to 4 cases:
Structure Observability Method
– Known Full Maximum Likelihood Estimation
– Known Partial EM (or gradient ascent)
– Unknown Full Search through model space
– Unknown Partial EM + search through model space

ProbabilisticModeling20080411

More Related Content

What's hot

Similar to ProbabilisticModeling20080411

ProbabilisticModeling20080411

Editor's Notes