Copyright © 2005 by DataPath, Inc.
Probabilistic Modeling
Clay Stanek,
Steven Bottone, DataPath, Inc.
11 April 2008
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
2
Probabilistic Models
Most generally, we would like to make decisions based on
– Data we have observed
– Any previous knowledge we may have
Best framed in terms of a probabilistic model
where X is data that has been observed (inputs) and Y is what you
would like to infer or predict (outputs)
The probability of each possible value of Y given the values of all X that
have been observed
Example, what is the probability of system outage over the next week
given the state of the system today.
P(Y|X)
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
3
Decision Theory
A cost, or utility, is assigned to each possible outcome of the inference, Y
There may be a cost for each action you can take
Decision: choose the action which maximizes the expected utility, or
minimizes the cost
where A is the action is set of all possible actions with elements a.
Example: there is a probability distribution for a stock to rise, fall, or stay
the same. Possible actions: buy the stock or leave money in the bank.
Value(P(Y|X)) = maxa∈A Σy U(a,y) P(y|X)
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
4
Decision Theory
Action 1: buy stock
P(Y=Down) = .25, U = $500
P(Y=Stay) = .50, U = $1000
P(Y=Up) = .25, U = $2000
E(Y) = (.25)($500) + (.50)($1000) + (.25)($2000) = $1375
Action 2: leave in bank
U = $1005
E(Y) = (.25)($1005) + (.50)($1005) + (.25)($1005) = $1005
Decision: buy stock
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
5
Types of Models Used in Decision Making
There are many types of models used in decision making
– Linear models for regression, Y is continuous
– Linear models for classification, Y is discrete
– Neural networks
– Kernel machines
– Support vector machines (SVM)
– Relevance vector machines (RVM)
– Bayesian Networks
Bayesian networks that have been augmented with decision and utility
nodes are called influence diagrams
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
6
The Importance of Data
This is not physics
– With so many variables it is generally not possible to know what
will happen from first principles
Use past data to estimates parameters in model
– Supervised learning
Hope that past data reflects what will happen with future data
– Must be careful about over-fitting the model
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
7
What is Importance To Model Building
Data and expertise are most important in probabilistic model building
– The expert can use knowledge to choose relevant inference
variables, Y
– The expert can help determine what data is important and which
variables might depend on one another
X Y
Data Expertise
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
8
Outline
Probabilistic Graphical Models
– Bayesian Networks
Simple Bayesian Networks
Three Main Problems for Bayesian Networks
Sample Bayesian Networks
Monitoring and Control of Satellite Earth Terminals
Detailed Bayesian Network for Satellite Monitoring and Control System
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
9
Probabilistic Graphical Models
Marriage between probability theory and graph theory
– Deals with uncertainty and complexity
Nodes (vertices) of graph are random variables
Edges of graph are (conditional) probability distributions
Graphical structure related to (conditional) independence of nodes
(random variables)
– Appealing interface for humans
– Graph theory provides methods for efficient general-purpose
computation algorithms
A
B
P(B=bi|A=aj)
|B|×|A| Table
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
10
Bayesian Networks
Graphical model where all edges (arcs) of graph are directed and there
are no cycles
– DAG (directed acyclic graph)
– Direction hints of causal connection between nodes
Joint probability distribution determined from graph
where the nodes are V = (V1
,...,VN
) and pa(V) are the parents of V.
All marginal and conditional distributions can be determined, in principle,
from joint distribution
P(V) = ∏ P(Vi
| pa(Vi
))
i = 1
N
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
11
Simple Bayesian Networks
A B C
Serial Connection
P(A,B,C) = P(A) P(B|A) P(C|B)
P(C|B,A) = P(C|B)
Joint PDF:
Diverging Connection
A
B C
Cond Indep:
P(A,B,C) = P(A) P(B|A) P(C|A)
P(B|A,C) = P(B|A)
Joint PDF:
Cond Indep:
P(C|A,B) = P(C|A)
C is independent of A given that B is known
B and C are independent given that A is known
A B
C
Converging Connection
P(A,B,C) = P(A) P(B) P(C|A,B)
P(A|B) = P(A)
Joint PDF:
Cond Indep:
A and B are independent unless C (or any of
its descendants) is known
Explaining Away Effect!: Berkson’s Paradox
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
12
Three Main Problems for Bayesian Networks
1. Given a graphical model, compute marginal and conditional
probability distributions, given evidence (inference)
2. Given a graphical structure and some data (with possibly missing
values), estimate unknown parameters for conditional probabilities
(learning probabilities)
3. Given some data (with possibly missing values) and some wisdom,
construct graphical structure and estimate unknown parameters for
conditional probabilities (learning structure)
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
13
Problem 1 for Bayesian Networks
Inference – Compute marginal probability distribution on unobserved
nodes given evidence on observed nodes
For discrete Bayesian networks, exact inference methods exist (Hugin)
– Elimination algorithm
– Sum-product algorithm
– Join (Junction) tree algorithm
Approximate inference using sampling methods
– Importance sampling
– Markov chain Monte Carlo (Gibbs sampling - BUGS)
Variational Methods
Bayesian inference is consistent
– All probabilities are ≥ 0 and sum to one
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
14
Problem 2 for Bayesian Networks
Learning Probabilities – Estimate parameters for local conditional
probability distributions given graphical structure and data (with
possibly missing values)
– Hidden nodes have no observed data
For discrete Bayesian networks with no missing data, probabilities
can be learned using unrestricted multinomial distributions
When data contains missing values
– Gibbs sampling (BUGS)
– Gaussian Approximations
– EM (Expectation-Maximization) Algorithm
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
15
Problem 3 for Bayesian Networks
Learning Structure – Determine graphical structure and estimate
local conditional probability parameters given (possibly missing)
data
Most difficult problem
Complexity can be greatly reduced using expert experience and
physics
Bayesian approach – compare posterior distributions of various
candidate structures given data
– Model selection
– Selective model averaging
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
16
Graphical Model for Monty Hall Problem
Prize
Door
Chosen
Monty
Opens
Door 1 1/3
Door 2 1/3
Door 3 1/3
Prior Probability for
Prize
Door 1
1/3
Door 2 0
Door 3
2/3
Probability for Prize
Given
Door Chosen = Door 1
Monty Opens = Door 2
e
e
Add Evidence
Door Chosen = Door 1
Add Evidence
Monty Opens = Door 2
P(Prize)
P(Prize|DC=1,MO=2)
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
Q1 Q3 QT
X1 X3 XT
. . .P(X1
|Q1
) P(X3
|Q3
) P(XT
|QT
)
P(Q2
|Q1
) P(QT
|QT-1
)
Q2
X2
P(Q3
|Q2
)
P(X2
|Q2
)
t=2
P(X1:T
,Q1:T
) = P(Q1
) P(X1
|Q1
) ∏ P(Qt
|Qt-1
) P(Xt
|Qt
)
T
Dynamic Bayesian Network
Hidden Markov Model or State Space
Model
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
18
Graphical Model for Kalman Filter
x1 x2 x3 xn
z1 z2
x2
z3 zn
. . .P(z1
|x1
) P(z2
|x2
) P(z3
|x3
) P(zn
|xn
)
P(x2
|x1
) P(x3
|x2
) P(xn
|xn-1
)
xk
= Axk-1
+ Buk-1
+ wk-1
zk
= Hxk
+ vk-1
Dynamic Bayesian Networks (DBNs) are directed graphical models of stochastic
processes. They generalize hidden Markov models (HMMs) and
linear dynamical systems (LDSs) by representing the hidden (and observed) state
in terms of state variables, which can have complex interdependencies. The
graphical structure provides an easy way to specify these conditional
independencies, and hence to provide a compact parameterization of the model.
A Linear Dynamical System (LDS) has the same topology as an HMM, but all the nodes are
assumed to have linear-Gaussian distributions,
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
19
Satellite Earth
Terminal
LNA or
LNB
TWTA,
Klystron,
SSPA
MODEM
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
20
Digital Satellite Uplink Chain
1. Digital data sent to modulator and converted to intermediate
frequency (L band, 70 – 140 MHz)
2. Intermediate frequency signal sent to upconverter and converted to
higher frequency (S, C, X, Ku, or Ka band, ≥ ~1000 MHz)
3. Noise removed and sent to high-power amplifier (HPA)
4. Amplified signal sent down waveguide to satellite dish
5. Dish emits high-frequency signal to satellite
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
21
Digital Satellite Downlink Chain
1. Satellite transmits signal that contains encoded data
2. Signal is received at satellite antenna dish
3. Signal is amplified through a low noise power amplifier (LNA) and fed
to downconverter
4. Downconverter converts high-frequency signal to intermediate
frequency
5. Intermediate frequency fed into demodulator and converted to digital
data
6. Data is sent to network through a router
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
22
Monitoring and Control of Satellite Earth
Terminals
Monitoring and control software systematically monitors state of each
piece of equipment
Wealth of information potentially available for predictive maintenance and
diagnosis
– Time history of each variable and be recorded
– All pieces of equipment
– All fielded systems
Data potentially much more extensive than available in lab tests
HPA often fails – are there precursors?
– Cathode current tends to drop before filaments burns out
– Helix current typically begins to rise before tube failure
– These rates may depend on environment, such as temperature
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
23
MaxView Monitoring and Control Software for Satellite
Uplink Chain
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
24
Bayesian Network for Satellite Earth Terminals
All nodes are discrete with finite number of states
Failure nodes for system and components
– Two states: fail and no fail
– Gives probability of failure for fixed period of time
Nodes for measurable components are usually instantiated (have
evidence)
– e.g. helix and cathode currents in HPA
– Component failure may depend trends in variables
– Time history of measurements needed for trend nodes
Most component nodes depend on environmental nodes
– e.g. temperatures, power, air conditioning
Conditional probabilities determined by data and prior lab tests of
equipment by manufacturer
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
25
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
26
M&C Bayesian Network with No Evidence
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
27
M&C Bayesian Network with Evidence – Temp = hot
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
28
Diagnosis
If system fails, what is most likely component to have failed?
– Reduce time to find component responsible for failure
– Help determine order of replacing or checking equipment
– May take into account cost of replacement or time to replace
System failure node has fail state instantiated
Attempt to find most probable configuration of Bayesian network given
evidence
– Hugin max-propagation algorithm
– Determines most likely state of each unobserved node
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
29
Diagnosis of Component Failure – OutdoorTemp = hot,
SystemFailure = fail
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
30
Value of Information (VOI) Analysis
Which nodes, if measured, provide the most information in helping to
determine the most likely hypothesis?
Useful measure of information of a discrete random variable, X, is the
entropy, H(X)
– Most informative: one xi has probability one, H = 0, minimum H
– Least Informative: X uniformly distributive, H is maximum
H(X) = − ∑ P(xi
) log P(xi)
i = 1
N
No Evidence BrgMeas=0-45 BrgMeas=0-45
Location=xyz
BrgMeas=0-45
Location=xyz
HotSpot=hot
BrgMeas=0-45
Location=xyz
HotSpot=hot
Freq=1000-1160
H(ID) 1.61 1.58 1.52 1.51 0.98
Entropy of ID variable given various evidence
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
31
Value of Information (VOI) Analysis
∑ ∑=
y x yPxP
yxP
yxPyPYXI
)()(
),(
log)|()(),(
The mutual information between two variables, I(X,Y), is the amount
entropy of X is reduced given Y, H(X|Y) = H(X) – I(X,Y)
Mutual information between ID and other measurable nodes, I(ID,Y)
Measurable Node, Y No Evidence
Mutual Information, I(ID,Y)
Frequency Known
Mutual Information, I(ID,Y|F)
Freq Measured 0.2522 0.0000
PRF Measured 0.1575 0.0090
PW Measured 0.1183 0.0140
Elevation Measured 0.0695 0.0816
Bearing Measured 0.0179 0.0056
PA Measured 0.0004 0.0059
ELINT Location 0.0000 0.0000
Hot Spot 0.0000 0.0000
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
32
Software Tools for Graphical Methods
BUGS (Bayesian Inference Using Gibbs Sampling)
– Bayesian analysis using Markov chain Monte Carlo
– Powerful, can sample from a variety of continuous and discrete probability
distributions
Hugin Expert
– Easy to use, discrete distributions (some Gaussian)
– Exact inference, EM algorithm
WEKA (Waikato Environment for Knowledge Analysis)
– Machine learning algorithms for data mining
PNL – Intel's open source probabilistic networks library
MSBNx – Microsoft Bayesian Networks Editor/Toolkit
Defying boundaries. Communicating anywhere.
Probabilistic Modeling
33
Learning Graphs
One needs to specify two things to describe a BN: the graph topology (structure) and
the parameters of each CPD.
It is possible to learn both of these from data. However, learning structure is much
harder than learning parameters. Also, learning when some of the nodes are
hidden, or we have missing data, is much harder than when everything is
observed.
This gives rise to 4 cases:
Structure Observability Method
– Known Full Maximum Likelihood Estimation
– Known Partial EM (or gradient ascent)
– Unknown Full Search through model space
– Unknown Partial EM + search through model space

ProbabilisticModeling20080411

  • 1.
    Copyright © 2005by DataPath, Inc. Probabilistic Modeling Clay Stanek, Steven Bottone, DataPath, Inc. 11 April 2008
  • 2.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 2 Probabilistic Models Most generally, we would like to make decisions based on – Data we have observed – Any previous knowledge we may have Best framed in terms of a probabilistic model where X is data that has been observed (inputs) and Y is what you would like to infer or predict (outputs) The probability of each possible value of Y given the values of all X that have been observed Example, what is the probability of system outage over the next week given the state of the system today. P(Y|X)
  • 3.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 3 Decision Theory A cost, or utility, is assigned to each possible outcome of the inference, Y There may be a cost for each action you can take Decision: choose the action which maximizes the expected utility, or minimizes the cost where A is the action is set of all possible actions with elements a. Example: there is a probability distribution for a stock to rise, fall, or stay the same. Possible actions: buy the stock or leave money in the bank. Value(P(Y|X)) = maxa∈A Σy U(a,y) P(y|X)
  • 4.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 4 Decision Theory Action 1: buy stock P(Y=Down) = .25, U = $500 P(Y=Stay) = .50, U = $1000 P(Y=Up) = .25, U = $2000 E(Y) = (.25)($500) + (.50)($1000) + (.25)($2000) = $1375 Action 2: leave in bank U = $1005 E(Y) = (.25)($1005) + (.50)($1005) + (.25)($1005) = $1005 Decision: buy stock
  • 5.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 5 Types of Models Used in Decision Making There are many types of models used in decision making – Linear models for regression, Y is continuous – Linear models for classification, Y is discrete – Neural networks – Kernel machines – Support vector machines (SVM) – Relevance vector machines (RVM) – Bayesian Networks Bayesian networks that have been augmented with decision and utility nodes are called influence diagrams
  • 6.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 6 The Importance of Data This is not physics – With so many variables it is generally not possible to know what will happen from first principles Use past data to estimates parameters in model – Supervised learning Hope that past data reflects what will happen with future data – Must be careful about over-fitting the model
  • 7.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 7 What is Importance To Model Building Data and expertise are most important in probabilistic model building – The expert can use knowledge to choose relevant inference variables, Y – The expert can help determine what data is important and which variables might depend on one another X Y Data Expertise
  • 8.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 8 Outline Probabilistic Graphical Models – Bayesian Networks Simple Bayesian Networks Three Main Problems for Bayesian Networks Sample Bayesian Networks Monitoring and Control of Satellite Earth Terminals Detailed Bayesian Network for Satellite Monitoring and Control System
  • 9.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 9 Probabilistic Graphical Models Marriage between probability theory and graph theory – Deals with uncertainty and complexity Nodes (vertices) of graph are random variables Edges of graph are (conditional) probability distributions Graphical structure related to (conditional) independence of nodes (random variables) – Appealing interface for humans – Graph theory provides methods for efficient general-purpose computation algorithms A B P(B=bi|A=aj) |B|×|A| Table
  • 10.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 10 Bayesian Networks Graphical model where all edges (arcs) of graph are directed and there are no cycles – DAG (directed acyclic graph) – Direction hints of causal connection between nodes Joint probability distribution determined from graph where the nodes are V = (V1 ,...,VN ) and pa(V) are the parents of V. All marginal and conditional distributions can be determined, in principle, from joint distribution P(V) = ∏ P(Vi | pa(Vi )) i = 1 N
  • 11.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 11 Simple Bayesian Networks A B C Serial Connection P(A,B,C) = P(A) P(B|A) P(C|B) P(C|B,A) = P(C|B) Joint PDF: Diverging Connection A B C Cond Indep: P(A,B,C) = P(A) P(B|A) P(C|A) P(B|A,C) = P(B|A) Joint PDF: Cond Indep: P(C|A,B) = P(C|A) C is independent of A given that B is known B and C are independent given that A is known A B C Converging Connection P(A,B,C) = P(A) P(B) P(C|A,B) P(A|B) = P(A) Joint PDF: Cond Indep: A and B are independent unless C (or any of its descendants) is known Explaining Away Effect!: Berkson’s Paradox
  • 12.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 12 Three Main Problems for Bayesian Networks 1. Given a graphical model, compute marginal and conditional probability distributions, given evidence (inference) 2. Given a graphical structure and some data (with possibly missing values), estimate unknown parameters for conditional probabilities (learning probabilities) 3. Given some data (with possibly missing values) and some wisdom, construct graphical structure and estimate unknown parameters for conditional probabilities (learning structure)
  • 13.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 13 Problem 1 for Bayesian Networks Inference – Compute marginal probability distribution on unobserved nodes given evidence on observed nodes For discrete Bayesian networks, exact inference methods exist (Hugin) – Elimination algorithm – Sum-product algorithm – Join (Junction) tree algorithm Approximate inference using sampling methods – Importance sampling – Markov chain Monte Carlo (Gibbs sampling - BUGS) Variational Methods Bayesian inference is consistent – All probabilities are ≥ 0 and sum to one
  • 14.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 14 Problem 2 for Bayesian Networks Learning Probabilities – Estimate parameters for local conditional probability distributions given graphical structure and data (with possibly missing values) – Hidden nodes have no observed data For discrete Bayesian networks with no missing data, probabilities can be learned using unrestricted multinomial distributions When data contains missing values – Gibbs sampling (BUGS) – Gaussian Approximations – EM (Expectation-Maximization) Algorithm
  • 15.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 15 Problem 3 for Bayesian Networks Learning Structure – Determine graphical structure and estimate local conditional probability parameters given (possibly missing) data Most difficult problem Complexity can be greatly reduced using expert experience and physics Bayesian approach – compare posterior distributions of various candidate structures given data – Model selection – Selective model averaging
  • 16.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 16 Graphical Model for Monty Hall Problem Prize Door Chosen Monty Opens Door 1 1/3 Door 2 1/3 Door 3 1/3 Prior Probability for Prize Door 1 1/3 Door 2 0 Door 3 2/3 Probability for Prize Given Door Chosen = Door 1 Monty Opens = Door 2 e e Add Evidence Door Chosen = Door 1 Add Evidence Monty Opens = Door 2 P(Prize) P(Prize|DC=1,MO=2)
  • 17.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling Q1 Q3 QT X1 X3 XT . . .P(X1 |Q1 ) P(X3 |Q3 ) P(XT |QT ) P(Q2 |Q1 ) P(QT |QT-1 ) Q2 X2 P(Q3 |Q2 ) P(X2 |Q2 ) t=2 P(X1:T ,Q1:T ) = P(Q1 ) P(X1 |Q1 ) ∏ P(Qt |Qt-1 ) P(Xt |Qt ) T Dynamic Bayesian Network Hidden Markov Model or State Space Model
  • 18.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 18 Graphical Model for Kalman Filter x1 x2 x3 xn z1 z2 x2 z3 zn . . .P(z1 |x1 ) P(z2 |x2 ) P(z3 |x3 ) P(zn |xn ) P(x2 |x1 ) P(x3 |x2 ) P(xn |xn-1 ) xk = Axk-1 + Buk-1 + wk-1 zk = Hxk + vk-1 Dynamic Bayesian Networks (DBNs) are directed graphical models of stochastic processes. They generalize hidden Markov models (HMMs) and linear dynamical systems (LDSs) by representing the hidden (and observed) state in terms of state variables, which can have complex interdependencies. The graphical structure provides an easy way to specify these conditional independencies, and hence to provide a compact parameterization of the model. A Linear Dynamical System (LDS) has the same topology as an HMM, but all the nodes are assumed to have linear-Gaussian distributions,
  • 19.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 19 Satellite Earth Terminal LNA or LNB TWTA, Klystron, SSPA MODEM
  • 20.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 20 Digital Satellite Uplink Chain 1. Digital data sent to modulator and converted to intermediate frequency (L band, 70 – 140 MHz) 2. Intermediate frequency signal sent to upconverter and converted to higher frequency (S, C, X, Ku, or Ka band, ≥ ~1000 MHz) 3. Noise removed and sent to high-power amplifier (HPA) 4. Amplified signal sent down waveguide to satellite dish 5. Dish emits high-frequency signal to satellite
  • 21.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 21 Digital Satellite Downlink Chain 1. Satellite transmits signal that contains encoded data 2. Signal is received at satellite antenna dish 3. Signal is amplified through a low noise power amplifier (LNA) and fed to downconverter 4. Downconverter converts high-frequency signal to intermediate frequency 5. Intermediate frequency fed into demodulator and converted to digital data 6. Data is sent to network through a router
  • 22.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 22 Monitoring and Control of Satellite Earth Terminals Monitoring and control software systematically monitors state of each piece of equipment Wealth of information potentially available for predictive maintenance and diagnosis – Time history of each variable and be recorded – All pieces of equipment – All fielded systems Data potentially much more extensive than available in lab tests HPA often fails – are there precursors? – Cathode current tends to drop before filaments burns out – Helix current typically begins to rise before tube failure – These rates may depend on environment, such as temperature
  • 23.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 23 MaxView Monitoring and Control Software for Satellite Uplink Chain
  • 24.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 24 Bayesian Network for Satellite Earth Terminals All nodes are discrete with finite number of states Failure nodes for system and components – Two states: fail and no fail – Gives probability of failure for fixed period of time Nodes for measurable components are usually instantiated (have evidence) – e.g. helix and cathode currents in HPA – Component failure may depend trends in variables – Time history of measurements needed for trend nodes Most component nodes depend on environmental nodes – e.g. temperatures, power, air conditioning Conditional probabilities determined by data and prior lab tests of equipment by manufacturer
  • 25.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 25
  • 26.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 26 M&C Bayesian Network with No Evidence
  • 27.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 27 M&C Bayesian Network with Evidence – Temp = hot
  • 28.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 28 Diagnosis If system fails, what is most likely component to have failed? – Reduce time to find component responsible for failure – Help determine order of replacing or checking equipment – May take into account cost of replacement or time to replace System failure node has fail state instantiated Attempt to find most probable configuration of Bayesian network given evidence – Hugin max-propagation algorithm – Determines most likely state of each unobserved node
  • 29.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 29 Diagnosis of Component Failure – OutdoorTemp = hot, SystemFailure = fail
  • 30.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 30 Value of Information (VOI) Analysis Which nodes, if measured, provide the most information in helping to determine the most likely hypothesis? Useful measure of information of a discrete random variable, X, is the entropy, H(X) – Most informative: one xi has probability one, H = 0, minimum H – Least Informative: X uniformly distributive, H is maximum H(X) = − ∑ P(xi ) log P(xi) i = 1 N No Evidence BrgMeas=0-45 BrgMeas=0-45 Location=xyz BrgMeas=0-45 Location=xyz HotSpot=hot BrgMeas=0-45 Location=xyz HotSpot=hot Freq=1000-1160 H(ID) 1.61 1.58 1.52 1.51 0.98 Entropy of ID variable given various evidence
  • 31.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 31 Value of Information (VOI) Analysis ∑ ∑= y x yPxP yxP yxPyPYXI )()( ),( log)|()(),( The mutual information between two variables, I(X,Y), is the amount entropy of X is reduced given Y, H(X|Y) = H(X) – I(X,Y) Mutual information between ID and other measurable nodes, I(ID,Y) Measurable Node, Y No Evidence Mutual Information, I(ID,Y) Frequency Known Mutual Information, I(ID,Y|F) Freq Measured 0.2522 0.0000 PRF Measured 0.1575 0.0090 PW Measured 0.1183 0.0140 Elevation Measured 0.0695 0.0816 Bearing Measured 0.0179 0.0056 PA Measured 0.0004 0.0059 ELINT Location 0.0000 0.0000 Hot Spot 0.0000 0.0000
  • 32.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 32 Software Tools for Graphical Methods BUGS (Bayesian Inference Using Gibbs Sampling) – Bayesian analysis using Markov chain Monte Carlo – Powerful, can sample from a variety of continuous and discrete probability distributions Hugin Expert – Easy to use, discrete distributions (some Gaussian) – Exact inference, EM algorithm WEKA (Waikato Environment for Knowledge Analysis) – Machine learning algorithms for data mining PNL – Intel's open source probabilistic networks library MSBNx – Microsoft Bayesian Networks Editor/Toolkit
  • 33.
    Defying boundaries. Communicatinganywhere. Probabilistic Modeling 33 Learning Graphs One needs to specify two things to describe a BN: the graph topology (structure) and the parameters of each CPD. It is possible to learn both of these from data. However, learning structure is much harder than learning parameters. Also, learning when some of the nodes are hidden, or we have missing data, is much harder than when everything is observed. This gives rise to 4 cases: Structure Observability Method – Known Full Maximum Likelihood Estimation – Known Partial EM (or gradient ascent) – Unknown Full Search through model space – Unknown Partial EM + search through model space

Editor's Notes

  • #3 Naïve Bayes is introduced to overcome the dimensionality problem and to exploit training data more efficiently. However, it goes from one extreme (fully dependent features) to another (many features independent). Common sense drives us to seek for a approximations that lie between these two extremes. The parents of an attribute (node) Vi are those features with directed links toward V and are members of the set Ai. In other words, Vi is conditionally independent of any combination of its nondescendants given its parents. Note we are estimating a joint pdf as the product of simpler terms. Each of them involves, in general, a much smaller number of features compared to the original number. The complete specification of a BN requires: Marginal probabilities of the root nodes (those without parents) Conditional probabilities of the nonroot nodes, given their parents for all possible combinations of their values. The joint pdf of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes. Need to order the variables such that every variable comes before its descendants in the related graph.
  • #4 Naïve Bayes is introduced to overcome the dimensionality problem and to exploit training data more efficiently. However, it goes from one extreme (fully dependent features) to another (many features independent). Common sense drives us to seek for a approximations that lie between these two extremes. The parents of an attribute (node) Vi are those features with directed links toward V and are members of the set Ai. In other words, Vi is conditionally independent of any combination of its nondescendants given its parents. Note we are estimating a joint pdf as the product of simpler terms. Each of them involves, in general, a much smaller number of features compared to the original number. The complete specification of a BN requires: Marginal probabilities of the root nodes (those without parents) Conditional probabilities of the nonroot nodes, given their parents for all possible combinations of their values. The joint pdf of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes. Need to order the variables such that every variable comes before its descendants in the related graph.
  • #5 Naïve Bayes is introduced to overcome the dimensionality problem and to exploit training data more efficiently. However, it goes from one extreme (fully dependent features) to another (many features independent). Common sense drives us to seek for a approximations that lie between these two extremes. The parents of an attribute (node) Vi are those features with directed links toward V and are members of the set Ai. In other words, Vi is conditionally independent of any combination of its nondescendants given its parents. Note we are estimating a joint pdf as the product of simpler terms. Each of them involves, in general, a much smaller number of features compared to the original number. The complete specification of a BN requires: Marginal probabilities of the root nodes (those without parents) Conditional probabilities of the nonroot nodes, given their parents for all possible combinations of their values. The joint pdf of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes. Need to order the variables such that every variable comes before its descendants in the related graph.
  • #6 Naïve Bayes is introduced to overcome the dimensionality problem and to exploit training data more efficiently. However, it goes from one extreme (fully dependent features) to another (many features independent). Common sense drives us to seek for a approximations that lie between these two extremes. The parents of an attribute (node) Vi are those features with directed links toward V and are members of the set Ai. In other words, Vi is conditionally independent of any combination of its nondescendants given its parents. Note we are estimating a joint pdf as the product of simpler terms. Each of them involves, in general, a much smaller number of features compared to the original number. The complete specification of a BN requires: Marginal probabilities of the root nodes (those without parents) Conditional probabilities of the nonroot nodes, given their parents for all possible combinations of their values. The joint pdf of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes. Need to order the variables such that every variable comes before its descendants in the related graph.
  • #7 Naïve Bayes is introduced to overcome the dimensionality problem and to exploit training data more efficiently. However, it goes from one extreme (fully dependent features) to another (many features independent). Common sense drives us to seek for a approximations that lie between these two extremes. The parents of an attribute (node) Vi are those features with directed links toward V and are members of the set Ai. In other words, Vi is conditionally independent of any combination of its nondescendants given its parents. Note we are estimating a joint pdf as the product of simpler terms. Each of them involves, in general, a much smaller number of features compared to the original number. The complete specification of a BN requires: Marginal probabilities of the root nodes (those without parents) Conditional probabilities of the nonroot nodes, given their parents for all possible combinations of their values. The joint pdf of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes. Need to order the variables such that every variable comes before its descendants in the related graph.
  • #8 Naïve Bayes is introduced to overcome the dimensionality problem and to exploit training data more efficiently. However, it goes from one extreme (fully dependent features) to another (many features independent). Common sense drives us to seek for a approximations that lie between these two extremes. The parents of an attribute (node) Vi are those features with directed links toward V and are members of the set Ai. In other words, Vi is conditionally independent of any combination of its nondescendants given its parents. Note we are estimating a joint pdf as the product of simpler terms. Each of them involves, in general, a much smaller number of features compared to the original number. The complete specification of a BN requires: Marginal probabilities of the root nodes (those without parents) Conditional probabilities of the nonroot nodes, given their parents for all possible combinations of their values. The joint pdf of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes. Need to order the variables such that every variable comes before its descendants in the related graph.
  • #11 Naïve Bayes is introduced to overcome the dimensionality problem and to exploit training data more efficiently. However, it goes from one extreme (fully dependent features) to another (many features independent). Common sense drives us to seek for a approximations that lie between these two extremes. The parents of an attribute (node) Vi are those features with directed links toward V and are members of the set Ai. In other words, Vi is conditionally independent of any combination of its nondescendants given its parents. Note we are estimating a joint pdf as the product of simpler terms. Each of them involves, in general, a much smaller number of features compared to the original number. The complete specification of a BN requires: Marginal probabilities of the root nodes (those without parents) Conditional probabilities of the nonroot nodes, given their parents for all possible combinations of their values. The joint pdf of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes. Need to order the variables such that every variable comes before its descendants in the related graph.
  • #12 Serial: A and C are d-separated given B is known. We conclude that evidence may be transmitted through a serial connection unless the state of the variable in the connection is known. Diverging: Imagine A = sex, B = Hair length, and C = stature. If we do not know the sex of a person, seeing the length of his her hair will tell us more about the sex and this in turn will focus our belief on their stature. Oh the other hand, if we know the person is a man, then length of hair doesn’t help us with their stature. Converging connection: imagine A = sprinkler on? B = rained last night, C = grass is wet (state). If we know C, and that it rained last night, it will reduce our belief that the sprinkler was on to make it wet.
  • #31 Naïve Bayes is introduced to overcome the dimensionality problem and to exploit training data more efficiently. However, it goes from one extreme (fully dependent features) to another (many features independent). Common sense drives us to seek for a approximations that lie between these two extremes. The parents of an attribute (node) Vi are those features with directed links toward V and are members of the set Ai. In other words, Vi is conditionally independent of any combination of its nondescendants given its parents. Note we are estimating a joint pdf as the product of simpler terms. Each of them involves, in general, a much smaller number of features compared to the original number. The complete specification of a BN requires: Marginal probabilities of the root nodes (those without parents) Conditional probabilities of the nonroot nodes, given their parents for all possible combinations of their values. The joint pdf of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes. Need to order the variables such that every variable comes before its descendants in the related graph.
  • #32 Naïve Bayes is introduced to overcome the dimensionality problem and to exploit training data more efficiently. However, it goes from one extreme (fully dependent features) to another (many features independent). Common sense drives us to seek for a approximations that lie between these two extremes. The parents of an attribute (node) Vi are those features with directed links toward V and are members of the set Ai. In other words, Vi is conditionally independent of any combination of its nondescendants given its parents. Note we are estimating a joint pdf as the product of simpler terms. Each of them involves, in general, a much smaller number of features compared to the original number. The complete specification of a BN requires: Marginal probabilities of the root nodes (those without parents) Conditional probabilities of the nonroot nodes, given their parents for all possible combinations of their values. The joint pdf of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes. Need to order the variables such that every variable comes before its descendants in the related graph.
  • #34 EM—ideally suited for case in which the available data set is incomplete. The EM algorithm maximizes the expectation of the log-likelihood function, conditioned on the observed samples and the current iteration estimate of theta.