Lecture12 xing

Machine LearningMachine Learninggg
Learning Graphical ModelsLearning Graphical Models
Eric XingEric Xing
Lecture 12, August 14, 2010
Eric Xing © Eric Xing @ CMU, 2006-2010 1
Reading:

Inference and LearningInference and Learning
 A BN M describes a unique probability distribution P
 Typical tasks:
 Task 1: How do we answer queries about P? Task 1: How do we answer queries about P?
 We use inference as a name for the process of computing answers to such
queries
 So far we have learned several algorithms for exact and approx. inferenceg pp
 Task 2: How do we estimate a plausible model M from data D?
i. We use learning as a name for the process of obtaining point estimate of M.
ii. But for Bayesian, they seek p(M |D), which is actually an inference problem.
iii When not all variables are observable even computing point estimate of M
iii. When not all variables are observable, even computing point estimate of M
need to do inference to impute the missing data.

Learning Graphical Models
The goal:
Learning Graphical Models
g
Given set of independent samples (assignments of
random variables) find the best (the most likely?)random variables), find the best (the most likely?)
graphical model (both the graph and the CPDs)
E BE B
R A
C
R A
C
(B,E,A,C,R)=(T,F,F,T,F)
(B,E,A,C,R)=(T,F,T,T,F)
C
0.9 0.1
e 0 2 0 8
be
b
BE P(A | E,B)
C
……..
(B,E,A,C,R)=(F,T,T,T,F)
e
b
e
0.2 0.8
0.01 0.99
0.9 0.1
b
b
e

Learning Graphical ModelsLearning Graphical Models
 Scenarios:

 completely observed GMs
 directed
 undirected 

 partially observed GMs
 directed
 undirected (an open research topic)

 Estimation principles:
 Maximal likelihood estimation (MLE)
 Bayesian estimationy
 Maximal conditional likelihood
 Maximal "Margin"
W l i f th f ti ti th
 We use learning as a name for the process of estimating the
parameters, and in some cases, the topology of the network, from
data.

Z
ML P t E t f
Z
X
ML Parameter Est. for
completely observed GMs ofp y
given structure
 The data:
{( (1) (1)) ( (2) (2)) ( (3) (3)) ( (N) (N))}
{(z(1),x(1)), (z(2),x(2)), (z(3),x(3)), ... (z(N),x(N))}

The basic idea underlying MLE
 Likelihood X1 X2
The basic idea underlying MLE
(for now let's assume that the structure is given):
L Lik lih d
X1 X2
X3
);,|()|()|()|()|( 33332211  XXXpXpXpXpXL  θθ
 Log-Likelihood:
 Data log-likelihood
),,|(log)|(log)|(log)|(log)|( 33332211  XXXpXpXpXpXl  θθ




n nnnn nn n
n n
XXXpXpXp
XpDATAl
),|(log)|(log)|(log
)|(log)|(
,,,,, 32132211 
θθ
 MLE
 nnn ,,,,,
)|(maxarg},,{ DATAlMLE θ321 
∑∑∑ ),|(logmaxarg,)|(logmaxarg,)|(logmaxarg ,,,
*
,
*
,
*
n
nnn
n
n
n
n XXXpXpXp 32133222111  

Example 1: conditional Gaussian
 The completely observed model: Z
p y
 Z is a class indicator vector
Z
X
110
2
1






∑and][where mm
ZZ
Z
Z
Z 110 









 ∑and],,[where,
m
M
ZZ
Z
Z

 zzzi M
zp )|(
21
1
All except one
of these terms
and a datum is in class i w.p. i
X i diti l G i i bl ith l ifi


m
z
m
Mi
m
zp
zp


)(
)|( 211 will be one
 X is a conditional Gaussian variable with a class-specific mean
 2
2
1
212 2
2
1
1 )-(-exp
)(
),,|( / m
m
xzxp 

 

)(
m
z
m
m
xNzxp ),|(),,|( ∏  

Z
zxpzp
zxpzpxzpDl
nnn
nn
n
n
n
nn

 
∑∑
∏
),,|(log)|(log
),,|()|(log),(log)|(

θ
Z
X
g
C
xN
pp
mm
n
z
m
m
n
n m
z
m
n
nn
n
n
m
n
m
n

∑∑∑∑
∑ ∏∑ ∏
)(l
),|(loglog
)|(g)|(g
21


Cxzz
n m
mn
m
n
n m
m
m
n  ∑∑∑∑ )-(-log 2
2
1
2  
s t∀)|(⇒)|(maxarg ∑∂*
mDlDl  10θθ
 MLE
⇒
s.t.,∀,)|(⇒),|(maxarg
∑
∑
*
m
∂
∂
N
n
N
z
mDlDl
mn
m
n
m
mm m



  10θθ
the fraction of
samples of class m
p
m
n n
m
n
n
m
n
n n
m
n
mm
n
xz
z
xz
Dl
∑
∑
∑**
⇒),|(maxarg   θ the average of
samples of class m

Example 2: HMM: two scenariosExample 2: HMM: two scenarios
 Supervised learning: estimation when the “right answer” isg g
known
 Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good, ,
(experimental) annotations of the CpG islands
GIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
 Unsupervised learning: estimation when the “right answer” is
unknown
 Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
 QUESTION: Update the parameters  of the model to maximize
P(x|) --- Maximal likelihood (ML) estimation

Recall definition of HMMRecall definition of HMM
 Transition probabilities between
y2 y3y1 yT...
any two states
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...,)|( , ji
i
t
j
t ayyp   11 1
or
St t b biliti
  .,,,,lMultinomia~)|( ,,, I iaaayyp Miii
i
tt 211 1
 Start probabilities
 .,,,lMultinomia~)( Myp  211
 Emission probabilities associated with each state
  .,,,,lMultinomia~)|( ,,, I ibbbyxp Kiii
i
tt 211
or in general:   .,|f~)|( I iyxp i
i
tt 1

Supervised ML estimationSupervised ML estimation
 Given x = x1…xN for which the true state path y = y1…yN is
known,
 Define:
  








n
T
t
tntn
T
t
tntnn xxpyypypp
1
,,
2
1,,1, )|()|()(log),(log),;( yxyxθl
Aij = # times state transition ij occurs in y
Bik = # times state i in y emits k in x
 We can show that the maximum likelihood parameters  are:
 
  



 
 
' ',
,,
)(#
)(#
j ij
ij
n
T
t
i
tn
j
tnn
T
t
i
tnML
ij
A
A
y
yy
i
ji
a
2 1
2 1
 
  





' ',
,,
)(#
)(#
k ik
ik
n
T
t
i
tn
k
tnn
T
t
i
tnML
ik
B
B
y
xy
i
ki
b
1
1
  
 If y is continuous, we can treat as NT
observations of, e.g., a Gaussian, and apply learning rules for Gaussian …
  NnTtyx tntn :,::, ,, 11 

Supervised ML estimation ctdSupervised ML estimation, ctd.
 Intuition:
 When we know the underlying states, the best estimate of  is the
average frequency of transitions & emissions that occur in the training
data
 Drawback:
 Given little data, there may be overfitting:
 P(x|) is maximized but  is unreasonable P(x|) is maximized, but  is unreasonable
0 probabilities – VERY BAD
 Example:
 Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3
y = F, F, F, F, F, F, F, F, F, F
Th 1 0
 Then: aFF = 1; aFL = 0
bF1 = bF3 = .2;
bF2 = .3; bF4 = 0; bF5 = bF6 = .1

PseudocountsPseudocounts
 Solution for small training sets:
 Add pseudocounts
Aij = # times state transition ij occurs in y + Rij
Bik = # times state i in y emits k in x + Sik
 Rij, Sij are pseudocounts representing our prior belief
 Total pseudocounts: Ri = jRij , Si = kSik ,
 --- "strength" of prior belief,
 --- total number of imaginary instances in the prior
 Larger total pseudocounts  strong prior belief
 Small total pseudocounts: just to avoid 0 probabilities --- smoothing
 This is equivalent to Bayesian est. under a uniform prior with
"parameter strength" equals to the pseudocounts

MLE for general BN parametersMLE for general BN parameters
 If we assume the parameters for each CPD are globallyp g y
independent, and all nodes are fully observed, then the log-
likelihood function decomposes into a sum of local terms, one
per node:p
   












i n
inin
n i
inin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,,   xxl
X2=1 X5=0X2 1
X2=0
X5 0
X5=1

Example: decomposable
likelihood of a directed model
 Consider the distribution defined by the directed acyclic GM:
likelihood of a directed model
y y
),,|(),|(),|()|()|( 132431311211  xxxpxxpxxpxpxp 
 This is exactly like learning four separate small BNs, each of
which consists of a node and its parentswhich consists of a node and its parents.
X1
X1
X1 X1
X2 X3
X2 X3
X2 X3
X4 X4

E.g.: MLE for BNs with tabular
CPDsCPDs
 Assume each CPD is represented as a table (multinomial)
where
 Note that in case of multiple parents, will have a composite
)|(
def
kXjXp iiijk  
Xp p p
state, and the CPD will be a high-dimensional table
 The sufficient statistics are counts of family configurations
i
 kj
xxn
def
 The log-likelihood is
 n ninijk i
xxn ,,
 
kji
ijkijk
kji
n
ijk nD ijk
,,,,
loglog);( l
 Using a Lagrange multiplier
to enforce , we get:1j ijk


kji
kij
ijkML
ijk
n
n
'
'

kji ,,

ZZ
X
Learning partially observed
GMsGMs
 The data:
 The data:
{(x(1)), (x(2)), (x(3)), ... (x(N))}

What if some nodes are not
observed?
 Consider the distribution defined by the directed acyclic GM:
observed?
y y
),,|(),|(),|()|()|( 132431311211  xxxpxxpxxpxpxp 
X1
X2 X3
X4
 Need to compute p(xH|xV)  inference

Recall: EM AlgorithmRecall: EM Algorithm
 A way of maximizing likelihood function for latent variable models.
Finds MLE of parameters when the original (hard) problem can be
broken up into two (easy) pieces:
1. Estimate some “missing” or “unobserved” data from observed data and current
tparameters.
2. Using this “complete” data, find the maximum likelihood parameter estimates.
 Alternate between filling in the latent variables using the best guess Alternate between filling in the latent variables using the best guess
(posterior) and updating the parameters based on this guess:
 E-step: )(maxarg tt
qFq 1p
 M-step:
 In the M-step we optimize a lower bound on the likelihood. In the E-
),(maxarg
q
qFq 
),(maxarg ttt
qF 

11 

step we close the gap, making bound=likelihood.

EM for general BNsEM for general BNs
while not convergedg
% E-step
for each node i
ESSi = 0 % reset expected sufficient statistics
for each data sample n
d i f ith Xdo inference with Xn,H
for each node i
)( iii xxSSESS 
% M-step
for each node i
)|(,,
,,
),(
HnHn
i xxpninii xxSSESS

 
i := MLE(ESSi )

Example: HMMExample: HMM
 Supervised learning: estimation when the “right answer” is known
 Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islands
GIVEN: the casino player allows us to observe him one evening,p y g,
as he changes dice and produces 10,000 rolls
 Unsupervised learning: estimation when the “right answer” is
unknown
 Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
 QUESTION: Update the parameters  of the model to maximize P(x|) -
-- Maximal likelihood (ML) estimation

The Baum Welch algorithmThe Baum Welch algorithm
 The complete log likelihoodp g
 The expected complete log likelihood
  








n
T
t
tntn
T
t
tntnnc xxpyypypp
12
11 )|()|()(log),(log),;( ,,,,,yxyxθl
 The expected complete log likelihood
EM
 
 
















n
T
t
kiyp
i
tn
k
tn
n
T
t
ji
yyp
j
tn
i
tn
n
iyp
i
nc byxayyy
ntnntntnnn
12
11
11
,)|(,,,
)|,(
,,)|(, logloglog),;(
,,,, xxx
yxθ l
 EM
 The E step
)|( ,,, n
i
tn
i
tn
i
tn ypy x1
 The M step ("symbolically" identical to MLE)
,,, ntntntn ypy
)|,( ,,,,
,
, n
j
tn
i
tn
j
tn
i
tn
ji
tn yypyy x1111  
 
 




n
T
t
i
tn
n
T
t
ji
tnML
ija 1
1
2
,
,
,


 
 




n
T
t
i
tn
k
tnn
T
t
i
tnML
ik
x
b 1
1
1
,
,,


N
n
i
nML
i

1,


Unsupervised ML estimationUnsupervised ML estimation
 Given x = x1…xN for which the true state path y = y1…yN is
kunknown,
 EXPECTATION MAXIMIZATION
0. Starting with our best guess of a model M, parameters :
1 Estimate A B in the training data1. Estimate Aij , Bik in the training data
 How? , ,
2. Update  according to Aij , Bik
 Now a "supervised learning" problem
k
tntn
i
tnik xyB ,, ,  tn
j
tn
i
tnij yyA , ,, 1
 Now a supervised learning problem
3. Repeat 1 & 2, until convergence
This is called the Baum-Welch Algorithm
g
We can get to a provably more (or equally) likely parameter set  each iteration

E B
ML St t l L i f
R A
C
ML Structural Learning for
completely observedp y
GMs
D tData
),,( )()( 11
1 nxx 
),,( )()( 22
1 nxx 
),,( )()( M
n
M
xx 1

),,( 1 n

Information Theoretic
Interpretation of MLInterpretation of ML
  







i
GiGnin
GG
ii
xp
GDpDG
)(|)(,, ),|(log
),|(log);,(
 

x
l
 









i n
GiGnin
n i
ii
xp )(|)(,, ),|(log  x
 
 













i x
GiGi
Gi
Gii
ii
i
xp
M
xcount
M
)(,
)(|)(
)(
),|(log
),(




x
x
x
  






i x
GiGiGi
Gii
iii
xpxpM
)(,
)(|)()( ),|(log),(ˆ

 
x
xx
From sum over data points to sum over count of variable states

Information Theoretic
Interpretation of ML (con'd)Interpretation of ML (con d)
  









GiGiGi
GG
xpxpM
GDpDG
iii
),|(ˆlog),(ˆ
),|(ˆlog);,(
)(|)()(  

xx
l
 
 













i x i
i
G
GiGi
Gi
i x
xp
xp
p
xp
xpM
Gii i
ii
i
Gii
iii
)(ˆ
)(ˆ
)(ˆ
),,(ˆ
log),(ˆ
, )(
)(|)(
)(
,
)(|)()(
)(
)(






x
x
x
x
x
   


















i x
ii
i x iG
GiGi
Gi xpxpM
xpp
xp
xpM
iGii i
ii
i
Gii i
)(ˆlog)(ˆ
)(ˆ)(ˆ
),,(ˆ
log),(ˆ
, )(
)(|)(
)(
)(
)(
)(






x
x
x
x
 
i
i
i
Gi xHMxIM i
)(ˆ),(ˆ
)(x
Decomposable score and a function of the graph structure

Structural SearchStructural Search
 How many graphs over n nodes? )(
2
2n
Oy g p
 How many trees over n nodes?
)(
)!(nO
 But it turns out that we can find exact solution of an optimal
tree (under MLE)!tree (under MLE)!
 Trick: in a tree each node has only one parent!
 Chow-liu algorithm

Chow Liu tree learning algorithmChow-Liu tree learning algorithm
 Objection function:j
 

iGi
GG
xHMxIM
GDpDG
i
)(ˆ),(ˆ
),|(ˆlog);,(
)(

x
l

i
Gi i
xIMGC ),(ˆ)( )(x
 Chow-Liu:
For each pair of variable and
ii
 For each pair of variable xi and xj
 Compute empirical distribution:
 Compute mutual information:
M
xxcount
XXp
ji
ji
),(
),(ˆ 

ji xxp
xxpXXI
),(ˆ
log)(ˆ)(ˆ Compute mutual information:
 Define a graph with node x1,…, xn
Edge (I j) gets weight

ji xx ji
jiji
xpxp
xxpXXI
, )(ˆ)(ˆ
log),(),(
)(ˆ XXI
 Edge (I,j) gets weight ),( ji XXI

Chow Liu algorithm (con'd)Chow-Liu algorithm (con d)
 Objection function:j
 

iGi
GG
xHMxIM
GDpDG
i
)(ˆ),(ˆ
),|(ˆlog);,(
)(

x
l

i
Gi i
xIMGC ),(ˆ)( )(x
 Chow-Liu:
Optimal tree BN
ii
Optimal tree BN
 Compute maximum weight spanning tree
 Direction in BN: pick any node as root, do breadth-first-search to define directions
 I-equivalence: I-equivalence:
A
B C
C
A E
D
E
C
D E B D A B
),(),(),(),()( ECIDCICAIBAIGC 

Structure Learning for general
graphsgraphs
 Theorem:
 The problem of learning a BN structure with at most d parents is
NP-hard for any (fixed) d≥2
 Most structure learning approaches use heuristics
 Exploit score decomposition
 Two heuristics that exploit decomposition in different ways Two heuristics that exploit decomposition in different ways
 Greedy search through space of node-orders
 Local search of graph structures

Inferring gene regulatory
networksnetworks
 Network of cis-regulatory pathwaysg y p y
 Success stories in sea urchin fruit fly etc from decades of experimental
 Success stories in sea urchin, fruit fly, etc, from decades of experimental
research
 Statistical modeling and automated learning just started

Gene Expression Profiling by
MicroarraysMicroarrays
Receptor A Receptor B
X1 X2
X1 X2
X1 X2
Kinase C
Trans. Factor
F
Kinase EKinase DX3 X4 X5
X6
F E
Kinase C
Trans. Factor
F
X6
Kinase C
Trans. Factor
F
X6
F EF E
Gene G Gene HX7
X8
F E
Gene G Gene HX7
X8
Gene G Gene HX7
X8
F EF E

Structure Learning Algorithms
St t l EM (F i d 1998)
Structure Learning Algorithms
Expression data
 Structural EM (Friedman 1998)
 The original algorithm
Expression data
 Sparse Candidate Algorithm
(Friedman et al.)
 Discretizing array signalsg y g
 Hill-climbing search using local operators: add/delete/swap of a
single edge
 Feature extraction: Markov relations, order relations
 Re-assemble high-confidence sub-networks from features
E B
Learning Algorithm
 Module network learning (Segal et al.)
 Heuristic search of structure in a "module graph"
 Module assignment
R A
C
g
 Parameter sharing
 Prior knowledge: possible regulators (TF genes)

Learning GM structureLearning GM structure
 Learning of best CPDs given DAG is easyg g y
 collect statistics of values of each node given specific assignment to its parents
 Learning of the graph topology (structure) is NP-hard
 heuristic search must be applied generally leads to a locally optimal network heuristic search must be applied, generally leads to a locally optimal network
 Overfitting
 It turns out, that richer structures give higher likelihood P(D|G) to the data (adding
an edge is always preferable)
A
C
BA
C
B
an edge is always preferable)
C
),|(≤)|( BACPACP
 more parameters to fit => more freedom => always exist more "optimal" CPD(C)
We prefer simpler (more explanatory) networks
 We prefer simpler (more explanatory) networks
 Practical scores regularize the likelihood improvement complex networks.

 Learning Graphical Model Structure via Learning Graphical Model Structure via
Neighborhood Selection

Undirected Graphical Models
 Why?
Undirected Graphical Models
y
Sometimes an UNDIRECTED
association graph makes moreassoc at o g ap a es o e
sense and/or is more
informative
 gene expressions may be influencedgene expressions may be influenced
by unobserved factor that are post-
transcriptionally regulated
B B B
 The unavailability of the state of B
lt i t i A d C
B
A C
B
A C
B
A C
results in a constrain over A and C

Gaussian Graphical ModelsGaussian Graphical Models
 Multivariate Gaussian density:y
 )-()-(-exp
)(
),|( //


 xxx 1
2
1
212
2
1 


 T
n
p
 WOLG: let
 






  
ji
ji
ij
i
iiinp xxqxq
Q
Qxxxp
2
2
1
2/
2/1
21 -exp
)2(
),0|,,,(


 We can view this as a continuous Markov Random Field with
potentials defined on every node and edge:

The covariance and the precision
matricesmatrices
 Covariance matrix
 Graphical model interpretation?
 Precision matrix Precision matrix
 Graphical model interpretation?

Sparse precision vs. sparse
covariance in GGMcovariance in GGM
211 33 544211 33 544



 00061



  150080130150100 .....













94800
08370
00726
1
















070040070010080
120070100020130
030010020030150
.....
.....
.....



 59000 


  080070120030150 .....
)5(or)1(51
1
15 0 nbrsnbrsXXX 

01551  XX

Another exampleAnother example
 How to estimate this MRF?
 What if p >> n What if p >> n
 MLE does not exist in general!
 What about only learning a “sparse” graphical model?
 This is possible when s=o(n)
 This is possible when s o(n)
 Very often it is the structure of the GM that is more interesting …

Single node ConditionalSingle-node Conditional
 The conditional dist. of a single node i given the rest of theg g
nodes can be written as:
 WOLG: let
 We can write the following conditional auto-regression
function for each node:
function for each node:

Conditional independenceConditional independence
 From
 Let:
 Given an estimate of the neighborhood si, we have:
 Thus the neighborhood si defines the Markov blanket of node i
g i

Recall lassoRecall lasso

Graph RegressionGraph Regression
Lasso:Neighborhood selection

It can be shown that:
i iid l d d l t h i l diti ( "i t bl ")
given iid samples, and under several technical conditions (e.g., "irrepresentable"),
the recovered structured is "sparsistent" even when p >> n

ConsistencyConsistency
 Theorem: for the graphical regression algorithm, underg p g g ,
certain verifiable conditions (omitted here for simplicity):
Note the from this theorem one should see that the regularizer is not actually
used to introduce an “artificial” sparsity bias, but a devise to ensure consistency
d fi it d t d hi h di i ditiunder finite data and high dimension condition.

Learning (sparse) GGMLearning (sparse) GGM
 Multivariate Gaussian over all continuous expressions
{ })-()-(-exp
||)2(
1
=]),...,([ 1-
2
1
1 2
1
2


xxxxp T
n n



p
 The precision matrix Q reveals the topology of the
(undirected) network
 Learning Algorithm: Covariance selection
 Want a sparse matrix Q
 As shown in the previous slides, we can use L_1 regularized linear
regression to obtain a sparse estimate of the neighborhood of each
variable

Recent trends in GGM:Recent trends in GGM:
 Covariance selection (classical  L1-regularization based
method)
 Dempster [1972]:
 Sequentially pruning smallest elements
in precision matrix
method (hot !)
 Meinshausen and Bühlmann [Ann. Stat.
06]:
 Used LASSO regression forin precision matrix
 Drton and Perlman [2008]:
 Improved statistical tests for pruning
 Used LASSO regression for
neighborhood selection
 Banerjee [JMLR 08]:
 Block sub-gradient algorithm for finding
precision matrixp
 Friedman et al. [Biostatistics 08]:
 Efficient fixed-point equations based
on a sub-gradient algorithm
 …
Serious limitations in
ti b k d h
 …
practice: breaks down when
covariance matrix is not
invertible
Structure learning is possible
Structure learning is possible
even when # variables ＞ #
samples

Learning Ising Model
(i e pairwise MRF)(i.e. pairwise MRF)
 Assuming the nodes are discrete, and edges are weighted,g g g
then for a sample xd, we have
 It can be shown following the same logic that we can use L_1
regularized logistic regression to obtain a sparse estimate of
the neighborhood of each variable in the discrete case.

Lecture12 xing

More Related Content

What's hot

Similar to Lecture12 xing

More from Tianlu Wang

Recently uploaded

Lecture12 xing