Machine LearningMachine Learninggg
Learning Graphical ModelsLearning Graphical Models
Eric XingEric Xing
Lecture 12, August 14, 2010
Eric Xing © Eric Xing @ CMU, 2006-2010 1
Reading:
Inference and LearningInference and Learning
 A BN M describes a unique probability distribution P
 Typical tasks:
 Task 1: How do we answer queries about P? Task 1: How do we answer queries about P?
 We use inference as a name for the process of computing answers to such
queries
 So far we have learned several algorithms for exact and approx. inferenceg pp
 Task 2: How do we estimate a plausible model M from data D?
i. We use learning as a name for the process of obtaining point estimate of M.
ii. But for Bayesian, they seek p(M |D), which is actually an inference problem.
iii When not all variables are observable even computing point estimate of M
Eric Xing © Eric Xing @ CMU, 2006-2010 2
iii. When not all variables are observable, even computing point estimate of M
need to do inference to impute the missing data.
Learning Graphical Models
The goal:
Learning Graphical Models
g
Given set of independent samples (assignments of
random variables) find the best (the most likely?)random variables), find the best (the most likely?)
graphical model (both the graph and the CPDs)
E BE B
R A
C
R A
C
(B,E,A,C,R)=(T,F,F,T,F)
(B,E,A,C,R)=(T,F,T,T,F)
C
0.9 0.1
e 0 2 0 8
be
b
BE P(A | E,B)
C
Eric Xing © Eric Xing @ CMU, 2006-2010 3
……..
(B,E,A,C,R)=(F,T,T,T,F)
e
b
e
0.2 0.8
0.01 0.99
0.9 0.1
b
b
e
Learning Graphical ModelsLearning Graphical Models
 Scenarios:

 completely observed GMs
 directed
 undirected 

 partially observed GMs
 directed
 undirected (an open research topic)

 Estimation principles:
 Maximal likelihood estimation (MLE)
 Bayesian estimationy
 Maximal conditional likelihood
 Maximal "Margin"
W l i f th f ti ti th
Eric Xing © Eric Xing @ CMU, 2006-2010 4
 We use learning as a name for the process of estimating the
parameters, and in some cases, the topology of the network, from
data.
Z
ML P t E t f
Z
X
ML Parameter Est. for
completely observed GMs ofp y
given structure
 The data:
{( (1) (1)) ( (2) (2)) ( (3) (3)) ( (N) (N))}
Eric Xing © Eric Xing @ CMU, 2006-2010 5
{(z(1),x(1)), (z(2),x(2)), (z(3),x(3)), ... (z(N),x(N))}
The basic idea underlying MLE
 Likelihood X1 X2
The basic idea underlying MLE
(for now let's assume that the structure is given):
L Lik lih d
X1 X2
X3
);,|()|()|()|()|( 33332211  XXXpXpXpXpXL  θθ
 Log-Likelihood:
 Data log-likelihood
),,|(log)|(log)|(log)|(log)|( 33332211  XXXpXpXpXpXl  θθ
 Data log-likelihood




n nnnn nn n
n n
XXXpXpXp
XpDATAl
),|(log)|(log)|(log
)|(log)|(
,,,,, 32132211 
θθ
 MLE
 nnn ,,,,,
)|(maxarg},,{ DATAlMLE θ321 
Eric Xing © Eric Xing @ CMU, 2006-2010 6
∑∑∑ ),|(logmaxarg,)|(logmaxarg,)|(logmaxarg ,,,
*
,
*
,
*
n
nnn
n
n
n
n XXXpXpXp 32133222111  
Example 1: conditional Gaussian
 The completely observed model: Z
Example 1: conditional Gaussian
p y
 Z is a class indicator vector
Z
X
110
2
1






∑and][where mm
ZZ
Z
Z
Z 110 









 ∑and],,[where,
m
M
ZZ
Z
Z

 zzzi M
zp )|(
21
1
All except one
of these terms
and a datum is in class i w.p. i
X i diti l G i i bl ith l ifi


m
z
m
Mi
m
zp
zp


)(
)|( 211 will be one
 X is a conditional Gaussian variable with a class-specific mean
 2
2
1
212 2
2
1
1 )-(-exp
)(
),,|( / m
m
xzxp 

 

Eric Xing © Eric Xing @ CMU, 2006-2010 7
)(
m
z
m
m
xNzxp ),|(),,|( ∏  
Example 1: conditional Gaussian
Z
Example 1: conditional Gaussian
 Data log-likelihood
zxpzp
zxpzpxzpDl
nnn
nn
n
n
n
nn

 
∑∑
∏
),,|(log)|(log
),,|()|(log),(log)|(

θ
Z
X
g
C
xN
pp
mm
n
z
m
m
n
n m
z
m
n
nn
n
n
m
n
m
n

∑∑∑∑
∑ ∏∑ ∏
)(l
),|(loglog
)|(g)|(g
21


Cxzz
n m
mn
m
n
n m
m
m
n  ∑∑∑∑ )-(-log 2
2
1
2  
s t∀)|(⇒)|(maxarg ∑∂*
mDlDl  10θθ
 MLE
⇒
s.t.,∀,)|(⇒),|(maxarg
∑
∑
*
m
∂
∂
N
n
N
z
mDlDl
mn
m
n
m
mm m



  10θθ
the fraction of
samples of class m
Eric Xing © Eric Xing @ CMU, 2006-2010 8
p
m
n n
m
n
n
m
n
n n
m
n
mm
n
xz
z
xz
Dl
∑
∑
∑**
⇒),|(maxarg   θ the average of
samples of class m
Example 2: HMM: two scenariosExample 2: HMM: two scenarios
 Supervised learning: estimation when the “right answer” isg g
known
 Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good, ,
(experimental) annotations of the CpG islands
GIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
 Unsupervised learning: estimation when the “right answer” is
unknown
 Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
Eric Xing © Eric Xing @ CMU, 2006-2010 9
 QUESTION: Update the parameters  of the model to maximize
P(x|) --- Maximal likelihood (ML) estimation
Recall definition of HMMRecall definition of HMM
 Transition probabilities between
y2 y3y1 yT...
any two states
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...,)|( , ji
i
t
j
t ayyp   11 1
or
St t b biliti
  .,,,,lMultinomia~)|( ,,, I iaaayyp Miii
i
tt 211 1
 Start probabilities
 .,,,lMultinomia~)( Myp  211
 Emission probabilities associated with each state
  .,,,,lMultinomia~)|( ,,, I ibbbyxp Kiii
i
tt 211
Eric Xing © Eric Xing @ CMU, 2006-2010 10
or in general:   .,|f~)|( I iyxp i
i
tt 1
Supervised ML estimationSupervised ML estimation
 Given x = x1…xN for which the true state path y = y1…yN is
known,
 Define:
  








n
T
t
tntn
T
t
tntnn xxpyypypp
1
,,
2
1,,1, )|()|()(log),(log),;( yxyxθl
Aij = # times state transition ij occurs in y
Bik = # times state i in y emits k in x
 We can show that the maximum likelihood parameters  are:
 
  



 
 
' ',
,,
)(#
)(#
j ij
ij
n
T
t
i
tn
j
tnn
T
t
i
tnML
ij
A
A
y
yy
i
ji
a
2 1
2 1
 
  





' ',
,,
)(#
)(#
k ik
ik
n
T
t
i
tn
k
tnn
T
t
i
tnML
ik
B
B
y
xy
i
ki
b
1
1
  
Eric Xing © Eric Xing @ CMU, 2006-2010 11
 If y is continuous, we can treat as NT
observations of, e.g., a Gaussian, and apply learning rules for Gaussian …
  NnTtyx tntn :,::, ,, 11 
Supervised ML estimation ctdSupervised ML estimation, ctd.
 Intuition:
 When we know the underlying states, the best estimate of  is the
average frequency of transitions & emissions that occur in the training
data
 Drawback:
 Given little data, there may be overfitting:
 P(x|) is maximized but  is unreasonable P(x|) is maximized, but  is unreasonable
0 probabilities – VERY BAD
 Example:
 Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3
y = F, F, F, F, F, F, F, F, F, F
Th 1 0
Eric Xing © Eric Xing @ CMU, 2006-2010 12
 Then: aFF = 1; aFL = 0
bF1 = bF3 = .2;
bF2 = .3; bF4 = 0; bF5 = bF6 = .1
PseudocountsPseudocounts
 Solution for small training sets:
 Add pseudocounts
Aij = # times state transition ij occurs in y + Rij
Bik = # times state i in y emits k in x + Sik
 Rij, Sij are pseudocounts representing our prior belief
 Total pseudocounts: Ri = jRij , Si = kSik ,
 --- "strength" of prior belief,
 --- total number of imaginary instances in the prior
 Larger total pseudocounts  strong prior belief
 Small total pseudocounts: just to avoid 0 probabilities --- smoothing
 This is equivalent to Bayesian est. under a uniform prior with
Eric Xing © Eric Xing @ CMU, 2006-2010 13
"parameter strength" equals to the pseudocounts
MLE for general BN parametersMLE for general BN parameters
 If we assume the parameters for each CPD are globallyp g y
independent, and all nodes are fully observed, then the log-
likelihood function decomposes into a sum of local terms, one
per node:p
   












i n
inin
n i
inin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,,   xxl
X2=1 X5=0X2 1
X2=0
X5 0
X5=1
Eric Xing © Eric Xing @ CMU, 2006-2010 14
Example: decomposable
likelihood of a directed model
 Consider the distribution defined by the directed acyclic GM:
likelihood of a directed model
y y
),,|(),|(),|()|()|( 132431311211  xxxpxxpxxpxpxp 
 This is exactly like learning four separate small BNs, each of
which consists of a node and its parentswhich consists of a node and its parents.
X1
X1
X1 X1
X2 X3
X2 X3
X2 X3
Eric Xing © Eric Xing @ CMU, 2006-2010 15
X4 X4
E.g.: MLE for BNs with tabular
CPDsCPDs
 Assume each CPD is represented as a table (multinomial)
where
 Note that in case of multiple parents, will have a composite
)|(
def
kXjXp iiijk  
Xp p p
state, and the CPD will be a high-dimensional table
 The sufficient statistics are counts of family configurations
i
 kj
xxn
def
 The log-likelihood is
 n ninijk i
xxn ,,
 
kji
ijkijk
kji
n
ijk nD ijk
,,,,
loglog);( l
 Using a Lagrange multiplier
to enforce , we get:1j ijk


kji
kij
ijkML
ijk
n
n
'
'

Eric Xing © Eric Xing @ CMU, 2006-2010 16
kji ,,
ZZ
X
Learning partially observed
GMsGMs
 The data:
Eric Xing © Eric Xing @ CMU, 2006-2010 17
 The data:
{(x(1)), (x(2)), (x(3)), ... (x(N))}
What if some nodes are not
observed?
 Consider the distribution defined by the directed acyclic GM:
observed?
y y
),,|(),|(),|()|()|( 132431311211  xxxpxxpxxpxpxp 
X1
X2 X3
X4
Eric Xing © Eric Xing @ CMU, 2006-2010 18
 Need to compute p(xH|xV)  inference
Recall: EM AlgorithmRecall: EM Algorithm
 A way of maximizing likelihood function for latent variable models.
Finds MLE of parameters when the original (hard) problem can be
broken up into two (easy) pieces:
1. Estimate some “missing” or “unobserved” data from observed data and current
tparameters.
2. Using this “complete” data, find the maximum likelihood parameter estimates.
 Alternate between filling in the latent variables using the best guess Alternate between filling in the latent variables using the best guess
(posterior) and updating the parameters based on this guess:
 E-step: )(maxarg tt
qFq 1p
 M-step:
 In the M-step we optimize a lower bound on the likelihood. In the E-
),(maxarg
q
qFq 
),(maxarg ttt
qF 

11 

Eric Xing © Eric Xing @ CMU, 2006-2010 19
step we close the gap, making bound=likelihood.
EM for general BNsEM for general BNs
while not convergedg
% E-step
for each node i
ESSi = 0 % reset expected sufficient statistics
for each data sample n
d i f ith Xdo inference with Xn,H
for each node i
)( iii xxSSESS 
% M-step
for each node i
)|(,,
,,
),(
HnHn
i xxpninii xxSSESS

 
Eric Xing © Eric Xing @ CMU, 2006-2010 20
i := MLE(ESSi )
Example: HMMExample: HMM
 Supervised learning: estimation when the “right answer” is known
 Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islands
GIVEN: the casino player allows us to observe him one evening,p y g,
as he changes dice and produces 10,000 rolls
 Unsupervised learning: estimation when the “right answer” is
unknown
 Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
 QUESTION: Update the parameters  of the model to maximize P(x|) -
-- Maximal likelihood (ML) estimation
Eric Xing © Eric Xing @ CMU, 2006-2010 21
The Baum Welch algorithmThe Baum Welch algorithm
 The complete log likelihoodp g
 The expected complete log likelihood
  








n
T
t
tntn
T
t
tntnnc xxpyypypp
12
11 )|()|()(log),(log),;( ,,,,,yxyxθl
 The expected complete log likelihood
EM
 
 
















n
T
t
kiyp
i
tn
k
tn
n
T
t
ji
yyp
j
tn
i
tn
n
iyp
i
nc byxayyy
ntnntntnnn
12
11
11
,)|(,,,
)|,(
,,)|(, logloglog),;(
,,,, xxx
yxθ l
 EM
 The E step
)|( ,,, n
i
tn
i
tn
i
tn ypy x1
 The M step ("symbolically" identical to MLE)
,,, ntntntn ypy
)|,( ,,,,
,
, n
j
tn
i
tn
j
tn
i
tn
ji
tn yypyy x1111  
Eric Xing © Eric Xing @ CMU, 2006-2010 22
 
 




n
T
t
i
tn
n
T
t
ji
tnML
ija 1
1
2
,
,
,


 
 




n
T
t
i
tn
k
tnn
T
t
i
tnML
ik
x
b 1
1
1
,
,,


N
n
i
nML
i

1,

Unsupervised ML estimationUnsupervised ML estimation
 Given x = x1…xN for which the true state path y = y1…yN is
kunknown,
 EXPECTATION MAXIMIZATION
0. Starting with our best guess of a model M, parameters :
1 Estimate A B in the training data1. Estimate Aij , Bik in the training data
 How? , ,
2. Update  according to Aij , Bik
 Now a "supervised learning" problem
k
tntn
i
tnik xyB ,, ,  tn
j
tn
i
tnij yyA , ,, 1
 Now a supervised learning problem
3. Repeat 1 & 2, until convergence
This is called the Baum-Welch Algorithm
Eric Xing © Eric Xing @ CMU, 2006-2010 23
g
We can get to a provably more (or equally) likely parameter set  each iteration
E B
ML St t l L i f
R A
C
ML Structural Learning for
completely observedp y
GMs
D tData
),,( )()( 11
1 nxx 
),,( )()( 22
1 nxx 
Eric Xing © Eric Xing @ CMU, 2006-2010 24
),,( )()( M
n
M
xx 1

),,( 1 n
Information Theoretic
Interpretation of MLInterpretation of ML
  







i
GiGnin
GG
ii
xp
GDpDG
)(|)(,, ),|(log
),|(log);,(
 

x
l
 









i n
GiGnin
n i
ii
xp )(|)(,, ),|(log  x
 
 













i x
GiGi
Gi
Gii
ii
i
xp
M
xcount
M
)(,
)(|)(
)(
),|(log
),(




x
x
x
  






i x
GiGiGi
Gii
iii
xpxpM
)(,
)(|)()( ),|(log),(ˆ

 
x
xx
Eric Xing © Eric Xing @ CMU, 2006-2010 25
From sum over data points to sum over count of variable states
Information Theoretic
Interpretation of ML (con'd)Interpretation of ML (con d)
  









GiGiGi
GG
xpxpM
GDpDG
iii
),|(ˆlog),(ˆ
),|(ˆlog);,(
)(|)()(  

xx
l
 
 













i x i
i
G
GiGi
Gi
i x
xp
xp
p
xp
xpM
Gii i
ii
i
Gii
iii
)(ˆ
)(ˆ
)(ˆ
),,(ˆ
log),(ˆ
, )(
)(|)(
)(
,
)(|)()(
)(
)(






x
x
x
x
x
   


















i x
ii
i x iG
GiGi
Gi xpxpM
xpp
xp
xpM
iGii i
ii
i
Gii i
)(ˆlog)(ˆ
)(ˆ)(ˆ
),,(ˆ
log),(ˆ
, )(
)(|)(
)(
)(
)(
)(






x
x
x
x
 
i
i
i
Gi xHMxIM i
)(ˆ),(ˆ
)(x
Eric Xing © Eric Xing @ CMU, 2006-2010 26
Decomposable score and a function of the graph structure
Structural SearchStructural Search
 How many graphs over n nodes? )(
2
2n
Oy g p
 How many trees over n nodes?
)(
)!(nO
 But it turns out that we can find exact solution of an optimal
tree (under MLE)!tree (under MLE)!
 Trick: in a tree each node has only one parent!
 Chow-liu algorithm
Eric Xing © Eric Xing @ CMU, 2006-2010 27
Chow Liu tree learning algorithmChow-Liu tree learning algorithm
 Objection function:j
 

iGi
GG
xHMxIM
GDpDG
i
)(ˆ),(ˆ
),|(ˆlog);,(
)(

x
l

i
Gi i
xIMGC ),(ˆ)( )(x
 Chow-Liu:
For each pair of variable and
ii
 For each pair of variable xi and xj
 Compute empirical distribution:
 Compute mutual information:
M
xxcount
XXp
ji
ji
),(
),(ˆ 

ji xxp
xxpXXI
),(ˆ
log)(ˆ)(ˆ Compute mutual information:
 Define a graph with node x1,…, xn
Edge (I j) gets weight

ji xx ji
jiji
xpxp
xxpXXI
, )(ˆ)(ˆ
log),(),(
)(ˆ XXI
Eric Xing © Eric Xing @ CMU, 2006-2010 28
 Edge (I,j) gets weight ),( ji XXI
Chow Liu algorithm (con'd)Chow-Liu algorithm (con d)
 Objection function:j
 

iGi
GG
xHMxIM
GDpDG
i
)(ˆ),(ˆ
),|(ˆlog);,(
)(

x
l

i
Gi i
xIMGC ),(ˆ)( )(x
 Chow-Liu:
Optimal tree BN
ii
Optimal tree BN
 Compute maximum weight spanning tree
 Direction in BN: pick any node as root, do breadth-first-search to define directions
 I-equivalence: I-equivalence:
A
B C
C
A E
D
E
C
Eric Xing © Eric Xing @ CMU, 2006-2010 29
D E B D A B
),(),(),(),()( ECIDCICAIBAIGC 
Structure Learning for general
graphsgraphs
 Theorem:
 The problem of learning a BN structure with at most d parents is
NP-hard for any (fixed) d≥2
 Most structure learning approaches use heuristics
 Exploit score decomposition
 Two heuristics that exploit decomposition in different ways Two heuristics that exploit decomposition in different ways
 Greedy search through space of node-orders
 Local search of graph structures
Eric Xing © Eric Xing @ CMU, 2006-2010 30
Inferring gene regulatory
networksnetworks
 Network of cis-regulatory pathwaysg y p y
 Success stories in sea urchin fruit fly etc from decades of experimental
Eric Xing © Eric Xing @ CMU, 2006-2010 31
 Success stories in sea urchin, fruit fly, etc, from decades of experimental
research
 Statistical modeling and automated learning just started
Gene Expression Profiling by
MicroarraysMicroarrays
Receptor A Receptor B
X1 X2
Receptor A Receptor B
X1 X2
Receptor A Receptor B
X1 X2
Kinase C
Trans. Factor
F
Kinase EKinase DX3 X4 X5
X6
F E
Kinase C
Trans. Factor
F
Kinase EKinase DX3 X4 X5
X6
Kinase C
Trans. Factor
F
Kinase EKinase DX3 X4 X5
X6
F EF E
Eric Xing © Eric Xing @ CMU, 2006-2010 32
Gene G Gene HX7
X8
F E
Gene G Gene HX7
X8
Gene G Gene HX7
X8
F EF E
Structure Learning Algorithms
St t l EM (F i d 1998)
Structure Learning Algorithms
Expression data
 Structural EM (Friedman 1998)
 The original algorithm
Expression data
 Sparse Candidate Algorithm
(Friedman et al.)
 Discretizing array signalsg y g
 Hill-climbing search using local operators: add/delete/swap of a
single edge
 Feature extraction: Markov relations, order relations
 Re-assemble high-confidence sub-networks from features
E B
Learning Algorithm
 Module network learning (Segal et al.)
 Heuristic search of structure in a "module graph"
 Module assignment
Eric Xing © Eric Xing @ CMU, 2006-2010 33
R A
C
g
 Parameter sharing
 Prior knowledge: possible regulators (TF genes)
Learning GM structureLearning GM structure
 Learning of best CPDs given DAG is easyg g y
 collect statistics of values of each node given specific assignment to its parents
 Learning of the graph topology (structure) is NP-hard
 heuristic search must be applied generally leads to a locally optimal network heuristic search must be applied, generally leads to a locally optimal network
 Overfitting
 It turns out, that richer structures give higher likelihood P(D|G) to the data (adding
an edge is always preferable)
A
C
BA
C
B
an edge is always preferable)
C
),|(≤)|( BACPACP
 more parameters to fit => more freedom => always exist more "optimal" CPD(C)
We prefer simpler (more explanatory) networks
Eric Xing © Eric Xing @ CMU, 2006-2010 34
 We prefer simpler (more explanatory) networks
 Practical scores regularize the likelihood improvement complex networks.
 Learning Graphical Model Structure via Learning Graphical Model Structure via
Neighborhood Selection
Eric Xing © Eric Xing @ CMU, 2006-2010 35
Undirected Graphical Models
 Why?
Undirected Graphical Models
y
Sometimes an UNDIRECTED
association graph makes moreassoc at o g ap a es o e
sense and/or is more
informative
 gene expressions may be influencedgene expressions may be influenced
by unobserved factor that are post-
transcriptionally regulated
B B B
 The unavailability of the state of B
lt i t i A d C
B
A C
B
A C
B
A C
Eric Xing © Eric Xing @ CMU, 2006-2010 36
results in a constrain over A and C
Gaussian Graphical ModelsGaussian Graphical Models
 Multivariate Gaussian density:y
 )-()-(-exp
)(
),|( //


 xxx 1
2
1
212
2
1 


 T
n
p
 WOLG: let
 






  
ji
ji
ij
i
iiinp xxqxq
Q
Qxxxp
2
2
1
2/
2/1
21 -exp
)2(
),0|,,,(


 We can view this as a continuous Markov Random Field with
potentials defined on every node and edge:
Eric Xing © Eric Xing @ CMU, 2006-2010 37
The covariance and the precision
matricesmatrices
 Covariance matrix
 Graphical model interpretation?
 Precision matrix Precision matrix
 Graphical model interpretation?
Eric Xing © Eric Xing @ CMU, 2006-2010 38
Sparse precision vs. sparse
covariance in GGMcovariance in GGM
211 33 544211 33 544



 00061



  150080130150100 .....













94800
08370
00726
1
















070040070010080
120070100020130
030010020030150
.....
.....
.....



 59000 


  080070120030150 .....
)5(or)1(51
1
15 0 nbrsnbrsXXX 

Eric Xing © Eric Xing @ CMU, 2006-2010 39
01551  XX
Another exampleAnother example
 How to estimate this MRF?
 What if p >> n What if p >> n
 MLE does not exist in general!
 What about only learning a “sparse” graphical model?
 This is possible when s=o(n)
Eric Xing © Eric Xing @ CMU, 2006-2010 40
 This is possible when s o(n)
 Very often it is the structure of the GM that is more interesting …
Single node ConditionalSingle-node Conditional
 The conditional dist. of a single node i given the rest of theg g
nodes can be written as:
 WOLG: let
 We can write the following conditional auto-regression
function for each node:
Eric Xing © Eric Xing @ CMU, 2006-2010 41
function for each node:
Conditional independenceConditional independence
 From
 Let:
 Given an estimate of the neighborhood si, we have:
 Thus the neighborhood si defines the Markov blanket of node i
Eric Xing © Eric Xing @ CMU, 2006-2010 42
g i
Recall lassoRecall lasso
Eric Xing © Eric Xing @ CMU, 2006-2010 43
Graph RegressionGraph Regression
Lasso:Neighborhood selection
Eric Xing © Eric Xing @ CMU, 2006-2010 44
Graph RegressionGraph Regression
Eric Xing © Eric Xing @ CMU, 2006-2010 45
Graph RegressionGraph Regression
It can be shown that:
i iid l d d l t h i l diti ( "i t bl ")
Eric Xing © Eric Xing @ CMU, 2006-2010 46
given iid samples, and under several technical conditions (e.g., "irrepresentable"),
the recovered structured is "sparsistent" even when p >> n
ConsistencyConsistency
 Theorem: for the graphical regression algorithm, underg p g g ,
certain verifiable conditions (omitted here for simplicity):
Note the from this theorem one should see that the regularizer is not actually
used to introduce an “artificial” sparsity bias, but a devise to ensure consistency
d fi it d t d hi h di i ditiunder finite data and high dimension condition.
Eric Xing © Eric Xing @ CMU, 2006-2010 47
Learning (sparse) GGMLearning (sparse) GGM
 Multivariate Gaussian over all continuous expressions
{ })-()-(-exp
||)2(
1
=]),...,([ 1-
2
1
1 2
1
2


xxxxp T
n n



p
 The precision matrix Q reveals the topology of the
(undirected) network
 Learning Algorithm: Covariance selection
 Want a sparse matrix Q
 As shown in the previous slides, we can use L_1 regularized linear
regression to obtain a sparse estimate of the neighborhood of each
variable
Eric Xing © Eric Xing @ CMU, 2006-2010 48
Recent trends in GGM:Recent trends in GGM:
 Covariance selection (classical  L1-regularization based
method)
 Dempster [1972]:
 Sequentially pruning smallest elements
in precision matrix
method (hot !)
 Meinshausen and Bühlmann [Ann. Stat.
06]:
 Used LASSO regression forin precision matrix
 Drton and Perlman [2008]:
 Improved statistical tests for pruning
 Used LASSO regression for
neighborhood selection
 Banerjee [JMLR 08]:
 Block sub-gradient algorithm for finding
precision matrixp
 Friedman et al. [Biostatistics 08]:
 Efficient fixed-point equations based
on a sub-gradient algorithm
 …
Serious limitations in
ti b k d h
 …
practice: breaks down when
covariance matrix is not
invertible
Structure learning is possible
Eric Xing © Eric Xing @ CMU, 2006-2010 49
Structure learning is possible
even when # variables > #
samples
Learning Ising Model
(i e pairwise MRF)(i.e. pairwise MRF)
 Assuming the nodes are discrete, and edges are weighted,g g g
then for a sample xd, we have
 It can be shown following the same logic that we can use L_1
regularized logistic regression to obtain a sparse estimate of
the neighborhood of each variable in the discrete case.
Eric Xing © Eric Xing @ CMU, 2006-2010 50

Lecture12 xing

  • 1.
    Machine LearningMachine Learninggg LearningGraphical ModelsLearning Graphical Models Eric XingEric Xing Lecture 12, August 14, 2010 Eric Xing © Eric Xing @ CMU, 2006-2010 1 Reading:
  • 2.
    Inference and LearningInferenceand Learning  A BN M describes a unique probability distribution P  Typical tasks:  Task 1: How do we answer queries about P? Task 1: How do we answer queries about P?  We use inference as a name for the process of computing answers to such queries  So far we have learned several algorithms for exact and approx. inferenceg pp  Task 2: How do we estimate a plausible model M from data D? i. We use learning as a name for the process of obtaining point estimate of M. ii. But for Bayesian, they seek p(M |D), which is actually an inference problem. iii When not all variables are observable even computing point estimate of M Eric Xing © Eric Xing @ CMU, 2006-2010 2 iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data.
  • 3.
    Learning Graphical Models Thegoal: Learning Graphical Models g Given set of independent samples (assignments of random variables) find the best (the most likely?)random variables), find the best (the most likely?) graphical model (both the graph and the CPDs) E BE B R A C R A C (B,E,A,C,R)=(T,F,F,T,F) (B,E,A,C,R)=(T,F,T,T,F) C 0.9 0.1 e 0 2 0 8 be b BE P(A | E,B) C Eric Xing © Eric Xing @ CMU, 2006-2010 3 …….. (B,E,A,C,R)=(F,T,T,T,F) e b e 0.2 0.8 0.01 0.99 0.9 0.1 b b e
  • 4.
    Learning Graphical ModelsLearningGraphical Models  Scenarios:   completely observed GMs  directed  undirected    partially observed GMs  directed  undirected (an open research topic)   Estimation principles:  Maximal likelihood estimation (MLE)  Bayesian estimationy  Maximal conditional likelihood  Maximal "Margin" W l i f th f ti ti th Eric Xing © Eric Xing @ CMU, 2006-2010 4  We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.
  • 5.
    Z ML P tE t f Z X ML Parameter Est. for completely observed GMs ofp y given structure  The data: {( (1) (1)) ( (2) (2)) ( (3) (3)) ( (N) (N))} Eric Xing © Eric Xing @ CMU, 2006-2010 5 {(z(1),x(1)), (z(2),x(2)), (z(3),x(3)), ... (z(N),x(N))}
  • 6.
    The basic ideaunderlying MLE  Likelihood X1 X2 The basic idea underlying MLE (for now let's assume that the structure is given): L Lik lih d X1 X2 X3 );,|()|()|()|()|( 33332211  XXXpXpXpXpXL  θθ  Log-Likelihood:  Data log-likelihood ),,|(log)|(log)|(log)|(log)|( 33332211  XXXpXpXpXpXl  θθ  Data log-likelihood     n nnnn nn n n n XXXpXpXp XpDATAl ),|(log)|(log)|(log )|(log)|( ,,,,, 32132211  θθ  MLE  nnn ,,,,, )|(maxarg},,{ DATAlMLE θ321  Eric Xing © Eric Xing @ CMU, 2006-2010 6 ∑∑∑ ),|(logmaxarg,)|(logmaxarg,)|(logmaxarg ,,, * , * , * n nnn n n n n XXXpXpXp 32133222111  
  • 7.
    Example 1: conditionalGaussian  The completely observed model: Z Example 1: conditional Gaussian p y  Z is a class indicator vector Z X 110 2 1       ∑and][where mm ZZ Z Z Z 110            ∑and],,[where, m M ZZ Z Z   zzzi M zp )|( 21 1 All except one of these terms and a datum is in class i w.p. i X i diti l G i i bl ith l ifi   m z m Mi m zp zp   )( )|( 211 will be one  X is a conditional Gaussian variable with a class-specific mean  2 2 1 212 2 2 1 1 )-(-exp )( ),,|( / m m xzxp      Eric Xing © Eric Xing @ CMU, 2006-2010 7 )( m z m m xNzxp ),|(),,|( ∏  
  • 8.
    Example 1: conditionalGaussian Z Example 1: conditional Gaussian  Data log-likelihood zxpzp zxpzpxzpDl nnn nn n n n nn    ∑∑ ∏ ),,|(log)|(log ),,|()|(log),(log)|(  θ Z X g C xN pp mm n z m m n n m z m n nn n n m n m n  ∑∑∑∑ ∑ ∏∑ ∏ )(l ),|(loglog )|(g)|(g 21   Cxzz n m mn m n n m m m n  ∑∑∑∑ )-(-log 2 2 1 2   s t∀)|(⇒)|(maxarg ∑∂* mDlDl  10θθ  MLE ⇒ s.t.,∀,)|(⇒),|(maxarg ∑ ∑ * m ∂ ∂ N n N z mDlDl mn m n m mm m      10θθ the fraction of samples of class m Eric Xing © Eric Xing @ CMU, 2006-2010 8 p m n n m n n m n n n m n mm n xz z xz Dl ∑ ∑ ∑** ⇒),|(maxarg   θ the average of samples of class m
  • 9.
    Example 2: HMM:two scenariosExample 2: HMM: two scenarios  Supervised learning: estimation when the “right answer” isg g known  Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good, , (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls  Unsupervised learning: estimation when the “right answer” is unknown  Examples: GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice Eric Xing © Eric Xing @ CMU, 2006-2010 9  QUESTION: Update the parameters  of the model to maximize P(x|) --- Maximal likelihood (ML) estimation
  • 10.
    Recall definition ofHMMRecall definition of HMM  Transition probabilities between y2 y3y1 yT... any two states A AA Ax2 x3x1 xT y2 y3y1 yT... ...,)|( , ji i t j t ayyp   11 1 or St t b biliti   .,,,,lMultinomia~)|( ,,, I iaaayyp Miii i tt 211 1  Start probabilities  .,,,lMultinomia~)( Myp  211  Emission probabilities associated with each state   .,,,,lMultinomia~)|( ,,, I ibbbyxp Kiii i tt 211 Eric Xing © Eric Xing @ CMU, 2006-2010 10 or in general:   .,|f~)|( I iyxp i i tt 1
  • 11.
    Supervised ML estimationSupervisedML estimation  Given x = x1…xN for which the true state path y = y1…yN is known,  Define:            n T t tntn T t tntnn xxpyypypp 1 ,, 2 1,,1, )|()|()(log),(log),;( yxyxθl Aij = # times state transition ij occurs in y Bik = # times state i in y emits k in x  We can show that the maximum likelihood parameters  are:             ' ', ,, )(# )(# j ij ij n T t i tn j tnn T t i tnML ij A A y yy i ji a 2 1 2 1           ' ', ,, )(# )(# k ik ik n T t i tn k tnn T t i tnML ik B B y xy i ki b 1 1    Eric Xing © Eric Xing @ CMU, 2006-2010 11  If y is continuous, we can treat as NT observations of, e.g., a Gaussian, and apply learning rules for Gaussian …   NnTtyx tntn :,::, ,, 11 
  • 12.
    Supervised ML estimationctdSupervised ML estimation, ctd.  Intuition:  When we know the underlying states, the best estimate of  is the average frequency of transitions & emissions that occur in the training data  Drawback:  Given little data, there may be overfitting:  P(x|) is maximized but  is unreasonable P(x|) is maximized, but  is unreasonable 0 probabilities – VERY BAD  Example:  Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 y = F, F, F, F, F, F, F, F, F, F Th 1 0 Eric Xing © Eric Xing @ CMU, 2006-2010 12  Then: aFF = 1; aFL = 0 bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1
  • 13.
    PseudocountsPseudocounts  Solution forsmall training sets:  Add pseudocounts Aij = # times state transition ij occurs in y + Rij Bik = # times state i in y emits k in x + Sik  Rij, Sij are pseudocounts representing our prior belief  Total pseudocounts: Ri = jRij , Si = kSik ,  --- "strength" of prior belief,  --- total number of imaginary instances in the prior  Larger total pseudocounts  strong prior belief  Small total pseudocounts: just to avoid 0 probabilities --- smoothing  This is equivalent to Bayesian est. under a uniform prior with Eric Xing © Eric Xing @ CMU, 2006-2010 13 "parameter strength" equals to the pseudocounts
  • 14.
    MLE for generalBN parametersMLE for general BN parameters  If we assume the parameters for each CPD are globallyp g y independent, and all nodes are fully observed, then the log- likelihood function decomposes into a sum of local terms, one per node:p                 i n inin n i inin ii xpxpDpD ),|(log),|(log)|(log);( ,,,,   xxl X2=1 X5=0X2 1 X2=0 X5 0 X5=1 Eric Xing © Eric Xing @ CMU, 2006-2010 14
  • 15.
    Example: decomposable likelihood ofa directed model  Consider the distribution defined by the directed acyclic GM: likelihood of a directed model y y ),,|(),|(),|()|()|( 132431311211  xxxpxxpxxpxpxp   This is exactly like learning four separate small BNs, each of which consists of a node and its parentswhich consists of a node and its parents. X1 X1 X1 X1 X2 X3 X2 X3 X2 X3 Eric Xing © Eric Xing @ CMU, 2006-2010 15 X4 X4
  • 16.
    E.g.: MLE forBNs with tabular CPDsCPDs  Assume each CPD is represented as a table (multinomial) where  Note that in case of multiple parents, will have a composite )|( def kXjXp iiijk   Xp p p state, and the CPD will be a high-dimensional table  The sufficient statistics are counts of family configurations i  kj xxn def  The log-likelihood is  n ninijk i xxn ,,   kji ijkijk kji n ijk nD ijk ,,,, loglog);( l  Using a Lagrange multiplier to enforce , we get:1j ijk   kji kij ijkML ijk n n ' '  Eric Xing © Eric Xing @ CMU, 2006-2010 16 kji ,,
  • 17.
    ZZ X Learning partially observed GMsGMs The data: Eric Xing © Eric Xing @ CMU, 2006-2010 17  The data: {(x(1)), (x(2)), (x(3)), ... (x(N))}
  • 18.
    What if somenodes are not observed?  Consider the distribution defined by the directed acyclic GM: observed? y y ),,|(),|(),|()|()|( 132431311211  xxxpxxpxxpxpxp  X1 X2 X3 X4 Eric Xing © Eric Xing @ CMU, 2006-2010 18  Need to compute p(xH|xV)  inference
  • 19.
    Recall: EM AlgorithmRecall:EM Algorithm  A way of maximizing likelihood function for latent variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces: 1. Estimate some “missing” or “unobserved” data from observed data and current tparameters. 2. Using this “complete” data, find the maximum likelihood parameter estimates.  Alternate between filling in the latent variables using the best guess Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess:  E-step: )(maxarg tt qFq 1p  M-step:  In the M-step we optimize a lower bound on the likelihood. In the E- ),(maxarg q qFq  ),(maxarg ttt qF   11   Eric Xing © Eric Xing @ CMU, 2006-2010 19 step we close the gap, making bound=likelihood.
  • 20.
    EM for generalBNsEM for general BNs while not convergedg % E-step for each node i ESSi = 0 % reset expected sufficient statistics for each data sample n d i f ith Xdo inference with Xn,H for each node i )( iii xxSSESS  % M-step for each node i )|(,, ,, ),( HnHn i xxpninii xxSSESS    Eric Xing © Eric Xing @ CMU, 2006-2010 20 i := MLE(ESSi )
  • 21.
    Example: HMMExample: HMM Supervised learning: estimation when the “right answer” is known  Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening,p y g, as he changes dice and produces 10,000 rolls  Unsupervised learning: estimation when the “right answer” is unknown  Examples: GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice  QUESTION: Update the parameters  of the model to maximize P(x|) - -- Maximal likelihood (ML) estimation Eric Xing © Eric Xing @ CMU, 2006-2010 21
  • 22.
    The Baum WelchalgorithmThe Baum Welch algorithm  The complete log likelihoodp g  The expected complete log likelihood            n T t tntn T t tntnnc xxpyypypp 12 11 )|()|()(log),(log),;( ,,,,,yxyxθl  The expected complete log likelihood EM                     n T t kiyp i tn k tn n T t ji yyp j tn i tn n iyp i nc byxayyy ntnntntnnn 12 11 11 ,)|(,,, )|,( ,,)|(, logloglog),;( ,,,, xxx yxθ l  EM  The E step )|( ,,, n i tn i tn i tn ypy x1  The M step ("symbolically" identical to MLE) ,,, ntntntn ypy )|,( ,,,, , , n j tn i tn j tn i tn ji tn yypyy x1111   Eric Xing © Eric Xing @ CMU, 2006-2010 22         n T t i tn n T t ji tnML ija 1 1 2 , , ,           n T t i tn k tnn T t i tnML ik x b 1 1 1 , ,,   N n i nML i  1, 
  • 23.
    Unsupervised ML estimationUnsupervisedML estimation  Given x = x1…xN for which the true state path y = y1…yN is kunknown,  EXPECTATION MAXIMIZATION 0. Starting with our best guess of a model M, parameters : 1 Estimate A B in the training data1. Estimate Aij , Bik in the training data  How? , , 2. Update  according to Aij , Bik  Now a "supervised learning" problem k tntn i tnik xyB ,, ,  tn j tn i tnij yyA , ,, 1  Now a supervised learning problem 3. Repeat 1 & 2, until convergence This is called the Baum-Welch Algorithm Eric Xing © Eric Xing @ CMU, 2006-2010 23 g We can get to a provably more (or equally) likely parameter set  each iteration
  • 24.
    E B ML Stt l L i f R A C ML Structural Learning for completely observedp y GMs D tData ),,( )()( 11 1 nxx  ),,( )()( 22 1 nxx  Eric Xing © Eric Xing @ CMU, 2006-2010 24 ),,( )()( M n M xx 1  ),,( 1 n
  • 25.
    Information Theoretic Interpretation ofMLInterpretation of ML           i GiGnin GG ii xp GDpDG )(|)(,, ),|(log ),|(log);,(    x l            i n GiGnin n i ii xp )(|)(,, ),|(log  x                  i x GiGi Gi Gii ii i xp M xcount M )(, )(|)( )( ),|(log ),(     x x x          i x GiGiGi Gii iii xpxpM )(, )(|)()( ),|(log),(ˆ    x xx Eric Xing © Eric Xing @ CMU, 2006-2010 25 From sum over data points to sum over count of variable states
  • 26.
    Information Theoretic Interpretation ofML (con'd)Interpretation of ML (con d)             GiGiGi GG xpxpM GDpDG iii ),|(ˆlog),(ˆ ),|(ˆlog);,( )(|)()(    xx l                  i x i i G GiGi Gi i x xp xp p xp xpM Gii i ii i Gii iii )(ˆ )(ˆ )(ˆ ),,(ˆ log),(ˆ , )( )(|)( )( , )(|)()( )( )(       x x x x x                       i x ii i x iG GiGi Gi xpxpM xpp xp xpM iGii i ii i Gii i )(ˆlog)(ˆ )(ˆ)(ˆ ),,(ˆ log),(ˆ , )( )(|)( )( )( )( )(       x x x x   i i i Gi xHMxIM i )(ˆ),(ˆ )(x Eric Xing © Eric Xing @ CMU, 2006-2010 26 Decomposable score and a function of the graph structure
  • 27.
    Structural SearchStructural Search How many graphs over n nodes? )( 2 2n Oy g p  How many trees over n nodes? )( )!(nO  But it turns out that we can find exact solution of an optimal tree (under MLE)!tree (under MLE)!  Trick: in a tree each node has only one parent!  Chow-liu algorithm Eric Xing © Eric Xing @ CMU, 2006-2010 27
  • 28.
    Chow Liu treelearning algorithmChow-Liu tree learning algorithm  Objection function:j    iGi GG xHMxIM GDpDG i )(ˆ),(ˆ ),|(ˆlog);,( )(  x l  i Gi i xIMGC ),(ˆ)( )(x  Chow-Liu: For each pair of variable and ii  For each pair of variable xi and xj  Compute empirical distribution:  Compute mutual information: M xxcount XXp ji ji ),( ),(ˆ   ji xxp xxpXXI ),(ˆ log)(ˆ)(ˆ Compute mutual information:  Define a graph with node x1,…, xn Edge (I j) gets weight  ji xx ji jiji xpxp xxpXXI , )(ˆ)(ˆ log),(),( )(ˆ XXI Eric Xing © Eric Xing @ CMU, 2006-2010 28  Edge (I,j) gets weight ),( ji XXI
  • 29.
    Chow Liu algorithm(con'd)Chow-Liu algorithm (con d)  Objection function:j    iGi GG xHMxIM GDpDG i )(ˆ),(ˆ ),|(ˆlog);,( )(  x l  i Gi i xIMGC ),(ˆ)( )(x  Chow-Liu: Optimal tree BN ii Optimal tree BN  Compute maximum weight spanning tree  Direction in BN: pick any node as root, do breadth-first-search to define directions  I-equivalence: I-equivalence: A B C C A E D E C Eric Xing © Eric Xing @ CMU, 2006-2010 29 D E B D A B ),(),(),(),()( ECIDCICAIBAIGC 
  • 30.
    Structure Learning forgeneral graphsgraphs  Theorem:  The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2  Most structure learning approaches use heuristics  Exploit score decomposition  Two heuristics that exploit decomposition in different ways Two heuristics that exploit decomposition in different ways  Greedy search through space of node-orders  Local search of graph structures Eric Xing © Eric Xing @ CMU, 2006-2010 30
  • 31.
    Inferring gene regulatory networksnetworks Network of cis-regulatory pathwaysg y p y  Success stories in sea urchin fruit fly etc from decades of experimental Eric Xing © Eric Xing @ CMU, 2006-2010 31  Success stories in sea urchin, fruit fly, etc, from decades of experimental research  Statistical modeling and automated learning just started
  • 32.
    Gene Expression Profilingby MicroarraysMicroarrays Receptor A Receptor B X1 X2 Receptor A Receptor B X1 X2 Receptor A Receptor B X1 X2 Kinase C Trans. Factor F Kinase EKinase DX3 X4 X5 X6 F E Kinase C Trans. Factor F Kinase EKinase DX3 X4 X5 X6 Kinase C Trans. Factor F Kinase EKinase DX3 X4 X5 X6 F EF E Eric Xing © Eric Xing @ CMU, 2006-2010 32 Gene G Gene HX7 X8 F E Gene G Gene HX7 X8 Gene G Gene HX7 X8 F EF E
  • 33.
    Structure Learning Algorithms Stt l EM (F i d 1998) Structure Learning Algorithms Expression data  Structural EM (Friedman 1998)  The original algorithm Expression data  Sparse Candidate Algorithm (Friedman et al.)  Discretizing array signalsg y g  Hill-climbing search using local operators: add/delete/swap of a single edge  Feature extraction: Markov relations, order relations  Re-assemble high-confidence sub-networks from features E B Learning Algorithm  Module network learning (Segal et al.)  Heuristic search of structure in a "module graph"  Module assignment Eric Xing © Eric Xing @ CMU, 2006-2010 33 R A C g  Parameter sharing  Prior knowledge: possible regulators (TF genes)
  • 34.
    Learning GM structureLearningGM structure  Learning of best CPDs given DAG is easyg g y  collect statistics of values of each node given specific assignment to its parents  Learning of the graph topology (structure) is NP-hard  heuristic search must be applied generally leads to a locally optimal network heuristic search must be applied, generally leads to a locally optimal network  Overfitting  It turns out, that richer structures give higher likelihood P(D|G) to the data (adding an edge is always preferable) A C BA C B an edge is always preferable) C ),|(≤)|( BACPACP  more parameters to fit => more freedom => always exist more "optimal" CPD(C) We prefer simpler (more explanatory) networks Eric Xing © Eric Xing @ CMU, 2006-2010 34  We prefer simpler (more explanatory) networks  Practical scores regularize the likelihood improvement complex networks.
  • 35.
     Learning GraphicalModel Structure via Learning Graphical Model Structure via Neighborhood Selection Eric Xing © Eric Xing @ CMU, 2006-2010 35
  • 36.
    Undirected Graphical Models Why? Undirected Graphical Models y Sometimes an UNDIRECTED association graph makes moreassoc at o g ap a es o e sense and/or is more informative  gene expressions may be influencedgene expressions may be influenced by unobserved factor that are post- transcriptionally regulated B B B  The unavailability of the state of B lt i t i A d C B A C B A C B A C Eric Xing © Eric Xing @ CMU, 2006-2010 36 results in a constrain over A and C
  • 37.
    Gaussian Graphical ModelsGaussianGraphical Models  Multivariate Gaussian density:y  )-()-(-exp )( ),|( //    xxx 1 2 1 212 2 1     T n p  WOLG: let            ji ji ij i iiinp xxqxq Q Qxxxp 2 2 1 2/ 2/1 21 -exp )2( ),0|,,,(    We can view this as a continuous Markov Random Field with potentials defined on every node and edge: Eric Xing © Eric Xing @ CMU, 2006-2010 37
  • 38.
    The covariance andthe precision matricesmatrices  Covariance matrix  Graphical model interpretation?  Precision matrix Precision matrix  Graphical model interpretation? Eric Xing © Eric Xing @ CMU, 2006-2010 38
  • 39.
    Sparse precision vs.sparse covariance in GGMcovariance in GGM 211 33 544211 33 544     00061      150080130150100 .....              94800 08370 00726 1                 070040070010080 120070100020130 030010020030150 ..... ..... .....     59000      080070120030150 ..... )5(or)1(51 1 15 0 nbrsnbrsXXX   Eric Xing © Eric Xing @ CMU, 2006-2010 39 01551  XX
  • 40.
    Another exampleAnother example How to estimate this MRF?  What if p >> n What if p >> n  MLE does not exist in general!  What about only learning a “sparse” graphical model?  This is possible when s=o(n) Eric Xing © Eric Xing @ CMU, 2006-2010 40  This is possible when s o(n)  Very often it is the structure of the GM that is more interesting …
  • 41.
    Single node ConditionalSingle-nodeConditional  The conditional dist. of a single node i given the rest of theg g nodes can be written as:  WOLG: let  We can write the following conditional auto-regression function for each node: Eric Xing © Eric Xing @ CMU, 2006-2010 41 function for each node:
  • 42.
    Conditional independenceConditional independence From  Let:  Given an estimate of the neighborhood si, we have:  Thus the neighborhood si defines the Markov blanket of node i Eric Xing © Eric Xing @ CMU, 2006-2010 42 g i
  • 43.
    Recall lassoRecall lasso EricXing © Eric Xing @ CMU, 2006-2010 43
  • 44.
    Graph RegressionGraph Regression Lasso:Neighborhoodselection Eric Xing © Eric Xing @ CMU, 2006-2010 44
  • 45.
    Graph RegressionGraph Regression EricXing © Eric Xing @ CMU, 2006-2010 45
  • 46.
    Graph RegressionGraph Regression Itcan be shown that: i iid l d d l t h i l diti ( "i t bl ") Eric Xing © Eric Xing @ CMU, 2006-2010 46 given iid samples, and under several technical conditions (e.g., "irrepresentable"), the recovered structured is "sparsistent" even when p >> n
  • 47.
    ConsistencyConsistency  Theorem: forthe graphical regression algorithm, underg p g g , certain verifiable conditions (omitted here for simplicity): Note the from this theorem one should see that the regularizer is not actually used to introduce an “artificial” sparsity bias, but a devise to ensure consistency d fi it d t d hi h di i ditiunder finite data and high dimension condition. Eric Xing © Eric Xing @ CMU, 2006-2010 47
  • 48.
    Learning (sparse) GGMLearning(sparse) GGM  Multivariate Gaussian over all continuous expressions { })-()-(-exp ||)2( 1 =]),...,([ 1- 2 1 1 2 1 2   xxxxp T n n    p  The precision matrix Q reveals the topology of the (undirected) network  Learning Algorithm: Covariance selection  Want a sparse matrix Q  As shown in the previous slides, we can use L_1 regularized linear regression to obtain a sparse estimate of the neighborhood of each variable Eric Xing © Eric Xing @ CMU, 2006-2010 48
  • 49.
    Recent trends inGGM:Recent trends in GGM:  Covariance selection (classical  L1-regularization based method)  Dempster [1972]:  Sequentially pruning smallest elements in precision matrix method (hot !)  Meinshausen and Bühlmann [Ann. Stat. 06]:  Used LASSO regression forin precision matrix  Drton and Perlman [2008]:  Improved statistical tests for pruning  Used LASSO regression for neighborhood selection  Banerjee [JMLR 08]:  Block sub-gradient algorithm for finding precision matrixp  Friedman et al. [Biostatistics 08]:  Efficient fixed-point equations based on a sub-gradient algorithm  … Serious limitations in ti b k d h  … practice: breaks down when covariance matrix is not invertible Structure learning is possible Eric Xing © Eric Xing @ CMU, 2006-2010 49 Structure learning is possible even when # variables > # samples
  • 50.
    Learning Ising Model (ie pairwise MRF)(i.e. pairwise MRF)  Assuming the nodes are discrete, and edges are weighted,g g g then for a sample xd, we have  It can be shown following the same logic that we can use L_1 regularized logistic regression to obtain a sparse estimate of the neighborhood of each variable in the discrete case. Eric Xing © Eric Xing @ CMU, 2006-2010 50