1. Manjari Narayan,
Postdoctoral Scholar, Stanford University (School of Medicine)
(PI: Amit Etkin, M.D., Ph.D.)
Tutorial presented at Junior Scientist Workshop at HHMI, Janelia Farms
Sparse Inverse Covariance Estimation
using skggm
skggm: Collaboration with Dr. Jason Laska, ML R&D at Clara Labs.
2. Explosion of Functional Imaging Tools
fMRI, fNIRS EEG, MEG
Intracranial EEG,
micro-ECoG
Molecular fMRI
Credit: Marie Suver, Ph.D. and Ainul Huda, University of Washington and
Michael H. Dickinson, Ph.D., California Institute of Technology
http://newsroom.cumc.columbia.edu/blog/2014/11/11/researchers-
receive-nih-brain-initiative-funding/
Calcium imaging
Credit: Misha Ahrens, Ph.D., Janelia Farms
https://www.simonsfoundation.org/features/
foundation-news/how-do-different-brain-regions-
interact-to-enhance-function/
Light sheet
microscopy
Photo Credit: Tang, 2015, Scientific Reports.
Voltage-sensitive
Dye Imaging
Light field
microscopy
Credit: Raju Tomer, Ph.D. & Deisseroth Lab, Stanford University
http://techfinder.stanford.edu/technology_detail.php?ID=36402
3. Application: Functional Connectomics
Network as a unit of interest
Unobserved stochastic dependence/interaction
between neurons, circuits, regions, …
Ahrens, et. al. Nature (2012)
A shared goal across
modalities & resolutions
Macroscale
Mesoscale
T or n
p
4. Probabilistic Graphical Models
Many probabilistic models available, both directed and undirected
Graph G = (V, E)
Vertices V = (1, . . . , p), Edges E ⇢ V ⌦ V
X = (X1, . . . , Xp) ⇠ PX
Probabilistic graphical model relates PX to G
(j, k) 62 E () independence or conditional independence between Xj and Xk
Graph G = (V, E)
Vertices V = (1, . . . , p), Edges E ⇢ V ⌦ V
X = (X1, . . . , Xp) ⇠ PX
Probabilistic graphical model relates PX to G
E () independence or conditional independence between Xj and Xk
Observed: Unobserved:
Examples
Directed Acyclic Graphs (DAGs/Bayes-nets)
State-Space Models including linear/nonlinear VAR,
Undirected Graphical Models or Markov networks
Bivariate associations (Correlation, Granger-Causality, Transfer Entropy)
5. More informative than correlations:
A measure of “direct” interactions that
elements “indirect” interactions due to
observed common causes.
Benefits:
Studying cognitive mechanisms
Designing interventional targets
Science-wide efficient use of data
Models for Connectivity:
Conditional Dependence & Markov Networks
conditional dependence
(“partial correlations”)
marginal dependence
(“marginal correlations”)
7. Markov Properties
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
1
5
2
4
3
X5 ? X1|XV {1,5}
Pairwise Markov Property (P):
Two variables conditionally independent,
given all other nodes
Lauritzen (1996)
8. Markov Properties
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
1
5
2
4
3
X5 ? XV ne(5)|Xne(5), ne(5) = {2, 5}
Local Markov Property (L):
A variable is conditionally independent of all others
given its neighbors
Lauritzen (1996)
9. Global Markov Property (G):
Given 3 disjoint sets A, B and C such that
all paths between A to B go through C, then
A conditionally independent of B given C
Markov Properties
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
1
5
2
4
3
A
B
C
XA ? XB|XC, where XA = {Xa}a2A
Lauritzen (1996)
10. Intersection property: Holds for positive densities, e.g. Gaussian
Factorizes probability distribution
Computational tractability & Statistical power
to identify all conditional independences
Extended to some non-positive densities!
Benefits of Global Markov Properties
P(X) = P(XA|XC)P(XB|XC)P(XC)
If A ? B|(C, D) and A ? C|(B, D) then A ? (B [ C)|D
Lauritzen (1996)
1
5
2
4
3
A
B
C
11. Generality of Markov Networks
For many types of pairwise associations
Markov Networks that satisfy Global Markov Property
Pairwise Association Markov Networks
Correlation Zero partial correlation = Conditional independence
Coherence or Coherency Zero partial coherence = Conditional independence
Directed Information (including transfer
entropy, Sims/Granger prediction, … )
Dynamic extensions to standard Markov properties,
local independence (Didelez 2008)
Pairwise ordering between variables DAGs, CPDAGs, MAGs, PAGs, ….
This is not an exhaustive list!
12. Generality of Markov Networks
For many probability distributions
Markov Networks that satisfy at least Local if not Global Markov Property
Distributional Assumptions Markov Networks
Exponential Families (Binary, Poisson, Circular, …)
Exponential MRFs including Binary Ising Models,
Poisson Graphical Models,
Nonparametric Distributions
Nonparanormal (copulas) Graphical Models,
Kernel Graphical Models
Separable Covariance Structure (Spatio-Temporal) Separable Markov Networks
(P. Ravikumar, G.I. Allen, and others)
(H. Liu, E. Xing, B. Scholkopf, and others)
(G.I. Allen, S. Zhou, A. Hero, P. Hoff, and many others)
13. From now on: Gaussian Graphical Model
Xk ? Xl|XV {k,l} () ⌃ 1
kl = 0 () (k, l) 62 E
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
Zero in Inverse Covariance = Conditional Independence
3
1
5
2
4
Lauritzen (1996)
1 2 3 4 5
12345
14. From now on: Gaussian Graphical Model
Xk ? Xl|XV {k,l} () ⌃ 1
kl = 0 () (k, l) 62 E
• Graph G = (V, E)
• Vertices V = {1, 2, . . . p} and Edges E ⇢ V ⇥ V
• Multivariate Normal x1, . . . , xT
i.i.d
⇠ Np(0, ⌃)
• Inverse Covariance ⌃ 1
= ⇥
Zero in Inverse Covariance = Conditional Independence
3
1
5
2
4
Lauritzen (1996)
Important for
nonparametric
distributions
+
exponential family
1 2 3 4 5
12345
16. Gaussian Log-Likelihood
Input to log-likelihood is effectively sample covariance
Likelihood for Inverse Covariance
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006)
ˆ⌃ =
1
T
X>
X, Data matrix XT ⇥p is centered
L(ˆ⌃; ⇥) ⌘ log det ⇥
D
ˆ⌃, ⇥
E
17. Gaussian Log-Likelihood
Input to log-likelihood is effectively sample covariance
Likelihood for Inverse Covariance
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006)
ˆ⌃ =
1
T
X>
X, Data matrix XT ⇥p is centered
L(ˆ⌃; ⇥) ⌘ log det ⇥
D
ˆ⌃, ⇥
E
18. Likelihood for Inverse Covariance
Put all variables on the same scale
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006); Yuan (2006)
ˆ⌃ =
1
T
X>
X, R(ˆ⌃) = D
1
2 ˆ⌃D
1
2 , D = diag(ˆ⌃)
L(ˆ⌃; ⇥) ⌘ log det ⇥
D
R(ˆ⌃), ⇥
E
Gaussian Log-Likelihood
Variance-Correlation Decomposition
19. Degeneracy of Likelihood
in High Dimensions
Credit: Negaban, Ravikumar, Wainwright & Yu, Statistical Science, 2012;
“A Unified Framework for High Dimensional Analysis of M-estimators with Decomposable Regularizers”
High curvature (Easy) Low curvature (Hard)
Given XT ⇥p, T ⇡ p
20. Encourage sparsity with Lasso penalty
Convex problem: Many optimization solutions available
Popular alternative if (L)=>(G): neighborhood selection
Sparse Inverse Covariance
Sparse penalized Maximum Likelihood
ˆ⇥( ) = maximize
⇥ 0
L(ˆ⌃; ⇥) k⇥k1,o↵,
k⇥k1,o↵ =
X
j6=k
|✓j,k|
“Covariance Selection”, Dempster, 1972; Banerjee et. al. (2006);
Yuan (2006); Friedman et. al (2008);“QUIC”, Hsieh et. al. (2011 & 2013); Buhlmann & Van De Geer (2011);
Meinshausen &
Buhlmann (2006)
21. Fisher Information (F) of the inverse covariance needs to
be well conditioned, not incoherent.
Signal strength of edges needs to be sufficiently larger
than noise
Caveat: Might always hold at infinite sample size but only
probabilistically in finite samples
Model Identifiability of Sparse MLE
When is perfect edge recovery possible?
Meinshausen et. al. 2006; Ravikumar et. al. (2010, 2011);
Van De Geer & Buhlmann (2013); and others
22. Model Identifiability:
Network Structure Matters
Theoretical assumptions often
violated for many networks
at finite samples
Narayan et. al (2015a)
Do two unconnected nodes,
share “mutual friends” ?
Increases with degree
Dependent on structure Meinshausen et. al. 2006;
Ravikumar et. al. (2010, 2011);
Cai & Zhou (2015)
More correlated nodes,
more errors in distinguishing edges from non-edges
23. We will only look at Lasso and its improved variants
Different estimators have slightly different limitations
Pseudolikelihood; Least squares; Dantzig type ….
Other regularizers behave differently as well
Model Identifiability of Sparse MLE
When is perfect edge recovery possible?
See review on graphical models from Drton & Maathius (2016)
24. Features:
scikit-learn interface
Comprehensive range of estimators, model selection procedures,
metrics, monte-carlo benchmarks of statistical error control, …
For researcher: Benchmark new estimator/algorithm against others
For data analyst: Best practices for estimation & structure learning
Github repo: http://github.com/jasonlaska/skggm
Tutorial notebooks: http://neurostats.org/jf2016-skggm/
skggm: Inverse covariance estimation
By @jasonlaska and @mnarayan
27. Saturated Precision Matrices
Saturation: Estimate all entries of inverse covariance (precision)
Recall: High curvature of likelihood, easy to distinguish different graphs
28. Saturated Precision Matrices
Degeneracy at low sample sizes.
(Using pseudo-inverse for degenerate sample covariance)
Recall: Low curvature of likelihood, hard to distinguish different graphs
29. Standard Graphical Lasso
Model Selection: How do we choose regularization/sparsity/non-zero support?
ˆ⇥( ) = arg min
⇥ 0
L(ˆ⌃; ⇥) + Pen(⇥)
Friedman et al. 2007; Meinshausen and Buhlmann 2006; Banerjee et al. 2006;
Rothman 2008; Hsieh et al; Cai et al. 2011; and many more.
Sparse Penalized Maximum Likelihood
30. Cross Validation: Minimizes Type II
Yuan and Lin (2007); Bickel and Levina (2008)
X ! (X⇤,train
, X⇤,test
)
{ ˆ⇥⇤
( )}train
Training Hold-out
Loss({ ˆ⇥⇤
( )}train
; { ˆ⌃⇤
}test
)
E.g. Kullback-Leibler; Log-Likelihood
31. Extended BIC: Minimizes Type I
Foygel & Drton (2010) Alternatives (StARS, Liu et. al.)
Coming soon
Privileges sparser models than BIC
min BIC(ˆE( )) = min Ln(ˆ⌃; ⇥) + |ˆE| log(n) + 4 |ˆE| log(p)
ˆE
d
= no. of non-zeros in ˆ⇥( )
32. One-Stage vs. Two-Stage Estimators
Use initial estimates to reduce bias in estimation
Standard Graphical Lasso
Weighted Graphical Lasso
Stage I:
Stage 2: ˆ⇥( ) = maximize
⇥ 0
L(ˆ⌃; ⇥) kW ⇥k1,o↵,
kW ⇥k1,o↵ =
X
j6=k
|wj,k✓j,k|
ˆ⇥( ) = maximize
⇥ 0
L(ˆ⌃; ⇥) k⇥k1,o↵,
k⇥k1,o↵ =
X
j6=k
|✓j,k|
wjk =
1
|ˆ✓|in
jk
E.g. Adaptive weights
Zou 2006; Zhou et. al. 2011
Buhlmann & Van De Geer 2011;
Cai & Zhou 2015
33. Strong edges shrink less
Shrinkage of Edges: Lasso vs. Adaptive
Performance very dependent on weights i.e.
Need good separation between strong vs. weak edges
Coefficient
(entryofinversecovariance)
Regularization parameter (lambda)
All edges shrink by same value
Zou (2006)
34. Weights can be data dependent/adaptive
Stage I: Any estimator not just MLE
Stage II: Adaptive MLE
Use to create randomized model averaging
Locally linear approximations to non-convex penalties
(coming soon to skggm)
Variety of Two-Stage Estimators
Weights can be specified in many ways
Adaptive Estimation: Zhou, Van De Geer, Buhlmann (2009);
Breheny and Huang (2011); Cai et. al. (2011) and others
36. High Sample Size, High Sparsity
Adaptive Estimator improves on Initial Estimator
n
p = 75, degree = .15p
Difference in sparsity: 69,77
Support Error: 4.0, False Pos: 4.0, False Neg: 0.0
Difference in sparsity: 69,141
Support Error: 36.0, False Pos: 36.0, False Neg: 0.0
37. Low Sample Size, High Sparsity
Adaptivity less useful without good initial estimate
n
p = 15, degree = .15p
Difference in sparsity: 69,85
Support Error: 8.0, False Pos: 8.0, False Neg: 0.0
Difference in sparsity: 69,149
Support Error: 40.0, False Pos: 40.0, False Neg: 0.0
39. High Sample Size, Moderate Sparsity
Nodes more correlated with each other, but adaptivity still does well
n
p = 75, degree = .4p
Difference in sparsity: 115,129
Support Error: 7.0, False Pos: 7.0, False Neg: 0.0
Difference in sparsity: 115,169
Support Error: 27.0, False Pos: 27.0, False Neg: 0.0
40. Low Sample Size, Moderate Sparsity
Nodes more correlated with each other, more false negatives
n
p = 15, degree = .4p
Difference in sparsity: 115,135
Support Error: 22.0, False Pos: 16.0, False Neg: 6.0
Difference in sparsity: 115,111
Support Error: 18.0, False Pos: 8.0, False Neg: 10.0
41. Model Averaging & Stability Selection
For any initial estimator build an ensemble of estimators and aggregate
n
p = 15, degree
Threshold stability scores => Familywise error control over edges
Meinshausen &
Buhlmann (2010)
ˆ⇥⇤b
( ) = maximize
⇥ 0
L(ˆ⌃⇤b
; ⇥) Pen(W⇤b
( ) ⇥),
w⇤b
jk = w⇤b
kj 2 { /a, a }, with Ber(⇢), for j 6= k
Aggregate I
⇣
ˆ⇥⇤b
( ) 6= 0
⌘
42. Future plans include
Computational scalability (big-quic, support for Apache spark)
Monte-Carlo “unit-testing” of statistical error control
Novel case studies and more examples
Other estimator classes (pseudo-likelihood, non-convex, …)
Regularizers beyond sparsity: mixture of regularizers, …
Other Markov network models for time-series
Directed graphical models
skggm: Inverse covariance estimation
Version 0.1