Kshitij Khare & Syed Rahman, University of Florida, present at the 2015 HPCC Systems Engineering Summit Community Day. In this presentation, we will discuss the motivation/theory behind CONCORD and its advantages over previous methods. In particular, we will discuss how the CONCORD estimate is superior to the empirical covariance matrix. We will end with an example detailing the implementation and use of the CONCORD algorithm in ECL. An exposure to multivariate statistics is helpful, but not necessary. Attendees should expect to come out with an understanding of sparse covariance estimation, its applications and how to use the CONCORD algorithm in ECL.
1. Methods for Robust High Dimensional Graphical Model Selection
Kshitij Khare and Syed Rahman
Department of Statistics
University of Florida
2. Motivation
• Availability of high-dimensional data or “big data” from various applications
• Number of variables (p) much larger than (or sometimes comparable to) the
sample size (n)
• Examples:
Biology: gene expression data
Environmental science: climate data on spatial grid
Finance: returns on thousands of stocks
1 / 20
4. Goal: Understanding relationships between variables
• Common goal in many applications: Understand complex network of relationships
between variables
• Covariance matrix: a fundamental quantity to help understand multivariate
relationships
• Even if estimating the covariance matrix is not the end goal, it is a crucial first
step before further analysis
3 / 20
5. Quick recap: What is a covariance matrix?
• The covariance of two variables/features (say two stock prices) is a measure of
linear dependence between these variables
• Positive covariance indicates similar behavior, Negative covariance indicates
opposite behavior, zero covariance indicates lack of linear dependence
4 / 20
6. Lets say we have five stock prices S1, S2, S3, S4, S5. The covariance matrix of these
five stocks looks like
S2
S2
S3
S4
S5
S1
S1
S3
S4
S5
5 / 20
7. Challenges in high-dimensional estimation
• Covariance matrix (often denoted by Σ) has O(p2) unknown parameters
• If p = 1000, we need to estimate roughly 1 million parameters
• If sample size n is much smaller (or even same order) than p, this is not viable
• The sample covariance matrix (classical estimator) can perform very poorly in
high-dimensional situations (not even invertible when n < p)
6 / 20
8. Is there a way out?
• Reliably estimate small number of parameters in Σ or Ω = Σ−1
• Set insignificant parameters to zero
• Gives rise to sparse estimates of Σ or Ω
• Sparsity pattern can be represented by graphs/networks
7 / 20
9. Concentration Graphical Models: Sparsity in Ω
• Assume Ω (inverse covariance matrix) is sparse: corresponds to assuming
conditional independences
• Sparsity pattern in Ω can be represented by an undirected graph G = (V , E)
• Build a graph from sparse Ω
Ω =
A B C
1 0.2 0.3 A
0.2 2 0 B
0.3 0 1.2 C
A
B C
8 / 20
10. Are these models useful? Appropriate?
• Many physical networks are assumed to be sparse
• Complex networks (internet, citation networks, social networks) tend to be sparse
[Newman, 2003]
• Genetic networks are sparse [Gardner et al, 2003, Jeong et al, 2001]
Model selection problem: How do we infer the underlying network/graph from data?
9 / 20
11. CONvex CORrelation selection methoD (CONCORD)
Obtain estimate of Ω by minimizing the objective function:
Qcon(Ω) =: −
p
i=1
n log ωii +
1
2
p
i=1
ωii Yi +
j=i
ωij Yj
2
2 + λ
1 i<j p
|ωij |
• The penalty term λ 1 i<j p |ωij | ensures that the minimizer is sparse
• λ (chosen by the user) controls the level of sparsity in the estimator
• Larger the λ, sparser the estimator
10 / 20
12. Minimization algorithm
• Direct minimization of Qcon not feasible
• Cyclic coordinate-wise minimization algorithm
1. Minimize [ωij ]1 (other coefficients held constant):
ωij ←
Sλ
n
− j =j ωij Sjj + i =i ωi j Sii
Sii + Sjj
2. Minimize [ωii ] (other coefficients held constant):
ωii ←
− j=i ωij Sij + j=i ωij Sij
2
+ 4Sii
2Sii
Repeat until convergence
1
Soft-thresholding operator: Sλ(x) = sign(x)(|x| − λ)+
11 / 20
14. Comparison with Sample Covariance Matrix
• When the sample size(n) is smaller than the number of variables(p), the sample
covariance matrix(S) is not even positive definite (and hence not invertible).
• In such a case, we HAVE to use CONCORD (or a comparable method) to get an
estimate.
• If n > p, we can consider S−1 as an estimate for Ω. However, S−1 will never be
sparse in general and is usually a poor estimate, especially if Ω is sparse.
13 / 20
15. Comparison with Sample Covariance Matrix continued ...
• For the true covariance matrix(Ω−1) for our numerical experiments we generated
50 × 50 positive definite matrix.
• Using the covariance matrix we then generated data for a sample of n = 60
observations (slightly larger than p = 50).
• We compared the accuracy of the sample covariance matrix and CONCORD using
the Frobenius Norm. The above experiment was repeated 100 times.
• The average Frobenius error for CONCORD is 0.4125151, while for the inverse
covariance matrix is 46.9759999.
Message: CONCORD is far superior to simply inverting S.
14 / 20
16. Leveraging strength of ECL
• One of the biggest advantages of ECL is distributed computing.
• However, CONCORD as it exists doesn’t lend itself easily to parallel computing.
• Even if we can run if on several nodes, the nodes need to communicate amongst
themselves due to the dependence structure of covariance matrices.
• How can we adapt CONCORD to leverage parallel computation?
15 / 20
17. Improvisation: Divide and Conquer
• Run concord for just a few iterations (around 10) until the corresponding graph
breaks into five to ten disjoint components.
• Run Concord afresh separately for each of these components in separately nodes.
• Each of these are completely independent and there is no need for any data
movement between these nodes.
• The overall performance in terms of time is greatly enhanced.
16 / 20
18. Illustrative example
• In our example with p = 50, we run CONCORD for 10 iterations.
• Depending on the value of the penalization parameter, λ the graph is broken up
into 5 or 6 components.
• Run CONCORD on the full dataset till convergence takes 15 minutes and 30
seconds.
• Improvised method: Concord for 10 iterations: 2 minutes 32 seconds Running a
fresh concord for each of these takes less than 3 minutes
Improvised method reduces overall method time by 66%.
17 / 20
19. Illustrative example
Full 1 2 3 4 5
Comparison of Concord Implementations
Implementation
Timeinseconds
0200400600800
18 / 20
20. Algorithm 1 CONCORD pseudocode
Input: Compute the sample covariance matrix S
Input: Fix maximum number of iterations: rmax
Input: Fix initial estimate: ˆΩ(0)
Input: Fix convergence threshold:
Set r ← 1
Set converged = FALSE
repeat
ˆΩold
= ˆΩcurrent
updates to partial covariances Ωij
for i = 1,...p − 1 do
for j = i + 1,...p do
ˆωcurrent
ij =
S λ
n
(−( j =j ωcurrent
ij sjj + i =i ωcurrent
i j sii ))
sii + sjj
(1)
where Sλ (x) := sign(x)(|x| − λ)+
end for
end for
updates to partial variances Ωii
for i = 1,...p do
ˆωcurrent
ii =
− k=i sik ωcurrent
ki + ( k=i sik ωcurrent
ki )2 + 4sii
2sii
(2)
end for
Convergence checking
if ˆΩold
− ˆΩcurrent
max
< then
converged = TRUE
else
r ← r + 1
end if
until converged=TRUE or r > rmax
return ˆΩ(r)
19 / 20
21. ECL implementation of CONCORD
• ECL has been implemented in CONCORD as part of the machine learning library.
• If n > p, use ML.PopulationEstimate.ConcordV1. If n < p, use
ML.PopulationEstimate.ConcordV2. Or simply use
ML.PopulationEstimate.InverseCovariance.
• ML.PopulationEstimate.ConcordV2(Y:=data,lambda:=10,
maxiter:=100,tol:=0.00001)
• Help/documentation is available at
https://concordinecl.wordpress.com/guide-to-using-concord-in-ecl/
20 / 20