Independent Component Analysis
P9120 Group 3
Ruoyu Ji Haokun Yuan Apoorva Srinivasan Tianchen Xu
Biostatistics Department
Columbia University
October 24, 2019
Group 3 (Data mining) ICA October 24, 2019 0 / 46
1 2
1 2
•
•
= =
•
•
•
•
•
•
•
•
•
•
•
•
•
• = =
Notation - Factor Analysis Model
Let’s introduce the notation first. Suppose observed X ∈ RN×p(N > p)
can be decomposed to two matrices S ∈ RN×p, A ∈ Rp×p:
X = SAT
⇔





x11 x12 · · · x1p
x21 x22 · · · x2p
...
...
xNp xN2 · · · xNp





=





s11 s12 · · · s1p
s21 s22 · · · s2p
...
...
sNp sN2 · · · sNp










a11 · · · ap1
a12 · · · ap2
...
...
a1p · · · app





original p variables transformed p variables transformation
FA: Score matrix Loading matrix
ICA: Source matrix Mixing matrix
Group 3 (Data mining) ICA October 24, 2019 16 / 46
Notation - ICA Model
X = SAT
⇔





x11 x12 · · · x1p
x21 x22 · · · x2p
...
...
xNp xN2 · · · xNp





=





s11 s12 · · · s1p
s21 s22 · · · s2p
...
...
sNp sN2 · · · sNp










a11 · · · ap1
a12 · · · ap2
...
...
a1p · · · app





X = AS
⇔





X1
X2
...
Xp





=





a11 · · · a1p
a21 · · · a2p
...
...
ap1 · · · app










S1
S2
...
Sp





where X and S are p dimensional r.vs.
Group 3 (Data mining) ICA October 24, 2019 17 / 46
Rotation Problem
The model is unidentifiable even we constrain S to be orthonormal:
X = AS
= ART
RS
= A∗
S∗
where R is any orthonormal matrices.
Quartimax: To maximize the sum of all loadings raised to power 4 in S.
It thus minimizes the number of factors needed to explain a variable.
Varimax: To maximize variance of the squared loadings in each factor
in S. As the result, each factor has only few variables with large loadings
by the factor.
......
Group 3 (Data mining) ICA October 24, 2019 18 / 46
Rotation - ICA Assumption
The starting point for ICA is the very simple assumption that the com-
ponents Si are statistically independent.
It will be shown that we must also assume that the independent com-
ponent must have non-Gaussian distributions. However, in the basic
model we do not assume these distributions known (if they are known,
the problem is considerably simplified.)
Then, after estimating the matrix A, we can compute its inverse, say
W , and obtain the independent component simply by:
S = A−1
X = W X.
Group 3 (Data mining) ICA October 24, 2019 19 / 46
Ambiguities of ICA
1 The variances of the independent components Si cannot be
determined.
Since both S and A are unknown, any scalar multiplier of source Si can
be cancelled by dividing the corresponding column of A with the same
scalar value.
The most natural way to assume that each ‘source’ has unit variance
Var(S2
i ) = 1.
2 The order of the independent components cannot be determined.
Again, since S and A are unknown, order of the terms in the model can
be changed freely, and we can call any of the independent components
the first one.
Group 3 (Data mining) ICA October 24, 2019 20 / 46
Model Estimation
Distribution of a sum of independent random variables tends toward a
Gaussian distribution.
Thus, a sum of two independent random variables usually has a
distribution that is closer to gaussian than any of the two original
random variables.
Problem Formulation
To find a linear mapping W such that the unmixed sequences u,
u = W T
X
where W is a row of W , are maximally statistically independent.
Group 3 (Data mining) ICA October 24, 2019 21 / 46
Necessary Non-Gaussianity
The trick now is how to
measure “non-Gaussianity”?
Group 3 (Data mining) ICA October 24, 2019 22 / 46
Measurement of Non-Gaussianity
Kurtosis
Defined by: kurt(y) = E(y4
) − 3(E(y2
))2
.
The kurtosis for a Gaussian is zero.
Kurtosis is very sensitive to outliers when its value has to be estimated
from a measured sample.
Differential entropy
Defined by: H(y) = − g(y) ln g(y) dy where g(y) is the density of y.
The Gaussian random variable has the largest entropy among all
random variables of equal variance, which means that entropy can be
used to measure non-gaussianity.
Negentropy
Defined by: J(y) = H(z) − H(y) where z is a normally distributed
variable.
Group 3 (Data mining) ICA October 24, 2019 23 / 46
Approximation of Negentropy
Classical approximation:
J(y) =
1
12
E(y3
)2
+
1
48
kurt(y)2
New approximation based on maximum-entropy principle:
J(y) =
p
i=1
ki [E{Gi (y)} − E{Gi(z)}]2
where ki are some positive constants, and z is a Gaussian variable of
zero mean and unit variance.The functions Gi are some nonquadratic
functions such as:
G(u) =
1
a
ln cosh(au), G(u) = −e−u2
2 , ... (1 ≤ a ≤ 2)
Group 3 (Data mining) ICA October 24, 2019 24 / 46
Mutual Information
Defined by:
I(y1, y2, · · · yp) =
p
i=1
H(yi ) − H(y)
where yi are the components of the random variable y.
It is the natural measure of the dependence between random
variables. Its value is always nonnegative, and zero if and only if the
variables are statistically dependent.
Let u = W X, then
I(u1, u2, · · · , up) =
p
i=1
H(ui ) − H(X) − ln | det(W )|.
This is equivalent to minimizing the sum of the entropies of the
separate components of u.
Group 3 (Data mining) ICA October 24, 2019 25 / 46
Use of Negentropy
Recall Negentropy:
Defined by: J(y) = H(z) − H(y) where z is a normally distributed. It
has many approximation forms.
Entropy and negentropy differ only by a constant and sign. Therefore,
finding an invertible transformation W that minimizes the mutual infor-
mation is roughly equivalent to finding directions in which negentropy
is maximized.
Group 3 (Data mining) ICA October 24, 2019 26 / 46
Preprocessing for ICA - Centering
The most basic and necessary preprocessing is to center the data
matrix X that is, subtract the mean vector, to make the data a zero
mean variable. With this, S can be considered to be zero mean, as
well.
After estimating the mixing matrix A the mean vector of S can be
added back to the centered estimates of S to complete the estimation.
The mean vector of S is given by A−1
µ where µ is the mean vector of
the data matrix X.
Group 3 (Data mining) ICA October 24, 2019 27 / 46
Preprocessing for ICA - Whitening
By eigen-value decomposition (EVD): E(XXT
) = VDV T , where V is
the orthogonal matrix of eigenvectors and D is the diagonal matrix of
eigenvalues.
transformation of X: ˜X = VD−1/2V T X, then Cov( ˜X) = I.
Since S has covariance I, then for ˜X = ˜AS, we have A is orthogonal
which only contains p(p − 1)/2 degrees of freedom instead of p2.
Preprocessing: Benefit
• Reducing the number of parameters
• The orthogonal matrix 𝐴∗ of N by N only has (N-1)(N-2)/2 free parameters.
Group 3 (Data mining) ICA October 24, 2019 28 / 46
Preprocessing for ICA - Sphering
Centering + Whitening = Sphering
Sphering removes the first and second order statistics of the data.
Both the mean and covariance are set to zero and the variance are
equalized.
Because it is a very simple and standard procedure, much simpler
than any ICA algorithms, it is a good idea to reduce the complexity of
the problem this way.
We look at the eigenvalues of E(XXT
) and discard those that are
too small. This has often the effect of reducing noise. Moreover,
dimension reduction prevents over learning, which can sometimes be
observed in ICA.
Group 3 (Data mining) ICA October 24, 2019 29 / 46
A Direct Approach to ICA
Recall that many approaches to ICA , are based on minimizing the
approximation on entropy.
Group 3 (Data mining) ICA October 24, 2019 30 / 46
Product Density ICA
Independent Component by definition have a joint product density:
fS (s) =
p
i=j
fj (sj )
And fj can be represented as a tilted Gaussian Density:
fj (sj ) = φ(sj )egj (sj )
The log-likelihood for the observed data X = AS:
l(A, {gj }p
l ; X) =
N
i=1
p
j=1
[logφj (aT
j xi ) + gj (aT
j xi )]
Group 3 (Data mining) ICA October 24, 2019 31 / 46
Penalized Density
p
j=1
1
N
N
i=1
[log φj (aT
j xi ) + gj (aT
j xi )] − φ(t)egj (t)
dt − λj {gj (t)}2
(t)dt
The first enforces the density constraint φ(t)egj (t)dt = 1 on any
solution ˆgj .
The second is a roughness penalty, which guarantees that the solution
ˆgj is a quartic-spline with knots at the oserved values of sij = aT
j xi .
Group 3 (Data mining) ICA October 24, 2019 32 / 46
Product Density ICA Algorithm
Group 3 (Data mining) ICA October 24, 2019 33 / 46
Product Density ICA Algorithm 2a)
Given A optimize log-likelihood function w.r.t. gj
p
j=1
1
N
N
i=1
[log φj (aT
j xi ) + gj (aT
j xi )] − φ(t)egj (t)
dt − λj {gj (t)}2
(t)dt
For simplicity, we focus on a single coordinate:
1
N
N
i=1
[log φ(si ) + g(si )] − φ(t)eg(t)
dt − λj {g (t)}2
(t)dt
Since the function involves integral, an approximation is needed.
Group 3 (Data mining) ICA October 24, 2019 34 / 46
Product Density ICA Algorithm 2a)
Construct a fine grid L with values s∗
l in increments of .
y∗
l =
#si ∈ (s∗
l − /2, s∗
l + /2)
N
Likelihood function can be written as
L
l=1
y∗
l [log φ(s∗
l ) + g(s∗
l )] − φ(s∗
l )eg(s∗
l )
− λj {g (t)}2
(t)dt
Group 3 (Data mining) ICA October 24, 2019 35 / 46
Product Density ICA Algorithm 2b)
Given ˆgj , perform one step of fixed point algorithm towards finding the
optimal A.
Optimize log-likelihood function w.r.t. A is equivalent to maximize:
C(A) =
1
N
N
i=1
p
j=1
ˆgj (aT
j xi )
C(A) is the log-likelihood ratio between the fitted density and Gaussian
and can be seen as a estimate of negentropy.
Group 3 (Data mining) ICA October 24, 2019 36 / 46
2b) Fixed Point Algorithm
For each j update
aj ← E X ˆgj (aT
j xi ) − E[ˆgj (aT
j xi )]
Orthogonalize A using (AAT
)−1
2 A
Compute the SVD of A, A = UDV T , and then replace A with
A ← UV T
Group 3 (Data mining) ICA October 24, 2019 37 / 46
Example: Handwritten Digits
Digitized 16 × 16 grayscale images
Points in 256-dimensional space
High-dimensional data
Group 3 (Data mining) ICA October 24, 2019 38 / 46
Comparison: standardized first five ICA components vs
standardized first five PCA components
Group 3 (Data mining) ICA October 24, 2019 39 / 46
Nature of each ICA component
extreme digits vs central digits
Group 3 (Data mining) ICA October 24, 2019 40 / 46
Example: EEG (electroencephalographic) Time Courses
Domain: Brain Dynamics
Group 3 (Data mining) ICA October 24, 2019 41 / 46
Experiment
Subjects wear a cap embedded with a lattice of 100 EEG electrodes,
which record brain activity at different locations on the scalp.
Data are from a single subject performing a standard “two-back”
learning task over a 30-minute period, i.e., the subject is presented
with a letter (B, H, J, C, F, or K) at roughly 1500-ms intervals, and
responds by pressing one of two buttons to indicate whether the letter
presented is the same or different from that presented two steps back.
Depending on the answer, the subject earns or loses points, and
occasionally earns bonus or loses penalty points for infrequent correct
or incorrect trials, respectively.
Group 3 (Data mining) ICA October 24, 2019 42 / 46
Goal: untangle the components of signals in multi-channel EEG data.
Key assumption: signals recorded at each scalp electrode are a
mixture of independent potentials arising from different cortical
activities, as well as non-cortical artifact domains.
ICA method
Group 3 (Data mining) ICA October 24, 2019 43 / 46
Group 3 (Data mining) ICA October 24, 2019 44 / 46
Group 3 (Data mining) ICA October 24, 2019 45 / 46
Thank you! :)
Group 3 (Data mining) ICA October 24, 2019 46 / 46

Ica group 3[1]

  • 1.
    Independent Component Analysis P9120Group 3 Ruoyu Ji Haokun Yuan Apoorva Srinivasan Tianchen Xu Biostatistics Department Columbia University October 24, 2019 Group 3 (Data mining) ICA October 24, 2019 0 / 46
  • 5.
  • 6.
  • 9.
  • 10.
  • 13.
  • 14.
  • 15.
  • 17.
    Notation - FactorAnalysis Model Let’s introduce the notation first. Suppose observed X ∈ RN×p(N > p) can be decomposed to two matrices S ∈ RN×p, A ∈ Rp×p: X = SAT ⇔      x11 x12 · · · x1p x21 x22 · · · x2p ... ... xNp xN2 · · · xNp      =      s11 s12 · · · s1p s21 s22 · · · s2p ... ... sNp sN2 · · · sNp           a11 · · · ap1 a12 · · · ap2 ... ... a1p · · · app      original p variables transformed p variables transformation FA: Score matrix Loading matrix ICA: Source matrix Mixing matrix Group 3 (Data mining) ICA October 24, 2019 16 / 46
  • 18.
    Notation - ICAModel X = SAT ⇔      x11 x12 · · · x1p x21 x22 · · · x2p ... ... xNp xN2 · · · xNp      =      s11 s12 · · · s1p s21 s22 · · · s2p ... ... sNp sN2 · · · sNp           a11 · · · ap1 a12 · · · ap2 ... ... a1p · · · app      X = AS ⇔      X1 X2 ... Xp      =      a11 · · · a1p a21 · · · a2p ... ... ap1 · · · app           S1 S2 ... Sp      where X and S are p dimensional r.vs. Group 3 (Data mining) ICA October 24, 2019 17 / 46
  • 19.
    Rotation Problem The modelis unidentifiable even we constrain S to be orthonormal: X = AS = ART RS = A∗ S∗ where R is any orthonormal matrices. Quartimax: To maximize the sum of all loadings raised to power 4 in S. It thus minimizes the number of factors needed to explain a variable. Varimax: To maximize variance of the squared loadings in each factor in S. As the result, each factor has only few variables with large loadings by the factor. ...... Group 3 (Data mining) ICA October 24, 2019 18 / 46
  • 20.
    Rotation - ICAAssumption The starting point for ICA is the very simple assumption that the com- ponents Si are statistically independent. It will be shown that we must also assume that the independent com- ponent must have non-Gaussian distributions. However, in the basic model we do not assume these distributions known (if they are known, the problem is considerably simplified.) Then, after estimating the matrix A, we can compute its inverse, say W , and obtain the independent component simply by: S = A−1 X = W X. Group 3 (Data mining) ICA October 24, 2019 19 / 46
  • 21.
    Ambiguities of ICA 1The variances of the independent components Si cannot be determined. Since both S and A are unknown, any scalar multiplier of source Si can be cancelled by dividing the corresponding column of A with the same scalar value. The most natural way to assume that each ‘source’ has unit variance Var(S2 i ) = 1. 2 The order of the independent components cannot be determined. Again, since S and A are unknown, order of the terms in the model can be changed freely, and we can call any of the independent components the first one. Group 3 (Data mining) ICA October 24, 2019 20 / 46
  • 22.
    Model Estimation Distribution ofa sum of independent random variables tends toward a Gaussian distribution. Thus, a sum of two independent random variables usually has a distribution that is closer to gaussian than any of the two original random variables. Problem Formulation To find a linear mapping W such that the unmixed sequences u, u = W T X where W is a row of W , are maximally statistically independent. Group 3 (Data mining) ICA October 24, 2019 21 / 46
  • 23.
    Necessary Non-Gaussianity The tricknow is how to measure “non-Gaussianity”? Group 3 (Data mining) ICA October 24, 2019 22 / 46
  • 24.
    Measurement of Non-Gaussianity Kurtosis Definedby: kurt(y) = E(y4 ) − 3(E(y2 ))2 . The kurtosis for a Gaussian is zero. Kurtosis is very sensitive to outliers when its value has to be estimated from a measured sample. Differential entropy Defined by: H(y) = − g(y) ln g(y) dy where g(y) is the density of y. The Gaussian random variable has the largest entropy among all random variables of equal variance, which means that entropy can be used to measure non-gaussianity. Negentropy Defined by: J(y) = H(z) − H(y) where z is a normally distributed variable. Group 3 (Data mining) ICA October 24, 2019 23 / 46
  • 25.
    Approximation of Negentropy Classicalapproximation: J(y) = 1 12 E(y3 )2 + 1 48 kurt(y)2 New approximation based on maximum-entropy principle: J(y) = p i=1 ki [E{Gi (y)} − E{Gi(z)}]2 where ki are some positive constants, and z is a Gaussian variable of zero mean and unit variance.The functions Gi are some nonquadratic functions such as: G(u) = 1 a ln cosh(au), G(u) = −e−u2 2 , ... (1 ≤ a ≤ 2) Group 3 (Data mining) ICA October 24, 2019 24 / 46
  • 26.
    Mutual Information Defined by: I(y1,y2, · · · yp) = p i=1 H(yi ) − H(y) where yi are the components of the random variable y. It is the natural measure of the dependence between random variables. Its value is always nonnegative, and zero if and only if the variables are statistically dependent. Let u = W X, then I(u1, u2, · · · , up) = p i=1 H(ui ) − H(X) − ln | det(W )|. This is equivalent to minimizing the sum of the entropies of the separate components of u. Group 3 (Data mining) ICA October 24, 2019 25 / 46
  • 27.
    Use of Negentropy RecallNegentropy: Defined by: J(y) = H(z) − H(y) where z is a normally distributed. It has many approximation forms. Entropy and negentropy differ only by a constant and sign. Therefore, finding an invertible transformation W that minimizes the mutual infor- mation is roughly equivalent to finding directions in which negentropy is maximized. Group 3 (Data mining) ICA October 24, 2019 26 / 46
  • 28.
    Preprocessing for ICA- Centering The most basic and necessary preprocessing is to center the data matrix X that is, subtract the mean vector, to make the data a zero mean variable. With this, S can be considered to be zero mean, as well. After estimating the mixing matrix A the mean vector of S can be added back to the centered estimates of S to complete the estimation. The mean vector of S is given by A−1 µ where µ is the mean vector of the data matrix X. Group 3 (Data mining) ICA October 24, 2019 27 / 46
  • 29.
    Preprocessing for ICA- Whitening By eigen-value decomposition (EVD): E(XXT ) = VDV T , where V is the orthogonal matrix of eigenvectors and D is the diagonal matrix of eigenvalues. transformation of X: ˜X = VD−1/2V T X, then Cov( ˜X) = I. Since S has covariance I, then for ˜X = ˜AS, we have A is orthogonal which only contains p(p − 1)/2 degrees of freedom instead of p2. Preprocessing: Benefit • Reducing the number of parameters • The orthogonal matrix 𝐴∗ of N by N only has (N-1)(N-2)/2 free parameters. Group 3 (Data mining) ICA October 24, 2019 28 / 46
  • 30.
    Preprocessing for ICA- Sphering Centering + Whitening = Sphering Sphering removes the first and second order statistics of the data. Both the mean and covariance are set to zero and the variance are equalized. Because it is a very simple and standard procedure, much simpler than any ICA algorithms, it is a good idea to reduce the complexity of the problem this way. We look at the eigenvalues of E(XXT ) and discard those that are too small. This has often the effect of reducing noise. Moreover, dimension reduction prevents over learning, which can sometimes be observed in ICA. Group 3 (Data mining) ICA October 24, 2019 29 / 46
  • 31.
    A Direct Approachto ICA Recall that many approaches to ICA , are based on minimizing the approximation on entropy. Group 3 (Data mining) ICA October 24, 2019 30 / 46
  • 32.
    Product Density ICA IndependentComponent by definition have a joint product density: fS (s) = p i=j fj (sj ) And fj can be represented as a tilted Gaussian Density: fj (sj ) = φ(sj )egj (sj ) The log-likelihood for the observed data X = AS: l(A, {gj }p l ; X) = N i=1 p j=1 [logφj (aT j xi ) + gj (aT j xi )] Group 3 (Data mining) ICA October 24, 2019 31 / 46
  • 33.
    Penalized Density p j=1 1 N N i=1 [log φj(aT j xi ) + gj (aT j xi )] − φ(t)egj (t) dt − λj {gj (t)}2 (t)dt The first enforces the density constraint φ(t)egj (t)dt = 1 on any solution ˆgj . The second is a roughness penalty, which guarantees that the solution ˆgj is a quartic-spline with knots at the oserved values of sij = aT j xi . Group 3 (Data mining) ICA October 24, 2019 32 / 46
  • 34.
    Product Density ICAAlgorithm Group 3 (Data mining) ICA October 24, 2019 33 / 46
  • 35.
    Product Density ICAAlgorithm 2a) Given A optimize log-likelihood function w.r.t. gj p j=1 1 N N i=1 [log φj (aT j xi ) + gj (aT j xi )] − φ(t)egj (t) dt − λj {gj (t)}2 (t)dt For simplicity, we focus on a single coordinate: 1 N N i=1 [log φ(si ) + g(si )] − φ(t)eg(t) dt − λj {g (t)}2 (t)dt Since the function involves integral, an approximation is needed. Group 3 (Data mining) ICA October 24, 2019 34 / 46
  • 36.
    Product Density ICAAlgorithm 2a) Construct a fine grid L with values s∗ l in increments of . y∗ l = #si ∈ (s∗ l − /2, s∗ l + /2) N Likelihood function can be written as L l=1 y∗ l [log φ(s∗ l ) + g(s∗ l )] − φ(s∗ l )eg(s∗ l ) − λj {g (t)}2 (t)dt Group 3 (Data mining) ICA October 24, 2019 35 / 46
  • 37.
    Product Density ICAAlgorithm 2b) Given ˆgj , perform one step of fixed point algorithm towards finding the optimal A. Optimize log-likelihood function w.r.t. A is equivalent to maximize: C(A) = 1 N N i=1 p j=1 ˆgj (aT j xi ) C(A) is the log-likelihood ratio between the fitted density and Gaussian and can be seen as a estimate of negentropy. Group 3 (Data mining) ICA October 24, 2019 36 / 46
  • 38.
    2b) Fixed PointAlgorithm For each j update aj ← E X ˆgj (aT j xi ) − E[ˆgj (aT j xi )] Orthogonalize A using (AAT )−1 2 A Compute the SVD of A, A = UDV T , and then replace A with A ← UV T Group 3 (Data mining) ICA October 24, 2019 37 / 46
  • 39.
    Example: Handwritten Digits Digitized16 × 16 grayscale images Points in 256-dimensional space High-dimensional data Group 3 (Data mining) ICA October 24, 2019 38 / 46
  • 40.
    Comparison: standardized firstfive ICA components vs standardized first five PCA components Group 3 (Data mining) ICA October 24, 2019 39 / 46
  • 41.
    Nature of eachICA component extreme digits vs central digits Group 3 (Data mining) ICA October 24, 2019 40 / 46
  • 42.
    Example: EEG (electroencephalographic)Time Courses Domain: Brain Dynamics Group 3 (Data mining) ICA October 24, 2019 41 / 46
  • 43.
    Experiment Subjects wear acap embedded with a lattice of 100 EEG electrodes, which record brain activity at different locations on the scalp. Data are from a single subject performing a standard “two-back” learning task over a 30-minute period, i.e., the subject is presented with a letter (B, H, J, C, F, or K) at roughly 1500-ms intervals, and responds by pressing one of two buttons to indicate whether the letter presented is the same or different from that presented two steps back. Depending on the answer, the subject earns or loses points, and occasionally earns bonus or loses penalty points for infrequent correct or incorrect trials, respectively. Group 3 (Data mining) ICA October 24, 2019 42 / 46
  • 44.
    Goal: untangle thecomponents of signals in multi-channel EEG data. Key assumption: signals recorded at each scalp electrode are a mixture of independent potentials arising from different cortical activities, as well as non-cortical artifact domains. ICA method Group 3 (Data mining) ICA October 24, 2019 43 / 46
  • 45.
    Group 3 (Datamining) ICA October 24, 2019 44 / 46
  • 46.
    Group 3 (Datamining) ICA October 24, 2019 45 / 46
  • 47.
    Thank you! :) Group3 (Data mining) ICA October 24, 2019 46 / 46