Robustly Separating the Wheat from the Chaff in Multiway Data

Robustly separating the wheat from the chaﬀ in
multiway data
Eric C. Chi with Tamara G. Kolda
October 11, 2010
Eric Chi Robustly separating wheat from chaﬀ 1

Wheat, Chaff, and fitting models to data
World View
“Wheat” “Chaff”
Data = Systematic Variation + non-Systematic Variation

Wheat, Chaff, and fitting models to data
World View
“Wheat” “Chaff”
Data = Systematic Variation + non-Systematic Variation
This talk
Systematic Variation (Wheat): multilinear

The “best” multilinear fit to the data
Which multilinear model “best” fits the data?
Measure of lack of fit: loss
Many choices:
2-norm
1-norm
Different losses = different assumptions about the statistical behavior
of the non-Systematic variation (Chaff).

The “best” multilinear fit to the data
Which multilinear model “best” fits the data?
Measure of lack of fit: loss
Many choices:
2-norm
1-norm
Different losses = different assumptions about the statistical behavior
of the non-Systematic variation (Chaff).
Why is this important?
Poor separation of wheat and chaff if chaff does not behave as
expected.

Many ways to extract wheat from the data
min
u i
(xi − u)2

Extract wheat with least squares
min
u i
(xi − u)2
Least Squares Solution 1

What happens if we spike the data with outliers?
min
u i
(xi − u)2

Least squares is sensitive to outliers.
min
u i
(xi − u)2

Extract wheat with least 1-norm
min
u i
|xi − u|
Least 1−norm Solution

When is a least squares solution the “best” solution?
Least Squares (LS) Solution = most “likely” u when
non-Systematic Variation is Gaussian.
The least squares solution is the maximum likelihood estimate (mle)
of u:
xi = u + ei .
where ei ∼ i.i.d. N(0,1).

When is a least 1-norm solution the “best” solution?
Least 1-norm Solution = mle of u when
non-Systematic Variation is Laplacian.
xi = u + ei .
where ei
i.i.d.
∼ f (x) = exp(−|x|)/2.

Gaussian and Laplacian densities
e
density
0.0
0.1
0.2
0.3
0.4
0.5
−6 −4 −2 0 2 4 6

Justifying the Gaussian assumption
STAT 101
non-Systematic variation = accumulation of many small random eﬀects

STAT 101
(Central Limit Theorem)

STAT 101
Practice
Least squares is easy and fast to compute

STAT 101
Practice
Least squares is easy and fast to compute (Convenience).

Assumptions: multilinear wheat
Example: Iris data
150 samples of 3 species of iris ﬂowers
4 measurements on sepal and petal dimensions
Data is a two-way array: X ∈ R150×4
Bilinear wheat (rank 2)
X ≈ u1 ◦ v1 + u2 ◦ v2
xij ≈ ui1vj1 + ui2vj2
u1, u2 ∈ R150
v1, v2 ∈ R4

Extracting bilinear wheat with least squares
min
150
i=1
4
j=1
(xij −
2
r=1
uir vjr )2
ui1
ui2
−1.0
−0.5
0.0
0.5
1.0
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
−3 −2 −1 0 1 2 3

CANDECOMP/PARAFAC (CP) Decomposition
Example: Video surveillance
256 × 256 pixel images at 200 time points.
Data is a three-way array: X ∈ R256×256×200.
Trilinear wheat (rank 5)
X ≈ u1 ◦ v1 ◦ w1 + · · · + u5 ◦ v5 ◦ w5.
xijk ≈ ui1vj1wk1 + ui2vj2wk2 + · · · + ui5vj5wk5
u1, u2, . . . , u5 ∈ R256
v1, v2, . . . , v5 ∈ R256
w1, w2, . . . , w5 ∈ R200

Extracting multilinear wheat
Least Squares
min
256
i=1
256
j=1
200
k=1
(xijk −
5
r=1
uir vjr wkr )2

Extracting multilinear wheat
Least Squares
min
256
i=1
256
j=1
200
k=1
(xijk −
5
r=1
uir vjr wkr )2
Least 1-norm
min
256
i=1
256
j=1
200
k=1
|xijk −
5
r=1
uir vjr wkr |

Violating Gaussian assumptions: Who cares?
What if the non-Systematic variation is really non-Gaussian?
Neuroimaging: Movement artifacts.
Video Surveillance: Foreground/Background separation.
non-Gaussian how?
Sparse large perturbations.
Perturbations are large but not too large.

Minimizing the 1-norm loss
Least 1-norm is harder to solve than Least Squares!
The loss is non-diﬀerentiable.
Smooth approximation to 1-norm
ijk
xijk −
R
r=1
uir vjr wkr ≈
ijk
xijk −
R
r=1
uir vjr wkr
2
+
for small > 0.

Majorization-Minimization
Solve a hard optimization problem
ijk
xijk −
R
r=1
uir vjr wkr
2
+
as a series of easy ones (weighted least squares).
Minimize surrogate loss functions (quadratic functions).

MM Algorithm
Loss =
i
(xi − u)2 +
u
Loss
Less
More
very bad optimal less bad

MM Algorithm
Loss =
i
(xi − u)2 +
u
Loss
Less
More
q

Toy example
X ∈ R25×25×25
Slice = mix of A and B.

Toy example
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25

Toy example: Trilinear Wheat + Gaussian Chaﬀ
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25

Toy example: A slice up close
Observed True Wheat Gauassian Chaﬀ
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25

Extracting from Gaussian Chaﬀ by Least Squares
Observed Extracted Wheat Diﬀerence
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25

Extracting from Gaussian Chaﬀ by Least 1-norm
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25

Trilinear wheat + non-Gaussian Chaﬀ
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25

Extracting from non-Gaussian Chaﬀ by Least Squares
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25

Extracting from non-Gaussian Chaﬀ by Least 1-norm
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25

Caveats
Costs
Computational: 1-norm minimization is more work than least squares.
Statistical: Robustness versus eﬃciency tradeoﬀ
Limitations
Trouble with sparse tensor data.
Best suited for dense tensor data?

Summary
Least Squares (i.i.d. Gaussian errors)
Wide applicability, easy to compute solution, statistical eﬃciency.
But not always appropriate.
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25

Summary
Least 1-norm (i.i.d. Laplacian errors)
Can accomodate Gaussian errors and non-Gaussian errors.
But not as easy to compute, less statistical eﬃciency.
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25

Future Work
Real data
More sophisticated robust loss functions: β divergences
Robust factorizations for data on diﬀerent scales:
Binary
Non-negative data.

Thank you!
Images from wikipedia.org
Eric C. Chi
echi@rice.edu

Robustly Separating the Wheat from the Chaff in Multiway Data

More Related Content

Featured

Robustly Separating the Wheat from the Chaff in Multiway Data