My talk from ICML 2016 describing the "information sieve", a principle for decomposing information that enables a new approach for unsupervised representation learning.
1. The Information Sieve
Greg Ver Steeg and Aram Galstyan
Soup = data
“Main
ingredient”
extracted at
each layer
2. Factorial code
• Carry recipe instead of soup
• Missing ingredients?
• Make more soup
• Compression
• Prediction
• Generative model
Recipe
-Ingredient 1
-Ingredient 2
-…
Invertible transform that makes
components independent
Finding such a transform is a generally intractable problem.
We use a sequence that incrementally removes dependence
3. Two Steps
1.Find the most informative function of the
input data
2. Transform the data to remove the
information in Yk, and then repeat
InputInputInput
Remainder
Soup
Main
ingredient
4. The main ingredient:
multivariate information
• Multivariate mutual information, or Total Correlation (Watanabe, 1960)
• TC(X|Y) = 0 if and only if Y “explains” all the dependence in X
• So we search for Y that minimizes TC(X|Y)
• Equivalently, we define the total correlation explained by Y as:
5. The main ingredient:
Total Correlation Explanation (CorEx)
• Optimize over all probabilistic functions
• Solution has special form that makes it tractable
• Computational complexity is linear in the number of variables
6. Sift out the main ingredient: remainder
info
The remainder is a transformation of the inputs with 2 properties:
Input
Remainder
Soup
Remainder contains no info about Y
Transformation is invertible
8. Iterative sifting as:
Dependence at each layer of the sieve
decreases until we get to zero, i.e. complete
independence
Dependence
(at layer r)
Extracting dependence
9. Recover spatial clusters from fMRI data
Ground truth ICA Sieve
Example of recovering spatial clusters in
brain data from temporal activation patterns
10. Lossy compression and in-painting
• Sieve representation with 12 layers/bits/binary latent factors on
MNIST digits
We can use the sieve for standard prediction and
generative model tasks
11. Lossless compression (on MNIST)
• Same size codebooks for Random and Sieve-based codes
• (gzip is sequence-based, shown for reference)
Proof of principle for lossless compression; though specialized
compression techniques are better on MNIST.
Method Naive gzip Random
codebook
Sieve
codebook
Bits per digit 784 328 267 243
12. Conclusion
• Incrementally decomposing multivariate
information is useful, practical, and delicious
• Could improve with joint optimization and better
transformations for remainder info
Link to all papers and code
http://bit.ly/corex_info
Contact: gregv@isi.edu, galstyan@isi.edu
• The extension to continuous random variables is nontrivial but more
practical and demonstrates connections to “common information”:
“Sifting Common Information from Many Variables”, arXiv:1606.02307.
Editor's Notes
I have a cartoon version of the talk…[describe]...that’s like 90% of it.
I’m going to stick with the soup metaphor:
All that remains is to say what we mean by “main ingredient”, and what does it mean to “remove” it.
Before that, though, why would you want to do this?
Filtering out all the ingredients in soup is really a way to reverse engineer the recipe.
The technical equivalent of this is called a factorial code; decompose data into independent components.
There are many advantages…
Unfortunately, this isn’t very easy.
Our sieves provide us a way to easily do this in an incremental way so that our representation is more independent at each step.
Let’s abstract a bit...
At every layer of this sieve, we have discrete random variables with iid samples drawn from an unknown distribution.
Step 1 finds the “main ingredient” by solving
Step 2 filters it out
Why the need for a qualification? It seems to me that information by itself is somewhat useless for learning. A bit of noise and a bit of signal are not really distinguishable.
High-d is only difficult if there are nontrivial relationships, so that’s what we need to characterize
(CAREFUL not to ramble here…)
In soup terms, we have two criteria:
The ingredient is completely extracted. If not, we might end up sifting out some carrots at layer 1 and more at layer 3.
We can invert the transformation. We just throw the carrots back in and we are right where we started.
WHY do we define remainder in this way exactly? The next two slides will show why that’s a powerful way to go.
Defining the main ingredient as multivariate information and correctly defining the remainder information leads finally to some very nice expressions.
Ok, so now we have a way to progressively extract the most important ingredients in our soup. We mentioned the benefits at the beginning, and we still get almost all of those benefits from doing it progressively. In fact, in a way we are better off because our list of ingredients is ranked by importance.
PUT IN PLOT?
Defining the main ingredient as multivariate information and correctly defining the remainder information leads finally to some very nice expressions.
Ok, so now we have a way to progressively extract the most important ingredients in our soup. We mentioned the benefits at the beginning, and we still get almost all of those benefits from doing it progressively. In fact, in a way we are better off because our list of ingredients is ranked by importance.
PUT IN PLOT?
Synthetic data, so we know the ground truth.
Plotting the weights, note that this is a linear version that is described in a different paper