The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend computer science and statistics. That classical perspectives from these fields are not adequate to address emerging challenges with massive datasets is apparent from their sharply divergent nature at an elementary level ? in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. In classical statistics, one usually considers the increase in inferential accuracy as the number of data points grows (with little formal consideration of computational complexity), while in classical numerical computation, one typically analyzes the improvement in accuracy as more computational resources such as space or time are employed (with the size of a dataset not formally viewed as a resource). In this talk we describe some of our research efforts towards addressing the question of trading off the amount of data and the amount of computation required to achieve a desired inferential accuracy.
CLIM Program: Remote Sensing Workshop, Computational and Statistical Trade-offs in Data Analysis - Venkat Chandrasekaran, Feb 12, 2018
1. Computational and Statistical Tradeoffs
Venkat Chandrasekaran
Caltech
Joint with Michael Jordan (Berkeley), Yong Sheng
Soh (Caltech), Quentin Berthet (Cambridge)
3. Classical Analysis in Statistics
o Previously, key bottleneck: amount of data
o Consider ``best’’ estimator without much regard for
computational considerations
o Minimax analysis
o More recently, time/memory/other infrastructure is bottleneck
o Data is plentiful in several domains
o Need to incorporate these additional constraints formally
4. A Thought Experiment
o The standard inferential tradeoff
o Process 5MB of data in 1 hour, error = 0.1
o Process 5TB in 1 month, error = 0.0001
o Given 5TB of data, happy with error = 0.1
o Don’t care about small improvements in error
o Statistical models are only approximations of reality
5. A Thought Experiment
o The standard inferential tradeoff
o Process 5MB of data in 1 hour, error = 0.1
o Process 5TB in 1 month, error = 0.0001
o Given 5TB of data, happy with error = 0.1
o Don’t care about small improvements in error
o Statistical models are only approximations of reality
o More data useful for less computation?
6. Computer Science vs. Statistics
Error
Amount
of data
Runtime
/ Space /
…
Computer
ScienceComputation /
Statistics tradeoffs
Statistics
7. Time-Data Tradeoffs
o Consider an inference problem with fixed (desired) error
o Inference procedures viewed as points in plot
Runtime
Minimax risk lower bound
- well understood
Computational lower bound
- poorly understood
- depends on computational model
A
B
C
D E F
Number of samples
8. Time-Data Tradeoffs
o Consider an inference problem with fixed (desired) error
o Need “weaker” algorithms
for larger datasets to reduce
runtime
o At some stage, throw away
dataRuntime
A
B
C
D E F
Number of samples
9. An Estimation Problem
o Signal from known (bounded) set
o Noise
o Observation model
o Observe i.i.d. samples
10. Convex Programming Estimator
o Sample mean is sufficient statistic
o Natural M-estimator
o Convex programming M-estimator
o Constraint is a convex set such that
11. Convex Programming Estimator
o Long history of shrinkage estimation in statistics
o James, Stein (1961)
o Donoho, Johnstone (early 1990s)
o Shrinkage onto convex sets for tractability
o Many surprises in high dimensions, i.e., large p
o More recently
o L1 norm, trace norm, max norm, …
12. Statistical Performance of Estimator
o Defn 1: The cone of feasible directions into a convex set C is
defined as
13. Statistical Performance of Estimator
o Defn 1: The cone of feasible directions into a convex set C is
defined as
o Defn 2: The Gaussian (squared) complexity of a cone T is defined
as
14. Statistical Performance of Estimator
o Thm: The risk of the estimator is
o Proof: Apply optimality conditions
o Intuition: Only consider error in feasible cone
o Many improvements possible to this theorem
15. Weakening via Convex Relaxation
o Thm: The risk of the estimator is
o Corr: To obtain risk of at most 1,
16. Weakening via Convex Relaxation
o Corr: To obtain risk of at most 1,
Monotonic in C
17. Weakening via Convex Relaxation
If we have access to larger n, can use larger C
Obtain “weaker” estimation algorithm
18. Hierarchy of Convex Relaxations
o If “algebraic”, then one can obtain family of outer convex
approximations
o Polyhedral, semidefinite, hyperbolic, relative entropy relaxations
(Sherali-Adams, Boyd, Parrilo, Lasserre, Renegar, C., Shah)
o Sets ordered by computational complexity
o Central role played by lift-and-project
19. Before we get to examples …
o How do we calculate runtime?
o Total runtime = np + # ops for projection
Computing
sample mean
Subsequent
processing
With more data,
this increases
With more data,
this decreases
20. Example 1
o consists of cut matrices
o E.g., collaborative filtering, clustering
21. Example 2
o Signal set consists of all perfect matchings in complete graph
o E.g., network inference
22. Example 3
o Banding estimators for covariance matrices
o Bickel-Levina [2007]; Cai, Zhang, Zhou [2010]
o Assume known variable ordering
o Stylized problem: let M be known tridiagonal matrix
o Signal set
23. Example 4
o consists of all adjacency matrices of graphs with only a clique
on square-root of the nodes
o E.g., sparse PCA, gene expression patterns
o Kolar et al. (2010)
24. Example 4
o What if we use an even weaker relaxation?
o E.g., just take the sample mean as the estimate
25. Example 4
o What if we use an even weaker relaxation?
o E.g., just take the sample mean as the estimate
o
o In this case, makes sense to throw away data …
26. Recall Plot …
o At some stage, throw away
data
o Several caveats
‒ Need computational lower
bounds
‒ Depends on computational
architecture
‒ Parallel vs. serial, …
Runtime
A
B
C
D E F
Number of samples
27. Beyond Denoising Problems …
o Consider a high-dimensional time series
o Goal: reliably estimate change-points
28. Change-Point Estimation
o Suppose we sample more frequently, but number of changes
remains the same
o Can again use convex relaxation to obtain time-data tradeoffs
o Joint work with Y.S. Soh (Caltech)
Soh, C., “High-Dimensional Changepoint Estimation: Combining Filtering and Convex Optimization,” ACHA, 2016
29. Heterogenous Data
o Preceding setups assume unrealistically that data are
“stationary” (same type, quality, etc.)
o How do we optimally deploy computational and other
resources to process data from disparate sources?
o Preliminary efforts (with Q. Berthet), but a lot remains to be done
Geo-monitoring
stations around
California
Cellphone
seismo-meters
in Pasadena
+ = ?
Berthet, C., “Resource Allocation for Statistical Estimation,” Proc. of the IEEE, 2016
30. Summary
o Many action items…
o Large-scale practical demonstrations
o Other types of inference problems, eg., covariance computation
o Mathematically codify network / communication / computing
constraints?
o Ultimately inform data-gathering exercises?
users.cms.caltech.edu/~venkatc