Modern, large scale data analysis typically involves the use of massive data stored on different computers that do not share the same file system. Computing complex statistical quantities, such as those that characterize spatial or temporal statistical dependence, requires information that crosses the boundaries imposed by this partitioning of the data. To leverage the information in these distributed data sets, analysts are faced with a trade-off between various costs (e.g., computational, transmission, and even the cost building an appropriate data system infrastructure) and inferential uncertainties (bias, variance, etc.) in the estimates produced by the analysis. In this talk we introduce a framework for quantifying this trade-off by optimizing over both statistical and data system design aspects of the problem. We illustrate with a simple example, and discuss how it may be extended to more complex settings.
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018
1. A Notional Framework for a Theory of Data Systems
Maggie Johnson
Joint with members of the ToDS subgroup
of the SAMSI CLIM Remote Sensing Working Group
Workshop on Remote Sensing, Uncertainty Quantification,
and a Theory of Data Systems
February 12, 2018
M. Johnson Remote Sensing Workshop February 12, 2018 1 / 22
2. Motivation
Motivation for this workshop:
. . .data must be brought together in some way . . . but moving data to a
central location for analysis is tedious at best and impossible at worst.
Some (remote) data reduction is almost certainly necessary, but how
much? What are the consequences for inference?
. . .how to navigate the trade-space between computational, transmission
and infrastructure costs versus uncertainty (a.k.a. “statistical costs”) in
the estimates or inferences that are ultimately produced.
In other words, can we integrate the design of data systems with the design of
statistical methodology to balance the various tradeoffs in these costs?
Can this be formulated as a well-specified optimization problem?
M. Johnson Remote Sensing Workshop February 12, 2018 2 / 22
3. What are the costs?
1 Computational
Number of operations, memory, time, etc.
2 Statistical
variance, prediction error, etc.
3 Transmission/Data Movement
bandwidth, latency, money, privacy, etc.
4 System Infrastructure/Design
data storage, types of connections, compute resources, etc.
5 . . .
M. Johnson Remote Sensing Workshop February 12, 2018 3 / 22
4. From the Software/System Architects Perspective
In designing a data system, architects consider the infrastructural costs and how
the design of the data system affects how data can be manipulated and moved
throughout the system
how to stage data across
servers?
where to build connections,
and how fast do they need
to be?
how to deploy compute
resources?
which services on which
machines?
privacy?
M. Johnson Remote Sensing Workshop February 12, 2018 4 / 22
5. From the Statistician’s/Data Scientist’s Perspective
In designing a statistical analysis, statisticians/data scientists are familiar with the
ideas of balancing the tradeoffs between the quality of a statistical analysis and
the computational costs of that analysis.
how much data, which data, where to move data?
which methodology?
what are the tradeoffs in efficiency of estimators/quality of inference
(uncertainty)?
Statistical analyses of distributed data depends on how data can be accessed,
computational resources, etc. (i.e. the design of the data system).
M. Johnson Remote Sensing Workshop February 12, 2018 5 / 22
6. A Theory of Data Systems
The simultaneous optimization of the data system architecture and the statistical
methodology balancing the tradeoffs in costs, for a given data analysis objective.
In theory, in order to do this we need to:
1 be able to quantify all of the various costs of performing data analysis in a
distributed setting
Many of the costs are very difficult to quantify
2 solve a highly complex, constrained, multi-objective optimization problem
competing objectives
3 choose a solution with costs we are willing to accept from a set of Pareto
optimal solutions
i.e., ”choose your battles”
M. Johnson Remote Sensing Workshop February 12, 2018 6 / 22
7. Illustration with a Toy Example
M. Johnson Remote Sensing Workshop February 12, 2018 7 / 22
8. Data System Setup
J servers, each with Nj observations (j = 1, . . . , J)
Assume only the user has computational resources
Cost to access the jth
server is aj and to move a data value from server j to
the user is bj
nj is the number of downloaded observations from server j to the user
M. Johnson Remote Sensing Workshop February 12, 2018 8 / 22
9. Data Analysis Objective
The statistical objective is to perform inference on the population mean from data
distributed across J servers, with the following statistical properties
Let Yij be the ith
observation on server j, assume E(Yij ) = µ, Var(Yij ) = 1
Correlation between two observations on the same server is φ
Correlation between an observation on server j and on server k is ρ|j−k|
φ and ρ are assumed known
Goal is to perform inference on µ using the sample mean
¯Yn =
J
j=1
nj
−1
J
j=1
nj
i=1
Yij
computed from n = {n1, . . . , nJ } observations as the estimator.
M. Johnson Remote Sensing Workshop February 12, 2018 9 / 22
10. The Costs
1 Statistical Cost (squared error loss −> minimize variance):
Cst(n) = Var( ¯Yn) = N−2
n
J
j=1
nj + φ(n2
j − nj ) +
k=j
nj nk ρ|j−k|
Given (assumed known) φ and ρ, the statistical cost depends only on the
amount of data downloaded from each server.
2 Infrastructure/Design Cost:
Cds(a, b) =
J
j=1
a−1
j + b−0.5
j
Meant to penalize small aj and bj (i.e. it is expensive to build a faster
connection)
Idea is that more resources should be allocated to servers where we need to
download more data.
M. Johnson Remote Sensing Workshop February 12, 2018 10 / 22
11. The Costs
3 Data Movement & Computation Cost:
Define data movement costs for n = {n1, . . . , nJ } observations as
J
j=1
(aj I(nj > 0) + bj nj )
Computational complexity is O( J
j=1 nj )
Combine both into a cost function for data movement and computation.
Cc (a, b, n) =
J
j=1
(aj I(nj > 0) + bj nj ) +
J
j=1
nj
M. Johnson Remote Sensing Workshop February 12, 2018 11 / 22
12. Multiobjective Optimization
The optimal distributed analysis for the toy example is a solution with jointly
minimizes the costs associated with the statistical analysis and the data system
infrastructure.
minimize
n,a,b
Cds(a, b), Cst(n), Cc (a, b, n)
subject to aj ∈ (c, d)
bj ∈ (e, f )
nj ∈ N
nj ≤ Nj
For the toy example, this optimization is feasible.
M. Johnson Remote Sensing Workshop February 12, 2018 12 / 22
13. The Pareto Front
Let φ = 0.5, ρ = 0.1, Nj = 100, J = 5, aj ∈ (1, 50), bj ∈ (1, 20). Using the R
package nloptr:
M. Johnson Remote Sensing Workshop February 12, 2018 13 / 22
14. “Choosing your Battles”
Suppose we wish to keep computational/data movement costs low (e.g. < 2000).
High statistical accuracy
(Var( ¯Yn) = 0.13)
Trades-off with expensive
data system design
(Cds = 5)
M. Johnson Remote Sensing Workshop February 12, 2018 14 / 22
15. “Choosing your Battles”
Suppose we wish to keep computational/data movement costs low (e.g. < 2000).
Cheap data system design
(e.g. Cds < 2)
Trades-off with reduced
statistical accuracy
(Var( ¯Yn) = 0.14)
M. Johnson Remote Sensing Workshop February 12, 2018 15 / 22
16. Effect of the Statistical Properties of the Data
Let φ = 0.5, ρ = 0.4, recall that the correlation between two servers is ρ|j−k|
.
It is more efficient to
sample from servers far
away from each other
More resources are then
focused on these servers
M. Johnson Remote Sensing Workshop February 12, 2018 16 / 22
17. Alternative Formulations of the Optimization Problem
Knowledge/decisions about the acceptable tradeoffs in costs can reduce the
optimization problem.
If all costs are equally important, combine the costs into a single objective
function
There are multiple subproblems given prior decisions on acceptable costs.
For example, in the toy example set Cc < 2000 as a constraint rather than
including the cost in the objective function
M. Johnson Remote Sensing Workshop February 12, 2018 17 / 22
18. The Difference between Theory and Practice
“Good Enough” Solution
In practice, it may only be realistic to obtain a solution which achieves
“acceptable” costs.
The statistical design may depend on unknown properties of the data (e.g.
unknown φ and ρ)
Not feasible to build the data system at every iteration of an optimization
algorithm
Quantifying statistical performance may in itself be computationally
demanding, and/or require data movement
Multiple analysis objectives
M. Johnson Remote Sensing Workshop February 12, 2018 18 / 22
19. The Difference between Theory and Practice
“Good Enough” Solution
In practice, one might iterate between the data system design and the statistical
design, to find a “good enough” solution.
1 Start with a distributed data system design, learn about the data
2 Given preliminary knowledge of statistical properties of the data, update the
data system architecture
3 Given the new data system architecture, update the statistical design
4 . . .
DAWN provides a potential framework to simulate this procedure.
M. Johnson Remote Sensing Workshop February 12, 2018 19 / 22
20. A Few Next Steps
For the toy example:
Allow connections between servers and distributed computation
Analysis objectives beyond inference about the mean
Incorporation of more realistic costs/constraints
Maximum likelihood estimation of the parameters of a covariance function
Multilayer networks as a framework for organizing the optimization problem
M. Johnson Remote Sensing Workshop February 12, 2018 20 / 22
21. Closing Thoughts
To continue to scale data analyses to the ever growing massive size of data, we
need to be able to exploit distributed data system architecture.
Requires understanding and accounting for the tradeoffs in the costs
associated with distributed data analysis and inferential quality for both data
system design and the design of the data analysis.
A Theory of Data Systems requires collaboration between statisticians, computer
scientists, data system architects, software engineers, and more.
Understanding realistic costs for all aspects of distributed data analysis
requires expert knowledge in each area.
M. Johnson Remote Sensing Workshop February 12, 2018 21 / 22