CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018

A Notional Framework for a Theory of Data Systems
Maggie Johnson
Joint with members of the ToDS subgroup
of the SAMSI CLIM Remote Sensing Working Group
Workshop on Remote Sensing, Uncertainty Quantiﬁcation,
and a Theory of Data Systems
February 12, 2018
M. Johnson Remote Sensing Workshop February 12, 2018 1 / 22

Motivation
Motivation for this workshop:
. . .data must be brought together in some way . . . but moving data to a
central location for analysis is tedious at best and impossible at worst.
Some (remote) data reduction is almost certainly necessary, but how
much? What are the consequences for inference?
. . .how to navigate the trade-space between computational, transmission
and infrastructure costs versus uncertainty (a.k.a. “statistical costs”) in
the estimates or inferences that are ultimately produced.
In other words, can we integrate the design of data systems with the design of
statistical methodology to balance the various tradeoﬀs in these costs?
Can this be formulated as a well-speciﬁed optimization problem?

What are the costs?
1 Computational
Number of operations, memory, time, etc.
2 Statistical
variance, prediction error, etc.
3 Transmission/Data Movement
bandwidth, latency, money, privacy, etc.
4 System Infrastructure/Design
data storage, types of connections, compute resources, etc.
5 . . .

From the Software/System Architects Perspective
In designing a data system, architects consider the infrastructural costs and how
the design of the data system aﬀects how data can be manipulated and moved
throughout the system
how to stage data across
servers?
where to build connections,
and how fast do they need
to be?
how to deploy compute
resources?
which services on which
machines?
privacy?

From the Statistician’s/Data Scientist’s Perspective
In designing a statistical analysis, statisticians/data scientists are familiar with the
ideas of balancing the tradeoffs between the quality of a statistical analysis and
the computational costs of that analysis.
how much data, which data, where to move data?
which methodology?
what are the tradeoffs in efficiency of estimators/quality of inference
(uncertainty)?
Statistical analyses of distributed data depends on how data can be accessed,
computational resources, etc. (i.e. the design of the data system).

A Theory of Data Systems
The simultaneous optimization of the data system architecture and the statistical
methodology balancing the tradeoﬀs in costs, for a given data analysis objective.
In theory, in order to do this we need to:
1 be able to quantify all of the various costs of performing data analysis in a
distributed setting
Many of the costs are very diﬃcult to quantify
2 solve a highly complex, constrained, multi-objective optimization problem
competing objectives
3 choose a solution with costs we are willing to accept from a set of Pareto
optimal solutions
i.e., ”choose your battles”

Illustration with a Toy Example

Data System Setup
J servers, each with Nj observations (j = 1, . . . , J)
Assume only the user has computational resources
Cost to access the jth
server is aj and to move a data value from server j to
the user is bj
nj is the number of downloaded observations from server j to the user

Data Analysis Objective
The statistical objective is to perform inference on the population mean from data
distributed across J servers, with the following statistical properties
Let Yij be the ith
observation on server j, assume E(Yij ) = µ, Var(Yij ) = 1
Correlation between two observations on the same server is φ
Correlation between an observation on server j and on server k is ρ|j−k|
φ and ρ are assumed known
Goal is to perform inference on µ using the sample mean
¯Yn =


J
j=1
nj


−1
J
j=1
nj
i=1
Yij
computed from n = {n1, . . . , nJ } observations as the estimator.

The Costs
1 Statistical Cost (squared error loss −> minimize variance):
Cst(n) = Var( ¯Yn) = N−2
n
J
j=1

nj + φ(n2
j − nj ) +
k=j
nj nk ρ|j−k|


Given (assumed known) φ and ρ, the statistical cost depends only on the
amount of data downloaded from each server.
2 Infrastructure/Design Cost:
Cds(a, b) =
J
j=1
a−1
j + b−0.5
j
Meant to penalize small aj and bj (i.e. it is expensive to build a faster
connection)
Idea is that more resources should be allocated to servers where we need to
download more data.

The Costs
3 Data Movement & Computation Cost:
Deﬁne data movement costs for n = {n1, . . . , nJ } observations as
J
j=1
(aj I(nj > 0) + bj nj )
Computational complexity is O( J
j=1 nj )
Combine both into a cost function for data movement and computation.
Cc (a, b, n) =
J
j=1
(aj I(nj > 0) + bj nj ) +
J
j=1
nj

Multiobjective Optimization
The optimal distributed analysis for the toy example is a solution with jointly
minimizes the costs associated with the statistical analysis and the data system
infrastructure.
minimize
n,a,b
Cds(a, b), Cst(n), Cc (a, b, n)
subject to aj ∈ (c, d)
bj ∈ (e, f )
nj ∈ N
nj ≤ Nj
For the toy example, this optimization is feasible.

The Pareto Front
Let φ = 0.5, ρ = 0.1, Nj = 100, J = 5, aj ∈ (1, 50), bj ∈ (1, 20). Using the R
package nloptr:

“Choosing your Battles”
Suppose we wish to keep computational/data movement costs low (e.g. < 2000).
High statistical accuracy
(Var( ¯Yn) = 0.13)
Trades-oﬀ with expensive
data system design
(Cds = 5)

“Choosing your Battles”
Suppose we wish to keep computational/data movement costs low (e.g. < 2000).
Cheap data system design
(e.g. Cds < 2)
Trades-oﬀ with reduced
statistical accuracy
(Var( ¯Yn) = 0.14)

Eﬀect of the Statistical Properties of the Data
Let φ = 0.5, ρ = 0.4, recall that the correlation between two servers is ρ|j−k|
.
It is more eﬃcient to
sample from servers far
away from each other
More resources are then
focused on these servers

Alternative Formulations of the Optimization Problem
Knowledge/decisions about the acceptable tradeoﬀs in costs can reduce the
optimization problem.
If all costs are equally important, combine the costs into a single objective
function
There are multiple subproblems given prior decisions on acceptable costs.
For example, in the toy example set Cc < 2000 as a constraint rather than
including the cost in the objective function

The Diﬀerence between Theory and Practice
“Good Enough” Solution
In practice, it may only be realistic to obtain a solution which achieves
“acceptable” costs.
The statistical design may depend on unknown properties of the data (e.g.
unknown φ and ρ)
Not feasible to build the data system at every iteration of an optimization
algorithm
Quantifying statistical performance may in itself be computationally
demanding, and/or require data movement
Multiple analysis objectives

The Diﬀerence between Theory and Practice
“Good Enough” Solution
In practice, one might iterate between the data system design and the statistical
design, to ﬁnd a “good enough” solution.
1 Start with a distributed data system design, learn about the data
2 Given preliminary knowledge of statistical properties of the data, update the
data system architecture
3 Given the new data system architecture, update the statistical design
4 . . .
DAWN provides a potential framework to simulate this procedure.

A Few Next Steps
For the toy example:
Allow connections between servers and distributed computation
Analysis objectives beyond inference about the mean
Incorporation of more realistic costs/constraints
Maximum likelihood estimation of the parameters of a covariance function
Multilayer networks as a framework for organizing the optimization problem

Closing Thoughts
To continue to scale data analyses to the ever growing massive size of data, we
need to be able to exploit distributed data system architecture.
Requires understanding and accounting for the tradeoﬀs in the costs
associated with distributed data analysis and inferential quality for both data
system design and the design of the data analysis.
A Theory of Data Systems requires collaboration between statisticians, computer
scientists, data system architects, software engineers, and more.
Understanding realistic costs for all aspects of distributed data analysis
requires expert knowledge in each area.

Thank You!

CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018

Similar to CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018