SlideShare a Scribd company logo
1 of 22
Download to read offline
A Notional Framework for a Theory of Data Systems
Maggie Johnson
Joint with members of the ToDS subgroup
of the SAMSI CLIM Remote Sensing Working Group
Workshop on Remote Sensing, Uncertainty Quantification,
and a Theory of Data Systems
February 12, 2018
M. Johnson Remote Sensing Workshop February 12, 2018 1 / 22
Motivation
Motivation for this workshop:
. . .data must be brought together in some way . . . but moving data to a
central location for analysis is tedious at best and impossible at worst.
Some (remote) data reduction is almost certainly necessary, but how
much? What are the consequences for inference?
. . .how to navigate the trade-space between computational, transmission
and infrastructure costs versus uncertainty (a.k.a. “statistical costs”) in
the estimates or inferences that are ultimately produced.
In other words, can we integrate the design of data systems with the design of
statistical methodology to balance the various tradeoffs in these costs?
Can this be formulated as a well-specified optimization problem?
M. Johnson Remote Sensing Workshop February 12, 2018 2 / 22
What are the costs?
1 Computational
Number of operations, memory, time, etc.
2 Statistical
variance, prediction error, etc.
3 Transmission/Data Movement
bandwidth, latency, money, privacy, etc.
4 System Infrastructure/Design
data storage, types of connections, compute resources, etc.
5 . . .
M. Johnson Remote Sensing Workshop February 12, 2018 3 / 22
From the Software/System Architects Perspective
In designing a data system, architects consider the infrastructural costs and how
the design of the data system affects how data can be manipulated and moved
throughout the system
how to stage data across
servers?
where to build connections,
and how fast do they need
to be?
how to deploy compute
resources?
which services on which
machines?
privacy?
M. Johnson Remote Sensing Workshop February 12, 2018 4 / 22
From the Statistician’s/Data Scientist’s Perspective
In designing a statistical analysis, statisticians/data scientists are familiar with the
ideas of balancing the tradeoffs between the quality of a statistical analysis and
the computational costs of that analysis.
how much data, which data, where to move data?
which methodology?
what are the tradeoffs in efficiency of estimators/quality of inference
(uncertainty)?
Statistical analyses of distributed data depends on how data can be accessed,
computational resources, etc. (i.e. the design of the data system).
M. Johnson Remote Sensing Workshop February 12, 2018 5 / 22
A Theory of Data Systems
The simultaneous optimization of the data system architecture and the statistical
methodology balancing the tradeoffs in costs, for a given data analysis objective.
In theory, in order to do this we need to:
1 be able to quantify all of the various costs of performing data analysis in a
distributed setting
Many of the costs are very difficult to quantify
2 solve a highly complex, constrained, multi-objective optimization problem
competing objectives
3 choose a solution with costs we are willing to accept from a set of Pareto
optimal solutions
i.e., ”choose your battles”
M. Johnson Remote Sensing Workshop February 12, 2018 6 / 22
Illustration with a Toy Example
M. Johnson Remote Sensing Workshop February 12, 2018 7 / 22
Data System Setup
J servers, each with Nj observations (j = 1, . . . , J)
Assume only the user has computational resources
Cost to access the jth
server is aj and to move a data value from server j to
the user is bj
nj is the number of downloaded observations from server j to the user
M. Johnson Remote Sensing Workshop February 12, 2018 8 / 22
Data Analysis Objective
The statistical objective is to perform inference on the population mean from data
distributed across J servers, with the following statistical properties
Let Yij be the ith
observation on server j, assume E(Yij ) = µ, Var(Yij ) = 1
Correlation between two observations on the same server is φ
Correlation between an observation on server j and on server k is ρ|j−k|
φ and ρ are assumed known
Goal is to perform inference on µ using the sample mean
¯Yn =


J
j=1
nj


−1
J
j=1
nj
i=1
Yij
computed from n = {n1, . . . , nJ } observations as the estimator.
M. Johnson Remote Sensing Workshop February 12, 2018 9 / 22
The Costs
1 Statistical Cost (squared error loss −> minimize variance):
Cst(n) = Var( ¯Yn) = N−2
n
J
j=1

nj + φ(n2
j − nj ) +
k=j
nj nk ρ|j−k|


Given (assumed known) φ and ρ, the statistical cost depends only on the
amount of data downloaded from each server.
2 Infrastructure/Design Cost:
Cds(a, b) =
J
j=1
a−1
j + b−0.5
j
Meant to penalize small aj and bj (i.e. it is expensive to build a faster
connection)
Idea is that more resources should be allocated to servers where we need to
download more data.
M. Johnson Remote Sensing Workshop February 12, 2018 10 / 22
The Costs
3 Data Movement & Computation Cost:
Define data movement costs for n = {n1, . . . , nJ } observations as
J
j=1
(aj I(nj > 0) + bj nj )
Computational complexity is O( J
j=1 nj )
Combine both into a cost function for data movement and computation.
Cc (a, b, n) =
J
j=1
(aj I(nj > 0) + bj nj ) +
J
j=1
nj
M. Johnson Remote Sensing Workshop February 12, 2018 11 / 22
Multiobjective Optimization
The optimal distributed analysis for the toy example is a solution with jointly
minimizes the costs associated with the statistical analysis and the data system
infrastructure.
minimize
n,a,b
Cds(a, b), Cst(n), Cc (a, b, n)
subject to aj ∈ (c, d)
bj ∈ (e, f )
nj ∈ N
nj ≤ Nj
For the toy example, this optimization is feasible.
M. Johnson Remote Sensing Workshop February 12, 2018 12 / 22
The Pareto Front
Let φ = 0.5, ρ = 0.1, Nj = 100, J = 5, aj ∈ (1, 50), bj ∈ (1, 20). Using the R
package nloptr:
M. Johnson Remote Sensing Workshop February 12, 2018 13 / 22
“Choosing your Battles”
Suppose we wish to keep computational/data movement costs low (e.g. < 2000).
High statistical accuracy
(Var( ¯Yn) = 0.13)
Trades-off with expensive
data system design
(Cds = 5)
M. Johnson Remote Sensing Workshop February 12, 2018 14 / 22
“Choosing your Battles”
Suppose we wish to keep computational/data movement costs low (e.g. < 2000).
Cheap data system design
(e.g. Cds < 2)
Trades-off with reduced
statistical accuracy
(Var( ¯Yn) = 0.14)
M. Johnson Remote Sensing Workshop February 12, 2018 15 / 22
Effect of the Statistical Properties of the Data
Let φ = 0.5, ρ = 0.4, recall that the correlation between two servers is ρ|j−k|
.
It is more efficient to
sample from servers far
away from each other
More resources are then
focused on these servers
M. Johnson Remote Sensing Workshop February 12, 2018 16 / 22
Alternative Formulations of the Optimization Problem
Knowledge/decisions about the acceptable tradeoffs in costs can reduce the
optimization problem.
If all costs are equally important, combine the costs into a single objective
function
There are multiple subproblems given prior decisions on acceptable costs.
For example, in the toy example set Cc < 2000 as a constraint rather than
including the cost in the objective function
M. Johnson Remote Sensing Workshop February 12, 2018 17 / 22
The Difference between Theory and Practice
“Good Enough” Solution
In practice, it may only be realistic to obtain a solution which achieves
“acceptable” costs.
The statistical design may depend on unknown properties of the data (e.g.
unknown φ and ρ)
Not feasible to build the data system at every iteration of an optimization
algorithm
Quantifying statistical performance may in itself be computationally
demanding, and/or require data movement
Multiple analysis objectives
M. Johnson Remote Sensing Workshop February 12, 2018 18 / 22
The Difference between Theory and Practice
“Good Enough” Solution
In practice, one might iterate between the data system design and the statistical
design, to find a “good enough” solution.
1 Start with a distributed data system design, learn about the data
2 Given preliminary knowledge of statistical properties of the data, update the
data system architecture
3 Given the new data system architecture, update the statistical design
4 . . .
DAWN provides a potential framework to simulate this procedure.
M. Johnson Remote Sensing Workshop February 12, 2018 19 / 22
A Few Next Steps
For the toy example:
Allow connections between servers and distributed computation
Analysis objectives beyond inference about the mean
Incorporation of more realistic costs/constraints
Maximum likelihood estimation of the parameters of a covariance function
Multilayer networks as a framework for organizing the optimization problem
M. Johnson Remote Sensing Workshop February 12, 2018 20 / 22
Closing Thoughts
To continue to scale data analyses to the ever growing massive size of data, we
need to be able to exploit distributed data system architecture.
Requires understanding and accounting for the tradeoffs in the costs
associated with distributed data analysis and inferential quality for both data
system design and the design of the data analysis.
A Theory of Data Systems requires collaboration between statisticians, computer
scientists, data system architects, software engineers, and more.
Understanding realistic costs for all aspects of distributed data analysis
requires expert knowledge in each area.
M. Johnson Remote Sensing Workshop February 12, 2018 21 / 22
Thank You!
M. Johnson Remote Sensing Workshop February 12, 2018 22 / 22

More Related Content

What's hot

A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving dataiaemedu
 
Click Model-Based Information Retrieval Metrics
Click Model-Based Information Retrieval MetricsClick Model-Based Information Retrieval Metrics
Click Model-Based Information Retrieval MetricsAleksandr Chuklin
 
IN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHOD
IN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHODIN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHOD
IN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHODijdms
 
Meta heuristic based clustering of two-dimensional data using-2
Meta heuristic based clustering of two-dimensional data using-2Meta heuristic based clustering of two-dimensional data using-2
Meta heuristic based clustering of two-dimensional data using-2IAEME Publication
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
 
A study on rough set theory based
A study on rough set theory basedA study on rough set theory based
A study on rough set theory basedijaia
 
A Comprehensive review of Conversational Agent and its prediction algorithm
A Comprehensive review of Conversational Agent and its prediction algorithmA Comprehensive review of Conversational Agent and its prediction algorithm
A Comprehensive review of Conversational Agent and its prediction algorithmvivatechijri
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentAlexander Decker
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...IRJET Journal
 
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTER
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERPERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTER
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERijdms
 
Support Vector Machine–Based Prediction System for a Football Match Result
Support Vector Machine–Based Prediction System for a Football Match ResultSupport Vector Machine–Based Prediction System for a Football Match Result
Support Vector Machine–Based Prediction System for a Football Match Resultiosrjce
 
An Introduction to Data Mining
An Introduction to Data MiningAn Introduction to Data Mining
An Introduction to Data MiningNiloy Sikder
 
Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...
Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...
Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...IJCSIS Research Publications
 
A bi objective workflow application
A bi objective workflow applicationA bi objective workflow application
A bi objective workflow applicationIJITE
 
Hashedcubes simple, low memory, real time visual
Hashedcubes simple, low memory, real time visualHashedcubes simple, low memory, real time visual
Hashedcubes simple, low memory, real time visualNexgen Technology
 

What's hot (19)

A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving data
 
Click Model-Based Information Retrieval Metrics
Click Model-Based Information Retrieval MetricsClick Model-Based Information Retrieval Metrics
Click Model-Based Information Retrieval Metrics
 
IN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHOD
IN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHODIN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHOD
IN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHOD
 
Meta heuristic based clustering of two-dimensional data using-2
Meta heuristic based clustering of two-dimensional data using-2Meta heuristic based clustering of two-dimensional data using-2
Meta heuristic based clustering of two-dimensional data using-2
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
A study on rough set theory based
A study on rough set theory basedA study on rough set theory based
A study on rough set theory based
 
A Comprehensive review of Conversational Agent and its prediction algorithm
A Comprehensive review of Conversational Agent and its prediction algorithmA Comprehensive review of Conversational Agent and its prediction algorithm
A Comprehensive review of Conversational Agent and its prediction algorithm
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
 
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTER
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERPERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTER
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTER
 
Support Vector Machine–Based Prediction System for a Football Match Result
Support Vector Machine–Based Prediction System for a Football Match ResultSupport Vector Machine–Based Prediction System for a Football Match Result
Support Vector Machine–Based Prediction System for a Football Match Result
 
An Introduction to Data Mining
An Introduction to Data MiningAn Introduction to Data Mining
An Introduction to Data Mining
 
presentationIDC - 14MAY2015
presentationIDC - 14MAY2015presentationIDC - 14MAY2015
presentationIDC - 14MAY2015
 
Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...
Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...
Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...
 
A bi objective workflow application
A bi objective workflow applicationA bi objective workflow application
A bi objective workflow application
 
Hashedcubes simple, low memory, real time visual
Hashedcubes simple, low memory, real time visualHashedcubes simple, low memory, real time visual
Hashedcubes simple, low memory, real time visual
 

Similar to CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018

GRID COMPUTING: STRATEGIC DECISION MAKING IN RESOURCE SELECTION
GRID COMPUTING: STRATEGIC DECISION MAKING IN RESOURCE SELECTIONGRID COMPUTING: STRATEGIC DECISION MAKING IN RESOURCE SELECTION
GRID COMPUTING: STRATEGIC DECISION MAKING IN RESOURCE SELECTIONIJCSEA Journal
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
 
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachcsandit
 
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHcscpconf
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET Journal
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET Journal
 
Near Reversible Data Hiding Scheme for images using DCT
Near Reversible Data Hiding Scheme for images using DCTNear Reversible Data Hiding Scheme for images using DCT
Near Reversible Data Hiding Scheme for images using DCTIJERA Editor
 
Data reduction techniques to analyze nsl kdd dataset
Data reduction techniques to analyze nsl kdd datasetData reduction techniques to analyze nsl kdd dataset
Data reduction techniques to analyze nsl kdd datasetIAEME Publication
 
IRJET- Distributed Resource Allocation for Data Center Networks: A Hierar...
IRJET-  	  Distributed Resource Allocation for Data Center Networks: A Hierar...IRJET-  	  Distributed Resource Allocation for Data Center Networks: A Hierar...
IRJET- Distributed Resource Allocation for Data Center Networks: A Hierar...IRJET Journal
 
An Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringAn Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringIJECEIAES
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET Journal
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theorycsandit
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...IJECEIAES
 

Similar to CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018 (20)

CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
 
GRID COMPUTING: STRATEGIC DECISION MAKING IN RESOURCE SELECTION
GRID COMPUTING: STRATEGIC DECISION MAKING IN RESOURCE SELECTIONGRID COMPUTING: STRATEGIC DECISION MAKING IN RESOURCE SELECTION
GRID COMPUTING: STRATEGIC DECISION MAKING IN RESOURCE SELECTION
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approach
 
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 
Near Reversible Data Hiding Scheme for images using DCT
Near Reversible Data Hiding Scheme for images using DCTNear Reversible Data Hiding Scheme for images using DCT
Near Reversible Data Hiding Scheme for images using DCT
 
Data reduction techniques to analyze nsl kdd dataset
Data reduction techniques to analyze nsl kdd datasetData reduction techniques to analyze nsl kdd dataset
Data reduction techniques to analyze nsl kdd dataset
 
IRJET- Distributed Resource Allocation for Data Center Networks: A Hierar...
IRJET-  	  Distributed Resource Allocation for Data Center Networks: A Hierar...IRJET-  	  Distributed Resource Allocation for Data Center Networks: A Hierar...
IRJET- Distributed Resource Allocation for Data Center Networks: A Hierar...
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
 
An Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream ClusteringAn Improved Differential Evolution Algorithm for Data Stream Clustering
An Improved Differential Evolution Algorithm for Data Stream Clustering
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theory
 
Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...Granularity analysis of classification and estimation for complex datasets wi...
Granularity analysis of classification and estimation for complex datasets wi...
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Recently uploaded (20)

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018

  • 1. A Notional Framework for a Theory of Data Systems Maggie Johnson Joint with members of the ToDS subgroup of the SAMSI CLIM Remote Sensing Working Group Workshop on Remote Sensing, Uncertainty Quantification, and a Theory of Data Systems February 12, 2018 M. Johnson Remote Sensing Workshop February 12, 2018 1 / 22
  • 2. Motivation Motivation for this workshop: . . .data must be brought together in some way . . . but moving data to a central location for analysis is tedious at best and impossible at worst. Some (remote) data reduction is almost certainly necessary, but how much? What are the consequences for inference? . . .how to navigate the trade-space between computational, transmission and infrastructure costs versus uncertainty (a.k.a. “statistical costs”) in the estimates or inferences that are ultimately produced. In other words, can we integrate the design of data systems with the design of statistical methodology to balance the various tradeoffs in these costs? Can this be formulated as a well-specified optimization problem? M. Johnson Remote Sensing Workshop February 12, 2018 2 / 22
  • 3. What are the costs? 1 Computational Number of operations, memory, time, etc. 2 Statistical variance, prediction error, etc. 3 Transmission/Data Movement bandwidth, latency, money, privacy, etc. 4 System Infrastructure/Design data storage, types of connections, compute resources, etc. 5 . . . M. Johnson Remote Sensing Workshop February 12, 2018 3 / 22
  • 4. From the Software/System Architects Perspective In designing a data system, architects consider the infrastructural costs and how the design of the data system affects how data can be manipulated and moved throughout the system how to stage data across servers? where to build connections, and how fast do they need to be? how to deploy compute resources? which services on which machines? privacy? M. Johnson Remote Sensing Workshop February 12, 2018 4 / 22
  • 5. From the Statistician’s/Data Scientist’s Perspective In designing a statistical analysis, statisticians/data scientists are familiar with the ideas of balancing the tradeoffs between the quality of a statistical analysis and the computational costs of that analysis. how much data, which data, where to move data? which methodology? what are the tradeoffs in efficiency of estimators/quality of inference (uncertainty)? Statistical analyses of distributed data depends on how data can be accessed, computational resources, etc. (i.e. the design of the data system). M. Johnson Remote Sensing Workshop February 12, 2018 5 / 22
  • 6. A Theory of Data Systems The simultaneous optimization of the data system architecture and the statistical methodology balancing the tradeoffs in costs, for a given data analysis objective. In theory, in order to do this we need to: 1 be able to quantify all of the various costs of performing data analysis in a distributed setting Many of the costs are very difficult to quantify 2 solve a highly complex, constrained, multi-objective optimization problem competing objectives 3 choose a solution with costs we are willing to accept from a set of Pareto optimal solutions i.e., ”choose your battles” M. Johnson Remote Sensing Workshop February 12, 2018 6 / 22
  • 7. Illustration with a Toy Example M. Johnson Remote Sensing Workshop February 12, 2018 7 / 22
  • 8. Data System Setup J servers, each with Nj observations (j = 1, . . . , J) Assume only the user has computational resources Cost to access the jth server is aj and to move a data value from server j to the user is bj nj is the number of downloaded observations from server j to the user M. Johnson Remote Sensing Workshop February 12, 2018 8 / 22
  • 9. Data Analysis Objective The statistical objective is to perform inference on the population mean from data distributed across J servers, with the following statistical properties Let Yij be the ith observation on server j, assume E(Yij ) = µ, Var(Yij ) = 1 Correlation between two observations on the same server is φ Correlation between an observation on server j and on server k is ρ|j−k| φ and ρ are assumed known Goal is to perform inference on µ using the sample mean ¯Yn =   J j=1 nj   −1 J j=1 nj i=1 Yij computed from n = {n1, . . . , nJ } observations as the estimator. M. Johnson Remote Sensing Workshop February 12, 2018 9 / 22
  • 10. The Costs 1 Statistical Cost (squared error loss −> minimize variance): Cst(n) = Var( ¯Yn) = N−2 n J j=1  nj + φ(n2 j − nj ) + k=j nj nk ρ|j−k|   Given (assumed known) φ and ρ, the statistical cost depends only on the amount of data downloaded from each server. 2 Infrastructure/Design Cost: Cds(a, b) = J j=1 a−1 j + b−0.5 j Meant to penalize small aj and bj (i.e. it is expensive to build a faster connection) Idea is that more resources should be allocated to servers where we need to download more data. M. Johnson Remote Sensing Workshop February 12, 2018 10 / 22
  • 11. The Costs 3 Data Movement & Computation Cost: Define data movement costs for n = {n1, . . . , nJ } observations as J j=1 (aj I(nj > 0) + bj nj ) Computational complexity is O( J j=1 nj ) Combine both into a cost function for data movement and computation. Cc (a, b, n) = J j=1 (aj I(nj > 0) + bj nj ) + J j=1 nj M. Johnson Remote Sensing Workshop February 12, 2018 11 / 22
  • 12. Multiobjective Optimization The optimal distributed analysis for the toy example is a solution with jointly minimizes the costs associated with the statistical analysis and the data system infrastructure. minimize n,a,b Cds(a, b), Cst(n), Cc (a, b, n) subject to aj ∈ (c, d) bj ∈ (e, f ) nj ∈ N nj ≤ Nj For the toy example, this optimization is feasible. M. Johnson Remote Sensing Workshop February 12, 2018 12 / 22
  • 13. The Pareto Front Let φ = 0.5, ρ = 0.1, Nj = 100, J = 5, aj ∈ (1, 50), bj ∈ (1, 20). Using the R package nloptr: M. Johnson Remote Sensing Workshop February 12, 2018 13 / 22
  • 14. “Choosing your Battles” Suppose we wish to keep computational/data movement costs low (e.g. < 2000). High statistical accuracy (Var( ¯Yn) = 0.13) Trades-off with expensive data system design (Cds = 5) M. Johnson Remote Sensing Workshop February 12, 2018 14 / 22
  • 15. “Choosing your Battles” Suppose we wish to keep computational/data movement costs low (e.g. < 2000). Cheap data system design (e.g. Cds < 2) Trades-off with reduced statistical accuracy (Var( ¯Yn) = 0.14) M. Johnson Remote Sensing Workshop February 12, 2018 15 / 22
  • 16. Effect of the Statistical Properties of the Data Let φ = 0.5, ρ = 0.4, recall that the correlation between two servers is ρ|j−k| . It is more efficient to sample from servers far away from each other More resources are then focused on these servers M. Johnson Remote Sensing Workshop February 12, 2018 16 / 22
  • 17. Alternative Formulations of the Optimization Problem Knowledge/decisions about the acceptable tradeoffs in costs can reduce the optimization problem. If all costs are equally important, combine the costs into a single objective function There are multiple subproblems given prior decisions on acceptable costs. For example, in the toy example set Cc < 2000 as a constraint rather than including the cost in the objective function M. Johnson Remote Sensing Workshop February 12, 2018 17 / 22
  • 18. The Difference between Theory and Practice “Good Enough” Solution In practice, it may only be realistic to obtain a solution which achieves “acceptable” costs. The statistical design may depend on unknown properties of the data (e.g. unknown φ and ρ) Not feasible to build the data system at every iteration of an optimization algorithm Quantifying statistical performance may in itself be computationally demanding, and/or require data movement Multiple analysis objectives M. Johnson Remote Sensing Workshop February 12, 2018 18 / 22
  • 19. The Difference between Theory and Practice “Good Enough” Solution In practice, one might iterate between the data system design and the statistical design, to find a “good enough” solution. 1 Start with a distributed data system design, learn about the data 2 Given preliminary knowledge of statistical properties of the data, update the data system architecture 3 Given the new data system architecture, update the statistical design 4 . . . DAWN provides a potential framework to simulate this procedure. M. Johnson Remote Sensing Workshop February 12, 2018 19 / 22
  • 20. A Few Next Steps For the toy example: Allow connections between servers and distributed computation Analysis objectives beyond inference about the mean Incorporation of more realistic costs/constraints Maximum likelihood estimation of the parameters of a covariance function Multilayer networks as a framework for organizing the optimization problem M. Johnson Remote Sensing Workshop February 12, 2018 20 / 22
  • 21. Closing Thoughts To continue to scale data analyses to the ever growing massive size of data, we need to be able to exploit distributed data system architecture. Requires understanding and accounting for the tradeoffs in the costs associated with distributed data analysis and inferential quality for both data system design and the design of the data analysis. A Theory of Data Systems requires collaboration between statisticians, computer scientists, data system architects, software engineers, and more. Understanding realistic costs for all aspects of distributed data analysis requires expert knowledge in each area. M. Johnson Remote Sensing Workshop February 12, 2018 21 / 22
  • 22. Thank You! M. Johnson Remote Sensing Workshop February 12, 2018 22 / 22