Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop, Software Architecture Considerations in the Analysis of Highly Distributed Data and Computational Analysis - Daniel Crichton, Aug 22, 2017

Software Architecture Considerations in the
Analysis of Highly Distributed Data and
Computational Analysis
Dan Crichton
Center for Data Science and Technology
Data Science Program
Earth Science Data Systems and Technology Program
Jet Propulsion Laboratory, Caltech
August 22, 2017

Jet Propulsion Laboratory
California Institute of Technology
Introduction
What is Big Data? Why Should We Care?
The CHALLENGE: Big Data
• When needs for data collection, processing,
management and analysis go beyond the capacity and
capability of available methods and software systems
The SOLUTION: Data Science
• Scalable architectural approaches, techniques,
software and algorithms which alter the paradigm by
which data is collected, managed and analyzed
The RELEVANCE:
• Addressing the challenge of Big Data is on the critical
path to accomplishing our NASA science objectives, as
– the size and distribution of science data sets and predictive
models continues to burgeon, and
– core science community objectives such as reproducibility
of results are becoming compromised

NASA Mission-Science Data
Lifecycle for Remote Sensing
Future Solutions:
Dynamic architectures to
scale data processing
and triage exascale data
streams
Future Solutions:
Onboard computation and
data science
Challenge: Data
collection capacity at the
instrument outstrips data
transport and data
storage capacity
Challenge:
Too much data, too fast;
cannot transport data
efficiently enough
Challenge: Data
distributed in
massive archives;
many different types
of measurements
Future Solutions:
Distributed data
analytics; uncertainty
quantification
Agile Science – Onboard Analysis
Extreme Data Volumes – Data
Triage
Distributed Data Analytics
Preparing for exascale computing…
SMAP (Today): 485 GB/day NI-SAR (2020): 86 TB/day

California Institute of Technology Surface Water Ocean Topography
• 2020 Launch

California Institute of Technology NASA Science and Big Data Today
How do these connect?
EOSDIS DAAC
EOSDIS DAAC
Comm
Network
Focus on generating, capturing, managing big data
Big Data
Infrastructure
(Data, Algorithms,
Machines)
Focus on using/analyzing
big data
?

NASA Data Archives
~20 PB of data

California Institute of Technology Considerations
• The storage, computing, and analysis of scientific remote sensing
data at NASA (and science in general) is highly driven by the
distribution and organization of the data
– Data is generally highly distributed, stored in different archives, and
there is little “analytic” services for bringing data together
– This imposes an architectural constraint on the analysis
• Unless scientific remote sensing data and computing is fully
centralized at NASA, new approaches for data processing and
analysis are required as NASA observational instruments and
climate models continues to rapidly increase in size

California Institute of Technology The Problem
• Typical data analysis approaches assume that data is “shipped” to user
for analysis
– Analysis is then highly dependent on the time to move the data across the
network
– Algorithms assume centralization (e.g., data is relocated and computed)
• The volume of data required to be shipped to the user is increasing at a
rapid rate making systems difficult to use
– For example, downloading model output from the Earth System Grid
– Analysis is limited by the movement of data
• Analysis which requires data from multiple systems compounds the
problem by requiring n number of downloads
• The NASA canonical architectures aren’t positioned to address
this challenge

Computational and Data Science
Future Capability Needs
Derived from NASA Office of the Chief Technologist TA-11 Roadmap (2015)
9NASA AIST Big Data Study, 2015-2016

Emerging Challenges as Data
Increases
• Reproducibility
• Uncertainty management
• Data fusion (including distributed data)
• Data reduction
• Data movement
• Data visualization
• Cost
• Performance
Considering the architecture and data lifecycle is key to large-scale
data intensive systems!

Future of Data Science at NASA
Enabling a Big Data Research Environment
Comm
Network
Big Data
Infrastructure
(Data, Algorithms,
Machines)
Other Data
Systems (e.g.
NOAA)
Other Data
Systems (e.g.
NOAA)
Other
Data Systems
(In-Situ, Other
Agency, etc.)
Instrument
Data
Systems
Airborne
Data
NASA
Data
Archives
Data Capture Data Analysis
(Water, Ocean,
CO2, Extreme
Events, Mars,
etc.)
Reducing Data Wrangling: “There is a major need for the development of software components…
that link high-level data analysis-specifications with low-level distributed systems architectures.”
Frontiers in the Analysis of Massive Data, National Research Council, 2013.

California Institute of Technology Compute vs. Move
• For distributed, federated environments, the future dilemma is
whether to “move and compute” or “compute and move”
• Answering this question is fundamentally important for determining
how systems should be implemented in the future
– SDSs which need to deliver data (or services!) to users
– Future archive systems
– Approaches for analysis
• Answering this question requires a quantitative approach to evaluate
the tradeoffs

California Institute of Technology The Quantitative Approach
• Need to measure the efficiency of NASA canonical architectures vs.
future proposed approaches
• Measure the time to perform analysis and its dependences on
specific use case parameters
– Measure comparing NASA canonical architectures vs. proposed
approaches.
• Understand the implication on uncertainty quantification to
determine reliability of scientific inferences in these types of
architectures

Example: Carbon Cycle Model-
Observation Comparison
Gunson, Braverman, Bowman, Cressie
Science Goal:
- Understand processes that control CO2 flux
Strategy:
- Experiment with models to increase
agreement observations / inferences
Analysis Challenges:
- Models and data reside at different locations
- Data are heterogeneous and must be reconciled as to
format, scope, fidelity, resolution, etc.
- Meaningful comparison requires uncertainty estimation
on both observational data and model output
Architecture Evaluation:
- Address the analysis challenges to minimize
both uncertainty and data movement.
A dynamic optimization problem.

Western States Water Mission (WSWM):
A Science/Data Science Collaboration
Input-Forcing
(e.g., GPM)
For Data Assimilation
(e.g., MODSCAG)
Standard Reports Ad Hoc Queries and Custom Reports
Snow-Water EquivalentSurface Water Ground Water
Single-Month EstimatesShort and Long-Term Trends
Research
Applications
Decision Support
Data Science
Infrastructure
(Tools,
Services,
Methods for
Massive Data
Analysis)
A Scalable
Data
Processing
System for
Hydrological
Science
(Web-Based Interface)
15

SAR derived SubsidenceIn-Situ: Stream Gage
Sensors
Model Output:
Soil Moisture
River Network
GPS
User Defined
Polygon
Fusing In-situ, Air-borne, Space-borne and model generated
data using visualization and a big data analytics engine
WaterTrek: Interactive Data Analytics
for Hydrology
16

California Institute of Technology Research Challenges
Principal objective is to research the relationship between architectural
topology and scientific data analysis efficiency to explore new architectural
techniques for scaling science-driven data analytics across distributed
environments.
1) for a fixed system architecture, how can one optimize the movement of data
and algorithms and estimate the costs?
2) which system architectures yield the greatest efﬁciencies for the types of
scientific analyses we wish to support?
3) how can existing and new computational methods be designed to better
exploit the distributed architecture and increase scientific return?

Data Science Architecture
Evaluation
• Formalize the capture of science questions as data science
computations
• Develop a model for evaluating data science computations based on
data system topologies and constraints
• Capture and evaluate high priority use cases that are constrained by
data science challenges
• Evaluate the use cases based on the model to assess cost,
performance, uncertainty constraints. Identify
– Big data computing stack
– Topology
– Methodologies
– Data
• Establish testbed based on the above output (computing stack,
topology/deployment, algorithms, data)
• Evaluate, tune and deploy

Topology Decisions
Distribution (data, computation)
Data Accessibility
Network Capacity
Computational Capacity
Analysis Choreography/Workflow
Data Science Architectural Considerations
19
Data Science
Architectural
Tradeoffs
Methodolog
y
Decisions
Software and
Hardware
Decisions
Use Cases,
Scenarios
Data Science Analytics Framework
Data Management Capabilities
Storage (e.g., Cloud)
Visualization
Algorithms/Data Processing
Server Resources (HPC, etc)
Data Movement Technologies
Output
Uncertainty, performance,
cost, and computing stack
based on a set of capacities
Data Collections
Data Products/Objects/Files
Metadata Products
Data Formats, Data Size
Methods
Data Reduction
Feature Extraction
Classification
Detection
Fusion
Data Big Data
Analytics
Capability
Topology
Decisions

California Institute of Technology DAWN: Architecture Evaluation
DAWN (“Distributed Analytics, Workflows and Numerics”) is a model for
simulation, analysis and optimization of data processing system architectures
• Example: the Science Data System of a generic NASA
Earth Observing mission
• Example: a climate science analysis to be performed
over datasets held at distributed locations
• Source code is a Python package available from JPL
GitHub
• Developed with funding from JPL Data Science
initiative (2014-2016)
Luca Cinquini, Kyo Lee, Amy Braverman, Mike Turmon, Dan Crichton

California Institute of Technology DAWN Usage
• Inputs:
‣ Formal representation of system topology (nodes, edges) and data processing
workflow (tasks, their sequence and dependencies)
‣ Numerical estimates for algorithm execution times, data volumes, network speed
‣ May be provided as XML document or built through the DAWN Python API
• Outputs: quantitative estimation of system architecture based on several metrics:
‣ Overall execution time, separate computation and data movement times
‣ Volumes of data transferred
‣ System load (aka CPU utilization)
‣ Monetary cost (preliminary, subject to market fluctuations)
‣ Future: Uncertainty

California Institute of Technology Applications of DAWN
• Can be run multiple times by changing the system parameters (number of nodes,
cores, network speed, …) to identify the resources needed to achieve a given
processing goal
• Can find bottlenecks in workflow execution to identify computations that need to
be optimized or parallelized
• Can analyze how efficiently CPUs are utilized, to minimize monetary cost
• Can compare different possible architectures (centralized, distributed, parallel, …)
to maximize efficiency

Use Case #1: Climate Rainfall
Prediction
• Description: can climate models accurately predict
rainfall characteristics (peak intensity and duration) over
a given geographic region ?
‣ Use statistical techniques to compare observations @
4km resolution from JPL to model output @ 4,12,24
km resolution from GSFC
• DAWN application: DAWN was used to identify the most
efficient architecture to execute the analysis (centralized
vs distributed)
• DAWN results:
‣ Distributed architecture is 3 orders of magnitude faster
than the centralized architecture (hours vs days)
‣ In the distributed architecture, the total time is the
same for processing all model resolutions (because the
workflow is dominated by processing of 4km
observations), so there is no advantage in using the
lowest possible model resolution
Centralized Distributed

California Institute of Technology Use Case #2: ECOSTRESS
• Description: ECOSTRESS is an upcoming NASA mission
that will use a thermal radiometer aboard the ISS to
study plant-water dynamics and variation in eco-systems
due to climate change and water availability
• DAWN application: DAWN was used to analyze and
optimize the L0-L2 data processing pipeline to determine
which computing resources are needed
• DAWN results:
‣ Identified PGE#4 (“geo-location”) as critical bottle-neck
that must be parallelized over scenes
‣ 2 servers with 8 cores each are sufficient to keep up
with incoming data stream
‣ When reprocessing all the data, the most efficient
architecture is distributed (cluster of nodes allocated
to execute specific tasks)
‣ Optimal allocation of 36 nodes is 1 - 3 - 24 - 8

California Institute of Technology Conclusions
• There is a need to quantitatively model software architectures to
understand tradeoffs that affect both scalability and scientific analysis
– Initial efforts at JPL have led to exploration of use cases and the initial DAWN
model
– JPL is working with CMU/David Garlan to explore formalized modeling
approaches
• Amy Braverman and Dan Crichton are calling this “theory of data systems”
– Significant interest in understanding how tradeoffs in analysis approaches
affect uncertainty
• Understanding such tradeoffs will lead to new approaches in developing
climate data analysis capabilities and helping NASA and other
organizations understand how to optimize the data infrastructures.
• We believe there is ample opportunity to form a multi-disciplinary team of
efforts to further the study of these challenges.

California Institute of Technology Backup

DAWN Results for Batch Re-
Processing
DAWN simulated the processing of 10 days of data (150 orbits)
• Distributed Architecture can be more efficient than Parallel or Centralized Architecture
• Degree of distribution is critical
Parallel Architecture
t = 46,140 sec = 12.8 hours
Centralized Architecture
t = 64,920 sec = 18.0 hours
Best Distributed Arch.
t = 31,260 sec = 8.7 hours

DAWN Results for Real-Time
Processing
DAWN simulated the processing of 1 day of data (15 orbits, 142 scenes)
• PGE #4 (“geo-location”) is critical bottleneck and must be parallelized over scenes
• 2 servers w/ 8 cores each are sufficient to keep up with incoming data stream
• Because of time delay between orbits, additional servers do not increase efficiency
PGE#4 sequential
(no orbit offset)
PGE#4 parallel
(no orbit offset)
PGE#4 parallel
(with orbit offset)
Exec Time
vs Number
of Nodes
CPUs load
vs Time

California Institute of Technology Data Science Architecture Planning
Massive Data
Science
Question/Challe
nge?
Architectural
Assessment/Tra
deoff
Evaluate
Question in
Data Science
Testbed
Computational
Capability
Stage 2: Assess hardware,
software, topology, methodology
configurations based on
formalization of the uncertainty,
cost, performance, capacity
tolerances
Stage 3: Establish a test bed for
the data science question based
on the architectural assessment
to support software,
computation, data, algorithm
integration
Stage 4:
Deploy
Stage I: Capture big data
characteristics and constraints

ExArch Meeting, October 2012
“Topology” Model
• Nodes : servers, providing computation and storage
‣ Each server can have multiple cores, which process data independently
‣ Server queue distributes jobs to available cores
• Edges : network connecting Nodes
‣ Tunable parametric speed
• Algorithms : computations to be executed on data
‣ Defined by benchmarked execution time (on a given server type)
• Datasets : inputs/outputs to processing algorithms
‣ Each algorithm can have multiple inputs and outputs
• Tasks : operations to be performed on the data (computation or movement)
‣ Support for concurrent data processing and data movement
‣ Support for task dependency
• Workflows : structured combination of tasks
‣ Can run tasks in sequence or in parallel
‣ Support for nested sub-workflows

Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop, Software Architecture Considerations in the Analysis of Highly Distributed Data and Computational Analysis - Daniel Crichton, Aug 22, 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop, Software Architecture Considerations in the Analysis of Highly Distributed Data and Computational Analysis - Daniel Crichton, Aug 22, 2017

Similar to Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop, Software Architecture Considerations in the Analysis of Highly Distributed Data and Computational Analysis - Daniel Crichton, Aug 22, 2017 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop, Software Architecture Considerations in the Analysis of Highly Distributed Data and Computational Analysis - Daniel Crichton, Aug 22, 2017