SlideShare a Scribd company logo
1 of 30
Download to read offline
Software Architecture Considerations in the
Analysis of Highly Distributed Data and
Computational Analysis
Dan Crichton
Center for Data Science and Technology
Data Science Program
Earth Science Data Systems and Technology Program
Jet Propulsion Laboratory, Caltech
August 22, 2017
Jet Propulsion Laboratory
California Institute of Technology
Introduction
What is Big Data? Why Should We Care?
The CHALLENGE: Big Data
• When needs for data collection, processing,
management and analysis go beyond the capacity and
capability of available methods and software systems
The SOLUTION: Data Science
• Scalable architectural approaches, techniques,
software and algorithms which alter the paradigm by
which data is collected, managed and analyzed
The RELEVANCE:
• Addressing the challenge of Big Data is on the critical
path to accomplishing our NASA science objectives, as
– the size and distribution of science data sets and predictive
models continues to burgeon, and
– core science community objectives such as reproducibility
of results are becoming compromised
Jet Propulsion Laboratory
California Institute of Technology
NASA Mission-Science Data
Lifecycle for Remote Sensing
Future Solutions:
Dynamic architectures to
scale data processing
and triage exascale data
streams
Future Solutions:
Onboard computation and
data science
Challenge: Data
collection capacity at the
instrument outstrips data
transport and data
storage capacity
Challenge:
Too much data, too fast;
cannot transport data
efficiently enough
Challenge: Data
distributed in
massive archives;
many different types
of measurements
Future Solutions:
Distributed data
analytics; uncertainty
quantification
Agile Science – Onboard Analysis
Extreme Data Volumes – Data
Triage
Distributed Data Analytics
Preparing for exascale computing…
SMAP (Today): 485 GB/day NI-SAR (2020): 86 TB/day
Jet Propulsion Laboratory
California Institute of Technology Surface Water Ocean Topography
• 2020 Launch
Jet Propulsion Laboratory
California Institute of Technology NASA Science and Big Data Today
How do these connect?
EOSDIS DAAC
EOSDIS DAAC
Comm
Network
Focus on generating, capturing, managing big data
Big Data
Infrastructure
(Data, Algorithms,
Machines)
Focus on using/analyzing
big data
?
NASA Data Archives
~20 PB of data
Jet Propulsion Laboratory
California Institute of Technology Considerations
• The storage, computing, and analysis of scientific remote sensing
data at NASA (and science in general) is highly driven by the
distribution and organization of the data
– Data is generally highly distributed, stored in different archives, and
there is little “analytic” services for bringing data together
– This imposes an architectural constraint on the analysis
• Unless scientific remote sensing data and computing is fully
centralized at NASA, new approaches for data processing and
analysis are required as NASA observational instruments and
climate models continues to rapidly increase in size
Jet Propulsion Laboratory
California Institute of Technology The Problem
• Typical data analysis approaches assume that data is “shipped” to user
for analysis
– Analysis is then highly dependent on the time to move the data across the
network
– Algorithms assume centralization (e.g., data is relocated and computed)
• The volume of data required to be shipped to the user is increasing at a
rapid rate making systems difficult to use
– For example, downloading model output from the Earth System Grid
– Analysis is limited by the movement of data
• Analysis which requires data from multiple systems compounds the
problem by requiring n number of downloads
• The NASA canonical architectures aren’t positioned to address
this challenge
Jet Propulsion Laboratory
California Institute of Technology
Computational and Data Science
Future Capability Needs
Derived from NASA Office of the Chief Technologist TA-11 Roadmap (2015)
9NASA AIST Big Data Study, 2015-2016
Jet Propulsion Laboratory
California Institute of Technology
Emerging Challenges as Data
Increases
• Reproducibility
• Uncertainty management
• Data fusion (including distributed data)
• Data reduction
• Data movement
• Data visualization
• Cost
• Performance
Considering the architecture and data lifecycle is key to large-scale
data intensive systems!
Jet Propulsion Laboratory
California Institute of Technology
Future of Data Science at NASA
Enabling a Big Data Research Environment
Comm
Network
Big Data
Infrastructure
(Data, Algorithms,
Machines)
Other Data
Systems (e.g.
NOAA)
Other Data
Systems (e.g.
NOAA)
Other
Data Systems
(In-Situ, Other
Agency, etc.)
Instrument
Data
Systems
Airborne
Data
NASA
Data
Archives
Data Capture Data Analysis
(Water, Ocean,
CO2, Extreme
Events, Mars,
etc.)
Reducing Data Wrangling: “There is a major need for the development of software components…
that link high-level data analysis-specifications with low-level distributed systems architectures.”
Frontiers in the Analysis of Massive Data, National Research Council, 2013.
Jet Propulsion Laboratory
California Institute of Technology Compute vs. Move
• For distributed, federated environments, the future dilemma is
whether to “move and compute” or “compute and move”
• Answering this question is fundamentally important for determining
how systems should be implemented in the future
– SDSs which need to deliver data (or services!) to users
– Future archive systems
– Approaches for analysis
• Answering this question requires a quantitative approach to evaluate
the tradeoffs
Jet Propulsion Laboratory
California Institute of Technology The Quantitative Approach
• Need to measure the efficiency of NASA canonical architectures vs.
future proposed approaches
• Measure the time to perform analysis and its dependences on
specific use case parameters
– Measure comparing NASA canonical architectures vs. proposed
approaches.
• Understand the implication on uncertainty quantification to
determine reliability of scientific inferences in these types of
architectures
Jet Propulsion Laboratory
California Institute of Technology
Example: Carbon Cycle Model-
Observation Comparison
Gunson, Braverman, Bowman, Cressie
Science Goal:
- Understand processes that control CO2 flux
Strategy:
- Experiment with models to increase
agreement observations / inferences
Analysis Challenges:
- Models and data reside at different locations
- Data are heterogeneous and must be reconciled as to
format, scope, fidelity, resolution, etc.
- Meaningful comparison requires uncertainty estimation
on both observational data and model output
Architecture Evaluation:
- Address the analysis challenges to minimize
both uncertainty and data movement.
A dynamic optimization problem.
Jet Propulsion Laboratory
California Institute of Technology
Western States Water Mission (WSWM):
A Science/Data Science Collaboration
Input-Forcing
(e.g., GPM)
For Data Assimilation
(e.g., MODSCAG)
Standard Reports Ad Hoc Queries and Custom Reports
Snow-Water EquivalentSurface Water Ground Water
Single-Month EstimatesShort and Long-Term Trends
Research
Applications
Decision Support
Data Science
Infrastructure
(Tools,
Services,
Methods for
Massive Data
Analysis)
A Scalable
Data
Processing
System for
Hydrological
Science
(Web-Based Interface)
15
Jet Propulsion Laboratory
California Institute of Technology
SAR derived SubsidenceIn-Situ: Stream Gage
Sensors
Model Output:
Soil Moisture
River Network
GPS
User Defined
Polygon
Fusing In-situ, Air-borne, Space-borne and model generated
data using visualization and a big data analytics engine
WaterTrek: Interactive Data Analytics
for Hydrology
16
Jet Propulsion Laboratory
California Institute of Technology Research Challenges
Principal objective is to research the relationship between architectural
topology and scientific data analysis efficiency to explore new architectural
techniques for scaling science-driven data analytics across distributed
environments.
1) for a fixed system architecture, how can one optimize the movement of data
and algorithms and estimate the costs?
2) which system architectures yield the greatest efficiencies for the types of
scientific analyses we wish to support?
3) how can existing and new computational methods be designed to better
exploit the distributed architecture and increase scientific return?
Jet Propulsion Laboratory
California Institute of Technology
Data Science Architecture
Evaluation
• Formalize the capture of science questions as data science
computations
• Develop a model for evaluating data science computations based on
data system topologies and constraints
• Capture and evaluate high priority use cases that are constrained by
data science challenges
• Evaluate the use cases based on the model to assess cost,
performance, uncertainty constraints. Identify
– Big data computing stack
– Topology
– Methodologies
– Data
• Establish testbed based on the above output (computing stack,
topology/deployment, algorithms, data)
• Evaluate, tune and deploy
Jet Propulsion Laboratory
California Institute of Technology
Topology Decisions
Distribution (data, computation)
Data Accessibility
Network Capacity
Computational Capacity
Analysis Choreography/Workflow
Data Science Architectural Considerations
19
Data Science
Architectural
Tradeoffs
Methodolog
y
Decisions
Software and
Hardware
Decisions
Use Cases,
Scenarios
Data Science Analytics Framework
Data Management Capabilities
Storage (e.g., Cloud)
Visualization
Algorithms/Data Processing
Server Resources (HPC, etc)
Data Movement Technologies
Output
Uncertainty, performance,
cost, and computing stack
based on a set of capacities
Data Collections
Data Products/Objects/Files
Metadata Products
Data Formats, Data Size
Methods
Data Reduction
Feature Extraction
Classification
Detection
Fusion
Data Big Data
Analytics
Capability
Topology
Decisions
Jet Propulsion Laboratory
California Institute of Technology DAWN: Architecture Evaluation
DAWN (“Distributed Analytics, Workflows and Numerics”) is a model for
simulation, analysis and optimization of data processing system architectures
• Example: the Science Data System of a generic NASA
Earth Observing mission
• Example: a climate science analysis to be performed
over datasets held at distributed locations
• Source code is a Python package available from JPL
GitHub
• Developed with funding from JPL Data Science
initiative (2014-2016)
Luca Cinquini, Kyo Lee, Amy Braverman, Mike Turmon, Dan Crichton
Jet Propulsion Laboratory
California Institute of Technology DAWN Usage
• Inputs:
‣ Formal representation of system topology (nodes, edges) and data processing
workflow (tasks, their sequence and dependencies)
‣ Numerical estimates for algorithm execution times, data volumes, network speed
‣ May be provided as XML document or built through the DAWN Python API
• Outputs: quantitative estimation of system architecture based on several metrics:
‣ Overall execution time, separate computation and data movement times
‣ Volumes of data transferred
‣ System load (aka CPU utilization)
‣ Monetary cost (preliminary, subject to market fluctuations)
‣ Future: Uncertainty
Jet Propulsion Laboratory
California Institute of Technology Applications of DAWN
• Can be run multiple times by changing the system parameters (number of nodes,
cores, network speed, …) to identify the resources needed to achieve a given
processing goal
• Can find bottlenecks in workflow execution to identify computations that need to
be optimized or parallelized
• Can analyze how efficiently CPUs are utilized, to minimize monetary cost
• Can compare different possible architectures (centralized, distributed, parallel, …)
to maximize efficiency
Jet Propulsion Laboratory
California Institute of Technology
Use Case #1: Climate Rainfall
Prediction
• Description: can climate models accurately predict
rainfall characteristics (peak intensity and duration) over
a given geographic region ?
‣ Use statistical techniques to compare observations @
4km resolution from JPL to model output @ 4,12,24
km resolution from GSFC
• DAWN application: DAWN was used to identify the most
efficient architecture to execute the analysis (centralized
vs distributed)
• DAWN results:
‣ Distributed architecture is 3 orders of magnitude faster
than the centralized architecture (hours vs days)
‣ In the distributed architecture, the total time is the
same for processing all model resolutions (because the
workflow is dominated by processing of 4km
observations), so there is no advantage in using the
lowest possible model resolution
Centralized Distributed
Jet Propulsion Laboratory
California Institute of Technology Use Case #2: ECOSTRESS
• Description: ECOSTRESS is an upcoming NASA mission
that will use a thermal radiometer aboard the ISS to
study plant-water dynamics and variation in eco-systems
due to climate change and water availability
• DAWN application: DAWN was used to analyze and
optimize the L0-L2 data processing pipeline to determine
which computing resources are needed
• DAWN results:
‣ Identified PGE#4 (“geo-location”) as critical bottle-neck
that must be parallelized over scenes
‣ 2 servers with 8 cores each are sufficient to keep up
with incoming data stream
‣ When reprocessing all the data, the most efficient
architecture is distributed (cluster of nodes allocated
to execute specific tasks)
‣ Optimal allocation of 36 nodes is 1 - 3 - 24 - 8
Jet Propulsion Laboratory
California Institute of Technology Conclusions
• There is a need to quantitatively model software architectures to
understand tradeoffs that affect both scalability and scientific analysis
– Initial efforts at JPL have led to exploration of use cases and the initial DAWN
model
– JPL is working with CMU/David Garlan to explore formalized modeling
approaches
• Amy Braverman and Dan Crichton are calling this “theory of data systems”
– Significant interest in understanding how tradeoffs in analysis approaches
affect uncertainty
• Understanding such tradeoffs will lead to new approaches in developing
climate data analysis capabilities and helping NASA and other
organizations understand how to optimize the data infrastructures.
• We believe there is ample opportunity to form a multi-disciplinary team of
efforts to further the study of these challenges.
Jet Propulsion Laboratory
California Institute of Technology Backup
Jet Propulsion Laboratory
California Institute of Technology
DAWN Results for Batch Re-
Processing
DAWN simulated the processing of 10 days of data (150 orbits)
• Distributed Architecture can be more efficient than Parallel or Centralized Architecture
• Degree of distribution is critical
Parallel Architecture
t = 46,140 sec = 12.8 hours
Centralized Architecture
t = 64,920 sec = 18.0 hours
Best Distributed Arch.
t = 31,260 sec = 8.7 hours
Jet Propulsion Laboratory
California Institute of Technology
DAWN Results for Real-Time
Processing
DAWN simulated the processing of 1 day of data (15 orbits, 142 scenes)
• PGE #4 (“geo-location”) is critical bottleneck and must be parallelized over scenes
• 2 servers w/ 8 cores each are sufficient to keep up with incoming data stream
• Because of time delay between orbits, additional servers do not increase efficiency
PGE#4 sequential
(no orbit offset)
PGE#4 parallel
(no orbit offset)
PGE#4 parallel
(with orbit offset)
Exec Time
vs Number
of Nodes
CPUs load
vs Time
Jet Propulsion Laboratory
California Institute of Technology Data Science Architecture Planning
Massive Data
Science
Question/Challe
nge?
Architectural
Assessment/Tra
deoff
Evaluate
Question in
Data Science
Testbed
Computational
Capability
Stage 2: Assess hardware,
software, topology, methodology
configurations based on
formalization of the uncertainty,
cost, performance, capacity
tolerances
Stage 3: Establish a test bed for
the data science question based
on the architectural assessment
to support software,
computation, data, algorithm
integration
Stage 4:
Deploy
Stage I: Capture big data
characteristics and constraints
ExArch Meeting, October 2012
“Topology” Model
• Nodes : servers, providing computation and storage
‣ Each server can have multiple cores, which process data independently
‣ Server queue distributes jobs to available cores
• Edges : network connecting Nodes
‣ Tunable parametric speed
• Algorithms : computations to be executed on data
‣ Defined by benchmarked execution time (on a given server type)
• Datasets : inputs/outputs to processing algorithms
‣ Each algorithm can have multiple inputs and outputs
• Tasks : operations to be performed on the data (computation or movement)
‣ Support for concurrent data processing and data movement
‣ Support for task dependency
• Workflows : structured combination of tasks
‣ Can run tasks in sequence or in parallel
‣ Support for nested sub-workflows

More Related Content

What's hot

Is robustness really robust? how different definitions of robustness impact d...
Is robustness really robust? how different definitions of robustness impact d...Is robustness really robust? how different definitions of robustness impact d...
Is robustness really robust? how different definitions of robustness impact d...Environmental Intelligence Lab
 
How Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather EventsHow Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather Eventsinside-BigData.com
 
The Climateprediction.net programme, big data climate modelling
The Climateprediction.net programme, big data climate modellingThe Climateprediction.net programme, big data climate modelling
The Climateprediction.net programme, big data climate modellingDavid Wallom
 
WP Technical Paper - Inter-annual variability of wind speed in South Africa
WP Technical Paper - Inter-annual variability of wind speed in South AfricaWP Technical Paper - Inter-annual variability of wind speed in South Africa
WP Technical Paper - Inter-annual variability of wind speed in South AfricaMatthew Behrens
 
A comparison of classification techniques for seismic facies recognition
A comparison of classification techniques for seismic facies recognitionA comparison of classification techniques for seismic facies recognition
A comparison of classification techniques for seismic facies recognitionPioneer Natural Resources
 
2006-03-21 Work Group Meeting on IT Techniques, Tools and Philosophies for Mo...
2006-03-21 Work Group Meeting on IT Techniques, Tools and Philosophies for Mo...2006-03-21 Work Group Meeting on IT Techniques, Tools and Philosophies for Mo...
2006-03-21 Work Group Meeting on IT Techniques, Tools and Philosophies for Mo...Rudolf Husar
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Sayed Abulhasan Quadri
 
K venkata reddy
K venkata reddyK venkata reddy
K venkata reddyClimDev15
 
Advanced Remote Sensing Project Report
Advanced Remote Sensing Project ReportAdvanced Remote Sensing Project Report
Advanced Remote Sensing Project ReportJeffrey Schorsch
 
Application of the extreme learning machine algorithm for the
Application of the extreme learning machine algorithm for theApplication of the extreme learning machine algorithm for the
Application of the extreme learning machine algorithm for themehmet şahin
 
Referal-Kevin-Grimes
Referal-Kevin-GrimesReferal-Kevin-Grimes
Referal-Kevin-GrimesKevin Grimes
 
INTRODUCING NREL’S BEST PRACTICES HANDBOOK FOR COLLECTION AND USE OF SOLAR RE...
INTRODUCING NREL’S BEST PRACTICES HANDBOOK FOR COLLECTION AND USE OF SOLAR RE...INTRODUCING NREL’S BEST PRACTICES HANDBOOK FOR COLLECTION AND USE OF SOLAR RE...
INTRODUCING NREL’S BEST PRACTICES HANDBOOK FOR COLLECTION AND USE OF SOLAR RE...Roberto Valer
 
Multi sensor-fusion
Multi sensor-fusionMulti sensor-fusion
Multi sensor-fusion万言 李
 
12 SuperAI on Supercomputers
12 SuperAI on Supercomputers12 SuperAI on Supercomputers
12 SuperAI on SupercomputersRCCSRENKEI
 
Reanalysis Datasets for Solar Resource Assessment Presented at ASES 2014
Reanalysis Datasets for Solar Resource Assessment Presented at ASES 2014Reanalysis Datasets for Solar Resource Assessment Presented at ASES 2014
Reanalysis Datasets for Solar Resource Assessment Presented at ASES 2014Gwendalyn Bender
 

What's hot (20)

Is robustness really robust? how different definitions of robustness impact d...
Is robustness really robust? how different definitions of robustness impact d...Is robustness really robust? how different definitions of robustness impact d...
Is robustness really robust? how different definitions of robustness impact d...
 
test
testtest
test
 
How Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather EventsHow Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather Events
 
The Climateprediction.net programme, big data climate modelling
The Climateprediction.net programme, big data climate modellingThe Climateprediction.net programme, big data climate modelling
The Climateprediction.net programme, big data climate modelling
 
WP Technical Paper - Inter-annual variability of wind speed in South Africa
WP Technical Paper - Inter-annual variability of wind speed in South AfricaWP Technical Paper - Inter-annual variability of wind speed in South Africa
WP Technical Paper - Inter-annual variability of wind speed in South Africa
 
A comparison of classification techniques for seismic facies recognition
A comparison of classification techniques for seismic facies recognitionA comparison of classification techniques for seismic facies recognition
A comparison of classification techniques for seismic facies recognition
 
2006-03-21 Work Group Meeting on IT Techniques, Tools and Philosophies for Mo...
2006-03-21 Work Group Meeting on IT Techniques, Tools and Philosophies for Mo...2006-03-21 Work Group Meeting on IT Techniques, Tools and Philosophies for Mo...
2006-03-21 Work Group Meeting on IT Techniques, Tools and Philosophies for Mo...
 
Ijcet 06 07_003
Ijcet 06 07_003Ijcet 06 07_003
Ijcet 06 07_003
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...
 
KREAM@ICCS2013
KREAM@ICCS2013KREAM@ICCS2013
KREAM@ICCS2013
 
K venkata reddy
K venkata reddyK venkata reddy
K venkata reddy
 
Thesis report
Thesis reportThesis report
Thesis report
 
Advanced Remote Sensing Project Report
Advanced Remote Sensing Project ReportAdvanced Remote Sensing Project Report
Advanced Remote Sensing Project Report
 
Application of the extreme learning machine algorithm for the
Application of the extreme learning machine algorithm for theApplication of the extreme learning machine algorithm for the
Application of the extreme learning machine algorithm for the
 
Referal-Kevin-Grimes
Referal-Kevin-GrimesReferal-Kevin-Grimes
Referal-Kevin-Grimes
 
Measuring '3dM' Qualities
Measuring '3dM' QualitiesMeasuring '3dM' Qualities
Measuring '3dM' Qualities
 
INTRODUCING NREL’S BEST PRACTICES HANDBOOK FOR COLLECTION AND USE OF SOLAR RE...
INTRODUCING NREL’S BEST PRACTICES HANDBOOK FOR COLLECTION AND USE OF SOLAR RE...INTRODUCING NREL’S BEST PRACTICES HANDBOOK FOR COLLECTION AND USE OF SOLAR RE...
INTRODUCING NREL’S BEST PRACTICES HANDBOOK FOR COLLECTION AND USE OF SOLAR RE...
 
Multi sensor-fusion
Multi sensor-fusionMulti sensor-fusion
Multi sensor-fusion
 
12 SuperAI on Supercomputers
12 SuperAI on Supercomputers12 SuperAI on Supercomputers
12 SuperAI on Supercomputers
 
Reanalysis Datasets for Solar Resource Assessment Presented at ASES 2014
Reanalysis Datasets for Solar Resource Assessment Presented at ASES 2014Reanalysis Datasets for Solar Resource Assessment Presented at ASES 2014
Reanalysis Datasets for Solar Resource Assessment Presented at ASES 2014
 

Viewers also liked

Viewers also liked (20)

CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Da...
CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Da...CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Da...
CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Da...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Analysis for Climate ...
CLIM Fall 2017 Course: Statistics for Climate Research, Analysis for Climate ...CLIM Fall 2017 Course: Statistics for Climate Research, Analysis for Climate ...
CLIM Fall 2017 Course: Statistics for Climate Research, Analysis for Climate ...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
CLIM Fall 2017 Course: Statistics for Climate Research, Detection & Attributi...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
CLIM Fall 2017 Course: Statistics for Climate Research, Climate Informatics -...
 
CLIM Undergraduate Workshop: Applications in Climate Context - Michael Wehner...
CLIM Undergraduate Workshop: Applications in Climate Context - Michael Wehner...CLIM Undergraduate Workshop: Applications in Climate Context - Michael Wehner...
CLIM Undergraduate Workshop: Applications in Climate Context - Michael Wehner...
 
CLIM Undergraduate Workshop: Tutorial on R Software - Huang Huang, Oct 23, 2017
CLIM Undergraduate Workshop: Tutorial on R Software - Huang Huang, Oct 23, 2017CLIM Undergraduate Workshop: Tutorial on R Software - Huang Huang, Oct 23, 2017
CLIM Undergraduate Workshop: Tutorial on R Software - Huang Huang, Oct 23, 2017
 
CLIM Undergraduate Workshop: How was this Made?: Making Dirty Data into Somet...
CLIM Undergraduate Workshop: How was this Made?: Making Dirty Data into Somet...CLIM Undergraduate Workshop: How was this Made?: Making Dirty Data into Somet...
CLIM Undergraduate Workshop: How was this Made?: Making Dirty Data into Somet...
 
CLIM Undergraduate Workshop: Introduction to Spatial Data Analysis with R - M...
CLIM Undergraduate Workshop: Introduction to Spatial Data Analysis with R - M...CLIM Undergraduate Workshop: Introduction to Spatial Data Analysis with R - M...
CLIM Undergraduate Workshop: Introduction to Spatial Data Analysis with R - M...
 
CLIM Undergraduate Workshop: Extreme Value Analysis for Climate Research - Wh...
CLIM Undergraduate Workshop: Extreme Value Analysis for Climate Research - Wh...CLIM Undergraduate Workshop: Extreme Value Analysis for Climate Research - Wh...
CLIM Undergraduate Workshop: Extreme Value Analysis for Climate Research - Wh...
 

Similar to Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop, Software Architecture Considerations in the Analysis of Highly Distributed Data and Computational Analysis - Daniel Crichton, Aug 22, 2017

IEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfIEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfssuserff37aa
 
Evolving Storage and Cyber Infrastructure at the NASA Center for Climate Simu...
Evolving Storage and Cyber Infrastructure at the NASA Center for Climate Simu...Evolving Storage and Cyber Infrastructure at the NASA Center for Climate Simu...
Evolving Storage and Cyber Infrastructure at the NASA Center for Climate Simu...inside-BigData.com
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Exascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor DataExascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor DataJoel Saltz
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
Scientific
Scientific Scientific
Scientific marpierc
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustMenchita Falcutila Dumlao
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineeringinside-BigData.com
 
Evolving NASA’s Data and Information Systems for Earth Science
Evolving NASA’s Data and Information Systems for Earth ScienceEvolving NASA’s Data and Information Systems for Earth Science
Evolving NASA’s Data and Information Systems for Earth Scienceinside-BigData.com
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016Manjula Ambur
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 
Integrated research data management in the Structural Sciences
Integrated research data management in the Structural SciencesIntegrated research data management in the Structural Sciences
Integrated research data management in the Structural SciencesManjulaPatel
 
To architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositoriesTo architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositoriesjiscdatapool
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
95Orchestrating Big Data Analysis Workflows in the Cloud.docx95Orchestrating Big Data Analysis Workflows in the Cloud.docx
95Orchestrating Big Data Analysis Workflows in the Cloud.docxfredharris32
 

Similar to Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop, Software Architecture Considerations in the Analysis of Highly Distributed Data and Computational Analysis - Daniel Crichton, Aug 22, 2017 (20)

IEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfIEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdf
 
Summary of 3DPAS
Summary of 3DPASSummary of 3DPAS
Summary of 3DPAS
 
Evolving Storage and Cyber Infrastructure at the NASA Center for Climate Simu...
Evolving Storage and Cyber Infrastructure at the NASA Center for Climate Simu...Evolving Storage and Cyber Infrastructure at the NASA Center for Climate Simu...
Evolving Storage and Cyber Infrastructure at the NASA Center for Climate Simu...
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
CLIM Program: Remote Sensing Workshop, Distributed Access and Analysis: NASA ...
CLIM Program: Remote Sensing Workshop, Distributed Access and Analysis: NASA ...CLIM Program: Remote Sensing Workshop, Distributed Access and Analysis: NASA ...
CLIM Program: Remote Sensing Workshop, Distributed Access and Analysis: NASA ...
 
Exascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor DataExascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor Data
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Scientific
Scientific Scientific
Scientific
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
 
Evolving NASA’s Data and Information Systems for Earth Science
Evolving NASA’s Data and Information Systems for Earth ScienceEvolving NASA’s Data and Information Systems for Earth Science
Evolving NASA’s Data and Information Systems for Earth Science
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
Integrated research data management in the Structural Sciences
Integrated research data management in the Structural SciencesIntegrated research data management in the Structural Sciences
Integrated research data management in the Structural Sciences
 
To architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositoriesTo architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositories
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
95Orchestrating Big Data Analysis Workflows in the Cloud.docx95Orchestrating Big Data Analysis Workflows in the Cloud.docx
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxAmita Gupta
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 

Recently uploaded (20)

SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 

Program on Mathematical and Statistical Methods for Climate and the Earth System Opening Workshop, Software Architecture Considerations in the Analysis of Highly Distributed Data and Computational Analysis - Daniel Crichton, Aug 22, 2017

  • 1. Software Architecture Considerations in the Analysis of Highly Distributed Data and Computational Analysis Dan Crichton Center for Data Science and Technology Data Science Program Earth Science Data Systems and Technology Program Jet Propulsion Laboratory, Caltech August 22, 2017
  • 2. Jet Propulsion Laboratory California Institute of Technology Introduction What is Big Data? Why Should We Care? The CHALLENGE: Big Data • When needs for data collection, processing, management and analysis go beyond the capacity and capability of available methods and software systems The SOLUTION: Data Science • Scalable architectural approaches, techniques, software and algorithms which alter the paradigm by which data is collected, managed and analyzed The RELEVANCE: • Addressing the challenge of Big Data is on the critical path to accomplishing our NASA science objectives, as – the size and distribution of science data sets and predictive models continues to burgeon, and – core science community objectives such as reproducibility of results are becoming compromised
  • 3. Jet Propulsion Laboratory California Institute of Technology NASA Mission-Science Data Lifecycle for Remote Sensing Future Solutions: Dynamic architectures to scale data processing and triage exascale data streams Future Solutions: Onboard computation and data science Challenge: Data collection capacity at the instrument outstrips data transport and data storage capacity Challenge: Too much data, too fast; cannot transport data efficiently enough Challenge: Data distributed in massive archives; many different types of measurements Future Solutions: Distributed data analytics; uncertainty quantification Agile Science – Onboard Analysis Extreme Data Volumes – Data Triage Distributed Data Analytics Preparing for exascale computing… SMAP (Today): 485 GB/day NI-SAR (2020): 86 TB/day
  • 4. Jet Propulsion Laboratory California Institute of Technology Surface Water Ocean Topography • 2020 Launch
  • 5. Jet Propulsion Laboratory California Institute of Technology NASA Science and Big Data Today How do these connect? EOSDIS DAAC EOSDIS DAAC Comm Network Focus on generating, capturing, managing big data Big Data Infrastructure (Data, Algorithms, Machines) Focus on using/analyzing big data ?
  • 7. Jet Propulsion Laboratory California Institute of Technology Considerations • The storage, computing, and analysis of scientific remote sensing data at NASA (and science in general) is highly driven by the distribution and organization of the data – Data is generally highly distributed, stored in different archives, and there is little “analytic” services for bringing data together – This imposes an architectural constraint on the analysis • Unless scientific remote sensing data and computing is fully centralized at NASA, new approaches for data processing and analysis are required as NASA observational instruments and climate models continues to rapidly increase in size
  • 8. Jet Propulsion Laboratory California Institute of Technology The Problem • Typical data analysis approaches assume that data is “shipped” to user for analysis – Analysis is then highly dependent on the time to move the data across the network – Algorithms assume centralization (e.g., data is relocated and computed) • The volume of data required to be shipped to the user is increasing at a rapid rate making systems difficult to use – For example, downloading model output from the Earth System Grid – Analysis is limited by the movement of data • Analysis which requires data from multiple systems compounds the problem by requiring n number of downloads • The NASA canonical architectures aren’t positioned to address this challenge
  • 9. Jet Propulsion Laboratory California Institute of Technology Computational and Data Science Future Capability Needs Derived from NASA Office of the Chief Technologist TA-11 Roadmap (2015) 9NASA AIST Big Data Study, 2015-2016
  • 10. Jet Propulsion Laboratory California Institute of Technology Emerging Challenges as Data Increases • Reproducibility • Uncertainty management • Data fusion (including distributed data) • Data reduction • Data movement • Data visualization • Cost • Performance Considering the architecture and data lifecycle is key to large-scale data intensive systems!
  • 11. Jet Propulsion Laboratory California Institute of Technology Future of Data Science at NASA Enabling a Big Data Research Environment Comm Network Big Data Infrastructure (Data, Algorithms, Machines) Other Data Systems (e.g. NOAA) Other Data Systems (e.g. NOAA) Other Data Systems (In-Situ, Other Agency, etc.) Instrument Data Systems Airborne Data NASA Data Archives Data Capture Data Analysis (Water, Ocean, CO2, Extreme Events, Mars, etc.) Reducing Data Wrangling: “There is a major need for the development of software components… that link high-level data analysis-specifications with low-level distributed systems architectures.” Frontiers in the Analysis of Massive Data, National Research Council, 2013.
  • 12. Jet Propulsion Laboratory California Institute of Technology Compute vs. Move • For distributed, federated environments, the future dilemma is whether to “move and compute” or “compute and move” • Answering this question is fundamentally important for determining how systems should be implemented in the future – SDSs which need to deliver data (or services!) to users – Future archive systems – Approaches for analysis • Answering this question requires a quantitative approach to evaluate the tradeoffs
  • 13. Jet Propulsion Laboratory California Institute of Technology The Quantitative Approach • Need to measure the efficiency of NASA canonical architectures vs. future proposed approaches • Measure the time to perform analysis and its dependences on specific use case parameters – Measure comparing NASA canonical architectures vs. proposed approaches. • Understand the implication on uncertainty quantification to determine reliability of scientific inferences in these types of architectures
  • 14. Jet Propulsion Laboratory California Institute of Technology Example: Carbon Cycle Model- Observation Comparison Gunson, Braverman, Bowman, Cressie Science Goal: - Understand processes that control CO2 flux Strategy: - Experiment with models to increase agreement observations / inferences Analysis Challenges: - Models and data reside at different locations - Data are heterogeneous and must be reconciled as to format, scope, fidelity, resolution, etc. - Meaningful comparison requires uncertainty estimation on both observational data and model output Architecture Evaluation: - Address the analysis challenges to minimize both uncertainty and data movement. A dynamic optimization problem.
  • 15. Jet Propulsion Laboratory California Institute of Technology Western States Water Mission (WSWM): A Science/Data Science Collaboration Input-Forcing (e.g., GPM) For Data Assimilation (e.g., MODSCAG) Standard Reports Ad Hoc Queries and Custom Reports Snow-Water EquivalentSurface Water Ground Water Single-Month EstimatesShort and Long-Term Trends Research Applications Decision Support Data Science Infrastructure (Tools, Services, Methods for Massive Data Analysis) A Scalable Data Processing System for Hydrological Science (Web-Based Interface) 15
  • 16. Jet Propulsion Laboratory California Institute of Technology SAR derived SubsidenceIn-Situ: Stream Gage Sensors Model Output: Soil Moisture River Network GPS User Defined Polygon Fusing In-situ, Air-borne, Space-borne and model generated data using visualization and a big data analytics engine WaterTrek: Interactive Data Analytics for Hydrology 16
  • 17. Jet Propulsion Laboratory California Institute of Technology Research Challenges Principal objective is to research the relationship between architectural topology and scientific data analysis efficiency to explore new architectural techniques for scaling science-driven data analytics across distributed environments. 1) for a fixed system architecture, how can one optimize the movement of data and algorithms and estimate the costs? 2) which system architectures yield the greatest efficiencies for the types of scientific analyses we wish to support? 3) how can existing and new computational methods be designed to better exploit the distributed architecture and increase scientific return?
  • 18. Jet Propulsion Laboratory California Institute of Technology Data Science Architecture Evaluation • Formalize the capture of science questions as data science computations • Develop a model for evaluating data science computations based on data system topologies and constraints • Capture and evaluate high priority use cases that are constrained by data science challenges • Evaluate the use cases based on the model to assess cost, performance, uncertainty constraints. Identify – Big data computing stack – Topology – Methodologies – Data • Establish testbed based on the above output (computing stack, topology/deployment, algorithms, data) • Evaluate, tune and deploy
  • 19. Jet Propulsion Laboratory California Institute of Technology Topology Decisions Distribution (data, computation) Data Accessibility Network Capacity Computational Capacity Analysis Choreography/Workflow Data Science Architectural Considerations 19 Data Science Architectural Tradeoffs Methodolog y Decisions Software and Hardware Decisions Use Cases, Scenarios Data Science Analytics Framework Data Management Capabilities Storage (e.g., Cloud) Visualization Algorithms/Data Processing Server Resources (HPC, etc) Data Movement Technologies Output Uncertainty, performance, cost, and computing stack based on a set of capacities Data Collections Data Products/Objects/Files Metadata Products Data Formats, Data Size Methods Data Reduction Feature Extraction Classification Detection Fusion Data Big Data Analytics Capability Topology Decisions
  • 20. Jet Propulsion Laboratory California Institute of Technology DAWN: Architecture Evaluation DAWN (“Distributed Analytics, Workflows and Numerics”) is a model for simulation, analysis and optimization of data processing system architectures • Example: the Science Data System of a generic NASA Earth Observing mission • Example: a climate science analysis to be performed over datasets held at distributed locations • Source code is a Python package available from JPL GitHub • Developed with funding from JPL Data Science initiative (2014-2016) Luca Cinquini, Kyo Lee, Amy Braverman, Mike Turmon, Dan Crichton
  • 21. Jet Propulsion Laboratory California Institute of Technology DAWN Usage • Inputs: ‣ Formal representation of system topology (nodes, edges) and data processing workflow (tasks, their sequence and dependencies) ‣ Numerical estimates for algorithm execution times, data volumes, network speed ‣ May be provided as XML document or built through the DAWN Python API • Outputs: quantitative estimation of system architecture based on several metrics: ‣ Overall execution time, separate computation and data movement times ‣ Volumes of data transferred ‣ System load (aka CPU utilization) ‣ Monetary cost (preliminary, subject to market fluctuations) ‣ Future: Uncertainty
  • 22. Jet Propulsion Laboratory California Institute of Technology Applications of DAWN • Can be run multiple times by changing the system parameters (number of nodes, cores, network speed, …) to identify the resources needed to achieve a given processing goal • Can find bottlenecks in workflow execution to identify computations that need to be optimized or parallelized • Can analyze how efficiently CPUs are utilized, to minimize monetary cost • Can compare different possible architectures (centralized, distributed, parallel, …) to maximize efficiency
  • 23. Jet Propulsion Laboratory California Institute of Technology Use Case #1: Climate Rainfall Prediction • Description: can climate models accurately predict rainfall characteristics (peak intensity and duration) over a given geographic region ? ‣ Use statistical techniques to compare observations @ 4km resolution from JPL to model output @ 4,12,24 km resolution from GSFC • DAWN application: DAWN was used to identify the most efficient architecture to execute the analysis (centralized vs distributed) • DAWN results: ‣ Distributed architecture is 3 orders of magnitude faster than the centralized architecture (hours vs days) ‣ In the distributed architecture, the total time is the same for processing all model resolutions (because the workflow is dominated by processing of 4km observations), so there is no advantage in using the lowest possible model resolution Centralized Distributed
  • 24. Jet Propulsion Laboratory California Institute of Technology Use Case #2: ECOSTRESS • Description: ECOSTRESS is an upcoming NASA mission that will use a thermal radiometer aboard the ISS to study plant-water dynamics and variation in eco-systems due to climate change and water availability • DAWN application: DAWN was used to analyze and optimize the L0-L2 data processing pipeline to determine which computing resources are needed • DAWN results: ‣ Identified PGE#4 (“geo-location”) as critical bottle-neck that must be parallelized over scenes ‣ 2 servers with 8 cores each are sufficient to keep up with incoming data stream ‣ When reprocessing all the data, the most efficient architecture is distributed (cluster of nodes allocated to execute specific tasks) ‣ Optimal allocation of 36 nodes is 1 - 3 - 24 - 8
  • 25. Jet Propulsion Laboratory California Institute of Technology Conclusions • There is a need to quantitatively model software architectures to understand tradeoffs that affect both scalability and scientific analysis – Initial efforts at JPL have led to exploration of use cases and the initial DAWN model – JPL is working with CMU/David Garlan to explore formalized modeling approaches • Amy Braverman and Dan Crichton are calling this “theory of data systems” – Significant interest in understanding how tradeoffs in analysis approaches affect uncertainty • Understanding such tradeoffs will lead to new approaches in developing climate data analysis capabilities and helping NASA and other organizations understand how to optimize the data infrastructures. • We believe there is ample opportunity to form a multi-disciplinary team of efforts to further the study of these challenges.
  • 26. Jet Propulsion Laboratory California Institute of Technology Backup
  • 27. Jet Propulsion Laboratory California Institute of Technology DAWN Results for Batch Re- Processing DAWN simulated the processing of 10 days of data (150 orbits) • Distributed Architecture can be more efficient than Parallel or Centralized Architecture • Degree of distribution is critical Parallel Architecture t = 46,140 sec = 12.8 hours Centralized Architecture t = 64,920 sec = 18.0 hours Best Distributed Arch. t = 31,260 sec = 8.7 hours
  • 28. Jet Propulsion Laboratory California Institute of Technology DAWN Results for Real-Time Processing DAWN simulated the processing of 1 day of data (15 orbits, 142 scenes) • PGE #4 (“geo-location”) is critical bottleneck and must be parallelized over scenes • 2 servers w/ 8 cores each are sufficient to keep up with incoming data stream • Because of time delay between orbits, additional servers do not increase efficiency PGE#4 sequential (no orbit offset) PGE#4 parallel (no orbit offset) PGE#4 parallel (with orbit offset) Exec Time vs Number of Nodes CPUs load vs Time
  • 29. Jet Propulsion Laboratory California Institute of Technology Data Science Architecture Planning Massive Data Science Question/Challe nge? Architectural Assessment/Tra deoff Evaluate Question in Data Science Testbed Computational Capability Stage 2: Assess hardware, software, topology, methodology configurations based on formalization of the uncertainty, cost, performance, capacity tolerances Stage 3: Establish a test bed for the data science question based on the architectural assessment to support software, computation, data, algorithm integration Stage 4: Deploy Stage I: Capture big data characteristics and constraints
  • 30. ExArch Meeting, October 2012 “Topology” Model • Nodes : servers, providing computation and storage ‣ Each server can have multiple cores, which process data independently ‣ Server queue distributes jobs to available cores • Edges : network connecting Nodes ‣ Tunable parametric speed • Algorithms : computations to be executed on data ‣ Defined by benchmarked execution time (on a given server type) • Datasets : inputs/outputs to processing algorithms ‣ Each algorithm can have multiple inputs and outputs • Tasks : operations to be performed on the data (computation or movement) ‣ Support for concurrent data processing and data movement ‣ Support for task dependency • Workflows : structured combination of tasks ‣ Can run tasks in sequence or in parallel ‣ Support for nested sub-workflows