A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense

A general framework for predicting the optimal
computing configurations for climate driven
ecological forecasting models
Scott Farley
Department of Geography
University of Wisconsin – Madison
Master of Science
Cartography & GIS
Public Talk
April 17, 2017

Given flexible computing resources and
massive data stores, what is the most efficient
computing hardware on which to run
ecological forecasting models?
Cloud
Computing
Species
Distribution
Modeling
Biodiversity
Informatics

Motivation
Figure adapted from: Urban, M. C. (2015). Accelerating extinction risk from climate change. Science, 348(6234), 571-573.
1 in 6
species is likely to go extinct due to
climate change.

Motivation
Figure adapted from: Dickson et al. (2014). Towards a global map of natural capital: key ecosystem assets. United National Environmental Programme. 1-33.
$125 trillion/year
Global ecosystem services are valued at

Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding life

Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding life
Neotoma Paleoecology
Database

Paleobiodiversity Data
Figure adapted from: Plumber, B (2014). There have been five mass extinctions in Earth’s history. Now we’re facing a sixth. The Washington Post.
https://www.washingtonpost.com/news/wonk/wp/2014/02/11/there-have-been-five-mass-extinctions-in-earths-history-now-were-facing-a-sixth

Paleobiodiversity Data
Figure adapted from: Booth, R. K., Brewer, S., Blaauw, M., Minckley, T. A., & Jackson, S. T. (2012). Decomposing the mid‐Holocene Tsuga decline in eastern North America. Ecology, 93(8), 1841-1852.

Recent Growth in Biodiversity Databases
Added 65.8
million records in
2015
Date of Record
MillionsofRecords
100
200
300
400
500
1500 1700 1900

Recent Growth in Biodiversity Databases
Neotoma
Paleoecology
Database
Added 65.8
million records in
2015
Added 1.5 million
fossil records since
2010
MillionsofRecords
Date of Record Date of Accession
MillionsofRecords
100
200
300
400
500
2.5
3
3.5
2
1500 1700 1900 2010 2015

The Four V’s of Big Data
Volume
Size of the data
✓

Volume
Veracity
Size of the data
Uncertain
ty of the
data
✓
✓

Volume Variety
Veracity
Size of the data
Uncertain
ty of the
data
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓

Volume Variety
Veracity Velocity
Size of the data
Uncertain
ty of the
data
Sensitivity of
data analysis
to time
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓

Volume Variety
Veracity Velocity
Size of the data
Uncertain
ty of the
data
Sensitivity of
data analysis
to time
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓
Biodiversity data does not typically
require real-time analysis
✗

Inductive Learning
Predict future responses (Y) given a set of potential covariates (X).
Training Examples
X and Y both known

Inductive Learning
Predict future response values (Y) given a set of potential covariates (X).
Training Examples
X and Y both known
Build Model
Estimate functional relationship
from x and y by minimizing loss
criteria (y-ŷ)

Inductive Learning
Predict future response values (Y) given a set of potential covariates (X).
Training Examples
X and Y both known
Build Model
Future Cases
Only X is known
Estimate new Y’s (ŷ)
from X
Use
approximate
d functional
relationship
in prediction

Species Distribution Modeling
Training
Exampl
es
Environment
al Covariates
SDM Algorithm
Predicted Future
Distribution
Predicted Future
Environment
Predict future distribution of species from observations of current (or fossil)
distribution and environmental/climatic covariates.
Figure source: https://www.unil.ch/idyst/en/home/menuinst/research-topics/geoinformatics-and-spatial-m/predictive-biogeography/advancing-the-science-of-eco.html

Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
relationship between
inputs and output
• Linear Regression
• Generalized linear
models
• Logistic regression

Types of SDM
Model Driven
model to a dataset
inputs and output
models
Data Driven
Estimate relationship
between inputs and
outputs from the data
• Regression trees
• Artificial neural
networks
• MaxEnt
High sensitivity to small
changes in input data

Types of SDM
Model Driven
model to a dataset
inputs and output
models
Data Driven Bayesian
Estimate the relationship
between inputs and
outputs as a probability
distribution using prior
knowledge and new data
Formally account for
model uncertainty
• Gaussian random fields
• Community full joint
distribution modeling
Estimate relationship
between inputs and
outputs from the data
• Regression trees
• Artificial neural
networks
• MaxEnt
High sensitivity to small
changes in input data

Algorithms in Contemporary SDM literature
Bayesian Data
Driven
Model
Driven
Other
120
100
80
60
40
20
Frequency
In 100 randomly sampled SDM papers…

Enables convenient and on-demand access to configurable
computing resources
Rapid provisioning and release with minimal management effort
Recent growth supported by federal agencies and public cloud
providers
Cloud Computing

For each SDM, there exists an
optimal data-hardware
configuration that:
1. Maximizes SDM accuracy
2. Balances the tradeoff between
performance and expense by
jointly minimizing the time and
cost of modeling
Hypothesis

For each SDM, there exists an
optimal data-hardware
configuration that:
1. Maximizes SDM accuracy
2. Balances the tradeoff between
performance and expense by
jointly minimizing the time and
cost of modeling
Hypothesis Data
Training Examples
Number of Covariates
Hardware
CPU Cores
Memory
SDMs
Random Forests
Boosted Regression Trees
Generalized Additive Models
Adaptive Regression Splines

1. Build large empirical dataset of SDM accuracy and
runtime
Methods

1. Build large empirical
dataset of SDM
accuracy and
runtime
2. Build model of
computing cost
Methods

runtime
2. Build model of computing cost
3. Build two computational performance models (CPM):
– Accuracy (AUC): Expected accuracy using given data
– Runtime (seconds): Expected execution time on given data-
hardware configuration
Methods

runtime
2. Build model of computing cost
3. Build two computational performance models
4. Use CPMs to identify optimal data-hardware
configuration
Methods

Bayesian Additive Regression Trees
Additive tree inductive learning model in a Bayesian framework
Probability density of SDM execution time or accuracy under
given input conditions
Performance and Accuracy Modeling Framework

Runtime Model
Skill
Runtime CPM
Skill
Regression Trees Regression Splines
Additive Models Random Forests

Runtime CPM Drivers For each predictor, build a CPM without
that predictor and compare its skill to
the skill of the full model.
Random Forests

Runtime CPM Drivers
Random Forests can
execute in parallel
For each predictor, build a CPM without
that predictor and compare its skill to
the skill of the full model.
Random Forests

Accuracy Model SkillAccuracy CPM
Skill
Regression Trees Regression Splines
Additive Models Random Forests

Accuracy CPM Drivers
Random Forests

Choosing the Optimal Hardware for an SDM
Hardware
Data
Accuracy
CPM
Performance
CPM
Time
Cost
Uncertainty
Predictors CPMs Optimization
Optimal
Configuration
Result

1. Identify data configuration of training
examples and covariates that will maximize
accuracy

2. Predict execution time
of that configuration on
different hardware
configurations
Number of CPU Cores
Memory(GB)
Unique
Hardware
Configuratio

3. Hierarchical clustering
- Time
- Cost
- Posterior SD (spread)
Dissimilarity
Unique
Hardware
Configuratio
n

4. Calculate cluster
mean distance from
origin
Choose the cluster
closest to the origin
Optimal:
No Time
No Cost
No Uncertainty

Optimal Configuration for Each SDM
CPU Cores
Memory(GB)
Random Forests

MARS: An Unresolved Quandary
Multivariate
Adaptive
Regression Splines
No incremental preference
for higher memory.
CPU Cores
Memory(GB)

Mean Distance from Origin
CPU Cores
Memory(GB)

Recommendations
Redevelop models at the code-
infrastructure interface to leverage
high performance computing
technologies
Prioritize model efficiency along
with ecological realism in future
development
Cloud computing offers the ability
to run models on the right
resources, not just the convenient
ones
Date of Record
MillionsofRecords
100
200
300
400
500
1500 1700 1900
Promote extensions of this
framework

Bayesian Additive Tree Model Structure and Priors
1. Node Depth Prior: P(Tt) ~ α(1+d)−β where α ∈ (0, 1) and β ∈ [0, ∞]
2. Leaf-Value Prior: P(Mt | Tt) ~ μl ~ N(μμ / m, σμ
2 )
μμ, is picked to be the range center, (ymin + ymax)/2
σμ
2 is empirically chosen so that the range center plus or minus k = 2 variances
cover 95% of the provided response values in the training set
3. Error Variance Prior: σ2 ∼ InvGamma(ν/2, νλ/2)
λ is determined from the data so that there is a q = 90% a priori chance (by
default) that the BART model will improve upon the RMSE from an ordinary least
squares regression.

Bayesian Additive Tree Model Structure and Priors
5. Response likelihood: mean of response in leaf in given MCMC iteration with
variance: yl ∼ N(μl, σ2)
6. Hyperparameters: α, β, k, ν and q
α: 0.95
β: 2
k: 2
ν: 3
q: 90%
R Package: bartMachine
Citation: Adam Kapelner, Justin Bleich (2016). bartMachine: Machine Learning with
Bayesian Additive Regression Trees. Journal of Statistical Software, 70(4), 1-40. doi:
10.18637/jss.v070.i04

MARS Remedial Measures
• Resample so each configuration has n=1 observations
– Completely covering parameter space
– Need to reduce the influence imbalanced design
– No qualitative change in results
• Future steps
– Recollect dataset using a balanced design with multiple replicates

Master
Compute
Node
Central Database
A. Which configurations have experiments that are not marked as
COMPLETED?
B. Return next configuration with experiments not
marked as COMPLETE.
C. Create instance group with specified vCPU
and memory
[GET /nextconfig]
[json]
[gcloud create]
1. Configure and build virtual instances
2. Run simulations and report results
3. Manage virtual infrastructure
Central Database
A. Select random experiment within machine’ s
computing capabilities.
B. Return experiment specification.
C. Report
time and
accuracy
measures to
the database
[json]
Compute
Node RScript
TimeSDM
Function
Load variables
Fit BRT Model
Predict to 2100
Evaluate Accuracy
Total
Time
Fit Time
Predict Time
Accuracy
Time
Central Database
Master
Compute
Node
A. What percent of the experiments in this configuration have been
completed?
A. Destroy instances, instance
group, and instance template
B. Return
percentage
completion.
[GET configstatus/cores/memory]
[json]
Poll every 30-seconds
if percent == 100
[gcloud delete]
Distributed
Computing for SDM:
A semi-automated
workflow

Clustering Specifications
• Axes: Run time, run cost, run time prediction standard deviation
• Distance Metric: Euclidean
• Linkage: Complete
• Splitting rule: Silhouette (Rousseeuw, 1987): Maximize between
cluster variance, minimize within cluster variance
• Initial scale and center to reduce effect of axes with different
dimensions.
• Clustering package: base R (function hclust)

Ecological Informatics
The scope of the journal takes into account the data-intensive nature of ecology, the
precious information content of ecological data, the growing capacity of computational
technology to leverage complex data as well as the critical need for informing sustainable
management in spite of global environmental and climate change.
The nature of the journal is interdisciplinary at the crossover between ecology and
informatics.
The journal invites papers on:
• novel concepts and tools for monitoring, acquisition, management, analysis and
synthesis of ecological data, including genomic and paleo-ecological data,
• understanding ecosystem functioning and evolution, and
• informing decisions on environmental issues like sustainability, climate change and
biodiversity.
Impact Factor: 1.683

Environmental Modeling and Software
The aim is to improve our capacity to represent, understand, predict or manage the
behaviour of environmental systems at all practical scales, and to communicate those
improvements to a wide scientific and professional audience.
• Generic and pervasive frameworks, techniques and issues - including system
identification theory and practice, model conception, model integration, model and/or
software evaluation, sensitivity and uncertainty assessment, visualization, scale and
regionalization issues.
• Artificial Intelligence (AI) techniques and systems, such as knowledge-based systems /
expert systems, case-based reasoning systems, data mining, multi-agent systems,
Bayesian networks, artificial neural networks, fuzzy logic, or knowledge elicitation and
knowledge acquisition methods.
• Decision support systems and environmental information systems- implementation and
use of environmental data and models to support all phases and aspects of decision
making, in particular supporting group and participatory decision making processes.
Intelligent Environmental Decision Support Systems can include qualitative, quantitative,
mathematical, statistical, AI models and meta-models.

Computers and Geosciences
Publications should apply modern computer science paradigms, whether
computational or informatics-based, to address problems in the geosciences.
• Computational/informatics elements may include: computational methods; algorithms;
data structure; database retrieval; information retrieval; data processing; artificial
intelligence; computer graphics; computer visualization; programming languages;
parallel systems; distributed systems; the World-Wide Web; social media; and software
engineering.
• Geoscientific topics of interest include: mineralogy; petrology; geochemistry;
geomorphology; paleontology; stratigraphy; structural geology; sedimentology;
hydrogeology; oceanography; atmospheric sciences; climatology; meteorology;
geophysics; geomatics; remote sensing; geodesy; hydrology; and glaciology.

Computers, Environment and Urban Systems
innovative computer-based research on urban systems, systems of cities, and built and
natural environments , that privileges the geospatial perspective.
Applied and theoretical contributions demonstrate the scope of computer-based analysis
fostering a better understanding of urban systems, the synergistic relationships between
built and natural environments, their spatial scope and their dynamics.
Contributions emphasizing the development and enhancement of computer-based
technologies for the analysis and modeling, policy formulation, planning, and
management of environmental and urban systems that enhance sustainable futures are
especially sought. The journal also encourages research on the modalities through which
information and other computer-based technologies mold environmental and urban
systems.

Applied Artificial Intelligence
addresses concerns in applied research and applications of artificial intelligence (AI).
Articles highlight advances in uses of AI systems for solving tasks in management,
industry, engineering, administration, and education; evaluations of existing AI systems
and tools, emphasizing comparative studies and user experiences; and the economic,
social, and cultural impacts of AI. Papers on key applications, highlighting methods, time
schedules, person-months needed, and other relevant material are welcome.

Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
occurrence or abundance with environmental
estimates…to predict distributions across
landscapes.”
Elith and Leathwick (2009)
Environmental
Covariates
Training
Examples

landscapes.”

landscapes.”
21st Century Climate Scenario
v
Probability of Presence in 2100
0 20 40 60
Predict
Model

Parallel efficiency
Efficiency
CPU Cores
Marginal returns in
runtime of adding one
more processor.
Number of Training
Examples

Parallel efficiency
Efficiency
CPU Cores
Bigger datasets
have higher
efficiency on
massively
parallel
configurations.

What is a regression tree?
Regression trees rely on recursive binary partitioning of predictor space into a set of hyperrectangles in order to
approximate some unknown function f.

A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense

Similar to A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense (20)

Recently uploaded

Recently uploaded (20)

A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense

Editor's Notes