A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense
This document summarizes Scott Farley's master's thesis presentation on developing a framework to predict optimal computing configurations for ecological forecasting models under climate change. The presentation discusses species distribution modeling and biodiversity informatics, describes challenges posed by big biodiversity data, and proposes using computational performance models to identify the hardware configuration that maximizes model accuracy while minimizing time and costs. The goal is to efficiently run ecological forecasting models on flexible cloud computing resources.
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
Similar to A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense
Similar to A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense (20)
A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense
1. A general framework for predicting the optimal
computing configurations for climate driven
ecological forecasting models
Scott Farley
Department of Geography
University of Wisconsin – Madison
Master of Science
Cartography & GIS
Public Talk
April 17, 2017
6. Given flexible computing resources and
massive data stores, what is the most efficient
computing hardware on which to run
ecological forecasting models?
Cloud
Computing
Species
Distribution
Modeling
Biodiversity
Informatics
7. Motivation
Figure adapted from: Urban, M. C. (2015). Accelerating extinction risk from climate change. Science, 348(6234), 571-573.
1 in 6
species is likely to go extinct due to
climate change.
8. Motivation
Figure adapted from: Dickson et al. (2014). Towards a global map of natural capital: key ecosystem assets. United National Environmental Programme. 1-33.
$125 trillion/year
Global ecosystem services are valued at
12. Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding life
Neotoma Paleoecology
Database
13. Paleobiodiversity Data
Figure adapted from: Plumber, B (2014). There have been five mass extinctions in Earth’s history. Now we’re facing a sixth. The Washington Post.
https://www.washingtonpost.com/news/wonk/wp/2014/02/11/there-have-been-five-mass-extinctions-in-earths-history-now-were-facing-a-sixth
14. Paleobiodiversity Data
Figure adapted from: Booth, R. K., Brewer, S., Blaauw, M., Minckley, T. A., & Jackson, S. T. (2012). Decomposing the mid‐Holocene Tsuga decline in eastern North America. Ecology, 93(8), 1841-1852.
15. Recent Growth in Biodiversity Databases
Added 65.8
million records in
2015
Date of Record
MillionsofRecords
100
200
300
400
500
1500 1700 1900
16. Recent Growth in Biodiversity Databases
Neotoma
Paleoecology
Database
Added 65.8
million records in
2015
Added 1.5 million
fossil records since
2010
MillionsofRecords
Date of Record Date of Accession
MillionsofRecords
100
200
300
400
500
2.5
3
3.5
2
1500 1700 1900 2010 2015
17. The Four V’s of Big Data
Volume
Size of the data
✓
18. The Four V’s of Big Data
Volume
Veracity
Size of the data
Uncertain
ty of the
data
✓
✓
19. The Four V’s of Big Data
Volume Variety
Veracity
Size of the data
Uncertain
ty of the
data
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓
20. The Four V’s of Big Data
Volume Variety
Veracity
Size of the data
Uncertain
ty of the
data
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓
21. The Four V’s of Big Data
Volume Variety
Veracity Velocity
Size of the data
Uncertain
ty of the
data
Sensitivity of
data analysis
to time
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓
22. The Four V’s of Big Data
Volume Variety
Veracity Velocity
Size of the data
Uncertain
ty of the
data
Sensitivity of
data analysis
to time
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓
Biodiversity data does not typically
require real-time analysis
✗
24. Inductive Learning
Predict future response values (Y) given a set of potential covariates (X).
Training Examples
X and Y both known
Build Model
Estimate functional relationship
from x and y by minimizing loss
criteria (y-ŷ)
25. Inductive Learning
Predict future response values (Y) given a set of potential covariates (X).
Training Examples
X and Y both known
Build Model
Future Cases
Only X is known
Estimate new Y’s (ŷ)
from X
Use
approximate
d functional
relationship
in prediction
26. Species Distribution Modeling
Training
Exampl
es
Environment
al Covariates
SDM Algorithm
Predicted Future
Distribution
Predicted Future
Environment
Predict future distribution of species from observations of current (or fossil)
distribution and environmental/climatic covariates.
Figure source: https://www.unil.ch/idyst/en/home/menuinst/research-topics/geoinformatics-and-spatial-m/predictive-biogeography/advancing-the-science-of-eco.html
27. Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
relationship between
inputs and output
• Linear Regression
• Generalized linear
models
• Logistic regression
28. Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
relationship between
inputs and output
• Linear Regression
• Generalized linear
models
• Logistic regression
Data Driven
Estimate relationship
between inputs and
outputs from the data
• Regression trees
• Artificial neural
networks
• MaxEnt
High sensitivity to small
changes in input data
29. Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
relationship between
inputs and output
• Linear Regression
• Generalized linear
models
• Logistic regression
Data Driven Bayesian
Estimate the relationship
between inputs and
outputs as a probability
distribution using prior
knowledge and new data
Formally account for
model uncertainty
• Gaussian random fields
• Community full joint
distribution modeling
Estimate relationship
between inputs and
outputs from the data
• Regression trees
• Artificial neural
networks
• MaxEnt
High sensitivity to small
changes in input data
30. Algorithms in Contemporary SDM literature
Bayesian Data
Driven
Model
Driven
Other
120
100
80
60
40
20
Frequency
In 100 randomly sampled SDM papers…
31. Enables convenient and on-demand access to configurable
computing resources
Rapid provisioning and release with minimal management effort
Recent growth supported by federal agencies and public cloud
providers
Cloud Computing
32. For each SDM, there exists an
optimal data-hardware
configuration that:
1. Maximizes SDM accuracy
2. Balances the tradeoff between
performance and expense by
jointly minimizing the time and
cost of modeling
Hypothesis
33. For each SDM, there exists an
optimal data-hardware
configuration that:
1. Maximizes SDM accuracy
2. Balances the tradeoff between
performance and expense by
jointly minimizing the time and
cost of modeling
Hypothesis Data
Training Examples
Number of Covariates
Hardware
CPU Cores
Memory
SDMs
Random Forests
Boosted Regression Trees
Generalized Additive Models
Adaptive Regression Splines
34. 1. Build large empirical dataset of SDM accuracy and
runtime
Methods
35. 1. Build large empirical
dataset of SDM
accuracy and
runtime
2. Build model of
computing cost
Methods
36. 1. Build large empirical dataset of SDM accuracy and
runtime
2. Build model of computing cost
3. Build two computational performance models (CPM):
– Accuracy (AUC): Expected accuracy using given data
– Runtime (seconds): Expected execution time on given data-
hardware configuration
Methods
37. 1. Build large empirical dataset of SDM accuracy and
runtime
2. Build model of computing cost
3. Build two computational performance models
4. Use CPMs to identify optimal data-hardware
configuration
Methods
38. Bayesian Additive Regression Trees
Additive tree inductive learning model in a Bayesian framework
Probability density of SDM execution time or accuracy under
given input conditions
Performance and Accuracy Modeling Framework
40. Runtime CPM Drivers For each predictor, build a CPM without
that predictor and compare its skill to
the skill of the full model.
Generalized Additive Models
Boosted Regression Trees
Adaptive Regression Splines
Random Forests
41. Runtime CPM Drivers
Random Forests can
execute in parallel
For each predictor, build a CPM without
that predictor and compare its skill to
the skill of the full model.
Generalized Additive Models
Boosted Regression Trees
Adaptive Regression Splines
Random Forests
44. Choosing the Optimal Hardware for an SDM
Hardware
Data
Accuracy
CPM
Performance
CPM
Time
Cost
Uncertainty
Predictors CPMs Optimization
Optimal
Configuration
Result
45. Choosing the Optimal Hardware for an SDM
1. Identify data configuration of training
examples and covariates that will maximize
accuracy
46. Choosing the Optimal Hardware for an SDM
2. Predict execution time
of that configuration on
different hardware
configurations
Number of CPU Cores
Memory(GB)
Unique
Hardware
Configuratio
47. Choosing the Optimal Hardware for an SDM
3. Hierarchical clustering
- Time
- Cost
- Posterior SD (spread)
Dissimilarity
Unique
Hardware
Configuratio
n
48. Choosing the Optimal Hardware for an SDM
4. Calculate cluster
mean distance from
origin
Choose the cluster
closest to the origin
Optimal:
No Time
No Cost
No Uncertainty
49. Optimal Configuration for Each SDM
CPU Cores
Memory(GB)
Generalized Additive Models
Boosted Regression Trees
Adaptive Regression Splines
Random Forests
50. MARS: An Unresolved Quandary
Multivariate
Adaptive
Regression Splines
No incremental preference
for higher memory.
CPU Cores
Memory(GB)
52. Recommendations
Redevelop models at the code-
infrastructure interface to leverage
high performance computing
technologies
Prioritize model efficiency along
with ecological realism in future
development
Cloud computing offers the ability
to run models on the right
resources, not just the convenient
ones
Date of Record
MillionsofRecords
100
200
300
400
500
1500 1700 1900
Promote extensions of this
framework
54. Bayesian Additive Tree Model Structure and Priors
1. Node Depth Prior: P(Tt) ~ α(1+d)−β where α ∈ (0, 1) and β ∈ [0, ∞]
2. Leaf-Value Prior: P(Mt | Tt) ~ μl ~ N(μμ / m, σμ
2 )
μμ, is picked to be the range center, (ymin + ymax)/2
σμ
2 is empirically chosen so that the range center plus or minus k = 2 variances
cover 95% of the provided response values in the training set
3. Error Variance Prior: σ2 ∼ InvGamma(ν/2, νλ/2)
λ is determined from the data so that there is a q = 90% a priori chance (by
default) that the BART model will improve upon the RMSE from an ordinary least
squares regression.
55. Bayesian Additive Tree Model Structure and Priors
5. Response likelihood: mean of response in leaf in given MCMC iteration with
variance: yl ∼ N(μl, σ2)
6. Hyperparameters: α, β, k, ν and q
α: 0.95
β: 2
k: 2
ν: 3
q: 90%
R Package: bartMachine
Citation: Adam Kapelner, Justin Bleich (2016). bartMachine: Machine Learning with
Bayesian Additive Regression Trees. Journal of Statistical Software, 70(4), 1-40. doi:
10.18637/jss.v070.i04
56. MARS Remedial Measures
• Resample so each configuration has n=1 observations
– Completely covering parameter space
– Need to reduce the influence imbalanced design
– No qualitative change in results
• Future steps
– Recollect dataset using a balanced design with multiple replicates
57. Master
Compute
Node
Central Database
A. Which configurations have experiments that are not marked as
COMPLETED?
B. Return next configuration with experiments not
marked as COMPLETE.
C. Create instance group with specified vCPU
and memory
[GET /nextconfig]
[json]
[gcloud create]
1. Configure and build virtual instances
2. Run simulations and report results
3. Manage virtual infrastructure
Central Database
A. Select random experiment within machine’ s
computing capabilities.
B. Return experiment specification.
C. Report
time and
accuracy
measures to
the database
[json]
Compute
Node RScript
TimeSDM
Function
Load variables
Fit BRT Model
Predict to 2100
Evaluate Accuracy
Total
Time
Fit Time
Predict Time
Accuracy
Time
Central Database
Master
Compute
Node
A. What percent of the experiments in this configuration have been
completed?
A. Destroy instances, instance
group, and instance template
B. Return
percentage
completion.
[GET configstatus/cores/memory]
[json]
Poll every 30-seconds
if percent == 100
[gcloud delete]
Distributed
Computing for SDM:
A semi-automated
workflow
58. Clustering Specifications
• Axes: Run time, run cost, run time prediction standard deviation
• Distance Metric: Euclidean
• Linkage: Complete
• Splitting rule: Silhouette (Rousseeuw, 1987): Maximize between
cluster variance, minimize within cluster variance
• Initial scale and center to reduce effect of axes with different
dimensions.
• Clustering package: base R (function hclust)
60. Ecological Informatics
The scope of the journal takes into account the data-intensive nature of ecology, the
precious information content of ecological data, the growing capacity of computational
technology to leverage complex data as well as the critical need for informing sustainable
management in spite of global environmental and climate change.
The nature of the journal is interdisciplinary at the crossover between ecology and
informatics.
The journal invites papers on:
• novel concepts and tools for monitoring, acquisition, management, analysis and
synthesis of ecological data, including genomic and paleo-ecological data,
• understanding ecosystem functioning and evolution, and
• informing decisions on environmental issues like sustainability, climate change and
biodiversity.
Impact Factor: 1.683
61. Environmental Modeling and Software
Impact Factor: 4.207
The aim is to improve our capacity to represent, understand, predict or manage the
behaviour of environmental systems at all practical scales, and to communicate those
improvements to a wide scientific and professional audience.
• Generic and pervasive frameworks, techniques and issues - including system
identification theory and practice, model conception, model integration, model and/or
software evaluation, sensitivity and uncertainty assessment, visualization, scale and
regionalization issues.
• Artificial Intelligence (AI) techniques and systems, such as knowledge-based systems /
expert systems, case-based reasoning systems, data mining, multi-agent systems,
Bayesian networks, artificial neural networks, fuzzy logic, or knowledge elicitation and
knowledge acquisition methods.
• Decision support systems and environmental information systems- implementation and
use of environmental data and models to support all phases and aspects of decision
making, in particular supporting group and participatory decision making processes.
Intelligent Environmental Decision Support Systems can include qualitative, quantitative,
mathematical, statistical, AI models and meta-models.
62. Computers and Geosciences
Impact Factor: 2.474
Publications should apply modern computer science paradigms, whether
computational or informatics-based, to address problems in the geosciences.
• Computational/informatics elements may include: computational methods; algorithms;
data structure; database retrieval; information retrieval; data processing; artificial
intelligence; computer graphics; computer visualization; programming languages;
parallel systems; distributed systems; the World-Wide Web; social media; and software
engineering.
• Geoscientific topics of interest include: mineralogy; petrology; geochemistry;
geomorphology; paleontology; stratigraphy; structural geology; sedimentology;
hydrogeology; oceanography; atmospheric sciences; climatology; meteorology;
geophysics; geomatics; remote sensing; geodesy; hydrology; and glaciology.
63. Computers, Environment and Urban Systems
Impact Factor: 2.092
innovative computer-based research on urban systems, systems of cities, and built and
natural environments , that privileges the geospatial perspective.
Applied and theoretical contributions demonstrate the scope of computer-based analysis
fostering a better understanding of urban systems, the synergistic relationships between
built and natural environments, their spatial scope and their dynamics.
Contributions emphasizing the development and enhancement of computer-based
technologies for the analysis and modeling, policy formulation, planning, and
management of environmental and urban systems that enhance sustainable futures are
especially sought. The journal also encourages research on the modalities through which
information and other computer-based technologies mold environmental and urban
systems.
64. Applied Artificial Intelligence
Impact Factor: 0.540
addresses concerns in applied research and applications of artificial intelligence (AI).
Articles highlight advances in uses of AI systems for solving tasks in management,
industry, engineering, administration, and education; evaluations of existing AI systems
and tools, emphasizing comparative studies and user experiences; and the economic,
social, and cultural impacts of AI. Papers on key applications, highlighting methods, time
schedules, person-months needed, and other relevant material are welcome.
65. Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
occurrence or abundance with environmental
estimates…to predict distributions across
landscapes.”
Elith and Leathwick (2009)
Environmental
Covariates
Training
Examples
66. Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
occurrence or abundance with environmental
estimates…to predict distributions across
landscapes.”
Elith and Leathwick (2009)
67. Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
occurrence or abundance with environmental
estimates…to predict distributions across
landscapes.”
Elith and Leathwick (2009)
21st Century Climate Scenario
v
Probability of Presence in 2100
0 20 40 60
Predict
Model
70. What is a regression tree?
Regression trees rely on recursive binary partitioning of predictor space into a set of hyperrectangles in order to
approximate some unknown function f.
Editor's Notes
SDMs are most often run on laptops/lab desktops for multiuse with little attention paid to the optimal strategy for SDMs in particular.
Cloud computing offers a convenient way easily to provision and release configurable resources
Gives users to opportunity to get the correct tool for the job by renting space on virtual machines rather than purchasing hardware.
Led by pushes from major federal funding agencies like NSF. “Cloud first” strategy. NSF -> $20 million dollars to cloud computing. NASA lots of research. OMB 25 point plan to reduce barriers to entry to cloud computing. Amazon EC2 now hosts terrabytes of scientific data on its public clouds.
For each SDM, there exists an optimal data/hardware configuration that:
Maximizes classification accuracy
Balances tradeoffs between performance and cost by jointly minimizing cost ($) and time (hr)
Hypotheses:
That an optimal exists
That accuracy will depend only on data
That execution time will depend on hardware and data
That Random forests will see performance gains on multicores, while other SDMs will not, since RF is able to execute in parallel.
For each SDM, there exists an optimal data/hardware configuration that:
Maximizes classification accuracy
Balances tradeoffs between performance and cost by jointly minimizing cost ($) and time (hr)
Hypotheses:
That an optimal exists
That accuracy will depend only on data
That execution time will depend on hardware and data
That Random forests will see performance gains on multicores, while other SDMs will not, since RF is able to execute in parallel.
Gathered empirical data on SDM runtime and accuracy for ~30,000 SDM simulation experiments. For each experiment, evaluated classification accuracy and measured runtime. Approximately evenly split over all four models. Ran on the Google Cloud infrastructure.
Used the pricing scheme of the GCE. Use this to build deterministic link between time of computation and cost of computation.
Build empirical dataset of SDM accuracy and runtime.
Build model of computing cost
Build component predictive models
Accuracy
Runtime
Use model to predict runtime and accuracy on many algorithm-hardware configurations.
Choose optimal configuration
Build empirical dataset of SDM accuracy and runtime.
Build model of computing cost
Build component predictive models
Accuracy
Runtime
Use model to predict runtime and accuracy on many algorithm-hardware configurations.
Choose optimal configuration
Yield probability density of runtime and accuracy under different configurations.
Amount of data used to fit the models is most important
Drivers vary amongst models
Number of cores is important for random forests
Hardware is not influential for GAM, GBM
Weird MARS memory affinity
Amount of data used to fit the models is most important
Drivers vary amongst models
Number of cores is important for random forests
Hardware is not influential for GAM, GBM
Weird MARS memory affinity
Node depth prior enforces shallow trees
Leaf-Value prior provides regularization so that single trees do not dominate
Error variance prior provides additional insurance against overfitting.
Node depth prior enforces shallow trees
Leaf-Value prior provides regularization so that single trees do not dominate
Error variance prior provides additional insurance against overfitting.
Species Distribution Models (SDMs) are a class of statistical models that quantify the relationships between a species and its environmental range determinants
Supervised machine-learning/statistical models
(Austin, 2007) has posited that a solid foundation of ecological theory is essential to the correct prediction and interpretation of species distribution models. He notes that the ecological underpinnings of the statistics are, perhaps, more important the statistical method itself.
(Elith & Leathwick, 2009a) further suggest that additional improvements in species distribution modeling will come from the incorporation of additional, ecological relevant information in the statistical model itself, and the covariates used to fit it. Indeed, “further advances in SDM are more likely to come from better integration of theory, concepts, and practice than from improved methods per se” (Elith & Leathwick, 2009a).
People are doing many hundreds or thousands of species under many different warming scenarios.
Species Distribution Models (SDMs) are a class of statistical models that quantify the relationships between a species and its environmental range determinants
Supervised machine-learning/statistical models
(Austin, 2007) has posited that a solid foundation of ecological theory is essential to the correct prediction and interpretation of species distribution models. He notes that the ecological underpinnings of the statistics are, perhaps, more important the statistical method itself.
(Elith & Leathwick, 2009a) further suggest that additional improvements in species distribution modeling will come from the incorporation of additional, ecological relevant information in the statistical model itself, and the covariates used to fit it. Indeed, “further advances in SDM are more likely to come from better integration of theory, concepts, and practice than from improved methods per se” (Elith & Leathwick, 2009a).
People are doing many hundreds or thousands of species under many different warming scenarios.
Species Distribution Models (SDMs) are a class of statistical models that quantify the relationships between a species and its environmental range determinants
Supervised machine-learning/statistical models
(Austin, 2007) has posited that a solid foundation of ecological theory is essential to the correct prediction and interpretation of species distribution models. He notes that the ecological underpinnings of the statistics are, perhaps, more important the statistical method itself.
(Elith & Leathwick, 2009a) further suggest that additional improvements in species distribution modeling will come from the incorporation of additional, ecological relevant information in the statistical model itself, and the covariates used to fit it. Indeed, “further advances in SDM are more likely to come from better integration of theory, concepts, and practice than from improved methods per se” (Elith & Leathwick, 2009a).
People are doing many hundreds or thousands of species under many different warming scenarios.