SlideShare a Scribd company logo
1 of 70
Download to read offline
A general framework for predicting the optimal
computing configurations for climate driven
ecological forecasting models
Scott Farley
Department of Geography
University of Wisconsin – Madison
Master of Science
Cartography & GIS
Public Talk
April 17, 2017
Acknowledgements
You!
Acknowledgements
You!
Acknowledgements
You!
Acknowledgements
You!
Given flexible computing resources and
massive data stores, what is the most efficient
computing hardware on which to run
ecological forecasting models?
Cloud
Computing
Species
Distribution
Modeling
Biodiversity
Informatics
Motivation
Figure adapted from: Urban, M. C. (2015). Accelerating extinction risk from climate change. Science, 348(6234), 571-573.
1 in 6
species is likely to go extinct due to
climate change.
Motivation
Figure adapted from: Dickson et al. (2014). Towards a global map of natural capital: key ecosystem assets. United National Environmental Programme. 1-33.
$125 trillion/year
Global ecosystem services are valued at
Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding life
Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding life
Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding life
Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding life
Neotoma Paleoecology
Database
Paleobiodiversity Data
Figure adapted from: Plumber, B (2014). There have been five mass extinctions in Earth’s history. Now we’re facing a sixth. The Washington Post.
https://www.washingtonpost.com/news/wonk/wp/2014/02/11/there-have-been-five-mass-extinctions-in-earths-history-now-were-facing-a-sixth
Paleobiodiversity Data
Figure adapted from: Booth, R. K., Brewer, S., Blaauw, M., Minckley, T. A., & Jackson, S. T. (2012). Decomposing the mid‐Holocene Tsuga decline in eastern North America. Ecology, 93(8), 1841-1852.
Recent Growth in Biodiversity Databases
Added 65.8
million records in
2015
Date of Record
MillionsofRecords
100
200
300
400
500
1500 1700 1900
Recent Growth in Biodiversity Databases
Neotoma
Paleoecology
Database
Added 65.8
million records in
2015
Added 1.5 million
fossil records since
2010
MillionsofRecords
Date of Record Date of Accession
MillionsofRecords
100
200
300
400
500
2.5
3
3.5
2
1500 1700 1900 2010 2015
The Four V’s of Big Data
Volume
Size of the data
✓
The Four V’s of Big Data
Volume
Veracity
Size of the data
Uncertain
ty of the
data
✓
✓
The Four V’s of Big Data
Volume Variety
Veracity
Size of the data
Uncertain
ty of the
data
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓
The Four V’s of Big Data
Volume Variety
Veracity
Size of the data
Uncertain
ty of the
data
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓
The Four V’s of Big Data
Volume Variety
Veracity Velocity
Size of the data
Uncertain
ty of the
data
Sensitivity of
data analysis
to time
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓
The Four V’s of Big Data
Volume Variety
Veracity Velocity
Size of the data
Uncertain
ty of the
data
Sensitivity of
data analysis
to time
Heterogeneity
of the data and
complexity of
interrelationshi
ps
✓
✓
✓
Biodiversity data does not typically
require real-time analysis
✗
Inductive Learning
Predict future responses (Y) given a set of potential covariates (X).
Training Examples
X and Y both known
Inductive Learning
Predict future response values (Y) given a set of potential covariates (X).
Training Examples
X and Y both known
Build Model
Estimate functional relationship
from x and y by minimizing loss
criteria (y-ŷ)
Inductive Learning
Predict future response values (Y) given a set of potential covariates (X).
Training Examples
X and Y both known
Build Model
Future Cases
Only X is known
Estimate new Y’s (ŷ)
from X
Use
approximate
d functional
relationship
in prediction
Species Distribution Modeling
Training
Exampl
es
Environment
al Covariates
SDM Algorithm
Predicted Future
Distribution
Predicted Future
Environment
Predict future distribution of species from observations of current (or fossil)
distribution and environmental/climatic covariates.
Figure source: https://www.unil.ch/idyst/en/home/menuinst/research-topics/geoinformatics-and-spatial-m/predictive-biogeography/advancing-the-science-of-eco.html
Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
relationship between
inputs and output
• Linear Regression
• Generalized linear
models
• Logistic regression
Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
relationship between
inputs and output
• Linear Regression
• Generalized linear
models
• Logistic regression
Data Driven
Estimate relationship
between inputs and
outputs from the data
• Regression trees
• Artificial neural
networks
• MaxEnt
High sensitivity to small
changes in input data
Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
relationship between
inputs and output
• Linear Regression
• Generalized linear
models
• Logistic regression
Data Driven Bayesian
Estimate the relationship
between inputs and
outputs as a probability
distribution using prior
knowledge and new data
Formally account for
model uncertainty
• Gaussian random fields
• Community full joint
distribution modeling
Estimate relationship
between inputs and
outputs from the data
• Regression trees
• Artificial neural
networks
• MaxEnt
High sensitivity to small
changes in input data
Algorithms in Contemporary SDM literature
Bayesian Data
Driven
Model
Driven
Other
120
100
80
60
40
20
Frequency
In 100 randomly sampled SDM papers…
Enables convenient and on-demand access to configurable
computing resources
Rapid provisioning and release with minimal management effort
Recent growth supported by federal agencies and public cloud
providers
Cloud Computing
For each SDM, there exists an
optimal data-hardware
configuration that:
1. Maximizes SDM accuracy
2. Balances the tradeoff between
performance and expense by
jointly minimizing the time and
cost of modeling
Hypothesis
For each SDM, there exists an
optimal data-hardware
configuration that:
1. Maximizes SDM accuracy
2. Balances the tradeoff between
performance and expense by
jointly minimizing the time and
cost of modeling
Hypothesis Data
Training Examples
Number of Covariates
Hardware
CPU Cores
Memory
SDMs
Random Forests
Boosted Regression Trees
Generalized Additive Models
Adaptive Regression Splines
1. Build large empirical dataset of SDM accuracy and
runtime
Methods
1. Build large empirical
dataset of SDM
accuracy and
runtime
2. Build model of
computing cost
Methods
1. Build large empirical dataset of SDM accuracy and
runtime
2. Build model of computing cost
3. Build two computational performance models (CPM):
– Accuracy (AUC): Expected accuracy using given data
– Runtime (seconds): Expected execution time on given data-
hardware configuration
Methods
1. Build large empirical dataset of SDM accuracy and
runtime
2. Build model of computing cost
3. Build two computational performance models
4. Use CPMs to identify optimal data-hardware
configuration
Methods
Bayesian Additive Regression Trees
Additive tree inductive learning model in a Bayesian framework
Probability density of SDM execution time or accuracy under
given input conditions
Performance and Accuracy Modeling Framework
Runtime Model
Skill
Runtime CPM
Skill
Regression Trees Regression Splines
Additive Models Random Forests
Runtime CPM Drivers For each predictor, build a CPM without
that predictor and compare its skill to
the skill of the full model.
Generalized Additive Models
Boosted Regression Trees
Adaptive Regression Splines
Random Forests
Runtime CPM Drivers
Random Forests can
execute in parallel
For each predictor, build a CPM without
that predictor and compare its skill to
the skill of the full model.
Generalized Additive Models
Boosted Regression Trees
Adaptive Regression Splines
Random Forests
Accuracy Model SkillAccuracy CPM
Skill
Regression Trees Regression Splines
Additive Models Random Forests
Accuracy CPM Drivers
Generalized Additive Models
Boosted Regression Trees
Adaptive Regression Splines
Random Forests
Choosing the Optimal Hardware for an SDM
Hardware
Data
Accuracy
CPM
Performance
CPM
Time
Cost
Uncertainty
Predictors CPMs Optimization
Optimal
Configuration
Result
Choosing the Optimal Hardware for an SDM
1. Identify data configuration of training
examples and covariates that will maximize
accuracy
Choosing the Optimal Hardware for an SDM
2. Predict execution time
of that configuration on
different hardware
configurations
Number of CPU Cores
Memory(GB)
Unique
Hardware
Configuratio
Choosing the Optimal Hardware for an SDM
3. Hierarchical clustering
- Time
- Cost
- Posterior SD (spread)
Dissimilarity
Unique
Hardware
Configuratio
n
Choosing the Optimal Hardware for an SDM
4. Calculate cluster
mean distance from
origin
Choose the cluster
closest to the origin
Optimal:
No Time
No Cost
No Uncertainty
Optimal Configuration for Each SDM
CPU Cores
Memory(GB)
Generalized Additive Models
Boosted Regression Trees
Adaptive Regression Splines
Random Forests
MARS: An Unresolved Quandary
Multivariate
Adaptive
Regression Splines
No incremental preference
for higher memory.
CPU Cores
Memory(GB)
Mean Distance from Origin
CPU Cores
Memory(GB)
Recommendations
Redevelop models at the code-
infrastructure interface to leverage
high performance computing
technologies
Prioritize model efficiency along
with ecological realism in future
development
Cloud computing offers the ability
to run models on the right
resources, not just the convenient
ones
Date of Record
MillionsofRecords
100
200
300
400
500
1500 1700 1900
Promote extensions of this
framework
Thanks!
Questions?
Bayesian Additive Tree Model Structure and Priors
1. Node Depth Prior: P(Tt) ~ α(1+d)−β where α ∈ (0, 1) and β ∈ [0, ∞]
2. Leaf-Value Prior: P(Mt | Tt) ~ μl ~ N(μμ / m, σμ
2 )
μμ, is picked to be the range center, (ymin + ymax)/2
σμ
2 is empirically chosen so that the range center plus or minus k = 2 variances
cover 95% of the provided response values in the training set
3. Error Variance Prior: σ2 ∼ InvGamma(ν/2, νλ/2)
λ is determined from the data so that there is a q = 90% a priori chance (by
default) that the BART model will improve upon the RMSE from an ordinary least
squares regression.
Bayesian Additive Tree Model Structure and Priors
5. Response likelihood: mean of response in leaf in given MCMC iteration with
variance: yl ∼ N(μl, σ2)
6. Hyperparameters: α, β, k, ν and q
α: 0.95
β: 2
k: 2
ν: 3
q: 90%
R Package: bartMachine
Citation: Adam Kapelner, Justin Bleich (2016). bartMachine: Machine Learning with
Bayesian Additive Regression Trees. Journal of Statistical Software, 70(4), 1-40. doi:
10.18637/jss.v070.i04
MARS Remedial Measures
• Resample so each configuration has n=1 observations
– Completely covering parameter space
– Need to reduce the influence imbalanced design
– No qualitative change in results
• Future steps
– Recollect dataset using a balanced design with multiple replicates
Master
Compute
Node
Central Database
A. Which configurations have experiments that are not marked as
COMPLETED?
B. Return next configuration with experiments not
marked as COMPLETE.
C. Create instance group with specified vCPU
and memory
[GET /nextconfig]
[json]
[gcloud create]
1. Configure and build virtual instances
2. Run simulations and report results
3. Manage virtual infrastructure
Central Database
A. Select random experiment within machine’ s
computing capabilities.
B. Return experiment specification.
C. Report
time and
accuracy
measures to
the database
[json]
Compute
Node RScript
TimeSDM
Function
Load variables
Fit BRT Model
Predict to 2100
Evaluate Accuracy
Total
Time
Fit Time
Predict Time
Accuracy
Time
Central Database
Master
Compute
Node
A. What percent of the experiments in this configuration have been
completed?
A. Destroy instances, instance
group, and instance template
B. Return
percentage
completion.
[GET configstatus/cores/memory]
[json]
Poll every 30-seconds
if percent == 100
[gcloud delete]
Distributed
Computing for SDM:
A semi-automated
workflow
Clustering Specifications
• Axes: Run time, run cost, run time prediction standard deviation
• Distance Metric: Euclidean
• Linkage: Complete
• Splitting rule: Silhouette (Rousseeuw, 1987): Maximize between
cluster variance, minimize within cluster variance
• Initial scale and center to reduce effect of axes with different
dimensions.
• Clustering package: base R (function hclust)
Publishing Venues
Ecological Informatics
The scope of the journal takes into account the data-intensive nature of ecology, the
precious information content of ecological data, the growing capacity of computational
technology to leverage complex data as well as the critical need for informing sustainable
management in spite of global environmental and climate change.
The nature of the journal is interdisciplinary at the crossover between ecology and
informatics.
The journal invites papers on:
• novel concepts and tools for monitoring, acquisition, management, analysis and
synthesis of ecological data, including genomic and paleo-ecological data,
• understanding ecosystem functioning and evolution, and
• informing decisions on environmental issues like sustainability, climate change and
biodiversity.
Impact Factor: 1.683
Environmental Modeling and Software
Impact Factor: 4.207
The aim is to improve our capacity to represent, understand, predict or manage the
behaviour of environmental systems at all practical scales, and to communicate those
improvements to a wide scientific and professional audience.
• Generic and pervasive frameworks, techniques and issues - including system
identification theory and practice, model conception, model integration, model and/or
software evaluation, sensitivity and uncertainty assessment, visualization, scale and
regionalization issues.
• Artificial Intelligence (AI) techniques and systems, such as knowledge-based systems /
expert systems, case-based reasoning systems, data mining, multi-agent systems,
Bayesian networks, artificial neural networks, fuzzy logic, or knowledge elicitation and
knowledge acquisition methods.
• Decision support systems and environmental information systems- implementation and
use of environmental data and models to support all phases and aspects of decision
making, in particular supporting group and participatory decision making processes.
Intelligent Environmental Decision Support Systems can include qualitative, quantitative,
mathematical, statistical, AI models and meta-models.
Computers and Geosciences
Impact Factor: 2.474
Publications should apply modern computer science paradigms, whether
computational or informatics-based, to address problems in the geosciences.
• Computational/informatics elements may include: computational methods; algorithms;
data structure; database retrieval; information retrieval; data processing; artificial
intelligence; computer graphics; computer visualization; programming languages;
parallel systems; distributed systems; the World-Wide Web; social media; and software
engineering.
• Geoscientific topics of interest include: mineralogy; petrology; geochemistry;
geomorphology; paleontology; stratigraphy; structural geology; sedimentology;
hydrogeology; oceanography; atmospheric sciences; climatology; meteorology;
geophysics; geomatics; remote sensing; geodesy; hydrology; and glaciology.
Computers, Environment and Urban Systems
Impact Factor: 2.092
innovative computer-based research on urban systems, systems of cities, and built and
natural environments , that privileges the geospatial perspective.
Applied and theoretical contributions demonstrate the scope of computer-based analysis
fostering a better understanding of urban systems, the synergistic relationships between
built and natural environments, their spatial scope and their dynamics.
Contributions emphasizing the development and enhancement of computer-based
technologies for the analysis and modeling, policy formulation, planning, and
management of environmental and urban systems that enhance sustainable futures are
especially sought. The journal also encourages research on the modalities through which
information and other computer-based technologies mold environmental and urban
systems.
Applied Artificial Intelligence
Impact Factor: 0.540
addresses concerns in applied research and applications of artificial intelligence (AI).
Articles highlight advances in uses of AI systems for solving tasks in management,
industry, engineering, administration, and education; evaluations of existing AI systems
and tools, emphasizing comparative studies and user experiences; and the economic,
social, and cultural impacts of AI. Papers on key applications, highlighting methods, time
schedules, person-months needed, and other relevant material are welcome.
Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
occurrence or abundance with environmental
estimates…to predict distributions across
landscapes.”
Elith and Leathwick (2009)
Environmental
Covariates
Training
Examples
Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
occurrence or abundance with environmental
estimates…to predict distributions across
landscapes.”
Elith and Leathwick (2009)
Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
occurrence or abundance with environmental
estimates…to predict distributions across
landscapes.”
Elith and Leathwick (2009)
21st Century Climate Scenario
v
Probability of Presence in 2100
0 20 40 60
Predict
Model
Parallel efficiency
Efficiency
CPU Cores
Marginal returns in
runtime of adding one
more processor.
Number of Training
Examples
Parallel efficiency
Efficiency
CPU Cores
Bigger datasets
have higher
efficiency on
massively
parallel
configurations.
What is a regression tree?
Regression trees rely on recursive binary partitioning of predictor space into a set of hyperrectangles in order to
approximate some unknown function f.

More Related Content

What's hot

Provinance in scientific workflows in e science
Provinance in scientific workflows in e scienceProvinance in scientific workflows in e science
Provinance in scientific workflows in e sciencebdemchak
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5Roger Barga
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
P.M.Pardalos - On Big Data Application
P.M.Pardalos - On Big Data ApplicationP.M.Pardalos - On Big Data Application
P.M.Pardalos - On Big Data ApplicationEkaterina Morozova
 
Data quality and uncertainty visualization
Data quality and uncertainty visualizationData quality and uncertainty visualization
Data quality and uncertainty visualizationbdemchak
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User StudyEnrico Daga
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
Prediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkPrediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkdbpublications
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Two Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information SystemTwo Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information SystemIRJET Journal
 
Multipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationMultipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
 
AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series PolarSeven Pty Ltd
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latexIAESIJEECS
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 

What's hot (20)

Provinance in scientific workflows in e science
Provinance in scientific workflows in e scienceProvinance in scientific workflows in e science
Provinance in scientific workflows in e science
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
P.M.Pardalos - On Big Data Application
P.M.Pardalos - On Big Data ApplicationP.M.Pardalos - On Big Data Application
P.M.Pardalos - On Big Data Application
 
Data quality and uncertainty visualization
Data quality and uncertainty visualizationData quality and uncertainty visualization
Data quality and uncertainty visualization
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Prediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on sparkPrediction of heart disease using classification mining technique on spark
Prediction of heart disease using classification mining technique on spark
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Deep learning and Healthcare
Deep learning and HealthcareDeep learning and Healthcare
Deep learning and Healthcare
 
Two Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information SystemTwo Layer k-means based Consensus Clustering for Rural Health Information System
Two Layer k-means based Consensus Clustering for Rural Health Information System
 
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...
CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of D...
 
Multipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendationMultipleregression covidmobility and Covid-19 policy recommendation
Multipleregression covidmobility and Covid-19 policy recommendation
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
 
AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 

Similar to A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021Praneeth Vepakomma
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesCambridge Semantics
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Institute of Contemporary Sciences
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
IRJET- E-MORES: Efficient Multiple Output Regression for Streaming Data
IRJET- E-MORES: Efficient Multiple Output Regression for Streaming DataIRJET- E-MORES: Efficient Multiple Output Regression for Streaming Data
IRJET- E-MORES: Efficient Multiple Output Regression for Streaming DataIRJET Journal
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014Raja Chiky
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsChirag Gupta
 
Peter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No MiraclesPeter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No MiraclesWeAreEsynergy
 
IEEE Big data 2016 Title and Abstract
IEEE Big data  2016 Title and AbstractIEEE Big data  2016 Title and Abstract
IEEE Big data 2016 Title and Abstracttsysglobalsolutions
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
Principles of Software-defined Elastic Systems for Big Data Analytics
Principles of Software-defined Elastic Systems for Big Data AnalyticsPrinciples of Software-defined Elastic Systems for Big Data Analytics
Principles of Software-defined Elastic Systems for Big Data AnalyticsHong-Linh Truong
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
Distributed Scalable Systems Short Overview
Distributed Scalable Systems Short OverviewDistributed Scalable Systems Short Overview
Distributed Scalable Systems Short OverviewRNeches
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 

Similar to A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense (20)

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
expeditions praneeth_june-2021
expeditions praneeth_june-2021expeditions praneeth_june-2021
expeditions praneeth_june-2021
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success Stories
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...Conceptual framework for entity integration from multiple data sources - Draz...
Conceptual framework for entity integration from multiple data sources - Draz...
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
 
IRJET- E-MORES: Efficient Multiple Output Regression for Streaming Data
IRJET- E-MORES: Efficient Multiple Output Regression for Streaming DataIRJET- E-MORES: Efficient Multiple Output Regression for Streaming Data
IRJET- E-MORES: Efficient Multiple Output Regression for Streaming Data
 
Smart Geo. Guido Satta (Maggio 2015)
Smart Geo. Guido Satta (Maggio 2015)Smart Geo. Guido Satta (Maggio 2015)
Smart Geo. Guido Satta (Maggio 2015)
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
Peter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No MiraclesPeter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No Miracles
 
IEEE Big data 2016 Title and Abstract
IEEE Big data  2016 Title and AbstractIEEE Big data  2016 Title and Abstract
IEEE Big data 2016 Title and Abstract
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
Big Data In Medicine
Big Data In Medicine Big Data In Medicine
Big Data In Medicine
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Principles of Software-defined Elastic Systems for Big Data Analytics
Principles of Software-defined Elastic Systems for Big Data AnalyticsPrinciples of Software-defined Elastic Systems for Big Data Analytics
Principles of Software-defined Elastic Systems for Big Data Analytics
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Distributed Scalable Systems Short Overview
Distributed Scalable Systems Short OverviewDistributed Scalable Systems Short Overview
Distributed Scalable Systems Short Overview
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 

Recently uploaded

3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docxUlahVanessaBasa
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationSanghamitraMohapatra5
 
Telephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online LecTelephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online Lecfllcampolet
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...Chayanika Das
 
Interpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTInterpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTAlexander F. Mayer
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerLuis Miguel Chong Chong
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...jana861314
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 

Recently uploaded (20)

3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitation
 
Telephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online LecTelephone Traffic Engineering Online Lec
Telephone Traffic Engineering Online Lec
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Ultrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptxUltrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptx
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
 
Interpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTInterpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWST
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of Cancer
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 

A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense

  • 1. A general framework for predicting the optimal computing configurations for climate driven ecological forecasting models Scott Farley Department of Geography University of Wisconsin – Madison Master of Science Cartography & GIS Public Talk April 17, 2017
  • 6. Given flexible computing resources and massive data stores, what is the most efficient computing hardware on which to run ecological forecasting models? Cloud Computing Species Distribution Modeling Biodiversity Informatics
  • 7. Motivation Figure adapted from: Urban, M. C. (2015). Accelerating extinction risk from climate change. Science, 348(6234), 571-573. 1 in 6 species is likely to go extinct due to climate change.
  • 8. Motivation Figure adapted from: Dickson et al. (2014). Towards a global map of natural capital: key ecosystem assets. United National Environmental Programme. 1-33. $125 trillion/year Global ecosystem services are valued at
  • 9. Biodiversity Informatics Community efforts to manage, archive, analyze, distribute, and interpret primary data regarding life
  • 10. Biodiversity Informatics Community efforts to manage, archive, analyze, distribute, and interpret primary data regarding life
  • 11. Biodiversity Informatics Community efforts to manage, archive, analyze, distribute, and interpret primary data regarding life
  • 12. Biodiversity Informatics Community efforts to manage, archive, analyze, distribute, and interpret primary data regarding life Neotoma Paleoecology Database
  • 13. Paleobiodiversity Data Figure adapted from: Plumber, B (2014). There have been five mass extinctions in Earth’s history. Now we’re facing a sixth. The Washington Post. https://www.washingtonpost.com/news/wonk/wp/2014/02/11/there-have-been-five-mass-extinctions-in-earths-history-now-were-facing-a-sixth
  • 14. Paleobiodiversity Data Figure adapted from: Booth, R. K., Brewer, S., Blaauw, M., Minckley, T. A., & Jackson, S. T. (2012). Decomposing the mid‐Holocene Tsuga decline in eastern North America. Ecology, 93(8), 1841-1852.
  • 15. Recent Growth in Biodiversity Databases Added 65.8 million records in 2015 Date of Record MillionsofRecords 100 200 300 400 500 1500 1700 1900
  • 16. Recent Growth in Biodiversity Databases Neotoma Paleoecology Database Added 65.8 million records in 2015 Added 1.5 million fossil records since 2010 MillionsofRecords Date of Record Date of Accession MillionsofRecords 100 200 300 400 500 2.5 3 3.5 2 1500 1700 1900 2010 2015
  • 17. The Four V’s of Big Data Volume Size of the data ✓
  • 18. The Four V’s of Big Data Volume Veracity Size of the data Uncertain ty of the data ✓ ✓
  • 19. The Four V’s of Big Data Volume Variety Veracity Size of the data Uncertain ty of the data Heterogeneity of the data and complexity of interrelationshi ps ✓ ✓ ✓
  • 20. The Four V’s of Big Data Volume Variety Veracity Size of the data Uncertain ty of the data Heterogeneity of the data and complexity of interrelationshi ps ✓ ✓ ✓
  • 21. The Four V’s of Big Data Volume Variety Veracity Velocity Size of the data Uncertain ty of the data Sensitivity of data analysis to time Heterogeneity of the data and complexity of interrelationshi ps ✓ ✓ ✓
  • 22. The Four V’s of Big Data Volume Variety Veracity Velocity Size of the data Uncertain ty of the data Sensitivity of data analysis to time Heterogeneity of the data and complexity of interrelationshi ps ✓ ✓ ✓ Biodiversity data does not typically require real-time analysis ✗
  • 23. Inductive Learning Predict future responses (Y) given a set of potential covariates (X). Training Examples X and Y both known
  • 24. Inductive Learning Predict future response values (Y) given a set of potential covariates (X). Training Examples X and Y both known Build Model Estimate functional relationship from x and y by minimizing loss criteria (y-ŷ)
  • 25. Inductive Learning Predict future response values (Y) given a set of potential covariates (X). Training Examples X and Y both known Build Model Future Cases Only X is known Estimate new Y’s (ŷ) from X Use approximate d functional relationship in prediction
  • 26. Species Distribution Modeling Training Exampl es Environment al Covariates SDM Algorithm Predicted Future Distribution Predicted Future Environment Predict future distribution of species from observations of current (or fossil) distribution and environmental/climatic covariates. Figure source: https://www.unil.ch/idyst/en/home/menuinst/research-topics/geoinformatics-and-spatial-m/predictive-biogeography/advancing-the-science-of-eco.html
  • 27. Types of SDM Model Driven Fit a parametric statistical model to a dataset Make assumptions about form of the functional relationship between inputs and output • Linear Regression • Generalized linear models • Logistic regression
  • 28. Types of SDM Model Driven Fit a parametric statistical model to a dataset Make assumptions about form of the functional relationship between inputs and output • Linear Regression • Generalized linear models • Logistic regression Data Driven Estimate relationship between inputs and outputs from the data • Regression trees • Artificial neural networks • MaxEnt High sensitivity to small changes in input data
  • 29. Types of SDM Model Driven Fit a parametric statistical model to a dataset Make assumptions about form of the functional relationship between inputs and output • Linear Regression • Generalized linear models • Logistic regression Data Driven Bayesian Estimate the relationship between inputs and outputs as a probability distribution using prior knowledge and new data Formally account for model uncertainty • Gaussian random fields • Community full joint distribution modeling Estimate relationship between inputs and outputs from the data • Regression trees • Artificial neural networks • MaxEnt High sensitivity to small changes in input data
  • 30. Algorithms in Contemporary SDM literature Bayesian Data Driven Model Driven Other 120 100 80 60 40 20 Frequency In 100 randomly sampled SDM papers…
  • 31. Enables convenient and on-demand access to configurable computing resources Rapid provisioning and release with minimal management effort Recent growth supported by federal agencies and public cloud providers Cloud Computing
  • 32. For each SDM, there exists an optimal data-hardware configuration that: 1. Maximizes SDM accuracy 2. Balances the tradeoff between performance and expense by jointly minimizing the time and cost of modeling Hypothesis
  • 33. For each SDM, there exists an optimal data-hardware configuration that: 1. Maximizes SDM accuracy 2. Balances the tradeoff between performance and expense by jointly minimizing the time and cost of modeling Hypothesis Data Training Examples Number of Covariates Hardware CPU Cores Memory SDMs Random Forests Boosted Regression Trees Generalized Additive Models Adaptive Regression Splines
  • 34. 1. Build large empirical dataset of SDM accuracy and runtime Methods
  • 35. 1. Build large empirical dataset of SDM accuracy and runtime 2. Build model of computing cost Methods
  • 36. 1. Build large empirical dataset of SDM accuracy and runtime 2. Build model of computing cost 3. Build two computational performance models (CPM): – Accuracy (AUC): Expected accuracy using given data – Runtime (seconds): Expected execution time on given data- hardware configuration Methods
  • 37. 1. Build large empirical dataset of SDM accuracy and runtime 2. Build model of computing cost 3. Build two computational performance models 4. Use CPMs to identify optimal data-hardware configuration Methods
  • 38. Bayesian Additive Regression Trees Additive tree inductive learning model in a Bayesian framework Probability density of SDM execution time or accuracy under given input conditions Performance and Accuracy Modeling Framework
  • 39. Runtime Model Skill Runtime CPM Skill Regression Trees Regression Splines Additive Models Random Forests
  • 40. Runtime CPM Drivers For each predictor, build a CPM without that predictor and compare its skill to the skill of the full model. Generalized Additive Models Boosted Regression Trees Adaptive Regression Splines Random Forests
  • 41. Runtime CPM Drivers Random Forests can execute in parallel For each predictor, build a CPM without that predictor and compare its skill to the skill of the full model. Generalized Additive Models Boosted Regression Trees Adaptive Regression Splines Random Forests
  • 42. Accuracy Model SkillAccuracy CPM Skill Regression Trees Regression Splines Additive Models Random Forests
  • 43. Accuracy CPM Drivers Generalized Additive Models Boosted Regression Trees Adaptive Regression Splines Random Forests
  • 44. Choosing the Optimal Hardware for an SDM Hardware Data Accuracy CPM Performance CPM Time Cost Uncertainty Predictors CPMs Optimization Optimal Configuration Result
  • 45. Choosing the Optimal Hardware for an SDM 1. Identify data configuration of training examples and covariates that will maximize accuracy
  • 46. Choosing the Optimal Hardware for an SDM 2. Predict execution time of that configuration on different hardware configurations Number of CPU Cores Memory(GB) Unique Hardware Configuratio
  • 47. Choosing the Optimal Hardware for an SDM 3. Hierarchical clustering - Time - Cost - Posterior SD (spread) Dissimilarity Unique Hardware Configuratio n
  • 48. Choosing the Optimal Hardware for an SDM 4. Calculate cluster mean distance from origin Choose the cluster closest to the origin Optimal: No Time No Cost No Uncertainty
  • 49. Optimal Configuration for Each SDM CPU Cores Memory(GB) Generalized Additive Models Boosted Regression Trees Adaptive Regression Splines Random Forests
  • 50. MARS: An Unresolved Quandary Multivariate Adaptive Regression Splines No incremental preference for higher memory. CPU Cores Memory(GB)
  • 51. Mean Distance from Origin CPU Cores Memory(GB)
  • 52. Recommendations Redevelop models at the code- infrastructure interface to leverage high performance computing technologies Prioritize model efficiency along with ecological realism in future development Cloud computing offers the ability to run models on the right resources, not just the convenient ones Date of Record MillionsofRecords 100 200 300 400 500 1500 1700 1900 Promote extensions of this framework
  • 54. Bayesian Additive Tree Model Structure and Priors 1. Node Depth Prior: P(Tt) ~ α(1+d)−β where α ∈ (0, 1) and β ∈ [0, ∞] 2. Leaf-Value Prior: P(Mt | Tt) ~ μl ~ N(μμ / m, σμ 2 ) μμ, is picked to be the range center, (ymin + ymax)/2 σμ 2 is empirically chosen so that the range center plus or minus k = 2 variances cover 95% of the provided response values in the training set 3. Error Variance Prior: σ2 ∼ InvGamma(ν/2, νλ/2) λ is determined from the data so that there is a q = 90% a priori chance (by default) that the BART model will improve upon the RMSE from an ordinary least squares regression.
  • 55. Bayesian Additive Tree Model Structure and Priors 5. Response likelihood: mean of response in leaf in given MCMC iteration with variance: yl ∼ N(μl, σ2) 6. Hyperparameters: α, β, k, ν and q α: 0.95 β: 2 k: 2 ν: 3 q: 90% R Package: bartMachine Citation: Adam Kapelner, Justin Bleich (2016). bartMachine: Machine Learning with Bayesian Additive Regression Trees. Journal of Statistical Software, 70(4), 1-40. doi: 10.18637/jss.v070.i04
  • 56. MARS Remedial Measures • Resample so each configuration has n=1 observations – Completely covering parameter space – Need to reduce the influence imbalanced design – No qualitative change in results • Future steps – Recollect dataset using a balanced design with multiple replicates
  • 57. Master Compute Node Central Database A. Which configurations have experiments that are not marked as COMPLETED? B. Return next configuration with experiments not marked as COMPLETE. C. Create instance group with specified vCPU and memory [GET /nextconfig] [json] [gcloud create] 1. Configure and build virtual instances 2. Run simulations and report results 3. Manage virtual infrastructure Central Database A. Select random experiment within machine’ s computing capabilities. B. Return experiment specification. C. Report time and accuracy measures to the database [json] Compute Node RScript TimeSDM Function Load variables Fit BRT Model Predict to 2100 Evaluate Accuracy Total Time Fit Time Predict Time Accuracy Time Central Database Master Compute Node A. What percent of the experiments in this configuration have been completed? A. Destroy instances, instance group, and instance template B. Return percentage completion. [GET configstatus/cores/memory] [json] Poll every 30-seconds if percent == 100 [gcloud delete] Distributed Computing for SDM: A semi-automated workflow
  • 58. Clustering Specifications • Axes: Run time, run cost, run time prediction standard deviation • Distance Metric: Euclidean • Linkage: Complete • Splitting rule: Silhouette (Rousseeuw, 1987): Maximize between cluster variance, minimize within cluster variance • Initial scale and center to reduce effect of axes with different dimensions. • Clustering package: base R (function hclust)
  • 60. Ecological Informatics The scope of the journal takes into account the data-intensive nature of ecology, the precious information content of ecological data, the growing capacity of computational technology to leverage complex data as well as the critical need for informing sustainable management in spite of global environmental and climate change. The nature of the journal is interdisciplinary at the crossover between ecology and informatics. The journal invites papers on: • novel concepts and tools for monitoring, acquisition, management, analysis and synthesis of ecological data, including genomic and paleo-ecological data, • understanding ecosystem functioning and evolution, and • informing decisions on environmental issues like sustainability, climate change and biodiversity. Impact Factor: 1.683
  • 61. Environmental Modeling and Software Impact Factor: 4.207 The aim is to improve our capacity to represent, understand, predict or manage the behaviour of environmental systems at all practical scales, and to communicate those improvements to a wide scientific and professional audience. • Generic and pervasive frameworks, techniques and issues - including system identification theory and practice, model conception, model integration, model and/or software evaluation, sensitivity and uncertainty assessment, visualization, scale and regionalization issues. • Artificial Intelligence (AI) techniques and systems, such as knowledge-based systems / expert systems, case-based reasoning systems, data mining, multi-agent systems, Bayesian networks, artificial neural networks, fuzzy logic, or knowledge elicitation and knowledge acquisition methods. • Decision support systems and environmental information systems- implementation and use of environmental data and models to support all phases and aspects of decision making, in particular supporting group and participatory decision making processes. Intelligent Environmental Decision Support Systems can include qualitative, quantitative, mathematical, statistical, AI models and meta-models.
  • 62. Computers and Geosciences Impact Factor: 2.474 Publications should apply modern computer science paradigms, whether computational or informatics-based, to address problems in the geosciences. • Computational/informatics elements may include: computational methods; algorithms; data structure; database retrieval; information retrieval; data processing; artificial intelligence; computer graphics; computer visualization; programming languages; parallel systems; distributed systems; the World-Wide Web; social media; and software engineering. • Geoscientific topics of interest include: mineralogy; petrology; geochemistry; geomorphology; paleontology; stratigraphy; structural geology; sedimentology; hydrogeology; oceanography; atmospheric sciences; climatology; meteorology; geophysics; geomatics; remote sensing; geodesy; hydrology; and glaciology.
  • 63. Computers, Environment and Urban Systems Impact Factor: 2.092 innovative computer-based research on urban systems, systems of cities, and built and natural environments , that privileges the geospatial perspective. Applied and theoretical contributions demonstrate the scope of computer-based analysis fostering a better understanding of urban systems, the synergistic relationships between built and natural environments, their spatial scope and their dynamics. Contributions emphasizing the development and enhancement of computer-based technologies for the analysis and modeling, policy formulation, planning, and management of environmental and urban systems that enhance sustainable futures are especially sought. The journal also encourages research on the modalities through which information and other computer-based technologies mold environmental and urban systems.
  • 64. Applied Artificial Intelligence Impact Factor: 0.540 addresses concerns in applied research and applications of artificial intelligence (AI). Articles highlight advances in uses of AI systems for solving tasks in management, industry, engineering, administration, and education; evaluations of existing AI systems and tools, emphasizing comparative studies and user experiences; and the economic, social, and cultural impacts of AI. Papers on key applications, highlighting methods, time schedules, person-months needed, and other relevant material are welcome.
  • 65. Species Distribution Models “Species distribution models (SDMs) are numerical tools that combine observations of species occurrence or abundance with environmental estimates…to predict distributions across landscapes.” Elith and Leathwick (2009) Environmental Covariates Training Examples
  • 66. Species Distribution Models “Species distribution models (SDMs) are numerical tools that combine observations of species occurrence or abundance with environmental estimates…to predict distributions across landscapes.” Elith and Leathwick (2009)
  • 67. Species Distribution Models “Species distribution models (SDMs) are numerical tools that combine observations of species occurrence or abundance with environmental estimates…to predict distributions across landscapes.” Elith and Leathwick (2009) 21st Century Climate Scenario v Probability of Presence in 2100 0 20 40 60 Predict Model
  • 68. Parallel efficiency Efficiency CPU Cores Marginal returns in runtime of adding one more processor. Number of Training Examples
  • 69. Parallel efficiency Efficiency CPU Cores Bigger datasets have higher efficiency on massively parallel configurations.
  • 70. What is a regression tree? Regression trees rely on recursive binary partitioning of predictor space into a set of hyperrectangles in order to approximate some unknown function f.

Editor's Notes

  1. SDMs are most often run on laptops/lab desktops for multiuse with little attention paid to the optimal strategy for SDMs in particular. Cloud computing offers a convenient way easily to provision and release configurable resources Gives users to opportunity to get the correct tool for the job by renting space on virtual machines rather than purchasing hardware. Led by pushes from major federal funding agencies like NSF. “Cloud first” strategy. NSF -> $20 million dollars to cloud computing. NASA lots of research. OMB 25 point plan to reduce barriers to entry to cloud computing. Amazon EC2 now hosts terrabytes of scientific data on its public clouds.
  2. For each SDM, there exists an optimal data/hardware configuration that: Maximizes classification accuracy Balances tradeoffs between performance and cost by jointly minimizing cost ($) and time (hr) Hypotheses: That an optimal exists That accuracy will depend only on data That execution time will depend on hardware and data That Random forests will see performance gains on multicores, while other SDMs will not, since RF is able to execute in parallel.
  3. For each SDM, there exists an optimal data/hardware configuration that: Maximizes classification accuracy Balances tradeoffs between performance and cost by jointly minimizing cost ($) and time (hr) Hypotheses: That an optimal exists That accuracy will depend only on data That execution time will depend on hardware and data That Random forests will see performance gains on multicores, while other SDMs will not, since RF is able to execute in parallel.
  4. Gathered empirical data on SDM runtime and accuracy for ~30,000 SDM simulation experiments. For each experiment, evaluated classification accuracy and measured runtime. Approximately evenly split over all four models. Ran on the Google Cloud infrastructure.
  5. Used the pricing scheme of the GCE. Use this to build deterministic link between time of computation and cost of computation.
  6. Build empirical dataset of SDM accuracy and runtime. Build model of computing cost Build component predictive models Accuracy Runtime Use model to predict runtime and accuracy on many algorithm-hardware configurations. Choose optimal configuration
  7. Build empirical dataset of SDM accuracy and runtime. Build model of computing cost Build component predictive models Accuracy Runtime Use model to predict runtime and accuracy on many algorithm-hardware configurations. Choose optimal configuration
  8. Yield probability density of runtime and accuracy under different configurations.
  9. Amount of data used to fit the models is most important Drivers vary amongst models Number of cores is important for random forests Hardware is not influential for GAM, GBM Weird MARS memory affinity
  10. Amount of data used to fit the models is most important Drivers vary amongst models Number of cores is important for random forests Hardware is not influential for GAM, GBM Weird MARS memory affinity
  11. Node depth prior enforces shallow trees Leaf-Value prior provides regularization so that single trees do not dominate Error variance prior provides additional insurance against overfitting.
  12. Node depth prior enforces shallow trees Leaf-Value prior provides regularization so that single trees do not dominate Error variance prior provides additional insurance against overfitting.
  13. Species Distribution Models (SDMs) are a class of statistical models that quantify the relationships between a species and its environmental range determinants Supervised machine-learning/statistical models (Austin, 2007) has posited that a solid foundation of ecological theory is essential to the correct prediction and interpretation of species distribution models. He notes that the ecological underpinnings of the statistics are, perhaps, more important the statistical method itself. (Elith & Leathwick, 2009a) further suggest that additional improvements in species distribution modeling will come from the incorporation of additional, ecological relevant information in the statistical model itself, and the covariates used to fit it. Indeed, “further advances in SDM are more likely to come from better integration of theory, concepts, and practice than from improved methods per se” (Elith & Leathwick, 2009a). People are doing many hundreds or thousands of species under many different warming scenarios.
  14. Species Distribution Models (SDMs) are a class of statistical models that quantify the relationships between a species and its environmental range determinants Supervised machine-learning/statistical models (Austin, 2007) has posited that a solid foundation of ecological theory is essential to the correct prediction and interpretation of species distribution models. He notes that the ecological underpinnings of the statistics are, perhaps, more important the statistical method itself. (Elith & Leathwick, 2009a) further suggest that additional improvements in species distribution modeling will come from the incorporation of additional, ecological relevant information in the statistical model itself, and the covariates used to fit it. Indeed, “further advances in SDM are more likely to come from better integration of theory, concepts, and practice than from improved methods per se” (Elith & Leathwick, 2009a). People are doing many hundreds or thousands of species under many different warming scenarios.
  15. Species Distribution Models (SDMs) are a class of statistical models that quantify the relationships between a species and its environmental range determinants Supervised machine-learning/statistical models (Austin, 2007) has posited that a solid foundation of ecological theory is essential to the correct prediction and interpretation of species distribution models. He notes that the ecological underpinnings of the statistics are, perhaps, more important the statistical method itself. (Elith & Leathwick, 2009a) further suggest that additional improvements in species distribution modeling will come from the incorporation of additional, ecological relevant information in the statistical model itself, and the covariates used to fit it. Indeed, “further advances in SDM are more likely to come from better integration of theory, concepts, and practice than from improved methods per se” (Elith & Leathwick, 2009a). People are doing many hundreds or thousands of species under many different warming scenarios.