Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A general framework for predicting the optimal
computing configurations for climate driven
ecological forecasting models
S...
Acknowledgements
You!
Acknowledgements
You!
Acknowledgements
You!
Acknowledgements
You!
Given flexible computing resources and
massive data stores, what is the most efficient
computing hardware on which to run
...
Motivation
Figure adapted from: Urban, M. C. (2015). Accelerating extinction risk from climate change. Science, 348(6234),...
Motivation
Figure adapted from: Dickson et al. (2014). Towards a global map of natural capital: key ecosystem assets. Unit...
Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding l...
Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding l...
Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding l...
Biodiversity Informatics Community efforts to manage,
archive, analyze, distribute,
and interpret primary data
regarding l...
Paleobiodiversity Data
Figure adapted from: Plumber, B (2014). There have been five mass extinctions in Earth’s history. N...
Paleobiodiversity Data
Figure adapted from: Booth, R. K., Brewer, S., Blaauw, M., Minckley, T. A., & Jackson, S. T. (2012)...
Recent Growth in Biodiversity Databases
Added 65.8
million records in
2015
Date of Record
MillionsofRecords
100
200
300
40...
Recent Growth in Biodiversity Databases
Neotoma
Paleoecology
Database
Added 65.8
million records in
2015
Added 1.5 million...
The Four V’s of Big Data
Volume
Size of the data
✓
The Four V’s of Big Data
Volume
Veracity
Size of the data
Uncertain
ty of the
data
✓
✓
The Four V’s of Big Data
Volume Variety
Veracity
Size of the data
Uncertain
ty of the
data
Heterogeneity
of the data and
c...
The Four V’s of Big Data
Volume Variety
Veracity
Size of the data
Uncertain
ty of the
data
Heterogeneity
of the data and
c...
The Four V’s of Big Data
Volume Variety
Veracity Velocity
Size of the data
Uncertain
ty of the
data
Sensitivity of
data an...
The Four V’s of Big Data
Volume Variety
Veracity Velocity
Size of the data
Uncertain
ty of the
data
Sensitivity of
data an...
Inductive Learning
Predict future responses (Y) given a set of potential covariates (X).
Training Examples
X and Y both kn...
Inductive Learning
Predict future response values (Y) given a set of potential covariates (X).
Training Examples
X and Y b...
Inductive Learning
Predict future response values (Y) given a set of potential covariates (X).
Training Examples
X and Y b...
Species Distribution Modeling
Training
Exampl
es
Environment
al Covariates
SDM Algorithm
Predicted Future
Distribution
Pre...
Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
re...
Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
re...
Types of SDM
Model Driven
Fit a parametric statistical
model to a dataset
Make assumptions about
form of the functional
re...
Algorithms in Contemporary SDM literature
Bayesian Data
Driven
Model
Driven
Other
120
100
80
60
40
20
Frequency
In 100 ran...
Enables convenient and on-demand access to configurable
computing resources
Rapid provisioning and release with minimal ma...
For each SDM, there exists an
optimal data-hardware
configuration that:
1. Maximizes SDM accuracy
2. Balances the tradeoff...
For each SDM, there exists an
optimal data-hardware
configuration that:
1. Maximizes SDM accuracy
2. Balances the tradeoff...
1. Build large empirical dataset of SDM accuracy and
runtime
Methods
1. Build large empirical
dataset of SDM
accuracy and
runtime
2. Build model of
computing cost
Methods
1. Build large empirical dataset of SDM accuracy and
runtime
2. Build model of computing cost
3. Build two computational p...
1. Build large empirical dataset of SDM accuracy and
runtime
2. Build model of computing cost
3. Build two computational p...
Bayesian Additive Regression Trees
Additive tree inductive learning model in a Bayesian framework
Probability density of S...
Runtime Model
Skill
Runtime CPM
Skill
Regression Trees Regression Splines
Additive Models Random Forests
Runtime CPM Drivers For each predictor, build a CPM without
that predictor and compare its skill to
the skill of the full ...
Runtime CPM Drivers
Random Forests can
execute in parallel
For each predictor, build a CPM without
that predictor and comp...
Accuracy Model SkillAccuracy CPM
Skill
Regression Trees Regression Splines
Additive Models Random Forests
Accuracy CPM Drivers
Generalized Additive Models
Boosted Regression Trees
Adaptive Regression Splines
Random Forests
Choosing the Optimal Hardware for an SDM
Hardware
Data
Accuracy
CPM
Performance
CPM
Time
Cost
Uncertainty
Predictors CPMs ...
Choosing the Optimal Hardware for an SDM
1. Identify data configuration of training
examples and covariates that will maxi...
Choosing the Optimal Hardware for an SDM
2. Predict execution time
of that configuration on
different hardware
configurati...
Choosing the Optimal Hardware for an SDM
3. Hierarchical clustering
- Time
- Cost
- Posterior SD (spread)
Dissimilarity
Un...
Choosing the Optimal Hardware for an SDM
4. Calculate cluster
mean distance from
origin
Choose the cluster
closest to the ...
Optimal Configuration for Each SDM
CPU Cores
Memory(GB)
Generalized Additive Models
Boosted Regression Trees
Adaptive Regr...
MARS: An Unresolved Quandary
Multivariate
Adaptive
Regression Splines
No incremental preference
for higher memory.
CPU Cor...
Mean Distance from Origin
CPU Cores
Memory(GB)
Recommendations
Redevelop models at the code-
infrastructure interface to leverage
high performance computing
technologies...
Thanks!
Questions?
Bayesian Additive Tree Model Structure and Priors
1. Node Depth Prior: P(Tt) ~ α(1+d)−β where α ∈ (0, 1) and β ∈ [0, ∞]
2....
Bayesian Additive Tree Model Structure and Priors
5. Response likelihood: mean of response in leaf in given MCMC iteration...
MARS Remedial Measures
• Resample so each configuration has n=1 observations
– Completely covering parameter space
– Need ...
Master
Compute
Node
Central Database
A. Which configurations have experiments that are not marked as
COMPLETED?
B. Return ...
Clustering Specifications
• Axes: Run time, run cost, run time prediction standard deviation
• Distance Metric: Euclidean
...
Publishing Venues
Ecological Informatics
The scope of the journal takes into account the data-intensive nature of ecology, the
precious info...
Environmental Modeling and Software
Impact Factor: 4.207
The aim is to improve our capacity to represent, understand, pred...
Computers and Geosciences
Impact Factor: 2.474
Publications should apply modern computer science paradigms, whether
comput...
Computers, Environment and Urban Systems
Impact Factor: 2.092
innovative computer-based research on urban systems, systems...
Applied Artificial Intelligence
Impact Factor: 0.540
addresses concerns in applied research and applications of artificial...
Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
o...
Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
o...
Species Distribution Models
“Species distribution models (SDMs) are numerical
tools that combine observations of species
o...
Parallel efficiency
Efficiency
CPU Cores
Marginal returns in
runtime of adding one
more processor.
Number of Training
Exam...
Parallel efficiency
Efficiency
CPU Cores
Bigger datasets
have higher
efficiency on
massively
parallel
configurations.
What is a regression tree?
Regression trees rely on recursive binary partitioning of predictor space into a set of hyperre...
Upcoming SlideShare
Loading in …5
×

A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense

63 views

Published on

Scott Farley's Master of Science (Cartography/GIS) public talk. Presents a general rubric for identifying the optimal hardware configuration for running species distribution models. Presents the results of a literature meta-analysis and a new characterization of ecological biodiversity data as Big Data.

Published in: Science
  • Be the first to comment

  • Be the first to like this

A general framework for predicting the optimal computing configuration for climate-driven ecological forecasting models: Scott Farley's Masters Thesis Defense

  1. 1. A general framework for predicting the optimal computing configurations for climate driven ecological forecasting models Scott Farley Department of Geography University of Wisconsin – Madison Master of Science Cartography & GIS Public Talk April 17, 2017
  2. 2. Acknowledgements You!
  3. 3. Acknowledgements You!
  4. 4. Acknowledgements You!
  5. 5. Acknowledgements You!
  6. 6. Given flexible computing resources and massive data stores, what is the most efficient computing hardware on which to run ecological forecasting models? Cloud Computing Species Distribution Modeling Biodiversity Informatics
  7. 7. Motivation Figure adapted from: Urban, M. C. (2015). Accelerating extinction risk from climate change. Science, 348(6234), 571-573. 1 in 6 species is likely to go extinct due to climate change.
  8. 8. Motivation Figure adapted from: Dickson et al. (2014). Towards a global map of natural capital: key ecosystem assets. United National Environmental Programme. 1-33. $125 trillion/year Global ecosystem services are valued at
  9. 9. Biodiversity Informatics Community efforts to manage, archive, analyze, distribute, and interpret primary data regarding life
  10. 10. Biodiversity Informatics Community efforts to manage, archive, analyze, distribute, and interpret primary data regarding life
  11. 11. Biodiversity Informatics Community efforts to manage, archive, analyze, distribute, and interpret primary data regarding life
  12. 12. Biodiversity Informatics Community efforts to manage, archive, analyze, distribute, and interpret primary data regarding life Neotoma Paleoecology Database
  13. 13. Paleobiodiversity Data Figure adapted from: Plumber, B (2014). There have been five mass extinctions in Earth’s history. Now we’re facing a sixth. The Washington Post. https://www.washingtonpost.com/news/wonk/wp/2014/02/11/there-have-been-five-mass-extinctions-in-earths-history-now-were-facing-a-sixth
  14. 14. Paleobiodiversity Data Figure adapted from: Booth, R. K., Brewer, S., Blaauw, M., Minckley, T. A., & Jackson, S. T. (2012). Decomposing the mid‐Holocene Tsuga decline in eastern North America. Ecology, 93(8), 1841-1852.
  15. 15. Recent Growth in Biodiversity Databases Added 65.8 million records in 2015 Date of Record MillionsofRecords 100 200 300 400 500 1500 1700 1900
  16. 16. Recent Growth in Biodiversity Databases Neotoma Paleoecology Database Added 65.8 million records in 2015 Added 1.5 million fossil records since 2010 MillionsofRecords Date of Record Date of Accession MillionsofRecords 100 200 300 400 500 2.5 3 3.5 2 1500 1700 1900 2010 2015
  17. 17. The Four V’s of Big Data Volume Size of the data ✓
  18. 18. The Four V’s of Big Data Volume Veracity Size of the data Uncertain ty of the data ✓ ✓
  19. 19. The Four V’s of Big Data Volume Variety Veracity Size of the data Uncertain ty of the data Heterogeneity of the data and complexity of interrelationshi ps ✓ ✓ ✓
  20. 20. The Four V’s of Big Data Volume Variety Veracity Size of the data Uncertain ty of the data Heterogeneity of the data and complexity of interrelationshi ps ✓ ✓ ✓
  21. 21. The Four V’s of Big Data Volume Variety Veracity Velocity Size of the data Uncertain ty of the data Sensitivity of data analysis to time Heterogeneity of the data and complexity of interrelationshi ps ✓ ✓ ✓
  22. 22. The Four V’s of Big Data Volume Variety Veracity Velocity Size of the data Uncertain ty of the data Sensitivity of data analysis to time Heterogeneity of the data and complexity of interrelationshi ps ✓ ✓ ✓ Biodiversity data does not typically require real-time analysis ✗
  23. 23. Inductive Learning Predict future responses (Y) given a set of potential covariates (X). Training Examples X and Y both known
  24. 24. Inductive Learning Predict future response values (Y) given a set of potential covariates (X). Training Examples X and Y both known Build Model Estimate functional relationship from x and y by minimizing loss criteria (y-ŷ)
  25. 25. Inductive Learning Predict future response values (Y) given a set of potential covariates (X). Training Examples X and Y both known Build Model Future Cases Only X is known Estimate new Y’s (ŷ) from X Use approximate d functional relationship in prediction
  26. 26. Species Distribution Modeling Training Exampl es Environment al Covariates SDM Algorithm Predicted Future Distribution Predicted Future Environment Predict future distribution of species from observations of current (or fossil) distribution and environmental/climatic covariates. Figure source: https://www.unil.ch/idyst/en/home/menuinst/research-topics/geoinformatics-and-spatial-m/predictive-biogeography/advancing-the-science-of-eco.html
  27. 27. Types of SDM Model Driven Fit a parametric statistical model to a dataset Make assumptions about form of the functional relationship between inputs and output • Linear Regression • Generalized linear models • Logistic regression
  28. 28. Types of SDM Model Driven Fit a parametric statistical model to a dataset Make assumptions about form of the functional relationship between inputs and output • Linear Regression • Generalized linear models • Logistic regression Data Driven Estimate relationship between inputs and outputs from the data • Regression trees • Artificial neural networks • MaxEnt High sensitivity to small changes in input data
  29. 29. Types of SDM Model Driven Fit a parametric statistical model to a dataset Make assumptions about form of the functional relationship between inputs and output • Linear Regression • Generalized linear models • Logistic regression Data Driven Bayesian Estimate the relationship between inputs and outputs as a probability distribution using prior knowledge and new data Formally account for model uncertainty • Gaussian random fields • Community full joint distribution modeling Estimate relationship between inputs and outputs from the data • Regression trees • Artificial neural networks • MaxEnt High sensitivity to small changes in input data
  30. 30. Algorithms in Contemporary SDM literature Bayesian Data Driven Model Driven Other 120 100 80 60 40 20 Frequency In 100 randomly sampled SDM papers…
  31. 31. Enables convenient and on-demand access to configurable computing resources Rapid provisioning and release with minimal management effort Recent growth supported by federal agencies and public cloud providers Cloud Computing
  32. 32. For each SDM, there exists an optimal data-hardware configuration that: 1. Maximizes SDM accuracy 2. Balances the tradeoff between performance and expense by jointly minimizing the time and cost of modeling Hypothesis
  33. 33. For each SDM, there exists an optimal data-hardware configuration that: 1. Maximizes SDM accuracy 2. Balances the tradeoff between performance and expense by jointly minimizing the time and cost of modeling Hypothesis Data Training Examples Number of Covariates Hardware CPU Cores Memory SDMs Random Forests Boosted Regression Trees Generalized Additive Models Adaptive Regression Splines
  34. 34. 1. Build large empirical dataset of SDM accuracy and runtime Methods
  35. 35. 1. Build large empirical dataset of SDM accuracy and runtime 2. Build model of computing cost Methods
  36. 36. 1. Build large empirical dataset of SDM accuracy and runtime 2. Build model of computing cost 3. Build two computational performance models (CPM): – Accuracy (AUC): Expected accuracy using given data – Runtime (seconds): Expected execution time on given data- hardware configuration Methods
  37. 37. 1. Build large empirical dataset of SDM accuracy and runtime 2. Build model of computing cost 3. Build two computational performance models 4. Use CPMs to identify optimal data-hardware configuration Methods
  38. 38. Bayesian Additive Regression Trees Additive tree inductive learning model in a Bayesian framework Probability density of SDM execution time or accuracy under given input conditions Performance and Accuracy Modeling Framework
  39. 39. Runtime Model Skill Runtime CPM Skill Regression Trees Regression Splines Additive Models Random Forests
  40. 40. Runtime CPM Drivers For each predictor, build a CPM without that predictor and compare its skill to the skill of the full model. Generalized Additive Models Boosted Regression Trees Adaptive Regression Splines Random Forests
  41. 41. Runtime CPM Drivers Random Forests can execute in parallel For each predictor, build a CPM without that predictor and compare its skill to the skill of the full model. Generalized Additive Models Boosted Regression Trees Adaptive Regression Splines Random Forests
  42. 42. Accuracy Model SkillAccuracy CPM Skill Regression Trees Regression Splines Additive Models Random Forests
  43. 43. Accuracy CPM Drivers Generalized Additive Models Boosted Regression Trees Adaptive Regression Splines Random Forests
  44. 44. Choosing the Optimal Hardware for an SDM Hardware Data Accuracy CPM Performance CPM Time Cost Uncertainty Predictors CPMs Optimization Optimal Configuration Result
  45. 45. Choosing the Optimal Hardware for an SDM 1. Identify data configuration of training examples and covariates that will maximize accuracy
  46. 46. Choosing the Optimal Hardware for an SDM 2. Predict execution time of that configuration on different hardware configurations Number of CPU Cores Memory(GB) Unique Hardware Configuratio
  47. 47. Choosing the Optimal Hardware for an SDM 3. Hierarchical clustering - Time - Cost - Posterior SD (spread) Dissimilarity Unique Hardware Configuratio n
  48. 48. Choosing the Optimal Hardware for an SDM 4. Calculate cluster mean distance from origin Choose the cluster closest to the origin Optimal: No Time No Cost No Uncertainty
  49. 49. Optimal Configuration for Each SDM CPU Cores Memory(GB) Generalized Additive Models Boosted Regression Trees Adaptive Regression Splines Random Forests
  50. 50. MARS: An Unresolved Quandary Multivariate Adaptive Regression Splines No incremental preference for higher memory. CPU Cores Memory(GB)
  51. 51. Mean Distance from Origin CPU Cores Memory(GB)
  52. 52. Recommendations Redevelop models at the code- infrastructure interface to leverage high performance computing technologies Prioritize model efficiency along with ecological realism in future development Cloud computing offers the ability to run models on the right resources, not just the convenient ones Date of Record MillionsofRecords 100 200 300 400 500 1500 1700 1900 Promote extensions of this framework
  53. 53. Thanks! Questions?
  54. 54. Bayesian Additive Tree Model Structure and Priors 1. Node Depth Prior: P(Tt) ~ α(1+d)−β where α ∈ (0, 1) and β ∈ [0, ∞] 2. Leaf-Value Prior: P(Mt | Tt) ~ μl ~ N(μμ / m, σμ 2 ) μμ, is picked to be the range center, (ymin + ymax)/2 σμ 2 is empirically chosen so that the range center plus or minus k = 2 variances cover 95% of the provided response values in the training set 3. Error Variance Prior: σ2 ∼ InvGamma(ν/2, νλ/2) λ is determined from the data so that there is a q = 90% a priori chance (by default) that the BART model will improve upon the RMSE from an ordinary least squares regression.
  55. 55. Bayesian Additive Tree Model Structure and Priors 5. Response likelihood: mean of response in leaf in given MCMC iteration with variance: yl ∼ N(μl, σ2) 6. Hyperparameters: α, β, k, ν and q α: 0.95 β: 2 k: 2 ν: 3 q: 90% R Package: bartMachine Citation: Adam Kapelner, Justin Bleich (2016). bartMachine: Machine Learning with Bayesian Additive Regression Trees. Journal of Statistical Software, 70(4), 1-40. doi: 10.18637/jss.v070.i04
  56. 56. MARS Remedial Measures • Resample so each configuration has n=1 observations – Completely covering parameter space – Need to reduce the influence imbalanced design – No qualitative change in results • Future steps – Recollect dataset using a balanced design with multiple replicates
  57. 57. Master Compute Node Central Database A. Which configurations have experiments that are not marked as COMPLETED? B. Return next configuration with experiments not marked as COMPLETE. C. Create instance group with specified vCPU and memory [GET /nextconfig] [json] [gcloud create] 1. Configure and build virtual instances 2. Run simulations and report results 3. Manage virtual infrastructure Central Database A. Select random experiment within machine’ s computing capabilities. B. Return experiment specification. C. Report time and accuracy measures to the database [json] Compute Node RScript TimeSDM Function Load variables Fit BRT Model Predict to 2100 Evaluate Accuracy Total Time Fit Time Predict Time Accuracy Time Central Database Master Compute Node A. What percent of the experiments in this configuration have been completed? A. Destroy instances, instance group, and instance template B. Return percentage completion. [GET configstatus/cores/memory] [json] Poll every 30-seconds if percent == 100 [gcloud delete] Distributed Computing for SDM: A semi-automated workflow
  58. 58. Clustering Specifications • Axes: Run time, run cost, run time prediction standard deviation • Distance Metric: Euclidean • Linkage: Complete • Splitting rule: Silhouette (Rousseeuw, 1987): Maximize between cluster variance, minimize within cluster variance • Initial scale and center to reduce effect of axes with different dimensions. • Clustering package: base R (function hclust)
  59. 59. Publishing Venues
  60. 60. Ecological Informatics The scope of the journal takes into account the data-intensive nature of ecology, the precious information content of ecological data, the growing capacity of computational technology to leverage complex data as well as the critical need for informing sustainable management in spite of global environmental and climate change. The nature of the journal is interdisciplinary at the crossover between ecology and informatics. The journal invites papers on: • novel concepts and tools for monitoring, acquisition, management, analysis and synthesis of ecological data, including genomic and paleo-ecological data, • understanding ecosystem functioning and evolution, and • informing decisions on environmental issues like sustainability, climate change and biodiversity. Impact Factor: 1.683
  61. 61. Environmental Modeling and Software Impact Factor: 4.207 The aim is to improve our capacity to represent, understand, predict or manage the behaviour of environmental systems at all practical scales, and to communicate those improvements to a wide scientific and professional audience. • Generic and pervasive frameworks, techniques and issues - including system identification theory and practice, model conception, model integration, model and/or software evaluation, sensitivity and uncertainty assessment, visualization, scale and regionalization issues. • Artificial Intelligence (AI) techniques and systems, such as knowledge-based systems / expert systems, case-based reasoning systems, data mining, multi-agent systems, Bayesian networks, artificial neural networks, fuzzy logic, or knowledge elicitation and knowledge acquisition methods. • Decision support systems and environmental information systems- implementation and use of environmental data and models to support all phases and aspects of decision making, in particular supporting group and participatory decision making processes. Intelligent Environmental Decision Support Systems can include qualitative, quantitative, mathematical, statistical, AI models and meta-models.
  62. 62. Computers and Geosciences Impact Factor: 2.474 Publications should apply modern computer science paradigms, whether computational or informatics-based, to address problems in the geosciences. • Computational/informatics elements may include: computational methods; algorithms; data structure; database retrieval; information retrieval; data processing; artificial intelligence; computer graphics; computer visualization; programming languages; parallel systems; distributed systems; the World-Wide Web; social media; and software engineering. • Geoscientific topics of interest include: mineralogy; petrology; geochemistry; geomorphology; paleontology; stratigraphy; structural geology; sedimentology; hydrogeology; oceanography; atmospheric sciences; climatology; meteorology; geophysics; geomatics; remote sensing; geodesy; hydrology; and glaciology.
  63. 63. Computers, Environment and Urban Systems Impact Factor: 2.092 innovative computer-based research on urban systems, systems of cities, and built and natural environments , that privileges the geospatial perspective. Applied and theoretical contributions demonstrate the scope of computer-based analysis fostering a better understanding of urban systems, the synergistic relationships between built and natural environments, their spatial scope and their dynamics. Contributions emphasizing the development and enhancement of computer-based technologies for the analysis and modeling, policy formulation, planning, and management of environmental and urban systems that enhance sustainable futures are especially sought. The journal also encourages research on the modalities through which information and other computer-based technologies mold environmental and urban systems.
  64. 64. Applied Artificial Intelligence Impact Factor: 0.540 addresses concerns in applied research and applications of artificial intelligence (AI). Articles highlight advances in uses of AI systems for solving tasks in management, industry, engineering, administration, and education; evaluations of existing AI systems and tools, emphasizing comparative studies and user experiences; and the economic, social, and cultural impacts of AI. Papers on key applications, highlighting methods, time schedules, person-months needed, and other relevant material are welcome.
  65. 65. Species Distribution Models “Species distribution models (SDMs) are numerical tools that combine observations of species occurrence or abundance with environmental estimates…to predict distributions across landscapes.” Elith and Leathwick (2009) Environmental Covariates Training Examples
  66. 66. Species Distribution Models “Species distribution models (SDMs) are numerical tools that combine observations of species occurrence or abundance with environmental estimates…to predict distributions across landscapes.” Elith and Leathwick (2009)
  67. 67. Species Distribution Models “Species distribution models (SDMs) are numerical tools that combine observations of species occurrence or abundance with environmental estimates…to predict distributions across landscapes.” Elith and Leathwick (2009) 21st Century Climate Scenario v Probability of Presence in 2100 0 20 40 60 Predict Model
  68. 68. Parallel efficiency Efficiency CPU Cores Marginal returns in runtime of adding one more processor. Number of Training Examples
  69. 69. Parallel efficiency Efficiency CPU Cores Bigger datasets have higher efficiency on massively parallel configurations.
  70. 70. What is a regression tree? Regression trees rely on recursive binary partitioning of predictor space into a set of hyperrectangles in order to approximate some unknown function f.

×