Data Mining to Reveal     Biological Invaders            VOJTĚCH JAROŠÍK    1Department   of Ecology, Faculty of Science, ...
Backgroud on biological invasions•   Economic impacts•   Changes in habitat properties•   Loss of species diversity•   Bio...
Economic impacts
Economic impactsSkibbereen 1847 by Cork artist James Mahony   Emigrants Leave Ireland, engraving by Henry Doyle
Changes of habitat properties
Extinction of native speciesBrown tree snake on a fence post in Guam
Biological homogenization
Aim• Data-mining tools were originally designed for analyzing vast  databases of often incomplete data, with an aim to fin...
Basic principles of data mining•   Main literature sources•   Binary recursive partitioning•   Classification And Regressi...
Basic principles of data mining•   Classification and Regression Trees (CART®):     –   Breiman L, Friedman JH, Olshen RA,...
Basic principles of predictive miningthe data are successively split   Binary recursive partitioning
Basic principles of CART®CART® provide graphical, highly intuitive inside                                                 ...
Basic principles of CART®      From:
Basic principles of RandomForestsTMRandom forests can be seen as an extension of classification trees by fitting many sub-...
Important properties of data mining             models•   Exploratory and flexible•   Non parametric•   Surrogates•   Pena...
Data mining models are exploratory• Unlike the classical linear methods, the data  mining techniques enable predictions to...
Data mining models are flexible• These techniques are also more flexible than traditional statistical analyses  because th...
With a complex data set, understandable and generally interpretable results often             can be found only by constru...
With a complex data set, understandable and generally interpretable results          often can be found only by constructi...
With a complex data set, understandable and generally interpretable results        often can be found only by constructing...
With a complex data set, understandable and generally interpretable results            often can be found only by construc...
Models are often used to select a manageable number of core measures…A useful     subset of predictors … can then be used ...
Models are often used to select a manageable number of core measures…A useful   subset of predictors … can then be used in...
Models are often used to select a manageable number of core measures…A useful                                    subset of...
Data mining methods are non parametric• Consequently, unlike with parametric linear  models  – Nonnormal distribution does...
Surrogates• Surrogates of each split, describing splitting rules  that closely mimic the action of the primary  split, can...
Surrogates can be used to build alternative trees
Surrogates can be used to build alternative trees From:
Surrogates can be used to build alternative treesAppendix S6. Alternative model to the optimal classification tree. The al...
Surrogates• Surrogates of each split, describing splitting rules  that closely mimic the action of the primary split,  can...
Surrogates serve to treat missing values• The fact that data mining techniques can handle  data gaps by calculating surrog...
Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining                 t...
Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining                 t...
Penalization• As it is easier to be a good splitter on a small  number of records (e.g., splitting a node with just  two r...
High-level categorical explanatory variables have inherentlyhigher splitting power than continuous explanatory variables a...
Weighting• Weighting enables one to give a different weight to each  case in analysis• A usual application is on proportio...
Weighting: a usual application is on proportional data
Weighting: stratified sampling                                                                                        Man-...
ScoringAgeratum houstonianum   Chromolaena odorata   Xanthium strumarium  Argemone ochroleuca                           La...
Scoring
Scoring        Ageratum 77.8%                   Argemone 84.8%                            Chromolaena 89.5%               ...
Artificial placing some factors at the top of a tree using                                                 Splitter Variab...
Limitations• Data mining models are good, but not as good to solve completely all  problems with data that violate a basic...
The tree-growing method is data intensive, requiring                            many more cases than classical regression ...
Hints for further development: a     problem of species relatedness• The problem is that related species can have similar ...
Species relatedness                             approximated by taxonomic                             hierarchy• As a firs...
Species relatedness                                   approximated by taxonomic                                   hierarch...
Species relatedness                                           following phylogenetic                                      ...
Hints for further development: aproblem of species relatedness                     The problem can                     be ...
The problem of species relatedness could be solvedby artificial placing some factors at the top of a tree                 ...
The problem of species relatedness could be solvedby artificial placing some factors at the top of a tree                 ...
Acknowledgements• Co-authors      Petr Pyšek            Jan Pergl
Acknowledgements• Wikipedia for freely available pictures
Acknowledgements• Grants   – ALARM: Assessing LArge-scale environmental Risks with tested     Methods (European Union 6th ...
Acknowledgements• And last but not least, …
Upcoming SlideShare
Loading in …5
×

Data mining to reveal biological invaders

1,272 views

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,272
On SlideShare
0
From Embeds
0
Number of Embeds
489
Actions
Shares
0
Downloads
12
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Data mining to reveal biological invaders

  1. 1. Data Mining to Reveal Biological Invaders VOJTĚCH JAROŠÍK 1Department of Ecology, Faculty of Science, Charles University, Prague and 2Institute of Botany, Academy of Sciences of the Czech Republic, Průhonice, the Czech Republic1 2
  2. 2. Backgroud on biological invasions• Economic impacts• Changes in habitat properties• Loss of species diversity• Biological homogenization
  3. 3. Economic impacts
  4. 4. Economic impactsSkibbereen 1847 by Cork artist James Mahony Emigrants Leave Ireland, engraving by Henry Doyle
  5. 5. Changes of habitat properties
  6. 6. Extinction of native speciesBrown tree snake on a fence post in Guam
  7. 7. Biological homogenization
  8. 8. Aim• Data-mining tools were originally designed for analyzing vast databases of often incomplete data, with an aim to find financial frauds, suitable candidates for loans, potential customers and other uncertain outputs• I will show that searching for potential invasive species and their traits responsible for invasiveness, or identifying factors that distinguish invasible communities from those that resist invasion are similar risk assessments – This is perhaps the main reason why CART® and related methods are becoming increasingly popular in the field of invasion biology – Identifying homogeneous groups with high or low risk and constructing rules for making predictions about individual cases is, in essence, the same for financial credit scoring as for pest risk assessment – In both cases, one searches for rules that can be used to predict uncertain future events
  9. 9. Basic principles of data mining• Main literature sources• Binary recursive partitioning• Classification And Regression Trees (CART®)• Random forests (RFTM)
  10. 10. Basic principles of data mining• Classification and Regression Trees (CART®): – Breiman L, Friedman JH, Olshen RA, Stone CG (1984) Classification and Regression Trees. Belmont, Wadsworth International Group – Steinberg D, Colla P (1995) CART: Tree-structured Non-parametric Data Analysis. Salford Systems, San Diego, USA – Steinberg D, Colla P (1997) CART: Classification and Regression Trees. Salford Systems, San Diego, USA – Steinberg G, Golovnya M (2006) CART 6.0 Users Manual. Salford Systems, San Diego, USA – Death G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81, 3178-3192 – Bourg NA, McShea WJ, Gill DE (2005) Putting a cart before the search: successful habitat prediction for a rare forest herb. Ecology, 86, 2793-2804 – Jarošík V (2011) CART and related methods. In: Encyclopedia of Biological Invasions (eds Simberloff D, Rejmánek M), pp. 104-108. University of California Press, Berkeley and Los Angeles, USA• Random ForestsTM – Breiman L (2001) Random Forests. Machine Learning, 45, 5--32 – Breiman L, Cutler A (2004) Random Forests TM. An Implementation of Leo Breiman’s RFTM by Salford Systems. Salford Systems, San Diego, USA – Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, et al. (2007) Random forests for classification in ecology. Ecology, 88, 2783-2792 – Hochachka WM, Caruana R, Fink D, Munson A, Riedewald M, et al. (2007) Data-mining discovery of pattern and process in ecological systems. Journal of Wildlife Management, 71, 242-2437• TreeNetTM – Friedman JH (1999) Stochastic Gradient Boosting. Technical report, Dept. of Statistics, Stanford University. – Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29, 1189- 1232 – Friedman JH (2002) Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378 – Hastie TJ, Tibshirani RJ, Friedman JH (2001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York – Jarošík V, Pyšek P, Kadlec T (2011) Alien plants in urban nature reserves: from red-list species to future invaders. NeoBiota, 10, 27–46
  11. 11. Basic principles of predictive miningthe data are successively split Binary recursive partitioning
  12. 12. Basic principles of CART®CART® provide graphical, highly intuitive inside Run off Class Cases % None Absent 384 60.3 High, Medium, Low Present 253 39.7 N= 637 Road density outside Natural areas outside Class Cases % Class Cases % =< 0.1 Absent 287 91.4 > 0.1 =< 0.9 Absent 97 30.0 > 0.9 Present 27 8.3 Present 226 70.0 N= 314 N= 323Terminal node 1 Terminal node 2 Terminal node 3Class Cases % Class Cases % Class Cases % Road present insideAbsent 277 96.9 Absent 10 35.7 Absent 34 15.7 Class Cases %Present 9 3.1 Present 18 64.3 Present 183 84.3 No Absent 63 59.4 Yes N= 286 N= 28 N= 217 Present 43 40.6 Terminal node 4 N= 106 Terminal node 5 Class Cases % Class Cases % Absent 34 89.5 Absent 29 42.6 Present 4 10.5 Present 39 57.4 N= 38 From: N= 68 The tree is represented (in defiance of gravity) with the root standing for undivided data at the top
  13. 13. Basic principles of CART® From:
  14. 14. Basic principles of RandomForestsTMRandom forests can be seen as an extension of classification trees by fitting many sub-treesto parts of the dataset and then combining the predictions from all trees Figure 3. Ranking of importance values (%) for all invasive species. Ranking is scaled relative to the best performing variable based on out of-bag method of random forests. White bars are predictors from the outside of Kruger National Park and grey bars from the inside. From:
  15. 15. Important properties of data mining models• Exploratory and flexible• Non parametric• Surrogates• Penalization• Weighting• Scoring• Artificial placing some predictors at the top of a tree
  16. 16. Data mining models are exploratory• Unlike the classical linear methods, the data mining techniques enable predictions to be made from the data and to identify the most important predictors by screening a large number of candidate variables without requiring the user to make any assumptions about the form of the relationships between the predictors and the target variable, and without a priori formulated hypotheses
  17. 17. Data mining models are flexible• These techniques are also more flexible than traditional statistical analyses because they can reveal structures in the dataset that are other than linear, and solve complex interactions• Unlike linear models, which uncover a single dominant structure in the data, data mining models are designed to work with data that might have multiple structures: – The models can use the same explanatory variable in different parts of the tree, dealing effectively with nonlinear relationships and higher order interactions• In fact, provided there are enough observations, the more complex the data and the more variables that are available, the better models will appear compared to alternative methods. – With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining models• Data mining models are also excellent for initial data inspection – Models are often used to select a manageable number of core measures from databases with hundreds of variables – A useful subset of predictors from a large set of variables can then be used in building a formal linear model
  18. 18. With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining modelsFrom:
  19. 19. With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining modelsTable 2. Test statistics and significances of the explanatory variables and their interactions in the AICminimal adequate models for proportion of archaeophytes. Non-significant variables and theirinteractions are not shown. R2 = 0.82Explanatory variable archaeophytes df Deviance P R2Habitat type 31 90521.2 <0.001 0.760Riverside position 1 28.5 <0.001 <0.001Vegetation cover 1 88.4 <0.001 <0.001Surrounding urban/industrial land 1 660.7 <0.001 0.005Surrounding agricultural land 1 852.8 <0.001 0.007Altitudinal floristic region 2 1755.1 <0.001 0.015Altitude 1 647.7 <0.001 0.005Temperature 1 28.3 <0.001 <0.001Precipitation 1 392.8 <0.001 0.003Habitat type x riverside position n.s.Habitat type x vegetation cover 32 373 <0.001 0.003Habitat type x urban/industrial 32 480.2 <0.001 0.004Habitat type x agricultural 32 278.7 <0.001 0.002Habitat type x human density 32 172.9 <0.001 0.001Habitat type x floristic region 53 405.3 <0.001 0.003Habitat type x altitude 32 284.0 <0.001 0.002 From:Habitat type x temperature 32 91.1 <0.001 <0.001Habitat type x precipitation 32 181.2 <0.001 0.001Riverside position x precipitation 1 11.4 <0.001 <0.001Vegetation cover x agricultural 1 34.6 <0.001 <0.001Vegetation cover x human density 1 4.4 0.036 <0.001Vegetation cover x floristic region 2 9.4 0.009 <0.001Vegetation cover x precipitation 1 9.2 0.002 <0.001Urban/industrial x human density n.s.Urban/industrial x floristic region 2 9.4 0.009 <0.001Agricultural x altitude n.s.Agricultural x temperature 1 38.0 <0.001 <0.001Agricultural x precipitation 1 10.0 0.002 <0.001Floristic region x altitude 2 11.1 0.004 <0.001Floristic region x temperature 2 8.2 0.016 <0.001Floristic region x precipitation n.s.Altitude x precipitation 1 6.2 0.013 <0.001Temperature x precipitation n.s.
  20. 20. With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining models R2 = 0.86 From:
  21. 21. With a complex data set, understandable and generally interpretable results often can be found only by constructing data mining models R2 = 0.74ANOVA table and deletion tests for linear minimal adequate model (MAM) of ANCOVA describing population characteristicsdetermining the impact on diversity between invaded and uninvaded pairs of plots. ANOVA on MAM Deletion tests on MAMSource of variation Df SS MS R2 (R2adj.) F Df PImpact on diversity H:Species x Height 13 25.4219 1.9555 1.93 13, 100 0.04 From:Species x Differences in height 13 5.0490 0.3884 2.11 13, 100 0.02Species x Differences in cover 13 14.1461 1.0882 10 13, 100 < 0.001Height x Differences in Height 1 0.3246 0.3246 4.44 1, 88 0.04Height x Differences in cover 1 0.2572 0.2572 4.83 1, 88 0.03Cover x Differences in cover 1 0.6834 0.6834 6.48 1, 88 0.01Residuals 87 9.1763 0.1055Total 129 55.0584 0.83 (0.76) 10.36 42, 87 < 0.001
  22. 22. Models are often used to select a manageable number of core measures…A useful subset of predictors … can then be used in building a formal linear model Water run-off Class Cases % =< 6.0 Absent 384 60.3 > 6.0 1 Probability of an alien record Present 253 39.7 N= 637 0 Road density outside Class Cases % Ro =< 0.1 Absent 323 89.7 > 0.1 ad de Present 37 10.3 Terminal node 3 ns -of f ity N= 360 10 run Class Cases % ter km Wa Terminal node 1 Terminal node 2 Absent 61 22.0 Class Cases % Class Cases % Present 216 78.0 Absent 313 94.6 Absent 10 34.5 N= 277 Present 18 5.4 Present 19 65.5 N= 331 N= 29From:
  23. 23. Models are often used to select a manageable number of core measures…A useful subset of predictors … can then be used in building a formal linear model From:
  24. 24. Models are often used to select a manageable number of core measures…A useful subset of predictors … can then be used in building a formal linear model The probability of non-native plant presence was associated with water runoff (0–27 million m3) and major 1 road density within 10 km radius outside the KNP boundary (0–0.15 km/km2). Overall significance of theProbability of an alien record model is G2 = 531.54, df=3, P<0.0001. Parameters of the model are given in the Table below. Parameter Estimate ASE Standardized ASE G2 df P estimate Intercept -6.30 0.66 -0.73 0.18 0 Run-off 0.74 0.08 3.37 0.31 264.82 1 0.0001 Road density 69.70 8.59 1.54 0.20 127.96 1 0.0001 Ro ad de ns -of f Run-off × Road -5.90 0.89 -1.63 0.25 49.13 1 0.0001 ity n 10 te r ru density km Wa From:
  25. 25. Data mining methods are non parametric• Consequently, unlike with parametric linear models – Nonnormal distribution does not require transformations • Because the trees are invariant to monotonic transformations of continuous predictor variables, no transformation is needed prior to analyses • Outliers among the predictors generally do not affect the models because splits usually occur at non-outlier values • (However, in some circumstances, transformation of the target variable may be important to alleviate variance heterogeneity).
  26. 26. Surrogates• Surrogates of each split, describing splitting rules that closely mimic the action of the primary split, can be assessed and ranked according to their association values, with the highest possible value 1.0 corresponding to the surrogate producing exactly the same split as the primary split – Surrogates can then be used to replace an expensive primary explanatory variable by a less expensive or easier to interpret, though probably less accurate one – Surrogates can also be used to build alternative trees on surrogates
  27. 27. Surrogates can be used to build alternative trees
  28. 28. Surrogates can be used to build alternative trees From:
  29. 29. Surrogates can be used to build alternative treesAppendix S6. Alternative model to the optimal classification tree. The alternative categoricalsplit “Run off” (none: no rivers intersected the segment; low: 2-10; medium: 10-15; high: 15-26 million m3 / quaternary watershed / annum) replaced the continuous primary split “Waterrun-off” at node 1 of the optimal tree (Fig. 3) as a surrogate with association value = 0.86.Overall misclassification rate of the model is 13.5%, sensitivity 0.90 and specificity 0.80.“Natural areas outside” refers to the percentage of natural areas in a 5-km radius outside theKNP boundary, “Road present inside” refers to the presence or absence of roads in a studiedsegment inside KNP. Otherwise as in Fig. 3. From:
  30. 30. Surrogates• Surrogates of each split, describing splitting rules that closely mimic the action of the primary split, can be assessed and ranked according to their association values, with the highest possible value 1.0 corresponding to the surrogate producing exactly the same split as the primary split – Surrogates also serve to treat missing values, because the alternative splits are used to classify a case when its primary splitting variable is missing
  31. 31. Surrogates serve to treat missing values• The fact that data mining techniques can handle data gaps by calculating surrogate variables to replace missing values often enable to use larger dataset than in linear models in which usually all cases with missing values had to be discarded• Consequently, the data mining techniques can often reveal additional factors relevant for predictions that may have been overlooked when tested by classical statistical approaches
  32. 32. Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining techniques thus often reveal additional factors relevant for predictions N = 136
  33. 33. Calculating surrogate variables to replace missing values …enable to use larger dataset… the data mining techniques thus often reveal additional factors relevant for predictions Man-made habitat Class % yes success 54.5 no failure 45.5 N = 173 Habitat Pathway: Escaped Class % outdoor, Class % indoor success 72.5 both yes success 28.6 no failure 27.5 failure 71.4 N = 114 N = 59 Terminal node 1 Terminal node 6 Terminal node 7 Class % World region Class % Class % success 91.7 Class % Americas, success 75.0 success 16.6 failure 8.3 Australasia success 64.8 Europe failure 25.0 failure 83.4 N = 23 failure 35.2 N=8 N = 51 N = 91 Terminal node 2 Class % success 86.7 Spatial extent failure 13.3 local, Class % regional, N = 23 international success 51.6 national failure 48.4 N = 68 Terminal node 5 Class % Area infested success 16.3 Class % failure 83.7 <= 0.2 ha success 77.0 > 0.2 ha N = 34 failure 23.0 N = 34 Terminal node 3 Terminal node 4 Class % Class % success 44.6 success 89.8 failure 55.4 failure 10.2 N=9 N = 25 Fig. 1. Optimal classification tree for factors relating to success and failure of 173 eradication campaigns against invertebrate plant pests, weeds and plant pathogens (viruses/viroids, bacteria and fungi) . Overall misclassification rate of the optimal tree is 15.8% compared to 50% for the null model, with 16.7% misclassified success and 14.8% failure cases. Sensitivity is 83.3 and specificity 85.2% for learning samples, and 77.1 and 69.0%, respectively, for cross-validated samples. Submitted to: PLoS One Which factors affect the success or failure of eradication campaigns against alien species? Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1 1University of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland 2Charles University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic 3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic 4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom 5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
  34. 34. Penalization• As it is easier to be a good splitter on a small number of records (e.g., splitting a node with just two records), to prevent missing explanatory variables from having an advantage as splitters, the power of explanatory variables should be penalized in proportion to the degree to which they are missing• High-level categorical predictor variables have inherently higher splitting power than continuous predictors and therefore can also be penalized to level the playing field
  35. 35. High-level categorical explanatory variables have inherentlyhigher splitting power than continuous explanatory variables and therefore can also be penalized to level the playing field
  36. 36. Weighting• Weighting enables one to give a different weight to each case in analysis• A usual application is on proportional data, where proportions calculated from larger samples give more precise estimates, and therefore, proportional response variables are weighted by their sample sizes• A similar approach can be applied to stratified sampling on strata having different sampling intensities• Data-mining models can also accommodate situations in which some misclassifications are more serious than others – For instance, if invasion risks are classified as low, moderate, and high, it would be more costly to classify a high-risk species as low-risk than as moderate-risk – This can be treated by specifying a differential penalty by weighting different way misclassifying high, moderate, and low risk
  37. 37. Weighting: a usual application is on proportional data
  38. 38. Weighting: stratified sampling Man-made habitat Class % yes success 54.5 no failure 45.5 N = 173 Habitat Pathway: Escaped Class % outdoor, Class % indoor success 72.5 both yes success 28.6 no failure 27.5 failure 71.4 N = 114 N = 59 Terminal node 1 Terminal node 6 Terminal node 7 Class % World region Class % Class % success 91.7 Class % Americas, success 75.0 success 16.6 failure 8.3 Australasia success 64.8 Europe failure 25.0 failure 83.4 N = 23 failure 35.2 N=8 N = 51 N = 91 Terminal node 2 Class % success 86.7 Spatial extent failure 13.3 local, Class % regional, N = 23 international success 51.6 national failure 48.4 N = 68 Terminal node 5 Class % Area infested success 16.3 Class % failure 83.7 <= 0.2 ha success 77.0 > 0.2 ha N = 34 failure 23.0 N = 34 Terminal node 3 Terminal node 4 Class % Class % success 44.6 success 89.8 failure 55.4 failure 10.2 N=9 N = 25 Fig. 1. Optimal classification tree for factors relating to success and failure of 173 eradication campaigns against invertebrate plant pests, weeds and plant pathogens (viruses/viroids, bacteria and fungi). Overall misclassification rate of the optimal tree is 15.8% compared to 50% for the null model, with 16.7% misclassified success and 14.8% failure cases. Sensitivity is 83.3 and specificity 85.2% for learning samples, and 77.1 and 69.0%, respectively, for cross-validated samples.Submitted to: PLoS OneWhich factors affect the success or failure of eradication campaigns against alien species?Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher11University of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland2Charles University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
  39. 39. ScoringAgeratum houstonianum Chromolaena odorata Xanthium strumarium Argemone ochroleuca Lantana camara Opuntia stricta
  40. 40. Scoring
  41. 41. Scoring Ageratum 77.8% Argemone 84.8% Chromolaena 89.5% Opuntia 95.4% Xanthium 95.6% Lantana 98.1%-25.0 -20.0 -15.0 -10.0 -5.0 0.0 5.0 10.0 15.0 20.0 25.0 Predictability of presence comparing to all aliens with 92.9% of correct predictions
  42. 42. Artificial placing some factors at the top of a tree using Splitter Variable Disallow Criteria Area infested Class % <= 4905 ha success 54.5 > 4905 ha failure 45.5 N = 173 Sanitary control Reaction time Class % Class % yes success 66.7 no <= 11 yrs success 32.5 > 11 yrs failure 33.3 failure 67.5 N = 101 N = 72Terminal node 1 Terminal node 8Class % Man-made habitat Detectability Class %success 86.1 Class % Class % easy, success 15.2failure 13.9 yes success 60.8 no difficult success 46.6 failure 84.8 intermediate N = 29 failure 39.2 failure 53.4 N = 32 N = 72 N = 40 Terminal node 2 Terminal node 6 Terminal node 7 Class % Class % Class % success 82.4 Pathway: escaped success 80.1 success 31.8 failure 17.6 Class % failure 19.9 failure 68.2 N = 37 yes success 42.3 no N=9 N = 31 failure 57.7 N = 35 Terminal node 3 Class % World region success 85.7 Class % Australasia, failure 14.3 Americas success 25.7 Europe N=7 failure 74.3 N = 28 Terminal node 4 Terminal node 5 Class % Class % success 59.3 success 11.5 failure 40.7 failure 88.5 N=7 N = 21Fig. 3. Optimal classification tree with event-specific factors placed at the top of the tree. Overall misclassification rate of the optimal tree is18.0% with 15.3% misclassified success and 23.4% failure cases. Sensitivity and specificity are respectively 84.7 and 76.6% for learning, and 66.7and 64.9% for cross-validated samples. Submitted to: PLoS One Which factors affect the success or failure of eradication campaigns against alien species? Therese Pluess1, Vojtěch Jarošík2,3*, Petr Pyšek3,2, Ray Cannon4, Jan Pergl3, Annemarie Breukers5, Sven Bacher1 1University of Fribourg, Department of Biology, Ecology & Evolution Unit, Chemin du Musée 10, 1700 Fribourg, Switzerland 2Charles University in Prague, Faculty of Science, Department of Ecology, CZ-128 44 Praha 2, Czech Republic 3Institute of Botany, Academy of Sciences of the Czech Republic, CZ-252 43 Průhonice, Czech Republic 4The Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, United Kingdom 5LEI, part of Wageningen UR, P.O. Box 8130, 6700 EW, Wageningen, The Netherlands
  43. 43. Limitations• Data mining models are good, but not as good to solve completely all problems with data that violate a basic assumption of the independence of errors of observations due to temporal or spatial autocorrelation, or due to a related problem of phylogenetic relatedness• Data mining methods cannot distinguish fixed and random effects and thus cannot be used with mixed effect and nested statistical designs• The tree-growing method is data intensive, requiring many more cases than classical regression – While for multiple regression it is usually recommended to keep the number of explanatory variables six to ten times smaller than the number of observations, for classification trees, at least 200 cases are recommended for binary response variables and about 100 more cases for each additional level of a categorical variable – Efficiency of trees decreases rapidly with decreasing sample size, and for small data sets, no test samples may be available
  44. 44. The tree-growing method is data intensive, requiring many more cases than classical regression A. Variable importance for plant richness B. Variable importance for butterfly richness A. area A. altitudinal range B. w asteland D. soil F. natural perim eter F. distance to natural habitat B. arable land A. aspect B. build-up area F. natural perim eter D. soil A. area C. forest B. forest B. w asteland A. aspect A. isolation A. altitudinal range F. built-up perim eter B. forest A. railw ay C. orchards A. altitude B. orchards D. bare bedrock B. pasture B. pasture D. bare bedrock F. distance to built-up area F. built-up perim eter C. arable land C. arable land B. arable land A. isolation C. w asteland A. railw ay C. pasture D. quarry D. quarry C. orchards A. reserve age C. grassland A. altitude C. build-up area B. grassland A. reserve age C. grassland B. grassland C. pasture B. orchards C. w asteland B. build-up area C. build-up area C. forestF. distance to natural habitat E. plants F. distance to built-up area E. com m unities 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Relative importance (%) Relative importance (%)Figure A1 Rank of importance of the individual explanatory variables from the six groups of explanatory variables (A. Geography, B. Past habitats, C. Presenthabitats, D. Substrate, E. Vegetation heterogeneity, F. Urbanization) from random forests classifications, used for predicting the above and below median value for eachobservation of species richness and endangered species across 48 reserves. Variable importance is rescaled to have values between 0 and 100. A. Variable importancefor plant richness: final error rate 45.8%, sensitivity 50.0%, specificity 58.3%. B. Variable importance for butterfly richness: final error rate 31.6%, sensitivity63.6%, specificity 73.1%.
  45. 45. Hints for further development: a problem of species relatedness• The problem is that related species can have similar traits due to inherited traits from a common ancestor• Consequently, the traits of related species are not independent, and in statistical analyses, this relatedness have to be taken into account• Considering effects of species relatedness is important not only to treat a lack of statistical independence, but also to solve practical implications – For instance, when one need to know how species belonging to a particular taxon (like family, order, or class) are predisposed to invasion
  46. 46. Species relatedness approximated by taxonomic hierarchy• As a first approximation, instead of a real relatedness based on inherited traits from common ancestors described by a phylogenetic tree, we can use taxonomic hierarchy, supposing that the species relatedness is increasing from subkingdoms within kingdoms to subspecies within species• First, the trees can be constructed only with the highest taxonomic level and then including each lower taxonomic level, one after another – This successional treatment of taxonomy enable to reveal at which taxonomic level the examined trait has the largest effect
  47. 47. Species relatedness approximated by taxonomic hierarchy• As a first approximation, instead of a real relatedness based on inherited traits from common ancestors, we can use taxonomic hierarchy, supposing that the species relatedness is increasing from subkingdoms within kingdoms to subspecies within species• Second, the splitting power of taxonomic levels in regression and classification trees can be penalized proportionally to the number of categories at each taxonomic level, taking advantage of that that the number of units at the taxonomic levels is increasing from the highest to the lowest – This penalization, weighting the taxonomic levels from the lowest to the highest proportionally to the number of categories at each level, thus can reveal a real weighted effect of a particular taxon on a particular species trait
  48. 48. Species relatedness following phylogenetic tree• The best way how to treat species relatedness is to use the actual species phylogenetic distances calculated from a specific phylogenetic tree• Technically, this can be done by calculating principal coordinate axes, that describe distances between all taxa in a phylogenetic tree• However, their use in CART® is hindered by the fact that these principal coordinate axes are orthogonal, and classification and regression trees exhibit their greatest strengths with a highly nonlinear structure and complex interactions – Their usefulness decreases with increasing linearity of the relationships, and consequently, on mutually independent principle coordinates, no trees are built
  49. 49. Hints for further development: aproblem of species relatedness The problem can be solved by using the Splitter Variable Disallow Criteria in CART®
  50. 50. The problem of species relatedness could be solvedby artificial placing some factors at the top of a tree • When species relatedness approximated by taxonomic hierarchy, the Splitter Variable Disallow Criteria can be directly used for appropriate placing the individual levels of the nested taxonomic hierarchy
  51. 51. The problem of species relatedness could be solvedby artificial placing some factors at the top of a tree • When species relatedness directly follows phylogenetic distances between taxa on a phylogenetic tree, the Splitter Variable Disallow Criteria can be again directly used to place the individual splits • A further development of Splitter Variable Disallow Criteria might enable to include also the real distances between the individual taxa, by developing a disallow criteria defining not only the split region, but also the distances between the individual splits, using e.g. a predefined importance values for the distances among the splits – This might be useful also for other applications in which it can appear useful to place some predictors not only deliberately at a chosen place in a tree structure, but also to define the precise distance between the predictors
  52. 52. Acknowledgements• Co-authors Petr Pyšek Jan Pergl
  53. 53. Acknowledgements• Wikipedia for freely available pictures
  54. 54. Acknowledgements• Grants – ALARM: Assessing LArge-scale environmental Risks with tested Methods (European Union 6th Framework Programme), DAISIE: Delivering Alien Invasive Species Inventories for Europe (European Union 6th Framework Programme), PRATIQUE: Enhancements of Pest Risk Analysis Techniques (European Union 7th Framework Programme, 2008–2011) – Czech Science Foundation grant no. 206/09/0563 – Long-term research projects no. RVO 67985939 from the Academy of Sciences of the Czech Republic, MSM 21620828 and LC06073 from the Ministry of Education, Youth and Sports of the Czech Republic – Praemium Academiae award from the Academy of Sciences of the Czech Republic to Petr Pyšek
  55. 55. Acknowledgements• And last but not least, …

×