Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland)

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Intelligent analysis of environmental data: an introduction Mikhail Kanevski – Institute of Geomatics and Risk Analysis (IGAR), University of Lausanne (Switzerland) - Presentation Transcript

    1. International Workshop: Intelligent Analysis of Environmental Data Institute of Geomatics and Analysis of Risk (IGAR) University of Lausanne, Switzerland Prof. Mikhail Kanevski M. Kanevski, Palermo 2009 1
    2. Comments and questions to: • Mikhail.Kanevski@unil.ch – www.unil.ch/igar – www.geokernels.org M. Kanevski, Palermo 2009 2
    3. General Introduction Typical problems Approaches Solutions Future research M. Kanevski, Palermo 2009 3
    4. Geo- and Environmental Data (classes, continuous, images, networks, geomanifolds,…) • Spatio-temporal • Multi-scale • Multivariate • Highly variable at many scales • High-dimensional geo-feature spaces • Uncertainties • …………. • In some cases we do have science-based models: data/knowledge/models integration M. Kanevski, Palermo 2009 4
    5. Spatio-temporal data in terms of patterns/structures: a. pattern recognition (pattern discovery, pattern extraction), b. pattern modelling, c. pattern prediction M. Kanevski, Palermo 2009 5
    6. Main Topics: • Review and posing of typical problems. • From “numbers” to data • Collection of data: Monitoring networks and data representativity? Monitoring network optimisation. • Get more information value from your data – EXPLORE ! Exploratory spatio-temporal data analysis (EDA, ESDA). • Predictions/estimations or simulations? Risk analysis and mapping • Let data speak for themselves: learning from data. Data mining, Machine learning. M. Kanevski, Palermo 2009 6
    7. Methods: • Monitoring networks descriptions • Geostatistics: predictions/simulations • Machine Learning(neural nets, SLT): – Neural networks: MLP, PNN, GRNN, RBF, SOM. ANNEX models. Hybrid models – Support Vector Machines • Recent trends in geostatistics: Multiple-points geostatistics, pattern based geostatistics. • Bayesian approach for uncertainty assessment, integration of data and science-based models (Bayesian Maximum Entropy) M. Kanevski, Palermo 2009 7
    8. Spatial data analysis: typical tasks • Predict a value at a given point. • Build a map (isolines, 3D surfaces,..). • Estimate prediction error. • Take into account measurement errors. • Risk mapping: Uncertainty mapping around unknown value. Estimate the probability of exceeding of a given/decision level. • Joint predictions of several variables (improve predictions on primary variable using auxiliary data and information). • Optimization of monitoring network (design/ redesign) • Simulations: modelling of spatial uncertainty and variability • Data/Science-based models assimilation/fusion • Image analysis. Remote sensing • Spatio-temporal events (forest fires, epidemiology, crime,…) • Predictions/simulations in high dimensional spaces • ……………………………………….. M. Kanevski, Palermo 2009 8
    9. Generic Methodology Data Base DATA Management System Statistical Quick Monitoring Description Visualisation Network Analysis Variography Deterministic Monitoring Interpolations Network Cross-validation Generation Machine Learning Geostatistical Algorithms Predictions & Simulations Decision-oriented Mapping GIS, M. Kanevski, Palermo 2009 Remote Sensing 9
    10. GEOSTATISTICAL ANALYSIS • Basic/Naïve statistical analysis. EDA • ESDA (regionalized EDA) • Structural analysis. Spatial correlation analysis (variography) • Model selection: Cross-validation, jack-knife,… • Prediction and error mapping for decision making (family of kriging models) • Probability and Risk mapping. Conditional stochastic simulations M. Kanevski, Palermo 2009 10
    11. Some Geostatistics • Exploration of spatial correlations • Family of kriging models (simple, ordinary, disjunctive, indicator,…) • Conditional Stochastic Simulations M. Kanevski, Palermo 2009 11
    12. Briansk region (radioactivity, Cs137) M. Kanevski, Palermo 2009 12
    13. Heavy metals, Japan M. Kanevski, Palermo 2009 13
    14. Switzerland, indoor radon M. Kanevski, Palermo 2009 14
    15. Measures to characterise MN • Topological • Statistical • Fractal/multifractal • Lacunarity M. Kanevski, Palermo 2009 15
    16. Preferential Sampling. Declustering Problem M. Kanevski, Palermo 2009 16
    17. Example: geostatistical spatial co-predictions Sr90 « expensive » information. Cs137 « cheap » exhaustive information. M. Kanevski, Palermo 2009 17
    18. (Cross)Variography M. Kanevski, Palermo 2009 18
    19. Use of Cs137 to improve Sr90 predictions (reduced errors and uncertainty). Decision-oriented mapping: « Thick isolines » M. Kanevski, Palermo 2009 19
    20. Simulations and Interpolations M. Kanevski, Palermo 2009 20
    21. Unconditional simulations M. Kanevski, Palermo 2009 21
    22. SGSim of the precipitation: M. Kanevski, Palermo 2009 22
    23. Results of the simulations M. Kanevski, Palermo 2009 23
    24. Post-processing of simulations: mean and standard deviation M. Kanevski, Palermo 2009 24
    25. Geostatistics: some comments • Geostatistics is a powerful and well elaborated model-dependent approach. • Geostatistics proposes a variety of models for spatial data analysis and modeling. It has long and successful history of developments and applications • Some problems: Nonlinearity Non-stationarity Two-point statistics Data/models integration Data mining. Pattern recognition • Hybrid Models (ANN/SVM + Geostat) can help. M. Kanevski, Palermo 2009 25
    26. Some useful comments, conclusions and future research • 1. Detection of patterns: try k-NN or GRNN • as an exploratory tools • Cross-validation: leave-one-out, leave k-out, jackknife,etc. as a control tool • Model selection and model asssessment M. Kanevski, Palermo 2009 26
    27. K- Nearest Neighbours M. Kanevski, Palermo 2009 27
    28. K-NN prediction: NN methods use those k-observations in the training data set T closest in input space to prediction point x to estimate Y k ∧ 1 Y= ∑( x) yi k xi ∈ Nk Where Nk(x) is the neighborhood of x defined by the closest points in the training set M. Kanevski, Palermo 2009 28
    29. k-NN Classifiers These classifiers are memory-based and do not require any model to be fit! Given a query point x, we find the k training points closest in the distance to x and then classify using MAJORITY vote among the k neighbors. M. Kanevski, Palermo 2009 29
    30. Because it uses only the training point closest to the query point, the bias of the 1-nn estimate is often low, but the variance is high. A famous result of Cover and Hurt (1967) shows that asymptotically the error rate of the 1-nn classifier is never more than twice the Bayes rate. This result can provide a rough idea about the best performance that is possible in a given problem: if the 1-nn rule has a 10% error rate, then asymptotically the Bayes error rate is at least 5%. M. Kanevski, Palermo 2009 30
    31. Dirichlet cells, Thiessen tessellation, Voronoï polygons M. Kanevski, Palermo 2009 31
    32. • How to find k ? Possible answer: Cross-validation or leave-one-out M. Kanevski, Palermo 2009 32
    33. k-NN prediction (n=6 ?) W3~(1/n) 3 W4~(1/n) W2~(1/n) r3 4 2 r2 r4 r5 W5~(1/n) 5 r1 r6 6 W1~(1/n) W6~(1/n) 1 M. Kanevski, Palermo 2009 33
    34. Cross-validation W3~(1/n) 3 W4~(1/n) W2~(1/n) r3 4 2 r2 r4 r5 W5~(1/n) 5 r1 r6 6 W1~(1/n) W6~(1/n) 1 Calculate error = (prediction-data) M. Kanevski, Palermo 2009 34
    35. Leave-next-one-out, etc W3~(1/n) 3 W4~(1/n) W2~(1/n) r3 4 2 r2 r4 r5 W5~(1/n) r1 r6 6 W1~(1/n) W6~(1/n) 1 5 Calculate error = (prediction-data) M. Kanevski, Palermo 2009 35
    36. Data and k-nn Cross- validation error curve M. Kanevski, Palermo 2009 36
    37. Complete data set and 500 training points linearly interpolated M. Kanevski, Palermo 2009 37
    38. Cross-validation curve M. Kanevski, Palermo 2009 38
    39. K-nn predictions M. Kanevski, Palermo 2009 39
    40. Machine Learning Algorithms • Machine learning is an area of artificial intelligence concerned with the development of techniques which allow computers to "learn". • More specifically, machine learning is a method for creating computer programs by the analysis of data sets. Machine learning overlaps heavily with statistics, since both fields study the analysis of data, but unlike statistics, machine learning is concerned with the algorithmic complexity of computational implementations. ... M. Kanevski, Palermo 2009 40
    41. Algorithms Common algorithm types include: • supervised learning – where the algorithm generates a function that maps inputs to desired outputs. • unsupervised learning – which models a set of inputs: labeled examples are not available. • semi-supervised learning – which combines both labeled and unlabeled examples to generate an appropriate function or classifier. • reinforcement learning – where the algorithm learns a policy of how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm. • transduction – similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and new inputs. • The performance and computational analysis of machine learning algorithms is a branch of statistics known as computational learning theory. M. Kanevski, Palermo 2009 41
    42. ML Topics (short lists) • Machine learning topics • Modeling conditional probability density functions, regression and classification – Artificial neural networks – Decision trees – Gene expression programming – Genetic Programming – Gaussian process regression – Linear discriminant analysis – k-Nearest Neighbor – Minimum message length – Perceptron – Quadratic classifier – Radial basis functions – Support vector machines M. Kanevski, Palermo 2009 42
    43. ML Topics (continued) • Modeling probability density functions through generative models: – Expectation-maximization algorithm – Graphical models including Bayesian networks and Markov Random Fields – Generative Topographic Mapping • Appromixate inference techniques: – Markov chain Monte Carlo method – Variational Bayes • Meta-Learning (Ensemble methods): – Boosting – Bootstrap Aggregating aka Bagging – Random forest – Weighted Majority Algorithm • Optimization: most of methods listed above either use optimization or are instances of optimization algorithms. • Multi-objective Machine Learning: An approach that addresses multiple, and often confliciting learning objectives explicitly using Pareto-based multi- objective optimization techniques. M. Kanevski, Palermo 2009 43
    44. Machine Learning • Artificial Neural Networks 3. Multilayer perceptrons (MLP) 4. General Regression Neural Networks (GRNN) • Statistical Learning Theory  Support Vector Classification  Support Vector Regression  Monitoring Networks Optimization M. Kanevski, Palermo 2009 44
    45. A Generic Model of Learning from Data/Examples Generator Supervisor Learning Machine M. Kanevski, Palermo 2009 45
    46. The Problem of Risk Minimization In order to choose the best available model to the supervisor’s response, one measure the LOSS or discrepancy L(y,f(x,α)) between the response y of the supervisor to a given input x and the response f(x,α) provided by the Loss Measure. M. Kanevski, Palermo 2009 46
    47. Three Main Learning Problems • Regression Estimation. Let the supervisor’s answer y, be a real value, and let f(x,α ), α∈Λ , be a set of real functions which contains the regression function f ( x, α) = ydF ( y ¦ x ) 0 ∫ M. Kanevski, Palermo 2009 47
    48. The Problem of Risk Minimization Consider the expected value of the loss, given by the risk functional R (α) = ∫ L( y , f ( x, α))dF ( x, y ) The goal is to find the function f(x,α 0) which minimises the risk in the situation where the joint pdf is unknown and the only available information is contained in the training set. M. Kanevski, Palermo 2009 48
    49. • Classification problem: A B A A A A A B B A B A B A A A B B B B B B M. Kanevski, Palermo 2009 49
    50. Three Main Learning Problems • Pattern Recognition (classification). y = {0,1}, classification error: 0, if y = f ( x,α ) L( y, f ( x,α )) = 1, if y ≠ f ( x,α ) M. Kanevski, Palermo 2009 50
    51. • Regression problem f(x) ?  f ( x) ˆ  x→ y M. Kanevski, Palermo 2009 51
    52. Three Main Learning Problems • Regression Estimation It is known that regression function is the one which minimizes the following loss-function: L( y, f ( x, α )) = ( y − f ( x, α )) 2 M. Kanevski, Palermo 2009 52
    53. • Probability density estimation p(x) M. Kanevski, x Palermo 2009 53
    54. Three Main Learning Problems • Density Estimation. For this problem we consider the following loss- function: L( p( x,α )) = − log p( x,α ) M. Kanevski, Palermo 2009 54
    55. Inductive, Deductive and Transductive F(x,y) Induction Deduction Training samples (xi, yi) (ynew,xnew) Transduction M. Kanevski, Palermo 2009 55
    56. Why Machine Learning algorithms? • Universal, nonlinear, robust tools • Data adapted • Easy data and knowledge integration • Efficient in high dimensional spaces • Good generalisation (low prediction error) • Input/feature selection M. Kanevski, Palermo 2009 56
    57. Our experience, some applications • Hydrogeology, pollution/contamination (soil, water, air, food chains,…), topo-climatic modelling, geophysics • Renewable resources – wind fields • Natural hazards/risks: forest fires, avalanches, indoor radon, • Optimization of monitoring networks • Crime data, epidemiology • MNL for remote sensing, change detection • Socio-economic spatio-temporal multivariate data • Spatial econometrics. Financial data. Econophysics • Fractals, Chaos, EVT, • Time series M. Kanevski, Palermo 2009 57
    58. Model Selection & Model Evaluation M. Kanevski, Palermo 2009 58
    59. Guillaume d'Occam (1285 - 1349) “Pluralitas non est ponenda sine necessitate” Occam’s razor: “The more simple explanation of the phenomena is more likely to be correct” M. Kanevski, Palermo 2009 59
    60. Model Assessment and Model Selection: Two separate goals M. Kanevski, Palermo 2009 60
    61. Model Selection: Estimating the performance of different models in order to choose the (approximate) best one Model Assessment: Having chosen a final model, estimating its prediction error (generalization error) on new data M. Kanevski, Palermo 2009 61
    62. If we are in a data-rich situation, the best solution is to split randomly (?) data Raw Data Train: 50% Validation:25% Test:25% (Train) (test) (validation) M. Kanevski, Palermo 2009 62
    63. Interpretation • The training set is used to fit the models • The validation set is used to estimate prediction error for model selection (tuning hyperparameters) • The test set is used for assessment of the generalization error of the final chosen model Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001 M. Kanevski, Palermo 2009 63
    64. Bias and Variance. Model’s complexity c. Underfitting 3 2.5 2 b. Overfitting 3 1.5 2.5 1 2 0.5 1.5 2 4 6 8 10 1 0.5 2 4 6 8 10 M. Kanevski, Palermo 2009 64
    65. One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples. This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task. Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error M. Kanevski, Palermo 2009 65
    66. Bias-Variance Dilemma Assume that Y = f (X ) + ε where E (ε ) = 0, Var (ε ) = σ 2 ε M. Kanevski, Palermo 2009 66
    67. We can derive an expression for the expected prediction error of a regression at an input point X=x0 using squared-error loss: M. Kanevski, Palermo 2009 67
    68. ∧ Err ( x0 ) = E[(Y − f ( x0 )) ¦ X = x0 ] = 2 ∧ ∧ ∧ σ + [ E f ( x0 ) − f ( x0 )] + E[ f ( x0 ) − E f ( x0 )] = 2 ε 2 2 ∧ ∧ σ + Bias ( f ( x0 )) + Var ( f ( x0 )) = 2 ε 2 IrreducibleError + Bias + Variance 2 M. Kanevski, Palermo 2009 68
    69. • The first term is the variance of the target around its true mean f(x0), and cannot be avoided no matter how well we estimate f(x0), unless σε2=0. • The second term is the squared bias, the amount by which the average of our estimate differs from the true mean • The last term is the variance, the expected squared deviation of ∧ around its mean. f ( x0 ) M. Kanevski, Palermo 2009 69
    70. For the k-NN regression fit ∧ Err ( x0 ) = E[(Y − f ( x0 )) ¦ X = x0 ] = 2 k 1 σ + [ f ( x0 ) − ∑ f ( xl )] + σ ε / k 2 ε 2 2 k l =1 Here we assume for simplicity that training inputs are fixed, and the randomness arises from the Y. The number of neighbors k is inversely related to the model complexity M. Kanevski, Palermo 2009 70
    71. Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001 M. Kanevski, Palermo 2009 71
    72. M. Kanevski, Palermo 2009 72
    73. • A neural network is only as good as the training data! • Poor training data inevitably leads to an unreliable and unpredictable network. • Exploratory Data Analysis and data preprocessing are extremely important!!! M. Kanevski, Palermo 2009 73
    74. • If possible, prior to training, add some noise or other randomness to your example (such as a random scaling factor). This helps to account for noise and natural variability in real data, and tends to produce a more reliable network. M. Kanevski, Palermo 2009 74
    75. Hybrid Models: Geostatistics + ML M. Kanevski, Palermo 2009 75
    76. Data F1,F2,...,Fn Structural analysis Statistical Trend Variogram Raw Data Variogram description analysis Data for training validation testing Lag (km) ANN architecture choice Validation Testing Statistical description ANN Training Multivariate structural analysis Accuracy Test ANN estimates for F1,F2,...,Fn Variogram model for residuals Validation Residual Variogram ANN Residuals F1,F2,...,Fn Variogram Cross- validation Lag (km) Final estimates Cokriging (ANN + Geostatistics) errors estimates NNRK/CK Algorithm M. Kanevski, Palermo 2009 76
    77. Model: Neural Network Residual Cokriging Artificial Neural Network Estimate Final estimate of 90Sr with Geostatistical Estimate NNRCK of the Residuals M. Kanevski, Palermo 2009 77
    78. Conclusions • Machine Learning: universal data-driven recently developed approach with many successful applications. Nonlinear, robust. Integration of different types of data and information. Efficient in high dimensional space. • But: Depends on the quality and quantity of data. Uncertainty characterization. Diagnostic tools. Hyper-parameters tuning. M. Kanevski, Palermo 2009 78
    79. Topics for the research • Multitask learning • Automatic feature selection/ feature extraction • Uncertainties characterisation • Understanding and visluation of high dimensional data • Modelling on geomanifold, semi-supervised learning • Active learning • MLA and simulations? • …………………………………………………… M. Kanevski, Palermo 2009 79
    80. Thank you for your attention! www.geokernels.org 2004 2008 2009 www.unil.ch/igar M. Kanevski, Palermo 2009 80

    + Beniamino  MurganteBeniamino Murgante, 4 months ago

    custom

    494 views, 0 favs, 0 embeds more stats

    Intelligent analysis of environmental data: an intr more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 494
      • 494 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 16
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories