Learning from data: data mining approaches for Energy & Weather/Climate applications

2,050 views
1,929 views

Published on

My lecture at the "2nd CLIM-RUN School: Building Two-way Communication: a week of climate services", 2-6 December 2013 ICTP, Trieste, Italy

Published in: Education, Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,050
On SlideShare
0
From Embeds
0
Number of Embeds
543
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Learning from data: data mining approaches for Energy & Weather/Climate applications

  1. 1. Learning from data: data mining approaches for Energy & Weather/Climate applications! Matteo De Felice, ENEA
  2. 2. http://matteodefelice.name/research
 www.utmea.enea.it   ! matteo.defelice@enea.it
  3. 3. Outline • • • • Building reliable Climate Services is really challenging Cross-disciplinary We need to use the latest and most advance research and knowledge We need to use all available data 2nd CLIM-RUN School
  4. 4. It’s just a matter of time “Can you tell me how much the climate change will affect the wind power production in Italy for the next two decades?” “Can you prepare me an early-warning forest fire model?” Yes, you can (try) But how long it will take to elaborate new theories and physical models? 2nd CLIM-RUN School
  5. 5. Energy & Climate/ Meteorology Link between Energy and Climate/Meteorology is strengthening for several reasons: 1. Diffusion of Renewable Energies 2. Widespread use of air conditioning 3. Necessity of improving efficiency/reliability of power networks (electric utilities) • 2nd CLIM-RUN School
  6. 6. Energy & Climate/ Meteorology • • • Conferences: ICEM Projects: CLIM-RUN, EUPORIAS, SPECS GFCS Climate Services 2nd CLIM-RUN School
  7. 7. EU Projects Timeline of projects 2012 2013 2014 2015 2016 2017 COMBINE EMBRACE EUCLIPSE CLIM-RUN ECLISE EUPORIAS NACLIM SPECS 1 Nov. 2012 31 Jan. 2014 30 Oct. 2015 31 Jan. 2017 from Carlo Buontempo presentation at ICEM 2013! ! 2nd CLIM-RUN School
  8. 8. Energy: main challenges 1. Deregulation and competition 2. Climate issue: emissions reduction and higher demands 3. Security and stability: diffusion of non-controllable sources and greater focus on critical infrastructures and dependancies 2nd CLIM-RUN School
  9. 9. Renewables Energy • • • We need to…
 …find the best place for new plants
 …manage existing plants (efficiently)
 …predict power output (tomorrow, in ten days, in five years) We need it as soon as possible! Communicating effectively with stakeholders (and municipalities, regional authorities, etc.) is fundamental 2nd CLIM-RUN School
  10. 10. Data Author's personal copy Climate and Energy Production – A Climate Services Perspective Table 1 119 Basic climate information needs of the energy sector Type of climatological input data Field of application Energy element needed Power Grid Gas production and distribution Climatological element needed Daily load and network structure Network structure Time scale Air temperature d, m Precipitation h, d Air temperature d, m Wind speed d, m Wind Production – high resolution time Wind h, d series Pressure d, m Solar Production data Radiation h, d Wind min, max Temperature d, m Humidity d, m Clouds h, d Aerosols h, d Hydropower and waterProduction data Temperature d, m based cooling systems Plants temperature Rainfall d, m Runoff d, m Water level d, m Snow cover d, m Snow melt d, m Soil Moisture d, m Marine energy Production data Wind d, m Wave height – direction d, m from Ruti & De Felice, Climate and Energy Production andA Climate Services Current speed d, m Climate Vulnerability: Understanding and Addressing Threats to Essential Elsevier Inc., Academic Press, 2013.! Space scale g, a g, a g, a g, a s, g s, g s, g s, g s, g s, g s, g s, g s, a s, a s, a s, a s, a s, a s, a s, g s, g Perspective, s, g Resources. ! s ¼ point values; a ¼ areal values; g ¼ grid values; y ¼ annual values; m ¼ monthly values; d ¼ daily values; h ¼ hourly values; min ¼ minimum values; max ¼ maximum values. ! 2nd CLIM-RUN School !
  11. 11. Example • Goal: “assess the forecast quality for solar and wind energy generation at s2d time scales” 2nd CLIM-RUN School
  12. 12. Available RE information Example Create software models of RE plants based on my assumptions about technology, typology, etc. Nothing Something Full Create models of RE plants using real parameters and options. 2nd CLIM-RUN School
  13. 13. Solar Power • • • Power output (photovoltaics) is proportional with solar radiation and affected by with cloud cover Rise of temperature leads to reduced efficiency Shadowing effects 2nd CLIM-RUN School
  14. 14. Wind Power • • • • Wind speed at specific height is needed (70-100 m) Strong non-linear effect (cut-off speed) Wind turbines interactions (wake effect) Strong intermittence 2nd CLIM-RUN School
  15. 15. Hydro-power • • • Hydro power is used to store water for electricity generation during the time of year when it is most valuable. Precipitation (snow and rain) information needed: when / where / how much (intensity) Necessity of hydrological models and in-situ measurements 2nd CLIM-RUN School
  16. 16. Electricity Demand • • • Electricity demand sensitive to weather conditions Currently only climatological data are used for time-scales >14 days Demand affected by “human activities” (calendar effects) and economic trends 2nd CLIM-RUN School
  17. 17. Energy & Meteorology • • • • Impact of spatial-temporal variability on energy systems Prediction of expected power in high spatial and temporal resolutions Operational aspects Critical for market operations How much energy this RE plant will produce in the next three hours? 2nd CLIM-RUN School
  18. 18. Energy & Climate • • • Longer time-scales -> higher uncertainty Climate risk management (e.g. extreme events) Use of S2D climate forecasts How much energy this RE plant will produce every year? 2nd CLIM-RUN School
  19. 19. Energy Sector Vulnerability • • Energy Sector is vulnerable to climate change (broadly speaking) E.g. During European 2003 heat-wave France reduced electricity export in August of 50% (EDF) […] a summer average decrease in capacity of power plants of 6.3–19% in Europe and 4.4–16% in the United States depending on cooling system type and climate scenario for 2031–2060. In addition, probabilities of extreme (>90%) reductions in thermoelectric power production will on average increase by a factor of three.
 (van Vliet et al., Vulnerability of US and European electricity supply to climate change, Nature Climate Change 2(9), 2012) 2nd CLIM-RUN School
  20. 20. Vulnerabilities • • • Hydro-power depends on hydrological cycle (seasonal pattern). Higher impacts on regions where snowmelt is a relevant factor. Wind power necessity wind speed data at height >50m. Terrain roughness is a key parameter and it’s affected by vegetation cover. Solar energy is affected by clouds and water vapour content 2nd CLIM-RUN School
  21. 21. Vulnerabilities Sector Variables Impacts Gas, coal and nuclear Air temperature Cooling water quantity and quality (efficiency) Oil and gas Extreme events Extraction/import distruption Hydro-power Precipitation Water availability … … … see Schaeffer et al., Energy sector vulnerability to climate change: A review, ! Energy (38), 2012! 2nd CLIM-RUN School
  22. 22. Predicting changes • • • “Prediction is very difficult, especially about the future”. (Niels Bohr) The impact of a predicted change in climate/weather variables on energy production/demand is not straightforward. Highly non-linear. How the uncertainty propagates? 2nd CLIM-RUN School
  23. 23. Interesting Questions Side question !  How to evaluate “meaningfully” forecast performances? Armour, P. G. (2013). What is a good estimate?: whether forecasting is valuable. Communications of the ACM, 56(6), 31-32. !  How the uncertainty “propagates” from climate forecasts to applications? SPECS, date 2nd CLIM-RUN School 9
  24. 24. Challenges for integrating seasonal forecasts in applications Coelho and Costa 319 Forecasting Figure 1 Challenge 5 Production of seasonal climate forecasts 1 to 6 months in advance (100 to 200 km) Challenge 1 Downscaling Modeling Space & Time Decision Making System Science Production of seasonal climate forecasts at refined space and time scale Challenge 2 Challenge 4 Climate Science Climate Prediction Model Application Model Production of information to support decision making based on climate and non-climate related knowledge Challenge 3 End user Current Opinion in Environmental Sustainability from Coelho and Costa, Challenges for integrating seasonal climate forecasts ! A simplified framework for Current Opinion in in user applications,an end-to-end forecasting system. Environmental Sustainability (2), 2010! 2nd CLIM-RUN School
  25. 25. Approaches To take decisions we need estimates and predictions To generate predictions we need models To build models we need data and information 2nd CLIM-RUN School
  26. 26. DRIP & Big Data • • DRIP (Data Rich Information Poor) era (Big Data) We need tools to deal with high-dimensional and heterogeneous data ? Data “factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation” ( Merriam-Webster) 2nd CLIM-RUN School Information “knowledge obtained from investigation, study, or instruction”
 ( Merriam-Webster)
  27. 27. Data vs Information Precise formally-defined Accurate Data Information Useful context-related 2nd CLIM-RUN School Meaningful
  28. 28. White Box & Black box • How to use data/information to build models…
 …able to generalise?
 …reliable?
 …consistent? 2nd CLIM-RUN School
  29. 29. “Essentially, all models are wrong, but some are useful.” George E. P. Box (English statistician)
  30. 30. Boxes White box Organised knowledge Black box Zero knowledge/ many observations and measures Gray box Some knowledge and some data 2nd CLIM-RUN School
  31. 31. Right tools • Wolpert “No Free Lunch” theorem: “best model” doesn’t exist considering all possible problems Exploring the trade-off between time-to-market and efficiency/ performances Genie’s lamp Performance • Worthless effort Random guess (dice?) 0 time-to-market 2nd CLIM-RUN School
  32. 32. Modeling software Low-level code MATLAB R st • Co • Spreadsheet software (e.g. Microsoft Excel) Code programming (C/C++, Python) Statistical computing environments (e.g. MATLAB, R) Performance • Excel 0 le F ity il High-level code ib x time-to-market 2nd CLIM-RUN School
  33. 33. R • • • • • open source multi-platform software for statistical computing well-established with over 5000 packages easy and quick statistic analysis availability of free packages developed by research centres worldwide quick access to databases and files of any format > gdp = getWorldBankData(id = 'NY.GDP.MKTP.PP.CD', date = ‘1990:2012')! > plot(gdp) 2nd CLIM-RUN School
  34. 34. Data Visualisation 1st rule: LOOK AT YOUR DATA 2nd rule: ALWAYS LOOK AT YOUR DATA • • • Examine your data to detect evident anomalies Nice and clear plots help you to understand your data (and to prepare high-level publications) Best tool: ggplot2 (R package) 2nd CLIM-RUN School
  35. 35. Do you want to learn R? • • Web resources Coursera.org courses 2nd CLIM-RUN School
  36. 36. Any question? • http://stats.stackexchange.com/ http://area51.stackexchange.com/proposals/36296/geoscience 2nd CLIM-RUN School
  37. 37. Dealing with Complexity 2nd CLIM-RUN School
  38. 38. Data Mining • • • We are in the DRIP era (Data Rich Information Poor) Big Data ECMWF stores From data to information 50 petabytes of data (50·1015) factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation knowledge obtained from investigation, study, or instruction 2nd CLIM-RUN School
  39. 39. Data Mining • • • “Extraction of implicit and potentially useful information from data” Automated search (software) Main difficulties: 1. Defining “interestingness” 2. Spurious and accidental coincidences (exceptions) 3. Missing data and noise
 2nd CLIM-RUN School
  40. 40. Example 4 3 x 10 GWh 2.5 2 1.5 1 1990 1992 1994 1996 1998 2000 2002 months 2004 2006 2008 2010 Italian Monthly Electricity Demand 2nd CLIM-RUN School
  41. 41. Example 4 3 x 10 GWh 2.5 2 1.5 1 1990 1992 1994 1996 1998 2000 2002 months 2nd CLIM-RUN School 2004 2006 2008 2010
  42. 42. Modelling Methods • • • • Experience Rule-of-thumbs and good practises Statistics Machine Learning methods Why? • • • • Forecasting Analysis of past events Control Anomaly Detection 2nd CLIM-RUN School
  43. 43. 10 CHAPTER 1 What’s It All About? Example Table 1.2 Weather Data Outlook Temperature Humidity Windy Play Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy hot hot hot mild cool cool cool mild cool mild mild mild hot mild high high high high normal normal normal high normal normal normal high normal high false true false false false true true false false false true true false true no no yes yes yes no yes no yes yes yes yes yes no that are suitable for playing some unspecified game. In general, instances in a dataset from Ian Witten, Data Mining: Practical machine learning tools and techniques, 
 are characterized by the values of features, or attributes, that measure different Morgan Kaufmann, 2005. aspects of the instance. In this case there are four attributes: outlook, temperature, humidity, and windy. The outcome is whether to play or not. In its simplest form, shown in Table 1.2, all four attributes have values that are symbolic categories rather than numbers. Outlook can be sunny, overcast, or rainy; 2nd hot, mild, or School temperature can beCLIM-RUN cool; humidity can be high or normal; and windy
  44. 44. Example Fire Weather Index system X Y month Temp RH wind rain area (ha) 7 5 mar fri 86.2 26.2 5.1 8.2 51 6.7 0 0 9 9 tue 85.8 48.3 313.4 3.9 42 2.7 0 0.36 3 6 sep mon 90.9 126.5 686.5 15.6 66 3.1 0 0 6 4 aug thu 95.2 131.7 578.8 10.4 20.3 41 4 0 1.9 6 3 aug thu 91.6 138.1 621.7 6.3 18.9 41 3.1 0 10.34 7 5 aug tue 96.1 181.1 671.2 14.3 27.3 63 4.9 6.4 jul day FFMC DMC DC 94.3 ISI 7 18 10.82 Burned area of forest fires, in the northeast region of Portugal (517 records) 
 [Cortez and Morais, 2007] 2nd CLIM-RUN School
  45. 45. Example • • • • • Soybean disease database with 307 records 19 different diseases 35 attributes like month, anomalies in growth, stem, leaves, visible damages, precipitation and temperature, etc. Plant pathologist identifies correctly 72% Computer-generated rules 97.5% 2nd CLIM-RUN School
  46. 46. Methods • How to represent discovered patterns? 2nd CLIM-RUN School
  47. 47. Tables 10 • • CHAPTER 1 What’s It All About? Rudimentary and simple Look up the appropriate “inputs” Table 1.2 Weather Data Outlook • • Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Y month Rainy 4 aug Sunny Overcast Overcast Y month Rainy Temperature hot hot hot mild cool cool cool mild cool day FFMC mild thu mild 95.2 mild day hot FFMC mild Humidity Windy high high high high normal normal normal high normal DC normal 578.8 normal high normal DC high false true false false false true true false false Temp false 20.3 true true false Temp true Play Problem with large (real-world) datasets What about continuous variables? X 6 X 6 4 aug thu 90 DMC 131.7 DMC 135 ISI 10.4 ISI 550 10.4 22 2nd CLIM-RUN School RH 41 RH 40 no no yes yes yes no yes no yes wind yes 4 yes yes yes wind no 4 rain area (ha) 0 1.9 rain area (ha) 0 ?
  48. 48. Linear models • • Regression (statistics) Can be applied to continuous and binary variables 2nd CLIM-RUN School
  49. 49. Example y = a1 SSR + a2 CC + a3 2nd CLIM-RUN School
  50. 50. Example #2 y = 54.11 SSR • • 10.56 CC + 18.62 Correlation of 0.84 Can we trust this model? 2nd CLIM-RUN School
  51. 51. Example • Forest fires Predictand (area) is dramatically skewed (trying with log transformation) • Poor results with a linear model • 2nd CLIM-RUN School
  52. 52. Overfitting 2nd CLIM-RUN School
  53. 53. rediction on a particular case. Model trees are basically a n conventional regression trees and linear regression models. be positive integers. Defining leaves as L0 , . . . , Ln , constants assuming a k-dimensional input space (X1 , . . . , Xk ), an exe is shown in Figure 1. In this tree we have three different or each leaf), and assuming that for each internal node we branch in case the rule is satisfied, we use the model L0 for • Divide-and-conquer approach n the condition X1 ≤ C1 ∧ X2 ≤ C2 is satisfied. Similarly the Decision/regression trees to the• condition X1 ≤ C1 ∧ X2 > C2 , and the model L2 is ndition X1 > C1 . Trees X1 ≤ C1 X2 ≤ C2 L0 L2 L1 Fig. 1: An example of a model tree 2nd CLIM-RUN School
  54. 54. Example • • Again with forest fires Converting continuos variable area into factor as: - ‘no’: area = 0 - ‘small’: 0 < area < 5 - ‘medium’: 5 < area < 50 - ‘large’: area > 50
 month = dec: medium (9) month in {jan,mar,may,nov,oct}: :...wind <= 8: no (72/23) : wind > 8: small (2) month in {apr,aug,feb,jul,jun,sep}: :...month = apr: :...temp <= 11.8: small (4/2) : temp > 11.8: no (5/1) month = jun: :...temp > 23.8: medium (4/2) : temp <= 23.8: : :...wind > 4.9: small (3) : wind <= 4.9: : :...temp <= 14.8: small (3/1) : temp > 14.8: no (7) month = feb: :...temp > 12.4: no (6) : temp <= 12.4: : :...wind > 4.5: medium (5/2) : wind <= 4.5: : :...wind > 2.2: no (3/1) : wind <= 2.2: [ ... ] Error: 33.8% no small med large 214 27 6 0 small 37 75 7 0 med 60 16 51 0 large 13 5 4 2 no (Worst results in cross-validation!) 2nd CLIM-RUN School
  55. 55. Example 1 • yes Data from solar plant SSR < 0.5 no 2 3 CV >= 0.45 SSR < 0.7 4 5 6 7 SSR < 0.19 SSR < 0.32 CV >= 0.23 0.87 8 9 10 11 12 13 0.16 0.33 0.43 0.58 0.6 0.76 2nd CLIM-RUN School
  56. 56. Linear vs Tree vector rpart(formula=output~SSR+CV,data=df) lm(formula=output~SSR+CV,data=df) SSR: CV SSR: CV 2nd CLIM-RUN School R SS R SS CV CV
  57. 57. 0.6 0.4 0.2 0.0 output 0.8 1.0 Example 0 100 200 300 day 2nd CLIM-RUN School 400
  58. 58. Using model trees • Linear models as leaves instead of constant values CV > 0.45 SSR > 0.70 y = 0.84·SSR - 0.27·CV + 0.31 y = 0.88·SSR + 0.043 y = 0.17·SSR - 0.24·CV + 0.76 2nd CLIM-RUN School
  59. 59. 0.5 0.2 0.3 0.4 tree 0.6 0.7 0.8 0.9 Using model trees 0.0 0.2 0.4 0.6 0.8 1.0 target Regression tree and model tree 2nd CLIM-RUN School
  60. 60. Clustering • • • Groups objects by “similarity” Many algorithms: most common and simplest centroid-based algorithms like K-Means Similarity is often measured with Eucledian distance q d = (x1 1 x2 )2 + (x1 1 2 2nd CLIM-RUN School x2 )2 + · · · + (x1 2 k x2 ) 2 k
  61. 61. Neural Networks • Most famous machine learning tool 2nd CLIM-RUN School
  62. 62. Pattern recognition • • • Traditional machine learning task Network ‘learns’ observing input-output pairs Generalisation on original (and never observed) data (!) Appl Pea Training Test 2 Test 1 2nd CLIM-RUN School
  63. 63. Perceptron • da r y • Simplest neural network Used for binary classification of linearly separable data bo 7 un Class 1 8 6 cis ion 4 3 de x2 5 Class 2 2 1 0 0 1 2 3 4 5 6 7 8 9 10 2nd CLIM-RUN School x1 N-dimensional case: classes separable by an hyperplane
  64. 64. Perceptron • • • Binary classifier: ‘sign’ function as output x input, w weights: function y = wx How to find optimal parameters? otherwise 2nd CLIM-RUN School
  65. 65. Multilayer perceptron (MLP) • • Neurons with nonlinear differentiable functions One or more neurone layers Activa tion fu nction 2nd CLIM-RUN School s
  66. 66. Example • Neural network vs Linear Regression y = 0.811 SSR 0.156 CC + 0.215 Σ f(•) 5.176 -0.09 4.119 SSR 1.281 -0.093 CC -1.295 Σ f(•) 1.099 0.067 -0.019 Σ 2nd CLIM-RUN School f(•) Σ
  67. 67. Non-linearity 2nd CLIM-RUN School
  68. 68. Ensemble Learning • • Combining models Majority vote is a type of ensemble learning Simple Methods • • Discrete: Majority vote (or weighted majority vote) Continuous: Average (or weighted average) Pros and Cons • • • Easy to implement Easy to understand Results sometimes hard to analyse 2nd CLIM-RUN School
  69. 69. Bagging " able procedures t vements for uns "impro • Bootstrap Aggregating [Breiman, 1996] Training data (size n) Uniform sampling with replacement T1 T2 (size n’ < n) (size n’ < n) Tk ............................ | {z } … Average • Introducing randomisation 2nd CLIM-RUN School (size n’ < n)
  70. 70. Ensemble: example 100 kW 80 60 40 20 60 80 100 120 140 160 testing hours 2nd CLIM-RUN School 180 200 220
  71. 71. Error measures • • • “Goodness” of a model ➠ Number (error) Very importance choice Different types: 1. Absolute errors 2. Percentage errors 3. Relative errors 2nd CLIM-RUN School
  72. 72. Error measures Absolute Percentage Relative 2nd CLIM-RUN School
  73. 73. Example MAE: M1 22106 / M2 22931 MSE: M1 8E08 / M2 3.85E9 RMSE: M1 28291 / M2 62087 MAPE: M1 269% / M2 195% 2nd CLIM-RUN School
  74. 74. Good practices • • • • See your data! Look at average AND median (why not quartiles?) Ask yoursel “Is it significant?” Pay attention to percentage measures mea zero values! 2nd CLIM-RUN School
  75. 75. Example MdAE: M1 17846 / M2 7416 MdAPE: M1 73% / M2 29% 2nd CLIM-RUN School
  76. 76. Overfitting • Machine learning models ‘learns’ data 
 (not the problem!) • Cross-validation • Early stopping • Regularization 2nd CLIM-RUN School
  77. 77. Cross-validation • • Standard method Dataset divided in three subsets: training, validation and testing Training Used to build the model Validation Used to evaluate generalisation 2nd CLIM-RUN School Testing Used to evaluate performances

×