Learning from data: data mining approaches
for Energy & Weather/Climate applications!
Matteo De Felice, ENEA
http://matteodefelice.name/research

www.utmea.enea.it	
  
!

matteo.defelice@enea.it
Outline
•
•
•
•

Building reliable Climate Services is really challenging
Cross-disciplinary
We need to use the latest and...
It’s just a matter of time
“Can you tell me how much the climate change will affect the
wind power production in Italy for...
Energy & Climate/
Meteorology
Link between Energy and Climate/Meteorology is
strengthening for several reasons:
1. Diffusi...
Energy & Climate/
Meteorology
•
•
•

Conferences: ICEM
Projects: CLIM-RUN, EUPORIAS, SPECS
GFCS Climate Services

2nd CLIM...
EU Projects
Timeline of projects
2012

2013

2014

2015

2016

2017

COMBINE
EMBRACE
EUCLIPSE
CLIM-RUN
ECLISE

EUPORIAS
NA...
Energy: main challenges
1. Deregulation and competition
2. Climate issue: emissions reduction and higher demands
3. Securi...
Renewables Energy
•

•
•

We need to…

…find the best place for new plants

…manage existing plants (efficiently)

…predict ...
Data

Author's personal copy
Climate and Energy Production – A Climate Services Perspective

Table 1

119

Basic climate i...
Example
•

Goal: “assess the forecast quality for solar and wind energy
generation at s2d time scales”

2nd CLIM-RUN Schoo...
Available RE information

Example
Create software models of RE plants
based on my assumptions about
technology, typology, ...
Solar Power
•

•
•

Power output (photovoltaics) is
proportional with solar radiation and
affected by with cloud cover
Ris...
Wind Power
•
•
•
•

Wind speed at specific height is needed (70-100 m)
Strong non-linear effect (cut-off speed)
Wind turbin...
Hydro-power
•
•
•

Hydro power is used to store water for electricity generation
during the time of year when it is most v...
Electricity Demand
•
•
•

Electricity demand sensitive to weather conditions
Currently only climatological data are used f...
Energy & Meteorology
•
•
•
•

Impact of spatial-temporal variability on energy systems
Prediction of expected power in hig...
Energy & Climate
•
•
•

Longer time-scales -> higher uncertainty
Climate risk management (e.g. extreme events)
Use of S2D ...
Energy Sector Vulnerability
•
•

Energy Sector is vulnerable to climate change (broadly
speaking)
E.g. During European 200...
Vulnerabilities
•

•

•

Hydro-power depends on hydrological cycle (seasonal
pattern). Higher impacts on regions where sno...
Vulnerabilities
Sector

Variables

Impacts

Gas, coal and
nuclear

Air temperature

Cooling water quantity and
quality (ef...
Predicting changes
•
•

•

“Prediction is very difficult, especially about the future”. (Niels
Bohr)
The impact of a predic...
Interesting Questions

Side question
!  How to evaluate “meaningfully” forecast
performances?

Armour, P. G. (2013). What ...
Challenges for integrating seasonal forecasts in applications Coelho and Costa 319

Forecasting

Figure 1

Challenge 5

Pr...
Approaches
To take decisions we need estimates and predictions
To generate predictions we need models
To build models we n...
DRIP & Big Data
•
•

DRIP (Data Rich Information Poor) era (Big Data)
We need tools to deal with high-dimensional and
hete...
Data vs Information
Precise

formally-defined

Accurate

Data

Information

Useful

context-related

2nd CLIM-RUN School

M...
White Box & Black box

•

How to use data/information to build models…

…able to generalise?

…reliable?

…consistent?

2n...
“Essentially, all models are wrong, but some are
useful.”

George E. P. Box (English statistician)
Boxes
White box
Organised knowledge

Black box
Zero knowledge/
many observations
and measures

Gray box
Some knowledge
and...
Right tools
•

Wolpert “No Free Lunch” theorem: “best model” doesn’t exist
considering all possible problems
Exploring the...
Modeling software

Low-level code

MATLAB
R

st

•

Co

•

Spreadsheet software (e.g. Microsoft Excel)
Code programming (C...
R
•
•
•
•
•

open source multi-platform software for statistical computing
well-established with over 5000 packages
easy a...
Data Visualisation
1st rule: LOOK AT YOUR DATA
2nd rule: ALWAYS LOOK AT YOUR DATA
•
•
•

Examine your data to detect evide...
Do you want to learn R?
•
•

Web resources
Coursera.org courses

2nd CLIM-RUN School
Any question?
•

http://stats.stackexchange.com/

http://area51.stackexchange.com/proposals/36296/geoscience
2nd CLIM-RUN ...
Dealing with Complexity

2nd CLIM-RUN School
Data Mining
•
•
•

We are in the DRIP era (Data Rich Information Poor)
Big Data
ECMWF stores 
From data to information
50 ...
Data Mining
•
•
•

“Extraction of implicit and potentially useful information from
data”
Automated search (software)
Main ...
Example
4

3

x 10

GWh

2.5

2

1.5

1
1990

1992

1994

1996

1998

2000 2002
months

2004

2006

2008

2010

Italian Mo...
Example
4

3

x 10

GWh

2.5

2

1.5

1
1990

1992

1994

1996

1998

2000 2002
months

2nd CLIM-RUN School

2004

2006

2...
Modelling Methods
•
•
•
•

Experience
Rule-of-thumbs and good practises
Statistics
Machine Learning methods

Why?
•
•
•
•
...
10

CHAPTER 1 What’s It All About?

Example

Table 1.2 Weather Data
Outlook

Temperature

Humidity

Windy

Play

Sunny
Sun...
Example
Fire Weather Index system
X

Y

month

Temp

RH

wind

rain

area (ha)

7

5 mar

fri 86.2 26.2

5.1 8.2

51

6.7
...
Example
•
•
•
•
•

Soybean disease database with 307 records
19 different diseases
35 attributes like month, anomalies in ...
Methods

•

How to represent discovered patterns?

2nd CLIM-RUN School
Tables
10
•
•

CHAPTER 1 What’s It All About?

Rudimentary and simple
Look up the appropriate “inputs”
Table 1.2 Weather D...
Linear models

•
•

Regression (statistics)
Can be applied to continuous and binary variables

2nd CLIM-RUN School
Example

y = a1 SSR + a2 CC + a3
2nd CLIM-RUN School
Example #2
y = 54.11 SSR

•
•

10.56 CC + 18.62

Correlation of 0.84
Can we trust this model?

2nd CLIM-RUN School
Example
•

Forest fires
Predictand (area) is dramatically skewed (trying with log transformation)

•

Poor results with a l...
Overfitting

2nd CLIM-RUN School
rediction on a particular case. Model trees are basically a
n conventional regression trees and linear regression models.
...
Example
•
•

Again with forest fires
Converting continuos variable area into factor as:
- ‘no’: area = 0
- ‘small’: 0 < are...
Example
1

•

yes

Data from solar plant

SSR < 0.5 no

2

3

CV >= 0.45

SSR < 0.7

4

5

6

7

SSR < 0.19

SSR < 0.32

C...
Linear vs Tree
vector rpart(formula=output~SSR+CV,data=df)
lm(formula=output~SSR+CV,data=df)

SSR: CV
SSR: CV

2nd CLIM-RU...
0.6
0.4
0.2
0.0

output

0.8

1.0

Example

0

100

200

300
day

2nd CLIM-RUN School

400
Using model trees
•

Linear models as leaves instead of constant values

CV > 0.45

SSR > 0.70

y = 0.84·SSR - 0.27·CV + 0...
0.5
0.2

0.3

0.4

tree

0.6

0.7

0.8

0.9

Using model trees

0.0

0.2

0.4

0.6

0.8

1.0

target

Regression tree and ...
Clustering
•
•
•

Groups objects by “similarity”
Many algorithms: most common and simplest centroid-based
algorithms like ...
Neural Networks

•

Most famous machine learning tool

2nd CLIM-RUN School
Pattern recognition
•
•
•

Traditional machine learning task
Network ‘learns’ observing input-output pairs
Generalisation ...
Perceptron
•

da
r

y

•

Simplest neural network
Used for binary classification of linearly separable data

bo

7

un

Cla...
Perceptron
•
•
•

Binary classifier: ‘sign’ function as output
x input, w weights: function y = wx
How to find optimal param...
Multilayer perceptron (MLP)
•
•

Neurons with nonlinear differentiable functions
One or more neurone layers

Activa
tion f...
Example
•

Neural network vs Linear Regression

y = 0.811 SSR

0.156 CC + 0.215
Σ

f(•)

5.176

-0.09

4.119

SSR

1.281
-...
Non-linearity

2nd CLIM-RUN School
Ensemble Learning
•
•

Combining models
Majority vote is a type of ensemble learning

Simple Methods
•
•

Discrete: Majori...
Bagging

"

able procedures
t
vements for uns
"impro

•

Bootstrap Aggregating [Breiman, 1996]
Training data (size n)

Uni...
Ensemble: example
100

kW

80
60
40
20
60

80

100

120

140
160
testing hours

2nd CLIM-RUN School

180

200

220
Error measures
•
•
•

“Goodness” of a model ➠ Number (error)
Very importance choice
Different types:
1. Absolute errors
2....
Error measures
Absolute

Percentage

Relative

2nd CLIM-RUN School
Example
MAE: M1 22106 / M2 22931
MSE: M1 8E08 / M2 3.85E9
RMSE: M1 28291 / M2 62087
MAPE: M1 269% / M2 195%

2nd CLIM-RUN ...
Good practices

•
•
•
•

See your data!
Look at average AND median (why not quartiles?)
Ask yoursel “Is it significant?”
Pa...
Example
MdAE: M1 17846 / M2 7416
MdAPE: M1 73% / M2 29%

2nd CLIM-RUN School
Overfitting
•

Machine learning models ‘learns’ data 

(not the problem!)
• Cross-validation
• Early stopping
• Regularizat...
Cross-validation
•
•

Standard method
Dataset divided in three subsets: training, validation and
testing

Training

Used t...
Upcoming SlideShare
Loading in...5
×

Learning from data: data mining approaches for Energy & Weather/Climate applications

1,278

Published on

My lecture at the "2nd CLIM-RUN School: Building Two-way Communication: a week of climate services", 2-6 December 2013 ICTP, Trieste, Italy

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,278
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Learning from data: data mining approaches for Energy & Weather/Climate applications

  1. 1. Learning from data: data mining approaches for Energy & Weather/Climate applications! Matteo De Felice, ENEA
  2. 2. http://matteodefelice.name/research
 www.utmea.enea.it   ! matteo.defelice@enea.it
  3. 3. Outline • • • • Building reliable Climate Services is really challenging Cross-disciplinary We need to use the latest and most advance research and knowledge We need to use all available data 2nd CLIM-RUN School
  4. 4. It’s just a matter of time “Can you tell me how much the climate change will affect the wind power production in Italy for the next two decades?” “Can you prepare me an early-warning forest fire model?” Yes, you can (try) But how long it will take to elaborate new theories and physical models? 2nd CLIM-RUN School
  5. 5. Energy & Climate/ Meteorology Link between Energy and Climate/Meteorology is strengthening for several reasons: 1. Diffusion of Renewable Energies 2. Widespread use of air conditioning 3. Necessity of improving efficiency/reliability of power networks (electric utilities) • 2nd CLIM-RUN School
  6. 6. Energy & Climate/ Meteorology • • • Conferences: ICEM Projects: CLIM-RUN, EUPORIAS, SPECS GFCS Climate Services 2nd CLIM-RUN School
  7. 7. EU Projects Timeline of projects 2012 2013 2014 2015 2016 2017 COMBINE EMBRACE EUCLIPSE CLIM-RUN ECLISE EUPORIAS NACLIM SPECS 1 Nov. 2012 31 Jan. 2014 30 Oct. 2015 31 Jan. 2017 from Carlo Buontempo presentation at ICEM 2013! ! 2nd CLIM-RUN School
  8. 8. Energy: main challenges 1. Deregulation and competition 2. Climate issue: emissions reduction and higher demands 3. Security and stability: diffusion of non-controllable sources and greater focus on critical infrastructures and dependancies 2nd CLIM-RUN School
  9. 9. Renewables Energy • • • We need to…
 …find the best place for new plants
 …manage existing plants (efficiently)
 …predict power output (tomorrow, in ten days, in five years) We need it as soon as possible! Communicating effectively with stakeholders (and municipalities, regional authorities, etc.) is fundamental 2nd CLIM-RUN School
  10. 10. Data Author's personal copy Climate and Energy Production – A Climate Services Perspective Table 1 119 Basic climate information needs of the energy sector Type of climatological input data Field of application Energy element needed Power Grid Gas production and distribution Climatological element needed Daily load and network structure Network structure Time scale Air temperature d, m Precipitation h, d Air temperature d, m Wind speed d, m Wind Production – high resolution time Wind h, d series Pressure d, m Solar Production data Radiation h, d Wind min, max Temperature d, m Humidity d, m Clouds h, d Aerosols h, d Hydropower and waterProduction data Temperature d, m based cooling systems Plants temperature Rainfall d, m Runoff d, m Water level d, m Snow cover d, m Snow melt d, m Soil Moisture d, m Marine energy Production data Wind d, m Wave height – direction d, m from Ruti & De Felice, Climate and Energy Production andA Climate Services Current speed d, m Climate Vulnerability: Understanding and Addressing Threats to Essential Elsevier Inc., Academic Press, 2013.! Space scale g, a g, a g, a g, a s, g s, g s, g s, g s, g s, g s, g s, g s, a s, a s, a s, a s, a s, a s, a s, g s, g Perspective, s, g Resources. ! s ¼ point values; a ¼ areal values; g ¼ grid values; y ¼ annual values; m ¼ monthly values; d ¼ daily values; h ¼ hourly values; min ¼ minimum values; max ¼ maximum values. ! 2nd CLIM-RUN School !
  11. 11. Example • Goal: “assess the forecast quality for solar and wind energy generation at s2d time scales” 2nd CLIM-RUN School
  12. 12. Available RE information Example Create software models of RE plants based on my assumptions about technology, typology, etc. Nothing Something Full Create models of RE plants using real parameters and options. 2nd CLIM-RUN School
  13. 13. Solar Power • • • Power output (photovoltaics) is proportional with solar radiation and affected by with cloud cover Rise of temperature leads to reduced efficiency Shadowing effects 2nd CLIM-RUN School
  14. 14. Wind Power • • • • Wind speed at specific height is needed (70-100 m) Strong non-linear effect (cut-off speed) Wind turbines interactions (wake effect) Strong intermittence 2nd CLIM-RUN School
  15. 15. Hydro-power • • • Hydro power is used to store water for electricity generation during the time of year when it is most valuable. Precipitation (snow and rain) information needed: when / where / how much (intensity) Necessity of hydrological models and in-situ measurements 2nd CLIM-RUN School
  16. 16. Electricity Demand • • • Electricity demand sensitive to weather conditions Currently only climatological data are used for time-scales >14 days Demand affected by “human activities” (calendar effects) and economic trends 2nd CLIM-RUN School
  17. 17. Energy & Meteorology • • • • Impact of spatial-temporal variability on energy systems Prediction of expected power in high spatial and temporal resolutions Operational aspects Critical for market operations How much energy this RE plant will produce in the next three hours? 2nd CLIM-RUN School
  18. 18. Energy & Climate • • • Longer time-scales -> higher uncertainty Climate risk management (e.g. extreme events) Use of S2D climate forecasts How much energy this RE plant will produce every year? 2nd CLIM-RUN School
  19. 19. Energy Sector Vulnerability • • Energy Sector is vulnerable to climate change (broadly speaking) E.g. During European 2003 heat-wave France reduced electricity export in August of 50% (EDF) […] a summer average decrease in capacity of power plants of 6.3–19% in Europe and 4.4–16% in the United States depending on cooling system type and climate scenario for 2031–2060. In addition, probabilities of extreme (>90%) reductions in thermoelectric power production will on average increase by a factor of three.
 (van Vliet et al., Vulnerability of US and European electricity supply to climate change, Nature Climate Change 2(9), 2012) 2nd CLIM-RUN School
  20. 20. Vulnerabilities • • • Hydro-power depends on hydrological cycle (seasonal pattern). Higher impacts on regions where snowmelt is a relevant factor. Wind power necessity wind speed data at height >50m. Terrain roughness is a key parameter and it’s affected by vegetation cover. Solar energy is affected by clouds and water vapour content 2nd CLIM-RUN School
  21. 21. Vulnerabilities Sector Variables Impacts Gas, coal and nuclear Air temperature Cooling water quantity and quality (efficiency) Oil and gas Extreme events Extraction/import distruption Hydro-power Precipitation Water availability … … … see Schaeffer et al., Energy sector vulnerability to climate change: A review, ! Energy (38), 2012! 2nd CLIM-RUN School
  22. 22. Predicting changes • • • “Prediction is very difficult, especially about the future”. (Niels Bohr) The impact of a predicted change in climate/weather variables on energy production/demand is not straightforward. Highly non-linear. How the uncertainty propagates? 2nd CLIM-RUN School
  23. 23. Interesting Questions Side question !  How to evaluate “meaningfully” forecast performances? Armour, P. G. (2013). What is a good estimate?: whether forecasting is valuable. Communications of the ACM, 56(6), 31-32. !  How the uncertainty “propagates” from climate forecasts to applications? SPECS, date 2nd CLIM-RUN School 9
  24. 24. Challenges for integrating seasonal forecasts in applications Coelho and Costa 319 Forecasting Figure 1 Challenge 5 Production of seasonal climate forecasts 1 to 6 months in advance (100 to 200 km) Challenge 1 Downscaling Modeling Space & Time Decision Making System Science Production of seasonal climate forecasts at refined space and time scale Challenge 2 Challenge 4 Climate Science Climate Prediction Model Application Model Production of information to support decision making based on climate and non-climate related knowledge Challenge 3 End user Current Opinion in Environmental Sustainability from Coelho and Costa, Challenges for integrating seasonal climate forecasts ! A simplified framework for Current Opinion in in user applications,an end-to-end forecasting system. Environmental Sustainability (2), 2010! 2nd CLIM-RUN School
  25. 25. Approaches To take decisions we need estimates and predictions To generate predictions we need models To build models we need data and information 2nd CLIM-RUN School
  26. 26. DRIP & Big Data • • DRIP (Data Rich Information Poor) era (Big Data) We need tools to deal with high-dimensional and heterogeneous data ? Data “factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation” ( Merriam-Webster) 2nd CLIM-RUN School Information “knowledge obtained from investigation, study, or instruction”
 ( Merriam-Webster)
  27. 27. Data vs Information Precise formally-defined Accurate Data Information Useful context-related 2nd CLIM-RUN School Meaningful
  28. 28. White Box & Black box • How to use data/information to build models…
 …able to generalise?
 …reliable?
 …consistent? 2nd CLIM-RUN School
  29. 29. “Essentially, all models are wrong, but some are useful.” George E. P. Box (English statistician)
  30. 30. Boxes White box Organised knowledge Black box Zero knowledge/ many observations and measures Gray box Some knowledge and some data 2nd CLIM-RUN School
  31. 31. Right tools • Wolpert “No Free Lunch” theorem: “best model” doesn’t exist considering all possible problems Exploring the trade-off between time-to-market and efficiency/ performances Genie’s lamp Performance • Worthless effort Random guess (dice?) 0 time-to-market 2nd CLIM-RUN School
  32. 32. Modeling software Low-level code MATLAB R st • Co • Spreadsheet software (e.g. Microsoft Excel) Code programming (C/C++, Python) Statistical computing environments (e.g. MATLAB, R) Performance • Excel 0 le F ity il High-level code ib x time-to-market 2nd CLIM-RUN School
  33. 33. R • • • • • open source multi-platform software for statistical computing well-established with over 5000 packages easy and quick statistic analysis availability of free packages developed by research centres worldwide quick access to databases and files of any format > gdp = getWorldBankData(id = 'NY.GDP.MKTP.PP.CD', date = ‘1990:2012')! > plot(gdp) 2nd CLIM-RUN School
  34. 34. Data Visualisation 1st rule: LOOK AT YOUR DATA 2nd rule: ALWAYS LOOK AT YOUR DATA • • • Examine your data to detect evident anomalies Nice and clear plots help you to understand your data (and to prepare high-level publications) Best tool: ggplot2 (R package) 2nd CLIM-RUN School
  35. 35. Do you want to learn R? • • Web resources Coursera.org courses 2nd CLIM-RUN School
  36. 36. Any question? • http://stats.stackexchange.com/ http://area51.stackexchange.com/proposals/36296/geoscience 2nd CLIM-RUN School
  37. 37. Dealing with Complexity 2nd CLIM-RUN School
  38. 38. Data Mining • • • We are in the DRIP era (Data Rich Information Poor) Big Data ECMWF stores From data to information 50 petabytes of data (50·1015) factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation knowledge obtained from investigation, study, or instruction 2nd CLIM-RUN School
  39. 39. Data Mining • • • “Extraction of implicit and potentially useful information from data” Automated search (software) Main difficulties: 1. Defining “interestingness” 2. Spurious and accidental coincidences (exceptions) 3. Missing data and noise
 2nd CLIM-RUN School
  40. 40. Example 4 3 x 10 GWh 2.5 2 1.5 1 1990 1992 1994 1996 1998 2000 2002 months 2004 2006 2008 2010 Italian Monthly Electricity Demand 2nd CLIM-RUN School
  41. 41. Example 4 3 x 10 GWh 2.5 2 1.5 1 1990 1992 1994 1996 1998 2000 2002 months 2nd CLIM-RUN School 2004 2006 2008 2010
  42. 42. Modelling Methods • • • • Experience Rule-of-thumbs and good practises Statistics Machine Learning methods Why? • • • • Forecasting Analysis of past events Control Anomaly Detection 2nd CLIM-RUN School
  43. 43. 10 CHAPTER 1 What’s It All About? Example Table 1.2 Weather Data Outlook Temperature Humidity Windy Play Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy hot hot hot mild cool cool cool mild cool mild mild mild hot mild high high high high normal normal normal high normal normal normal high normal high false true false false false true true false false false true true false true no no yes yes yes no yes no yes yes yes yes yes no that are suitable for playing some unspecified game. In general, instances in a dataset from Ian Witten, Data Mining: Practical machine learning tools and techniques, 
 are characterized by the values of features, or attributes, that measure different Morgan Kaufmann, 2005. aspects of the instance. In this case there are four attributes: outlook, temperature, humidity, and windy. The outcome is whether to play or not. In its simplest form, shown in Table 1.2, all four attributes have values that are symbolic categories rather than numbers. Outlook can be sunny, overcast, or rainy; 2nd hot, mild, or School temperature can beCLIM-RUN cool; humidity can be high or normal; and windy
  44. 44. Example Fire Weather Index system X Y month Temp RH wind rain area (ha) 7 5 mar fri 86.2 26.2 5.1 8.2 51 6.7 0 0 9 9 tue 85.8 48.3 313.4 3.9 42 2.7 0 0.36 3 6 sep mon 90.9 126.5 686.5 15.6 66 3.1 0 0 6 4 aug thu 95.2 131.7 578.8 10.4 20.3 41 4 0 1.9 6 3 aug thu 91.6 138.1 621.7 6.3 18.9 41 3.1 0 10.34 7 5 aug tue 96.1 181.1 671.2 14.3 27.3 63 4.9 6.4 jul day FFMC DMC DC 94.3 ISI 7 18 10.82 Burned area of forest fires, in the northeast region of Portugal (517 records) 
 [Cortez and Morais, 2007] 2nd CLIM-RUN School
  45. 45. Example • • • • • Soybean disease database with 307 records 19 different diseases 35 attributes like month, anomalies in growth, stem, leaves, visible damages, precipitation and temperature, etc. Plant pathologist identifies correctly 72% Computer-generated rules 97.5% 2nd CLIM-RUN School
  46. 46. Methods • How to represent discovered patterns? 2nd CLIM-RUN School
  47. 47. Tables 10 • • CHAPTER 1 What’s It All About? Rudimentary and simple Look up the appropriate “inputs” Table 1.2 Weather Data Outlook • • Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Y month Rainy 4 aug Sunny Overcast Overcast Y month Rainy Temperature hot hot hot mild cool cool cool mild cool day FFMC mild thu mild 95.2 mild day hot FFMC mild Humidity Windy high high high high normal normal normal high normal DC normal 578.8 normal high normal DC high false true false false false true true false false Temp false 20.3 true true false Temp true Play Problem with large (real-world) datasets What about continuous variables? X 6 X 6 4 aug thu 90 DMC 131.7 DMC 135 ISI 10.4 ISI 550 10.4 22 2nd CLIM-RUN School RH 41 RH 40 no no yes yes yes no yes no yes wind yes 4 yes yes yes wind no 4 rain area (ha) 0 1.9 rain area (ha) 0 ?
  48. 48. Linear models • • Regression (statistics) Can be applied to continuous and binary variables 2nd CLIM-RUN School
  49. 49. Example y = a1 SSR + a2 CC + a3 2nd CLIM-RUN School
  50. 50. Example #2 y = 54.11 SSR • • 10.56 CC + 18.62 Correlation of 0.84 Can we trust this model? 2nd CLIM-RUN School
  51. 51. Example • Forest fires Predictand (area) is dramatically skewed (trying with log transformation) • Poor results with a linear model • 2nd CLIM-RUN School
  52. 52. Overfitting 2nd CLIM-RUN School
  53. 53. rediction on a particular case. Model trees are basically a n conventional regression trees and linear regression models. be positive integers. Defining leaves as L0 , . . . , Ln , constants assuming a k-dimensional input space (X1 , . . . , Xk ), an exe is shown in Figure 1. In this tree we have three different or each leaf), and assuming that for each internal node we branch in case the rule is satisfied, we use the model L0 for • Divide-and-conquer approach n the condition X1 ≤ C1 ∧ X2 ≤ C2 is satisfied. Similarly the Decision/regression trees to the• condition X1 ≤ C1 ∧ X2 > C2 , and the model L2 is ndition X1 > C1 . Trees X1 ≤ C1 X2 ≤ C2 L0 L2 L1 Fig. 1: An example of a model tree 2nd CLIM-RUN School
  54. 54. Example • • Again with forest fires Converting continuos variable area into factor as: - ‘no’: area = 0 - ‘small’: 0 < area < 5 - ‘medium’: 5 < area < 50 - ‘large’: area > 50
 month = dec: medium (9) month in {jan,mar,may,nov,oct}: :...wind <= 8: no (72/23) : wind > 8: small (2) month in {apr,aug,feb,jul,jun,sep}: :...month = apr: :...temp <= 11.8: small (4/2) : temp > 11.8: no (5/1) month = jun: :...temp > 23.8: medium (4/2) : temp <= 23.8: : :...wind > 4.9: small (3) : wind <= 4.9: : :...temp <= 14.8: small (3/1) : temp > 14.8: no (7) month = feb: :...temp > 12.4: no (6) : temp <= 12.4: : :...wind > 4.5: medium (5/2) : wind <= 4.5: : :...wind > 2.2: no (3/1) : wind <= 2.2: [ ... ] Error: 33.8% no small med large 214 27 6 0 small 37 75 7 0 med 60 16 51 0 large 13 5 4 2 no (Worst results in cross-validation!) 2nd CLIM-RUN School
  55. 55. Example 1 • yes Data from solar plant SSR < 0.5 no 2 3 CV >= 0.45 SSR < 0.7 4 5 6 7 SSR < 0.19 SSR < 0.32 CV >= 0.23 0.87 8 9 10 11 12 13 0.16 0.33 0.43 0.58 0.6 0.76 2nd CLIM-RUN School
  56. 56. Linear vs Tree vector rpart(formula=output~SSR+CV,data=df) lm(formula=output~SSR+CV,data=df) SSR: CV SSR: CV 2nd CLIM-RUN School R SS R SS CV CV
  57. 57. 0.6 0.4 0.2 0.0 output 0.8 1.0 Example 0 100 200 300 day 2nd CLIM-RUN School 400
  58. 58. Using model trees • Linear models as leaves instead of constant values CV > 0.45 SSR > 0.70 y = 0.84·SSR - 0.27·CV + 0.31 y = 0.88·SSR + 0.043 y = 0.17·SSR - 0.24·CV + 0.76 2nd CLIM-RUN School
  59. 59. 0.5 0.2 0.3 0.4 tree 0.6 0.7 0.8 0.9 Using model trees 0.0 0.2 0.4 0.6 0.8 1.0 target Regression tree and model tree 2nd CLIM-RUN School
  60. 60. Clustering • • • Groups objects by “similarity” Many algorithms: most common and simplest centroid-based algorithms like K-Means Similarity is often measured with Eucledian distance q d = (x1 1 x2 )2 + (x1 1 2 2nd CLIM-RUN School x2 )2 + · · · + (x1 2 k x2 ) 2 k
  61. 61. Neural Networks • Most famous machine learning tool 2nd CLIM-RUN School
  62. 62. Pattern recognition • • • Traditional machine learning task Network ‘learns’ observing input-output pairs Generalisation on original (and never observed) data (!) Appl Pea Training Test 2 Test 1 2nd CLIM-RUN School
  63. 63. Perceptron • da r y • Simplest neural network Used for binary classification of linearly separable data bo 7 un Class 1 8 6 cis ion 4 3 de x2 5 Class 2 2 1 0 0 1 2 3 4 5 6 7 8 9 10 2nd CLIM-RUN School x1 N-dimensional case: classes separable by an hyperplane
  64. 64. Perceptron • • • Binary classifier: ‘sign’ function as output x input, w weights: function y = wx How to find optimal parameters? otherwise 2nd CLIM-RUN School
  65. 65. Multilayer perceptron (MLP) • • Neurons with nonlinear differentiable functions One or more neurone layers Activa tion fu nction 2nd CLIM-RUN School s
  66. 66. Example • Neural network vs Linear Regression y = 0.811 SSR 0.156 CC + 0.215 Σ f(•) 5.176 -0.09 4.119 SSR 1.281 -0.093 CC -1.295 Σ f(•) 1.099 0.067 -0.019 Σ 2nd CLIM-RUN School f(•) Σ
  67. 67. Non-linearity 2nd CLIM-RUN School
  68. 68. Ensemble Learning • • Combining models Majority vote is a type of ensemble learning Simple Methods • • Discrete: Majority vote (or weighted majority vote) Continuous: Average (or weighted average) Pros and Cons • • • Easy to implement Easy to understand Results sometimes hard to analyse 2nd CLIM-RUN School
  69. 69. Bagging " able procedures t vements for uns "impro • Bootstrap Aggregating [Breiman, 1996] Training data (size n) Uniform sampling with replacement T1 T2 (size n’ < n) (size n’ < n) Tk ............................ | {z } … Average • Introducing randomisation 2nd CLIM-RUN School (size n’ < n)
  70. 70. Ensemble: example 100 kW 80 60 40 20 60 80 100 120 140 160 testing hours 2nd CLIM-RUN School 180 200 220
  71. 71. Error measures • • • “Goodness” of a model ➠ Number (error) Very importance choice Different types: 1. Absolute errors 2. Percentage errors 3. Relative errors 2nd CLIM-RUN School
  72. 72. Error measures Absolute Percentage Relative 2nd CLIM-RUN School
  73. 73. Example MAE: M1 22106 / M2 22931 MSE: M1 8E08 / M2 3.85E9 RMSE: M1 28291 / M2 62087 MAPE: M1 269% / M2 195% 2nd CLIM-RUN School
  74. 74. Good practices • • • • See your data! Look at average AND median (why not quartiles?) Ask yoursel “Is it significant?” Pay attention to percentage measures mea zero values! 2nd CLIM-RUN School
  75. 75. Example MdAE: M1 17846 / M2 7416 MdAPE: M1 73% / M2 29% 2nd CLIM-RUN School
  76. 76. Overfitting • Machine learning models ‘learns’ data 
 (not the problem!) • Cross-validation • Early stopping • Regularization 2nd CLIM-RUN School
  77. 77. Cross-validation • • Standard method Dataset divided in three subsets: training, validation and testing Training Used to build the model Validation Used to evaluate generalisation 2nd CLIM-RUN School Testing Used to evaluate performances

×