Barga Data Science lecture 3

Deriving Knowledge from Data at Scale

Before we begin tonight…
developer edition

Lecture 3 Outline
• Opening Discussion
• Forecasting, continued (2/2)
• Introducing Weka
• Decision Trees
• Hands On, Decision Tree in Weka (might be a stretch…)

Lecture 3 Outline
• Understand elements of a time series
• Excel as a tool
• Practical application
• Familiar with time series manipulation techniques
• Automatic time series procedures (homework)
• Gain familiarity with Weka
• Dive into Decision Trees, in Weka (time permitting)
Learning Objectives

Lecture 3 Outline
Follow Up

What tools to use?

• Weka – explorer…
• KNIME – experimentation…
Get proficient in at least two (2) tools…

http://www.cs.waikato.ac.nz/ml/weka/ we’ll use this in class…

Fixed and known period
Rise and fall, not a fixed period
Trend
Smoothing
Moving
Average
Exponential
Model
Linear
Exponential
Auto Regressive

In the multiplicative mode for time series modeling
Time Series = Trend component * Seasonality * Irregular
Let’s assume the cyclical component is 0…

forecast Year 5

trend

Linear regression deseasonalized
Analysis ToolPak

Create time step column

Create time step column
Select Data Analysis option
Deseasonalize data
Time Step
Labels
OK

=(Intercept (F4 to lock) + slope (F4 to lock) * time code for row)
=(5.099 + .147 * 1)
Copy all the way down to Y4 Q4

seasonality
seasonality * trend = prediction

seasonality
trend

Running Example: Amazon Orders

-
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
8/17/04 8/17/05 8/17/06 8/17/07 8/17/08 8/17/09 8/17/10 8/17/11 8/17/12 8/17/13 8/17/14
DailyOrders

-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
2/1/15 2/8/15 2/15/15 2/22/15 3/1/15 3/8/15 3/15/15 3/22/15 3/29/15 4/5/15 4/12/15 4/19/15 4/26/15

-
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
11/1/14
11/8/14
11/15/14
11/22/14
11/29/14
12/6/14
12/13/14
12/20/14
12/27/14
1/3/15
1/10/15
1/17/15
1/24/15
1/31/15
2/7/15
2/14/15
2/21/15
2/28/15
3/7/15
3/14/15
3/21/15
3/28/15
4/4/15
4/11/15
4/18/15
4/25/15
Cyber Monday
Black
Friday
Christmas Eve
Super Saturday

yt »a1yt-1 +a2yt-7 +a3yt-365 +et
SSresidual = yt - a1yt-1 +a2yt-7 +a3yt-365( )éë ùû
2
t
å

Uncover Missing Data
Missing vs. Anomalous Data

-
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
8/17/04 8/17/05 8/17/06 8/17/07 8/17/08 8/17/09 8/17/10 8/17/11 8/17/12 8/17/13 8/17/14
DailyOrders
No lag
Test
Train (fit parameters)

1. Coefficient of Determination (R2)
2. Mean Absolute Error (MAE)
SStotal = (yt - y)2
t
å R2
=1-
SSresidual
SStotal
1
T yt - a1yt-1 +a2yt-7 +a3yt-365( )
t
å

yt » 0.59yt-1 +0.25yt-7 +0.20yt-365 +31,252

yt
yt - 0.59yt-1 +0.25yt-7 +0.20yt-365 +31,252( )

yt » 0.57yt-1 +0.27yt-7 +0.19yt-365+2,288,140´I(CyberMonday)+30,239

10 Minute Break…

thousands

Data Downloads
Data Downloads

• Naïve method
• Mean method
• Seasonal naïve method
• Drift method

But by now, you know this…

nothing is forecastable until it is stable…
(1) Mean constant
Volatility constant

Transformation: take differences (diff() function in R)
Transformation: take logs or powers. Box-Cox family of
transformations flexibly covers both:
Y = (lambda*y + 1)^(1/lambda)

y(t) = y(t-1) + e

arima() stats
forecast()
Arima()
auto.arima() forecast

[1] "ETS(M,Md,M)“ – Holt Winter, multiplicative error, multiplicative
damped trend, multiplicative seasonality,

Feb Mar Apr May Jun Jul
114,727,363 123,818,067 132,671,221 141,424,018 150,134,416 158,826,902

These are scale dependent, so OK if comparing forecasts on the
same data set or same scale of data…

rolling forecasting origin

R
A Little R Time Series Book
Python/Pandas
Video1 Video2 Video3
statsmodels
Texts

Out of Class Reading (2), optional but very helpful…

For this project you can use the beer data set, or analyze a
dataset of interest to you. The objective is to give you a
hands on opportunity to work with the R time series
functionality, in particular ARIMA. You can find time series
datasets at the Time Series Data Library, or you can
fallback to use the beer data set.

See homework description for what to turn in…

http://www.cs.waikato.ac.nz/ml/weka/

http://weka.wikispaces.com/ARFF+(stable+version)

• Opening Discussion
• Forecasting, continued (2/2)
• Introducing Weka
• Decision Trees
• Hands On, Decision Tree in Weka (stretch goal)

evaluation
http://www.20q.net/

• Classification
• Regression
• Clustering
classification trees

overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
Each node is a test on
one attribute
Possible attribute values
of the node
Leafs are the
decisions

overcast
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
Each node is a test on
one attribute
Possible attribute values
of the node
Leafs are the
decisions
Sample size
Your data
gets smaller

overcast
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
A new test example:
(Outlook==rain) and (not
Windy==false)
Pass it on the tree
-> Decision is yes.

overcast
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
(Outlook ==overcast) -> yes
(Outlook==rain) and (not Windy==false) ->yes
(Outlook==sunny) and (Humidity=normal) ->yes

• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• Finding the minimal decision tree consistent with the data is
NP-hard
• Recursive algorithm is a greedy heuristic search for a simple
tree, but cannot guarantee optimality.
• Select attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.

test
test
Overfitting

Which attribute should be used as the test?
Intuitively, you would prefer the one that separates
the training examples as much as possible, reduces
the entropy…

+ - - + + + - - + - + - + +
- - + + + - - + - + - - + - -
+ - + - - + - + - + + - - +
+ - - - + - + - + + - - + +
+ - - + - + - + + - - + - +
- - + + + - + - +
+ - + - + + + - -
+ - + - - + - +
- - + - + - +
- - - + - - - -
+ - - + - - -
+ + + +
+ + + +
- - - - - -
- - - - - -
+ + + + +
+ + + + +
+ + + +
- - + - + - +
- + + + - - - - - -
- - - - - -
+ + +
+ + +
- - - - -
Highly Disorganized
High Entropy
Highly Organized
Low Entropy

amount of uncertainty

4 +
4 -
8 +
0 -
The distribution is less uniform
Entropy is lower
The node is purer

(information before split) – (information after split)

provides most information
about the class
reduces class entropy most
information gain

Example
Humidity Wind
High Normal Strong Weak
S: [9+,5-] S: [9+,5-]
S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-]
E = 0.985 E = 0.592 E = 0.811 E = 1.0
E = 0.940 E = 0.940
Gain(S, Humidity)
= .940 - (7/14).985 - (7/14).592
= 0.151
Gain(S, Wind)
= .940 - (8/14).811 - (6/14)1.0
= 0.048

Hypothesis space search in TDIDT

area with probably
wrong predictions
Overfitting: Example
+
+
+
+
+
+
+
-
-
- -
-
--
-
--
-
-
-
- +
-
-
-
-
-

That’s all for tonight….

Barga Data Science lecture 3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Barga Data Science lecture 3

Similar to Barga Data Science lecture 3 (20)

More from Roger Barga

More from Roger Barga (8)

Recently uploaded

Recently uploaded (20)

Barga Data Science lecture 3