4. Deriving Knowledge from Data at Scale
Lecture 3 Outline
• Opening Discussion
• Forecasting, continued (2/2)
• Introducing Weka
• Decision Trees
• Hands On, Decision Tree in Weka (might be a stretch…)
5. Deriving Knowledge from Data at Scale
Lecture 3 Outline
• Understand elements of a time series
• Excel as a tool
• Practical application
• Familiar with time series manipulation techniques
• Automatic time series procedures (homework)
• Gain familiarity with Weka
• Dive into Decision Trees, in Weka (time permitting)
Learning Objectives
7. Deriving Knowledge from Data at Scale
Lecture 3 Outline
• Opening Discussion
• Forecasting, continued (2/2)
• Introducing Weka
• Decision Trees
• Hands On, Decision Tree in Weka (might be a stretch…)
18. Deriving Knowledge from Data at Scale
Fixed and known period
Rise and fall, not a fixed period
Trend
Smoothing
Moving
Average
Exponential
Model
Linear
Exponential
Auto Regressive
19. Deriving Knowledge from Data at Scale
In the multiplicative mode for time series modeling
Time Series = Trend component * Seasonality * Irregular
Let’s assume the cyclical component is 0…
31. Deriving Knowledge from Data at Scale
=(Intercept (F4 to lock) + slope (F4 to lock) * time code for row)
=(5.099 + .147 * 1)
Copy all the way down to Y4 Q4
45. Deriving Knowledge from Data at Scale
-
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
8/17/04 8/17/05 8/17/06 8/17/07 8/17/08 8/17/09 8/17/10 8/17/11 8/17/12 8/17/13 8/17/14
DailyOrders
No lag
Test
Train (fit parameters)
46. Deriving Knowledge from Data at Scale
1. Coefficient of Determination (R2)
2. Mean Absolute Error (MAE)
SStotal = (yt - y)2
t
å R2
=1-
SSresidual
SStotal
1
T yt - a1yt-1 +a2yt-7 +a3yt-365( )
t
å
64. Deriving Knowledge from Data at Scale
Transformation: take differences (diff() function in R)
Transformation: take logs or powers. Box-Cox family of
transformations flexibly covers both:
Y = (lambda*y + 1)^(1/lambda)
97. Deriving Knowledge from Data at Scale
For this project you can use the beer data set, or analyze a
dataset of interest to you. The objective is to give you a
hands on opportunity to work with the R time series
functionality, in particular ARIMA. You can find time series
datasets at the Time Series Data Library, or you can
fallback to use the beer data set.
123. Deriving Knowledge from Data at Scale
• Opening Discussion
• Forecasting, continued (2/2)
• Introducing Weka
• Decision Trees
• Hands On, Decision Tree in Weka (stretch goal)
125. Deriving Knowledge from Data at Scale
• Classification
• Regression
• Clustering
classification trees
126. Deriving Knowledge from Data at Scale
overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
Each node is a test on
one attribute
Possible attribute values
of the node
Leafs are the
decisions
127. Deriving Knowledge from Data at Scale
overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
Each node is a test on
one attribute
Possible attribute values
of the node
Leafs are the
decisions
Sample size
Your data
gets smaller
129. Deriving Knowledge from Data at Scale
overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
A new test example:
(Outlook==rain) and (not
Windy==false)
Pass it on the tree
-> Decision is yes.
130. Deriving Knowledge from Data at Scale
overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
(Outlook ==overcast) -> yes
(Outlook==rain) and (not Windy==false) ->yes
(Outlook==sunny) and (Humidity=normal) ->yes
131. Deriving Knowledge from Data at Scale
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• Finding the minimal decision tree consistent with the data is
NP-hard
• Recursive algorithm is a greedy heuristic search for a simple
tree, but cannot guarantee optimality.
• Select attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
133. Deriving Knowledge from Data at Scale
Which attribute should be used as the test?
Intuitively, you would prefer the one that separates
the training examples as much as possible, reduces
the entropy…