More Related Content Similar to Machine learning overview (with SAS software) (20) More from Longhow Lam (14) Machine learning overview (with SAS software)1. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MACHINE LEARNING WITH SAS WORKSHOP
GETTING THE MOST OUT OF YOUR DATA
Longhow Lam
2. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
AGENDA AND SOME READING MATERIAL
Intro & positioning of Machine learning
SAS platform for Machine learning
Overview of Specific methods
Some examples
Further reading
An experimental comparison of classification techniques for imbalanced
credit scoring data sets using SAS® Enterprise Miner
http://support.sas.com/resources/papers/proceedings12/129-2012.pdf
Benchmarking state-of-the-art classification algorithms for credit scoring: A ten-year update
http://www.business-school.ed.ac.uk/waf/crc_archive/2013/42.pdf
An absolute recommender for more detail:
The elements of statistical learning, Hasting, Tibshirani & Friedman
http://www-stat.stanford.edu/~tibs/ElemStatLearn/
3. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
LONGHOW LAM SHORT BIO
MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs. wiskunde)
MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)
10+ year SAS experience (Base / Stat / Guide/ Miner / VA / VS)
10+ year R experience ( An introduction to R)
10 + year predictive modeling experience
ABNAMRO – Risk modeler
Basel, Credit risk, ALM models
Business&Decision – Quantitative consultant
ING Belgium, Fortis
Leaseplan, Belgium Post
Experian – data mininer
Collection Score, Delphi credit score, consulting
@longhowlamFollow me:
4. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
INTRO MACHINE LEARNING
Wikipedia:
“Machine learning is a scientific discipline that deals with the construction
and study of algorithms that can learn from data. Such algorithms operate by
building a model based on inputs and using that to make predictions or
decisions, rather than following only explicitly programmed instructions.”
5. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR
Statistical
modeling
Supervised
Learning
Clustering
Unsupervised
Learning
Data mining
Machine
learning
Dimension
reduction
Association
rules
Recommender
Auto
encoders
Self
organizing
maps
6. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SAS SOFTWARE
FOR MACHINE LEARNING (AND DATA MINING)
7. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
IDENTIFY /
FORMULATE
PROBLEM
DATA
PREPARATION
DATA
EXPLORATION
TRANSFORM
& SELECT
BUILD
MODEL
VALIDATE
MODEL
DEPLOY
MODEL
EVALUATE /
MONITOR
RESULTS
SAS In-Database Scoring
SAS Decision Manager
BUSINESS
MANAGER
SAS Model Manager
IT SYSTEMS /
MANAGEMENT
SAS Enterprise Guide
BUSINESS
ANALYST
Enterprise Miner / Text Miner
SAS IMSTAT / Recommender
DATA MINER /
DATA SCIENTIST
THE ANALYTICS
LIFECYCLE
SAS Visual Analytics
SAS Visual Statistics
8. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata
structure = markovblanket;
model default = x1 LTV income age;
selction = Y
RUN;
9. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MACHINE LEARNING
Machine Learning algorithms designed to run on single
blade or multi blade distributed memory environments
HIGH PERFORMANCE
10. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Manage
Rules + Data + Models
Deployment flexibility:
Batch
Real Time
Stored Process
In Database
Drive Reuse and
Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS
11. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
PREDICT SOMEONE’S INCOME
Income = 15.2 + 1.102 × Age
Age
Income
Predict someones income from his/her age
Collect some data
Plot the data
Analytical Base Table
IS THIS MACHINE LEARNING?
12. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
The problem may not be linear: X2, X3, Log(X), Sqrt(X), 1/X ,…….?
You do not have one input variable: X1, X2, X3,……X567
Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs
13. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MACHINE LEARNING WHY IT CAN MATTER € € €
Suppose we have an untargeted direct mailing of 100.000 ‘letters’ to randomly
sampled prospects:
Conversion rate is around 1%. Profit per conversion €80, Cost per mailing is €0.70
Total ROI = 100.000 X 1% X € 80 − 100.000 X € 0.70 = € 10,000
Now we have a targeted mailing with a machine learning predictive model, that uses
prospect input data that can distinguish between high / low responders.
14. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MACHINE LEARNING WHY IT CAN MATTER € € €
Decile N Conversion Profit Cumulative
1 10.000 2.00% 9.000 9.000
2 10.000 1.50% 5.000 14.000
3 10.000 1.00% 1.000 15.000
4 10.000 1.00% 1.000 16.000
5 10.000 1.00% 1.000 17.000
6 10.000 1.00% 1.000 18.000
7 10.000 1.00% 1.000 19.000
8 10.000 0.80% -600 18.400
9 10.000 0.50% -3.000 15.400
10 10.000 0.20% -5.400 10.000
The profit by using a model to sent
letters only to the first 7 deciles is now:
€ 19.000 (instead of € 10.000)
If you have 100 of such campaigns a
year that means an increase of
€ 0.9 mln !!
15. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MACHINE LEARNING WHY IT CAN MATTER € € €
Decile N Conversion Profit Cumulative
1 10.000 3.00% 17.000 17.000
2 10.000 2.00% 9.000 26.000
3 10.000 1.40% 4.200 30.200
4 10.000 1.15% 2.200 32.400
5 10.000 1.00% 1.000 33.400
6 10.000 0.60% -2.200 31.200
7 10.000 0.40% -3.800 27.400
8 10.000 0.30% -4.600 22.800
9 10.000 0.10% -6.200 16.600
10 10.000 0.05% -6.600 10.000
The profit by using a much better model
to sent letters only to the first 5 deciles
is now:
€ 33.400 (instead of € 10.000)
If you have 100 of such campaigns a
year that means an increase of
€ 2.34 mln !!
16. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MACHINE LEARNING WHY IT CAN MATTER? € € €
Decile N Conversion Profit Cumulative
1 10.000 3.35% 19.800 19.800
2 10.000 2.23% 10.840 30.640
3 10.000 1.30% 3.400 34.040
4 10.000 1.10% 1.800 35.840
5 10.000 1.00% 1.000 36.840
6 10.000 0.55% -2.600 34.240
7 10.000 0.28% -4.760 29.480
8 10.000 0.25% -5.000 24.480
9 10.000 0.05% -6.600 17.880
10 10.000 0.02% -6.840 11.040
Now lets suppose we have even a
slightly better model than the last one
€ 36.840
If you have 100 of such campaigns a
year that means an increase of
€ 2.68 mln !!
17. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
OVERVIEW OF SPECIFIC
MACHINE LEARNING METHODS
Classical regression
Decision trees
Dimension reduction
Bagging & Boosting
Support vector machines
K-Nearest Neighbour
Neural networks / deep learning
Bayesian networks
Text mining
Recommendation engine
18. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
“CLASSICAL” REGRESSION
19. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
LINEAR & LOGISTIC REGRESSION
Income = a + b × Age
Age
Income
Age
P(Churn)
1
0
P(Churn) =
1
1+𝐸𝑋𝑃(𝑎+𝑏 × Age)
Numeric target variable Binairy target variable
20. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relation
• Transformation of inputs: X2 , X3 , log(X) etc…
• Buckets / binning of variables
Y / logit(y)
X
Smoothing Splines
21. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines: Piecewise polynomials that are glued together at knots
Two special cases for λ:
λ = 0 Any function that interpolates the data
λ = ∞ Simple Least square line fit
Choose λ by cross validation
22. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site. For many cars we have the
kilometres driven and the car price. For the Opel Astra we have 2360 cars:
What is the relation between km driven and car sales price?
Too much smoothing and too little smoothing
23. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
0.2 is the optimal smoothing paramter
24. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Some other car make/models with
spline estimates of car depreciation
versus kilometres driven.
Hmmm.. my Renault Clio looks nice
but after 50.000 km I only have 46%
of the original value left…
25. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MODELING NON LINEARITIES
In SAS we have TPSLINE, LOESS and the ADAPTIVEREG procedure
to fit multivariate regression splines
Supports:
More than one input
linear, logistic, Poisson, GLM regressions
combines both regression splines and model selection methods.
supports partitioning of data into training, validation, and testing roles
SPLINE REGRESSION
27. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
DECISION TREES
How does it work? A simple example
Suppose we have the following group of people
50% Response
50% No Response
We have/know Age and Marital Status
50%
50%
Age≤ 45 Age> 45
30%
70%
60%
40%
Married
Divorced UnMarried
20%
80%
60%
40%
28. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
DECISION TREES REGRESSION & CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 1.2 X
N 21 B 456 1.5 X
Y 32 A 545 1.3 U
Y 34 C 443 1.1 U
N 23 A 345 1.7 U
N 13 B 567 1.2 X
N 45 A 654 1.9 X
… … … … … …
… … … … … …
Y 46 A 657 2.1 X
A recursive splitting algorithm:
1. Loop trough all inputs
2. Determine per input how to split
3. Take the best input to split
4. On the two new data sets apply 1,2,3 again….
5. Stop somewhere….
• How to split X1 or X2 ?
• When to stop?
29. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
DECISION TREES
How to split?
Number is usualy 2 or 3.
More splits will exhaust the data too fast
Why split X1 <t1 beter dan X1 <s1?
Regression: Mean squared Error
Classification:
Mis-classification rate,
Cross-entropy, Chi-Squared
Regression tree: Mean square error
..
.
.
.
.
. . .
.
.
.
.
.. .
Split s1 Split t1
x
Y Y
x
REGRESSION & CLASSIFICATION
30. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
DECISION TREES
How to split?
Number is usualy 2 or 3.
More splits will exhaust the data too fast
Why split X1 <t1 beter dan X1 <s1?
Regression: Mean squared Error
Classification:
Mis-classification rate,
Cross-entropy, Chi-Squared
Classification tree: Mis classificatie rate
xSplit s1 Split t1
REGRESSION & CLASSIFICATION
31. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Decision trees (regressie & classificatie)
When to stop?
Not too early not too late!
Pruning
Remove parts the tree
32. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C4.5 / C5.0
CART (Classification and Regression)
The difference is mainly in the different splitting options
33. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Decision trees pros and cons
pros
Interaction between variables
Interpretable rules
Missing values easy to incorporate.
cons
Unstable
“Lack-of-Smoothnes”
Fit of obvious (non)linear relations
man vrouw
Inkomen < 45 K Leeftijd < 33
Response rate
Opel Astras
35. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
PRINCIPLE
COMPONENTS
ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that
The largest variance is in the first coordinate
The second largets variance is in the second coordinate
Etc…
36. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
PRINCIPLE
COMPONENTS
ANALYSIS
X1
X2
x x x x x x x
x
x
x
x
x
x
x
x
37. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
PRINCIPLE
COMPONENTS
ANALYSIS
38. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
PRINCIPLE
COMPONENTS
ANALYSIS
The Math behind
P = X W
𝑝11 𝑝21
.
.
.
.
.
.
𝑝1𝑛 𝑝2𝑛
=
𝑥11 𝑥21.
.
.
.
.
.
𝑥1𝑛 𝑥2𝑛
𝑤11 𝑤21
𝑤12 𝑤22
w11 and w12 are the loadings corresponding to the first principle component.
w21 and w22 are the loadings corresponding to the second principle component.
With two dimensions In general
It turns out that the columns of W
Are the eigenvalue vectors of the matrix XTX
39. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
PRINCIPLE
COMPONENTS
ANALYSIS
Scaling the inputs is important here
Applications of PCA
Dimension reduction
Visualisation
Outlier / anomalie detectie
PCA regression
Use PC instead of the original inputs
40. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
PRINCIPLE
COMPONENTS
DIMENSION REDUCTION
P = X W
Now only take the first L columns of W
PL = X WL
For example for visualization only use the first
2 or 3 columns so that PL only has 2 or 3
columns that can be visualized in scatter or
contour plots
X
W
P
=
X
WL
PL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)
41. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition:
Diagonal with r singular values
[ could be a large number]
UA
VT
═ Σ
42. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition:
Diagonal with r singular values
[ could be a large number]
UA
VT
═ Σ
Take only k << r singular values
Uk
Ak
VT
k
═
Σk
43. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original
2448 X 3264 ~ 8 mln numbers
44. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD: 15 largest SV’s
1% of the data
45. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SVD EXAMPLE USING MY SON AS AN EXPERIMENT
SVD: 75 largest V’s
5% of the data
46. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
VARIABLE
CLUSTERING
TO REDUCE THE DIMENSION
Variabele selection
I have 500 inputs but maybe there are only ten clusters of inputs
Within 1 cluster the variables are (strongly) correlated.
Then use only 1 input per cluster for predictive modeling
X1, X2, X3, ….., X500
X1, X21, X35, X430,….. X35
X17, X29, X353, X490,…. X29
X37, X95, X251, X393,…. X251
47. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
VARIABLE
CLUSTERING
TO REDUCE THE DIMENSION
48. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
VARIABLE
CLUSTERING
TO REDUCE THE DIMENSION
50. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
COMBINE MODELS BAGGING & BOOSTING
If one model is not good enough: let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough
and have some predictive power
Random
sample
Final
model
data
51. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Bagging & Boosting: Random Forests
Random forests ≈ Bagging with trees
Apply underlying steps repeatedly
1. Generate a bootstrap sample
2. Choose randomly m inputs m << P
3. Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree:
The random forest prediction is the majority vote of all trees
In case of a regression tree:
The random forest prediction is the average of all trees
52. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100
sub trees) fitted on the simulated data
53. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions.
54. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
GRADIENT BOOSTING DON’T LET THE FORMULAS INTIMIDATE YOU
55. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
GRADIENT
BOOSTING
SCHEMATIC OVERVIEW
Gradient Boosting, M iterations m = 1,2,…,M
Inputs
x
r1
Final
model FM
… M
At each succesive iteration a base learner hm
(which is a decision tree) is fit on the pseudo residuals
using inputs x to “correct” the previous learner.
Pseudo residuals rim at each step
r2 rM
Inputs
x
Inputs
x
Fm = Fm-1 + γ·hm
56. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SUPPORT VECTOR MACHINES
57. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Support vector machines (SVM)
Suppose we have a separable classification problem.
Find a linear decision boundary between the two groups with
maxium margin M. So green line would be better than blue line.
If not separable you have to allow that some points are on the
wrong side. These points are penalized. SVM still maximizes the
margin M, but with the constraint that total penalty is smaller than
C.
The input space might not be linear. We could apply non linear
mappings to the inputs: I.e. x2 , x3 , of spline(x).
The beauty of SVM is that in the calculations of the decision
boundary we do not need to explicitly use these transformations
“The kernel trick”
58. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Support vector machines (SVM)
Suppose we have a separable classification problem.
Find a linear decision boundary between the two groups with
maxium margin M. So green line would be better than blue line.
If not separable you have to allow that some points are on the
wrong side. These points are penalized. SVM still maximizes the
margin M, but with the constraint that total penalty is smaller than
C.
The input space might not be linear. We could apply non linear
mappings to the inputs: I.e. x2 , x3 , of spline(x).
The beauty of SVM is that in the calculations of the decision
boundary we do not need to explicitly use these transformations
“The kernel trick”
59. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Support vector machines (SVM)
Suppose we have a separable classification problem.
Find a linear decision boundary between the two groups with
maxium margin M. So green line would be better than blue line.
If not separable you have to allow that some points are on the
wrong side. These points are penalized. SVM still maximizes the
margin M, but with the constraint that total penalty is smaller than
C.
The input space might not be linear. We could apply non linear
mappings to the inputs: I.e. x2 , x3 , of spline(x).
The beauty of SVM is that in the calculations of the decision
boundary we do not need to explicitly use these transformations
“The kernel trick”
60. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMS
Separable classification
Non Separable classification
Non Separable classification rewritten using
Lagrange Dual problem
Kernels to model nonlinear behaviour
61. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
https://www.youtube.com/watch?v=3liCbRZPrZA
Linear not separable, but in 3D space they are!
63. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
K-NN METHOD
• No model is fitted. Given a query point x0 , find the k points x1, x2,..., xk that are
closest in distance to x0.
• Classify x0 using the majority vote among the k neighbours
x0
5 nearest neighbours of x0
3 of them are red
2 of them are green
so we predict x0 to be red
64. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
K-NN METHOD
1 nearest neighbour 15 nearest neighbour
65. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity, k-nearest-neighbors has been
successful used in problems like
• handwritten digits,
• Satellite image scenes
• EKG patterns
66. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site
For 108K Dutch postal codes (out of 463K) there are one or more houses for sale.
How can we estimate the house value for the postal codes without a house price?
For a Postal code with no price estimate the price
by taking the k closest house for sale prices.
67. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Comparing different nearest neighbours in SAS Enterprise Miner
68. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
K-NN EXAMPLE DUTCH HOUSE PRICES
30% of the data was used as validation set
In Enterprise Miner different values for k were used
k=5 nearest neighboor has the lowest Average squared error
70. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
NEURAL NETWORKS
DEEP LEARNING
71. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
NEURAL NETWORK LINEAR REGRESSION
f Y = f(X,w) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4
w4
w3
w1
w2 Neural network compute node
f is the so-called activation function.
This could be the logit function, but
other choices are possible
There are four weights w’s that have
to be determined
72. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
P Y X) = 𝑔 𝑇𝑌
𝑇𝑌 = 𝛽0𝑌 + 𝛽 𝑌
𝑇
𝑍
𝑍 𝑚 = 𝜎 𝛼0𝑚 + 𝛼 𝑚
𝑇
𝑋
De functions g and σ are defined as
𝑔 𝑇𝑌 =
𝑒 𝑇 𝑌
𝑒 𝑇 𝑁+𝑒 𝑇 𝑌
, 𝜎(𝑥) =
1
1+𝑒−𝑥
In case of a binary classifier 𝑃 𝑁 𝑋 = 1 − 𝑃(𝑌|𝑋)
The model weights α and β have to be estimated from the data
73. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
Randomly choose small values for all wi’ s
For each data point (observation)
1. Calculate the neural net prediction
2. Calculate the error E (for example: E = (actual – prediction)2)
3. Adjust weights w according to:
4. Stop if error E is small enough.
𝑤𝑖
𝑛𝑒𝑤
= 𝑤𝑖 + ∆𝑤𝑖
∆𝑤𝑖 = −𝛼
𝜕𝐸
𝜕𝑤𝑖
74. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS
75. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
NEURAL NETS AUTOENCODERS
http://support.sas.com/resources/papers/proceedings14/SAS313-2014.pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function corresponds with 2 dimensional principle components analysis
2 dimensional middle layer
For visualisation
76. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
NEURAL NETS AUTOENCODERS
http://support.sas.com/resources/papers/proceedings14/SAS313-2014.pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT
77. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
NEURAL NET CARS EXAMPLE
2 dimensional PCA
Autoencoder network
25 – 15 – 2 – 15 – 25
78. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
NEURAL NETS AUTOENCODER EXAMPLE
• 1000 images of digits
• Each image has 400 pixels
• So a 400 dimensional input vector X = (x1,…,x400)
• Compare two dimensional PCA with an neural net auto encoder
79. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
NEURAL NETS AUTOENCODER EXAMPLE
proc neural
data= autoencoderTraining
dmdbcat= work.autoencoderTrainingCat;
performance compile details cpucount= 12 threads= yes;
/* DEFAULTS: ACT= TANH COMBINE= LINEAR */
/* IDS ARE USED AS LAYER INDICATORS – SEE FIGURE 6 */
/* INPUTS AND TARGETS SHOULD BE STANDARDIZED */
archi MLP hidden= 5;
hidden 300 / id= h1;
hidden 100 / id= h2;
hidden 2 / id= h3 act= linear;
hidden 100 / id= h4;
hidden 300 / id= h5;
input corruptedPixel1 - corruptedPixel400 / id= i level= int std=
std;
target pixel1-pixel400 / act= identity id= t level= int std= std;
/* BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM */
initial random= 123;
prelim 10 preiter= 10;
run;
80. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Two dimensional representation of 400 dimensial ‘digit’ data
82. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
• Nodes represent random variables,
• Links between nodes represent conditional dependencies,
• Conditional probabilty tables are derived from training data for each node,
• Random variables are typically
binary or discrete,
• The graph structure can be
learned from the data,
85. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
TEXT MINING BASICS
“Advanced” word counting
Parse & Filter
Part of speech
Entity detection
Mixed / numeric / abbrev.
Stemming
Spell checks, Stop list, Synonim list
Multi-term words
Apply Traditional data mining
Clustering
Prediction / machine learning
86. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
TEXT MINING BASICS
Document 1: “Ik loop over straat in Amsterdam, 1057DK, met mijn fiets”
Document 2: “Zij liep niet maar fietste met haar blauwe fieets, //bitly.com/sdrtw”
Document 3: “Mijn tweewieler is kapot, wat een slecht stuk ijzer, @#$%$@!”
Terms Doc 1 Doc 2 Doc 3
+Fiets (znmw) 1 1 1
Fietsen (ww) 0 1 0
Blauwe (bvg) 0 1 0
Amsterdam (locatie) 1 0 0
+Lopen (ww) 1 1 0
Straat (znmw) 1 0 0
Kapot (bijw) 0 0 1
Slecht 0 0 1
Stuk Ijzer 0 0 1
1057DK (postcode) 1 0 0
//bitly.com/sdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX: A
• Each text document is (very) long vector
of word counts (often with many zeros!)
• Apply further mining on this matrix A.
87. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document
matrix
• Often more terms than documents
• Rows could be strongly correlated
• Matrix is often very sparse
Apply Singular value decomposition first.
88. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector 𝑑,
say of length 300.
Matrix SVD decompositie:
Diagonal with r singular values
[ could be many thousands ]
UA
VT
═ Σ
take only the first k << r singular values
Uk
Ak
VT
k
═
Σk
89. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn / fraud)
Apply machine learning to create
a model f to predict the target
Automatically generate topics within large document collections
Apply clustering techniques to classify
documents into clusters (topics)
Topic 1 Topic 2 Topic 3
90. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RECOMMENDATION ENGINE
Which product should I recommend my customers?
91. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RECOMMENDATION
ENGINE
USER – ITEM MATRIX EXPLICIT RECOMMENDATIONS
Users rated items (products) explicitly
Matrix is often very sparse
1 mln users 100K items ~ 0.01%??
User - Item Matrix – Data
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5
User 2 - - - 1 1
User 3 1 - 2 5 -
User 4 - - 1 2 5
User 5 2 1 4 2 3
User 6 2 3 - 5 1
User 7 5 1 - 3 4
User 8 - 1 - 4 1
User 9 2 3 2 4 2
User 10 - 1 3 - 1
User 4's Item Ratings
User 4 - - 1 2 5
After some math…. recommendations are:
User 4 3.21 4.82 1 2 5
Recommend item 2!
92. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RECOMMENDATION
ENGINE
ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms
Slope one (slope1)
K nearest neighbors (knn)
Model-based algorithms
Matrix factorization (SVD - LBFGS)
Market basket analysis
Association rules mining (arm)
Mixture of different methods
Clustering(cluster)
Ensemble
93. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RE METHODS SLOPE ONE
Y = x + b with slope equal to 1;
See notes
Item-item based
𝑟𝑢𝑖 =
𝑗 𝑤 𝑖𝑗 𝑟 𝑢𝑗
𝑗 𝑤 𝑖𝑗
Weight wij: the number of users having rated both items i and j;
Rating ruj : the average rating computed from item j;
Sample rating database
Customer Item A Item B Item C
John 5 3 2
Mark 3 4 ??
Lucy ?? 2 5
94. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings “in the neighborhood”
𝑟𝑢𝑖 =
𝑗∈N 𝑖;𝑢 𝑠𝑖𝑚 𝑖𝑗 𝑟 𝑢𝑗
𝑗∈N 𝑖;𝑢 𝑠𝑖𝑚 𝑖𝑗
How to determine the neighbors and how many (k) to use?
How to compute the similarity/distance measure 𝒘𝒊𝒋
• Pearson’s correlation coefficient
• Cosine distance
• Other adjustments
Similarity w
Neighbors N
95. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RE METHODS
PEARSON CORRELATION
𝑎, 𝑏 : users
𝑟𝑎,𝑝 : rating of user 𝑎 for item 𝑝
𝑃 : set of items, rated both by 𝑎 and 𝑏
• Possible similarity values between −1 and 1
𝒔𝒊𝒎 𝒂, 𝒃 =
𝒑 ∈𝑷(𝒓 𝒂,𝒑 − 𝒓 𝒂)(𝒓 𝒃,𝒑 − 𝒓 𝒃)
𝒑 ∈𝑷 𝒓 𝒂,𝒑 − 𝒓 𝒂
𝟐
𝒑 ∈𝑷 𝒓 𝒃,𝒑 − 𝒓 𝒃
𝟐
96. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RE METHODS K NEAREST NEIGHBORS METHOD
97. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data?
m n
R U=
V
m k k n
Select loss function (squared error)
Select the number of hidden factors k
Optimization problem
L-BFGS
ALS
users
items
𝑅𝑖𝑗 = 𝑈𝑖
𝑇
𝑉𝑗Predict New Rating R:
Minimize prediction error: min
𝑢,𝑣
𝑖,𝑗
(𝑅𝑖𝑗−𝑈𝑖
𝑇
𝑉𝑗)2
+ 𝜆( 𝑈𝑖
2
+ 𝑉𝑗
2
)
98. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RE METHODS CLUSTER
Knn within
one subgroup
User/item
profile
User/item
rating
Predictions
Clustering
99. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining
Identify frequent itemsets (rules) in the transaction data:
IF item A and B THEN item C
IF item X THEN item Y
Not all rules are interesting, use ‘support’ and ‘lift’ to judge importance of a rule
# trxs. {X} {Y}
Total # trxs.
Support (X,Y) =
Lift =
Support (X,Y)
Support (X) * Support(Y)
Support & Lift Diapers Beer 0.8%
Diapers Candles 0.018%
For example a lift of 2.5 means:
If people have X they are 2.5 more likely
to buy Y than if they don’t have X
100. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
METHOD ENSEMBLE
Linear combination of previous methods
Achieve better performance
101. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
PROC RECOMMEND recom = rs.IENS;
* Add a recommendation system;
ADD rs.IENS /item = item user = user rating = rating;
* Add tables;
ADDTABLE LHL1209.IENS_UIR / recom = rs.IENS type = rating vars=(item user rating);
* Method SVD LBFGS met 20 factoren ;
METHOD svd /
factors = 20
label = "svd" fconv = 1e-3
gconv = 1e-3 maxiter = 100
MAXFEVAL = 5000 function = L2
lamda = 0.2
technique = lbfgs;
RUN;
METHOD ARM /
label = "ARM" ;
RUN;
/* information on the recommender system */
INFO;
QUIT;
102. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
/** prediction with the SVD method ***/
PROC RECOMMEND recom = rs.IENS;
PREDICT /
method = svd
label = "svd"
Num = 3
users = ("Longhow Lam");
run;
QUIT;
104. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
OF MORE MODERN MACHINE LEARNING
CONS
Unfamilar with broader audiance, (more) difficult to explain
Black box approach (you are rejected: The computer says NO)
Often relations can already be modeled with classical regression models
It allows you to not think about the business problem
PROS
Often less data prep (manual tuning) neccesary (just throw it in the algorithm…)
Interactions often “automatically” taken into account
Superior for Text mining, Image & Speech recognition
Better lift possible (paar procent “gratis”)
It allows you to not think about the business problem
(compared to traditional linear /logistic regression)
PROS AND CONS
105. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
WHY SAS FOR MACHINE LEARNING
• Many different techniques
• Easy to use GUI’s combined with flexible coding
• High performance scalability
• Easy Deployable
106. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SOME MACHINE LEARNING EXAMPLES
Text mining
Image recognition
Sound recognition
Strange faces
So can a machine read, see and hear?
107. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
PREDICTING SENTIMENT FROM
RESTAURANT REVIEWS
108. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
IENS REVIEWS COLLECTED AROUND 16.000 REVIEWS AND THEIR SCORES
Used text miner to parse and filter reviews,
and transform reviews to data points in SVD space.
109. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Predicted review score vs. Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 0.5
R2 Neural Net = 0.6
110. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
Predicted review score vs. Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 0.5
R2 Neural Net = 0.6
111. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
IENS REVIEWS APPLY MODEL ON ‘NEW REVIEWS’
112. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MNIST DATA IN SAS
MODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
113. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MNIST TRAINING DATA
42.000 pictures of hand-written digits
Each digit is a picture of 28 by 28 pixels
So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red
114. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 – Nearest Neighbour has the lowest misclassification
rate. 3.6% of the digits in the validation set are mis
classified.
70/30 training/validation split
PCA regression on 50 largest PC’s
Seven singel layer neural nets: 3, 6, 12, 24,
48, 100, 200 neurons
Seven multi layer neural nets
Three Random forest: 100, 500 and 1000
trees
8, 16 and 24 nearest neighbors
115. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MNIST DATA APPLY MODEL ON TEST SET
28.000 digits without known labels.
Our best model predicted the label for
these digits.
First 100 predicted digits, together with
the handwritten digits are displayed
here.
Red numbers are predicted labels. We
see obvious some mistakes…..
116. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SPEECH RECOGNITION
DIGITS RECORDED WITH IPHONE
1 2
117. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SPEECH RECOGNITION
WAV files consists of ~ 30.000 points too much redundancy
Use spectral analysis to convert signal to frequency domain
Still too much apply principle components
TRAIN DATA
8 spoken ‘ones’ in wav files
8 spoken ‘twos’ in wav files
119. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
SPEECH RECOGNITION
Zero errors on training data
Zero errors on test data
Also 8 ‘ones’ and 8 ‘twos’
In Enterprise Miner:
Neural network with 9 neurons in one hidden layer
120. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
STRANGE FACE
DETECTION
COMBO OF OPEN API / R & SAS
Little joke on my colleagues….
121. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
STRANGE FACE
DETECTION
COMBO OF OPEN API / R & SAS
Get free API key for Face++
Their API returns 83 facial landmarks (in JSON format)
Apply advanced analytics on the ABT
Which faces are look-alikes proc cluster (hierarchical cluster)
Sales faces? Predictive modeling / machine learning
Who is the Brad Pit? Nearest Neighbor
Strange faces? proc neural / auto-encoder
Create R script to
Retrieve the SAS faces from our site
put them trough the Face++ API
Collect JSON results and store them in an ABT
122. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
STRANGE FACE
DETECTION
LOOK ALIKE FACES
123. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
STRANGE FACE
DETECTION
BRAD PIT LOOK A LIKES
124. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
STRANGE FACE
DETECTION
STRANGE FACES
SAS Faces, Actors Faces
Read more on my blog
125. Copyright © 2012, SAS Institute Inc. All rights reserv ed.
STRANGE FACE
DETECTION
COMBO OF OPEN API / R & SAS
SAS Faces, Actors Faces
Read more on my blog
Editor's Notes Lineaire and logistische regressie worden al sinds jaar en dag gebruikt. Met creatieve constructive van regressive variabele zijn al vrij geode modellen te maken. Lineaire and logistische regressie worden al sinds jaar en dag gebruikt. Met creatieve constructive van regressive variabele zijn al vrij geode modellen te maken. Decision tree 20:25 – 20:35 Vanaf eind Jaren 80 al bekend met het werk van Leo Breiman Wanneer stoppen we?
Te vroeg: missen bepaalde structuur in de data.
Te laat: de tree wordt te groot en we overfitten
Mogelijke stop strategieën
Stop met splitsen als er geen echte afname is in MSE of GINI
Te kortzichtig omdat in een verdere split de afname alsnog kan komen
Maak eerste een grote tree,
stop splitsen alleen als een minimum aantal data punten overblijft.
Pas pruning toe op de grote tree (snoeien)
Knip weer stukken van de boom af, maar alleen als dat niet tot
een te grote toename in MSE of GINI leidt.
CHAID (chi-squared automatic interaction detection)
Categorical or continuous target
Multiple splits
Criteria = Chi-Square
Stops before a tree gets too large
Uses missing values as an additional category
CART (Classification and Regression)
Categorical or continuous target
Binary splits
Criteria = Gini
Large trees then prune
Uses surrogate field for missing values
C4.5 / C5.0
Only categorical target
Multiple splits
Criteria = entropy
Large trees then prune
Imputes missing values
Rond midden Jaren 90 verschenen hier artiekelen over. Leo Breiman Bagging Predictors
Bootstrap Aggregation (Bagging)
Neem meerdere (onafhankelijke) random samples uit de data, bijvoorbeeld K samples.
Fit op elk sample een model, resulterend in model M1, M2, … ,MK
Uiteindelijke voorspelling is een meerderheids-stem of averaging van de K modellen.
Boosting
Begin met een simpel model (M1), dit model maakt goede en foute beslissingen
In een tweede iteratie: geef goed geclassificeerde cases meer gewicht en fit opnieuw een model (M2)
Ga zo door tot je de modellen M1, M2,….,MK hebt en neem als uiteindelijke model de (gewogen) meerderheids-vote van deze K modellen.
Pas onderstaande stappen een flink aantal keren toe.
Trek random N cases uit de data (met terug leggen, de bootstrap sample)
Als er P inputs zijn trek random m << P inputs uit de bootstrap sample.
Fit op de bootstrap sample met m inputs een tree (zonder pruning)
In geval van een classificatie tree:
De random forest predictie is de meerderheidsstem van alle trees
In geval van een regression tree:
De random forest predictie is het gemiddelde van alle trees
Rond 1995 geintroduceerd door Vladimir Vapnik
De punten op de maximale afstands lijnen zijn de support vectors, dit zijn er maar een paar.
Vind een lineaire decision boundary tussen twee lineair separabele groepen met maximale margin M. Dus groene lijn is beter dan blauwe lijn.
Indien de groepen niet lineair separabel zijn, moet je toestaan dat sommige punten aan de verkeerde kant liggen. Deze punten krijgen een straf. SVM maximaliseert nog steeds de margin M, maar met als restrictie dat de totale straf kleiner is dan een constante.
De wereld is niet lineair. We leggen een mapping naar een niet lineaire wereld. De inputs kunnen we transformeren bijvoorbeeld x2 , x3 , of spline(x).
Het mooie van SVM is dat in de berekeningen van de decision boundary deze transformaties niet expliciet berekend hoeven te worden.
Rond 1995 geintroduceerd door Vladimir Vapnik
De punten op de maximale afstands lijnen zijn de support vectors, dit zijn er maar een paar.
Vind een lineaire decision boundary tussen twee lineair separabele groepen met maximale margin M. Dus groene lijn is beter dan blauwe lijn.
Indien de groepen niet lineair separabel zijn, moet je toestaan dat sommige punten aan de verkeerde kant liggen. Deze punten krijgen een straf. SVM maximaliseert nog steeds de margin M, maar met als restrictie dat de totale straf kleiner is dan een constante.
De wereld is niet lineair. We leggen een mapping naar een niet lineaire wereld. De inputs kunnen we transformeren bijvoorbeeld x2 , x3 , of spline(x).
Het mooie van SVM is dat in de berekeningen van de decision boundary deze transformaties niet expliciet berekend hoeven te worden.
Rond 1995 geintroduceerd door Vladimir Vapnik
De punten op de maximale afstands lijnen zijn de support vectors, dit zijn er maar een paar.
Vind een lineaire decision boundary tussen twee lineair separabele groepen met maximale margin M. Dus groene lijn is beter dan blauwe lijn.
Indien de groepen niet lineair separabel zijn, moet je toestaan dat sommige punten aan de verkeerde kant liggen. Deze punten krijgen een straf. SVM maximaliseert nog steeds de margin M, maar met als restrictie dat de totale straf kleiner is dan een constante.
De wereld is niet lineair. We leggen een mapping naar een niet lineaire wereld. De inputs kunnen we transformeren bijvoorbeeld x2 , x3 , of spline(x).
Het mooie van SVM is dat in de berekeningen van de decision boundary deze transformaties niet expliciet berekend hoeven te worden.
binaire clasificatie 2 output nodes: Y en N
4 input variabelen 4 input nodes: X = (X1,..,X4)
1 hidden layer met 3 hidden nodes: Z = (Z1, Z2, Z3)
With Explicit rating First Compute the difference between two items A and B: [ 2 + (-1) ] / 2 = 0.5; r_{LucyA} = 2+0.5 = 2.5
First Compute the difference between two items A and C: 3; r_{LucyC} = 5+3 = 8;
Weighted sum: r_{LucyA} = ( 2.5 * 2 + 8 ) / 3 = 4.33