Machine learning overview (with SAS software)

Copyright © 2012, SAS Institute Inc. All rights reserv ed.
MACHINE LEARNING WITH SAS WORKSHOP
GETTING THE MOST OUT OF YOUR DATA
Longhow Lam

AGENDA AND SOME READING MATERIAL
 Intro & positioning of Machine learning
 SAS platform for Machine learning
 Overview of Specific methods
 Some examples
Further reading
An experimental comparison of classification techniques for imbalanced
credit scoring data sets using SAS® Enterprise Miner
http://support.sas.com/resources/papers/proceedings12/129-2012.pdf
Benchmarking state-of-the-art classification algorithms for credit scoring: A ten-year update
http://www.business-school.ed.ac.uk/waf/crc_archive/2013/42.pdf
An absolute recommender for more detail:
The elements of statistical learning, Hasting, Tibshirani & Friedman
http://www-stat.stanford.edu/~tibs/ElemStatLearn/

LONGHOW LAM SHORT BIO
 MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs. wiskunde)
 MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)
 10+ year SAS experience (Base / Stat / Guide/ Miner / VA / VS)
 10+ year R experience ( An introduction to R)
 10 + year predictive modeling experience
 ABNAMRO – Risk modeler
 Basel, Credit risk, ALM models
 Business&Decision – Quantitative consultant
 ING Belgium, Fortis
 Leaseplan, Belgium Post
 Experian – data mininer
 Collection Score, Delphi credit score, consulting
@longhowlamFollow me:

INTRO MACHINE LEARNING
Wikipedia:
“Machine learning is a scientific discipline that deals with the construction
and study of algorithms that can learn from data. Such algorithms operate by
building a model based on inputs and using that to make predictions or
decisions, rather than following only explicitly programmed instructions.”

MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR
Statistical
modeling
Supervised
Learning
Clustering
Unsupervised
Learning
Data mining
Machine
learning
Dimension
reduction
Association
rules
Recommender
Auto
encoders
Self
organizing
maps

SAS SOFTWARE
FOR MACHINE LEARNING (AND DATA MINING)

IDENTIFY /
FORMULATE
PROBLEM
DATA
PREPARATION
DATA
EXPLORATION
TRANSFORM
& SELECT
BUILD
MODEL
VALIDATE
MODEL
DEPLOY
MODEL
EVALUATE /
MONITOR
RESULTS
SAS In-Database Scoring
SAS Decision Manager
BUSINESS
MANAGER
SAS Model Manager
IT SYSTEMS /
MANAGEMENT
SAS Enterprise Guide
BUSINESS
ANALYST
Enterprise Miner / Text Miner
SAS IMSTAT / Recommender
DATA MINER /
DATA SCIENTIST
THE ANALYTICS
LIFECYCLE
SAS Visual Analytics
SAS Visual Statistics

EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES
PROC hpbnet data = creditdata
structure = markovblanket;
model default = x1 LTV income age;
selction = Y
RUN;

MACHINE LEARNING
Machine Learning algorithms designed to run on single
blade or multi blade distributed memory environments
HIGH PERFORMANCE

Manage
Rules + Data + Models
Deployment flexibility:
Batch
Real Time
Stored Process
In Database
Drive Reuse and
Consistency
EASY DEPLOYABLE
Model
Data
Rules
Model
MACHINE LEARNING WITH SAS

PREDICT SOMEONE’S INCOME
Income = 15.2 + 1.102 × Age
Age
Income
Predict someones income from his/her age
 Collect some data
 Plot the data
 Analytical Base Table
IS THIS MACHINE LEARNING?

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES
 The problem may not be linear: X2, X3, Log(X), Sqrt(X), 1/X ,…….?
 You do not have one input variable: X1, X2, X3,……X567
 Interactions en correlations between input variables
age
income
male
female
Analytical base table Derived inputs

MACHINE LEARNING WHY IT CAN MATTER € € €
Suppose we have an untargeted direct mailing of 100.000 ‘letters’ to randomly
sampled prospects:
 Conversion rate is around 1%. Profit per conversion €80, Cost per mailing is €0.70
 Total ROI = 100.000 X 1% X € 80 − 100.000 X € 0.70 = € 10,000
Now we have a targeted mailing with a machine learning predictive model, that uses
prospect input data that can distinguish between high / low responders.

Decile N Conversion Profit Cumulative
1 10.000 2.00% 9.000 9.000
2 10.000 1.50% 5.000 14.000
3 10.000 1.00% 1.000 15.000
4 10.000 1.00% 1.000 16.000
5 10.000 1.00% 1.000 17.000
6 10.000 1.00% 1.000 18.000
7 10.000 1.00% 1.000 19.000
8 10.000 0.80% -600 18.400
9 10.000 0.50% -3.000 15.400
10 10.000 0.20% -5.400 10.000
The profit by using a model to sent
letters only to the first 7 deciles is now:
€ 19.000 (instead of € 10.000)
If you have 100 of such campaigns a
year that means an increase of
€ 0.9 mln !!

1 10.000 3.00% 17.000 17.000
2 10.000 2.00% 9.000 26.000
3 10.000 1.40% 4.200 30.200
4 10.000 1.15% 2.200 32.400
5 10.000 1.00% 1.000 33.400
6 10.000 0.60% -2.200 31.200
7 10.000 0.40% -3.800 27.400
8 10.000 0.30% -4.600 22.800
9 10.000 0.10% -6.200 16.600
10 10.000 0.05% -6.600 10.000
The profit by using a much better model
to sent letters only to the first 5 deciles
is now:
€ 33.400 (instead of € 10.000)
€ 2.34 mln !!

MACHINE LEARNING WHY IT CAN MATTER? € € €
1 10.000 3.35% 19.800 19.800
2 10.000 2.23% 10.840 30.640
3 10.000 1.30% 3.400 34.040
4 10.000 1.10% 1.800 35.840
5 10.000 1.00% 1.000 36.840
6 10.000 0.55% -2.600 34.240
7 10.000 0.28% -4.760 29.480
8 10.000 0.25% -5.000 24.480
9 10.000 0.05% -6.600 17.880
10 10.000 0.02% -6.840 11.040
Now lets suppose we have even a
slightly better model than the last one
€ 36.840
€ 2.68 mln !!

OVERVIEW OF SPECIFIC
MACHINE LEARNING METHODS
 Classical regression
 Decision trees
 Dimension reduction
 Bagging & Boosting
 Support vector machines
 K-Nearest Neighbour
 Neural networks / deep learning
 Bayesian networks
 Text mining
 Recommendation engine

“CLASSICAL” REGRESSION

LINEAR & LOGISTIC REGRESSION
Income = a + b × Age
Age
Income
Age
P(Churn)
1
0
P(Churn) =
1
1+𝐸𝑋𝑃(𝑎+𝑏 × Age)
Numeric target variable Binairy target variable

SPLINE REGRESSION MODELING NON LINEARITIES
Often there is a non linear relation
• Transformation of inputs: X2 , X3 , log(X) etc…
• Buckets / binning of variables
Y / logit(y)
X
Smoothing Splines

SPLINE REGRESSION MODELING NON LINEARITIES
Smoothing Splines: Piecewise polynomials that are glued together at knots
Two special cases for λ:
λ = 0 Any function that interpolates the data
λ = ∞ Simple Least square line fit
Choose λ by cross validation

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
Extracted data from car sales site. For many cars we have the
kilometres driven and the car price. For the Opel Astra we have 2360 cars:
 What is the relation between km driven and car sales price?
Too much smoothing and too little smoothing

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION
0.2 is the optimal smoothing paramter

Some other car make/models with
spline estimates of car depreciation
versus kilometres driven.
Hmmm.. my Renault Clio looks nice
but after 50.000 km I only have 46%
of the original value left… 

MODELING NON LINEARITIES
In SAS we have TPSLINE, LOESS and the ADAPTIVEREG procedure
to fit multivariate regression splines
Supports:
 More than one input
 linear, logistic, Poisson, GLM regressions
 combines both regression splines and model selection methods.
 supports partitioning of data into training, validation, and testing roles
SPLINE REGRESSION

DECISION TREES

DECISION TREES
How does it work? A simple example
Suppose we have the following group of people
 50% Response
 50% No Response
We have/know Age and Marital Status
50%
50%
Age≤ 45 Age> 45
30%
70%
60%
40%
Married
Divorced UnMarried
20%
80%
60%
40%

DECISION TREES REGRESSION & CLASSIFICATION
Target X1 X2 X3 X4 X5
Y 12 A 456 1.2 X
N 21 B 456 1.5 X
Y 32 A 545 1.3 U
Y 34 C 443 1.1 U
N 23 A 345 1.7 U
N 13 B 567 1.2 X
N 45 A 654 1.9 X
… … … … … …
… … … … … …
Y 46 A 657 2.1 X
A recursive splitting algorithm:
1. Loop trough all inputs
2. Determine per input how to split
3. Take the best input to split
4. On the two new data sets apply 1,2,3 again….
5. Stop somewhere….
• How to split X1 or X2 ?
• When to stop?

DECISION TREES
How to split?
Number is usualy 2 or 3.
More splits will exhaust the data too fast
Why split X1 <t1 beter dan X1 <s1?
 Regression: Mean squared Error
 Classification:
 Mis-classification rate,
 Cross-entropy, Chi-Squared
Regression tree: Mean square error
..
.
.
.
.
. . .
.
.
.
.
.. .
Split s1 Split t1
x
Y Y
x
REGRESSION & CLASSIFICATION

DECISION TREES
How to split?
Number is usualy 2 or 3.
More splits will exhaust the data too fast
Why split X1 <t1 beter dan X1 <s1?
 Regression: Mean squared Error
 Classification:
 Mis-classification rate,
 Cross-entropy, Chi-Squared
Classification tree: Mis classificatie rate
xSplit s1 Split t1
REGRESSION & CLASSIFICATION

Decision trees (regressie & classificatie)
When to stop?
 Not too early not too late!
Pruning
Remove parts the tree

DECISION TREES SOME COMMON TYPES
CHAID (chi-squared automatic interaction detection)
C4.5 / C5.0
CART (Classification and Regression)
The difference is mainly in the different splitting options

Decision trees pros and cons
pros
 Interaction between variables
 Interpretable rules
 Missing values easy to incorporate.
cons
 Unstable
 “Lack-of-Smoothnes”
 Fit of obvious (non)linear relations
man vrouw
Inkomen < 45 K Leeftijd < 33
Response rate
Opel Astras

DIMENSION REDUCTION

PRINCIPLE
COMPONENTS
ANALYSIS
Linear transformation of data to uncorrelated data
The transformation W is such that
 The largest variance is in the first coordinate
 The second largets variance is in the second coordinate
 Etc…

PRINCIPLE
COMPONENTS
ANALYSIS
X1
X2
x x x x x x x
x
x
x
x
x
x
x
x

PRINCIPLE
COMPONENTS
ANALYSIS

PRINCIPLE
COMPONENTS
ANALYSIS
The Math behind
P = X W
𝑝11 𝑝21
.
.
.
.
.
.
𝑝1𝑛 𝑝2𝑛
=
𝑥11 𝑥21.
.
.
.
.
.
𝑥1𝑛 𝑥2𝑛
𝑤11 𝑤21
𝑤12 𝑤22
w11 and w12 are the loadings corresponding to the first principle component.
w21 and w22 are the loadings corresponding to the second principle component.
With two dimensions In general
It turns out that the columns of W
Are the eigenvalue vectors of the matrix XTX

PRINCIPLE
COMPONENTS
ANALYSIS
Scaling the inputs is important here
Applications of PCA
 Dimension reduction
 Visualisation
  Outlier / anomalie detectie
 PCA regression
 Use PC instead of the original inputs

PRINCIPLE
COMPONENTS
DIMENSION REDUCTION
P = X W
Now only take the first L columns of W
PL = X WL
For example for visualization only use the first
2 or 3 columns so that PL only has 2 or 3
columns that can be visualized in scatter or
contour plots
X
W
P
=
X
WL
PL
=
(10000 by 100 ) (100 by 100)(10000 by 100 )
(10000 by100 ) (100 by2)(10000 by 2)

SINGULAR VALUE DECOMPOSITION
Matrix SVD decomposition:
Diagonal with r singular values
[ could be a large number]
UA
VT
═ Σ

SINGULAR VALUE DECOMPOSITION
A datapoint d can now be represented by k dimensional point
Matrix SVD decomposition:
[ could be a large number]
UA
VT
═ Σ
Take only k << r singular values
Uk
Ak
VT
k
═
Σk

SVD EXAMPLE USING MY SON AS AN EXPERIMENT
Original
2448 X 3264 ~ 8 mln numbers

SVD: 15 largest SV’s
1% of the data

SVD: 75 largest V’s
5% of the data

VARIABLE
CLUSTERING
TO REDUCE THE DIMENSION
Variabele selection
 I have 500 inputs but maybe there are only ten clusters of inputs
 Within 1 cluster the variables are (strongly) correlated.
 Then use only 1 input per cluster for predictive modeling
X1, X2, X3, ….., X500
X1, X21, X35, X430,…..  X35
X17, X29, X353, X490,….  X29
X37, X95, X251, X393,….  X251

VARIABLE
CLUSTERING
TO REDUCE THE DIMENSION

BAGGING & BOOSTING

COMBINE MODELS BAGGING & BOOSTING
If one model is not good enough: let multiple models vote for a prediction
Bootstrap Aggregation (Bagging)
This makes only sense if underlying models are different enough
and have some predictive power
Random
sample
Final
model
data

Bagging & Boosting: Random Forests
Random forests ≈ Bagging with trees
Apply underlying steps repeatedly
1. Generate a bootstrap sample
2. Choose randomly m inputs m << P
3. Fit a tree on the bootstrap sample with the m inputs (do not prune)
In case of a classification tree:
 The random forest prediction is the majority vote of all trees
In case of a regression tree:
 The random forest prediction is the average of all trees

FOREST VS TREE EXAMPLE ON SIMULATED DATA
Decision tree and Random forest (100
sub trees) fitted on the simulated data

FOREST VS TREE EXAMPLE ON SIMULATED DATA
It is clear to see that the forest can produce much smoother predictions.

GRADIENT BOOSTING DON’T LET THE FORMULAS INTIMIDATE YOU

GRADIENT
BOOSTING
SCHEMATIC OVERVIEW
Gradient Boosting, M iterations m = 1,2,…,M
Inputs
x
r1
Final
model FM
… M
At each succesive iteration a base learner hm
(which is a decision tree) is fit on the pseudo residuals
using inputs x to “correct” the previous learner.
Pseudo residuals rim at each step
r2 rM
Inputs
x
Inputs
x
Fm = Fm-1 + γ·hm

SUPPORT VECTOR MACHINES

Support vector machines (SVM)
 Suppose we have a separable classification problem.
 Find a linear decision boundary between the two groups with
maxium margin M. So green line would be better than blue line.
 If not separable you have to allow that some points are on the
wrong side. These points are penalized. SVM still maximizes the
margin M, but with the constraint that total penalty is smaller than
C.
 The input space might not be linear. We could apply non linear
mappings to the inputs: I.e. x2 , x3 , of spline(x).
 The beauty of SVM is that in the calculations of the decision
boundary we do not need to explicitly use these transformations
 “The kernel trick”

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMS
Separable classification
Non Separable classification
Non Separable classification rewritten using
Lagrange Dual problem
Kernels to model nonlinear behaviour

https://www.youtube.com/watch?v=3liCbRZPrZA
Linear not separable, but in 3D space they are!

K – NEAREST NEIGHBOUR

K-NN METHOD
• No model is fitted. Given a query point x0 , find the k points x1, x2,..., xk that are
closest in distance to x0.
• Classify x0 using the majority vote among the k neighbours
x0
5 nearest neighbours of x0
 3 of them are red
 2 of them are green
 so we predict x0 to be red

K-NN METHOD
1 nearest neighbour 15 nearest neighbour

K-NN METHOD
Use different numbers k of nearest neighbours test and traning errors
Despite its simplicity, k-nearest-neighbors has been
successful used in problems like
• handwritten digits,
• Satellite image scenes
• EKG patterns

K-NN EXAMPLE DUTCH HOUSE PRICES
Extract house for sale prices from a Dutch housing site
 For 108K Dutch postal codes (out of 463K) there are one or more houses for sale.
 How can we estimate the house value for the postal codes without a house price?
For a Postal code with no price estimate the price
by taking the k closest house for sale prices.

Comparing different nearest neighbours in SAS Enterprise Miner

K-NN EXAMPLE DUTCH HOUSE PRICES
 30% of the data was used as validation set
 In Enterprise Miner different values for k were used
 k=5 nearest neighboor has the lowest Average squared error

NEURAL NETWORKS
DEEP LEARNING

NEURAL NETWORK LINEAR REGRESSION
f Y = f(X,w) = w1 + w2X2 + w3X3 + w4X41
X2
X3
X4
w4
w3
w1
w2 Neural network compute node
f is the so-called activation function.
This could be the logit function, but
other choices are possible
There are four weights w’s that have
to be determined

NEURAL NETWORKS MATHEMATICAL FORMULATION
In formula the prediction forumla for a NN is geiven by
Leeftijd
Inkomen
Regio
Geslacht
X1
X2
X3
X4
Z1
Z2
Z3
Y
N
X inputs Hidden layer z outputs
α1
β1
P Y X) = 𝑔 𝑇𝑌
𝑇𝑌 = 𝛽0𝑌 + 𝛽 𝑌
𝑇
𝑍
𝑍 𝑚 = 𝜎 𝛼0𝑚 + 𝛼 𝑚
𝑇
𝑋
De functions g and σ are defined as
𝑔 𝑇𝑌 =
𝑒 𝑇 𝑌
𝑒 𝑇 𝑁+𝑒 𝑇 𝑌
, 𝜎(𝑥) =
1
1+𝑒−𝑥
In case of a binary classifier 𝑃 𝑁 𝑋 = 1 − 𝑃(𝑌|𝑋)
The model weights α and β have to be estimated from the data

NEURAL NETWORKS ESTIMATING THE WEIGHTS
Back propagation algorithm
 Randomly choose small values for all wi’ s
 For each data point (observation)
1. Calculate the neural net prediction
2. Calculate the error E (for example: E = (actual – prediction)2)
3. Adjust weights w according to:
4. Stop if error E is small enough.
𝑤𝑖
𝑛𝑒𝑤
= 𝑤𝑖 + ∆𝑤𝑖
∆𝑤𝑖 = −𝛼
𝜕𝐸
𝜕𝑤𝑖

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

NEURAL NETS AUTOENCODERS
http://support.sas.com/resources/papers/proceedings14/SAS313-2014.pdf
Neural networks that use inputs to predict the inputs
X1
X2
X3
X4
X1
X2
X3
X4
ENCODE DECODE
Linear activation function  corresponds with 2 dimensional principle components analysis
2 dimensional middle layer
For visualisation

NEURAL NETS AUTOENCODERS
http://support.sas.com/resources/papers/proceedings14/SAS313-2014.pdf
Often more hidden layers with many nodes
ENCODE DECODE
INPUT OUTPUT = INPUT

NEURAL NET CARS EXAMPLE
2 dimensional PCA
Autoencoder network
25 – 15 – 2 – 15 – 25

NEURAL NETS AUTOENCODER EXAMPLE
• 1000 images of digits
• Each image has 400 pixels
• So a 400 dimensional input vector X = (x1,…,x400)
• Compare two dimensional PCA with an neural net auto encoder

NEURAL NETS AUTOENCODER EXAMPLE
proc neural
data= autoencoderTraining
dmdbcat= work.autoencoderTrainingCat;
performance compile details cpucount= 12 threads= yes;
/* DEFAULTS: ACT= TANH COMBINE= LINEAR */
/* IDS ARE USED AS LAYER INDICATORS – SEE FIGURE 6 */
/* INPUTS AND TARGETS SHOULD BE STANDARDIZED */
archi MLP hidden= 5;
hidden 300 / id= h1;
hidden 2 / id= h3 act= linear;
input corruptedPixel1 - corruptedPixel400 / id= i level= int std=
std;
target pixel1-pixel400 / act= identity id= t level= int std= std;
/* BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM */
initial random= 123;
prelim 10 preiter= 10;
run;

Two dimensional representation of 400 dimensial ‘digit’ data

BAYESIAN NETWORKS

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS
• Nodes represent random variables,
• Links between nodes represent conditional dependencies,
• Conditional probabilty tables are derived from training data for each node,
• Random variables are typically
binary or discrete,
• The graph structure can be
learned from the data,

TEXT MINING

TEXT MINING BASICS
“Advanced” word counting
 Parse & Filter
 Part of speech
 Entity detection
 Mixed / numeric / abbrev.
 Stemming
 Spell checks, Stop list, Synonim list
 Multi-term words
 Apply Traditional data mining
 Clustering
 Prediction / machine learning

TEXT MINING BASICS
Document 1: “Ik loop over straat in Amsterdam, 1057DK, met mijn fiets”
Document 2: “Zij liep niet maar fietste met haar blauwe fieets, //bitly.com/sdrtw”
Document 3: “Mijn tweewieler is kapot, wat een slecht stuk ijzer, @#$%$@!”
Terms Doc 1 Doc 2 Doc 3
+Fiets (znmw) 1 1 1
Fietsen (ww) 0 1 0
Blauwe (bvg) 0 1 0
Amsterdam (locatie) 1 0 0
+Lopen (ww) 1 1 0
Straat (znmw) 1 0 0
Kapot (bijw) 0 0 1
Slecht 0 0 1
Stuk Ijzer 0 0 1
1057DK (postcode) 1 0 0
//bitly.com/sdrtw (Internet) 0 1 0
TERM DOCUMENT MATRIX: A
• Each text document is (very) long vector
of word counts (often with many zeros!)
• Apply further mining on this matrix A.

TEXT MINING TERM DOCUMENT MATRIX A
It is not useful to apply data mining techniques directly on the term document
matrix
• Often more terms than documents
• Rows could be strongly correlated
• Matrix is often very sparse
Apply Singular value decomposition first.

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A
A document d is not a long vector of m word counts but a much shorter vector 𝑑,
say of length 300.
Matrix SVD decompositie:
[ could be many thousands ]
UA
VT
═ Σ
take only the first k << r singular values
Uk
Ak
VT
k
═
Σk

TEXT MINING APPLICATIONS
Combine customer structured data and unstructured data to better predict behaviour (churn / fraud)
Apply machine learning to create
a model f to predict the target
Automatically generate topics within large document collections
Apply clustering techniques to classify
documents into clusters (topics)
Topic 1 Topic 2 Topic 3

RECOMMENDATION ENGINE
Which product should I recommend my customers?

RECOMMENDATION
ENGINE
USER – ITEM MATRIX EXPLICIT RECOMMENDATIONS
 Users rated items (products) explicitly
 Matrix is often very sparse
 1 mln users 100K items  ~ 0.01%??
User - Item Matrix – Data
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 3 2 5 4 5
User 2 - - - 1 1
User 3 1 - 2 5 -
User 4 - - 1 2 5
User 5 2 1 4 2 3
User 6 2 3 - 5 1
User 7 5 1 - 3 4
User 8 - 1 - 4 1
User 9 2 3 2 4 2
User 10 - 1 3 - 1
User 4's Item Ratings
User 4 - - 1 2 5
After some math…. recommendations are:
User 4 3.21 4.82 1 2 5
Recommend item 2!

RECOMMENDATION
ENGINE
ALGORITHMS IN PROC RECOMMEND
Memory-based algorithms
 Slope one (slope1)
 K nearest neighbors (knn)
Model-based algorithms
 Matrix factorization (SVD - LBFGS)
Market basket analysis
 Association rules mining (arm)
Mixture of different methods
 Clustering(cluster)
 Ensemble

RE METHODS SLOPE ONE
 Y = x + b with slope equal to 1;
 See notes
 Item-item based
𝑟𝑢𝑖 =
𝑗 𝑤 𝑖𝑗 𝑟 𝑢𝑗
𝑗 𝑤 𝑖𝑗
 Weight wij: the number of users having rated both items i and j;
 Rating ruj : the average rating computed from item j;
Sample rating database
Customer Item A Item B Item C
John 5 3 2
Mark 3 4 ??
Lucy ?? 2 5

RE METHODS K NEAREST NEIGHBORS
The rating rui is determined by the ratings “in the neighborhood”
𝑟𝑢𝑖 =
𝑗∈N 𝑖;𝑢 𝑠𝑖𝑚 𝑖𝑗 𝑟 𝑢𝑗
𝑗∈N 𝑖;𝑢 𝑠𝑖𝑚 𝑖𝑗
How to determine the neighbors and how many (k) to use?
How to compute the similarity/distance measure 𝒘𝒊𝒋
• Pearson’s correlation coefficient
• Cosine distance
• Other adjustments
Similarity w
Neighbors N

RE METHODS
PEARSON CORRELATION
𝑎, 𝑏 : users
𝑟𝑎,𝑝 : rating of user 𝑎 for item 𝑝
𝑃 : set of items, rated both by 𝑎 and 𝑏
• Possible similarity values between −1 and 1
𝒔𝒊𝒎 𝒂, 𝒃 =
𝒑 ∈𝑷(𝒓 𝒂,𝒑 − 𝒓 𝒂)(𝒓 𝒃,𝒑 − 𝒓 𝒃)
𝒑 ∈𝑷 𝒓 𝒂,𝒑 − 𝒓 𝒂
𝟐
𝒑 ∈𝑷 𝒓 𝒃,𝒑 − 𝒓 𝒃
𝟐

RE METHODS K NEAREST NEIGHBORS METHOD

RE METHODS MATRIX FACTORIZATION
How do we fill in the missing data?
m  n
R U=
V
m  k k  n
 Select loss function (squared error)
 Select the number of hidden factors k
 Optimization problem
 L-BFGS
 ALS
users
items
𝑅𝑖𝑗 = 𝑈𝑖
𝑇
𝑉𝑗Predict New Rating R:
Minimize prediction error: min
𝑢,𝑣
𝑖,𝑗
(𝑅𝑖𝑗−𝑈𝑖
𝑇
𝑉𝑗)2
+ 𝜆( 𝑈𝑖
2
+ 𝑉𝑗
2
)

RE METHODS CLUSTER
Knn within
one subgroup
User/item
profile
User/item
rating
Predictions
Clustering

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)
Basic steps for assoc rules mining
 Identify frequent itemsets (rules) in the transaction data:
 IF item A and B THEN item C
 IF item X THEN item Y
 Not all rules are interesting, use ‘support’ and ‘lift’ to judge importance of a rule
# trxs. {X}  {Y}
Total # trxs.
Support (X,Y) =
Lift =
Support (X,Y)
Support (X) * Support(Y)
Support & Lift Diapers  Beer 0.8%
Diapers  Candles 0.018%
For example a lift of 2.5 means:
If people have X they are 2.5 more likely
to buy Y than if they don’t have X

METHOD ENSEMBLE
 Linear combination of previous methods
 Achieve better performance

PROC RECOMMEND recom = rs.IENS;
* Add a recommendation system;
ADD rs.IENS /item = item user = user rating = rating;
* Add tables;
ADDTABLE LHL1209.IENS_UIR / recom = rs.IENS type = rating vars=(item user rating);
* Method SVD LBFGS met 20 factoren ;
METHOD svd /
factors = 20
label = "svd" fconv = 1e-3
gconv = 1e-3 maxiter = 100
MAXFEVAL = 5000 function = L2
lamda = 0.2
technique = lbfgs;
RUN;
METHOD ARM /
label = "ARM" ;
RUN;
/* information on the recommender system */
INFO;
QUIT;

/** prediction with the SVD method ***/
PROC RECOMMEND recom = rs.IENS;
PREDICT /
method = svd
label = "svd"
Num = 3
users = ("Longhow Lam");
run;
QUIT;

LAST SLIDE 

OF MORE MODERN MACHINE LEARNING
CONS
 Unfamilar with broader audiance, (more) difficult to explain
 Black box approach (you are rejected: The computer says NO)
 Often relations can already be modeled with classical regression models
 It allows you to not think about the business problem
PROS
 Often less data prep (manual tuning) neccesary (just throw it in the algorithm…)
 Interactions often “automatically” taken into account
 Superior for Text mining, Image & Speech recognition
 Better lift possible (paar procent “gratis”)
 It allows you to not think about the business problem
(compared to traditional linear /logistic regression)
PROS AND CONS

WHY SAS FOR MACHINE LEARNING
• Many different techniques
• Easy to use GUI’s combined with flexible coding
• High performance scalability
• Easy Deployable

SOME MACHINE LEARNING EXAMPLES
 Text mining
 Image recognition
 Sound recognition
 Strange faces
So can a machine read, see and hear?

PREDICTING SENTIMENT FROM
RESTAURANT REVIEWS

IENS REVIEWS COLLECTED AROUND 16.000 REVIEWS AND THEIR SCORES
 Used text miner to parse and filter reviews,
 and transform reviews to data points in SVD space.

Predicted review score vs. Given review score
USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS
R2 Linear regression = 0.5
R2 Neural Net = 0.6

IENS REVIEWS APPLY MODEL ON ‘NEW REVIEWS’

MNIST DATA IN SAS
MODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

MNIST TRAINING DATA
 42.000 pictures of hand-written digits
 Each digit is a picture of 28 by 28 pixels
 So a 784 dimensional vector
First 100 digits of the MNIST data and there KNOWN labels in red

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES
8 – Nearest Neighbour has the lowest misclassification
rate. 3.6% of the digits in the validation set are mis
classified.
70/30 training/validation split
 PCA regression on 50 largest PC’s
 Seven singel layer neural nets: 3, 6, 12, 24,
48, 100, 200 neurons
 Seven multi layer neural nets
 Three Random forest: 100, 500 and 1000
trees
 8, 16 and 24 nearest neighbors

MNIST DATA APPLY MODEL ON TEST SET
28.000 digits without known labels.
Our best model predicted the label for
these digits.
First 100 predicted digits, together with
the handwritten digits are displayed
here.
Red numbers are predicted labels. We
see obvious some mistakes…..

SPEECH RECOGNITION
DIGITS RECORDED WITH IPHONE
1 2

SPEECH RECOGNITION
 WAV files consists of ~ 30.000 points  too much redundancy
 Use spectral analysis to convert signal to frequency domain
 Still too much  apply principle components
 TRAIN DATA
 8 spoken ‘ones’ in wav files
 8 spoken ‘twos’ in wav files

SPEECH RECOGNITION

SPEECH RECOGNITION
Zero errors on training data
Zero errors on test data
Also 8 ‘ones’ and 8 ‘twos’
In Enterprise Miner:
Neural network with 9 neurons in one hidden layer

STRANGE FACE
DETECTION
COMBO OF OPEN API / R & SAS
Little joke on my colleagues….

STRANGE FACE
DETECTION
 Get free API key for Face++
 Their API returns 83 facial landmarks (in JSON format)
Apply advanced analytics on the ABT
Which faces are look-alikes  proc cluster (hierarchical cluster)
Sales faces?  Predictive modeling / machine learning
Who is the Brad Pit?  Nearest Neighbor
Strange faces?  proc neural / auto-encoder
 Create R script to
 Retrieve the SAS faces from our site
 put them trough the Face++ API
 Collect JSON results and store them in an ABT

STRANGE FACE
DETECTION
LOOK ALIKE FACES

STRANGE FACE
DETECTION
BRAD PIT LOOK A LIKES

STRANGE FACE
DETECTION
STRANGE FACES
SAS Faces, Actors Faces
Read more on my blog

STRANGE FACE
DETECTION
SAS Faces, Actors Faces
Read more on my blog

Machine learning overview (with SAS software)

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (11)

Similar to Machine learning overview (with SAS software)

Similar to Machine learning overview (with SAS software) (20)

More from Longhow Lam

More from Longhow Lam (14)

Recently uploaded

Recently uploaded (20)

Machine learning overview (with SAS software)

Editor's Notes