SlideShare a Scribd company logo
1 of 52
Machine Learning
with R
Barbara Fusinska
@BasiaFusinska
About me
Data Science Freelancer
Machine Learning
Programmer
@BasiaFusinska
BarbaraFusinska.com
Barbara@Fusinska.com
https://github.com/BasiaFusinska/RMachineLearning
Agenda
• Machine Learning
• R platform
• Machine Learning with R
• Classification problem
• Linear Regression
• Clustering
Machine Learning?
Movies Genres
Title # Kisses # Kicks Genre
Taken 3 47 Action
Love story 24 2 Romance
P.S. I love you 17 3 Romance
Rush hours 5 51 Action
Bad boys 7 42 Action
Question:
What is the genre of
Gone with the wind
?
Data-based classification
Id Feature 1 Feature 2 Class
1. 3 47 A
2. 24 2 B
3. 17 3 B
4. 5 51 A
5. 7 42 A
Question:
What is the class of the entry
with the following features:
F1: 31, F2: 4
?
Data Visualization
0
10
20
30
40
50
60
0 10 20 30 40 50
Rule 1:
If on the left side of the
line then Class = A
Rule 2:
If on the right side of the
line then Class = B
A
B
Chick sexing
Aircraft recognition
Supervised
learning
• Classification, regression
• Label, target value
• Training & Validation phases
Unsupervised
learning
• Clustering, feature selection
• Finding structure of data
• Statistical values describing the
data
Publishing the model
Machine Learning
Model
Model Training
Published
Machine Learning
Model
Prediction
Training data
Publish model
Test stream
Scores
R language
Why R?
• Ross Ihaka & Robert Gentleman
• Successor of S
• Open source
• Community driven
• #1 for statistical computing
• Exploratory Data Analysis
• Machine Learning
• Visualisation
Setup
• Install R:
https://www.r-project.org/
• Install RStudio:
https://www.rstudio.com/
• GitHub repository:
https://github.com/BasiaFusinska/RMac
hineLearning
https://github.com/BasiaFusinska/Machi
neLearningWithR
• Packages
Supervised Machine Learning workflow
Clean data Data split
Machine Learning
algorithm
Trained model Score
Preprocess
data
Training
data
Test data
Classification problem
Model training
Data & Labels
0
1
2
3
4
5
6
7
8
9
Data preparation
32 x 32
(0-1)
8 x 8
(0..16)
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
K-Nearest Neighbours Algorithm
• Object is classified by a majority
vote
• k – algorithm parameter
• Distance metrics: Euclidean
(continuous variables), Hamming
(text)
?
Naïve Bayes classifier
𝑝 𝐶 𝑘 𝒙) =
𝑝 𝐶 𝑘 𝑝 𝒙 𝐶 𝑘)
𝑝(𝒙)
𝒙 = (𝑥1, … , 𝑥 𝑘)
𝑝 𝐶 𝑘 𝑥1, … , 𝑥 𝑘) likelihood
evidence
prior
posterior
Naïve Bayes example
Sex Height Weight Foot size
Male 6 190 11
Male 6.2 170 10
Female 5 130 6
… … … …
Sex Height Weight Foot size
? 5.9 140 8
𝑝 𝑚𝑎𝑙𝑒 𝒙 =
𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝(8|𝑚𝑎𝑙𝑒)
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝 8 𝑚𝑎𝑙𝑒 +
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒)
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝒙 =
𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒)
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
Logistic regression
𝑧 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
𝑦 =
1 𝑓𝑜𝑟 𝑧 > 0
0 𝑓𝑜𝑟 𝑧 < 0
𝑦 =
1 𝑓𝑜𝑟 𝜙(𝑧) > 0.5
0 𝑓𝑜𝑟 𝜙(𝑧) < 0.5
Logistic function
Coefficients
Best fit of β
Evaluation methods for classification
Confusion
Matrix
Reference
Positive Negative
Prediction
Positive TP FP
Negative FN TN
Receiver Operating Characteristic
curve
Area under the curve
(AUC)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
#𝑐𝑜𝑟𝑟𝑒𝑐𝑡
#𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
=
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁
𝑇𝑁 + 𝐹𝑁
How good at avoiding
false alarms
How good it is at
detecting positives
# Read data
trainingSet <- read.csv(trainingFile, header = FALSE)
testSet <- read.csv(testFile, header = FALSE)
trainingSet$V65 <- factor(trainingSet$V65)
testSet$V65 <- factor(testSet$V65)
# Classify
library(caret)
knn.fit <- knn3(V65 ~ ., data=trainingSet, k=5)
# Predict new values
pred.test <- predict(knn.fit, testSet[,1:64], type="class")
# Confusion matrix
library(caret)
confusionMatrix(pred.test, testSet[,65])
Regression problem
• Dependent value
• Predicting the real value
• Fitting the coefficients
• Analytical solutions
• Gradient descent
Ordinary linear regression
Residual sum of squares (RSS)
𝑆 𝛽 =
𝑖=1
𝑛
(𝑦𝑖 − 𝑥𝑖
𝑇
𝛽)2
= 𝑦 − 𝑋𝛽 𝑇
𝑦 − 𝑋𝛽
𝛽 = 𝑎𝑟𝑔 min
𝛽
𝑆(𝛽)
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
Evaluation methods for regression
• Errors
𝑅𝑀𝑆𝐸 = 𝑖=1
𝑛
(𝑓𝑖 − 𝑦𝑖)2
𝑛
𝑅2 = 1 −
(𝑓𝑖 − 𝑦𝑖)2
( 𝑦 − 𝑦𝑖)2
• Statistics (t, ANOVA)
Prestige dataset
Feature Data type Description
education continuous Average education (years)
income integer Average income (dollars)
women continuous Percentage of women
prestige continuous Pineo-Porter prestige score for
occupation
census integer Canadian Census occupational
code
type multi-valued
discrete
Type of occupation: bc, prof, wc
# Pairs for the numeric data
pairs(Prestige[,-c(5,6)], pch=21, bg=Prestige$type)
# Linear regression, numerical data
num.model <- lm(prestige ~ education + log2(income) + women, Prestige)
summary(num.model)
--------------------------------------------------
Call:
lm(formula = prestige ~ education + log2(income) + women, data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-17.364 -4.429 -0.101 4.316 19.179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -110.9658 14.8429 -7.476 3.27e-11 ***
education 3.7305 0.3544 10.527 < 2e-16 ***
log2(income) 9.3147 1.3265 7.022 2.90e-10 ***
women 0.0469 0.0299 1.568 0.12
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.093 on 98 degrees of freedom
Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
Regression
Plots
par(mfrow=c(2,2))
plot(fit)
par(mfrow=c(1,1))
Residuals vs
Fitted
• Check if residuals have non-
linear patterns
• Check if the model captures
the non-linear relationship
• Should show equally spread
residuals around the
horizontal line
Normal Q-Q
• Shows if the residuals are
normally distributed
• Values should be lined on the
straight dashed line
• Check if residuals do not
deviate severely
Scale-Location
• Show if residuals are spread
equally along the ranges of
predictors
• Test the assumption of equal
variance (homoscedasticity)
• Should show horizontal line
with equally (randomly)
spread points
Residuals vs
Leverage
• Helps to find influential cases
• When outside of the Cook’s
distance the cases are
influential
• With no influential cases
Cook’s distance lines should
be barely visible
Categorical data for regression
• Categories: A, B, C are coded as
dummy variables
• In general if the variable has k
categories it will be decoded into
k-1 dummy variables
Category V1 V2
A 0 0
B 1 0
C 0 1
𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
# Linear regression, categorical variable
cat.model <- lm(prestige ~ education + log2(income) + type, Prestige)
summary(cat.model)
--------------------------------------------------
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81.2019 13.7431 -5.909 5.63e-08 ***
education 3.2845 0.6081 5.401 5.06e-07 ***
log2(income) 7.2694 1.1900 6.109 2.31e-08 ***
typeprof 6.7509 3.6185 1.866 0.0652 .
typewc -1.4394 2.3780 -0.605 0.5465
# Linear regression, categorical variable split
et.fit <- lm(prestige ~ type*education, Prestige)
summary(et.fit)
--------------------------------------------------
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.2936 8.6470 -0.497 0.621
typeprof 18.8637 16.8881 1.117 0.267
typewc -24.3833 21.7777 -1.120 0.266
education 4.7637 1.0247 4.649 1.11e-05 ***
typeprof:education -0.9808 1.4495 -0.677 0.500
typewc:education 1.6709 2.0777 0.804 0.423
# Pairs for the numeric data
cf <- et.fit$coefficients
ggplot(prestige, aes(education, prestige)) + geom_point(aes(col=type)) +
geom_abline(slope=cf[4], intercept = cf[1], colour='red') +
geom_abline(slope=cf[4] + cf[5], intercept = cf[1] + cf[2], colour='green') +
geom_abline(slope=cf[4] + cf[6], intercept = cf[1] + cf[3], colour='blue')
Resampling: Bootstrapping
k-fold cross validation
library(caret)
library(pROC)
# By default bootstraping is chosen
boot.fit <- train(labels ~ ., data=myData, method="knn")
# k-fold cross validation (k = 10 by default)
tc.cv <- trainControl(method="cv")
cv.fit <- train(labels ~ ., data=myData, trControl=tc.cv, method="knn")
# Changing the metric from default Accuracy to ROC
tc.rcv <- trainControl(method="cv", classProbs=TRUE,
summaryFunction = twoClassSummary)
rcv.fit <- train(labels ~ ., data=myData, trControl=tc.rcv,
method="knn", metric="ROC")
Clustering problem
K-means Algorithm
Chicago crimes dataset
Data column Data type
ID Number
Case Number String
Arrest Boolean
Primary Type Enum
District Enum
DateFBI Code Enum
Longitude Numeric
Latitude Numeric
...
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
# Read data
crimeData <- read.csv(crimeFilePath)
# Only data with location, only Assault or Burglary types
crimeData <- crimeData[
!is.na(crimeData$Latitude) & !is.na(crimeData$Longitude),]
selectedCrimes <- subset(crimeData,
Primary.Type %in% c(crimeTypes[2], crimeTypes[4]))
# Visualise
library(ggplot2)
library(ggmap)
# Get map from Google
map_g <- get_map(location=c(lon=mean(crimeData$Longitude, na.rm=TRUE), lat=mean(
crimeData$Latitude, na.rm=TRUE)), zoom = 11, maptype = "terrain", scale = 2)
ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude,
fill = Primary.Type, alpha = 0.8), size = 1, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)
Assault
& Burglary
# k-means clustering (k=6)
clusterResult <- kmeans(selectedCrimes[, c('Longitude', 'Latitude')], 6)
# Get the clusters information
centers <- as.data.frame(clusterResult$centers)
clusterColours <- factor(clusterResult$cluster)
# Visualise
ggmap(map_g) +
geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude,
alpha = 0.8, color = clusterColours), size = 1) +
geom_point(data = centers, aes(x = Longitude, y = Latitude,
alpha = 0.8), size = 1.5) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)
Crimes
clusters
Keep in touch
BarbaraFusinska.com
Barbara@Fusinska.com
@BasiaFusinska
https://github.com/BasiaFusinska/RMachineLearning

More Related Content

What's hot

Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelinesRamesh Sampath
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with AzureBarbara Fusinska
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative ModelsMijung Kim
 
Adversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generationAdversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generationKeon Kim
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Manohar Mukku
 
Boosted tree
Boosted treeBoosted tree
Boosted treeZhuyi Xue
 
Reading group gan - 20170417
Reading group   gan - 20170417Reading group   gan - 20170417
Reading group gan - 20170417Shuai Zhang
 
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...Rizwan Habib
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GANNAVER Engineering
 
Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGANNAVER Engineering
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)NamHyuk Ahn
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것NAVER Engineering
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Generative Adversarial Networks 2
Generative Adversarial Networks 2Generative Adversarial Networks 2
Generative Adversarial Networks 2Alireza Shafaei
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar AhmedZaffar Ahmed Shaikh
 

What's hot (20)

Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative Models
 
Adversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generationAdversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generation
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
Boosted tree
Boosted treeBoosted tree
Boosted tree
 
Reading group gan - 20170417
Reading group   gan - 20170417Reading group   gan - 20170417
Reading group gan - 20170417
 
Gan intro
Gan introGan intro
Gan intro
 
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN[GAN by Hung-yi Lee]Part 1: General introduction of GAN
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
 
Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGAN
 
그림 그리는 AI
그림 그리는 AI그림 그리는 AI
그림 그리는 AI
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Generative Adversarial Networks 2
Generative Adversarial Networks 2Generative Adversarial Networks 2
Generative Adversarial Networks 2
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Nearest Neighbor Algorithm Zaffar Ahmed
Nearest Neighbor Algorithm  Zaffar AhmedNearest Neighbor Algorithm  Zaffar Ahmed
Nearest Neighbor Algorithm Zaffar Ahmed
 

Similar to Machine Learning with R

Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017Codemotion
 
Extensible Operators and Literals for JavaScript
Extensible Operators and Literals for JavaScriptExtensible Operators and Literals for JavaScript
Extensible Operators and Literals for JavaScriptBrendan Eich
 
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptxRithikRaj25
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Ranking systems
Ranking systemsRanking systems
Ranking systemscristian L
 
Outrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesMax De Marzi
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in RBabu Priyavrat
 
Linear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorialLinear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorialJia-Bin Huang
 
Pattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifierPattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifier108kaushik
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Alexandros Karatzoglou
 
5 Understanding Page Rank
5 Understanding Page Rank5 Understanding Page Rank
5 Understanding Page Rankmasiclat
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxSubrata Kumer Paul
 
Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...
Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...
Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...JeeyeonHan
 

Similar to Machine Learning with R (20)

Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
 
Extensible Operators and Literals for JavaScript
Extensible Operators and Literals for JavaScriptExtensible Operators and Literals for JavaScript
Extensible Operators and Literals for JavaScript
 
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptx
 
Bisection method
Bisection methodBisection method
Bisection method
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
Ranking systems
Ranking systemsRanking systems
Ranking systems
 
nlp2.pdf
nlp2.pdfnlp2.pdf
nlp2.pdf
 
Outrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph Databases
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in R
 
Linear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorialLinear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorial
 
Pattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifierPattern recognition binoy 05-naive bayes classifier
Pattern recognition binoy 05-naive bayes classifier
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
 
Compilation
CompilationCompilation
Compilation
 
5 Understanding Page Rank
5 Understanding Page Rank5 Understanding Page Rank
5 Understanding Page Rank
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx
 
Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...
Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...
Confirmatory Bayesian Online Change Point Detection in the Covariance Structu...
 

More from Barbara Fusinska

Hassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowHassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowBarbara Fusinska
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportBarbara Fusinska
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowBarbara Fusinska
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportBarbara Fusinska
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitBarbara Fusinska
 
Using Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical supportUsing Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical supportBarbara Fusinska
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLBarbara Fusinska
 
A picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with RA picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with RBarbara Fusinska
 
Predicting the Future as a Service with Azure ML and R
Predicting the Future as a Service with Azure ML and R Predicting the Future as a Service with Azure ML and R
Predicting the Future as a Service with Azure ML and R Barbara Fusinska
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Breaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.jsBreaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.jsBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
How aspects clean your code
How aspects clean your codeHow aspects clean your code
How aspects clean your codeBarbara Fusinska
 

More from Barbara Fusinska (20)

Hassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowHassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with Kubeflow
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
TensorFlow in 3 sentences
TensorFlow in 3 sentencesTensorFlow in 3 sentences
TensorFlow in 3 sentences
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical Support
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical Support
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive Toolkit
 
Using Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical supportUsing Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical support
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure ML
 
A picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with RA picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with R
 
Predicting the Future as a Service with Azure ML and R
Predicting the Future as a Service with Azure ML and R Predicting the Future as a Service with Azure ML and R
Predicting the Future as a Service with Azure ML and R
 
Getting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commitsGetting started with R when analysing GitHub commits
Getting started with R when analysing GitHub commits
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Breaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.jsBreaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.js
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
When the connection fails
When the connection failsWhen the connection fails
When the connection fails
 
When the connection fails
When the connection failsWhen the connection fails
When the connection fails
 
How aspects clean your code
How aspects clean your codeHow aspects clean your code
How aspects clean your code
 

Recently uploaded

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Machine Learning with R

  • 1.
  • 2. Machine Learning with R Barbara Fusinska @BasiaFusinska
  • 3. About me Data Science Freelancer Machine Learning Programmer @BasiaFusinska BarbaraFusinska.com Barbara@Fusinska.com https://github.com/BasiaFusinska/RMachineLearning
  • 4. Agenda • Machine Learning • R platform • Machine Learning with R • Classification problem • Linear Regression • Clustering
  • 6. Movies Genres Title # Kisses # Kicks Genre Taken 3 47 Action Love story 24 2 Romance P.S. I love you 17 3 Romance Rush hours 5 51 Action Bad boys 7 42 Action Question: What is the genre of Gone with the wind ?
  • 7. Data-based classification Id Feature 1 Feature 2 Class 1. 3 47 A 2. 24 2 B 3. 17 3 B 4. 5 51 A 5. 7 42 A Question: What is the class of the entry with the following features: F1: 31, F2: 4 ?
  • 8. Data Visualization 0 10 20 30 40 50 60 0 10 20 30 40 50 Rule 1: If on the left side of the line then Class = A Rule 2: If on the right side of the line then Class = B A B
  • 10. Supervised learning • Classification, regression • Label, target value • Training & Validation phases
  • 11. Unsupervised learning • Clustering, feature selection • Finding structure of data • Statistical values describing the data
  • 12. Publishing the model Machine Learning Model Model Training Published Machine Learning Model Prediction Training data Publish model Test stream Scores
  • 14. Why R? • Ross Ihaka & Robert Gentleman • Successor of S • Open source • Community driven • #1 for statistical computing • Exploratory Data Analysis • Machine Learning • Visualisation
  • 15. Setup • Install R: https://www.r-project.org/ • Install RStudio: https://www.rstudio.com/ • GitHub repository: https://github.com/BasiaFusinska/RMac hineLearning https://github.com/BasiaFusinska/Machi neLearningWithR • Packages
  • 16. Supervised Machine Learning workflow Clean data Data split Machine Learning algorithm Trained model Score Preprocess data Training data Test data
  • 17. Classification problem Model training Data & Labels 0 1 2 3 4 5 6 7 8 9
  • 18. Data preparation 32 x 32 (0-1) 8 x 8 (0..16) https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
  • 19. K-Nearest Neighbours Algorithm • Object is classified by a majority vote • k – algorithm parameter • Distance metrics: Euclidean (continuous variables), Hamming (text) ?
  • 20. Naïve Bayes classifier 𝑝 𝐶 𝑘 𝒙) = 𝑝 𝐶 𝑘 𝑝 𝒙 𝐶 𝑘) 𝑝(𝒙) 𝒙 = (𝑥1, … , 𝑥 𝑘) 𝑝 𝐶 𝑘 𝑥1, … , 𝑥 𝑘) likelihood evidence prior posterior
  • 21. Naïve Bayes example Sex Height Weight Foot size Male 6 190 11 Male 6.2 170 10 Female 5 130 6 … … … … Sex Height Weight Foot size ? 5.9 140 8 𝑝 𝑚𝑎𝑙𝑒 𝒙 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝(8|𝑚𝑎𝑙𝑒) 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝 8 𝑚𝑎𝑙𝑒 + 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒) 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝒙 = 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒) 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
  • 22. Logistic regression 𝑧 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘 𝑦 = 1 𝑓𝑜𝑟 𝑧 > 0 0 𝑓𝑜𝑟 𝑧 < 0 𝑦 = 1 𝑓𝑜𝑟 𝜙(𝑧) > 0.5 0 𝑓𝑜𝑟 𝜙(𝑧) < 0.5 Logistic function Coefficients Best fit of β
  • 23. Evaluation methods for classification Confusion Matrix Reference Positive Negative Prediction Positive TP FP Negative FN TN Receiver Operating Characteristic curve Area under the curve (AUC) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = #𝑐𝑜𝑟𝑟𝑒𝑐𝑡 #𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁 𝑇𝑁 + 𝐹𝑁 How good at avoiding false alarms How good it is at detecting positives
  • 24. # Read data trainingSet <- read.csv(trainingFile, header = FALSE) testSet <- read.csv(testFile, header = FALSE) trainingSet$V65 <- factor(trainingSet$V65) testSet$V65 <- factor(testSet$V65) # Classify library(caret) knn.fit <- knn3(V65 ~ ., data=trainingSet, k=5) # Predict new values pred.test <- predict(knn.fit, testSet[,1:64], type="class")
  • 26. Regression problem • Dependent value • Predicting the real value • Fitting the coefficients • Analytical solutions • Gradient descent
  • 27. Ordinary linear regression Residual sum of squares (RSS) 𝑆 𝛽 = 𝑖=1 𝑛 (𝑦𝑖 − 𝑥𝑖 𝑇 𝛽)2 = 𝑦 − 𝑋𝛽 𝑇 𝑦 − 𝑋𝛽 𝛽 = 𝑎𝑟𝑔 min 𝛽 𝑆(𝛽) 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
  • 28. Evaluation methods for regression • Errors 𝑅𝑀𝑆𝐸 = 𝑖=1 𝑛 (𝑓𝑖 − 𝑦𝑖)2 𝑛 𝑅2 = 1 − (𝑓𝑖 − 𝑦𝑖)2 ( 𝑦 − 𝑦𝑖)2 • Statistics (t, ANOVA)
  • 29. Prestige dataset Feature Data type Description education continuous Average education (years) income integer Average income (dollars) women continuous Percentage of women prestige continuous Pineo-Porter prestige score for occupation census integer Canadian Census occupational code type multi-valued discrete Type of occupation: bc, prof, wc
  • 30. # Pairs for the numeric data pairs(Prestige[,-c(5,6)], pch=21, bg=Prestige$type)
  • 31. # Linear regression, numerical data num.model <- lm(prestige ~ education + log2(income) + women, Prestige) summary(num.model) -------------------------------------------------- Call: lm(formula = prestige ~ education + log2(income) + women, data = Prestige) Residuals: Min 1Q Median 3Q Max -17.364 -4.429 -0.101 4.316 19.179 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -110.9658 14.8429 -7.476 3.27e-11 *** education 3.7305 0.3544 10.527 < 2e-16 *** log2(income) 9.3147 1.3265 7.022 2.90e-10 *** women 0.0469 0.0299 1.568 0.12 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.093 on 98 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.83 F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
  • 33. Residuals vs Fitted • Check if residuals have non- linear patterns • Check if the model captures the non-linear relationship • Should show equally spread residuals around the horizontal line
  • 34. Normal Q-Q • Shows if the residuals are normally distributed • Values should be lined on the straight dashed line • Check if residuals do not deviate severely
  • 35. Scale-Location • Show if residuals are spread equally along the ranges of predictors • Test the assumption of equal variance (homoscedasticity) • Should show horizontal line with equally (randomly) spread points
  • 36. Residuals vs Leverage • Helps to find influential cases • When outside of the Cook’s distance the cases are influential • With no influential cases Cook’s distance lines should be barely visible
  • 37. Categorical data for regression • Categories: A, B, C are coded as dummy variables • In general if the variable has k categories it will be decoded into k-1 dummy variables Category V1 V2 A 0 0 B 1 0 C 0 1 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
  • 38. # Linear regression, categorical variable cat.model <- lm(prestige ~ education + log2(income) + type, Prestige) summary(cat.model) -------------------------------------------------- Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -81.2019 13.7431 -5.909 5.63e-08 *** education 3.2845 0.6081 5.401 5.06e-07 *** log2(income) 7.2694 1.1900 6.109 2.31e-08 *** typeprof 6.7509 3.6185 1.866 0.0652 . typewc -1.4394 2.3780 -0.605 0.5465
  • 39. # Linear regression, categorical variable split et.fit <- lm(prestige ~ type*education, Prestige) summary(et.fit) -------------------------------------------------- Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.2936 8.6470 -0.497 0.621 typeprof 18.8637 16.8881 1.117 0.267 typewc -24.3833 21.7777 -1.120 0.266 education 4.7637 1.0247 4.649 1.11e-05 *** typeprof:education -0.9808 1.4495 -0.677 0.500 typewc:education 1.6709 2.0777 0.804 0.423
  • 40. # Pairs for the numeric data cf <- et.fit$coefficients ggplot(prestige, aes(education, prestige)) + geom_point(aes(col=type)) + geom_abline(slope=cf[4], intercept = cf[1], colour='red') + geom_abline(slope=cf[4] + cf[5], intercept = cf[1] + cf[2], colour='green') + geom_abline(slope=cf[4] + cf[6], intercept = cf[1] + cf[3], colour='blue')
  • 43. library(caret) library(pROC) # By default bootstraping is chosen boot.fit <- train(labels ~ ., data=myData, method="knn") # k-fold cross validation (k = 10 by default) tc.cv <- trainControl(method="cv") cv.fit <- train(labels ~ ., data=myData, trControl=tc.cv, method="knn") # Changing the metric from default Accuracy to ROC tc.rcv <- trainControl(method="cv", classProbs=TRUE, summaryFunction = twoClassSummary) rcv.fit <- train(labels ~ ., data=myData, trControl=tc.rcv, method="knn", metric="ROC")
  • 46. Chicago crimes dataset Data column Data type ID Number Case Number String Arrest Boolean Primary Type Enum District Enum DateFBI Code Enum Longitude Numeric Latitude Numeric ... https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
  • 47. # Read data crimeData <- read.csv(crimeFilePath) # Only data with location, only Assault or Burglary types crimeData <- crimeData[ !is.na(crimeData$Latitude) & !is.na(crimeData$Longitude),] selectedCrimes <- subset(crimeData, Primary.Type %in% c(crimeTypes[2], crimeTypes[4])) # Visualise library(ggplot2) library(ggmap) # Get map from Google map_g <- get_map(location=c(lon=mean(crimeData$Longitude, na.rm=TRUE), lat=mean( crimeData$Latitude, na.rm=TRUE)), zoom = 11, maptype = "terrain", scale = 2) ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude, fill = Primary.Type, alpha = 0.8), size = 1, shape = 21) + guides(fill=FALSE, alpha=FALSE, size=FALSE)
  • 49. # k-means clustering (k=6) clusterResult <- kmeans(selectedCrimes[, c('Longitude', 'Latitude')], 6) # Get the clusters information centers <- as.data.frame(clusterResult$centers) clusterColours <- factor(clusterResult$cluster) # Visualise ggmap(map_g) + geom_point(data = selectedCrimes, aes(x = Longitude, y = Latitude, alpha = 0.8, color = clusterColours), size = 1) + geom_point(data = centers, aes(x = Longitude, y = Latitude, alpha = 0.8), size = 1.5) + guides(fill=FALSE, alpha=FALSE, size=FALSE)
  • 51.