Machine learning with R

Machine learning with R
AMIS Day April 3rd 2017
Maarten Smeets

MACHINE LEARNING WITH R
WHAT IS MACHINE LEARNING USE CASES FOR MACHINE
LEARNING
SUPERVISED LEARNING
UNSUPERVISED LEARNING INTRODUCING R
COOL FEATURES OF R R AND ORACLE

MACHINE LEARNING
• Machine learning is the subfield of computer science that gives
computers the ability to learn without being explicitly programmed.

MACHINE LEARNING
USE CASES
• E-mail categorization
Spam, News, Personal, Orders, …
• Anomaly detection
Fraud detection, behavior which does not fit known classifications well
• Optical Character recognition (OCR)
• Genetics
Will you have a high change of relapse when you have this cancer type
and these genes?

MACHINE LEARNING
USE CASES
• Log file analysis
Which entries are rare?
Which are the variables in a log line?
Intruder detection
• IoT
Self learning thermostats
• Predict weather
Based on environmental measures like
humidity, air pressure, satellite images
• Detect trends
The number of cases present in the
KEI system at Spir-it and performance
• Image recognition
Self driving cars like Tesla, BMW
• Predict stock prices
Find correlations between stocks and try to
find features which can predict future prices

1 2
WHAT IS MACHINE LEARNING
Supervised learning Unsupervised learning

SUPERVISED LEARNING
• The computer is presented with input and desired output
• The goal is to derive a general ruleset to map input to output
• This ruleset can be used to do predictions of output based on input

SUPERVISED LEARNING
EXAMPLES
• Linear regression
• Support Vector Regression
• Random forest
• Artificial Neural Networks (ANN)

SUPERVISED LEARNING
LINEAR REGRESSION
Data
Statistics
Plot

SUPERVISED LEARNING
SUPPORT VECTOR REGRESSION

SUPERVISED LEARNING
SUPPORT VECTOR REGRESSION
http://www.svm-tutorial.com/2014/10/support-vector-regression-r/
Prediction with tuned model

SUPERVISED LEARNING
RANDOM FOREST

SUPERVISED LEARNING
RANDOM FOREST
• Features are used to classify data
• A set of decision trees are generated based on 2 sets of random features
• Every tree sees a subset of the data
• Splits in the tree are determined by training data values
where does a split add most information
• To do predictions, features are put through all decision trees
and the result classifications are given a weight

SUPERVISED LEARNING
RANDOM FOREST
Variable importance plot
Mainly Y was used in the decision trees
to determine the outcome
i (a counter) was not important

SUPERVISED LEARNING
RANDOM FOREST
• Why is it very useful?
• Data does not have many requirements
• Can deal with multiple dimensions
• Does good predictions in a lot of cases
• Fast
• Variable importance can easily be determined
If many features are correlated, a single representative feature can be used

Large black box
performing magic
SUPERVISED LEARNING
ARTIFICIAL NEURAL NETWORKS (ANN)
Input Output

SUPERVISED LEARNING
Input Output
Input
nodes
Output
nodes
Hidden
nodes

EXAMPLE BACKPROPAGATION
• Backpropagation
1. Nodes have connections and connections have a random assigned weight
2. Provide input and let the network generate output
3. Compare generated output with desired output
4. Go from output nodes back to input and adjust the weight of the node connections.
Adjusting a little bit at a time increases learning time and accuracy
5. Repeat from step 2 until desired error rate reached
• Can be done with weights or with node activation thresholds

SOME PERSONAL THOUGHTS (AS NEUROBIOLOGIST)
• Most samples of artificial neural networks do not take into account several
properties of biological neural networks
• Signals take time to go from A to B
• Neurons are not arranged in layers
Biological neural networks have a 3d structure with specialized area’s
• Once trained, most artificial neural networks are static and don’t learn anymore
• Biological neural networks implement a wide range of signaling mechanisms per node
(neurotransmitters)
• Learning algorithms are not only internal to the neural network.
Natural selection also plays a role

SUPERVISED LEARNING
CHALLENGES
• Requires learning set of inputs and desired outputs
• Training data should be balanced
• Correlated features cause biases
• Outputs should be distributed as evenly as possible

SUPERVISED LEARNING
AAAAAA A
B B
Training data
A
BBBBBB
Test data A
BAAAAAA
Input Output
Input Output

UNSUPERVISED LEARNING
• Unsupervised machine learning is the machine learning task of
inferring a function to describe hidden structure from "unlabeled"
data
a classification or categorization is not included in the observations
• Examples
• Clustering
• Anomaly detection
• Neural networks (Self Organizing Map)

HIERARCHICAL CLUSTERING
Every point starts a cluster
Clusters merge as
they go up the tree

A: MEAN 2,2 STDEV 2 B: MEAN 6,6 STDEV 2

Original Prediction

1 2 3
History Installation Basics
INTRODUCING R

R A SHORT HISTORY
• Conceived august 1993
An implementation of the S programming language
S was conceived in 1976
• Open sourced June 1995
• Main competitors: SPSS and SAS
• A lot of (mostly statistical) libraries available
CRAN package repository features 10366 available packages.

R INSTALLATION
• Download and install R
https://www.r-project.org/

R STUDIO INSTALLATION
• Download and install R Studio
https://www.rstudio.com/

R BASICS
• R is a functional programming (FP) language
• It provides many tools for the creation and manipulation of functions.
• You can do anything with functions that you can do with vectors: you
can assign them to variables, store them in lists, pass them as
arguments to other functions, create them inside functions, and even
return them as the result of a function.

R BASICS
SOME FEATURES
• GIT integration
• Interpreted; does not require compilation
Execute a line in your script and look at the result in the console
• Has its own markdown variant for documentation
Especially useful if you want to have graphs
• R Shiny allows you to generate and host scripts / graphs and make
them available from a browser

R BASICS
SOME FEATURES
• Code completion
• Allows multi threaded execution
• Can be run remotely on an R-server
• Great at reading / writing datasets
For example web site scraping for data
• Of course great at statistics
• Great at generating plots
Especially when using the ggplot2 library

R BASICS
SOME TIPS TO GET STARTED
• ?ggplot
• help(package=“ggplot2")

R DATATYPES
THE VECTOR
• Vector
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
a <- c(1,2,5.3,6,-2,4)
b <- a * 2
[1] 2.0 4.0 10.6 12.0 -4.0 8.0

R DATATYPES
THE MATRIX. ALL VALUES HAVE THE SAME TYPE AND LENGTH
# generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2,
byrow=TRUE, dimnames=list(rnames, cnames))
# accessing matrix values
|x[,4] # 4th column of matrix
x[3,] # 3rd row of matrix
x[2:4,1:3] # rows 2,3,4 of columns 1,2,3

R DATATYPES
THE DATA.FRAME. LIKE A MATRIX BUT TYPES AND LENGTHS CAN VARY
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(d,e,f)
names(mydata) <- c("ID","Color","Passed") # variable names
myframe[3:5] # columns 3,4,5 of data frame
myframe[c("ID","Age")] # columns ID and Age from data frame
myframe$X1 # variable x1 in the data frame

R DATATYPES
THE LIST
• An ordered collection of objects (components)
# example of a list with 4 components –
# a string, a numeric vector, a matrix, and a scaler
w <- list(name=“Maarten", mynumbers=a, mymatrix=y, age=36)
# example of a list containing two lists
v <- c(list1,list2)

1 2 3
Hosting plots
Shiny
Plot.ly
R markdown Web site crawling
COOL FEATURES OF R

COOL FEATURES OF R
SHINY
UI Server

COOL FEATURES OF R
PLOT.LY INTERACTIVE GRAPHS

COOL FEATURES OF R
WEB SITE CRAWLING

COOL FEATURES OF R
WEB SITE CRAWLING
• Sector to Industry, Industry to Company

COOL FEATURES OF R
WEB SITE CRAWLING
http://chart.finance.yahoo.com/table.csv?s=ABT.AX&a=1&b=28&c=2017&d=2&e=28&f=2017&g=d&ignore=.csv

1 2 3
What does
Oracle do with R
Using data from
an Oracle DB in R
Using functions from
R in the Oracle DB
ORACLE AND R

ORACLE R ENTERPRISE
USING DATABASE DATA IN R

ORACLE R ENTERPRISE
USING R SCRIPTS DIRECTLY IN SQL STATEMENTS

https://github.com/MaartenSmeets/R

Machine learning with R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine learning with R

Similar to Machine learning with R (20)

More from Maarten Smeets

More from Maarten Smeets (16)

Recently uploaded

Recently uploaded (20)

Machine learning with R