Machine Learning
Chris Sharkey
today
@shark2900
What do you think of when
we say machine learning?
big words
• Hadoop
• Terabyte
• Petabyte
• NoSQL
• Data Science
• D3
• Visualization
• Machine learning
What is machine learning?
“Predictive or descriptive
modeling which learns
from past experience or
data to build models which
can predict the future”
Past Data
(known outcome)
Machine Learning
Model
New Data
(unknown outcome)
Predicted Outcome
Will John play golf?
Date Weather Temperature Sally going? Did John Golf ?
Sept 1 Sunny 92o F Yes Yes
Sept 2 Cloudy 84o F No No
Sept 3 Raining 84o F No Yes
Sept 4 Sunny 95o F Yes Yes
Date Weather Temperature Sally going? Will John Golf ?
Sept 5 Cloudy 87o F No ?
We want a model based on John’s past behavior to predict
what he will do in the future. Can we use ML?
Yes. This is a
classification problem
ZeroR
Establishes a base line
Naïve Bayes
Probabilistic model
OneR
Single Rule
J4.5 / C4.5
Decision Tree
Upgrade our example
age blood pressure specific gravity albumin sugar
red blood cells pus cell pus cell clumps potassium blood glucose
blood urea serum creatinine sodium hemoglobin packed cell
volume
white blood cell
count
red blood cell
count
hypertension diabetes mellitus coronary artery
Heart disease appetite pedal edema anemia stage
Data Set
• 319 instances or people
• 25 attributes or variables
Machine Learning
• ZeroR
• OneR
• Naïve Bayes
• J4.5 / C4.5
Model
Blood test data for
new individuals with
unknown disease
status
Predict if induvial has
CKD and if so the
stage of there
disease status
ZeroR
Past data
(known outcome)
New instance
Classified
Classify new data as the
most ‘popular’ class
Build frequency table
Choice ‘most popular’ or
most frequent class
How did ZeroR do?
• Correctly classified 28.2% of the time
• Rule: always guess a new instance (person) has stage three kidney disease
• 28.2% correct classfication rate is our base line
• Correct classification rates above 28.2% are better than guessing
OneR
Past data
(known outcome)
New instance
Classified
Choose attribute which
rule has the highest
correct classification rate
Build frequency table for
each attribute. This
generates a rule for
value of each attribute.
How did OneR do?
• Correctly classified 80.2% of the time
• Rule based on serum creatinine
• < 0.85 is healthy
• < 1.15 is stage 2
• < 2.25 is stage 3
• > = 2.25 is stage 5
• Single rule is created and responsible for classification
• High classification rate indicates a single value has high influence in predicting class
Naïve Bayes
Past data
(known outcome)
New instance
Classified
For each attribute
multiply conditional
probability for each of
the values with
probability of value
Multiply all prior
calculated probabilities
Choose most probable
class
Build frequency table
for each attribute.
Determine
probabilities for values
of each attribute.
Determine conditional
probabilities for values
of each attribute.
How did Naïve Bayes do?
• Correctly classified 56.6% of the time
• Conditional and overall probabilities constitute a rule
• High classification rate indicates attributes have ‘equaler’ influence
• No iterative process, faster on larger data sets
J4.5 / C4.5
Past data
(known outcome)
New instance
Classified
Follow decision tree to a
leaf or class
Top down recursive
algorithm determining
splitting points based on
information gains
How did J4.5 do?
• Correctly classified 88.4% of the time
• Decision tree generated
• Balance between discrimination of OneR and fairness of Naïve Bayes
• Decision trees are popular, intuitive, easy to create and easy to interpret
• People like decision trees. They tell a nice story
ZeroR
• Correct classification rate – 28.2%
• Established base line accuracy
• Always guess stage 3 ckd
Naïve Bayes
• Correct classification rate – 56.6%
• Established over all probabilities to
pick most probable class
OneR
• Correct classification rate – 80.2%
• Serum Creatinine
• < 0.85 – Healthy
• < 1.15 – Stage 2
• < 2.25 – Stage 3
• > = 2.25 – Stage 5
J4.5 / C4.5
• Correct classification rate – 88.4%
Does this make sense?
Other important concepts
in machine learning.
Cross Validation
• Hold out one of ten slices and build the
model on the other nine slices
• Test on the ‘held out’ slice
• Hold out a different slice, build the models
on the now other nine slices and test on the
new ‘held out’ slice
Overfitting
• Classification rule that is ‘over fit’ or so specific to the training data set that it does
not generalize to the broader population
• Limiting the complexity or rules can help prevent overfitting
• Large representative data sets can help fight overfitting
• A problem in machine learning
• Must be a suspicious data scientist
Question?

Introduction to Machine Learning & Classification

  • 1.
  • 2.
    What do youthink of when we say machine learning?
  • 4.
    big words • Hadoop •Terabyte • Petabyte • NoSQL • Data Science • D3 • Visualization • Machine learning
  • 5.
    What is machinelearning?
  • 6.
    “Predictive or descriptive modelingwhich learns from past experience or data to build models which can predict the future”
  • 7.
    Past Data (known outcome) MachineLearning Model New Data (unknown outcome) Predicted Outcome
  • 8.
    Will John playgolf? Date Weather Temperature Sally going? Did John Golf ? Sept 1 Sunny 92o F Yes Yes Sept 2 Cloudy 84o F No No Sept 3 Raining 84o F No Yes Sept 4 Sunny 95o F Yes Yes Date Weather Temperature Sally going? Will John Golf ? Sept 5 Cloudy 87o F No ? We want a model based on John’s past behavior to predict what he will do in the future. Can we use ML?
  • 9.
    Yes. This isa classification problem
  • 10.
    ZeroR Establishes a baseline Naïve Bayes Probabilistic model OneR Single Rule J4.5 / C4.5 Decision Tree
  • 11.
    Upgrade our example ageblood pressure specific gravity albumin sugar red blood cells pus cell pus cell clumps potassium blood glucose blood urea serum creatinine sodium hemoglobin packed cell volume white blood cell count red blood cell count hypertension diabetes mellitus coronary artery Heart disease appetite pedal edema anemia stage Data Set • 319 instances or people • 25 attributes or variables Machine Learning • ZeroR • OneR • Naïve Bayes • J4.5 / C4.5 Model Blood test data for new individuals with unknown disease status Predict if induvial has CKD and if so the stage of there disease status
  • 12.
    ZeroR Past data (known outcome) Newinstance Classified Classify new data as the most ‘popular’ class Build frequency table Choice ‘most popular’ or most frequent class
  • 13.
    How did ZeroRdo? • Correctly classified 28.2% of the time • Rule: always guess a new instance (person) has stage three kidney disease • 28.2% correct classfication rate is our base line • Correct classification rates above 28.2% are better than guessing
  • 14.
    OneR Past data (known outcome) Newinstance Classified Choose attribute which rule has the highest correct classification rate Build frequency table for each attribute. This generates a rule for value of each attribute.
  • 15.
    How did OneRdo? • Correctly classified 80.2% of the time • Rule based on serum creatinine • < 0.85 is healthy • < 1.15 is stage 2 • < 2.25 is stage 3 • > = 2.25 is stage 5 • Single rule is created and responsible for classification • High classification rate indicates a single value has high influence in predicting class
  • 16.
    Naïve Bayes Past data (knownoutcome) New instance Classified For each attribute multiply conditional probability for each of the values with probability of value Multiply all prior calculated probabilities Choose most probable class Build frequency table for each attribute. Determine probabilities for values of each attribute. Determine conditional probabilities for values of each attribute.
  • 17.
    How did NaïveBayes do? • Correctly classified 56.6% of the time • Conditional and overall probabilities constitute a rule • High classification rate indicates attributes have ‘equaler’ influence • No iterative process, faster on larger data sets
  • 18.
    J4.5 / C4.5 Pastdata (known outcome) New instance Classified Follow decision tree to a leaf or class Top down recursive algorithm determining splitting points based on information gains
  • 20.
    How did J4.5do? • Correctly classified 88.4% of the time • Decision tree generated • Balance between discrimination of OneR and fairness of Naïve Bayes • Decision trees are popular, intuitive, easy to create and easy to interpret • People like decision trees. They tell a nice story
  • 21.
    ZeroR • Correct classificationrate – 28.2% • Established base line accuracy • Always guess stage 3 ckd Naïve Bayes • Correct classification rate – 56.6% • Established over all probabilities to pick most probable class OneR • Correct classification rate – 80.2% • Serum Creatinine • < 0.85 – Healthy • < 1.15 – Stage 2 • < 2.25 – Stage 3 • > = 2.25 – Stage 5 J4.5 / C4.5 • Correct classification rate – 88.4%
  • 22.
  • 23.
    Other important concepts inmachine learning.
  • 24.
    Cross Validation • Holdout one of ten slices and build the model on the other nine slices • Test on the ‘held out’ slice • Hold out a different slice, build the models on the now other nine slices and test on the new ‘held out’ slice
  • 25.
    Overfitting • Classification rulethat is ‘over fit’ or so specific to the training data set that it does not generalize to the broader population • Limiting the complexity or rules can help prevent overfitting • Large representative data sets can help fight overfitting • A problem in machine learning • Must be a suspicious data scientist
  • 26.