Understanding basics of
Machine Learning
Pranav Ainavolu
Microsoft MVP | Senior Developer at Realpage
@a_pranav | http://pranavon.net/
Agenda
1) data science
2) prediction
3) process
4) models
5) AzureML
data science
• key word: “science”
• try stuff
• it (might not | won’t) work
the first time
• this might work…question
• wikipedia timeresearch
• I have an ideahypothesis
• try it outexperiment
• did this even work?analysis
• time for a better ideaconclusion
machine learning
• finding (and exploiting) patterns in data
• replacing “human writing code” with
“human supplying data”
• system figures out what the person wants
based on examples
• need to abstract from “training” examples
to “test” examples
• most central issue in ML: generalization
machine learning
•split into two (ish) areas
•supervised learning
• predicting the future
• learn from past examples to predict future
•unsupervised learning
• understanding the past
• making sense of data
• learning structure of data
• compressing data for consumption
neat applications applications
neat applications
• spam catchers
• ocr (optical character recognition)
• natural language processing
• machine translation
• biology
• medicine
• robotics (autonomous systems)
• etc…
7
prediction
making decisions
making decisions
•what kinds of decisions are we making?
• binary classification
• yes/no, 1/0, male/female
• multi-class classification
• {A, B, C, D, F} (Grade),
{1, 2, 3, 4} (Class),
{teacher, student, secretary}
• regression
• number between 0 and 100, real value
9
process
data
clean
transform
maths
model
predict
data
Class Outlook Temp. Windy
Play Sunny Low Yes
No Play Sunny High Yes
No Play Sunny High No
Play Overcast Low Yes
Play Overcast High No
Play Overcast Low No
No Play Rainy Low Yes
Play Rainy Low No
? Sunny Low No
label (y)
play / no play
features
outlook, temp, windy
values (x)
[Sunny, Low, Yes]
Labeled dataset is a collection of (X, Y) pairs.
Given a new x, how do we predict y?
clean / transform / maths
Class Outlook Temp. Windy
Play Sunny Lowest Yes
No Play ? High Yes
No Play Sunny High KindOf
Play Overcast ? Yes
Play Turtle Cloud High No
Play Overcast ? No
No Play Rainy Low 28%
Play Rainy Low No
? Sunny Low No
need to clean up data
need to convert to model-able form (linear algebra)
yak shaving
Any apparently useless activity
which, by allowing you to
overcome intermediate difficulties,
allows you to solve a larger
problem.
I was doing a bit of yak shaving
this morning, and it looks like it
might have paid off.
http://en.wiktionary.org/wiki/yak_shaving
clean / transform / maths
Class Outlook Temp. Windy
Play Sunny Low Yes
No Play Sunny High Yes
No Play Sunny High No
Play Overcast Low Yes
Play Overcast High No
Play Overcast Low No
No Play Rainy Low Yes
Play Rainy Low No
? Sunny Low No
need to clean up data
need to convert to model-able form (linear algebra)
model
Class Outlook Temp. Windy
Play Sunny Low Yes
No Play Sunny High Yes
No Play Sunny High No
Play Overcast Low Yes
Play Overcast High No
Play Overcast Low No
No Play Rainy Low Yes
Play Rainy Low No
? Sunny Low No
predict
PLAY!!!
Class Outlook Temp. Windy
? Sunny Low No
models
how do we build them?
linear classifiers
•in order to classify things properly we need:
• a way to mathematically represent examples
• a way to separate classes (yes/no)
•“decision boundary”
•excel example
•graph example
17
MODELS
linear classifiers
•dot product of vectors
• [ 3, 4 ] ● [ 1, 2 ] = (3 × 1) + (4 × 2) = 11
• a ● b = | a | × | b | cos θ
• When does this equal 0?
•why would this be useful?
• decision boundary can be represented using a single vector
18
MODELS
perceptron
…and other linear models
linear classifiers
•Frank Rosenblatt, Cornell 1957
• let’s make a line (by using a single vector)
• take the dot product between the line and the new point
• > 0 belongs to class 1
• < 0 belongs to class 2
• == 0 flip a coin we don’t know
• for each example, if we make a mistake, move the line
20
MODELS
perceptron
point demo
perceptron
what if….
kernel methods
models
kernel methods
2𝑛 +
𝑛
2
= 2n +
𝑛 𝑛−1
2
features….
perceptron
•minimize mistakes by moving w
arg min
(𝒘,𝒃)
1
2
𝒘 2
subject to:
𝑦𝑖 𝒘 ∙ 𝒙𝒊 − 𝑏 ≥ 1
REMINDER
perceptron
•eventually this becomes an optimization problem
𝐿 𝛼 =
𝑖=1
𝑛
𝛼𝑖 −
1
2
𝑖,𝑗
𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖
𝑇
𝒙𝑗
subject to:
𝛼𝑖 ≥ 0,
𝑖=1
𝑛
𝛼𝑖 𝑦𝑖 = 0
REMINDER
perceptron
•eventually this becomes an optimization problem
𝐿 𝛼 =
𝑖=1
𝑛
𝛼𝑖 −
1
2
𝑖,𝑗
𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖
𝑇
𝒙𝑗
subject to:
𝛼𝑖 ≥ 0,
𝑖=1
𝑛
𝛼𝑖 𝑦𝑖 = 0
REMINDER
perceptron
•eventually this becomes an optimization problem
𝐿 𝛼 =
𝑖=1
𝑛
𝛼𝑖 −
1
2
𝑖,𝑗
𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑘 𝒙𝑖, 𝒙𝑗
subject to:
𝛼𝑖 ≥ 0,
𝑖=1
𝑛
𝛼𝑖 𝑦𝑖 = 0
REMINDER
dot product
perceptron
•Frank Rosenblatt, Cornell 1957
• let’s make a line (by using a single vector)
• take the dot product between the line and the new point
• > 0 belongs to class 1
• < 0 belongs to class 2
• == 0 flip a coin we don’t know
• for each example, if we make a mistake, move the line
30
REMINDER
kernel (one weird trick….)
•store dot product in a table
𝒙0
𝑇
𝒙0 ⋯ 𝒙0
𝑇
𝒙𝑗
⋮ ⋱ ⋮
𝒙𝑖
𝑇
𝒙0 ⋯ 𝒙𝑖
𝑇
𝒙𝑗
•call it the “kernel matrix” and “kernel trick”
•project into any space and still learn a linear model
MODELS
support vector machines
•this method is the basis for SVM’s
•returns a set of vectors (<< n) to make decision
•essentially changed the space to make it separable
MODELS
kernels
•polynomial kernel
𝐾 𝒙, 𝒚 = 𝒙 𝑇
𝒚 + 𝑐 𝑑
•RBF kernel
𝐾 𝒙, 𝒚 = exp −
𝒙 − 𝒚 2
2
2𝜎2
MODELS
1
34
what if….
neural networks
models
neural networks
neural networks
Play?
ℎ1
ℎ2
ℎ3
𝐵1
LINEAR METHODS
decision trees
models
decision trees
Class Outlook Temp. Windy
Play Sunny Low Yes
No Play Sunny High Yes
No Play Sunny High No
Play Overcast Low Yes
Play Overcast High No
Play Overcast Low No
No Play Rainy Low Yes
Play Rainy Low No
? Sunny Low No
decision trees
•how should the computer split?
• information gain (with entropy)
• entropy measures how disorganized your
answer is.
• information gain says:
• if I separate the answer by the values in a
particular column, does the answer become
*more* organized?
decision trees
•calculating information gain:
• 𝐻 𝑦 – how messy is the answer
• 𝐻 𝑦 𝑎) – how messy is the answer if we
know a?
𝐼𝐺 𝑦, 𝑎 = 𝐻 𝑦 − 𝐻 𝑦 𝑎)
𝑎 ∈ 𝐴𝑡𝑡𝑟(𝑥)
decision trees
demo
POPULAR MODELS
do they work?
testing
how well is it doing?
Train Test
Use 80% Use 20%
AzureML
putting it all together
48
process reminder (same on Azure)
data
clean
transform
maths
model
predict
experiments
putting it all together
50
Truth
true false
Guess
positive
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑡𝑝
𝑡𝑝 + 𝑓𝑝
negative
𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑡𝑝
𝑡𝑝 + 𝑓𝑛
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑡𝑝 + 𝑡𝑛
𝑡𝑝 + 𝑡𝑛 + 𝑓𝑝 + 𝑓𝑛
confusion matrix
Thank you!

Understanding Basics of Machine Learning

  • 1.
    Understanding basics of MachineLearning Pranav Ainavolu Microsoft MVP | Senior Developer at Realpage @a_pranav | http://pranavon.net/
  • 2.
    Agenda 1) data science 2)prediction 3) process 4) models 5) AzureML
  • 3.
    data science • keyword: “science” • try stuff • it (might not | won’t) work the first time • this might work…question • wikipedia timeresearch • I have an ideahypothesis • try it outexperiment • did this even work?analysis • time for a better ideaconclusion
  • 4.
    machine learning • finding(and exploiting) patterns in data • replacing “human writing code” with “human supplying data” • system figures out what the person wants based on examples • need to abstract from “training” examples to “test” examples • most central issue in ML: generalization
  • 5.
    machine learning •split intotwo (ish) areas •supervised learning • predicting the future • learn from past examples to predict future •unsupervised learning • understanding the past • making sense of data • learning structure of data • compressing data for consumption
  • 6.
  • 7.
    neat applications • spamcatchers • ocr (optical character recognition) • natural language processing • machine translation • biology • medicine • robotics (autonomous systems) • etc… 7
  • 8.
  • 9.
    making decisions •what kindsof decisions are we making? • binary classification • yes/no, 1/0, male/female • multi-class classification • {A, B, C, D, F} (Grade), {1, 2, 3, 4} (Class), {teacher, student, secretary} • regression • number between 0 and 100, real value 9
  • 10.
  • 11.
    data Class Outlook Temp.Windy Play Sunny Low Yes No Play Sunny High Yes No Play Sunny High No Play Overcast Low Yes Play Overcast High No Play Overcast Low No No Play Rainy Low Yes Play Rainy Low No ? Sunny Low No label (y) play / no play features outlook, temp, windy values (x) [Sunny, Low, Yes] Labeled dataset is a collection of (X, Y) pairs. Given a new x, how do we predict y?
  • 12.
    clean / transform/ maths Class Outlook Temp. Windy Play Sunny Lowest Yes No Play ? High Yes No Play Sunny High KindOf Play Overcast ? Yes Play Turtle Cloud High No Play Overcast ? No No Play Rainy Low 28% Play Rainy Low No ? Sunny Low No need to clean up data need to convert to model-able form (linear algebra) yak shaving Any apparently useless activity which, by allowing you to overcome intermediate difficulties, allows you to solve a larger problem. I was doing a bit of yak shaving this morning, and it looks like it might have paid off. http://en.wiktionary.org/wiki/yak_shaving
  • 13.
    clean / transform/ maths Class Outlook Temp. Windy Play Sunny Low Yes No Play Sunny High Yes No Play Sunny High No Play Overcast Low Yes Play Overcast High No Play Overcast Low No No Play Rainy Low Yes Play Rainy Low No ? Sunny Low No need to clean up data need to convert to model-able form (linear algebra)
  • 14.
    model Class Outlook Temp.Windy Play Sunny Low Yes No Play Sunny High Yes No Play Sunny High No Play Overcast Low Yes Play Overcast High No Play Overcast Low No No Play Rainy Low Yes Play Rainy Low No ? Sunny Low No
  • 15.
  • 16.
    models how do webuild them?
  • 17.
    linear classifiers •in orderto classify things properly we need: • a way to mathematically represent examples • a way to separate classes (yes/no) •“decision boundary” •excel example •graph example 17 MODELS
  • 18.
    linear classifiers •dot productof vectors • [ 3, 4 ] ● [ 1, 2 ] = (3 × 1) + (4 × 2) = 11 • a ● b = | a | × | b | cos θ • When does this equal 0? •why would this be useful? • decision boundary can be represented using a single vector 18 MODELS
  • 19.
  • 20.
    linear classifiers •Frank Rosenblatt,Cornell 1957 • let’s make a line (by using a single vector) • take the dot product between the line and the new point • > 0 belongs to class 1 • < 0 belongs to class 2 • == 0 flip a coin we don’t know • for each example, if we make a mistake, move the line 20 MODELS
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    kernel methods 2𝑛 + 𝑛 2 =2n + 𝑛 𝑛−1 2 features….
  • 26.
    perceptron •minimize mistakes bymoving w arg min (𝒘,𝒃) 1 2 𝒘 2 subject to: 𝑦𝑖 𝒘 ∙ 𝒙𝒊 − 𝑏 ≥ 1 REMINDER
  • 27.
    perceptron •eventually this becomesan optimization problem 𝐿 𝛼 = 𝑖=1 𝑛 𝛼𝑖 − 1 2 𝑖,𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖 𝑇 𝒙𝑗 subject to: 𝛼𝑖 ≥ 0, 𝑖=1 𝑛 𝛼𝑖 𝑦𝑖 = 0 REMINDER
  • 28.
    perceptron •eventually this becomesan optimization problem 𝐿 𝛼 = 𝑖=1 𝑛 𝛼𝑖 − 1 2 𝑖,𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑖 𝑇 𝒙𝑗 subject to: 𝛼𝑖 ≥ 0, 𝑖=1 𝑛 𝛼𝑖 𝑦𝑖 = 0 REMINDER
  • 29.
    perceptron •eventually this becomesan optimization problem 𝐿 𝛼 = 𝑖=1 𝑛 𝛼𝑖 − 1 2 𝑖,𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑘 𝒙𝑖, 𝒙𝑗 subject to: 𝛼𝑖 ≥ 0, 𝑖=1 𝑛 𝛼𝑖 𝑦𝑖 = 0 REMINDER dot product
  • 30.
    perceptron •Frank Rosenblatt, Cornell1957 • let’s make a line (by using a single vector) • take the dot product between the line and the new point • > 0 belongs to class 1 • < 0 belongs to class 2 • == 0 flip a coin we don’t know • for each example, if we make a mistake, move the line 30 REMINDER
  • 31.
    kernel (one weirdtrick….) •store dot product in a table 𝒙0 𝑇 𝒙0 ⋯ 𝒙0 𝑇 𝒙𝑗 ⋮ ⋱ ⋮ 𝒙𝑖 𝑇 𝒙0 ⋯ 𝒙𝑖 𝑇 𝒙𝑗 •call it the “kernel matrix” and “kernel trick” •project into any space and still learn a linear model MODELS
  • 32.
    support vector machines •thismethod is the basis for SVM’s •returns a set of vectors (<< n) to make decision •essentially changed the space to make it separable MODELS
  • 33.
    kernels •polynomial kernel 𝐾 𝒙,𝒚 = 𝒙 𝑇 𝒚 + 𝑐 𝑑 •RBF kernel 𝐾 𝒙, 𝒚 = exp − 𝒙 − 𝒚 2 2 2𝜎2 MODELS 1
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    decision trees Class OutlookTemp. Windy Play Sunny Low Yes No Play Sunny High Yes No Play Sunny High No Play Overcast Low Yes Play Overcast High No Play Overcast Low No No Play Rainy Low Yes Play Rainy Low No ? Sunny Low No
  • 42.
    decision trees •how shouldthe computer split? • information gain (with entropy) • entropy measures how disorganized your answer is. • information gain says: • if I separate the answer by the values in a particular column, does the answer become *more* organized?
  • 43.
    decision trees •calculating informationgain: • 𝐻 𝑦 – how messy is the answer • 𝐻 𝑦 𝑎) – how messy is the answer if we know a? 𝐼𝐺 𝑦, 𝑎 = 𝐻 𝑦 − 𝐻 𝑦 𝑎) 𝑎 ∈ 𝐴𝑡𝑡𝑟(𝑥)
  • 44.
  • 45.
  • 46.
  • 47.
    how well isit doing? Train Test Use 80% Use 20%
  • 48.
  • 49.
    process reminder (sameon Azure) data clean transform maths model predict
  • 50.
  • 51.
    Truth true false Guess positive 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝 𝑡𝑝 + 𝑓𝑝 negative 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑝 𝑡𝑝 + 𝑓𝑛 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑡𝑝 + 𝑡𝑛 𝑡𝑝 + 𝑡𝑛 + 𝑓𝑝 + 𝑓𝑛 confusion matrix
  • 52.