Nearest Neighbor And Decision Tree - NN DT

Nearest Neighbor, Decision Tree
Danna Gurari
University of Texas at Austin
Spring 2021
https://www.ischool.utexas.edu/~dannag/Courses/IntroToMachineLearning/CourseContent.html

Review
• Last week:
• Binary classification applications
• Evaluating classification models
• Biological neurons: inspiration
• Artificial neurons: Perceptron & Adaline
• Gradient descent
• Assignments (Canvas)
• Lab assignment 1 due tonight
• Problem set 3 due next week
• Questions?

Today’s Topics
• Multiclass classification applications and evaluating models
• Motivation for new ML era: need non-linear models
• Nearest neighbor classification
• Decision tree classification
• Parametric versus non-parametric models

Today’s Focus: Multiclass Classification
Predict 3+ classes

Multiclass Classification: Cancer Diagnosis
https://news.stanford.edu/2017/01/25/artificial-intelligence-used-identify-skin-cancer/
https://www.nature.com/articles/nature21056
Doctor-level performance in recognizing 2,032 diseases

Multiclass Classification: Face Recognition
Security
https://www.anyvision.co/
Visual assistance for people
with vision impairments
Social media

Multiclass Classification: Shopping
Camfind (https://venturebeat.com/2014/09/24/camfind-app-
brings-accurate-visual-search-to-google-glass-exclusive/)

Multiclass Classification: Song Recognition
https://gbksoft.com/blog/how-to-make-a-shazam-like-app/

Goal: Design Models that Generalize Well to
New, Previously Unseen Examples
Input:
Label: Cat Dog Person

1. Split data into a “training set” and “ ”
Training Data Test Data
Input:
Label: Cat Dog Person

Training Data Test Data
Input:
Label: Cat Dog
2. Train model on “training set” to try to minimize prediction error on it
Person

Prediction Model
3. Apply trained model on “ ” to measure generalization error
? ? ?
Input:
Label:

Prediction Model
Cat ? ?
Input:
Label:

Prediction Model
Cat Giraffe ?
Input:
Label:

Prediction Model
Cat Giraffe Person
Input:
Label:

Prediction Model
Cat Giraffe Person
Input:
Label:
Cat Pumpkin
Human Annotated Label: Person

• Accuracy: percentage correct
• Precision:
• Recall:
Evaluation Methods
• Confusion matrix; e.g.,
http://gabrielelanaro.github.io/blog/2016/0
2/03/multiclass-evaluation-measures.html

Recall: Historical Context of ML Models
1613
Human “Computers”
1945
First
programmable
machine
Turing
Test
1959
Machine
Learning
1956
Early
1800 1974 1980 1987 1993
1rst AI
Winter
2nd AI
Winter
Linear
regression
1957
Perceptron
1950
AI

Recall: Vision for Perceptron
Frank Rosenblatt
(Psychologist)
“[The perceptron is] the embryo of an
electronic computer that [the Navy] expects
will be able to walk, talk, see, write,
reproduce itself and be conscious of its
existence…. [It] is expected to be finished in
about a year at a cost of $100,000.”
1958 New York Times article: https://www.nytimes.com/1958/07/08/archives/new-
navy-device-learns-by-doing-psychologist-shows-embryo-of.html
https://en.wikipedia.org/wiki/Frank_Rosenblatt

“Perceptrons” Book: Instigator for “AI Winter”
1613
1945
First
programmable
machine
Turing
Test
1959
Machine
Learning
1956
Early
1800 1974 1980 1987 1993
1rst AI
Winter
2nd AI
Winter
Linear
regression
1957
Perceptron
1950
AI
1969
Minsky & Papert publish a book called
“Perceptrons” to discuss its limitations

Perceptron Limitation: XOR Problem
XOR = “Exclusive Or”
- Input: two binary values x1 and x2
- Output:
- 1, when exactly one input equals 1
- 0, otherwise
x1 x2 x1 XOR x2
0 0
0 1
1 0
1 1
?
?
?
?

- Output:
- 0, otherwise
x1 x2 x1 XOR x2
0 0
0 1
1 0
1 1
0
?
?
?

- Output:
- 0, otherwise
x1 x2 x1 XOR x2
0 0
0 1
1 0
1 1
0
1
?
?

- Output:
- 0, otherwise
x1 x2 x1 XOR x2
0 0
0 1
1 0
1 1
0
1
1
?

- Output:
- 0, otherwise
x1 x2 x1 XOR x2
0 0
0 1
1 0
1 1
0
1
1
0

x1 x2 x1 XOR x2
0 0
0 1
1 0
1 1
0
1
1
0
How to separate 1s from 0s with a perceptron (linear function)?

Frank Rosenblatt
(Psychologist)
“[The perceptron is] the embryo of an
electronic computer that [the Navy] expects
will be able to walk, talk, see, write,
reproduce itself and be conscious of its
existence…. [It] is expected to be finished in
about a year at a cost of $100,000.”
1958 New York Times article: https://www.nytimes.com/1958/07/08/archives/new-
navy-device-learns-by-doing-psychologist-shows-embryo-of.html
How can a machine be “conscious”
when it can’t solve the XOR problem?

How to Overcome Limitation?
Non-linear models: e.g.,
• Linear regression: perform non-linear transformation of input features
• K-nearest neighbors
• Decision trees
• And many more to follow…

Today’s Topics
• Multiclass classification applications and evaluating models
• Motivation for new ML era: need non-linear models
• Nearest neighbor classification
• Decision tree classification
• Parametric versus non-parametric model

Historical Context of ML Models
1613
1945
First
programmable
machine
Turing
Test
1959
Machine
Learning
1956
Early
1800 1974 1980 1987 1993
1rst AI
Winter
2nd AI
Winter
Linear
regression
1957
Perceptron
1950
AI
1951
K-Nearest Neighbors
E. Fix and J.L. Hodges. Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, 1951

Recall: Instance-Based Learning
Memorizes examples and uses a similarity
measure to those examples to make predictions
Figure Source: Hands-on Machine Learning with Scikit-Learn & TensorFlow, Aurelien Geron

e.g., Predict What Scene The Image Shows
Input Output
Kitchen
Database Search

Input:
Label:
1. Create Large Database
Kitchen Coast
Store

2. Organize Database so Visually Similar Examples Neighbor Each Other
Adapted from slides by Antonio Torralba

3. Predict Class Using Label of Most Similar Example(s) in the Database
Input:
Label: ?

Input:
Label: Cathedral

Input:
Label: ?

Input:
Label: Highway

K-Nearest Neighbor Classification
https://github.com/amueller/introduction_to_ml_with_python/blob/master/02-supervised-learning.ipynb
Training Data:
• Given:
• x = {10.5, 3}
• Predict:
• When k = 1:
Novel Examples:

K-Nearest Neighbor Classification
Training Data:
• Given:
• x = {10.5, 3}
• Predict:
• When k = 1:
• Class 1
• When k = 6:
• How to avoid ties?
• Set “k” to odd value
for binary problems
• Prefer “closer”
neighbors
Novel Examples:

K-Nearest Neighbors: Measuring Distance
How to measure distance between a novel example and test example?
• Commonly use, Minkowski distance:
• When p = 2, Euclidean distance:
• When p = 1, Manhattan distance:
https://en.wikipedia.org/wiki/Taxicab_geometry

• Given:
• x = {10.5, 3}
• k =1:
Euclidean Distance
Training Data:

• Given:
• x = {10.5, 3}
• k =1:
Euclidean Distance
Training Data:
Note: May want to
scale the data to the
same range first

• Given:
• x = {10.5, 3}
• k =1:
Manhattan Distance
Training Data:

• Given:
• x = {10.5, 3}
• k =1:
Manhattan Distance
Training Data:
Note: May want to
scale the data to the
same range first

How to measure distance for novel categorical test data?
• e.g., Train = blue
• e.g., Test = blue; identical values so assign distance 0
• e.g., Test = white; different values so assign distance 1

K-Nearest Neighbor Classification: What “K”?
When K=1: When K=3:

When K=1: When K=3:
Given that predictions can change with different
values of “k”, what value of “k” should one use?

What happens to the decision boundary as “k” grows?

What happens when “k” equals the training data size?

(Higher Model Complexity) (Lower Model Complexity)

At what value
for “k” is model
overfitting the
most?
k = 1

What is the best
value for “k”?
k = 6

K-Nearest Neighbor: How to Use to Predict
More than Two Classes?
• Tally number of examples belonging to each class and again choose
the majority vote winners

What are Strengths of KNN?
• Adapts as new data is added
• Training is relatively fast
• Easy to understand

What are Weaknesses of KNN?
• For large datasets, requires large storage space
• For large datasets, this approach can be very slow or infeasible
• Note: can improve speed with efficient data structures such as KD-trees
• Vulnerable to noisy/irrelevant examples
• Sensitive to imbalanced datasets where more frequent class will
dominate majority voting

Historical Context of ML Models
1613
1945
First
programmable
machine
Turing
Test
1959
Machine
Learning
1956
Early
1800 1974 1980 1987 1993
1rst AI
Winter
2nd AI
Winter
Linear
regression
1957
Perceptron
1950
AI
Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106.
Hunt, E.B. (1962). Concept learning: An information processing problem. New York: Wiley.
Quinlan, J.R. (1979). Discovering rules by induction from large collections of examples. In D. Michie (Ed.), Expert systems in the
micro electronic age. Edinburgh University Press.
K-nearest
neighbors
1962
Concept Learning
1979
ID3

Example: Decision Tree
http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/

Test Example
Salary: $44,869
Commute: 35 min
Free Coffee: Yes

Test Example
Salary: $62,200
Commute: 45 min
Free Coffee: Yes

Machine must
learn to create a
decision tree

Decision Tree: Generic Structure
• Goal: predict class label (aka: class, ground truth)
• Representation: Tree
• Internal (non-leaf) nodes = tests an attribute
• Branches = attribute value
• Leaf = classification label

Decision Tree: Generic Structure
• Goal: predict class label
• Representation: Tree
• Internal (non-leaf) nodes = tests an attribute
• Branches = attribute value
• Leaf = classification label
How can a machine learn a decision tree?

Decision Tree: Generic Learning Algorithm
• Greedy approach (NP complete problem)
aka – “data” aka – “feature names”

Key Decision

Next “Best” Attribute: Use Entropy
Number of classes
Fraction of examples belonging to class i
Encodes in bits
In a binary setting,
• Entropy is 0 when fraction of examples belonging to a class is 0 or 1
• Entropy is 1 when fraction of examples belonging to each class is 0.5
https://www.logcalculator.net/log-2

e.g., Will you like a movie?
• Let C1 = “Yes” and C2 = “No”
• Current entropy?

• Entropy if we split on “Type”?
• Left tree: “Comedy” = ?

• Left tree: “Comedy” = 0.92
• Right tree: “Drama” = ?

• Left tree: “Comedy” = 0.92
• Right tree: “Drama” = 0.97
• Information gain by split on “Type”?

• Entropy if we split on “Length”?
• Left tree: “Short” = ?

• Left tree: “Short” = 0.92
• Middle tree: “Medium” = ?

• Middle tree: “Medium” = 0.82
• Right tree: “Long” = ?

• Middle tree: “Medium” = 0.82
• Right tree: “Long” = 0
• Information gain by split on “Length”?

• Entropy if we split on “IMDb Rating”?
• Order attribute values:
{4.5, 5.1, 6.9, 7.2, 7.5, 8.0, 8.3, 9.3}
Split at midpoint and measure entropy

Decision Tree: What is Our First Split?
IG = 0 IG = 0.19 IG = 0.95

Decision Tree: What Tree Results?
IMDb Rating
<= 7.05 > 7.05
Recurse on this tree?

No
Recurse on this tree?
IMDb Rating
<= 7.05 > 7.05

No Yes
IMDb Rating
<= 7.05 > 7.05

Key Decision
• Entropy (maximize information gain)
• Gini Index (used in CART algorithm)
• Gain ratio (used in C4.5 algorithm)
• Mean squared error
• …

Overfitting
• At what tree size, does overfitting begin?
http://www.alanfielding.co.uk/multivar/crt/dt_example_04.htm
Evaluate on training data
Evaluate on test data

Regularization to Avoid Overfitting
• Pruning
• Pre-pruning: stop tree growth earlier
• Post-pruning: prune tree afterwards
https://www.clariba.com/blog/tech-20140811-decision-trees-with-sap-predictive-analytics-and-sap-hana-emilio-nieto

Machine Learning Goal
• Learn function that maps input features (X) to an output prediction (Y)
Y = f(x)

• Parametric model: has a fixed number of parameters
Y = f(x)

• Parametric model: has a fixed number of parameters
• Non-parametric model: does not specify the number of parameters
Y = f(x)

Class Discussion
• For each model, is it parametric or non-parametric?
• Linear regression
• K-nearest neighbors
• What are advantages and disadvantages of parametric models?
• What are advantages and disadvantages of non-parametric models?

References Used for Today’s Material
• Chapter 3 of Deep Learning book by Goodfellow et al.
• http://www.cs.utoronto.ca/~fidler/teaching/2015/slides/CSC411/06_trees.pdf
• http://www.cs.utoronto.ca/~fidler/teaching/2015/slides/CSC411/tutorial3_CrossVal-DTs.pdf
• http://www.cs.cmu.edu/~epxing/Class/10701/slides/classification15.pdf

Nearest Neighbor And Decision Tree - NN DT

Recommended

Recommended

More Related Content

Similar to Nearest Neighbor And Decision Tree - NN DT

Similar to Nearest Neighbor And Decision Tree - NN DT (20)

More from julianaantunes58

More from julianaantunes58 (6)

Recently uploaded

Recently uploaded (20)

Nearest Neighbor And Decision Tree - NN DT