Lecture -2 Classification (Machine Learning Basic and kNN).ppt

Classification : Machine
Learning Basic and kNN
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2021)

Outline
 A brief overview of ML
 Key tasks in ML
 Why we need ML
 Why Python is so grate for ML
 K-nearest neighbors algorithm
kNN Classification
kNN Regression
Some Issues in KNN
Python Modules to work on the ML Algorithms
02/04/25 2

Machine Learning (ML)
 With machine learning we can gain insight from a dataset.
 We’re going to ask the computer to make some sense from the data.
 This is what we mean by learning.
 Machine learning is the process of turning the data into Information and
Knowledge.
 Machine Learning lies at the intersection of computer science, engineering,
and statistics and often appears in other disciplines.
02/04/25 3

What is Machine Learning?
 It’s a tool that can be applied to many problems.
 Any field that needs to interpret and act on data can benefit
from Machine Learning techniques.
 There are many problems where the solution isn’t deterministic.
 That is, we don’t know enough about the problem or don’t have
enough computing power to properly model the problem.
02/04/25 4

Traditional Vs ML systems
 In Machine Learning, once the system is provided with the right
data and algorithms, it can "fish for itself”.
02/04/25 5

Traditional Vs ML systems
 A key aspect of Machine Learning that makes it particularly appealing
in terms of business value is that it does not require as much explicit
programming in advance.
02/04/25 6

Sensor and the Data Deluge
 We have a tremendous amount of human-created data from the WWW,
but recently more non-human sources of data have been coming online.
Sensors connected to the web.
20 % of non-video internet traffic by sensors.
 Data collected from mobile phone (three-axis accelerometer, temperature
sensors, and GPS receivers)
 Due to the two trends of mobile computing and sensor generated data
mean that we’ll be getting more and more data in the future.
02/04/25 7

Key Terminology
02/04/25 8
 Weight, Wingspan, Webbed feet, Back color are features or
attributes.
 An instance is made up of features. (controlled, exposure etc.)
 Species is the target variable. (response, outcome, output etc.)
 Attributes can be numeric, binary, nominal.

Key Terminology
02/04/25 9
 To train the Machine Learning algorithm we need to feed it quality data known as a training set.
 In the above example each training example (instant) has four features and one target variable.
 In a training set the target variable is known. (classification)
 The machine learns by finding some relationship between the features and the target variable.
 In the classification problem the target variables are called classes, and they are assumed to be a
finite number of classes.

Key Terminology Cont…
02/04/25 10
 To test machine learning algorithms a separate dataset is used which is called a test set.
 The target variable for each example from the test set isn’t given to the program.
 The program (model) decides in which class each example should belong to.
 Then compare the predicted value with the target variable.

Key Tasks of Machine Learning
02/04/25 11
 In classification, our job is to predict what class an instance of data should fall into.
 Regression is the prediction of a numeric value. (target variable)
 Classification and regression are examples of supervised learning.
 This set of problems is known as supervised because we’re telling the algorithm what to predict.

02/04/25 12
 The opposite of supervised learning is a set of tasks known as unsupervised learning.
 In unsupervised learning, there’s no label or target value given for the data. (known as clustering)
 In unsupervised learning, we may also want to find statistical values that describe the data. This is known as density estimation.
 Another task of unsupervised learning may be reducing the data from many features to a small number so that we can properly
visualize the dimensions.

02/04/25 13
 Common algorithms used to perform classification, regression, clustering, and density estimation tasks.
 Balancing generalization and memorization (over fitting) is a common problem to many ML algorithms.
 Regularization techniques are used to reduce over fitting.

02/04/25 14
 There are two fundamental cause of prediction error: a model bias, and its variance.
 A model with high variance over-fits the training data, while a model with high bias under-fits the training data.
 High bias, low variance (Under-fitting)
 Low bias, high variance (Over-fitting)
 High bias, high variance (Under/Over – fitting)
 Low bias, low variance (Good dataset)
 The predictive power of many Machine Learning algorithms improve as the amount of training data increases.
 Quality of data is also important.

02/04/25 15
 Ideally, a model will have both low bias and variance; but effort to reduce one will frequently increase the other.
 This is known as the bias-variance trade-off.
 Common measurement of performance:
 Accuracy (ACC) = (TP + TN / TP+TN+FP+FN)
 Precision (P) = (TP / TP+FP)
 Recall (R) = (TP / TP+FN)

How to Choose the Right
Algorithm
02/04/25 16
 First, you need to consider your goal.
 If you’re trying to predict or forecast a target value, then you need to look into supervised learning.
 If not, then unsupervised learning is the place you want to be.
 If you’ve chosen supervised learning, what’s your target value?
Discrete value (y/n, 1/2/3, Red/Yellow/Black):- classification
A number of values (0.00 to 100.00 etc…):- regression

Algorithm
02/04/25 17
 Spend some time to know the data, and the more we know it, we can build successful application. (70-80% of the time)
 Things to know about the data are these:
Are the features nominal or continuous?
Are there missing values in the features?
If there are missing values, why are there missing values?
Are there outliers in the data? etc…(80, 81, 82, 83, 245)
Etc…
 All of these features about your data can help you narrow the algorithm selection process.

Algorithm
02/04/25 18
 Finding the best algorithm is an iterative process of trial and error.
 Steps in developing a machine learning application:
Collect data: scraping a website, RSS feed or API etc..
 Prepare the input data: make sure the unstableness of the data format.
Analyze the input data: looking at the data.
Understand the data.
Train the algorithm: the ML takes place (not for unsupervised)
Test the algorithm: (go back to the 4th step)
Use it (implement ML application)

Problem Solving Framework
 Problem solving Framework for ML application:
Business issue understanding
Data understanding
Data preparation
Analysis Modeling
Validation
Presentation / Visualization
02/04/25 19

Machine Learning Systems and
Data
 In ML, instead of writing a program by hand for each specific
task, we collect lots of examples that specify the correct output
for a given input.
 The most important factors in ML is not the algorithm or the
software systems. The quality of the data is the soul of the ML
systems.
02/04/25 20

Data
 Invalid training data:
Garbage In ------ Garbage Out.
 Invalid dataset leads to invalid results.
 This is not to say that the training data needs to be prefer.
 Out of a million examples, some inaccurate labels is
acceptable.
 The quality of the data is the soul of the ML systems.
02/04/25 21

Data
 “garbage” can be several things:
Wrong label (Dog – Cat, Cat – Dog)
Inaccurate and Missing Values
A bias dataset etc…
 Handling missing data:
Small portion row and columns – discarded them,
Data imputation (time serial data) – the last valid value,
Substitute with mean or median,
Predicting the missing values from the available data,
A missing value can have a meaning on its own (missing).
02/04/25 22

Data
 Having a clear dataset is not always enough.
 Features with large magnitudes can dominate features with small
magnitudes during the training.
 Example: Age [0-100], salary [6,000 – 20,000] – Scaling and
Standardization
 Data imbalance:
02/04/25 23
No Classes Number
1 Cat 5000
2 Dog 5000
3 Tiger 150
4 Cow 25
 Leave as it is.
Under sampling (if all classes are
equally important) [5000 – 25]
Over sampling (if all classes are
equally important) [25-5000]

Challenges in Machine
Learning
 It requires considerable data and compute power.
 It requires knowledgeable data science specialists or teams.
 It adds complexity to the organization's data integration
strategy. (data-driven culture)
 Learning ML algorithms is challenging without an advanced
math background.
 The context of data often changes. (private data Vs public data)
 Algorithmic bias, privacy and ethical concerns may be
overlooked.
02/04/25 24

Stages of ML Process
 The first key step in preparing to explore and exploit ML is to
understand the basic stages involved.
02/04/25 25

Stages of ML Process
 Machine Learning Tasks and Subtasks:
02/04/25 26

Data Collection and Preparation
 Data collection is the process of gathering and measuring
information from countless different sources.
 Data generating at an unprecedented rate. These data can be:
 Numeric (temperature, loan amount, customer retention rate),
 Categorical (gender, color, highest degree earned), or
 Even free text (think doctor’s notes or opinion surveys).
 In order to use the data we collect to develop practical
solutions, it must be collected and stored in a way that makes
sense for the business problem at hand.
02/04/25 27

Data Collection and
Preparation

 During an ML system development, we always rely on data.
 From training, tuning, model selection to testing, we use three
different data sets: the training set, the validation set ,and the
testing set.
 The validation set is used to select and tune the final ML model.
 The test data set is used to evaluate how well your algorithm
was trained with the training data set.
02/04/25 29

 Testing sets represent 20% or 30% of the data. (cross validation)
 The test set is ensured to be the input data grouped together with
verified correct outputs, generally by human verification.
02/04/25 30

 The most successful ML projects are those that integrate a data
collection strategy during the service/product life-cycle.
 It must be built into the core product itself.
 Basically, every time a user engages with the product/service,
you want to collect data from the interaction.
 The goal is to use this constant new data flow to improve your
product/service.
02/04/25 31

 Solving the right problem:
 Understand the purpose for a model.
 Ask about who, what, when, where and why?
 Is the problem viable for machine learning?
02/04/25 32

 Data preparation is a set of procedures that makes your dataset
more suitable for ML.
 Articulate the problem early
 Establish data collection mechanisms (data-driven culture)
 Format data to make it consistent
 Reduce data (attribute sampling)
 Complete data cleaning
 Decompose data (complex data set)
 Rescale data (data normalization)
 Discretize data (numerical – categorical values)
 Private datasets capture the specifics of your unique business
and potentially have all relevant attributes.
02/04/25 33

Data Collection, Preparation
and Delivery
02/04/25 34

Python
02/04/25 35
 Python is a grate language for ML.
Has clear syntax:
 High-level data type (list, tuples, dictionaries, sets, etc…)
 Program in any style (OO, procedural, functional, and so on)
Makes text manipulation extremely easy.
There are a number of libraries.
 Libraries such as SciPy and NumPy: to do vector and matrix
operations.
 Matplotlib can plot 2D and 3D plots.
 Pandas data manipulation in a table form.

Classification with k-Nearest
Neighbors
02/04/25 36

K-Nearest Neighbors (KNN)
 It is an easy to grasp (understand and implement) and very
effective (powerful tool).
 The model for kNN is the entire training dataset.
 Pros: High accuracy, insensitive to outliers, no assumptions
about data.
 Cons: computationally expensive, requires a lot of memory.
 Works with: Numeric values, nominal values. (Classification
and regression)
02/04/25 37

 We have an existing set of example data (training set).
 We know what class each piece of the data should fall into.
 When we’re given a new piece of data without a label.
 We compare that new piece of data to the existing data, every piece of existing data.
 We then take the most similar pieces of data (the nearest neighbors) and look at their
labels.
02/04/25 38

 We have an existing set of example data (training set).
 We look at the top k most similar pieces of data from our known dataset. (usually less than 20)
 The K is often set to an odd number to prevent ties.
 Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new
class we assign to the data we were asked to classify.
02/04/25 39

 KNN, non-paramteric models can be useful when training data is abundant and you have little prior
knowledge about the relationship b/n the response and explanatory variables.
 KNN makes only one assumption: instance that are near each other are likely to have similar values of
response variable.
 A model that makes assumption about the relationship can be useful if training data is scarce or if you already
know about the relationship.
02/04/25 40

KNN Classification
 Classifying movies into romance or action movies.
 The number of kisses and kicks in each movie (features)
 Now, you find a movie you haven’t seen yet and want to know if it’s a romance movie or an action movie.
 To determine this, we’ll use the kNN algorithm.
02/04/25 41

KNN Classification
 We find the movie in question and see how many kicks and kisses it has.
02/04/25 42
Classifying movies by plotting the # kicks and kisses in each movie

KNN Classification
Movies with the # of kicks, # of kisses along with their class
02/04/25 43

KNN Classification
 We don’t know what type of movie the question mark movie is.
 First, we calculate the distance to all the other movies.
02/04/25 44
Distance b/n each movie and the unknown movie

KNN Classification
02/04/25 45
Euclidian distance where the distance between two vectors

KNN Classification
Let’s assume k=3.
 Then, the three closest movies are He’s Not Really into Dudes, Beautiful Woman, and California Man.
 Because all three movies are romances, we forecast that the
mystery movie is a romance movie. (through majority vote)
02/04/25 46

General Approach to KNN
 General approach to kNN:
 Collect: Any method
 Prepare: Numeric values are needed for a distance calculation.
 Analyze: Any method (plotting).
 Train: Does not apply to the kNN algorithm.
 Test: Calculate the error rate.
 Use: This application needs to get some input data and output structured numeric values.
02/04/25 47

kNN is an instance-based learning algorithm.
02/04/25 48
<x, y> 1
<x, y> 2
<x, y> 3
<x, y> 4
……..
<x, y> n
F(x) = wx + b
Non-instance supervised learning Instance-based supervised learning
<x, y> 1
<x, y> 2
<x, y> 3
<x, y> 4
……..
<x, y> n
Database
F(x) = lookup(x)

 Advantage:
 It remembers
 Fast (no learning time)
 Simple and straight forward
 Down side :
 No generalization
 Over-fitting (noise)
 Computationally expensive for large datasets
02/04/25 49

 Given:
 Training data D = (xi, yi),
 Distance metric d(q, x): domain knowledge important,
 Number of neighbors K: domain knowledge important,
 Query point q.
 KNN = {i : d(q, xi) k smallest }
 Return:
 Classification: Majority Vote of the yi.
 Regression: mean of the yi.
02/04/25 50

KNN- Regression Problem
 d(): k Average
 Euclidian: 1-NN _______
 3-NN _______
 Manhattan 1-NN _______
 3-NN _______
02/04/25 51
X1, X2 y
1, 6 7
2, 4 8
3, 7 16
6, 8 44
7, 1 50
8, 4 68
Regression
Q = 4, 2, y = ???
 The similarity measure is dependent on the type of the data:
 Real-valued data: Euclidean distance .
 Hamming distance: categorical or binary data (P-norm; when p=0)

 d(): k Average
 Euclidian: 1-NN ___8___
 3-NN ___42__
 Manhattan 1-NN _______
 3-NN _______
Euclidian = ((X1i – q1)2 +(X2i – q2)2)1/2
02/04/25 52
Regression
Q = 4, 2, y = ???
X1, X2 y ED
1, 6 7 25
2, 4 8 8
3, 7 16 26
6, 8 44 40
7, 1 50 10
8, 4 68 20

 d(): k Average
 Euclidian: 1-NN _______
 3-NN _______
 Manhattan 1-NN ___29__
 3-NN __35.5__
Manhattan = (|X1i – q1|) + (|X2i - q1|)
02/04/25 53
Regression
Q = 4, 2, y = ???
X1, X2 y mD
1, 6 7 7
2, 4 8 4
3, 7 16 6
6, 8 44 8
7, 1 50 4
8, 4 68 6

K-Nearest Neighbors Bias
 Preference Bias?
 Our believe about what makes a good hypothesis.
 Locality: near points are similar (distance function / domain)
 Smoothness: averaging
 All features matter equally.
 Best practices for Data preparation
 Rescale data: normalizing the data to the range [0, 1] is a good idea.
 Address missing data: excluded or imputed the missing values.
 Lower dimensionality: KNN is suitable for lower dimensional data.
02/04/25 54

KNN and Curse of
Dimensionality
 As the number of features or dimension grows, the amount of data we need to generalize accurately grows exponentially.
 Exponentially mean “bad”. O(2d
)
02/04/25 55

Some Other Issues
 What is needed to select a KNN model?
 How to measure closeness of neighbors.
 Correct value for K.
 d(x, q) = Euclidian, Manhattan, weighted etc…
 The choice of the distance function matters.
 K value:
 K = n (the average of all data / no need of query)
 K = n (weighted average) [Locally weighted regression]
02/04/25 56

Summary
 kNN is an example of instance-based learning.
 The algorithm has to carry around the full dataset; for large datasets, this implies a large amount of storage.
 Need to calculate the distance measurement for every piece of data in the database, and this can be cumbersome.
 kNN doesn’t give you any idea of the underlying structure of the data.
 kNN is an example of lazy learning, which is the opposite of eager learning.
 kNN can handle both classification and regression.
02/04/25 57

Summary
 KNN is positioned in the algorithm list of scikit learn.
02/04/25 58

Python Programming
 Python: PL (python tutorial)
 Jupyter: an advanced python shell. (Anaconda - Jupyter)
 Numpy: to manipulate number data (Number python)
 Scipy: high-level scientific computation (Scientific Python), optimization, regression, interpolation.
 Matplotlib: 2-D visualization, “publication-ready” plots.
 Scikit-learn: the ML algorithms in python.
02/04/25 61

Assignment One - Python
Programming
 Numpy
02/04/25 62

Python Programming
 Numpy
02/04/25 63

Python Programming
 Numpy
02/04/25 64

Python Programming
 Mat pl ot li b
02/04/25 65

Python Programming
 Mat pl ot li b
02/04/25 66

Python Programming
 Mat pl ot li b
02/04/25 67

Python Programming
 Mat pl ot li b
02/04/25 68

Python Programming
 Mat pl ot li b
02/04/25 69

Python Programming
 Mat pl ot li b
02/04/25 70

Python Programming
 Mat pl ot li b
02/04/25 71

Python Programming
 Mat pl ot li b
02/04/25 72

Python Programming
 Mat pl ot li b
02/04/25 73

Python Programming
 Mat pl ot li b
02/04/25 74

Python Programming
 SciPy
02/04/25 75

Python Programming
 SciPy
02/04/25 76

Python Programming
 SciPy
02/04/25 77

Tool Set
Jupyter notebooks
 Interactive coding and Visualization of output
NumPy, SciPy, Pandas
 Numerical computation
Matplotlib, Seaborn
 Data visualization
Scikit-learn
 Machine learning
02/04/25 78

Jupyter Cell
%matplotlib inline: display plots inline in Jupyter notebook.
02/04/25 79

Jupyter Cell
 %%timeit: time how long a cell takes to execute.
02/04/25 80
%run filename.ipynb: execute code from another notebook
or python file.

Introduction to Pandas: Series
 Library for computation with tabular data.
 Mixed types of data allowed in a single table.
 Columns and rows of data can be named.
 Advanced data aggregation and statistical functions.
02/04/25 81

Introduction to Pandas
02/04/25 82

02/04/25 83

02/04/25 84

Introduction to Pandas:
Dataframe
02/04/25 85

Dataframe
02/04/25 86

Dataframe
 Library for computation with tabular data .
02/04/25 87

Dataframe
02/04/25 88

Dataframe
02/04/25 89

Dataframe
02/04/25 90

Lecture -2 Classification (Machine Learning Basic and kNN).ppt

More Related Content

Similar to Lecture -2 Classification (Machine Learning Basic and kNN).ppt

More from gadisaAdamu

Recently uploaded

Lecture -2 Classification (Machine Learning Basic and kNN).ppt

Editor's Notes