Classification : Machine
Learning Basic and kNN
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2021)
Outline
 A brief overview of ML
 Key tasks in ML
 Why we need ML
 Why Python is so grate for ML
 K-nearest neighbors algorithm
kNN Classification
kNN Regression
Some Issues in KNN
Python Modules to work on the ML Algorithms
02/04/25 2
Machine Learning (ML)
 With machine learning we can gain insight from a dataset.
 We’re going to ask the computer to make some sense from the data.
 This is what we mean by learning.
 Machine learning is the process of turning the data into Information and
Knowledge.
 Machine Learning lies at the intersection of computer science, engineering,
and statistics and often appears in other disciplines.
02/04/25 3
What is Machine Learning?
 It’s a tool that can be applied to many problems.
 Any field that needs to interpret and act on data can benefit
from Machine Learning techniques.
 There are many problems where the solution isn’t deterministic.
 That is, we don’t know enough about the problem or don’t have
enough computing power to properly model the problem.
02/04/25 4
Traditional Vs ML systems
 In Machine Learning, once the system is provided with the right
data and algorithms, it can "fish for itself”.
02/04/25 5
Traditional Vs ML systems
 A key aspect of Machine Learning that makes it particularly appealing
in terms of business value is that it does not require as much explicit
programming in advance.
02/04/25 6
Sensor and the Data Deluge
 We have a tremendous amount of human-created data from the WWW,
but recently more non-human sources of data have been coming online.
Sensors connected to the web.
20 % of non-video internet traffic by sensors.
 Data collected from mobile phone (three-axis accelerometer, temperature
sensors, and GPS receivers)
 Due to the two trends of mobile computing and sensor generated data
mean that we’ll be getting more and more data in the future.
02/04/25 7
Key Terminology
02/04/25 8
 Weight, Wingspan, Webbed feet, Back color are features or
attributes.
 An instance is made up of features. (controlled, exposure etc.)
 Species is the target variable. (response, outcome, output etc.)
 Attributes can be numeric, binary, nominal.
Key Terminology
02/04/25 9
 To train the Machine Learning algorithm we need to feed it quality data known as a training set.
 In the above example each training example (instant) has four features and one target variable.
 In a training set the target variable is known. (classification)
 The machine learns by finding some relationship between the features and the target variable.
 In the classification problem the target variables are called classes, and they are assumed to be a
finite number of classes.
Key Terminology Cont…
02/04/25 10
 To test machine learning algorithms a separate dataset is used which is called a test set.
 The target variable for each example from the test set isn’t given to the program.
 The program (model) decides in which class each example should belong to.
 Then compare the predicted value with the target variable.
Key Tasks of Machine Learning
02/04/25 11
 In classification, our job is to predict what class an instance of data should fall into.
 Regression is the prediction of a numeric value. (target variable)
 Classification and regression are examples of supervised learning.
 This set of problems is known as supervised because we’re telling the algorithm what to predict.
Key Tasks of Machine Learning
02/04/25 12
 The opposite of supervised learning is a set of tasks known as unsupervised learning.
 In unsupervised learning, there’s no label or target value given for the data. (known as clustering)
 In unsupervised learning, we may also want to find statistical values that describe the data. This is known as density estimation.
 Another task of unsupervised learning may be reducing the data from many features to a small number so that we can properly
visualize the dimensions.
Key Tasks of Machine Learning
02/04/25 13
 Common algorithms used to perform classification, regression, clustering, and density estimation tasks.
 Balancing generalization and memorization (over fitting) is a common problem to many ML algorithms.
 Regularization techniques are used to reduce over fitting.
Key Tasks of Machine Learning
02/04/25 14
 There are two fundamental cause of prediction error: a model bias, and its variance.
 A model with high variance over-fits the training data, while a model with high bias under-fits the training data.
 High bias, low variance (Under-fitting)
 Low bias, high variance (Over-fitting)
 High bias, high variance (Under/Over – fitting)
 Low bias, low variance (Good dataset)
 The predictive power of many Machine Learning algorithms improve as the amount of training data increases.
 Quality of data is also important.
Key Tasks of Machine Learning
02/04/25 15
 Ideally, a model will have both low bias and variance; but effort to reduce one will frequently increase the other.
 This is known as the bias-variance trade-off.
 Common measurement of performance:
 Accuracy (ACC) = (TP + TN / TP+TN+FP+FN)
 Precision (P) = (TP / TP+FP)
 Recall (R) = (TP / TP+FN)
How to Choose the Right
Algorithm
02/04/25 16
 First, you need to consider your goal.
 If you’re trying to predict or forecast a target value, then you need to look into supervised learning.
 If not, then unsupervised learning is the place you want to be.
 If you’ve chosen supervised learning, what’s your target value?
Discrete value (y/n, 1/2/3, Red/Yellow/Black):- classification
A number of values (0.00 to 100.00 etc…):- regression
How to Choose the Right
Algorithm
02/04/25 17
 Spend some time to know the data, and the more we know it, we can build successful application. (70-80% of the time)
 Things to know about the data are these:
Are the features nominal or continuous?
Are there missing values in the features?
If there are missing values, why are there missing values?
Are there outliers in the data? etc…(80, 81, 82, 83, 245)
Etc…
 All of these features about your data can help you narrow the algorithm selection process.
How to Choose the Right
Algorithm
02/04/25 18
 Finding the best algorithm is an iterative process of trial and error.
 Steps in developing a machine learning application:
Collect data: scraping a website, RSS feed or API etc..
 Prepare the input data: make sure the unstableness of the data format.
Analyze the input data: looking at the data.
Understand the data.
Train the algorithm: the ML takes place (not for unsupervised)
Test the algorithm: (go back to the 4th step)
Use it (implement ML application)
Problem Solving Framework
 Problem solving Framework for ML application:
Business issue understanding
Data understanding
Data preparation
Analysis Modeling
Validation
Presentation / Visualization
02/04/25 19
Machine Learning Systems and
Data
 In ML, instead of writing a program by hand for each specific
task, we collect lots of examples that specify the correct output
for a given input.
 The most important factors in ML is not the algorithm or the
software systems. The quality of the data is the soul of the ML
systems.
02/04/25 20
Machine Learning Systems and
Data
 Invalid training data:
Garbage In ------ Garbage Out.
 Invalid dataset leads to invalid results.
 This is not to say that the training data needs to be prefer.
 Out of a million examples, some inaccurate labels is
acceptable.
 The quality of the data is the soul of the ML systems.
02/04/25 21
Machine Learning Systems and
Data
 “garbage” can be several things:
Wrong label (Dog – Cat, Cat – Dog)
Inaccurate and Missing Values
A bias dataset etc…
 Handling missing data:
Small portion row and columns – discarded them,
Data imputation (time serial data) – the last valid value,
Substitute with mean or median,
Predicting the missing values from the available data,
A missing value can have a meaning on its own (missing).
02/04/25 22
Machine Learning Systems and
Data
 Having a clear dataset is not always enough.
 Features with large magnitudes can dominate features with small
magnitudes during the training.
 Example: Age [0-100], salary [6,000 – 20,000] – Scaling and
Standardization
 Data imbalance:
02/04/25 23
No Classes Number
1 Cat 5000
2 Dog 5000
3 Tiger 150
4 Cow 25
 Leave as it is.
Under sampling (if all classes are
equally important) [5000 – 25]
Over sampling (if all classes are
equally important) [25-5000]
Challenges in Machine
Learning
 It requires considerable data and compute power.
 It requires knowledgeable data science specialists or teams.
 It adds complexity to the organization's data integration
strategy. (data-driven culture)
 Learning ML algorithms is challenging without an advanced
math background.
 The context of data often changes. (private data Vs public data)
 Algorithmic bias, privacy and ethical concerns may be
overlooked.
02/04/25 24
Stages of ML Process
 The first key step in preparing to explore and exploit ML is to
understand the basic stages involved.
02/04/25 25
Stages of ML Process
 Machine Learning Tasks and Subtasks:
02/04/25 26
Data Collection and Preparation
 Data collection is the process of gathering and measuring
information from countless different sources.
 Data generating at an unprecedented rate. These data can be:
 Numeric (temperature, loan amount, customer retention rate),
 Categorical (gender, color, highest degree earned), or
 Even free text (think doctor’s notes or opinion surveys).
 In order to use the data we collect to develop practical
solutions, it must be collected and stored in a way that makes
sense for the business problem at hand.
02/04/25 27
Data Collection and
Preparation
Data Collection and Preparation
 During an ML system development, we always rely on data.
 From training, tuning, model selection to testing, we use three
different data sets: the training set, the validation set ,and the
testing set.
 The validation set is used to select and tune the final ML model.
 The test data set is used to evaluate how well your algorithm
was trained with the training data set.
02/04/25 29
Data Collection and Preparation
 Testing sets represent 20% or 30% of the data. (cross validation)
 The test set is ensured to be the input data grouped together with
verified correct outputs, generally by human verification.
02/04/25 30
Data Collection and Preparation
 The most successful ML projects are those that integrate a data
collection strategy during the service/product life-cycle.
 It must be built into the core product itself.
 Basically, every time a user engages with the product/service,
you want to collect data from the interaction.
 The goal is to use this constant new data flow to improve your
product/service.
02/04/25 31
Data Collection and Preparation
 Solving the right problem:
 Understand the purpose for a model.
 Ask about who, what, when, where and why?
 Is the problem viable for machine learning?
02/04/25 32
Data Collection and Preparation
 Data preparation is a set of procedures that makes your dataset
more suitable for ML.
 Articulate the problem early
 Establish data collection mechanisms (data-driven culture)
 Format data to make it consistent
 Reduce data (attribute sampling)
 Complete data cleaning
 Decompose data (complex data set)
 Rescale data (data normalization)
 Discretize data (numerical – categorical values)
 Private datasets capture the specifics of your unique business
and potentially have all relevant attributes.
02/04/25 33
Data Collection, Preparation
and Delivery
02/04/25 34
Python
02/04/25 35
 Python is a grate language for ML.
Has clear syntax:
 High-level data type (list, tuples, dictionaries, sets, etc…)
 Program in any style (OO, procedural, functional, and so on)
Makes text manipulation extremely easy.
There are a number of libraries.
 Libraries such as SciPy and NumPy: to do vector and matrix
operations.
 Matplotlib can plot 2D and 3D plots.
 Pandas data manipulation in a table form.
Classification with k-Nearest
Neighbors
02/04/25 36
K-Nearest Neighbors (KNN)
 It is an easy to grasp (understand and implement) and very
effective (powerful tool).
 The model for kNN is the entire training dataset.
 Pros: High accuracy, insensitive to outliers, no assumptions
about data.
 Cons: computationally expensive, requires a lot of memory.
 Works with: Numeric values, nominal values. (Classification
and regression)
02/04/25 37
K-Nearest Neighbors (KNN)
 We have an existing set of example data (training set).
 We know what class each piece of the data should fall into.
 When we’re given a new piece of data without a label.
 We compare that new piece of data to the existing data, every piece of existing data.
 We then take the most similar pieces of data (the nearest neighbors) and look at their
labels.
02/04/25 38
K-Nearest Neighbors (KNN)
 We have an existing set of example data (training set).
 We look at the top k most similar pieces of data from our known dataset. (usually less than 20)
 The K is often set to an odd number to prevent ties.
 Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new
class we assign to the data we were asked to classify.
02/04/25 39
K-Nearest Neighbors (KNN)
 KNN, non-paramteric models can be useful when training data is abundant and you have little prior
knowledge about the relationship b/n the response and explanatory variables.
 KNN makes only one assumption: instance that are near each other are likely to have similar values of
response variable.
 A model that makes assumption about the relationship can be useful if training data is scarce or if you already
know about the relationship.
02/04/25 40
KNN Classification
 Classifying movies into romance or action movies.
 The number of kisses and kicks in each movie (features)
 Now, you find a movie you haven’t seen yet and want to know if it’s a romance movie or an action movie.
 To determine this, we’ll use the kNN algorithm.
02/04/25 41
KNN Classification
 We find the movie in question and see how many kicks and kisses it has.
02/04/25 42
Classifying movies by plotting the # kicks and kisses in each movie
KNN Classification
Movies with the # of kicks, # of kisses along with their class
02/04/25 43
KNN Classification
 We don’t know what type of movie the question mark movie is.
 First, we calculate the distance to all the other movies.
02/04/25 44
Distance b/n each movie and the unknown movie
KNN Classification
02/04/25 45
Euclidian distance where the distance between two vectors
KNN Classification
Let’s assume k=3.
 Then, the three closest movies are He’s Not Really into Dudes, Beautiful Woman, and California Man.
 Because all three movies are romances, we forecast that the
mystery movie is a romance movie. (through majority vote)
02/04/25 46
General Approach to KNN
 General approach to kNN:
 Collect: Any method
 Prepare: Numeric values are needed for a distance calculation.
 Analyze: Any method (plotting).
 Train: Does not apply to the kNN algorithm.
 Test: Calculate the error rate.
 Use: This application needs to get some input data and output structured numeric values.
02/04/25 47
K-Nearest Neighbors (KNN)
kNN is an instance-based learning algorithm.
02/04/25 48
<x, y> 1
<x, y> 2
<x, y> 3
<x, y> 4
……..
<x, y> n
F(x) = wx + b
Non-instance supervised learning Instance-based supervised learning
<x, y> 1
<x, y> 2
<x, y> 3
<x, y> 4
……..
<x, y> n
Database
F(x) = lookup(x)
K-Nearest Neighbors (KNN)
 Advantage:
 It remembers
 Fast (no learning time)
 Simple and straight forward
 Down side :
 No generalization
 Over-fitting (noise)
 Computationally expensive for large datasets
02/04/25 49
K-Nearest Neighbors (KNN)
 Given:
 Training data D = (xi, yi),
 Distance metric d(q, x): domain knowledge important,
 Number of neighbors K: domain knowledge important,
 Query point q.
 KNN = {i : d(q, xi) k smallest }
 Return:
 Classification: Majority Vote of the yi.
 Regression: mean of the yi.
02/04/25 50
KNN- Regression Problem
 d(): k Average
 Euclidian: 1-NN _______
 3-NN _______
 Manhattan 1-NN _______
 3-NN _______
02/04/25 51
X1, X2 y
1, 6 7
2, 4 8
3, 7 16
6, 8 44
7, 1 50
8, 4 68
Regression
Q = 4, 2, y = ???
 The similarity measure is dependent on the type of the data:
 Real-valued data: Euclidean distance .
 Hamming distance: categorical or binary data (P-norm; when p=0)
KNN- Regression Problem
 d(): k Average
 Euclidian: 1-NN ___8___
 3-NN ___42__
 Manhattan 1-NN _______
 3-NN _______
Euclidian = ((X1i – q1)2 +(X2i – q2)2)1/2
02/04/25 52
Regression
Q = 4, 2, y = ???
X1, X2 y ED
1, 6 7 25
2, 4 8 8
3, 7 16 26
6, 8 44 40
7, 1 50 10
8, 4 68 20
KNN- Regression Problem
 d(): k Average
 Euclidian: 1-NN _______
 3-NN _______
 Manhattan 1-NN ___29__
 3-NN __35.5__
Manhattan = (|X1i – q1|) + (|X2i - q1|)
02/04/25 53
Regression
Q = 4, 2, y = ???
X1, X2 y mD
1, 6 7 7
2, 4 8 4
3, 7 16 6
6, 8 44 8
7, 1 50 4
8, 4 68 6
K-Nearest Neighbors Bias
 Preference Bias?
 Our believe about what makes a good hypothesis.
 Locality: near points are similar (distance function / domain)
 Smoothness: averaging
 All features matter equally.
 Best practices for Data preparation
 Rescale data: normalizing the data to the range [0, 1] is a good idea.
 Address missing data: excluded or imputed the missing values.
 Lower dimensionality: KNN is suitable for lower dimensional data.
02/04/25 54
KNN and Curse of
Dimensionality
 As the number of features or dimension grows, the amount of data we need to generalize accurately grows exponentially.
 Exponentially mean “bad”. O(2d
)
02/04/25 55
Some Other Issues
 What is needed to select a KNN model?
 How to measure closeness of neighbors.
 Correct value for K.
 d(x, q) = Euclidian, Manhattan, weighted etc…
 The choice of the distance function matters.
 K value:
 K = n (the average of all data / no need of query)
 K = n (weighted average) [Locally weighted regression]
02/04/25 56
Summary
 kNN is an example of instance-based learning.
 The algorithm has to carry around the full dataset; for large datasets, this implies a large amount of storage.
 Need to calculate the distance measurement for every piece of data in the database, and this can be cumbersome.
 kNN doesn’t give you any idea of the underlying structure of the data.
 kNN is an example of lazy learning, which is the opposite of eager learning.
 kNN can handle both classification and regression.
02/04/25 57
Summary
 KNN is positioned in the algorithm list of scikit learn.
02/04/25 58
Question & Answer
02/04/25 59
Thank You !!!
02/04/25 60
Python Programming
 Python: PL (python tutorial)
 Jupyter: an advanced python shell. (Anaconda - Jupyter)
 Numpy: to manipulate number data (Number python)
 Scipy: high-level scientific computation (Scientific Python), optimization, regression, interpolation.
 Matplotlib: 2-D visualization, “publication-ready” plots.
 Scikit-learn: the ML algorithms in python.
02/04/25 61
Assignment One - Python
Programming
 Numpy
02/04/25 62
Python Programming
 Numpy
02/04/25 63
Python Programming
 Numpy
02/04/25 64
Python Programming
 Mat pl ot li b
02/04/25 65
Python Programming
 Mat pl ot li b
02/04/25 66
Python Programming
 Mat pl ot li b
02/04/25 67
Python Programming
 Mat pl ot li b
02/04/25 68
Python Programming
 Mat pl ot li b
02/04/25 69
Python Programming
 Mat pl ot li b
02/04/25 70
Python Programming
 Mat pl ot li b
02/04/25 71
Python Programming
 Mat pl ot li b
02/04/25 72
Python Programming
 Mat pl ot li b
02/04/25 73
Python Programming
 Mat pl ot li b
02/04/25 74
Python Programming
 SciPy
02/04/25 75
Python Programming
 SciPy
02/04/25 76
Python Programming
 SciPy
02/04/25 77
Tool Set
Jupyter notebooks
 Interactive coding and Visualization of output
NumPy, SciPy, Pandas
 Numerical computation
Matplotlib, Seaborn
 Data visualization
Scikit-learn
 Machine learning
02/04/25 78
Jupyter Cell
%matplotlib inline: display plots inline in Jupyter notebook.
02/04/25 79
Jupyter Cell
 %%timeit: time how long a cell takes to execute.
02/04/25 80
%run filename.ipynb: execute code from another notebook
or python file.
Introduction to Pandas: Series
 Library for computation with tabular data.
 Mixed types of data allowed in a single table.
 Columns and rows of data can be named.
 Advanced data aggregation and statistical functions.
02/04/25 81
Introduction to Pandas
 Library for computation with tabular data.
02/04/25 82
Introduction to Pandas
 Library for computation with tabular data.
02/04/25 83
Introduction to Pandas
 Library for computation with tabular data.
02/04/25 84
Introduction to Pandas:
Dataframe
 Library for computation with tabular data.
02/04/25 85
Introduction to Pandas:
Dataframe
 Library for computation with tabular data.
02/04/25 86
Introduction to Pandas:
Dataframe
 Library for computation with tabular data .
02/04/25 87
Introduction to Pandas:
Dataframe
 Library for computation with tabular data .
02/04/25 88
Introduction to Pandas:
Dataframe
 Library for computation with tabular data .
02/04/25 89
Introduction to Pandas:
Dataframe
 Library for computation with tabular data .
02/04/25 90

Lecture -2 Classification (Machine Learning Basic and kNN).ppt

  • 1.
    Classification : Machine LearningBasic and kNN Adama Science and Technology University School of Electrical Engineering and Computing Department of CSE Dr. Mesfin Abebe Haile (2021)
  • 2.
    Outline  A briefoverview of ML  Key tasks in ML  Why we need ML  Why Python is so grate for ML  K-nearest neighbors algorithm kNN Classification kNN Regression Some Issues in KNN Python Modules to work on the ML Algorithms 02/04/25 2
  • 3.
    Machine Learning (ML) With machine learning we can gain insight from a dataset.  We’re going to ask the computer to make some sense from the data.  This is what we mean by learning.  Machine learning is the process of turning the data into Information and Knowledge.  Machine Learning lies at the intersection of computer science, engineering, and statistics and often appears in other disciplines. 02/04/25 3
  • 4.
    What is MachineLearning?  It’s a tool that can be applied to many problems.  Any field that needs to interpret and act on data can benefit from Machine Learning techniques.  There are many problems where the solution isn’t deterministic.  That is, we don’t know enough about the problem or don’t have enough computing power to properly model the problem. 02/04/25 4
  • 5.
    Traditional Vs MLsystems  In Machine Learning, once the system is provided with the right data and algorithms, it can "fish for itself”. 02/04/25 5
  • 6.
    Traditional Vs MLsystems  A key aspect of Machine Learning that makes it particularly appealing in terms of business value is that it does not require as much explicit programming in advance. 02/04/25 6
  • 7.
    Sensor and theData Deluge  We have a tremendous amount of human-created data from the WWW, but recently more non-human sources of data have been coming online. Sensors connected to the web. 20 % of non-video internet traffic by sensors.  Data collected from mobile phone (three-axis accelerometer, temperature sensors, and GPS receivers)  Due to the two trends of mobile computing and sensor generated data mean that we’ll be getting more and more data in the future. 02/04/25 7
  • 8.
    Key Terminology 02/04/25 8 Weight, Wingspan, Webbed feet, Back color are features or attributes.  An instance is made up of features. (controlled, exposure etc.)  Species is the target variable. (response, outcome, output etc.)  Attributes can be numeric, binary, nominal.
  • 9.
    Key Terminology 02/04/25 9 To train the Machine Learning algorithm we need to feed it quality data known as a training set.  In the above example each training example (instant) has four features and one target variable.  In a training set the target variable is known. (classification)  The machine learns by finding some relationship between the features and the target variable.  In the classification problem the target variables are called classes, and they are assumed to be a finite number of classes.
  • 10.
    Key Terminology Cont… 02/04/2510  To test machine learning algorithms a separate dataset is used which is called a test set.  The target variable for each example from the test set isn’t given to the program.  The program (model) decides in which class each example should belong to.  Then compare the predicted value with the target variable.
  • 11.
    Key Tasks ofMachine Learning 02/04/25 11  In classification, our job is to predict what class an instance of data should fall into.  Regression is the prediction of a numeric value. (target variable)  Classification and regression are examples of supervised learning.  This set of problems is known as supervised because we’re telling the algorithm what to predict.
  • 12.
    Key Tasks ofMachine Learning 02/04/25 12  The opposite of supervised learning is a set of tasks known as unsupervised learning.  In unsupervised learning, there’s no label or target value given for the data. (known as clustering)  In unsupervised learning, we may also want to find statistical values that describe the data. This is known as density estimation.  Another task of unsupervised learning may be reducing the data from many features to a small number so that we can properly visualize the dimensions.
  • 13.
    Key Tasks ofMachine Learning 02/04/25 13  Common algorithms used to perform classification, regression, clustering, and density estimation tasks.  Balancing generalization and memorization (over fitting) is a common problem to many ML algorithms.  Regularization techniques are used to reduce over fitting.
  • 14.
    Key Tasks ofMachine Learning 02/04/25 14  There are two fundamental cause of prediction error: a model bias, and its variance.  A model with high variance over-fits the training data, while a model with high bias under-fits the training data.  High bias, low variance (Under-fitting)  Low bias, high variance (Over-fitting)  High bias, high variance (Under/Over – fitting)  Low bias, low variance (Good dataset)  The predictive power of many Machine Learning algorithms improve as the amount of training data increases.  Quality of data is also important.
  • 15.
    Key Tasks ofMachine Learning 02/04/25 15  Ideally, a model will have both low bias and variance; but effort to reduce one will frequently increase the other.  This is known as the bias-variance trade-off.  Common measurement of performance:  Accuracy (ACC) = (TP + TN / TP+TN+FP+FN)  Precision (P) = (TP / TP+FP)  Recall (R) = (TP / TP+FN)
  • 16.
    How to Choosethe Right Algorithm 02/04/25 16  First, you need to consider your goal.  If you’re trying to predict or forecast a target value, then you need to look into supervised learning.  If not, then unsupervised learning is the place you want to be.  If you’ve chosen supervised learning, what’s your target value? Discrete value (y/n, 1/2/3, Red/Yellow/Black):- classification A number of values (0.00 to 100.00 etc…):- regression
  • 17.
    How to Choosethe Right Algorithm 02/04/25 17  Spend some time to know the data, and the more we know it, we can build successful application. (70-80% of the time)  Things to know about the data are these: Are the features nominal or continuous? Are there missing values in the features? If there are missing values, why are there missing values? Are there outliers in the data? etc…(80, 81, 82, 83, 245) Etc…  All of these features about your data can help you narrow the algorithm selection process.
  • 18.
    How to Choosethe Right Algorithm 02/04/25 18  Finding the best algorithm is an iterative process of trial and error.  Steps in developing a machine learning application: Collect data: scraping a website, RSS feed or API etc..  Prepare the input data: make sure the unstableness of the data format. Analyze the input data: looking at the data. Understand the data. Train the algorithm: the ML takes place (not for unsupervised) Test the algorithm: (go back to the 4th step) Use it (implement ML application)
  • 19.
    Problem Solving Framework Problem solving Framework for ML application: Business issue understanding Data understanding Data preparation Analysis Modeling Validation Presentation / Visualization 02/04/25 19
  • 20.
    Machine Learning Systemsand Data  In ML, instead of writing a program by hand for each specific task, we collect lots of examples that specify the correct output for a given input.  The most important factors in ML is not the algorithm or the software systems. The quality of the data is the soul of the ML systems. 02/04/25 20
  • 21.
    Machine Learning Systemsand Data  Invalid training data: Garbage In ------ Garbage Out.  Invalid dataset leads to invalid results.  This is not to say that the training data needs to be prefer.  Out of a million examples, some inaccurate labels is acceptable.  The quality of the data is the soul of the ML systems. 02/04/25 21
  • 22.
    Machine Learning Systemsand Data  “garbage” can be several things: Wrong label (Dog – Cat, Cat – Dog) Inaccurate and Missing Values A bias dataset etc…  Handling missing data: Small portion row and columns – discarded them, Data imputation (time serial data) – the last valid value, Substitute with mean or median, Predicting the missing values from the available data, A missing value can have a meaning on its own (missing). 02/04/25 22
  • 23.
    Machine Learning Systemsand Data  Having a clear dataset is not always enough.  Features with large magnitudes can dominate features with small magnitudes during the training.  Example: Age [0-100], salary [6,000 – 20,000] – Scaling and Standardization  Data imbalance: 02/04/25 23 No Classes Number 1 Cat 5000 2 Dog 5000 3 Tiger 150 4 Cow 25  Leave as it is. Under sampling (if all classes are equally important) [5000 – 25] Over sampling (if all classes are equally important) [25-5000]
  • 24.
    Challenges in Machine Learning It requires considerable data and compute power.  It requires knowledgeable data science specialists or teams.  It adds complexity to the organization's data integration strategy. (data-driven culture)  Learning ML algorithms is challenging without an advanced math background.  The context of data often changes. (private data Vs public data)  Algorithmic bias, privacy and ethical concerns may be overlooked. 02/04/25 24
  • 25.
    Stages of MLProcess  The first key step in preparing to explore and exploit ML is to understand the basic stages involved. 02/04/25 25
  • 26.
    Stages of MLProcess  Machine Learning Tasks and Subtasks: 02/04/25 26
  • 27.
    Data Collection andPreparation  Data collection is the process of gathering and measuring information from countless different sources.  Data generating at an unprecedented rate. These data can be:  Numeric (temperature, loan amount, customer retention rate),  Categorical (gender, color, highest degree earned), or  Even free text (think doctor’s notes or opinion surveys).  In order to use the data we collect to develop practical solutions, it must be collected and stored in a way that makes sense for the business problem at hand. 02/04/25 27
  • 28.
  • 29.
    Data Collection andPreparation  During an ML system development, we always rely on data.  From training, tuning, model selection to testing, we use three different data sets: the training set, the validation set ,and the testing set.  The validation set is used to select and tune the final ML model.  The test data set is used to evaluate how well your algorithm was trained with the training data set. 02/04/25 29
  • 30.
    Data Collection andPreparation  Testing sets represent 20% or 30% of the data. (cross validation)  The test set is ensured to be the input data grouped together with verified correct outputs, generally by human verification. 02/04/25 30
  • 31.
    Data Collection andPreparation  The most successful ML projects are those that integrate a data collection strategy during the service/product life-cycle.  It must be built into the core product itself.  Basically, every time a user engages with the product/service, you want to collect data from the interaction.  The goal is to use this constant new data flow to improve your product/service. 02/04/25 31
  • 32.
    Data Collection andPreparation  Solving the right problem:  Understand the purpose for a model.  Ask about who, what, when, where and why?  Is the problem viable for machine learning? 02/04/25 32
  • 33.
    Data Collection andPreparation  Data preparation is a set of procedures that makes your dataset more suitable for ML.  Articulate the problem early  Establish data collection mechanisms (data-driven culture)  Format data to make it consistent  Reduce data (attribute sampling)  Complete data cleaning  Decompose data (complex data set)  Rescale data (data normalization)  Discretize data (numerical – categorical values)  Private datasets capture the specifics of your unique business and potentially have all relevant attributes. 02/04/25 33
  • 34.
  • 35.
    Python 02/04/25 35  Pythonis a grate language for ML. Has clear syntax:  High-level data type (list, tuples, dictionaries, sets, etc…)  Program in any style (OO, procedural, functional, and so on) Makes text manipulation extremely easy. There are a number of libraries.  Libraries such as SciPy and NumPy: to do vector and matrix operations.  Matplotlib can plot 2D and 3D plots.  Pandas data manipulation in a table form.
  • 36.
  • 37.
    K-Nearest Neighbors (KNN) It is an easy to grasp (understand and implement) and very effective (powerful tool).  The model for kNN is the entire training dataset.  Pros: High accuracy, insensitive to outliers, no assumptions about data.  Cons: computationally expensive, requires a lot of memory.  Works with: Numeric values, nominal values. (Classification and regression) 02/04/25 37
  • 38.
    K-Nearest Neighbors (KNN) We have an existing set of example data (training set).  We know what class each piece of the data should fall into.  When we’re given a new piece of data without a label.  We compare that new piece of data to the existing data, every piece of existing data.  We then take the most similar pieces of data (the nearest neighbors) and look at their labels. 02/04/25 38
  • 39.
    K-Nearest Neighbors (KNN) We have an existing set of example data (training set).  We look at the top k most similar pieces of data from our known dataset. (usually less than 20)  The K is often set to an odd number to prevent ties.  Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new class we assign to the data we were asked to classify. 02/04/25 39
  • 40.
    K-Nearest Neighbors (KNN) KNN, non-paramteric models can be useful when training data is abundant and you have little prior knowledge about the relationship b/n the response and explanatory variables.  KNN makes only one assumption: instance that are near each other are likely to have similar values of response variable.  A model that makes assumption about the relationship can be useful if training data is scarce or if you already know about the relationship. 02/04/25 40
  • 41.
    KNN Classification  Classifyingmovies into romance or action movies.  The number of kisses and kicks in each movie (features)  Now, you find a movie you haven’t seen yet and want to know if it’s a romance movie or an action movie.  To determine this, we’ll use the kNN algorithm. 02/04/25 41
  • 42.
    KNN Classification  Wefind the movie in question and see how many kicks and kisses it has. 02/04/25 42 Classifying movies by plotting the # kicks and kisses in each movie
  • 43.
    KNN Classification Movies withthe # of kicks, # of kisses along with their class 02/04/25 43
  • 44.
    KNN Classification  Wedon’t know what type of movie the question mark movie is.  First, we calculate the distance to all the other movies. 02/04/25 44 Distance b/n each movie and the unknown movie
  • 45.
    KNN Classification 02/04/25 45 Euclidiandistance where the distance between two vectors
  • 46.
    KNN Classification Let’s assumek=3.  Then, the three closest movies are He’s Not Really into Dudes, Beautiful Woman, and California Man.  Because all three movies are romances, we forecast that the mystery movie is a romance movie. (through majority vote) 02/04/25 46
  • 47.
    General Approach toKNN  General approach to kNN:  Collect: Any method  Prepare: Numeric values are needed for a distance calculation.  Analyze: Any method (plotting).  Train: Does not apply to the kNN algorithm.  Test: Calculate the error rate.  Use: This application needs to get some input data and output structured numeric values. 02/04/25 47
  • 48.
    K-Nearest Neighbors (KNN) kNNis an instance-based learning algorithm. 02/04/25 48 <x, y> 1 <x, y> 2 <x, y> 3 <x, y> 4 …….. <x, y> n F(x) = wx + b Non-instance supervised learning Instance-based supervised learning <x, y> 1 <x, y> 2 <x, y> 3 <x, y> 4 …….. <x, y> n Database F(x) = lookup(x)
  • 49.
    K-Nearest Neighbors (KNN) Advantage:  It remembers  Fast (no learning time)  Simple and straight forward  Down side :  No generalization  Over-fitting (noise)  Computationally expensive for large datasets 02/04/25 49
  • 50.
    K-Nearest Neighbors (KNN) Given:  Training data D = (xi, yi),  Distance metric d(q, x): domain knowledge important,  Number of neighbors K: domain knowledge important,  Query point q.  KNN = {i : d(q, xi) k smallest }  Return:  Classification: Majority Vote of the yi.  Regression: mean of the yi. 02/04/25 50
  • 51.
    KNN- Regression Problem d(): k Average  Euclidian: 1-NN _______  3-NN _______  Manhattan 1-NN _______  3-NN _______ 02/04/25 51 X1, X2 y 1, 6 7 2, 4 8 3, 7 16 6, 8 44 7, 1 50 8, 4 68 Regression Q = 4, 2, y = ???  The similarity measure is dependent on the type of the data:  Real-valued data: Euclidean distance .  Hamming distance: categorical or binary data (P-norm; when p=0)
  • 52.
    KNN- Regression Problem d(): k Average  Euclidian: 1-NN ___8___  3-NN ___42__  Manhattan 1-NN _______  3-NN _______ Euclidian = ((X1i – q1)2 +(X2i – q2)2)1/2 02/04/25 52 Regression Q = 4, 2, y = ??? X1, X2 y ED 1, 6 7 25 2, 4 8 8 3, 7 16 26 6, 8 44 40 7, 1 50 10 8, 4 68 20
  • 53.
    KNN- Regression Problem d(): k Average  Euclidian: 1-NN _______  3-NN _______  Manhattan 1-NN ___29__  3-NN __35.5__ Manhattan = (|X1i – q1|) + (|X2i - q1|) 02/04/25 53 Regression Q = 4, 2, y = ??? X1, X2 y mD 1, 6 7 7 2, 4 8 4 3, 7 16 6 6, 8 44 8 7, 1 50 4 8, 4 68 6
  • 54.
    K-Nearest Neighbors Bias Preference Bias?  Our believe about what makes a good hypothesis.  Locality: near points are similar (distance function / domain)  Smoothness: averaging  All features matter equally.  Best practices for Data preparation  Rescale data: normalizing the data to the range [0, 1] is a good idea.  Address missing data: excluded or imputed the missing values.  Lower dimensionality: KNN is suitable for lower dimensional data. 02/04/25 54
  • 55.
    KNN and Curseof Dimensionality  As the number of features or dimension grows, the amount of data we need to generalize accurately grows exponentially.  Exponentially mean “bad”. O(2d ) 02/04/25 55
  • 56.
    Some Other Issues What is needed to select a KNN model?  How to measure closeness of neighbors.  Correct value for K.  d(x, q) = Euclidian, Manhattan, weighted etc…  The choice of the distance function matters.  K value:  K = n (the average of all data / no need of query)  K = n (weighted average) [Locally weighted regression] 02/04/25 56
  • 57.
    Summary  kNN isan example of instance-based learning.  The algorithm has to carry around the full dataset; for large datasets, this implies a large amount of storage.  Need to calculate the distance measurement for every piece of data in the database, and this can be cumbersome.  kNN doesn’t give you any idea of the underlying structure of the data.  kNN is an example of lazy learning, which is the opposite of eager learning.  kNN can handle both classification and regression. 02/04/25 57
  • 58.
    Summary  KNN ispositioned in the algorithm list of scikit learn. 02/04/25 58
  • 59.
  • 60.
  • 61.
    Python Programming  Python:PL (python tutorial)  Jupyter: an advanced python shell. (Anaconda - Jupyter)  Numpy: to manipulate number data (Number python)  Scipy: high-level scientific computation (Scientific Python), optimization, regression, interpolation.  Matplotlib: 2-D visualization, “publication-ready” plots.  Scikit-learn: the ML algorithms in python. 02/04/25 61
  • 62.
    Assignment One -Python Programming  Numpy 02/04/25 62
  • 63.
  • 64.
  • 65.
    Python Programming  Matpl ot li b 02/04/25 65
  • 66.
    Python Programming  Matpl ot li b 02/04/25 66
  • 67.
    Python Programming  Matpl ot li b 02/04/25 67
  • 68.
    Python Programming  Matpl ot li b 02/04/25 68
  • 69.
    Python Programming  Matpl ot li b 02/04/25 69
  • 70.
    Python Programming  Matpl ot li b 02/04/25 70
  • 71.
    Python Programming  Matpl ot li b 02/04/25 71
  • 72.
    Python Programming  Matpl ot li b 02/04/25 72
  • 73.
    Python Programming  Matpl ot li b 02/04/25 73
  • 74.
    Python Programming  Matpl ot li b 02/04/25 74
  • 75.
  • 76.
  • 77.
  • 78.
    Tool Set Jupyter notebooks Interactive coding and Visualization of output NumPy, SciPy, Pandas  Numerical computation Matplotlib, Seaborn  Data visualization Scikit-learn  Machine learning 02/04/25 78
  • 79.
    Jupyter Cell %matplotlib inline:display plots inline in Jupyter notebook. 02/04/25 79
  • 80.
    Jupyter Cell  %%timeit:time how long a cell takes to execute. 02/04/25 80 %run filename.ipynb: execute code from another notebook or python file.
  • 81.
    Introduction to Pandas:Series  Library for computation with tabular data.  Mixed types of data allowed in a single table.  Columns and rows of data can be named.  Advanced data aggregation and statistical functions. 02/04/25 81
  • 82.
    Introduction to Pandas Library for computation with tabular data. 02/04/25 82
  • 83.
    Introduction to Pandas Library for computation with tabular data. 02/04/25 83
  • 84.
    Introduction to Pandas Library for computation with tabular data. 02/04/25 84
  • 85.
    Introduction to Pandas: Dataframe Library for computation with tabular data. 02/04/25 85
  • 86.
    Introduction to Pandas: Dataframe Library for computation with tabular data. 02/04/25 86
  • 87.
    Introduction to Pandas: Dataframe Library for computation with tabular data . 02/04/25 87
  • 88.
    Introduction to Pandas: Dataframe Library for computation with tabular data . 02/04/25 88
  • 89.
    Introduction to Pandas: Dataframe Library for computation with tabular data . 02/04/25 89
  • 90.
    Introduction to Pandas: Dataframe Library for computation with tabular data . 02/04/25 90

Editor's Notes

  • #28 This illustration adds a series of new / enhanced applications and is more specific as to the nature of the types of data sources.