Chapter8_What_Is_Machine_Learning Testing Cases

WHAT IS
MACHINE LEARNING
Dr. Majid Ali Khan
Dr. Ghazanfar Latif

OUTLINES
What is Machine Learning and Why is it useful?
Applications of Machine Learning
Types of Machine Learning
Challenges of Machine Learning
Testing and Validating

OUTSIDERS VIEW OF MACHINE LEARNING
Intelligent robots (good or bad) roaming the world!!!

REAL WORLD MACHINE LEARNING
Spam Filters
Optical Character Recognition
Image Processing
Manufacturing
Civil
Mechanical
Finance

WHAT IS MACHINE LEARNING
The science (and art) of programming computers so that they can learn from data
[Machine Learning is the] field of study that gives computers the ability to learn without being
explicitly programmed.
—Arthur Samuel, 1959
A computer program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with experience E.
-Tom Mitchell, 1997

WHAT IS MACHINE LEARNING – AN EXAMPLE
Spam Filter:
 Differentiate between spam emails from the regular (non-spam) emails
What is the Experience (E):
 Training set (examples used to learn)
 Training instance (one particular training example)
What is the Task (T):
 Identify spam emails
What is the performance measure (P):
 How accurate is the identification (carried out using test set)
 Accuracy= Number of Correct Classification/Total size of test set

AI VS MACHINE LEARNING VS. DEEP LEARNING

WHY USE MACHINE LEARNING
Spam filter using traditional programing

Spam Filter using Machine Learning approach

Machine Learning can adapt to changing environment

Machine Learning can help humans better understand large data

IN SUMMARY: WHY USE ML
Use for problems for which existing solutions require a lot of fine-tuning or
long lists of rules: one Machine Learning algorithm can often simplify code and
perform better than the traditional approach.
Fluctuating environments: a Machine Learning system can adapt to new data.
Complex problems for which using a traditional approach yields no good
solution: the best Machine Learning techniques can perhaps find a solution.
Getting insights about complex problems and large amounts of data.

EXAMPLES OF APPLICATIONS
Analyzing images of products on a production line to automatically classify them (Image Classification using CNN)
Detecting tumors in brain scans (Semantic Segmentation)
Automatically classifying news articles (Text Classification)
Automatically flagging offensive comments on discussion forums (Text Classification)
Summarizing long documents automatically (Text Summarization)
Creating a chatbot or a personal assistant (Natural Language Processing)
Forecasting your company’s revenue next year, based on many performance metrics (Regression)
Making your app react to voice commands (Speech Recognition)
Detecting credit card fraud (Anomaly Detection)
Segmenting clients based on their purchases so that you can design a different marketing strategy for each segment
(Clustering)
Representing a complex, high-dimensional dataset in a clear and insightful diagram (Data Visualization)
Recommending a product that a client may be interested in, based on past purchases (Recommender Systems)
Building an intelligent bot for a game (Reinforcement Learning)

TYPES OF MACHINE LEARNING SYSTEMS
Whether or not they are trained with human supervision
 Supervised
 Unsupervised
 Semi-supervised
 Reinforcement Learning
Whether or not they can learn incrementally on the fly:
 Online Learning
 Batch Learning

EXAMPLES OF SUPERVISED LEARNING
ALGORITHMS
k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machines (SVMs)
Decision Trees and Random Forests
Neural Networks

EXAMPLES OF UNSUPERVISED LEARNING ALGORITHMS
Clustering
 K-Means
 DBSCAN
 Hierarchical Cluster Analysis (HCA)

Visualization and dimensionality reduction
 Principal Component Analysis (PCA)
 Kernel PCA
 Locally Linear Embedding (LLE)
 t-Distributed Stochastic Neighbor Embedding (t-SNE)

Anomaly detection and novelty detection
 One-class SVM
 Isolation Forest

Association rule learning
 Apriori
 Eclat

Clustering
 K-Means
 DBSCAN
 Hierarchical Cluster Analysis (HCA)
Anomaly detection and novelty detection
 One-class SVM
 Isolation Forest
Visualization and dimensionality reduction
 Principal Component Analysis (PCA)
 Kernel PCA
 Locally Linear Embedding (LLE)
 t-Distributed Stochastic Neighbor Embedding (t-SNE)
Association rule learning
 Apriori
 Eclat

BATCH AND ONLINE LEARNING
Batch Learning:
 Learn in one go using all available training dataset
 Learning can not be done incrementally
 Requires to train the model from scratch again with an updated dataset
 Requires lots of computational resources
 But the process can be automated, so for small dataset it is not a huge concern
Online Learning:
 Learn on the fly with incoming data
 Learning can be done incrementally
 Does not require to keep all the data available all the time

CHALLENGES OF MACHINE LEARNING
Insufficient Quantity of Training Data
Non-representative Training Data
Poor Quality Data
Irrelevant Features
Overfitting Training Data
Underfitting Training Data

UNREASONABLE EFFECTIVENESS OF DATA
Algorithms performed similarly with
enough data

POOR QUALITY DATA
 Error in data gathering
 Outliers
 Noise (Inaccurate measurements)
If some instances are clearly outliers, it may help to simply discard them or try to
fix the errors manually.
If some instances are missing a few features (e.g., 5% of your customers did not
specify their age), you must decide whether you want to ignore this attribute
altogether, ignore these instances, fill in the missing values (e.g., with the median
age), or train one model with the feature and one model without it.

IRRELEVANT FEATURES
Some features are not as useful in building the prediction model
 Feature Selection: Select features that matter
Feature Extraction: Extract new features based on existing features

OVERFITTING
Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization.

UNDERFITTING
Opposite of overfitting.
The Machine Learning model is not able to learn properly from the data
Solutions:
 Select a more powerful model, with more parameters.
 Feed better features to the learning algorithm (feature engineering).
 Reduce the constraints on the model (e.g., reduce the regularization
hyperparameter).

TESTING
Split data into training and test set (common to use 80%-20% ratio)
Build the model on the training data
Test the model on test data
If training error is high, it means the model is not generalizing well (underfitting)
If the training error is low but testing error is high it means the model is not
generalizing to test data (overfitting)

VALIDATION
What if you have to compare different models or optimize your model on different
parameters
Should you just keep using test data for identifying generlaization error?
Doing so would cause the model to adapt to test data and not generalize well
Solution:
Divide data into training, validation and test data (possibly 60%, 20%, 20%)
Train models on training data and check error on validation data
Select model that minimizes the validation error
Then do one final training on training + validation data and test on test data

Chapter8_What_Is_Machine_Learning Testing Cases

More Related Content

Similar to Chapter8_What_Is_Machine_Learning Testing Cases

More from Ghazanfar Latif (Gabe)

Recently uploaded

Chapter8_What_Is_Machine_Learning Testing Cases

Editor's Notes