ML UNIT-I.ppt

February 21, 2023 SIT1305 Machine Learning 1
SIT1305- MACHINE LEARNING
Course In-Charges:
Dr.A.Mary Posonia
Dr.B.Ankayarkanni

SIT1305- MACHINE LEARNING
TEXT / REFERENCE BOOKS
Ethem Alpaydin, “Introduction to Machine Learning”,
MIT Press,2004
Tom Mitchell, “Machine Learning”, McGraw Hill,
1997.
February 21, 2023 3
SIT1305 Machine Learning

UNIT 1 INTRODUCTION TO MACHINE LEARNING
• Machine learning - examples of machine
learning applications - Learning associations -
Classification - Regression - Unsupervised
learning - Supervised Learning - Learning
class from examples - PAC learning -
Noise,model selection and generalization -
Dimension of supervised machine learning
algorithm.

What is machine learning?
• A branch of artificial intelligence, concerned
with the design and development of
algorithms that allow computers to evolve
behaviors based on empirical data.
• As intelligence requires knowledge, it is
necessary for the computers to acquire
knowledge.

Artificial Intelligence

Machine Learning
• “Field of study that gives computers the ability to learn
without being explicitly programmed”
• “Learning is any process by which a system improves
performance from experience”

Traditional Programming vs Machine Learning

How a software developer creates
a solution

How a data engineer develops a
solution using machine learning

What is Machine Learning?
Aspect of AI: creates knowledge
Definition:
“changes in [a] system that ... enable [it] to do the same task
or tasks drawn from the same population more efficiently
and more effectively the next time.'' (Simon 1983)
There are two ways that a system can improve:
1. By acquiring new knowledge
– acquiring new facts
– acquiring new skills
2. By adapting its behavior
– solving problems more accurately
– solving problems more efficiently

Tom Mitchell provides a more modern
definition:
“A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
improves with experience E.”

What is Learning?
• Herbert Simon: “Learning is any process by
which a system improves performance from
experience.”
• What is the task?
– Classification
– Categorization/clustering
– Problem solving / planning / control
– Prediction
– others

Example
• Imagine you have some sets of the pair of numbers.
• Put only 1 number of the pair into a machine to predict the
other half of the pair.
(2,4),(3,6),(4,8)
The computer program has to predict the second number for
(5,?)
The program first needs to find the logic between the pairs
and then apply the same logic to predict the number.
To find that logic is called “machine learning”.
So that after finding the logic it can apply the same logic to
predict each number.

Types of Learning

When Should You Use Machine
Learning?
• Consider using machine learning when you have a
complex task or problem involving a large amount of
data and lots of variables, but no existing formula or
equation.
• For example, machine learning is a good option if you
need to handle situations like these:
– Hand-written rules and equations are too complex—as in
face recognition and speech recognition.
– The rules of a task are constantly changing—as in fraud
detection from transaction records.
– The nature of the data keeps changing, and the program
needs to adapt—as in automated trading, energy
demand forecasting, and predicting shopping trends.

• Machine Learning is used when:
– Human expertise does not exist (navigating on
Mars),
– Humans are unable to explain their expertise
(speech recognition)
– Solution changes in time (routing on a computer
network)
– Solution needs to be adapted to particular cases
(user biometrics)

Multidisciplinary Field
Machine learning is primarily concerned
with the accuracy and effectiveness of
the computer system.
psychological models
data
mining
cognitive science
decision theory
information theory
databases
machine
learning
neuroscience
statistics
evolutionary
models
control theory

Why now?
• Flood of available data (especially with the advent of
internet)
• Increasing computational power.
• Growing progress in available algorithms and theory
developed by researchers.
• Increasing support from industries.

Storage units

Applications of Machine learning
•Machine learning is a buzzword for today's technology, and it is growing
very rapidly day by day.
•We are using machine learning in our daily life even without knowing it
such as Google Maps, Google assistant, Alexa, etc.

Real-World Applications
• With the rise in big data, machine learning has become particularly
important for solving problems in areas like these:
– Computational finance, for credit scoring and algorithmic
trading
– Image processing and computer vision, for face recognition,
motion detection, and object detection
– Computational biology, for tumor detection, drug discovery, and
DNA sequencing
– Energy production, for price and load forecasting
– Automotive, aerospace, and manufacturing, for predictive
maintenance
– Natural language processing

Machine Learning Trends

“Telephone took 75 years to reach 50
million users, radio 38 yrs, television
13 yrs, Internet 4 yrs, Facebook 19
months, Pokemon Go 19
days. AarogyaSetu, India’s app to fight
COVID-19 has reached 50 mn users in
just 13 days-fastest ever globally for
an App,” Kant said in his tweet.
Machine Learning Trends
The app will calculate this based on their interaction with
others, using cutting edge Bluetooth technology, algorithms
and artificial intelligence.

What Machine Learning can do
 Finding which category an object belongs to -- by
Classification Algorithm
 Finding what is strange -- by Anomaly Detection Algorithm
 Finding how much and how many -- by Regression
Algorithm
 Finding how data is arranged – by Clustering Algorithm
 What should I do next -- by Reinforcement Algorithm

How Do You Decide Which Algorithm
to Use?
• Choosing the right algorithm can seem overwhelming—there
are dozens of supervised and unsupervised machine learning
algorithms, and each takes a different approach to learning.
• There is no best method or one size fits all. Finding the right
algorithm is partly just trial and error—even highly
experienced data scientists can’t tell whether an algorithm
will work without trying it out.
• But algorithm selection also depends on the size and type of
data you’re working with, the insights you want to get from
the data, and how those insights will be used.

Questions to Consider Before You
Start
• Every machine learning workflow begins with three questions:
– What kind of data are you working with?
– What insights do you want to get from it?
– How and where will those insights be applied?
• Your answers to these questions help you decide whether to
use supervised or unsupervised learning.

Machine Learning Workflow

Understanding Machine Learning
Machine Learning vs Statistical Inference vs Pattern
Recognition vs Data Mining
Perspective 1
same concepts evolving in different scientific traditions
• Statistical Inference (SI): field of Applied Mathematics
• Machine Learning (ML): field of AI
• Pattern Recognition (PR): branch of Computer Science
focused on perception problems (image processing,
speech recognition, etc.)
• Data Mining (DM): field of Database Engineering
SIT1305 Machine Learning 34

Perspective 2: slight conceptual differences
• Statistical Inference: inference based on probabilistic
models built on data. Located at the intersection of
Mathematics and Artificial Intelligence (AI)
• Machine Learning: methods tend to be more heuristic in
nature
• Pattern Recognition: most authors defend it is the same
thing as machine learning
• Data Mining: applied machine learning. Involves issues such
as data pre-processing, data cleaning, transformation,
integration or visualization. Involves machine learning, plus
computer science and database systems. 35
Understanding Machine Learning

Designing a Learning System
• Choose the training experience
• Choose exactly what is too be learned, i.e. the target
function.
• Choose how to represent the target function.
• Choose a learning algorithm to infer the target
function from the experience.
Environment/
Experience
Learner
Knowledge
Performance
Element

Types of Machine Learning

38
Supervised learning
– Generates a function that maps inputs to desired
outputs.
• For example, in a classification problem, the
learner approximates a function mapping a
vector into classes by looking at input-output
examples of the function
– Probably, the most common paradigm
– E.g., decision trees, support vector machines, Naïve
Bayes, k-Nearest Neighbors, …
Learning Paradigms

Machine learning structure
• Supervised learning

40
• Unsupervised learning
– Labels are not known during training
– E.g., clustering, association learning
• Semi-supervised learning
– Combines both labeled and unlabeled examples to
generate an appropriate function or classifier
– E.g., Transductive Support Vector Machine
Learning Paradigms

• semisupervised learning
The goal of a semi-supervised model is to classify some of the
unlabeled data using the labeled information set.
• Speech Analysis
• Protein Sequencing
• Web content analysis

Reinforcement Learning
• In the absence of training dataset, it is bound to learn from its
experience.
• We have an agent and a reward, with many hurdles in between.
The agent is supposed to find the best possible path to reach the
reward.
Types of Reinforcement: positive, negative, punishment, and
extinction.

• Reinforcement learning
– It is concerned with how an agent should take actions in an
environment so as to maximize some notion of cumulative
reward.
• Reward given if some evaluation metric improved
• Punishment in the reverse case
– E.g., Q-learning, Sarsa

Machine Learning Perspectives

Algorithms
• Supervised learning
– Prediction
– Classification (discrete labels), Regression (real values)
– Clustering
– Probability distribution estimation
– Finding association (in features)
– Dimension reduction
• Semi-supervised learning
• Reinforcement learning
– Decision making (robot, chess machine)

47
• Classification
– Learn a way to classify unseen examples, based on a
set of labeled examples, e.g., classify songs by
emotion categories. E.g., decision trees (e.g., C5.4)
• Regression
– Learn a way to predict continuous output values,
based on a set of labeled examples, e.g., predict
software development effort in person months
– Sometimes regarded as numeric classification
(outputs are continuous instead of discrete)
– E.g., Support Vector Regression
ML Algorithms

48
• Association
– Find any association among features, not just input-
output associations (e.g., in a supermarket, find that
clients who buys apples also buys cereals)
– E.g., Apriori
• Clustering
– Find natural grouping among data
– E.g., K-means clustering, DBSCAN, Heirarchial
clustering
ML Algorithms

Machine Learning Process

Training and testing
• Training is the process of making the system able to learn.
• No free lunch rule:
– Training set and testing set come from the same distribution
– Need to make some assumptions or bias

• There are several factors affecting the performance:
– Types of training provided
– The form and extent of any initial background
knowledge
– The type of feedback provided
– The learning algorithms used
• Two important factors:
– Modeling
– Optimization
Performance

52
• Different ML traditions propose different approaches
inspired by real-world analogies
– Neural networks researchers: emphasize analogies
to neurobiology
– Case-based learning: human memory
– Genetic algorithms: evolution
– Rule induction: heuristic search
– Analytic methods: reasoning in formal logic
• Again, different notation and terminology
Machine Learning Traditions

Different Varieties of Machine Learning
• Concept Learning
• Clustering Algorithms
• Connectionist Algorithms
• Genetic Algorithms
• Explanation-based Learning
• Transformation-based Learning
• Reinforcement Learning
• Case-based Learning
• Macro Learning
• Evaluation Functions
• Cognitive Learning Architectures
• Constructive Induction
• Discovery Systems
• Knowledge capture

54
• Black-box
– Learned model internals are practically incomprehensible
• E.g., Neural Networks, Support Vector Machines
• Transparent-box
– Learned model internals are understandable, interpretable
• E.g., explicit rules, decision-trees
• Instance-based or case-based learning
– Represents knowledge in terms of specific cases or
experiences
– Relies on flexible matching methods to retrieve these cases
and apply them to new situations
– E.g., k-Nearest Neighbors
Learning Paradigms

Machine Learning Applications

Machine Learning touching our Daily Life
Walmart use Robots in
their stores for inventory
management, packing,
pricing checks
Restaurants
have Robot
chefs and
Waiters

Michelangelo, an internal ML-as-a-service platform that
democratizes machine learning and makes scaling AI to
meet the needs of business as easy as requesting a ride.

Song Recommendations
based on mood and interest
Data Acquisition from Tamr – Enterprise
Data Unification Company
Content specific vaccines
for Children

Amazon – Game Changer of the Decade

Machine Learning in Civil Engineering
 Design of Construction Management System
 Prediction of the Severity of Earthquakes
 Better analysis of monitoring the construction
health
 Analysis of Environmental Engineering
 Design of Highway and transportation
Engineering fo the prediction of Transport arrivals
and pedestrian movement analysis
 Use of Machine learning in surveying,
Geotechnical and Geospatial Engineering

Machine Learning in Mechanical
Engineering
 Cognitive Science of a Machine
 Use of IoT and Big Data Analytics
 On site performance of devices
 Non-linear root cause analysis
 Tools for analytics and operations

Autonomous Cars - ALVINN

Autonomous Driving Cars

Adaptive Highbeam
Automatically and
continuously
adapts the
headlamp range
to the distance of
vehicles ahead or
which are
incoming

Predicting mechanical failure
• By continuously monitoring data (power plant,
manufacturing unit operations) and providing
them to smart decision support systems,
manufacturers can predict the probability of
failure.
• Predictive maintenance is an emerging field in
industrial applications that helps in determining
the condition of in-service equipment to estimate
the optimum time of maintenance.
• ML-based predictive maintenance saves cost
and time on routine or preventive maintenance.

AI for automatically segmenting brain
tumors
 Artificial Intelligence has a broad scope in
healthcare devices and applications.
 Makes analysis, treatment, and monitoring of
tumors more effective.
 NVIDIA has developed a 3D MRI brain tumor
segmentation using deep-learning and 3D
magnetic resonance imaging technologies.

Personalization
Sophia
social humanoid
Google Assistant Amazon Alexa
Virtual Assistant
Personalization Platforms

Machine Learning Applications across Industries

Robotics and ML
 Areas that robots are used:
 Industrial robots
 Military, government and space robots
 Service robots for home, healthcare, laboratory
 Why are robots used?
 Dangerous tasks or in hazardous environments
 Repetitive tasks
 High precision tasks or those requiring high quality
 Labor savings
 Control technologies:
 Autonomous (self-controlled), tele-operated (remote
control)

Industrial Robots
• Uses for robots in manufacturing:
– Welding
– Painting
– Cutting
– Dispensing
– Assembly
– Polishing/Finishing
– Material Handling
• Packaging, Palletizing
• Machine loading

Industrial Robots
• Uses for robots in industry/Manufacturing
– Automotive
– Packaging

Military/Government Robots
• iRobot PackBot
 Remotec Andros

Military/Government Robots
Soldiers in Afghanistan being trained how to defuse a landmine using a PackBot.

Military Robots
• Aerial drones (UAV)  Military suit

Space Robots
• Mars Rovers – Spirit and Opportunity
– Autonomous navigation features with human
remote control and oversight

Service Robots
• Many uses…
– Cleaning & Housekeeping
– Humanitarian Demining
– Rehabilitation
– Inspection
– Agriculture & Harvesting
– Lawn Mowers
– Surveillance
– Mining Applications
– Construction
– Automatic Refilling
– Fire Fighters
– Search & Rescue
iRobot Roomba vacuum cleaner robot

Medical/Healthcare Applications
DaVinci surgical robot by Intuitive Surgical.
St. Elizabeth Hospital is one of the local hospitals using this robot. You can
see this robot in person during an open house (website).
Japanese health care assistant suit
(HAL - Hybrid Assistive Limb)
Also… Mind-controlled
wheelchair using NI LabVIEW

Laboratory Applications
Drug discovery Test tube sorting

Programs with the ability
to learn and reason like
humans
Algorithms with the ability
to learn without being
explicitly programmed
Subset of Machine
Learning in which Artificial
Neural Networks adapt
and learn from vast
amounts of data.
AI vs Machine Learning vs Deep Learning

Deep Learning

Machine Intelligence Landscape

Future of machine learning
• Improved unsupervised algorithms
• Enhanced personalization
• Increased adoption of quantum
computing
• Improved cognitive services
• Rise of robots

Future of Machine Learning
Gartner Predictions

Technology Trends 2020

Industry 4.0 the Future

Skills Required

Top Machine Learning Software Tools
Software Platform
Written in
language
Algorithms or Features
Scikit Learn Linux, Mac OS,
Windows
Python,
Cython, C,
C++
Classification, Regression
Clustering, Preprocessing
Model Selection
Dimensionality reduction.
PyTorch Linux, Mac OS,
Windows
Python, C++,
CUDA
Autograd Module, Optim
Module, nn Module
TensorFlow Linux, Mac OS,
Windows
Python, C++,
CUDA
Provides a library for dataflow
programming.
Weka
Waikato
Environment
for
Knowledge
Analysis
Linux, Mac OS,
Windows
Java Data preparation, Classification
Regression, Clustering
Visualization, Association rules
mining

Software Platform
Written in
language
KNIME
Konstanz
Information
Miner
Linux, Mac
OS,
Windows
Java Can work with large data volume.
Supports text mining & image
mining through plugins
Colab Cloud
Service
- Supports libraries of PyTorch,
Keras, TensorFlow, and OpenCV
Apache
Mahout
Cross-
platform
Java
Scala
Preprocessors, Regression
Clustering, Recommenders
Distributed Linear Algebra.
Accors.Net Cross-
platform
C# Classification, Regression,
Distribution, Clustering
Hypothesis Tests & Kernel
Methods, Image, Audio & Signal.
& Vision

Software Platform
Written in
language
Shogun Windows
Linux
UNIX
Mac OS
C++ Regression, Classification
Clustering, Support vector
machines, Dimensionality
reduction, Online learning etc.
Keras.io Cross-
platform
Python API for neural networks ,
supports CNN
Rapid
Miner
Cross-
platform
Java Data loading & Transformation
Data preprocessing &
visualization.
Oryx2 Cross
Platform
Python collaborative filtering,
classification, regression , DL,
CNN

Supervised Learning
• Train the algorithm using data which is well labelled
that means some data is already tagged with correct
answers
Ex: Given an basket filled with different kind of
fruits and train the algorithm with all different
fruits

Supervised Learning
• Classification: When the output variable is a category
such as “Red” or “Blue” or “disease” and “No-
disease”
• Regression: When the output variable is continuous
values such as “dollars” or “weight”. It is used for
continuous values attributes.
Ex: Continuous marks of 70 students for particular
subject Ass1 Ass2
80 70
83 72
……

Unsupervised Learning
• Training the algorithm using information without any
guidance that is neither classified or labelled
• Set of data given based on similarities, pattern and
differences it is grouped.
Ex: Given an image having both animals and birds
Technique: Clustering

Parameters Supervised machine learning technique Unsupervised machine learning technique
Process
In a supervised learning model, input and
output variables will be given.
In unsupervised learning model, only input
data will be given
Input Data Algorithms are trained using labeled data.
Algorithms are used against data which is not
labeled
Algorithms Used
Support vector machine, Neural network,
Linear and logistics regression, random
forest, and Classification trees.
Unsupervised algorithms can be divided into
different categories: like Cluster algorithms,
K-means, Hierarchical clustering, etc.
Computational
Complexity
Supervised learning is a simpler method.
Unsupervised learning is computationally
complex
Use of Data
Supervised learning model uses training
data to learn a link between the input and
the outputs.
Unsupervised learning does not use output
data.
Accuracy of
Results
Highly accurate and trustworthy method. Less accurate and trustworthy method.
Real Time Learning Learning method takes place offline. Learning method takes place in real time.
Number of Classes Number of classes is known. Number of classes is not known.
Main Drawback
Classifying big data can be a real challenge
in Supervised Learning.
You cannot get precise information regarding
data sorting, and the output as data used in
unsupervised learning is not labeled.

Questions
1. A computer program is said to learn from experience E with respect to
some task T and some performance measure P if its performance on T, as
measured by P, improves with experience E. Suppose we feed a learning
algorithm a lot of historical weather data, and have it learn to predict
weather. In this setting, what is E?
1. Suppose you are working on weather prediction, and you would like to
predict whether or not it will be raining at 5pm tomorrow. You want to
use a learning algorithm for this. Would you treat this as a classification
or a regression problem?
2. Suppose you are working on stock market prediction, and you would like
to predict the price of a particular stock tomorrow (measured in dollars).
You want to use a learning algorithm for this. Would you treat this as a
classification or a regression problem?
February 21, 2023 SIT1305 MACHINE LEARNING 103

1. Take a collection of 1000 essays written on the US Economy, and find a
way to automatically group these essays into a small number of groups of
essays that are somehow "similar" or "related". :=
This is an unsupervised learning/clustering problem (similar to the
Google News example in the lectures).
2. Given a large dataset of medical records from patients suffering from
heart disease, try to learn whether there might be different clusters of
such patients for which we might tailor separate treatements. :=
This can be addressed using an unsupervised learning, clustering,
algorithm, in which we group patients into different clusters.
3. Given genetic (DNA) data from a person, predict the odds of him/her
developing diabetes over the next 10 years. :=
This can be addressed as a supervised learning, classification,
problem, where we can learn from a labeled dataset comprising different
people's genetic data, and labels telling us if they had developed
diabetes.
4. Given 50 articles written by male authors, and 50 articles written by
female authors, learn to predict the gender of a new manuscript's author
(when the identity of this author is unknown). :=
This can be addressed as a supervised learning, classification,
problem, where we learn from the labeled data to predict gender.

5. In farming, given data on crop yields over the last 50 years, learn to
predict next year's crop yields. :=
This can be addresses as a supervised learning problem, where we
learn from historical data (labeled with historical crop yields) to predict
future crop yields.
6. Examine a large collection of emails that are known to be spam email, to
discover if there are sub-types of spam mail. :=
This can addressed using a clustering (unsupervised learning)
algorithm, to cluster spam mail into sub-types.
7. Examine the statistics of two football teams, and predicting which team
will win tomorrow's match (given historical data of teams' wins/losses to
learn from). :=
This can be addressed using supervised learning, in which we learn
from historical records to make win/loss predictions.

Learning Class by Example
• Class C of a “family car”
– Prediction: Is car x a family car?
– Knowledge extraction: What do people expect from a
family car?
• Output:
Positive (+) and negative (–) examples
• Input representation:
x1: price, x2 : engine power

Training set X
N
t
t
t
,r 1
}
{ 
 x
X




negative
is
if
positive
is
if
x
x
0
1
r
108







2
1
x
x
x
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press
(V1.0)

Class C
   
2
1
2
1 power
engine
AND
price e
e
p
p 



109
Lecture Notes for E Alpaydın 2010
Introduction to Machine Learning 2e ©
The MIT Press (V1.0)

Hypothesis class H




negative
is
says
if
positive
is
says
if
)
(
x
x
x
h
h
h
0
1
 
 




N
t
t
t
r
h
h
E
1
1 x
)
|
( X
110
Error of h on H
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press
(V1.0)

S, G, and the Version Space
111
most specific hypothesis, S
most general hypothesis, G
h H, between S and G is
consistent
and make up the
version space
(Mitchell, 1997)
Lecture Notes for E Alpaydın 2010
Introduction to Machine Learning 2e ©
The MIT Press (V1.0)

Examples of Machine Learning Applications
Tasks are classified into two categories:
1. Descriptive- characterize the properties of data
2. Predictive – Inference on current data to make predictions
Functionalities:
1.Class/Concept Description
2. Associations
3.Classification and Prediction
4.Clustering
5.Outliers

1.Class/Concept Description
• Class/Concept refers to the data to be associated with the
classes or concepts.
– Data Characterization − This refers to summarizing data of
class under study. This class under study is called as Target
Class.
– Data Discrimination − It refers to the mapping or
classification of a class with some predefined group or
class.

2. Associations
• Find frequent elements/items
• Find the associations between them/relationships
• Single dimensional association rule
Buys(‘X’, ‘Computer’)  Buys(‘X’, ‘Software’)
• Multidimensional association rule
Age(‘X’, ’40-60’) and income(‘X’, ’50-60 lakhs’)  Buys(‘X’,
‘Computer’)
Algorithms:
1. Apriori
2. Frequent pattern growth tree

Classification
• Example: Credit scoring
• Differentiating between
low-risk and high-risk
customers from their
income and savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
Model

Prediction: Regression
• Example: Price of a used car
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
y = wx+w0

Learning Associations
• Imagine that you are a sales manager at AllElectronics, and
you are talking to a customer who recently bought a PC and a
digital camera from the store. What should you recommend
to her/him next?
• Frequent patterns and association rules are the knowledge
that you want to mine in such a scenario.
• Finding frequent patterns plays an essential role in mining
associations, correlations, and many other interesting
relationships among data.
• Moreover, it helps in data classification, clustering, and other
data mining tasks.

What is Association Mining?
• Motivation: finding regularities in data
– What products were often purchased together?
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?

• Frequent patterns are patterns that appear frequently in a
data set.
– a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent
itemset.
– A subsequence,such as buying first a PC, then a digital
camera, and then a memory card, if it occurs frequently in
a shopping history database, is a (frequent) sequential
pattern.
– A substructure can refer to different structural forms, such
as subgraphs, subtrees, or sublattices, which may be
combined with itemsets or subsequences.
• If a substructure occurs frequently, it is called a (frequent)
structured pattern.

Association Rules
• An association rule is an implication of the form X → Y where
X is the antecedent and Y is the consequent of the rule.
• To find the dependency between two items X and Y.
• P (Y | X ) probability that somebody who buys X also buys Y
where X and Y are products/services.

Example Association Rule
90% of transactions that purchase bread and butter also
purchase milk
“IF” part = antecedent
“THEN” part = consequent
“Item set” = the items (e.g., products) comprising the
antecedent or consequent
• Antecedent and consequent are disjoint(i.e., have no items in
common.
Antecedent: bread and butter
Consequent: milk
Confidence factor:90%

Three measures
• Support of the association rule X→Y :
• Confidence of the association rule X→Y :
• Lift or interest of the association rule X→Y :
• Goal:Find all rules that satisfy the user-specified minimum support(min.sup) and
minimum confidence(min.conf).
}
{
#
}
{
#
)
,
(
)
,
(
support
customers
Y
and
X
bought
who
customers
Y
X
P
Y
X 

}
{
#
}
{
#
)
(
)
,
(
)
|
(
)
(
X
bought
who
customers
Y
and
X
bought
who
customers
X
P
Y
X
P
X
Y
P
Y
X
confidence




)
(
)
|
(
)
(
)
(
)
,
(
)
(
Y
P
X
Y
P
Y
P
X
P
Y
X
P
Y
X
Lift 



Market Basket Analysis: A Motivating
Example

Market Basket Analysis: A Motivating
Example
• Market basket analysis, the earliest form of frequent pattern mining
for association rules.
• Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data
sets.
• With massive amounts of data continuously being collected and
stored, many industries are becoming interested in mining such
patterns from their databases.
• Help in many business decision-making processes such as catalog
design, cross-marketing, and customer shopping behavior analysis.

Apriori Algorithm
• Proposed by Agrawal et al. in 1996.
• Initially used for Market Basket Analysis to find how items
purchased by customers are related.
• Two steps:
1. Finding frequent itemsets, that is ,those which have
enough support.
2. Converting them to rules with enough confidence, by
splitting the items into two, as items in the antecedent
and items in the consequent.

Association rule mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction.

Transaction data: supermarket data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}...
...
tn: {biscuit, eggs, milk}
• Concepts:
• An item: an item/article in a basket
• I:the set of all items sold in the store
• A transaction: items purchased in a basket; it may have TID
(transaction ID)
• A transactional dataset: A set of transactions

Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count (σ)
–Frequency of occurrence of an itemset
–E.g. σ({Milk, Bread,Diaper}) = 2
• Support
•Fraction of transactions that contain an itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
•An itemset whose support is greater than or equal to a
min.support threshold σ({Milk, Bread,Diaper}) = 2

Frequent Itemset Mining Methods
Apriori Algorithm:
Finding Frequent Itemsets by Confined Candidate
Generation.

Example
• A database has nine transactions ,that is, |D|=9.Minimum
support count is 2 and Minimum confidence is 60%.
a) Find all frequent itemsets using Apriori.
b) List all the strong association rules (with support s and
confidence c).

• Step-1: K=1
– Create a table containing
support count of each item
present in dataset – called
C1(candidate set)
– compare candidate set item’s
support count with minimum
support count(given
min_support=2). This gives us
itemset L1.
C1
L1

• Step-2: K=2
– Generate candidate set C2
using L1 (this is called join
step). Condition of joining Lk-1
and Lk-1 is that it should have
(K-2) elements in common.
– Check all subsets of an itemset
are frequent or not and if not
frequent remove that itemset.
– Now find support count of
these itemsets by searching in
dataset.
L1
C2

• compare candidate (C2)
support count with
minimum support count,this
gives us itemset L2.
C2
L2

• Step-3:
–Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common.
–for L2, first element should match.
Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.
–(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For
{I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for
every itemset)
–find support count of these remaining itemset by searching in dataset.
L3
Itemset
I1,I2,I3
I1,I2,I5
I1,I2,I4
I1,I3,I5
I2,I3,I4
I2, I3, I5
I2, I4, I5
C3
L2

• Step-4:
• Generate candidate set C4 using L3 (join step). Condition of
joining Lk-1 and Lk-1 (K=4) is that, they should have (K-2)
elements in common. So here, for L3, first 2 elements (items)
should match.
• Check all subsets of these itemsets are frequent or not (Here
itemset formed by joining L3 is {I1, I2, I3, I5} so its subset
contains {I1, I3, I5}, which is not frequent).
• So no itemset in C4.
• Stop , because no frequent itemsets are found further.

Apriori Algorithm
• Apriori employs an iterative approach known as a level-wise
search,where k-itemsets are used to explore (k+1)-itemsets.
Apriori property:All nonempty subsets of a frequent itemset must also be
frequent.
A two step process is followed:
1. The Join step: To find Lk, a set of candidate k-itemsets is
generated by joining l Lk-1 with itself. The set of candidates is
denoted Ck.
2. The Prune step: Ck is a superset of Lk-1,that is ,its members
may or may not be frequent, but all of the frequent k-
itemsets are included in Ck. To reduce the size of Ck the
Apriori property is used.

• Generation of strong association rule.
• Calculate confidence of each rule.

Generating Association Rules
Rule generation for Itemset {I1, I2, I5} from L3

Rule generation for Itemset {I1, I2, I3} from L3
Rules Confidence
{I1 Î2} → I3 2/4=0.5=50%
{I2Î3} → I1 2/4=0.5=50%
{I1Î3} → I2 2/4=0.5=50%
I3→ {I1 Î2} 2/5=0.4=40%
I1→ {I2Î3} 2/6=0.33=33.33%
I2→ {I1Î3} 2/7=0.28=28%
As the given threshold or minimum confidence is 60%, no rules can be
considered as the strong association rules for the given problem.

Example 2

B->CÊ C->BÊ E->B^C
B^C->E BÊ->C CÊ->B

Vapnik–Chervonenkis (VC) dimension
• Assume we have a dataset containing N points.
• These N points can be labeled in 2N ways as positive and
negative examples.
• Therefore , 2N different learning problems can be defined by N
data points.
• If for any of these problems, we can find a hypotheses h ϵ H
that separates the positive examples from negative ,then we
say H shatters N points.
• That is, any learning problem definable by N examples can be
learned with no error by a hypothesis drawn from H.
• The maximum number of points that can be shattered by H is
called the Vapnik-Chervonenkis(VC) dimension of H, is
denoted as VC(H), and measures the capacity of H.
•

Vapnik–Chervonenkis (VC) dimension
• The Vapnik–Chervonenkis (VC) dimension is a measure of the
capacity (complexity, expressive power, richness, or flexibility)
of a set of functions that can be learned by a statistical binary
classification algorithm.
• It is defined as the cardinality of the largest set of points that
the algorithm can shatter.
• It was originally defined by Vladimir Vapnik and Alexey
Chervonenkis.

• When choosing a classifier for your data, an obvious question
to ask is “What kind of data can this classifier classify?”.
For example,
– if you know your points can easily be separated by a single
line, you may opt to choose a simple linear classifier,
– whereas if you know your points will be in many separate
groups, you may opt to choose a more powerful classifier
such as a random forest or multilayer perceptron.
• This fundamental question can be answered using a
classifier’s VC dimension, which is a concept from
computational learning theory that formally quantifies the
power of a classification algorithm.

Example
• The VC dimension for a linear classifier is at least 3, since it
can shatter this configuration of 3 points.
• In each of the 2³ = 8 possible assignment of positive and
negative, the classifier is able to perfectly separate the two
classes.

• Now, we show that a linear classifier is lower than 4.
• In this configuration of 4 points, the classifier is unable to
segment the positive and negative classes in at least one
assignment.
• Two lines would be necessary to separate the two classes in
this situation.
• We actually need to prove that there does not exist a 4 point
configuration that can be shattered, but the same logic
applies to other configurations, so, for brevity’s sake, this
example is good enough.

Applications of VC dimension
• In most cases, the exact VC dimension of a classifier is not so
important.
• Rather, it is used more so to classify different types of
algorithms by their complexities.
• For example, the class of simple classifiers could include basic
shapes like lines, circles, or rectangles, whereas a class of
complex classifiers could include classifiers such as multilayer
perceptrons, boosted trees, or other nonlinear classifiers.
• The complexity of a classification algorithm, which is directly
related to its VC dimension, is related to the trade-off
between bias and variance.

Bias and Variance
• Machine learning models are an incredible powerful and
useful tool for data scientists.
• When building a model, it is important to remember that with
predictions comes prediction errors.
• These errors are due to a combination of bias and variance
which have a trade-off relationship.
• Understanding these fundamentals is just the first step to
building an accurate model and avoiding the pitfalls of under-
fitting and over-fitting.

• In supervised machine learning,we are approximating a target
function(f) that maps input variables (X) to an output variable
(Y). The relationship is mathematically expressed as:
• where e represents the total error. The total error can actually
be further split into three parts:

Bias
• Bias, or bias error, can be defined as the difference between
the expected prediction of our model and the correct value
which we are trying to predict.
• High bias can cause our model to miss significant relations
between our features (X) and target outputs (Y) so it cannot
learn the training data or generalize to new data.
• This is also known as under-fitting. Under-fitted models are
forced to make a lot of assumptions which can cause
inaccurate predictions.

Variance
• Variance is the variability of a model prediction for a given
data point.
• It is the error from sensitivity to small fluctuations in the
training data.
• When there is high variance, this can cause random noise (e)
to be introduced into the training data rather than the
intended outputs (Y).
• High variance is also known as over-fitting data. When the
data is over-fitted, the model essentially learns the training
data too well and therefore cannot generalize to new data.
• The last error term is the irreducible error. Irreducible error is
essentially the amount of noise from factors outside of our
control and cannot be removed.

Example
line of best fit.
• In the left graph below, we can see that the line is simple and
does not follow many of the data points, thus showing high
bias.
• The right graph below shows a line that follows almost every
data point, even ones that may be noise or outliers, showing
high variance.
• Our goal is to find a balance between these two extremes so
that the majority of data points are explained with an
appropriate amount of noise.

• The relationship between bias and variance can also be
visualized using a target example.

The Bias-Variance Trade-off

Prevent Underfitting and Overfitting
Underfitting:
• Make sure there is enough training data so that the error/cost
function (e.g. MSE or SSE) is sufficiently minimized
Overfitting:
• Limit the number of features or adjustable parameters in the
model. As the number of features increases, the complexity of
the model also increases, thus creating a higher chance of
overfitting.
• Shorten the training so the model doesn’t “over-learn” the
training data.
• Add some form of regularization term to the error/cost
function to encourage smoother network mappings (Ridge or
Lasso regression are commonly used techniques)

Modelling supervised learning
• Given training set of labelled examples, learning algorithm
generates a hypothesis. Run hypothesis on test set to check
how good it is.
• But how good really? May be training and test data consists of
bad examples so the hypothesis doesn’t generalize well.
• Insight: Introduce probabilities to measure degree of
certainty and correctness.
• With high probability an efficient learning algorithm will find a
hypothesis that is approximately identical to the hidden target
function.
• Intuition : A hypothesis built from a large amount of training
data is unlikely to be wrong i.e. Probably approximately
correct(PAC).

• In PAC learning, given a class C , and examples drawn from
some unknown but fixed probability distribution, p(x) , we
want to find the number of examples, N ,such that with
probability at least 1-δ , the hypothesis h has error at most ϵ,
for arbitrary δ≤1/2 and ϵ>0.
P{CΔh ≤ ϵ}≥ 1- δ
where CΔh is the region of difference between C and h.

Noise

Occam’s Razor

Learning Multiple classes

Model Selection & Generalization
• Consider the case of learning a Boolean function from
examples.
• In a Boolean function, all inputs and the output are binary.
• There are 2d possible ways to write d binary values and
therefore, with d inputs,the training set has at most 2d
examples.
• As shown in table , each of these can be labeled as 0 or 1, and
therefore, there are possible Boolean functions of d inputs.
d
2
2

• Each distinct training example removes half the hypotheses,
namely, those whose guesses are wrong.
• For example, let us say we have x1 = 0, x2 = 1 and the output
is 0; this removes h5, h6, h7, h8, h13, h14, h15, h16.
• This is one way to interpret learning: we start with all possible
hypothesis and as we see more training examples, we remove
those hypotheses that are not consistent with the training
data.
• In the case of a Boolean function, to end up with a single
hypothesis we need to see all 2d training examples.

• Ill-posed Problem:
– If the given training set contains only a small subset of all
possible instances, and if we know what the output should
be for only a small percentage of the cases—the solution is
not unique.
– This is an example of an ill-posed problem where the data
by itself is not sufficient to find a unique solution.
• Inductive Bias:
– Because learning is ill-posed, and data by itself is not
sufficient to find the solution, we should make some extra
assumptions to have a unique solution with the data we
have.
– The set of assumptions made to have learning possible is
called the inductive bias of the learning algorithm.

• Model Selection:
– Learning is not possible without inductive bias, and now
the question is how to choose the right bias. This is called
model selection, which is choosing between possible H.
• Generalization:
– How well a model trained on the training set predicts the
right output for new instances is called generalization.

ML UNIT-I.ppt

Recommended

Recommended

More Related Content

Similar to ML UNIT-I.ppt

Similar to ML UNIT-I.ppt (20)

Recently uploaded

Recently uploaded (20)

ML UNIT-I.ppt

Editor's Notes