SlideShare a Scribd company logo
1 of 49
UNIT – V
Unsupervised Learning
1
 Introduction,
 Unsupervised vs Supervised Learning,
 Applications of Unsupervised Learning,
 Clustering – Clustering as a machine learning task,
 Different types of clustering techniques,
 Partitioning methods,
K-Medoids: a representative object-based technique,
 Hierarchical clustering,
 Density-based methods – DBSCAN
 Finding Pattern using Association Rule - Definition of common
terms, Association rule, The Apriori algorithm for association rule
learning, Build the Apriori principle rules
SYLLABUS
2
 Unsupervised learning is one of the machine learning
strategies.
 It is used to discover hidden knowledge or patterns.
 It works on unlabelled, unclassified and untrained data.
 This method is used to identify patterns, groupings, sorting
order, and numerous other interesting knowledge from the
set of data.
Unsupervised learning
3
S.No Area Supervised Learning Unsupervised Learning
1 Goals
To predict outcomes for new data. We know
the type of results to expect.
The goal is to get insights from large
volumes of new data
2 Data Both input and output variables to be given Only output variable will be given
3 Applications
spam detection, sentiment analysis, weather
forecasting and pricing predictions, etc.,
anomaly detection, recommendation
engines, customer personas and medical
imaging
4 Complexity
Solutions are provided by the use of
programs like R or Python
need powerful tools for working with
large amounts of unclassified data.
5
Techniques
Used Regression and Classification Clustering and Association
6 Drawbacks
Supervised learning models can be time-
consuming to train, and the labels for input
and output variables require expertise.
unsupervised learning methods can have
wildly inaccurate results unless you have
human intervention to validate the output
variables.
Key Difference Between Supervised and Unsupervised Learning
4
APPLICATIONS OF UNSUPERVISED LEARNING
There are many domains where unsupervised learning finds its application. Few
examples of such applications are as follows:
 Segmentation of target consumer populations by an advertisement consulting
agency on the basis of few dimensions such as demography, financial data,
purchasing habits, etc. so that the advertisers can reach their target
consumers efficiently.
 Anomaly or fraud detection in the banking sector by identifying the pattern
of loan defaulters
 Image processing and image segmentation such as face recognition,
expression identification, etc.
 Document clustering and identifying potential labelling options.
5
Two major aspects of unsupervised learning, namely
Clustering which helps in segmentation of the set of
objects into groups of similar objects.
Association Analysis which is related to the identification
of relationships among objects in a data set.
6
CLUSTERING
 Clustering is a technique for finding subgroups, or clusters, in a data
set on the basis of the characteristics of the objects within that
data set in such a manner that the objects within the group are
similar (or related to each other) but are different from (or
unrelated to) the objects from the other groups.
 The effectiveness of clustering depends on how similar or related
the objects within a group are or how different or unrelated the
objects in different groups are from each other. It is often domain
specific to define what is meant by two objects to be similar or
dissimilar.
7
There are many different fields where cluster analysis is used
effectively, such as
 Text data mining: this includes tasks such as text categorization, text
clustering, document summarization, concept extraction, sentiment analysis, and
entity relation modelling
 Customer segmentation: creating clusters of customers on the basis of
parameters such as demographics, financial conditions, buying habits, etc., which
can be used by retailers and advertisers to promote their products in the
correct segment
 Anomaly checking: checking of anomalous behaviours such as fraudulent bank
transaction, unauthorized computer intrusion, suspicious movements on a radar
scanner, etc.
 Data mining: simplify the data mining task by grouping a large number of
features from an extremely large data set to make the analysis manageable
8
Clustering as a machine learning task
 clustering is defined as an unsupervised machine learning task that
automatically divides the data into clusters or groups of similar items.
 We can assume that the definition of similarity might vary across
applications.
 Using this principle, whenever a large set of diverse and varied data is
presented for analysis, clustering enables to represent the data in a
smaller number of groups.
 It helps to reduce the complexity and provides awareness into patterns
of relationships to generate meaningful and actionable structures within
the data.
 The effectiveness of clustering is measured by the homogeneity within
a group as well as the difference between distinct groups.
9
Clustering as a machine learning task
10
 Through clustering, we are trying to label the objects with
class labels.
 But clustering is somewhat different from the classification
and numeric prediction in supervised learning.
 In each of these cases, the goal was to create a model that
relates features to an outcome or to other features and the
model identifies patterns within the data.
11
12
13
Partitioning
Methods
• Uses mean or medoid to represent cluster center
• Adopts distance based approach to refine clusters
• Finds mutually exclusive clusters of spherical or nearly spherical shape
• Effective for datasets of small to medium size
Hierarchical
Methods
• Creates hierarchical or tree like structure through decomposition or
merger
• Uses distance between the nearest or furthest points in neighboring
clusters as a guidance for refinement
1.Density
based
Methods
• Identification of dense region of objects in space which are separated by
low-density regions.
• Identifying arbitrarily shaped clusters
• May filter out outliers.
Different types of clustering techniques
The major clustering techniques are as follows:
14 Partitioning Methods
Two important algorithms for partitioning based clustering are
 k-means
 k-medoid.
In the k-means algorithm, the centroid of the prototype is
identified for clustering, which is normally the mean of a group of
points.
The k-medoid algorithm identifies the medoid which is the most
representative point for a group of points.
15
K-Means - A centroid-based technique
 This is one of the oldest and most popularly used algorithm for
clustering.
 The Principle of the k-means algorithm is to assign each of the
‘n’ data points to one of the K clusters where ‘K’ is a user-
defined parameter as the number of clusters desired.
 The objective is to maximize the homogeneity within the
clusters and also to maximize the differences between the
clusters.
 The homogeneity and differences are measured in terms of the
distance between the objects or points in the data set.
16
Simple Algorithm of K-means
Step 1: Select K points in the data space and mark them as initial centroids
loop
Step 2: Assign each point in the data space to the nearest centroid to form
K clusters
Step 3: Measure the distance of each point in the cluster from the centroid
Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality
of the clusters
Step 5: Identify the new centroid of each cluster on the basis of distance
between points
Step 6: Repeat Steps 2 to 5 to refine until centroids do not change
end loop
17
Clustering concept – before and after clustering
18
Student Assignment
Problem solving on K-means
K-means clustering – solved example
https://www.youtube.com/watch?v=KzJORp8bgqs&list=PPSV
By Mahesh Haddur
19
Elbow method
• This method tries to measure the homogeneity or
heterogeneity within the cluster and for various
values of ‘K’ and helps in arriving at the optimal
‘K’.
• The homogeneity will increase or heterogeneity
will decrease with increasing ‘K’.
• After a certain point, the increase in homogeneity
benefit is no longer in accordance with the
investment required to achieve it, as is evident from
the figure. This point is known as the elbow point,
and the ‘K’ value at this point produces the optimal
clustering performance.
Fig: Elbow point to determine
the appropriate number of
clusters
20
Student Assignment
Elbow Method - Solved example
by
Mahesh Haddur
https://www.youtube.com/watch?v=wW1tgWtkj4I&list=PPSV
21
Strengths and Weaknesses of K-means Algorithm
S.No Strengths Weaknesses
1
Easy to use and It returns clusters
which can be easily interpreted and
even visualized. This simplicity
makes it highly useful in some
cases when you need a quick
overview of the data segments.
It needs human to set the cluster
size to function optimally.
2
High performance: It can handle
large datasets conveniently.
It only creates spherical clusters.
3
If your data has no labels (class
values or targets) or even column
headers, it will still successfully
cluster your data.
It doesn't have an outlier concept.
It will throw in everything in its
clusters. It is Sometimes good and
sometimes not so good.
22 K-Medoids: a representative object-based technique
Medoid: A Medoid is a point in the cluster from which the sum
of distances to other data points is minimal. The distance can be
measured by using the Euclidean distance, Manhattan distance,
or any other suitable distance function.
There are three types of algorithms for K-Medoids
Clustering:
1. PAM (Partitioning Around Clustering)
2. CLARA (Clustering Large Applications)
3. CLARANS (Randomized Clustering Large Applications)
23
Working of the PAM Algorithm
The steps taken by the K-medoids algorithm for clustering are as follows:-
1. Randomly select k points from the data( k is the number of clusters to be
formed). These k points would act as our initial medoids.
2. The distances between the medoid points and the non-medoid points are
calculated, and each point is assigned to the cluster of its nearest medoid.
3. Calculate the cost as the total sum of the distances(also called
dissimilarities) of the data points from the assigned medoid.
4. Swap one medoid point with a non-medoid point(from the same cluster as
the medoid point) and recalculate the cost.
5. If the calculated cost with the new medoid point is more than the previous
cost, we undo the swap, and the algorithm converges else; we repeat step 4
Finally, we will have k medoid points with their clusters.
24
Student Assignment
Problem solving on K-medoids
https://www.youtube.com/watch?v=ChBxx4aR-bY&list=PPSV
By
Mahesh Haddur
25
Hierarchical clustering
 The hierarchical clustering methods are used to group the data into
hierarchy or tree-like structure.
 For example, in a machine learning problem of organizing employees of a
university in different departments, first the employees are grouped
under the different departments in the university, and then within each
department, the employees can be grouped according to their roles such
as professors, assistant professors, supervisors, lab assistants, etc.
 This creates a hierarchical structure of the employee data and eases
visualization and analysis.
26
 There are two main hierarchical clustering methods:
agglomerative clustering and
divisive clustering.
 Agglomerative clustering is a bottom-up technique which
starts with individual objects as clusters and then
iteratively merges them to form larger clusters.
 divisive method is a top-down method and starts with
one cluster (large cluster) with all given objects and then
splits it iteratively to form smaller clusters.
27 A dendrogram is a commonly used tree structure
representation of step-by-step creation of
hierarchical clustering.
It shows how the
 clusters are merged iteratively (in the case
of agglomerative clustering) or
 split iteratively (in the case of divisive
clustering) to arrive at the optimal clustering
solution.
The following fig. represents a dendrogram.
28
agglomerative clustering and divisive clustering
29 One of the core measures of proximities between clusters
is the distance between them. There are four standard
methods to measure the distance between clusters:
 Let Ci and Cj be the two clusters with ni and nj
respectively.
 pi and pj represents the points in clusters Ci and Cj
respectively.
 We will denote the mean of cluster Ci as mi.
30
31
The distance measure is used to decide when to terminate the
clustering algorithm.
 For example, in an agglomerative clustering, the merging iterations
may be stopped once the MIN distance between two neighbouring
clusters becomes less than the user- defined threshold.
 So, when an algorithm uses the minimum distance Dmin to measure
the distance between the clusters, then it is referred to as nearest
neighbour clustering algorithm, and
 if the decision to stop the algorithm is based on a user-defined limit
on Dmin, then it is called single linkage algorithm.
32
On the other hand, when an algorithm uses the maximum
distance Dmax to measure the distance between the
clusters, then it is referred to as furthest neighbour
clustering algorithm, and if the decision to stop the
algorithm is based on a user-defined limit on Dmax then it is
called complete linkage algorithm.
As minimum and maximum measures provide two extreme
options to measure distance between the clusters, they are
prone to the outliers and noisy data. Instead, the use of
mean and average distance helps in avoiding such problem
and provides more consistent results.
33
Density-based methods - DBSCAN
o When we use the partitioning and hierarchical clustering methods, the
resulting clusters are spherical or nearly spherical in nature.
o In the case of the other shaped clusters such as S- shaped or uneven
shaped clusters, the above two types of method do not provide accurate
results.
o The density- based clustering approach provides a solution to identify
clusters of arbitrary shapes.
o The principle is based on identifying the dense area and sparse area
within the data set and then run the clustering algorithm.
o DBSCAN is one of the popular density-based algorithm which creates
clusters by using connected regions with high density.
34
Student Assignment : Density-Based Methods
DBSCAN - Problem Solving
By
Mahesh Haddur
https://www.youtube.com/watch?v=-p354tQsKrs&list=PPSV
35
Finding Pattern Using Association Rule
 Association rule as a technique is used for identifying interesting relationships
hidden in large data sets.
 It is also known as association analysis.
 The relationships can be represented in the form of association rules comprising
a set of frequent items.
 A common application of this analysis is the Market Basket Analysis
 The below association rule says that, people who have bought bread and milk
have often bought egg also; so, for the retailer, it makes sense that these items
are placed together for new opportunities for cross-selling.
{Bread, Milk} → {Egg}c
36
Few common terminologies used in association analysis
Itemset : A collection of zero or more items is
called an itemset.
For example, in Table, {Bread, Milk, Egg} can
be grouped together to form an itemset as those
are frequently bought together.
A null itemset is the one which does not contain
any item.
In the association analysis, an itemset is called k-
itemset if it contains k number of items.
EX: {Bread, Milk, Egg} -> 3-itemset
Transaction
Number
Purchased Items
1
{ Bread, Milk, Egg, Butter,
Salt, Apple }
2 { Bread, Milk, Egg, Apple }
3 { Bread, Milk, Butter, Apple }
4 { Milk, Egg, Butter, Apple }
5 { Bread, Egg, Salt }
6 { Bread, Milk, Egg, Apple }
37
Support count : denotes the number of transactions in which a particular
itemset is present
In the above table, the itemset {Bread, Milk, Egg} occurs together three times
and thus have a support count of 3.
Association rule: The result of the market basket analysis is expressed as a set
of association rules that specify patterns of relationships among items. A
typical rule might be expressed as{Bread, Milk}→{Egg}, which denotes that if
Bread and Milk are purchased, then Egg is also likely to be purchased.
The above rule was identified from the set of {Bread, Milk, Egg}.
38
Support and confidence are the two concepts that are used for measuring the
strength of an association rule.
Support refers to popularity of any item, which is nothing but counting several
transactions containing that item. For ex, we divide the count by the total items
bought. For ex, if bread, butter, and bread + butter are bought in 200, 150, 100
transactions, respectively, in a day when 1000 transactions happened, then support
for these associations will be
39
Confidence refers to the probability that item B will be bought if it is given that
the user has purchased item A. for ex, if A and B both are bought in 100
transactions and A is bought in 200 transactions, then confidence would be:
Confidence ( A  B ) = Transaction containing both A and B / Transaction only
containing A
Confidence ( Bread - - > Butter ) = Transaction having both bread and butter /
Transaction only containing bread = 100 / 200 = 0.5
40
Minimum Support : This is the support threshold below
which the association rule will be discarded.
Minimum Confidence: This helps us filter out rules that
do not meet particular confidence.
41 Apriori Algorithm – Solved Example
↠ It’s called Apriori, because it uses prior knowledge of frequent itemset
properties.
↠ We apply an iterative approach or level-wise search where k-frequent
item sets are used to find k+1 item sets.
↠ To improve the efficiency of level-wise generation of frequent item
sets, an important property is used called Apriori property which helps
by reducing the search space.
↠ Apriori Property – All non-empty subset of frequent itemset must be
frequent.
42
Consider the following dataset and we will find
frequent item sets and generate association rules
for them.
minimum support count is 2
minimum confidence is 60%
43
Step-1: K=1
(I) Create a table containing support count of each item
present in dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with
minimum support count(here min_support=2 if
support_count of candidate set items is less than
min_support then remove those items). This gives us
itemset L1.
44 Step-2: K=2
 Generate candidate set C2 using L1 (this is
called join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in
common.
 Check all subsets of an itemset are frequent or
not and if not frequent remove that itemset.
(Example subset of{I1, I2} are {I1}, {I2} they are
frequent. Check for each itemset)
 Now find support count of these item sets by
searching in dataset.
45
(II) compare candidate (C2) support count with
minimum support count(here min_support=2 if
support_count of candidate set item is less than
min_support then remove those items) this
gives us itemset L2.
46
Step-3:
 Generate candidate set C3 using L2 (join step). Condition
of joining Lk-1 and Lk-1 is that it should have (K-2)
elements in common. So here, for L2, first element should
match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2,
I5}{I1, I3, I5}{I2, I3, I4}{I2, I4, I5} {I2, I3, I5}
• Check if all subsets of these item sets are frequent or not
and if not, then remove that itemset.
(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which
are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so
remove it. Similarly check for every itemset)
• Find support count of these remaining item sets by
searching in dataset.
47 (II) Compare candidate (C3) support count with minimum
support count(here min_support=2 if support_count of
candidate set item is less than min_support then remove
those items) this gives us itemset L3.
Step-4:
• Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and
Lk-1 (K=4) is that, they should have (K-2) elements in common. So here, for
L3, first 2 elements (items) should match.
• Check all subsets of these item sets are frequent or not
(Here itemset formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3,
I5}, which is not frequent). So no itemset in C4
• We stop here because no frequent item sets are found further
48 Advantages are as follows:
1.Simplicity & ease of implementation
2.The rules are easy to human-readable * interpretable
3.Works well on unlabelled data
4.Flexibility
5.Extensions for multiple use cases can be created easily
6.The algorithm is widely used & studied
Disadvantages of Apriori Algorithm are as follows:
1.Time & space overhead
2.Higher memory usage
3.Bias of minimum support threshold
4.Inability to handle numeric data
49
Thank You All.....

More Related Content

Similar to For iiii year students of cse ML-UNIT-V.pptx

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...IRJET Journal
 
Student Performance Evaluation in Education Sector Using Prediction and Clust...
Student Performance Evaluation in Education Sector Using Prediction and Clust...Student Performance Evaluation in Education Sector Using Prediction and Clust...
Student Performance Evaluation in Education Sector Using Prediction and Clust...IJSRD
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
Unveiling Patterns.pdf
Unveiling Patterns.pdfUnveiling Patterns.pdf
Unveiling Patterns.pdfKajal Digital
 
A Survey on the Clustering Algorithms in Sales Data Mining
A Survey on the Clustering Algorithms in Sales Data MiningA Survey on the Clustering Algorithms in Sales Data Mining
A Survey on the Clustering Algorithms in Sales Data MiningEditor IJCATR
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningNandakumar P
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Reviewijdpsjournal
 

Similar to For iiii year students of cse ML-UNIT-V.pptx (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
Student Performance Evaluation in Education Sector Using Prediction and Clust...
Student Performance Evaluation in Education Sector Using Prediction and Clust...Student Performance Evaluation in Education Sector Using Prediction and Clust...
Student Performance Evaluation in Education Sector Using Prediction and Clust...
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Unveiling Patterns.pdf
Unveiling Patterns.pdfUnveiling Patterns.pdf
Unveiling Patterns.pdf
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Descriptive m0deling
Descriptive m0delingDescriptive m0deling
Descriptive m0deling
 
A Survey on the Clustering Algorithms in Sales Data Mining
A Survey on the Clustering Algorithms in Sales Data MiningA Survey on the Clustering Algorithms in Sales Data Mining
A Survey on the Clustering Algorithms in Sales Data Mining
 
Data mining
Data miningData mining
Data mining
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data Mining
 
ACO-medoids
ACO-medoidsACO-medoids
ACO-medoids
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Customer segmentation.pptx
Customer segmentation.pptxCustomer segmentation.pptx
Customer segmentation.pptx
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
 

Recently uploaded

Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Serviceritikaroy0888
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Roomdivyansh0kumar0
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...lizamodels9
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan
 
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc.../:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...lizamodels9
 
Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.Eni
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...Paul Menig
 
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service DewasVip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewasmakika9823
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst SummitHolger Mueller
 
Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxAbhayThakur200703
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMANIlamathiKannappan
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
Catalogue ONG NUOC PPR DE NHAT .pdf
Catalogue ONG NUOC PPR DE NHAT      .pdfCatalogue ONG NUOC PPR DE NHAT      .pdf
Catalogue ONG NUOC PPR DE NHAT .pdfOrient Homes
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Neil Kimberley
 
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfmuskan1121w
 

Recently uploaded (20)

Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
 
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc.../:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
 
Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service DewasVip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst Summit
 
Non Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptxNon Text Magic Studio Magic Design for Presentations L&P.pptx
Non Text Magic Studio Magic Design for Presentations L&P.pptx
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMAN
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
Catalogue ONG NUOC PPR DE NHAT .pdf
Catalogue ONG NUOC PPR DE NHAT      .pdfCatalogue ONG NUOC PPR DE NHAT      .pdf
Catalogue ONG NUOC PPR DE NHAT .pdf
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
Best Practices for Implementing an External Recruiting Partnership
Best Practices for Implementing an External Recruiting PartnershipBest Practices for Implementing an External Recruiting Partnership
Best Practices for Implementing an External Recruiting Partnership
 
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Mehrauli Delhi 💯Call Us 🔝8264348440🔝
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdf
 

For iiii year students of cse ML-UNIT-V.pptx

  • 2.  Introduction,  Unsupervised vs Supervised Learning,  Applications of Unsupervised Learning,  Clustering – Clustering as a machine learning task,  Different types of clustering techniques,  Partitioning methods, K-Medoids: a representative object-based technique,  Hierarchical clustering,  Density-based methods – DBSCAN  Finding Pattern using Association Rule - Definition of common terms, Association rule, The Apriori algorithm for association rule learning, Build the Apriori principle rules SYLLABUS 2
  • 3.  Unsupervised learning is one of the machine learning strategies.  It is used to discover hidden knowledge or patterns.  It works on unlabelled, unclassified and untrained data.  This method is used to identify patterns, groupings, sorting order, and numerous other interesting knowledge from the set of data. Unsupervised learning 3
  • 4. S.No Area Supervised Learning Unsupervised Learning 1 Goals To predict outcomes for new data. We know the type of results to expect. The goal is to get insights from large volumes of new data 2 Data Both input and output variables to be given Only output variable will be given 3 Applications spam detection, sentiment analysis, weather forecasting and pricing predictions, etc., anomaly detection, recommendation engines, customer personas and medical imaging 4 Complexity Solutions are provided by the use of programs like R or Python need powerful tools for working with large amounts of unclassified data. 5 Techniques Used Regression and Classification Clustering and Association 6 Drawbacks Supervised learning models can be time- consuming to train, and the labels for input and output variables require expertise. unsupervised learning methods can have wildly inaccurate results unless you have human intervention to validate the output variables. Key Difference Between Supervised and Unsupervised Learning 4
  • 5. APPLICATIONS OF UNSUPERVISED LEARNING There are many domains where unsupervised learning finds its application. Few examples of such applications are as follows:  Segmentation of target consumer populations by an advertisement consulting agency on the basis of few dimensions such as demography, financial data, purchasing habits, etc. so that the advertisers can reach their target consumers efficiently.  Anomaly or fraud detection in the banking sector by identifying the pattern of loan defaulters  Image processing and image segmentation such as face recognition, expression identification, etc.  Document clustering and identifying potential labelling options. 5
  • 6. Two major aspects of unsupervised learning, namely Clustering which helps in segmentation of the set of objects into groups of similar objects. Association Analysis which is related to the identification of relationships among objects in a data set. 6
  • 7. CLUSTERING  Clustering is a technique for finding subgroups, or clusters, in a data set on the basis of the characteristics of the objects within that data set in such a manner that the objects within the group are similar (or related to each other) but are different from (or unrelated to) the objects from the other groups.  The effectiveness of clustering depends on how similar or related the objects within a group are or how different or unrelated the objects in different groups are from each other. It is often domain specific to define what is meant by two objects to be similar or dissimilar. 7
  • 8. There are many different fields where cluster analysis is used effectively, such as  Text data mining: this includes tasks such as text categorization, text clustering, document summarization, concept extraction, sentiment analysis, and entity relation modelling  Customer segmentation: creating clusters of customers on the basis of parameters such as demographics, financial conditions, buying habits, etc., which can be used by retailers and advertisers to promote their products in the correct segment  Anomaly checking: checking of anomalous behaviours such as fraudulent bank transaction, unauthorized computer intrusion, suspicious movements on a radar scanner, etc.  Data mining: simplify the data mining task by grouping a large number of features from an extremely large data set to make the analysis manageable 8
  • 9. Clustering as a machine learning task  clustering is defined as an unsupervised machine learning task that automatically divides the data into clusters or groups of similar items.  We can assume that the definition of similarity might vary across applications.  Using this principle, whenever a large set of diverse and varied data is presented for analysis, clustering enables to represent the data in a smaller number of groups.  It helps to reduce the complexity and provides awareness into patterns of relationships to generate meaningful and actionable structures within the data.  The effectiveness of clustering is measured by the homogeneity within a group as well as the difference between distinct groups. 9
  • 10. Clustering as a machine learning task 10
  • 11.  Through clustering, we are trying to label the objects with class labels.  But clustering is somewhat different from the classification and numeric prediction in supervised learning.  In each of these cases, the goal was to create a model that relates features to an outcome or to other features and the model identifies patterns within the data. 11
  • 12. 12
  • 13. 13 Partitioning Methods • Uses mean or medoid to represent cluster center • Adopts distance based approach to refine clusters • Finds mutually exclusive clusters of spherical or nearly spherical shape • Effective for datasets of small to medium size Hierarchical Methods • Creates hierarchical or tree like structure through decomposition or merger • Uses distance between the nearest or furthest points in neighboring clusters as a guidance for refinement 1.Density based Methods • Identification of dense region of objects in space which are separated by low-density regions. • Identifying arbitrarily shaped clusters • May filter out outliers. Different types of clustering techniques The major clustering techniques are as follows:
  • 14. 14 Partitioning Methods Two important algorithms for partitioning based clustering are  k-means  k-medoid. In the k-means algorithm, the centroid of the prototype is identified for clustering, which is normally the mean of a group of points. The k-medoid algorithm identifies the medoid which is the most representative point for a group of points.
  • 15. 15 K-Means - A centroid-based technique  This is one of the oldest and most popularly used algorithm for clustering.  The Principle of the k-means algorithm is to assign each of the ‘n’ data points to one of the K clusters where ‘K’ is a user- defined parameter as the number of clusters desired.  The objective is to maximize the homogeneity within the clusters and also to maximize the differences between the clusters.  The homogeneity and differences are measured in terms of the distance between the objects or points in the data set.
  • 16. 16 Simple Algorithm of K-means Step 1: Select K points in the data space and mark them as initial centroids loop Step 2: Assign each point in the data space to the nearest centroid to form K clusters Step 3: Measure the distance of each point in the cluster from the centroid Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality of the clusters Step 5: Identify the new centroid of each cluster on the basis of distance between points Step 6: Repeat Steps 2 to 5 to refine until centroids do not change end loop
  • 17. 17 Clustering concept – before and after clustering
  • 18. 18 Student Assignment Problem solving on K-means K-means clustering – solved example https://www.youtube.com/watch?v=KzJORp8bgqs&list=PPSV By Mahesh Haddur
  • 19. 19 Elbow method • This method tries to measure the homogeneity or heterogeneity within the cluster and for various values of ‘K’ and helps in arriving at the optimal ‘K’. • The homogeneity will increase or heterogeneity will decrease with increasing ‘K’. • After a certain point, the increase in homogeneity benefit is no longer in accordance with the investment required to achieve it, as is evident from the figure. This point is known as the elbow point, and the ‘K’ value at this point produces the optimal clustering performance. Fig: Elbow point to determine the appropriate number of clusters
  • 20. 20 Student Assignment Elbow Method - Solved example by Mahesh Haddur https://www.youtube.com/watch?v=wW1tgWtkj4I&list=PPSV
  • 21. 21 Strengths and Weaknesses of K-means Algorithm S.No Strengths Weaknesses 1 Easy to use and It returns clusters which can be easily interpreted and even visualized. This simplicity makes it highly useful in some cases when you need a quick overview of the data segments. It needs human to set the cluster size to function optimally. 2 High performance: It can handle large datasets conveniently. It only creates spherical clusters. 3 If your data has no labels (class values or targets) or even column headers, it will still successfully cluster your data. It doesn't have an outlier concept. It will throw in everything in its clusters. It is Sometimes good and sometimes not so good.
  • 22. 22 K-Medoids: a representative object-based technique Medoid: A Medoid is a point in the cluster from which the sum of distances to other data points is minimal. The distance can be measured by using the Euclidean distance, Manhattan distance, or any other suitable distance function. There are three types of algorithms for K-Medoids Clustering: 1. PAM (Partitioning Around Clustering) 2. CLARA (Clustering Large Applications) 3. CLARANS (Randomized Clustering Large Applications)
  • 23. 23 Working of the PAM Algorithm The steps taken by the K-medoids algorithm for clustering are as follows:- 1. Randomly select k points from the data( k is the number of clusters to be formed). These k points would act as our initial medoids. 2. The distances between the medoid points and the non-medoid points are calculated, and each point is assigned to the cluster of its nearest medoid. 3. Calculate the cost as the total sum of the distances(also called dissimilarities) of the data points from the assigned medoid. 4. Swap one medoid point with a non-medoid point(from the same cluster as the medoid point) and recalculate the cost. 5. If the calculated cost with the new medoid point is more than the previous cost, we undo the swap, and the algorithm converges else; we repeat step 4 Finally, we will have k medoid points with their clusters.
  • 24. 24 Student Assignment Problem solving on K-medoids https://www.youtube.com/watch?v=ChBxx4aR-bY&list=PPSV By Mahesh Haddur
  • 25. 25 Hierarchical clustering  The hierarchical clustering methods are used to group the data into hierarchy or tree-like structure.  For example, in a machine learning problem of organizing employees of a university in different departments, first the employees are grouped under the different departments in the university, and then within each department, the employees can be grouped according to their roles such as professors, assistant professors, supervisors, lab assistants, etc.  This creates a hierarchical structure of the employee data and eases visualization and analysis.
  • 26. 26  There are two main hierarchical clustering methods: agglomerative clustering and divisive clustering.  Agglomerative clustering is a bottom-up technique which starts with individual objects as clusters and then iteratively merges them to form larger clusters.  divisive method is a top-down method and starts with one cluster (large cluster) with all given objects and then splits it iteratively to form smaller clusters.
  • 27. 27 A dendrogram is a commonly used tree structure representation of step-by-step creation of hierarchical clustering. It shows how the  clusters are merged iteratively (in the case of agglomerative clustering) or  split iteratively (in the case of divisive clustering) to arrive at the optimal clustering solution. The following fig. represents a dendrogram.
  • 28. 28 agglomerative clustering and divisive clustering
  • 29. 29 One of the core measures of proximities between clusters is the distance between them. There are four standard methods to measure the distance between clusters:  Let Ci and Cj be the two clusters with ni and nj respectively.  pi and pj represents the points in clusters Ci and Cj respectively.  We will denote the mean of cluster Ci as mi.
  • 30. 30
  • 31. 31 The distance measure is used to decide when to terminate the clustering algorithm.  For example, in an agglomerative clustering, the merging iterations may be stopped once the MIN distance between two neighbouring clusters becomes less than the user- defined threshold.  So, when an algorithm uses the minimum distance Dmin to measure the distance between the clusters, then it is referred to as nearest neighbour clustering algorithm, and  if the decision to stop the algorithm is based on a user-defined limit on Dmin, then it is called single linkage algorithm.
  • 32. 32 On the other hand, when an algorithm uses the maximum distance Dmax to measure the distance between the clusters, then it is referred to as furthest neighbour clustering algorithm, and if the decision to stop the algorithm is based on a user-defined limit on Dmax then it is called complete linkage algorithm. As minimum and maximum measures provide two extreme options to measure distance between the clusters, they are prone to the outliers and noisy data. Instead, the use of mean and average distance helps in avoiding such problem and provides more consistent results.
  • 33. 33 Density-based methods - DBSCAN o When we use the partitioning and hierarchical clustering methods, the resulting clusters are spherical or nearly spherical in nature. o In the case of the other shaped clusters such as S- shaped or uneven shaped clusters, the above two types of method do not provide accurate results. o The density- based clustering approach provides a solution to identify clusters of arbitrary shapes. o The principle is based on identifying the dense area and sparse area within the data set and then run the clustering algorithm. o DBSCAN is one of the popular density-based algorithm which creates clusters by using connected regions with high density.
  • 34. 34 Student Assignment : Density-Based Methods DBSCAN - Problem Solving By Mahesh Haddur https://www.youtube.com/watch?v=-p354tQsKrs&list=PPSV
  • 35. 35 Finding Pattern Using Association Rule  Association rule as a technique is used for identifying interesting relationships hidden in large data sets.  It is also known as association analysis.  The relationships can be represented in the form of association rules comprising a set of frequent items.  A common application of this analysis is the Market Basket Analysis  The below association rule says that, people who have bought bread and milk have often bought egg also; so, for the retailer, it makes sense that these items are placed together for new opportunities for cross-selling. {Bread, Milk} → {Egg}c
  • 36. 36 Few common terminologies used in association analysis Itemset : A collection of zero or more items is called an itemset. For example, in Table, {Bread, Milk, Egg} can be grouped together to form an itemset as those are frequently bought together. A null itemset is the one which does not contain any item. In the association analysis, an itemset is called k- itemset if it contains k number of items. EX: {Bread, Milk, Egg} -> 3-itemset Transaction Number Purchased Items 1 { Bread, Milk, Egg, Butter, Salt, Apple } 2 { Bread, Milk, Egg, Apple } 3 { Bread, Milk, Butter, Apple } 4 { Milk, Egg, Butter, Apple } 5 { Bread, Egg, Salt } 6 { Bread, Milk, Egg, Apple }
  • 37. 37 Support count : denotes the number of transactions in which a particular itemset is present In the above table, the itemset {Bread, Milk, Egg} occurs together three times and thus have a support count of 3. Association rule: The result of the market basket analysis is expressed as a set of association rules that specify patterns of relationships among items. A typical rule might be expressed as{Bread, Milk}→{Egg}, which denotes that if Bread and Milk are purchased, then Egg is also likely to be purchased. The above rule was identified from the set of {Bread, Milk, Egg}.
  • 38. 38 Support and confidence are the two concepts that are used for measuring the strength of an association rule. Support refers to popularity of any item, which is nothing but counting several transactions containing that item. For ex, we divide the count by the total items bought. For ex, if bread, butter, and bread + butter are bought in 200, 150, 100 transactions, respectively, in a day when 1000 transactions happened, then support for these associations will be
  • 39. 39 Confidence refers to the probability that item B will be bought if it is given that the user has purchased item A. for ex, if A and B both are bought in 100 transactions and A is bought in 200 transactions, then confidence would be: Confidence ( A  B ) = Transaction containing both A and B / Transaction only containing A Confidence ( Bread - - > Butter ) = Transaction having both bread and butter / Transaction only containing bread = 100 / 200 = 0.5
  • 40. 40 Minimum Support : This is the support threshold below which the association rule will be discarded. Minimum Confidence: This helps us filter out rules that do not meet particular confidence.
  • 41. 41 Apriori Algorithm – Solved Example ↠ It’s called Apriori, because it uses prior knowledge of frequent itemset properties. ↠ We apply an iterative approach or level-wise search where k-frequent item sets are used to find k+1 item sets. ↠ To improve the efficiency of level-wise generation of frequent item sets, an important property is used called Apriori property which helps by reducing the search space. ↠ Apriori Property – All non-empty subset of frequent itemset must be frequent.
  • 42. 42 Consider the following dataset and we will find frequent item sets and generate association rules for them. minimum support count is 2 minimum confidence is 60%
  • 43. 43 Step-1: K=1 (I) Create a table containing support count of each item present in dataset – Called C1(candidate set) (II) compare candidate set item’s support count with minimum support count(here min_support=2 if support_count of candidate set items is less than min_support then remove those items). This gives us itemset L1.
  • 44. 44 Step-2: K=2  Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.  Check all subsets of an itemset are frequent or not and if not frequent remove that itemset. (Example subset of{I1, I2} are {I1}, {I2} they are frequent. Check for each itemset)  Now find support count of these item sets by searching in dataset.
  • 45. 45 (II) compare candidate (C2) support count with minimum support count(here min_support=2 if support_count of candidate set item is less than min_support then remove those items) this gives us itemset L2.
  • 46. 46 Step-3:  Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it should have (K-2) elements in common. So here, for L2, first element should match. So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, I5}{I2, I3, I4}{I2, I4, I5} {I2, I3, I5} • Check if all subsets of these item sets are frequent or not and if not, then remove that itemset. (Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset) • Find support count of these remaining item sets by searching in dataset.
  • 47. 47 (II) Compare candidate (C3) support count with minimum support count(here min_support=2 if support_count of candidate set item is less than min_support then remove those items) this gives us itemset L3. Step-4: • Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items) should match. • Check all subsets of these item sets are frequent or not (Here itemset formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4 • We stop here because no frequent item sets are found further
  • 48. 48 Advantages are as follows: 1.Simplicity & ease of implementation 2.The rules are easy to human-readable * interpretable 3.Works well on unlabelled data 4.Flexibility 5.Extensions for multiple use cases can be created easily 6.The algorithm is widely used & studied Disadvantages of Apriori Algorithm are as follows: 1.Time & space overhead 2.Higher memory usage 3.Bias of minimum support threshold 4.Inability to handle numeric data