Clustering is an unsupervised learning technique used to group unlabeled data points into clusters based on similarity. There are several clustering methods including hierarchical, partitioning, density-based, and grid-based approaches. K-means clustering is a popular partitioning method that groups data into K number of clusters by minimizing distances between data points and cluster centers. It works by randomly selecting K data points as initial cluster centers and then iteratively reassigning all other points to clusters while updating the cluster centers until the clusters are stable.
Data Science - Part VII - Cluster AnalysisDerek Kane
This lecture provides an overview of clustering techniques, including K-Means, Hierarchical Clustering, and Gaussian Mixed Models. We will go through some methods of calibration and diagnostics and then apply the technique on a recognizable dataset.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
Data Science - Part VII - Cluster AnalysisDerek Kane
This lecture provides an overview of clustering techniques, including K-Means, Hierarchical Clustering, and Gaussian Mixed Models. We will go through some methods of calibration and diagnostics and then apply the technique on a recognizable dataset.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
K means Clustering - algorithm to cluster n objectsVoidVampire
The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k < n.
It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data.
It assumes that the object attributes form a vector space.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
Clustering is the process of making a group of abstract objects into classes of similar objects. Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each other, and these subsets are called clusters. Now that the data from our customer base is divided into clusters, we can make an informed decision about who we think is best suited for this product.
Cluster analysis is a data analysis technique that explores the naturally occurring groups within a data set known as clusters. Cluster analysis doesn't need to group data points into any predefined groups, which means that it is an unsupervised learning method.
Cluster analysis is a data analysis technique that explores the naturally occurring groups within a data set known as clusters. Cluster analysis doesn't need to group data points into any predefined groups, which means that it is an unsupervised learning method.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
K means Clustering - algorithm to cluster n objectsVoidVampire
The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k < n.
It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data.
It assumes that the object attributes form a vector space.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
Clustering is the process of making a group of abstract objects into classes of similar objects. Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each other, and these subsets are called clusters. Now that the data from our customer base is divided into clusters, we can make an informed decision about who we think is best suited for this product.
Cluster analysis is a data analysis technique that explores the naturally occurring groups within a data set known as clusters. Cluster analysis doesn't need to group data points into any predefined groups, which means that it is an unsupervised learning method.
Cluster analysis is a data analysis technique that explores the naturally occurring groups within a data set known as clusters. Cluster analysis doesn't need to group data points into any predefined groups, which means that it is an unsupervised learning method.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
3. Definition
• Clustering can be considered the most important
unsupervised learning technique; so, as every other
problem of this kind, it deals with finding a structure in a
collection of unlabeled data.
• Clustering is “the process of organizing objects into
groups whose members are similar in some way”.
• A cluster is therefore a collection of objects which are
“similar” between them and are “dissimilar” to the
objects belonging to other clusters.
Mu-Yu Lu, SJSU
4.
5. Why clustering?
A few good reasons ...
• Simplifications
• Pattern detection
• Useful in data concept construction
• Unsupervised learning process
6. Where to use clustering?
• Data mining
• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic
7. Which method should I use?
• Type of attributes in data
• Scalability to larger dataset
• Ability to work with irregular data
• Time cost
• complexity
• Data order dependency
• Result presentation
9. Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Professor Lee, Sin-Min
10. Distance based method
• In this case we easily identify the 4 clusters into which the data can be
divided; the similarity criterion is distance: two or more objects belong to
the same cluster if they are “close” according to a given distance. This is
called distance-based clustering.
11. Hierarchical clustering
Agglomerative (bottom up)
1. start with 1 point
(singleton)
2. recursively add two or
more appropriate
clusters
3. Stop when k number of
clusters is achieved.
Divisive (top down)
1. Start with a big cluster
2. Recursively divide into
smaller clusters
3. Stop when k number of
clusters is achieved.
12. general steps of hierarchical clustering
Given a set of N items to be clustered, and an N*N distance (or
similarity) matrix, the basic process of hierarchical clustering
(defined by S.C. Johnson in 1967) is this:
• Start by assigning each item to a cluster, so that if you have N
items, you now have N clusters, each containing just one item.
Let the distances (similarities) between the clusters the same as
the distances (similarities) between the items they contain.
• Find the closest (most similar) pair of clusters and merge them
into a single cluster, so that now you have one cluster less.
• Compute distances (similarities) between the new cluster and
each of the old clusters.
• Repeat steps 2 and 3 until all items are clustered into K number
of clusters
Mu-Yu Lu, SJSU
13. Exclusive vs. non exclusive
clustering
• In the first case data are grouped in an
exclusive way, so that if a certain datum
belongs to a definite cluster then it could
not be included in another cluster. A
simple example of that is shown in the
figure below, where the separation of
points is achieved by a straight line on a
bi-dimensional plane.
• On the contrary the second type, the
overlapping clustering, uses fuzzy sets to
cluster data, so that each point may
belong to two or more clusters with
different degrees of membership.
14. Partitioning clustering
1. Divide data into proper subset
2. recursively go through each subset
and relocate points between clusters
(opposite to visit-once approach in
Hierarchical approach)
This recursive relocation= higher quality cluster
15. Probabilistic clustering
1. Data are picked from mixture of
probability distribution.
2. Use the mean, variance of each
distribution as parameters for cluster
3. Single cluster membership
16. Single-Linkage Clustering(hierarchical)
• The N*N proximity matrix is D = [d(i,j)]
• The clusterings are assigned sequence
numbers 0,1,......, (n-1)
• L(k) is the level of the kth clustering
• A cluster with sequence number m is
denoted (m)
• The proximity between clusters (r) and (s)
is denoted d [(r),(s)]
Mu-Yu Lu, SJSU
17. The algorithm is composed of the
following steps:
• Begin with the disjoint clustering having level
L(0) = 0 and sequence number m = 0.
• Find the least dissimilar pair of clusters in the
current clustering, say pair (r), (s), according to
d[(r),(s)] = min d[(i),(j)]
where the minimum is over all pairs of clusters
in the current clustering.
18. The algorithm is composed of the
following steps:(cont.)
• Increment the sequence number : m = m +1. Merge
clusters (r) and (s) into a single cluster to form the next
clustering m. Set the level of this clustering to
L(m) = d[(r),(s)]
• Update the proximity matrix, D, by deleting the rows and
columns corresponding to clusters (r) and (s) and adding
a row and column corresponding to the newly formed
cluster. The proximity between the new cluster, denoted
(r,s) and old cluster (k) is defined in this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
• If all objects are in one cluster, stop. Else, go to step 2.
19. Hierarchical clustering example
• Let’s now see a simple example: a hierarchical clustering
of distances in kilometers between some Italian cities.
The method used is single-linkage.
• Input distance matrix (L = 0 for all the clusters):
20. • The nearest pair of cities is MI and TO, at distance 138. These
are merged into a single cluster called "MI/TO". The level of
the new cluster is L(MI/TO) = 138 and the new sequence
number is m = 1.
Then we compute the distance from this new compound object
to all other objects. In single link clustering the rule is that
the distance from the compound object to another object is
equal to the shortest distance from any member of the
cluster to the outside object. So the distance from "MI/TO"
to RM is chosen to be 564, which is the distance from MI to
RM, and so on.
22. • min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a
new cluster called NA/RM
L(NA/RM) = 219
m = 2
23. • min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM
into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m = 3
24. • min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM
and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m = 4
25. • Finally, we merge the last two clusters at level 295.
• The process is summarized by the following hierarchical tree:
26. K-mean algorithm
1. It accepts the number of clusters to group data
into, and the dataset to cluster as input values.
2. It then creates the first K initial clusters (K=
number of clusters needed) from the dataset by
choosing K rows of data randomly from the
dataset. For Example, if there are 10,000 rows of
data in the dataset and 3 clusters need to be
formed, then the first K=3 initial clusters will be
created by selecting 3 records randomly from the
dataset as the initial clusters. Each of the 3 initial
clusters formed will have just one row of data.
27. 3. The K-Means algorithm calculates the Arithmetic Mean of
each cluster formed in the dataset. The Arithmetic Mean of
a cluster is the mean of all the individual records in the cluster. In
each of the first K initial clusters, their is only one record. The
Arithmetic Mean of a cluster with one record is the set of values
that make up that record. For Example if the dataset we are
discussing is a set of Height, Weight and Age measurements for
students in a University, where a record P in the dataset S is
represented by a Height, Weight and Age measurement, then P =
{Age, Height, Weight). Then a record containing
the measurements of a student John, would be represented as
John = {20, 170, 80} where John's Age = 20 years, Height = 1.70
metres and Weight = 80 Pounds. Since there is only one record in
each initial cluster then the Arithmetic Mean of a cluster with only
the record for John as a member = {20, 170, 80}.
28. 4. Next, K-Means assigns each record in the dataset to only one of the initial
clusters. Each record is assigned to the nearest cluster (the cluster which it is
most similar to) using a measure of distance or similarity like the Euclidean
Distance Measure or Manhattan/City-Block Distance Measure.
5. K-Means re-assigns each record in the dataset to the most similar cluster and re-
calculates the arithmetic mean of all the clusters in the dataset. The arithmetic
mean of a cluster is the arithmetic mean of all the records in that cluster. For
Example, if a cluster contains two records where the record of the set of
measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the
arithmetic mean Pmean is represented as Pmean= {Agemean, Heightmean,
Weightmean). Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2 and
Weightmean= (80 + 120)/2. The arithmetic mean of this cluster = {25,
165, 100}. This new arithmetic mean becomes the center of this new cluster.
Following the same procedure, new cluster centers are formed for all the
existing clusters.
29. 6. K-Means re-assigns each record in the dataset to only one of the
new clusters formed. A record or data point is assigned to the
nearest cluster (the cluster which it is most similar to) using a
measure of distance or similarity
7. The preceding steps are repeated until stable clusters are formed
and the K-Means clustering procedure is completed. Stable clusters
are formed when new iterations or repetitions of the K-Means
clustering algorithm does not create new clusters as the cluster center
or Arithmetic Mean of each cluster formed is the same as the old
cluster center. There are different techniques for determining when a
stable cluster is formed or when the k-means clustering algorithm
procedure is completed.
30. Classification
• Classification Problem Overview
• Classification Techniques
– Regression
– Distance
– Decision Trees
– Rules
– Neural Networks
Goal: Provide an overview of the classification
problem and introduce some of the basic
algorithms
31. Classification Examples
• Teachers classify students’ grades as A,
B, C, D, or F.
• Identify mushrooms as poisonous or
edible.
• Predict when a river will flood.
• Identify individuals with credit risks.
• Speech recognition
• Pattern recognition
32. Classification Ex: Grading
• If x >= 90 then grade
=A.
• If 80<=x<90 then
grade =B.
• If 70<=x<80 then
grade =C.
• If 60<=x<70 then
grade =D.
• If x<50 then grade =F.
>=90
<90
x
>=80
<80
x
>=70
<70
x
F
B
A
>=60
<50
x C
D
33. Classification Techniques
• Approach:
1. Create specific model by evaluating
training data (or using domain
experts’ knowledge).
2. Apply model developed to new data.
• Classes must be predefined
• Most common techniques use DTs,
NNs, or are based on distances or
statistical methods.
35. Classification Using Regression
• Division: Use regression function to
divide area into regions.
• Prediction: Use regression function to
predict a class membership function.
Input includes desired class.
36. Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
39. Classification Using Distance
• Place items in class to which they are
“closest”.
• Must determine distance between an
item and a class.
• Classes represented by
– Centroid: Central value.
– Medoid: Representative point.
– Individual points
• Algorithm: KNN
40. K Nearest Neighbor
(KNN):
• Training set includes classes.
• Examine K items near item to be
classified.
• New item placed in class with the most
number of close items.
• O(q) for each tuple to be classified.
(Here q is the size of the training set.)
43. Classification Using Decision
Trees
• Partitioning based: Divide search
space into rectangular regions.
• Tuple placed into class based on the
region within which it falls.
• DT approaches differ in how the tree is
built: DT Induction
• Internal nodes associated with attribute
and arcs with values for that attribute.
• Algorithms: ID3, C4.5, CART
44. Decision Tree
Given:
– D = {t1, …, tn} where ti=<ti1, …, tih>
– Database schema contains {A1, A2, …, Ah}
– Classes C={C1, …., Cm}
Decision or Classification Tree is a tree
associated with D such that
– Each internal node is labeled with attribute, Ai
– Each arc is labeled with predicate which can
be applied to attribute at parent
– Each leaf node is labeled with a class, Cj
47. ID3
• Creates tree using information theory concepts and
tries to reduce expected number of comparison..
• ID3 chooses split attribute with the highest
information gain using entropy as base for
calculation.
48. Conclusion
• very useful in data mining
• applicable for both text and graphical
based data
• Help simplify data complexity
• classification
• detect hidden pattern in data
49. Reference
• Dr. M.H. Dunham -
http://engr.smu.edu/~mhd/dmbook/part2.ppt.
• Dr. Lee, Sin-Min – San Jose State University
• Mu-Yu Lu, SJSU
• Database System Concepts, Silberschatz, Korth, Sudarshan