The document discusses data mining and classification techniques. It defines data mining as the extraction of interesting patterns from large amounts of data. Classification involves using attributes of records in a training dataset to predict the class of new, unseen records. Decision trees are a common classification technique that use attributes to recursively split data into subgroups until each subgroup belongs to a single class. The document also discusses clustering, which organizes unlabeled data into groups without predefined classes.
ID3, C4.5 :used to generate a decision tree developed by Ross Quinlan typically used in the machine learning and natural language processing domains, overview about these algorithms with illustrated examples
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
This is the documentation of the study-meeting in lab.
Tha book title is "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and this is the chapter 8.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
ID3, C4.5 :used to generate a decision tree developed by Ross Quinlan typically used in the machine learning and natural language processing domains, overview about these algorithms with illustrated examples
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
This is the documentation of the study-meeting in lab.
Tha book title is "Hands-On Machine Learning with Scikit-Learn and TensorFlow" and this is the chapter 8.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...Simplilearn
This K-Nearest Neighbor Classification Algorithm presentation (KNN Algorithm) will help you understand what is KNN, why do we need KNN, how do we choose the factor 'K', when do we use KNN, how does KNN algorithm work and you will also see a use case demo showing how to predict whether a person will have diabetes or not using KNN algorithm. KNN algorithm can be applied to both classification and regression problems. Apparently, within the Data Science industry, it's more widely used to solve classification problems. It’s a simple algorithm that stores all available cases and classifies any new cases by taking a majority vote of its k neighbors. Now lets deep dive into these slides to understand what is KNN algorithm and how does it actually works.
Below topics are explained in this K-Nearest Neighbor Classification Algorithm (KNN Algorithm) tutorial:
1. Why do we need KNN?
2. What is KNN?
3. How do we choose the factor 'K'?
4. When do we use KNN?
5. How does KNN algorithm work?
6. Use case - Predict whether a person will have diabetes or not
Simplilearn’s Machine Learning course will make you an expert in Machine Learning, a form of Artificial Intelligence that automates data analysis to enable computers to learn and adapt through experience to do specific tasks without explicit programming. You will master Machine Learning concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, hands-on modeling to develop algorithms and prepare you for the role of Machine Learning Engineer
Why learn Machine Learning?
Machine Learning is rapidly being deployed in all kinds of industries, creating a huge demand for skilled professionals. The Machine Learning market size is expected to grow from USD 1.03 billion in 2016 to USD 8.81 billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
You can gain in-depth knowledge of Machine Learning by taking our Machine Learning certification training course. With Simplilearn’s Machine Learning course, you will prepare for a career as a Machine Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, Naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
Learn more at: https://www.simplilearn.com
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
Detecting anomalous patterns in data can lead to significant actionable insights in a wide variety of application domains, such as fraud detection, network traffic management, predictive healthcare, energy monitoring and many more.
However, detecting anomalies accurately can be difficult. What qualifies as an anomaly is continuously changing and anomalous patterns are unexpected. An effective anomaly detection system needs to continuously self-learn without relying on pre-programmed thresholds.
Join our speakers Ravishankar Rao Vallabhajosyula, Senior Data Scientist, Impetus Technologies and Saurabh Dutta, Technical Product Manager - StreamAnalytix, in a discussion on:
Importance of anomaly detection in enterprise data, types of anomalies, and challenges
Prominent real-time application areas
Approaches, techniques and algorithms for anomaly detection
Sample use-case implementation on the StreamAnalytix platform
In this presentation, two different data-sets are being collected to implement the machine learning classification techniques introduced from introduction to data mining and machine learning coursework. Both data-sets are collected by analyzing their output and team members interest. Following are the data-sets named as, Electricity grid stability simulated data-set and Face Recognition on Olivetti Data set
Abstract: This PDSG workshop introduces basic concepts of ensemble methods in machine learning. Concepts covered are Condercet Jury Theorem, Weak Learners, Decision Stumps, Bagging and Majority Voting.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Dataset Preparation
Abstract: This PDSG workshop introduces basic concepts on preparing a dataset for training a model. Concepts covered are data wrangling, replacing missing values, categorical variable conversion, and feature scaling.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
La visualisation est un élément important de la compréhension et de la (re)présentation des données dans les (data) sciences. Elle repose sur des principes et des outils que Christophe Bontemps (Toulouse School of Economics) décryptera à la lumière de son expérience et de ses lectures.
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...Simplilearn
This K-Nearest Neighbor Classification Algorithm presentation (KNN Algorithm) will help you understand what is KNN, why do we need KNN, how do we choose the factor 'K', when do we use KNN, how does KNN algorithm work and you will also see a use case demo showing how to predict whether a person will have diabetes or not using KNN algorithm. KNN algorithm can be applied to both classification and regression problems. Apparently, within the Data Science industry, it's more widely used to solve classification problems. It’s a simple algorithm that stores all available cases and classifies any new cases by taking a majority vote of its k neighbors. Now lets deep dive into these slides to understand what is KNN algorithm and how does it actually works.
Below topics are explained in this K-Nearest Neighbor Classification Algorithm (KNN Algorithm) tutorial:
1. Why do we need KNN?
2. What is KNN?
3. How do we choose the factor 'K'?
4. When do we use KNN?
5. How does KNN algorithm work?
6. Use case - Predict whether a person will have diabetes or not
Simplilearn’s Machine Learning course will make you an expert in Machine Learning, a form of Artificial Intelligence that automates data analysis to enable computers to learn and adapt through experience to do specific tasks without explicit programming. You will master Machine Learning concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, hands-on modeling to develop algorithms and prepare you for the role of Machine Learning Engineer
Why learn Machine Learning?
Machine Learning is rapidly being deployed in all kinds of industries, creating a huge demand for skilled professionals. The Machine Learning market size is expected to grow from USD 1.03 billion in 2016 to USD 8.81 billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
You can gain in-depth knowledge of Machine Learning by taking our Machine Learning certification training course. With Simplilearn’s Machine Learning course, you will prepare for a career as a Machine Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, Naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
Learn more at: https://www.simplilearn.com
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
Detecting anomalous patterns in data can lead to significant actionable insights in a wide variety of application domains, such as fraud detection, network traffic management, predictive healthcare, energy monitoring and many more.
However, detecting anomalies accurately can be difficult. What qualifies as an anomaly is continuously changing and anomalous patterns are unexpected. An effective anomaly detection system needs to continuously self-learn without relying on pre-programmed thresholds.
Join our speakers Ravishankar Rao Vallabhajosyula, Senior Data Scientist, Impetus Technologies and Saurabh Dutta, Technical Product Manager - StreamAnalytix, in a discussion on:
Importance of anomaly detection in enterprise data, types of anomalies, and challenges
Prominent real-time application areas
Approaches, techniques and algorithms for anomaly detection
Sample use-case implementation on the StreamAnalytix platform
In this presentation, two different data-sets are being collected to implement the machine learning classification techniques introduced from introduction to data mining and machine learning coursework. Both data-sets are collected by analyzing their output and team members interest. Following are the data-sets named as, Electricity grid stability simulated data-set and Face Recognition on Olivetti Data set
Abstract: This PDSG workshop introduces basic concepts of ensemble methods in machine learning. Concepts covered are Condercet Jury Theorem, Weak Learners, Decision Stumps, Bagging and Majority Voting.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Dataset Preparation
Abstract: This PDSG workshop introduces basic concepts on preparing a dataset for training a model. Concepts covered are data wrangling, replacing missing values, categorical variable conversion, and feature scaling.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
La visualisation est un élément important de la compréhension et de la (re)présentation des données dans les (data) sciences. Elle repose sur des principes et des outils que Christophe Bontemps (Toulouse School of Economics) décryptera à la lumière de son expérience et de ses lectures.
These days we see a lot of buzz about Machine Learning(ML)/Artificial Intelligence(AI), and why not, we all are consumers of ML directly or indirectly, irrespective of our profession. AI and ML is a fantastic field, everyone is excited about it, and rightly so. In this tutorial series, we will try to explore and demystify the complicated world of {maths, equations, and theory} that functions in tandem to bring out the "magic" which we experience on many application(s)/software(s). In this talk we will learn about Supervised Learning, Decision Tree, and shall solve some problem with SageMaker.
Blog: https://dev.to/aws/an-introduction-to-decision-tree-and-ensemble-methods-part-1-24p0
Code: https://github.com/debnsuma/AI-ML-Algo2020/tree/master/01.Decision_Tree
Leveraging Big Data to Manage Transport Operations (LeMO) project will address these issues by investigating the implications of the utilisation of such big data to enhance the economic sustainability and competitiveness of European transport sector.
Big Data and Harvesting Data from Social MediaR A Akerkar
The value of social media data is only as valuable as the information and insights we can extract from it. It is the information and insights that will help us make better decisions and give us a competitive edge.
Can You Really Make Best Use of Big Data?R A Akerkar
How big is big? What are the precise criteria for a data set to be considered big data? At least three major factors that contribute to the bigness of big data: Ubiquity and variety of data capturing devices for different types of information
Increase data resolution. Super-linear scaling of data production rate with data producers. Although big data has other dimensions too but these are not inherent to the "bigness" of big data.
Can we see additional value in linking and exploiting big data for business and societal benefit?
If we bring together numerous data sources to provide a single reference point then we start to derive new value.
Until then, we simply risk creating new data silos.
Whilst big data may represent a step forward in business intelligence and analytics, we see added value in linking and utilizing big data for business benefit. Once we bring together numerous data sources to provide a single reference point can we start to derive new value. Until then, we only risk creating new data silos.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
MATATAG CURRICULUM: ASSESSING THE READINESS OF ELEM. PUBLIC SCHOOL TEACHERS I...NelTorrente
In this research, it concludes that while the readiness of teachers in Caloocan City to implement the MATATAG Curriculum is generally positive, targeted efforts in professional development, resource distribution, support networks, and comprehensive preparation can address the existing gaps and ensure successful curriculum implementation.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Data mining
1. Data Mining
Rajendra Akerkar
July 7, 2009 Data Mining: R. Akerkar 1
2. What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns or
knowledge from huge amount of data
• Is everything “data mining”?
– (Deductive) query processing.
processing
– Expert systems or small ML/statistical programs
July 7, 2009 Data Mining: R. Akerkar 2
3. Definition
• Several Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information from data
– Exploration & analysis, by automatic or
semi automatic
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
July 7, 2009 Data Mining: R. Akerkar 3
4. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
[ yy , ] g y g,
July 7, 2009 Data Mining: R. Akerkar 4
6. Classification: D fi iti
Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: pre io sl unseen records should be assigned
previously nseen sho ld
a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
y
Usually, the given data set is divided into training and
test sets, with training set used to build the model and
test set used to validate it.
July 7, 2009 Data Mining: R. Akerkar 6
7. Classification: Introduction
• A classification scheme which generates a tree
and a set of rules from given data set.
• The attributes of the records are categorise into
two types:
– Attributes whose domain is numerical are called
numerical attributes.
– A ib
Attributes whose domain is not numerical are
h d i i i l
called the categorical attributes.
July 7, 2009 Data Mining: R. Akerkar 7
8. Decision Tree
• A decision tree is a tree with the following properties:
– An inner node represents an attribute.
– A edge represents a t t on the attribute of the father
An d t test th tt ib t f th f th
node.
– A leaf represents one of the classes.
• Construction of a decision tree
– Based on the training data
– Top-Down strategy
July 7, 2009 Data Mining: R. Akerkar 8
10. Decision Tree
Example
• The data set has five attributes.
• There is a special attribute: the attribute class is the class
label.
label
• The attributes, temp (temperature) and humidity are
numerical attributes
• Other attributes are categorical, that is, they cannot be
categorical is
ordered.
• Based on the training data set, we want to find a set of
set
rules to know what values of outlook, temperature,
humidity and wind, determine whether or not to play golf.
July 7, 2009 Data Mining: R. Akerkar 10
11. Decision Tree
Example
• We have five leaf nodes.
• In a decision tree, each leaf node represents a rule.
, p
• We have the following rules corresponding to the tree
given in Figure.
• RULE 1 If it is sunny and the humidity is not above 75%, then play.
• RULE 2 If it is sunny and the humidity is above 75%, then do not play.
f y y , p y
• RULE 3 If it is overcast, then play.
• RULE 4 If it is rainy and not windy, then play.
• RULE 5 If it is rainy and windy, then don't play.
July 7, 2009 Data Mining: R. Akerkar 11
13. Iterative Dichotomizer (ID3)
• Quinlan (1986)
• Each d
E h node corresponds to a splitting attribute
d t litti tt ib t
• Each arc is a possible value of that attribute.
• At each node the splitting attribute is selected to be the most
informative among the attributes not yet considered in the path from
the root.
• Entropy is used to measure how informative is a node.
• The algorithm uses the criterion of information gain to determine
the goodness of a split.
– The attribute with the greatest information gain is taken as
g g
the splitting attribute, and the data set is split for all distinct
values of the attribute.
July 7, 2009 Data Mining: R. Akerkar 13
14. Training Dataset
This follows an example from Quinlan’s ID3
age income student credit_rating buys_computer
<=30 high no fair no
<=30 g
high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 g
high y
yes fair y
yes
>40 medium no excellent no
July 7, 2009 Data Mining: R. Akerkar 14
15. Extracting Classification Rules from
Trees
• Represent the k
R h knowledge in the
l d i h
form of IF-THEN rules
• One rule is created for each path
from the root to a leaf
• Each attribute-value pair along a
path forms a conjunction
• The leaf node holds the class
prediction
• Rules
R l are easier for humans to
i f h
understand What are the rules?
July 7, 2009 Data Mining: R. Akerkar 15
16. Attribute Selection Measure: Information
Gain (ID3/C4.5)
(ID3/C4 5)
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures info required to classify any
q y y
arbitrary tuple m
si si
I( s1,s 2,...,s m ) log 2
i 1 s s ….information is encoded in bits.
entropy of attribute A with values {a1,a2,…,av}
f b h l {
v
s1 j ... smj
E(A) I( s1 j ,...,smj )
j 1 s
information gained by branching on attribute A
Gain(A) I(s 1, s 2 ,..., sm) E(A)
July 7, 2009 Data Mining: R. Akerkar 16
17. age pi ni I(pi, ni) Class P: buys_computer = “yes”
Cl
Class N: buys_computer = “no”
N b t “ ”
<=30 2 3 0.971 I(p, n) = I(9, 5) =0.940
30…40 4 0 0 Compute the entropy for age:
>40 3 2 0 971
0.971
age income student credit_rating buys_computer
<=30 high
g no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low
ow yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
< 30
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40
31 40 high yes fair yes
>40 medium no excellent no
July 7, 2009 Data Mining: R. Akerkar 17
18. Attribute Selection by Information Gain
Computation
5 4
E ( age )
g I ( 2 ,3) I ( 4,0 )
14 14
5
I (3, 2 ) 0 .694
14
5
I ( 2 ,3 ) means “age <=30” has 5 out of 14 samples, with 2 yes's
14
and 3 no’s. Hence
Gain ( age ) I ( p , n ) E ( age ) 0.246
Similarly, Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
July 7, 2009 Data Mining: R. Akerkar 18
19. Exercise 1
• The following table consists of training data from an employee
database.
database
• Let status be the class attribute. Use the ID3 algorithm to construct
a decision tree from the given data.
July 7, 2009 Data Mining: R. Akerkar 19
21. Clustering: Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them, find
clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one
another.
h
• Similarity Measures:
– E lid
Euclidean Distance if attributes are continuous.
Di t tt ib t ti
– Other Problem-specific Measures.
July 7, 2009 Data Mining: R. Akerkar 21
22. The K-Means Clustering Method
K Means
• Given k, the k-means algorithm is implemented in
k means
four steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of
the current partition (the centroid is the center, i.e., mean
center i e
point, of the cluster)
– Assign each object to the cluster with the nearest seed
point
– Go back to Step 2, stop when no more new assignment
July 7, 2009 Data Mining: R. Akerkar 22
24. Exercise 2
• Apply the K-means algorithm for the
following 1-dimensional points (for k=2): 1;
g p ( ) ;
2; 3; 4; 6; 7; 8; 9.
• Use 1 and 2 as the starting centroids.
centroids
July 7, 2009 Data Mining: R. Akerkar 24
25. K – Mean for 2-dimensional
2 dimensional
database
• Let us consider {x1, x2, x3, x4, x5} with following coordinates as
two-dimensional sample for clustering:
• x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2)
• Suppose that required number of clusters is 2.
• Initially, clusters are formed from random distribution of
samples:
• C1 = {x1, x2, x4} and C2 = {x3, x5}.
July 7, 2009 Data Mining: R. Akerkar 25
26. Centroid Calculation
• Suppose that the given set of N samples in an n-dimensional space
has somehow be partitioned into K clusters {C1, C2, …, Ck}
• Each Ck has nk samples and each sample is exactly in one cluster.
• Therefore, nk = N, where k = 1, …, K.
• The mean vector Mk of cluster Ck is defined as centroid of the
cluster, nk
Mk = (1/ nk) i = 1 xik
Where xik is the ith sample belonging
to cluster Ck.
• In our example The centroids for these two clusters are
example,
• M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66}
• M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00}
) ( ) } { }
July 7, 2009 Data Mining: R. Akerkar 26
27. The S
Th Square-error of th cluster
f the l t
• The square-error for cluster Ck is the sum of squared Euclidean
distances between each sample in Ck and its centroid.
• Thi error is called the within-cluster variation.
This i ll d th ithi l t i ti
ek2 = n k
i=1 (xik – Mk)2
• Within cluster variations, after initial random distribution of
samples, are
• e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2]
+ [(5 – 1.66)2 + (0 – 0.66)2] = 19.36
• e22 = [(1.5 – 3.25)2 + (0 – 1)2] + [( – 3.25)2 + (2 – 1)2] = 8.12
[( ) ( ) [(5 ) ( )
July 7, 2009 Data Mining: R. Akerkar 27
28. Total Square-error
• The square error for the entire clustering space
containing K clusters is the sum of the within-cluster
variations. K
i i
Ek2 =
k = 1 ek
2
• The total square error is
E2 = e12 + e22 = 19.36 + 8.12 = 27.48
July 7, 2009 Data Mining: R. Akerkar 28
29. • When we reassign all samples, depending on a minimum distance
from centroids M1 and M2, the new redistribution of samples inside
clusters will be,
be
• d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40 x1 C1
• d(M1, x2) = 1.79 and d(M2, x2) = 3.40 x2 C1
d(M1, x3) = 0.83 and d(M2, x3) = 2.01 x3 C1
d(M1, x4) = 3.41 and d(M2, x4) = 2.01 x4 C2
d(M1, x5) = 3.60 and d(M2, x5) = 2.01 x5 C2
Above calculation is based on Euclidean distance formula,
d(xi, xj) = k = 1 (xik – xjk)1/2
m
July 7, 2009 Data Mining: R. Akerkar 29
30. • New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new
centroids
• M1 = {0.5, 0.67}
• M2 = {5.0, 1.0}
• The corresponding within-cluster variations and the total square
error are,
• e12 = 4.17
• e22 = 2.00
• E2 = 6.17
July 7, 2009 Data Mining: R. Akerkar 30
31. Exercise 3
Let the set X consist of the following sample points in 2
dimensional space:
X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)}
Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of
centroids for X.
What are the revised values of c1 and c2 after 1 iteration of k-
means clustering (k = 2)?
July 7, 2009 Data Mining: R. Akerkar 31
33. Associations discovery
• Associations discovery uncovers affinities
amongst collection of items
• Affinities are represented by association rules
• Associations discovery is an unsupervised
approach to data mining.
July 7, 2009 Data Mining: R. Akerkar 33
34. Association discovery is one of the most common
forms of data mining that people closely
associate with data mining, namely mining for
gold through a vast database. The gold in this
ld th h td t b Th ld i thi
case is a rule that tells you something about your
database that you did not already know, and
know
were probably unable to explicitly articulate
July 7, 2009 Data Mining: R. Akerkar 34
35. Association discovery is done using rule induction which
y g
basically tells a user how strong a pattern is and how
likely it is to happen again. For instance a database of
items scanned in a consumer market basket helps finding
interesting patterns such as: If bagels are purchased then
cream cheese is purchased 90% of the time and this
pattern occurs i 3% of all shopping baskets
in f ll h i b k
You go tell the data base to go find the rules, the rules that are
rules
pulled from the database are extracted and ordered to be
presented to the user to according to the percentage of
times they are correct and how often they apply. Often
gets lot of rules and the user almost needs a second pass
to find his/her gold nugget.
g gg
July 7, 2009 Data Mining: R. Akerkar 35
36. Associations
• The problem of deriving associations from
data
– market-basket analysis
– The popular algorithms are thus concerned with
determining the set of frequent itemsets in a
given set of operation databases.
– The problem is to compute the frequency of
occurrences of each itemset in the database.
July 7, 2009 Data Mining: R. Akerkar 36
38. Association Rules
• Algorithms that obtain association rules
from data usually divide the task into two
y
parts:
– find the frequent itemsets and
– form the rules from them.
July 7, 2009 Data Mining: R. Akerkar 38
39. Association Rules
• The problem of mining association rules can be
divided into two sub-problems:
July 7, 2009 Data Mining: R. Akerkar 39
41. Exercise 3
Suppose that L3 is the list
{{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w},
{b,c,x}, {p,q,r}, {p,q,s}, {p,q,t}, {p,r,s},
{q,r,s}}
Which itemsets are placed in C4 by the join
step of the Apriori algorithm? Which are
p p g
then removed by the prune step?
July 7, 2009 Data Mining: R. Akerkar 41
42. Exercise 4
• Given a dataset with four attributes w, x, y
and z, each with three values, how many
, , y
rules can be generated with one term on the
right-hand side?
g
July 7, 2009 Data Mining: R. Akerkar 42
43. References
• R. Akerkar and P. Lingras. Building an Intelligent Web: Theory &
Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House,
2009)
• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press,
Mining Press
1996
• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in
Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2001
K f
• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT
Press, 2001
,
July 7, 2009 Data Mining: R. Akerkar 43