SlideShare a Scribd company logo
1 of 261
Data Mining
Ch.Sanjeev Kumar Dash
Definition
• Definition- It is the non trivial process of
extracting ,interesting useful and novel
patterns or implicit knowledge from huge
amount of data.
Definition
• Non-trivial process-which means that it is not
obvious the knowledge it has to be extracted.
• Implicit knowledge- The knowledge is inbuilt
in the data we have to extract it only from
the data.
• Novelty part -knowledge has to be a new
knowledge and unknown knowledge
previously.
• Potentially useful -This has to be useful
knowledge depending on the application.
• The knowledge often takes the form of
patterns in data, some regularity or some kind
of structure in the data and from huge
amount of data.
Which is not Data mining
• The plain search like we do in
• Google or any search engine;
• Query processing in a DBMS
• Booking a ticket in Indian railway
Example-www.irctc dot. com
• we want to find out how many ticket are freely
available in this train on this day.
• Data mining would be from historical data, not
from the existing.
Data Mining
• This entire study is very much
interdisciplinary.
• It borrows from database technology,
statistics, machine learning, pattern
recognition algorithms, cognitive theory etc.
Steps of KDD
KDD Process
Steps of KDD
• 1. Data cleaning (to remove noise and inconsistent
data)
• 2. Data integration (where multiple data sources may
be combined)
• 3. Data selection (where data relevant to the analysis
task are retrieved from the
• database)
• 4. Data transformation (where data are transformed
and consolidated into forms
• appropriate for mining by performing summary or
aggregation operations)
• 5. Data mining (an essential process where
intelligent methods are applied to extract
data patterns)
• 6. Pattern evaluation (to identify the truly
interesting patterns representing knowledge
• based on interestingness measures)
• 7. Knowledge presentation (where visualization
and knowledge representation techniques are
used to present mined knowledge to users)
• Data mining is the process of discovering
interesting patterns and knowledge from
large amounts of data.
• The data sources can include databases, data
• warehouses, the Web, other information
repositories, or data that are streamed into
the system dynamically.
What Kinds of Data Can Be Mined?
• Data mining can be applied to any kind of data
as long as the data are meaningful for a target
application.
• The most basic forms of data for mining
applications are database data , data
warehouse data ,and transactional data .
• Data mining can also be applied to other
forms of data .
• For example data streams, ordered/sequence
data, graph or networked data, spatial data,
text data, multimedia data, and the WWW .
Types of Datasets
Record Data
Text Data
Graph Data
Ordered Data
Database Data
• A database system, also called a database
management system (DBMS), consists of a
collection of interrelated data, known as a
database, and a set of software programs to
manage and access the data.
• A relational database is a collection of tables,
each of which is assigned a unique name.
• Each table consists of a set of attributes
(columns or fields) and usually stores large
set of tuples (records or rows).
Data Warehouses
• A data warehouse a repository of information
collected from multiple sources, stored under
a unified schema, and usually residing at a
single site.
• Data warehouses are constructed via a
process of data cleaning, data integration,
data transformation, data loading, and
periodic data refreshing.
• A data warehouse is usually modeled by a
multidimensional data structure, called a
• data cube, in which each dimension corresponds
to an attribute or a set of attributes in the
schema, and each cell stores the value of some
aggregate measure such as count or sum(sales
• amount).
• A data cube provides a multidimensional view of
data and allows the pre computation and fast
access of summarized data.
Transactional Data
• In general, each record in a transactional
database captures a transaction, such as a
customer’s purchase, a flight booking, or a
user’s clicks on a web page.
• A transaction typically includes a unique
transaction identity number (transID) and a
list of the items
• making up the transaction, such as the items
purchased in the transaction.
• A transactional database may have additional
tables, which contain other information
related to the transactions, such as item
description, information about the
salesperson or the branch, and so on.
Data mining functionalities
• We have observed various types of data and
information repositories on which
• data mining can be performed.
Data mining functionalities.
• These include
• 1. Characterization and discrimination ,
2. The mining of frequent patterns, associations,
and Correlations,
3 . Classification and regression ,
4. Clustering analysis and outlier analysis.
Data mining functionalities are used to specify
the kinds of patterns to be found in data mining
tasks.
Characterization
• Data characterization is a summarization of
the general characteristics or features of a
target class of data.
• Data entries can be associated with classes or
concepts.
• For example, in the Electronic store
• classes of items for sale include computers
and printers, and concepts of customers
include bigSpenders and budgetSpenders.
Characterization
• It can be useful to describe individual classes
• and concepts in summarized, concise, and yet
precise terms.
• Such descriptions of a class or a concept are
called class/concept descriptions.
• These descriptions can be derived using data
characterization, by summarizing the data of
the class under study(often called target class)
Example
• A customer relationship manager at
AllElectronics may order the following data
mining task: Summarize the characteristics of
customers who spend more than $5000 a year
at AllElectronics.
• The result is a general profile of these
customers, such as that they are 40 to 50
years old, employed, and have excellent credit
ratings.
Characterization
• The output of data characterization can be
presented in various forms.
• Examples include pie charts, bar charts,
curves, multidimensional data cubes, and
multidimensional tables, including crosstabs.
Data discrimination
• Data discrimination is a comparison of the
general features of the target class data
• objects against the general features of objects
from one or multiple contrasting classes.
• The target and contrasting classes can be
specified by a user, and the corresponding
• data objects can be retrieved through
database queries.
Example,
• A user may want to compare the general
features of software products with sales that
increased by 10% last year against those with
sales that decreased by at least 30% during
the same period.
Mining Frequent Patterns,
Associations, and Correlations
• Frequent patterns, as the name suggests, are
patterns that occur frequently in data.
• There are many kinds of
• frequent patterns,
• including frequent itemsets,
• frequent subsequences (also known as sequential
patterns),
• and frequent substructures.
•
• A frequent itemset typically refers to a set of
items that often appear together in a
transactional data set.
• for example, milk and bread, which are
frequently bought together in grocery stores
by many customers.
• A frequently occurring subsequence, such as
the pattern that customers, tend to purchase
first a laptop, followed by a digital camera,
and then a memory card, is a (frequent)
sequential pattern.
• A substructure can refer to different
structural forms (e.g., graphs, trees, or lattices)
that may be combined with itemsets or
subsequences.
• If a substructure occurs frequently, it is called
a (frequent) structured pattern.
• Mining frequent patterns leads to the
discovery of interesting associations and
correlations within data.
Classification and Regression for
Predictive Analysis
• Classification is the process of finding a
model (or function) that describes and
distinguishes data classes or concepts.
• The model are derived based on the analysis
of a set of training data (i.e., data objects for
which the class labels are known).
• The model is used to predict the class label of
objects for which the class label is unknown.
• The derived model may be represented in
various forms, such as classification rules (i.e.,
IF-THEN rules), decision trees, mathematical
• formulae, or neural networks
Types of ML Classification Algorithms:
• Classification Algorithms can be further divided
into the following types:
• Logistic Regression
• K-Nearest Neighbours
• Support Vector Machines
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
Example of Decision tree
• Classification predicts categorical (discrete,
unordered) labels.
• Regression models predicts continuous-valued
functions.
Regression:
• Regression is a process of finding the
correlations between dependent and
independent variables.
• It helps in predicting the continuous variables
such as prediction of Market Trends,
prediction of House prices, etc.
Regression
• The task of the Regression algorithm is to find
the mapping function to map the input
variable(x) to the continuous output
variable(y).
Types of Regression Algorithm:
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
• To predict the amount of revenue that each
item will generate during an upcoming sale at
electronic shop based on the previous sales
data.
Cluster Analysis
• clustering analyzes data objects without
consulting class labels.
• In many cases, class labeled data may simply not
exist at the beginning.
• Clustering can be used to generate class labels for
a group of data.
• The objects are clustered or grouped based on
the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.
Outlier Analysis
• A data set may contain objects that do not comply with
the general behavior or model of the data.
• These data objects are outliers.
• Many data mining methods discard outliers as noise
or exceptions.
• However, in some applications (e.g., fraud detection)
the rare events can be more interesting than the more
regularly occurring ones.
• The analysis of outlier data is referred to as outlier
analysis or anomaly mining.
Getting to Know Your Data
• Data Objects-
• Data sets are made up of data objects.
• A data object represents an entity.
• For example In a sales database, the objects may
be customers, store items, and sales;
• In a medical database, the objects may be
patients;
• In a university database, the objects may be
students, professors,
• and courses.
What is Data?
Data Objects-
• Data objects are typically described by attributes.
• Data objects can also be referred to as samples,
examples, instances, data points, or objects.
• If the data objects are stored in a database, they
are data tuples.
• That is, the rows of a database correspond to
the data objects, and the columns correspond to
the attributes.
What Is an Attribute?
• An attribute is a data field, representing a
characteristic or feature of a data object.
• The nouns attribute, dimension, feature, and variable
are often used interchangeably in the literature.
• The term dimension is commonly used in data
warehousing.
• Machine learning literature tends to use the term
feature, while statisticians prefer the term variable.
• Data mining and database professionals commonly use
the term attribute.
• Each row as a vector, whose components are
this individual attribute values.
• These vectors are also sometimes known as
the object vector or the feature vector.
• What is the dimension of the vector?
• The dimension of the vector is determined by
the number of attributes in the table.
Types of Attribute
• Nominal attribute-Nominal means “relating to
names.”
• The values of a nominal attribute are symbols
or names of things.
• Each value represents some kind of category,
code, or state, and so nominal attributes are
also referred to as categorical.
• The values do not have any meaningful order.
• For example hair color are black,brown, blond,
red, auburn, gray, and white.
• The attribute marital status can take on
• the values single, married, divorced, and
widowed.
• Another example of a nominal attribute is
occupation, with the values teacher, dentist,
programmer, farmer, and so on.
Binary Attributes
• A binary attribute is a nominal attribute with
only two categories or states: 0 or 1, where
• 0 typically means that the attribute is absent,
and 1 means that it is present.
• Binary attributes are referred to as Boolean if
the two states correspond to true and false.
Example
• The attribute medical test is binary, where a
value of 1 means the result of the test for the
patient is positive, while 0 means the result is
negative.
Binary attributes.
• Given the attribute smoker describing a
patient object, 1 indicates that the patient
smokes, while 0 indicates that the patient
does not.
• A binary attribute is symmetric if both of its
states are equally valuable and carry
• the same weight; that is, there is no
preference on which outcome should be
coded as 0 or 1.
asymmetric
• A binary attribute is asymmetric if the
outcomes of the states are not equally
important.
• such as the positive and negative outcomes of
a medical test for HIV.
• By convention,we code the most important
outcome, which is usually the rarest one, by 1
(e.g., HIVpositive) and the other by 0 (e.g., HIV
negative).
Ordinal Attributes
• An ordinal attribute is an attribute with
possible values that have a meaningful order
or ranking among them, but the magnitude
between successive values is not known.
• This nominal attribute has three possible
values: small, medium,and large.
• we cannot tell from the values how much
bigger, say, a medium is than a large.
• Professional rank-Professional ranks can be
enumerated in a sequential order:
• for example, assistant, associate, and full for
professors.
Numeric Attributes
• A numeric attribute is quantitative; that is, it
is a measurable quantity, represented in
• integer or real values.
• Numeric attributes can be interval-scaled or
ratio-scaled.
Interval-Scaled Attributes
• Interval-scaled attributes are measured on a
scale of equal-size units.
• The values of interval-scaled attributes have
order and can be positive, 0, or negative.
• Example- A temperature attribute is interval-
scaled.
• Suppose that we have the outdoor temperature
value for a number of different days.
• By ordering the values, we obtain a ranking of
the objects with respect to temperature.
• Example- Calendar dates are another example.
For instance, the years 2002 and 2010 are
eight years apart.
Ratio-Scaled Attributes
• A ratio-scaled attribute is a numeric attribute
with an inherent zero-point.
• That is, if a measurement is ratio-scaled, we
can speak of a value as being a multiple (or
ratio)of another value.
• In addition, the values are ordered, and we
can also compute the difference between
values, as well as the mean, median, and
mode.
Examples of ratio-scaled attributes
• such as years of experience(e.g., the objects
are employees)
• and number of words (e.g., the objects are
documents).
• Additional examples include attributes to
measure weight, height, latitude and
longitude.
• Interval scales hold no true zero and can
represent values below zero.
• For example, you can measure temperatures
below 0 degrees Celsius, such as -10 degrees.
Ratio variables, on the other hand, never fall
below zero. Height and weight measure from
0 and above, but never fall below it.
Discrete versus Continuous Attributes
• A discrete attribute has a finite or countably
infinite set of values, which may or may not
be represented as integers.
• The attributes hair color, smoker, medical test,
and drink size each have a finite number of
values, and so are discrete.
Example
• Note that discrete attributes may have numeric
values, such as 0 and 1 for binary attributes
• or, the values 0 to 110 for the attribute age.
• An attribute is countably infinite if the set of
possible values is infinite but the values can be
put in a one-to-one correspondence with natural
numbers.
• For example, the attribute customer ID is
countably infinite.
• Zip codes are another example.
continuous
• If an attribute is not discrete, it is continuous.
• Continuous values are real numbers, whereas
numeric values can be either integers or real
numbers.
• Continuous attributes are typically
represented as floating-point variables.
Example
• A feature F1 can take certain values: A, B, C, D, E,
F, and represents the grade of students
• from a college. Which of the following statement
is true in the following case?
• a. Feature F1 is an example of a nominal variable.
• b. Feature F1 is an example of an ordinal variable.
• c. It doesn’t belong to any of the above
categories.
• d. Both of these
Measures of Central Tendency
• The central value or the most occurring value
that gives a general idea of the whole data set
is called the Measure of Central Tendency.
• Some of the most commonly used measures
of central tendency are:
• Mean
• Median
• Mode
Example
• Suppose that we have some attribute X, like
salary, which has been recorded for a set of
objects.
• Let x1,x2,x3,….,xn be the set of N observed
values or observations for X.
• If we were to plot the observations for salary,
where would most of the values fall?
Mean
• The most common and effective numeric
measure of the “center” of a set of data is
• the (arithmetic) mean. Let x1,x2,x3,….,xn be
the set of N observed values or observations
for X. The mean of this set of values is
Arithmetic mean
Mean
Weighted Arithmetic Mean
Example
• A major problem with the mean
• is its sensitivity to extreme (e.g., outlier)
values.
• Even a small number of extreme values
• can corrupt the mean.
Trimmed mean
• Trimmed mean-which is the mean obtained
after chopping off values at the high and low
extremes.
• For example, we can sort the values observed
for salary and remove the top and bottom 2%
• before computing the mean. We should avoid
trimming too large a portion (such as
• 20%) at both ends, as this can result in the
loss of valuable information.
• Let's say, as an example, a figure skating competition
produces the following scores: 6.0, 8.1, 8.3, 9.1, and
9.9.
• The mean for the scores would equal:
• ((6.0 + 8.1 + 8.3 + 9.1 + 9.9) / 5) = 8.28
• To trim the mean by a total of 40%, we remove the
lowest 20% and the highest 20% of values, eliminating
the scores of 6.0 and 9.9.
• Next, we calculate the mean based on the calculation:
• (8.1 + 8.3 + 9.1) / 3 = 8.50
Median.
• Median is the middle value in a set of ordered
data values. It is the value that separates the
higher half of a data set from the lower half.
• Suppose that a given data set of N values for an
attribute X is sorted in increasing order.
• If N is odd, then the median is the middle value of
the ordered set.
• If N is even, then the median is not unique; it is
the two
• middlemost values and any value in between.
• If X is a numeric attribute in this case, by
convention, the median is taken as the average of
the two middlemost values.
• Suppose we have the following values for salary (in
thousands of dollars), shown
• in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, 110.
• So median= (52+56)/2=54
• Suppose that we had only the first 11 values in the list.
• Given an odd number of values, the median is the
middlemost value.
• This is the sixth value in this list, which has
a value of $52,000.
• How to calculate median for an even number of
values?
• Example:
• 9, 8, 5, 6, 3, 4
• Arrange values in order
• 3, 4, 5, 6, 8, 9
• Add 2 middle values and calculate their mean.
• Median = 5+6/2
• Median = 5.5
• The median is expensive to compute when we
have a large number of observations.
Median for range of values
• Note: The median class is the class that
contains the value located at N/2.
Median for range of values
• L: Lower limit of median class: 11
• W: Width of median class: 9
• N: Total Frequency: 60
• C: Cumulative frequency up to median class: 8
• F: Frequency of median class: 25
• Median = L + W[(N/2 – C) / F]
• Median = 11 + 9[(60/2 – 8) / 25]
• Median = 18.92
• We estimate that the median exam score
is 18.92.
Mode
• The mode is another measure of central
tendency.
• The mode for a set of data is the value that
occurs most frequently in the set.
• For Example,
• In {6, 9, 3, 6, 6, 5, 2, 3}, the Mode is 6 as it
occurs most often.
Types of Mode
• The different types of Mode are Unimodal,
Bimodal, Trimodal, and Multimodal. Let us
understand each of these Modes.
• Unimodal Mode - A set of data with one Mode is
known as a Unimodal Mode.
• For example, the Mode of data set A = { 14, 15,
16, 17, 15, 18, 15, 19} is 15 as there is only one
value repeating itself. Hence, it is a Unimodal
data set.
• Bimodal Mode - A set of data with two Modes
is known as a Bimodal Mode. This means that
there are two data values that are having the
highest frequencies.
• For example, the Mode of data set A = {
8,13,13,14,15,17,17,19} is 13 and 17 because
both 13 and 17 are repeating twice in the
given set. Hence, it is a Bimodal data set.
• Trimodal Mode - A set of data with three Modes
is known as a Trimodal Mode. This means that
there are three data values that are having the
highest frequencies.
• For example, the Mode of data set A = {2, 2, 2, 3,
4, 4, 5, 6, 5,4, 7, 5, 8} is 2, 4, and 5 because all the
three values are repeating thrice in the given set.
• Hence, it is a Trimodal data set.
•
• Multimodal Mode - A set of data with four or
more than four Modes is known as a
Multimodal Mode.
• For example, The Mode of data set
• A = {100, 80, 80, 95, 95, 100, 90, 90,100 ,95 }
is 80, 90, 95, and 100 because both all the
four values are repeated twice in the given
set. Hence, it is a Multimodal data set.
• Data in most real applications are not
symmetric.
• They may instead be either positively skewed,
where the mode occurs at a value that is
smaller than the median or negatively
skewed , where the mode occurs at a value
greater than the median .
What is data skewness?
• When most of the values are skewed to the left
or right side from the median, then the data is
called skewed.
• Data can be in any of the following shapes;
• Symmetric: Mean, median and mode are at the
same point.
• Positively skewed: When most of the values are
to the left from the median.
• Negatively skewed: When most of the values are
to the right from the median.
• For symmetric distribution=
mean=median=mode
• For skewed distribution
• For positively skewed=mean>median>mode
• For negatively skewed =Mode>median>mean
• Mean-mode=3(mean-median)
• Mode=3median-2mean
• The midrange can also be used to assess the
central tendency of a numeric data set.
• It is the average of the largest and smallest
values in the set.
• Midrange. The midrange of the data of is
• (30,000+110,000)/2 = $70,000.
Measuring the Dispersion of Data:
• Range, Quartiles, Variance,Standard
Deviation, and Interquartile Range.
• Range- let x1,x2,….,xn be a set of
observations for some numeric attribute, X.
The range the set is the difference between
the largest (max()) and smallest (min()
What is quartile?
• Suppose that the data for attribute X are
sorted in increasing numeric order.
• we can pick certain data points so as to split
the data distribution into equal-size
consecutive sets.
• These data points are called quantiles.
• Quantiles are points taken at regular intervals
of a data distribution, dividing it into
essentially equal size consecutive sets.
• The 2-quantile is the data point dividing the lower and
upper halves of the data distribution. It corresponds to
the median.
• The 4-quantiles are the three data points that
• split the data distribution into four equal parts; each
part represents one-fourth of the data distribution.
• They are more commonly referred to as quartiles.
• The 100-quantiles are more commonly referred to as
percentiles; they divide the data distribution into 100
• equal-sized consecutive sets.
• The first quartile, denoted by Q1, is the 25th
percentile. It cuts off the lowest 25% of the
data.
• The third quartile, denoted by Q3, is the 75th
percentile—it cuts off the lowest 75% (or
• highest 25%) of the data.
• The second quartile is the 50th percentile. As
the median, it gives the center of the data
distribution.
• The distance between the first and third
quartiles is a simple measure of spread
• that gives the range covered by the middle
half of the data.
• This distance is called the interquartile range
(IQR) and is defined as
• IQR= Q3-Q1
How to find quartiles of odd length
data set?
• Data = 8, 5, 2, 4, 8, 9, 5
• Step 1:
• First of all, arrange the values in order.
• Data = 2, 4, 5, 5, 8, 8, 9
• Step 2:
• For dividing this data into four equal parts, we
needed three quartiles.
• Q1: Lower quartile
• Q2: Median of the data set
• Q3: Upper quartile
• Step 3:
• Find the median of the data set and label it as Q2.
• Data = 2, 4, 5, 5, 8, 8, 9
• Q1: 4 – Lower quartile
• Q2: 5 – Middle quartile
• Q3: 8 – Upper quartile
• Inter Quartile Range= Q3 – Q1
• = 8 – 4
• = 4
How to find quartiles of even length
data set?
• Data = 8, 5, 2, 4, 8, 9, 5,7
• Step 1:
• First of all, arrange the values in order
• After ordering the values:
• Data = 2, 4, 5, 5, 7, 8, 8, 9
• Step 2:
• For dividing this data into four equal parts, we
needed three quartiles.
• Q1: Lower quartile
• Q2: Median of the data set
• Q3: Upper quartile
• Step 3:
• Find the median of the data set and label it as Q2.
• Data = 2, 4, 5, 5, 7, 8, 8, 9
• Minimum: 2
• Q1: 4 + 5 / 2 = 4.5 Lower quartile
• Q2: 5+ 7 / 2 = 6 Middle quartile
• Q3: 8 + 8 / 2 = 8 Upper quartile
• Maximum: 9
• Inter Quartile Range= Q3 – Q1
• = 8 – 4.5
• = 3.5
Five-Number Summary, Boxplots, and
Outliers
• The five-number summary of a distribution
consists of the median (Q2)), the quartiles
Q1and Q3, and the smallest and largest
individual observations.
• That is written in the order of Minimum,
Q1,Median, Q3, Maximum.
• How to Find a Five-Number Summary: Steps
• Step 1: Put your numbers in ascending order (from smallest
to largest). For this particular data set, the order is:
Example: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
• Step 2: Find the minimum and maximum for your data set.
Now that your numbers are in order, this should be easy to
spot.
In the example in step 1, the minimum (the smallest
number) is 1 and the maximum (the largest number) is 27.
• Step 3: Find the median. The median is the middle number.
If you aren’t sure how to find the median, see: How to find
the mean mode and median.
• Step 4: Place parentheses around the numbers above and
below the median.
(This is not technically necessary, but it makes Q1 and Q3
easier to find).
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27).
• Step 5: Find Q1 and Q3. Q1 can be thought of as a median
in the lower half of the data, and Q3 can be thought of as a
median for the upper half of data.
(1, 2, 5, 6, 7), 9, ( 12, 15,18,19,27).
• Step 6: Write down your summary found in the above steps.
minimum = 1, Q1 = 5, median = 9, Q3 = 18, and maximum =
27
Box plot
• Boxplots are a popular way of visualizing a
distribution.
• A boxplot incorporates the five-number summary as
follows:
• Typically, the ends of the box are at the quartiles so
that the box length is the interquartile range.
• The median is marked by a line within the box.
• Two lines (called whiskers) outside the box extend to
the smallest (Minimum) and largest (Maximum)
observations.
• When the median is in the middle of the box, and
the whiskers are about the same on both sides of
the box, then the distribution is symmetric.
• When the median is closer to the bottom of the
box, and if the whisker is shorter on the lower
end of the box, then the distribution is positively
skewed (skewed right).
• When the median is closer to the top of the box,
and if the whisker is shorter on the upper end of
the box, then the distribution is negatively
skewed (skewed left).
• Boxplots can be computed in O(nlogn) time.
• An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population.
• Suppose that the data for analysis includes the
attribute age. The age values for the data
• tuples are (in increasing order) 13, 15, 16, 16, 19, 20,
20, 21, 22, 22, 25, 25, 25, 25, 30,
• 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
• (a) What is the mean of the data? What is the median?
• (b) What is the mode of the data? Comment on the
data’s modality (i.e., bimodal,
• trimodal, etc.).
• (c) What is the midrange of the data?
• (d) Can you find (roughly) the first quartile
(Q1) and the third quartile (Q3) of the data?
• (e) Give the five-number summary of the data.
• (f) Show a boxplot of the data.
Example
• 11,22,20,14,29,8,35,27,13,49,10,24,17
• After sorting
• 8,10,11,13,14,17,20,22,24,27,29,35,49
• Q1=13+11/2=12 min=8 max=49
• Q2=20
• Q3=28
• IQR=28-12=16 min =8 max=49
• Max outlier= q3+1.5*IQR=28+1.5*16=52
• Min outlier=q1-1.5*IQR=12-1.5*16=-12
• 18,34,76,29,15,41,46,25,54,38,20,32,43,22
• (15,18,20,22,25,29,32),(34, 38,41,43,46,54,76)
• Q2=(32+34)=33
• Q1=22 Q3=43
• IQR=43-22=21
• [Q1-1.5*IQR]=[-9.5]
• [q3+1.5*IQR]=74.5
Variance and Standard Deviation
• Dispersion refers to the ‘distribution’ of
objects over a large region.
• Variance and standard deviation are measures
of data dispersion.
• They indicate how spread out a data
distribution is.
• Variance measures how far each number in
the dataset from the mean.
Standard deviation
• Standard deviation is a squared root of the
variance to get original values.
• A low standard deviation means that the data
observations tend to be very close to the
mean.
• A high standard deviation indicates that
the data are spread out over a large range of
values.
Graphic Displays of Basic Statistical
Descriptions of Data
• Graphic displays of basic statistical
descriptions.
• These include quantile plots, quantile–
quantile plots, histograms, and scatter plots.
• Such graphs are helpful for the visual
inspection of data, which is useful for data
preprocessing.
Find median
Proximity measure
• Whether two objects are unlike or like?
• Application
• Clustering
• Outlier analysis
• Nearest neighbour
Types of attribute
• Nominal attributes
• Ordinal attributes
• Binary attributes
• Numerical attributes
• Mixed attributes
Measuring Data Similarity and
Dissimilarity
• Similarity and dissimilarity measures, which are
referred to as measures of proximity.
• Similarity and dissimilarity are related.
• A similarity measure for two objects, i and j, will
typically return the value 0 if the objects are
unalike.
• The higher the similarity value, the greater the
similarity between objects. (Typically, a value of 1
• indicates complete similarity, that is, the objects
are identical.)
• A dissimilarity measure works the opposite
way.
• It returns a value of 0 if the objects are the
same (and therefore, far from being
dissimilar).
• The higher the dissimilarity value, the more
dissimilar the two objects are.
Data Matrix versus Dissimilarity
Matrix
• Data matrix -Data matrix (or object-by-
attribute structure):
• This structure stores the n data objects in the
form of a relational table, or n-by-p matrix (n
objects p attributes):
• Suppose that we have n objects (e.g., persons,
items, or courses) described by p attributes (also
called measurements or features, such as age,
height, weight, or gender).
• The objects are x1=(x11,x12,x13,…x1p),
• x2=(x21,x22,….x2p) and so on,
• where xij is the value for object xi of the jth
attribute.
• Each row corresponds to an object
Dissimilarity matrix (or object-by-
object structure):
This structure stores a collection of proximities
that are available for all pairs of n objects.
• It is often represented by an n-by-n table:
• where d(i, j) is the measured dissimilarity or
“difference” between objects i and j.
• In general, d(i, j) is a non-negative number
that is close to 0 when objects i and j are
highly similar or “near” each other
• and becomes larger the more they differ.
• Note
• d(i, i)= 0 that is, the difference between an
object and itself is 0.
• -- d(i, j)= d (j, i).
• Measures of similarity can often be expressed
as a function of measures of dissimilarity.
• For example, for nominal data,
• Sim(i, j)= 1 –d(i, j)
• where sim(i, j)is the similarity between objects
i and j.
Proximity Measures for Nominal
Attributes
• For example,map color is a nominal attribute
that may have, say, five states: red, yellow,
green, pink, and blue.
• Let the number of states of a nominal
attribute be M.
• The states can be denoted by letters, symbols,
or a set of integers, such as 1, 2, … , M.
• How is dissimilarity computed between objects
described by nominal attributes?
• The dissimilarity between two objects i and j
can be computed based on the ratio of
mismatches:
• where m is the number of matches (i.e., the
number of attributes for which i and j are in
• the same state)
• and p is the total number of attributes
describing the objects.
• Weights can be assigned to increase the effect
of m or to assign greater weight to the
matches in attributes having a larger number
of states.
• P=1 – it is no of nominal attribute.
• m –no of matches.
• --d(2,1)= (p-m)/p=(1-0)/1=1
• D(3,1)=(1-0)/1=1
• D(4,1)=(1-1)/1=0
• D(i, j)evaluates to 0 if objects i and j match,
and 1 if the objects differ.
Dissimilarity matrix
• A dissimilarity measure works the opposite
way. It returns a value of 0 if the objects are
the same (and therefore,
• far from being dissimilar). The higher the
dissimilarity value, the more dissimilar the
• two objects are.
Similarity matrix
• A similarity measure for two objects, i and j,
will typically return the value 0 if the objects
are unalike.
• The higher the similarity value, the greater the
similarity between objects. (Typically, a value
of 1 indicates complete similarity, that is, the
objects are identical.)
Example
Ordinal attribute
• There are three states for test-2: fair, good, and
• excellent, that is, Mf = 3.
• For step 1, if we replace each value for test-2 by
its rank, the four objects are assigned the ranks 3,
1, 2, and 3, respectively.
• Step 2 normalizes the ranking by mapping rank 1
to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
• For step 3, we can use, say, the Euclidean
distance.
• 1. find the no of of ordinal attributes
• (Mf=3) and rank it.
2. Normalize the rank
3. Find the distance between the objects.
• Zif=Rf-1/mf-1
• Fair(1)=1-1/3-1=0
• Good(2)= 2-1/2=0.5
• Excellent(3)/3-1/3-1=1
• Manhantan distance=|x1-y1|+|x2-y2|
Binary attribute
• It can be either symmetric or asymmetric
• A binary attribute is symmetric if both of its
states are equally valuable and carry the same
weight; that is, there is no preference on which
outcome should be coded as 0 or 1. (Example
male and female)
• A binary attribute is asymmetric if the outcomes
of the states are not equally important,
• such as the positive and negative outcomes of a
medical test for HIV
Binary attribute
• We can find similarity matrix for symmetric
binary for asymmetric binary and dissimilarity
symmetric binary for asymmetric binary .
• q-m11
• r-m10
• s-m01
• t-m00
• If all binary attributes are thought of as having the same
weight, we have the 2 *2 contingency table
• where q is the number of attributes that equal 1 for both
objects i and j,
• r is the number of attributes that equal 1 for object i but
equal 0 for object j,
• s is the number of attributes that equal 0 for object i but
equal 1 for object j,
• and t is the number of attributes that equal 0 for both
objects i and j.
• The total number of attributes is p, where p = q +r + s +t .
Similarity
• For symmetric we use simple matching
coefficient.
• Simple matching coefficient(SCM)=
• sim(I,j)=(m11+m00)/(m11+m10+m01+m00)
Similarity
• for asymmetric binary we can use jaccord
coefficient.
• Sim(i,j)=q/(q+r+s)=m11/(m11+m01+m10)
Dissimilarity
• For dissimilarity matrix for symmetric binary
and asymmetric binary.
symmetric binary
dissimilarity matrix for asymmetric
binary
asymmetric binary
Example
For dissimilarity symmetric binary
Dissimilarity( Symmetric binary)
• D(jack,jim)=(0+1)/(2+0+1+3)=1/6
• D(mary,jack)=(1+1)/(1+2+1+2)=2/6
• D(mary,jim)=(2+1)/1+2+1+2=3/6
Dissimilarity( Symmetric binary)
Binary (similarity)
• 1. symmetric
• Symmetric Binary ( simple matching
coeffcient)
• sim(I,j)=(m11+m00)/(m11+m10+m01+m00)
• X=1,0,0,0,0,0,0,0,0,0
• Y=0,0,0,0,0,0,1,0,0,1
• Scm(x,y)=7/0+1+1+2+7=7/10=.7
• The similarity between x and y is 70%.
•
Similarity for asymmetric binary
Jaccard coeffcient
• Sim(i,j)=q/(q+r+s)=m11/(m11+m01+m10)
• Asymmetric binary=0/2=0
• Super market contains 1000 product
• C1={sugar, coffee , tea, rice,egg}
• C2={sugar, coffee, bread, biscuit}
• How much similarity there between customer1 and
customer2.
• M11=2 ( item present in both the custemer)
• (sugar,coffee)=2
• M10=3 (item present is customer 1 but not in customer
2) {Tea,rice,egg}=3
• M01=2(item present in c2 but in c1) {bread,biscut}=2
• M00=item present in not in c1 or c2
• =Total item-(m11+m10+m01)
• =1000-7=993
• Jaccard coefficient
=(m11)/(m11+m10+m01)=(2/2+3+2)=2/7
• Scm=(m11+m00)/(m11+m00+m01+m00)=(2+
993)/(2+3+2+993)=.995
• SCM=(2+993)/2+3+2+993=0.995
Mixed attribute
For test1
For test1
For test-2
Numerical data
• Normalize data- frame data between 0 and 1.
• ->d(I,j)=|xif-xjf|/max-min
• D(2,1)=|45-22|/64-22=23/42=0.55
• D(3,1)=|45-64|/64-22=0.45
• D(4,1)=|22-64|/64-22=42/42=1
• D(4,2)=|28-22|/64-22=.14
• D(4,3)=|64-28|=64-22=.86
Final dissimilar matrix
How to put all together
( ) ( )
1
( )
1
( , )
n
f f
ij ij
f
n
f
ij
f
d
d i j







( )
( )
0 0 sin
1
f
ij if jf if jf
f
ij
if x x or x x ismis g
otherwise


   

• D(2,1)=1*1+1*1+0.55*1/1+1+1=2.55/3=0.85
• D(3,1)=(1*1)+(1*0.5)+(1*.45)/3=0.65
Final matrix
Cosine Similarity
• A document can be represented by thousands
of attributes, each recording the frequency
• of a particular word (such as a keyword) or
phrase in the document.
• Thus, each document is an object represented
by what is called a term-frequency vector.
• Term-frequency vectors are typically very long
and sparse (i.e., they have many 0 values).
• Applications using such structures include
information retrieval, text document
• clustering, biological taxonomy, and gene
feature mapping.
• The traditional distance measures that we
have studied in this chapter do not work well
for such sparse numeric data.
Cosine similarity
• Cosine similarity is a measure of similarity
that can be used to compare documents
or, say, give a ranking of documents with
respect to a given vector of query
• words.
• Let x and y be two vectors for comparison.
Using the cosine measure as a similarity
function, we have
• The measure computes the cosine of the
angle between vectors x and y. A cosine value
of 0 means that the two vectors are at 90
degrees to each other (orthogonal) and have
no match.
• The closer the cosine value to 1, the smaller
the angle and the greater the match between
vectors.
Cosine similarity and distance
• Suppose we have two points p1 and p2.
• If distance between p1 and p2 increases the
similarity decreases.
• If distance between p1 and p2 decreases the
similarity increases.
• 1- cosine similarity = cosine distance.
• The cosine similarity says to find the similarity
between two objects. we have to find the
angle between them.
• Cosine similarity= Cos (theta).
• The theta is the angle between object p1 and
p2.
• The cosine similarity ranging between -1 to 1.
• The angle is more between then they have
less similarity.
• The angle is less between then they have
more similarity.
• Cos 0=1 it means more similarity.
• Cos 90=0 it means less similarity.
• The angle between two objects p1 and p2= 45
degree.
• Cosine similarity= cos 45= .5
• 50% similarity between p1 and p2.
• Cos 90=0
• The similarity is less.
The angle is 0 (more similar)
Example
• Cosine similarity between two term-
frequency vectors. Suppose that x and y are
the first two term-frequency vectors in Table
2.5. That is, x = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) and
• Y =(3, 0, 2, 0, 1, 1, 0, 1, 0, 1). How similar are
x and y? To compute the cosine similarity
between the two vectors,
Euclidian distance (L2norm)
Euclidian distance
Manhattan distance(l1 norm)
Minkowski distance
Supremum distance
• Given two objects represented by the tuples
(22, 1, 42, 10) and (20, 0, 36, 8):
• (a) Compute the Euclidean distance between
the two objects.
• (b) Compute the Manhattan distance between
the two objects.
• (c) Compute the Minkowski distance between
the two objects, using q D 3.
• .
Assignment
• (d) Compute the supremum distance between
the two objects
Data Preprocessing
• Today’s real-world databases are highly
susceptible to noisy, missing, and
inconsistent data due to their typically huge
size (often several gigabytes or more) and
their likely origin from multiple, heterogenous
sources.
• Low-quality data will lead to low-quality
mining results.
Data Preprocessing
• There are several data preprocessing
techniques.
• Data cleaning can be applied to remove noise
and correct inconsistencies in data.
• Data integration merges data from
• multiple sources into a coherent data store
such as a data warehouse.
• Data reduction can reduce data size by, for
instance, aggregating, eliminating redundant
features, or clustering.
• Data transformations (e.g., normalization)
may be applied, where data are scaled to fall
within a smaller range like 0.0 to 1.0.
• This can improve the accuracy and efficiency
of mining algorithms involving distance
measurements.
• The three of the elements defining data
quality: accuracy, completeness, and
consistency.
• Inaccurate, incomplete, and inconsistent data
are commonplace properties of large real-
world databases and data warehouses.
Factors affecting data quality
• Timeliness also affects data quality
• The month-end data are not updated in a
timely fashion has a negative impact on the
data quality.
• Two other factors affecting data quality are
believability and interpretability.
• Believability reflects how much the data are
trusted by users,
• Interpretability reflects how easy the data are
understood.
Example
• Suppose that a database, at one point, had
several errors, all of which have since been
corrected.
• The past errors, however, had caused
• many problems for sales department users,
and so they no longer trust the data. The
• data also use many accounting codes, which
the sales department does not know how to
• interpret.
Data Cleaning
• Real-world data tend to be incomplete, noisy,
and inconsistent.
• Data cleaning (or data cleansing) routines
attempt to fill in missing values, smooth out
noise while identifying outliers, and correct
inconsistencies in the data.
Missing Values
• 1. Ignore the tuple: This is usually done when
the class label is missing (assuming the
• mining task involves classification).
• This method is not very effective, unless the
tuple contains several attributes with missing
values.
2. Fill in the missing value manually: In general,
this approach is time consuming and
• may not be feasible given a large data set with
many missing values.
3. Use a global constant to fill in the missing
value: Replace all missing attribute values
• by the same constant such as a label like
“Unknown” or infanite.
• 4. Use a measure of central tendency for the
attribute (e.g., the mean or median) to
• fill in the missing value:
• 5. Use the attribute mean or median for all
samples belonging to the same class as
• the given tuple.
6. Use the most probable value to fill in the
missing value: This may be determined
• with regression, inference-based tools using a
Bayesian formalism, or decision tree
induction.
• https://t4tutorials.com/what-are-quartiles-in-
data-mining/

More Related Content

Similar to Data Mining-2023 (2).ppt

chap1.ppt
chap1.pptchap1.ppt
chap1.pptImXaib
 
Data mining basic concept and Data warehousing
Data mining basic concept and Data warehousingData mining basic concept and Data warehousing
Data mining basic concept and Data warehousingNivaTripathy1
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptxOmamaNoor2
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining Jeremiah Fadugba
 
Chapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxWollo UNiversity
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Data warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniquesData warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniquesVaibhav Khanna
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation Pralhad Rijal
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxshumPanwar
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemKiran kumar
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
dataWarehouse.pptx
dataWarehouse.pptxdataWarehouse.pptx
dataWarehouse.pptxhqlm1
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 

Similar to Data Mining-2023 (2).ppt (20)

chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
Data mining basic concept and Data warehousing
Data mining basic concept and Data warehousingData mining basic concept and Data warehousing
Data mining basic concept and Data warehousing
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptx
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
Chapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptxChapter 2 - Introduction to Data Science.pptx
Chapter 2 - Introduction to Data Science.pptx
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniquesData warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniques
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation
 
Lecture2 (1).ppt
Lecture2 (1).pptLecture2 (1).ppt
Lecture2 (1).ppt
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse System
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
dataWarehouse.pptx
dataWarehouse.pptxdataWarehouse.pptx
dataWarehouse.pptx
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 

Data Mining-2023 (2).ppt

  • 2. Definition • Definition- It is the non trivial process of extracting ,interesting useful and novel patterns or implicit knowledge from huge amount of data.
  • 3. Definition • Non-trivial process-which means that it is not obvious the knowledge it has to be extracted. • Implicit knowledge- The knowledge is inbuilt in the data we have to extract it only from the data.
  • 4. • Novelty part -knowledge has to be a new knowledge and unknown knowledge previously. • Potentially useful -This has to be useful knowledge depending on the application. • The knowledge often takes the form of patterns in data, some regularity or some kind of structure in the data and from huge amount of data.
  • 5. Which is not Data mining • The plain search like we do in • Google or any search engine; • Query processing in a DBMS • Booking a ticket in Indian railway Example-www.irctc dot. com • we want to find out how many ticket are freely available in this train on this day. • Data mining would be from historical data, not from the existing.
  • 6. Data Mining • This entire study is very much interdisciplinary. • It borrows from database technology, statistics, machine learning, pattern recognition algorithms, cognitive theory etc.
  • 7.
  • 10. Steps of KDD • 1. Data cleaning (to remove noise and inconsistent data) • 2. Data integration (where multiple data sources may be combined) • 3. Data selection (where data relevant to the analysis task are retrieved from the • database) • 4. Data transformation (where data are transformed and consolidated into forms • appropriate for mining by performing summary or aggregation operations)
  • 11. • 5. Data mining (an essential process where intelligent methods are applied to extract data patterns) • 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge • based on interestingness measures) • 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users)
  • 12. • Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. • The data sources can include databases, data • warehouses, the Web, other information repositories, or data that are streamed into the system dynamically.
  • 13. What Kinds of Data Can Be Mined? • Data mining can be applied to any kind of data as long as the data are meaningful for a target application. • The most basic forms of data for mining applications are database data , data warehouse data ,and transactional data .
  • 14. • Data mining can also be applied to other forms of data . • For example data streams, ordered/sequence data, graph or networked data, spatial data, text data, multimedia data, and the WWW .
  • 20. Database Data • A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.
  • 21. • A relational database is a collection of tables, each of which is assigned a unique name. • Each table consists of a set of attributes (columns or fields) and usually stores large set of tuples (records or rows).
  • 22. Data Warehouses • A data warehouse a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site. • Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.
  • 23. • A data warehouse is usually modeled by a multidimensional data structure, called a • data cube, in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure such as count or sum(sales • amount). • A data cube provides a multidimensional view of data and allows the pre computation and fast access of summarized data.
  • 24.
  • 25. Transactional Data • In general, each record in a transactional database captures a transaction, such as a customer’s purchase, a flight booking, or a user’s clicks on a web page. • A transaction typically includes a unique transaction identity number (transID) and a list of the items • making up the transaction, such as the items purchased in the transaction.
  • 26. • A transactional database may have additional tables, which contain other information related to the transactions, such as item description, information about the salesperson or the branch, and so on.
  • 27. Data mining functionalities • We have observed various types of data and information repositories on which • data mining can be performed.
  • 28. Data mining functionalities. • These include • 1. Characterization and discrimination , 2. The mining of frequent patterns, associations, and Correlations, 3 . Classification and regression , 4. Clustering analysis and outlier analysis. Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks.
  • 29. Characterization • Data characterization is a summarization of the general characteristics or features of a target class of data. • Data entries can be associated with classes or concepts. • For example, in the Electronic store • classes of items for sale include computers and printers, and concepts of customers include bigSpenders and budgetSpenders.
  • 30. Characterization • It can be useful to describe individual classes • and concepts in summarized, concise, and yet precise terms. • Such descriptions of a class or a concept are called class/concept descriptions. • These descriptions can be derived using data characterization, by summarizing the data of the class under study(often called target class)
  • 31. Example • A customer relationship manager at AllElectronics may order the following data mining task: Summarize the characteristics of customers who spend more than $5000 a year at AllElectronics. • The result is a general profile of these customers, such as that they are 40 to 50 years old, employed, and have excellent credit ratings.
  • 32. Characterization • The output of data characterization can be presented in various forms. • Examples include pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
  • 33. Data discrimination • Data discrimination is a comparison of the general features of the target class data • objects against the general features of objects from one or multiple contrasting classes. • The target and contrasting classes can be specified by a user, and the corresponding • data objects can be retrieved through database queries.
  • 34. Example, • A user may want to compare the general features of software products with sales that increased by 10% last year against those with sales that decreased by at least 30% during the same period.
  • 35. Mining Frequent Patterns, Associations, and Correlations • Frequent patterns, as the name suggests, are patterns that occur frequently in data. • There are many kinds of • frequent patterns, • including frequent itemsets, • frequent subsequences (also known as sequential patterns), • and frequent substructures. •
  • 36. • A frequent itemset typically refers to a set of items that often appear together in a transactional data set. • for example, milk and bread, which are frequently bought together in grocery stores by many customers.
  • 37. • A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.
  • 38. • A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may be combined with itemsets or subsequences. • If a substructure occurs frequently, it is called a (frequent) structured pattern. • Mining frequent patterns leads to the discovery of interesting associations and correlations within data.
  • 39. Classification and Regression for Predictive Analysis • Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts. • The model are derived based on the analysis of a set of training data (i.e., data objects for which the class labels are known). • The model is used to predict the class label of objects for which the class label is unknown.
  • 40. • The derived model may be represented in various forms, such as classification rules (i.e., IF-THEN rules), decision trees, mathematical • formulae, or neural networks
  • 41. Types of ML Classification Algorithms: • Classification Algorithms can be further divided into the following types: • Logistic Regression • K-Nearest Neighbours • Support Vector Machines • Kernel SVM • Naïve Bayes • Decision Tree Classification • Random Forest Classification
  • 43. • Classification predicts categorical (discrete, unordered) labels. • Regression models predicts continuous-valued functions.
  • 44. Regression: • Regression is a process of finding the correlations between dependent and independent variables. • It helps in predicting the continuous variables such as prediction of Market Trends, prediction of House prices, etc.
  • 45. Regression • The task of the Regression algorithm is to find the mapping function to map the input variable(x) to the continuous output variable(y).
  • 46. Types of Regression Algorithm: • Simple Linear Regression • Multiple Linear Regression • Polynomial Regression • Support Vector Regression • Decision Tree Regression • Random Forest Regression
  • 47. • To predict the amount of revenue that each item will generate during an upcoming sale at electronic shop based on the previous sales data.
  • 48. Cluster Analysis • clustering analyzes data objects without consulting class labels. • In many cases, class labeled data may simply not exist at the beginning. • Clustering can be used to generate class labels for a group of data. • The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity.
  • 49.
  • 50. Outlier Analysis • A data set may contain objects that do not comply with the general behavior or model of the data. • These data objects are outliers. • Many data mining methods discard outliers as noise or exceptions. • However, in some applications (e.g., fraud detection) the rare events can be more interesting than the more regularly occurring ones. • The analysis of outlier data is referred to as outlier analysis or anomaly mining.
  • 51. Getting to Know Your Data • Data Objects- • Data sets are made up of data objects. • A data object represents an entity. • For example In a sales database, the objects may be customers, store items, and sales; • In a medical database, the objects may be patients; • In a university database, the objects may be students, professors, • and courses.
  • 53. Data Objects- • Data objects are typically described by attributes. • Data objects can also be referred to as samples, examples, instances, data points, or objects. • If the data objects are stored in a database, they are data tuples. • That is, the rows of a database correspond to the data objects, and the columns correspond to the attributes.
  • 54. What Is an Attribute? • An attribute is a data field, representing a characteristic or feature of a data object. • The nouns attribute, dimension, feature, and variable are often used interchangeably in the literature. • The term dimension is commonly used in data warehousing. • Machine learning literature tends to use the term feature, while statisticians prefer the term variable. • Data mining and database professionals commonly use the term attribute.
  • 55. • Each row as a vector, whose components are this individual attribute values. • These vectors are also sometimes known as the object vector or the feature vector. • What is the dimension of the vector? • The dimension of the vector is determined by the number of attributes in the table.
  • 56. Types of Attribute • Nominal attribute-Nominal means “relating to names.” • The values of a nominal attribute are symbols or names of things. • Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical. • The values do not have any meaningful order.
  • 57. • For example hair color are black,brown, blond, red, auburn, gray, and white. • The attribute marital status can take on • the values single, married, divorced, and widowed. • Another example of a nominal attribute is occupation, with the values teacher, dentist, programmer, farmer, and so on.
  • 58. Binary Attributes • A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where • 0 typically means that the attribute is absent, and 1 means that it is present. • Binary attributes are referred to as Boolean if the two states correspond to true and false.
  • 59. Example • The attribute medical test is binary, where a value of 1 means the result of the test for the patient is positive, while 0 means the result is negative.
  • 60. Binary attributes. • Given the attribute smoker describing a patient object, 1 indicates that the patient smokes, while 0 indicates that the patient does not.
  • 61. • A binary attribute is symmetric if both of its states are equally valuable and carry • the same weight; that is, there is no preference on which outcome should be coded as 0 or 1.
  • 62. asymmetric • A binary attribute is asymmetric if the outcomes of the states are not equally important. • such as the positive and negative outcomes of a medical test for HIV. • By convention,we code the most important outcome, which is usually the rarest one, by 1 (e.g., HIVpositive) and the other by 0 (e.g., HIV negative).
  • 63. Ordinal Attributes • An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. • This nominal attribute has three possible values: small, medium,and large.
  • 64. • we cannot tell from the values how much bigger, say, a medium is than a large. • Professional rank-Professional ranks can be enumerated in a sequential order: • for example, assistant, associate, and full for professors.
  • 65. Numeric Attributes • A numeric attribute is quantitative; that is, it is a measurable quantity, represented in • integer or real values. • Numeric attributes can be interval-scaled or ratio-scaled.
  • 66. Interval-Scaled Attributes • Interval-scaled attributes are measured on a scale of equal-size units. • The values of interval-scaled attributes have order and can be positive, 0, or negative. • Example- A temperature attribute is interval- scaled. • Suppose that we have the outdoor temperature value for a number of different days. • By ordering the values, we obtain a ranking of the objects with respect to temperature.
  • 67. • Example- Calendar dates are another example. For instance, the years 2002 and 2010 are eight years apart.
  • 68. Ratio-Scaled Attributes • A ratio-scaled attribute is a numeric attribute with an inherent zero-point. • That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio)of another value. • In addition, the values are ordered, and we can also compute the difference between values, as well as the mean, median, and mode.
  • 69. Examples of ratio-scaled attributes • such as years of experience(e.g., the objects are employees) • and number of words (e.g., the objects are documents). • Additional examples include attributes to measure weight, height, latitude and longitude.
  • 70. • Interval scales hold no true zero and can represent values below zero. • For example, you can measure temperatures below 0 degrees Celsius, such as -10 degrees. Ratio variables, on the other hand, never fall below zero. Height and weight measure from 0 and above, but never fall below it.
  • 71. Discrete versus Continuous Attributes • A discrete attribute has a finite or countably infinite set of values, which may or may not be represented as integers. • The attributes hair color, smoker, medical test, and drink size each have a finite number of values, and so are discrete.
  • 72. Example • Note that discrete attributes may have numeric values, such as 0 and 1 for binary attributes • or, the values 0 to 110 for the attribute age. • An attribute is countably infinite if the set of possible values is infinite but the values can be put in a one-to-one correspondence with natural numbers. • For example, the attribute customer ID is countably infinite. • Zip codes are another example.
  • 73. continuous • If an attribute is not discrete, it is continuous. • Continuous values are real numbers, whereas numeric values can be either integers or real numbers. • Continuous attributes are typically represented as floating-point variables.
  • 74. Example • A feature F1 can take certain values: A, B, C, D, E, F, and represents the grade of students • from a college. Which of the following statement is true in the following case? • a. Feature F1 is an example of a nominal variable. • b. Feature F1 is an example of an ordinal variable. • c. It doesn’t belong to any of the above categories. • d. Both of these
  • 75.
  • 76.
  • 77. Measures of Central Tendency • The central value or the most occurring value that gives a general idea of the whole data set is called the Measure of Central Tendency. • Some of the most commonly used measures of central tendency are: • Mean • Median • Mode
  • 78. Example • Suppose that we have some attribute X, like salary, which has been recorded for a set of objects. • Let x1,x2,x3,….,xn be the set of N observed values or observations for X. • If we were to plot the observations for salary, where would most of the values fall?
  • 79. Mean • The most common and effective numeric measure of the “center” of a set of data is • the (arithmetic) mean. Let x1,x2,x3,….,xn be the set of N observed values or observations for X. The mean of this set of values is
  • 81. Mean
  • 84. • A major problem with the mean • is its sensitivity to extreme (e.g., outlier) values. • Even a small number of extreme values • can corrupt the mean.
  • 85. Trimmed mean • Trimmed mean-which is the mean obtained after chopping off values at the high and low extremes. • For example, we can sort the values observed for salary and remove the top and bottom 2% • before computing the mean. We should avoid trimming too large a portion (such as • 20%) at both ends, as this can result in the loss of valuable information.
  • 86. • Let's say, as an example, a figure skating competition produces the following scores: 6.0, 8.1, 8.3, 9.1, and 9.9. • The mean for the scores would equal: • ((6.0 + 8.1 + 8.3 + 9.1 + 9.9) / 5) = 8.28 • To trim the mean by a total of 40%, we remove the lowest 20% and the highest 20% of values, eliminating the scores of 6.0 and 9.9. • Next, we calculate the mean based on the calculation: • (8.1 + 8.3 + 9.1) / 3 = 8.50
  • 87. Median. • Median is the middle value in a set of ordered data values. It is the value that separates the higher half of a data set from the lower half.
  • 88. • Suppose that a given data set of N values for an attribute X is sorted in increasing order. • If N is odd, then the median is the middle value of the ordered set. • If N is even, then the median is not unique; it is the two • middlemost values and any value in between. • If X is a numeric attribute in this case, by convention, the median is taken as the average of the two middlemost values.
  • 89. • Suppose we have the following values for salary (in thousands of dollars), shown • in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. • So median= (52+56)/2=54 • Suppose that we had only the first 11 values in the list. • Given an odd number of values, the median is the middlemost value. • This is the sixth value in this list, which has a value of $52,000.
  • 90. • How to calculate median for an even number of values? • Example: • 9, 8, 5, 6, 3, 4 • Arrange values in order • 3, 4, 5, 6, 8, 9 • Add 2 middle values and calculate their mean. • Median = 5+6/2 • Median = 5.5
  • 91. • The median is expensive to compute when we have a large number of observations.
  • 92. Median for range of values
  • 93.
  • 94. • Note: The median class is the class that contains the value located at N/2.
  • 95. Median for range of values
  • 96.
  • 97.
  • 98. • L: Lower limit of median class: 11 • W: Width of median class: 9 • N: Total Frequency: 60 • C: Cumulative frequency up to median class: 8 • F: Frequency of median class: 25
  • 99. • Median = L + W[(N/2 – C) / F] • Median = 11 + 9[(60/2 – 8) / 25] • Median = 18.92 • We estimate that the median exam score is 18.92.
  • 100. Mode • The mode is another measure of central tendency. • The mode for a set of data is the value that occurs most frequently in the set. • For Example, • In {6, 9, 3, 6, 6, 5, 2, 3}, the Mode is 6 as it occurs most often.
  • 101. Types of Mode • The different types of Mode are Unimodal, Bimodal, Trimodal, and Multimodal. Let us understand each of these Modes. • Unimodal Mode - A set of data with one Mode is known as a Unimodal Mode. • For example, the Mode of data set A = { 14, 15, 16, 17, 15, 18, 15, 19} is 15 as there is only one value repeating itself. Hence, it is a Unimodal data set.
  • 102. • Bimodal Mode - A set of data with two Modes is known as a Bimodal Mode. This means that there are two data values that are having the highest frequencies. • For example, the Mode of data set A = { 8,13,13,14,15,17,17,19} is 13 and 17 because both 13 and 17 are repeating twice in the given set. Hence, it is a Bimodal data set.
  • 103. • Trimodal Mode - A set of data with three Modes is known as a Trimodal Mode. This means that there are three data values that are having the highest frequencies. • For example, the Mode of data set A = {2, 2, 2, 3, 4, 4, 5, 6, 5,4, 7, 5, 8} is 2, 4, and 5 because all the three values are repeating thrice in the given set. • Hence, it is a Trimodal data set. •
  • 104. • Multimodal Mode - A set of data with four or more than four Modes is known as a Multimodal Mode. • For example, The Mode of data set • A = {100, 80, 80, 95, 95, 100, 90, 90,100 ,95 } is 80, 90, 95, and 100 because both all the four values are repeated twice in the given set. Hence, it is a Multimodal data set.
  • 105.
  • 106.
  • 107. • Data in most real applications are not symmetric. • They may instead be either positively skewed, where the mode occurs at a value that is smaller than the median or negatively skewed , where the mode occurs at a value greater than the median .
  • 108. What is data skewness? • When most of the values are skewed to the left or right side from the median, then the data is called skewed. • Data can be in any of the following shapes; • Symmetric: Mean, median and mode are at the same point. • Positively skewed: When most of the values are to the left from the median. • Negatively skewed: When most of the values are to the right from the median.
  • 109.
  • 110. • For symmetric distribution= mean=median=mode • For skewed distribution • For positively skewed=mean>median>mode • For negatively skewed =Mode>median>mean • Mean-mode=3(mean-median) • Mode=3median-2mean
  • 111. • The midrange can also be used to assess the central tendency of a numeric data set. • It is the average of the largest and smallest values in the set. • Midrange. The midrange of the data of is • (30,000+110,000)/2 = $70,000.
  • 112. Measuring the Dispersion of Data: • Range, Quartiles, Variance,Standard Deviation, and Interquartile Range. • Range- let x1,x2,….,xn be a set of observations for some numeric attribute, X. The range the set is the difference between the largest (max()) and smallest (min()
  • 113.
  • 114. What is quartile? • Suppose that the data for attribute X are sorted in increasing numeric order. • we can pick certain data points so as to split the data distribution into equal-size consecutive sets. • These data points are called quantiles. • Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets.
  • 115. • The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It corresponds to the median. • The 4-quantiles are the three data points that • split the data distribution into four equal parts; each part represents one-fourth of the data distribution. • They are more commonly referred to as quartiles. • The 100-quantiles are more commonly referred to as percentiles; they divide the data distribution into 100 • equal-sized consecutive sets.
  • 116.
  • 117. • The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. • The third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or • highest 25%) of the data. • The second quartile is the 50th percentile. As the median, it gives the center of the data distribution.
  • 118. • The distance between the first and third quartiles is a simple measure of spread • that gives the range covered by the middle half of the data. • This distance is called the interquartile range (IQR) and is defined as • IQR= Q3-Q1
  • 119. How to find quartiles of odd length data set? • Data = 8, 5, 2, 4, 8, 9, 5 • Step 1: • First of all, arrange the values in order. • Data = 2, 4, 5, 5, 8, 8, 9
  • 120. • Step 2: • For dividing this data into four equal parts, we needed three quartiles. • Q1: Lower quartile • Q2: Median of the data set • Q3: Upper quartile
  • 121. • Step 3: • Find the median of the data set and label it as Q2. • Data = 2, 4, 5, 5, 8, 8, 9 • Q1: 4 – Lower quartile • Q2: 5 – Middle quartile • Q3: 8 – Upper quartile • Inter Quartile Range= Q3 – Q1 • = 8 – 4 • = 4
  • 122. How to find quartiles of even length data set? • Data = 8, 5, 2, 4, 8, 9, 5,7 • Step 1: • First of all, arrange the values in order • After ordering the values: • Data = 2, 4, 5, 5, 7, 8, 8, 9
  • 123. • Step 2: • For dividing this data into four equal parts, we needed three quartiles. • Q1: Lower quartile • Q2: Median of the data set • Q3: Upper quartile • Step 3: • Find the median of the data set and label it as Q2.
  • 124. • Data = 2, 4, 5, 5, 7, 8, 8, 9 • Minimum: 2 • Q1: 4 + 5 / 2 = 4.5 Lower quartile • Q2: 5+ 7 / 2 = 6 Middle quartile • Q3: 8 + 8 / 2 = 8 Upper quartile • Maximum: 9 • Inter Quartile Range= Q3 – Q1 • = 8 – 4.5 • = 3.5
  • 125. Five-Number Summary, Boxplots, and Outliers • The five-number summary of a distribution consists of the median (Q2)), the quartiles Q1and Q3, and the smallest and largest individual observations. • That is written in the order of Minimum, Q1,Median, Q3, Maximum.
  • 126. • How to Find a Five-Number Summary: Steps • Step 1: Put your numbers in ascending order (from smallest to largest). For this particular data set, the order is: Example: 1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27. • Step 2: Find the minimum and maximum for your data set. Now that your numbers are in order, this should be easy to spot. In the example in step 1, the minimum (the smallest number) is 1 and the maximum (the largest number) is 27. • Step 3: Find the median. The median is the middle number. If you aren’t sure how to find the median, see: How to find the mean mode and median.
  • 127. • Step 4: Place parentheses around the numbers above and below the median. (This is not technically necessary, but it makes Q1 and Q3 easier to find). (1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27). • Step 5: Find Q1 and Q3. Q1 can be thought of as a median in the lower half of the data, and Q3 can be thought of as a median for the upper half of data. (1, 2, 5, 6, 7), 9, ( 12, 15,18,19,27). • Step 6: Write down your summary found in the above steps. minimum = 1, Q1 = 5, median = 9, Q3 = 18, and maximum = 27
  • 128. Box plot • Boxplots are a popular way of visualizing a distribution. • A boxplot incorporates the five-number summary as follows: • Typically, the ends of the box are at the quartiles so that the box length is the interquartile range. • The median is marked by a line within the box. • Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations.
  • 129.
  • 130.
  • 131.
  • 132. • When the median is in the middle of the box, and the whiskers are about the same on both sides of the box, then the distribution is symmetric. • When the median is closer to the bottom of the box, and if the whisker is shorter on the lower end of the box, then the distribution is positively skewed (skewed right). • When the median is closer to the top of the box, and if the whisker is shorter on the upper end of the box, then the distribution is negatively skewed (skewed left).
  • 133. • Boxplots can be computed in O(nlogn) time. • An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.
  • 134. • Suppose that the data for analysis includes the attribute age. The age values for the data • tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, • 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. • (a) What is the mean of the data? What is the median? • (b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, • trimodal, etc.).
  • 135. • (c) What is the midrange of the data? • (d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data? • (e) Give the five-number summary of the data. • (f) Show a boxplot of the data.
  • 136. Example • 11,22,20,14,29,8,35,27,13,49,10,24,17 • After sorting • 8,10,11,13,14,17,20,22,24,27,29,35,49 • Q1=13+11/2=12 min=8 max=49 • Q2=20 • Q3=28 • IQR=28-12=16 min =8 max=49 • Max outlier= q3+1.5*IQR=28+1.5*16=52 • Min outlier=q1-1.5*IQR=12-1.5*16=-12
  • 137. • 18,34,76,29,15,41,46,25,54,38,20,32,43,22 • (15,18,20,22,25,29,32),(34, 38,41,43,46,54,76) • Q2=(32+34)=33 • Q1=22 Q3=43 • IQR=43-22=21 • [Q1-1.5*IQR]=[-9.5] • [q3+1.5*IQR]=74.5
  • 138. Variance and Standard Deviation • Dispersion refers to the ‘distribution’ of objects over a large region. • Variance and standard deviation are measures of data dispersion. • They indicate how spread out a data distribution is. • Variance measures how far each number in the dataset from the mean.
  • 139.
  • 140.
  • 141.
  • 142. Standard deviation • Standard deviation is a squared root of the variance to get original values. • A low standard deviation means that the data observations tend to be very close to the mean. • A high standard deviation indicates that the data are spread out over a large range of values.
  • 143. Graphic Displays of Basic Statistical Descriptions of Data • Graphic displays of basic statistical descriptions. • These include quantile plots, quantile– quantile plots, histograms, and scatter plots. • Such graphs are helpful for the visual inspection of data, which is useful for data preprocessing.
  • 145. Proximity measure • Whether two objects are unlike or like? • Application • Clustering • Outlier analysis • Nearest neighbour
  • 146. Types of attribute • Nominal attributes • Ordinal attributes • Binary attributes • Numerical attributes • Mixed attributes
  • 147. Measuring Data Similarity and Dissimilarity • Similarity and dissimilarity measures, which are referred to as measures of proximity. • Similarity and dissimilarity are related. • A similarity measure for two objects, i and j, will typically return the value 0 if the objects are unalike. • The higher the similarity value, the greater the similarity between objects. (Typically, a value of 1 • indicates complete similarity, that is, the objects are identical.)
  • 148. • A dissimilarity measure works the opposite way. • It returns a value of 0 if the objects are the same (and therefore, far from being dissimilar). • The higher the dissimilarity value, the more dissimilar the two objects are.
  • 149. Data Matrix versus Dissimilarity Matrix • Data matrix -Data matrix (or object-by- attribute structure): • This structure stores the n data objects in the form of a relational table, or n-by-p matrix (n objects p attributes):
  • 150. • Suppose that we have n objects (e.g., persons, items, or courses) described by p attributes (also called measurements or features, such as age, height, weight, or gender). • The objects are x1=(x11,x12,x13,…x1p), • x2=(x21,x22,….x2p) and so on, • where xij is the value for object xi of the jth attribute. • Each row corresponds to an object
  • 151.
  • 152. Dissimilarity matrix (or object-by- object structure): This structure stores a collection of proximities that are available for all pairs of n objects. • It is often represented by an n-by-n table: • where d(i, j) is the measured dissimilarity or “difference” between objects i and j. • In general, d(i, j) is a non-negative number that is close to 0 when objects i and j are highly similar or “near” each other • and becomes larger the more they differ.
  • 153.
  • 154. • Note • d(i, i)= 0 that is, the difference between an object and itself is 0. • -- d(i, j)= d (j, i).
  • 155. • Measures of similarity can often be expressed as a function of measures of dissimilarity. • For example, for nominal data, • Sim(i, j)= 1 –d(i, j) • where sim(i, j)is the similarity between objects i and j.
  • 156.
  • 157. Proximity Measures for Nominal Attributes • For example,map color is a nominal attribute that may have, say, five states: red, yellow, green, pink, and blue. • Let the number of states of a nominal attribute be M. • The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, … , M.
  • 158. • How is dissimilarity computed between objects described by nominal attributes? • The dissimilarity between two objects i and j can be computed based on the ratio of mismatches:
  • 159.
  • 160. • where m is the number of matches (i.e., the number of attributes for which i and j are in • the same state) • and p is the total number of attributes describing the objects.
  • 161. • Weights can be assigned to increase the effect of m or to assign greater weight to the matches in attributes having a larger number of states.
  • 162.
  • 163.
  • 164. • P=1 – it is no of nominal attribute. • m –no of matches. • --d(2,1)= (p-m)/p=(1-0)/1=1 • D(3,1)=(1-0)/1=1 • D(4,1)=(1-1)/1=0 • D(i, j)evaluates to 0 if objects i and j match, and 1 if the objects differ.
  • 166. • A dissimilarity measure works the opposite way. It returns a value of 0 if the objects are the same (and therefore, • far from being dissimilar). The higher the dissimilarity value, the more dissimilar the • two objects are.
  • 168. • A similarity measure for two objects, i and j, will typically return the value 0 if the objects are unalike. • The higher the similarity value, the greater the similarity between objects. (Typically, a value of 1 indicates complete similarity, that is, the objects are identical.)
  • 169.
  • 171. Ordinal attribute • There are three states for test-2: fair, good, and • excellent, that is, Mf = 3. • For step 1, if we replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3, respectively. • Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. • For step 3, we can use, say, the Euclidean distance.
  • 172.
  • 173. • 1. find the no of of ordinal attributes • (Mf=3) and rank it. 2. Normalize the rank 3. Find the distance between the objects.
  • 174. • Zif=Rf-1/mf-1 • Fair(1)=1-1/3-1=0 • Good(2)= 2-1/2=0.5 • Excellent(3)/3-1/3-1=1 • Manhantan distance=|x1-y1|+|x2-y2|
  • 175.
  • 176. Binary attribute • It can be either symmetric or asymmetric • A binary attribute is symmetric if both of its states are equally valuable and carry the same weight; that is, there is no preference on which outcome should be coded as 0 or 1. (Example male and female) • A binary attribute is asymmetric if the outcomes of the states are not equally important, • such as the positive and negative outcomes of a medical test for HIV
  • 177.
  • 178. Binary attribute • We can find similarity matrix for symmetric binary for asymmetric binary and dissimilarity symmetric binary for asymmetric binary .
  • 179. • q-m11 • r-m10 • s-m01 • t-m00
  • 180.
  • 181. • If all binary attributes are thought of as having the same weight, we have the 2 *2 contingency table • where q is the number of attributes that equal 1 for both objects i and j, • r is the number of attributes that equal 1 for object i but equal 0 for object j, • s is the number of attributes that equal 0 for object i but equal 1 for object j, • and t is the number of attributes that equal 0 for both objects i and j. • The total number of attributes is p, where p = q +r + s +t .
  • 182.
  • 183. Similarity • For symmetric we use simple matching coefficient. • Simple matching coefficient(SCM)= • sim(I,j)=(m11+m00)/(m11+m10+m01+m00)
  • 184. Similarity • for asymmetric binary we can use jaccord coefficient. • Sim(i,j)=q/(q+r+s)=m11/(m11+m01+m10)
  • 185. Dissimilarity • For dissimilarity matrix for symmetric binary and asymmetric binary.
  • 187. dissimilarity matrix for asymmetric binary
  • 190.
  • 191.
  • 192.
  • 194.
  • 195.
  • 196. Dissimilarity( Symmetric binary) • D(jack,jim)=(0+1)/(2+0+1+3)=1/6 • D(mary,jack)=(1+1)/(1+2+1+2)=2/6 • D(mary,jim)=(2+1)/1+2+1+2=3/6
  • 198. Binary (similarity) • 1. symmetric • Symmetric Binary ( simple matching coeffcient) • sim(I,j)=(m11+m00)/(m11+m10+m01+m00)
  • 199. • X=1,0,0,0,0,0,0,0,0,0 • Y=0,0,0,0,0,0,1,0,0,1 • Scm(x,y)=7/0+1+1+2+7=7/10=.7 • The similarity between x and y is 70%. •
  • 200. Similarity for asymmetric binary Jaccard coeffcient
  • 202. • Super market contains 1000 product • C1={sugar, coffee , tea, rice,egg} • C2={sugar, coffee, bread, biscuit} • How much similarity there between customer1 and customer2. • M11=2 ( item present in both the custemer) • (sugar,coffee)=2 • M10=3 (item present is customer 1 but not in customer 2) {Tea,rice,egg}=3 • M01=2(item present in c2 but in c1) {bread,biscut}=2
  • 203. • M00=item present in not in c1 or c2 • =Total item-(m11+m10+m01) • =1000-7=993 • Jaccard coefficient =(m11)/(m11+m10+m01)=(2/2+3+2)=2/7 • Scm=(m11+m00)/(m11+m00+m01+m00)=(2+ 993)/(2+3+2+993)=.995
  • 209.
  • 210. Numerical data • Normalize data- frame data between 0 and 1. • ->d(I,j)=|xif-xjf|/max-min • D(2,1)=|45-22|/64-22=23/42=0.55 • D(3,1)=|45-64|/64-22=0.45 • D(4,1)=|22-64|/64-22=42/42=1 • D(4,2)=|28-22|/64-22=.14 • D(4,3)=|64-28|=64-22=.86
  • 211.
  • 213. How to put all together ( ) ( ) 1 ( ) 1 ( , ) n f f ij ij f n f ij f d d i j       
  • 214. ( ) ( ) 0 0 sin 1 f ij if jf if jf f ij if x x or x x ismis g otherwise       
  • 217. Cosine Similarity • A document can be represented by thousands of attributes, each recording the frequency • of a particular word (such as a keyword) or phrase in the document. • Thus, each document is an object represented by what is called a term-frequency vector.
  • 218.
  • 219. • Term-frequency vectors are typically very long and sparse (i.e., they have many 0 values). • Applications using such structures include information retrieval, text document • clustering, biological taxonomy, and gene feature mapping.
  • 220. • The traditional distance measures that we have studied in this chapter do not work well for such sparse numeric data.
  • 221. Cosine similarity • Cosine similarity is a measure of similarity that can be used to compare documents or, say, give a ranking of documents with respect to a given vector of query • words.
  • 222. • Let x and y be two vectors for comparison. Using the cosine measure as a similarity function, we have
  • 223.
  • 224.
  • 225. • The measure computes the cosine of the angle between vectors x and y. A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. • The closer the cosine value to 1, the smaller the angle and the greater the match between vectors.
  • 226. Cosine similarity and distance • Suppose we have two points p1 and p2. • If distance between p1 and p2 increases the similarity decreases. • If distance between p1 and p2 decreases the similarity increases. • 1- cosine similarity = cosine distance.
  • 227.
  • 228. • The cosine similarity says to find the similarity between two objects. we have to find the angle between them. • Cosine similarity= Cos (theta). • The theta is the angle between object p1 and p2. • The cosine similarity ranging between -1 to 1.
  • 229. • The angle is more between then they have less similarity. • The angle is less between then they have more similarity. • Cos 0=1 it means more similarity. • Cos 90=0 it means less similarity. • The angle between two objects p1 and p2= 45 degree.
  • 230. • Cosine similarity= cos 45= .5 • 50% similarity between p1 and p2.
  • 231.
  • 232. • Cos 90=0 • The similarity is less.
  • 233. The angle is 0 (more similar)
  • 234. Example • Cosine similarity between two term- frequency vectors. Suppose that x and y are the first two term-frequency vectors in Table 2.5. That is, x = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) and • Y =(3, 0, 2, 0, 1, 1, 0, 1, 0, 1). How similar are x and y? To compute the cosine similarity between the two vectors,
  • 235.
  • 238.
  • 242.
  • 243.
  • 244. • Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): • (a) Compute the Euclidean distance between the two objects. • (b) Compute the Manhattan distance between the two objects. • (c) Compute the Minkowski distance between the two objects, using q D 3. • .
  • 245. Assignment • (d) Compute the supremum distance between the two objects
  • 246. Data Preprocessing • Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources. • Low-quality data will lead to low-quality mining results.
  • 247. Data Preprocessing • There are several data preprocessing techniques. • Data cleaning can be applied to remove noise and correct inconsistencies in data. • Data integration merges data from • multiple sources into a coherent data store such as a data warehouse.
  • 248. • Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering. • Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. • This can improve the accuracy and efficiency of mining algorithms involving distance measurements.
  • 249. • The three of the elements defining data quality: accuracy, completeness, and consistency. • Inaccurate, incomplete, and inconsistent data are commonplace properties of large real- world databases and data warehouses.
  • 250. Factors affecting data quality • Timeliness also affects data quality • The month-end data are not updated in a timely fashion has a negative impact on the data quality.
  • 251. • Two other factors affecting data quality are believability and interpretability. • Believability reflects how much the data are trusted by users, • Interpretability reflects how easy the data are understood.
  • 252. Example • Suppose that a database, at one point, had several errors, all of which have since been corrected. • The past errors, however, had caused • many problems for sales department users, and so they no longer trust the data. The • data also use many accounting codes, which the sales department does not know how to • interpret.
  • 253. Data Cleaning • Real-world data tend to be incomplete, noisy, and inconsistent. • Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.
  • 254. Missing Values • 1. Ignore the tuple: This is usually done when the class label is missing (assuming the • mining task involves classification). • This method is not very effective, unless the tuple contains several attributes with missing values.
  • 255. 2. Fill in the missing value manually: In general, this approach is time consuming and • may not be feasible given a large data set with many missing values. 3. Use a global constant to fill in the missing value: Replace all missing attribute values • by the same constant such as a label like “Unknown” or infanite.
  • 256. • 4. Use a measure of central tendency for the attribute (e.g., the mean or median) to • fill in the missing value: • 5. Use the attribute mean or median for all samples belonging to the same class as • the given tuple.
  • 257. 6. Use the most probable value to fill in the missing value: This may be determined • with regression, inference-based tools using a Bayesian formalism, or decision tree induction.
  • 258.
  • 259.
  • 260.