SlideShare a Scribd company logo
1 of 133
Data Mining Functionalities—What Kinds of
Patterns Can Be Mined?
Data mining functionalities are used to specify the kind of patterns to be found
in data mining tasks.
In general, data mining tasks can be classified into two categories:
1.descriptive task
2.Predictive task
1.Descriptive mining tasks characterize the general properties of the data in the
database.
2.Predictive mining tasks perform inference on the current data in order to
make predictions.
Concept/Class Description: Characterization and Discrimination
• Data can be associated with classes or concepts.
For example, in the AllElectronics store, classes of items for sale
include computers and printers, and concepts of customers include
bigSpenders and budgetSpenders.
Such descriptions of a class or a concept are called class/concept
descriptions.
These descriptions can be derived via
(1) data characterization, by summarizing the data of the class under
study in general terms, or
(2) data discrimination, by comparison of the target class with one or a
set of comparative classes (often called the contrasting classes), or
(3) both data characterization and discrimination
• Data characterization is a summarization of the general characteristics
or features of a target class of data.
• The data corresponding to the user-specified class are typically
collected by a database query.
• For example, to study the characteristics of software products whose
sales increased by 10% in the last year, the data related to such
products can be collected by executing an SQL query.
• There are several methods for effective data summarization and
characterization. Simple data summaries based on statistical
measures and plots.
• The output of data characterization can be presented in various
forms.
• Examples include pie charts, bar charts, curves, multidimensional
data cubes, and multidimensional tables, including crosstabs.
• The resulting descriptions can also be presented as generalized
relations or in rule form (called characteristic rules).
Example Data characterization.
• A data mining system should be able to produce a description
summarizing the characteristics of customers who spend more than
$1,000 a year at AllElectronics.
• The result could be a general profile of the customers, such as they
are 40–50 years old, employed, and have excellent credit ratings.
• Data discrimination is a comparison of the general features of target
class data objects with the general features of objects from one or a
set of contrasting classes.
• For example, the user may like to compare the general features of
software products whose sales increased by 10% in the last year with
those whose sales decreased by at least 30% during the same period.
Example 1.5 Data discrimination.
• A data mining system should be able to compare two groups of
AllElectronics customers, such as those who shop for computer
products regularly versus those who rarely shop for such products
(i.e., less than three times a year).
• The resulting description provides a general comparative profile of
the customers, such as 80% of the customers who frequently
purchase computer products are between 20 and 40 years old and
have a university education, whereas 60% of the customers who
infrequently buy such products are either seniors or youths, and have
no university degree.
Mining Frequent Patterns, Associations, and Correlations
• Frequent Pattern is a pattern which appears frequently in a data
set. By identifying frequent patterns we can observe strongly
correlated items together and easily identify similar
characteristics, associations among them.
• By doing frequent pattern mining, it leads to further analysis like
clustering, classification and other data mining tasks.
There are many kinds of frequent patterns, including itemsets,
subsequences, and substructures.
• A frequent itemset typically refers to a set of items that frequently
appear together in a transactional data set, such as milk and bread.
• A frequently occurring subsequence, such as the pattern that
customers tend to purchase first a PC, followed by a digital camera,
and then a memory card, is a (frequent) sequential pattern.
• A substructure can refer to different structural forms, such as graphs,
trees, or lattices, which may be combined with itemsets or
subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern.
• Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
Example 1.6 Association analysis.
• Suppose, as a marketing manager of AllElectronics, you would like to
determine which items are frequently purchased together within the same
transactions.
• An example of such a rule, mined from the AllElectronics transactional
database, is
buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%,confidence =
50%]
Association rules that contain a single predicate are referred to as single-
dimensional association rules.
• Suppose, instead, that we are given the AllElectronics relational
database relating to purchases.
• A data mining system may find association rules like
• age(X, “20...29”)∧ income(X, “40K...49K”) ⇒ buys(X, “laptops”)
[support = 2%, confidence = 60%]
• Typically, association rules are discarded as uninteresting if they do
not satisfy both a minimum support threshold and a minimum
confidence threshold.
Classification and Prediction
• Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts, for the purpose
of being able to use the model to predict the class of objects whose
class label is unknown.
• “How is the derived model presented?” The derived model may be
represented in various forms, such as classification
• (IF-THEN) rules,
• decision trees,
• mathematical formulae,
• or neural networks
• A decision tree is a flow-chart-like tree structure, where each node
denotes a test on an attribute value, each branch represents an
outcome of the test, and tree leaves represent classes or class
distributions.
• Regression analysis is a statistical methodology that is most often
used for numeric prediction, although other methods exist as well.
Prediction also encompasses the identification of distribution trends
based on the available data.
Example 1.7 Classification and prediction.
• Suppose, as sales manager of AllElectronics, you would like to classify
a large set of items in the store, based on three kinds of responses to
a sales campaign: good response, mild response, and no response.
• you would like to predict the amount of revenue that each item will
generate during an upcoming sale at AllElectronics, based on previous
sales data. This is an example of (numeric) prediction because the
model constructed will predict a continuous-valued function, or
ordered value.
Cluster Analysis
• Clustering can be used to generate such labels. The objects are
clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
• That is, clusters of objects are formed so that objects within a cluster
have high similarity in comparison to one another, but are very
dissimilar to objects in other clusters.
Outlier Analysis
• A database may contain data objects that do not comply with the
general behavior or model of the data. These data objects are
outliers.
• Example Outlier analysis. Outlier analysis may uncover fraudulent
usage of credit cards by detecting purchases of extremely large
amounts for a given account number in comparison to regular
charges incurred by the same account.
Evolution Analysis
• Data evolution analysis describes and models regularities or trends
for objects whose behavior changes over time. Although this may
include characterization, discrimination, association and correlation
analysis, classification, prediction, or clustering of time related data,
distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data
analysis.
Example Evolution analysis.
• Suppose that you have the major stock market (time-series) data of
the last several years available from the New York Stock Exchange and
you would like to invest in shares of high-tech industrial companies.
• A data mining study of stock exchange data may identify stock
evolution regularities for overall stocks and for the stocks of particular
companies.
• Such regularities may help predict future trends in stock market
prices, contributing to your decision making regarding stock
investments.
Are All of the Patterns Interesting?
• “What makes a pattern interesting?
• Can a data mining system generate all of the interesting patterns?
• Can a data mining system generate only interesting patterns?”
To answer the first question, a pattern is interesting if it is
(1) easily understood by humans,
(2) valid on new or test data with some degree ofcertainty,
(3) potentially useful
(4) novel.
The second question—“Can a data mining system
generate all of the interesting patterns?”—
• refers to the completeness of a data mining algorithm. It is often
unrealistic and inefficient for data mining systems to generate all of
the possible patterns.
Finally the third question
• “Can a data mining system generate only interesting patterns?”
• is an optimization problem in data mining.
• It is highly desirable for data mining systems to generate only
interesting patterns.
Classification of Data Mining Systems
Statistics
• Statistical models are widely used to model data and data classes.
• For example, in data mining tasks like data characterization and
classification, statistical models of target classes can be built.
• For example, we can use statistics to model noise and missing data
values.
• Statistics research develops tools for prediction and forecasting using
data and statistical models. Statistical methods can be used to
summarize or describe a collection of data
• Inferential statistics (or predictive statistics) models data in a way
that accounts for randomness and uncertainty in the observations
and is used to draw inferences about the process or population under
investigation.
• A statistical hypothesis test (sometimes called confirmatory data
analysis) makes statistical decisions using experimental data. A result
is called statistically significant if it is unlikely to have occurred by
chance.
Machine Learning
• Machine learning investigates how computers can learn (or improve
their performance) based on data. A main research area is for
computer programs to automatically learn to recognize complex
patterns and make intelligent decisions based on data.
• Supervised learning
• Unsupervised learning
• Semi supervised learning
• Active learning.
Supervised learning
• Supervised learning is the types of machine learning in which
machines are trained using well "labelled" training data, and on basis
of that data, machines predict the output. The labelled data means
some input data is already tagged with the correct output.
How Supervised Learning Works?
• In supervised learning, models are trained using labelled dataset,
where the model learns about each type of data. Once the training
process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.
• The working of Supervised learning can be easily understood by the
below example and diagram:
Unsupervised Machine Learning
• Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that
data without any supervision.
• Unsupervised learning cannot be directly applied to a
regression or classification problem because unlike supervised
learning, we have the input data but no corresponding output
data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according
to similarities, and represent that dataset in a compressed
format.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the
below diagram:
Semi supervised learning
• Semi-Supervised learning is a type of Machine Learning
algorithm that represents the intermediate ground between
Supervised and Unsupervised learning algorithms. It uses
the combination of labeled and unlabeled datasets during
the training period.
• In one approach, labeled examples are used to learn class models and
unlabeled examples are used to refine the boundaries between
classes.
Active learning
• Active learning is a machine learning approach that lets users play an
active role in the learning process. An active learning approach can
ask a user (e.g., a domain expert) to label an example, which may be
from a set of unlabeled examples
Database Systems and Data Warehouses
• Database systems research focuses on the creation, maintenance, and
use of databases for organizations and end-users. Particularly,
database systems researchers have established highly recognized
principles in data models, query languages, query processing and
optimization methods, data storage, and indexing and accessing
methods. Database systems are often well known for their high
scalability in processing very large, relatively structured data sets.
• A data warehouse integrates data originating from multiple sources
and various timeframes. It consolidates data in multidimensional
space to form partially materialized data cubes. The data cube model
not only facilitates OLAP in multidimensional databases but also
promotes multidimensional data mining.
Information Retrieval
• Information retrieval (IR) is the science of searching for documents or
information in documents. Documents can be text or multimedia, and
may reside on the Web.
• The differences between traditional information retrieval and
database systems are twofold:
• Information retrieval assumes that
• (1) the data under search are unstructured; and
• (2) the queries are formed mainly by keywords, which do not have
complex structures (unlike SQL queries in database systems).
Data Mining Task Primitives
• Each user will have a data mining task in mind, that is, some form of
data analysis that he or she would like to have performed.
• A data mining task can be specified in the form of a data mining
query, which is input to the data mining system.
• A data mining query is defined in terms of data mining task
primitives.
• The set of task relevant data to be mined.
• The kind of knowledge to be mined.
• The background knowledge to be used in the discovery process.
• The interesting measures and thresholds for the pattern evaluation.
• The expected representation for visualizing the discovered patterns.
The data mining primitives specify the following
The set of task-relevant data to be mined:
This specifies the portions of the database or the set of data in which
the user is interested. This includes the database attributes or data
warehouse dimensions of interest (referred to as the relevant
attributes or dimensions).
The kind of knowledge to be mined:
• This specifies the data mining functions to be performed, such as
characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution
analysis.
The background knowledge to be used in the discovery
process:
• This knowledge about the domain to be mined is useful for guiding
the knowledge discovery process and for evaluating the patterns
found. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of
abstraction.
The interestingness measures and thresholds for pattern
evaluation:
• They may be used to guide the mining process or, after discovery, to
evaluate the discovered patterns.
• Different kinds of knowledge may have different interestingness
measures.
• For example, interestingness measures for association rules include
support and confidence.
• Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
The expected representation for visualizing the
discovered patterns:
• This refers to the form in which discovered patterns are to be
displayed, which may include rules, tables, charts, graphs, decision
trees, and cubes.
Integration of a Data Mining System with a
Database or Data Warehouse System
• The data mining system is integrated with a database or data
warehouse system so that it can do its tasks in an effective
presence. A data mining system operates in an environment that
needed it to communicate with other data systems like a
database system. There are the possible integration schemes
that can integrate these systems which are as follows −
• No coupling
• Loose coupling
• Semi tight coupling
• Tight coupling
No coupling
• No coupling defines that a data mining system will not use any
function of a database or data warehouse system.
• It can retrieve data from a specific source (including a file
system), process data using some data mining algorithms, and
therefore save the mining results in a different file.
• First, a Database system offers a big deal of flexibility and
adaptability at storing, organizing, accessing, and processing
data.
• Without using a Database/Data warehouse system, a Data
mining system can allocate a large amount of time finding,
collecting, cleaning, and changing data.
Loose coupling
• In this data mining system uses some services of a database or data
warehouse system.
• The data is fetched from a data repository handled by these systems.
• Data mining approaches are used to process the data and then the
processed data is saved either in a file or in a designated area in a
database or data warehouse.
• Loose coupling is better than no coupling as it can fetch some area
of data stored in databases by using query processing or various
system facilities.
• These are memory based.
• It is difficult to achieve high scalability and performance in large data
sets.
SEMI TIGHT COUPLING
• In this adequate execution of a few essential data
mining primitives can be supported in the
database/data warehouse system.
• These primitives can contain sorting, indexing,
aggregation, histogram analysis, multi-way join,
and pre-computation of some important
statistical measures, including sum, count, max,
min, standard deviation, etc.
Tight coupling
• Tight coupling defines that a data mining system is smoothly
integrated into the database/data warehouse system.
• The data mining subsystem is considered as one functional
element of an information system.
• Data mining queries and functions are developed and
established on mining query analysis, data structures, indexing
schemes, and query processing methods of database/data
warehouse systems.
• It is hugely desirable because it supports the effective
implementation of data mining functions, high system
performance, and an integrated data processing environment.
Major Issues in Data Mining
• Data mining is not an easy task, as the algorithms used can get
very complex and data is not always available at one place. It
needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this
tutorial, we will discuss the major issues regarding −
• Mining Methodology and User Interaction
• Performance Issues
• Diverse Data Types Issues
• The following diagram describes the major issues.
• Mining Methodology and User Interaction Issues
• It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different
users may be interested in different kinds of knowledge. Therefore
it is necessary for data mining to cover a broad range of knowledge
discovery task.
• Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing
and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations
should be easily understandable.
• Handling noisy or incomplete data − The data cleaning
methods are required to handle the noise and incomplete objects
while mining the data regularities. If the data cleaning methods
are not there then the accuracy of the discovered patterns will be
poor.
• Pattern evaluation − The patterns discovered should be
interesting because either they represent common knowledge or
lack novelty.
Performance Issues
• There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged. The incremental
algorithms, update databases without mining the data again from
scratch.
Diverse Data Types Issues
• Handling of relational and complex types of data − The
database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
• Mining information from heterogeneous databases and
global information systems − The data is available at different
data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining
the knowledge from them adds challenges to data mining.
Data Preprocessing
• Today’s real-world databases are highly susceptible to noisy, missing,
and inconsistent data due to their typically huge size (often several
gigabytes or more) and their likely origin from multiple, heterogenous
sources.
• Low-quality data will lead to low-quality mining results.
• “How can the data be preprocessed in order to help improve the
quality of the data and, consequently, of the mining results?
• “How can the data be preprocessed so as to improve the efficiency
and ease of the mining process?”
Data preprocessing techniques.
• Data cleaning:Data cleaning can be applied to remove noise and correct
inconsistencies in the data.
• Data Integration:Data integration merges data from multiple sources
into a coherent data store, such as a data warehouse.
• Data transformation:Data transformations, such as normalization, may
be applied. For example, normalization may improve the accuracy and
efficiency of mining algorithms involving distance measurements.
• Data reduction: can reduce the data size by aggregating, eliminating
redundant features, or clustering, for instance.
Why Preprocess the Data?
• Data preprocessing is essential before its actual use. Data
preprocessing is the concept of changing the raw data into a
clean data set.
• The dataset is preprocessed in order to check missing
values, noisy data, and other inconsistencies before
executing it to the algorithm.
• Noisy data
• The data collection instruments may be faulty.
• Data entry is wrong by humans.
• Inconsistencies in the naming conventions or data codes
used,inconsistent formats for the input field such as date.
• Duplicate tuples also require data cleaning.
Descriptive Data Summarization
• Descriptive data summarization techniques can be used to identify
the typical properties of your data and highlight which data values
should be treated as noise or outlier.
• For many data preprocessing tasks, users would like to learn about
data characteristics regarding both central tendency and dispersion
of the data.
• Measures of central tendency include mean, median, mode, and
midrange, while measures of data dispersion include quartiles,
interquartile range (IQR), and variance.
Measuring the Central Tendency
• There are many ways to measure the central tendency of data.
• The most common and most effective numerical measure of the
“center” of a set of data is the (arithmetic) mean.
• Let x1,x2,...,xN be a set of N values or observations, such as for some
attribute, like salary. The mean of this set of values is
• This corresponds to the built-in aggregate function, average (avg() in
SQL), provided in relational database systems.
• Distributive measure: A distributive measure is a measure (i.e.,
function) that can be computed for a given data set by partitioning
the data into smaller subsets, computing the measure for each
subset, and then merging the results in order to arrive at the
measure’s value for the original (entire) data set.
• Both sum() and count() are distributive measures because they can be
computed in this manner. Other examples include max() and min().
• Algebraic measure:An algebraic measure is a measure that can be
computed by applying an algebraic function to one or more
distributive measures. Hence, average (or mean()) is an algebraic
measure because it can be computed by sum()/count().
• Each value xi in a set may be associated with a weight wi ,
• for i = 1,...,N. The weights reflect the significance, importance, or
occurrence frequency attached to their respective values. In this case,
we can compute This is called the weighted arithmetic mean or the
weighted average.
Drawbacks of mean
• A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values. Even a small number of extreme values can corrupt
the mean.
• For example, the mean salary at a company may be substantially
pushed up by that of a few highly paid managers. Similarly, the
average score of a class in an exam could be pulled down quite a bit
by a few very low scores.
Trimmed mean
• we can instead use the trimmed mean, which is the
mean obtained after chopping off values at the high
and low extremes.
• For example, we can sort the values observed for
salary and remove the top and bottom 2% before
computing the mean.
• We should avoid trimming too large a portion (such as
20%) at both ends as this can result in the loss of
valuable information.
Median
• It is a better measure to find the center of data
• Suppose that a given data set of N distinct values is sorted in
numerical order.
• If N is odd, then the median is the middle value of the ordered set;
otherwise (i.e., if N is even), the median is the average of the middle
two values.
• Assume that data are grouped in intervals according to their xi data
values and that the frequency (i.e., number of data values) of each
interval is known.
• For example, people may be grouped according to their annual salary
in intervals such as 10–20K, 20–30K, and so on.
• Let the interval that contains the median frequency be the median
interval. We can approximate the median of the entire data set (e.g.,
the median salary) by interpolation using the formula:
• L1 is the lower boundary of the median interval,
• N is the number of values in the entire data set,
• (∑ freq)l is the sum of the frequencies of all of the intervals that are
lower than the median interval,
• freqmedian is the frequency of the median interval,
• and width is the width of the median interval.
Mode
• The mode for a set of data is the value that occurs most frequently in the set.
• Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
• For example, the mode of the data set in the given set of data:
• 2, 4, 5, 5, 6, 7 is 5 because it appears twice in the collection.
• In general, a data set with two or more modes is multimodal.
• At the other extreme, if each data value occurs only once, then there is no
mode.
Measuring the Dispersion of Data
• The degree to which numerical data tend to spread is called the
dispersion, or variance of the data.
• The most common measures of data dispersion are range, the five-
number summary (based on quartiles), the interquartile range, and
the standard deviation.
Boxplots can be plotted based on the five-number summary and are a
useful tool for identifying outliers.
Range, Quartiles, Outliers, and Boxplots
• Let x1,x2,...,xN be a set of observations for some attribute. The range
of the set is the difference between the largest (max()) and smallest
(min()) values.
• let’s assume that the data are sorted in increasing numerical order.
• The kth percentile of a set of data in numerical order is the value xi
having the property that k percent of the data entries lie at or below
xi .
• The most commonly used percentiles other than the median are
quartiles.
• The first quartile, denoted by Q1, is the 25th percentile; the third
quartile, denoted by Q3, is the 75th percentile.
IQR(Inter quarter range)
• The distance between the first and third quartiles is inter quarter
range.
IQR = Q3 −Q1.
The five-number summary of a distribution consists of the median, the
quartiles Q1 and Q3, and the smallest and largest individual
observations, written in the order Minimum, Q1, Median, Q3,
Maximum.
Boxplots
• Boxplots are a popular way of visualizing a distribution. A boxplot
incorporates the five-number summary as follows:
Variance and Standard Deviation
• Plotting histograms, or frequency histograms, is a graphical method for
summarizing the distribution of a given attribute.
• A histogram for an attribute A partitions the data distribution of A into
disjoint subsets, or buckets. Typically, the width of each bucket is
uniform.
The scatter plot is a useful method for providing a first look at
bivariate data to see clusters of points and outliers, or to explore the
possibility of correlation relations
Data Cleaning
•Real-world data tend to be incomplete, noisy,
and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing
values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
Missing Values
• Imagine that you need to analyze All Electronics sales and customer
data. You note that many tuples have no recorded value for several
attributes, such as customer income.
• How can you go about filling in the missing values for this attribute?
• Ignore the tuple:This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not
very effective, unless the tuple contains several attributes with
missing values.
• Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
• Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown”
or −∞. If missing values are replaced by, say, “Unknown,”.
• Use the attribute mean to fill in the missing value:For example,
suppose that the average income of All Electronics customers is
$56,000. Use this value to replace the missing value for income.
• Use the attribute mean for all samples belonging to the same class as
the given tuple:
• For example, if classifying customers according to credit risk, replace
the missing value with the average income value for customers in the
same credit risk category as that of the given tuple.
• Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
Noisy Data
• “What is noise?”
• Noise is a random error or variance in a measured variable.
• Given a numerical attribute such as, say, price, how can we “smooth”
out the data to remove the noise?
• Let’s look at the following data smoothing techniques:
2.Regression
• Data can be smoothed by fitting the data to a
function, such as with regression. Linear regression
involves finding the “best” line to fit two attributes (or
variables), so that one attribute can be used to predict
the other.
• Multiple linear regression is an extension of linear
regression, where more than two attributes are
involved and the data are fit to a multidimensional
surface.
3. Clustering:
Outliers may be detected by clustering, where
similar values are organized into groups, or
“clusters.”
Intuitively, values that fall outside of the set of
clusters may be considered outliers
Data Integration
• Data mining often requires data integration—the merging of data
from multiple data stores.
• It is likely that your data analysis task will involve data integration,
which combines data from multiple sources into a coherent data
store, as in data warehousing.
• These sources may include multiple databases, data cubes, or flat
files.
Issues during Data Integration
1.Entity identification problem:How can equivalent real-world entities
from multiple data sources be matched up?
• For example, how can the data analyst or the computer be sure that
customer id in one database and cust number in another refer to the
same attribute?
• Examples of metadata for each attribute include the name, meaning,
data type, and range of values permitted for the attribute, and null
rules for handling blank, zero, or null values.
• Such metadata can be used to help avoid errors in schema
integration.
2.Redundancy
•Redundancy is another important issue. An
attribute (such as annual revenue, for instance)
may be redundant if it can be “derived” from
another attribute or set of attributes.
• Inconsistencies in attribute or dimension
naming can also cause redundancies in the
resulting data set.
correlation analysis
• Given two attributes, such analysis can measure how strongly one
attribute implies the other, based on the available data.
• For numerical attributes, we can evaluate the correlation between
two attributes, A and B, by computing the correlation coefficient(also
known as Pearson’s product moment coefficient, named after its
inventer, Karl Pearson)
• where N is the number of tuples
• ai and bi are the respective values of A and B in tuple i
• A and B are the respective mean values of A and B
• σA and σB are the respective standard deviations of A and B
• and Σ(aibi) is the sum of the AB cross-product (that is, for each tuple,
the value for A is multiplied by the value for B in that tuple).
• Note that −1 ≤ rA,B ≤ +1.
• IfrA,B is greater than 0, then A and B are positively correlated, meaning that
the values of A increase as the values of B increase.
• If the resulting value is less than 0, then A and B are negatively correlated,
where the values of one attribute increase as the values of the other attribute
decrease.
• For categorical (discrete) data, a correlation relationship between two
attributes, A and B, can be discovered by a χ 2 (chi-square) test.
• Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct
values, namely b1,b2,...br .
• The data tuples described by A and B can be shown as a contingency
table, with the c values of A making up the columns and the r values
of B making up the rows.
• Let (Ai ,Bj) denote the event that attribute A takes on value ai and
attribute B takes on value bj , that is, where (A = ai ,B = bj). Each and
every possible (Ai ,Bj) joint event has its own cell (or slot) in the table.
The χ 2 value (also known as the Pearson χ 2 statistic) is computed as:
• where oij is the observed frequency (i.e., actual count) of the joint event (Ai
,Bj) and ei j is the expected frequency of (Ai ,Bj), which can be computed as
where N is the number of data tuples,
count(A = ai)is the number of tuples having value ai for A,
and count(B = bj) is the number of tuples having value bj for B.
• Correlation analysis of categorical attributes using χ 2 . Suppose that a
group of 1,500 people was surveyed.
• The gender of each person was noted. Each person was polled as to
whether their preferred type of reading material was fiction or
nonfiction.
• Thus, we have two attributes, gender and preferred reading.
Data Transformation
In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Data transformation can involve the following:
• Smoothing, which works to remove noise from the data. Such techniques
include binning, regression, and clustering.
• Aggregation, where summary or aggregation operations are applied to the
data.
• For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically used in constructing a
data cube for analysis of the data at multiple granularities.
• Generalization of the data, where low-level or “primitive” (raw) data are
replaced by higher-level concepts through the use of concept hierarchies.
• For example, categorical attributes, like street, can be generalized to higher-
level concepts, like city or country.
• Similarly, values for numerical attributes, like age, may be mapped to higher-
level concepts, like youth, middle-aged, and senior.
• Normalization, where the attribute data are scaled so as to fall within
a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.
• Attribute construction (or feature construction), where new attributes
are constructed and added from the given set of attributes to help
the mining process.
• Min-max normalization.
• z-score normalization.
• Normalization by decimal scaling.
Min-max normalization
Z-score normalization:
• This method normalizes the value for attribute A using the mean
and standard deviation. The following formula is used for Z-
score normalization:
Normalization by decimal scaling
• Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A. The number of decimal points moved
depends on the maximum absolute value of A.
• A value, v, of A is normalized to v 0 by computing
Data Reduction
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume.
• Strategies for data reduction include the following:
• Data cube aggregation, where aggregation operations are applied to the
data in the construction of a data cube.
• Attribute subset selection, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
• Dimensionality reduction, where encoding mechanisms are used to reduce
the data set size.
• Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as parametric models or
nonparametric methods such as clustering, sampling, and the use of
histograms.
• Discretization and concept hierarchy generation where raw data values for
attributes are replaced by ranges or higher conceptual levels.
• Data discretization is a form of numerosity reduction that is very useful for the
automatic generation of concept hierarchies. Discretization and concept
hierarchy generation are powerful tools for data mining, in that they allow the
mining of data at multiple levels of abstraction.
Data aggregation
• This technique is used to aggregate data in a simpler form. For
example, imagine that information you gathered for your
analysis for the years 2012 to 2014, that data includes the
revenue of your company every three months.
• They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the
resulting data summarizes the total sales per year instead
of per quarter. It summarizes the data.
Attribute Subset Selection
• Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the
data classes is as close as possible to the original distribution
obtained using all attributes.
• Mining on a reduced set of attributes has an additional benefit. It
reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
Basic heuristic methods of attribute subset selection include the
following techniques
• Stepwise forward selection: The procedure starts with an empty set
of attributes as the reduced set. The best of the original attributes is
determined and added to the reduced set. At each subsequent
iteration or step, the best of the remaining original attributes is added
to the set.
• Stepwise backward elimination: The procedure starts with the full set
of attributes.
• At each step, it removes the worst attribute remaining in the set.
• Combination of forward selection and backward elimination: The
stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best
attribute and removes the worst from among the remaining
attributes.
• Decision tree induction: Decision tree algorithms, such as
ID3(iterative dichotamiser), C4.5, and CART(classification and
regression trees), were originally intended for classification.
• Decision tree induction constructs a flowchart like structure where
each internal (non leaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each external
(leaf) node denotes a class prediction.
• The set of attributes appearing in the tree form the reduced subset of
attributes.
Dimensionality Reduction
•In dimensionality reduction, data encoding or
transformations are applied so as to obtain a reduced
or “compressed” representation of the original data.
•If the original data can be reconstructed from the
compressed data without any loss of information, the
data reduction is called lossless.
• If, instead, we can reconstruct only an approximation
of the original data, then the data reduction is called
lossy.
Numerosity Reduction
• “Can we reduce the data volume by choosing alternative,
‘smaller’ forms of data representation?” These techniques
may be parametric or non parametric.
• For parametric methods, a model is used to estimate the
data, so that typically only the data parameters need to be
stored, instead of the actual data. (Outliers may also be
stored.)
• Log-linear models, which estimate discrete multidimensional
probability distributions, are an example.
• Nonparametric methods for storing reduced representations
of the data include histograms, clustering, and sampling.
Two effective methods of lossy dimensionality reduction:
1.wavelet transforms and
2.principal components analysis.
Wavelet transforms
• The discrete wavelet transform (DWT) is a signal processing
technique that transforms linear signals.
• The wavelet transform can present a signal with a good time
resolution or a good frequency resolution. There are two types
of wavelet transforms: the continuous wavelet transform
(CWT) and the discrete wavelet transform (DWT).
• The data vector X is transformed into a numerically different
vector, Xo, of wavelet coefficients when the DWT is applied.
The two vectors X and Xo must be of the same length.
• When applying this technique to data reduction, we consider n-
dimensional data tuple, that is, X = (x1,x2,…,xn), where n is the
number of attributes present in the relation of the data set.
Discrete Wavelet Transform
• What’s a Wavelet?
• A Wavelet is a wave-like oscillation that is localized in time,
an example is given below. Wavelets have two basic properties:
scale and location.
• Scale (or dilation) defines how “stretched” or “squished” a
wavelet is. This property is related to frequency as defined for
waves.
• Location defines where the wavelet is positioned in time (or
space)
• Wavelet transforms can be applied to multidimensional data,
such as a data cube.
• The computational complexity involved is linear with respect
to the number of cells in the cube.
• Wavelet transforms give good results on sparse or skewed
data and on data with ordered attributes.
• Wavelet transforms have many real-world applications,
including the compression of fingerprint images, computer
vision, analysis of time-series data, and data cleaning.
Principal Components Analysis
• principal components analysis is one method for dimensionality
reduction. PCA is a method used to reduce number of
variables in your data by extracting important one from a
large pool.
• It reduces the dimension of your data with the aim of retaining
as much information as possible..
• Suppose that the data to be reduced consist of tuples or data vectors
described by n attributes or dimensions.
• Principal components analysis,searches for k n-dimensional
orthogonal vectors that can best be used to represent the data,
where k ≤ n.
• PCA works by considering the variance of each attribute
because the high attribute shows the good split between the
classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation
in various communication channels.
• It is a feature extraction technique, so it contains the important
variables and drops the least important variable.
• The PCA algorithm is based on some mathematical concepts
such as:
• Variance and Covariance
• Eigenvalues and Eigen factors
Applications of Principal Component
Analysis
• PCA is mainly used as the dimensionality
reduction technique in various AI applications
such as computer vision, image compression,
etc.
• It can also be used for finding hidden patterns if
data has high dimensions. Some fields where
PCA is used are Finance, data mining,
Psychology, etc.
Regression and Log-Linear Models
• Regression and log-linear models can be used to approximate the
given data. In (simple) linear regression, the data are modeled to fit a
straight line.
• For example, a random variable, y (called a response variable), can be
modeled as a linear function of another random variable, x (called a
predictor variable), with the equation. Y is response variable and x is
called as the predictor variable. W and b are the regression
coefficients specify the slope of the line y-intercept.
y = wx+b
Histograms
• Histograms use binning to approximate data distributions and are a
popular form of data reduction.
• A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, or buckets. If each bucket represents only a
single attribute-value/frequency pair, the buckets are called singleton
buckets. Often, buckets instead represent continuous ranges for the
given attribute.
• Histograms. The following data are a list of prices of commonly sold
items at All Electronics (rounded to the nearest dollar). The numbers
have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20,
20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
Sampling
• Sampling can be used as a data reduction technique because it allows
a large data set to be represented by a much smaller random sample
(or subset) of the data.
• Suppose that a large data set, D, contains N tuples. Let’s look at the
most common ways that we could sample D for data reduction.
unit 1.pptx

More Related Content

What's hot

Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Learning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOILLearning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOILPavithra Thippanaik
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Maximum Matching in General Graphs
Maximum Matching in General GraphsMaximum Matching in General Graphs
Maximum Matching in General GraphsAhmad Khayyat
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
Learning rule of first order rules
Learning rule of first order rulesLearning rule of first order rules
Learning rule of first order rulesswapnac12
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classificationKrish_ver2
 

What's hot (20)

Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Learning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOILLearning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOIL
 
Markov Random Field (MRF)
Markov Random Field (MRF)Markov Random Field (MRF)
Markov Random Field (MRF)
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Data mining
Data miningData mining
Data mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Maximum Matching in General Graphs
Maximum Matching in General GraphsMaximum Matching in General Graphs
Maximum Matching in General Graphs
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Learning rule of first order rules
Learning rule of first order rulesLearning rule of first order rules
Learning rule of first order rules
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 

Similar to unit 1.pptx

Similar to unit 1.pptx (20)

Data Mining-2023 (2).ppt
Data Mining-2023 (2).pptData Mining-2023 (2).ppt
Data Mining-2023 (2).ppt
 
Chapter 1.pdf
Chapter 1.pdfChapter 1.pdf
Chapter 1.pdf
 
Data Mining Presentation.pptx
Data Mining Presentation.pptxData Mining Presentation.pptx
Data Mining Presentation.pptx
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Unit i
Unit iUnit i
Unit i
 
Mining Frequent Patterns, Associations, and.pptx
 Mining Frequent Patterns, Associations, and.pptx Mining Frequent Patterns, Associations, and.pptx
Mining Frequent Patterns, Associations, and.pptx
 
Data Mining
Data MiningData Mining
Data Mining
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
CLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptxCLUSTER ANALYSIS.pptx
CLUSTER ANALYSIS.pptx
 
Data mining
Data miningData mining
Data mining
 
Data Mining Technniques
Data Mining TechnniquesData Mining Technniques
Data Mining Technniques
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
 
Data mining
Data miningData mining
Data mining
 
Data modelling it's process and examples
Data modelling it's process and examplesData modelling it's process and examples
Data modelling it's process and examples
 

Recently uploaded

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 

unit 1.pptx

  • 1. Data Mining Functionalities—What Kinds of Patterns Can Be Mined? Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general, data mining tasks can be classified into two categories: 1.descriptive task 2.Predictive task 1.Descriptive mining tasks characterize the general properties of the data in the database. 2.Predictive mining tasks perform inference on the current data in order to make predictions.
  • 2. Concept/Class Description: Characterization and Discrimination • Data can be associated with classes or concepts. For example, in the AllElectronics store, classes of items for sale include computers and printers, and concepts of customers include bigSpenders and budgetSpenders. Such descriptions of a class or a concept are called class/concept descriptions.
  • 3. These descriptions can be derived via (1) data characterization, by summarizing the data of the class under study in general terms, or (2) data discrimination, by comparison of the target class with one or a set of comparative classes (often called the contrasting classes), or (3) both data characterization and discrimination
  • 4. • Data characterization is a summarization of the general characteristics or features of a target class of data. • The data corresponding to the user-specified class are typically collected by a database query. • For example, to study the characteristics of software products whose sales increased by 10% in the last year, the data related to such products can be collected by executing an SQL query.
  • 5. • There are several methods for effective data summarization and characterization. Simple data summaries based on statistical measures and plots. • The output of data characterization can be presented in various forms. • Examples include pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs. • The resulting descriptions can also be presented as generalized relations or in rule form (called characteristic rules).
  • 6. Example Data characterization. • A data mining system should be able to produce a description summarizing the characteristics of customers who spend more than $1,000 a year at AllElectronics. • The result could be a general profile of the customers, such as they are 40–50 years old, employed, and have excellent credit ratings.
  • 7. • Data discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. • For example, the user may like to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by at least 30% during the same period.
  • 8. Example 1.5 Data discrimination. • A data mining system should be able to compare two groups of AllElectronics customers, such as those who shop for computer products regularly versus those who rarely shop for such products (i.e., less than three times a year). • The resulting description provides a general comparative profile of the customers, such as 80% of the customers who frequently purchase computer products are between 20 and 40 years old and have a university education, whereas 60% of the customers who infrequently buy such products are either seniors or youths, and have no university degree.
  • 9. Mining Frequent Patterns, Associations, and Correlations • Frequent Pattern is a pattern which appears frequently in a data set. By identifying frequent patterns we can observe strongly correlated items together and easily identify similar characteristics, associations among them. • By doing frequent pattern mining, it leads to further analysis like clustering, classification and other data mining tasks.
  • 10. There are many kinds of frequent patterns, including itemsets, subsequences, and substructures. • A frequent itemset typically refers to a set of items that frequently appear together in a transactional data set, such as milk and bread. • A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. • A substructure can refer to different structural forms, such as graphs, trees, or lattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern. • Mining frequent patterns leads to the discovery of interesting associations and correlations within data.
  • 11. Example 1.6 Association analysis. • Suppose, as a marketing manager of AllElectronics, you would like to determine which items are frequently purchased together within the same transactions. • An example of such a rule, mined from the AllElectronics transactional database, is buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%,confidence = 50%] Association rules that contain a single predicate are referred to as single- dimensional association rules.
  • 12. • Suppose, instead, that we are given the AllElectronics relational database relating to purchases. • A data mining system may find association rules like • age(X, “20...29”)∧ income(X, “40K...49K”) ⇒ buys(X, “laptops”) [support = 2%, confidence = 60%] • Typically, association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold.
  • 13. Classification and Prediction • Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. • “How is the derived model presented?” The derived model may be represented in various forms, such as classification • (IF-THEN) rules, • decision trees, • mathematical formulae, • or neural networks
  • 14. • A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.
  • 15.
  • 16. • Regression analysis is a statistical methodology that is most often used for numeric prediction, although other methods exist as well. Prediction also encompasses the identification of distribution trends based on the available data.
  • 17. Example 1.7 Classification and prediction. • Suppose, as sales manager of AllElectronics, you would like to classify a large set of items in the store, based on three kinds of responses to a sales campaign: good response, mild response, and no response. • you would like to predict the amount of revenue that each item will generate during an upcoming sale at AllElectronics, based on previous sales data. This is an example of (numeric) prediction because the model constructed will predict a continuous-valued function, or ordered value.
  • 19. • Clustering can be used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. • That is, clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters.
  • 20. Outlier Analysis • A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. • Example Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account.
  • 21. Evolution Analysis • Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. Although this may include characterization, discrimination, association and correlation analysis, classification, prediction, or clustering of time related data, distinct features of such an analysis include time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis.
  • 22. Example Evolution analysis. • Suppose that you have the major stock market (time-series) data of the last several years available from the New York Stock Exchange and you would like to invest in shares of high-tech industrial companies. • A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies. • Such regularities may help predict future trends in stock market prices, contributing to your decision making regarding stock investments.
  • 23. Are All of the Patterns Interesting? • “What makes a pattern interesting? • Can a data mining system generate all of the interesting patterns? • Can a data mining system generate only interesting patterns?”
  • 24. To answer the first question, a pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test data with some degree ofcertainty, (3) potentially useful (4) novel.
  • 25. The second question—“Can a data mining system generate all of the interesting patterns?”— • refers to the completeness of a data mining algorithm. It is often unrealistic and inefficient for data mining systems to generate all of the possible patterns.
  • 26. Finally the third question • “Can a data mining system generate only interesting patterns?” • is an optimization problem in data mining. • It is highly desirable for data mining systems to generate only interesting patterns.
  • 27. Classification of Data Mining Systems
  • 28.
  • 29. Statistics • Statistical models are widely used to model data and data classes. • For example, in data mining tasks like data characterization and classification, statistical models of target classes can be built. • For example, we can use statistics to model noise and missing data values. • Statistics research develops tools for prediction and forecasting using data and statistical models. Statistical methods can be used to summarize or describe a collection of data
  • 30. • Inferential statistics (or predictive statistics) models data in a way that accounts for randomness and uncertainty in the observations and is used to draw inferences about the process or population under investigation. • A statistical hypothesis test (sometimes called confirmatory data analysis) makes statistical decisions using experimental data. A result is called statistically significant if it is unlikely to have occurred by chance.
  • 31. Machine Learning • Machine learning investigates how computers can learn (or improve their performance) based on data. A main research area is for computer programs to automatically learn to recognize complex patterns and make intelligent decisions based on data. • Supervised learning • Unsupervised learning • Semi supervised learning • Active learning.
  • 32. Supervised learning • Supervised learning is the types of machine learning in which machines are trained using well "labelled" training data, and on basis of that data, machines predict the output. The labelled data means some input data is already tagged with the correct output. How Supervised Learning Works? • In supervised learning, models are trained using labelled dataset, where the model learns about each type of data. Once the training process is completed, the model is tested on the basis of test data (a subset of the training set), and then it predicts the output. • The working of Supervised learning can be easily understood by the below example and diagram:
  • 33.
  • 34. Unsupervised Machine Learning • Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision. • Unsupervised learning cannot be directly applied to a regression or classification problem because unlike supervised learning, we have the input data but no corresponding output data. The goal of unsupervised learning is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format.
  • 35. Working of Unsupervised Learning Working of unsupervised learning can be understood by the below diagram:
  • 36. Semi supervised learning • Semi-Supervised learning is a type of Machine Learning algorithm that represents the intermediate ground between Supervised and Unsupervised learning algorithms. It uses the combination of labeled and unlabeled datasets during the training period. • In one approach, labeled examples are used to learn class models and unlabeled examples are used to refine the boundaries between classes.
  • 37.
  • 38. Active learning • Active learning is a machine learning approach that lets users play an active role in the learning process. An active learning approach can ask a user (e.g., a domain expert) to label an example, which may be from a set of unlabeled examples
  • 39. Database Systems and Data Warehouses • Database systems research focuses on the creation, maintenance, and use of databases for organizations and end-users. Particularly, database systems researchers have established highly recognized principles in data models, query languages, query processing and optimization methods, data storage, and indexing and accessing methods. Database systems are often well known for their high scalability in processing very large, relatively structured data sets. • A data warehouse integrates data originating from multiple sources and various timeframes. It consolidates data in multidimensional space to form partially materialized data cubes. The data cube model not only facilitates OLAP in multidimensional databases but also promotes multidimensional data mining.
  • 40. Information Retrieval • Information retrieval (IR) is the science of searching for documents or information in documents. Documents can be text or multimedia, and may reside on the Web. • The differences between traditional information retrieval and database systems are twofold: • Information retrieval assumes that • (1) the data under search are unstructured; and • (2) the queries are formed mainly by keywords, which do not have complex structures (unlike SQL queries in database systems).
  • 41. Data Mining Task Primitives • Each user will have a data mining task in mind, that is, some form of data analysis that he or she would like to have performed. • A data mining task can be specified in the form of a data mining query, which is input to the data mining system. • A data mining query is defined in terms of data mining task primitives. • The set of task relevant data to be mined. • The kind of knowledge to be mined. • The background knowledge to be used in the discovery process. • The interesting measures and thresholds for the pattern evaluation. • The expected representation for visualizing the discovered patterns.
  • 42. The data mining primitives specify the following The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or dimensions).
  • 43. The kind of knowledge to be mined: • This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis.
  • 44. The background knowledge to be used in the discovery process: • This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of abstraction.
  • 45.
  • 46. The interestingness measures and thresholds for pattern evaluation: • They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. • Different kinds of knowledge may have different interestingness measures. • For example, interestingness measures for association rules include support and confidence. • Rules whose support and confidence values are below user-specified thresholds are considered uninteresting.
  • 47. The expected representation for visualizing the discovered patterns: • This refers to the form in which discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and cubes.
  • 48.
  • 49. Integration of a Data Mining System with a Database or Data Warehouse System • The data mining system is integrated with a database or data warehouse system so that it can do its tasks in an effective presence. A data mining system operates in an environment that needed it to communicate with other data systems like a database system. There are the possible integration schemes that can integrate these systems which are as follows − • No coupling • Loose coupling • Semi tight coupling • Tight coupling
  • 50. No coupling • No coupling defines that a data mining system will not use any function of a database or data warehouse system. • It can retrieve data from a specific source (including a file system), process data using some data mining algorithms, and therefore save the mining results in a different file. • First, a Database system offers a big deal of flexibility and adaptability at storing, organizing, accessing, and processing data. • Without using a Database/Data warehouse system, a Data mining system can allocate a large amount of time finding, collecting, cleaning, and changing data.
  • 51. Loose coupling • In this data mining system uses some services of a database or data warehouse system. • The data is fetched from a data repository handled by these systems. • Data mining approaches are used to process the data and then the processed data is saved either in a file or in a designated area in a database or data warehouse. • Loose coupling is better than no coupling as it can fetch some area of data stored in databases by using query processing or various system facilities. • These are memory based. • It is difficult to achieve high scalability and performance in large data sets.
  • 52. SEMI TIGHT COUPLING • In this adequate execution of a few essential data mining primitives can be supported in the database/data warehouse system. • These primitives can contain sorting, indexing, aggregation, histogram analysis, multi-way join, and pre-computation of some important statistical measures, including sum, count, max, min, standard deviation, etc.
  • 53. Tight coupling • Tight coupling defines that a data mining system is smoothly integrated into the database/data warehouse system. • The data mining subsystem is considered as one functional element of an information system. • Data mining queries and functions are developed and established on mining query analysis, data structures, indexing schemes, and query processing methods of database/data warehouse systems. • It is hugely desirable because it supports the effective implementation of data mining functions, high system performance, and an integrated data processing environment.
  • 54. Major Issues in Data Mining • Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues. Here in this tutorial, we will discuss the major issues regarding − • Mining Methodology and User Interaction • Performance Issues • Diverse Data Types Issues • The following diagram describes the major issues.
  • 55.
  • 56. • Mining Methodology and User Interaction Issues • It refers to the following kinds of issues − • Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task. • Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results. • Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction.
  • 57. • Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining. • Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable. • Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor. • Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty.
  • 58. Performance Issues • There can be performance-related issues such as follows − • Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable. • Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch.
  • 59. Diverse Data Types Issues • Handling of relational and complex types of data − The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data. • Mining information from heterogeneous databases and global information systems − The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining.
  • 60. Data Preprocessing • Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources. • Low-quality data will lead to low-quality mining results. • “How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results? • “How can the data be preprocessed so as to improve the efficiency and ease of the mining process?”
  • 61. Data preprocessing techniques. • Data cleaning:Data cleaning can be applied to remove noise and correct inconsistencies in the data. • Data Integration:Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. • Data transformation:Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. • Data reduction: can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance.
  • 62.
  • 63. Why Preprocess the Data? • Data preprocessing is essential before its actual use. Data preprocessing is the concept of changing the raw data into a clean data set. • The dataset is preprocessed in order to check missing values, noisy data, and other inconsistencies before executing it to the algorithm. • Noisy data • The data collection instruments may be faulty. • Data entry is wrong by humans. • Inconsistencies in the naming conventions or data codes used,inconsistent formats for the input field such as date. • Duplicate tuples also require data cleaning.
  • 64. Descriptive Data Summarization • Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which data values should be treated as noise or outlier. • For many data preprocessing tasks, users would like to learn about data characteristics regarding both central tendency and dispersion of the data. • Measures of central tendency include mean, median, mode, and midrange, while measures of data dispersion include quartiles, interquartile range (IQR), and variance.
  • 65. Measuring the Central Tendency • There are many ways to measure the central tendency of data. • The most common and most effective numerical measure of the “center” of a set of data is the (arithmetic) mean. • Let x1,x2,...,xN be a set of N values or observations, such as for some attribute, like salary. The mean of this set of values is
  • 66. • This corresponds to the built-in aggregate function, average (avg() in SQL), provided in relational database systems. • Distributive measure: A distributive measure is a measure (i.e., function) that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set. • Both sum() and count() are distributive measures because they can be computed in this manner. Other examples include max() and min().
  • 67. • Algebraic measure:An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures. Hence, average (or mean()) is an algebraic measure because it can be computed by sum()/count(). • Each value xi in a set may be associated with a weight wi , • for i = 1,...,N. The weights reflect the significance, importance, or occurrence frequency attached to their respective values. In this case, we can compute This is called the weighted arithmetic mean or the weighted average.
  • 68. Drawbacks of mean • A major problem with the mean is its sensitivity to extreme (e.g., outlier) values. Even a small number of extreme values can corrupt the mean. • For example, the mean salary at a company may be substantially pushed up by that of a few highly paid managers. Similarly, the average score of a class in an exam could be pulled down quite a bit by a few very low scores.
  • 69. Trimmed mean • we can instead use the trimmed mean, which is the mean obtained after chopping off values at the high and low extremes. • For example, we can sort the values observed for salary and remove the top and bottom 2% before computing the mean. • We should avoid trimming too large a portion (such as 20%) at both ends as this can result in the loss of valuable information.
  • 70. Median • It is a better measure to find the center of data • Suppose that a given data set of N distinct values is sorted in numerical order. • If N is odd, then the median is the middle value of the ordered set; otherwise (i.e., if N is even), the median is the average of the middle two values.
  • 71. • Assume that data are grouped in intervals according to their xi data values and that the frequency (i.e., number of data values) of each interval is known. • For example, people may be grouped according to their annual salary in intervals such as 10–20K, 20–30K, and so on. • Let the interval that contains the median frequency be the median interval. We can approximate the median of the entire data set (e.g., the median salary) by interpolation using the formula:
  • 72. • L1 is the lower boundary of the median interval, • N is the number of values in the entire data set, • (∑ freq)l is the sum of the frequencies of all of the intervals that are lower than the median interval, • freqmedian is the frequency of the median interval, • and width is the width of the median interval.
  • 73. Mode • The mode for a set of data is the value that occurs most frequently in the set. • Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal. • For example, the mode of the data set in the given set of data: • 2, 4, 5, 5, 6, 7 is 5 because it appears twice in the collection. • In general, a data set with two or more modes is multimodal. • At the other extreme, if each data value occurs only once, then there is no mode.
  • 74.
  • 75. Measuring the Dispersion of Data • The degree to which numerical data tend to spread is called the dispersion, or variance of the data. • The most common measures of data dispersion are range, the five- number summary (based on quartiles), the interquartile range, and the standard deviation. Boxplots can be plotted based on the five-number summary and are a useful tool for identifying outliers.
  • 76. Range, Quartiles, Outliers, and Boxplots • Let x1,x2,...,xN be a set of observations for some attribute. The range of the set is the difference between the largest (max()) and smallest (min()) values. • let’s assume that the data are sorted in increasing numerical order. • The kth percentile of a set of data in numerical order is the value xi having the property that k percent of the data entries lie at or below xi . • The most commonly used percentiles other than the median are quartiles. • The first quartile, denoted by Q1, is the 25th percentile; the third quartile, denoted by Q3, is the 75th percentile.
  • 77. IQR(Inter quarter range) • The distance between the first and third quartiles is inter quarter range. IQR = Q3 −Q1. The five-number summary of a distribution consists of the median, the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order Minimum, Q1, Median, Q3, Maximum.
  • 78. Boxplots • Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-number summary as follows:
  • 79.
  • 81. • Plotting histograms, or frequency histograms, is a graphical method for summarizing the distribution of a given attribute. • A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets. Typically, the width of each bucket is uniform.
  • 82.
  • 83.
  • 84. The scatter plot is a useful method for providing a first look at bivariate data to see clusters of points and outliers, or to explore the possibility of correlation relations
  • 85.
  • 86. Data Cleaning •Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.
  • 87. Missing Values • Imagine that you need to analyze All Electronics sales and customer data. You note that many tuples have no recorded value for several attributes, such as customer income. • How can you go about filling in the missing values for this attribute? • Ignore the tuple:This is usually done when the class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values. • Fill in the missing value manually: In general, this approach is time- consuming and may not be feasible given a large data set with many missing values.
  • 88. • Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown” or −∞. If missing values are replaced by, say, “Unknown,”. • Use the attribute mean to fill in the missing value:For example, suppose that the average income of All Electronics customers is $56,000. Use this value to replace the missing value for income. • Use the attribute mean for all samples belonging to the same class as the given tuple: • For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. • Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction.
  • 89. Noisy Data • “What is noise?” • Noise is a random error or variance in a measured variable. • Given a numerical attribute such as, say, price, how can we “smooth” out the data to remove the noise? • Let’s look at the following data smoothing techniques:
  • 90.
  • 91. 2.Regression • Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the “best” line to fit two attributes (or variables), so that one attribute can be used to predict the other. • Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.
  • 92. 3. Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers
  • 93. Data Integration • Data mining often requires data integration—the merging of data from multiple data stores. • It is likely that your data analysis task will involve data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. • These sources may include multiple databases, data cubes, or flat files.
  • 94. Issues during Data Integration 1.Entity identification problem:How can equivalent real-world entities from multiple data sources be matched up? • For example, how can the data analyst or the computer be sure that customer id in one database and cust number in another refer to the same attribute? • Examples of metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values. • Such metadata can be used to help avoid errors in schema integration.
  • 95. 2.Redundancy •Redundancy is another important issue. An attribute (such as annual revenue, for instance) may be redundant if it can be “derived” from another attribute or set of attributes. • Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set.
  • 96. correlation analysis • Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. • For numerical attributes, we can evaluate the correlation between two attributes, A and B, by computing the correlation coefficient(also known as Pearson’s product moment coefficient, named after its inventer, Karl Pearson)
  • 97. • where N is the number of tuples • ai and bi are the respective values of A and B in tuple i • A and B are the respective mean values of A and B • σA and σB are the respective standard deviations of A and B • and Σ(aibi) is the sum of the AB cross-product (that is, for each tuple, the value for A is multiplied by the value for B in that tuple). • Note that −1 ≤ rA,B ≤ +1. • IfrA,B is greater than 0, then A and B are positively correlated, meaning that the values of A increase as the values of B increase. • If the resulting value is less than 0, then A and B are negatively correlated, where the values of one attribute increase as the values of the other attribute decrease.
  • 98. • For categorical (discrete) data, a correlation relationship between two attributes, A and B, can be discovered by a χ 2 (chi-square) test. • Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct values, namely b1,b2,...br . • The data tuples described by A and B can be shown as a contingency table, with the c values of A making up the columns and the r values of B making up the rows. • Let (Ai ,Bj) denote the event that attribute A takes on value ai and attribute B takes on value bj , that is, where (A = ai ,B = bj). Each and every possible (Ai ,Bj) joint event has its own cell (or slot) in the table. The χ 2 value (also known as the Pearson χ 2 statistic) is computed as:
  • 99. • where oij is the observed frequency (i.e., actual count) of the joint event (Ai ,Bj) and ei j is the expected frequency of (Ai ,Bj), which can be computed as where N is the number of data tuples, count(A = ai)is the number of tuples having value ai for A, and count(B = bj) is the number of tuples having value bj for B.
  • 100.
  • 101. • Correlation analysis of categorical attributes using χ 2 . Suppose that a group of 1,500 people was surveyed. • The gender of each person was noted. Each person was polled as to whether their preferred type of reading material was fiction or nonfiction. • Thus, we have two attributes, gender and preferred reading.
  • 102. Data Transformation In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following: • Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering. • Aggregation, where summary or aggregation operations are applied to the data. • For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities. • Generalization of the data, where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. • For example, categorical attributes, like street, can be generalized to higher- level concepts, like city or country. • Similarly, values for numerical attributes, like age, may be mapped to higher- level concepts, like youth, middle-aged, and senior.
  • 103. • Normalization, where the attribute data are scaled so as to fall within a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0. • Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process. • Min-max normalization. • z-score normalization. • Normalization by decimal scaling.
  • 105. Z-score normalization: • This method normalizes the value for attribute A using the mean and standard deviation. The following formula is used for Z- score normalization:
  • 106. Normalization by decimal scaling • Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A. • A value, v, of A is normalized to v 0 by computing
  • 107. Data Reduction • Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume. • Strategies for data reduction include the following: • Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube. • Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed. • Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. • Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as parametric models or nonparametric methods such as clustering, sampling, and the use of histograms.
  • 108. • Discretization and concept hierarchy generation where raw data values for attributes are replaced by ranges or higher conceptual levels. • Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction.
  • 109. Data aggregation • This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. • They involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data.
  • 110.
  • 111.
  • 112. Attribute Subset Selection • Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions). • The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. • Mining on a reduced set of attributes has an additional benefit. It reduces the number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand.
  • 113. Basic heuristic methods of attribute subset selection include the following techniques • Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set. • Stepwise backward elimination: The procedure starts with the full set of attributes. • At each step, it removes the worst attribute remaining in the set.
  • 114. • Combination of forward selection and backward elimination: The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes. • Decision tree induction: Decision tree algorithms, such as ID3(iterative dichotamiser), C4.5, and CART(classification and regression trees), were originally intended for classification. • Decision tree induction constructs a flowchart like structure where each internal (non leaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction. • The set of attributes appearing in the tree form the reduced subset of attributes.
  • 115.
  • 116. Dimensionality Reduction •In dimensionality reduction, data encoding or transformations are applied so as to obtain a reduced or “compressed” representation of the original data. •If the original data can be reconstructed from the compressed data without any loss of information, the data reduction is called lossless. • If, instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy.
  • 117. Numerosity Reduction • “Can we reduce the data volume by choosing alternative, ‘smaller’ forms of data representation?” These techniques may be parametric or non parametric. • For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.) • Log-linear models, which estimate discrete multidimensional probability distributions, are an example. • Nonparametric methods for storing reduced representations of the data include histograms, clustering, and sampling.
  • 118. Two effective methods of lossy dimensionality reduction: 1.wavelet transforms and 2.principal components analysis.
  • 119. Wavelet transforms • The discrete wavelet transform (DWT) is a signal processing technique that transforms linear signals. • The wavelet transform can present a signal with a good time resolution or a good frequency resolution. There are two types of wavelet transforms: the continuous wavelet transform (CWT) and the discrete wavelet transform (DWT). • The data vector X is transformed into a numerically different vector, Xo, of wavelet coefficients when the DWT is applied. The two vectors X and Xo must be of the same length. • When applying this technique to data reduction, we consider n- dimensional data tuple, that is, X = (x1,x2,…,xn), where n is the number of attributes present in the relation of the data set.
  • 121. • What’s a Wavelet? • A Wavelet is a wave-like oscillation that is localized in time, an example is given below. Wavelets have two basic properties: scale and location. • Scale (or dilation) defines how “stretched” or “squished” a wavelet is. This property is related to frequency as defined for waves. • Location defines where the wavelet is positioned in time (or space)
  • 122. • Wavelet transforms can be applied to multidimensional data, such as a data cube. • The computational complexity involved is linear with respect to the number of cells in the cube. • Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes. • Wavelet transforms have many real-world applications, including the compression of fingerprint images, computer vision, analysis of time-series data, and data cleaning.
  • 123. Principal Components Analysis • principal components analysis is one method for dimensionality reduction. PCA is a method used to reduce number of variables in your data by extracting important one from a large pool. • It reduces the dimension of your data with the aim of retaining as much information as possible.. • Suppose that the data to be reduced consist of tuples or data vectors described by n attributes or dimensions. • Principal components analysis,searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k ≤ n.
  • 124. • PCA works by considering the variance of each attribute because the high attribute shows the good split between the classes, and hence it reduces the dimensionality. Some real- world applications of PCA are image processing, movie recommendation system, optimizing the power allocation in various communication channels. • It is a feature extraction technique, so it contains the important variables and drops the least important variable. • The PCA algorithm is based on some mathematical concepts such as: • Variance and Covariance • Eigenvalues and Eigen factors
  • 125.
  • 126. Applications of Principal Component Analysis • PCA is mainly used as the dimensionality reduction technique in various AI applications such as computer vision, image compression, etc. • It can also be used for finding hidden patterns if data has high dimensions. Some fields where PCA is used are Finance, data mining, Psychology, etc.
  • 127. Regression and Log-Linear Models • Regression and log-linear models can be used to approximate the given data. In (simple) linear regression, the data are modeled to fit a straight line. • For example, a random variable, y (called a response variable), can be modeled as a linear function of another random variable, x (called a predictor variable), with the equation. Y is response variable and x is called as the predictor variable. W and b are the regression coefficients specify the slope of the line y-intercept. y = wx+b
  • 128. Histograms • Histograms use binning to approximate data distributions and are a popular form of data reduction. • A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets are called singleton buckets. Often, buckets instead represent continuous ranges for the given attribute.
  • 129. • Histograms. The following data are a list of prices of commonly sold items at All Electronics (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
  • 130.
  • 132. • Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random sample (or subset) of the data. • Suppose that a large data set, D, contains N tuples. Let’s look at the most common ways that we could sample D for data reduction.