VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
unit 1.pptx
1. Data Mining Functionalities—What Kinds of
Patterns Can Be Mined?
Data mining functionalities are used to specify the kind of patterns to be found
in data mining tasks.
In general, data mining tasks can be classified into two categories:
1.descriptive task
2.Predictive task
1.Descriptive mining tasks characterize the general properties of the data in the
database.
2.Predictive mining tasks perform inference on the current data in order to
make predictions.
2. Concept/Class Description: Characterization and Discrimination
• Data can be associated with classes or concepts.
For example, in the AllElectronics store, classes of items for sale
include computers and printers, and concepts of customers include
bigSpenders and budgetSpenders.
Such descriptions of a class or a concept are called class/concept
descriptions.
3. These descriptions can be derived via
(1) data characterization, by summarizing the data of the class under
study in general terms, or
(2) data discrimination, by comparison of the target class with one or a
set of comparative classes (often called the contrasting classes), or
(3) both data characterization and discrimination
4. • Data characterization is a summarization of the general characteristics
or features of a target class of data.
• The data corresponding to the user-specified class are typically
collected by a database query.
• For example, to study the characteristics of software products whose
sales increased by 10% in the last year, the data related to such
products can be collected by executing an SQL query.
5. • There are several methods for effective data summarization and
characterization. Simple data summaries based on statistical
measures and plots.
• The output of data characterization can be presented in various
forms.
• Examples include pie charts, bar charts, curves, multidimensional
data cubes, and multidimensional tables, including crosstabs.
• The resulting descriptions can also be presented as generalized
relations or in rule form (called characteristic rules).
6. Example Data characterization.
• A data mining system should be able to produce a description
summarizing the characteristics of customers who spend more than
$1,000 a year at AllElectronics.
• The result could be a general profile of the customers, such as they
are 40–50 years old, employed, and have excellent credit ratings.
7. • Data discrimination is a comparison of the general features of target
class data objects with the general features of objects from one or a
set of contrasting classes.
• For example, the user may like to compare the general features of
software products whose sales increased by 10% in the last year with
those whose sales decreased by at least 30% during the same period.
8. Example 1.5 Data discrimination.
• A data mining system should be able to compare two groups of
AllElectronics customers, such as those who shop for computer
products regularly versus those who rarely shop for such products
(i.e., less than three times a year).
• The resulting description provides a general comparative profile of
the customers, such as 80% of the customers who frequently
purchase computer products are between 20 and 40 years old and
have a university education, whereas 60% of the customers who
infrequently buy such products are either seniors or youths, and have
no university degree.
9. Mining Frequent Patterns, Associations, and Correlations
• Frequent Pattern is a pattern which appears frequently in a data
set. By identifying frequent patterns we can observe strongly
correlated items together and easily identify similar
characteristics, associations among them.
• By doing frequent pattern mining, it leads to further analysis like
clustering, classification and other data mining tasks.
10. There are many kinds of frequent patterns, including itemsets,
subsequences, and substructures.
• A frequent itemset typically refers to a set of items that frequently
appear together in a transactional data set, such as milk and bread.
• A frequently occurring subsequence, such as the pattern that
customers tend to purchase first a PC, followed by a digital camera,
and then a memory card, is a (frequent) sequential pattern.
• A substructure can refer to different structural forms, such as graphs,
trees, or lattices, which may be combined with itemsets or
subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern.
• Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
11. Example 1.6 Association analysis.
• Suppose, as a marketing manager of AllElectronics, you would like to
determine which items are frequently purchased together within the same
transactions.
• An example of such a rule, mined from the AllElectronics transactional
database, is
buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%,confidence =
50%]
Association rules that contain a single predicate are referred to as single-
dimensional association rules.
12. • Suppose, instead, that we are given the AllElectronics relational
database relating to purchases.
• A data mining system may find association rules like
• age(X, “20...29”)∧ income(X, “40K...49K”) ⇒ buys(X, “laptops”)
[support = 2%, confidence = 60%]
• Typically, association rules are discarded as uninteresting if they do
not satisfy both a minimum support threshold and a minimum
confidence threshold.
13. Classification and Prediction
• Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts, for the purpose
of being able to use the model to predict the class of objects whose
class label is unknown.
• “How is the derived model presented?” The derived model may be
represented in various forms, such as classification
• (IF-THEN) rules,
• decision trees,
• mathematical formulae,
• or neural networks
14. • A decision tree is a flow-chart-like tree structure, where each node
denotes a test on an attribute value, each branch represents an
outcome of the test, and tree leaves represent classes or class
distributions.
15.
16. • Regression analysis is a statistical methodology that is most often
used for numeric prediction, although other methods exist as well.
Prediction also encompasses the identification of distribution trends
based on the available data.
17. Example 1.7 Classification and prediction.
• Suppose, as sales manager of AllElectronics, you would like to classify
a large set of items in the store, based on three kinds of responses to
a sales campaign: good response, mild response, and no response.
• you would like to predict the amount of revenue that each item will
generate during an upcoming sale at AllElectronics, based on previous
sales data. This is an example of (numeric) prediction because the
model constructed will predict a continuous-valued function, or
ordered value.
19. • Clustering can be used to generate such labels. The objects are
clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
• That is, clusters of objects are formed so that objects within a cluster
have high similarity in comparison to one another, but are very
dissimilar to objects in other clusters.
20. Outlier Analysis
• A database may contain data objects that do not comply with the
general behavior or model of the data. These data objects are
outliers.
• Example Outlier analysis. Outlier analysis may uncover fraudulent
usage of credit cards by detecting purchases of extremely large
amounts for a given account number in comparison to regular
charges incurred by the same account.
21. Evolution Analysis
• Data evolution analysis describes and models regularities or trends
for objects whose behavior changes over time. Although this may
include characterization, discrimination, association and correlation
analysis, classification, prediction, or clustering of time related data,
distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data
analysis.
22. Example Evolution analysis.
• Suppose that you have the major stock market (time-series) data of
the last several years available from the New York Stock Exchange and
you would like to invest in shares of high-tech industrial companies.
• A data mining study of stock exchange data may identify stock
evolution regularities for overall stocks and for the stocks of particular
companies.
• Such regularities may help predict future trends in stock market
prices, contributing to your decision making regarding stock
investments.
23. Are All of the Patterns Interesting?
• “What makes a pattern interesting?
• Can a data mining system generate all of the interesting patterns?
• Can a data mining system generate only interesting patterns?”
24. To answer the first question, a pattern is interesting if it is
(1) easily understood by humans,
(2) valid on new or test data with some degree ofcertainty,
(3) potentially useful
(4) novel.
25. The second question—“Can a data mining system
generate all of the interesting patterns?”—
• refers to the completeness of a data mining algorithm. It is often
unrealistic and inefficient for data mining systems to generate all of
the possible patterns.
26. Finally the third question
• “Can a data mining system generate only interesting patterns?”
• is an optimization problem in data mining.
• It is highly desirable for data mining systems to generate only
interesting patterns.
29. Statistics
• Statistical models are widely used to model data and data classes.
• For example, in data mining tasks like data characterization and
classification, statistical models of target classes can be built.
• For example, we can use statistics to model noise and missing data
values.
• Statistics research develops tools for prediction and forecasting using
data and statistical models. Statistical methods can be used to
summarize or describe a collection of data
30. • Inferential statistics (or predictive statistics) models data in a way
that accounts for randomness and uncertainty in the observations
and is used to draw inferences about the process or population under
investigation.
• A statistical hypothesis test (sometimes called confirmatory data
analysis) makes statistical decisions using experimental data. A result
is called statistically significant if it is unlikely to have occurred by
chance.
31. Machine Learning
• Machine learning investigates how computers can learn (or improve
their performance) based on data. A main research area is for
computer programs to automatically learn to recognize complex
patterns and make intelligent decisions based on data.
• Supervised learning
• Unsupervised learning
• Semi supervised learning
• Active learning.
32. Supervised learning
• Supervised learning is the types of machine learning in which
machines are trained using well "labelled" training data, and on basis
of that data, machines predict the output. The labelled data means
some input data is already tagged with the correct output.
How Supervised Learning Works?
• In supervised learning, models are trained using labelled dataset,
where the model learns about each type of data. Once the training
process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.
• The working of Supervised learning can be easily understood by the
below example and diagram:
33.
34. Unsupervised Machine Learning
• Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that
data without any supervision.
• Unsupervised learning cannot be directly applied to a
regression or classification problem because unlike supervised
learning, we have the input data but no corresponding output
data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according
to similarities, and represent that dataset in a compressed
format.
35. Working of Unsupervised Learning
Working of unsupervised learning can be understood by the
below diagram:
36. Semi supervised learning
• Semi-Supervised learning is a type of Machine Learning
algorithm that represents the intermediate ground between
Supervised and Unsupervised learning algorithms. It uses
the combination of labeled and unlabeled datasets during
the training period.
• In one approach, labeled examples are used to learn class models and
unlabeled examples are used to refine the boundaries between
classes.
37.
38. Active learning
• Active learning is a machine learning approach that lets users play an
active role in the learning process. An active learning approach can
ask a user (e.g., a domain expert) to label an example, which may be
from a set of unlabeled examples
39. Database Systems and Data Warehouses
• Database systems research focuses on the creation, maintenance, and
use of databases for organizations and end-users. Particularly,
database systems researchers have established highly recognized
principles in data models, query languages, query processing and
optimization methods, data storage, and indexing and accessing
methods. Database systems are often well known for their high
scalability in processing very large, relatively structured data sets.
• A data warehouse integrates data originating from multiple sources
and various timeframes. It consolidates data in multidimensional
space to form partially materialized data cubes. The data cube model
not only facilitates OLAP in multidimensional databases but also
promotes multidimensional data mining.
40. Information Retrieval
• Information retrieval (IR) is the science of searching for documents or
information in documents. Documents can be text or multimedia, and
may reside on the Web.
• The differences between traditional information retrieval and
database systems are twofold:
• Information retrieval assumes that
• (1) the data under search are unstructured; and
• (2) the queries are formed mainly by keywords, which do not have
complex structures (unlike SQL queries in database systems).
41. Data Mining Task Primitives
• Each user will have a data mining task in mind, that is, some form of
data analysis that he or she would like to have performed.
• A data mining task can be specified in the form of a data mining
query, which is input to the data mining system.
• A data mining query is defined in terms of data mining task
primitives.
• The set of task relevant data to be mined.
• The kind of knowledge to be mined.
• The background knowledge to be used in the discovery process.
• The interesting measures and thresholds for the pattern evaluation.
• The expected representation for visualizing the discovered patterns.
42. The data mining primitives specify the following
The set of task-relevant data to be mined:
This specifies the portions of the database or the set of data in which
the user is interested. This includes the database attributes or data
warehouse dimensions of interest (referred to as the relevant
attributes or dimensions).
43. The kind of knowledge to be mined:
• This specifies the data mining functions to be performed, such as
characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution
analysis.
44. The background knowledge to be used in the discovery
process:
• This knowledge about the domain to be mined is useful for guiding
the knowledge discovery process and for evaluating the patterns
found. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of
abstraction.
45.
46. The interestingness measures and thresholds for pattern
evaluation:
• They may be used to guide the mining process or, after discovery, to
evaluate the discovered patterns.
• Different kinds of knowledge may have different interestingness
measures.
• For example, interestingness measures for association rules include
support and confidence.
• Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
47. The expected representation for visualizing the
discovered patterns:
• This refers to the form in which discovered patterns are to be
displayed, which may include rules, tables, charts, graphs, decision
trees, and cubes.
48.
49. Integration of a Data Mining System with a
Database or Data Warehouse System
• The data mining system is integrated with a database or data
warehouse system so that it can do its tasks in an effective
presence. A data mining system operates in an environment that
needed it to communicate with other data systems like a
database system. There are the possible integration schemes
that can integrate these systems which are as follows −
• No coupling
• Loose coupling
• Semi tight coupling
• Tight coupling
50. No coupling
• No coupling defines that a data mining system will not use any
function of a database or data warehouse system.
• It can retrieve data from a specific source (including a file
system), process data using some data mining algorithms, and
therefore save the mining results in a different file.
• First, a Database system offers a big deal of flexibility and
adaptability at storing, organizing, accessing, and processing
data.
• Without using a Database/Data warehouse system, a Data
mining system can allocate a large amount of time finding,
collecting, cleaning, and changing data.
51. Loose coupling
• In this data mining system uses some services of a database or data
warehouse system.
• The data is fetched from a data repository handled by these systems.
• Data mining approaches are used to process the data and then the
processed data is saved either in a file or in a designated area in a
database or data warehouse.
• Loose coupling is better than no coupling as it can fetch some area
of data stored in databases by using query processing or various
system facilities.
• These are memory based.
• It is difficult to achieve high scalability and performance in large data
sets.
52. SEMI TIGHT COUPLING
• In this adequate execution of a few essential data
mining primitives can be supported in the
database/data warehouse system.
• These primitives can contain sorting, indexing,
aggregation, histogram analysis, multi-way join,
and pre-computation of some important
statistical measures, including sum, count, max,
min, standard deviation, etc.
53. Tight coupling
• Tight coupling defines that a data mining system is smoothly
integrated into the database/data warehouse system.
• The data mining subsystem is considered as one functional
element of an information system.
• Data mining queries and functions are developed and
established on mining query analysis, data structures, indexing
schemes, and query processing methods of database/data
warehouse systems.
• It is hugely desirable because it supports the effective
implementation of data mining functions, high system
performance, and an integrated data processing environment.
54. Major Issues in Data Mining
• Data mining is not an easy task, as the algorithms used can get
very complex and data is not always available at one place. It
needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this
tutorial, we will discuss the major issues regarding −
• Mining Methodology and User Interaction
• Performance Issues
• Diverse Data Types Issues
• The following diagram describes the major issues.
55.
56. • Mining Methodology and User Interaction Issues
• It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different
users may be interested in different kinds of knowledge. Therefore
it is necessary for data mining to cover a broad range of knowledge
discovery task.
• Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing
and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
57. • Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse query
language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations
should be easily understandable.
• Handling noisy or incomplete data − The data cleaning
methods are required to handle the noise and incomplete objects
while mining the data regularities. If the data cleaning methods
are not there then the accuracy of the discovered patterns will be
poor.
• Pattern evaluation − The patterns discovered should be
interesting because either they represent common knowledge or
lack novelty.
58. Performance Issues
• There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged. The incremental
algorithms, update databases without mining the data again from
scratch.
59. Diverse Data Types Issues
• Handling of relational and complex types of data − The
database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
• Mining information from heterogeneous databases and
global information systems − The data is available at different
data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining
the knowledge from them adds challenges to data mining.
60. Data Preprocessing
• Today’s real-world databases are highly susceptible to noisy, missing,
and inconsistent data due to their typically huge size (often several
gigabytes or more) and their likely origin from multiple, heterogenous
sources.
• Low-quality data will lead to low-quality mining results.
• “How can the data be preprocessed in order to help improve the
quality of the data and, consequently, of the mining results?
• “How can the data be preprocessed so as to improve the efficiency
and ease of the mining process?”
61. Data preprocessing techniques.
• Data cleaning:Data cleaning can be applied to remove noise and correct
inconsistencies in the data.
• Data Integration:Data integration merges data from multiple sources
into a coherent data store, such as a data warehouse.
• Data transformation:Data transformations, such as normalization, may
be applied. For example, normalization may improve the accuracy and
efficiency of mining algorithms involving distance measurements.
• Data reduction: can reduce the data size by aggregating, eliminating
redundant features, or clustering, for instance.
62.
63. Why Preprocess the Data?
• Data preprocessing is essential before its actual use. Data
preprocessing is the concept of changing the raw data into a
clean data set.
• The dataset is preprocessed in order to check missing
values, noisy data, and other inconsistencies before
executing it to the algorithm.
• Noisy data
• The data collection instruments may be faulty.
• Data entry is wrong by humans.
• Inconsistencies in the naming conventions or data codes
used,inconsistent formats for the input field such as date.
• Duplicate tuples also require data cleaning.
64. Descriptive Data Summarization
• Descriptive data summarization techniques can be used to identify
the typical properties of your data and highlight which data values
should be treated as noise or outlier.
• For many data preprocessing tasks, users would like to learn about
data characteristics regarding both central tendency and dispersion
of the data.
• Measures of central tendency include mean, median, mode, and
midrange, while measures of data dispersion include quartiles,
interquartile range (IQR), and variance.
65. Measuring the Central Tendency
• There are many ways to measure the central tendency of data.
• The most common and most effective numerical measure of the
“center” of a set of data is the (arithmetic) mean.
• Let x1,x2,...,xN be a set of N values or observations, such as for some
attribute, like salary. The mean of this set of values is
66. • This corresponds to the built-in aggregate function, average (avg() in
SQL), provided in relational database systems.
• Distributive measure: A distributive measure is a measure (i.e.,
function) that can be computed for a given data set by partitioning
the data into smaller subsets, computing the measure for each
subset, and then merging the results in order to arrive at the
measure’s value for the original (entire) data set.
• Both sum() and count() are distributive measures because they can be
computed in this manner. Other examples include max() and min().
67. • Algebraic measure:An algebraic measure is a measure that can be
computed by applying an algebraic function to one or more
distributive measures. Hence, average (or mean()) is an algebraic
measure because it can be computed by sum()/count().
• Each value xi in a set may be associated with a weight wi ,
• for i = 1,...,N. The weights reflect the significance, importance, or
occurrence frequency attached to their respective values. In this case,
we can compute This is called the weighted arithmetic mean or the
weighted average.
68. Drawbacks of mean
• A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values. Even a small number of extreme values can corrupt
the mean.
• For example, the mean salary at a company may be substantially
pushed up by that of a few highly paid managers. Similarly, the
average score of a class in an exam could be pulled down quite a bit
by a few very low scores.
69. Trimmed mean
• we can instead use the trimmed mean, which is the
mean obtained after chopping off values at the high
and low extremes.
• For example, we can sort the values observed for
salary and remove the top and bottom 2% before
computing the mean.
• We should avoid trimming too large a portion (such as
20%) at both ends as this can result in the loss of
valuable information.
70. Median
• It is a better measure to find the center of data
• Suppose that a given data set of N distinct values is sorted in
numerical order.
• If N is odd, then the median is the middle value of the ordered set;
otherwise (i.e., if N is even), the median is the average of the middle
two values.
71. • Assume that data are grouped in intervals according to their xi data
values and that the frequency (i.e., number of data values) of each
interval is known.
• For example, people may be grouped according to their annual salary
in intervals such as 10–20K, 20–30K, and so on.
• Let the interval that contains the median frequency be the median
interval. We can approximate the median of the entire data set (e.g.,
the median salary) by interpolation using the formula:
72. • L1 is the lower boundary of the median interval,
• N is the number of values in the entire data set,
• (∑ freq)l is the sum of the frequencies of all of the intervals that are
lower than the median interval,
• freqmedian is the frequency of the median interval,
• and width is the width of the median interval.
73. Mode
• The mode for a set of data is the value that occurs most frequently in the set.
• Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
• For example, the mode of the data set in the given set of data:
• 2, 4, 5, 5, 6, 7 is 5 because it appears twice in the collection.
• In general, a data set with two or more modes is multimodal.
• At the other extreme, if each data value occurs only once, then there is no
mode.
74.
75. Measuring the Dispersion of Data
• The degree to which numerical data tend to spread is called the
dispersion, or variance of the data.
• The most common measures of data dispersion are range, the five-
number summary (based on quartiles), the interquartile range, and
the standard deviation.
Boxplots can be plotted based on the five-number summary and are a
useful tool for identifying outliers.
76. Range, Quartiles, Outliers, and Boxplots
• Let x1,x2,...,xN be a set of observations for some attribute. The range
of the set is the difference between the largest (max()) and smallest
(min()) values.
• let’s assume that the data are sorted in increasing numerical order.
• The kth percentile of a set of data in numerical order is the value xi
having the property that k percent of the data entries lie at or below
xi .
• The most commonly used percentiles other than the median are
quartiles.
• The first quartile, denoted by Q1, is the 25th percentile; the third
quartile, denoted by Q3, is the 75th percentile.
77. IQR(Inter quarter range)
• The distance between the first and third quartiles is inter quarter
range.
IQR = Q3 −Q1.
The five-number summary of a distribution consists of the median, the
quartiles Q1 and Q3, and the smallest and largest individual
observations, written in the order Minimum, Q1, Median, Q3,
Maximum.
78. Boxplots
• Boxplots are a popular way of visualizing a distribution. A boxplot
incorporates the five-number summary as follows:
81. • Plotting histograms, or frequency histograms, is a graphical method for
summarizing the distribution of a given attribute.
• A histogram for an attribute A partitions the data distribution of A into
disjoint subsets, or buckets. Typically, the width of each bucket is
uniform.
82.
83.
84. The scatter plot is a useful method for providing a first look at
bivariate data to see clusters of points and outliers, or to explore the
possibility of correlation relations
85.
86. Data Cleaning
•Real-world data tend to be incomplete, noisy,
and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing
values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
87. Missing Values
• Imagine that you need to analyze All Electronics sales and customer
data. You note that many tuples have no recorded value for several
attributes, such as customer income.
• How can you go about filling in the missing values for this attribute?
• Ignore the tuple:This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not
very effective, unless the tuple contains several attributes with
missing values.
• Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
88. • Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown”
or −∞. If missing values are replaced by, say, “Unknown,”.
• Use the attribute mean to fill in the missing value:For example,
suppose that the average income of All Electronics customers is
$56,000. Use this value to replace the missing value for income.
• Use the attribute mean for all samples belonging to the same class as
the given tuple:
• For example, if classifying customers according to credit risk, replace
the missing value with the average income value for customers in the
same credit risk category as that of the given tuple.
• Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
89. Noisy Data
• “What is noise?”
• Noise is a random error or variance in a measured variable.
• Given a numerical attribute such as, say, price, how can we “smooth”
out the data to remove the noise?
• Let’s look at the following data smoothing techniques:
90.
91. 2.Regression
• Data can be smoothed by fitting the data to a
function, such as with regression. Linear regression
involves finding the “best” line to fit two attributes (or
variables), so that one attribute can be used to predict
the other.
• Multiple linear regression is an extension of linear
regression, where more than two attributes are
involved and the data are fit to a multidimensional
surface.
92. 3. Clustering:
Outliers may be detected by clustering, where
similar values are organized into groups, or
“clusters.”
Intuitively, values that fall outside of the set of
clusters may be considered outliers
93. Data Integration
• Data mining often requires data integration—the merging of data
from multiple data stores.
• It is likely that your data analysis task will involve data integration,
which combines data from multiple sources into a coherent data
store, as in data warehousing.
• These sources may include multiple databases, data cubes, or flat
files.
94. Issues during Data Integration
1.Entity identification problem:How can equivalent real-world entities
from multiple data sources be matched up?
• For example, how can the data analyst or the computer be sure that
customer id in one database and cust number in another refer to the
same attribute?
• Examples of metadata for each attribute include the name, meaning,
data type, and range of values permitted for the attribute, and null
rules for handling blank, zero, or null values.
• Such metadata can be used to help avoid errors in schema
integration.
95. 2.Redundancy
•Redundancy is another important issue. An
attribute (such as annual revenue, for instance)
may be redundant if it can be “derived” from
another attribute or set of attributes.
• Inconsistencies in attribute or dimension
naming can also cause redundancies in the
resulting data set.
96. correlation analysis
• Given two attributes, such analysis can measure how strongly one
attribute implies the other, based on the available data.
• For numerical attributes, we can evaluate the correlation between
two attributes, A and B, by computing the correlation coefficient(also
known as Pearson’s product moment coefficient, named after its
inventer, Karl Pearson)
97. • where N is the number of tuples
• ai and bi are the respective values of A and B in tuple i
• A and B are the respective mean values of A and B
• σA and σB are the respective standard deviations of A and B
• and Σ(aibi) is the sum of the AB cross-product (that is, for each tuple,
the value for A is multiplied by the value for B in that tuple).
• Note that −1 ≤ rA,B ≤ +1.
• IfrA,B is greater than 0, then A and B are positively correlated, meaning that
the values of A increase as the values of B increase.
• If the resulting value is less than 0, then A and B are negatively correlated,
where the values of one attribute increase as the values of the other attribute
decrease.
98. • For categorical (discrete) data, a correlation relationship between two
attributes, A and B, can be discovered by a χ 2 (chi-square) test.
• Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct
values, namely b1,b2,...br .
• The data tuples described by A and B can be shown as a contingency
table, with the c values of A making up the columns and the r values
of B making up the rows.
• Let (Ai ,Bj) denote the event that attribute A takes on value ai and
attribute B takes on value bj , that is, where (A = ai ,B = bj). Each and
every possible (Ai ,Bj) joint event has its own cell (or slot) in the table.
The χ 2 value (also known as the Pearson χ 2 statistic) is computed as:
99. • where oij is the observed frequency (i.e., actual count) of the joint event (Ai
,Bj) and ei j is the expected frequency of (Ai ,Bj), which can be computed as
where N is the number of data tuples,
count(A = ai)is the number of tuples having value ai for A,
and count(B = bj) is the number of tuples having value bj for B.
100.
101. • Correlation analysis of categorical attributes using χ 2 . Suppose that a
group of 1,500 people was surveyed.
• The gender of each person was noted. Each person was polled as to
whether their preferred type of reading material was fiction or
nonfiction.
• Thus, we have two attributes, gender and preferred reading.
102. Data Transformation
In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Data transformation can involve the following:
• Smoothing, which works to remove noise from the data. Such techniques
include binning, regression, and clustering.
• Aggregation, where summary or aggregation operations are applied to the
data.
• For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is typically used in constructing a
data cube for analysis of the data at multiple granularities.
• Generalization of the data, where low-level or “primitive” (raw) data are
replaced by higher-level concepts through the use of concept hierarchies.
• For example, categorical attributes, like street, can be generalized to higher-
level concepts, like city or country.
• Similarly, values for numerical attributes, like age, may be mapped to higher-
level concepts, like youth, middle-aged, and senior.
103. • Normalization, where the attribute data are scaled so as to fall within
a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.
• Attribute construction (or feature construction), where new attributes
are constructed and added from the given set of attributes to help
the mining process.
• Min-max normalization.
• z-score normalization.
• Normalization by decimal scaling.
105. Z-score normalization:
• This method normalizes the value for attribute A using the mean
and standard deviation. The following formula is used for Z-
score normalization:
106. Normalization by decimal scaling
• Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A. The number of decimal points moved
depends on the maximum absolute value of A.
• A value, v, of A is normalized to v 0 by computing
107. Data Reduction
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume.
• Strategies for data reduction include the following:
• Data cube aggregation, where aggregation operations are applied to the
data in the construction of a data cube.
• Attribute subset selection, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
• Dimensionality reduction, where encoding mechanisms are used to reduce
the data set size.
• Numerosity reduction, where the data are replaced or estimated by
alternative, smaller data representations such as parametric models or
nonparametric methods such as clustering, sampling, and the use of
histograms.
108. • Discretization and concept hierarchy generation where raw data values for
attributes are replaced by ranges or higher conceptual levels.
• Data discretization is a form of numerosity reduction that is very useful for the
automatic generation of concept hierarchies. Discretization and concept
hierarchy generation are powerful tools for data mining, in that they allow the
mining of data at multiple levels of abstraction.
109. Data aggregation
• This technique is used to aggregate data in a simpler form. For
example, imagine that information you gathered for your
analysis for the years 2012 to 2014, that data includes the
revenue of your company every three months.
• They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the
resulting data summarizes the total sales per year instead
of per quarter. It summarizes the data.
110.
111.
112. Attribute Subset Selection
• Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the
data classes is as close as possible to the original distribution
obtained using all attributes.
• Mining on a reduced set of attributes has an additional benefit. It
reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
113. Basic heuristic methods of attribute subset selection include the
following techniques
• Stepwise forward selection: The procedure starts with an empty set
of attributes as the reduced set. The best of the original attributes is
determined and added to the reduced set. At each subsequent
iteration or step, the best of the remaining original attributes is added
to the set.
• Stepwise backward elimination: The procedure starts with the full set
of attributes.
• At each step, it removes the worst attribute remaining in the set.
114. • Combination of forward selection and backward elimination: The
stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best
attribute and removes the worst from among the remaining
attributes.
• Decision tree induction: Decision tree algorithms, such as
ID3(iterative dichotamiser), C4.5, and CART(classification and
regression trees), were originally intended for classification.
• Decision tree induction constructs a flowchart like structure where
each internal (non leaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each external
(leaf) node denotes a class prediction.
• The set of attributes appearing in the tree form the reduced subset of
attributes.
115.
116. Dimensionality Reduction
•In dimensionality reduction, data encoding or
transformations are applied so as to obtain a reduced
or “compressed” representation of the original data.
•If the original data can be reconstructed from the
compressed data without any loss of information, the
data reduction is called lossless.
• If, instead, we can reconstruct only an approximation
of the original data, then the data reduction is called
lossy.
117. Numerosity Reduction
• “Can we reduce the data volume by choosing alternative,
‘smaller’ forms of data representation?” These techniques
may be parametric or non parametric.
• For parametric methods, a model is used to estimate the
data, so that typically only the data parameters need to be
stored, instead of the actual data. (Outliers may also be
stored.)
• Log-linear models, which estimate discrete multidimensional
probability distributions, are an example.
• Nonparametric methods for storing reduced representations
of the data include histograms, clustering, and sampling.
118. Two effective methods of lossy dimensionality reduction:
1.wavelet transforms and
2.principal components analysis.
119. Wavelet transforms
• The discrete wavelet transform (DWT) is a signal processing
technique that transforms linear signals.
• The wavelet transform can present a signal with a good time
resolution or a good frequency resolution. There are two types
of wavelet transforms: the continuous wavelet transform
(CWT) and the discrete wavelet transform (DWT).
• The data vector X is transformed into a numerically different
vector, Xo, of wavelet coefficients when the DWT is applied.
The two vectors X and Xo must be of the same length.
• When applying this technique to data reduction, we consider n-
dimensional data tuple, that is, X = (x1,x2,…,xn), where n is the
number of attributes present in the relation of the data set.
121. • What’s a Wavelet?
• A Wavelet is a wave-like oscillation that is localized in time,
an example is given below. Wavelets have two basic properties:
scale and location.
• Scale (or dilation) defines how “stretched” or “squished” a
wavelet is. This property is related to frequency as defined for
waves.
• Location defines where the wavelet is positioned in time (or
space)
122. • Wavelet transforms can be applied to multidimensional data,
such as a data cube.
• The computational complexity involved is linear with respect
to the number of cells in the cube.
• Wavelet transforms give good results on sparse or skewed
data and on data with ordered attributes.
• Wavelet transforms have many real-world applications,
including the compression of fingerprint images, computer
vision, analysis of time-series data, and data cleaning.
123. Principal Components Analysis
• principal components analysis is one method for dimensionality
reduction. PCA is a method used to reduce number of
variables in your data by extracting important one from a
large pool.
• It reduces the dimension of your data with the aim of retaining
as much information as possible..
• Suppose that the data to be reduced consist of tuples or data vectors
described by n attributes or dimensions.
• Principal components analysis,searches for k n-dimensional
orthogonal vectors that can best be used to represent the data,
where k ≤ n.
124. • PCA works by considering the variance of each attribute
because the high attribute shows the good split between the
classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation
in various communication channels.
• It is a feature extraction technique, so it contains the important
variables and drops the least important variable.
• The PCA algorithm is based on some mathematical concepts
such as:
• Variance and Covariance
• Eigenvalues and Eigen factors
125.
126. Applications of Principal Component
Analysis
• PCA is mainly used as the dimensionality
reduction technique in various AI applications
such as computer vision, image compression,
etc.
• It can also be used for finding hidden patterns if
data has high dimensions. Some fields where
PCA is used are Finance, data mining,
Psychology, etc.
127. Regression and Log-Linear Models
• Regression and log-linear models can be used to approximate the
given data. In (simple) linear regression, the data are modeled to fit a
straight line.
• For example, a random variable, y (called a response variable), can be
modeled as a linear function of another random variable, x (called a
predictor variable), with the equation. Y is response variable and x is
called as the predictor variable. W and b are the regression
coefficients specify the slope of the line y-intercept.
y = wx+b
128. Histograms
• Histograms use binning to approximate data distributions and are a
popular form of data reduction.
• A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, or buckets. If each bucket represents only a
single attribute-value/frequency pair, the buckets are called singleton
buckets. Often, buckets instead represent continuous ranges for the
given attribute.
129. • Histograms. The following data are a list of prices of commonly sold
items at All Electronics (rounded to the nearest dollar). The numbers
have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20,
20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
132. • Sampling can be used as a data reduction technique because it allows
a large data set to be represented by a much smaller random sample
(or subset) of the data.
• Suppose that a large data set, D, contains N tuples. Let’s look at the
most common ways that we could sample D for data reduction.