SlideShare a Scribd company logo
1 of 153
20IT501 – Data Warehousing and
Data Mining
III Year / V Semester
UNIT II- DATA MINING
Introduction – Data – Types of Data – Data
Mining Functionalities – Interestingness of
Patterns – Classification of Data Mining Systems
– Data Mining Task Primitives – Integration of a
Data Mining System with a Data Warehouse –
Issues –Data Preprocessing
Data Mining
 Data mining, also known as knowledge discovery in data (KDD), is
the process of uncovering patterns and other valuable information
from large data sets.
 Data mining has improved organizational decision-making through
insightful data analyses.
 In addition, many other terms have a similar meaning to data
mining—for example, knowledge mining from data, knowledge
extraction, data/pattern analysis, data archaeology, and data
dredging
Introduction
Data Mining
The knowledge discovery process:
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be
combined)
Data selection (where data relevant to the analysis task
are retrieved from the database)
Data Mining
The knowledge discovery process:
Data transformation (where data are transformed
and consolidated into forms appropriate for mining
by performing summary or aggregation operations)
Data mining (an essential process where intelligent
methods are applied to extract data patterns)
Data Mining
The knowledge discovery process:
Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on
interestingness measures)
Knowledge presentation (where visualization and
knowledge representation techniques are used to
present mined knowledge to users)
Data Mining
The knowledge discovery process:
 Steps 1 through 4 are different forms of data
preprocessing, where data are prepared for mining.
Data Mining uncovers hidden patterns for evaluation.
 Data mining is the process of discovering interesting
patterns and knowledge from large amounts of data.
Data Mining
The knowledge discovery process - The data
sources can include
Databases,
Data warehouses,
The Web,
Other information repositories, or
data that are streamed into the system dynamically.
Data Mining Architecture
Data Mining Architecture
 Components - Database, data warehouse, WWW,
or other information repository:
This is one or a set of databases, data warehouses,
spreadsheets, or other kinds of information
repositories.
Data cleaning and data integration techniques may be
performed on the data.
Data Mining Architecture
 Components - Database or data warehouse
server:
The database or data warehouse server is
responsible for fetching the relevant data, based on
the user’s data mining request.
Data Mining Architecture
 Components - Knowledge base:
This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting
patterns.
 Such knowledge can include concept hierarchies, used
to organize attributes or attribute values into different
levels of abstraction.
Data Mining Architecture
 Components - Data mining engine:
This is essential to the data mining system and ideally
consists of a set of functional modules for tasks such
as characterization, association and correlation
analysis, classification, prediction, cluster analysis,
outlier analysis, and evolution analysis.
Data Mining Architecture
 Components - Pattern evaluation module:
This component typically employs interestingness
measures and interacts with the data mining modules
so as to focus the search toward interesting patterns.
 the pattern evaluation module may be integrated with
the mining module, depending on the implementation
of the data mining method used.
Data Mining Architecture
 Components - User interface:
This module communicates between users and the data
mining system.
The user to interact with the system by specifying a
data mining query or task, providing information to
help focus the search, and performing exploratory data
mining based on the intermediate data mining results
Types of Data
 Database Data (or) Relational Databases:
 A database system, also called a database management system
(DBMS), consists of a collection of interrelated data, known as a
database, and a set of software programs to manage and access the data.
 The software programs involve mechanisms for the definition of
database structures; for data storage; for concurrent, shared, or
distributed data access; and for ensuring the consistency and security of
the information stored, despite system crashes or attempts at
unauthorized access.
Types of Data
 Database Data (or) Relational Databases:
 A relational database is a collection of tables, each ofwhich
is assigned a unique name.
Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows).
 Each tuple in a relational table represents an object
identified by a unique key and described by a set of
attribute values.
Types of Data
 Database Data (or) Relational Databases:
 A semantic data model, such as an entity-relationship (ER)
data model, is often constructed for relational databases.
An ER data model represents the database as a set of
entities and their relationships.
 Relational data can be accessed by database queries
written in a relational query language, such as SQL, or with
the assistance of graphical user interfaces.
Types of Data
 Database Data (or) Relational Databases:
 Example: Data mining systems can analyze customer data
to predict the credit risk of new customers based on their
income, age, and previous credit information.
Data mining systems may also detect deviations—that is,
items with sales that are far from those expected in
comparison with the previous year. Such deviations can
then be further investigated.
Types of Data
 Database Data (or) Relational Databases:
 Relational databases are one of the most
commonly available and rich information
repositories, and thus they are a major data form in
our study of data mining.
Types of Data
 Data Warehouses:
 A data warehouse is a repository of information
collected from multiple sources, stored under a unified
schema, and that usually resides at a single site.
Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data
loading, and periodic data refreshing.
Types of Data
 Data Warehouses:
 To facilitate decision making, the data in a data
warehouse are organized around major subjects, such
as customer, item, supplier, and activity.
The data are stored to provide information from a
historical perspective (such as from the past 5–10
years) and are typically summarized.
Types of Data
 Data Warehouses:
 data warehouse is usually modeled by a multidimensional
database structure, where each dimension corresponds to an
attribute or a set of attributes in the schema, and each cell stores
the value of some aggregate measure, such as count or sales
amount.
 A data cube provides a multidimensional view of data and
allows the precomputation and fast accessing of summarized
data.
Types of Data
 Transactional Databases
 In general, a transactional database consists of a
file where each record represents a transaction.
 A transaction typically includes a unique
transaction identity number (trans ID) and a list of
the items making up the transaction
Types of Data
 Other Kinds of Data
 Time-related or sequence data (e.g., historical records, stock
exchange data, and time-series and biological sequence data),
 Data streams (e.g., video surveillance and sensor data, which are
continuously transmitted), spatial data (e.g., maps),
 Engineering design data (e.g., the Design of buildings, system
components, or integrated circuits),
Types of Data
 Other Kinds of Data
Hypertext and multimedia data (including text, image,
video, and audio data),
Graph and networked data (e.g., social and information
networks), and
The Web (a huge, widely distributed information
repository made available by the Internet).
Data Mining Functionalities
 Data mining functionalities are used to specify the
kinds of patterns to be found in data mining tasks.
 Descriptive mining tasks characterize properties of the
data in a target data set.
 Predictive mining tasks perform induction on the
current data in order to make predictions.
Data Mining Functionalities
 Concept/Class Description: Characterization and
Discrimination:
 Data can be associated with classes or concepts.
 Example: Student Reg. No and Student Name belongs to
Student class.
 It can be useful to describe individual classes and concepts
in summarized, concise, and yet precise terms.
Data Mining Functionalities
 Concept/Class Description: Characterization and
Discrimination:
 Data characterization, by summarizing the data of the class
under study (often called the target class) in general terms.
 Data discrimination, by comparison of the target class with one
or a set of comparative classes (often called the contrasting
classes), or
 Both data characterization and discrimination.
Data Mining Functionalities
 Concept/Class Description: Characterization and Discrimination:
 The data cube–based OLAP roll-up operation can be used to perform
user-controlled data summarization along a specified dimension.
 The output of data characterization can be presented in various forms.
 Examples: Pie and bar charts, curves and multidimensional data cubes
Data Mining Functionalities
 Concept/Class Description: Characterization and Discrimination:
 In Data discrimination, The target and contrasting classes can be
specified by the user, and the corresponding data objects retrieved
through database queries.
 For example, the user may like to compare the general features of
software products whose sales increased by 10% in the last year with
those whose sales decreased by at least 30% during the same period.
Data Mining Functionalities
Mining Frequent Patterns, Associations, and
Correlations:
 Frequent patterns, as the name suggests, are patterns
that occur frequently in data.
 A frequent itemset typically refers to a set of items
that frequently appear together in a transactional data
set, such as milk and bread.
Data Mining Functionalities
Mining Frequent Patterns, Associations, and
Correlations:
 A frequently occurring subsequence, such as the
pattern that customers tend to purchase first a PC,
followed by a digital camera, and then a memory
card, is a (frequent) sequential pattern.
Data Mining Functionalities
Classification and Prediction
 Classification is the process of finding a model (or
function) that describes and distinguishes data
classes or concepts, for the purpose of being able
to use the model to predict the class of objects
whose class label is unknown.
Data Mining Functionalities
Classification and Prediction
 The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision
trees, mathematical formulae, or neural networks.
 classification predicts categorical (discrete,
unordered) labels, regression models predicts
continuous-valued functions.
Data Mining Functionalities
Classification and Prediction
 Regression is used to predict missing or
unavailable numerical data values rather than
(discrete) class labels. The term prediction refers to
both numeric prediction and class label prediction.
Data Mining Functionalities
 Cluster Analysis
 Clustering analyzes data objects without consulting class
labels.
Clustering can be used to generate class labels for a group
of data.
The objects are clustered or grouped based on the principle
of maximizing the intraclass similarity and minimizing the
interclass similarity.
Data Mining Functionalities
 Cluster Analysis
 Clusters of objects are formed so that objects
within a cluster have high similarity in comparison
to one another, but are rather dissimilar to objects
in other clusters.
Data Mining Functionalities
 Outlier Analysis:
 A data set may contain objects that do not comply
with the general behavior or model of the data.
These data objects are outliers.
 Many data mining methods discard outliers as
noise or exceptions.
Data Mining Functionalities
 Outlier Analysis:
 However, in some applications (e.g., fraud
detection) the rare events can be more interesting
than the more regularly occurring ones.
The analysis of outlier data is referred to as outlier
analysis or anomaly mining.
Data Mining Functionalities
 Evolution Analysis:
 Data evolution analysis describes and models
regularities or trends for objects whose behavior
changes over time.
 Example: Stock market (time-series) data
Interestingness of Patterns
 A pattern is interesting if it is
Easily understood by humans,
Valid on new or test data with some degree of certainty,
Potentially useful and
Novel
 An interesting pattern represents knowledge
Interestingness of Patterns
 Several objective measures of pattern
interestingness exist.
 An objective measure for association rules of the
form XY is rule support.
 Another objective measure for association rules is
confidence.
Classification of Data Mining
Systems
Classification of Data Mining
Systems
 Classification according to the kinds of databases
mined:
 Database systems can be classified according to different
criteria (such as data models, or the types of data or
applications involved), each of which may require its own
data mining technique.
Data mining systems can therefore be classified
accordingly.
Classification of Data Mining
Systems
 Classification according to the kinds of databases mined:
 For instance, if classifying according to data models, we may
have a relational, transactional, object-relational, or data
warehouse mining system.
 If classifying according to the special types of data handled, we
may have a spatial, time-series, text, stream data, multimedia
data mining system, or a World Wide Web mining system
Classification of Data Mining
Systems
 Classification according to the kinds of
knowledge mined:
 It is based on data mining functionalities, such as
characterization, discrimination, association and
correlation analysis, classification, prediction,
clustering, outlier analysis, and evolution analysis.
Classification of Data Mining
Systems
 Classification according to the kinds of techniques utilized:
 The techniques can be described according to the degree of user
interaction involved (e.g., autonomous systems, interactive
exploratory systems, query-driven systems) or
 The methods of data analysis employed (e.g., database-oriented
or data warehouse– oriented techniques, machine learning,
statistics, visualization, pattern recognition, neural networks, and
so on).
Classification of Data Mining
Systems
Classification according to the applications
adapted:
 Data mining systems may be tailored specifically for
finance, telecommunications, DNA, stock markets, e-
mail, and so on.
 All-purpose data mining system may not fit domain-
specific mining tasks.
Data Mining Task Primitives
 A data mining task can be specified in the form of a data
mining query, which is input to the data mining system.
 A data mining query is defined in terms of data mining task
primitives.
 These primitives allow the user to interactively
communicate with the data mining system during discovery
in order to direct the mining process, or examine the
findings from different angles or depths
Data Mining Task Primitives
The set of task-relevant data to be mined:
 This specifies the portions of the database or the
set of data in which the user is interested.
 This includes the database attributes or data
warehouse dimensions of interest.
Data Mining Task Primitives
The kind of knowledge to be mined:
 This specifies the data mining functions to be
performed, such as characterization,
discrimination, association or correlation analysis,
classification, prediction, clustering, outlier
analysis, or evolution analysis
Data Mining Task Primitives
The background knowledge to be used in the
discovery process:
 This knowledge about the domain to be mined is
useful for guiding the knowledge discovery
process and for evaluating the patterns found.
Data Mining Task Primitives
The interestingness measures and thresholds for
pattern evaluation:
They may be used to guide the mining process or, after
discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different
interestingness measures.
 Example: Support and Confidence
Data Mining Task Primitives
The expected representation for visualizing the
discovered patterns:
This refers to the form in which discovered
patterns are to be displayed, which may include
rules, tables, charts, graphs, decision trees, and
cubes.
Integration of a Data Mining System with a
Database or DataWarehouse System
 Good system architecture will facilitate the data mining
system:
 to make best use of the software environment
 accomplish data mining tasks in an efficient and timely manner
 interoperate and exchange information with other information
systems
 be adaptable to users’ diverse requirements, and evolve with
time
Integration of a Data Mining System with a
Database or DataWarehouse System
 The design of a data mining (DM) system is how to
integrate or couple the DM system with a database
(DB) system and/or a data warehouse (DW) system.
 If a DM system works as a stand-alone system or is
embedded in an application program, there are no DB
or DW systems with which it has to communicate.
Integration of a Data Mining System with a
Database or DataWarehouse System
 No coupling:
 A DM system will not utilize any function of a DB or
DW system.
 It may fetch data from a particular source (such as a
file system), process data using some data mining
algorithms, and then store the mining results in another
file.
Integration of a Data Mining System with a
Database or DataWarehouse System
 No coupling – Drawbacks:
First, a DB system provides a great deal of flexibility
and efficiency at storing, organizing, accessing, and
processing data.
 Without using a DB/DW system, a DM system may
spend a substantial amount of time finding, collecting,
cleaning, and transforming data.
Integration of a Data Mining System with a
Database or DataWarehouse System
 No coupling – Drawbacks:
 Second, there are many tested, scalable algorithms and
data structures implemented in DB and DW systems.
 It is feasible to realize efficient, scalable implementations
using such systems.
 Moreover, most data have been or will be stored in
DB/DW systems
Integration of a Data Mining System with a
Database or DataWarehouse System
 Loose coupling:
 A DM system will use some facilities of a DB or DW
system, fetching data from a data repository managed
by these systems, performing data mining, and then
storing the mining results either in a file or in a
designated place in a database or data warehouse.
Integration of a Data Mining System with a
Database or DataWarehouse System
 Loose coupling:
 Loose coupling is better than no coupling because it
can fetch any portion of data stored in databases or
data warehouses by using query processing, indexing,
and other system facilities.
It incurs some advantages of the flexibility, efficiency,
and other features provided by such systems
Integration of a Data Mining System with a
Database or DataWarehouse System
 Loose coupling:
 Many loosely coupled mining systems are main memory-
based
Because mining does not explore data structures and query
optimization methods provided by DB or DW systems.
 It is difficult for loose coupling to achieve high scalability
and good performance with large data sets.
Integration of a Data Mining System with a
Database or DataWarehouse System
 Semitight coupling:
 Semitight coupling means that besides linking a DM
system to a DB/DW system, efficient implementations of a
few essential data mining primitives can be provided in the
DB/DW system.
 Some frequently used intermediate mining results can be
precomputed and stored in the DB/DW system.
 This design will enhance the performance of a DM system
Integration of a Data Mining System with a
Database or DataWarehouse System
 Tight coupling:
 A DM system is smoothly integrated into the DB/DW system.
 Data mining queries and functions are optimized based on
mining query analysis, data structures, indexing schemes, and
query processing methods of a DB or DW system.
 With further technology advances, DM, DB, and DW systems
will evolve and integrate together as one information system
with multiple functionalities.
Major Issues in Data Mining
Mining methodology and user interaction issues:
 Mining different kinds of knowledge in databases
 Interactive mining of knowledge at multiple levels of
abstraction
 Incorporation of background knowledge
 Data mining query languages and ad hoc data mining
Major Issues in Data Mining
Mining methodology and user interaction
issues:
 Presentation and visualization of data mining
results
 Handling noisy or incomplete data
 Pattern evaluation—the interestingness problem
Major Issues in Data Mining
Performance issues:
 Efficiency and scalability of data mining
algorithms
 Parallel, distributed, and incremental mining
algorithms
Major Issues in Data Mining
 Issues relating to the diversity of database
types:
 Handling of relational and complex types of data
 Mining information from heterogeneous databases
and global information systems
Data Preprocessing
 Why Preprocess the Data?
 Incomplete, noisy, and inconsistent data are
common place properties of large real world
databases and data warehouses.
Data Preprocessing
 Why Preprocess the Data?
 Incomplete data can occur for a number of
reasons.
 Attributes of interest may not always be available
 Other data may not be included simply because it was
not considered important at the time of entry
Data Preprocessing
 Why Preprocess the Data?
 Incomplete data can occur for a number of reasons.
 Relevant data may not be recorded due to a
misunderstanding, or because of equipment malfunctions.
 Data that were inconsistent with other recorded data may
have been deleted.
 Missing data, particularly for tuples with missing values for
some attributes, may need to be inferred.
Data Preprocessing
 Why Preprocess the Data?
 There are many possible reasons for noisy data.
 The data collection instruments used may be faulty.
 There may have been human or computer errors occurring at data
entry.
 Errors in data transmission can also occur.
 Incorrect data may also result from inconsistencies in naming
conventions or data codes used, or inconsistent formats for input
fields, such as date.
Data Preprocessing
 Descriptive Data Summarization
 Descriptive data summarization techniques can be used to
identify the typical properties of your data and highlight
which data values should be treated as noise or outliers.
 For many data preprocessing tasks, users would like to
learn about data characteristics regarding both central
tendency and dispersion of the data.
Data Preprocessing
 Descriptive Data Summarization
 Measures of central tendency include mean,
median, mode, and midrange, while measures of
data dispersion include quartiles, interquartile
range (IQR), and variance.
Data Preprocessing
 Descriptive Data Summarization - Measures
of central tendency:
 The most common and most effective numerical
measure of the “center” of a set of data is the
(arithmetic) mean.
Data Preprocessing
 Descriptive Data Summarization - Measures of central
tendency:
 A distributive measure is a measure (i.e., function) that can
be computed for a given data set by partitioning the data
into smaller subsets, computing the measure for each
subset, and then merging the results in order to arrive at the
measure’s value for the original (entire) data set.
 Example: sum(), count(), max() and min()
Data Preprocessing
 Descriptive Data Summarization - Measures of
central tendency:
 An algebraic measure is a measure that can be
computed by applying an algebraic function to one or
more distributive measures.
 Example: average() is an algebraic measure because it
can be computed by sum()/count()
Data Preprocessing
 Descriptive Data Summarization - Measures of
central tendency:
 The mean is the single most useful quantity for
describing a data set, it is not always the best way of
measuring the center of the data.
 A major problem with the mean is its sensitivity to
extreme (e.g., outlier) values.
Data Preprocessing
 Descriptive Data Summarization - Measures of central
tendency:
 For skewed (asymmetric) data, a better measure of the
center of data is the median.
 Suppose that a given data set of N distinct values is sorted
in numerical order. If N is odd, then the median is the
middle value of the ordered set; otherwise (i.e., if N is
even), the median is the average of the middle two values
Data Preprocessing
 Descriptive Data Summarization - Measures of central
tendency:
 A holistic measure is a measure that must be computed on the
entire data set as a whole.
 It cannot be computed by partitioning the given data into subsets
and merging the values obtained for the measure in each subset.
 Example: Median
Data Preprocessing
 Descriptive Data Summarization - Measures of central
tendency:
 Another measure of central tendency is the mode. The
mode for a set of data is the value that occurs most
frequently in the set.
 It is possible for the greatest frequency to correspond to
several different values, which results in more than one
mode.
Data Preprocessing
 Descriptive Data Summarization - Measures of
central tendency:
 Data sets with one, two, or three modes are
respectively called unimodal, bimodal, and trimodal.
 In general, a data set with two or more modes is
multimodal. At the other extreme, if each data value
occurs only once, then there is no mode.
Data Preprocessing
 Descriptive Data Summarization - Measuring the Dispersion of
Data:
 The degree to which numerical data tend to spread is called the
dispersion, or variance of the data.
 The most common measures of data dispersion are range, the five-
number summary (based on quartiles), the interquartile range, and the
standard deviation.
 Boxplots can be plotted based on the five-number summary and are a
useful tool for identifying outliers.
Data Preprocessing
 Descriptive Data Summarization - Measuring the
Dispersion of Data:
The range of the set is the difference between the
largest (max()) and smallest (min()) values.
 The kth percentile of a set of data in numerical order is
the value xi having the property that k percent of the
data entries lie at or below xi.
Data Preprocessing
 Descriptive Data Summarization - Measuring the
Dispersion of Data:
The most commonly used percentiles other than the
median are quartiles.
The first quartile, denoted by Q1, is the 25th
percentile; the third quartile, denoted by Q3, is the
75th percentile.
Data Preprocessing
 Descriptive Data Summarization - Measuring the
Dispersion of Data:
The distance between the first and third quartiles is a
simple measure of spread that gives the range covered
by the middle half of the data. This distance is called
the interquartile range (IQR) and is defined as
 IQR = Q3 – Q1
Data Preprocessing
 Descriptive Data Summarization - Measuring the
Dispersion of Data:
The five-number summary of a distribution consists of
the median, the quartiles Q1 and Q3, and the smallest
and largest individual observations, written in the
order
Minimum; Q1; Median; Q3; Maximum
Data Preprocessing
 Descriptive Data Summarization - Measuring the
Dispersion of Data:
 Boxplots are a popular way of visualizing a distribution. A
boxplot incorporates the five-number summary as follows:
 Typically, the ends of the box are at the quartiles, so that the box
length is the interquartile range, IQR.
 The median is marked by a line within the box.
 Two lines (called whiskers) outside the box extend to the smallest
(Minimum) and largest (Maximum) observations
Data Preprocessing
 Descriptive Data Summarization - Measuring
the Dispersion of Data:
Data Preprocessing
 Graphic Displays of Basic Descriptive Data
Summaries:
 Plotting histograms, or frequency histograms, is a
graphical method for summarizing the distribution of a
given attribute.
 A histogram for an attribute A partitions the data
distribution of A into disjoint subsets, or buckets
Data Preprocessing
 Graphic Displays of Basic Descriptive Data
Summaries:
Typically, the width of each bucket is uniform.
Each bucket is represented by a rectangle whose
height is equal to the count or relative frequency of
the values at the bucket.
Data Preprocessing
 Graphic Displays of Basic Descriptive Data
Summaries - Histogram
Data Preprocessing
 Graphic Displays of Basic Descriptive Data
Summaries - Quantile plot:
 A simple and effective way to have a first look at a
univariate data distribution.
 First, it displays all of the data for the given attribute.
 Second, it plots quantile information.
Data Preprocessing
 Graphic Displays of Basic Descriptive Data
Summaries – Scatter plot:
 The most effective graphical methods for determining if
there appears to be a relationship, pattern, or trend between
two numerical attributes.
 To construct a scatter plot, each pair of values is treated as
a pair of coordinates in an algebraic sense and plotted as
points in the plane.
Data Preprocessing
 Graphic Displays of Basic Descriptive Data
Summaries – Scatter plot:
Data Cleaning
 Real-world data tend to be incomplete, noisy, and
inconsistent.
 Data cleaning (or data cleansing) routines
attempt to fill in missing values, smooth out noise
while identifying outliers, and correct
inconsistencies in the data.
Data Cleaning
 Missing Values:
 Ignore the tuple: This is usually done when the class label
is missing. This method is not very effective, unless the
tuple contains several attributes with missing values.
 Fill in the missing value manually: In general, this
approach is time-consuming and may not be feasible given
a large data set with many missing values.
Data Cleaning
 Missing Values:
 Use a global constant to fill in the missing value: Replace
all missing attribute values by the same constant, such as a
label like “Unknown”.
 Use the attribute mean to fill in the missing value: For
example, suppose that the average income of customers is
56,000. Use this value to replace the missing value for
income
Data Cleaning
 Missing Values:
 Use the attribute mean for all samples belonging to
the same class as the given tuple: For example, if
classifying customers according to credit risk, replace
the missing value with the average income value for
customers in the same credit risk category as that of
the given tuple.
Data Cleaning
 Missing Values:
 Use the most probable value to fill in the missing
value: For example, using the other customer
attributes in your data set, you may construct a
decision tree to predict the missing values for
income.
Data Cleaning
 Noisy Data:
 Noise is a random error or variance in a measured
variable.
 Methods:
 Binning, Regression and Clustering:
Data Cleaning
 Noisy Data – Binning:
 Binning methods smooth a sorted data value by consulting
its “neighborhood,” that is, the values around it.
 The sorted values are distributed into a number of
“buckets,” or bins.
 Because binning methods consult the neighborhood of
values, they perform local smoothing
Data Cleaning
 Noisy Data – Binning:
 Example: Data: 4,8,15,21,21,24,25,28,34
 Partition into (equal-frequency) bins:
 Bin 1: 4,8,15
 Bin 2: 21,21,24
 Bin 3: 25,28,34
Data Cleaning
 Noisy Data – Binning:
 Example: Data: 4,8,15,21,21,24,25,28,34
 Smoothing by bin means: each value in a bin is
replaced by the mean value of the bin
 Bin 1: 9,9,9
 Bin 2: 22,22,22
 Bin 3: 29,29,29
Data Cleaning
 Noisy Data – Binning:
 Example: Data: 4,8,15,21,21,24,25,28,34
 Smoothing by bin boundaries: The minimum and maximum values
in a given bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value
 Bin 1: 4,4,15
 Bin 2: 21,21,24
 Bin 3: 25,25,34
Data Cleaning
 Noisy Data – Regression:
 Data can be smoothed by fitting the data to a
function, such as with regression.
 Linear regression involves finding the “best” line
to fit two attributes (or variables), so that one
attribute can be used to predict the other.
Data Cleaning
 Noisy Data – Clustering:
 Outliers may be detected by clustering, where
similar values are organized into groups, or
“clusters.”
Intuitively, values that fall outside of the set of
clusters may be considered outliers
Data Cleaning
 Data Cleaning as a Process:
 Missing values, noise, and inconsistencies
contribute to inaccurate data.
 The first step in data cleaning as a process is
discrepancy detection.
Data Cleaning
 Data Cleaning as a Process - Discrepancies can be caused by
several factors,
 poorly designed data entry forms that have many optional fields
 human error in data entry
 deliberate errors (e.g., respondents not wanting to disclose information
about themselves)
 data decay (e.g., outdated addresses)
 Errors in instrumentation devices that record data, and system errors
Data Cleaning
 Data Cleaning as a Process - Discrepancies can be caused
by several factors,
 poorly designed data entry forms that have many optional fields
 human error in data entry
 deliberate errors (e.g., respondents not wanting to disclose
information about themselves)
 data decay (e.g., outdated addresses).
Data Cleaning
 Data Cleaning as a Process – Discrepancies Detection
Tools:
 Data scrubbing tools use simple domain knowledge (e.g.,
knowledge of postal addresses, and spell-checking) to
detect errors and make corrections in the data
 Data auditing tools find discrepancies by analyzing the
data to discover rules and relationships, and detecting data
that violate such conditions.
Data Cleaning
 Data Cleaning as a Process – Discrepancies Detection
Tools:
 Data migration tools allow simple transformations to be
specified, such as to replace the string “gender” by “sex”
 ETL (extraction/transformation/loading) tools allow users
to specify transforms through a graphical user interface
(GUI).
Data Integration and
Transformation
 Data Integration:
 It combines data from multiple sources into a
coherent data store, as in data warehousing. These
sources may include multiple databases, data
cubes, or flat files.
Data Integration and
Transformation
 Data Integration – Issues:
 Schema integration and object matching can be tricky.
 For example, how can the data analyst or the
computer be sure that customer id in one database and
cust number in another refer to the same attribute.
 Metadata can be used to help avoid errors in schema
integration
Data Integration and
Transformation
 Data Integration – Issues:
Metadata for each attribute include the name,
meaning, data type, and range of values permitted
for the attribute, and null rules for handling blank,
zero, or null values.
Data Integration and
Transformation
 Data Integration – Issues:
 Redundancy is another important issue.
 Inconsistencies in attribute or dimension naming can
also cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation
analysis.
Data Integration and
Transformation
 Data Integration – Issues:
 A third important issue in data integration is the
detection and resolution of data value conflicts.
 For a hotel chain, the price of rooms in different cities
may involve not only different currencies but also
different services (such as free breakfast) and taxes.
Data Integration and
Transformation
 Data Integration:
 The semantic heterogeneity and structure of data pose
great challenges in data integration.
 Careful integration of the data from multiple sources can
help reduce and avoid redundancies and inconsistencies in
the resulting data set.
 This can help improve the accuracy and speed of the
subsequent mining process.
Data Integration and
Transformation
 Data Transformation:
 The data are transformed or consolidated into
forms appropriate for mining.
Data Integration and
Transformation
 Data Transformation:
 Smoothing, which works to remove noise from the data. Such
techniques include binning, regression, and clustering.
 Aggregation, where summary or aggregation operations are
applied to the data.
 Generalization of the data, where low-level or “primitive” (raw)
data are replaced by higher-level concepts through the use of
concept hierarchies
Data Integration and
Transformation
 Data Transformation:
 Normalization, where the attribute data are scaled so
as to fall within a small specified range, such as -1.0 to
1.0, or 0.0 to 1.0.
 Attribute construction (or feature construction),where
new attributes are constructed and added from the
given set of attributes to help the mining process.
Data Integration and
Transformation
 Data Transformation - There are many
methods for data normalization.
 Min-max normalization,
 Z-score normalization,
 Normalization by decimal scaling.
Data Integration and
Transformation
 Min-max normalization:
 Min-max normalization preserves the
relationships among the original data values.
 It will encounter an “out-of-bounds” error if a
future input case for normalization falls outside of
the original data range for A.
Data Integration and
Transformation
 Min-max normalization:
 Suppose that the minimum and maximum values for
the attribute income are 12,000 and 98,000,
respectively, to map income to the range [0:0;1:0].
 By min-max normalization, a value of $73,600 for
income is transformed to – 0.76
Data Integration and
Transformation
 Z-score normalization:
 The values for an attribute, A, are normalized
based on the mean and standard deviation of A. A
value, v, of A is normalized to v’ by computing
Data Integration and
Transformation
 Z-score normalization:
The mean and standard deviation of the values for
the attribute income are $54,000 and $16,000,
respectively. A value of $73,600 for income is
transformed to 1.225
Data Integration and
Transformation
 Normalization by decimal scaling
 Normalizes by moving the decimal point of values
of attribute A.
 The number of decimal points moved depends on
the maximum absolute value of A.
Data Integration and
Transformation
 Normalization can change the original data quite a bit.
 It is also necessary to save the normalization
parameters (such as the mean and standard deviation if
using z-score normalization) so that future data can be
normalized in a uniform manner.
Data Integration and
Transformation
 Attribute Construction:
 New attributes are constructed from the given
attributes and added in order to help improve the
accuracy and understanding of structure in high-
dimensional data.
 To add the attribute area based on the attributes height
and width.
Data Reduction
 To obtain a reduced representation of the data
set that is much smaller in volume, yet closely
maintains the integrity of the original data.
Data Reduction
 Strategies
 Data cube aggregation
 Attribute subset selection
 Dimensionality reduction
 Numerosity reduction
 Discretization and concept hierarchy generation
Data Reduction
 Strategies - Data cube aggregation
 The data can be aggregated so that the resulting
data summarize the total sales per year instead of
per quarter.
Data Reduction
 Strategies - Data cube aggregation
 The cube created at the lowest level of abstraction is
referred to as the base cuboid.
 The base cuboid should correspond to an individual
entity of interest, such as sales or customer.
 In other words, the lowest level should be usable, or
useful for the analysis.
Data Reduction
 Strategies - Data cube aggregation
 A cube at the highest level of abstraction is the
apex cuboid.
 The apex cuboid would give one total
Data Reduction
 Strategies - Attribute Subset Selection
 Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes (or
dimensions).
 To find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as
possible to the original distribution obtained using all
attributes.
Data Reduction
 Strategies - Attribute Subset Selection
 Stepwise forward selection: The procedure starts with
an empty set of attributes as the reduced set.
The best of the original attributes is determined and
added to the reduced set.
At each subsequent iteration or step, the best of the
remaining original attributes is added to the set.
Data Reduction
 Strategies - Attribute Subset Selection
 Stepwise backward elimination: The procedure
starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.
Data Reduction
 Strategies - Attribute Subset Selection
 Combination of forward selection and backward
elimination: The stepwise forward selection and
backward elimination methods can be combined so
that, at each step, the procedure selects the best
attribute and removes the worst from among the
remaining attributes.
Data Reduction
 Strategies - Attribute Subset Selection
 Decision tree induction: Decision tree algorithms, such as ID3,
C4.5, and CART, were originally intended for classification.
 Decision tree induction constructs a flowchart like structure
where each internal (nonleaf) node denotes a test on an attribute,
each branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction.
Data Reduction
 Dimensionality Reduction:
 In dimensionality reduction, data encoding or
transformations are applied so as to obtain a reduced or
“compressed” representation of the original data.
 Methods: wavelet transforms and principal
components analysis.
Data Reduction
 Dimensionality Reduction - Wavelet
Transforms:
 The discrete wavelet transform(DWT) is a linear
signal processing technique that, when applied to a
data vector X, transforms it to a numerically
different vector, X0, of wavelet coefficients.
Data Reduction
 Dimensionality Reduction - Wavelet Transforms:
 A compressed approximation of the data can be retained by storing
only a small fraction of the strongest of the wavelet coefficients.
 For example, all wavelet coefficients larger than some user-specified
threshold can be retained. All other coefficients are set to 0.
 The technique also works to remove noise without smoothing out the
main features of the data, making it effective for data cleaning as well.
Data Reduction
 Dimensionality Reduction - Principal
Components Analysis:
 It searches for k n-dimensional orthogonal vectors that
can best be used to represent the data.
 The original data are thus projected onto a much
smaller space, resulting in dimensionality reduction.
Data Reduction
 Dimensionality Reduction - Numerosity Reduction:
 For parametric methods, a model is used to estimate the
data, so that typically only the data parameters need to be
stored, instead of the actual data.
 Nonparametric methods for storing reduced
representations of the data include histograms, clustering,
and sampling.
Data Reduction
 Dimensionality Reduction - Numerosity Reduction –
Histograms:
 A histogram for an attribute, A, partitions the data distribution of
A into disjoint subsets, or buckets.
 Each bucket represents only a single attribute-value/frequency
pair, the buckets are called singleton buckets.
 Often, buckets instead represent continuous ranges for the given
attribute.
Data Reduction
Histograms - Partitioning rules:
 Equal-width: In an equal-width histogram, the width
of each bucket range is uniform
 Equal-frequency (or equidepth): In an equal-
frequency histogram, the buckets are created so that,
roughly, the frequency of each bucket is constant
Data Reduction
 Histograms - Partitioning rules:
 Histogram variance is a weighted sum of the original
values that each bucket represents, where bucket weight is
equal to the number of values in the bucket.
 MaxDiff: The difference between each pair of adjacent
values. A bucket boundary is established between each pair
for pairs having the  -1 largest differences, where  is the
user-specified number of buckets.
Data Reduction
 Dimensionality Reduction - Numerosity
Reduction – Sampling:
 Sampling can be used as a data reduction
technique because it allows a large data set to be
represented by a much smaller random sample (or
subset) of the data.
Data Reduction
Sampling:
 Simple random sample without replacement
(SRSWOR) of size s:
 all tuples are equally likely to be sampled.
 Cluster sample: If the tuples in D are grouped into M
mutually disjoint “clusters,” then an SRS of s clusters
can be obtained, where s < M.
Data Reduction
Sampling:
 Stratified sample: If D is divided into mutually
disjoint parts called strata, a stratified sample of D
is generated by obtaining an SRS at each stratum.
 This helps ensure a representative sample,
especially when the data are skewed.
Data Reduction
 Sampling:
 An advantage of sampling for data reduction is that the cost of
obtaining a sample is proportional to the size of the sample, s, as
opposed to N, the data set size.
 When applied to data reduction, sampling is most commonly
used to estimate the answer to an aggregate query. It is possible
(using the central limit theorem) to determine a sufficient sample
size for estimating a given function within a specified degree of
error.

More Related Content

Similar to 20IT501_DWDM_PPT_Unit_II.ppt

Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345AkhilSinghal21
 
Introduction-to-Databases.pptx
Introduction-to-Databases.pptxIntroduction-to-Databases.pptx
Introduction-to-Databases.pptxIvanDarrylLopez
 
Data mining query language
Data mining query languageData mining query language
Data mining query languageGowriLatha1
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4ambujm
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningNandakumar P
 
Unit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxUnit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxHarsha Patel
 
DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT huma sh
 
Data warehouse
Data warehouseData warehouse
Data warehouseRajThakuri
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
11666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect311666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect3ambujm
 
Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023SakshiTiwari490123
 

Similar to 20IT501_DWDM_PPT_Unit_II.ppt (20)

data mining
data miningdata mining
data mining
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
 
Database Concepts
Database ConceptsDatabase Concepts
Database Concepts
 
Introduction-to-Databases.pptx
Introduction-to-Databases.pptxIntroduction-to-Databases.pptx
Introduction-to-Databases.pptx
 
Data mining query language
Data mining query languageData mining query language
Data mining query language
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Dwbasics
DwbasicsDwbasics
Dwbasics
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data Mining
 
Unit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptxUnit-IV-Introduction to Data Warehousing .pptx
Unit-IV-Introduction to Data Warehousing .pptx
 
DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Chapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptxChapter 2 - EMTE.pptx
Chapter 2 - EMTE.pptx
 
11666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect311666 Bitt I 2008 Lect3
11666 Bitt I 2008 Lect3
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023
 

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 

20IT501_DWDM_PPT_Unit_II.ppt

  • 1. 20IT501 – Data Warehousing and Data Mining III Year / V Semester
  • 2. UNIT II- DATA MINING Introduction – Data – Types of Data – Data Mining Functionalities – Interestingness of Patterns – Classification of Data Mining Systems – Data Mining Task Primitives – Integration of a Data Mining System with a Data Warehouse – Issues –Data Preprocessing
  • 3. Data Mining  Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering patterns and other valuable information from large data sets.  Data mining has improved organizational decision-making through insightful data analyses.  In addition, many other terms have a similar meaning to data mining—for example, knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging
  • 5. Data Mining The knowledge discovery process: Data cleaning (to remove noise and inconsistent data) Data integration (where multiple data sources may be combined) Data selection (where data relevant to the analysis task are retrieved from the database)
  • 6. Data Mining The knowledge discovery process: Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations) Data mining (an essential process where intelligent methods are applied to extract data patterns)
  • 7. Data Mining The knowledge discovery process: Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures) Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users)
  • 8. Data Mining The knowledge discovery process:  Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining. Data Mining uncovers hidden patterns for evaluation.  Data mining is the process of discovering interesting patterns and knowledge from large amounts of data.
  • 9. Data Mining The knowledge discovery process - The data sources can include Databases, Data warehouses, The Web, Other information repositories, or data that are streamed into the system dynamically.
  • 11. Data Mining Architecture  Components - Database, data warehouse, WWW, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.
  • 12. Data Mining Architecture  Components - Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.
  • 13. Data Mining Architecture  Components - Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns.  Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction.
  • 14. Data Mining Architecture  Components - Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
  • 15. Data Mining Architecture  Components - Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns.  the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used.
  • 16. Data Mining Architecture  Components - User interface: This module communicates between users and the data mining system. The user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results
  • 17. Types of Data  Database Data (or) Relational Databases:  A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.  The software programs involve mechanisms for the definition of database structures; for data storage; for concurrent, shared, or distributed data access; and for ensuring the consistency and security of the information stored, despite system crashes or attempts at unauthorized access.
  • 18. Types of Data  Database Data (or) Relational Databases:  A relational database is a collection of tables, each ofwhich is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows).  Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values.
  • 19. Types of Data  Database Data (or) Relational Databases:  A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships.  Relational data can be accessed by database queries written in a relational query language, such as SQL, or with the assistance of graphical user interfaces.
  • 20. Types of Data  Database Data (or) Relational Databases:  Example: Data mining systems can analyze customer data to predict the credit risk of new customers based on their income, age, and previous credit information. Data mining systems may also detect deviations—that is, items with sales that are far from those expected in comparison with the previous year. Such deviations can then be further investigated.
  • 21. Types of Data  Database Data (or) Relational Databases:  Relational databases are one of the most commonly available and rich information repositories, and thus they are a major data form in our study of data mining.
  • 22. Types of Data  Data Warehouses:  A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.
  • 23. Types of Data  Data Warehouses:  To facilitate decision making, the data in a data warehouse are organized around major subjects, such as customer, item, supplier, and activity. The data are stored to provide information from a historical perspective (such as from the past 5–10 years) and are typically summarized.
  • 24. Types of Data  Data Warehouses:  data warehouse is usually modeled by a multidimensional database structure, where each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure, such as count or sales amount.  A data cube provides a multidimensional view of data and allows the precomputation and fast accessing of summarized data.
  • 25. Types of Data  Transactional Databases  In general, a transactional database consists of a file where each record represents a transaction.  A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction
  • 26. Types of Data  Other Kinds of Data  Time-related or sequence data (e.g., historical records, stock exchange data, and time-series and biological sequence data),  Data streams (e.g., video surveillance and sensor data, which are continuously transmitted), spatial data (e.g., maps),  Engineering design data (e.g., the Design of buildings, system components, or integrated circuits),
  • 27. Types of Data  Other Kinds of Data Hypertext and multimedia data (including text, image, video, and audio data), Graph and networked data (e.g., social and information networks), and The Web (a huge, widely distributed information repository made available by the Internet).
  • 28. Data Mining Functionalities  Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks.  Descriptive mining tasks characterize properties of the data in a target data set.  Predictive mining tasks perform induction on the current data in order to make predictions.
  • 29. Data Mining Functionalities  Concept/Class Description: Characterization and Discrimination:  Data can be associated with classes or concepts.  Example: Student Reg. No and Student Name belongs to Student class.  It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms.
  • 30. Data Mining Functionalities  Concept/Class Description: Characterization and Discrimination:  Data characterization, by summarizing the data of the class under study (often called the target class) in general terms.  Data discrimination, by comparison of the target class with one or a set of comparative classes (often called the contrasting classes), or  Both data characterization and discrimination.
  • 31. Data Mining Functionalities  Concept/Class Description: Characterization and Discrimination:  The data cube–based OLAP roll-up operation can be used to perform user-controlled data summarization along a specified dimension.  The output of data characterization can be presented in various forms.  Examples: Pie and bar charts, curves and multidimensional data cubes
  • 32. Data Mining Functionalities  Concept/Class Description: Characterization and Discrimination:  In Data discrimination, The target and contrasting classes can be specified by the user, and the corresponding data objects retrieved through database queries.  For example, the user may like to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by at least 30% during the same period.
  • 33. Data Mining Functionalities Mining Frequent Patterns, Associations, and Correlations:  Frequent patterns, as the name suggests, are patterns that occur frequently in data.  A frequent itemset typically refers to a set of items that frequently appear together in a transactional data set, such as milk and bread.
  • 34. Data Mining Functionalities Mining Frequent Patterns, Associations, and Correlations:  A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.
  • 35. Data Mining Functionalities Classification and Prediction  Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.
  • 36. Data Mining Functionalities Classification and Prediction  The derived model may be represented in various forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks.  classification predicts categorical (discrete, unordered) labels, regression models predicts continuous-valued functions.
  • 37. Data Mining Functionalities Classification and Prediction  Regression is used to predict missing or unavailable numerical data values rather than (discrete) class labels. The term prediction refers to both numeric prediction and class label prediction.
  • 38. Data Mining Functionalities  Cluster Analysis  Clustering analyzes data objects without consulting class labels. Clustering can be used to generate class labels for a group of data. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity.
  • 39. Data Mining Functionalities  Cluster Analysis  Clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are rather dissimilar to objects in other clusters.
  • 40. Data Mining Functionalities  Outlier Analysis:  A data set may contain objects that do not comply with the general behavior or model of the data. These data objects are outliers.  Many data mining methods discard outliers as noise or exceptions.
  • 41. Data Mining Functionalities  Outlier Analysis:  However, in some applications (e.g., fraud detection) the rare events can be more interesting than the more regularly occurring ones. The analysis of outlier data is referred to as outlier analysis or anomaly mining.
  • 42. Data Mining Functionalities  Evolution Analysis:  Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time.  Example: Stock market (time-series) data
  • 43. Interestingness of Patterns  A pattern is interesting if it is Easily understood by humans, Valid on new or test data with some degree of certainty, Potentially useful and Novel  An interesting pattern represents knowledge
  • 44. Interestingness of Patterns  Several objective measures of pattern interestingness exist.  An objective measure for association rules of the form XY is rule support.  Another objective measure for association rules is confidence.
  • 45. Classification of Data Mining Systems
  • 46. Classification of Data Mining Systems  Classification according to the kinds of databases mined:  Database systems can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly.
  • 47. Classification of Data Mining Systems  Classification according to the kinds of databases mined:  For instance, if classifying according to data models, we may have a relational, transactional, object-relational, or data warehouse mining system.  If classifying according to the special types of data handled, we may have a spatial, time-series, text, stream data, multimedia data mining system, or a World Wide Web mining system
  • 48. Classification of Data Mining Systems  Classification according to the kinds of knowledge mined:  It is based on data mining functionalities, such as characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis.
  • 49. Classification of Data Mining Systems  Classification according to the kinds of techniques utilized:  The techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems) or  The methods of data analysis employed (e.g., database-oriented or data warehouse– oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on).
  • 50. Classification of Data Mining Systems Classification according to the applications adapted:  Data mining systems may be tailored specifically for finance, telecommunications, DNA, stock markets, e- mail, and so on.  All-purpose data mining system may not fit domain- specific mining tasks.
  • 51. Data Mining Task Primitives  A data mining task can be specified in the form of a data mining query, which is input to the data mining system.  A data mining query is defined in terms of data mining task primitives.  These primitives allow the user to interactively communicate with the data mining system during discovery in order to direct the mining process, or examine the findings from different angles or depths
  • 52. Data Mining Task Primitives The set of task-relevant data to be mined:  This specifies the portions of the database or the set of data in which the user is interested.  This includes the database attributes or data warehouse dimensions of interest.
  • 53. Data Mining Task Primitives The kind of knowledge to be mined:  This specifies the data mining functions to be performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis
  • 54. Data Mining Task Primitives The background knowledge to be used in the discovery process:  This knowledge about the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found.
  • 55. Data Mining Task Primitives The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures.  Example: Support and Confidence
  • 56. Data Mining Task Primitives The expected representation for visualizing the discovered patterns: This refers to the form in which discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and cubes.
  • 57. Integration of a Data Mining System with a Database or DataWarehouse System  Good system architecture will facilitate the data mining system:  to make best use of the software environment  accomplish data mining tasks in an efficient and timely manner  interoperate and exchange information with other information systems  be adaptable to users’ diverse requirements, and evolve with time
  • 58. Integration of a Data Mining System with a Database or DataWarehouse System  The design of a data mining (DM) system is how to integrate or couple the DM system with a database (DB) system and/or a data warehouse (DW) system.  If a DM system works as a stand-alone system or is embedded in an application program, there are no DB or DW systems with which it has to communicate.
  • 59. Integration of a Data Mining System with a Database or DataWarehouse System  No coupling:  A DM system will not utilize any function of a DB or DW system.  It may fetch data from a particular source (such as a file system), process data using some data mining algorithms, and then store the mining results in another file.
  • 60. Integration of a Data Mining System with a Database or DataWarehouse System  No coupling – Drawbacks: First, a DB system provides a great deal of flexibility and efficiency at storing, organizing, accessing, and processing data.  Without using a DB/DW system, a DM system may spend a substantial amount of time finding, collecting, cleaning, and transforming data.
  • 61. Integration of a Data Mining System with a Database or DataWarehouse System  No coupling – Drawbacks:  Second, there are many tested, scalable algorithms and data structures implemented in DB and DW systems.  It is feasible to realize efficient, scalable implementations using such systems.  Moreover, most data have been or will be stored in DB/DW systems
  • 62. Integration of a Data Mining System with a Database or DataWarehouse System  Loose coupling:  A DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse.
  • 63. Integration of a Data Mining System with a Database or DataWarehouse System  Loose coupling:  Loose coupling is better than no coupling because it can fetch any portion of data stored in databases or data warehouses by using query processing, indexing, and other system facilities. It incurs some advantages of the flexibility, efficiency, and other features provided by such systems
  • 64. Integration of a Data Mining System with a Database or DataWarehouse System  Loose coupling:  Many loosely coupled mining systems are main memory- based Because mining does not explore data structures and query optimization methods provided by DB or DW systems.  It is difficult for loose coupling to achieve high scalability and good performance with large data sets.
  • 65. Integration of a Data Mining System with a Database or DataWarehouse System  Semitight coupling:  Semitight coupling means that besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system.  Some frequently used intermediate mining results can be precomputed and stored in the DB/DW system.  This design will enhance the performance of a DM system
  • 66. Integration of a Data Mining System with a Database or DataWarehouse System  Tight coupling:  A DM system is smoothly integrated into the DB/DW system.  Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system.  With further technology advances, DM, DB, and DW systems will evolve and integrate together as one information system with multiple functionalities.
  • 67. Major Issues in Data Mining Mining methodology and user interaction issues:  Mining different kinds of knowledge in databases  Interactive mining of knowledge at multiple levels of abstraction  Incorporation of background knowledge  Data mining query languages and ad hoc data mining
  • 68. Major Issues in Data Mining Mining methodology and user interaction issues:  Presentation and visualization of data mining results  Handling noisy or incomplete data  Pattern evaluation—the interestingness problem
  • 69. Major Issues in Data Mining Performance issues:  Efficiency and scalability of data mining algorithms  Parallel, distributed, and incremental mining algorithms
  • 70. Major Issues in Data Mining  Issues relating to the diversity of database types:  Handling of relational and complex types of data  Mining information from heterogeneous databases and global information systems
  • 71. Data Preprocessing  Why Preprocess the Data?  Incomplete, noisy, and inconsistent data are common place properties of large real world databases and data warehouses.
  • 72. Data Preprocessing  Why Preprocess the Data?  Incomplete data can occur for a number of reasons.  Attributes of interest may not always be available  Other data may not be included simply because it was not considered important at the time of entry
  • 73. Data Preprocessing  Why Preprocess the Data?  Incomplete data can occur for a number of reasons.  Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions.  Data that were inconsistent with other recorded data may have been deleted.  Missing data, particularly for tuples with missing values for some attributes, may need to be inferred.
  • 74. Data Preprocessing  Why Preprocess the Data?  There are many possible reasons for noisy data.  The data collection instruments used may be faulty.  There may have been human or computer errors occurring at data entry.  Errors in data transmission can also occur.  Incorrect data may also result from inconsistencies in naming conventions or data codes used, or inconsistent formats for input fields, such as date.
  • 75. Data Preprocessing  Descriptive Data Summarization  Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which data values should be treated as noise or outliers.  For many data preprocessing tasks, users would like to learn about data characteristics regarding both central tendency and dispersion of the data.
  • 76. Data Preprocessing  Descriptive Data Summarization  Measures of central tendency include mean, median, mode, and midrange, while measures of data dispersion include quartiles, interquartile range (IQR), and variance.
  • 77. Data Preprocessing  Descriptive Data Summarization - Measures of central tendency:  The most common and most effective numerical measure of the “center” of a set of data is the (arithmetic) mean.
  • 78. Data Preprocessing  Descriptive Data Summarization - Measures of central tendency:  A distributive measure is a measure (i.e., function) that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set.  Example: sum(), count(), max() and min()
  • 79. Data Preprocessing  Descriptive Data Summarization - Measures of central tendency:  An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures.  Example: average() is an algebraic measure because it can be computed by sum()/count()
  • 80. Data Preprocessing  Descriptive Data Summarization - Measures of central tendency:  The mean is the single most useful quantity for describing a data set, it is not always the best way of measuring the center of the data.  A major problem with the mean is its sensitivity to extreme (e.g., outlier) values.
  • 81. Data Preprocessing  Descriptive Data Summarization - Measures of central tendency:  For skewed (asymmetric) data, a better measure of the center of data is the median.  Suppose that a given data set of N distinct values is sorted in numerical order. If N is odd, then the median is the middle value of the ordered set; otherwise (i.e., if N is even), the median is the average of the middle two values
  • 82. Data Preprocessing  Descriptive Data Summarization - Measures of central tendency:  A holistic measure is a measure that must be computed on the entire data set as a whole.  It cannot be computed by partitioning the given data into subsets and merging the values obtained for the measure in each subset.  Example: Median
  • 83. Data Preprocessing  Descriptive Data Summarization - Measures of central tendency:  Another measure of central tendency is the mode. The mode for a set of data is the value that occurs most frequently in the set.  It is possible for the greatest frequency to correspond to several different values, which results in more than one mode.
  • 84. Data Preprocessing  Descriptive Data Summarization - Measures of central tendency:  Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.  In general, a data set with two or more modes is multimodal. At the other extreme, if each data value occurs only once, then there is no mode.
  • 85. Data Preprocessing  Descriptive Data Summarization - Measuring the Dispersion of Data:  The degree to which numerical data tend to spread is called the dispersion, or variance of the data.  The most common measures of data dispersion are range, the five- number summary (based on quartiles), the interquartile range, and the standard deviation.  Boxplots can be plotted based on the five-number summary and are a useful tool for identifying outliers.
  • 86. Data Preprocessing  Descriptive Data Summarization - Measuring the Dispersion of Data: The range of the set is the difference between the largest (max()) and smallest (min()) values.  The kth percentile of a set of data in numerical order is the value xi having the property that k percent of the data entries lie at or below xi.
  • 87. Data Preprocessing  Descriptive Data Summarization - Measuring the Dispersion of Data: The most commonly used percentiles other than the median are quartiles. The first quartile, denoted by Q1, is the 25th percentile; the third quartile, denoted by Q3, is the 75th percentile.
  • 88. Data Preprocessing  Descriptive Data Summarization - Measuring the Dispersion of Data: The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data. This distance is called the interquartile range (IQR) and is defined as  IQR = Q3 – Q1
  • 89. Data Preprocessing  Descriptive Data Summarization - Measuring the Dispersion of Data: The five-number summary of a distribution consists of the median, the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order Minimum; Q1; Median; Q3; Maximum
  • 90. Data Preprocessing  Descriptive Data Summarization - Measuring the Dispersion of Data:  Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-number summary as follows:  Typically, the ends of the box are at the quartiles, so that the box length is the interquartile range, IQR.  The median is marked by a line within the box.  Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations
  • 91. Data Preprocessing  Descriptive Data Summarization - Measuring the Dispersion of Data:
  • 92. Data Preprocessing  Graphic Displays of Basic Descriptive Data Summaries:  Plotting histograms, or frequency histograms, is a graphical method for summarizing the distribution of a given attribute.  A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets
  • 93. Data Preprocessing  Graphic Displays of Basic Descriptive Data Summaries: Typically, the width of each bucket is uniform. Each bucket is represented by a rectangle whose height is equal to the count or relative frequency of the values at the bucket.
  • 94. Data Preprocessing  Graphic Displays of Basic Descriptive Data Summaries - Histogram
  • 95. Data Preprocessing  Graphic Displays of Basic Descriptive Data Summaries - Quantile plot:  A simple and effective way to have a first look at a univariate data distribution.  First, it displays all of the data for the given attribute.  Second, it plots quantile information.
  • 96. Data Preprocessing  Graphic Displays of Basic Descriptive Data Summaries – Scatter plot:  The most effective graphical methods for determining if there appears to be a relationship, pattern, or trend between two numerical attributes.  To construct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane.
  • 97. Data Preprocessing  Graphic Displays of Basic Descriptive Data Summaries – Scatter plot:
  • 98. Data Cleaning  Real-world data tend to be incomplete, noisy, and inconsistent.  Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.
  • 99. Data Cleaning  Missing Values:  Ignore the tuple: This is usually done when the class label is missing. This method is not very effective, unless the tuple contains several attributes with missing values.  Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible given a large data set with many missing values.
  • 100. Data Cleaning  Missing Values:  Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown”.  Use the attribute mean to fill in the missing value: For example, suppose that the average income of customers is 56,000. Use this value to replace the missing value for income
  • 101. Data Cleaning  Missing Values:  Use the attribute mean for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.
  • 102. Data Cleaning  Missing Values:  Use the most probable value to fill in the missing value: For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.
  • 103. Data Cleaning  Noisy Data:  Noise is a random error or variance in a measured variable.  Methods:  Binning, Regression and Clustering:
  • 104. Data Cleaning  Noisy Data – Binning:  Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values around it.  The sorted values are distributed into a number of “buckets,” or bins.  Because binning methods consult the neighborhood of values, they perform local smoothing
  • 105. Data Cleaning  Noisy Data – Binning:  Example: Data: 4,8,15,21,21,24,25,28,34  Partition into (equal-frequency) bins:  Bin 1: 4,8,15  Bin 2: 21,21,24  Bin 3: 25,28,34
  • 106. Data Cleaning  Noisy Data – Binning:  Example: Data: 4,8,15,21,21,24,25,28,34  Smoothing by bin means: each value in a bin is replaced by the mean value of the bin  Bin 1: 9,9,9  Bin 2: 22,22,22  Bin 3: 29,29,29
  • 107. Data Cleaning  Noisy Data – Binning:  Example: Data: 4,8,15,21,21,24,25,28,34  Smoothing by bin boundaries: The minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value  Bin 1: 4,4,15  Bin 2: 21,21,24  Bin 3: 25,25,34
  • 108. Data Cleaning  Noisy Data – Regression:  Data can be smoothed by fitting the data to a function, such as with regression.  Linear regression involves finding the “best” line to fit two attributes (or variables), so that one attribute can be used to predict the other.
  • 109. Data Cleaning  Noisy Data – Clustering:  Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers
  • 110. Data Cleaning  Data Cleaning as a Process:  Missing values, noise, and inconsistencies contribute to inaccurate data.  The first step in data cleaning as a process is discrepancy detection.
  • 111. Data Cleaning  Data Cleaning as a Process - Discrepancies can be caused by several factors,  poorly designed data entry forms that have many optional fields  human error in data entry  deliberate errors (e.g., respondents not wanting to disclose information about themselves)  data decay (e.g., outdated addresses)  Errors in instrumentation devices that record data, and system errors
  • 112. Data Cleaning  Data Cleaning as a Process - Discrepancies can be caused by several factors,  poorly designed data entry forms that have many optional fields  human error in data entry  deliberate errors (e.g., respondents not wanting to disclose information about themselves)  data decay (e.g., outdated addresses).
  • 113. Data Cleaning  Data Cleaning as a Process – Discrepancies Detection Tools:  Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses, and spell-checking) to detect errors and make corrections in the data  Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions.
  • 114. Data Cleaning  Data Cleaning as a Process – Discrepancies Detection Tools:  Data migration tools allow simple transformations to be specified, such as to replace the string “gender” by “sex”  ETL (extraction/transformation/loading) tools allow users to specify transforms through a graphical user interface (GUI).
  • 115. Data Integration and Transformation  Data Integration:  It combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files.
  • 116. Data Integration and Transformation  Data Integration – Issues:  Schema integration and object matching can be tricky.  For example, how can the data analyst or the computer be sure that customer id in one database and cust number in another refer to the same attribute.  Metadata can be used to help avoid errors in schema integration
  • 117. Data Integration and Transformation  Data Integration – Issues: Metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values.
  • 118. Data Integration and Transformation  Data Integration – Issues:  Redundancy is another important issue.  Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set.  Some redundancies can be detected by correlation analysis.
  • 119. Data Integration and Transformation  Data Integration – Issues:  A third important issue in data integration is the detection and resolution of data value conflicts.  For a hotel chain, the price of rooms in different cities may involve not only different currencies but also different services (such as free breakfast) and taxes.
  • 120. Data Integration and Transformation  Data Integration:  The semantic heterogeneity and structure of data pose great challenges in data integration.  Careful integration of the data from multiple sources can help reduce and avoid redundancies and inconsistencies in the resulting data set.  This can help improve the accuracy and speed of the subsequent mining process.
  • 121. Data Integration and Transformation  Data Transformation:  The data are transformed or consolidated into forms appropriate for mining.
  • 122. Data Integration and Transformation  Data Transformation:  Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering.  Aggregation, where summary or aggregation operations are applied to the data.  Generalization of the data, where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies
  • 123. Data Integration and Transformation  Data Transformation:  Normalization, where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0.0 to 1.0.  Attribute construction (or feature construction),where new attributes are constructed and added from the given set of attributes to help the mining process.
  • 124. Data Integration and Transformation  Data Transformation - There are many methods for data normalization.  Min-max normalization,  Z-score normalization,  Normalization by decimal scaling.
  • 125. Data Integration and Transformation  Min-max normalization:  Min-max normalization preserves the relationships among the original data values.  It will encounter an “out-of-bounds” error if a future input case for normalization falls outside of the original data range for A.
  • 126. Data Integration and Transformation  Min-max normalization:  Suppose that the minimum and maximum values for the attribute income are 12,000 and 98,000, respectively, to map income to the range [0:0;1:0].  By min-max normalization, a value of $73,600 for income is transformed to – 0.76
  • 127. Data Integration and Transformation  Z-score normalization:  The values for an attribute, A, are normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v’ by computing
  • 128. Data Integration and Transformation  Z-score normalization: The mean and standard deviation of the values for the attribute income are $54,000 and $16,000, respectively. A value of $73,600 for income is transformed to 1.225
  • 129. Data Integration and Transformation  Normalization by decimal scaling  Normalizes by moving the decimal point of values of attribute A.  The number of decimal points moved depends on the maximum absolute value of A.
  • 130. Data Integration and Transformation  Normalization can change the original data quite a bit.  It is also necessary to save the normalization parameters (such as the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner.
  • 131. Data Integration and Transformation  Attribute Construction:  New attributes are constructed from the given attributes and added in order to help improve the accuracy and understanding of structure in high- dimensional data.  To add the attribute area based on the attributes height and width.
  • 132. Data Reduction  To obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.
  • 133. Data Reduction  Strategies  Data cube aggregation  Attribute subset selection  Dimensionality reduction  Numerosity reduction  Discretization and concept hierarchy generation
  • 134. Data Reduction  Strategies - Data cube aggregation  The data can be aggregated so that the resulting data summarize the total sales per year instead of per quarter.
  • 135. Data Reduction  Strategies - Data cube aggregation  The cube created at the lowest level of abstraction is referred to as the base cuboid.  The base cuboid should correspond to an individual entity of interest, such as sales or customer.  In other words, the lowest level should be usable, or useful for the analysis.
  • 136. Data Reduction  Strategies - Data cube aggregation  A cube at the highest level of abstraction is the apex cuboid.  The apex cuboid would give one total
  • 137. Data Reduction  Strategies - Attribute Subset Selection  Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions).  To find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.
  • 138. Data Reduction  Strategies - Attribute Subset Selection  Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set.
  • 139. Data Reduction  Strategies - Attribute Subset Selection  Stepwise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.
  • 140. Data Reduction  Strategies - Attribute Subset Selection  Combination of forward selection and backward elimination: The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.
  • 141. Data Reduction  Strategies - Attribute Subset Selection  Decision tree induction: Decision tree algorithms, such as ID3, C4.5, and CART, were originally intended for classification.  Decision tree induction constructs a flowchart like structure where each internal (nonleaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction.
  • 142. Data Reduction  Dimensionality Reduction:  In dimensionality reduction, data encoding or transformations are applied so as to obtain a reduced or “compressed” representation of the original data.  Methods: wavelet transforms and principal components analysis.
  • 143. Data Reduction  Dimensionality Reduction - Wavelet Transforms:  The discrete wavelet transform(DWT) is a linear signal processing technique that, when applied to a data vector X, transforms it to a numerically different vector, X0, of wavelet coefficients.
  • 144. Data Reduction  Dimensionality Reduction - Wavelet Transforms:  A compressed approximation of the data can be retained by storing only a small fraction of the strongest of the wavelet coefficients.  For example, all wavelet coefficients larger than some user-specified threshold can be retained. All other coefficients are set to 0.  The technique also works to remove noise without smoothing out the main features of the data, making it effective for data cleaning as well.
  • 145. Data Reduction  Dimensionality Reduction - Principal Components Analysis:  It searches for k n-dimensional orthogonal vectors that can best be used to represent the data.  The original data are thus projected onto a much smaller space, resulting in dimensionality reduction.
  • 146. Data Reduction  Dimensionality Reduction - Numerosity Reduction:  For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data.  Nonparametric methods for storing reduced representations of the data include histograms, clustering, and sampling.
  • 147. Data Reduction  Dimensionality Reduction - Numerosity Reduction – Histograms:  A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets.  Each bucket represents only a single attribute-value/frequency pair, the buckets are called singleton buckets.  Often, buckets instead represent continuous ranges for the given attribute.
  • 148. Data Reduction Histograms - Partitioning rules:  Equal-width: In an equal-width histogram, the width of each bucket range is uniform  Equal-frequency (or equidepth): In an equal- frequency histogram, the buckets are created so that, roughly, the frequency of each bucket is constant
  • 149. Data Reduction  Histograms - Partitioning rules:  Histogram variance is a weighted sum of the original values that each bucket represents, where bucket weight is equal to the number of values in the bucket.  MaxDiff: The difference between each pair of adjacent values. A bucket boundary is established between each pair for pairs having the  -1 largest differences, where  is the user-specified number of buckets.
  • 150. Data Reduction  Dimensionality Reduction - Numerosity Reduction – Sampling:  Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random sample (or subset) of the data.
  • 151. Data Reduction Sampling:  Simple random sample without replacement (SRSWOR) of size s:  all tuples are equally likely to be sampled.  Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained, where s < M.
  • 152. Data Reduction Sampling:  Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by obtaining an SRS at each stratum.  This helps ensure a representative sample, especially when the data are skewed.
  • 153. Data Reduction  Sampling:  An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample, s, as opposed to N, the data set size.  When applied to data reduction, sampling is most commonly used to estimate the answer to an aggregate query. It is possible (using the central limit theorem) to determine a sufficient sample size for estimating a given function within a specified degree of error.