We live in a world where vast amounts of data are collected daily.
Analyzing such data is an important need.
One emerging data repository architecture is the data warehouse .
This is a repository of multiple heterogeneous data sources.
Data mining turns a large collection of data into knowledge.
A search engine (e.g.,Google) receives hundreds of millions of queries
every day.
Each query can be viewed as a transaction where the user describes her
or his information need.
Data Mining
11/9/20191
Data Mining and Data Preprocessing
We live in a world where vast amounts of data are collected daily.
Analyzing such data is an important need.
One emerging data repository architecture is the data warehouse .
This is a repository of multiple heterogeneous data sources.
Data mining turns a large collection of data into knowledge.
A search engine (e.g.,Google) receives hundreds of millions of queries
every day.
Each query can be viewed as a transaction where the user describes her
or his information need.
Data Mining Introduction
11/9/20192
Data Mining and Data Preprocessing
Data Collection and Database Creation
(1960s and earlier)
Primitive file processing
Data Mining Introduction
11/9/20193
Data Mining and Data Preprocessing
Database Management Systems
(1970s to early 1980s)
Hierarchical and network database systems
Relational database systems
Data modeling: entity-relationship models, etc.
Indexing and accessing methods
Data Mining Introduction
11/9/20194
Data Mining and Data Preprocessing
Query languages: SQL, etc.
User interfaces, forms, and reports
Query processing and optimization
Transactions, concurrency control, and recovery
Online transaction processing (OLTP)
Data Mining Introduction
11/9/20195
Data Mining and Data Preprocessing
Advanced Database Systems
(mid-1980s to present)
Advanced data models: extended-relational, object relational, deductive,
etc.
Managing complex data: spatial, temporal, multimedia, sequence and
structured,
scientific, engineering, moving objects, etc.
Data Mining Introduction
11/9/20196
Data Mining and Data Preprocessing
Data streams and cyber-physical data systems
Web-based databases (XML, semantic web)
Managing uncertain data and data cleaning
Integration of heterogeneous sources
Text database systems and integration with information retrieval
Data Mining Introduction
11/9/20197
Data Mining and Data Preprocessing
Extremely large data management
Database system tuning and adaptive systems
Advanced queries: ranking, skyline, etc.
Cloud computing and parallel data processing
Issues of data privacy and security
Data Mining Introduction
11/9/20198
Data Mining and Data Preprocessing
Advanced Data Analysis
(late- 1980s to present)
Data warehouse and OLAP
Data mining and knowledge discovery: classification, clustering, outlier
analysis, association and correlation, comparative summary,
discrimination analysis, pattern
discovery, trend and deviation analysis, etc.
Mining complex types of data: streams, sequence, text, spatial, temporal,
multimedia, Web, networks, etc.
Data Mining Introduction
11/9/20199
Data Mining and Data Preprocessing
Data mining applications: business, society, retail, banking,
telecommunications, science and engineering, blogs, daily life, etc.
Data mining and society: invisible data mining, privacy-preserving data
mining,
mining social and information networks, recommender systems, etc.
One emerging data repository architecture is the data warehouse.
This is a repository of multiple heterogeneous data sources
Data Mining Introduction
11/9/201910
Data Mining and Data Preprocessing
Data Mining
The process that finds a small set of precious nuggets from a great deal
of raw material.
Knowledge mining from data.
Also known as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
The process of knowledge discovery.
Knowledge Discovery from Data,
or KDD
11/9/201911
Data Mining and Data Preprocessing
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved
from the
database)
4. Data transformation (where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations)
Knowledge Discovery from Data,
or KDD
11/9/201912
Data Mining and Data Preprocessing
5. Data mining (an essential process where intelligent methods are applied
to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to users)
Knowledge Discovery from Data,
or KDD
11/9/201913
Data Mining and Data Preprocessing
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system
dynamically.
Knowledge Discovery from Data,
or KDD
11/9/201914
Data Mining and Data Preprocessing
Knowledge Discovery from Data, or
KDD
11/9/201915
Data Mining and Data Preprocessing
Figure 1.1 Data mining as
a step in the process of
knowledge discovery.
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system
dynamically.
Knowledge Discovery from Data,
or KDD
11/9/201916
Data Mining and Data Preprocessing
Today’s real-world databases are highly susceptible to noisy, missing,
and inconsistent data due to their typically huge size.
They origin from multiple, heterogeneous sources.
Low-quality data will lead to low-quality mining results.
The data be preprocessed in order to help improve the quality of the data
and, consequently, of the mining results.
To improve the efficiency and ease of the mining process.
Data Preprocessing
11/9/201917
Data Mining and Data Preprocessing
There are several data preprocessing techniques.
Data cleaning can be applied to remove noise and correct inconsistencies in
data.
Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
Data transformations (e.g., normalization) may be applied, where data are
scaled to fall within a smaller range like 0.0 to 1.0. This can improve the
accuracy and efficiency of mining algorithms.
Data Preprocessing
11/9/201918
Data Mining and Data Preprocessing
Data Quality
Data have quality if they satisfy the requirements of the intended use.
There are many factors comprising data quality
1. Accuracy
2. Completeness
3. Consistency
4. Timeliness
5. Believability and Interpretability.
Data Preprocessing
11/9/201919
Data Mining and Data Preprocessing
Inaccurate Data:
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data
entry.
Users may purposely submit incorrect data values for mandatory fields
when they do not wish to submit personal information.
Data Preprocessing
11/9/201920
Data Mining and Data Preprocessing
Incomplete Data:
Incomplete data can occur for a number of reasons.
 Attributes of interest may not always be available, such as customer
information for sales transaction data.
Other data may not be included simply because they were not
considered important at the time of entry.
Relevant data may not be recorded due to a misunderstanding or
because of equipment malfunctions.
Data Preprocessing
11/9/201921
Data Mining and Data Preprocessing
Inconsistent Data:
Incorrect data may also result from inconsistencies in naming
conventions or data codes, or inconsistent formats for input fields (e.g.,
date).
Timeliness:
The fact that the month-end data are not updated in a timely fashion
has a negative impact on the data quality.
Data Preprocessing
11/9/201922
Data Mining and Data Preprocessing
Believability :
It reflects how much the data are trusted by users.
Interpretability:
It reflects how easy the data are understood. Suppose that a database, at
one point, had several errors, all of which have since been corrected.
Even though the database is now accurate, complete, consistent, and
timely, sales department users may regard it as of low quality due to poor
believability and interpretability.
Data Preprocessing
11/9/201923
Data Mining and Data Preprocessing
Fill in missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
Missing Values
Ignore the tuple: This is usually done when the class label is
missing.
This method is not very effective, unless the tuple contains
several attributes with missing values.
Data Cleaning
11/9/201924
Data Mining and Data Preprocessing
Fill in the missing value manually: Time consuming and may not be
feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant such as a label like “Unknown”.
 Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value.
Use the attribute mean or median for all samples belonging
to the same class as the given tuple.
Use the most probable value to fill in the missing value.
Data Cleaning
11/9/201925
Data Mining and Data Preprocessing
Noisy Data:
Random error or variance in a measured variable.
Data smoothing techniques:
Binning:
These methods smooth a sorted data value by consulting its “neighbor-
hood,” that is, the values around it.
The sorted values are distributed into a number of “buckets,” or bins.
Binning methods consult the neighborhood of values which perform
local smoothing.
Data Cleaning
11/9/201926
Data Mining and Data Preprocessing
Binning methods for data smoothing:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins: Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means: Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries: Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Data Cleaning
11/9/201927 Data Mining and Data Preprocessing
In smoothing by bin medians , the each bin value is replaced by the
bin median.
In smoothing by bin boundaries, the minimum and maximum values
in a given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
Data Cleaning
11/9/201928
Data Mining and Data Preprocessing
Regression:
A technique that con- forms data values to a function.
Linear regression involves finding the “best” line to fit two attributes .
One attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where
more than two attributes are involved .
The data are fit to a multidimensional surface.
Data Cleaning
11/9/201929
Data Mining and Data Preprocessing
Outlier analysis:
Outliers may be detected by clustering.
The similar values are organized into groups, or “clusters.”.
The values that fall outside of the set of clusters may be considered
outliers.
Data Cleaning
11/9/201930
Data Mining and Data Preprocessing
Data Cleaning
11/9/201931
Data Mining and Data Preprocessing
Figure 1.2 A 2-D customer data plot with respect to customer locations in a city,
showing three data clusters. Outliers may be detected as values that fall outside of the
cluster sets.
Data Cleaning as a process:
Discrepancies can be caused by several factors, including poorly
designed data entry forms that have many optional fields, human
error in data entry, deliberate errors and data decay.
Discrepancies may also arise from inconsistent data
representations and inconsistent use of codes.
Other sources of discrepancies include errors in instrumentation
devices that record data and system errors.
 There may also be inconsistencies due to data integration.
Data Cleaning
11/9/201932
Data Mining and Data Preprocessing
Discrepancy Detection Methods:
Field overloading is another error source that typically results when
developers squeeze new attribute definitions into unused (bit) portions of
already defined attributes.
A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
A consecutive rule says that there can be no missing values between the
lowest and highest values for the attribute, and that all values must also be
unique (e.g., as in check numbers).
Data Cleaning
11/9/201933
Data Mining and Data Preprocessing
Discrepancy Detection Methods:
A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition.
Data scrubbing tools use simple domain knowledge (e.g., knowledge
of postal addresses and spell-checking) to detect errors and make
corrections in the data.. Data auditing tools find discrepancies by
analyzing the data to discover rules and relationships, and detecting data
that violate such conditions.
They are variants of data mining tools. For example, they may employ
statistical analysis to find correlations, or clustering to identify outliers.
Data Cleaning
11/9/201934
Data Mining and Data Preprocessing
Commercial tools can assist in the data transformation step.
Data migration tools allow simple transformations to be specified such
as to replace the string “gender” by “sex.”
ETL (extraction/transformation/loading) tools allow users to specify
transforms through a graphical user interface (GUI).
Data Cleaning
11/9/201935
Data Mining and Data Preprocessing
Thank You
11/9/201936
Data Mining and Data Preprocessing

Data mining

  • 1.
    We live ina world where vast amounts of data are collected daily. Analyzing such data is an important need. One emerging data repository architecture is the data warehouse . This is a repository of multiple heterogeneous data sources. Data mining turns a large collection of data into knowledge. A search engine (e.g.,Google) receives hundreds of millions of queries every day. Each query can be viewed as a transaction where the user describes her or his information need. Data Mining 11/9/20191 Data Mining and Data Preprocessing
  • 2.
    We live ina world where vast amounts of data are collected daily. Analyzing such data is an important need. One emerging data repository architecture is the data warehouse . This is a repository of multiple heterogeneous data sources. Data mining turns a large collection of data into knowledge. A search engine (e.g.,Google) receives hundreds of millions of queries every day. Each query can be viewed as a transaction where the user describes her or his information need. Data Mining Introduction 11/9/20192 Data Mining and Data Preprocessing
  • 3.
    Data Collection andDatabase Creation (1960s and earlier) Primitive file processing Data Mining Introduction 11/9/20193 Data Mining and Data Preprocessing
  • 4.
    Database Management Systems (1970sto early 1980s) Hierarchical and network database systems Relational database systems Data modeling: entity-relationship models, etc. Indexing and accessing methods Data Mining Introduction 11/9/20194 Data Mining and Data Preprocessing
  • 5.
    Query languages: SQL,etc. User interfaces, forms, and reports Query processing and optimization Transactions, concurrency control, and recovery Online transaction processing (OLTP) Data Mining Introduction 11/9/20195 Data Mining and Data Preprocessing
  • 6.
    Advanced Database Systems (mid-1980sto present) Advanced data models: extended-relational, object relational, deductive, etc. Managing complex data: spatial, temporal, multimedia, sequence and structured, scientific, engineering, moving objects, etc. Data Mining Introduction 11/9/20196 Data Mining and Data Preprocessing
  • 7.
    Data streams andcyber-physical data systems Web-based databases (XML, semantic web) Managing uncertain data and data cleaning Integration of heterogeneous sources Text database systems and integration with information retrieval Data Mining Introduction 11/9/20197 Data Mining and Data Preprocessing
  • 8.
    Extremely large datamanagement Database system tuning and adaptive systems Advanced queries: ranking, skyline, etc. Cloud computing and parallel data processing Issues of data privacy and security Data Mining Introduction 11/9/20198 Data Mining and Data Preprocessing
  • 9.
    Advanced Data Analysis (late-1980s to present) Data warehouse and OLAP Data mining and knowledge discovery: classification, clustering, outlier analysis, association and correlation, comparative summary, discrimination analysis, pattern discovery, trend and deviation analysis, etc. Mining complex types of data: streams, sequence, text, spatial, temporal, multimedia, Web, networks, etc. Data Mining Introduction 11/9/20199 Data Mining and Data Preprocessing
  • 10.
    Data mining applications:business, society, retail, banking, telecommunications, science and engineering, blogs, daily life, etc. Data mining and society: invisible data mining, privacy-preserving data mining, mining social and information networks, recommender systems, etc. One emerging data repository architecture is the data warehouse. This is a repository of multiple heterogeneous data sources Data Mining Introduction 11/9/201910 Data Mining and Data Preprocessing
  • 11.
    Data Mining The processthat finds a small set of precious nuggets from a great deal of raw material. Knowledge mining from data. Also known as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. The process of knowledge discovery. Knowledge Discovery from Data, or KDD 11/9/201911 Data Mining and Data Preprocessing
  • 12.
    1. Data cleaning(to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations) Knowledge Discovery from Data, or KDD 11/9/201912 Data Mining and Data Preprocessing
  • 13.
    5. Data mining(an essential process where intelligent methods are applied to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users) Knowledge Discovery from Data, or KDD 11/9/201913 Data Mining and Data Preprocessing
  • 14.
    Data mining isthe process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically. Knowledge Discovery from Data, or KDD 11/9/201914 Data Mining and Data Preprocessing
  • 15.
    Knowledge Discovery fromData, or KDD 11/9/201915 Data Mining and Data Preprocessing Figure 1.1 Data mining as a step in the process of knowledge discovery.
  • 16.
    Data mining isthe process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically. Knowledge Discovery from Data, or KDD 11/9/201916 Data Mining and Data Preprocessing
  • 17.
    Today’s real-world databasesare highly susceptible to noisy, missing, and inconsistent data due to their typically huge size. They origin from multiple, heterogeneous sources. Low-quality data will lead to low-quality mining results. The data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results. To improve the efficiency and ease of the mining process. Data Preprocessing 11/9/201917 Data Mining and Data Preprocessing
  • 18.
    There are severaldata preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms. Data Preprocessing 11/9/201918 Data Mining and Data Preprocessing
  • 19.
    Data Quality Data havequality if they satisfy the requirements of the intended use. There are many factors comprising data quality 1. Accuracy 2. Completeness 3. Consistency 4. Timeliness 5. Believability and Interpretability. Data Preprocessing 11/9/201919 Data Mining and Data Preprocessing
  • 20.
    Inaccurate Data: The datacollection instruments used may be faulty. There may have been human or computer errors occurring at data entry. Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information. Data Preprocessing 11/9/201920 Data Mining and Data Preprocessing
  • 21.
    Incomplete Data: Incomplete datacan occur for a number of reasons.  Attributes of interest may not always be available, such as customer information for sales transaction data. Other data may not be included simply because they were not considered important at the time of entry. Relevant data may not be recorded due to a misunderstanding or because of equipment malfunctions. Data Preprocessing 11/9/201921 Data Mining and Data Preprocessing
  • 22.
    Inconsistent Data: Incorrect datamay also result from inconsistencies in naming conventions or data codes, or inconsistent formats for input fields (e.g., date). Timeliness: The fact that the month-end data are not updated in a timely fashion has a negative impact on the data quality. Data Preprocessing 11/9/201922 Data Mining and Data Preprocessing
  • 23.
    Believability : It reflectshow much the data are trusted by users. Interpretability: It reflects how easy the data are understood. Suppose that a database, at one point, had several errors, all of which have since been corrected. Even though the database is now accurate, complete, consistent, and timely, sales department users may regard it as of low quality due to poor believability and interpretability. Data Preprocessing 11/9/201923 Data Mining and Data Preprocessing
  • 24.
    Fill in missingvalues, smooth out noise while identifying outliers, and correct inconsistencies in the data. Missing Values Ignore the tuple: This is usually done when the class label is missing. This method is not very effective, unless the tuple contains several attributes with missing values. Data Cleaning 11/9/201924 Data Mining and Data Preprocessing
  • 25.
    Fill in themissing value manually: Time consuming and may not be feasible given a large data set with many missing values. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant such as a label like “Unknown”.  Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value. Use the attribute mean or median for all samples belonging to the same class as the given tuple. Use the most probable value to fill in the missing value. Data Cleaning 11/9/201925 Data Mining and Data Preprocessing
  • 26.
    Noisy Data: Random erroror variance in a measured variable. Data smoothing techniques: Binning: These methods smooth a sorted data value by consulting its “neighbor- hood,” that is, the values around it. The sorted values are distributed into a number of “buckets,” or bins. Binning methods consult the neighborhood of values which perform local smoothing. Data Cleaning 11/9/201926 Data Mining and Data Preprocessing
  • 27.
    Binning methods fordata smoothing: Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 Data Cleaning 11/9/201927 Data Mining and Data Preprocessing
  • 28.
    In smoothing bybin medians , the each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. Data Cleaning 11/9/201928 Data Mining and Data Preprocessing
  • 29.
    Regression: A technique thatcon- forms data values to a function. Linear regression involves finding the “best” line to fit two attributes . One attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved . The data are fit to a multidimensional surface. Data Cleaning 11/9/201929 Data Mining and Data Preprocessing
  • 30.
    Outlier analysis: Outliers maybe detected by clustering. The similar values are organized into groups, or “clusters.”. The values that fall outside of the set of clusters may be considered outliers. Data Cleaning 11/9/201930 Data Mining and Data Preprocessing
  • 31.
    Data Cleaning 11/9/201931 Data Miningand Data Preprocessing Figure 1.2 A 2-D customer data plot with respect to customer locations in a city, showing three data clusters. Outliers may be detected as values that fall outside of the cluster sets.
  • 32.
    Data Cleaning asa process: Discrepancies can be caused by several factors, including poorly designed data entry forms that have many optional fields, human error in data entry, deliberate errors and data decay. Discrepancies may also arise from inconsistent data representations and inconsistent use of codes. Other sources of discrepancies include errors in instrumentation devices that record data and system errors.  There may also be inconsistencies due to data integration. Data Cleaning 11/9/201932 Data Mining and Data Preprocessing
  • 33.
    Discrepancy Detection Methods: Fieldoverloading is another error source that typically results when developers squeeze new attribute definitions into unused (bit) portions of already defined attributes. A unique rule says that each value of the given attribute must be different from all other values for that attribute. A consecutive rule says that there can be no missing values between the lowest and highest values for the attribute, and that all values must also be unique (e.g., as in check numbers). Data Cleaning 11/9/201933 Data Mining and Data Preprocessing
  • 34.
    Discrepancy Detection Methods: Anull rule specifies the use of blanks, question marks, special characters, or other strings that may indicate the null condition. Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses and spell-checking) to detect errors and make corrections in the data.. Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. They are variants of data mining tools. For example, they may employ statistical analysis to find correlations, or clustering to identify outliers. Data Cleaning 11/9/201934 Data Mining and Data Preprocessing
  • 35.
    Commercial tools canassist in the data transformation step. Data migration tools allow simple transformations to be specified such as to replace the string “gender” by “sex.” ETL (extraction/transformation/loading) tools allow users to specify transforms through a graphical user interface (GUI). Data Cleaning 11/9/201935 Data Mining and Data Preprocessing
  • 36.
    Thank You 11/9/201936 Data Miningand Data Preprocessing