SlideShare a Scribd company logo
1 of 128
Download to read offline
Data Mining:
Concepts and Techniques
— Unit 1 —
— Introduction —
Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Classification of data mining systems
Major issues in data mining
Overview
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
(1000 Terabytes = 1 Petabyte )
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
Disk Storage
1 Bit = Binary Digit
· 8 Bits = 1 Byte
· 1024 Bytes = 1 Kilobyte
· 1024 Kilobytes = 1 Megabyte
· 1024 Megabytes = 1 Gigabyte
· 1024 Gigabytes = 1 Terabyte
· 1024 Terabytes = 1 Petabyte
· 1024 Petabytes = 1 Exabyte
· 1024 Exabytes = 1 Zettabyte
· 1024 Zettabytes = 1 Yottabyte
· 1024 Yottabytes = 1 Brontobyte
· 1024 Brontobytes = 1 Geopbyte
We are drowning in data, but starving for
knowledge!
“Necessity is the mother of invention”
Data mining—Automated analysis of massive
data sets
Evolution of Sciences
Before 1600, empirical science
1600-1950s, theoretical science
1950s-1990s, computational science – simulation , models
1990-now, data science
The flood of data
The ability to economically store and manage data online
The Internet and computing Grid that makes all these
archives universally accessible
Here is the concept of Data mining.
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information
systems
8
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
ACM Transactions on KDD starting in 2007
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction
Data/pattern analysis
Data archeology
Data dredging
Information harvesting
Business intelligence, etc
Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis task are
retrieved fromthe database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)
April 13, 2021 Data Mining: Concepts and Techniques 12
5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
7. Knowledge presentation (where visualization and
knowledge representation techniques are used to present
the mined knowledge to the user)
April 13, 2021 Data Mining: Concepts and Techniques 13
Architecture: Typical Data Mining System
data cleaning, integration, and selection
Database or Data
Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowl
edge-
Base
Database Data
Warehouse
World-Wide
Web
Other Info
Repositories
Database, data warehouse, WorldWideWeb, or other
information repository: This is one or a set of databases,
data warehouses, spreadsheets, or other kinds of
information repositories. Data cleaning and data
integration techniques may be performed on the data.
April 13, 2021 Data Mining: Concepts and Techniques 15
Database or data warehouse server: The database or data
warehouse server is responsible for fetching the relevant
data, based on the user’s data mining request.
April 13, 2021 Data Mining: Concepts and Techniques 16
Knowledge base: This is the domain knowledge that is
used to guide the search or evaluate the interestingness of
resulting patterns.
Knowledge such as user beliefs, which can be used to
assess a pattern’s interestingness based on its
unexpectedness, may also be included.
Other examples of domain knowledge are additional
interestingness constraints or thresholds, and metadata
(e.g., describing data from multiple heterogeneous
sources).
April 13, 2021 Data Mining: Concepts and Techniques 17
Data mining engine: This is essential to the data mining
system and ideally consists of a set of functional modules
for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster
analysis, outlier analysis, and evolution analysis.
April 13, 2021 Data Mining: Concepts and Techniques 18
Pattern evaluation module: This component typically
employs interestingness measures and interacts with the
data mining modules so as to focus the search toward
interesting patterns.
It may use interestingness thresholds to filter out
discovered patterns.
Alternatively, the pattern evaluation module may be
integrated with the mining module, depending on the
implementation of the data mining method used. For
efficient data mining, it is highly recommended to push
April 13, 2021 Data Mining: Concepts and Techniques 19
User interface: This module communicates between users
and the data mining system, allowing the user to interact
with the system by specifying a data mining query or task,
providing information to help focus the search, and
performing exploratory data mining based on the
intermediate data mining results.
In addition, this component allows the user to browse
database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in
different forms.
April 13, 2021 Data Mining: Concepts and Techniques 20
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithm
Other
Disciplines
Visualization
Why Not Traditional Data Analysis?
Tremendous amount of data
High-dimensionality of data
High complexity of data
New and sophisticated applications
Data Mining: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
Data Mining: On What Kinds of Data?
a. Database-oriented data sets and applications
Relational database
Data warehouse
Transactional database
Relational Databases
A database system, also called a database management
system (DBMS), consists of a collection of interrelated
data, known as a database, and a set of software programs
to manage and access the data.
A relational database is a collection of tables, each of
which is assigned a unique name.
Each table consists of a set of attributes (columns or
fields) and usually stores a large set of tuples (records or
rows
A semantic data model, such as an entity-relationship (ER)
data model, is often constructed for relational databases.
An ER data model represents the database as a set of
entities and their relationships.
April 13, 2021 Data Mining: Concepts and Techniques 26
A relational database for AllElectronics. The AllElectronics
company is described by the following relation tables:
customer, item, employee, and branch.
Relational data can be accessed by database queries
written in a relational query language, such as SQL, or
with the assistance of graphical user interfaces.
April 13, 2021 Data Mining: Concepts and Techniques 27
April 13, 2021 Data Mining: Concepts and Techniques 28
April 13, 2021 Data Mining: Concepts and Techniques 29
A query allows retrieval of specified subsets of the data.
Suppose that your job is to analyze the AllElectronics data.
Through the use of relational queries, you can ask things
like “Show me a list of all items that were sold in the last
quarter.”
Relational languages also include aggregate functions such
as sum, avg (average), count, max (maximum), and min
(minimum).
April 13, 2021 Data Mining: Concepts and Techniques 30
DataWarehouses
A data warehouse is a repository of information collected
from multiple sources, stored under a unified schema, and
that usually resides at a single site.
Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data loading,
and periodic data refreshing.
April 13, 2021 Data Mining: Concepts and Techniques 31
April 13, 2021 Data Mining: Concepts and Techniques 32
To facilitate decision making, the data in a data
warehouse are organized around major subjects, such as
customer, item, supplier, and activity.
The data are stored to provide information from a
historical perspective (such as from the past 5–10 years) and
are typically summarized.
April 13, 2021 Data Mining: Concepts and Techniques 33
A data warehouse is usually modeled by a multidimensional
database structure.
Each dimension corresponds to an attribute or a set of
attributes in the schema.
Each cell stores the value of some aggregate measure, such
as count or sales amount.
The actual physical structure of a data warehouse may be a
relational data store or a multidimensional data cube.
A data cube provides a multidimensional view of data and
allows the pre computation and fast accessing of
summarized data.
April 13, 2021 Data Mining: Concepts and Techniques 34
A data cube for AllElectronics
A data cube for summarized sales data of AllElectronics
The cube has three dimensions: address (with city
valuesChicago, New York, Toronto, Vancouver), time (with
quarter values Q1, Q2, Q3, Q4), and
item(with itemtype values home entertainment, computer, phone,
security).
The aggregate value stored in each cell of the cube is sales
amount (in thousands).
For example, the total sales forthefirstquarter,Q1, for
items relating to security systems in Vancouver is
$400,000, as stored in cell (Vancouver, Q1, security)
April 13, 2021 Data Mining: Concepts and Techniques 35
April 13, 2021 Data Mining: Concepts and Techniques 36
Data warehouse systems are well suited for on-line
analytical processing, or OLAP.
Examples of OLAP operations include drill-down and roll-
up, which allow the user to view the data at differing
degrees of summarization,
we can drill down on sales data summarized by quarter to
see the data summarized by month. Similarly, we can roll
up on sales data summarized by city to view the data
summarized by country. 37
What is the difference between a datawarehouse and
a data mart?”
A data warehouse collects information about subjects
that span an entire organization, and thus its scope is
enterprise-wide.
A data mart, on the other hand, is a department subset of
a data warehouse. It focuses on selected subjects, and
thus its scope is department-wide.
April 13, 2021 Data Mining: Concepts and Techniques 38
Transactional Databases
A transactional database consists of a file where each
record represents a transaction.
A transaction typically includes a unique transaction
identity number (trans ID) and a list of the items making
up the transaction (such as items purchased in a store).
April 13, 2021 Data Mining: Concepts and Techniques 39
“Show me all the items purchased by Sandy Smith” or
“How many transactions include item number I3?”
Answering such queries may require a scan of the entire
transactional database.
“Which items sold well together?”
market basket data analysis would enable you to bundle
groups of items together as a strategy for maximizing
sales.
data mining systems for transactional data can do so by
identifying frequent itemsets.
April 13, 2021 Data Mining: Concepts and Techniques 40
b. Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked
data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
Two dimensional
representation-table
April 13, 2021 Data Mining: Concepts and Techniques 42
Data cube
Data cube
April 13, 2021 Data Mining: Concepts and Techniques 43
Dimension Modelling.
April 13, 2021 Data Mining: Concepts and Techniques 44
Data Mining Functionalities
Data mining functionalities are used to specify the kind
of patterns to be found in data mining tasks.
In general, data mining tasks can be classified into two
categories: descriptive and predictive.
Descriptive mining tasks characterize the general
properties of the data in the database.
Predictive mining tasks perform inference on the current
data in order to make predictions.
April 13, 2021 Data Mining: Concepts and Techniques 45
Data Mining Functionalities
Concept/Class Description: Characterization and
Discrimination
Data can be associated with classes or concepts.
For example, in the AllElectronics store, classes of items for sale
include computers and printers, and concepts of customers include
bigSpenders and budgetSpenders.
It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such descriptions of a
class or a concept are called class/concept descriptions.
April 13, 2021 Data Mining: Concepts and Techniques 46
These descriptions can be derived via
(1) data characterization, by summarizing the data of the
class under study in general terms, or (2) data
discrimination, by comparison of the target class with one
or a set of comparative classes or (3) both data
characterization and discrimination.
Data characterization is a summarization of the general
characteristics or features of a target class of data.
April 13, 2021 Data Mining: Concepts and Techniques 47
For example,
Data characterization. summarizing the characteristics
of customers who spend more than $1,000 a year at
AllElectronics.
Data discrimination . compare the general features of
software products whose sales increased by 10% in the
last year with those whose sales decreased by at least 30%
during the same period.
April 13, 2021 Data Mining: Concepts and Techniques 48
Data Mining Functionalities
ii. Frequent patterns, association, correlation
Frequent patterns, as the name suggests, are patterns that
occur frequently in data.
frequent patterns, including item sets, subsequences, and
substructures.
A frequent itemset :milk and bread
(frequent) sequential pattern: customers tend to purchase
first a PC, followed by a digital camera, and then a
memory card
(frequent)structured pattern: substructure occurs
frequently
Mining frequent patterns leads to the discovery of
interesting associations and correlations within data.
April 13, 2021 Data Mining: Concepts and Techniques 49
Association analysis
buys(X; “computer”))buys(X; “software”) [support = 1%; confidence = 50%]
A confidence, or certainty, of 50% means that if a customer buys a
computer, there is a 50% chance that she will buy software as well.
A 1% support means that 1% of all of the transactions under
analysis showed that computer and software were purchased
together.
Association rules that contain a single predicate are referred to as
single-dimensional association rules.
April 13, 2021 Data Mining: Concepts and Techniques 50
Association analysis
age(X, “20:::29”)^income(X, “20K:::29K”))buys(X, “CD player”)
[support = 2%, confidence = 60%]
multidimensional association rule
association rules are discarded as uninteresting if they do not
satisfy both a minimum support threshold and a minimum
confidence threshold.
April 13, 2021 Data Mining: Concepts and Techniques 51
Data Mining Functionalities
iii. Classification and prediction
Construct models (functions) that describe and
distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or
classify cars based on (mileage)
Predict some unknown or missing numerical values
April 13, 2021 Data Mining: Concepts and Techniques 53
iv. Cluster analysis
e.g., cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass
similarity
April 13, 2021 Data Mining: Concepts and Techniques 54
April 13, 2021 Data Mining: Concepts and Techniques 55
v. Outlier analysis
Outlier: Data object that does not comply with the general
behavior of the data
vi. Trend and evolution analysis
Trend and deviation: e.g., regression analysis (eg: height &
wait , price & demand)
Sequential pattern mining: e.g., digital camera large SD
memory
Periodicity analysis
Similarity-based analysis
vii. statistical analyses
Integration scheme includes
No coupling
Loose coupling
Semi tight coupling
Tight Coupling
April 13, 2021 Data Mining: Concepts and Techniques 57
No coupling
No coupling means that a DM system will not utilize any function of
a DB or DW system.
Advantages
Simple to use
Drawbacks
Without using a DB/DW system, a DM system may spend a
substantial amount of time finding, collecting, cleaning, and
transforming data
DM system will need to use other tools to extract data, making it
difficult to integrate such a system into an information processing
environment.
Thus, no coupling represents a poor design.
58
Loose coupling
Loose coupling means that a DM system will use some facilities of a DB
or DW system.
Fetching data from a data repository managed by these systems,
performing data mining, and then storing the mining results either in a
file or in a designated place in a database or data warehouse.
Advantages
Loose coupling is better than no coupling.
it can fetch any portion of data stored in databases or data
warehouses by using query processing.
It incurs some advantages of the flexibility, efficiency, and other
features provided by such systems
Disadvantages
many loosely coupled mining systems are main memory-based.
it is difficult for loose coupling to achieve high scalability and good
performance with large data sets
59
Semitight coupling
Semitight coupling means that besides linking a DM system to a
DB/DW system, efficient implementations of a few essential data
mining primitives can be provided in the DB/DW system.
These primitives can include sorting, indexing, aggregation, histogram
analysis, multiway join, and pre computation of some essential
statistical measures, such as sum, count, max, min, standard deviation,
and so on.
this design will enhance the performance of a DM system.
April 13, 2021 Data Mining: Concepts and Techniques 60
Tight coupling
Tight coupling means that a DM system is smoothly integrated into the
DB/DW system.
The data mining subsystem is treated as one functional component of an
information system.
This approach is highly desirable because it facilitates efficient
implementations of data mining functions, high system performance,
and an integrated information processing environment.
April 13, 2021 Data Mining: Concepts and Techniques 61
Major Issues in Data Mining
Mining different kinds of knowledge in databases.
Interactive mining of knowledge at multiple levels of
abstraction.
Incorporation of background knowledge.
Data mining query languages and ad hoc data mining.
Presentation and visualization of data mining results.
Handling noisy or incomplete data
Pattern evaluation-the interestingness problem
Major Issues in Data Mining
Performance Issues
Efficiency and scalability of data mining algorithms
Parallel , distributed and incremental mining algorithms
Issues relating to the diversity of database types
Handling of relational and complex types of data.
Mining information from heterogeneous database and global
information systems
April 13, 2021 Data Mining: Concepts and Techniques 63
64
Data
warehouse
Definition
Data Warehouse
A collection of corporate information,
derived directly from operational
systems and some external data
sources.
Its specific purpose is to support
business decisions, not business
operations.
The Purpose of Data Warehousing
Realize the value of data
Data / information is an asset
Methods to realize the value, (Reporting, Analysis, etc.)
Make better decisions
Turn data into information
Create competitive advantage
Methods to support the decision making process(DSS)
A data ware house refers to database that is maintained
separately from an organization’s operational database.
It allows integration of variety of application systems.
It gives the opportunity for historical data analysis.
April 13, 2021 Data Mining: Concepts and Techniques 67
Subject oriented: A data ware house is subject oriented rather
than Transaction oriented. For eg: customer, product, sales etc.
Non volatile: data ware house is Physically separate ,So it not
perishable.
April 13, 2021 Data Mining: Concepts and Techniques 68
Integrated: A data ware house is constructed by integrating
heterogeneous sources such as rdbms, files,OLTPfiles.
data cleaning and data integration techniques are used for
ensuring consistency in naming conventions, encoding
structures, attribute measures and so on.
Time variant: Data is stored in historical perspective (5-
10years).
Implicit or explicit time variant will be there in constructing data
ware house.
April 13, 2021 Data Mining: Concepts and Techniques 69
MULTI DIMENSIONAL DATA MODEL
Two dimensional representation-table
April 13, 2021 Data Mining: Concepts and Techniques 70
Data cube
April 13, 2021 Data Mining: Concepts and Techniques 71
A multi-dimensional
structure called the data cube.
It is a data abstraction that
allows one to view
aggregated data from a
number of perspectives.
Data Ware house implementation
Data Cube
OLAP Operations.
OLAP Operations are used for retrieving data in a simplified
manner from data cube for analysis.
Slicing: Reducing the data cube by one or more dimensions.
Slice time=’Q2’ C[quarter,city,product]=C[city, product]
April 13, 2021 Data Mining: Concepts and Techniques 73
Dicing:
This operation is for selecting a smaller data cube and analyzing it
from different perspectives (selection criteria).
Eg:
dice time=’Q1 or Q2 and location =’Mumbai’ or ‘Pune’
C[Quarter,city,product]=C[quarter,city,product]
April 13, 2021 Data Mining: Concepts and Techniques 74
Drilling: Moving up and down along classification
hierarchies
Drill up(Roll Up): This means switching from
detailed to an aggregated with in same classification
hierarchy.
Eg RollUp=C[quarter,city,product]=C[quarter,
province. product]
April 13, 2021 Data Mining: Concepts and Techniques 75
April 13, 2021 Data Mining: Concepts and Techniques 76
Drill Down : This is concerned with switching from aggregated to detailed
level.
For eg: day->month->quarter->year,
April 13, 2021 Data Mining: Concepts and Techniques 77
Integration of Data Mining and Data
Warehousing:
Data warehouse provides clean, integrated data for fruitful
mining.
Data mining provides powerful tools for analysis of data stored
in data warehouses.
Data mining provides more analysis tools, e.g.,
- association,
- classification,
- clustering,
- pattern-directed, and
- trend analysis.
Data mining: the extraction of hidden predictive information
from large DB.
Data might be one of the most valuable assets of your
corporation - but only if you know how to reveal valuable
knowledge hidden in raw data.
Data mining allows you to extract diamonds of knowledge from
your historical data and predict outcomes of future situations.
The actual need of data warehouse is
- To store heterogeneous data for managerial decision purpose.
- To store data in various dimensions with in a data warehouse.
-it is easy to analyze the data and to take decisions.
- A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management and
decision-making process.
81
3- Tier Data Warehouse
Architecture
82
3-Tier Data Warehouse Architecture
Data ware house adopt a three tier architecture.
These 3 tiers are:
Bottom Tier
Middle Tier
Top Tier
83
84
Data Sources:
All the data related to any bussiness organization is stored in
operational databases, external files and flat files.
These sources are application oriented
Eg: complete data of organization such as training detail, customer
detail, sales, departments, transactions, employee detail etc.
Data present here in different formats or host format
Contain data that is not well documented
85
Bottom Tier: Data warehouse server
Data Warehouse server fetch only relevant
information based on data mining (mining a
knowledge from large amount of data) request.
Eg: customer profile information provided by
external consultants.
Data is feed into bottom tier by some backend
tools
and utilities.
86
Backend Tools & Utilities:
Functions performed by backend tools and utilities
are:
Data Extraction
Data Cleaning
Data Transformation
Load
Refresh
87
Bottom Tier Contains:
Data warehouse
Metadata Repository
Data Marts
Monitoring and Administration
Data Warehouse:
It is an optimized form of operational database contain
only relevant information and provide fast access to
data.
Subject oriented
Eg: Data related to all the departments of an organization
Integrated:
Different views Single unified
of data view
Time – variant
Nonvolatile
A
B
C
Warehous
e
Metadata repository:
It figure out that what is available in data warehouse.
It contains:
Structure of data warehouse
Data names and definitions
Source of extracted data
Algorithm used for data cleaning purpose
Sequence of transformations applied on data
Data related to system performance
Data Marts:
Subset of data warehouse contain only small slices of data
warehouse
Eg: Data pertaining to the single department
Two types of data marts:
Dependent Independent
sourced directly sourced from one or
from data warehouse more data sources
Monitoring & Administration:
Data Refreshment
Data source synchronization
Disaster recovery
Managing access control and security
Manage data growth, database performance
Controlling the number & range of queries
Limiting the size of data warehouse
Data
Warehouse
Data
Marts
Metadata
Repository
Monitoring Administration
Sourc
e A
B C
Bottom Tier:
Data Warehouse
Server
Data
Middle Tier: OLAP Server
It presents the users a multidimensional data from
data warehouse or data marts.
Typically implemented using two models:
ROLAP Model MOLAP Model
Present data in Present data in array
relational tables based structures means
map directly to data
cube array structure.
Top Tier: Front end tools
It is front end client layer.
Query and reporting tools
Reporting Tools: Production reporting tools
Report writers
Managed query tools: Point and click creation of SQL used
in customer mailing list.
Analysis tools : Prepare charts based on analysis
Data mining Tools: mining knowledge, discover hidden
piece of information, new correlations, useful pattern
Data Pre Processing
Data preprocessing is a data mining technique that involves
transforming raw data into an understandable format.
Data preprocessing is a proven method of resolving such issues.
Data preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as
customer relationship management and rule-based applications.
Data Preprocessing
Preprocess Steps
Data cleaning
Data integration
Data transformation
Data reduction
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
Forms of data preprocessing
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth by
boundaries
Cluster Analysis
Clustering: detect and remove outliers
Regression
x
y
y = x + 1
X1
Y1
Y1’
Regression:
smooth by fitting
the data into
regression functions
Data cleaning as a process
Discrepancy detection
Use meta data
Field overloading
Unique rules
Consecutive rules
Null rules
April 13, 2021
15
Data Integration
Data integration:
combines data from multiple sources.
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g.,A.cust-id ≡ B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different sources
are different
possible reasons: different representations, different scales, e.g.,
metric vs. British units
Handling Redundant Data in Data Integration
Redundant data occur often when integration of multiple
databases
The same attribute may have different names in different databases
One attribute may be a “derived” attribute in another table.
Redundant data may be able to be detected by correlational analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Correlation analysis
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation: Normalization
min-max normalization
Min-max normalization performs a linear transformation on the original
data.
Suppose that mina and maxa are the minimum and the maximum values
for attribute A. Min-max normalization maps a value v of A to v’ in the
range [new-mina, new-maxa] by computing:
v’= ( (v-mina) / (maxa – mina) ) * (new-maxa – newmina)+ new-mina
Data Transformation: Normalization
Z-score Normalization:
In z-score normalization, attribute A are normalized based on the
mean and standard deviation of A. a value v of A is normalized to v’
by computing:
v’ = ( ( v –A ) / µA )
where  and  A are the mean and the standard deviation
respectively of attribute A.
This method of normalization is useful when the actual minimum and
maximum of attribute A are unknown.
Data Transformation: Normalization
Normalization by Decimal Scaling
Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A.
The number of decimal points moved depends on the maximum
absolute value of A.
a value v of A is normalized to v’ by computing: v’ = ( v / 10j ).Where j
is the smallest integer such that Max(|v’|)1.
Obtain a reduced representation of the data set that is much smaller
in volume but yet produces the same (or almost the same) analytical
results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
22
Data reduction
Data reduction strategies
Data cube aggregation
Attribute subset selection
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
Data cube aggregation
aggregation operations are applied to the data in the construction
of a data cube.
This is achieved by aggregation operations on data cube.
April 13, 2021
24
April 13, 2021
25
Attribute subset selection
Irrelevant ,weakly relevant or redundant attributes or
dimensions may be detected and removed.
Stepwise forward selection:
Stepwise backward elimination
Combination of forward selection and backward
elimination:
Decision tree induction:
Dimensionality reduction
Encoding mechanisms are used to reduce the data size.
Wavelet transforms
The discrete wavelet transform(DWT) is a linear signal processing
technique that, when applied to a data vector X, transforms it to a
numerically different vector, X0, of wavelet coefficients.
Principal components analysis, or PCA
Unlike attribute subset selection, which reduces the attribute set size by
retaining a subset of the initial set of attributes, PCA “combines” the
essence of attributes by creating an alternative, smaller set of variables.
Data compression
Numerosity reduction
The data are replaced or estimated by alternative, smaller data
representations such as parametric models(which need to store only
the model parameters instead of the actual data) or nonparametric
methods such as clustering, sampling and the use of histograms.
Regression and Log-Linear Models
Histograms
Clustering
Sampling
Data compression
Regression and Log-Linear Models
Regression and log-linear models can be used to approximate the
given data.
linear regression, the data are modeled to fit a straight line.
y (called a response variable)
X (called a predictor variable)
y = wx+b
Log-linear models approximate discrete multidimensional
probability distributions.
This allows a higher-dimensional data space to be constructed from
lower dimensional spaces.
Log-linear models are therefore also useful for dimensionality
reduction
April 13, 2021
28
Histograms
Histograms use binning to approximate data distributions and are a
popular form of
The following data are a list of prices of commonly sold items at
AllElectronics(rounded to the nearest dollar). The numbers have been
sorted: 1, 1, 5, 5, 5, 5, 5,8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,20, 20, 20, 20, 20, 20, 21,
21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
April 13, 2021
29
April 13, 2021
30
Sampling
Sampling can be used as a data reduction technique because it allows a
large data set to be represented by a much smaller random sample (or
subset) of the data.
April 13, 2021
31
Sampling
Simple random sample without replacement (SRSWOR) of size s
Simple random sample with replacement (SRSWR) of size s
Cluster sample
Stratified sample
April 13, 2021
32
Data Discretization and Concept
Hierarchy Generation
Data discretization techniques can be used to reduce the number of
values for a given continuous attribute by dividing the range of the
attribute into intervals.
Interval labels can then be used to replace actual data values.
Supervised discretization
Unsupervised discretization
Top-down discretization or splitting
Bottom-up discretization or merging
April 13, 2021
33
Data Discretization and Concept
Hierarchy Generation
A concept hierarchy for a given numerical attribute defines a
discretization of the attribute.
Concept hierarchies can be used to reduce the data by
collecting and replacing low-level concepts (such as numerical
values for the attribute age) with higher-level concepts (such as
youth, middle-aged, or senior).
April 13, 2021
34

More Related Content

Similar to Data Mining mod1 ppt.pdf bca sixth semester notes

Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptPadmajaLaksh
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningAbcdDcba12
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesDeepaR42
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfNancykumari47
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic conceptsPritiRishi
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryYoung Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryHarry Potter
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryJames Wong
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryFraboni Ec
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryLuis Goldster
 

Similar to Data Mining mod1 ppt.pdf bca sixth semester notes (20)

Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.ppt
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 

Recently uploaded

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 

Recently uploaded (20)

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 

Data Mining mod1 ppt.pdf bca sixth semester notes

  • 1. Data Mining: Concepts and Techniques — Unit 1 — — Introduction —
  • 2. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Classification of data mining systems Major issues in data mining Overview
  • 3. Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes (1000 Terabytes = 1 Petabyte ) Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube
  • 4. Disk Storage 1 Bit = Binary Digit · 8 Bits = 1 Byte · 1024 Bytes = 1 Kilobyte · 1024 Kilobytes = 1 Megabyte · 1024 Megabytes = 1 Gigabyte · 1024 Gigabytes = 1 Terabyte · 1024 Terabytes = 1 Petabyte · 1024 Petabytes = 1 Exabyte · 1024 Exabytes = 1 Zettabyte · 1024 Zettabytes = 1 Yottabyte · 1024 Yottabytes = 1 Brontobyte · 1024 Brontobytes = 1 Geopbyte
  • 5. We are drowning in data, but starving for knowledge! “Necessity is the mother of invention” Data mining—Automated analysis of massive data sets
  • 6. Evolution of Sciences Before 1600, empirical science 1600-1950s, theoretical science 1950s-1990s, computational science – simulation , models 1990-now, data science The flood of data The ability to economically store and manage data online The Internet and computing Grid that makes all these archives universally accessible Here is the concept of Data mining.
  • 7. Evolution of Database Technology 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
  • 8. 8 A Brief History of Data Mining Society 1989 IJCAI Workshop on Knowledge Discovery in Databases 1991-1994 Workshops on Knowledge Discovery in Databases Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) Journal of Data Mining and Knowledge Discovery (1997) ACM SIGKDD conferences since 1998 and SIGKDD Explorations More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. ACM Transactions on KDD starting in 2007
  • 9. What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
  • 10. Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction Data/pattern analysis Data archeology Data dredging Information harvesting Business intelligence, etc
  • 11. Knowledge Discovery (KDD) Process Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 12. 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved fromthe database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance) April 13, 2021 Data Mining: Concepts and Techniques 12
  • 13. 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) April 13, 2021 Data Mining: Concepts and Techniques 13
  • 14. Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowl edge- Base Database Data Warehouse World-Wide Web Other Info Repositories
  • 15. Database, data warehouse, WorldWideWeb, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. April 13, 2021 Data Mining: Concepts and Techniques 15
  • 16. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. April 13, 2021 Data Mining: Concepts and Techniques 16
  • 17. Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). April 13, 2021 Data Mining: Concepts and Techniques 17
  • 18. Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis. April 13, 2021 Data Mining: Concepts and Techniques 18
  • 19. Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push April 13, 2021 Data Mining: Concepts and Techniques 19
  • 20. User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. April 13, 2021 Data Mining: Concepts and Techniques 20
  • 21. Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
  • 22. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
  • 23. Why Not Traditional Data Analysis? Tremendous amount of data High-dimensionality of data High complexity of data New and sophisticated applications
  • 24. Data Mining: Classification Schemes General functionality Descriptive data mining Predictive data mining Different views lead to different classifications Data view: Kinds of data to be mined Knowledge view: Kinds of knowledge to be discovered Method view: Kinds of techniques utilized Application view: Kinds of applications adapted
  • 25. Data Mining: On What Kinds of Data? a. Database-oriented data sets and applications Relational database Data warehouse Transactional database
  • 26. Relational Databases A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships. April 13, 2021 Data Mining: Concepts and Techniques 26
  • 27. A relational database for AllElectronics. The AllElectronics company is described by the following relation tables: customer, item, employee, and branch. Relational data can be accessed by database queries written in a relational query language, such as SQL, or with the assistance of graphical user interfaces. April 13, 2021 Data Mining: Concepts and Techniques 27
  • 28. April 13, 2021 Data Mining: Concepts and Techniques 28
  • 29. April 13, 2021 Data Mining: Concepts and Techniques 29
  • 30. A query allows retrieval of specified subsets of the data. Suppose that your job is to analyze the AllElectronics data. Through the use of relational queries, you can ask things like “Show me a list of all items that were sold in the last quarter.” Relational languages also include aggregate functions such as sum, avg (average), count, max (maximum), and min (minimum). April 13, 2021 Data Mining: Concepts and Techniques 30
  • 31. DataWarehouses A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. April 13, 2021 Data Mining: Concepts and Techniques 31
  • 32. April 13, 2021 Data Mining: Concepts and Techniques 32
  • 33. To facilitate decision making, the data in a data warehouse are organized around major subjects, such as customer, item, supplier, and activity. The data are stored to provide information from a historical perspective (such as from the past 5–10 years) and are typically summarized. April 13, 2021 Data Mining: Concepts and Techniques 33
  • 34. A data warehouse is usually modeled by a multidimensional database structure. Each dimension corresponds to an attribute or a set of attributes in the schema. Each cell stores the value of some aggregate measure, such as count or sales amount. The actual physical structure of a data warehouse may be a relational data store or a multidimensional data cube. A data cube provides a multidimensional view of data and allows the pre computation and fast accessing of summarized data. April 13, 2021 Data Mining: Concepts and Techniques 34
  • 35. A data cube for AllElectronics A data cube for summarized sales data of AllElectronics The cube has three dimensions: address (with city valuesChicago, New York, Toronto, Vancouver), time (with quarter values Q1, Q2, Q3, Q4), and item(with itemtype values home entertainment, computer, phone, security). The aggregate value stored in each cell of the cube is sales amount (in thousands). For example, the total sales forthefirstquarter,Q1, for items relating to security systems in Vancouver is $400,000, as stored in cell (Vancouver, Q1, security) April 13, 2021 Data Mining: Concepts and Techniques 35
  • 36. April 13, 2021 Data Mining: Concepts and Techniques 36
  • 37. Data warehouse systems are well suited for on-line analytical processing, or OLAP. Examples of OLAP operations include drill-down and roll- up, which allow the user to view the data at differing degrees of summarization, we can drill down on sales data summarized by quarter to see the data summarized by month. Similarly, we can roll up on sales data summarized by city to view the data summarized by country. 37
  • 38. What is the difference between a datawarehouse and a data mart?” A data warehouse collects information about subjects that span an entire organization, and thus its scope is enterprise-wide. A data mart, on the other hand, is a department subset of a data warehouse. It focuses on selected subjects, and thus its scope is department-wide. April 13, 2021 Data Mining: Concepts and Techniques 38
  • 39. Transactional Databases A transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction (such as items purchased in a store). April 13, 2021 Data Mining: Concepts and Techniques 39
  • 40. “Show me all the items purchased by Sandy Smith” or “How many transactions include item number I3?” Answering such queries may require a scan of the entire transactional database. “Which items sold well together?” market basket data analysis would enable you to bundle groups of items together as a strategy for maximizing sales. data mining systems for transactional data can do so by identifying frequent itemsets. April 13, 2021 Data Mining: Concepts and Techniques 40
  • 41. b. Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
  • 42. Two dimensional representation-table April 13, 2021 Data Mining: Concepts and Techniques 42
  • 43. Data cube Data cube April 13, 2021 Data Mining: Concepts and Techniques 43
  • 44. Dimension Modelling. April 13, 2021 Data Mining: Concepts and Techniques 44
  • 45. Data Mining Functionalities Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general, data mining tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining tasks perform inference on the current data in order to make predictions. April 13, 2021 Data Mining: Concepts and Techniques 45
  • 46. Data Mining Functionalities Concept/Class Description: Characterization and Discrimination Data can be associated with classes or concepts. For example, in the AllElectronics store, classes of items for sale include computers and printers, and concepts of customers include bigSpenders and budgetSpenders. It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept descriptions. April 13, 2021 Data Mining: Concepts and Techniques 46
  • 47. These descriptions can be derived via (1) data characterization, by summarizing the data of the class under study in general terms, or (2) data discrimination, by comparison of the target class with one or a set of comparative classes or (3) both data characterization and discrimination. Data characterization is a summarization of the general characteristics or features of a target class of data. April 13, 2021 Data Mining: Concepts and Techniques 47
  • 48. For example, Data characterization. summarizing the characteristics of customers who spend more than $1,000 a year at AllElectronics. Data discrimination . compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by at least 30% during the same period. April 13, 2021 Data Mining: Concepts and Techniques 48
  • 49. Data Mining Functionalities ii. Frequent patterns, association, correlation Frequent patterns, as the name suggests, are patterns that occur frequently in data. frequent patterns, including item sets, subsequences, and substructures. A frequent itemset :milk and bread (frequent) sequential pattern: customers tend to purchase first a PC, followed by a digital camera, and then a memory card (frequent)structured pattern: substructure occurs frequently Mining frequent patterns leads to the discovery of interesting associations and correlations within data. April 13, 2021 Data Mining: Concepts and Techniques 49
  • 50. Association analysis buys(X; “computer”))buys(X; “software”) [support = 1%; confidence = 50%] A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well. A 1% support means that 1% of all of the transactions under analysis showed that computer and software were purchased together. Association rules that contain a single predicate are referred to as single-dimensional association rules. April 13, 2021 Data Mining: Concepts and Techniques 50
  • 51. Association analysis age(X, “20:::29”)^income(X, “20K:::29K”))buys(X, “CD player”) [support = 2%, confidence = 60%] multidimensional association rule association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold. April 13, 2021 Data Mining: Concepts and Techniques 51
  • 52. Data Mining Functionalities iii. Classification and prediction Construct models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (mileage) Predict some unknown or missing numerical values
  • 53. April 13, 2021 Data Mining: Concepts and Techniques 53
  • 54. iv. Cluster analysis e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity April 13, 2021 Data Mining: Concepts and Techniques 54
  • 55. April 13, 2021 Data Mining: Concepts and Techniques 55
  • 56. v. Outlier analysis Outlier: Data object that does not comply with the general behavior of the data vi. Trend and evolution analysis Trend and deviation: e.g., regression analysis (eg: height & wait , price & demand) Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based analysis vii. statistical analyses
  • 57. Integration scheme includes No coupling Loose coupling Semi tight coupling Tight Coupling April 13, 2021 Data Mining: Concepts and Techniques 57
  • 58. No coupling No coupling means that a DM system will not utilize any function of a DB or DW system. Advantages Simple to use Drawbacks Without using a DB/DW system, a DM system may spend a substantial amount of time finding, collecting, cleaning, and transforming data DM system will need to use other tools to extract data, making it difficult to integrate such a system into an information processing environment. Thus, no coupling represents a poor design. 58
  • 59. Loose coupling Loose coupling means that a DM system will use some facilities of a DB or DW system. Fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse. Advantages Loose coupling is better than no coupling. it can fetch any portion of data stored in databases or data warehouses by using query processing. It incurs some advantages of the flexibility, efficiency, and other features provided by such systems Disadvantages many loosely coupled mining systems are main memory-based. it is difficult for loose coupling to achieve high scalability and good performance with large data sets 59
  • 60. Semitight coupling Semitight coupling means that besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system. These primitives can include sorting, indexing, aggregation, histogram analysis, multiway join, and pre computation of some essential statistical measures, such as sum, count, max, min, standard deviation, and so on. this design will enhance the performance of a DM system. April 13, 2021 Data Mining: Concepts and Techniques 60
  • 61. Tight coupling Tight coupling means that a DM system is smoothly integrated into the DB/DW system. The data mining subsystem is treated as one functional component of an information system. This approach is highly desirable because it facilitates efficient implementations of data mining functions, high system performance, and an integrated information processing environment. April 13, 2021 Data Mining: Concepts and Techniques 61
  • 62. Major Issues in Data Mining Mining different kinds of knowledge in databases. Interactive mining of knowledge at multiple levels of abstraction. Incorporation of background knowledge. Data mining query languages and ad hoc data mining. Presentation and visualization of data mining results. Handling noisy or incomplete data Pattern evaluation-the interestingness problem
  • 63. Major Issues in Data Mining Performance Issues Efficiency and scalability of data mining algorithms Parallel , distributed and incremental mining algorithms Issues relating to the diversity of database types Handling of relational and complex types of data. Mining information from heterogeneous database and global information systems April 13, 2021 Data Mining: Concepts and Techniques 63
  • 65. Definition Data Warehouse A collection of corporate information, derived directly from operational systems and some external data sources. Its specific purpose is to support business decisions, not business operations.
  • 66. The Purpose of Data Warehousing Realize the value of data Data / information is an asset Methods to realize the value, (Reporting, Analysis, etc.) Make better decisions Turn data into information Create competitive advantage Methods to support the decision making process(DSS)
  • 67. A data ware house refers to database that is maintained separately from an organization’s operational database. It allows integration of variety of application systems. It gives the opportunity for historical data analysis. April 13, 2021 Data Mining: Concepts and Techniques 67
  • 68. Subject oriented: A data ware house is subject oriented rather than Transaction oriented. For eg: customer, product, sales etc. Non volatile: data ware house is Physically separate ,So it not perishable. April 13, 2021 Data Mining: Concepts and Techniques 68
  • 69. Integrated: A data ware house is constructed by integrating heterogeneous sources such as rdbms, files,OLTPfiles. data cleaning and data integration techniques are used for ensuring consistency in naming conventions, encoding structures, attribute measures and so on. Time variant: Data is stored in historical perspective (5- 10years). Implicit or explicit time variant will be there in constructing data ware house. April 13, 2021 Data Mining: Concepts and Techniques 69
  • 70. MULTI DIMENSIONAL DATA MODEL Two dimensional representation-table April 13, 2021 Data Mining: Concepts and Techniques 70
  • 71. Data cube April 13, 2021 Data Mining: Concepts and Techniques 71
  • 72. A multi-dimensional structure called the data cube. It is a data abstraction that allows one to view aggregated data from a number of perspectives. Data Ware house implementation Data Cube
  • 73. OLAP Operations. OLAP Operations are used for retrieving data in a simplified manner from data cube for analysis. Slicing: Reducing the data cube by one or more dimensions. Slice time=’Q2’ C[quarter,city,product]=C[city, product] April 13, 2021 Data Mining: Concepts and Techniques 73
  • 74. Dicing: This operation is for selecting a smaller data cube and analyzing it from different perspectives (selection criteria). Eg: dice time=’Q1 or Q2 and location =’Mumbai’ or ‘Pune’ C[Quarter,city,product]=C[quarter,city,product] April 13, 2021 Data Mining: Concepts and Techniques 74
  • 75. Drilling: Moving up and down along classification hierarchies Drill up(Roll Up): This means switching from detailed to an aggregated with in same classification hierarchy. Eg RollUp=C[quarter,city,product]=C[quarter, province. product] April 13, 2021 Data Mining: Concepts and Techniques 75
  • 76. April 13, 2021 Data Mining: Concepts and Techniques 76
  • 77. Drill Down : This is concerned with switching from aggregated to detailed level. For eg: day->month->quarter->year, April 13, 2021 Data Mining: Concepts and Techniques 77
  • 78. Integration of Data Mining and Data Warehousing: Data warehouse provides clean, integrated data for fruitful mining. Data mining provides powerful tools for analysis of data stored in data warehouses. Data mining provides more analysis tools, e.g., - association, - classification, - clustering, - pattern-directed, and - trend analysis.
  • 79. Data mining: the extraction of hidden predictive information from large DB. Data might be one of the most valuable assets of your corporation - but only if you know how to reveal valuable knowledge hidden in raw data. Data mining allows you to extract diamonds of knowledge from your historical data and predict outcomes of future situations.
  • 80. The actual need of data warehouse is - To store heterogeneous data for managerial decision purpose. - To store data in various dimensions with in a data warehouse. -it is easy to analyze the data and to take decisions. - A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management and decision-making process.
  • 81. 81 3- Tier Data Warehouse Architecture
  • 82. 82 3-Tier Data Warehouse Architecture Data ware house adopt a three tier architecture. These 3 tiers are: Bottom Tier Middle Tier Top Tier
  • 83. 83
  • 84. 84 Data Sources: All the data related to any bussiness organization is stored in operational databases, external files and flat files. These sources are application oriented Eg: complete data of organization such as training detail, customer detail, sales, departments, transactions, employee detail etc. Data present here in different formats or host format Contain data that is not well documented
  • 85. 85 Bottom Tier: Data warehouse server Data Warehouse server fetch only relevant information based on data mining (mining a knowledge from large amount of data) request. Eg: customer profile information provided by external consultants. Data is feed into bottom tier by some backend tools and utilities.
  • 86. 86 Backend Tools & Utilities: Functions performed by backend tools and utilities are: Data Extraction Data Cleaning Data Transformation Load Refresh
  • 87. 87 Bottom Tier Contains: Data warehouse Metadata Repository Data Marts Monitoring and Administration
  • 88. Data Warehouse: It is an optimized form of operational database contain only relevant information and provide fast access to data. Subject oriented Eg: Data related to all the departments of an organization Integrated: Different views Single unified of data view Time – variant Nonvolatile A B C Warehous e
  • 89. Metadata repository: It figure out that what is available in data warehouse. It contains: Structure of data warehouse Data names and definitions Source of extracted data Algorithm used for data cleaning purpose Sequence of transformations applied on data Data related to system performance
  • 90. Data Marts: Subset of data warehouse contain only small slices of data warehouse Eg: Data pertaining to the single department Two types of data marts: Dependent Independent sourced directly sourced from one or from data warehouse more data sources
  • 91. Monitoring & Administration: Data Refreshment Data source synchronization Disaster recovery Managing access control and security Manage data growth, database performance Controlling the number & range of queries Limiting the size of data warehouse
  • 93. Middle Tier: OLAP Server It presents the users a multidimensional data from data warehouse or data marts. Typically implemented using two models: ROLAP Model MOLAP Model Present data in Present data in array relational tables based structures means map directly to data cube array structure.
  • 94. Top Tier: Front end tools It is front end client layer. Query and reporting tools Reporting Tools: Production reporting tools Report writers Managed query tools: Point and click creation of SQL used in customer mailing list. Analysis tools : Prepare charts based on analysis Data mining Tools: mining knowledge, discover hidden piece of information, new correlations, useful pattern
  • 95. Data Pre Processing Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing. Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications.
  • 96. Data Preprocessing Preprocess Steps Data cleaning Data integration Data transformation Data reduction
  • 97. Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ”
  • 98. noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names
  • 99. Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility
  • 100. Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results
  • 101. Forms of data preprocessing
  • 102. Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data
  • 103. Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred.
  • 104. How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
  • 105. Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data
  • 106. How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by boundaries
  • 107. Cluster Analysis Clustering: detect and remove outliers
  • 108. Regression x y y = x + 1 X1 Y1 Y1’ Regression: smooth by fitting the data into regression functions
  • 109. Data cleaning as a process Discrepancy detection Use meta data Field overloading Unique rules Consecutive rules Null rules April 13, 2021 15
  • 110. Data Integration Data integration: combines data from multiple sources. Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g.,A.cust-id ≡ B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units
  • 111. Handling Redundant Data in Data Integration Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases One attribute may be a “derived” attribute in another table. Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Correlation analysis
  • 112. Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones
  • 113. Data Transformation: Normalization min-max normalization Min-max normalization performs a linear transformation on the original data. Suppose that mina and maxa are the minimum and the maximum values for attribute A. Min-max normalization maps a value v of A to v’ in the range [new-mina, new-maxa] by computing: v’= ( (v-mina) / (maxa – mina) ) * (new-maxa – newmina)+ new-mina
  • 114. Data Transformation: Normalization Z-score Normalization: In z-score normalization, attribute A are normalized based on the mean and standard deviation of A. a value v of A is normalized to v’ by computing: v’ = ( ( v –A ) / µA ) where and A are the mean and the standard deviation respectively of attribute A. This method of normalization is useful when the actual minimum and maximum of attribute A are unknown.
  • 115. Data Transformation: Normalization Normalization by Decimal Scaling Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A. a value v of A is normalized to v’ by computing: v’ = ( v / 10j ).Where j is the smallest integer such that Max(|v’|)1.
  • 116. Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. 22 Data reduction
  • 117. Data reduction strategies Data cube aggregation Attribute subset selection Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
  • 118. Data cube aggregation aggregation operations are applied to the data in the construction of a data cube. This is achieved by aggregation operations on data cube. April 13, 2021 24
  • 119. April 13, 2021 25 Attribute subset selection Irrelevant ,weakly relevant or redundant attributes or dimensions may be detected and removed. Stepwise forward selection: Stepwise backward elimination Combination of forward selection and backward elimination: Decision tree induction:
  • 120. Dimensionality reduction Encoding mechanisms are used to reduce the data size. Wavelet transforms The discrete wavelet transform(DWT) is a linear signal processing technique that, when applied to a data vector X, transforms it to a numerically different vector, X0, of wavelet coefficients. Principal components analysis, or PCA Unlike attribute subset selection, which reduces the attribute set size by retaining a subset of the initial set of attributes, PCA “combines” the essence of attributes by creating an alternative, smaller set of variables. Data compression
  • 121. Numerosity reduction The data are replaced or estimated by alternative, smaller data representations such as parametric models(which need to store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling and the use of histograms. Regression and Log-Linear Models Histograms Clustering Sampling Data compression
  • 122. Regression and Log-Linear Models Regression and log-linear models can be used to approximate the given data. linear regression, the data are modeled to fit a straight line. y (called a response variable) X (called a predictor variable) y = wx+b Log-linear models approximate discrete multidimensional probability distributions. This allows a higher-dimensional data space to be constructed from lower dimensional spaces. Log-linear models are therefore also useful for dimensionality reduction April 13, 2021 28
  • 123. Histograms Histograms use binning to approximate data distributions and are a popular form of The following data are a list of prices of commonly sold items at AllElectronics(rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5,8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30. April 13, 2021 29
  • 125. Sampling Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random sample (or subset) of the data. April 13, 2021 31
  • 126. Sampling Simple random sample without replacement (SRSWOR) of size s Simple random sample with replacement (SRSWR) of size s Cluster sample Stratified sample April 13, 2021 32
  • 127. Data Discretization and Concept Hierarchy Generation Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Supervised discretization Unsupervised discretization Top-down discretization or splitting Bottom-up discretization or merging April 13, 2021 33
  • 128. Data Discretization and Concept Hierarchy Generation A concept hierarchy for a given numerical attribute defines a discretization of the attribute. Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts (such as numerical values for the attribute age) with higher-level concepts (such as youth, middle-aged, or senior). April 13, 2021 34