SlideShare a Scribd company logo
1 of 39
Data Mining-PART II
By
M.Dhilsath Fathima
DATA MINING Task/Functions
• Classification
• Clustering
• Outlier analysis
• Association
• Prediction/Regression
CLASSIFICATION
• Classification is a data mining (machine
learning) technique used to predict the target
class for each case in the data.
• For example, you may wish to use classification
to predict whether the weather on a particular
day will be “sunny”, “rainy” or “cloudy”.
• For example, a classification model could be
used to identify loan applicants as low,
medium, or high credit risks.
• Popular classification techniques include
decision trees and neural networks.
CLASSIFICATION
CLASSIFICATION-Example
Clustering
• Classification is supervised learning the supervision comes from
labeling the instances with the class.
• Clustering is unsupervised learning -- there are no predefined
class labels, no training set.
• So our clustering algorithm needs to assign a cluster to each
instance such that all objects with the same cluster are more
similar than others.
Clustering
• Finding groups of objects such that the objects in a group will be similar
(or related) to one another and different from (or unrelated to) the
objects in other groups
• The goal is to find the most 'natural' groupings of the instances.
- Within a cluster: Maximize similarity between instances.
- Between clusters: Minimize similarity between instances.
Inter-cluster
distances are
maximizedIntra-cluster
distances are
minimized
OUTLIERS ANALYSIS
Cluster 1
Cluster 2
Outliers
What is an Outlier?
ASSOCIATION
• An association rule has two parts, an antecedent (ifand a
consequent (then). An antecedent )(preceding in time or
order) is an item found in the data. A consequent(the
second part of a conditional proposition/Result) is an item
that is found in combination with the antecedent.
• Association rules are created by analyzing data for
frequent if/then patterns and using the
criteria support and confidence to identify the most
important relationships. Support is an indication of how
frequently the items appear in the
database. Confidence indicates the number of times the
if/then statements have been found to be true.
ASSOCIATION(Cont..)
• In data mining, association rules are useful for 
analyzing  and  predicting customer behavior.
They  play  an  important  part  in  shopping 
basket  data  analysis,  product  clustering, 
catalog design and store layout.
• Form: AB
• Ex for association:{Bread,Jam},{Computer,Printer}
antecedent
consequent
Applications of Data Mining
Data Mining Applications in Sales/Marketing- Ex
For Association
• Discover  consumer  groups  based  on  their  purchasing 
habits,  thus  helping  in  planning and launching new
marketing campaigns in prompt and cost effective way. 
• Data mining is used for market basket analysis to provide 
information  on  what  product  combinations  were 
purchased together when they were bought and in what 
sequence. 
Data Mining Applications in Banking –Ex For
Classification
• Data mining is used to identify customers loyalty by analyzing 
the  data  of  customer’s  purchasing  activities  such  as  the  data 
of frequency of purchase in a period of time, a total monetary 
value  of  all  purchases  and  when  was  the  last purchase.  After 
analyzing those dimensions, the relative measure is generated 
for each customer. The higher of the score, the more relative 
loyal the customer is.
• To help the bank to retain credit card customers, data mining is 
applied.  By analyzing the past data, data mining can help banks 
predict  customers  that  likely  to  change  their  credit  card 
affiliation so they can plan and launch different special offers to 
retain those customers.
Data Mining Applications in Banking –Ex
For Clustering
• Given:
– A source of textual
documents
– Similarity measure
• e.g., how many words
are common in these
documents
Clustering
System
Similarity
measure
Documents
source
Doc
Do
c
Doc
Doc
Doc
DocDoc
Doc
Doc
Doc
• Find:
• Several clusters of documents
that are relevant to each
other
Association Rules 
• A common application
is market basket
analysis which
(1) items are frequently
sold together at a
supermarket
(2) arranging items on
shelves which items
should be promoted
together
DATA PREPROCESSING
Define-Data Preprocessing
• Data preprocessing is a data mining technique 
that  involves  transforming  raw  data  into  an 
understandable format.
•   Data pre-processing is  an  important  step  in 
the data mining process. 
• The  product  of  data  pre-processing  is  the 
final training set.
Why Data Preprocessing?
• Data in the real world is dirty.
noisy: containing errors or outliers.
Incomplete: Missing Values, Lacking attribute 
values.
Inconsistent Data
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality 
data
Major Tasks in Data
Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, 
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or 
similar analytical results
Forms of data preprocessing
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
What is Missing Data?
• Data is not always available
– E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing
 (Can be applicable for large data set)
• Fill in the missing value manually: tedious + infeasible for large
database?
• Use a global constant to fill in the missing value
• Use the attribute mean to fill in the missing value
• Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
Noisy Data/Outlier
• Noise: random error or variance in a measured
variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– inconsistency in naming convention
– duplicate records
– incomplete data
– inconsistent data
OUTLIER
• A Data object or observations that do not
comply with the general behavior or model of
the data. Such data objects, which are grossly
different from or inconsistent with the
remaining set of data, are called outliers.
• A data object that deviates significantly from
the normal objects as if it were generated by a
different mechanism.
How to Handle Noisy Data?
(Not Now)
• Binning method
• Clustering
• Combined computer and human
inspection
• Regression
Data integration and transformation
Data Integration
• Data integration:
– combines data from multiple sources into a coherent store
Three Problems involved in data integration
Schema integration
Detecting and resolving data value conflicts.
Redundant data occur often when integration of multiple
databases
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
DATA REDUCTION
Data Reduction Strategies
• Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
• Data reduction
– Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
• Data reduction strategies
– Data cube aggregation(Ex:Construction of Datacube)
– Numerosity reduction(Ex:Generating Histograms)
– concept hierarchy generation
Data Cube Aggregation
• The lowest level of a data cube
– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve the
task
Numerosity reduction-Histograms
• A popular data reduction
technique
• Divide data into buckets
and store average (sum) for
each bucket
• Can be constructed
optimally in one dimension
using dynamic
programming
• Related to quantization
problems.
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
Numerosity reduction-Clustering
• Partition data set into clusters, and one can store cluster
representation only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms, further detailed in Chapter 8
Sampling
• Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
• Choose a representative subset of the data
– Simple random sampling may have very poor performance
in the presence of skew
• Develop adaptive sampling methods
– Stratified sampling:
• Approximate the percentage of each class (or
subpopulation of interest) in the overall database
• Used in conjunction with skewed data
Sampling
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
Concept hierarchy
• Arrangement of concepts such as time , location.
– reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior).
Data warehouse Usage/Applications of Data
warehouse

More Related Content

What's hot

Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
08. Object Oriented Database in DBMS
08. Object Oriented Database in DBMS08. Object Oriented Database in DBMS
08. Object Oriented Database in DBMSkoolkampus
 
Validation based protocol
Validation based protocolValidation based protocol
Validation based protocolBBDITM LUCKNOW
 
Data mining query language
Data mining query languageData mining query language
Data mining query languageGowriLatha1
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEZalpa Rathod
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data WarehouseShanthi Mukkavilli
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMSkoolkampus
 
What is cluster analysis
What is cluster analysisWhat is cluster analysis
What is cluster analysisPrabhat gangwar
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guidethomasmary607
 
Indexing and Hashing
Indexing and HashingIndexing and Hashing
Indexing and Hashingsathish sak
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouseJ M
 

What's hot (20)

Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
08. Object Oriented Database in DBMS
08. Object Oriented Database in DBMS08. Object Oriented Database in DBMS
08. Object Oriented Database in DBMS
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Validation based protocol
Validation based protocolValidation based protocol
Validation based protocol
 
Data mining query language
Data mining query languageData mining query language
Data mining query language
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Unit 01 dbms
Unit 01 dbmsUnit 01 dbms
Unit 01 dbms
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 
What is cluster analysis
What is cluster analysisWhat is cluster analysis
What is cluster analysis
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
Clustering
ClusteringClustering
Clustering
 
Indexing and Hashing
Indexing and HashingIndexing and Hashing
Indexing and Hashing
 
Data cubes
Data cubesData cubes
Data cubes
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouse
 
RDBMS
RDBMSRDBMS
RDBMS
 

Similar to Unit 3 part ii Data mining

finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxshumPanwar
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Data mining basic concept and Data warehousing
Data mining basic concept and Data warehousingData mining basic concept and Data warehousing
Data mining basic concept and Data warehousingNivaTripathy1
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 

Similar to Unit 3 part ii Data mining (20)

finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data mining basic concept and Data warehousing
Data mining basic concept and Data warehousingData mining basic concept and Data warehousing
Data mining basic concept and Data warehousing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Dmblog
DmblogDmblog
Dmblog
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Pre processing
Pre processingPre processing
Pre processing
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Data Mining-2023 (2).ppt
Data Mining-2023 (2).pptData Mining-2023 (2).ppt
Data Mining-2023 (2).ppt
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 

More from Dhilsath Fathima

engineer's are responsible for safety
engineer's are responsible for safetyengineer's are responsible for safety
engineer's are responsible for safetyDhilsath Fathima
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDhilsath Fathima
 
business analysis-Data warehousing
business analysis-Data warehousingbusiness analysis-Data warehousing
business analysis-Data warehousingDhilsath Fathima
 
Profession & professionalism
Profession & professionalismProfession & professionalism
Profession & professionalismDhilsath Fathima
 
Engineering as social experimentation
Engineering as social experimentation Engineering as social experimentation
Engineering as social experimentation Dhilsath Fathima
 
Moral autonomy & consensus &controversy
Moral autonomy & consensus &controversyMoral autonomy & consensus &controversy
Moral autonomy & consensus &controversyDhilsath Fathima
 

More from Dhilsath Fathima (10)

Information Security
Information SecurityInformation Security
Information Security
 
Sdlc model
Sdlc modelSdlc model
Sdlc model
 
engineer's are responsible for safety
engineer's are responsible for safetyengineer's are responsible for safety
engineer's are responsible for safety
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
business analysis-Data warehousing
business analysis-Data warehousingbusiness analysis-Data warehousing
business analysis-Data warehousing
 
Profession & professionalism
Profession & professionalismProfession & professionalism
Profession & professionalism
 
Engineering as social experimentation
Engineering as social experimentation Engineering as social experimentation
Engineering as social experimentation
 
Moral autonomy & consensus &controversy
Moral autonomy & consensus &controversyMoral autonomy & consensus &controversy
Moral autonomy & consensus &controversy
 
Virtues
VirtuesVirtues
Virtues
 
Business analysis
Business analysisBusiness analysis
Business analysis
 

Recently uploaded

Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 

Recently uploaded (20)

Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 

Unit 3 part ii Data mining

  • 2. DATA MINING Task/Functions • Classification • Clustering • Outlier analysis • Association • Prediction/Regression
  • 3. CLASSIFICATION • Classification is a data mining (machine learning) technique used to predict the target class for each case in the data. • For example, you may wish to use classification to predict whether the weather on a particular day will be “sunny”, “rainy” or “cloudy”. • For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. • Popular classification techniques include decision trees and neural networks.
  • 6. Clustering • Classification is supervised learning the supervision comes from labeling the instances with the class. • Clustering is unsupervised learning -- there are no predefined class labels, no training set. • So our clustering algorithm needs to assign a cluster to each instance such that all objects with the same cluster are more similar than others.
  • 7. Clustering • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups • The goal is to find the most 'natural' groupings of the instances. - Within a cluster: Maximize similarity between instances. - Between clusters: Minimize similarity between instances. Inter-cluster distances are maximizedIntra-cluster distances are minimized
  • 9. What is an Outlier?
  • 10. ASSOCIATION • An association rule has two parts, an antecedent (ifand a consequent (then). An antecedent )(preceding in time or order) is an item found in the data. A consequent(the second part of a conditional proposition/Result) is an item that is found in combination with the antecedent. • Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true.
  • 11. ASSOCIATION(Cont..) • In data mining, association rules are useful for  analyzing  and  predicting customer behavior. They  play  an  important  part  in  shopping  basket  data  analysis,  product  clustering,  catalog design and store layout. • Form: AB • Ex for association:{Bread,Jam},{Computer,Printer} antecedent consequent
  • 13. Data Mining Applications in Sales/Marketing- Ex For Association • Discover  consumer  groups  based  on  their  purchasing  habits,  thus  helping  in  planning and launching new marketing campaigns in prompt and cost effective way.  • Data mining is used for market basket analysis to provide  information  on  what  product  combinations  were  purchased together when they were bought and in what  sequence. 
  • 14. Data Mining Applications in Banking –Ex For Classification • Data mining is used to identify customers loyalty by analyzing  the  data  of  customer’s  purchasing  activities  such  as  the  data  of frequency of purchase in a period of time, a total monetary  value  of  all  purchases  and  when  was  the  last purchase.  After  analyzing those dimensions, the relative measure is generated  for each customer. The higher of the score, the more relative  loyal the customer is. • To help the bank to retain credit card customers, data mining is  applied.  By analyzing the past data, data mining can help banks  predict  customers  that  likely  to  change  their  credit  card  affiliation so they can plan and launch different special offers to  retain those customers.
  • 15. Data Mining Applications in Banking –Ex For Clustering • Given: – A source of textual documents – Similarity measure • e.g., how many words are common in these documents Clustering System Similarity measure Documents source Doc Do c Doc Doc Doc DocDoc Doc Doc Doc • Find: • Several clusters of documents that are relevant to each other
  • 16. Association Rules  • A common application is market basket analysis which (1) items are frequently sold together at a supermarket (2) arranging items on shelves which items should be promoted together
  • 18. Define-Data Preprocessing • Data preprocessing is a data mining technique  that  involves  transforming  raw  data  into  an  understandable format. •   Data pre-processing is  an  important  step  in  the data mining process.  • The  product  of  data  pre-processing  is  the  final training set.
  • 19. Why Data Preprocessing? • Data in the real world is dirty. noisy: containing errors or outliers. Incomplete: Missing Values, Lacking attribute  values. Inconsistent Data • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality  data
  • 20. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers,  and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or  similar analytical results
  • 21. Forms of data preprocessing
  • 22. Data Cleaning • Data cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data
  • 23. What is Missing Data? • Data is not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data
  • 24. How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing  (Can be applicable for large data set) • Fill in the missing value manually: tedious + infeasible for large database? • Use a global constant to fill in the missing value • Use the attribute mean to fill in the missing value • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
  • 25. Noisy Data/Outlier • Noise: random error or variance in a measured variable • Incorrect attribute values may due to – faulty data collection instruments – data entry problems – data transmission problems – inconsistency in naming convention – duplicate records – incomplete data – inconsistent data
  • 26. OUTLIER • A Data object or observations that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers. • A data object that deviates significantly from the normal objects as if it were generated by a different mechanism.
  • 27. How to Handle Noisy Data? (Not Now) • Binning method • Clustering • Combined computer and human inspection • Regression
  • 28. Data integration and transformation
  • 29. Data Integration • Data integration: – combines data from multiple sources into a coherent store Three Problems involved in data integration Schema integration Detecting and resolving data value conflicts. Redundant data occur often when integration of multiple databases
  • 30. Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range – min-max normalization – z-score normalization – normalization by decimal scaling
  • 32. Data Reduction Strategies • Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set • Data reduction – Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Data reduction strategies – Data cube aggregation(Ex:Construction of Datacube) – Numerosity reduction(Ex:Generating Histograms) – concept hierarchy generation
  • 33. Data Cube Aggregation • The lowest level of a data cube – the aggregated data for an individual entity of interest – e.g., a customer in a phone calling data warehouse. • Multiple levels of aggregation in data cubes – Further reduce the size of data to deal with • Reference appropriate levels – Use the smallest representation which is enough to solve the task
  • 34. Numerosity reduction-Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Can be constructed optimally in one dimension using dynamic programming • Related to quantization problems. 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
  • 35. Numerosity reduction-Clustering • Partition data set into clusters, and one can store cluster representation only • Can be very effective if data is clustered but not if data is “smeared” • Can have hierarchical clustering and be stored in multi- dimensional index tree structures • There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8
  • 36. Sampling • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Choose a representative subset of the data – Simple random sampling may have very poor performance in the presence of skew • Develop adaptive sampling methods – Stratified sampling: • Approximate the percentage of each class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data
  • 38. Concept hierarchy • Arrangement of concepts such as time , location. – reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).