SlideShare a Scribd company logo
DATA PREPROCESSING
INTRODUCTION
 Data preprocessing is a data mining technique that
involves transformation raw data into an understandable
format.
 Real world data is often
 Inconsistent
 Insufficient
 Lacking in certain behaviors or trends, and is likely to
contain many error.
Cleaning
 It is the process of detecting and correcting corrupt or
inaccurate record from a record set, table, or database
and refers to identifying
 Incomplete
 Incorrect
 Inaccurate
 Irrelevant
 part of the data and replaying
Data Integration and Transformation
 In computing, data transformation is the process of
converting data from one format or structure into
another format or structure.
 It is fundamental aspect of most data integration and
data management tasks such as
 Data wrangling,
 Data integration
 Application integration.
Data Reduction
 Data reduction is the transformation of numerical
or alphabetical digital information derived
empirically or experimentally into a corrected ordered
and simplified form.
 The basic concept is the reduction of multitudinous
amount of data down to the meaningful parts.
Cube
 A data cube is generally used to easily interpret data. It
is especially useful clean representation data together
with dimensions as a certain measures of business
requirement.
 A cube’s every dimension represents certain
characteristic of the data base, for example daily,
monthly, or yearly sales.
Attributes
 Attributes subset selection is a technique which
process.
 Data reduction reduces the size of data so that it can
be used for analysis purposes more efficiently.
 Need of attribute subset selection.
 The data set may have a large number of attributes.
 Data mining bayesian classification advertisement.
 Bayesian classification is based on bayes theorem.
BAYESIAN CLASSIFIER
 Bayesian classifier are the statistial classifiers.
 Bayesian classifier can predicts class membership
probabilities such as the probability that given
tuple belong to a particular class.
Data in the real world is dirty
 Incomplete : lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data.
 Noisy: containing errors or outliers.
 Inconsistent: containing discrepancies in codes or
names.
 Broad categories
 Intrinsic, contextual, representational and
accessibility.
No quality data, no quality mining
 Quality decisions must be based on quality data.
 Data warehouse needs consistent integration of
quality data.
• A multi-dimensional measure of data quality
A well-accepted multi-dimensional view:
 Accuracy, completeness, consistency, timeliness,
believability, value added, interpretability,
accessibility.
Data Cleaning
 Inconsistent data:
 Manual correction using external references
 Semi-automatic using various tools
- To detect violation of known functional
dependencies and data constraints.
- To correct redundant data
Data Reduction
 Manage Data Reduction
 Data reduction: reduced representation, while still
retaining critical information
1. Data cube aggregation
2. Dimensionality reduction
3. Data compression
4. Numerosity reduction
5. Discretization and concept hierarchy generation
Data Cleaning
 Tasks of Data Cleaning
a. Fill in missing values
b. Identify outliers and smooth noisy data
c. Correct inconsistent data

More Related Content

What's hot

Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
Preparing Your Data for ECM
Preparing Your Data for ECMPreparing Your Data for ECM
Preparing Your Data for ECM
Axis Technical Group
 
The Other Side of Linked Open Data: Managing Metadata Aggregation
The Other Side of Linked Open Data: Managing Metadata AggregationThe Other Side of Linked Open Data: Managing Metadata Aggregation
The Other Side of Linked Open Data: Managing Metadata Aggregation
Diane Hillmann
 
Data Warehouse By Piyush
Data Warehouse By PiyushData Warehouse By Piyush
Data Warehouse By Piyush
astronish
 
Ensuring data quality
Ensuring data qualityEnsuring data quality
Ensuring data quality
IUPUI
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
Blaise Cheuteu
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
Data analysis
Data analysisData analysis
Data analysis
Yusuf Khan
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
datatovalue
 
Data quality overview
Data quality overviewData quality overview
Data quality overviewAlex Meadows
 
Convergence Overview
Convergence OverviewConvergence Overview
Convergence Overviewmachterberg
 
Classification of Data in Statistics
Classification of Data in Statistics Classification of Data in Statistics
Classification of Data in Statistics
Stat Analytica
 
Data analytics
Data analyticsData analytics
Data analytics
Bhanu Pratap
 
Data Cleaning
Data CleaningData Cleaning
Is 581 milestone 7 and 8 case study coastline systems consulting
Is 581 milestone 7 and 8 case study coastline systems consultingIs 581 milestone 7 and 8 case study coastline systems consulting
Is 581 milestone 7 and 8 case study coastline systems consulting
sivakumar4841
 
Foundation of data quality
Foundation of data qualityFoundation of data quality
Foundation of data quality
Khaled Mosharraf
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
Nicolas Sarramagna
 

What's hot (19)

Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Preparing Your Data for ECM
Preparing Your Data for ECMPreparing Your Data for ECM
Preparing Your Data for ECM
 
The Other Side of Linked Open Data: Managing Metadata Aggregation
The Other Side of Linked Open Data: Managing Metadata AggregationThe Other Side of Linked Open Data: Managing Metadata Aggregation
The Other Side of Linked Open Data: Managing Metadata Aggregation
 
Data Warehouse By Piyush
Data Warehouse By PiyushData Warehouse By Piyush
Data Warehouse By Piyush
 
Ensuring data quality
Ensuring data qualityEnsuring data quality
Ensuring data quality
 
Data profiling-best-practices
Data profiling-best-practicesData profiling-best-practices
Data profiling-best-practices
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Data analysis
Data analysisData analysis
Data analysis
 
Data Quality Rules introduction
Data Quality Rules introductionData Quality Rules introduction
Data Quality Rules introduction
 
Data quality overview
Data quality overviewData quality overview
Data quality overview
 
Convergence Overview
Convergence OverviewConvergence Overview
Convergence Overview
 
Classification of Data in Statistics
Classification of Data in Statistics Classification of Data in Statistics
Classification of Data in Statistics
 
Data analytics
Data analyticsData analytics
Data analytics
 
It using dfd
It   using dfdIt   using dfd
It using dfd
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
Part1
Part1Part1
Part1
 
Is 581 milestone 7 and 8 case study coastline systems consulting
Is 581 milestone 7 and 8 case study coastline systems consultingIs 581 milestone 7 and 8 case study coastline systems consulting
Is 581 milestone 7 and 8 case study coastline systems consulting
 
Foundation of data quality
Foundation of data qualityFoundation of data quality
Foundation of data quality
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
 

Similar to Unit2

Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
DrAbhishekKumarSingh3
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
AbdullahAbbasi55
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology
VaishaghMp
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
DrGnaneswariG
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
YashikaSengar2
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
dineshbabuspr
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
priyanka rajput
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
ShaikSikindar1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
dineshbabuspr
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
Akash527744
 
Advance Data_Preprocessing_and_Wrangling
Advance Data_Preprocessing_and_WranglingAdvance Data_Preprocessing_and_Wrangling
Advance Data_Preprocessing_and_Wrangling
Bhushan134837
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
Ujjawal
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
DurgaDeviP2
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And IntegrityGerrit Klaschke, CSM
 
Clase_6.1_Eq7.pdf Lengua y Cultura Extranjera
Clase_6.1_Eq7.pdf Lengua y Cultura ExtranjeraClase_6.1_Eq7.pdf Lengua y Cultura Extranjera
Clase_6.1_Eq7.pdf Lengua y Cultura Extranjera
GeraVzquez
 

Similar to Unit2 (20)

Preprocess
PreprocessPreprocess
Preprocess
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
1234
12341234
1234
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
Advance Data_Preprocessing_and_Wrangling
Advance Data_Preprocessing_and_WranglingAdvance Data_Preprocessing_and_Wrangling
Advance Data_Preprocessing_and_Wrangling
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
Data Collection Process And Integrity
Data Collection Process And IntegrityData Collection Process And Integrity
Data Collection Process And Integrity
 
Clase_6.1_Eq7.pdf Lengua y Cultura Extranjera
Clase_6.1_Eq7.pdf Lengua y Cultura ExtranjeraClase_6.1_Eq7.pdf Lengua y Cultura Extranjera
Clase_6.1_Eq7.pdf Lengua y Cultura Extranjera
 

Recently uploaded

Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 

Recently uploaded (20)

Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 

Unit2

  • 2. INTRODUCTION  Data preprocessing is a data mining technique that involves transformation raw data into an understandable format.  Real world data is often  Inconsistent  Insufficient  Lacking in certain behaviors or trends, and is likely to contain many error.
  • 3. Cleaning  It is the process of detecting and correcting corrupt or inaccurate record from a record set, table, or database and refers to identifying  Incomplete  Incorrect  Inaccurate  Irrelevant  part of the data and replaying
  • 4. Data Integration and Transformation  In computing, data transformation is the process of converting data from one format or structure into another format or structure.  It is fundamental aspect of most data integration and data management tasks such as  Data wrangling,  Data integration  Application integration.
  • 5. Data Reduction  Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected ordered and simplified form.  The basic concept is the reduction of multitudinous amount of data down to the meaningful parts.
  • 6. Cube  A data cube is generally used to easily interpret data. It is especially useful clean representation data together with dimensions as a certain measures of business requirement.  A cube’s every dimension represents certain characteristic of the data base, for example daily, monthly, or yearly sales.
  • 7. Attributes  Attributes subset selection is a technique which process.  Data reduction reduces the size of data so that it can be used for analysis purposes more efficiently.  Need of attribute subset selection.  The data set may have a large number of attributes.  Data mining bayesian classification advertisement.  Bayesian classification is based on bayes theorem.
  • 8. BAYESIAN CLASSIFIER  Bayesian classifier are the statistial classifiers.  Bayesian classifier can predicts class membership probabilities such as the probability that given tuple belong to a particular class.
  • 9.
  • 10. Data in the real world is dirty  Incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.  Noisy: containing errors or outliers.  Inconsistent: containing discrepancies in codes or names.  Broad categories  Intrinsic, contextual, representational and accessibility.
  • 11. No quality data, no quality mining  Quality decisions must be based on quality data.  Data warehouse needs consistent integration of quality data. • A multi-dimensional measure of data quality A well-accepted multi-dimensional view:  Accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility.
  • 12. Data Cleaning  Inconsistent data:  Manual correction using external references  Semi-automatic using various tools - To detect violation of known functional dependencies and data constraints. - To correct redundant data
  • 13. Data Reduction  Manage Data Reduction  Data reduction: reduced representation, while still retaining critical information 1. Data cube aggregation 2. Dimensionality reduction 3. Data compression 4. Numerosity reduction 5. Discretization and concept hierarchy generation
  • 14. Data Cleaning  Tasks of Data Cleaning a. Fill in missing values b. Identify outliers and smooth noisy data c. Correct inconsistent data