SlideShare a Scribd company logo
1 of 22
1.
2.
3.
4.
5.
6.
7.
8.

Introduction
Data Quality: Needs of Preprocessing the data?
Data Preprocessing tasks
Data Cleaning
Data integration
Data reduction
Data Transformation and Data Discretization
Conclusion
• It is a process which is comes before applying data mining
technique's
• Low-quality data will lead to low-quality mining results.
• So we need to smear Data Preprocessing techniques such as:
- Data quality
- Data cleaning
- Data integration
- Data reduction
- Data transformation
- Data discremination
• Data have quality if the requirements of the intended use.
• There are many factors comprising data quality, including:
–
–
–
–
–
–

Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
• Data cleaning routines attempt to fill in missing values , smooth out
noise while identifying outliers, and inconsistencies in data.
•

Basic methods of data cleaning:
– Missing value
– Noisy Data
– Data Cleaning as a process
• Ignore the tuple
• Fill in missing values manually
[ time consuming and infeasible]
• Fill in it automatically with
[a global constant : e.g., “Unknown”, ∞]
• Use the most portable value to fill in the missing value [regression,
inference-based tools using Bayesian formalism or decision tree
induction]
• Noise is the random error or variance in a measured variable.
• Binning:
Binning method smooth a sorted data value by consulting its
“neighborhood”, that is, the value around it.
The sorted values are distributed into number of “buckets”, or
“bins”.
• Smoothing by bin means:
Each value in a bin is replaced by the mean value of the bin [4,8,15
in bin is 9].
• Smoothing by bin medians:
Each value in a bin replaced by the bin median
• Smoothing by bin boundaries:
The minimum and maximum values in a given bin are identified as
the bin boundaries each bin values is then replaced by closest
boundary value
Binning is also used as a discretization technique.
• Regression:
Data smoothing can also done by regression, a technique that
conforms of values to the function
– Linear regression involves finding “best” line to fit two
attributes. one attribute used to predict other
– Multiple linear regression extension of linear regression.
• Outlier analysis:
it may be detected by clustering. Where similar values are
organized into groups or clusters.
• The first step in the data cleaning is discrepancy detection
[inconsistent data] .
• The data should examined regarding :
– Unique rule [ each attribute value must be different from all
other attribute value ]
– Consecutive rule [no missing values between lowest and highest
values of the attribute]
– Null rule [specifies the use of blanks, question marks, special
characters]
• Use commercial tools
Data scrubbing: use simple domain knowledge (e.g, postal code,
spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and relationship
to detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools:
allow users to
specify transformations through a graphical user interface
• It is the merging of data from multiple
data stores.
• Careful integration avoid and reduce redundancies and
inconsistencies in resulting data set.
• Schema integration: [ Integrate metadata from different sources]
• Entity identification problem: [ Identify real world entities from
multiple data sources]
• Redundancy analysis: [an attribute value may be redundant that
can be detect by correlation analysis]
• This technique applied to obtain a reduced representation of the
data set.
• Data reduction strategies include
– Dimensionality reduction :
Remove unimportant attributes
Its method include wavelet transforms , principal components
analysis(PCA) which transforms the original data onto a smaller
space.
– Numerosity reduction:
Replace the original data volume by alternative
– Data compression:
transformations are applied to obtain a reduced or
“compressed” representation of the original data.
• If the compressed data without any information loss then
the Data reduction is called “lossless”.
• If we reconstruct only an approximation of the original data,
then the Data reduction is called “lossy”.
• Dimensionality reduction and numerosity reduction
techniques can also be considered forms of “data
compression”.
Data compression

Original Data

Compressed
Data

lossless
ss y
lo
Original Data
Approximated

16
• Data transformation routines convert the data into appropriate
forms for mining.
• Strategies for data transformation includes:
 Smoothing: Remove noise from data
 Attribute/feature construction: New attributes constructed
from the given ones to help mining process.
 Aggregation: Summarization, data cube construction. (e.g) daily
sales aggregate to compute monthly or annual total amounts.
 Normalization: Scaled to fall within a smaller, specified range,
min-max normalization(0.1 to 1.0 or 0.0 to 1.0)
• It transforms numeric data by mapping values to interval or
concept labels.
• Discretization and concept hierarchy generation can also be useful,
• where raw values for attributes are replaced by ranges or higher
conceptual levels .
• raw values of a numeric attribute (e.g age) are replaced by interval
lables (e.g 0-10, 11-20, etc) or higher-level concepts (e.g youth ,
adult, senior).
• Three types of attributes
– Nominal values from an unordered set, e.g., color, profession
– Ordinal values from an ordered set [military or academic rank ]
– Numeric real numbers, e.g integer or real numbers

• Discretization:
Divide the range of a continuous attribute into intervals
–
–
–
–
–
–

Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
Although numerous methods of data preprocessing have been
developed ,data preprocessing remains an active area of research
,due to the huge amount of inconsistent or dirty data and the
complexity of the problem.
Data preprocess
Data preprocess

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data MiningSamad Baseer Khan
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingksamyMCA
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic conceptsKrish_ver2
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data Mining
Data MiningData Mining
Data Mining
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Data reduction
Data reductionData reduction
Data reduction
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 

Similar to Data preprocess

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Data Preprocessing&tools
Data Preprocessing&toolsData Preprocessing&tools
Data Preprocessing&toolsAmandeep Gill
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingextraganesh
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2malathieswaran29
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processingFEG
 

Similar to Data preprocess (20)

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Data Preprocessing&tools
Data Preprocessing&toolsData Preprocessing&tools
Data Preprocessing&tools
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Dmblog
DmblogDmblog
Dmblog
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 

Recently uploaded

CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Recently uploaded (20)

CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

Data preprocess

  • 1.
  • 2. 1. 2. 3. 4. 5. 6. 7. 8. Introduction Data Quality: Needs of Preprocessing the data? Data Preprocessing tasks Data Cleaning Data integration Data reduction Data Transformation and Data Discretization Conclusion
  • 3. • It is a process which is comes before applying data mining technique's
  • 4. • Low-quality data will lead to low-quality mining results. • So we need to smear Data Preprocessing techniques such as: - Data quality - Data cleaning - Data integration - Data reduction - Data transformation - Data discremination
  • 5. • Data have quality if the requirements of the intended use. • There are many factors comprising data quality, including: – – – – – – Accuracy Completeness Consistency Timeliness Believability Interpretability
  • 6. • Data cleaning routines attempt to fill in missing values , smooth out noise while identifying outliers, and inconsistencies in data. • Basic methods of data cleaning: – Missing value – Noisy Data – Data Cleaning as a process
  • 7. • Ignore the tuple • Fill in missing values manually [ time consuming and infeasible] • Fill in it automatically with [a global constant : e.g., “Unknown”, ∞] • Use the most portable value to fill in the missing value [regression, inference-based tools using Bayesian formalism or decision tree induction]
  • 8. • Noise is the random error or variance in a measured variable. • Binning: Binning method smooth a sorted data value by consulting its “neighborhood”, that is, the value around it. The sorted values are distributed into number of “buckets”, or “bins”.
  • 9. • Smoothing by bin means: Each value in a bin is replaced by the mean value of the bin [4,8,15 in bin is 9]. • Smoothing by bin medians: Each value in a bin replaced by the bin median • Smoothing by bin boundaries: The minimum and maximum values in a given bin are identified as the bin boundaries each bin values is then replaced by closest boundary value Binning is also used as a discretization technique.
  • 10. • Regression: Data smoothing can also done by regression, a technique that conforms of values to the function – Linear regression involves finding “best” line to fit two attributes. one attribute used to predict other – Multiple linear regression extension of linear regression. • Outlier analysis: it may be detected by clustering. Where similar values are organized into groups or clusters.
  • 11. • The first step in the data cleaning is discrepancy detection [inconsistent data] . • The data should examined regarding : – Unique rule [ each attribute value must be different from all other attribute value ] – Consecutive rule [no missing values between lowest and highest values of the attribute] – Null rule [specifies the use of blanks, question marks, special characters]
  • 12. • Use commercial tools Data scrubbing: use simple domain knowledge (e.g, postal code, spell-check) to detect errors and make corrections Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) • Data migration and integration Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface
  • 13. • It is the merging of data from multiple data stores. • Careful integration avoid and reduce redundancies and inconsistencies in resulting data set. • Schema integration: [ Integrate metadata from different sources] • Entity identification problem: [ Identify real world entities from multiple data sources] • Redundancy analysis: [an attribute value may be redundant that can be detect by correlation analysis]
  • 14. • This technique applied to obtain a reduced representation of the data set. • Data reduction strategies include – Dimensionality reduction : Remove unimportant attributes Its method include wavelet transforms , principal components analysis(PCA) which transforms the original data onto a smaller space.
  • 15. – Numerosity reduction: Replace the original data volume by alternative – Data compression: transformations are applied to obtain a reduced or “compressed” representation of the original data. • If the compressed data without any information loss then the Data reduction is called “lossless”. • If we reconstruct only an approximation of the original data, then the Data reduction is called “lossy”. • Dimensionality reduction and numerosity reduction techniques can also be considered forms of “data compression”.
  • 17. • Data transformation routines convert the data into appropriate forms for mining. • Strategies for data transformation includes:  Smoothing: Remove noise from data  Attribute/feature construction: New attributes constructed from the given ones to help mining process.  Aggregation: Summarization, data cube construction. (e.g) daily sales aggregate to compute monthly or annual total amounts.  Normalization: Scaled to fall within a smaller, specified range, min-max normalization(0.1 to 1.0 or 0.0 to 1.0)
  • 18. • It transforms numeric data by mapping values to interval or concept labels. • Discretization and concept hierarchy generation can also be useful, • where raw values for attributes are replaced by ranges or higher conceptual levels . • raw values of a numeric attribute (e.g age) are replaced by interval lables (e.g 0-10, 11-20, etc) or higher-level concepts (e.g youth , adult, senior).
  • 19. • Three types of attributes – Nominal values from an unordered set, e.g., color, profession – Ordinal values from an ordered set [military or academic rank ] – Numeric real numbers, e.g integer or real numbers • Discretization: Divide the range of a continuous attribute into intervals – – – – – – Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis, e.g., classification
  • 20. Although numerous methods of data preprocessing have been developed ,data preprocessing remains an active area of research ,due to the huge amount of inconsistent or dirty data and the complexity of the problem.