SlideShare a Scribd company logo
1 of 12
DATA
PREPROCESSING
AND
DATA CLEANSING
By-Ahtesham ullah
khan
1604610013
CS-3rd yr
DATA PRE-PROCESSING
◈ Data pre-processing is a data mining technique that
involves transforming raw data into an understandable
format.
◈ Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviours or trends, and is likely to
contain many errors.
◈ Data pre-processing is a proven method of resolving
such issues. Data pre-processing prepares raw data for
further processing.
Data goes through a series of steps
during pre-processing
◈ Data Cleansing
◈ Data Integration
◈ Data Transformation
◈ Data Reduction
◈ Data Discretization
DATA CLEANSING
◈ Data cleansing or data cleaning is the process of
detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table,
or database and refers to identifying incomplete, incorrect,
inaccurate or irrelevant parts of the data and then
replacing, modifying, or deleting the dirty or coarse data.
◈ Data cleansing may be performed interactively with data
wrangling tools, or as batch processing through scripting.
◈ After cleansing, a data set should be consistent with other
similar data sets in the system. The inconsistencies
detected or removed may have been originally caused by
user entry errors, by corruption in transmission or storage,
or by different data dictionary definitions of similar entities
in different stores
◈ Data cleaning differs from data validation in that validation
almost invariably means data is rejected from the system at entry
and is performed at the time of entry, rather than on batches of
data.
Process
1- Data auditing:- The data is audited with the use
of statistical and database methods to detect anomalies and
contradictions: this eventually gives an indication of the
characteristics of the anomalies and their locations.
◈ Several commercial software packages will let you specify
constraints of various kinds (using a grammar that
conforms to that of a standard programming language,
e.g., JavaScript or Visual Basic) and then generate code
that checks the data for violation of these constraints.
2-Workflow specification: The detection and removal of
anomalies is performed by a sequence of operations on the
data known as the workflow. It is specified after the process of
auditing the data and is crucial in achieving the end product of
high-quality data.
3-Workflow execution: In this stage, the workflow is executed
after its specification is complete and its correctness is verified.
 The implementation of the workflow should be efficient,
even on large sets of data, which inevitably poses a trade-
off because the execution of a data-cleansing operation
can be computationally expensive.
◈ 4-Post-processing and controlling: After executing the
cleansing workflow, the results are inspected to verify
correctness.
◈ Data that could not be corrected during execution of the
workflow is manually corrected, if possible.
◈ The result is a new cycle in the data-cleansing process
where the data is audited again to allow the specification of
an additional workflow to further cleanse the data by
automatic processing
◈ 5-Parsing: for the detection of syntax errors. A parser
decides whether a string of data is acceptable within the
allowed data specification. This is similar to the way a
parser works with grammars and languages.
◈ 6-Data transformation: Data transformation allows the
mapping of the data from its given format into the format
expected by the appropriate application. This includes
value conversions or translation functions, as well as
normalizing numeric values to conform to minimum and
maximum values.
◈ 7-Duplicate elimination: Duplicate detection requires
an algorithm for determining whether data contains
duplicate representations of the same entity. Usually, data
is sorted by a key that would bring duplicate entries closer
together for faster identification.
◈ 8-Statistical methods: By analysing the data using the
values of mean, standard deviation, range,
or clustering algorithms, it is possible for an expert to find
values that are unexpected and thus erroneous.
◈ It can be resolved by setting the values to an average or
other statistical value. Statistical methods can also be used
to handle missing values which can be replaced by one or
more plausible values, which are usually obtained by
extensive data augmentation algorithms.
TOOLS USED
◈ There are lots of data cleansing tools like
• Trifacta
• OpenRefine
• Paxata
• Alteryx, and others.
It's also common to use libraries like Pandas
(software) for Python (programming language), or Dplyr for R
(programming language).
◈ One example of a data cleansing for distributed systems
under Apache Spark is called Optimus,
an OpenSource framework for laptop or cluster allowing
pre-processing, cleansing, and exploratory data analysis. It
includes several data wrangling tools.
Data Preprocessing Techniques

More Related Content

What's hot

Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Edureka!
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionDerek Kane
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Data Wrangling
Data WranglingData Wrangling
Data WranglingGramener
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using RUmmiya Mohammedi
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using PythonChariza Pladin
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Edureka!
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

What's hot (20)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
Data Science Tutorial | What is Data Science? | Data Science For Beginners | ...
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Data Science
Data ScienceData Science
Data Science
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 

Similar to Data Preprocessing Techniques

UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxAkash527744
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxAbdullahAbbasi55
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET Journal
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEijcsa
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Qualitypriyanka rajput
 
Dm data pre processing
Dm data pre processingDm data pre processing
Dm data pre processingSangeethaSasi1
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDatavalley.ai
 

Similar to Data Preprocessing Techniques (20)

UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Data warehouse testing
Data warehouse testingData warehouse testing
Data warehouse testing
 
Data processing
Data processingData processing
Data processing
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current Approaches
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 
Dm data pre processing
Dm data pre processingDm data pre processing
Dm data pre processing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Pre processing
Pre processingPre processing
Pre processing
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
1234
12341234
1234
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
 

More from Ahtesham Ullah khan

A PRESENTATION ON INSTALLATION OF RAPID MINER & CHARTA PRESENTATION ON INSTAL...
A PRESENTATION ON INSTALLATION OF RAPID MINER &CHARTA PRESENTATION ON INSTAL...A PRESENTATION ON INSTALLATION OF RAPID MINER &CHARTA PRESENTATION ON INSTAL...
A PRESENTATION ON INSTALLATION OF RAPID MINER & CHARTA PRESENTATION ON INSTAL...Ahtesham Ullah khan
 
3D TRANSFORMATION: MATRIX REPRESENTATION
3D TRANSFORMATION: MATRIX REPRESENTATION3D TRANSFORMATION: MATRIX REPRESENTATION
3D TRANSFORMATION: MATRIX REPRESENTATIONAhtesham Ullah khan
 
HOMOGENEOUS CO-ORDINATES IN COMPUTER GRAPHICS PPT
HOMOGENEOUS CO-ORDINATES IN COMPUTER GRAPHICS PPTHOMOGENEOUS CO-ORDINATES IN COMPUTER GRAPHICS PPT
HOMOGENEOUS CO-ORDINATES IN COMPUTER GRAPHICS PPTAhtesham Ullah khan
 
Flow control in Computer Network
Flow control in Computer NetworkFlow control in Computer Network
Flow control in Computer NetworkAhtesham Ullah khan
 

More from Ahtesham Ullah khan (6)

A PRESENTATION ON INSTALLATION OF RAPID MINER & CHARTA PRESENTATION ON INSTAL...
A PRESENTATION ON INSTALLATION OF RAPID MINER &CHARTA PRESENTATION ON INSTAL...A PRESENTATION ON INSTALLATION OF RAPID MINER &CHARTA PRESENTATION ON INSTAL...
A PRESENTATION ON INSTALLATION OF RAPID MINER & CHARTA PRESENTATION ON INSTAL...
 
3D TRANSFORMATION: MATRIX REPRESENTATION
3D TRANSFORMATION: MATRIX REPRESENTATION3D TRANSFORMATION: MATRIX REPRESENTATION
3D TRANSFORMATION: MATRIX REPRESENTATION
 
HOMOGENEOUS CO-ORDINATES IN COMPUTER GRAPHICS PPT
HOMOGENEOUS CO-ORDINATES IN COMPUTER GRAPHICS PPTHOMOGENEOUS CO-ORDINATES IN COMPUTER GRAPHICS PPT
HOMOGENEOUS CO-ORDINATES IN COMPUTER GRAPHICS PPT
 
Flow control in Computer Network
Flow control in Computer NetworkFlow control in Computer Network
Flow control in Computer Network
 
Framing in data link layer
Framing in data link layerFraming in data link layer
Framing in data link layer
 
Error Control In Network Layer
Error Control In Network LayerError Control In Network Layer
Error Control In Network Layer
 

Recently uploaded

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 

Recently uploaded (20)

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 

Data Preprocessing Techniques

  • 2. DATA PRE-PROCESSING ◈ Data pre-processing is a data mining technique that involves transforming raw data into an understandable format. ◈ Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviours or trends, and is likely to contain many errors. ◈ Data pre-processing is a proven method of resolving such issues. Data pre-processing prepares raw data for further processing.
  • 3. Data goes through a series of steps during pre-processing ◈ Data Cleansing ◈ Data Integration ◈ Data Transformation ◈ Data Reduction ◈ Data Discretization
  • 4. DATA CLEANSING ◈ Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. ◈ Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting. ◈ After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores
  • 5. ◈ Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data.
  • 6. Process 1- Data auditing:- The data is audited with the use of statistical and database methods to detect anomalies and contradictions: this eventually gives an indication of the characteristics of the anomalies and their locations. ◈ Several commercial software packages will let you specify constraints of various kinds (using a grammar that conforms to that of a standard programming language, e.g., JavaScript or Visual Basic) and then generate code that checks the data for violation of these constraints.
  • 7. 2-Workflow specification: The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of high-quality data. 3-Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified.  The implementation of the workflow should be efficient, even on large sets of data, which inevitably poses a trade- off because the execution of a data-cleansing operation can be computationally expensive.
  • 8. ◈ 4-Post-processing and controlling: After executing the cleansing workflow, the results are inspected to verify correctness. ◈ Data that could not be corrected during execution of the workflow is manually corrected, if possible. ◈ The result is a new cycle in the data-cleansing process where the data is audited again to allow the specification of an additional workflow to further cleanse the data by automatic processing
  • 9. ◈ 5-Parsing: for the detection of syntax errors. A parser decides whether a string of data is acceptable within the allowed data specification. This is similar to the way a parser works with grammars and languages. ◈ 6-Data transformation: Data transformation allows the mapping of the data from its given format into the format expected by the appropriate application. This includes value conversions or translation functions, as well as normalizing numeric values to conform to minimum and maximum values. ◈ 7-Duplicate elimination: Duplicate detection requires an algorithm for determining whether data contains duplicate representations of the same entity. Usually, data is sorted by a key that would bring duplicate entries closer together for faster identification.
  • 10. ◈ 8-Statistical methods: By analysing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. ◈ It can be resolved by setting the values to an average or other statistical value. Statistical methods can also be used to handle missing values which can be replaced by one or more plausible values, which are usually obtained by extensive data augmentation algorithms.
  • 11. TOOLS USED ◈ There are lots of data cleansing tools like • Trifacta • OpenRefine • Paxata • Alteryx, and others. It's also common to use libraries like Pandas (software) for Python (programming language), or Dplyr for R (programming language). ◈ One example of a data cleansing for distributed systems under Apache Spark is called Optimus, an OpenSource framework for laptop or cluster allowing pre-processing, cleansing, and exploratory data analysis. It includes several data wrangling tools.