SlideShare a Scribd company logo
1 of 29
Pre Processing and Data
Prepration
Big Data & IoT
Umair Shafique (03246441789)
Scholar MS Information Technology - University of Gujrat
Data Pre-Processing
Learning Objectives
•Understand what is data pre-processing?
• Understand why pre-process the data.
• Understand how to clean the data.
• Understand how to integrate and transform the data.
• Data:
• Text
• Images
• Audio
• Video
• Data pre-processing is the process to convert raw
data into meaningful data using different
techniques.
Data pre-processing
• Incomplete:
missing data, lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
• Noisy:
containing errors, such as measurement errors, or outliers
• Inconsistent:
containing discrepancies in codes or names
• Distorted:
sampling distortion (A Change for worse)
Terms used in pre-processing
Steps of Pre-processing
• Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies and errors
• Data integration
Integration of multiple databases, data cubes, or files
• Data transformation
Normalization and aggregation
• Data reduction
Obtains reduced representation in volume but produces the same or similar analytical
results
• Data discretization
Part of data reduction but with particular importance, especially for numerical data
steps of data pre-processing
Data Cleaning
• Data cleaning tasks
i. Fill in missing values
ii. Identify outliers and smooth out noisy data
iii. Correct inconsistent data
Data Cleaning
1. Missing Data
• Data is not always available
a. E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
a. equipment malfunction
b. inconsistent with other recorded data and thus
deleted
c. data not entered due to misunderstanding
d. certain data may not be considered important at the
time of entry
e. not register history or changes of the data
f. Missing data may need to be inferred.
How to Handle Missing Data?
• Ignore the tuple; usually done when the class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.)
• Fill in the missing value manually
• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
2. Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– inconsistent data
How to Handle Noisy Data?
• Binning method:
- first sort data and partition into (equi-depth) bins
- then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
• Clustering
- detect and remove outliers
• Regression
- smooth by fitting the data into regression functions
• Combined computer and human inspection
- detect suspicious values and check by human
Binning Methods for Data Smoothing
* Sorted data for price (in dollars):
4,8,9,15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
1. Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
2. Smoothing by bin boundaries:
- Bin 1: 4, 4, 15, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 34, 34
Cluster Analysis
Regression
Data Integration
• Data integration:
• combines data from multiple sources into a coherent store
• Schema integration
• integrate metadata from different sources
• entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different sources are different
• possible reasons: different representations, different scales, e.g., metric vs. British
units
Handling Redundant Data in Data Integration
• Redundant data occur often when integration of multiple
databases
– The same attribute may have different names in different databases
– One attribute may be a “derived” attribute in another table, e.g., annual revenue
• Redundant data may be able to be detected by correlational
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Data Transformation
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
Data Reduction
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume,
yet closely maintains the integrity of the original data. That is,
mining on the reduced data set should be more efficient yet
produce the same (or almost the same) analytical results.
Data reduction
Strategies
1. Data cube aggregation
2. Dimension reduction
3. Data compression
4. Numerosity reduction
5. Discretization and concept hierarchies
1- Data Cube Aggregation
Data cube aggregation, where aggregation operations are
applied to the data in the construction of a data cube.
2- dimension reduction
• Dimension Reduction, where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and removed.
• The goal is to find a minimum set of attributes such that the
resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes.
• Mining on a reduced set of attributes has an additional benefit. It
reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
2.1- methods of dimension reduction
Basic heuristic methods of attribute dimension reduction include the
following techniques:
o Stepwise forward selection
The procedure starts with an empty set of attributes as the reduced set. The
best of the original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes
is added to the set.
o Stepwise backward elimination
The procedure starts with the full set of attributes. At each step, it removes
the worst attribute remaining in the set.
o Combination of forward selection and backward elimination
The stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
3- Data compression
• In data compression, where encoding mechanisms are used to reduce the
data set size.
• Types of compressions:
• Lossless Compression
Encoding techniques (Run Length Encoding) allows a simple and
minimal data size reduction. Lossless data compression uses
algorithms to restore the precise original data from the compressed
data.
• Lossy Compression
Methods such as Discrete Wavelet transform technique, PCA
(principal component analysis) are examples of this compression.
For e.g., JPEG image format is a lossy compression, but we can
find the meaning equivalent to the original the image. In lossy-data
compression, the decompressed data may differ to the original data
but are useful enough to retrieve information from them.
4- numerosity reduction
• Here, the data are replaced or estimated by alternative, smaller data
representations
• Techniques:
• Parametric methods
A model is used to estimate the data, so that typically only
the data parameters need to be stored, instead of the actual
data. (Outliers may also be stored.) Log-linear models, which
estimate discrete multidimensional probability distributions,
are an example
• Non-parametric methods
For storing reduced representations of the data include
histograms, clustering, and sampling.
5- discretization and concept heirarchy
Discretization:
• Techniques of data discretization are used to divide the attributes of the
continuous nature into data with intervals
• Types of discretization:
• Top-down discretization
If you first consider one or a couple of points (so-called
breakpoints or split points) to divide the whole set of attributes and
repeat of this method up to the end, then the process is known as
top-down discretization also known as splitting.
• Bottom-up discretization
If you first consider all the constant values as split-points, some
are discarded through a combination of the neighbourhood values
in the interval, that process is called bottom-up discretization.
5- discretization and concept hierarchy
Concept Hierarchy:
• It reduces the data size by collecting and then replacing the low-level
concepts
• Techniques:
1. Binning
Binning is the process of changing numerical variables into
categorical counterparts. The number of categorical counterparts
depends on the number of bins specified by the user.
2. Histogram Analysis
Used to partition the value for the attribute X, into disjoint ranges
called bracket
5- discretization and concept hierarchy
• Histogram partitioning rules:
•Equal frequency partitioning
Partitioning the values based on their number of occurrences
in the data set.
•Equal width partitioning
Partitioning the values in a fixed gap based on the number of
bins i.e. a set of values ranging from 0-20
•Clustering
Grouping the similar data together
Benefits of data preparation
1. Ensure the data used in analytics applications produces reliable
results;
2. Identify and fix data issues that otherwise might not be detected;
3. Enable more informed decision-making by business executives
and operational workers;
4. Reduce data management and analytics costs;
5. Avoid duplication of effort in preparing data for use in multiple
applications; and
6. Get a higher ROI from BI and analytics initiatives.
Summary
• Data preparation or preprocessing is a big issue for both data
warehousing and data mining
• Descriptive data summarization is needed for quality data
preprocessing
• Data preparation includes
Data cleaning and data integration
Data transformation and reduction
Discretization
• A lot a methods have been developed but data preprocessing
still an active area of research

More Related Content

What's hot

Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandasAkshitaKanther
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse ArchitecturesTheju Paul
 
Conceptual design & ER Model.pptx
Conceptual design & ER Model.pptxConceptual design & ER Model.pptx
Conceptual design & ER Model.pptxAvinashChoure2
 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDATAVERSITY
 
Introduction to Data Abstraction
Introduction to Data AbstractionIntroduction to Data Abstraction
Introduction to Data AbstractionDennis Gajo
 
Introduction to Data Structure and Algorithm
Introduction to Data Structure and AlgorithmIntroduction to Data Structure and Algorithm
Introduction to Data Structure and AlgorithmSagacious IT Solution
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introductionDr-Dipali Meher
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data CleaningCarolineSmith912130
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guidethomasmary607
 

What's hot (20)

Data mining
Data mining Data mining
Data mining
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Conceptual design & ER Model.pptx
Conceptual design & ER Model.pptxConceptual design & ER Model.pptx
Conceptual design & ER Model.pptx
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best Practices
 
Introduction to Data Abstraction
Introduction to Data AbstractionIntroduction to Data Abstraction
Introduction to Data Abstraction
 
Data mining notes
Data mining notesData mining notes
Data mining notes
 
Introduction to Data Structure and Algorithm
Introduction to Data Structure and AlgorithmIntroduction to Data Structure and Algorithm
Introduction to Data Structure and Algorithm
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data Cleaning
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 

Similar to Pre-Processing and Data Preparation

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processingFEG
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 

Similar to Pre-Processing and Data Preparation (20)

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Dmblog
DmblogDmblog
Dmblog
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 

More from Umair Shafique

Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With HadoopUmair Shafique
 
BIG DATA ANALYTICS USING R
BIG DATA ANALYTICS USING  RBIG DATA ANALYTICS USING  R
BIG DATA ANALYTICS USING RUmair Shafique
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGUmair Shafique
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataUmair Shafique
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big DataUmair Shafique
 

More from Umair Shafique (6)

Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
BIG DATA ANALYTICS USING R
BIG DATA ANALYTICS USING  RBIG DATA ANALYTICS USING  R
BIG DATA ANALYTICS USING R
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNING
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 

Pre-Processing and Data Preparation

  • 1. Pre Processing and Data Prepration Big Data & IoT Umair Shafique (03246441789) Scholar MS Information Technology - University of Gujrat
  • 2. Data Pre-Processing Learning Objectives •Understand what is data pre-processing? • Understand why pre-process the data. • Understand how to clean the data. • Understand how to integrate and transform the data.
  • 3. • Data: • Text • Images • Audio • Video • Data pre-processing is the process to convert raw data into meaningful data using different techniques. Data pre-processing
  • 4. • Incomplete: missing data, lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • Noisy: containing errors, such as measurement errors, or outliers • Inconsistent: containing discrepancies in codes or names • Distorted: sampling distortion (A Change for worse) Terms used in pre-processing
  • 5. Steps of Pre-processing • Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies and errors • Data integration Integration of multiple databases, data cubes, or files • Data transformation Normalization and aggregation • Data reduction Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization Part of data reduction but with particular importance, especially for numerical data
  • 6. steps of data pre-processing
  • 7. Data Cleaning • Data cleaning tasks i. Fill in missing values ii. Identify outliers and smooth out noisy data iii. Correct inconsistent data
  • 8. Data Cleaning 1. Missing Data • Data is not always available a. E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to a. equipment malfunction b. inconsistent with other recorded data and thus deleted c. data not entered due to misunderstanding d. certain data may not be considered important at the time of entry e. not register history or changes of the data f. Missing data may need to be inferred.
  • 9. How to Handle Missing Data? • Ignore the tuple; usually done when the class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.) • Fill in the missing value manually • Use a global constant to fill in the missing value: e.g., “unknown”, a new class • Use the attribute mean to fill in the missing value • Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
  • 10. 2. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention • Other data problems which requires data cleaning – duplicate records – inconsistent data
  • 11. How to Handle Noisy Data? • Binning method: - first sort data and partition into (equi-depth) bins - then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Clustering - detect and remove outliers • Regression - smooth by fitting the data into regression functions • Combined computer and human inspection - detect suspicious values and check by human
  • 12. Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4,8,9,15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 1. Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 2. Smoothing by bin boundaries: - Bin 1: 4, 4, 15, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 34, 34
  • 15. Data Integration • Data integration: • combines data from multiple sources into a coherent store • Schema integration • integrate metadata from different sources • entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-# • Detecting and resolving data value conflicts • for the same real world entity, attribute values from different sources are different • possible reasons: different representations, different scales, e.g., metric vs. British units
  • 16. Handling Redundant Data in Data Integration • Redundant data occur often when integration of multiple databases – The same attribute may have different names in different databases – One attribute may be a “derived” attribute in another table, e.g., annual revenue • Redundant data may be able to be detected by correlational analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 17. Data Transformation • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling • Attribute/feature construction • New attributes constructed from the given ones
  • 18. Data Reduction • Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.
  • 19. Data reduction Strategies 1. Data cube aggregation 2. Dimension reduction 3. Data compression 4. Numerosity reduction 5. Discretization and concept hierarchies
  • 20. 1- Data Cube Aggregation Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
  • 21. 2- dimension reduction • Dimension Reduction, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed. • The goal is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. • Mining on a reduced set of attributes has an additional benefit. It reduces the number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand.
  • 22. 2.1- methods of dimension reduction Basic heuristic methods of attribute dimension reduction include the following techniques: o Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set. o Stepwise backward elimination The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set. o Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.
  • 23. 3- Data compression • In data compression, where encoding mechanisms are used to reduce the data set size. • Types of compressions: • Lossless Compression Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed data. • Lossy Compression Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original the image. In lossy-data compression, the decompressed data may differ to the original data but are useful enough to retrieve information from them.
  • 24. 4- numerosity reduction • Here, the data are replaced or estimated by alternative, smaller data representations • Techniques: • Parametric methods A model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.) Log-linear models, which estimate discrete multidimensional probability distributions, are an example • Non-parametric methods For storing reduced representations of the data include histograms, clustering, and sampling.
  • 25. 5- discretization and concept heirarchy Discretization: • Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals • Types of discretization: • Top-down discretization If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and repeat of this method up to the end, then the process is known as top-down discretization also known as splitting. • Bottom-up discretization If you first consider all the constant values as split-points, some are discarded through a combination of the neighbourhood values in the interval, that process is called bottom-up discretization.
  • 26. 5- discretization and concept hierarchy Concept Hierarchy: • It reduces the data size by collecting and then replacing the low-level concepts • Techniques: 1. Binning Binning is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts depends on the number of bins specified by the user. 2. Histogram Analysis Used to partition the value for the attribute X, into disjoint ranges called bracket
  • 27. 5- discretization and concept hierarchy • Histogram partitioning rules: •Equal frequency partitioning Partitioning the values based on their number of occurrences in the data set. •Equal width partitioning Partitioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20 •Clustering Grouping the similar data together
  • 28. Benefits of data preparation 1. Ensure the data used in analytics applications produces reliable results; 2. Identify and fix data issues that otherwise might not be detected; 3. Enable more informed decision-making by business executives and operational workers; 4. Reduce data management and analytics costs; 5. Avoid duplication of effort in preparing data for use in multiple applications; and 6. Get a higher ROI from BI and analytics initiatives.
  • 29. Summary • Data preparation or preprocessing is a big issue for both data warehousing and data mining • Descriptive data summarization is needed for quality data preprocessing • Data preparation includes Data cleaning and data integration Data transformation and reduction Discretization • A lot a methods have been developed but data preprocessing still an active area of research