SlideShare a Scribd company logo
Dr. Gopal Sakarkar,
IEEE-CIS Member, Ph.D(CSE)
Department of AI and Machine Learning ,
G H RaisoniCollegeof Engineering , Nagpur
Data Pre-processing Services
using
Machine Learning Algorithms
Data Cleaning Services
Good data preparation is key
to producing valid and reliable
models.
Applications of Machine Learning
Applications of Machine Learning
Applications of Machine Learning
Applications of Machine Learning
What is Machine Learning?
• According to Arthur Samuel(1959), Machine Learning algorithms enable the
computers to learn from data, and even improve themselves, without being
explicitly programmed.
• Machine learning (ML) is a category of an algorithm that allows software
applications to become more accurate in predicting outcomes without being
explicitly programmed.
• The basic premise of machine learning is to build algorithms that can receive
input data and use statistical analysis to predict an output while updating
outputs as new data becomes available.
Types of Machine Learning
Types of Machine Learning
Supervised Learning Unsupervised Learning
MachineLearningAlgorithms
Where is Data Cleaning used?
Machine Learning Life Cycle
Data Pre-processing
• Data preprocessing is an important step in ML
• The phrase "garbage in, garbage out" is particularly applicable to data
mining and machine learning projects.
• It involves transforming raw data into an understandable format.
• Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors.
• Data preprocessing is a proven method of resolving such issues
Why Data Pre-processing?
Why Data Pre-processing?
• A manager at All Electronics and have been charged with analyzing the company's data with
respect to the sales at a branch.
• He carefully inspect the company's database and data warehouse, identifying dimensions to be
included, such as item, price, units sold, and session .
• He notice that several of the attributes for various tuples have no recorded value. For analysis,
he would like to include information.
• In other words, the data he wish to analyze by machine learning techniques is incomplete,
noisy and inconsistent.
Why Data Pre-processing?
Item Price Unit Sold Session
TV 7200 44 All
Fan 480 27 Summer
Tube light 54 30 All
AC 27000 38
Fridge 40 Summer
Switches 58 35
2 mm Wire 520 All
Backup
Light 790 48 Winter
Fan
Regulator 83 50 All
Bulb 87 37 Rainy Session
What do you mean by data Pre-processing ?
• It is cleaning and explorating data for analysis
• Prepping data for modeling
• Modeling in Python requires numerical input
• Data preprocessing is a technique that involves transforming raw data into an understandable
format.
• Data preprocessing is a proven method of resolving such issues.
Data Understanding : Relevance of data
• What data is available for the task?
• Is this data relevant?
• Is additional relevant data available?
• How much historical data is available?
Data Understanding: Quantity of data
• Number of instances (records, objects)
• Rule of thumb: 5,000 or more desired
• if less, results are less reliable; use special methods (boosting, …)
• Number of attributes (fields)
• Rule of thumb: for each attribute, 10 or more instances
• If more fields, use feature reduction and selection
• if very unbalanced, use sampling
Data Pre-processing Steps
Data Pre-processing Steps
• Data Cleaning
Data cleaning is process of fill in missing values, smoothing the noisy data, identify or
remove outliers, and resolve inconsistencies.
• Data Integration
Integration of multiple databases, data cubes, or files.
• Data Transformation
Data transformation is the task of data normalization and aggregation.
Data Pre-processing Steps
• Data Reduction
Process of reduced representation in volume but produces the same or similar analytical
results.
• Data Discretization
Part of data reduction but with particular importance, especially for numerical data.
Data Pre-processing Steps
Data Pre-processing Steps
Data Cleaning
• Importance
Data cleaning is the number one problem during working with large data.
Data Cleaning Tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
Data Cleaning: Missing Data
• Data is not always available
E.g., while admission filling form by student at the time of admission,
he might be don’t known local guardian contact number.
• Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
no register history or changes of the data
expansion of data schema
How to Handle Missing Data?
• Ignore the tuple (loss of information)
• Fill in missing values manually: tedious, infeasible?
• Fill in it automatically with
a global constant : e.g., unknown, a new class?!
Imputation: Use the attribute mean to fill in the missing value,
 Use the most probable value to fill in the missing value.
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
• Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to handle noisy data?
• Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
• Combined computer and human inspection
detect suspicious values and check by human
Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
-Bin 1: 4, 8, 9, 15
-Bin 2: 21, 21, 24, 25
-Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
-Bin 1: 9, 9, 9, 9 (4+8+9+15/4) =9
-Bin 2: 23, 23, 23, 23 (21+21+24+25/4)=23
-Bin 3: 29, 29, 29, 29 (26+28+29+34/4)=29
* Smoothing by bin boundaries:
-Bin 1: 4, 4, 4, 15
-Bin 2: 21, 21, 25, 25
-Bin 3: 26, 26, 26, 34
Data Integration
Data integration:
Its combines data from multiple sources
• Schema integration
Integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British units
• Removing duplicates and redundant data
Data Transformation
Data Transformation
• Smoothing: remove noise from data
• Normalization: scaled to fall within a small, specified range
• Attribute/feature construction
 New attributes constructed from the given ones
• Aggregation: summarization
 Integrate data from different sources (tables)
Data Reduction
• Data is too big to work with
 Too many instances
 too many features (attributes)
Data Reduction
 Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results (easily said but difficult to do)
• Data reduction strategies
 Dimensionality reduction — remove unimportant attributes
 Aggregation and clustering –
 Remove redundant or close associated ones
 Sampling
Data Reduction
Clustering
• Partition data set into clusters, and one can store cluster
representation only.
• Can be very effective if data is clustered but not if data is dirty.
• There are many choices of clustering and clustering algorithms.
Data Reduction
Sampling
• Choose a representative subset of the data
 Simply selecting random sampling may have improve
performance in the presence of scenario .
• Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or subpopulation of
interest) in the overall database
Data Reduction
Sampling
Data Discretization
• Discretization is a process that transforms quantitative data into
qualitative data.
• It significantly improve the quality of discovering knowledge.
• It reduces the running time of various machine learning tasks such as
association rule discovery, classification, clustering and prediction.
• It reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
• Interval labels can then be used to replace actual data values
Data Discretization
Email: gopal.sakarkar@raisoni.net
Part 2
Implementation of Data
Cleaning Services
Using
Python Programming

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
parry prabhu
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
Pabna University of Science & Technology
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
DataminingTools Inc
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
DataminingTools Inc
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
Samra Shahzadi
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
Kamal Acharya
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
SwatiTripathi44
 
Over fitting underfitting
Over fitting underfittingOver fitting underfitting
Over fitting underfitting
SivapriyaS12
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
Ashraf Uddin
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
Pravinkumar Landge
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
Knoldus Inc.
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
K-Folds Cross Validation Method
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation Method
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Over fitting underfitting
Over fitting underfittingOver fitting underfitting
Over fitting underfitting
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 

Similar to Data preprocessing using Machine Learning

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
YashikaSengar2
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
YashikaSengar2
 
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
subhashchandra197
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
Dhilsath Fathima
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
Chandrika Sweety
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
Chandra Meena
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
FEG
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
Vijay Kumar
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
Luminous8
 
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. TisiModule-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Arunnaik63
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
Pre processing
Pre processingPre processing
Pre processing
Vijay Kumar
 
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of CambridgeData quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
BCS Data Management Specialist Group
 
Data pre processing
Data pre processingData pre processing
Data pre processing
pommurajopt
 

Similar to Data preprocessing using Machine Learning (20)

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
UNIT-1 Data pre-processing-Data cleaning, Transformation, Reduction, Integrat...
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. TisiModule-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Pre processing
Pre processingPre processing
Pre processing
 
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of CambridgeData quality in decision making - Dr. Philip Woodall, University of Cambridge
Data quality in decision making - Dr. Philip Woodall, University of Cambridge
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 

Recently uploaded

Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
GauravCar
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
Madan Karki
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
shahdabdulbaset
 

Recently uploaded (20)

Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
 

Data preprocessing using Machine Learning

  • 1. Dr. Gopal Sakarkar, IEEE-CIS Member, Ph.D(CSE) Department of AI and Machine Learning , G H RaisoniCollegeof Engineering , Nagpur Data Pre-processing Services using Machine Learning Algorithms
  • 2. Data Cleaning Services Good data preparation is key to producing valid and reliable models.
  • 7. What is Machine Learning? • According to Arthur Samuel(1959), Machine Learning algorithms enable the computers to learn from data, and even improve themselves, without being explicitly programmed. • Machine learning (ML) is a category of an algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. • The basic premise of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.
  • 8. Types of Machine Learning
  • 9. Types of Machine Learning Supervised Learning Unsupervised Learning MachineLearningAlgorithms
  • 10. Where is Data Cleaning used? Machine Learning Life Cycle
  • 11. Data Pre-processing • Data preprocessing is an important step in ML • The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. • It involves transforming raw data into an understandable format. • Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. • Data preprocessing is a proven method of resolving such issues
  • 13. Why Data Pre-processing? • A manager at All Electronics and have been charged with analyzing the company's data with respect to the sales at a branch. • He carefully inspect the company's database and data warehouse, identifying dimensions to be included, such as item, price, units sold, and session . • He notice that several of the attributes for various tuples have no recorded value. For analysis, he would like to include information. • In other words, the data he wish to analyze by machine learning techniques is incomplete, noisy and inconsistent.
  • 14. Why Data Pre-processing? Item Price Unit Sold Session TV 7200 44 All Fan 480 27 Summer Tube light 54 30 All AC 27000 38 Fridge 40 Summer Switches 58 35 2 mm Wire 520 All Backup Light 790 48 Winter Fan Regulator 83 50 All Bulb 87 37 Rainy Session
  • 15. What do you mean by data Pre-processing ? • It is cleaning and explorating data for analysis • Prepping data for modeling • Modeling in Python requires numerical input • Data preprocessing is a technique that involves transforming raw data into an understandable format. • Data preprocessing is a proven method of resolving such issues.
  • 16. Data Understanding : Relevance of data • What data is available for the task? • Is this data relevant? • Is additional relevant data available? • How much historical data is available?
  • 17. Data Understanding: Quantity of data • Number of instances (records, objects) • Rule of thumb: 5,000 or more desired • if less, results are less reliable; use special methods (boosting, …) • Number of attributes (fields) • Rule of thumb: for each attribute, 10 or more instances • If more fields, use feature reduction and selection • if very unbalanced, use sampling
  • 20. • Data Cleaning Data cleaning is process of fill in missing values, smoothing the noisy data, identify or remove outliers, and resolve inconsistencies. • Data Integration Integration of multiple databases, data cubes, or files. • Data Transformation Data transformation is the task of data normalization and aggregation. Data Pre-processing Steps
  • 21. • Data Reduction Process of reduced representation in volume but produces the same or similar analytical results. • Data Discretization Part of data reduction but with particular importance, especially for numerical data. Data Pre-processing Steps
  • 23. Data Cleaning • Importance Data cleaning is the number one problem during working with large data. Data Cleaning Tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration
  • 24. Data Cleaning: Missing Data • Data is not always available E.g., while admission filling form by student at the time of admission, he might be don’t known local guardian contact number. • Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry no register history or changes of the data expansion of data schema
  • 25. How to Handle Missing Data? • Ignore the tuple (loss of information) • Fill in missing values manually: tedious, infeasible? • Fill in it automatically with a global constant : e.g., unknown, a new class?! Imputation: Use the attribute mean to fill in the missing value,  Use the most probable value to fill in the missing value.
  • 26. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention • Other data problems which requires data cleaning  duplicate records  incomplete data  inconsistent data
  • 27. How to handle noisy data? • Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Combined computer and human inspection detect suspicious values and check by human
  • 28. Binning Methods for Data Smoothing • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: -Bin 1: 4, 8, 9, 15 -Bin 2: 21, 21, 24, 25 -Bin 3: 26, 28, 29, 34 * Smoothing by bin means: -Bin 1: 9, 9, 9, 9 (4+8+9+15/4) =9 -Bin 2: 23, 23, 23, 23 (21+21+24+25/4)=23 -Bin 3: 29, 29, 29, 29 (26+28+29+34/4)=29 * Smoothing by bin boundaries: -Bin 1: 4, 4, 4, 15 -Bin 2: 21, 21, 25, 25 -Bin 3: 26, 26, 26, 34
  • 29. Data Integration Data integration: Its combines data from multiple sources • Schema integration Integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-# • Detecting and resolving data value conflicts • for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units • Removing duplicates and redundant data
  • 30. Data Transformation Data Transformation • Smoothing: remove noise from data • Normalization: scaled to fall within a small, specified range • Attribute/feature construction  New attributes constructed from the given ones • Aggregation: summarization  Integrate data from different sources (tables)
  • 31. Data Reduction • Data is too big to work with  Too many instances  too many features (attributes) Data Reduction  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results (easily said but difficult to do) • Data reduction strategies  Dimensionality reduction — remove unimportant attributes  Aggregation and clustering –  Remove redundant or close associated ones  Sampling
  • 32. Data Reduction Clustering • Partition data set into clusters, and one can store cluster representation only. • Can be very effective if data is clustered but not if data is dirty. • There are many choices of clustering and clustering algorithms.
  • 33. Data Reduction Sampling • Choose a representative subset of the data  Simply selecting random sampling may have improve performance in the presence of scenario . • Develop adaptive sampling methods  Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database
  • 35. Data Discretization • Discretization is a process that transforms quantitative data into qualitative data. • It significantly improve the quality of discovering knowledge. • It reduces the running time of various machine learning tasks such as association rule discovery, classification, clustering and prediction. • It reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. • Interval labels can then be used to replace actual data values
  • 38. Part 2 Implementation of Data Cleaning Services Using Python Programming