SlideShare a Scribd company logo
1 of 21
Download to read offline
Module 2 -Data Pre-processing and EDA
•Data Cleaning
•Data Integration
• Data Transformation
•Data Reduction
• Significance of Exploratory Data Analysis
ØMaking sense of Data
Data
Cleaning
cycle
Data pre-processing
• Data Cleaning/Cleansing -
Real-world data tend to be
incomplete, noisy, and
inconsistent.
• Data Integration - Combining
data from multiple sources.
• Data Transformation – Encoder
• Data Reduction - Reducing
representation of data set.
Data Integration
The content of this presentation is proprietary and confidential information of NAST. It is not intended to be distributed to any third part without the written consent of NAST.
Data Cleaning:
Data cleaning involve different techniques based on
the
problem and the data type.
Overall, incorrect data is either removed, corrected,
or
imputed.
Steps involved in Data Cleaning:
Duplicates, Type conversion,
Scaling / Transformation,
Normalization, Missing Values, Outlier
Treatment,
Irrelevant Data.
Steps in data pre-processing
• Import the data set – Kaggle, uci, api, csv
data_set= pd.read_csv(‘Dataset.csv’)
• Import libraries – Numpy, scipy, Pandas , matplotlib,sklearn
• Import the data set
• import pandas as pd
#(Pandas data frame ) ( df)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.read_csv("Soils.csv")
df.head()
#Returns the first 5 rows of the dataframe
df.tail()
#Returns the last 5 rows of the
dataframe.
df.shape
#Returns a tuple representing
the dimensions.
df.shape (48, 14) represents 48
rows and 14 columns.
Describe function returns the count, mean, standard
deviation, minimum and maximum values and the
quantiles of the data.
Missing Data Handling
• many irrelevant and missing parts
Missing data can be handled in two ways:
• Ignore the tuples:
when the dataset we have is quite large and multiple values
• Fill the Missing values:
fill the missing values manually - by attribute mean or the most
probable value.
• Not Available” or “NA” can be used to replace the missing values.
Missing Values – syntax
• Mean: data=data.fillna(data.mean())
• Median: data=data.fillna(data.median())
• Standard Deviation: data=data.fillna(data.std())
• Min: data=data.fillna(data.min())
• Max: data=data.fillna(data.max())
Missing Values
# importing pandas module
import pandas as pd
# loading data set
data = pd.read_csv('item.csv')
# display the data
print(data)
# replacing missing values in quantity
# column with mean of that column
data['quantity'] = data['quantity'].fillna(data['quantity'].mean())
# replacing missing values in price column
# with median of that column
data['price'] = data['price'].fillna(data['price'].median())
# replacing missing values in bought column with
# standard deviation of that column
data['bought'] = data['bought'].fillna(data['bought'].std())
# replacing missing values in forenoon column with
# minimum number of that column
data['forenoon'] = data['forenoon'].fillna(data['forenoon'].min())
# replacing missing values in afternoon column with
# maximum number of that column
data['afternoon'] = data['afternoon'].fillna(data['afternoon'].max())
print(Data)
Data cleaning, reduction and transformation.pdf

More Related Content

Similar to Data cleaning, reduction and transformation.pdf

Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Data Preprocessing&tools
Data Preprocessing&toolsData Preprocessing&tools
Data Preprocessing&toolsAmandeep Gill
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptkannaradhas
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 

Similar to Data cleaning, reduction and transformation.pdf (20)

Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data Preprocessing&tools
Data Preprocessing&toolsData Preprocessing&tools
Data Preprocessing&tools
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
DataPreProcessing
DataPreProcessing DataPreProcessing
DataPreProcessing
 

Recently uploaded

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 

Recently uploaded (20)

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 

Data cleaning, reduction and transformation.pdf

  • 1. Module 2 -Data Pre-processing and EDA •Data Cleaning •Data Integration • Data Transformation •Data Reduction • Significance of Exploratory Data Analysis ØMaking sense of Data
  • 2.
  • 3.
  • 5. Data pre-processing • Data Cleaning/Cleansing - Real-world data tend to be incomplete, noisy, and inconsistent. • Data Integration - Combining data from multiple sources. • Data Transformation – Encoder • Data Reduction - Reducing representation of data set.
  • 7. The content of this presentation is proprietary and confidential information of NAST. It is not intended to be distributed to any third part without the written consent of NAST. Data Cleaning: Data cleaning involve different techniques based on the problem and the data type. Overall, incorrect data is either removed, corrected, or imputed. Steps involved in Data Cleaning: Duplicates, Type conversion, Scaling / Transformation, Normalization, Missing Values, Outlier Treatment, Irrelevant Data.
  • 8.
  • 9. Steps in data pre-processing • Import the data set – Kaggle, uci, api, csv data_set= pd.read_csv(‘Dataset.csv’) • Import libraries – Numpy, scipy, Pandas , matplotlib,sklearn • Import the data set
  • 10. • import pandas as pd #(Pandas data frame ) ( df) pd.set_option('display.max_columns', 500) pd.set_option('display.max_rows', 500) pd.read_csv("Soils.csv") df.head() #Returns the first 5 rows of the dataframe
  • 11. df.tail() #Returns the last 5 rows of the dataframe. df.shape #Returns a tuple representing the dimensions. df.shape (48, 14) represents 48 rows and 14 columns.
  • 12. Describe function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. Missing Data Handling • many irrelevant and missing parts Missing data can be handled in two ways: • Ignore the tuples: when the dataset we have is quite large and multiple values • Fill the Missing values: fill the missing values manually - by attribute mean or the most probable value. • Not Available” or “NA” can be used to replace the missing values.
  • 18. Missing Values – syntax • Mean: data=data.fillna(data.mean()) • Median: data=data.fillna(data.median()) • Standard Deviation: data=data.fillna(data.std()) • Min: data=data.fillna(data.min()) • Max: data=data.fillna(data.max())
  • 19. Missing Values # importing pandas module import pandas as pd # loading data set data = pd.read_csv('item.csv') # display the data print(data)
  • 20. # replacing missing values in quantity # column with mean of that column data['quantity'] = data['quantity'].fillna(data['quantity'].mean()) # replacing missing values in price column # with median of that column data['price'] = data['price'].fillna(data['price'].median()) # replacing missing values in bought column with # standard deviation of that column data['bought'] = data['bought'].fillna(data['bought'].std()) # replacing missing values in forenoon column with # minimum number of that column data['forenoon'] = data['forenoon'].fillna(data['forenoon'].min()) # replacing missing values in afternoon column with # maximum number of that column data['afternoon'] = data['afternoon'].fillna(data['afternoon'].max()) print(Data)