Data cleaning, reduction and transformation.pdf

•

0 likes•14 views

Data preprocessing consists of three main stages: data cleaning, data transformation, and data reduction. Data Cleaning This stage aims to eliminate errors, inconsistencies, and noise from the dataset. Examples of data cleaning tasks include: Identifying and correcting missing values Removing duplicates Resolving inconsistencies in data formats or coding schemes Detecting and handling outliers Data Transformation In this phase, the data is prepared for analysis by converting it into a suitable format. Examples of data transformation techniques include: Normalization: Scaling data to a common range Standardization: Transforming data to have zero mean and unit variance Discretization: Replacing continuous data with discrete categories Data Reduction Reducing the size of the dataset while retaining important information helps improve the efficiency of data analysis and prevents overfitting. Methods employed during data reduction include: Feature selection: Selecting a subset of relevant features from the dataset Feature extraction: Transforming the data into a lower-dimensional space while maintaining important information These steps are crucial for preparing data for analysis and improving the accuracy of the resulting models. Different techniques are used depending on the nature of the data and the analysis goals

Engineering

Module 2 -Data Pre-processing and EDA
•Data Cleaning
•Data Integration
• Data Transformation
•Data Reduction
• Significance of Exploratory Data Analysis
ØMaking sense of Data

Data pre-processing
• Data Cleaning/Cleansing -
Real-world data tend to be
incomplete, noisy, and
inconsistent.
• Data Integration - Combining
data from multiple sources.
• Data Transformation – Encoder
• Data Reduction - Reducing
representation of data set.

The content of this presentation is proprietary and confidential information of NAST. It is not intended to be distributed to any third part without the written consent of NAST.
Data Cleaning:
Data cleaning involve different techniques based on
the
problem and the data type.
Overall, incorrect data is either removed, corrected,
or
imputed.
Steps involved in Data Cleaning:
Duplicates, Type conversion,
Scaling / Transformation,
Normalization, Missing Values, Outlier
Treatment,
Irrelevant Data.

Steps in data pre-processing
• Import the data set – Kaggle, uci, api, csv
data_set= pd.read_csv(‘Dataset.csv’)
• Import libraries – Numpy, scipy, Pandas , matplotlib,sklearn
• Import the data set

• import pandas as pd
#(Pandas data frame ) ( df)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.read_csv("Soils.csv")
df.head()
#Returns the first 5 rows of the dataframe

df.tail()
#Returns the last 5 rows of the
dataframe.
df.shape
#Returns a tuple representing
the dimensions.
df.shape (48, 14) represents 48
rows and 14 columns.

Describe function returns the count, mean, standard
deviation, minimum and maximum values and the
quantiles of the data.

Missing Data Handling
• many irrelevant and missing parts
Missing data can be handled in two ways:
• Ignore the tuples:
when the dataset we have is quite large and multiple values
• Fill the Missing values:
fill the missing values manually - by attribute mean or the most
probable value.
• Not Available” or “NA” can be used to replace the missing values.

Missing Values – syntax
• Mean: data=data.fillna(data.mean())
• Median: data=data.fillna(data.median())
• Standard Deviation: data=data.fillna(data.std())
• Min: data=data.fillna(data.min())
• Max: data=data.fillna(data.max())

Missing Values
# importing pandas module
import pandas as pd
# loading data set
data = pd.read_csv('item.csv')
# display the data
print(data)

# replacing missing values in quantity
# column with mean of that column
data['quantity'] = data['quantity'].fillna(data['quantity'].mean())
# replacing missing values in price column
# with median of that column
data['price'] = data['price'].fillna(data['price'].median())
# replacing missing values in bought column with
# standard deviation of that column
data['bought'] = data['bought'].fillna(data['bought'].std())
# replacing missing values in forenoon column with
# minimum number of that column
data['forenoon'] = data['forenoon'].fillna(data['forenoon'].min())
# replacing missing values in afternoon column with
# maximum number of that column
data['afternoon'] = data['afternoon'].fillna(data['afternoon'].max())
print(Data)

Similar to Data cleaning, reduction and transformation.pdf

Data pre processingpommurajopt

Data preprocessing 2extraganesh

Datapreprocessingpriya_trehan

ML-ChapterTwo-Data Preprocessing.pptbelay41

Preprocessing.pptcongtran88

Data Preprocessing&toolsAmandeep Gill

Preprocessing.pptRoshan575917

Preprocessing.pptchatbot9

Preprocessing.pptwaseemchaudhry13

Preprocessing.pptArumugam Prakash

Pre-Processing and Data PreparationUmair Shafique

Chapter 2 Cond (1).pptkannaradhas

Data Preparation.pptxYashikaSengar2

Data Preparation.pptxDrAbhishekKumarSingh3

DataPreprocessing.pptxDr-Dipali Meher

data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1

Data preprocessingsuganmca14

Preprocessingtdharmaputhiran

Data PreProcessingtdharmaputhiran

DataPreProcessing tdharmaputhiran

Similar to Data cleaning, reduction and transformation.pdf (20)

Data pre processing

Data preprocessing 2

Datapreprocessing

ML-ChapterTwo-Data Preprocessing.ppt

Preprocessing.ppt

Data Preprocessing&tools

Preprocessing.ppt

Pre-Processing and Data Preparation

Chapter 2 Cond (1).ppt

Data Preparation.pptx

DataPreprocessing.pptx

data wrangling (1).pptx kjhiukjhknjbnkjh

Data preprocessing

Preprocessing

Data PreProcessing

DataPreProcessing

Recently uploaded

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N

Introduction to Multiple Access Protocol.pptxupamatechverse

College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

Introduction and different types of Ethernet.pptxupamatechverse

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

AKTU Computer Networks notes --- Unit 3.pdfankushspencer015

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

UNIT - IV - Air Compressors and its Performancesivaprakash250

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control

Porous Ceramics seminar and technical writingrakeshbaidya232001

Recently uploaded (20)

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING

Coefficient of Thermal Expansion and their Importance.pptx

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE

Introduction to Multiple Access Protocol.pptx

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik

Introduction and different types of Ethernet.pptx

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

AKTU Computer Networks notes --- Unit 3.pdf

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

Roadmap to Membership of RICS - Pathways and Routes

UNIT - IV - Air Compressors and its Performance

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx

Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Water Industry Process Automation & Control Monthly - April 2024

Porous Ceramics seminar and technical writing

Data cleaning, reduction and transformation.pdf

1. Module 2 -Data Pre-processing and EDA •Data Cleaning •Data Integration • Data Transformation •Data Reduction • Significance of Exploratory Data Analysis ØMaking sense of Data

4. Data Cleaning cycle

5. Data pre-processing • Data Cleaning/Cleansing - Real-world data tend to be incomplete, noisy, and inconsistent. • Data Integration - Combining data from multiple sources. • Data Transformation – Encoder • Data Reduction - Reducing representation of data set.

6. Data Integration

7. The content of this presentation is proprietary and confidential information of NAST. It is not intended to be distributed to any third part without the written consent of NAST. Data Cleaning: Data cleaning involve different techniques based on the problem and the data type. Overall, incorrect data is either removed, corrected, or imputed. Steps involved in Data Cleaning: Duplicates, Type conversion, Scaling / Transformation, Normalization, Missing Values, Outlier Treatment, Irrelevant Data.

9. Steps in data pre-processing • Import the data set – Kaggle, uci, api, csv data_set= pd.read_csv(‘Dataset.csv’) • Import libraries – Numpy, scipy, Pandas , matplotlib,sklearn • Import the data set

10. • import pandas as pd #(Pandas data frame ) ( df) pd.set_option('display.max_columns', 500) pd.set_option('display.max_rows', 500) pd.read_csv("Soils.csv") df.head() #Returns the first 5 rows of the dataframe

11. df.tail() #Returns the last 5 rows of the dataframe. df.shape #Returns a tuple representing the dimensions. df.shape (48, 14) represents 48 rows and 14 columns.

12. Describe function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.

13.

14.

15.

16.

17. Missing Data Handling • many irrelevant and missing parts Missing data can be handled in two ways: • Ignore the tuples: when the dataset we have is quite large and multiple values • Fill the Missing values: fill the missing values manually - by attribute mean or the most probable value. • Not Available” or “NA” can be used to replace the missing values.

18. Missing Values – syntax • Mean: data=data.fillna(data.mean()) • Median: data=data.fillna(data.median()) • Standard Deviation: data=data.fillna(data.std()) • Min: data=data.fillna(data.min()) • Max: data=data.fillna(data.max())

19. Missing Values # importing pandas module import pandas as pd # loading data set data = pd.read_csv('item.csv') # display the data print(data)

20. # replacing missing values in quantity # column with mean of that column data['quantity'] = data['quantity'].fillna(data['quantity'].mean()) # replacing missing values in price column # with median of that column data['price'] = data['price'].fillna(data['price'].median()) # replacing missing values in bought column with # standard deviation of that column data['bought'] = data['bought'].fillna(data['bought'].std()) # replacing missing values in forenoon column with # minimum number of that column data['forenoon'] = data['forenoon'].fillna(data['forenoon'].min()) # replacing missing values in afternoon column with # maximum number of that column data['afternoon'] = data['afternoon'].fillna(data['afternoon'].max()) print(Data)

Data cleaning, reduction and transformation.pdf

Recommended

Recommended

More Related Content

Similar to Data cleaning, reduction and transformation.pdf

Similar to Data cleaning, reduction and transformation.pdf (20)

Recently uploaded

Recently uploaded (20)

Data cleaning, reduction and transformation.pdf