Data Cleaning
Process in Data Cleaning
1. Handling duplicates
2. Removing unwanted columns
3. Check for outliers
4. Handling missing values
5. Uniqueness
Handling duplicated values
Finding the duplicated values
df.duplicated()
• Will return True if the row of the dataframe is repeated
• Otherwise False
Handling duplicated values
Data deduplication
• To remove the duplicated rows in the dataframe.
df.drop_duplicated()
• Will delete the duplicated rows in dataframe
Removing unwanted columns
• In our dataset there may be chances for columns present which is not
useful for analysis.
• We have to remove those columns for further analysis.
df.drop([“ColumnName”])
Skewness and Curtosis
Skewness
Skewness is a measure of symmetry, or more precisely, the lack of
symmetry.
• Three types of skew:
1. Right skew(+ive skew) => mean > median
2. Left skew(-ive skew) => mean < median
3. Zero skew
Skewness and Curtosis
Curtosis:
Kurtosis is a measure of whether the data are heavy-tailed or light-
tailed relative to a normal distribution.
1. Mesokurtosis (kurtosis of 0)
2. Platykurtosis (negative excess => thin tailed)
3. Leptokurtosis ( positive excess => fat tailed)
Handling missing values
Finding NaN values
• To find the missing values, use
dataframe.isnull()
• Will return True if the value is NaN,
• Otherwise False
dataframe.notnull() => Opposite to isnull()
Handling missing values
Filling missing values
• To fill the NaN values we can use
df.fillna()
• Inside the fillna() function we can add any numerical values or mean,
median, mode of the column.
df.fillna(df[“ColumnName” ].mean/median/mode)
Handling missing values
Dropping NaN values
• Alternate way of handling missing values is dropping them.
df.dropna()
• Will remove the rows having a NaN value.
Uniqueness
To find the unique values present in the dataframe’s column
df[“ColumnName”].unique()
Will return the values that are unique in that column as an array.
EDA
EDA
• Also called RCA
• There are two types
1. Data visualization
2. Statistical methods
Data visualization
Graphical representation of the dataset.
1. Univariate analysis
2. Bivariate analysis
3. Multivariate analysis
Statistical methods
1. Correlation analysis:
df.corr()
• To give the correlation between each of the columns as nxn matrix
where n is the number of columns
• Lies b/w -1 to 1
• 0.1 to -0.1 => Bad correlation
• 0.1 to 0.5 or -0.1 to -0.5 => Good correlation
• > 0.5 or < - 0.5 => Very good correlation
2. ANOVA table
• One way ANOVA table
• Two way ANOVA table
3. Chi-Square test

ml ppt.pptx

  • 1.
  • 2.
    Process in DataCleaning 1. Handling duplicates 2. Removing unwanted columns 3. Check for outliers 4. Handling missing values 5. Uniqueness
  • 3.
    Handling duplicated values Findingthe duplicated values df.duplicated() • Will return True if the row of the dataframe is repeated • Otherwise False
  • 4.
    Handling duplicated values Datadeduplication • To remove the duplicated rows in the dataframe. df.drop_duplicated() • Will delete the duplicated rows in dataframe
  • 5.
    Removing unwanted columns •In our dataset there may be chances for columns present which is not useful for analysis. • We have to remove those columns for further analysis. df.drop([“ColumnName”])
  • 6.
    Skewness and Curtosis Skewness Skewnessis a measure of symmetry, or more precisely, the lack of symmetry. • Three types of skew: 1. Right skew(+ive skew) => mean > median 2. Left skew(-ive skew) => mean < median 3. Zero skew
  • 8.
    Skewness and Curtosis Curtosis: Kurtosisis a measure of whether the data are heavy-tailed or light- tailed relative to a normal distribution. 1. Mesokurtosis (kurtosis of 0) 2. Platykurtosis (negative excess => thin tailed) 3. Leptokurtosis ( positive excess => fat tailed)
  • 10.
    Handling missing values FindingNaN values • To find the missing values, use dataframe.isnull() • Will return True if the value is NaN, • Otherwise False dataframe.notnull() => Opposite to isnull()
  • 11.
    Handling missing values Fillingmissing values • To fill the NaN values we can use df.fillna() • Inside the fillna() function we can add any numerical values or mean, median, mode of the column. df.fillna(df[“ColumnName” ].mean/median/mode)
  • 12.
    Handling missing values DroppingNaN values • Alternate way of handling missing values is dropping them. df.dropna() • Will remove the rows having a NaN value.
  • 13.
    Uniqueness To find theunique values present in the dataframe’s column df[“ColumnName”].unique() Will return the values that are unique in that column as an array.
  • 14.
  • 15.
    EDA • Also calledRCA • There are two types 1. Data visualization 2. Statistical methods
  • 16.
    Data visualization Graphical representationof the dataset. 1. Univariate analysis 2. Bivariate analysis 3. Multivariate analysis
  • 17.
    Statistical methods 1. Correlationanalysis: df.corr() • To give the correlation between each of the columns as nxn matrix where n is the number of columns • Lies b/w -1 to 1 • 0.1 to -0.1 => Bad correlation • 0.1 to 0.5 or -0.1 to -0.5 => Good correlation • > 0.5 or < - 0.5 => Very good correlation
  • 18.
    2. ANOVA table •One way ANOVA table • Two way ANOVA table 3. Chi-Square test