ml ppt.pptx

Process in Data Cleaning
1. Handling duplicates
2. Removing unwanted columns
3. Check for outliers
4. Handling missing values
5. Uniqueness

Handling duplicated values
Finding the duplicated values
df.duplicated()
• Will return True if the row of the dataframe is repeated
• Otherwise False

Handling duplicated values
Data deduplication
• To remove the duplicated rows in the dataframe.
df.drop_duplicated()
• Will delete the duplicated rows in dataframe

Removing unwanted columns
• In our dataset there may be chances for columns present which is not
useful for analysis.
• We have to remove those columns for further analysis.
df.drop([“ColumnName”])

Skewness and Curtosis
Skewness
Skewness is a measure of symmetry, or more precisely, the lack of
symmetry.
• Three types of skew:
1. Right skew(+ive skew) => mean > median
2. Left skew(-ive skew) => mean < median
3. Zero skew

Skewness and Curtosis
Curtosis:
Kurtosis is a measure of whether the data are heavy-tailed or light-
tailed relative to a normal distribution.
1. Mesokurtosis (kurtosis of 0)
2. Platykurtosis (negative excess => thin tailed)
3. Leptokurtosis ( positive excess => fat tailed)

Handling missing values
Finding NaN values
• To find the missing values, use
dataframe.isnull()
• Will return True if the value is NaN,
• Otherwise False
dataframe.notnull() => Opposite to isnull()

Filling missing values
• To fill the NaN values we can use
df.fillna()
• Inside the fillna() function we can add any numerical values or mean,
median, mode of the column.
df.fillna(df[“ColumnName” ].mean/median/mode)

Dropping NaN values
• Alternate way of handling missing values is dropping them.
df.dropna()
• Will remove the rows having a NaN value.

Uniqueness
To find the unique values present in the dataframe’s column
df[“ColumnName”].unique()
Will return the values that are unique in that column as an array.

EDA
• Also called RCA
• There are two types
1. Data visualization
2. Statistical methods

Data visualization
Graphical representation of the dataset.
1. Univariate analysis
2. Bivariate analysis
3. Multivariate analysis

Statistical methods
1. Correlation analysis:
df.corr()
• To give the correlation between each of the columns as nxn matrix
where n is the number of columns
• Lies b/w -1 to 1
• 0.1 to -0.1 => Bad correlation
• 0.1 to 0.5 or -0.1 to -0.5 => Good correlation
• > 0.5 or < - 0.5 => Very good correlation

2. ANOVA table
• One way ANOVA table
• Two way ANOVA table
3. Chi-Square test

ml ppt.pptx

More Related Content

Similar to ml ppt.pptx

Recently uploaded

ml ppt.pptx