2. AUTHOR
NAME :- Soham Chakraborty
COLLEGE :- Teschno India University
COURSE :-B.Sc Data Science
SEMESTER :- 4th sem,
2nd year
YEAR OF PASING :- 2021
e-mail ID :- sohamchakraborty777@gmail.com
4. INTRODUCTION
Machine learning is a subset of artificial intelligence in the field of computer science
that often uses statistical techniques to give computers the ability to "learn" with
data, without being explicitly programmed. Machine learning helps us to analyse a lot
of data is less time with great accuracy. Madrid has a diverse amount of pollution rate
with a tendency to vary drastically within days. Machine learning helps us to
calculate the gases rate by analysing other gases.
5. PYTHON
• Python is a high-level, general-purpose, open source, strictly typed programming
language. The language provides constructs intended to enable clear programs on
both a small and large scale.
• Python was Created by Guido van Rossum.
• The Python Software Foundation (PSF) is the organization behind Python.
• Current Versions:
• 3.6.3
• 2.7.14
6. • Python features
• Some of the features of python include
• Dynamic
• Object oriented
• Multipurpose
• Strongly typed
• Open Sourced
• Python is widely used in many domains
• Web Development
• Data Analysis
• Machine Learning
• Internet Of Things
• GUI Development
• Image processing
• Data visualization
• Game Development
7. LIBRARIES
• Pandas
• In computer programming, pandas is a software library written for the Python
programming language for data manipulation and analysis. In particular, it offers data
structures and operations for manipulating numerical tables and time series.
8. INTEGRATED DEVELOPMENT
ENVIRONMENT
• An integrated development environment is a software application that provides
comprehensive facilities to computer programmers for software development. An
IDE normally consists of a source code editor, build automation tools, and a
debugger.
• SPYDER is the Scientific Python Development Environment:
• A powerful interactive development environment for the Python language with advanced
editing, interactive testing, debugging and introspection features.
• And a numerical computing environment thanks to the support of IPython (enhanced
interactive Python interpreter) and popular Python libraries such as NumPy (linear
algebra), SciPy (signal and image processing) or matplotlib (interactive 2D/3D plotting).
9. PROBLEM STATEMENT
• Data in real world are rarely clean and homogeneous. Data can either be missing during
data extraction or collection. Missing values need to be handled because they reduce the
quality for any of our performance metric. It can also lead to wrong prediction or
classification and can also cause a high bias for any given model being used.
• Depending on data sources, missing data are identified differently. Pandas always
identify missing values as NaN. However, unless the data has been pre-processed to a
degree that an analyst will encounter missing values as NaN. Missing values can appear
as a question mark (?) or a zero (0) or minus one (-1) or a blank. As a result, it is always
important that a data scientist always perform exploratory data analysis(EDA) first
before writing any machine learning algorithm. EDA is simply a litmus for
understanding and knowing the behaviour of our data.
10. SOLUTION
• There are several options for handling missing values each with its own PROS and CONS. However,
the choice of what should be done is largely dependent on the nature of our data and the missing
values. Below is a summary highlight of several options we have for handling missing values.
1) DROP MISSING VALUES
2) FILL MISSING VALUES WITH TEST STATISTIC
3) PREDICT MISSING VALUE WITH A MACHINE LEARNING ALGORITHM
• Below is a few list of commands to detect missing values with EDA
1. data_name.describe()
2. data_name.info()
3. data_name.head(x)
4. data_name.isnull().sum()
11. CODE
#we import pandas for data_frame operations
import pandas as pd
df=pd.read_csv("D:Data_SetsGoogle-Playstore-32K(1).csv")
print("The First 3 Rows of the table are shown as below")
print(df.head(3))
print("Dimension of the acquired Data Frame:",df.shape)
print("Description of the only numeric Column is given belown",df.describe())
print("Mean for Reviews Column(with missing values):",df.Reviews.mean())
print("Median for Reviews Column(with missing values):",df.Reviews.median())
print("Total No. of Reviews:",df.Reviews.count())
#check the number of missing values
print("The Number of missing values in each columnn",df.isnull().sum())
12. #user defined function to fill the missing numeric vaues
def abc(series):
return series.fillna(series.median())
#command to fill the empty cells
df.Reviews = df["Reviews"].transform(abc)
#again checking number of missing values
#and there are no empty cells in the Reviews Column
print("The Number of missing values in each column after transformingn",df.isnull().sum())
print("Mean for Reviews Column(without missing values):",df.Reviews.mean())
print("Median for Reviews Column(without missing values):",df.Reviews.median())
13. OUTPUT
The First 3 Rows of the table are shown as below
App Name Category Rating Reviews Installs Size Price Content Rating Last Updated Minimum Version Latest
Version
DoorDash – FOOD_AND_DRINK 4.548561573 305034.0 5,000,000+ Varies 0 Everyone 29-Mar-19 Varies with device Varies with
Device
Food Delivery with device
TripAdvisor Hotels... TRAVEL 4.400671482 1207922.0 100,000,000+ Varies 0 Everyone 29-Mar-19 Varies with device Varies with
device
_AND_LOCAL with device
Peapod SHOPPING 3.656329393 1967.0 100,000+ 1.4M 0 Everyone 20-Sep-18 5.0 and up 2.2.0
Dimension of the acquired Data Frame: (32000, 11)
14. OUTPUT
• Description of the only numeric Column is given below
• Reviews
• count 3.199300e+04
• mean 9.850928e+04
• std 1.173820e+06
• min 1.000000e+00
• 25% 1.390000e+02
• 50% 1.464000e+03
• 75% 1.445100e+04
• max 8.621429e+07
• Mean for Reviews Column(with missing values): 98509.28305747504
• Median for Reviews Column(with missing values): 1464.0
• Total No. of Reviews: 31993
15. OUTPUT
• The Number of missing values in each column
• App Name 0
• Category 0
• Rating 0
• Reviews 7
• Installs 0
• Size 0
• Price 0
• Content Rating 0
• Last Updated 0
• Minimum Version 0
• Latest Version 0
• dtype: int64
16. OUTPUT
• The Number of missing values in each column after transforming
• App Name 0
• Category 0
• Rating 0
• Reviews 0
• Installs 0
• Size 0
• Price 0
• Content Rating 0
• Last Updated 0
• Minimum Version 0
• Latest Version 1
• dtype: int64
17. OUTPUT
• Mean for Reviews Column(without missing values): 98488.05440180622
• Median for Reviews Column(without missing values): 1464.0
18. SOURCE OF DATASET
• All the data present in this dataset comes from Kaggle.com, which are the ones to be
acknowledged for the data collection. It aims to provide a more convenient format for
data scientist, as well as some enhanced context in a single place.
19. CONCLUSION
• As you can see in the few lines of code above using pandas to fill the empty cells is
quite simple. This is truly where the library shines in its ability to easily manipulate
a data to get required insights.