PANDAS APPLICATION
AUTHOR
NAME :- Soham Chakraborty
COLLEGE :- Teschno India University
COURSE :-B.Sc Data Science
SEMESTER :- 4th sem,
2nd year
YEAR OF PASING :- 2021
e-mail ID :- sohamchakraborty777@gmail.com
CONTENT
• Introduction
• Python
• Libraries
• Integrated development
environment
• Problem statement
• Solution
• Code
• Output
• Source
• Conclusion
INTRODUCTION
Machine learning is a subset of artificial intelligence in the field of computer science
that often uses statistical techniques to give computers the ability to "learn" with
data, without being explicitly programmed. Machine learning helps us to analyse a lot
of data is less time with great accuracy. Madrid has a diverse amount of pollution rate
with a tendency to vary drastically within days. Machine learning helps us to
calculate the gases rate by analysing other gases.
PYTHON
• Python is a high-level, general-purpose, open source, strictly typed programming
language. The language provides constructs intended to enable clear programs on
both a small and large scale.
• Python was Created by Guido van Rossum.
• The Python Software Foundation (PSF) is the organization behind Python.
• Current Versions:
• 3.6.3
• 2.7.14
• Python features
• Some of the features of python include
• Dynamic
• Object oriented
• Multipurpose
• Strongly typed
• Open Sourced
• Python is widely used in many domains
• Web Development
• Data Analysis
• Machine Learning
• Internet Of Things
• GUI Development
• Image processing
• Data visualization
• Game Development
LIBRARIES
• Pandas
• In computer programming, pandas is a software library written for the Python
programming language for data manipulation and analysis. In particular, it offers data
structures and operations for manipulating numerical tables and time series.
INTEGRATED DEVELOPMENT
ENVIRONMENT
• An integrated development environment is a software application that provides
comprehensive facilities to computer programmers for software development. An
IDE normally consists of a source code editor, build automation tools, and a
debugger.
• SPYDER is the Scientific Python Development Environment:
• A powerful interactive development environment for the Python language with advanced
editing, interactive testing, debugging and introspection features.
• And a numerical computing environment thanks to the support of IPython (enhanced
interactive Python interpreter) and popular Python libraries such as NumPy (linear
algebra), SciPy (signal and image processing) or matplotlib (interactive 2D/3D plotting).
PROBLEM STATEMENT
• Data in real world are rarely clean and homogeneous. Data can either be missing during
data extraction or collection. Missing values need to be handled because they reduce the
quality for any of our performance metric. It can also lead to wrong prediction or
classification and can also cause a high bias for any given model being used.
• Depending on data sources, missing data are identified differently. Pandas always
identify missing values as NaN. However, unless the data has been pre-processed to a
degree that an analyst will encounter missing values as NaN. Missing values can appear
as a question mark (?) or a zero (0) or minus one (-1) or a blank. As a result, it is always
important that a data scientist always perform exploratory data analysis(EDA) first
before writing any machine learning algorithm. EDA is simply a litmus for
understanding and knowing the behaviour of our data.
SOLUTION
• There are several options for handling missing values each with its own PROS and CONS. However,
the choice of what should be done is largely dependent on the nature of our data and the missing
values. Below is a summary highlight of several options we have for handling missing values.
1) DROP MISSING VALUES
2) FILL MISSING VALUES WITH TEST STATISTIC
3) PREDICT MISSING VALUE WITH A MACHINE LEARNING ALGORITHM
• Below is a few list of commands to detect missing values with EDA
1. data_name.describe()
2. data_name.info()
3. data_name.head(x)
4. data_name.isnull().sum()
CODE
#we import pandas for data_frame operations
import pandas as pd
df=pd.read_csv("D:Data_SetsGoogle-Playstore-32K(1).csv")
print("The First 3 Rows of the table are shown as below")
print(df.head(3))
print("Dimension of the acquired Data Frame:",df.shape)
print("Description of the only numeric Column is given belown",df.describe())
print("Mean for Reviews Column(with missing values):",df.Reviews.mean())
print("Median for Reviews Column(with missing values):",df.Reviews.median())
print("Total No. of Reviews:",df.Reviews.count())
#check the number of missing values
print("The Number of missing values in each columnn",df.isnull().sum())
#user defined function to fill the missing numeric vaues
def abc(series):
return series.fillna(series.median())
#command to fill the empty cells
df.Reviews = df["Reviews"].transform(abc)
#again checking number of missing values
#and there are no empty cells in the Reviews Column
print("The Number of missing values in each column after transformingn",df.isnull().sum())
print("Mean for Reviews Column(without missing values):",df.Reviews.mean())
print("Median for Reviews Column(without missing values):",df.Reviews.median())
OUTPUT
The First 3 Rows of the table are shown as below
App Name Category Rating Reviews Installs Size Price Content Rating Last Updated Minimum Version Latest
Version
DoorDash – FOOD_AND_DRINK 4.548561573 305034.0 5,000,000+ Varies 0 Everyone 29-Mar-19 Varies with device Varies with
Device
Food Delivery with device
TripAdvisor Hotels... TRAVEL 4.400671482 1207922.0 100,000,000+ Varies 0 Everyone 29-Mar-19 Varies with device Varies with
device
_AND_LOCAL with device
Peapod SHOPPING 3.656329393 1967.0 100,000+ 1.4M 0 Everyone 20-Sep-18 5.0 and up 2.2.0
Dimension of the acquired Data Frame: (32000, 11)
OUTPUT
• Description of the only numeric Column is given below
• Reviews
• count 3.199300e+04
• mean 9.850928e+04
• std 1.173820e+06
• min 1.000000e+00
• 25% 1.390000e+02
• 50% 1.464000e+03
• 75% 1.445100e+04
• max 8.621429e+07
• Mean for Reviews Column(with missing values): 98509.28305747504
• Median for Reviews Column(with missing values): 1464.0
• Total No. of Reviews: 31993
OUTPUT
• The Number of missing values in each column
• App Name 0
• Category 0
• Rating 0
• Reviews 7
• Installs 0
• Size 0
• Price 0
• Content Rating 0
• Last Updated 0
• Minimum Version 0
• Latest Version 0
• dtype: int64
OUTPUT
• The Number of missing values in each column after transforming
• App Name 0
• Category 0
• Rating 0
• Reviews 0
• Installs 0
• Size 0
• Price 0
• Content Rating 0
• Last Updated 0
• Minimum Version 0
• Latest Version 1
• dtype: int64
OUTPUT
• Mean for Reviews Column(without missing values): 98488.05440180622
• Median for Reviews Column(without missing values): 1464.0
SOURCE OF DATASET
• All the data present in this dataset comes from Kaggle.com, which are the ones to be
acknowledged for the data collection. It aims to provide a more convenient format for
data scientist, as well as some enhanced context in a single place.
CONCLUSION
• As you can see in the few lines of code above using pandas to fill the empty cells is
quite simple. This is truly where the library shines in its ability to easily manipulate
a data to get required insights.
THANK YOU

Pandas application

  • 1.
  • 2.
    AUTHOR NAME :- SohamChakraborty COLLEGE :- Teschno India University COURSE :-B.Sc Data Science SEMESTER :- 4th sem, 2nd year YEAR OF PASING :- 2021 e-mail ID :- sohamchakraborty777@gmail.com
  • 3.
    CONTENT • Introduction • Python •Libraries • Integrated development environment • Problem statement • Solution • Code • Output • Source • Conclusion
  • 4.
    INTRODUCTION Machine learning isa subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to "learn" with data, without being explicitly programmed. Machine learning helps us to analyse a lot of data is less time with great accuracy. Madrid has a diverse amount of pollution rate with a tendency to vary drastically within days. Machine learning helps us to calculate the gases rate by analysing other gases.
  • 5.
    PYTHON • Python isa high-level, general-purpose, open source, strictly typed programming language. The language provides constructs intended to enable clear programs on both a small and large scale. • Python was Created by Guido van Rossum. • The Python Software Foundation (PSF) is the organization behind Python. • Current Versions: • 3.6.3 • 2.7.14
  • 6.
    • Python features •Some of the features of python include • Dynamic • Object oriented • Multipurpose • Strongly typed • Open Sourced • Python is widely used in many domains • Web Development • Data Analysis • Machine Learning • Internet Of Things • GUI Development • Image processing • Data visualization • Game Development
  • 7.
    LIBRARIES • Pandas • Incomputer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
  • 8.
    INTEGRATED DEVELOPMENT ENVIRONMENT • Anintegrated development environment is a software application that provides comprehensive facilities to computer programmers for software development. An IDE normally consists of a source code editor, build automation tools, and a debugger. • SPYDER is the Scientific Python Development Environment: • A powerful interactive development environment for the Python language with advanced editing, interactive testing, debugging and introspection features. • And a numerical computing environment thanks to the support of IPython (enhanced interactive Python interpreter) and popular Python libraries such as NumPy (linear algebra), SciPy (signal and image processing) or matplotlib (interactive 2D/3D plotting).
  • 9.
    PROBLEM STATEMENT • Datain real world are rarely clean and homogeneous. Data can either be missing during data extraction or collection. Missing values need to be handled because they reduce the quality for any of our performance metric. It can also lead to wrong prediction or classification and can also cause a high bias for any given model being used. • Depending on data sources, missing data are identified differently. Pandas always identify missing values as NaN. However, unless the data has been pre-processed to a degree that an analyst will encounter missing values as NaN. Missing values can appear as a question mark (?) or a zero (0) or minus one (-1) or a blank. As a result, it is always important that a data scientist always perform exploratory data analysis(EDA) first before writing any machine learning algorithm. EDA is simply a litmus for understanding and knowing the behaviour of our data.
  • 10.
    SOLUTION • There areseveral options for handling missing values each with its own PROS and CONS. However, the choice of what should be done is largely dependent on the nature of our data and the missing values. Below is a summary highlight of several options we have for handling missing values. 1) DROP MISSING VALUES 2) FILL MISSING VALUES WITH TEST STATISTIC 3) PREDICT MISSING VALUE WITH A MACHINE LEARNING ALGORITHM • Below is a few list of commands to detect missing values with EDA 1. data_name.describe() 2. data_name.info() 3. data_name.head(x) 4. data_name.isnull().sum()
  • 11.
    CODE #we import pandasfor data_frame operations import pandas as pd df=pd.read_csv("D:Data_SetsGoogle-Playstore-32K(1).csv") print("The First 3 Rows of the table are shown as below") print(df.head(3)) print("Dimension of the acquired Data Frame:",df.shape) print("Description of the only numeric Column is given belown",df.describe()) print("Mean for Reviews Column(with missing values):",df.Reviews.mean()) print("Median for Reviews Column(with missing values):",df.Reviews.median()) print("Total No. of Reviews:",df.Reviews.count()) #check the number of missing values print("The Number of missing values in each columnn",df.isnull().sum())
  • 12.
    #user defined functionto fill the missing numeric vaues def abc(series): return series.fillna(series.median()) #command to fill the empty cells df.Reviews = df["Reviews"].transform(abc) #again checking number of missing values #and there are no empty cells in the Reviews Column print("The Number of missing values in each column after transformingn",df.isnull().sum()) print("Mean for Reviews Column(without missing values):",df.Reviews.mean()) print("Median for Reviews Column(without missing values):",df.Reviews.median())
  • 13.
    OUTPUT The First 3Rows of the table are shown as below App Name Category Rating Reviews Installs Size Price Content Rating Last Updated Minimum Version Latest Version DoorDash – FOOD_AND_DRINK 4.548561573 305034.0 5,000,000+ Varies 0 Everyone 29-Mar-19 Varies with device Varies with Device Food Delivery with device TripAdvisor Hotels... TRAVEL 4.400671482 1207922.0 100,000,000+ Varies 0 Everyone 29-Mar-19 Varies with device Varies with device _AND_LOCAL with device Peapod SHOPPING 3.656329393 1967.0 100,000+ 1.4M 0 Everyone 20-Sep-18 5.0 and up 2.2.0 Dimension of the acquired Data Frame: (32000, 11)
  • 14.
    OUTPUT • Description ofthe only numeric Column is given below • Reviews • count 3.199300e+04 • mean 9.850928e+04 • std 1.173820e+06 • min 1.000000e+00 • 25% 1.390000e+02 • 50% 1.464000e+03 • 75% 1.445100e+04 • max 8.621429e+07 • Mean for Reviews Column(with missing values): 98509.28305747504 • Median for Reviews Column(with missing values): 1464.0 • Total No. of Reviews: 31993
  • 15.
    OUTPUT • The Numberof missing values in each column • App Name 0 • Category 0 • Rating 0 • Reviews 7 • Installs 0 • Size 0 • Price 0 • Content Rating 0 • Last Updated 0 • Minimum Version 0 • Latest Version 0 • dtype: int64
  • 16.
    OUTPUT • The Numberof missing values in each column after transforming • App Name 0 • Category 0 • Rating 0 • Reviews 0 • Installs 0 • Size 0 • Price 0 • Content Rating 0 • Last Updated 0 • Minimum Version 0 • Latest Version 1 • dtype: int64
  • 17.
    OUTPUT • Mean forReviews Column(without missing values): 98488.05440180622 • Median for Reviews Column(without missing values): 1464.0
  • 18.
    SOURCE OF DATASET •All the data present in this dataset comes from Kaggle.com, which are the ones to be acknowledged for the data collection. It aims to provide a more convenient format for data scientist, as well as some enhanced context in a single place.
  • 19.
    CONCLUSION • As youcan see in the few lines of code above using pandas to fill the empty cells is quite simple. This is truly where the library shines in its ability to easily manipulate a data to get required insights.
  • 20.