UNIT-V INTRODUCTION TO NUMPY, PANDAS, MATPLOTLIB
Exploratory Data Analysis (EDA), Data Science life cycle, Descriptive Statistics, Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of EDA. Data Visualization: Scatter plot, bar chart, histogram, boxplot, heat maps, etc.
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
1. 20ACS04 –
PROBLEM SOLVING AND
PROGRAMMING USING
PYTHON
PREPARED BY
Mr. P. NANDAKUMAR
ASSISTANT PROFESSOR,
DEPARTMENT OF INFORMATION TECHNOLOGY,
SVCET.
2. COURSE CONTENT
UNIT-V INTRODUCTION TO NUMPY, PANDAS,
MATPLOTLIB
Exploratory Data Analysis (EDA), Data Science life cycle,
Descriptive Statistics, Basic tools (plots, graphs and summary
statistics) of EDA, Philosophy of EDA. Data Visualization: Scatter
plot, bar chart, histogram, boxplot, heat maps, etc.
3. EXPLORATORY DATA ANALYSIS (EDA)
Exploratory Data Analysis (EDA) is an approach that is used to
analyze the data and discover trends, patterns, or check assumptions in
data with the help of statistical summaries and graphical
representations.
Types of EDA
Depending on the number of columns we are analyzing we can divide
EDA into three types.
1. Univariate Analysis
2. Bi-Variate analysis
3. Multivariate Analysis
4. EXPLORATORY DATA ANALYSIS (EDA)
1. Univariate Analysis – In univariate analysis, we analyze or deal with
only one variable at a time. The analysis of univariate data is thus the
simplest form of analysis since the information deals with only one
quantity that changes. It does not deal with causes or relationships and the
main purpose of the analysis is to describe the data and find patterns that
exist within it.
2. Bi-Variate analysis – This type of data involves two different variables.
The analysis of this type of data deals with causes and relationships and
the analysis is done to find out the relationship between the two variables.
3. Multivariate Analysis – When the data involves three or more variables,
it is categorized under multivariate.
5. EXPLORATORY DATA ANALYSIS (EDA)
Depending on the type of analysis we can also subcategorize EDA into
two parts.
1. Non-graphical Analysis – In non-graphical analysis, we analyze
data using statistical tools like mean median or mode or skewness
2. Graphical Analysis – In graphical analysis, we use visualizations
charts to visualize trends and patterns in the data
6. DATA SCIENCE LIFECYCLE
Data Science Lifecycle revolves around the use of machine learning
and different analytical strategies to produce insights and predictions
from information in order to acquire a commercial enterprise
objective.
The complete method includes a number of steps like data cleaning,
preparation, modelling, model evaluation, etc. It is a lengthy procedure
and may additionally take quite a few months to complete.
7. DATA SCIENCE LIFECYCLE
The following are some primary motives for the use of Data science
technology:
It helps to convert the big quantity of uncooked and unstructured records
into significant insights.
It can assist in unique predictions such as a range of surveys, elections, etc.
It also helps in automating transportation such as growing a self-driving
car, we can say which is the future of transportation.
Companies are shifting towards Data science and opting for this
technology. Amazon, Netflix, etc, which cope with the big quantity of
data, are the use of information science algorithms for higher consumer
experience.
9. DESCRIPTIVE STATISTICS
In Descriptive statistics, we are describing our data with the help of various
representative methods like by using charts, graphs, tables, excel files etc.
In descriptive statistics, we describe our data in some manner and present it in
a meaningful way so that it can be easily understood.
Most of the times it is performed on small data sets and this analysis helps us
a lot to predict some future trends based on the current findings.
Types of Descriptive statistic:
Measure of central tendency
Measure of variability
11. DESCRIPTIVE STATISTICS
Measure of central tendency:
It represents the whole set of data by single value.It gives us the location of
central points. There are three main measures of central tendency:
1. Mean
2. Mode
3. Median
12. DESCRIPTIVE STATISTICS
Mean:
It is the sum of observation divided by the total number of observations. It is
also defined as average which is the sum divided by count.
where, n = number of terms
Python Code to find Mean in python:
import numpy as np
# Sample Data
arr = [5, 6, 11]
# Mean
mean = np.mean(arr)
print("Mean = ", mean)
13. DESCRIPTIVE STATISTICS
Mode:
It is the value that has the highest frequency in the given data set. The data set
may have no mode if the frequency of all data points is the same. Also, we
can have more than one mode if we encounter two or more data points having
the same frequency.
Code to find Mode in python:
from scipy import stats
# sample Data
arr =[1, 2, 2, 3]
# Mode
mode = stats.mode(arr)
print("Mode = ", mode)
14. DESCRIPTIVE STATISTICS
Median:
It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is median
and if it is even then the median would be the average of two central
elements.
where, n=number of terms
Python code to find Median:
import numpy as np
# sample Data
arr =[1, 2, 3, 4]
# Median
median = np.median(arr)
print("Median = ", median)
15. DESCRIPTIVE STATISTICS
Measure of variability:
Measure of variability is known as the spread of data or how well is our data
is distributed. The most common variability measures are:
1. Range
2. Variance
3. Standard deviation
16. DESCRIPTIVE STATISTICS
Range:
The range describes the difference between the largest and smallest data point
in our data set. The bigger the range, the more is the spread of data and vice
versa.
Range = Largest data value – smallest data value
Python Code to find Range:
import numpy as np
# Sample Data
arr = [1, 2, 3, 4, 5]
#Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
# Difference Of Max and Min
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(Maximum,
Minimum, Range))
17. DESCRIPTIVE STATISTICS
Variance:
It is defined as an average squared deviation from the mean. It is being
calculated by finding the difference between every data point and the average
which is also known as the mean, squaring them, adding all of them and then
dividing by the number of data points present in our data set.
where N = number of terms
u = Mean
Python code to find Variance:
import statistics
# sample data
arr = [1, 2, 3, 4, 5]
# variance
print("Var = ", (statistics.variance(arr)))
18. DESCRIPTIVE STATISTICS
Standard Deviation:
It is defined as the square root of the variance. It is being calculated by finding
the Mean, then subtract each number from the Mean which is also known as
average and square the result. Adding all the values and then divide by the no
of terms followed the square root.
where N = number of terms
u = Mean
Python code to perform Standard Deviation:
import statistics
# sample data
arr = [1, 2, 3, 4, 5]
# Standard Deviation
print("Std = ", (statistics.stdev(arr)))
19. BASIC TOOLS OF EDA
TYPES OF EXPLORATORY DATAANALYSIS:
1. Univariate Non-graphical - this is the simplest form of data analysis as
during this we use just one variable to research the info. The standard
goal of univariate non-graphical EDA is to know the underlying sample
distribution/ data and make observations about the population. Outlier
detection is additionally part of the analysis.
2. Multivariate Non-graphical - Multivariate non-graphical EDA technique
is usually wont to show the connection between two or more variables
within the sort of either cross-tabulation or statistics.
20. BASIC TOOLS OF EDA
TYPES OF EXPLORATORY DATAANALYSIS:
3. Univariate graphical - Non-graphical methods are quantitative and
objective, they are not able to give the complete picture of the data;
therefore, graphical methods are used more as they involve a degree of
subjective analysis, also are required. Common sorts of univariate
graphics are:
Histogram
Stem-and-leaf plots
Boxplots
Quantile-normal plots
21. BASIC TOOLS OF EDA
TYPES OF EXPLORATORY DATAANALYSIS:
4. Multivariate graphical - Multivariate graphical data uses graphics to
display relationships between two or more sets of knowledge. The sole
one used commonly may be a grouped barplot with each group
representing one level of 1 of the variables and every bar within a gaggle
representing the amount of the opposite variable.
Other common sorts of multivariate graphics are:
Scatterplot
Run chart
Heat map
Multivariate chart
Bubble chart
22. BASIC TOOLS OF EDA
TOOLS REQUIRED FOR EXPLORATORY DATAANALYSIS:
R: An open-source programming language and free software environment
for statistical computing and graphics supported by the R foundation for
statistical computing.
Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high level, built-in data structures, combined with
dynamic binding, make it very attractive for rapid application development,
also as to be used as a scripting or glue language to attach existing
components together.