unit-5 Data Wrandling weightage marks.pptx

Data
Wrangling
weightage:20
Marks
PREPARED BY:
PROF. SHAKUNTLA RAVANI

Chapter Covered From Book
1. Stetching Python’s Capabilities (chapter-12)
2. Exploring Data Analysis (chapter-13)
PREPARED BY: PROF. SHAKUNTLA RAVANI

CHAPTER 12
STRETCHING PYTHON’S
CAPABILITIES

Linear Regression
• It is a statistical algorithm which is useful to show relation between two or
more variable using linear equation

Linear Regression
• Example: Annual sale
• Dependent Variable: Annual sale
• Represent on Graph on Y axis
• Independent Variable: factors
affects the annual sale like number
of employee, profit, loss etc.
• Represent on graph on X axis
• Usage:
• To establish a relation between independent and dependent variable regression is useful
• Determine the strength of Predicators (how independent are dependent variable are
related)
• Example: sales-> marketing, Age->income
• Trend Analysis
• Example: what will be price of bit coin in next 6 month?

Linear Regression
Positive Linear Regression Negative Linear Regression
Slope of Line is Increasing
Slope of Line is Decreasing
Y=mx+c is line of linear regression
Line of Linear Regression Shows Relation between dependent variable and Independent Variable
y=-mx+c

Linear Regression
Aftre Adding Some data points:
1. Generate Regression line/Best fit line
2. Reduce Error : reduce distance between estimated/predicted value and actual value

Linear Regression
Speed:
Distance:Distance covered with constant time
Equation of line : y=mx+c
y is a distance travelled in fix duration of time
x is speed of vehicle
m is positive slope
c is y-intercept of line

Types of Linear regression
1. Simple Linear Regression:
only 1 dependent and 1 independent variable:
Equation : y=mx+c
2. Multiple Linear Regression (If dependent variable depends on more then one independent factors
like x1,x2,x3..xn
y=m1x1+m2x2+m3x3+m4x4+.....+mnxn +c
Example y=1.2x1 +2x2+4x3+1x4+0.9
here c=0.9, m1=1.2, m2=2, m3=4 and m4=1
impact of m3 and it’s variable x3 is high on dependent variable
Example : result (pass, fail)
x1=enrolment number
x2=percentage , here impact of percentage is high on result

Linear Regression: Mathematical
Implementation
• 1. Find equation of line from
x and y values
• For that
• calculate mean of x and
mean of y:
mean(x)=(1+2+3+4+5)/5=3
mean(y)=(3+4+2+4+5)/
5=18/5=3.6
Plot mean point(3,3.6)

Implementation
• for y=mx+c, find value of m and c
m
x-xbar is a distance of all
points of line from it’s mean
value

Implementation
m
x-xbar is a distance of all points of
line from it’s mean value

Implementation
m
y-ybar is a distance of all points of
line from it’s mean value of 3.6

Implementation

Implementation
=0.4

Implementation
=0.4
put x=3 and y=3.6 and m=0.4 in
c=y-mx= 3.6-(0.4)*3=3.6-1.2 =2.4
Regression equation

Using Regression equation calculate
value of y from value of x
m=0.4
c=2.4
y=0.4x+2.4
for given m=0.4 and &c=2.4
lets predict value for y using
x=[1,2,3,4,5]
x=1,y=0.4 *1+ 2.4 =2.8
x=2,y=0.4 *2+ 2.4 =3.2
x=3,y=0.4 *3+ 2.4 =3.6
x=4,y=0.4 *4+ 2.4 =4.0
x=5,y=0.4 *5+ 2.4 =4.4

Calculate distance between actual value
and predicted value
Reduce error (distance) between
actual points and predicted points
For that we have to try different value
of m to get closer to actual points

R square method : How close the data to
fitted regression line
value of R square must be between 0-1.
1 means very closer to actual point
y – actual value
yp is predicted value
ybar is mean value

≈0.3

Implementation using python
• Package : sklearn
• module : linear_model
• class : LinerRegression
• object= sklearn.linear_model.LinearRegression()
• Attributes of LinearRegression class:
1. object.coef_ : returs value of slope
2. object.intercept : returns value of constant c
• Methods:
1. object.fit(N-dimension array X, array Y)
here value of x and y points will be passed to perform linear regression.
2. ypre=object.predict(x) : returns new value of y on bases of linear regression
equation y=mx+c, by applying value of m, x and c
3. object.score(x,y) : returns distance of actual points and predicted points-R2

Implementation using python
x=[1,2,3,4,5]
y=[3,4,2,4,5]
import pandas as pd
d={'X':x,'Y':y}
df=pd.DataFrame(d)
print(df)
from sklearn.linear_model import LinearRegression
X = df['X'].values.reshape(-1,1)
#the unspecified value is inferred to be original value
Y = df['Y'].values.reshape(-1,1)
h=LinearRegression()
h.fit(X,Y)
print(h.coef_) #value of m
print(h.intercept_)# value of c
ypre=h.predict(X)
print(ypre)
h.score(X,Y) #distance of actual points and predicted points

Linear Regression of boston dataset
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data,boston.target
print (X.shape, y.shape)
hypothesis = LinearRegression(normalize=True)
hypothesis.fit(X,y)
print (hypothesis.coef_)
print(hypothesis.intercept_)
print(hypothesis.score(X,y))

we print the value of the boston_dataset to understand what it contains. print(boston_dataset.keys()) gives
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
•data: contains the information for various houses
•target: prices of the house
•feature_names: names of the features
•DESCR: describes the dataset
To know more about the features use boston_dataset.DESCR The description of all the features is given below:
CRIM: Per capital crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature
variables based on which we will predict the value of a house.
We will now load the data into a pandas dataframe using pd.DataFrame. We then print the first 5 rows of the data
using head()

Multiprocessing
1. sklearn.datasets.load_digits() : loads digit dataset from sklearn
2. sklearn.model_selection.cross_val_score(regression-objectname,x-independent,y-dependent,
cv=number,n_jobs=1/-1)
• cross_val_score method returns R-square error with number of folds specified by cv (cross-
validation)
• cv=number of folds (groups of data) Example : cv=5
• jobs=1 – uniprocessor , -1 : multicore

Execute code on multicore instead of single core-
Multiprocessing
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
digits = load_digits()
X, y = digits.data,digits.target
print(X.shape)
print(y.shape)
l=LinearRegression()
single_core_learning = cross_val_score(l, X, y, cv=20, n_jobs=1)
#print(single_core_learning)
OUTPUT
(1797, 64)
(1797,)

Need of hashing/feature hashing
Index is used to access element:
Array[3] will gives value4
When we do not want to worry about
Number of elements in advance :

returns by hash function can be used as an index to store element/object
Mod function

Hash Function
• The hash(object) function returns the hash value of an object if it has one. Hash values are just
integers that are used to compare dictionary keys during a dictionary lookup quickly.
• Example:
print (hash('Python'))
print (abs(hash('Python')) % 1000)
print(hash(1))
print(hash(True))
a='hello'
print(hash(a))
OUTPUT:
567715945433905441
441
1
1
6747001990439529300

Value
returns by
hash
function can
be used as
an index to
store
element/obj
ect
If we want to
generate
only index
up to specific
number

Hashing Collision Resolution:
separate chaining

Feature Hashing

Feature Hashing
Number of extracted features = mod of number

Hashing
ure hashing is powerful technique to reduce number of featu

data science”
• Assign arbitrary number to each word, for instance python=0, for=1, data=2, science=3
• Add one more sentence : python for machine learning
Feature Encoding
python 0
for 1
data 2
science 3
python for data Science
python for data
science
1 1 1 1
Bag of word
Binary vector
Feature Encoding
python 0
for 1
data 2
science 3
machine 4
learning 5
python for data Science machine learning
python for data
science
1 1 1 1 0 0
Python for machine
learning
1 1 0 0 1 1
Bag of word

encoding
from sklearn.feature_extraction.text import *
bow=CountVectorizer()
bowm=bow.fit_transform(['python for data science','python for machine learning']).toarray()
print(bow.vocabulary_)
OUTPUT:
{'python': 4, 'for': 1, 'data': 0, 'science': 5, 'machine': 3, 'learning': 2}
python for data science
python for machine learning
data for
learning
machine
python
science

encoding
bow=CountVectorizer()
bowm=bow.fit_transform(['python for data science','python for machine learning']).toarray()
bowm1=bow.transform(['new text has arrived']).toarray()
ry_)
OUTPUT:
{'python': 4, 'for': 1, 'data': 0, 'science': 5, 'machine': 3, 'learning': 2}
New text has arrived
data for
learning
machine
python
science
Limitations:
1. Require more storage if document is large (per feature-word one index occupied)
2. Optimally encodes text in matrix but once matrix generated, new text can not be
inserted

defined
string_1 = 'Python for data science'
string_2 = 'Python for machine learning'
def hashing_trick(input_string, vector_size=20):
feature_vector = [0] * vector_size
for word in input_string.split(' '):
index = abs(hash(word)) % vector_size
print('word is:',word,'original hash value',hash(word),'index is:',index)
feature_vector[index] = 1

defined
OUTPUT:
word is: Python original hash value -5119337540373521090 index is: 10
word is: for original hash value -5304979803602257784 index is: 4
word is: data original hash value -250754315164222072 index is: 12
word is: science original hash value -4211303434629928500 index is: 0
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]
word is: Python original hash value -5119337540373521090 index is: 10
word is: for original hash value -5304979803602257784 index is: 4
word is: machine original hash value 3675180801325909671 index is: 11
Remark : if vectorsize is small then many words overlap to same location in list
representing feature vector. To keep overlap minimum mod value should be large enough

HashingVectorizer
HashingVectorizer: A class that rapidly transforms any collection of text in to separate sparse data
matrix using hashing trick.
Package and module: sklearn.feature_extraction.text
bowh=HashingVectorizer(n_features=20,binary=True,norm=None)
• Here n_features: you can specified mod value to specify maximum number of index
generation. For example: 20 means indexes will be 0-19 to store words(features)
• Binary=True : binary encoding – If feature is available then 1 else 0
Sr. No CountVectorizer HashingVectorizer
1 Require more storage if document is large
(per feature-word one index occupied)
We can generate user defined features by
parameter n_features. No storage limitation but it
suffers from collision
2 Optimally encodes text in matrix but once
matrix generated, new text can not be
inserted
We can insert new data after encoding
3 Takes more time to execute Takes less time as compare to CountVectorizer

HashingVectorizer
bowh = HashingVectorizer(n_features=20, binary=True,norm=None)
bowhm = bowh.fit_transform(['Python for data science','Python for machine
learning']).toarray()
print(bowhm)
bowhmn=bowh.transform(['new text has inserted']).toarray()
print(bowhmn)
OUTPUT:
[[0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]
[0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]

HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
bowh = HashingVectorizer(n_features=20, binary=True, norm=None)
bowc = CountVectorizer()
texts = ['Python for data science','Python for machine learning']

Memory Profiler
Useful to find out memory usage (space complexity) of particular statement of program
Package: memory_profiler
• Command to install (type on Jupyter Notebook)
1. import sys
2. ! (sys.executable) –m pip install memory_profiler
After installation completes, type below command on jupyter Notebook
3. %load memory_profiler
For find out memory requirement for hashing matrix give use magic function %memit as below:
%memit bowhm = bowh.fit_transform(texts).toarray()

Memory Profiler
1. Peak memory refers to the peak memory usage of
your system (including memory usage of other
processes) during the program runtime.
2. Increment is the increment in memory usage
relative to the memory usage just before the
program is run (i.e. increment = peak memory -
starting memory

CHAPTER 13
Exploratory Data
Analysis (EDA)

Exploratory Data Analysis (EDA)
• Exploratory Data Analysis is a general approach to exploring datasets by
means of simple summary statistics and graphical visualizations in order to
gain a deeper understanding of data.
• Dataset : iris (flower) dataset of sklearn
• You can load it in python program using:
from sklearn.datasets import load_iris
iris = load_iris() Iris setosa Iris versicolor Iris virginica

• The data set consists of 50 samples from each of three species of Iris (Iris
setosa, Iris virginica and Iris versicolor). Four features were measured from
each sample: the length and the width of the sepals and petals, in centimeters.
Based on the combination of these four features, using linear discriminant
model we can distinguish the species from each other.
Iris setosa Iris versicolor Iris virginica

Python program to load iris dataset & print it’s
description and feature names
iris = load_iris()
#print(iris)
print(iris.DESCR) # it prints decription of iris dataset
print(iris.feature_names) # it prints feature names going to be use for prediction- it’s
value will be available at iris. Data
print(iris.data) #prints values of features – n dimension array
print(iris.target) #print target variable value (iris flower category like verginica,
Sentosa, versicolor) contains value either 0 (setosa), 1(versicolor) or 2(virginica)

Generate dataframe from iris dataset by using data attribute
and target attribute and print statistical analysis------version 1
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data, columns=iris.feature_names)
print(df.head())
df['group']=iris.target # adding flower type in form of 0,1,2
print(df.head())
print(df.describe())

print flower type in string instead of category code
import pandas as pd
iris=load_iris()
print(iris.target)
for k in iris.target:
print(iris.target_names[k])

Generate dataframe from iris dataset by using data attribute and
target attribute and print statistical analysis------version 2
considering as target as a category of flower (data in from of string)
import pandas as pd
iris=load_iris()
df['group']=pd.Series([iris.target_names[k] for k in iris.target], dtype='category')
print(df.head())
df.describe()

• Steps: (for all features/columns contains numerical data)
• Measuring central tendency : find of mean() value
• Measuring variance : find out standard deviation
• Measuring range : find out difference of maximum and minimum value

=154.5

=165.6

10

=57

Standard deviation

=75.2

iris
import pandas as pd
iris=load_iris()
df['group']=pd.Series([iris.target_names[k] for k in iris.target],dtype='category')
print(df.head())
print('mean')
print(df.mean())
print('standarad deviation')
print(df.std())

Effect of transforming data on
speread and center

plot of iris
import pandas as pd
iris=load_iris()
df['group']=pd.Series([iris.target_names[k] for k in iris.target],dtype='category’)
print(df.quantile([0,0.25,0.50,0.75,1]))
import matplotlib.pyplot as plt
df.boxplot()
plt.show()

Skewness
1. Skewness Defines Asymmetry of data with respect to mean.
2. If Skewness is negative then the left tail is too long and mass of the
observations are on the right side of the distribution
3. If The skew is positive then right tail too long and mass of the
observations are on left side of distribution

Python Program to find skewness
import numpy as np
x=np.concatenate((np.linspace(10,19,1),
np.linspace(20,29,3),np.linspace(30,39,9),
np.linspace(40,49,11),np.linspace(50,59,15)))
print(x.mean())
from scipy.stats import skew
s=skew(x)
print("skewness=",s)
import seaborn as sns
sns.distplot(x)
43.61538461538461
skewness= -0.8067819252577693
Negative skewness/Left
SKewness
Skewness = 3 * (Mean – Median) / Standard Deviation

Kurtosis
1. Kurtosis shows whether the data distribution, especially the peak
and the tails, are of the right shape.
2. If the kurtosis is above zero, the distribution has a marked peak. If it is
below zero, the distribution is too flat instead.

Kurtosis=+Ve
Kurtosis=0
Kurtosis=Negative

Python Program to find kurtosis
import numpy as
npx=np.concatenate((np.linspace(10,19,1),
np.linspace(20,29,2),np.linspace(30,39,9),
np.linspace(40,49,2),np.linspace(50,59,1)))
print(x.mean())
from scipy.stats import kurtosis
k=kurtosis(x)
print("kurtosis=",k)
import seaborn as sns
sns.distplot(x)
33.9
kurtosis= 0.7624441090238117
Positive
Kurtosis=Leptokurtic

68-95-99.7 Rule

group
import pandas as pd
iris=load_iris()
group0 = df['group'] == 'setosa'
print(group0.head())
group1 = df['group'] == 'versicolor'
group2 = df['group'] == 'virginica'
print('variance of geoup 0 is',df['petal length (cm)'][group0].var())

Hypothesis Testing
• In hypothesis testing evaluates two mutually exclusive statements on a population using a
sample of data
• Example of mutually exclusive statements: in court of law:
• 1. defendant is innocent (Null Hypothesis –Ho)
• 2. defendant is Guilty (Alternate Hypothesis –Ha)
• Steps:
• 1. Make initial assumption (Null Hypothesis H0)
• 2.Collect Data (like figure prints, DNA samples etc)
• 3. Gather Evidences to reject or not reject NULL Hypothesis

• 1 sample proportion test
• Chi-Square test
• 1 sample T test for 1 continuous variable
• T test for 2 continuous variable
• ANOVA test for more than 2 continuous variables
Types of Hypothesis Testing
No
of
vari
able
Type of
variable
Test
1 categorical 1
sample
proport
ion test
2 categorical CHI-
SQUAR
E test
1 Continous
numerical
1
Sample
t-test
Gend
er
Age
Grou
p
Weig
ht
(kg)
Heigh
t
(m)
M Elderl
y
70 1.4
F Adult 65 1.2
M Adult 65 1.4
M Child 20 1

Types of Hypothesis Testing: 1 sample
Proportion test
Gend
er
Age
Grou
p
Weig
ht
Heigh
t
M Elderl
y
70 1.4
F Adult 65 1.2
M Adult 65 1.4
M Child 20 1
F Adult 75 1.3
M Elderl
y
80 1.3
• Whether there is a difference in proportion of
between Male and Female?
• Ho: No difference
• H1: Difference
• Returns P-value
• If p value >0.05 then Ho will be accepted else
HA will be accepted

Types of Hypothesis Testing: CHI SQUARE
TEST
Gend
er
Age
Grou
p
Weig
ht
Heigh
t
M Elderl
y
70 1.4
F Adult 65 1.2
M Adult 65 1.4
M Child 20 1
F Adult 75 1.3
M Elderl
y
80 1.3
• Whether there is a difference in proportion of
between Male and Female bases on Age group
?
• H1: Difference
• Returns P-value
HA will be accepted

SIGNIFICANCE DIFFERENCE BETWEEN
MEAN OF SAMPLE AND POPULATIONS/
2 VARIABLES

Types of Hypothesis Testing: 1 sample T
test – bases on height
Gend
er
Age
Grou
p
Weig
ht
Heigh
t
M Elderl
y
70 1.4
F Adult 65 1.2
M Adult 65 1.4
M Child 20 1
F Adult 75 1.3
M Elderl
y
80 1.3
• Is there mean difference between sample
height and population height?
• H1: Difference
• Returns P-value
HA will be accepted
• Function: scipy.stats.ttest_1samp()
• Usage:
t_stat,p-
value=scipy.stats.ttest_1samp(sequence)

variable)
ages=[10,20,35,50,28,40,55,18,16,55,30,25,43,18,30,28,14,24,16,17,32,35,26,27,65,18,43,23,21,20,19,70]
print(len(ages))
import numpy as np
ages_mean=np.mean(ages)
print(ages_mean)
## Lets take sample
sample_size=10
age_sample=np.random.choice(ages,sample_size)
print(age_sample)
from scipy.stats import ttest_1samp
ttest,p_value=ttest_1samp(age_sample,30)
print(p_value)
32
30.34375
[18 40 16 50 28 27 50 30 30 55]
0.33568006273592693
we are accepting null hypothesis
new p value 1.0

flower groups
from scipy.stats import ttest_ind
import pandas as pd
iris=load_iris()
t,pvalue=ttest_ind(df['petal length (cm)'][group0],df['petal length (cm)'][group1])
< 0.05, Null hypothesis rejected
Means mean difference is there between
group 0 and group 1
Function: scipy.stats.ttest_ind(var1, var2)
It returs p value and statistical difference

ANOVA TEST

difference
from scipy.stats import f_oneway
import pandas as pd
iris=load_iris()
f,pvalue=f_oneway(df['petal length (cm)'][group0],df['petal length (cm)'][group1],df['petal length
< 0.05, Null hypothesis rejected
Means mean difference is there between
group 0, group 1 and group 2
Function: scipy.stats.f_oneway(group1, group2,group3,...)
It returs p value and statistical difference

OBSERVING PARALLEL
COORDINATES

COORDINATES
• Parallel coordinates are a common way of visualizing and analyzing high-dimensional
datasets
• A point in n-dimensional space is represented as a polyline with vertices on the parallel
axes and the position of the vertex corresponds to the coordinate of the point.
• Function:pandas.plotting.parallel_coordinates(dataframe, label)
• Here label is one of the column of dataframe uses as a label

dataframe
from pandas.plotting import parallel_coordinates
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data,
columns=iris.feature_names)
df['label']=pd.Series( [iris.target_names[k] for k in
iris.target],dtype='category')
df['group']=iris.target
pl=parallel_coordinates(df,'label')

Graphing Distribution using
histogram and density plot

histogram
• Function for density plot:
• pandas.DataFrame.plot(kind=‘density’)
• Or
• pandas.DataFrame[‘column-name’].plot(kind=‘density’)
• Function for histogram for specific column of dataframe/whole dataframe
• pandas.DataFrame.plot(kind=‘hist’)

dataframe
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data,
columns=iris.feature_names)
df['group']=pd.Series([iris.target_names[k] for k in
iris.target],dtype='category')
df.plot(kind='density')
df.plot(kind='hist')

Plotting scatterplots
and scatter matrix

Function for scatter plot
• Function for scatter plot:
• Pandas.DataFrame.plot(kind=‘scatter’,x=‘column-name’,y=‘column-name’, c=colorlist)
• It plots scatterplot of x and y columns and to represent different groups we can use
different colors

Function for scatter plot
import pandas as pd
iris=load_iris()
pelette={0:'red',1:'yellow',2:'blue'}
color=[pelette[k] for k in df['group']]
print(color)
df.plot(kind='scatter',x='petal length (cm)',y='petal width (cm)',c=color)

Function for scatter matrix
• Useful for plotting scatter plot of all columns of dataframe vs all columns
• At diagonal position we can plot either histogram or density plot
• Function:
pandas.plotting.scatter_matrix(dataframename,figsize(width,height),color=colors,
diagonal=‘hist/kde’)
• PARAMETERS
figsize(float,float), optional
• A tuple (width, height) in inches.

Scatter matrix for all columns->all
columns of iris
from pandas.plotting import scatter_matrix
import pandas as pd
iris=load_iris()
pelette={0:'red',1:'yellow',2:'blue'}
colorlist=[pelette[k] for k in df['group']]
scatter_matrix(df,figsize=(6,6),color=colorlist,diagonal='hist')

Correlation and
Covariance

covariance
• Covariance and Correlation are terms used in
statistics to measure relationships between two
random variables. Both of these terms measure
linear dependency between a pair of random
variables
Size of
house
Price
1200sqm 100k$
1500sqm 200k$
1800sqm 300k$
Quantify Relationship between size and
price
Size < -- > price ?
Like size increase -- price increases or
Size decreases ----price decreases

covariance
Size of
house
Price
1200sqm 100k$
1500sqm 200k$
1800sqm 300k$
• Cov(X,X)=∑
𝑖=1
𝑛
(𝑥 −𝑥𝑏𝑎𝑟)(𝑥−𝑥𝑏𝑎𝑟)
𝑛−1
If x increases, y will also increases,
Then covariance will be +ve
If x increases, y decreases, then cov
Will be negative
Function:
pandas.DataFrameob.c
ov()

covariance
Covariance is useful to
quantify relation
between random
variables
It’s gives direction :
either positive or
negative
-ve
direction
+ve
direction

correlation
• It represents relationship between dependent and independent variable
• Like if we have two variable x and y then if x increases then will y increases or not?
• Covariance helps us to find direction of relationship
• Correlation helps us to find strength of relationship. Like if direction is +ve, then how much +ve
and if direction is –ve then how much -ve
• Perason’s correlation coefficient is used to find out strength of relationship
P(x,y) will between -1 to +1
-1 <=p(x,y)<=1
Function:
pandas.DataFrameob.c
orr()

correlation
P=1
P=1
0<p<1
0<p<1 P=0
P=0
-1<P<0
-1<P<0 -1<P<0 P=-1
Strong
+ve
Strong
-ve
medium
+ve
week
+ve
No
relation
week
-ve
Mediu
m
-ve

Python program to print covariance of all columns
with respect to all other columns of iris
import pandas as pd
iris=load_iris()
df.cov()

Python program to print pearson’s correlation coefficient
of all columns with respect to all other columns of iris
import pandas as pd
iris=load_iris()
df.corr()

Practical set 7: question 36
Write a python program to print telephone numbers
placed in sentences using regular expressions
import re
data1 = 'My phone number is: 800-555-1212.'
data2 = '800-555-1234 is my phone number.'
pattern = r'd{3}-d{3}-d{4}'
m1=re.search(pattern,data1)
print(m1.group())
m2=re.search(pattern,data2)
print(m2.group())
OUTPUT:
800-555-1212
800-555-1234

Write a python program to create basic adjacency matrix
from the NetworkX- supplied graph
import networkx as nx
G = nx.cycle_graph(10)
A = nx.adjacency_matrix(G)
print(A) #print sparse matrix
print(A.todense())
nx.draw_networkx(G)

Write a python program to create applied visualization for
EDA using boxplots and perform t-tests.
import pandas as pd
iris=load_iris()
df=pd.DataFrame(iris.data,columns=iris.feature_names)
df.plot(kind='box')
group1=df['group']=='setosa'
group2=df['group']=='versicolor'
print(group1)

THE END !!!

unit-5 Data Wrandling weightage marks.pptx

More Related Content

Similar to unit-5 Data Wrandling weightage marks.pptx

More from nilampatoliya

Recently uploaded

unit-5 Data Wrandling weightage marks.pptx