KBN Pandas in python for Btech students.pptx

Dr. C.NAGARAJU YSRCE OF YOGIVEMANAUNI
VERSITY 9949218570
1
DATA ANALYTICS WITH PANDAS
Dr C.Naga Raju
B.Tech(CSE),M.Tech(CSE),PhD(CSE),MIEEE,MCSI,MISTE
Associate Professor
Department of CSE
YSR Engineering College of YVU
Proddatur
https://archive.ics.uci.edu/ml/datasets.php

INTRODUCTION TO PANDAS
• Pandas is a high-level data manipulation tool developed by Wes McKinney.
• Pandas library provides data analytics features like R programming
and MATLAB
• Pandas is built on Numpy, Scipy and Matplotlib packages so that it uses
features of these packages
• The key data structures of pandas are 1) Series 2)Data Frames
• Series is like one dimensional array object contains data and labels(index).
• Data Frame is like two dimensional array object stores data in the form of
rows and columns.
• rows represents observations and columns represents variables.

• Pandas series: series is like one dimensional object containing data and
labels(or) indexes
• Series can be created in different ways using series method

• Single value can be selected from series by single index.
• multiple values are selected from series by multiple indexes

• Series is fixed length ordered Dictionary(dist).
• How ever unlike dictionary index items do not have to be unique

Series operations
• Filtering
• Numpy-like type operations on data

• Pandas can accommodate incomplete data

• Unlike numpy ,ndarray data is automatically alligned

•DATA FREAMES
• Data Frame is like two dimensional array object stores data in the
form of rows and columns.
• rows represents observations and columns represents variables.
• It has both row and column indexes
• It also considered as collection of series as a dictionary(dict)

• Dataframe is created by using DataFrame method of pandas
• Data Frame can be created using dictionary of equal length lists

• Data frame can be created with dictionary of dictionaries

Create file using excel with given name ex: Book1.xlsx
create file using note pad with given name ex:abc.csv

Working with the whole DataFrame
• import numpy as np
• import matplotlib.pyplot as plt
• import pandas as pd
• from pandas import DataFrame,Series
• df=pd.DataFrame ( [ [4,7,10,20,30], [5,8,11,33,34],
[6,9,12,23,12],[4,7,10,20,30], [4,7,10,20,30],
[4,7,10,20,30]],
index=[1,2,3,4,5,6],columns=['a','b','c','d','e'])
print('dataframen’,df)
print('information about dataframen’, df.info())
dataframe
a b c d e
1 4 7 10 20 30
2 5 8 11 33 34
3 6 9 12 23 12
4 4 7 10 20 30
5 4 7 10 20 30
6 4 7 10 20 30
information about dataframe
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 1 to 6
Data columns (total 5 columns):
a 6 non-null int64
b 6 non-null int64
c 6 non-null int64
d 6 non-null int64
e 6 non-null int64
dtypes: int64(5)
memory usage: 288.0 bytes
None

• n=2
• dfh=df.head(n)
print('headn',dfh)
• dft=df.tail(n)
print('tail n',dft)
• dfs=df.describe()
print('describen',dfs)
• top_left_corner_df=df.iloc[:5,:5]
print('top_left_corner_dfn',top_left_corner_df)
• dfT=df.T
print('transposen',dfT)
head
a b c d e
1 4 7 10 20 30
2 5 8 11 33 34
tail
a b c d e
5 4 7 10 20 30
6 4 7 10 20 30
describe
a b c d e
count 6.00000 6.00000 6.00000 6.000000 6.000000
mean 4.50000 7.50000 10.50000 22.666667 27.666667
std 0.83666 0.83666 0.83666 5.202563 7.840068
min 4.00000 7.00000 10.00000 20.000000 12.000000
25% 4.00000 7.00000 10.00000 20.000000 30.000000
50% 4.00000 7.00000 10.00000 20.000000 30.000000
75% 4.75000 7.75000 10.75000 22.250000 30.000000
max 6.00000 9.00000 12.00000 33.000000 34.000000
top_left_corner_df
a b c d e
1 4 7 10 20 30
2 5 8 11 33 34
3 6 9 12 23 12
4 4 7 10 20 30
5 4 7 10 20 30
transpose
1 2 3 4 5 6
a 4 5 6 4 4 4
b 7 8 9 7 7 7
c 10 11 12 10 10 10
d 20 33 23 20 20 20
e 30 34 12 30 30 30

• idx = df.columns # get col index
• print('Column indexn',idx)
• label = df.columns[0] # 1st col label
• print('Column Labeln',label)
• lst = df.columns.tolist() # get as a list
• print('Column as Listn',lst)
• s = df['a'] # select col to Series
• print('col to Seriesn',s)
• s = df[['a']] # select col to df
• print('col to dfn',s)
• s = df[['a','b']] # select 2 or more
• print('select 2 or more columnsn',s)
Column index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Column Label
a
Column as List
['a', 'b', 'c', 'd', 'e']
col to Series
1 4
2 5
3 6
4 4
5 4
6 4
Name: a, dtype: int64
col to df
a
1 4
2 5
3 6
4 4
5 4
6 4
select 2 or more columns
a b
1 4 7
2 5 8
3 6 9
4 4 7
5 4 7
6 4 7

• s = df[['c','a','b']]# change order
• print('change order of columnsn',s)
• f=df.columns[[0, 3, 4]]
• print('Column name by numbern',f)
• s = df.pop('c')
• print('Deleting a columnn',df)
• idx = df.index # get row index
• print('Row indexn',idx)
change order of columns
c a b
1 10 4 7
2 11 5 8
3 12 6 9
4 10 4 7
5 10 4 7
6 10 4 7
select by number
1 7
2 8
3 9
4 7
5 7
6 7
Name: b, dtype: int64
Column name by number
Index(['a', 'd', 'e'], dtype='object')
Deleting a column
a b d e
1 4 7 20 30
2 5 8 33 34
3 6 9 23 12
4 4 7 20 30
5 4 7 20 30
6 4 7 20 30
Row index
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')

• label = df.index[0] # 1st row label
• print('Row Labeln',label)
• lst = df.index.tolist() # get as a list
• print('Index as Listn',lst)
• df.sort_index(inplace=True) # sort by row
• df = df.sort_index(ascending=False)
• print('Sorting by rown',df)
Row Label
1
Index as List
[1, 2, 3, 4, 5, 6]
Sorting by row
a b d e
6 4 7 20 30
5 4 7 20 30
4 4 7 20 30
3 6 9 23 12
2 5 8 33 34
1 4 7 20 30

• s=df.dtypes
print('serial col data typen’,s)
• b=df.empty
print(' true for empty data type: ',b)
• i=df.ndim
print(' n no of dimensions: ',i)
• (r,c)=df.shape
print(n 'no of rows and cols: ’,(r,c))
• i=df.size
print(' n size: ',i)
• a=df.values
print(' n valuesn',a)
• dfc=df.copy()
print('copyn',dfc)
• dfr=df.rank()
print('rankn',dfr)
serial col data type
a int64
b int64
c int64
d int64
e int64
dtype: object
true for empty data type: False
no of dimensions: 2
no of rows and cols: (6, 5)
size: 30
values
[[ 4 7 10 20 30]
[ 5 8 11 33 34]
[ 6 9 12 23 12]
[ 4 7 10 20 30]
[ 4 7 10 20 30]
[ 4 7 10 20 30]]
copy
a b c d e
1 4 7 10 20 30
2 5 8 11 33 34
3 6 9 12 23 12
4 4 7 10 20 30
5 4 7 10 20 30
6 4 7 10 20 30
rank
a b c d e
1 2.5 2.5 2.5 2.5 3.5
2 5.0 5.0 5.0 6.0 6.0
3 6.0 6.0 6.0 5.0 1.0
4 2.5 2.5 2.5 2.5 3.5
5 2.5 2.5 2.5 2.5 3.5
6 2.5 2.5 2.5 2.5 3.5

• dfab= df.abs()
print('Absolute n',dfab)
• dfad= df.add(1)
print('Addn',dfad)
• s = df.count()
print('countn',s)
• dfmax= df.cummax()
print('cumulative maxn',dfmax)
• dfmin = df.cummin()
print('cumulative minn',dfmin)
Absolute
a b c d e
1 4 7 10 20 30
2 5 8 11 33 34
3 6 9 12 23 12
4 4 7 10 20 30
5 4 7 10 20 30
6 4 7 10 20 30
Add
a b c d e
1 5 8 11 21 31
2 6 9 12 34 35
3 7 10 13 24 13
4 5 8 11 21 31
5 5 8 11 21 31
7 5 8 11 21 31
count
a 6
b 6
c 6
d 6
e 6
dtype: int64
cumulative max
a b c d e
1 4 7 10 20 30
2 5 8 11 33 34
3 6 9 12 33 34
4 6 9 12 33 34
5 6 9 12 33 34
6 6 9 12 33 34
cumulative min
a b c d e
1 4 7 10 20 30
2 4 7 10 20 30
3 4 7 10 20 12
4 4 7 10 20 12
5 4 7 10 20 12
6 4 7 10 20 12

cumulative sum
a b c d e
1 4 7 10 20 30
2 9 15 21 53 64
3 15 24 33 76 76
4 19 31 43 96 106
5 23 38 53 116 136
6 27 45 63 136 166
cumulative product
a b c d e
1 4 7 10 20 30
2 20 56 110 660 1020
3 120 504 1320 15180 12240
4 480 3528 13200 303600 367200
5 1920 24696 132000 6072000 11016000
6 7680 172872 1320000 121440000 330480000
list difference
a b c d e
1 NaN NaN NaN NaN NaN
2 1.0 1.0 1.0 13.0 4.0
3 1.0 1.0 1.0 -10.0 -22.0
4 -2.0 -2.0 -2.0 -3.0 18.0
5 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0
division
a b c d e
1 2.0 3.5 5.0 10.0 15.0
2 2.5 4.0 5.5 16.5 17.0
3 3.0 4.5 6.0 11.5 6.0
4 2.0 3.5 5.0 10.0 15.0
5 2.0 3.5 5.0 10.0 15.0
6 2.0 3.5 5.0 10.0 15.0
• dfcs = df.cumsum()
print('cumulative sumn',dfcs)
• dfpr= df.cumprod()
print('cumulative productn',dfpr)
• dif = df.diff()
print('list differencen',dif)
• div1= df.div(2)
print('divisionn',div1)

• s = df.max()
print('max of axis (col def)n’,s)
• s = df.mean()
print('mean (col default axis)n',s)
• s = df.median()
print('median (col default)n’,s)
• s = df.min()
print('min of axis (col def)n',s)
• mul = df.mul(1)
print('mul by df Series valn',mul)
• s = df.sum()
print('sum of axisn',s)
max of axis (col def)
a 6
b 9
c 12
d 33
e 34
dtype: int64
'mean (col default axis)
a 4.500000
b 7.500000
c 10.500000
d 22.666667
e 27.666667
dtype: float64
median (col default)
a 4.0
b 7.0
c 10.0
d 20.0
e 30.0
dtype: float64
min of axis (col def)
a 4
b 7
c 10
d 20
e 12
dtype: int64
mul by df Series val
a b c d e
1 4 7 10 20 30
2 5 8 11 33 34
3 6 9 12 23 12
4 4 7 10 20 30
5 4 7 10 20 30
6 4 7 10 20 30
sum of axis
a 27
b 45
c 63
d 136
e 166

Dataframe filters for selection of rows and col
• dffi=df.filter(items=['a','b'])
print('Filter by col n',dffi)
• dfrow=df.filter(items=[2],axis=0)
print('filter by rown',dfrow)
• dfin=df.filter(like='%a%')
print('Filter in coln',dfin)
Filter by col
a b
1 4 7
2 5 8
3 6 9
4 4 7
5 4 7
6 4 7
filter by row
a b c d e
7 5 8 11 33 34
Filter in col
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6]

Basic Statistics
• s = df['a'].describe()
print('describe col an',s)
• cor=df.corr()
print('correlation n',cor)
• cov=df.cov()
print('covariancen',cov)
• kur=df.kurt()
print('kurtosis n',kur)
describe col a
count 6.00000
mean 4.50000
std 0.83666
min 4.00000
25% 4.00000
50% 4.00000
75% 4.75000
max 6.00000
Name: a, dtype: float64
correlation
a b c d e
a 1.000000 1.000000 1.000000 0.505424 -0.762257
b 1.000000 1.000000 1.000000 0.505424 -0.762257
c 1.000000 1.000000 1.000000 0.505424 -0.762257
d 0.505424 0.505424 0.505424 1.000000 0.173252
e -0.762257 -0.762257 -0.762257 0.173252 1.000000
covariance
a b c d e
a 0.7 0.7 0.7 2.200000 -5.000000
b 0.7 0.7 0.7 2.200000 -5.000000
c 0.7 0.7 0.7 2.200000 -5.000000
d 2.2 2.2 2.2 27.066667 7.066667
e -5.0 -5.0 -5.0 7.066667 61.466667
kurtosis
a 1.428571
b 1.428571
c 1.428571
d 4.837353
e 5.231624
dtype: float64

• mdev=df.mad()
print(' mean absolute deviationn',mdev)
• serr=df.sem()
print(' standard error of meann',serr)
• vaco=df.var()
print('variance over cols n',vaco)
• s = df['a'].value_counts()
print('value count in col an',s)
mean absolute deviation
a 0.666667
b 0.666667
c 0.666667
d 3.555556
e 5.222222
dtype: float64
standard error of mean
a 0.341565
b 0.341565
c 0.341565
d 2.123938
e 3.200694
dtype: float64
variance over cols
a 0.700000
b 0.700000
c 0.700000
d 27.066667
e 61.466667
dtype: float64
value count in col a
4 4
6 1
5 1

Cross-tabulation (frequency count)
• ct = pd.crosstab(index=df['a'],columns=df['b'])
print(‘Crosstabn',ct)
Quantiles and ranking
• quants = [0.05, 0.25, 0.5, 0.75, 0.95]
• q = df.quantile(quants)
print(‘Quantilen’,q)
• r = df.rank()
print('Rankn',r)
Crosstab
b 7 8 9
a
4 4 0 0
5 0 1 0
6 0 0 1
Quantile
a b c d e
0.05 4.00 7.00 10.00 20.00 16.5
0.25 4.00 7.00 10.00 20.00 30.0
0.50 4.00 7.00 10.00 20.00 30.0
0.75 4.75 7.75 10.75 22.25 30.0
0.95 5.75 8.75 11.75 30.50 33.0
Rank
a b c d e
1 2.5 2.5 2.5 2.5 3.5
2 5.0 5.0 5.0 6.0 6.0
3 6.0 6.0 6.0 5.0 1.0
4 2.5 2.5 2.5 2.5 3.5
5 2.5 2.5 2.5 2.5 3.5
6 2.5 2.5 2.5 2.5 3.5

Working with strings
assume that df['col'] is series of strings
• df['col']=('niki’)
• s = df['col'].str.lower()
print('Lower n',s)
• s = df['col'].str.upper()
print('Uppern',s)
• s = df['col'].str.len()
print('Lengthn',s)
Lower
1 niki
2 niki
3 niki
4 niki
5 niki
6 niki
Name: col, dtype: object
Upper
1 NIKI
2 NIKI
3 NIKI
4 NIKI
5 NIKI
6 NIKI
Length
1 4
2 4
3 4
4 4
5 4
6 4
Name: col, dtype: int64

• df['col'] += 'suffix'
print('Appendn',df['col’])
• df['col']=('niki')
• df['col'] *= 2
print('duplicaten',df['col’])
• df['col']=('niki')
• s = df['col'] + df['col']
print('concatenaten',s)
Append
1 nikisuffix
2 nikisuffix
3 nikisuffix
4 nikisuffix
5 nikisuffix
6 nikisuffix
duplicate
1 nikiniki
2 nikiniki
3 nikiniki
4 nikiniki
5 nikiniki
6 nikiniki
concatenate
1 nikiniki
2 nikiniki
3 nikiniki
4 nikiniki
5 nikiniki
6 nikiniki

Working with Columns
• idx = df.columns
print('Column indexn',idx)
• label = df.columns[0]
print('Column Labeln',label)
• lst = df.columns.tolist()
print('Column as Listn',lst)
• s = df['a']
print('col to Seriesn',s)
• s = df[['a']]
print('col to dfn',s)
Column index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Column Label
a
Column as List
['a', 'b', 'c', 'd', 'e’]
col to Series
1 4
2 5
3 6
4 4
5 4
6 4
col to df
a
1 4
2 5
3 6
4 4
5 4
6 4

• s = df[['a','b']]
print('select 2 or more columnsn’,s)
• s = df[['c','a','b']]
print('change order of columnsn',s)
• s = df[df.columns[1]]
print('select by numbern',s)
• f=df.columns[[0, 3, 4]]
print('Column name by numbern',f)
• s = df.pop('c')
print('Deleting a columnn',df)
select 2 or more columns
a b
1 4 7
2 5 8
3 6 9
4 4 7
5 4 7
6 4 7
change order of columns
c a b
1 10 4 7
2 11 5 8
3 12 6 9
4 10 4 7
5 10 4 7
6 10 4 7
select by number
1 7
2 8
3 9
4 7
5 7
6 7
Name: b, dtype: int64
Column name by number
Index(['a', 'd', 'e'], dtype='object')
Deleting a column
a b d e
1 4 7 20 30
2 5 8 33 34
3 6 9 23 12
4 4 7 20 30
5 4 7 20 30
6 4 7 20 30

Working with rows
• idx = df.index
print('Row indexn',idx)
• label = df.index[0]
print('Row Labeln',label)
• lst = df.index.tolist()
print('Index as Listn',lst)
• df.sort_index(inplace=True)
• df = df.sort_index(ascending=False)
print('Sorting by rown',df)
Row index
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')
Row Label
1
Index as List
[1, 2, 3, 4, 5, 6]
Sorting by row
a b c d e
6 4 7 10 20 30
5 4 7 10 20 30
4 4 7 10 20 30
3 6 9 12 23 12
2 5 8 11 33 34
1 4 7 10 20 30

import pandas as pd
df = pd.DataFrame ( { 'name':['john','mary','peter','jeff','bill','lisa','jose'],
'age':[23,78,22,19,45,33,20],
'gender':['M','F','M','M','M','F','M'],
'state':['california','dc','california','dc','california','texas','texas'],
'num_children':[2,0,0,3,2,1,4],
'num_pets':[5,1,0,5,2,2,3] } )

Plot two dataframe columns as a scatter plot
# a scatter plot comparing num_children and num_pets
import matplotlib.pyplot as plt
import pandas as pd
df.plot(kind='scatter',x='num_children',y='num_pets',color='red') plt.show()

• Plot column values as a bar plot
import pandas as pd
df.plot(kind='bar',x='name',y='age')

• Line plot with multiple columns
import pandas as pd
df.plot(kind='line',x='name',y='num_children',ax=ax)
df.plot(kind='line',x='name',y='num_pets', color='red', ax=ax)
plt.show()

• Bar plot with group by
import pandas as pd df.groupby('state')
['name'].nunique().plot(kind='bar') plt.show()

Plot histogram of column values
import pandas as pd
df[['age']].plot(kind='hist',bins=[0,20,40,60,80,100],rwidth=0.8)
plt.show()

KBN Pandas in python for Btech students.pptx

More Related Content

Similar to KBN Pandas in python for Btech students.pptx

More from gandhamcharan2006

Recently uploaded

KBN Pandas in python for Btech students.pptx