Up and running with python

Here is the notebook which I used for the Ann Arbor Chapter of the
American Statistical Association Class
'Up and Running with Python'
A copy is available at: https://drive.google.com/open?id=1-lgkJ9pilNsvQ1NaJTR3Xf8j9_u0xp07
(https://drive.google.com/open?id=1-lgkJ9pilNsvQ1NaJTR3Xf8j9_u0xp07)
This is a standard module import list:
In [2]: # For system information:
import sys
import os
import pandas as pd
import pandasql as sqla # for SQL work.
import numpy as np
import scipy
import statsmodels
# For graphs:
import matplotlib as plt
import seaborn as sns
# For dates and times:
import datetime
import time
import math as mth
# this causes matplotlib graphs to be inside the jupyter notebook, rather than appe
aring in a separate window:
%matplotlib inline
In [ ]:
First Step - Import a Data Set from .CSV
Pandas (Panel DAta Sets) is the major Python module for data wrangling. The major object is the Pandas DataFrame
(corresponding to the R data frame or SAS data set). Pandas has a large number of import funtions. We will use
read_csv() to import the data set.
In [3]: pwt9 = pd.read_csv('Datapwt9.0.csv')
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
1 of 13 2/28/2020, 10:52 AM

Explore the object, using various functions:
In [4]: print(type(pwt9))
In [5]: pwt9.shape
In [6]: pwt9.head()
In [7]: pwt9.tail()
'Index' is what Pandas DataFrames use for the names of columns and rows. If there is no row names specified, the
row number is used. Note that '[...]' is a list.
<class 'pandas.core.frame.DataFrame'>
Out[5]: (11830, 48)
Out[6]:
Unnamed:
0
country isocode year currency rgdpe rgdpo pop emp avh ... csh_g csh_x csh_m csh_r
0 ABW-1950 Aruba ABW 1950
Aruban
Guilder
NaN NaN NaN NaN NaN ... NaN NaN NaN NaN
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
5 rows × 48 columns
Out[7]:
Unnamed:
0
country isocode year currency rgdpe rgdpo pop emp avh
11825 ZWE-2010 Zimbabwe ZWE 2010
US
Dollar
20652.718750 21053.855469 13.973897 6.298438 NaN
US
Dollar
20720.435547 21592.298828 14.255592 6.518841 NaN
US
Dollar
23708.654297 24360.527344 14.565482 6.248271 NaN
US
Dollar
27011.988281 28157.886719 14.898092 6.287056 NaN
US
Dollar
28495.554688 29149.708984 15.245855 6.499974 NaN
2 of 13 2/28/2020, 10:52 AM

In [8]: pwt9.columns
You can assign column names to a list, for changing them or using them in formulae:
In [9]: column_names = list(pwt9.columns)
The reason I put them into a list function was that the column names are a data type in Pandas called an 'Index', and
Python will not allow editing it.
In [10]: type(column_names)
In [11]: column_names[0] = 'Unique_ID'
We can rename columns by assignment:
In [12]: pwt9.columns = column_names
In [13]: pwt9.columns
In the printouts above, Python used the row number (starting at 0), becaue there was no row label variable assigned,
which Pandas calls a 'Row Index'. You can assign one - for this data set, the variable 'Unique_ID':
In [14]: pwt9.set_index('Unique_ID', inplace=True)
Out[8]: Index(['Unnamed: 0', 'country', 'isocode', 'year', 'currency', 'rgdpe',
'rgdpo', 'pop', 'emp', 'avh', 'hc', 'ccon', 'cda', 'cgdpe', 'cgdpo',
'ck', 'ctfp', 'cwtfp', 'rgdpna', 'rconna', 'rdana', 'rkna', 'rtfpna',
'rwtfpna', 'labsh', 'delta', 'xr', 'pl_con', 'pl_da', 'pl_gdpo',
'i_cig', 'i_xm', 'i_xr', 'i_outlier', 'cor_exp', 'statcap', 'csh_c',
'csh_i', 'csh_g', 'csh_x', 'csh_m', 'csh_r', 'pl_c', 'pl_i', 'pl_g',
'pl_x', 'pl_m', 'pl_k'],
dtype='object')
Out[10]: list
Out[13]: Index(['Unique_ID', 'country', 'isocode', 'year', 'currency', 'rgdpe', 'rgdpo',
'pop', 'emp', 'avh', 'hc', 'ccon', 'cda', 'cgdpe', 'cgdpo', 'ck',
'ctfp', 'cwtfp', 'rgdpna', 'rconna', 'rdana', 'rkna', 'rtfpna',
'rwtfpna', 'labsh', 'delta', 'xr', 'pl_con', 'pl_da', 'pl_gdpo',
'i_cig', 'i_xm', 'i_xr', 'i_outlier', 'cor_exp', 'statcap', 'csh_c',
'csh_i', 'csh_g', 'csh_x', 'csh_m', 'csh_r', 'pl_c', 'pl_i', 'pl_g',
'pl_x', 'pl_m', 'pl_k'],
dtype='object')
3 of 13 2/28/2020, 10:52 AM

In [15]: pwt9.head()
The 'inplace=True' makes Pandas alter the original data set, rather than making a copy.
You can select items in a DataFrame by row/column number, or row/column name:
the first uses the iloc[] command (for integer location?), the second used the 'loc[]' command:
In [16]: pwt9.iloc[0,0]
This selects the row 'ZWE-2010', the ',:' selects all columns.
Out[15]:
country isocode year currency rgdpe rgdpo pop emp avh hc ... csh_g csh_x csh_m
Unique_ID
ABW-1950 Aruba ABW 1950
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Out[16]: 'Aruba'
4 of 13 2/28/2020, 10:52 AM

In [17]: pwt9.loc['ZWE-2010',:]
Out[17]: country Zimbabwe
isocode ZWE
year 2010
currency US Dollar
rgdpe 20652.7
rgdpo 21053.9
pop 13.9739
emp 6.29844
avh NaN
hc 2.3726
ccon 21862.8
cda 26023
cgdpe 20537.3
cgdpo 21236.8
ck 34274.3
ctfp 0.234834
cwtfp 0.277434
rgdpna 19295.2
rconna 20977.2
rdana 25035.1
rkna 78953.3
rtfpna 0.932604
rwtfpna 0.863711
labsh 0.555796
delta 0.0373084
xr 1
pl_con 0.442739
pl_da 0.458783
pl_gdpo 0.443672
i_cig interpolated
i_xm benchmark
i_xr market
i_outlier no
cor_exp NaN
statcap 45.5556
csh_c 0.902229
csh_i 0.195897
csh_g 0.127251
csh_x 0.214657
csh_m -0.454497
csh_r 0.0144622
pl_c 0.44717
pl_i 0.5431
pl_g 0.411316
pl_x 0.701797
pl_m 0.606324
pl_k 1.01514
Name: ZWE-2010, dtype: object
5 of 13 2/28/2020, 10:52 AM

In [18]: pwt9.iloc[0,0:10]
In [19]: pwt9.iloc[0:10,]
To select a column, just use the column name(s) within quotes and brackets.
Out[18]: country Aruba
isocode ABW
year 1950
currency Aruban Guilder
rgdpe NaN
rgdpo NaN
pop NaN
emp NaN
avh NaN
hc NaN
Name: ABW-1950, dtype: object
Out[19]:
country isocode year currency rgdpe rgdpo pop emp avh hc ... csh_g csh_x csh_m
Unique_ID
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
Aruban
Guilder
6 of 13 2/28/2020, 10:52 AM

In [20]: pwt9['country']
Missing Values (https://pandas.pydata.org/pandas-docs/stable/user_guide
/missing_data.html (https://pandas.pydata.org/pandas-docs/stable/user_guide
/missing_data.html))
Pandas uses 'NaN' and 'NA' as missing value symbols. You can also set '-inf' and 'inf' to be treated as missing values
with the command: "pandas.options.mode.use_inf_as_na = True."
Python uses 'None'.
There are other types, such as 'NaT' for datetime64[ns]
When comparing/testing values, use the functions 'isna()' and 'notna()' functions:
Out[20]: Unique_ID
ABW-1950 Aruba
ABW-1951 Aruba
ABW-1952 Aruba
ABW-1953 Aruba
ABW-1954 Aruba
...
ZWE-2010 Zimbabwe
ZWE-2011 Zimbabwe
ZWE-2012 Zimbabwe
ZWE-2013 Zimbabwe
ZWE-2014 Zimbabwe
Name: country, Length: 11830, dtype: object
7 of 13 2/28/2020, 10:52 AM

In [21]: pwt9.loc['ABW-1950',]
In [22]: pwt9.loc['ABW-1950','rgdpe']
In [23]: pwt9.loc['ABW-1950','rgdpe'] == 'nan'
In [24]: pd.isna(pwt9.loc['ABW-1950','rgdpe'])
Out[21]: country Aruba
isocode ABW
year 1950
currency Aruban Guilder
rgdpe NaN
rgdpo NaN
pop NaN
emp NaN
avh NaN
hc NaN
ccon NaN
cda NaN
cgdpe NaN
cgdpo NaN
ck NaN
ctfp NaN
cwtfp NaN
rgdpna NaN
rconna NaN
rdana NaN
rkna NaN
rtfpna NaN
rwtfpna NaN
labsh NaN
delta NaN
xr NaN
pl_con NaN
pl_da NaN
pl_gdpo NaN
i_cig NaN
i_xm NaN
i_xr NaN
i_outlier NaN
cor_exp NaN
statcap NaN
csh_c NaN
csh_i NaN
csh_g NaN
csh_x NaN
csh_m NaN
csh_r NaN
pl_c NaN
pl_i NaN
pl_g NaN
pl_x NaN
pl_m NaN
pl_k NaN
Name: ABW-1950, dtype: object
Out[22]: nan
Out[23]: False
Out[24]: True
8 of 13 2/28/2020, 10:52 AM

The usual caveats about propagation of missing values applies.
Group By: split-apply-combine
In [25]: grouped_pwt9 = pwt9.groupby(['isocode','year'])
In [26]: grouped_pwt9.count()
You can also add the groupby() and count() methods directly.
Out[26]:
country currency rgdpe rgdpo pop emp avh hc ccon cda ... csh_g csh_x csh_m csh_r
isocode year
ABW
1950 1 1 0 0 0 0 0 0 0 0 ... 0 0 0
1951 1 1 0 0 0 0 0 0 0 0 ... 0 0 0
1952 1 1 0 0 0 0 0 0 0 0 ... 0 0 0
1953 1 1 0 0 0 0 0 0 0 0 ... 0 0 0
1954 1 1 0 0 0 0 0 0 0 0 ... 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZWE
2010 1 1 1 1 1 1 0 1 1 1 ... 1 1 1
2011 1 1 1 1 1 1 0 1 1 1 ... 1 1 1
2012 1 1 1 1 1 1 0 1 1 1 ... 1 1 1
2013 1 1 1 1 1 1 0 1 1 1 ... 1 1 1
2014 1 1 1 1 1 1 0 1 1 1 ... 1 1 1
9 of 13 2/28/2020, 10:52 AM

In [27]: pwt9.groupby('isocode').count()
Sumary Statistics (see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html))
In [28]: pwt9.describe()
Out[27]:
country year currency rgdpe rgdpo pop emp avh hc ccon ... csh_g csh_x csh_m csh_r
isocode
ABW 65 65 65 45 45 45 24 0 0 45 ... 45 45 45 45
AGO 65 65 65 45 45 45 45 0 45 45 ... 45 45 45 45
AIA 65 65 65 45 45 45 29 0 0 45 ... 45 45 45 45
ALB 65 65 65 45 45 45 45 0 45 45 ... 45 45 45 45
ARE 65 65 65 45 45 45 45 0 45 45 ... 45 45 45 45
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
VNM 65 65 65 45 45 45 45 45 45 45 ... 45 45 45 45
YEM 65 65 65 26 26 26 26 0 26 26 ... 26 26 26 26
ZAF 65 65 65 65 65 65 65 12 65 65 ... 65 65 65 65
ZMB 65 65 65 60 60 60 60 0 60 60 ... 60 60 60 60
ZWE 65 65 65 61 61 61 35 0 61 61 ... 61 61 61 61
Out[28]:
year rgdpe rgdpo pop emp avh hc
count 11830.000000 9.439000e+03 9.439000e+03 9439.000000 8244.000000 3319.000000 7867.000000 9.439000e+03
mean 1982.000000 2.530908e+05 2.508545e+05 30.090573 14.218857 1995.704047 2.032653 1.856605e+05
std 18.762456 9.973281e+05 9.899123e+05 111.489127 56.500008 271.514641 0.708940 7.308969e+05
min 1950.000000 1.854156e+01 1.108702e+00 0.004377 0.001180 1362.503252 1.007038 1.459605e+01
25% 1966.000000 5.847668e+03 6.044927e+03 1.593420 0.941974 1816.496296 1.408157 4.988803e+03
50% 1982.000000 2.458191e+04 2.487366e+04 5.985658 2.976475 1979.000000 1.916717 1.975266e+04
75% 1998.000000 1.304465e+05 1.324746e+05 19.533204 8.403848 2176.902193 2.608796 9.644857e+04
max 2014.000000 1.708030e+07 1.713595e+07 1369.435670 798.367798 3042.446905 3.734285 1.368595e+07
10 of 13 2/28/2020, 10:52 AM

In [29]: pwt9.groupby('isocode').describe()
Plotting, using Matplotlib
There are a number of plotting functions
Out[29]:
year rgdpe ... pl_m
count mean std min 25% 50% 75% max count mean ... 75%
isocode
ABW 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 45.0 2524.893969 ... 0.648430
AGO 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 45.0 55150.395095 ... 0.572490
AIA 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 45.0 207.284460 ... 0.566618
ALB 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 45.0 14805.517654 ... 0.580224
ARE 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 45.0 255698.718403 ... 0.560910
... ... ... ... ... ... ... ... ... ... ... ... ...
VNM 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 45.0 153119.746441 ... 0.550852
YEM 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 26.0 41752.590952 ... 0.642233
ZAF 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 65.0 285953.341587 ... 0.552549
ZMB 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 60.0 16092.146663 ... 0.572661
ZWE 65.0 1982.0 18.90767 1950.0 1966.0 1982.0 1998.0 2014.0 61.0 23284.501889 ... 0.542417
11 of 13 2/28/2020, 10:52 AM

In [30]: import matplotlib.pyplot as plt
import numpy as np
plt.xlabel('Log(Population)')
plt.title('Histogram of Log(Population)')
# Adding the option 'density=True' would give the density.
plt.hist(np.log(pwt9['pop']), bins=30)
plt.show()
The command '%matplotlib inline' in the first cell at the top of the notebook causes Matplotlib to put the plot in the
notebook, rather than in a new window.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
C:ProgramDataAnaconda3libsite-packagespandascoreseries.py:853: RuntimeWar
ning: invalid value encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
C:ProgramDataAnaconda3libsite-packagesnumpylibhistograms.py:824: RuntimeW
arning: invalid value encountered in greater_equal
keep = (tmp_a >= first_edge)
C:ProgramDataAnaconda3libsite-packagesnumpylibhistograms.py:825: RuntimeW
arning: invalid value encountered in less_equal
keep &= (tmp_a <= last_edge)
12 of 13 2/28/2020, 10:52 AM

In [ ]:
13 of 13 2/28/2020, 10:52 AM

Up and running with python

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Up and running with python

Similar to Up and running with python (20)

More from Barry DeCicco

More from Barry DeCicco (7)

Recently uploaded

Recently uploaded (20)

Up and running with python