This is a PDF version of the Jupyter notebook which I used. You can find the notebook and all other AA ASA presentation materials at: https://drive.google.com/open?id=1Aw_LmQYSvcDJQUZ3_dCFyCcMgURB5nXl
DBA Basics: Getting Started with Performance Tuning.pdf
Up and running with python
1. Here is the notebook which I used for the Ann Arbor Chapter of the
American Statistical Association Class
'Up and Running with Python'
A copy is available at: https://drive.google.com/open?id=1-lgkJ9pilNsvQ1NaJTR3Xf8j9_u0xp07
(https://drive.google.com/open?id=1-lgkJ9pilNsvQ1NaJTR3Xf8j9_u0xp07)
This is a standard module import list:
In [2]: # For system information:
import sys
import os
import pandas as pd
import pandasql as sqla # for SQL work.
import numpy as np
import scipy
import statsmodels
# For graphs:
import matplotlib as plt
import seaborn as sns
# For dates and times:
import datetime
import time
import math as mth
# this causes matplotlib graphs to be inside the jupyter notebook, rather than appe
aring in a separate window:
%matplotlib inline
In [ ]:
First Step - Import a Data Set from .CSV
Pandas (Panel DAta Sets) is the major Python module for data wrangling. The major object is the Pandas DataFrame
(corresponding to the R data frame or SAS data set). Pandas has a large number of import funtions. We will use
read_csv() to import the data set.
In [3]: pwt9 = pd.read_csv('Datapwt9.0.csv')
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
1 of 13 2/28/2020, 10:52 AM
2. Explore the object, using various functions:
In [4]: print(type(pwt9))
In [5]: pwt9.shape
In [6]: pwt9.head()
In [7]: pwt9.tail()
'Index' is what Pandas DataFrames use for the names of columns and rows. If there is no row names specified, the
row number is used. Note that '[...]' is a list.
<class 'pandas.core.frame.DataFrame'>
Out[5]: (11830, 48)
Out[6]:
Unnamed:
0
country isocode year currency rgdpe rgdpo pop emp avh ... csh_g csh_x csh_m csh_r
0 ABW-1950 Aruba ABW 1950
Aruban
Guilder
NaN NaN NaN NaN NaN ... NaN NaN NaN NaN
1 ABW-1951 Aruba ABW 1951
Aruban
Guilder
NaN NaN NaN NaN NaN ... NaN NaN NaN NaN
2 ABW-1952 Aruba ABW 1952
Aruban
Guilder
NaN NaN NaN NaN NaN ... NaN NaN NaN NaN
3 ABW-1953 Aruba ABW 1953
Aruban
Guilder
NaN NaN NaN NaN NaN ... NaN NaN NaN NaN
4 ABW-1954 Aruba ABW 1954
Aruban
Guilder
NaN NaN NaN NaN NaN ... NaN NaN NaN NaN
5 rows × 48 columns
Out[7]:
Unnamed:
0
country isocode year currency rgdpe rgdpo pop emp avh
11825 ZWE-2010 Zimbabwe ZWE 2010
US
Dollar
20652.718750 21053.855469 13.973897 6.298438 NaN
11826 ZWE-2011 Zimbabwe ZWE 2011
US
Dollar
20720.435547 21592.298828 14.255592 6.518841 NaN
11827 ZWE-2012 Zimbabwe ZWE 2012
US
Dollar
23708.654297 24360.527344 14.565482 6.248271 NaN
11828 ZWE-2013 Zimbabwe ZWE 2013
US
Dollar
27011.988281 28157.886719 14.898092 6.287056 NaN
11829 ZWE-2014 Zimbabwe ZWE 2014
US
Dollar
28495.554688 29149.708984 15.245855 6.499974 NaN
5 rows × 48 columns
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
2 of 13 2/28/2020, 10:52 AM
3. In [8]: pwt9.columns
You can assign column names to a list, for changing them or using them in formulae:
In [9]: column_names = list(pwt9.columns)
The reason I put them into a list function was that the column names are a data type in Pandas called an 'Index', and
Python will not allow editing it.
In [10]: type(column_names)
In [11]: column_names[0] = 'Unique_ID'
We can rename columns by assignment:
In [12]: pwt9.columns = column_names
In [13]: pwt9.columns
In the printouts above, Python used the row number (starting at 0), becaue there was no row label variable assigned,
which Pandas calls a 'Row Index'. You can assign one - for this data set, the variable 'Unique_ID':
In [14]: pwt9.set_index('Unique_ID', inplace=True)
Out[8]: Index(['Unnamed: 0', 'country', 'isocode', 'year', 'currency', 'rgdpe',
'rgdpo', 'pop', 'emp', 'avh', 'hc', 'ccon', 'cda', 'cgdpe', 'cgdpo',
'ck', 'ctfp', 'cwtfp', 'rgdpna', 'rconna', 'rdana', 'rkna', 'rtfpna',
'rwtfpna', 'labsh', 'delta', 'xr', 'pl_con', 'pl_da', 'pl_gdpo',
'i_cig', 'i_xm', 'i_xr', 'i_outlier', 'cor_exp', 'statcap', 'csh_c',
'csh_i', 'csh_g', 'csh_x', 'csh_m', 'csh_r', 'pl_c', 'pl_i', 'pl_g',
'pl_x', 'pl_m', 'pl_k'],
dtype='object')
Out[10]: list
Out[13]: Index(['Unique_ID', 'country', 'isocode', 'year', 'currency', 'rgdpe', 'rgdpo',
'pop', 'emp', 'avh', 'hc', 'ccon', 'cda', 'cgdpe', 'cgdpo', 'ck',
'ctfp', 'cwtfp', 'rgdpna', 'rconna', 'rdana', 'rkna', 'rtfpna',
'rwtfpna', 'labsh', 'delta', 'xr', 'pl_con', 'pl_da', 'pl_gdpo',
'i_cig', 'i_xm', 'i_xr', 'i_outlier', 'cor_exp', 'statcap', 'csh_c',
'csh_i', 'csh_g', 'csh_x', 'csh_m', 'csh_r', 'pl_c', 'pl_i', 'pl_g',
'pl_x', 'pl_m', 'pl_k'],
dtype='object')
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
3 of 13 2/28/2020, 10:52 AM
4. In [15]: pwt9.head()
The 'inplace=True' makes Pandas alter the original data set, rather than making a copy.
You can select items in a DataFrame by row/column number, or row/column name:
the first uses the iloc[] command (for integer location?), the second used the 'loc[]' command:
In [16]: pwt9.iloc[0,0]
This selects the row 'ZWE-2010', the ',:' selects all columns.
Out[15]:
country isocode year currency rgdpe rgdpo pop emp avh hc ... csh_g csh_x csh_m
Unique_ID
ABW-1950 Aruba ABW 1950
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1951 Aruba ABW 1951
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1952 Aruba ABW 1952
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1953 Aruba ABW 1953
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1954 Aruba ABW 1954
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
5 rows × 47 columns
Out[16]: 'Aruba'
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
4 of 13 2/28/2020, 10:52 AM
5. In [17]: pwt9.loc['ZWE-2010',:]
Out[17]: country Zimbabwe
isocode ZWE
year 2010
currency US Dollar
rgdpe 20652.7
rgdpo 21053.9
pop 13.9739
emp 6.29844
avh NaN
hc 2.3726
ccon 21862.8
cda 26023
cgdpe 20537.3
cgdpo 21236.8
ck 34274.3
ctfp 0.234834
cwtfp 0.277434
rgdpna 19295.2
rconna 20977.2
rdana 25035.1
rkna 78953.3
rtfpna 0.932604
rwtfpna 0.863711
labsh 0.555796
delta 0.0373084
xr 1
pl_con 0.442739
pl_da 0.458783
pl_gdpo 0.443672
i_cig interpolated
i_xm benchmark
i_xr market
i_outlier no
cor_exp NaN
statcap 45.5556
csh_c 0.902229
csh_i 0.195897
csh_g 0.127251
csh_x 0.214657
csh_m -0.454497
csh_r 0.0144622
pl_c 0.44717
pl_i 0.5431
pl_g 0.411316
pl_x 0.701797
pl_m 0.606324
pl_k 1.01514
Name: ZWE-2010, dtype: object
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
5 of 13 2/28/2020, 10:52 AM
6. In [18]: pwt9.iloc[0,0:10]
In [19]: pwt9.iloc[0:10,]
To select a column, just use the column name(s) within quotes and brackets.
Out[18]: country Aruba
isocode ABW
year 1950
currency Aruban Guilder
rgdpe NaN
rgdpo NaN
pop NaN
emp NaN
avh NaN
hc NaN
Name: ABW-1950, dtype: object
Out[19]:
country isocode year currency rgdpe rgdpo pop emp avh hc ... csh_g csh_x csh_m
Unique_ID
ABW-1950 Aruba ABW 1950
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1951 Aruba ABW 1951
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1952 Aruba ABW 1952
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1953 Aruba ABW 1953
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1954 Aruba ABW 1954
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1955 Aruba ABW 1955
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1956 Aruba ABW 1956
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1957 Aruba ABW 1957
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1958 Aruba ABW 1958
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
ABW-1959 Aruba ABW 1959
Aruban
Guilder
NaN NaN NaN NaN NaN NaN ... NaN NaN NaN
10 rows × 47 columns
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
6 of 13 2/28/2020, 10:52 AM
7. In [20]: pwt9['country']
Missing Values (https://pandas.pydata.org/pandas-docs/stable/user_guide
/missing_data.html (https://pandas.pydata.org/pandas-docs/stable/user_guide
/missing_data.html))
Pandas uses 'NaN' and 'NA' as missing value symbols. You can also set '-inf' and 'inf' to be treated as missing values
with the command: "pandas.options.mode.use_inf_as_na = True."
Python uses 'None'.
There are other types, such as 'NaT' for datetime64[ns]
When comparing/testing values, use the functions 'isna()' and 'notna()' functions:
Out[20]: Unique_ID
ABW-1950 Aruba
ABW-1951 Aruba
ABW-1952 Aruba
ABW-1953 Aruba
ABW-1954 Aruba
...
ZWE-2010 Zimbabwe
ZWE-2011 Zimbabwe
ZWE-2012 Zimbabwe
ZWE-2013 Zimbabwe
ZWE-2014 Zimbabwe
Name: country, Length: 11830, dtype: object
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
7 of 13 2/28/2020, 10:52 AM
8. In [21]: pwt9.loc['ABW-1950',]
In [22]: pwt9.loc['ABW-1950','rgdpe']
In [23]: pwt9.loc['ABW-1950','rgdpe'] == 'nan'
In [24]: pd.isna(pwt9.loc['ABW-1950','rgdpe'])
Out[21]: country Aruba
isocode ABW
year 1950
currency Aruban Guilder
rgdpe NaN
rgdpo NaN
pop NaN
emp NaN
avh NaN
hc NaN
ccon NaN
cda NaN
cgdpe NaN
cgdpo NaN
ck NaN
ctfp NaN
cwtfp NaN
rgdpna NaN
rconna NaN
rdana NaN
rkna NaN
rtfpna NaN
rwtfpna NaN
labsh NaN
delta NaN
xr NaN
pl_con NaN
pl_da NaN
pl_gdpo NaN
i_cig NaN
i_xm NaN
i_xr NaN
i_outlier NaN
cor_exp NaN
statcap NaN
csh_c NaN
csh_i NaN
csh_g NaN
csh_x NaN
csh_m NaN
csh_r NaN
pl_c NaN
pl_i NaN
pl_g NaN
pl_x NaN
pl_m NaN
pl_k NaN
Name: ABW-1950, dtype: object
Out[22]: nan
Out[23]: False
Out[24]: True
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
8 of 13 2/28/2020, 10:52 AM
12. In [30]: import matplotlib.pyplot as plt
import numpy as np
plt.xlabel('Log(Population)')
plt.title('Histogram of Log(Population)')
# Adding the option 'density=True' would give the density.
plt.hist(np.log(pwt9['pop']), bins=30)
plt.show()
The command '%matplotlib inline' in the first cell at the top of the notebook causes Matplotlib to put the plot in the
notebook, rather than in a new window.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
C:ProgramDataAnaconda3libsite-packagespandascoreseries.py:853: RuntimeWar
ning: invalid value encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
C:ProgramDataAnaconda3libsite-packagesnumpylibhistograms.py:824: RuntimeW
arning: invalid value encountered in greater_equal
keep = (tmp_a >= first_edge)
C:ProgramDataAnaconda3libsite-packagesnumpylibhistograms.py:825: RuntimeW
arning: invalid value encountered in less_equal
keep &= (tmp_a <= last_edge)
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
12 of 13 2/28/2020, 10:52 AM
13. In [ ]:
Up and Running with Python http://localhost:8888/nbconvert/html/Documents/Python/Projects/ASA U...
13 of 13 2/28/2020, 10:52 AM