Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
2018
WALTER GENTILE
Ver. 1.0
FROM SQL to PANDAS
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
Summary
PANDAS: PYTHON DATA ANALYSIS LIBRARY................................................................................. 2
WHY USE PANDAS? ............................................................................................................................. 2
HOW TO INSTALL.................................................................................................................................. 3
OFFICIAL WEB SITE AND DOCUMENTATION.................................................................................... 3
SELECT … WHERE ............................................................................................................................... 4
SELECT … TOP N.................................................................................................................................. 5
SELECT … DISTINCT ............................................................................................................................ 5
SELECT … IN OR SELECT NOT IN ...................................................................................................... 5
ORDER BY… ASC OR DESC................................................................................................................ 6
GROUP BY COL..................................................................................................................................... 6
SUM, MAX, MIN, COUNT ....................................................................................................................... 7
MERGE JOIN AND CONCATENATE .................................................................................................... 8
AN EXAMPLE OF THE EXPRESSIVE POWER OF PANDAS.............................................................. 9
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
Pandas: Python Data Analysis Library
pandas is an open source, BSD-licensed library providing high-
performance, easy-to-use data structures and data analysis tools for
the Python programming language.
Why use Pandas?
• A fast and efficient DataFrame object for data manipulation with
integrated indexing;
• Tools for reading and writing data between in-memory data
structures and different formats: CSV and text files, Microsoft
Excel, SQL databases, and the fast HDF5 format;
• Intelligent data alignment and integrated handling of missing
data: gain automatic label-based alignment in computations and
easily manipulate messy data into an orderly form;
• Flexible reshaping and pivoting of data sets;
• Intelligent label-based slicing, fancy indexing, and subsetting of
large data sets;
• Columns can be inserted and deleted from data structures for size
mutability;
• Aggregating or transforming data with a powerful group by engine
allowing split-apply-combine operations on data sets;
• High performance merging and joining of data sets;
• Hierarchical axis indexing provides an intuitive way of working
with high-dimensional data in a lower-dimensional data structure;
• Time series-functionality: date range generation and frequency
conversion, moving window statistics, moving window linear
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
regressions, date shifting and lagging. Even create domain-specific
time offsets and join time series without losing data;
How to Install
pip install pandas
Official web site and documentation
https://pandas.pydata.org/
https://pandas.pydata.org/pandas-docs/stable/
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
In order to describe our examples, we define a general purpose
Dataframe:
import pandas as pd
df = pd.dataframe(columns=[‘col1’,’col2’,…,’colN’])
SELECT … WHERE
SELECT * FROM TABLE WHERE col = ’val’
selected = df[ df[‘col]==’val’ ]
SELECT col1,col2,…,colN FROM TABLE WHERE colX = 'val'
selected = df[df[‘colX’]==’val’][[‘col1’,’col2’,…,’colN’]]
SELECT * FROM TABLE WHERE col1 = 'v_1' AND col2 = 'v_2'
codition_1 = df[‘col1’] == ‘v_1’
condition_2 = df[‘col2’]==’v_2’]
selected_values = df[condition_1 & conditionn_2]
SELECT * FROM TABLE WHERE col1 = 'v_1' OR col2 = 'v_2'
codition_1 = df[‘col1’] == ‘v_1’
condition_2 = df[‘col2’]==’v_2’]
selected_values = df[ condition_1| condition_2]
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
SELECT … TOP N
SELECT TOP N * FROM TABLE
selected_values = df.head(N)
SELECT TOP N col1, col2,…, colN FROM TABLE
selected_values = df[[‘col1’,col2,…,’colX’]].head(N)
SELECT … DISTINCT
SELECT DISTINCT(col) FROM TABLE WHERE condition
distinct_values = df[condition].col.unique() OR
distinct_values = df[‘col’].unique()
SELECT … IN or SELECT NOT IN
SELECT * FROM TABLE WHERE col IN (‘v_1’,’v _2’,…’v _N’)
in_val = df[ df[‘col’].isin([‘v_1’,’v_2’,…,’v_N’])]
SELECT * FROM TABLE WHERE col NOT IN (‘v_1’,’v _2’,…’v _N’)
not_in_val = df[~ df[‘col’].isin([‘v_1’,’v_2’,…,’v_N’]) ]
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
ORDER BY… ASC or DESC
SELECT * FROM TABLE ORDER BY col ASC
asc_ordered = df.sort_values(['col'], ascending = True)
SELECT * FROM TABLE ORDER BY col DESC
desc_ordered = df.sort_values(['col'], ascending = False)
GROUP BY col
SELECT col1,… ,colX FROM TABLE GROUP BY col1,…,colX
grouped = df.groupby(['col1',…,'colX'])
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
SUM, MAX, MIN, COUNT
SELECT SUM(col1) FROM TABLE
sum = df[['col1']].sum()
SELECT MAX(col1) FROM TABLE
max = df[['col1']].max()
SELECT MIN(col1) FROM TABLE
min = df[['col1']].min()
SELECT COUNT(*) FROM TABLE WHERE col = ‘val’
count = df[df['col']==’val’][‘col’].count()
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
MERGE JOIN and CONCATENATE
pandas provides various facilities for easily combining together Series
and DataFramel objects with various kinds of set logic for the indexes
and relational algebra functionality in the case of join / merge-type
operations.
There are many examples and the best thing is to see the use directly
here:
https://pandas.pydata.org/pandas-docs/stable/merging.html
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
An example of the expressive power of Pandas
Suppose to have these data:
Item Value Return Date
A 0,4 39 30/09/2018
A 0,01 -0,975 29/09/2018
A 0,4 0,42857143 28/09/2018
A 0,28 13 27/09/2018
A 0,02 -0,8888889 26/09/2018
A 0,18 2 25/09/2018
A 0,06 -0,8636364 24/09/2018
A 0,44 3,88888889 23/09/2018
A 0,09 -0,6538462 22/09/2018
A 0,26 1,88888889 21/09/2018
A 0,09 -0,75 20/09/2018
A 0,36 -0,2653061 19/09/2018
A 0,49 0,36111111 18/09/2018
A 0,36 3,5 17/09/2018
A 0,08 -0,8139535 16/09/2018
A 0,43 0,59259259 15/09/2018
A 0,27 0,6875 14/09/2018
A 0,16 -0,6 13/09/2018
A 0,4 0,9047619 12/09/2018
A 0,21 -0,3823529 11/09/2018
A 0,34 0,36 10/09/2018
A 0,25 -0,21875 09/09/2018
A 0,32 0,45454545 08/09/2018
A 0,22 -0,3125 07/09/2018
A 0,32 0,39130435 06/09/2018
A 0,23 -0,3235294 05/09/2018
A 0,34 0,13333333 04/09/2018
A 0,3 -0,2682927 03/09/2018
A 0,41 19,5 02/09/2018
A 0,02 Nan 01/09/2018
We want calculate the percentage Return between two consecutive
Values in formulas:
Attribution-NonCommercial-ShareAlike
4.0 International(CC BY-NC-SA 4.0)
Walter Gentile
𝑅𝑒𝑡𝑢𝑟𝑛 =
𝑉𝑎𝑙𝑢𝑒(𝐷𝑎𝑡𝑒_2) − 𝑉𝑎𝑙𝑢𝑒(𝐷𝑎𝑡𝑒_1)
𝑉𝑎𝑙𝑢𝑒(𝐷𝑎𝑡𝑒_1)
𝑤𝑖𝑡ℎ 𝐷𝑎𝑡𝑒_2 > 𝐷𝑎𝑡𝑒_1
SQL Server Solution
A possible solution, certainly not the only one, to this problem, using T-
SQL in Microsoft Sql Server environment are this (suppose to store data
into T_DATA table):
WITH T2 AS (
SELECT ID = ROW_NUMBER() OVER (ORDER BY [Date] DESC),
[Date], [Value]
FROM T_DATA
)
,T3 AS (
SELECT T1.[Date], T1.[Value],
[Return] =(ISNULL(T1.[Value], 0) - T2.[Value])/T2.[Value]
FROM
(
SELECT ID_1 = ROW_NUMBER() OVER (ORDER BY [Date] DESC), *
FROM T_DATA) T1
LEFT JOIN T2 ON T1.ID_1 + 1 = T2.ID
)
UPDATE TD
SET TD.[Return] = T3. [Return]
FROM T_DATA AS TD
INNER JOIN T3 ON TD.[Date] = T3.[Date]
PANDAS Solution
Once the data has been loaded into a dataframe called for example df
we can update the Return column with this single line of code.
df[‘Return’] = df[‘Value].pct_change()
Unbelievable! 😊

From SQL to Pandas

  • 1.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) 2018 WALTER GENTILE Ver. 1.0 FROM SQL to PANDAS
  • 2.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile Summary PANDAS: PYTHON DATA ANALYSIS LIBRARY................................................................................. 2 WHY USE PANDAS? ............................................................................................................................. 2 HOW TO INSTALL.................................................................................................................................. 3 OFFICIAL WEB SITE AND DOCUMENTATION.................................................................................... 3 SELECT … WHERE ............................................................................................................................... 4 SELECT … TOP N.................................................................................................................................. 5 SELECT … DISTINCT ............................................................................................................................ 5 SELECT … IN OR SELECT NOT IN ...................................................................................................... 5 ORDER BY… ASC OR DESC................................................................................................................ 6 GROUP BY COL..................................................................................................................................... 6 SUM, MAX, MIN, COUNT ....................................................................................................................... 7 MERGE JOIN AND CONCATENATE .................................................................................................... 8 AN EXAMPLE OF THE EXPRESSIVE POWER OF PANDAS.............................................................. 9
  • 3.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile Pandas: Python Data Analysis Library pandas is an open source, BSD-licensed library providing high- performance, easy-to-use data structures and data analysis tools for the Python programming language. Why use Pandas? • A fast and efficient DataFrame object for data manipulation with integrated indexing; • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form; • Flexible reshaping and pivoting of data sets; • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; • Columns can be inserted and deleted from data structures for size mutability; • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets; • High performance merging and joining of data sets; • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure; • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear
  • 4.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data; How to Install pip install pandas Official web site and documentation https://pandas.pydata.org/ https://pandas.pydata.org/pandas-docs/stable/
  • 5.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile In order to describe our examples, we define a general purpose Dataframe: import pandas as pd df = pd.dataframe(columns=[‘col1’,’col2’,…,’colN’]) SELECT … WHERE SELECT * FROM TABLE WHERE col = ’val’ selected = df[ df[‘col]==’val’ ] SELECT col1,col2,…,colN FROM TABLE WHERE colX = 'val' selected = df[df[‘colX’]==’val’][[‘col1’,’col2’,…,’colN’]] SELECT * FROM TABLE WHERE col1 = 'v_1' AND col2 = 'v_2' codition_1 = df[‘col1’] == ‘v_1’ condition_2 = df[‘col2’]==’v_2’] selected_values = df[condition_1 & conditionn_2] SELECT * FROM TABLE WHERE col1 = 'v_1' OR col2 = 'v_2' codition_1 = df[‘col1’] == ‘v_1’ condition_2 = df[‘col2’]==’v_2’] selected_values = df[ condition_1| condition_2]
  • 6.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile SELECT … TOP N SELECT TOP N * FROM TABLE selected_values = df.head(N) SELECT TOP N col1, col2,…, colN FROM TABLE selected_values = df[[‘col1’,col2,…,’colX’]].head(N) SELECT … DISTINCT SELECT DISTINCT(col) FROM TABLE WHERE condition distinct_values = df[condition].col.unique() OR distinct_values = df[‘col’].unique() SELECT … IN or SELECT NOT IN SELECT * FROM TABLE WHERE col IN (‘v_1’,’v _2’,…’v _N’) in_val = df[ df[‘col’].isin([‘v_1’,’v_2’,…,’v_N’])] SELECT * FROM TABLE WHERE col NOT IN (‘v_1’,’v _2’,…’v _N’) not_in_val = df[~ df[‘col’].isin([‘v_1’,’v_2’,…,’v_N’]) ]
  • 7.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile ORDER BY… ASC or DESC SELECT * FROM TABLE ORDER BY col ASC asc_ordered = df.sort_values(['col'], ascending = True) SELECT * FROM TABLE ORDER BY col DESC desc_ordered = df.sort_values(['col'], ascending = False) GROUP BY col SELECT col1,… ,colX FROM TABLE GROUP BY col1,…,colX grouped = df.groupby(['col1',…,'colX'])
  • 8.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile SUM, MAX, MIN, COUNT SELECT SUM(col1) FROM TABLE sum = df[['col1']].sum() SELECT MAX(col1) FROM TABLE max = df[['col1']].max() SELECT MIN(col1) FROM TABLE min = df[['col1']].min() SELECT COUNT(*) FROM TABLE WHERE col = ‘val’ count = df[df['col']==’val’][‘col’].count()
  • 9.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile MERGE JOIN and CONCATENATE pandas provides various facilities for easily combining together Series and DataFramel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations. There are many examples and the best thing is to see the use directly here: https://pandas.pydata.org/pandas-docs/stable/merging.html
  • 10.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile An example of the expressive power of Pandas Suppose to have these data: Item Value Return Date A 0,4 39 30/09/2018 A 0,01 -0,975 29/09/2018 A 0,4 0,42857143 28/09/2018 A 0,28 13 27/09/2018 A 0,02 -0,8888889 26/09/2018 A 0,18 2 25/09/2018 A 0,06 -0,8636364 24/09/2018 A 0,44 3,88888889 23/09/2018 A 0,09 -0,6538462 22/09/2018 A 0,26 1,88888889 21/09/2018 A 0,09 -0,75 20/09/2018 A 0,36 -0,2653061 19/09/2018 A 0,49 0,36111111 18/09/2018 A 0,36 3,5 17/09/2018 A 0,08 -0,8139535 16/09/2018 A 0,43 0,59259259 15/09/2018 A 0,27 0,6875 14/09/2018 A 0,16 -0,6 13/09/2018 A 0,4 0,9047619 12/09/2018 A 0,21 -0,3823529 11/09/2018 A 0,34 0,36 10/09/2018 A 0,25 -0,21875 09/09/2018 A 0,32 0,45454545 08/09/2018 A 0,22 -0,3125 07/09/2018 A 0,32 0,39130435 06/09/2018 A 0,23 -0,3235294 05/09/2018 A 0,34 0,13333333 04/09/2018 A 0,3 -0,2682927 03/09/2018 A 0,41 19,5 02/09/2018 A 0,02 Nan 01/09/2018 We want calculate the percentage Return between two consecutive Values in formulas:
  • 11.
    Attribution-NonCommercial-ShareAlike 4.0 International(CC BY-NC-SA4.0) Walter Gentile 𝑅𝑒𝑡𝑢𝑟𝑛 = 𝑉𝑎𝑙𝑢𝑒(𝐷𝑎𝑡𝑒_2) − 𝑉𝑎𝑙𝑢𝑒(𝐷𝑎𝑡𝑒_1) 𝑉𝑎𝑙𝑢𝑒(𝐷𝑎𝑡𝑒_1) 𝑤𝑖𝑡ℎ 𝐷𝑎𝑡𝑒_2 > 𝐷𝑎𝑡𝑒_1 SQL Server Solution A possible solution, certainly not the only one, to this problem, using T- SQL in Microsoft Sql Server environment are this (suppose to store data into T_DATA table): WITH T2 AS ( SELECT ID = ROW_NUMBER() OVER (ORDER BY [Date] DESC), [Date], [Value] FROM T_DATA ) ,T3 AS ( SELECT T1.[Date], T1.[Value], [Return] =(ISNULL(T1.[Value], 0) - T2.[Value])/T2.[Value] FROM ( SELECT ID_1 = ROW_NUMBER() OVER (ORDER BY [Date] DESC), * FROM T_DATA) T1 LEFT JOIN T2 ON T1.ID_1 + 1 = T2.ID ) UPDATE TD SET TD.[Return] = T3. [Return] FROM T_DATA AS TD INNER JOIN T3 ON TD.[Date] = T3.[Date] PANDAS Solution Once the data has been loaded into a dataframe called for example df we can update the Return column with this single line of code. df[‘Return’] = df[‘Value].pct_change() Unbelievable! 😊