You might not need pandas - Reuben Cummings

YOU MIGHT NOT
NEED PANDAS
DATA DAY MEXICO — MEXICO CITY
MARCH 15, 2018
BY REUBEN CUMMINGS
@REUBANO

WHO AM I?
• Managing Director, Nerevu
Development
• Founder of Arusha Coders
• Author of several popular Python
packages
• @reubano on Twitter and GitHub

WHO DOESN’T LOVE THESE CUTE CUDDLY CREATURES?
WHY PANDAS?
• It’s fast
• It’s
ubiquitous
• It gets the
job done
Photo credit: Mathias Appel (@mathiasappel)

BECAUSE IT ISN’T THAT BAD AFTER ALL
WHEN SHOULD YOU USE PANDAS?
• Speed is paramount
• You already have pandas code
available
• You don’t want have time to
learn anything new

WELL, THEY NEVER CLAIMED TO BE GRACEFUL…
WHY NOT PANDAS?
• It’s complex
• It’s large
• It likes lots
of memory
Photo credit: CSBaltimore

BECAUSE SOMETIMES IT ISN’T THAT GREAT EITHER
WHEN SHOULDN’T YOU USE PANDAS?
• You don’t want all those
dependencies
• You like functional programming
• You have tight RAM constraints

PANDAS ALTERNATIVES
• Pure Python
• meza (https://github.com/
reubano/meza)
• Other libraries (csvkit,
messytables, etc.)

LA
MARIPOSA
MONARCA
CHOOSE YOUR OWN
ADVENTURE ANALYSISANALYSIS
Photo credit: Pablo Leautaud (@pleautaud)

ALL THE THINGS I AM NOT
DISCLAIMER
•Pandas expert
•Statistician
•Lepidopterologist
(butterfly scientist)

MONARCH BUTTERFLY
HIBERNATING COLONIES
WORLD WILDLIFE FUND MÉXICO

Photo credit: Adam Jones (@adam_jones)

THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML
FILES
HOW WILL YOU OBTAIN THE DATA?
Year
Occupied Forest
(acres)
1993 15.39
1994 19.30
1995 31.16
1996 44.95
1997 14.26

THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML
FILES
HOW WILL YOU OBTAIN THE DATA?
PARSE THE
EXCEL FILE
OR
SCRAPE THE
HTML FILE

PARSE THE EXCEL FILE (PANDAS)
>>> from pandas import ExcelFile
>>>
>>> book = ExcelFile('data.xlsx')
>>> hibernation = book.parse(
>>> 'hibernation')
>>>
>>> hibernation.head()
Year Forest Area (acres)
0 1993 15.39
1 1994 19.30
2 1995 31.16
3 1996 44.95
4 1997 14.26

PARSE THE EXCEL FILE (NOT PANDAS)
>>> from xlrd import open_workbook
>>>
>>> book = open_workbook('data.xlsx')
>>> hibernation = book.sheet_by_name(
>>> 'hibernation')
>>>
>>> hibernation.row_values(0)
['Year', 'Forest Area (acres)']
>>> hibernation.row_values(1)
[1993.0, 15.39]

>>> from meza.io import read_xls
>>> from meza.process import peek
>>>
>>> hibernation = read_xls(
>>> 'data.xlsx', sanitize=True)
>>>
>>> hibernation, head = peek(hibernation)
>>> head[:3]
[{'forest_area_acres': '15.39', 'year':
'1993.0'},
{'forest_area_acres': '19.3', 'year':
'1994.0'},
'1995.0'}]
PARSE THE EXCEL FILE (MEZA)

¡FELICIDADES!
YOU’VE OBTAINED ESTIMATES
FOR THE FOREST AREA
OCCUPIED BY BUTTERFLY
COLONIES

YOU’VE OBTAINED THE FOREST AREA ESTIMATES
¡FELICIDADES!
• The declining
forest area
worries you
• You suspect de-
forestation
• You also suspect
pesticides
ACRES OF “BUTTERFLY
OCCUPIED” FOREST
0
12.5
25
37.5
50
1993 1994 1995 1996 1997

YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN
MEXICO AND PESTICIDE USAGE IN THE USA
WHICH DATA SET WILL YOU INVESTIGATE FURTHER?
Year
Occupied
Forest (acres)
Deforestation (ha)
Pesticides
(millions of lbs)
2003* 23.0 140.4 476.5
2005* 10.0 480.4 488.2
2006 17.0 462.4 485.9
2007 11.4 244.6 503.8
2008 12.5 259.0 516.1
* two year averages

INVESTIGATE
DEFORESTATION
OR
INVESTIGATE
PESTICIDES
WHICH DATA SET WILL YOU INVESTIGATE FURTHER?
YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN
MEXICO AND PESTICIDE USAGE IN THE USA

MONARCH BUTTERFLY
RESERVE DEFORESTATION
WORLD WILDLIFE FUND MÉXICO

INVESTIGATE DEFORESTATION (PANDAS)
>>>
>>> book = ExcelFile(‘data.xlsx’)
>>>
>>> deforestation = book.parse(
>>> 'deforestation')
>>>
>>> 'hibernation')
>>>
>>> df = deforestation.merge(
>>> hibernation, on='Year')
>>>

>>> df.head()
Year Deforested... Forest Area...
0 2003 140.372293 27.48
1 2005 480.449496 14.60
2 2006 462.360283 16.98
3 2007 244.566159 11.39
4 2008 259.037529 12.50

>>> X = df['Deforested Area (ha)']
>>> X[:3]
0 140.372293
1 480.449496
2 462.360283
>>> Y = df['Forest Area (acres)']
>>> Y[:3]
0 27.48
1 14.60
2 16.98
>>> cor_coef = X.corr(Y)
>>> cor_coef
0.5801873106352113

INVESTIGATE DEFORESTATION (NOT PANDAS)
>>>
>>>
>>> deforestation = book.sheet_by_name(
>>> 'deforestation')
>>>
>>> 'hibernation')
>>>
>>> def_years = deforestation.col_values(
>>> 0, start_rowx=1)
>>>
>>> hiber_years = hibernation.col_values(

>>> common = set(def_years).intersection(
>>> hiber_years)
>>>
>>> common
{2003.0, 2005.0, 2006.0, 2007.0, 2008.0,
2009.0, 2010.0, 2011.0, 2012.0, 2013.0,
2014.0}
>>> drows = deforestation.get_rows()
>>> hrows = hibernation.get_rows()
>>>
>>> X = [
>>> r[2].value for r in drows
>>> if r[0].value in common]

>>> Y = [
>>> r[1].value for r in hrows

>>> from statistics import mean, pstdev
>>> from itertools import starmap
>>> from operator import mul
>>>
>>>
>>> def correlation(X, Y):
>>> prod = starmap(mul, zip(X, Y))
>>> ave = sum(prod) / len(X)
>>> covar = ave - mean(X) * mean(Y)
>>> std_prod = pstdev(X) * pstdev(Y)
>>> return covar / std_prod
>>>
>>> cor_coef = correlation(X, Y)

>>> cor_coef
0.5801873106352112

INVESTIGATE DEFORESTATION (MEZA)
>>> from meza.process import tfilter
>>>
>>> deforestation = read_xls(
>>> 'data.xlsx', sheet=1,
>>> sanitize=True)
>>>
>>>
>>> pred = lambda y: float(y) in common
>>>
>>> drecords = tfilter(
>>> deforestation, 'year', pred)

INVESTIGATE DEFORESTATION (MEZA)
>>> hrecords = tfilter(
>>> hibernation, 'year', pred)
>>>
>>> X = [
>>> float(r['deforested_area_ha'])
>>> for r in drecords]
>>>
>>> Y = [
>>> float(r['forest_area_acres'])
>>> for r in hrecords]
>>>
>>> cor_coef
0.5801873106352112

¡FELICIDADES!
YOU’VE OBTAINED THE
DEFORESTATION
CORRELATIONS

¡FELICIDADES!
YOU’VE OBTAINED THE DEFORESTATION CORRELATIONS
DEFORESTATION VS OCCUPIED FOREST
OCCUPIEDFOREST(ACRES)
0
7.5
15
22.5
30
DEFORESTED AREA (HA)
0 125 250 375 500

YOUR COLLEAGUES WOULD LIKE TO ACCESS YOUR FINDINGS
AS EITHER A CSV OR JSON FILE
HOW DO YOU WANT TO SAVE YOUR RESULTS?
SAVE TO
CSV
OR
SAVE TO
JSON

SAVE TO A JSON FILE (PANDAS)
>>> from pandas import DataFrame
>>>
>>> metric = 'correlation coefficient'
>>> data = {
>>> 'metric': [metric],
>>> 'value': [cor_coef]}
>>>
>>> df = DataFrame(data=data)
>>> df
metric value
0 correlation coefficient 0.311468
>>> df.to_json('results.json')

SAVE TO A JSON FILE (NOT PANDAS)
>>> from json import dump
>>>
>>> row = {
>>> 'metric': metric,
>>> 'value': cor_coef}
>>>
>>> with open('results.json', 'w') as f:
>>> dump([row], f)

SAVE TO A JSON FILE (MEZA)
>>> from meza.convert import records2json
>>> from meza.io import write
>>>
>>> record = {
>>>
>>> results = records2json([record])
>>> write('results.json', results)
68

¡FELICIDADES!
YOU’RE NOW ABLE TO READ,
MANIPULATE, AND SAVE DATA.
WITHOUT EVER TOUCHING
PANDAS!

¿PREGUNTAS?
REUBEN CUMMINGS
@REUBANO

SCRAPE THE HTML FILE (PANDAS)
>>> from pandas import read_html
>>>
>>> with open('hibernation.html') as f:
>>> df = read_html(f)[0]
>>> hibernation = df[1:]
>>>
>>> hibernation.head()
Year Forest Area (acres)
1 1993 15.39
2 1994 19.3
3 1995 31.16
4 1996 44.95
5 1997 14.26

SCRAPE THE HTML FILE (NOT PANDAS)
>>> from bs4 import BeautifulSoup
>>>
>>> with open('hibernation.html') as f:
>>> soup = BeautifulSoup(f, 'lxml')
>>> trs = soup.table.find_all('tr')
>>>
>>> def gen_rows(trs):
>>> for tr in trs:
>>> row = tr.find_all('th')
>>> row = row or tr.find_all('td')
>>> yield tuple(
>>> el.text for el in row)

SCRAPE THE HTML FILE (NOT PANDAS)
>>> from itertools import islice
>>>
>>> rows = gen_rows(trs)
>>> list(islice(rows, 3))
[('Year', 'Forest Area (acres)'),
('1993', '15.39'),
(‘1994’, '19.3')]

>>> from meza.io import read_html
>>> from meza.process import peek
>>>
>>> hibernation = read_html(
>>> 'hibernation.html', sanitize=True)
>>>
>>> hibernation, head = peek(hibernation)
>>> head[:3]
[{'forest_area_acres': '15.39', 'year':
'1993'},
'1994'},
'1995'}]
SCRAPE THE HTML FILE (MEZA)

PESTICIDE USE IN U.S.
AGRICULTURE
USDA

INVESTIGATE PESTICIDES (PANDAS)
>>>
>>> book = ExcelFile('data.xlsx')
>>>
>>> pesticides = book.parse(
>>> 'pesticides')
>>>
>>> 'hibernation')
>>>
>>> df = pesticides.merge(
>>> hibernation, on='Year')

>>> df.head()
Year Total... Forest Area...
0 1993 549.3853 15.39
1 1994 568.4952 19.30
2 1995 541.9101 31.16
3 1996 597.3228 44.95
4 1997 600.5113 14.26

>>> X = df['Total (millions of lbs)']
>>> X[:3]
0 549.3853
1 568.4952
2 541.9101
>>> Y = df['Forest Area (acres)']
>>> Y[:3]
0 15.39
1 19.30
2 31.16
>>> cor_coef = X.corr(Y)
>>> cor_coef
0.3114682506273226

INVESTIGATE PESTICIDES (NOT PANDAS)
>>>
>>>
>>> pesticides = book.sheet_by_name(
>>> 'pesticides')
>>>
>>> 'hibernation')
>>>
>>> pest_years = pesticides.col_values(
>>>
>>> hiber_years = hibernation.col_values(

>>> common = set(pest_years).intersection(
>>> hiber_years)
>>>
>>> common
{1993.0, 1994.0, 1995.0, 1996.0, 1997.0,
1998.0, 1999.0, 2000.0, 2001.0, 2002.0,
2003.0, 2004.0, 2005.0, …, 2008.0}
>>> prows = pesticides.get_rows()
>>> hrows = hibernation.get_rows()
>>>
>>> X = [
>>> r[5].value for r in prows

>>> Y = [
>>> r[1].value for r in hrows

>>> from statistics import mean, pstdev
>>> from itertools import starmap
>>> from operator import mul
>>>
>>>
>>> def correlation(X, Y):
>>> prod = starmap(mul, zip(X, Y))
>>> ave = sum(prod) / len(X)
>>> covar = ave - mean(X) * mean(Y)
>>> std_prod = pstdev(X) * pstdev(Y)
>>> return covar / std_prod
>>>

>>> cor_coef
0.3114682506273143

INVESTIGATE PESTICIDES (MEZA)
>>> from meza.process import tfilter
>>>
>>> pesticides = read_xls(
>>> 'data.xlsx', sheet=2,
>>> sanitize=True)
>>>
>>>
>>> pred = lambda y: float(y) in common
>>>
>>> precords = tfilter(
>>> pesticides, 'year', pred)

INVESTIGATE PESTICIDES (MEZA)
>>> hrecords = tfilter(
>>> hibernation, 'year', pred)
>>>
>>> X = [
>>> float(r['total_millions_of_lbs'])
>>> for r in precords]
>>>
>>> Y = [
>>> float(r['forest_area_acres'])
>>> for r in hrecords]
>>>
>>> cor_coef
0.3114682506273143

¡FELICIDADES!
YOU’VE OBTAINED THE
PESTICIDE CORRELATIONS

¡FELICIDADES!
YOU’VE OBTAINED THE PESTICIDE CORRELATIONS
US PESTICIDE USE VS OCCUPIED FOREST
OCCUPIEDFOREST(ACRES)
0
12.5
25
37.5
50
PESTICIDE USE (MILLIONS OF LBS)
0 175 350 525 700

SAVE TO A CSV FILE (PANDAS)
>>> from pandas import DataFrame
>>>
>>> metric = 'correlation coefficient'
>>> data = {
>>> 'metric': [metric],
>>> 'value': [cor_coef]}
>>>
>>> df = DataFrame(data=data)
>>> df
metric value
0 correlation coefficient 0.311468
>>> df.to_csv('results.csv')

SAVE TO A CSV FILE (NOT PANDAS)
>>> from csv import DictWriter
>>>
>>> row = {
>>>
>>> with open('results.csv', 'w') as f:
>>> fieldnames = ['metric', 'value']
>>> writer = DictWriter(
>>> f, fieldnames=fieldnames)
>>> writer.writeheader()
>>> writer.writerow(row)

SAVE TO A CSV FILE (MEZA)
>>> from meza.convert import records2csv
>>> from meza.io import write
>>>
>>> record = {
>>>
>>> results = records2csv([record])
>>> write('results.csv', results)
58

You might not need pandas - Reuben Cummings

Recommended

Recommended

More Related Content

Similar to You might not need pandas - Reuben Cummings

Similar to You might not need pandas - Reuben Cummings (20)

More from Software Guru

More from Software Guru (20)

Recently uploaded

Recently uploaded (20)

You might not need pandas - Reuben Cummings