SlideShare a Scribd company logo
YOU MIGHT NOT
NEED PANDAS
DATA DAY MEXICO — MEXICO CITY
MARCH 15, 2018
BY REUBEN CUMMINGS
@REUBANO
WHO AM I?
• Managing Director, Nerevu
Development
• Founder of Arusha Coders
• Author of several popular Python
packages
• @reubano on Twitter and GitHub
WHO DOESN’T LOVE THESE CUTE CUDDLY CREATURES?
WHY PANDAS?
• It’s fast
• It’s
ubiquitous
• It gets the
job done
Photo credit: Mathias Appel (@mathiasappel)
BECAUSE IT ISN’T THAT BAD AFTER ALL
WHEN SHOULD YOU USE PANDAS?
• Speed is paramount
• You already have pandas code
available
• You don’t want have time to
learn anything new
WELL, THEY NEVER CLAIMED TO BE GRACEFUL…
WHY NOT PANDAS?
• It’s complex
• It’s large
• It likes lots
of memory
Photo credit: CSBaltimore
BECAUSE SOMETIMES IT ISN’T THAT GREAT EITHER
WHEN SHOULDN’T YOU USE PANDAS?
• You don’t want all those
dependencies
• You like functional programming
• You have tight RAM constraints
PANDAS ALTERNATIVES
• Pure Python
• meza (https://github.com/
reubano/meza)
• Other libraries (csvkit,
messytables, etc.)
LA
MARIPOSA
MONARCA
CHOOSE YOUR OWN
ADVENTURE ANALYSISANALYSIS
Photo credit: Pablo Leautaud (@pleautaud)
DISCLAIMER
ALL THE THINGS I AM NOT
DISCLAIMER
•Pandas expert
•Statistician
•Lepidopterologist
(butterfly scientist)
MONARCH BUTTERFLY
HIBERNATING COLONIES
WORLD WILDLIFE FUND MÉXICO
Photo credit: Adam Jones (@adam_jones)
THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML
FILES
HOW WILL YOU OBTAIN THE DATA?
Year
Occupied Forest
(acres)
1993 15.39
1994 19.30
1995 31.16
1996 44.95
1997 14.26
THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML
FILES
HOW WILL YOU OBTAIN THE DATA?
PARSE THE
EXCEL FILE
OR
SCRAPE THE
HTML FILE
PARSE THE EXCEL FILE (PANDAS)
>>> from pandas import ExcelFile
>>>
>>> book = ExcelFile('data.xlsx')
>>> hibernation = book.parse(
>>> 'hibernation')
>>>
>>> hibernation.head()
Year Forest Area (acres)
0 1993 15.39
1 1994 19.30
2 1995 31.16
3 1996 44.95
4 1997 14.26
PARSE THE EXCEL FILE (NOT PANDAS)
>>> from xlrd import open_workbook
>>>
>>> book = open_workbook('data.xlsx')
>>> hibernation = book.sheet_by_name(
>>> 'hibernation')
>>>
>>> hibernation.row_values(0)
['Year', 'Forest Area (acres)']
>>> hibernation.row_values(1)
[1993.0, 15.39]
>>> from meza.io import read_xls
>>> from meza.process import peek
>>>
>>> hibernation = read_xls(
>>> 'data.xlsx', sanitize=True)
>>>
>>> hibernation, head = peek(hibernation)
>>> head[:3]
[{'forest_area_acres': '15.39', 'year':
'1993.0'},
{'forest_area_acres': '19.3', 'year':
'1994.0'},
{'forest_area_acres': '31.16', 'year':
'1995.0'}]
PARSE THE EXCEL FILE (MEZA)
¡FELICIDADES!
YOU’VE OBTAINED ESTIMATES
FOR THE FOREST AREA
OCCUPIED BY BUTTERFLY
COLONIES
YOU’VE OBTAINED THE FOREST AREA ESTIMATES
¡FELICIDADES!
• The declining
forest area
worries you
• You suspect de-
forestation
• You also suspect
pesticides
ACRES OF “BUTTERFLY
OCCUPIED” FOREST
0
12.5
25
37.5
50
1993 1994 1995 1996 1997
YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN
MEXICO AND PESTICIDE USAGE IN THE USA
WHICH DATA SET WILL YOU INVESTIGATE FURTHER?
Year
Occupied
Forest (acres)
Deforestation (ha)
Pesticides
(millions of lbs)
2003* 23.0 140.4 476.5
2005* 10.0 480.4 488.2
2006 17.0 462.4 485.9
2007 11.4 244.6 503.8
2008 12.5 259.0 516.1
* two year averages
INVESTIGATE
DEFORESTATION
OR
INVESTIGATE
PESTICIDES
WHICH DATA SET WILL YOU INVESTIGATE FURTHER?
YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN
MEXICO AND PESTICIDE USAGE IN THE USA
MONARCH BUTTERFLY
RESERVE DEFORESTATION
WORLD WILDLIFE FUND MÉXICO
INVESTIGATE DEFORESTATION (PANDAS)
>>> from pandas import ExcelFile
>>>
>>> book = ExcelFile(‘data.xlsx’)
>>>
>>> deforestation = book.parse(
>>> 'deforestation')
>>>
>>> hibernation = book.parse(
>>> 'hibernation')
>>>
>>> df = deforestation.merge(
>>> hibernation, on='Year')
>>>
INVESTIGATE DEFORESTATION (PANDAS)
>>> df.head()
Year Deforested... Forest Area...
0 2003 140.372293 27.48
1 2005 480.449496 14.60
2 2006 462.360283 16.98
3 2007 244.566159 11.39
4 2008 259.037529 12.50
INVESTIGATE DEFORESTATION (PANDAS)
>>> X = df['Deforested Area (ha)']
>>> X[:3]
0 140.372293
1 480.449496
2 462.360283
>>> Y = df['Forest Area (acres)']
>>> Y[:3]
0 27.48
1 14.60
2 16.98
>>> cor_coef = X.corr(Y)
>>> cor_coef
0.5801873106352113
INVESTIGATE DEFORESTATION (NOT PANDAS)
>>> from xlrd import open_workbook
>>>
>>> book = open_workbook('data.xlsx')
>>>
>>> deforestation = book.sheet_by_name(
>>> 'deforestation')
>>>
>>> hibernation = book.sheet_by_name(
>>> 'hibernation')
>>>
>>> def_years = deforestation.col_values(
>>> 0, start_rowx=1)
>>>
>>> hiber_years = hibernation.col_values(
>>> 0, start_rowx=1)
INVESTIGATE DEFORESTATION (NOT PANDAS)
>>> common = set(def_years).intersection(
>>> hiber_years)
>>>
>>> common
{2003.0, 2005.0, 2006.0, 2007.0, 2008.0,
2009.0, 2010.0, 2011.0, 2012.0, 2013.0,
2014.0}
>>> drows = deforestation.get_rows()
>>> hrows = hibernation.get_rows()
>>>
>>> X = [
>>> r[2].value for r in drows
>>> if r[0].value in common]
INVESTIGATE DEFORESTATION (NOT PANDAS)
>>> Y = [
>>> r[1].value for r in hrows
>>> if r[0].value in common]
INVESTIGATE DEFORESTATION (NOT PANDAS)
>>> from statistics import mean, pstdev
>>> from itertools import starmap
>>> from operator import mul
>>>
>>>
>>> def correlation(X, Y):
>>> prod = starmap(mul, zip(X, Y))
>>> ave = sum(prod) / len(X)
>>> covar = ave - mean(X) * mean(Y)
>>> std_prod = pstdev(X) * pstdev(Y)
>>> return covar / std_prod
>>>
>>> cor_coef = correlation(X, Y)
INVESTIGATE DEFORESTATION (NOT PANDAS)
>>> cor_coef
0.5801873106352112
INVESTIGATE DEFORESTATION (MEZA)
>>> from meza.io import read_xls
>>> from meza.process import tfilter
>>>
>>> deforestation = read_xls(
>>> 'data.xlsx', sheet=1,
>>> sanitize=True)
>>>
>>> hibernation = read_xls(
>>> 'data.xlsx', sanitize=True)
>>>
>>> pred = lambda y: float(y) in common
>>>
>>> drecords = tfilter(
>>> deforestation, 'year', pred)
INVESTIGATE DEFORESTATION (MEZA)
>>> hrecords = tfilter(
>>> hibernation, 'year', pred)
>>>
>>> X = [
>>> float(r['deforested_area_ha'])
>>> for r in drecords]
>>>
>>> Y = [
>>> float(r['forest_area_acres'])
>>> for r in hrecords]
>>>
>>> cor_coef = correlation(X, Y)
>>> cor_coef
0.5801873106352112
¡FELICIDADES!
YOU’VE OBTAINED THE
DEFORESTATION
CORRELATIONS
¡FELICIDADES!
YOU’VE OBTAINED THE DEFORESTATION CORRELATIONS
DEFORESTATION VS OCCUPIED FOREST
OCCUPIEDFOREST(ACRES)
0
7.5
15
22.5
30
DEFORESTED AREA (HA)
0 125 250 375 500
YOUR COLLEAGUES WOULD LIKE TO ACCESS YOUR FINDINGS
AS EITHER A CSV OR JSON FILE
HOW DO YOU WANT TO SAVE YOUR RESULTS?
SAVE TO
CSV
OR
SAVE TO
JSON
SAVE TO A JSON FILE (PANDAS)
>>> from pandas import DataFrame
>>>
>>> metric = 'correlation coefficient'
>>> data = {
>>> 'metric': [metric],
>>> 'value': [cor_coef]}
>>>
>>> df = DataFrame(data=data)
>>> df
metric value
0 correlation coefficient 0.311468
>>> df.to_json('results.json')
SAVE TO A JSON FILE (NOT PANDAS)
>>> from json import dump
>>>
>>> row = {
>>> 'metric': metric,
>>> 'value': cor_coef}
>>>
>>> with open('results.json', 'w') as f:
>>> dump([row], f)
SAVE TO A JSON FILE (MEZA)
>>> from meza.convert import records2json
>>> from meza.io import write
>>>
>>> record = {
>>> 'metric': metric,
>>> 'value': cor_coef}
>>>
>>> results = records2json([record])
>>> write('results.json', results)
68
¡FELICIDADES!
YOU’RE NOW ABLE TO READ,
MANIPULATE, AND SAVE DATA.
WITHOUT EVER TOUCHING
PANDAS!
 ¡GRACIAS!
¿PREGUNTAS?
REUBEN CUMMINGS
@REUBANO
EXTRA
SLIDES
SCRAPE THE HTML FILE (PANDAS)
>>> from pandas import read_html
>>>
>>> with open('hibernation.html') as f:
>>> df = read_html(f)[0]
>>> hibernation = df[1:]
>>>
>>> hibernation.head()
Year Forest Area (acres)
1 1993 15.39
2 1994 19.3
3 1995 31.16
4 1996 44.95
5 1997 14.26
SCRAPE THE HTML FILE (NOT PANDAS)
>>> from bs4 import BeautifulSoup
>>>
>>> with open('hibernation.html') as f:
>>> soup = BeautifulSoup(f, 'lxml')
>>> trs = soup.table.find_all('tr')
>>>
>>> def gen_rows(trs):
>>> for tr in trs:
>>> row = tr.find_all('th')
>>> row = row or tr.find_all('td')
>>> yield tuple(
>>> el.text for el in row)
SCRAPE THE HTML FILE (NOT PANDAS)
>>> from itertools import islice
>>>
>>> rows = gen_rows(trs)
>>> list(islice(rows, 3))
[('Year', 'Forest Area (acres)'),
('1993', '15.39'),
(‘1994’, '19.3')]
>>> from meza.io import read_html
>>> from meza.process import peek
>>>
>>> hibernation = read_html(
>>> 'hibernation.html', sanitize=True)
>>>
>>> hibernation, head = peek(hibernation)
>>> head[:3]
[{'forest_area_acres': '15.39', 'year':
'1993'},
{'forest_area_acres': '19.3', 'year':
'1994'},
{'forest_area_acres': '31.16', 'year':
'1995'}]
SCRAPE THE HTML FILE (MEZA)
PESTICIDE USE IN U.S.
AGRICULTURE
USDA
INVESTIGATE PESTICIDES (PANDAS)
>>> from pandas import ExcelFile
>>>
>>> book = ExcelFile('data.xlsx')
>>>
>>> pesticides = book.parse(
>>> 'pesticides')
>>>
>>> hibernation = book.parse(
>>> 'hibernation')
>>>
>>> df = pesticides.merge(
>>> hibernation, on='Year')
INVESTIGATE PESTICIDES (PANDAS)
>>> df.head()
Year Total... Forest Area...
0 1993 549.3853 15.39
1 1994 568.4952 19.30
2 1995 541.9101 31.16
3 1996 597.3228 44.95
4 1997 600.5113 14.26
INVESTIGATE PESTICIDES (PANDAS)
>>> X = df['Total (millions of lbs)']
>>> X[:3]
0 549.3853
1 568.4952
2 541.9101
>>> Y = df['Forest Area (acres)']
>>> Y[:3]
0 15.39
1 19.30
2 31.16
>>> cor_coef = X.corr(Y)
>>> cor_coef
0.3114682506273226
INVESTIGATE PESTICIDES (NOT PANDAS)
>>> from xlrd import open_workbook
>>>
>>> book = open_workbook('data.xlsx')
>>>
>>> pesticides = book.sheet_by_name(
>>> 'pesticides')
>>>
>>> hibernation = book.sheet_by_name(
>>> 'hibernation')
>>>
>>> pest_years = pesticides.col_values(
>>> 0, start_rowx=1)
>>>
>>> hiber_years = hibernation.col_values(
>>> 0, start_rowx=1)
INVESTIGATE PESTICIDES (NOT PANDAS)
>>> common = set(pest_years).intersection(
>>> hiber_years)
>>>
>>> common
{1993.0, 1994.0, 1995.0, 1996.0, 1997.0,
1998.0, 1999.0, 2000.0, 2001.0, 2002.0,
2003.0, 2004.0, 2005.0, …, 2008.0}
>>> prows = pesticides.get_rows()
>>> hrows = hibernation.get_rows()
>>>
>>> X = [
>>> r[5].value for r in prows
>>> if r[0].value in common]
INVESTIGATE PESTICIDES (NOT PANDAS)
>>> Y = [
>>> r[1].value for r in hrows
>>> if r[0].value in common]
INVESTIGATE PESTICIDES (NOT PANDAS)
>>> from statistics import mean, pstdev
>>> from itertools import starmap
>>> from operator import mul
>>>
>>>
>>> def correlation(X, Y):
>>> prod = starmap(mul, zip(X, Y))
>>> ave = sum(prod) / len(X)
>>> covar = ave - mean(X) * mean(Y)
>>> std_prod = pstdev(X) * pstdev(Y)
>>> return covar / std_prod
>>>
>>> cor_coef = correlation(X, Y)
INVESTIGATE PESTICIDES (NOT PANDAS)
>>> cor_coef
0.3114682506273143
INVESTIGATE PESTICIDES (MEZA)
>>> from meza.io import read_xls
>>> from meza.process import tfilter
>>>
>>> pesticides = read_xls(
>>> 'data.xlsx', sheet=2,
>>> sanitize=True)
>>>
>>> hibernation = read_xls(
>>> 'data.xlsx', sanitize=True)
>>>
>>> pred = lambda y: float(y) in common
>>>
>>> precords = tfilter(
>>> pesticides, 'year', pred)
INVESTIGATE PESTICIDES (MEZA)
>>> hrecords = tfilter(
>>> hibernation, 'year', pred)
>>>
>>> X = [
>>> float(r['total_millions_of_lbs'])
>>> for r in precords]
>>>
>>> Y = [
>>> float(r['forest_area_acres'])
>>> for r in hrecords]
>>>
>>> cor_coef = correlation(X, Y)
>>> cor_coef
0.3114682506273143
¡FELICIDADES!
YOU’VE OBTAINED THE
PESTICIDE CORRELATIONS
¡FELICIDADES!
YOU’VE OBTAINED THE PESTICIDE CORRELATIONS
US PESTICIDE USE VS OCCUPIED FOREST
OCCUPIEDFOREST(ACRES)
0
12.5
25
37.5
50
PESTICIDE USE (MILLIONS OF LBS)
0 175 350 525 700
SAVE TO A CSV FILE (PANDAS)
>>> from pandas import DataFrame
>>>
>>> metric = 'correlation coefficient'
>>> data = {
>>> 'metric': [metric],
>>> 'value': [cor_coef]}
>>>
>>> df = DataFrame(data=data)
>>> df
metric value
0 correlation coefficient 0.311468
>>> df.to_csv('results.csv')
SAVE TO A CSV FILE (NOT PANDAS)
>>> from csv import DictWriter
>>>
>>> row = {
>>> 'metric': metric,
>>> 'value': cor_coef}
>>>
>>> with open('results.csv', 'w') as f:
>>> fieldnames = ['metric', 'value']
>>> writer = DictWriter(
>>> f, fieldnames=fieldnames)
>>> writer.writeheader()
>>> writer.writerow(row)
SAVE TO A CSV FILE (MEZA)
>>> from meza.convert import records2csv
>>> from meza.io import write
>>>
>>> record = {
>>> 'metric': metric,
>>> 'value': cor_coef}
>>>
>>> results = records2csv([record])
>>> write('results.csv', results)
58

More Related Content

Similar to You might not need pandas - Reuben Cummings

INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
carliotwaycave
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Артём Акуляков - F# for Data Analysis
Артём Акуляков - F# for Data AnalysisАртём Акуляков - F# for Data Analysis
Артём Акуляков - F# for Data Analysis
SpbDotNet Community
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Sander Kieft
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Pythonpugpe
 
NumPy/SciPy Statistics
NumPy/SciPy StatisticsNumPy/SciPy Statistics
NumPy/SciPy Statistics
Enthought, Inc.
 
Pres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdfPres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdf
RamziFeghali
 
Python language data types
Python language data typesPython language data types
Python language data types
James Wong
 
Python language data types
Python language data typesPython language data types
Python language data types
Harry Potter
 
Python language data types
Python language data typesPython language data types
Python language data types
Hoang Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data types
Young Alista
 
Python language data types
Python language data typesPython language data types
Python language data types
Luis Goldster
 
Python language data types
Python language data typesPython language data types
Python language data types
Tony Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data types
Fraboni Ec
 
Τα Πολύ Βασικά για την Python
Τα Πολύ Βασικά για την PythonΤα Πολύ Βασικά για την Python
Τα Πολύ Βασικά για την Python
Moses Boudourides
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
Avjinder (Avi) Kaler
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
Avjinder (Avi) Kaler
 

Similar to You might not need pandas - Reuben Cummings (20)

INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Артём Акуляков - F# for Data Analysis
Артём Акуляков - F# for Data AnalysisАртём Акуляков - F# for Data Analysis
Артём Акуляков - F# for Data Analysis
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Python 1
Python 1Python 1
Python 1
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
Slides
SlidesSlides
Slides
 
NumPy/SciPy Statistics
NumPy/SciPy StatisticsNumPy/SciPy Statistics
NumPy/SciPy Statistics
 
Pres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdfPres_python_talakhoury_26_09_2023.pdf
Pres_python_talakhoury_26_09_2023.pdf
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Τα Πολύ Βασικά για την Python
Τα Πολύ Βασικά για την PythonΤα Πολύ Βασικά για την Python
Τα Πολύ Βασικά για την Python
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 

More from Software Guru

Hola Mundo del Internet de las Cosas
Hola Mundo del Internet de las CosasHola Mundo del Internet de las Cosas
Hola Mundo del Internet de las Cosas
Software Guru
 
Estructuras de datos avanzadas: Casos de uso reales
Estructuras de datos avanzadas: Casos de uso realesEstructuras de datos avanzadas: Casos de uso reales
Estructuras de datos avanzadas: Casos de uso reales
Software Guru
 
Building bias-aware environments
Building bias-aware environmentsBuilding bias-aware environments
Building bias-aware environments
Software Guru
 
El secreto para ser un desarrollador Senior
El secreto para ser un desarrollador SeniorEl secreto para ser un desarrollador Senior
El secreto para ser un desarrollador Senior
Software Guru
 
Cómo encontrar el trabajo remoto ideal
Cómo encontrar el trabajo remoto idealCómo encontrar el trabajo remoto ideal
Cómo encontrar el trabajo remoto ideal
Software Guru
 
Automatizando ideas con Apache Airflow
Automatizando ideas con Apache AirflowAutomatizando ideas con Apache Airflow
Automatizando ideas con Apache Airflow
Software Guru
 
How thick data can improve big data analysis for business:
How thick data can improve big data analysis for business:How thick data can improve big data analysis for business:
How thick data can improve big data analysis for business:
Software Guru
 
Introducción al machine learning
Introducción al machine learningIntroducción al machine learning
Introducción al machine learning
Software Guru
 
Democratizando el uso de CoDi
Democratizando el uso de CoDiDemocratizando el uso de CoDi
Democratizando el uso de CoDi
Software Guru
 
Gestionando la felicidad de los equipos con Management 3.0
Gestionando la felicidad de los equipos con Management 3.0Gestionando la felicidad de los equipos con Management 3.0
Gestionando la felicidad de los equipos con Management 3.0
Software Guru
 
Taller: Creación de Componentes Web re-usables con StencilJS
Taller: Creación de Componentes Web re-usables con StencilJSTaller: Creación de Componentes Web re-usables con StencilJS
Taller: Creación de Componentes Web re-usables con StencilJS
Software Guru
 
El camino del full stack developer (o como hacemos en SERTI para que no solo ...
El camino del full stack developer (o como hacemos en SERTI para que no solo ...El camino del full stack developer (o como hacemos en SERTI para que no solo ...
El camino del full stack developer (o como hacemos en SERTI para que no solo ...
Software Guru
 
¿Qué significa ser un programador en Bitso?
¿Qué significa ser un programador en Bitso?¿Qué significa ser un programador en Bitso?
¿Qué significa ser un programador en Bitso?
Software Guru
 
Colaboración efectiva entre desarrolladores del cliente y tu equipo.
Colaboración efectiva entre desarrolladores del cliente y tu equipo.Colaboración efectiva entre desarrolladores del cliente y tu equipo.
Colaboración efectiva entre desarrolladores del cliente y tu equipo.
Software Guru
 
Pruebas de integración con Docker en Azure DevOps
Pruebas de integración con Docker en Azure DevOpsPruebas de integración con Docker en Azure DevOps
Pruebas de integración con Docker en Azure DevOps
Software Guru
 
Elixir + Elm: Usando lenguajes funcionales en servicios productivos
Elixir + Elm: Usando lenguajes funcionales en servicios productivosElixir + Elm: Usando lenguajes funcionales en servicios productivos
Elixir + Elm: Usando lenguajes funcionales en servicios productivos
Software Guru
 
Así publicamos las apps de Spotify sin stress
Así publicamos las apps de Spotify sin stressAsí publicamos las apps de Spotify sin stress
Así publicamos las apps de Spotify sin stress
Software Guru
 
Achieving Your Goals: 5 Tips to successfully achieve your goals
Achieving Your Goals: 5 Tips to successfully achieve your goalsAchieving Your Goals: 5 Tips to successfully achieve your goals
Achieving Your Goals: 5 Tips to successfully achieve your goals
Software Guru
 
Acciones de comunidades tech en tiempos del Covid19
Acciones de comunidades tech en tiempos del Covid19Acciones de comunidades tech en tiempos del Covid19
Acciones de comunidades tech en tiempos del Covid19
Software Guru
 
De lo operativo a lo estratégico: un modelo de management de diseño
De lo operativo a lo estratégico: un modelo de management de diseñoDe lo operativo a lo estratégico: un modelo de management de diseño
De lo operativo a lo estratégico: un modelo de management de diseño
Software Guru
 

More from Software Guru (20)

Hola Mundo del Internet de las Cosas
Hola Mundo del Internet de las CosasHola Mundo del Internet de las Cosas
Hola Mundo del Internet de las Cosas
 
Estructuras de datos avanzadas: Casos de uso reales
Estructuras de datos avanzadas: Casos de uso realesEstructuras de datos avanzadas: Casos de uso reales
Estructuras de datos avanzadas: Casos de uso reales
 
Building bias-aware environments
Building bias-aware environmentsBuilding bias-aware environments
Building bias-aware environments
 
El secreto para ser un desarrollador Senior
El secreto para ser un desarrollador SeniorEl secreto para ser un desarrollador Senior
El secreto para ser un desarrollador Senior
 
Cómo encontrar el trabajo remoto ideal
Cómo encontrar el trabajo remoto idealCómo encontrar el trabajo remoto ideal
Cómo encontrar el trabajo remoto ideal
 
Automatizando ideas con Apache Airflow
Automatizando ideas con Apache AirflowAutomatizando ideas con Apache Airflow
Automatizando ideas con Apache Airflow
 
How thick data can improve big data analysis for business:
How thick data can improve big data analysis for business:How thick data can improve big data analysis for business:
How thick data can improve big data analysis for business:
 
Introducción al machine learning
Introducción al machine learningIntroducción al machine learning
Introducción al machine learning
 
Democratizando el uso de CoDi
Democratizando el uso de CoDiDemocratizando el uso de CoDi
Democratizando el uso de CoDi
 
Gestionando la felicidad de los equipos con Management 3.0
Gestionando la felicidad de los equipos con Management 3.0Gestionando la felicidad de los equipos con Management 3.0
Gestionando la felicidad de los equipos con Management 3.0
 
Taller: Creación de Componentes Web re-usables con StencilJS
Taller: Creación de Componentes Web re-usables con StencilJSTaller: Creación de Componentes Web re-usables con StencilJS
Taller: Creación de Componentes Web re-usables con StencilJS
 
El camino del full stack developer (o como hacemos en SERTI para que no solo ...
El camino del full stack developer (o como hacemos en SERTI para que no solo ...El camino del full stack developer (o como hacemos en SERTI para que no solo ...
El camino del full stack developer (o como hacemos en SERTI para que no solo ...
 
¿Qué significa ser un programador en Bitso?
¿Qué significa ser un programador en Bitso?¿Qué significa ser un programador en Bitso?
¿Qué significa ser un programador en Bitso?
 
Colaboración efectiva entre desarrolladores del cliente y tu equipo.
Colaboración efectiva entre desarrolladores del cliente y tu equipo.Colaboración efectiva entre desarrolladores del cliente y tu equipo.
Colaboración efectiva entre desarrolladores del cliente y tu equipo.
 
Pruebas de integración con Docker en Azure DevOps
Pruebas de integración con Docker en Azure DevOpsPruebas de integración con Docker en Azure DevOps
Pruebas de integración con Docker en Azure DevOps
 
Elixir + Elm: Usando lenguajes funcionales en servicios productivos
Elixir + Elm: Usando lenguajes funcionales en servicios productivosElixir + Elm: Usando lenguajes funcionales en servicios productivos
Elixir + Elm: Usando lenguajes funcionales en servicios productivos
 
Así publicamos las apps de Spotify sin stress
Así publicamos las apps de Spotify sin stressAsí publicamos las apps de Spotify sin stress
Así publicamos las apps de Spotify sin stress
 
Achieving Your Goals: 5 Tips to successfully achieve your goals
Achieving Your Goals: 5 Tips to successfully achieve your goalsAchieving Your Goals: 5 Tips to successfully achieve your goals
Achieving Your Goals: 5 Tips to successfully achieve your goals
 
Acciones de comunidades tech en tiempos del Covid19
Acciones de comunidades tech en tiempos del Covid19Acciones de comunidades tech en tiempos del Covid19
Acciones de comunidades tech en tiempos del Covid19
 
De lo operativo a lo estratégico: un modelo de management de diseño
De lo operativo a lo estratégico: un modelo de management de diseñoDe lo operativo a lo estratégico: un modelo de management de diseño
De lo operativo a lo estratégico: un modelo de management de diseño
 

Recently uploaded

20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 

Recently uploaded (20)

20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 

You might not need pandas - Reuben Cummings

  • 1. YOU MIGHT NOT NEED PANDAS DATA DAY MEXICO — MEXICO CITY MARCH 15, 2018 BY REUBEN CUMMINGS @REUBANO
  • 2. WHO AM I? • Managing Director, Nerevu Development • Founder of Arusha Coders • Author of several popular Python packages • @reubano on Twitter and GitHub
  • 3. WHO DOESN’T LOVE THESE CUTE CUDDLY CREATURES? WHY PANDAS? • It’s fast • It’s ubiquitous • It gets the job done Photo credit: Mathias Appel (@mathiasappel)
  • 4. BECAUSE IT ISN’T THAT BAD AFTER ALL WHEN SHOULD YOU USE PANDAS? • Speed is paramount • You already have pandas code available • You don’t want have time to learn anything new
  • 5. WELL, THEY NEVER CLAIMED TO BE GRACEFUL… WHY NOT PANDAS? • It’s complex • It’s large • It likes lots of memory Photo credit: CSBaltimore
  • 6. BECAUSE SOMETIMES IT ISN’T THAT GREAT EITHER WHEN SHOULDN’T YOU USE PANDAS? • You don’t want all those dependencies • You like functional programming • You have tight RAM constraints
  • 7. PANDAS ALTERNATIVES • Pure Python • meza (https://github.com/ reubano/meza) • Other libraries (csvkit, messytables, etc.)
  • 8. LA MARIPOSA MONARCA CHOOSE YOUR OWN ADVENTURE ANALYSISANALYSIS Photo credit: Pablo Leautaud (@pleautaud)
  • 10. ALL THE THINGS I AM NOT DISCLAIMER •Pandas expert •Statistician •Lepidopterologist (butterfly scientist)
  • 12. Photo credit: Adam Jones (@adam_jones)
  • 13. THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML FILES HOW WILL YOU OBTAIN THE DATA? Year Occupied Forest (acres) 1993 15.39 1994 19.30 1995 31.16 1996 44.95 1997 14.26
  • 14. THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML FILES HOW WILL YOU OBTAIN THE DATA? PARSE THE EXCEL FILE OR SCRAPE THE HTML FILE
  • 15. PARSE THE EXCEL FILE (PANDAS) >>> from pandas import ExcelFile >>> >>> book = ExcelFile('data.xlsx') >>> hibernation = book.parse( >>> 'hibernation') >>> >>> hibernation.head() Year Forest Area (acres) 0 1993 15.39 1 1994 19.30 2 1995 31.16 3 1996 44.95 4 1997 14.26
  • 16. PARSE THE EXCEL FILE (NOT PANDAS) >>> from xlrd import open_workbook >>> >>> book = open_workbook('data.xlsx') >>> hibernation = book.sheet_by_name( >>> 'hibernation') >>> >>> hibernation.row_values(0) ['Year', 'Forest Area (acres)'] >>> hibernation.row_values(1) [1993.0, 15.39]
  • 17. >>> from meza.io import read_xls >>> from meza.process import peek >>> >>> hibernation = read_xls( >>> 'data.xlsx', sanitize=True) >>> >>> hibernation, head = peek(hibernation) >>> head[:3] [{'forest_area_acres': '15.39', 'year': '1993.0'}, {'forest_area_acres': '19.3', 'year': '1994.0'}, {'forest_area_acres': '31.16', 'year': '1995.0'}] PARSE THE EXCEL FILE (MEZA)
  • 18. ¡FELICIDADES! YOU’VE OBTAINED ESTIMATES FOR THE FOREST AREA OCCUPIED BY BUTTERFLY COLONIES
  • 19. YOU’VE OBTAINED THE FOREST AREA ESTIMATES ¡FELICIDADES! • The declining forest area worries you • You suspect de- forestation • You also suspect pesticides ACRES OF “BUTTERFLY OCCUPIED” FOREST 0 12.5 25 37.5 50 1993 1994 1995 1996 1997
  • 20. YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN MEXICO AND PESTICIDE USAGE IN THE USA WHICH DATA SET WILL YOU INVESTIGATE FURTHER? Year Occupied Forest (acres) Deforestation (ha) Pesticides (millions of lbs) 2003* 23.0 140.4 476.5 2005* 10.0 480.4 488.2 2006 17.0 462.4 485.9 2007 11.4 244.6 503.8 2008 12.5 259.0 516.1 * two year averages
  • 21. INVESTIGATE DEFORESTATION OR INVESTIGATE PESTICIDES WHICH DATA SET WILL YOU INVESTIGATE FURTHER? YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN MEXICO AND PESTICIDE USAGE IN THE USA
  • 23. INVESTIGATE DEFORESTATION (PANDAS) >>> from pandas import ExcelFile >>> >>> book = ExcelFile(‘data.xlsx’) >>> >>> deforestation = book.parse( >>> 'deforestation') >>> >>> hibernation = book.parse( >>> 'hibernation') >>> >>> df = deforestation.merge( >>> hibernation, on='Year') >>>
  • 24. INVESTIGATE DEFORESTATION (PANDAS) >>> df.head() Year Deforested... Forest Area... 0 2003 140.372293 27.48 1 2005 480.449496 14.60 2 2006 462.360283 16.98 3 2007 244.566159 11.39 4 2008 259.037529 12.50
  • 25. INVESTIGATE DEFORESTATION (PANDAS) >>> X = df['Deforested Area (ha)'] >>> X[:3] 0 140.372293 1 480.449496 2 462.360283 >>> Y = df['Forest Area (acres)'] >>> Y[:3] 0 27.48 1 14.60 2 16.98 >>> cor_coef = X.corr(Y) >>> cor_coef 0.5801873106352113
  • 26. INVESTIGATE DEFORESTATION (NOT PANDAS) >>> from xlrd import open_workbook >>> >>> book = open_workbook('data.xlsx') >>> >>> deforestation = book.sheet_by_name( >>> 'deforestation') >>> >>> hibernation = book.sheet_by_name( >>> 'hibernation') >>> >>> def_years = deforestation.col_values( >>> 0, start_rowx=1) >>> >>> hiber_years = hibernation.col_values( >>> 0, start_rowx=1)
  • 27. INVESTIGATE DEFORESTATION (NOT PANDAS) >>> common = set(def_years).intersection( >>> hiber_years) >>> >>> common {2003.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0} >>> drows = deforestation.get_rows() >>> hrows = hibernation.get_rows() >>> >>> X = [ >>> r[2].value for r in drows >>> if r[0].value in common]
  • 28. INVESTIGATE DEFORESTATION (NOT PANDAS) >>> Y = [ >>> r[1].value for r in hrows >>> if r[0].value in common]
  • 29. INVESTIGATE DEFORESTATION (NOT PANDAS) >>> from statistics import mean, pstdev >>> from itertools import starmap >>> from operator import mul >>> >>> >>> def correlation(X, Y): >>> prod = starmap(mul, zip(X, Y)) >>> ave = sum(prod) / len(X) >>> covar = ave - mean(X) * mean(Y) >>> std_prod = pstdev(X) * pstdev(Y) >>> return covar / std_prod >>> >>> cor_coef = correlation(X, Y)
  • 30. INVESTIGATE DEFORESTATION (NOT PANDAS) >>> cor_coef 0.5801873106352112
  • 31. INVESTIGATE DEFORESTATION (MEZA) >>> from meza.io import read_xls >>> from meza.process import tfilter >>> >>> deforestation = read_xls( >>> 'data.xlsx', sheet=1, >>> sanitize=True) >>> >>> hibernation = read_xls( >>> 'data.xlsx', sanitize=True) >>> >>> pred = lambda y: float(y) in common >>> >>> drecords = tfilter( >>> deforestation, 'year', pred)
  • 32. INVESTIGATE DEFORESTATION (MEZA) >>> hrecords = tfilter( >>> hibernation, 'year', pred) >>> >>> X = [ >>> float(r['deforested_area_ha']) >>> for r in drecords] >>> >>> Y = [ >>> float(r['forest_area_acres']) >>> for r in hrecords] >>> >>> cor_coef = correlation(X, Y) >>> cor_coef 0.5801873106352112
  • 34. ¡FELICIDADES! YOU’VE OBTAINED THE DEFORESTATION CORRELATIONS DEFORESTATION VS OCCUPIED FOREST OCCUPIEDFOREST(ACRES) 0 7.5 15 22.5 30 DEFORESTED AREA (HA) 0 125 250 375 500
  • 35. YOUR COLLEAGUES WOULD LIKE TO ACCESS YOUR FINDINGS AS EITHER A CSV OR JSON FILE HOW DO YOU WANT TO SAVE YOUR RESULTS? SAVE TO CSV OR SAVE TO JSON
  • 36. SAVE TO A JSON FILE (PANDAS) >>> from pandas import DataFrame >>> >>> metric = 'correlation coefficient' >>> data = { >>> 'metric': [metric], >>> 'value': [cor_coef]} >>> >>> df = DataFrame(data=data) >>> df metric value 0 correlation coefficient 0.311468 >>> df.to_json('results.json')
  • 37. SAVE TO A JSON FILE (NOT PANDAS) >>> from json import dump >>> >>> row = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> with open('results.json', 'w') as f: >>> dump([row], f)
  • 38. SAVE TO A JSON FILE (MEZA) >>> from meza.convert import records2json >>> from meza.io import write >>> >>> record = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> results = records2json([record]) >>> write('results.json', results) 68
  • 39. ¡FELICIDADES! YOU’RE NOW ABLE TO READ, MANIPULATE, AND SAVE DATA. WITHOUT EVER TOUCHING PANDAS!
  • 43. SCRAPE THE HTML FILE (PANDAS) >>> from pandas import read_html >>> >>> with open('hibernation.html') as f: >>> df = read_html(f)[0] >>> hibernation = df[1:] >>> >>> hibernation.head() Year Forest Area (acres) 1 1993 15.39 2 1994 19.3 3 1995 31.16 4 1996 44.95 5 1997 14.26
  • 44. SCRAPE THE HTML FILE (NOT PANDAS) >>> from bs4 import BeautifulSoup >>> >>> with open('hibernation.html') as f: >>> soup = BeautifulSoup(f, 'lxml') >>> trs = soup.table.find_all('tr') >>> >>> def gen_rows(trs): >>> for tr in trs: >>> row = tr.find_all('th') >>> row = row or tr.find_all('td') >>> yield tuple( >>> el.text for el in row)
  • 45. SCRAPE THE HTML FILE (NOT PANDAS) >>> from itertools import islice >>> >>> rows = gen_rows(trs) >>> list(islice(rows, 3)) [('Year', 'Forest Area (acres)'), ('1993', '15.39'), (‘1994’, '19.3')]
  • 46. >>> from meza.io import read_html >>> from meza.process import peek >>> >>> hibernation = read_html( >>> 'hibernation.html', sanitize=True) >>> >>> hibernation, head = peek(hibernation) >>> head[:3] [{'forest_area_acres': '15.39', 'year': '1993'}, {'forest_area_acres': '19.3', 'year': '1994'}, {'forest_area_acres': '31.16', 'year': '1995'}] SCRAPE THE HTML FILE (MEZA)
  • 47. PESTICIDE USE IN U.S. AGRICULTURE USDA
  • 48. INVESTIGATE PESTICIDES (PANDAS) >>> from pandas import ExcelFile >>> >>> book = ExcelFile('data.xlsx') >>> >>> pesticides = book.parse( >>> 'pesticides') >>> >>> hibernation = book.parse( >>> 'hibernation') >>> >>> df = pesticides.merge( >>> hibernation, on='Year')
  • 49. INVESTIGATE PESTICIDES (PANDAS) >>> df.head() Year Total... Forest Area... 0 1993 549.3853 15.39 1 1994 568.4952 19.30 2 1995 541.9101 31.16 3 1996 597.3228 44.95 4 1997 600.5113 14.26
  • 50. INVESTIGATE PESTICIDES (PANDAS) >>> X = df['Total (millions of lbs)'] >>> X[:3] 0 549.3853 1 568.4952 2 541.9101 >>> Y = df['Forest Area (acres)'] >>> Y[:3] 0 15.39 1 19.30 2 31.16 >>> cor_coef = X.corr(Y) >>> cor_coef 0.3114682506273226
  • 51. INVESTIGATE PESTICIDES (NOT PANDAS) >>> from xlrd import open_workbook >>> >>> book = open_workbook('data.xlsx') >>> >>> pesticides = book.sheet_by_name( >>> 'pesticides') >>> >>> hibernation = book.sheet_by_name( >>> 'hibernation') >>> >>> pest_years = pesticides.col_values( >>> 0, start_rowx=1) >>> >>> hiber_years = hibernation.col_values( >>> 0, start_rowx=1)
  • 52. INVESTIGATE PESTICIDES (NOT PANDAS) >>> common = set(pest_years).intersection( >>> hiber_years) >>> >>> common {1993.0, 1994.0, 1995.0, 1996.0, 1997.0, 1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, …, 2008.0} >>> prows = pesticides.get_rows() >>> hrows = hibernation.get_rows() >>> >>> X = [ >>> r[5].value for r in prows >>> if r[0].value in common]
  • 53. INVESTIGATE PESTICIDES (NOT PANDAS) >>> Y = [ >>> r[1].value for r in hrows >>> if r[0].value in common]
  • 54. INVESTIGATE PESTICIDES (NOT PANDAS) >>> from statistics import mean, pstdev >>> from itertools import starmap >>> from operator import mul >>> >>> >>> def correlation(X, Y): >>> prod = starmap(mul, zip(X, Y)) >>> ave = sum(prod) / len(X) >>> covar = ave - mean(X) * mean(Y) >>> std_prod = pstdev(X) * pstdev(Y) >>> return covar / std_prod >>> >>> cor_coef = correlation(X, Y)
  • 55. INVESTIGATE PESTICIDES (NOT PANDAS) >>> cor_coef 0.3114682506273143
  • 56. INVESTIGATE PESTICIDES (MEZA) >>> from meza.io import read_xls >>> from meza.process import tfilter >>> >>> pesticides = read_xls( >>> 'data.xlsx', sheet=2, >>> sanitize=True) >>> >>> hibernation = read_xls( >>> 'data.xlsx', sanitize=True) >>> >>> pred = lambda y: float(y) in common >>> >>> precords = tfilter( >>> pesticides, 'year', pred)
  • 57. INVESTIGATE PESTICIDES (MEZA) >>> hrecords = tfilter( >>> hibernation, 'year', pred) >>> >>> X = [ >>> float(r['total_millions_of_lbs']) >>> for r in precords] >>> >>> Y = [ >>> float(r['forest_area_acres']) >>> for r in hrecords] >>> >>> cor_coef = correlation(X, Y) >>> cor_coef 0.3114682506273143
  • 59. ¡FELICIDADES! YOU’VE OBTAINED THE PESTICIDE CORRELATIONS US PESTICIDE USE VS OCCUPIED FOREST OCCUPIEDFOREST(ACRES) 0 12.5 25 37.5 50 PESTICIDE USE (MILLIONS OF LBS) 0 175 350 525 700
  • 60. SAVE TO A CSV FILE (PANDAS) >>> from pandas import DataFrame >>> >>> metric = 'correlation coefficient' >>> data = { >>> 'metric': [metric], >>> 'value': [cor_coef]} >>> >>> df = DataFrame(data=data) >>> df metric value 0 correlation coefficient 0.311468 >>> df.to_csv('results.csv')
  • 61. SAVE TO A CSV FILE (NOT PANDAS) >>> from csv import DictWriter >>> >>> row = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> with open('results.csv', 'w') as f: >>> fieldnames = ['metric', 'value'] >>> writer = DictWriter( >>> f, fieldnames=fieldnames) >>> writer.writeheader() >>> writer.writerow(row)
  • 62. SAVE TO A CSV FILE (MEZA) >>> from meza.convert import records2csv >>> from meza.io import write >>> >>> record = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> results = records2csv([record]) >>> write('results.csv', results) 58