You obtained forest area estimates occupied by monarch butterfly colonies from 1993-1997 from an Excel file without using Pandas. The declining forest area worried you about potential causes like deforestation and pesticide use. You then investigated further correlations between forest area and deforestation or pesticide usage data, obtaining a correlation coefficient of 0.58 for deforestation using various data processing methods like Pandas, raw Python, and Meza. Finally, you saved your results as a JSON file for colleagues to access.
This set of slides is based on the presentation I gave at ACM DataScience camp 2014. This is suitable for those who are still new to R. It has a few basic data manipulation techniques, and then goes into the basics of using of the dplyr package (Hadley Wickham) #rstats #dplyr
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
If you are struggling to make a plot, tear yourself away from stackoverflow for a moment and ... take a hard look at your data. Is it really in the most favorable form for the task at hand? Time and time again I have found that my visualization struggles are really a symptom of unfinished data wrangling. R has long had excellent facilities for data aggregation or "split-apply-combine": split an object into pieces, compute on each piece, and glue the result back together again. Recent developments, especially in the purrr package, have made "split-apply-combine" even easier and more general. But this requires a certain comfort level with lists, especially with lists that are columns inside a data frame. This is unfamiliar to most of us. I give an overview of this set of problems and match them up with solutions based on grouped, nested, and split data frames.
From talk presented at Lambda Jam 2013.
Characteristics and applications of functional linear data structures.
Bibliography and code examples:
http://jackfoxy.com/lambda_jam_fsharp_bibliography
This set of slides is based on the presentation I gave at ACM DataScience camp 2014. This is suitable for those who are still new to R. It has a few basic data manipulation techniques, and then goes into the basics of using of the dplyr package (Hadley Wickham) #rstats #dplyr
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
If you are struggling to make a plot, tear yourself away from stackoverflow for a moment and ... take a hard look at your data. Is it really in the most favorable form for the task at hand? Time and time again I have found that my visualization struggles are really a symptom of unfinished data wrangling. R has long had excellent facilities for data aggregation or "split-apply-combine": split an object into pieces, compute on each piece, and glue the result back together again. Recent developments, especially in the purrr package, have made "split-apply-combine" even easier and more general. But this requires a certain comfort level with lists, especially with lists that are columns inside a data frame. This is unfamiliar to most of us. I give an overview of this set of problems and match them up with solutions based on grouped, nested, and split data frames.
From talk presented at Lambda Jam 2013.
Characteristics and applications of functional linear data structures.
Bibliography and code examples:
http://jackfoxy.com/lambda_jam_fsharp_bibliography
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
INFORMATIVE ESSAY
The purpose of the Informative Essay assignment is to choose a job or task that you know how to do and then write a minimum of 2 full pages, maximum of 3 full pages, Informative Essay teaching the reader how to do that job or task. You will follow the organization techniques explained in Unit 6.
Here are the details:
1. Read the Lecture Notes in Unit 6. You may also find the information in Chapter 10.5 in our text on Process Analysis helpful. The lecture notes will really be the most important to read in writing this assignment. However, here is a link to that chapter that you may look at in addition to the lecture notes:
https://open.lib.umn.edu/writingforsuccess/chapter/10-5-process-analysis/ (Links to an external site.)
2. Choose your topic, that is, the job or task you want to teach. As the notes explain, this should be a job or task that you already know how to do, and it should be something you can do well. At this point, think about your audience (reader). Will your reader need any knowledge or experience to do this job or task, or will you write these instructions for a general reader where no experience is required to perform the job?
3. Plan your outline to organize this essay. Unit 6 notes offer advice on this organization process. Be sure to include an introductory paragraph that has the four main points presented in the lecture notes.
4. Write the essay. It will need to be at least 2 FULL pages long, maximum of 3 full pages long. You will use the MLA formatting that you used in previous essays from Units 3, 4, and 5.
5. Be sure to include a title for your essay.
6. After writing the essay, be sure to take time to read it several times for revision and editing. It would be helpful to have at least one other person proofread it as well before submitting the assignment.
Quiz2
# comments start with #
# to quit q()
# two steps to install any library
#install.packages("rattle")
#library(rattle)
setwd("D:/AJITH/CUMBERLANDS/Ph.D/SEMESTER 3/Data Science & Big Data Analy (ITS-836-51)/RStudio/Week2")
getwd()
x <- 3 # x is a vector of length 1
print(x)
v1 <- c(2,4,6,8,10)
print(v1)
print(v1[3])
v <- c(1:10) #creates a vector of 10 elements numbered 1 through 10. More complicated data
print(v)
print(v[6])
# Import test data
test<-read.csv("CVEs.csv")
test1<-read.csv("CVEs.csv", sep=",")
test2<-read.table("CVEs.csv", sep=",")
write.csv(test2, file="out.csv")
# Write CSV in R
write.table(test1, file = "out1.csv",row.names=TRUE, na="",col.names=TRUE, sep=",")
head(test)
tail(test)
summary(test)
head <- head(test)
tail <- tail(test)
cor(test$X, test$index)
sd(test$index)
var(test$index)
plot(test$index)
hist(test$index)
str(test$index)
quit()
Quiz3
setwd("C:/Users/ialsmadi/Desktop/University_of_Cumberlands/Lectures/Week2/RScripts")
getwd()
# Import test data
data<-read.csv("yearly_sales.csv")
#A 5-number summary is a set of 5 descriptive statistics for summarizing a continuous univariate data set.
#It consists o ...
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Estructuras de datos avanzadas: Casos de uso realesSoftware Guru
La utilización de estructuras de datos adecuadas para cada problema hace que se simplifiquen en gran medida los tiempos de respuestas y la cantidad de cómputo realizada.
Por Nelson González
More Related Content
Similar to You might not need pandas - Reuben Cummings
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
INFORMATIVE ESSAY
The purpose of the Informative Essay assignment is to choose a job or task that you know how to do and then write a minimum of 2 full pages, maximum of 3 full pages, Informative Essay teaching the reader how to do that job or task. You will follow the organization techniques explained in Unit 6.
Here are the details:
1. Read the Lecture Notes in Unit 6. You may also find the information in Chapter 10.5 in our text on Process Analysis helpful. The lecture notes will really be the most important to read in writing this assignment. However, here is a link to that chapter that you may look at in addition to the lecture notes:
https://open.lib.umn.edu/writingforsuccess/chapter/10-5-process-analysis/ (Links to an external site.)
2. Choose your topic, that is, the job or task you want to teach. As the notes explain, this should be a job or task that you already know how to do, and it should be something you can do well. At this point, think about your audience (reader). Will your reader need any knowledge or experience to do this job or task, or will you write these instructions for a general reader where no experience is required to perform the job?
3. Plan your outline to organize this essay. Unit 6 notes offer advice on this organization process. Be sure to include an introductory paragraph that has the four main points presented in the lecture notes.
4. Write the essay. It will need to be at least 2 FULL pages long, maximum of 3 full pages long. You will use the MLA formatting that you used in previous essays from Units 3, 4, and 5.
5. Be sure to include a title for your essay.
6. After writing the essay, be sure to take time to read it several times for revision and editing. It would be helpful to have at least one other person proofread it as well before submitting the assignment.
Quiz2
# comments start with #
# to quit q()
# two steps to install any library
#install.packages("rattle")
#library(rattle)
setwd("D:/AJITH/CUMBERLANDS/Ph.D/SEMESTER 3/Data Science & Big Data Analy (ITS-836-51)/RStudio/Week2")
getwd()
x <- 3 # x is a vector of length 1
print(x)
v1 <- c(2,4,6,8,10)
print(v1)
print(v1[3])
v <- c(1:10) #creates a vector of 10 elements numbered 1 through 10. More complicated data
print(v)
print(v[6])
# Import test data
test<-read.csv("CVEs.csv")
test1<-read.csv("CVEs.csv", sep=",")
test2<-read.table("CVEs.csv", sep=",")
write.csv(test2, file="out.csv")
# Write CSV in R
write.table(test1, file = "out1.csv",row.names=TRUE, na="",col.names=TRUE, sep=",")
head(test)
tail(test)
summary(test)
head <- head(test)
tail <- tail(test)
cor(test$X, test$index)
sd(test$index)
var(test$index)
plot(test$index)
hist(test$index)
str(test$index)
quit()
Quiz3
setwd("C:/Users/ialsmadi/Desktop/University_of_Cumberlands/Lectures/Week2/RScripts")
getwd()
# Import test data
data<-read.csv("yearly_sales.csv")
#A 5-number summary is a set of 5 descriptive statistics for summarizing a continuous univariate data set.
#It consists o ...
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Estructuras de datos avanzadas: Casos de uso realesSoftware Guru
La utilización de estructuras de datos adecuadas para cada problema hace que se simplifiquen en gran medida los tiempos de respuestas y la cantidad de cómputo realizada.
Por Nelson González
Onboarding new members into an engineering team is not easy on anyone. In a short period of time, the new team member is required to be able to bring professional
Por Victoriya Kalmanovich
El secreto para ser un desarrollador SeniorSoftware Guru
En esta charla platicaremos sobre el “secreto” y el camino para llegar a ser un desarrollador Senior, experiencia, consejos y recomendaciones que en estos 8 años
Por René Sandoval
Apache Airflow es una plataforma en la que podemos crear flujos de datos de manera programática, planificarlos y monitorear de manera centralizada.
Por Yesi Díaz
How thick data can improve big data analysis for business:Software Guru
En esta presentación hablaré sobre cómo el Análisis de Datos Gruesos, específicamente el análisis antropológico y semiótico, puede ayudar a mejorar los resultados del Big Data
Por Martin Cuitzeo
CoDi® es la nueva forma de realizar pagos digitales desarrollada por el Banco de México. Por medio de CoDi puedes realizar cobros y pagos desde tu celular, utilizando una cuenta bancaria o de alguna institución financiera, sin comisiones.
Por Cristian Jaramillo
Gestionando la felicidad de los equipos con Management 3.0Software Guru
En las metodologías agiles hablamos de equipos colaborativos, autogestionados y felices. hablamos de lideres serviciales. El management 3.0 nos ayuda a cultivar el mindset correcto, aquel que servirá como el terreno fértil para que la agilidad florezca.
Por Andrea Vélez Cárdenas
Taller: Creación de Componentes Web re-usables con StencilJSSoftware Guru
Hoy por hoy las experiences de usuario pueden ser enriquecidas mediante el uso de Web Components, que son un estándar de la W3C soportado por la mayoría de los navegadores web modernos.
Por Alex Arriaga
Así publicamos las apps de Spotify sin stressSoftware Guru
En Spotify tenemos 1600+ ingenieros, trabajando en 280+ squads. Aún a esta escala, hemos logrado adoptar prácticas que nos han permitido acelerar la forma en que desarrollamos nuestro producto. Presentado por Erick Camacho en SG Virtual Conference 2020
Achieving Your Goals: 5 Tips to successfully achieve your goalsSoftware Guru
he measure of the executive, Peter F. Drucker reminds us, is the ability to "get the right things done." This involves having clarity on what are the right things as well as avoiding what is unproductive. Intelligence, creativity, and knowledge may all be wasted if not put to work on the things that matter.
Presentado por Cristina Nistor en SG Virtual Conference 2020
Acciones de comunidades tech en tiempos del Covid19Software Guru
Acciones de Comunidades Tech en tiempo del COVID-19 es una platica para informar acerca de las acciones que están realizando algunas comunidades de tecnología en México para luchar contra la propagación del COVID-19. Desde análisis de datos, visualizaciones, simulaciones de contagio, etc.
Presentado por Juana Martínez, Adriana Vallejo y Eduardo Ramírez en SG Virtual Conference 2020
De lo operativo a lo estratégico: un modelo de management de diseñoSoftware Guru
La charla presenta un modelo claro, generado por la ponente, para atender los niveles desde lo operativo a lo estratégico.
Presentado por Gabriela Salinas en SG Virtual Conference
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
1. YOU MIGHT NOT
NEED PANDAS
DATA DAY MEXICO — MEXICO CITY
MARCH 15, 2018
BY REUBEN CUMMINGS
@REUBANO
2. WHO AM I?
• Managing Director, Nerevu
Development
• Founder of Arusha Coders
• Author of several popular Python
packages
• @reubano on Twitter and GitHub
3. WHO DOESN’T LOVE THESE CUTE CUDDLY CREATURES?
WHY PANDAS?
• It’s fast
• It’s
ubiquitous
• It gets the
job done
Photo credit: Mathias Appel (@mathiasappel)
4. BECAUSE IT ISN’T THAT BAD AFTER ALL
WHEN SHOULD YOU USE PANDAS?
• Speed is paramount
• You already have pandas code
available
• You don’t want have time to
learn anything new
5. WELL, THEY NEVER CLAIMED TO BE GRACEFUL…
WHY NOT PANDAS?
• It’s complex
• It’s large
• It likes lots
of memory
Photo credit: CSBaltimore
6. BECAUSE SOMETIMES IT ISN’T THAT GREAT EITHER
WHEN SHOULDN’T YOU USE PANDAS?
• You don’t want all those
dependencies
• You like functional programming
• You have tight RAM constraints
7. PANDAS ALTERNATIVES
• Pure Python
• meza (https://github.com/
reubano/meza)
• Other libraries (csvkit,
messytables, etc.)
13. THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML
FILES
HOW WILL YOU OBTAIN THE DATA?
Year
Occupied Forest
(acres)
1993 15.39
1994 19.30
1995 31.16
1996 44.95
1997 14.26
14. THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML
FILES
HOW WILL YOU OBTAIN THE DATA?
PARSE THE
EXCEL FILE
OR
SCRAPE THE
HTML FILE
15. PARSE THE EXCEL FILE (PANDAS)
>>> from pandas import ExcelFile
>>>
>>> book = ExcelFile('data.xlsx')
>>> hibernation = book.parse(
>>> 'hibernation')
>>>
>>> hibernation.head()
Year Forest Area (acres)
0 1993 15.39
1 1994 19.30
2 1995 31.16
3 1996 44.95
4 1997 14.26
16. PARSE THE EXCEL FILE (NOT PANDAS)
>>> from xlrd import open_workbook
>>>
>>> book = open_workbook('data.xlsx')
>>> hibernation = book.sheet_by_name(
>>> 'hibernation')
>>>
>>> hibernation.row_values(0)
['Year', 'Forest Area (acres)']
>>> hibernation.row_values(1)
[1993.0, 15.39]
19. YOU’VE OBTAINED THE FOREST AREA ESTIMATES
¡FELICIDADES!
• The declining
forest area
worries you
• You suspect de-
forestation
• You also suspect
pesticides
ACRES OF “BUTTERFLY
OCCUPIED” FOREST
0
12.5
25
37.5
50
1993 1994 1995 1996 1997
20. YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN
MEXICO AND PESTICIDE USAGE IN THE USA
WHICH DATA SET WILL YOU INVESTIGATE FURTHER?
Year
Occupied
Forest (acres)
Deforestation (ha)
Pesticides
(millions of lbs)
2003* 23.0 140.4 476.5
2005* 10.0 480.4 488.2
2006 17.0 462.4 485.9
2007 11.4 244.6 503.8
2008 12.5 259.0 516.1
* two year averages
43. SCRAPE THE HTML FILE (PANDAS)
>>> from pandas import read_html
>>>
>>> with open('hibernation.html') as f:
>>> df = read_html(f)[0]
>>> hibernation = df[1:]
>>>
>>> hibernation.head()
Year Forest Area (acres)
1 1993 15.39
2 1994 19.3
3 1995 31.16
4 1996 44.95
5 1997 14.26
44. SCRAPE THE HTML FILE (NOT PANDAS)
>>> from bs4 import BeautifulSoup
>>>
>>> with open('hibernation.html') as f:
>>> soup = BeautifulSoup(f, 'lxml')
>>> trs = soup.table.find_all('tr')
>>>
>>> def gen_rows(trs):
>>> for tr in trs:
>>> row = tr.find_all('th')
>>> row = row or tr.find_all('td')
>>> yield tuple(
>>> el.text for el in row)
45. SCRAPE THE HTML FILE (NOT PANDAS)
>>> from itertools import islice
>>>
>>> rows = gen_rows(trs)
>>> list(islice(rows, 3))
[('Year', 'Forest Area (acres)'),
('1993', '15.39'),
(‘1994’, '19.3')]
46. >>> from meza.io import read_html
>>> from meza.process import peek
>>>
>>> hibernation = read_html(
>>> 'hibernation.html', sanitize=True)
>>>
>>> hibernation, head = peek(hibernation)
>>> head[:3]
[{'forest_area_acres': '15.39', 'year':
'1993'},
{'forest_area_acres': '19.3', 'year':
'1994'},
{'forest_area_acres': '31.16', 'year':
'1995'}]
SCRAPE THE HTML FILE (MEZA)
59. ¡FELICIDADES!
YOU’VE OBTAINED THE PESTICIDE CORRELATIONS
US PESTICIDE USE VS OCCUPIED FOREST
OCCUPIEDFOREST(ACRES)
0
12.5
25
37.5
50
PESTICIDE USE (MILLIONS OF LBS)
0 175 350 525 700
60. SAVE TO A CSV FILE (PANDAS)
>>> from pandas import DataFrame
>>>
>>> metric = 'correlation coefficient'
>>> data = {
>>> 'metric': [metric],
>>> 'value': [cor_coef]}
>>>
>>> df = DataFrame(data=data)
>>> df
metric value
0 correlation coefficient 0.311468
>>> df.to_csv('results.csv')
61. SAVE TO A CSV FILE (NOT PANDAS)
>>> from csv import DictWriter
>>>
>>> row = {
>>> 'metric': metric,
>>> 'value': cor_coef}
>>>
>>> with open('results.csv', 'w') as f:
>>> fieldnames = ['metric', 'value']
>>> writer = DictWriter(
>>> f, fieldnames=fieldnames)
>>> writer.writeheader()
>>> writer.writerow(row)
62. SAVE TO A CSV FILE (MEZA)
>>> from meza.convert import records2csv
>>> from meza.io import write
>>>
>>> record = {
>>> 'metric': metric,
>>> 'value': cor_coef}
>>>
>>> results = records2csv([record])
>>> write('results.csv', results)
58