SlideShare a Scribd company logo
1 of 61
Building flexible tools to store
sums and report on CSV data
Presented by
Margery Harrison
Audience level: Novice
09:45 AM - 10:45 AM
August 17, 2014
Room 704
Python Flexibility
● Basic, Fortran, C, Pascal, Javascript,...
● At some point, there's a tendency to think
the same way, and just translate it
● You can write Python as if it were C
● Or you can take advantage of Python's
special data structures.
● The second option is a lot more fun.
Using Python data structures to
report on CSV data
● Lists
● Sets
● Tuples
● Dictionaries
● CSV Reader
● DictReader
● Counter
Also,
● Using tuples as dictionary keys
● Using enumerate() to count how many
times you've looped
– See “Loop like a Native”
http://nedbatchelder.com/text/iter.html
Code Development Method
● Start with simplest possible version
● Test and validate
● Iterative improvements
– Make it prettier
– Make it do more
– Make it more general
This is a CSV file
color,size,shape,number
red,big,square,3
blue,big,triangle,5
green,small,square,2
blue,small,triangle,1
red,big,square,7
blue,small,triangle,3
https://c1.staticflickr.com/3/2201/2469586703_cfdaf88195.jpg
http://i239.photobucket.com/albums/ff263/peacelovebones/two-
pandas-rolling-1.jpg
CSV DictReader
>>> import csv
>>> import os
>>> with open("simpleCSV.txt") as f:
... r=csv.DictReader(f)
... for row in r:
... print row
...
Running DictReader
DictReader is sequential
Tabulate All Possible Values
How many of each?
● It's nice to have a listing that shows the
variety of objects that can appear in each
column.
● Next, we'd like to count how many of each
● And guess what? Python has a special data
structure for that.
collections.Counter
Playing with Counters
Index into Counters
Counter + DictReader
Let's use counters to tell us how many of each
value was in each column.
Print number of each value
Output
color
blue : 3
green : 1
red : 2
shape
square : 3
triangle: 3
number
1 : 1
3 : 2
2 : 1
5 : 1
7 : 1
size
small : 3
big : 3
You might ask, why not this?
for row in r:
for head in r.fieldnames:
field_value = row[head]
possible_values[head].add(field_value)
#count_of_values[field_value]+=1
count_of_values.update(field_value)
print count_of_values
Because, Counter likes to count
Counter({'e': 13, 'l': 12, 'a': 9, 'r': 9, 'g': 7, 'b': 6, 'i': 6, 's':
6, 'u': 6, 'n': 4, 'm': 3, 'q': 3, 't': 3, 'd': 2, '3': 2, '1': 1, '2':
1, '7': 1, '5': 1})
color
blue : 0
green : 0
red : 0
shape
square : 0
triangle: 0
number
1 : 1
3 : 2
2 : 1
5 : 1
7 : 1
size
small : 0
big : 0
Output
color
blue : 3
green : 1
red : 2
shape
square : 3
triangle: 3
number
1 : 1
3 : 2
2 : 1
5 : 1
7 : 1
size
small : 3
big : 3
How many red squares?
● We can use tuples as an index into the
counter
– (red,square)
– (big,red,square)
– (small,blue,triangle)
– (small,square)
Let's use a simpler CSV
color,size,shape
red,big,square
blue,big,triangle
green,small,square
blue,small,triangle
red,big,square
blue,small,triangle
Counting Tuples
trying to use magic update()
>>> c=collections.Counter([('a,b'),('c,d,e')])
>>> c
Counter({'a,b': 1, 'c,d,e': 1})
>>> c.update(('a','b'))
>>> c
Counter({'a': 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})
>>> c.update((('a','b'),))
>>> c
Counter({'a': 1, ('a', 'b'): 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})
Oh well
>>> c.update([(('a','b'),)])
>>> c
Counter({'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1,
('a', 'b'): 1})
>>> c[('a','b')]
1
>>> c[('a','b')]+=5
>>> c
Counter({('a', 'b'): 6, 'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e':
1, 'a,b': 1})
Combo Count Part 1: Initialize
Combo Count 2: Counting
Combo Count 3: Printing
Combo Count Output
color
blue : 3
3 blue in 1 combinations:
('blue', 'big'): 1
('blue', 'small'): 2
3 blue in 2 combinations:
('blue', 'big', 'triangle'): 1
('blue', 'small', 'triangle'): 2
green : 1
1 green in 1 combinations:
('green', 'small'): 1
1 green in 2 combinations:
('green', 'small', 'square'): 1
red : 2
2 red in 1 combinations:
('red', 'big'): 2
2 red in 2 combinations:
('red', 'big', 'square'): 2
shape
square : 3
3 square in 1 combinations:
3 square in 2 combinations:
('red', 'big', 'square'): 2
('green', 'small', 'square'): 1
triangle: 3
3 triangle in 1 combinations:
3 triangle in 2 combinations:
('blue', 'big', 'triangle'): 1
('blue', 'small', 'triangle'): 2
size
small : 3
3 small in 1 combinations:
('blue', 'small'): 2
('green', 'small'): 1
3 small in 2 combinations:
('green', 'small', 'square'): 1
('blue', 'small', 'triangle'): 2
big : 3
3 big in 1 combinations:
('blue', 'big'): 1
('red', 'big'): 2
3 big in 2 combinations:
('red', 'big', 'square'): 2
('blue', 'big', 'triangle'):
1
Well, that's ugly
● We need to make it prettier
● We need to write out to a file
● We need to break things up into Classes
Printing Combination Levels
Number of Squares
Number of Red Squares
Number of Blue Squares
Number of Triangles
Number of Red Triangles
Number of Blue Triangles
Total Red
Total Blue
Indentation per level
● If we're indexing by tuple, then the
indentation level could correspond to the
number of items in the tuple.
● Let's have general methods to format the
indentation level, given the number of
items in the tuple, or input 'level' integer
A class write_indent() method
If part of class with counter and msgs dict,
just pass in the tuple:
def write_indent(self, tup_index):
'''
:param tup_index: tuple index into counter
'''
indent = ' ' * len(tup_index)
msg = self.msgs[tup_index]
sum = self.counts[tup_index]
indented_msg = ('{0:s}{1:s}'.format(
indent, msg, sum)
class-less indent_message()
def indent_message(level, msg, sum,
space_per_indent=2, space=' '):
num_spaces = self.space_per_indent * level
indent = space * num_spaces
# We'll want to tune the formatting..
indented_msg = ('{0:s}{1:s}:{2:d}'.format(
indent, msg, sum)
return indented_msg
Adjustable field widths
Depending on data, we'll want different
field widths
red squares 5
Blue squares 21
Large Red Squares in the Bronx 987654321
Using format to format a format
string
>>> f='{{0:{0:d}s}}'.format(3)
>>> f
'{0:3s}'
>>> f='{{0:{0:d}s}}{{1:{1:d}d}}'.format(3,5)
>>> f
'{0:3s}{1:5d}'
>>> f='{{0:s}}{{1:{0:d}s}}{{2:{1:d}d}}'.format(3,5)
>>> f
'{0:s}{1:3s}{2:5d}'
Format 3 values
● Our formatting string will print 3 values:
– String of space chars: {0:s}
– Message: {1:[msg_width]s}
– Sum: Right justified {2:-[sum_width]d}
Class For Flexible Indentation
Flexible Indent Class Variables
Flexible Indent Method
Testing IndentMessages class
SimpleCSVReporter
● Open a CSV File
● Create
– Set of possible values
– Set of possible tuples
– Counter indexed by each value & tuple
● Use IndentMessages to format output lines
SimpleCSVReporter class vars
readCSV() begins
initialize sets..
readCSV() continued:
Loop to collect & sum
Write to Report File
Using recursion for limitless
indentation
Recursive print sub-levels
Word transform stubs
General method to test
Test with simpler CSV
Output for simpler CSV
A bigger CSV file
"CCN","REPORTDATETIME","SHIFT","OFFENSE","METHOD","BLOCKSIT
EADDRESS","WARD","ANC","DISTRICT","PSA","NEIGHBORHOODCL
USTER","BUSINESSIMPROVEMENTDISTRICT","VOTING_PRECINCT",
"START_DATE","END_DATE"
4104147,"4/16/2013 12:00:00
AM","MIDNIGHT","HOMICIDE","KNIFE","1500 - 1599 BLOCK OF 1ST
STREET SW",6,"6D","FIRST",105,9,,"Precinct 127","7/27/2004 8:30:00
PM","7/27/2004 8:30:00 PM"
5047867,"6/5/2013 12:00:00 AM","MIDNIGHT","SEX ABUSE","KNIFE","6500
- 6599 BLOCK OF PINEY BRANCH ROAD
NW",4,"4B","FOURTH",402,17,,"Precinct 59","4/15/2005 12:30:00 PM",
● From http://data.octo.dc.gov/
Deleted all but 4 columns
"SHIFT","OFFENSE","METHOD","DISTRICT"
"MIDNIGHT","HOMICIDE","KNIFE","FIRST"
"MIDNIGHT","SEX ABUSE","KNIFE","FOURTH"
...
"DAY","THEFT/OTHER","OTHERS","SECOND"
"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"
"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"
"EVENING","BURGLARY","OTHERS","FIFTH"
...
Method to run crime report
Output - top
Output - bottom
Improvements
● Allow user-specified order for values, e.g.
FIRST, SECOND, THIRD
● Other means of tabulating
● Keeping track of blank values
● Summing counts in columns
● ...
https://c1.staticflickr.com/3/2201/2469586703_cfdaf88195.jpg
Links
This talk: http://www.slideshare.net/pargery/mnh-csv-python
● https://github.com/pargery/csv_utils2
● Also some notes in http://margerytech.blogspot.com/
Info on Data Structures
● http://rhodesmill.org/brandon/slides/2014-04-pycon/data-structures/
● http://nedbatchelder.com/text/iter.html
DC crime stats
● http://data.octo.dc.gov/
“The data made available here has been modified for use from its original source, which is the Government of the
District of Columbia. Neither the District of Columbia Government nor the Office of the Chief Technology Officer
(OCTO) makes any claims as to the completeness, accuracy or content of any data contained in this application;
makes any representation of any kind, including, but not limited to, warranty of the accuracy or fitness for a
particular use; nor are any such warranties to be implied or inferred with respect to the information or data
furnished herein. The data is subject to change as modifications and updates are complete. It is understood that
the information contained in the web feed is being used at one's own risk."

More Related Content

Similar to Mnh csv python

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData StackPeadar Coyle
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDSCUSICT
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
Zurich R User group: Desc tools
Zurich R User group: Desc tools Zurich R User group: Desc tools
Zurich R User group: Desc tools Zurich_R_User_Group
 
OverviewThis hands-on lab allows you to follow and experiment w.docx
OverviewThis hands-on lab allows you to follow and experiment w.docxOverviewThis hands-on lab allows you to follow and experiment w.docx
OverviewThis hands-on lab allows you to follow and experiment w.docxgerardkortney
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NOllieShoresna
 
Mrongraphs acm-sig-2 (1)
Mrongraphs acm-sig-2 (1)Mrongraphs acm-sig-2 (1)
Mrongraphs acm-sig-2 (1)Nima Sarshar
 
Abir ppt3
Abir ppt3Abir ppt3
Abir ppt3abir96
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingAkos Hajdu
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.Alex Powers
 
r studio presentation.pptx
r studio presentation.pptxr studio presentation.pptx
r studio presentation.pptxDevikaRaj14
 
r studio presentation.pptx
r studio presentation.pptxr studio presentation.pptx
r studio presentation.pptxDevikaRaj14
 
Float Data Type in C.pdf
Float Data Type in C.pdfFloat Data Type in C.pdf
Float Data Type in C.pdfSudhanshiBakre1
 
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...James Nelson
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwaresDr.ammara khakwani
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryJan Aerts
 
Feature-Engineering-Earth-Advocacy-Project-2015
Feature-Engineering-Earth-Advocacy-Project-2015Feature-Engineering-Earth-Advocacy-Project-2015
Feature-Engineering-Earth-Advocacy-Project-2015Ankoor Bhagat
 

Similar to Mnh csv python (20)

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
The R of War
The R of WarThe R of War
The R of War
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekends
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
Zurich R User group: Desc tools
Zurich R User group: Desc tools Zurich R User group: Desc tools
Zurich R User group: Desc tools
 
OverviewThis hands-on lab allows you to follow and experiment w.docx
OverviewThis hands-on lab allows you to follow and experiment w.docxOverviewThis hands-on lab allows you to follow and experiment w.docx
OverviewThis hands-on lab allows you to follow and experiment w.docx
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with N
 
Mrongraphs acm-sig-2 (1)
Mrongraphs acm-sig-2 (1)Mrongraphs acm-sig-2 (1)
Mrongraphs acm-sig-2 (1)
 
Abir ppt3
Abir ppt3Abir ppt3
Abir ppt3
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model Checking
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
 
r studio presentation.pptx
r studio presentation.pptxr studio presentation.pptx
r studio presentation.pptx
 
r studio presentation.pptx
r studio presentation.pptxr studio presentation.pptx
r studio presentation.pptx
 
Float Data Type in C.pdf
Float Data Type in C.pdfFloat Data Type in C.pdf
Float Data Type in C.pdf
 
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
Feature-Engineering-Earth-Advocacy-Project-2015
Feature-Engineering-Earth-Advocacy-Project-2015Feature-Engineering-Earth-Advocacy-Project-2015
Feature-Engineering-Earth-Advocacy-Project-2015
 

Recently uploaded

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 

Recently uploaded (20)

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 

Mnh csv python

  • 1. Building flexible tools to store sums and report on CSV data Presented by Margery Harrison Audience level: Novice 09:45 AM - 10:45 AM August 17, 2014 Room 704
  • 2. Python Flexibility ● Basic, Fortran, C, Pascal, Javascript,... ● At some point, there's a tendency to think the same way, and just translate it ● You can write Python as if it were C ● Or you can take advantage of Python's special data structures. ● The second option is a lot more fun.
  • 3. Using Python data structures to report on CSV data ● Lists ● Sets ● Tuples ● Dictionaries ● CSV Reader ● DictReader ● Counter
  • 4. Also, ● Using tuples as dictionary keys ● Using enumerate() to count how many times you've looped – See “Loop like a Native” http://nedbatchelder.com/text/iter.html
  • 5. Code Development Method ● Start with simplest possible version ● Test and validate ● Iterative improvements – Make it prettier – Make it do more – Make it more general
  • 6. This is a CSV file color,size,shape,number red,big,square,3 blue,big,triangle,5 green,small,square,2 blue,small,triangle,1 red,big,square,7 blue,small,triangle,3
  • 9. CSV DictReader >>> import csv >>> import os >>> with open("simpleCSV.txt") as f: ... r=csv.DictReader(f) ... for row in r: ... print row ...
  • 13. How many of each? ● It's nice to have a listing that shows the variety of objects that can appear in each column. ● Next, we'd like to count how many of each ● And guess what? Python has a special data structure for that.
  • 17. Counter + DictReader Let's use counters to tell us how many of each value was in each column.
  • 18. Print number of each value
  • 19. Output color blue : 3 green : 1 red : 2 shape square : 3 triangle: 3 number 1 : 1 3 : 2 2 : 1 5 : 1 7 : 1 size small : 3 big : 3
  • 20. You might ask, why not this? for row in r: for head in r.fieldnames: field_value = row[head] possible_values[head].add(field_value) #count_of_values[field_value]+=1 count_of_values.update(field_value) print count_of_values
  • 21. Because, Counter likes to count Counter({'e': 13, 'l': 12, 'a': 9, 'r': 9, 'g': 7, 'b': 6, 'i': 6, 's': 6, 'u': 6, 'n': 4, 'm': 3, 'q': 3, 't': 3, 'd': 2, '3': 2, '1': 1, '2': 1, '7': 1, '5': 1}) color blue : 0 green : 0 red : 0 shape square : 0 triangle: 0 number 1 : 1 3 : 2 2 : 1 5 : 1 7 : 1 size small : 0 big : 0
  • 22. Output color blue : 3 green : 1 red : 2 shape square : 3 triangle: 3 number 1 : 1 3 : 2 2 : 1 5 : 1 7 : 1 size small : 3 big : 3
  • 23. How many red squares? ● We can use tuples as an index into the counter – (red,square) – (big,red,square) – (small,blue,triangle) – (small,square)
  • 24. Let's use a simpler CSV color,size,shape red,big,square blue,big,triangle green,small,square blue,small,triangle red,big,square blue,small,triangle
  • 25. Counting Tuples trying to use magic update() >>> c=collections.Counter([('a,b'),('c,d,e')]) >>> c Counter({'a,b': 1, 'c,d,e': 1}) >>> c.update(('a','b')) >>> c Counter({'a': 1, 'b': 1, 'a,b': 1, 'c,d,e': 1}) >>> c.update((('a','b'),)) >>> c Counter({'a': 1, ('a', 'b'): 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})
  • 26. Oh well >>> c.update([(('a','b'),)]) >>> c Counter({'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1, ('a', 'b'): 1}) >>> c[('a','b')] 1 >>> c[('a','b')]+=5 >>> c Counter({('a', 'b'): 6, 'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1})
  • 27. Combo Count Part 1: Initialize
  • 28. Combo Count 2: Counting
  • 29. Combo Count 3: Printing
  • 30. Combo Count Output color blue : 3 3 blue in 1 combinations: ('blue', 'big'): 1 ('blue', 'small'): 2 3 blue in 2 combinations: ('blue', 'big', 'triangle'): 1 ('blue', 'small', 'triangle'): 2 green : 1 1 green in 1 combinations: ('green', 'small'): 1 1 green in 2 combinations: ('green', 'small', 'square'): 1 red : 2 2 red in 1 combinations: ('red', 'big'): 2 2 red in 2 combinations: ('red', 'big', 'square'): 2 shape square : 3 3 square in 1 combinations: 3 square in 2 combinations: ('red', 'big', 'square'): 2 ('green', 'small', 'square'): 1 triangle: 3 3 triangle in 1 combinations: 3 triangle in 2 combinations: ('blue', 'big', 'triangle'): 1 ('blue', 'small', 'triangle'): 2 size small : 3 3 small in 1 combinations: ('blue', 'small'): 2 ('green', 'small'): 1 3 small in 2 combinations: ('green', 'small', 'square'): 1 ('blue', 'small', 'triangle'): 2 big : 3 3 big in 1 combinations: ('blue', 'big'): 1 ('red', 'big'): 2 3 big in 2 combinations: ('red', 'big', 'square'): 2 ('blue', 'big', 'triangle'): 1
  • 31. Well, that's ugly ● We need to make it prettier ● We need to write out to a file ● We need to break things up into Classes
  • 32. Printing Combination Levels Number of Squares Number of Red Squares Number of Blue Squares Number of Triangles Number of Red Triangles Number of Blue Triangles Total Red Total Blue
  • 33. Indentation per level ● If we're indexing by tuple, then the indentation level could correspond to the number of items in the tuple. ● Let's have general methods to format the indentation level, given the number of items in the tuple, or input 'level' integer
  • 34. A class write_indent() method If part of class with counter and msgs dict, just pass in the tuple: def write_indent(self, tup_index): ''' :param tup_index: tuple index into counter ''' indent = ' ' * len(tup_index) msg = self.msgs[tup_index] sum = self.counts[tup_index] indented_msg = ('{0:s}{1:s}'.format( indent, msg, sum)
  • 35. class-less indent_message() def indent_message(level, msg, sum, space_per_indent=2, space=' '): num_spaces = self.space_per_indent * level indent = space * num_spaces # We'll want to tune the formatting.. indented_msg = ('{0:s}{1:s}:{2:d}'.format( indent, msg, sum) return indented_msg
  • 36. Adjustable field widths Depending on data, we'll want different field widths red squares 5 Blue squares 21 Large Red Squares in the Bronx 987654321
  • 37. Using format to format a format string >>> f='{{0:{0:d}s}}'.format(3) >>> f '{0:3s}' >>> f='{{0:{0:d}s}}{{1:{1:d}d}}'.format(3,5) >>> f '{0:3s}{1:5d}' >>> f='{{0:s}}{{1:{0:d}s}}{{2:{1:d}d}}'.format(3,5) >>> f '{0:s}{1:3s}{2:5d}'
  • 38. Format 3 values ● Our formatting string will print 3 values: – String of space chars: {0:s} – Message: {1:[msg_width]s} – Sum: Right justified {2:-[sum_width]d}
  • 39. Class For Flexible Indentation
  • 43. SimpleCSVReporter ● Open a CSV File ● Create – Set of possible values – Set of possible tuples – Counter indexed by each value & tuple ● Use IndentMessages to format output lines
  • 48. Using recursion for limitless indentation
  • 54. A bigger CSV file "CCN","REPORTDATETIME","SHIFT","OFFENSE","METHOD","BLOCKSIT EADDRESS","WARD","ANC","DISTRICT","PSA","NEIGHBORHOODCL USTER","BUSINESSIMPROVEMENTDISTRICT","VOTING_PRECINCT", "START_DATE","END_DATE" 4104147,"4/16/2013 12:00:00 AM","MIDNIGHT","HOMICIDE","KNIFE","1500 - 1599 BLOCK OF 1ST STREET SW",6,"6D","FIRST",105,9,,"Precinct 127","7/27/2004 8:30:00 PM","7/27/2004 8:30:00 PM" 5047867,"6/5/2013 12:00:00 AM","MIDNIGHT","SEX ABUSE","KNIFE","6500 - 6599 BLOCK OF PINEY BRANCH ROAD NW",4,"4B","FOURTH",402,17,,"Precinct 59","4/15/2005 12:30:00 PM", ● From http://data.octo.dc.gov/
  • 55. Deleted all but 4 columns "SHIFT","OFFENSE","METHOD","DISTRICT" "MIDNIGHT","HOMICIDE","KNIFE","FIRST" "MIDNIGHT","SEX ABUSE","KNIFE","FOURTH" ... "DAY","THEFT/OTHER","OTHERS","SECOND" "MIDNIGHT","SEX ABUSE","OTHERS","THIRD" "MIDNIGHT","SEX ABUSE","OTHERS","THIRD" "EVENING","BURGLARY","OTHERS","FIFTH" ...
  • 56. Method to run crime report
  • 59. Improvements ● Allow user-specified order for values, e.g. FIRST, SECOND, THIRD ● Other means of tabulating ● Keeping track of blank values ● Summing counts in columns ● ...
  • 61. Links This talk: http://www.slideshare.net/pargery/mnh-csv-python ● https://github.com/pargery/csv_utils2 ● Also some notes in http://margerytech.blogspot.com/ Info on Data Structures ● http://rhodesmill.org/brandon/slides/2014-04-pycon/data-structures/ ● http://nedbatchelder.com/text/iter.html DC crime stats ● http://data.octo.dc.gov/ “The data made available here has been modified for use from its original source, which is the Government of the District of Columbia. Neither the District of Columbia Government nor the Office of the Chief Technology Officer (OCTO) makes any claims as to the completeness, accuracy or content of any data contained in this application; makes any representation of any kind, including, but not limited to, warranty of the accuracy or fitness for a particular use; nor are any such warranties to be implied or inferred with respect to the information or data furnished herein. The data is subject to change as modifications and updates are complete. It is understood that the information contained in the web feed is being used at one's own risk."