Mnh csv python

Building flexible tools to store
sums and report on CSV data
Presented by
Margery Harrison
Audience level: Novice
09:45 AM - 10:45 AM
August 17, 2014
Room 704

Python Flexibility
● Basic, Fortran, C, Pascal, Javascript,...
● At some point, there's a tendency to think
the same way, and just translate it
● You can write Python as if it were C
● Or you can take advantage of Python's
special data structures.
● The second option is a lot more fun.

Using Python data structures to
report on CSV data
● Lists
● Sets
● Tuples
● Dictionaries
● CSV Reader
● DictReader
● Counter

Also,
● Using tuples as dictionary keys
● Using enumerate() to count how many
times you've looped
– See “Loop like a Native”
http://nedbatchelder.com/text/iter.html

Code Development Method
● Start with simplest possible version
● Test and validate
● Iterative improvements
– Make it prettier
– Make it do more
– Make it more general

This is a CSV file
color,size,shape,number
red,big,square,3
blue,big,triangle,5
green,small,square,2
blue,small,triangle,1
red,big,square,7
blue,small,triangle,3

https://c1.staticflickr.com/3/2201/2469586703_cfdaf88195.jpg

http://i239.photobucket.com/albums/ff263/peacelovebones/two-
pandas-rolling-1.jpg

CSV DictReader
>>> import csv
>>> import os
>>> with open("simpleCSV.txt") as f:
... r=csv.DictReader(f)
... for row in r:
... print row
...

How many of each?
● It's nice to have a listing that shows the
variety of objects that can appear in each
column.
● Next, we'd like to count how many of each
● And guess what? Python has a special data
structure for that.

Counter + DictReader
Let's use counters to tell us how many of each
value was in each column.

Output
color
blue : 3
green : 1
red : 2
shape
square : 3
triangle: 3
number
1 : 1
3 : 2
2 : 1
5 : 1
7 : 1
size
small : 3
big : 3

You might ask, why not this?
for row in r:
for head in r.fieldnames:
field_value = row[head]
possible_values[head].add(field_value)
#count_of_values.update(row[head])
count_of_values.update(field_value)
print count_of_values

Because
Counter({'e': 13, 'l': 12, 'a': 9, 'r': 9, 'g': 7, 'b': 6, 'i': 6, 's':
6, 'u': 6, 'n': 4, 'm': 3, 'q': 3, 't': 3, 'd': 2, '3': 2, '1': 1, '2':
1, '7': 1, '5': 1})
color
blue : 0
green : 0
red : 0
shape
square : 0
triangle: 0
number
1 : 1
3 : 2
2 : 1
5 : 1
7 : 1
size
small : 0
big : 0

How many red squares?
● We can use tuples as an index into the
counter
– (red,square)
– (big,red,square)
– (small,blue,triangle)
– (small,square)

Let's use a simpler CSV
color,size,shape
red,big,square
blue,big,triangle
green,small,square
blue,small,triangle
red,big,square
blue,small,triangle

Counting Tuples
trying to use magic update()
>>> c=collections.Counter([('a,b'),('c,d,e')])
>>> c
Counter({'a,b': 1, 'c,d,e': 1})
>>> c.update(('a','b'))
>>> c
Counter({'a': 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})
>>> c.update((('a','b'),))
>>> c
Counter({'a': 1, ('a', 'b'): 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})

Oh well
>>> c.update([(('a','b'),)])
>>> c
Counter({'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1,
('a', 'b'): 1})
>>> c[('a','b')]
1
>>> c[('a','b')]+=5
>>> c
Counter({('a', 'b'): 6, 'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e':
1, 'a,b': 1})

Combo Count Part 1: Initialize

Combo Count Output
color
blue : 3
3 blue in 1 combinations:
('blue', 'big'): 1
('blue', 'small'): 2
3 blue in 2 combinations:
('blue', 'big', 'triangle'): 1
('blue', 'small', 'triangle'): 2
green : 1
1 green in 1 combinations:
('green', 'small'): 1
1 green in 2 combinations:
('green', 'small', 'square'): 1
red : 2
2 red in 1 combinations:
('red', 'big'): 2
2 red in 2 combinations:
('red', 'big', 'square'): 2
shape
square : 3
3 square in 1 combinations:
3 square in 2 combinations:
triangle: 3
3 triangle in 1 combinations:
3 triangle in 2 combinations:
('blue', 'big', 'triangle'): 1
size
small : 3
3 small in 1 combinations:
('blue', 'small'): 2
('green', 'small'): 1
3 small in 2 combinations:
big : 3
3 big in 1 combinations:
('blue', 'big'): 1
('red', 'big'): 2
3 big in 2 combinations:
('blue', 'big', 'triangle'):
1

Well, that's ugly
● We need to make it prettier
● We need to write out to a file
● We need to break things up into Classes

Printing Combination Levels
Number of Squares
Number of Red Squares
Number of Blue Squares
Number of Triangles
Number of Red Triangles
Number of Blue Triangles
Total Red
Total Blue

Indentation per level
● If we're indexing by tuple, then the
indentation level could correspond to the
number of items in the tuple.
● Let's have general methods to format the
indentation level, given the number of
items in the tuple, or input 'level' integer

A class write_indent() method
If part of class with counter and msgs dict,
just pass in the tuple:
def write_indent(self, tup_index):
'''
:param tup_index: tuple index into counter
'''
indent = ' ' * len(tup_index)
msg = self.msgs[tup_index]
sum = self.counts[tup_index]
indented_msg = ('{0:s}{1:s}'.format(
indent, msg, sum)

class-less indent_message()
def indent_message(level, msg, sum,
space_per_indent=2, space=' '):
num_spaces = self.space_per_indent * level
indent = space * num_spaces
# We'll want to tune the formatting..
indented_msg = ('{0:s}{1:s}:{2:d}'.format(
indent, msg, sum)
return indented_msg

Adjustable field widths
Depending on data, we'll want different
field widths
red squares 5
Blue squares 21
Large Red Squares in the Bronx 987654321

Using format to format a format
string
>>> f='{{0:{0:d}s}}'.format(3)
>>> f
'{0:3s}'
>>> f='{{0:{0:d}s}}{{1:{1:d}d}}'.format(3,5)
>>> f
'{0:3s}{1:5d}'
>>> f='{{0:s}}{{1:{0:d}s}}{{2:{1:d}d}}'.format(3,5)
>>> f
'{0:s}{1:3s}{2:5d}'

Format 3 values
● Our formatting string will print 3 values:
– String of space chars: {0:s}
– Message: {1:[msg_width]s}
– Sum: Right justified {2:-[sum_width]d}

Class For Flexible Indentation

Flexible Indent Class Variables

SimpleCSVReporter
● Open a CSV File
● Create
– Set of possible values
– Set of possible tuples
– Counter indexed by each value & tuple
● Use IndentMessages to format output lines

readCSV() begins
initialize sets..

readCSV() continued:
Loop to collect & sum

Using recursion for limitless
indentation

A bigger CSV file
"CCN","REPORTDATETIME","SHIFT","OFFENSE","METHOD","BLOCKSIT
EADDRESS","WARD","ANC","DISTRICT","PSA","NEIGHBORHOODCL
USTER","BUSINESSIMPROVEMENTDISTRICT","VOTING_PRECINCT",
"START_DATE","END_DATE"
4104147,"4/16/2013 12:00:00
AM","MIDNIGHT","HOMICIDE","KNIFE","1500 - 1599 BLOCK OF 1ST
STREET SW",6,"6D","FIRST",105,9,,"Precinct 127","7/27/2004 8:30:00
PM","7/27/2004 8:30:00 PM"
5047867,"6/5/2013 12:00:00 AM","MIDNIGHT","SEX ABUSE","KNIFE","6500
- 6599 BLOCK OF PINEY BRANCH ROAD
NW",4,"4B","FOURTH",402,17,,"Precinct 59","4/15/2005 12:30:00 PM",
● From http://data.octo.dc.gov/

Deleted all but 4 columns
"SHIFT","OFFENSE","METHOD","DISTRICT"
"MIDNIGHT","HOMICIDE","KNIFE","FIRST"
"MIDNIGHT","SEX ABUSE","KNIFE","FOURTH"
...
"DAY","THEFT/OTHER","OTHERS","SECOND"
"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"
"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"
"EVENING","BURGLARY","OTHERS","FIFTH"
...

Improvements
● Allow user-specified order for values, e.g.
FIRST, SECOND, THIRD
● Other means of tabulating
● Keeping track of blank values
● Summing counts in columns
● ...

Links
This talk: http://www.slideshare.net/pargery/mnh-csv-python
● https://github.com/pargery/csv_utils2
● Also some notes in http://margerytech.blogspot.com/
Info on Data Structures
● http://rhodesmill.org/brandon/slides/2014-04-pycon/data-structures/
● http://nedbatchelder.com/text/iter.html
DC crime stats
● http://data.octo.dc.gov/
“The data made available here has been modified for use from its original source, which is the Government of the
District of Columbia. Neither the District of Columbia Government nor the Office of the Chief Technology Officer
(OCTO) makes any claims as to the completeness, accuracy or content of any data contained in this application;
makes any representation of any kind, including, but not limited to, warranty of the accuracy or fitness for a
particular use; nor are any such warranties to be implied or inferred with respect to the information or data
furnished herein. The data is subject to change as modifications and updates are complete. It is understood that
the information contained in the web feed is being used at one's own risk."

Mnh csv python

Recommended

Recommended

More Related Content

Similar to Mnh csv python

Similar to Mnh csv python (20)

Recently uploaded

Recently uploaded (20)

Mnh csv python