Python for Open Data Lovers:                    Explore It, Analyze It, Map It                           Jackie Kazil    D...
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Where are we going?                    • open data everywhere                    • a data swiss army knife                ...
•        Data.gov                  •        OpenDataPhilly                  •        DC Data Catalog                  •   ...
assembly member expenses                                                        bicycle lanes                             ...
Saturday, March 10, 2012
http://bit.ly/DCdatafailSaturday, March 10, 2012
Saturday, March 10, 2012
• What are DC agencies spending money on?                    • How much are they spending?                    • What are t...
Saturday, March 10, 2012
swiss army knife                    •      csvkit: http://csvkit.readthedocs.org/                    •      a set of Pytho...
$       csvcut -n purchase2011_cleaned.csv                  1: PO_NUMBER                  2: AGENCY_NAME                  ...
$   csvcut -c 2,6 purchase2011_cleaned.csv | csvstat               1. AGENCY_NAME           !    <type unicode>           ...
$   csvgrep -c 6 -r ^MAYA purchase2011_cleaned.csvPO_NUMBER,AGENCY_NAME,NIGP_DESCRIPTION,PO_TOTAL_AMOUNT,ORDER_DATE,SUPPLI...
$ csvcut -c 4,2,6,5 purchase2011_cleaned.csv | csvsort -r | head -n 20 | csvlook -----------------------------------------...
Social Network Analysis                       “Social network analysis is focused on                       uncovering the ...
99th House                               President: Reagan                               House majority: Democrats        ...
107th House                               President: Bush                               House majority: Republicans       ...
108th House                   President: Bush                   House majority: Republicans                   Years: 2003,...
109th House                                President: Bush                                House majority: Republicans     ...
110th House                                President: Bush                                House majority: Democrats       ...
111th House                                President: Obama                                House majority: Democrats      ...
CSV to network                import networkx as nx                G = nx.Graph()                node_edgelist = []       ...
Centrality Analysis (networkx)                Degree - nx.degree(G)                # of connections; More connections = mo...
Centrality Analysis (networkx)Saturday, March 10, 2012
Centrality Analysis (networkx)                Digi Docs Inc Document Mangers (Dallas)                “Offers software that...
Centrality Analysis (networkx)                           Not included in previous slide...                            Unit...
Visual the network                pos=nx.spring_layout(G,iterations=100)                plot.figure(1,figsize=(15,15))      ...
Visual the networkSaturday, March 10, 2012
Trimming nodes                g2 = G.copy()                d = nx.degree(g2)                for n in g2.nodes():          ...
Degree Distribution                               d=nx.degree(G)                               plot.figure(1,figsize=(15,10)...
Degree DistributionSaturday, March 10, 2012
Degree DistributionSaturday, March 10, 2012
Trimmed nodesSaturday, March 10, 2012
Adding labelsSaturday, March 10, 2012
nx.draw_networkx_labels                           (g3,pos,alpha=1)                           nx.draw_networkx_edges       ...
Maps to mapsSaturday, March 10, 2012
Spatial is special                    • spatial data = attributes, location, time                    • mappable!          ...
Spatial analysis                    • large data sets       a smaller amount of                           meaningful infor...
Techniques                    • point pattern analysis -- hot spots, k                           density, nearest neighbor...
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
Saturday, March 10, 2012
PySAL                    •      GeoDa Center at ASU                    •      Python library for spatial analysis, with mo...
PySAL                    •      developers looking for spatial analytical methods                           to incorporate...
Saturday, March 10, 2012
Next steps                    •      quantify clusters in city, region, nation                    •      examine clusters ...
From data analysis to storiesSaturday, March 10, 2012
Which stories would we go                                     after?                    • construction contracts          ...
Want to learn more?          The SAGE Handbook of          Spatial Analysis          eds. A. Stewart Fotheringham and     ...
And even more?   NetworkX tutorial   http://networkx.lanl.gov/   networkx_tutorial.pdf   UCD Dublin summer course   http:/...
Upcoming SlideShare
Loading in...5
×

PyCon 2012: Python for data lovers: explore it, analyze it, map it

1,248

Published on

Slides from Pycon 2012. Speakers: Jacqueline Kazil , Dana Bauer
More info: https://us.pycon.org/2012/schedule/presentation/426/ | Video: http://pyvideo.org/video/676/python-for-data-lovers-explore-it-analyze-it-m

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,248
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "PyCon 2012: Python for data lovers: explore it, analyze it, map it "

  1. 1. Python for Open Data Lovers: Explore It, Analyze It, Map It Jackie Kazil Dana Bauer @jackiekazil @geography76Saturday, March 10, 2012
  2. 2. Saturday, March 10, 2012
  3. 3. Saturday, March 10, 2012
  4. 4. Saturday, March 10, 2012
  5. 5. Saturday, March 10, 2012
  6. 6. Where are we going? • open data everywhere • a data swiss army knife • finding network patterns • finding spatial patterns • which stories to pursue? moving beyond data analysisSaturday, March 10, 2012
  7. 7. • Data.gov • OpenDataPhilly • DC Data Catalog • DataSF • Chicago Data Portal • NYC Open Data • London DatastoreSaturday, March 10, 2012
  8. 8. assembly member expenses bicycle lanes city purchase orders dialysis centers elevation data filming locations Google Transit Feed Specification (GTFS) historical photos influenza rates judicial districts Key Stage 2 test results by free school meal eligibility land cover monthly calls to Human Services Agency switchboard operators neighborhood health clinics Oyster ticket stop locations political districts quality of life indicators restaurant inspections sewer lines traffic counts utility excavation and paving five-year plan violent crime incidents ward offices youth centers zoning **real-time parking availability and pricing**Saturday, March 10, 2012
  9. 9. Saturday, March 10, 2012
  10. 10. http://bit.ly/DCdatafailSaturday, March 10, 2012
  11. 11. Saturday, March 10, 2012
  12. 12. • What are DC agencies spending money on? • How much are they spending? • What are the relationships between businesses and agencies? • Where are these businesses located?Saturday, March 10, 2012
  13. 13. Saturday, March 10, 2012
  14. 14. swiss army knife • csvkit: http://csvkit.readthedocs.org/ • a set of Python utilities for working with csv • meant to replace csv module • pip install csvkit (no issues!)Saturday, March 10, 2012
  15. 15. $ csvcut -n purchase2011_cleaned.csv 1: PO_NUMBER 2: AGENCY_NAME 3: NIGP_DESCRIPTION 4: PO_TOTAL_AMOUNT 5: ORDER_DATE 6: SUPPLIER 7: SUPPLIER_FULL_ADDRESS ! ! !Saturday, March 10, 2012
  16. 16. $ csvcut -c 2,6 purchase2011_cleaned.csv | csvstat 1. AGENCY_NAME ! <type unicode> ! Nulls: False ! Unique values: 85 ! 5 most frequent values: ! ! DISTRICT OF COLUMBIA PUBLIC SCHOOLS:!2410 ! ! STATE SUPERINTENDENT OF EDUCATION (OSSE):! 1340 ! ! DEPARTMENT OF HEALTH:! 895 ! ! OFFICE OF CHIEF TECHNOLOGY OFFICER:! 786 ! ! OFF PUBLIC ED FACILITIES MODERNIZATION:!722 ! Max length: 40 2. SUPPLIER ! <type unicode> ! Nulls: False ! Unique values: 4357 ! 5 most frequent values: ! ! OST, INC.:! 841 ! ! DELL COMPUTER CORP.:! 366 ! ! AMERICAN EXPRESS COMPANY:! 282 ! ! MVS, INC.:! 176 ! ! CAPITAL SERVICES AND SUPPLIES:! 167 ! Max length: 52 Row count: 16075 ! ! !Saturday, March 10, 2012
  17. 17. $ csvgrep -c 6 -r ^MAYA purchase2011_cleaned.csvPO_NUMBER,AGENCY_NAME,NIGP_DESCRIPTION,PO_TOTAL_AMOUNT,ORDER_DATE,SUPPLIER,SUPPLIER_FULL_ADDRESSPO352244,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,408644.73,01/04/2011,MAYA ANGELOUPCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO352652,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,111679.16,01/07/2011,MAYA ANGELOUPCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO352920,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2205630.13,01/11/2011,MAYA ANGELOU PCS,"18519TH STREET NW, WASHINGTON, DC, 20001"PO355150,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,391092.49,02/07/2011,MAYA ANGELOUPCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO356426,STATE SUPERINTENDENT OF EDUCATION (OSSE),FINANCIAL SERVICES (NOT OTHERWISE CLASSIFIED)49,999891,02/23/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO356632,STATE SUPERINTENDENT OF EDUCATION (OSSE),PROFESSIONAL SERVICES (NOT OTHERWISE CLASSIFIED)58,187200,02/25/2011,MAYA ANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO359961,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,1753238,04/12/2011,MAYA ANGELOU PCS,"18519TH STREET NW, WASHINGTON, DC, 20001"PO360284,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,110729.88,04/14/2011,MAYA ANGELOUPCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO361203,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,92617.32,04/28/2011,MAYA ANGELOUPCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO351462-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATIONAL RESEARCH SERVICES 19,152229.95,05/05/2011,MAYA ANGELOUPCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO364208,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,118825.51,06/09/2011,MAYA ANGELOUPCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO366839,PUBLIC CHARTER SCHOOLS,SCHOOL OPERATION AND MANAGEMENT SERVICES 71,2767027,07/12/2011,MAYA ANGELOU PCS,"18519TH STREET NW, WASHINGTON, DC, 20001"PO365094-V2,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,98092.35,08/15/2011,MAYA ANGELOU PCS,"18519TH STREET NW, WASHINGTON, DC, 20001"PO370948,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,45736.58,08/25/2011,MAYA ANGELOU PCS,"1851 9THSTREET NW, WASHINGTON, DC, 20001"PO361027-V5,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,29424.86,09/06/2011,MAYAANGELOU PCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO374132,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,9000,09/28/2011,MAYA ANGELOU PCS,"1851 9THSTREET NW, WASHINGTON, DC, 20001"PO377919,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,491663.6,10/25/2011,MAYA ANGELOU PCS,"1851 9THSTREET NW, WASHINGTON, DC, 20001"PO381219,STATE SUPERINTENDENT OF EDUCATION (OSSE),EDUCATION AND TRAINING CONSULTING 38,120188.81,11/29/2011,MAYA ANGELOUPCS,"1851 9TH STREET NW, WASHINGTON, DC, 20001"PO383965,STATE SUPERINTENDENT OF EDUCATION (OSSE),YOUTH CARE SERVICES 95,294690.57,12/22/2011,MAYA ANGELOU PCS,"1436 USTREET, NW SUITE 203, WASHINGTON, DC, 20009"! ! !Saturday, March 10, 2012
  18. 18. $ csvcut -c 4,2,6,5 purchase2011_cleaned.csv | csvsort -r | head -n 20 | csvlook ------------------------------------------------------------------------------------------------------------ | PO_TOTAL_AMOUNT | AGENCY_NAME | SUPPLIER | ORDER_DATE | ------------------------------------------------------------------------------------------------------------ | 154133337.02 | DEPARTMENT OF TRANSPORTATION | SKANSKA-FACCHINA JV | 2011-11-10 | | 62677473.88 | DEPARTMENT OF REAL ESTATE SERVICES | EEC OF DC INC-FORRESTER CONSTR | 2011-09-22 | | 31809425.48 | DEPARTMENT OF HEALTH | DEFENSE LOGISTIC AGENCY | 2011-09-08 | | 23600580.0 | DEPARTMENT OF CORRECTIONS | UNITY HEALTH CARE, INC. | 2011-10-24 | | 23538552.0 | DEPARTMENT OF REAL ESTATE SERVICES | EEC-FORRESTER ANACOSTIA | 2011-11-08 | | 22375314.45 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-05-25 | | 21450000.04 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIPHOME | 2011-08-18 | | 20813348.99 | DEPARTMENT OF REAL ESTATE SERVICES | THE JOHN AKRIDGE CO | 2011-06-28 | | 20622000.0 | DEPARTMENT OF TRANSPORTATION | W M SCHLOSSER CO INC | 2011-08-29 | | 19824914.0 | DEPARTMENT OF CORRECTIONS | CORRECTIONS CORPORATION OF | 2011-10-24 | | 18300956.56 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIPHOME | 2011-11-29 | | 18104339.98 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIPHOME | 2011-05-17 | | 18000000.0 | DEPARTMENT OF HEALTH | DC PRIMARY CARE ASSOCIATION | 2011-03-10 | | 17000000.0 | DEPARTMENT OF HEALTH | CHILDRENS NATIONAL MEDICAL CTR | 2011-11-25 | | 16850000.0 | DEPUTY MAYOR FOR ECONOMIC DEVELOPMENT | 2 M STREET REDEVELOPMENT LLC | 2011-09-29 | | 16333257.33 | DEPARTMENT OF HUMAN SERVICES | THE COMMUNITY PARTNERSHIPHOME | 2011-06-02 | | 14206937.0 | PUBLIC CHARTER SCHOOLS | FRIENDSHIP PCS | 2011-07-12 | | 13862557.44 | MUNICIPAL FACILITIES: NON-CAPITAL | US SECURITY ASSOCIATES, INC. | 2011-10-07 | | 13800000.0 | DISTRICT DEPARTMENT OF THE ENVIRONMENT | VERMONT ENERGY INVESTMENT CORP | 2011-10-04 | ------------------------------------------------------------------------------------------------------------ ! ! !Saturday, March 10, 2012
  19. 19. Social Network Analysis “Social network analysis is focused on uncovering the patterning of peoples interaction.” - http://www.insna.org/sna/what.htmlSaturday, March 10, 2012
  20. 20. 99th House President: Reagan House majority: Democrats Years: 1985, 1986Saturday, March 10, 2012
  21. 21. 107th House President: Bush House majority: Republicans Years: 2001, 2002Saturday, March 10, 2012
  22. 22. 108th House President: Bush House majority: Republicans Years: 2003, 2004Saturday, March 10, 2012
  23. 23. 109th House President: Bush House majority: Republicans Years: 2005, 2006Saturday, March 10, 2012
  24. 24. 110th House President: Bush House majority: Democrats Years: 2007, 2008Saturday, March 10, 2012
  25. 25. 111th House President: Obama House majority: Democrats Years: 2009, 2010Saturday, March 10, 2012
  26. 26. CSV to network import networkx as nx G = nx.Graph() node_edgelist = [] # grab edges for row in csv_file: node_edgelist.append((n,e)) # create edges for f in node_edgelist: for t in node_edgelist: if t != f: add_edge_or_weight(G, f[0], t[0])Saturday, March 10, 2012
  27. 27. Centrality Analysis (networkx) Degree - nx.degree(G) # of connections; More connections = more important Closeness centrality nx.closeness_centrality(G) Distance to all other nodes; Closer = more important Betweenness centrality nx.betweenness_centrality(G) Based on the shortest path of info control Page rank nx.pagerank(G) Node gains importance via the importance around himSaturday, March 10, 2012
  28. 28. Centrality Analysis (networkx)Saturday, March 10, 2012
  29. 29. Centrality Analysis (networkx) Digi Docs Inc Document Mangers (Dallas) “Offers software that generates loan documents for electronic delivery.” Iron Mountain (Mountain View) “Iron Mountain provides information management services that help organizations lower the costs, risks and inefficiencies of managing their physical and digital data.” MVS, Inc. (Washington, DC) “MVS Consulting is an 8(a) STARS II, HUBZone, LSDBE, CBE, and MBE IT Solutions company that provides IT solutions to Federal, State and Local Government Agencies.” MDM OFFICE SYSTEMS INC (Washington, DC) "Standard Office Supply - Office Supplies, Furniture Dealer, Educational Products, Breakroom Supplies, Imaging Supplies, and Coffee Services" Capital Services and Supplies (Washington, DC) “CSSI is an office solutions firm located in Washington, DC since 1980. CSSI’s goods and services are available to commercial, government, and educational institutions throughout the continental United States.”Saturday, March 10, 2012
  30. 30. Centrality Analysis (networkx) Not included in previous slide... United States Postal Service & Dell Computer CorpSaturday, March 10, 2012
  31. 31. Visual the network pos=nx.spring_layout(G,iterations=100) plot.figure(1,figsize=(15,15)) plt.axis(off) nx.draw_networkx_nodes( G, pos,node_size=100, alpha=1, node_color=g ) nx.draw_networkx_edges(G,pos,alpha=0.2) plot.savefig(graph.png)Saturday, March 10, 2012
  32. 32. Visual the networkSaturday, March 10, 2012
  33. 33. Trimming nodes g2 = G.copy() d = nx.degree(g2) for n in g2.nodes(): if d[n] <= degree: g2.remove_node(n) return g2Saturday, March 10, 2012
  34. 34. Degree Distribution d=nx.degree(G) plot.figure(1,figsize=(15,10)) h=plot.hist(d.values(),100)Saturday, March 10, 2012
  35. 35. Degree DistributionSaturday, March 10, 2012
  36. 36. Degree DistributionSaturday, March 10, 2012
  37. 37. Trimmed nodesSaturday, March 10, 2012
  38. 38. Adding labelsSaturday, March 10, 2012
  39. 39. nx.draw_networkx_labels (g3,pos,alpha=1) nx.draw_networkx_edges (g3,pos,alpha=0.05)Saturday, March 10, 2012
  40. 40. Maps to mapsSaturday, March 10, 2012
  41. 41. Spatial is special • spatial data = attributes, location, time • mappable! • spatial data must be referenced in space • Tobler’s First Law of GeographySaturday, March 10, 2012
  42. 42. Spatial analysis • large data sets a smaller amount of meaningful information • exploratory (ESDA) • spatial statistics • mathematical modeling and prediction of spatial processesSaturday, March 10, 2012
  43. 43. Techniques • point pattern analysis -- hot spots, k density, nearest neighbor • spatial interpolation -- kriging • spatial regression -- ordinary least squares, geographically weighted regressionSaturday, March 10, 2012
  44. 44. Saturday, March 10, 2012
  45. 45. Saturday, March 10, 2012
  46. 46. Saturday, March 10, 2012
  47. 47. Saturday, March 10, 2012
  48. 48. Saturday, March 10, 2012
  49. 49. Saturday, March 10, 2012
  50. 50. Saturday, March 10, 2012
  51. 51. Saturday, March 10, 2012
  52. 52. PySAL • GeoDa Center at ASU • Python library for spatial analysis, with modules for exploratory spatial data analysis, spatial econometrics, and location modeling • http://code.google.com/p/pysal/ • requires NumPy, SciPySaturday, March 10, 2012
  53. 53. PySAL • developers looking for spatial analytical methods to incorporate in application development • analysts working on projects that require custom scripting • looking for a user-friendly GUI? Try STARS, GeoDA, GeoDASpace. • want to integrate into a powerful GIS? Look for plug-ins for ArcGIS & QGIS.Saturday, March 10, 2012
  54. 54. Saturday, March 10, 2012
  55. 55. Next steps • quantify clusters in city, region, nation • examine clusters along networks, business corridors • create beautiful, interactive maps and charts to allow users to explore spending patterns on their ownSaturday, March 10, 2012
  56. 56. From data analysis to storiesSaturday, March 10, 2012
  57. 57. Which stories would we go after? • construction contracts • funding to charter schools • health care costs in prisons • local vs. regional vs. national purchases • technology services -- look for overlapSaturday, March 10, 2012
  58. 58. Want to learn more? The SAGE Handbook of Spatial Analysis eds. A. Stewart Fotheringham and Peter A. Rogerson Interactive Spatial Data Analysis Trevor Bailey and Tony Gatrell Geographic Information Analysis David O’Sullivan and David Unwin PySAL Luc Anselin, GeoDA Center Arizona State University Mia, age 3, geographer in trainingSaturday, March 10, 2012
  59. 59. And even more? NetworkX tutorial http://networkx.lanl.gov/ networkx_tutorial.pdf UCD Dublin summer course http://mlg.ucd.ie/summer Social Network Analysis for Startups (OReilly Media) http://shop.oreilly.com/product/ 0636920020424.doSaturday, March 10, 2012
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×