Data Cleaning: Your Questions• At what point in the data maintenance process do you find yourself cleaning data?• Are there ways that you would like to improve the workflow?
Cleaning & Preparing Data• Making sense of data starts at the point of collection – Define what you want to measure / track • Clearly define schema and fields – Have a shared meaning for values – Data validation on entry – Collect your data – Examine results • Are there common mistakes you could prevent? • Are there different interpretations of fields? – Close the feedback loop & iterate
Cleaning & Preparing Data• Common data quality issues – Combined fields • Address: “340 N 12th St, Suite 402 , Philadelphia, PA 19107” – Invalid entries • ZIP code: 1234 (length check, is number) • Age: 204 (reasonable range check, is number) – Format variations • State: PA vs. Pennsylvania (drop down or scrubbing rules) – Duplicates • CRM: John Smith with old and new addresses
Cleaning & Preparing Data Not a reasonable option
What does this have to do with trees?• We track things - tree inventories, potential planting sites, community groups, people who requested trees, etc .• Data comes from lots of places - web forms, collected by various staff, submitted by community groups.• None of it matches.• Good data makes our lives easier.
Cleaning & Preparing Data• Tools to clean tabular data – Excel (or open source equivalent) • Pros: – Broad features – Widely utilized / common skill – Formulas / sorting / flexible • Cons: – Doesn’t understand record concept – Mass changes can be tedious
Cleaning & Preparing Data• Tools to clean tabular data – DataWrangler • http://vis.stanford.edu/wrangler/ • Pros: – Focused on transforming data into relational format – Live previews • Cons: – Alpha quality version – Data size limits / online tool – Can be difficult to figure out what set of transforms are needed
Cleaning & Preparing Data• Tools to clean tabular data – Google Refine • http://code.google.com/p/google-refine/ • Pros: – Understands record concept – Formulas / Facets – Undo capability – Windows / Mac / Linux • Cons: – There is a learning curve – Unusual type of app » Download, unzip, run exe file, access through browser
Context: Your Questions• What challenges have you faced putting your data in context?• Are you struggling to identify what “context” means for your organization?• Do you know what data you’d like to use, but have trouble finding it?
Your Data in Context• Your data is essential!• But it is more meaningful in context… – Ratios & rates • Service level • Market penetration – Indicators & trends • How you compare – Targeting • Key demographics Juice Analytics • Custom summaries
What does this have to do with trees?• Trees don’t exist in a vacuum.• Contextual data = more effective outreach.• More info gives you new insights.
Making Sense of the Census• American FactFinder• http://factfinder2.census.gov – Decennial Census • Every 10 years • Full population survey • Just 10 questions – American Community Survey (ACS) • Monthly sample • Aggregated over different time periods (1-, 3- and 5-year) • Extremely detailed questions • Subject to sampling error
Helpers: ACS Alchemist• https://github.com/azavea/acs-alchemist • Retrieval of block group-level data• Custom variable selection• Delivery in spatial data format ready for mappingThis tool was developed by Azavea in collaboration with Jerry Ratcliffe and Ralph Taylor of TempleUniversity Center for Security and Crime Science. This project was supported by Award No. 2010-DE-BX-K004, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice.
Helpers: ACS AlchemistAs easy as 1-2-31.Create a document with your selected variables
Helpers: ACS AlchemistAs easy as 1-2-31.Create a document with your selected variables2.Pick your geographies
Helpers: ACS AlchemistAs easy as 1-2-31.Create a document with your selected variables2.Pick your geographies and geolevels3.Retrieve your shapefiles
Other Sources• Public data – Open Data Portals • Federal, state & local data – Political Data • Voter data • Legislative boundaries• Commercial data – Population Projections – Consumer Data
Data Visualization: Your Questions• Do you currently share data with your constituents?• Where do you use data visualizations (e.g. annual report, embedded infographics, live data trackers)?• Do you currently map your data?
What does this have to do with trees?• Charts, graphs, maps, and photos help us tell a story.• Show that trees are more than just leaves and branches.• Explore the science without making people’s eyes glaze over.
Exploring Data• Visualization tools – Tableau • http://www.tableausoftware.com/ • Pros: – Flexible interface makes data exploration easy – Fast even on large data sets • Cons: – Easy to visualize something that doesn’t make sense to look at – Price (for desktop tool)