What Data would you like to Track? - Fred Tuinstra
Final_Presentation
1. KEY FACTORS AND JURISDICTION RECOMMENDATIONS
NewMet Bootcamp – Fall 2015
Final presentation
Ankoor Bhagat
2. Overview
• Definitions
• Data Sources
• Data Cleaning
• EDA
• Feature Engineering
• Modeling
• Recommendations
Note: Data & Codes available at: https://github.com/ankoorb/NPO-Project
3. Definitions
• Objective – Identify Factors affecting garbage production rate
and make market targeting recommendations
• Annual Per Capita Disposal Rate (PPD) – Calculated as
Disposal Tons x 2000 Lbs / Population / 365
• 50 % Target Per Capita Disposal Rate (PPD) – Calculation
• Used Jurisdiction specific average of 2003-2006 Per Capita Generation
Rates
• Divide Average Per Capita Generation Rates by 2 to get disposal a
jurisdiction would have disposed if it was exactly 50% diversion
• Indicators –
• Primary – Population of Jurisdiction (Per Resident Disposal)
• Secondary – Jurisdiction Industry Employment (Per Employee Disposal)
• Judging Criteria – To Meet 50% goal, jurisdictions must
dispose off not more than their 50% Per Capita Disposal
Target
5. Data Sources
• Disposal Rates: California Disposal Progress Report Year
2007 to 2013: http://www.calrecycle.ca.gov/LGCentral/Reports/jurisdiction/diversiondisposal.aspx
• CalRecycle Program Data: Program Counts by Status, Year
and Jurisdiction Data (2007-2013): CalRecycle
• Crime Data: Criminal Justice Statistics Center Statistics –
Crimes and Clearances (2005-2014): https://oag.ca.gov/crime/cjsc/stats/crimes-
clearances
• Solar Data: California Solar Initiative – Working Dataset:
https://www.californiasolarstatistics.ca.gov/data_downloads/
• California City Area – Wikipedia:
https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California
• Building Construction Permit by Jurisdiction (2007-2013):
State of the Cities Data Systems Database: http://socds.huduser.gov/permits/
• Voter Registration Data: California Report of Registration
(2007-2013): http://www.sos.ca.gov/elections/report-registration/
6. Data Cleaning
• Changed Data Structure – Row to Column/ Column to Row
• Filtered data to select data between 2007 and 2013
• String manipulation
• str to int/float
• Removing unnecessary characters: , $ - * N 2,500- 250,000+
• Jurisdiction name: capital to lower, removing -, (manual spelling
change to match during merging step)
• Renamed columns – Very long names. Key-Id data dictionary
• Replaced NaN with median values and in some cases with 0
• Merged data
7. Data Cleaning
• Initial Stats – 378 Jurisdictions and 946 Features
• After removal of 2007 to 2012 data - 378 Jurisdictions and
380 Features
• Feature Engineering - 378 Jurisdictions and 45 Features
(more on this later)
10. Feature Engineering
• Ethnic Diversity Index
• Voter Registration Rate
• Republican to Democratic Ratio
• Major Crime to Minor Crime Ratio
• Percent Violent Crime
• Total Crime/1000 Inhabitants
• Crime Index
• Theil Index (Household Income)
• Mean Logarithmic Deviation (Household Income)
• Household Income Ratio
• Income Index
11. Feature Engineering
• Per Capita Income Index
• Travel Time Index
• Median Income Index
• Male to Female Median Earning Income Index
• Residential Solar Units/Person
• Residential Solar Units/Household
• and a lot more…
13. Modeling
• Difference between Target Residential PPD and Annual
Residential PPD Calculated
• Difference discretized based on Quantiles. Intervals and labels
• [-1.9, 1.3) – Low
• (1.3, 2.4) – Fair
• (2.4, 3.6] – Good
• (3.6, 7052.8] – Excellent
• Clustering - looking at groups of jurisdictions that have similar
social, economic, housing, demographic, and political
characteristics
14. Modeling
• K Means Clustering – Checked Silhouette Scores (closer
to 0)
• Used Linear Discriminant Analysis based on labels from K
Means to visually inspect clusters – n_clusters = 4 chosen
• Decision Tree to check Feature Importance
• Random Forest to check Feature Importance and
accuracy
• Cross Validation – 10 folds
15. Modeling
• Still it was not clear what each cluster means
• Selected Features with Feature Importance > 0.05
• Plotted PCA Biplots for different clusters