SlideShare a Scribd company logo
1 of 17
KEY FACTORS AND JURISDICTION RECOMMENDATIONS
NewMet Bootcamp – Fall 2015
Final presentation
Ankoor Bhagat
Overview
• Definitions
• Data Sources
• Data Cleaning
• EDA
• Feature Engineering
• Modeling
• Recommendations
Note: Data & Codes available at: https://github.com/ankoorb/NPO-Project
Definitions
• Objective – Identify Factors affecting garbage production rate
and make market targeting recommendations
• Annual Per Capita Disposal Rate (PPD) – Calculated as
Disposal Tons x 2000 Lbs / Population / 365
• 50 % Target Per Capita Disposal Rate (PPD) – Calculation
• Used Jurisdiction specific average of 2003-2006 Per Capita Generation
Rates
• Divide Average Per Capita Generation Rates by 2 to get disposal a
jurisdiction would have disposed if it was exactly 50% diversion
• Indicators –
• Primary – Population of Jurisdiction (Per Resident Disposal)
• Secondary – Jurisdiction Industry Employment (Per Employee Disposal)
• Judging Criteria – To Meet 50% goal, jurisdictions must
dispose off not more than their 50% Per Capita Disposal
Target
Point Arena
Santa Monica
Data Sources
• Disposal Rates: California Disposal Progress Report Year
2007 to 2013: http://www.calrecycle.ca.gov/LGCentral/Reports/jurisdiction/diversiondisposal.aspx
• CalRecycle Program Data: Program Counts by Status, Year
and Jurisdiction Data (2007-2013): CalRecycle
• Crime Data: Criminal Justice Statistics Center Statistics –
Crimes and Clearances (2005-2014): https://oag.ca.gov/crime/cjsc/stats/crimes-
clearances
• Solar Data: California Solar Initiative – Working Dataset:
https://www.californiasolarstatistics.ca.gov/data_downloads/
• California City Area – Wikipedia:
https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California
• Building Construction Permit by Jurisdiction (2007-2013):
State of the Cities Data Systems Database: http://socds.huduser.gov/permits/
• Voter Registration Data: California Report of Registration
(2007-2013): http://www.sos.ca.gov/elections/report-registration/
Data Cleaning
• Changed Data Structure – Row to Column/ Column to Row
• Filtered data to select data between 2007 and 2013
• String manipulation
• str to int/float
• Removing unnecessary characters: , $ - * N 2,500- 250,000+
• Jurisdiction name: capital to lower, removing -, (manual spelling
change to match during merging step)
• Renamed columns – Very long names. Key-Id data dictionary
• Replaced NaN with median values and in some cases with 0
• Merged data
Data Cleaning
• Initial Stats – 378 Jurisdictions and 946 Features
• After removal of 2007 to 2012 data - 378 Jurisdictions and
380 Features
• Feature Engineering - 378 Jurisdictions and 45 Features
(more on this later)
EDA (2013 Data)
• Histograms
• Joint Distributions
• Pearson Correlation
EDA (2013 Data)
Feature Engineering
• Ethnic Diversity Index
• Voter Registration Rate
• Republican to Democratic Ratio
• Major Crime to Minor Crime Ratio
• Percent Violent Crime
• Total Crime/1000 Inhabitants
• Crime Index
• Theil Index (Household Income)
• Mean Logarithmic Deviation (Household Income)
• Household Income Ratio
• Income Index
Feature Engineering
• Per Capita Income Index
• Travel Time Index
• Median Income Index
• Male to Female Median Earning Income Index
• Residential Solar Units/Person
• Residential Solar Units/Household
• and a lot more…
Feature Engineering
• Some Plots
Modeling
• Difference between Target Residential PPD and Annual
Residential PPD Calculated
• Difference discretized based on Quantiles. Intervals and labels
• [-1.9, 1.3) – Low
• (1.3, 2.4) – Fair
• (2.4, 3.6] – Good
• (3.6, 7052.8] – Excellent
• Clustering - looking at groups of jurisdictions that have similar
social, economic, housing, demographic, and political
characteristics
Modeling
• K Means Clustering – Checked Silhouette Scores (closer
to 0)
• Used Linear Discriminant Analysis based on labels from K
Means to visually inspect clusters – n_clusters = 4 chosen
• Decision Tree to check Feature Importance
• Random Forest to check Feature Importance and
accuracy
• Cross Validation – 10 folds
Modeling
• Still it was not clear what each cluster means
• Selected Features with Feature Importance > 0.05
• Plotted PCA Biplots for different clusters
Recommendations
• Good Performers
• Recommended Jurisdictions
Thank You!

More Related Content

Viewers also liked

8 Superfoods to Grow at Home
8 Superfoods to Grow at Home8 Superfoods to Grow at Home
8 Superfoods to Grow at HomeFreya Wilson
 
5 Cleaning Hacks for the Kitchen
5 Cleaning Hacks for the Kitchen5 Cleaning Hacks for the Kitchen
5 Cleaning Hacks for the KitchenFreya Wilson
 
Feature-Engineering-Earth-Advocacy-Project-2015
Feature-Engineering-Earth-Advocacy-Project-2015Feature-Engineering-Earth-Advocacy-Project-2015
Feature-Engineering-Earth-Advocacy-Project-2015Ankoor Bhagat
 
Prob-Dist-Toll-Forecast-Uncertainty
Prob-Dist-Toll-Forecast-UncertaintyProb-Dist-Toll-Forecast-Uncertainty
Prob-Dist-Toll-Forecast-UncertaintyAnkoor Bhagat
 
Time Management-Is the urgent getting in the way of the important?
Time Management-Is the urgent getting in the way of the important?Time Management-Is the urgent getting in the way of the important?
Time Management-Is the urgent getting in the way of the important?Helen Monroe
 
5 Reasons Fast Food Is Bad
5 Reasons Fast Food Is Bad5 Reasons Fast Food Is Bad
5 Reasons Fast Food Is BadFreya Wilson
 

Viewers also liked (8)

8 Superfoods to Grow at Home
8 Superfoods to Grow at Home8 Superfoods to Grow at Home
8 Superfoods to Grow at Home
 
5 Cleaning Hacks for the Kitchen
5 Cleaning Hacks for the Kitchen5 Cleaning Hacks for the Kitchen
5 Cleaning Hacks for the Kitchen
 
Own Your Future
Own Your FutureOwn Your Future
Own Your Future
 
Feature-Engineering-Earth-Advocacy-Project-2015
Feature-Engineering-Earth-Advocacy-Project-2015Feature-Engineering-Earth-Advocacy-Project-2015
Feature-Engineering-Earth-Advocacy-Project-2015
 
17° Congreso de vialidad y tránsito - Propuesta Vehicular de FADEEAC
17° Congreso de vialidad y tránsito - Propuesta Vehicular de FADEEAC17° Congreso de vialidad y tránsito - Propuesta Vehicular de FADEEAC
17° Congreso de vialidad y tránsito - Propuesta Vehicular de FADEEAC
 
Prob-Dist-Toll-Forecast-Uncertainty
Prob-Dist-Toll-Forecast-UncertaintyProb-Dist-Toll-Forecast-Uncertainty
Prob-Dist-Toll-Forecast-Uncertainty
 
Time Management-Is the urgent getting in the way of the important?
Time Management-Is the urgent getting in the way of the important?Time Management-Is the urgent getting in the way of the important?
Time Management-Is the urgent getting in the way of the important?
 
5 Reasons Fast Food Is Bad
5 Reasons Fast Food Is Bad5 Reasons Fast Food Is Bad
5 Reasons Fast Food Is Bad
 

Similar to Final_Presentation

ML Application Life Cycle
ML Application Life CycleML Application Life Cycle
ML Application Life CycleSrujanaMerugu1
 
IoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDBIoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDBIvo Andreev
 
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18DataconomyGmbH
 
Predictive Maintenance - Predict the Unpredictable
Predictive Maintenance - Predict the UnpredictablePredictive Maintenance - Predict the Unpredictable
Predictive Maintenance - Predict the UnpredictableIvo Andreev
 
Need a Perfect Blend of EA & SOA - The Open Group Conference
Need a Perfect Blend of EA & SOA - The Open Group ConferenceNeed a Perfect Blend of EA & SOA - The Open Group Conference
Need a Perfect Blend of EA & SOA - The Open Group ConferenceHariharan V Ganesarethinam
 
E governance and enteerprise architecture
E governance and enteerprise architectureE governance and enteerprise architecture
E governance and enteerprise architectureKumar
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
TIBCO Advanced Analytics Meetup (TAAM) November 2015
TIBCO Advanced Analytics Meetup (TAAM) November 2015TIBCO Advanced Analytics Meetup (TAAM) November 2015
TIBCO Advanced Analytics Meetup (TAAM) November 2015Bipin Singh
 
Recommendations at Zillow
Recommendations at ZillowRecommendations at Zillow
Recommendations at Zillownjstevens
 
Simplifying Data Interoperability with Geo Addressing and Enrichment
Simplifying Data Interoperability with Geo Addressing and EnrichmentSimplifying Data Interoperability with Geo Addressing and Enrichment
Simplifying Data Interoperability with Geo Addressing and EnrichmentPrecisely
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati
 
Architecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12cArchitecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12cGustavo Rene Antunez
 
Everything You Need to Know About Oracle 12c Indexes
Everything You Need to Know About Oracle 12c IndexesEverything You Need to Know About Oracle 12c Indexes
Everything You Need to Know About Oracle 12c IndexesSolarWinds
 

Similar to Final_Presentation (20)

ML Application Life Cycle
ML Application Life CycleML Application Life Cycle
ML Application Life Cycle
 
IoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDBIoT with Azure Machine Learning and InfluxDB
IoT with Azure Machine Learning and InfluxDB
 
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
 
Predictive Maintenance - Predict the Unpredictable
Predictive Maintenance - Predict the UnpredictablePredictive Maintenance - Predict the Unpredictable
Predictive Maintenance - Predict the Unpredictable
 
Project Estimation
Project EstimationProject Estimation
Project Estimation
 
Using the LEADing Data Reference Content
Using the LEADing Data Reference ContentUsing the LEADing Data Reference Content
Using the LEADing Data Reference Content
 
Need a Perfect Blend of EA & SOA - The Open Group Conference
Need a Perfect Blend of EA & SOA - The Open Group ConferenceNeed a Perfect Blend of EA & SOA - The Open Group Conference
Need a Perfect Blend of EA & SOA - The Open Group Conference
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
E governance and enteerprise architecture
E governance and enteerprise architectureE governance and enteerprise architecture
E governance and enteerprise architecture
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
TIBCO Advanced Analytics Meetup (TAAM) November 2015
TIBCO Advanced Analytics Meetup (TAAM) November 2015TIBCO Advanced Analytics Meetup (TAAM) November 2015
TIBCO Advanced Analytics Meetup (TAAM) November 2015
 
SampleProject1
SampleProject1SampleProject1
SampleProject1
 
Environmental Data Management and Analytics
Environmental Data Management and AnalyticsEnvironmental Data Management and Analytics
Environmental Data Management and Analytics
 
Recommendations at Zillow
Recommendations at ZillowRecommendations at Zillow
Recommendations at Zillow
 
Soa
SoaSoa
Soa
 
Simplifying Data Interoperability with Geo Addressing and Enrichment
Simplifying Data Interoperability with Geo Addressing and EnrichmentSimplifying Data Interoperability with Geo Addressing and Enrichment
Simplifying Data Interoperability with Geo Addressing and Enrichment
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
 
Architecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12cArchitecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12c
 
Everything You Need to Know About Oracle 12c Indexes
Everything You Need to Know About Oracle 12c IndexesEverything You Need to Know About Oracle 12c Indexes
Everything You Need to Know About Oracle 12c Indexes
 
What Data would you like to Track? - Fred Tuinstra
What Data would you like to Track? - Fred TuinstraWhat Data would you like to Track? - Fred Tuinstra
What Data would you like to Track? - Fred Tuinstra
 

Final_Presentation

  • 1. KEY FACTORS AND JURISDICTION RECOMMENDATIONS NewMet Bootcamp – Fall 2015 Final presentation Ankoor Bhagat
  • 2. Overview • Definitions • Data Sources • Data Cleaning • EDA • Feature Engineering • Modeling • Recommendations Note: Data & Codes available at: https://github.com/ankoorb/NPO-Project
  • 3. Definitions • Objective – Identify Factors affecting garbage production rate and make market targeting recommendations • Annual Per Capita Disposal Rate (PPD) – Calculated as Disposal Tons x 2000 Lbs / Population / 365 • 50 % Target Per Capita Disposal Rate (PPD) – Calculation • Used Jurisdiction specific average of 2003-2006 Per Capita Generation Rates • Divide Average Per Capita Generation Rates by 2 to get disposal a jurisdiction would have disposed if it was exactly 50% diversion • Indicators – • Primary – Population of Jurisdiction (Per Resident Disposal) • Secondary – Jurisdiction Industry Employment (Per Employee Disposal) • Judging Criteria – To Meet 50% goal, jurisdictions must dispose off not more than their 50% Per Capita Disposal Target
  • 5. Data Sources • Disposal Rates: California Disposal Progress Report Year 2007 to 2013: http://www.calrecycle.ca.gov/LGCentral/Reports/jurisdiction/diversiondisposal.aspx • CalRecycle Program Data: Program Counts by Status, Year and Jurisdiction Data (2007-2013): CalRecycle • Crime Data: Criminal Justice Statistics Center Statistics – Crimes and Clearances (2005-2014): https://oag.ca.gov/crime/cjsc/stats/crimes- clearances • Solar Data: California Solar Initiative – Working Dataset: https://www.californiasolarstatistics.ca.gov/data_downloads/ • California City Area – Wikipedia: https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California • Building Construction Permit by Jurisdiction (2007-2013): State of the Cities Data Systems Database: http://socds.huduser.gov/permits/ • Voter Registration Data: California Report of Registration (2007-2013): http://www.sos.ca.gov/elections/report-registration/
  • 6. Data Cleaning • Changed Data Structure – Row to Column/ Column to Row • Filtered data to select data between 2007 and 2013 • String manipulation • str to int/float • Removing unnecessary characters: , $ - * N 2,500- 250,000+ • Jurisdiction name: capital to lower, removing -, (manual spelling change to match during merging step) • Renamed columns – Very long names. Key-Id data dictionary • Replaced NaN with median values and in some cases with 0 • Merged data
  • 7. Data Cleaning • Initial Stats – 378 Jurisdictions and 946 Features • After removal of 2007 to 2012 data - 378 Jurisdictions and 380 Features • Feature Engineering - 378 Jurisdictions and 45 Features (more on this later)
  • 8. EDA (2013 Data) • Histograms • Joint Distributions • Pearson Correlation
  • 10. Feature Engineering • Ethnic Diversity Index • Voter Registration Rate • Republican to Democratic Ratio • Major Crime to Minor Crime Ratio • Percent Violent Crime • Total Crime/1000 Inhabitants • Crime Index • Theil Index (Household Income) • Mean Logarithmic Deviation (Household Income) • Household Income Ratio • Income Index
  • 11. Feature Engineering • Per Capita Income Index • Travel Time Index • Median Income Index • Male to Female Median Earning Income Index • Residential Solar Units/Person • Residential Solar Units/Household • and a lot more…
  • 13. Modeling • Difference between Target Residential PPD and Annual Residential PPD Calculated • Difference discretized based on Quantiles. Intervals and labels • [-1.9, 1.3) – Low • (1.3, 2.4) – Fair • (2.4, 3.6] – Good • (3.6, 7052.8] – Excellent • Clustering - looking at groups of jurisdictions that have similar social, economic, housing, demographic, and political characteristics
  • 14. Modeling • K Means Clustering – Checked Silhouette Scores (closer to 0) • Used Linear Discriminant Analysis based on labels from K Means to visually inspect clusters – n_clusters = 4 chosen • Decision Tree to check Feature Importance • Random Forest to check Feature Importance and accuracy • Cross Validation – 10 folds
  • 15. Modeling • Still it was not clear what each cluster means • Selected Features with Feature Importance > 0.05 • Plotted PCA Biplots for different clusters
  • 16. Recommendations • Good Performers • Recommended Jurisdictions