Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Find signal in noise.

934 views

Published on

6 steps to find value from messy data. CRISP-DM data analysis framework.

Published in: Data & Analytics
  • Be the first to comment

Find signal in noise.

  1. 1. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Michal Brys Data Scientist @ Allegro Measure Camp | London, 10th September 2016 Find signal in noise. 6 steps to find value from messy data.
  2. 2. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Michal Brys Data Scientist @ Allegro Specialized also in: + Google Analytics + Google Tag Manager michalbrys.com about.me/michal.brys
  3. 3. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Framework for data analysis CRISP-DM - Cross Industry Standard Process for Data Mining - Set up in 1996 (SPSS, Teradata, Daimler AG, NCR ,OHRA) - Still works! Read more: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
  4. 4. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 1: Business Understanding - Define analysis goal - What you want to achieve by analysis? - Check business context - Don’t be afraid to ask questions
  5. 5. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 1: Business Understanding I want to select customers group with the highest probability of response (...) to target marketing campaign for this group.
  6. 6. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 2: Data Understanding - Collect data Check: - What all variables in dataset means - How about missing values? - Exploratory data analysis (EDA)
  7. 7. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 2: Data Understanding Google Analytics with client id as custom dimension - Source: Cookies + JavaScript tracker - Processed by Google Analytics - No access to raw data
  8. 8. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 2: Data Understanding 10 000 records with 11 variables
  9. 9. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 3: Data Preparation - Data cleaning - Prepare new variables, transform data - Remove missing and outstanding values - Check distributions
  10. 10. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 3: Data Preparation Example: Fix variables type.
  11. 11. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 4: Modeling - Classification problem - Prepare models by different methods - Training and test subset - CART C5.0 Logit Regression
  12. 12. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 5: Evaluation Model True Negative True Positive False Negative False Positive Total Error Rate CART 5081 3150 1080 689 17.69% C5.0 4089 2701 1606 1604 32.10% Logit Regression 5871 2107 1307 715 20.22%
  13. 13. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 6: Deployment - Prepare report - Implement in system - Bulid product - ...
  14. 14. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Summary CRISP-DM + Keeps business goal in mind + Result will answer for initial question + Reproducible and documented process Image: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining#/media/File:CRISP-DM_Process_Diagram.png
  15. 15. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 More inspiration “Data Mining Methods and Models” Daniel T. Larose “The Signal and the Noise” Nate Silver
  16. 16. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 One more thing... michalbrys.gitbooks.io/r-google-analytics/
  17. 17. Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Q&A Michal Brys about.me/michal.brys github.com/michalbrys

×