Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Michal Brys
Data Scientist @ Allegro
Measure Camp | London, 10th September 2016
Find signal in noise.
6 steps to find value from messy data.
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Michal Brys
Data Scientist @ Allegro
Specialized also in:
+ Google Analytics
+ Google Tag Manager
michalbrys.com
about.me/michal.brys
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Framework for data analysis
CRISP-DM
- Cross Industry Standard Process for Data Mining
- Set up in 1996 (SPSS, Teradata, Daimler AG, NCR ,OHRA)
- Still works!
Read more: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
1: Business Understanding
- Define analysis goal
- What you want to achieve by analysis?
- Check business context
- Don’t be afraid to ask questions
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
1: Business Understanding
I want to select customers group with the
highest probability of response (...)
to target marketing campaign for this group.
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
2: Data Understanding
- Collect data
Check:
- What all variables in dataset means
- How about missing values?
- Exploratory data analysis (EDA)
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
2: Data Understanding
Google Analytics with client id as custom dimension
- Source: Cookies + JavaScript tracker
- Processed by Google Analytics
- No access to raw data
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
2: Data Understanding
10 000 records with 11 variables
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
3: Data Preparation
- Data cleaning
- Prepare new variables, transform data
- Remove missing and outstanding values
- Check distributions
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
3: Data Preparation
Example: Fix variables type.
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
4: Modeling
- Classification problem
- Prepare models by different methods
- Training and test subset
- CART
C5.0
Logit Regression
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
5: Evaluation
Model
True
Negative
True
Positive
False
Negative
False
Positive
Total Error
Rate
CART 5081 3150 1080 689 17.69%
C5.0 4089 2701 1606 1604 32.10%
Logit Regression 5871 2107 1307 715 20.22%
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
6: Deployment
- Prepare report
- Implement in system
- Bulid product
- ...
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Summary
CRISP-DM
+ Keeps business goal in mind
+ Result will answer for initial question
+ Reproducible and documented process
Image: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining#/media/File:CRISP-DM_Process_Diagram.png
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
More inspiration
“Data Mining Methods and Models”
Daniel T. Larose
“The Signal and the Noise”
Nate Silver
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
One more thing...
michalbrys.gitbooks.io/r-google-analytics/
Michał Bryś, Data Scientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016
Q&A
Michal Brys
about.me/michal.brys
github.com/michalbrys

Find signal in noise.

  • 1.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Michal Brys Data Scientist @ Allegro Measure Camp | London, 10th September 2016 Find signal in noise. 6 steps to find value from messy data.
  • 2.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Michal Brys Data Scientist @ Allegro Specialized also in: + Google Analytics + Google Tag Manager michalbrys.com about.me/michal.brys
  • 3.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Framework for data analysis CRISP-DM - Cross Industry Standard Process for Data Mining - Set up in 1996 (SPSS, Teradata, Daimler AG, NCR ,OHRA) - Still works! Read more: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
  • 4.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 1: Business Understanding - Define analysis goal - What you want to achieve by analysis? - Check business context - Don’t be afraid to ask questions
  • 5.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 1: Business Understanding I want to select customers group with the highest probability of response (...) to target marketing campaign for this group.
  • 6.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 2: Data Understanding - Collect data Check: - What all variables in dataset means - How about missing values? - Exploratory data analysis (EDA)
  • 7.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 2: Data Understanding Google Analytics with client id as custom dimension - Source: Cookies + JavaScript tracker - Processed by Google Analytics - No access to raw data
  • 8.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 2: Data Understanding 10 000 records with 11 variables
  • 9.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 3: Data Preparation - Data cleaning - Prepare new variables, transform data - Remove missing and outstanding values - Check distributions
  • 10.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 3: Data Preparation Example: Fix variables type.
  • 11.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 4: Modeling - Classification problem - Prepare models by different methods - Training and test subset - CART C5.0 Logit Regression
  • 12.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 5: Evaluation Model True Negative True Positive False Negative False Positive Total Error Rate CART 5081 3150 1080 689 17.69% C5.0 4089 2701 1606 1604 32.10% Logit Regression 5871 2107 1307 715 20.22%
  • 13.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 6: Deployment - Prepare report - Implement in system - Bulid product - ...
  • 14.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Summary CRISP-DM + Keeps business goal in mind + Result will answer for initial question + Reproducible and documented process Image: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining#/media/File:CRISP-DM_Process_Diagram.png
  • 15.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 More inspiration “Data Mining Methods and Models” Daniel T. Larose “The Signal and the Noise” Nate Silver
  • 16.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 One more thing... michalbrys.gitbooks.io/r-google-analytics/
  • 17.
    Michał Bryś, DataScientist @ Allegro, Complexity Garage @ Kraków, 05.02.2016Michał Bryś, Data Scientist @ Allegro, Measure Camp @ London, 10.09.2016 Q&A Michal Brys about.me/michal.brys github.com/michalbrys