SlideShare a Scribd company logo
1 of 66
Download to read offline
Vishal Patel
June 2017
Exploring the Data Science Process
 Vishal Patel
 Data Science Consultant
 Founder of DERIVE, LLC
 Fully Automated Advanced Analytics products
 Data Science services
 MS in Computer Science, and MS in Decision Sciences
 Richmond, VA
w w w. d e r i v e . i o
Data
Science
HOW TO BUILD
A MACHINE LEARNING MODEL
IN THREE QUICK AND EASY STEPS
USING [XXX]!!!
– Most tutorials
1
50% of analytic projects fail.
– Gartner, 2015
2
On September 21, 2009, the grand prize of US$1,000,000 was given to the BellKor's
Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10.06%.
“[T]he additional accuracy gains that we measured did not seem to justify the
engineering effort needed to bring them into a production environment.”
Analytic projects fail because…
…they aren’t completed within budget or on schedule,
or because they fail to deliver the features and benefits
that are optimistically agreed on at their outset.
How to Avoid Failure?
1 Build with Organizational Buy-in
2 Build with End In Mind
3 Build with Structured Approach
How to Avoid Failure?
1 Build with Organizational Buy-in
2 Build with End In Mind
3 Build with Structured Approach
Data
Data
Science
Value
Data
Data
Science
Value
Business Question
Data
Science
The Blind Men and the Elephant
It was six men of Indostan
To learning much inclined,
Who went to see the Elephant
(Though all of them were blind),
That each by observation
Might satisfy his mind.
And so these men of Indostan
Disputed loud and long,
Each in his own opinion
Exceeding stiff and strong,
Though each was partly in the right
And all were in the wrong!
– John Godfrey Saxe
Data Science
STATISTICS
BUSINESS
COMPUTER
SCIENCE
Data Munging
Model Training
Model Deployment
Data Preparation Model Tracking
Business
Understanding
Model Evaluation
Visualize Results
Data Science Process
STATISTICSBUSINESS COMPUTER SCIENCE
1
2
3
4
5
6
7
Data Munging
Model Training
Model Deployment
Data Preparation Model Tracking
Business
Understanding
Model Evaluation
Visualize Results
Data Science Process
STATISTICSBUSINESS COMPUTER SCIENCE
Business
Understanding
Data
Preparation
Data
Munging
Model
Training
Model
Evaluation
Model
Deployment
Model
Tracking
Business
Understanding
Far better
an approximate answer to the right question
than
an exact answer to the wrong question.
– John Tukey
Business
Understanding
1
2
3
DETERMINE
UNDERSTAND
MAP
Business
Understanding
1
2
3
DETERMINE
UNDERSTAND
MAP
What does the client want to achieve?
Primary Objective
o Reduce attrition
o Customized targeting
o Plan future media spend
o Prevent fraud
o Recommend Products
Business
Understanding
1
2
3
DETERMINE
UNDERSTAND
MAP
o Identify secondary or competing objectives
o List assumptions, constraints, and important factors
o Understand success criteria
o Specific, measurable, time-bound
o Study existing solutions (if any)
Business
Understanding
1
2
3
DETERMINE
UNDERSTAND
MAP
o State the project objective(s) in technical terms
o Describe how the data science project will help solve
the business problem
o Explore successful scenarios
Business Objective  Technical Objective
Business
Understanding
OBJECTIVE TECHNIQUE EXAMPLES
Predict Values Regression
Linear regression,
Bayesian regression,
Decision Trees
Predict Categories Classification
Logistic regression,
SVM, Decision Trees
Predict Preference Recommender System
Collaborative / Content-
based filtering
Discover groups Clustering
k-means, Hierarchical
clustering
Discover unusual data
points
Anomaly Detection k-NN, One-class SVM
…
1
2
3
DETERMINE
UNDERSTAND
MAP
If all you have is a hammer
then everything looks like a nail.
Business
Understanding
o Primary Objective: Prevent attrition  Increase subscription renewals
o Completive Objective: Core customers are also targeted for up-sell
o Constraints: Avoid targeting customers too close to their contract expiration
o Success Criteria: Current renewal rate = 65%  Improve by 8%
o Existing Solution: Business-rule-based targeting
o Data Science Objective: Build a binary classification model to identify
customers who are not likely to renew their subscriptions three months in
advance of their contract expiration.
o Success Scenario: The model correctly identifies 80% of the future attritors.
And the promotional campaign targets all likely attritors, and successfully
converts 19% of them into non-attritors.
Business
Understanding
o Duration
o Inventory of resources
o Tools and techniques
o Risks and contingencies
o Costs and benefits
o Milestones
The thought that disaster is impossible
often leads to an unthinkable disaster.
– Gerald Weinberg
Project Plan
Titanic at Southampton docks, prior to departure
Data
Preparation
1 IDENTIFY
2 COLLECT
3 ASSESS
4 VECTORIZE
Data
Preparation
o Data sources, formats
o Database, Streaming API’s, Logs, Excel files, Websites, etc.
o Entity Relationship Diagram (ERD)
o Identify additional data sources
o Demographics data appends
o Geographical data
o Census data, etc.
o Identify relevant data
o Record unavailable data
o How long a history is available and one should use?
1 IDENTIFY
2 COLLECT
3 ASSESS
4 VECTORIZE
Data
Preparation
o Access or acquire all relevant data in a central location
o Quality control checks and tests
o File formats, delimiters
o Number of records, columns
o Primary keys
1 IDENTIFY
2 COLLECT
3 ASSESS
4 VECTORIZE
Data
Preparation
First look at the data
o Get familiar with the data
o Study seasonality
o Monthly/weekly/daily patterns
o Unexplained gaps or spikes in the historical data
o Detect mistakes
o Extreme or outlier values
o Unusual values
o Special missing values
o Check assumptions
o Review distributions
1 IDENTIFY
3 ASSESS
4 VECTORIZE
2 COLLECT
Tidy dataset are all alike;
Every messy dataset is messy in its own way.
Data
Preparation
GOAL: Create Analysis Dataset
1 IDENTIFY
4 VECTORIZE
2 COLLECT
3 ASSESS
𝑦1
𝑦2
𝑦3
.
.
.
𝑦𝑛
𝑥11 𝑥12 𝑥13 … 𝑥1𝑗
𝑥21 𝑥22 𝑥23 … 𝑥2𝑗
𝑥31 𝑥32 𝑥33 … 𝑥3𝑗
. . . .
. . . .
. . . .
𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗
𝑦 = 𝑋 =
Outcome
Target
Independent Variable
Inputs
Features
Dependent Variables
Target Definition
o Churn = 90 days of consecutive inactivity
o What’s inactivity?
o Incoming and outgoing calls
o Data usage
o Incoming text
o Promotional texts
o Voicemail usage
o Call forwarding
o Etc.
o What to do with customers who change their device or phone number?
o Churn at the individual (person) level, or at the device (phone) level?
o How many customers return (become active again) after 90 days of inactivity?
o Prediction window
o Predict 90 days of consecutive inactivity?
o Would 10 days of consecutive inactivity suffice?
o How many customers return after x days of inactivity?
o How to deal with fraud?
o Etc.
Modeling Sample
o Historical trends and seasonality
o Are there certain timeframes that should be discarded?
o The model should be generalizable for future targeting purposes
o Eligible, relevant population
o Must align with the business goals
o Eligible, relevant markets (e.g., US vs. International)
o Must align with the business goals
o Outdated products
Selection Bias
Abraham Wald's Work on Aircraft Survivability
Journal of the American Statistical Association Vol. 79, No. 386 (Jun., 1984)
Information Leakage
… Sep Oct Nov Dec Jan Feb Mar Apr
…to predict whether customers
will [do something]
Use all available information
(“Leading Indicators”)
as of the end of Jan…
OBSERVATION WIDOW PREDICTION WIDOW
o The leading indicators must be calculated from the timeframe leading up to the event
– it must not overlap with the prediction window.
o Beware of proxy events, e.g., future bookings
Data Aggregation
o Attribute creation
o Derived attributes: Household income / Number of adults = Income per adult
o Brainstorm with team members
𝑥11 𝑥12 𝑥13 … 𝑥1𝑗
𝑥21 𝑥22 𝑥23 … 𝑥2𝑗
𝑥31 𝑥32 𝑥33 … 𝑥3𝑗
. . . .
. . . .
. . . .
𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗
𝑋 =
Data Aggregation
CUSTOMER_ID PURCHASE_DATE
1001 02-12-2015:05:20:39
1001 05-13-2015:12:18:09
1001 12-20-2016:00:15:59
1002 01-19-2014:04:28:54
1003 01-12-2015:09:20:36
1003 05-31-2015:10:10:02
… …
1. Number of transactions (Frequency)
2. Days since the last transaction (Recency)
3. Days since the earliest transaction (Tenure)
4. Avg. days between transaction
5. # of transactions during weekends
6. % of transactions during weekends
7. # of transactions by day-part (breakfast, lunch, etc.)
8. % of transactions by day-part
9. Days since last transaction / Avg. days between transactions
10.…
CUSTOMER_ID 𝑥1 𝑥2 … 𝑥𝑗
1001 … … …
1002 … … …
1003 … … …
… … … … …
Data
Preparation
1 IDENTIFY
4 VECTORIZE
2 COLLECT
3 ASSESS
OUTPUT: Analysis Dataset
𝑦1
𝑦2
𝑦3
.
.
.
𝑦𝑛
𝑥11 𝑥12 𝑥13 … 𝑥1𝑗
𝑥21 𝑥22 𝑥23 … 𝑥2𝑗
𝑥31 𝑥32 𝑥33 … 𝑥3𝑗
. . . .
. . . .
. . . .
𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗
𝑦 = 𝑋 =
Entity Relationship Diagram
(ERD)
Data
Munging
Model
Training
Model
Evaluation
Model
Deployment
Model
Tracking
Data
Preparation
Business
Understanding
80% 20%
Data
Munging
Model
Building
Time
Spent
Give me six hours to chop down a tree
and I will spend the first four sharpening the axe.
– Abraham Lincoln
Data
Munging
o Descriptive statistics
o Review with business owners
o Correlation analysis
o Review with business owners
o Watch out for data leakage
o Impute missing values
o Trim extreme values
o Process categorical attributes
o Transformations (square, log, etc.)
o Binning / variable smoothing
o Multicollinearity
o Reduce redundancy
o Create additional feature
o Interactions
o Normalization (scaling)
Univariate Multivariate
Non-Graphical
Graphical
o Cross-tabulation
o Univariate statistics by category
o Correlation matrices
o Histograms
o Box plots, stem-and-leaf plots
o Quantile-normal plots
o Categorical: Tabulated frequencies
o Quantitative:
o Central tendency: mean, median, mode
o Spread: Standard deviation, inter-
quartile range
o Skewness and kurtosis
o Univariate graphs by category (e.g.,
side-by-side box-plots)
o Scatterplots
o Correlation matrix plots
Data
Munging
Data
Munging
 Feature Reduction: The process of selecting a subset of features for use in model construction
 Useful for both supervised and unsupervised learning problems
Art is the elimination of the unnecessary.
– Pablo Picasso
Data
Munging
 True dimensionality <<< Observed dimensionality
 The abundance of redundant and irrelevant features
 Curse of dimensionality
 With a fixed number of training samples, the predictive power reduces as the dimensionality
increases. [Hughes phenomenon]
 With 𝑑 binary variables, the number of possible combinations is 𝑂(2 𝑑
).
 Value of Analytics
 Descriptive  Diagnostic  Predictive  Prescriptive
 Law of Parsimony [Occam’s Razor]
 Other things being equal, simpler explanations are generally better than complex ones.
 Overfitting
 Execution time (Algorithm and data)
Hindsight Insight Foresight
Feature Reduction: Why
Data
Munging
1. Percent missing values
2. Amount of variation
3. Pairwise correlation
4. Multicolinearity
5. Principal Component Analysis (PCA)
6. Cluster analysis
7. Correlation (with the target)
8. Forward selection
9. Backward elimination
10. Stepwise selection
11. LASSO
12. Tree-based selection
Feature
Reduction
Techniques
All you can eat Eat healthy
Feature Reduction
Model
Training
 Try more than one machine learning technique
 Fine-tune parameters
 Assess model performance
 Avoid Over-fitting
Assess Model Performance
 Area Under the ROC Curve (AUC), Confusion Matrix, Precision, Recall, Log-loss
 Model Lift, Model Gains, Kolmogorov-Smirnov (KS), etc.
When a measure becomes a target,
it ceases to be a good measure.
Goodhart's law
Model
Evaluation
 Hold-out (test) set
 Tri-fold partitioning, k-fold cross-validation
 Out-of-time test set
 Value of analytics
 Law of Parsimony (Occam’s Razor)
 Model execution time
 Deployment complexity
 Compare against existing business rules/model
 Model peer-review (Quality Control)
Model Selection and Performance Assessment
Bias-Variance Tradeoff
Model
Evaluation
 AUC
 Cumulative Gains chart
 Model usage recommendations
 Decile reports
 How many deciles should be targeted?
 Predictor Importance
 Each predictor’s relationship with the target
 Interpret results as they relate to the business application
Visualization of Modeling Results
 Primary bullet
 Secondary bullet
Visual Output
. . . . . . . . . .
Model
Deployment
 Model production cycle (daily, weekly, monthly, etc.)
 Scoring code, or publish model as a web service
 Model Documentation (Technical Specifications)
 Data preparation, transformations, imputations, parameter settings, etc.
 Reproducibility
 Docker containers, etc.
 Model Persistence vs. Model Transience
Model Persistence vs. Model Transience
Aug Sep Oct Nov Dec Jan Feb Mar Apr
Model Build Model Decay Tracking Model Rebuild Model Decay Tracking
Scoring Algorithm Scoring Algorithm
Model
Persistence
Aug Sep Oct Nov Dec
Model Build
Model
Transience
Model Build Model Build Model Build Model Build
• Traditional approach
• Provides stability
• Less resource intensive
• Modern approach
• Able to capture recent trends
• Resource intensive
Model
Tracking
 Model decay tracking (monitoring) plan
 Model maintenance plan
 Adding new data sources
 Version control
 Campaign Set-up and Execution
 Experimental Design
 Reason Coding
No Selection
(Random)
Experimental Design
Marketing Treatment No Treatment
Selection
Based on Model
A Test
C Control
B
Selection
Hold-out
D
Random
Hold-out
Measure
Improve
Test
Analytics is a river.
Data Munging
Model Building
Model Deployment
Data Preparation Model Tracking
Business
Understanding
Model Evaluation
Visualize Results
Data Science Process: Recap
STATISTICSBUSINESS COMPUTER SCIENCE
Data Munging
Model Building
Model Deployment
Model Evaluation
Visualize Results
AutoClassifier
These steps usually take
2 to 3 weeks
But with AutoClassifier,
it takes a few mouse-clicks
and less than 24 hours w w w . d e r i v e . i o
Process as Proxy
Good process serves you so you can serve customers.
But if you’re not watchful, the process can become the proxy for the result you want.
You stop looking at outcomes and just make sure you’re doing the process right.
Gulp.
It’s always worth asking, do we own the process or does the process own us?
– Jeff Bezos
THANK YOU!
vishal@derive.io
www.linkedIn.com/in/VishalJP
w w w . d e r i v e . i o

More Related Content

What's hot

Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data VisualizationEamonn Maguire
 
Ppt on data science
Ppt on data science Ppt on data science
Ppt on data science Ansh Budania
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Big data visualization
Big data visualizationBig data visualization
Big data visualizationAnurag Gupta
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptxSarojkumari55
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data sciencebhavesh lande
 
Data Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesData Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesSlideTeam
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school studentsMelanie Manning, CFA
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 

What's hot (20)

Data analytics
Data analyticsData analytics
Data analytics
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data Visualization
 
Ppt on data science
Ppt on data science Ppt on data science
Ppt on data science
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big data visualization
Big data visualizationBig data visualization
Big data visualization
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptx
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Data analytics
Data analyticsData analytics
Data analytics
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Data Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation SlidesData Analytics PowerPoint Presentation Slides
Data Analytics PowerPoint Presentation Slides
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school students
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 

Similar to Exploring the Data science Process

The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Warehouse components
Warehouse componentsWarehouse components
Warehouse componentsganblues
 
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...Naveen Agarwal
 
Big Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessBig Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessJ On The Beach
 
How to succeed at data without even trying!
How to succeed at data without even trying!How to succeed at data without even trying!
How to succeed at data without even trying!Dylan
 
Pin the tail on the metric v00 75 min version
Pin the tail on the metric v00 75 min versionPin the tail on the metric v00 75 min version
Pin the tail on the metric v00 75 min versionSteven Martin
 
Domains and data analytics
Domains and data analyticsDomains and data analytics
Domains and data analyticsPratik Shukla
 
PPT1-Buss Intel Analytics.pptx
PPT1-Buss Intel  Analytics.pptxPPT1-Buss Intel  Analytics.pptx
PPT1-Buss Intel Analytics.pptxssuser28b150
 
10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist10 Tips From A Young Data Scientist
10 Tips From A Young Data ScientistNuno Carneiro
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...Kai Wähner
 
Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsProduct School
 
L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale Microsoft Ideas
 
Supply Chain Management Workshop
Supply Chain Management WorkshopSupply Chain Management Workshop
Supply Chain Management WorkshopTom Sauder, P.Eng.
 
Crossing the Digital Chasm - Applying Advanced Analytics in acquiring, nurtur...
Crossing the Digital Chasm - Applying Advanced Analytics in acquiring, nurtur...Crossing the Digital Chasm - Applying Advanced Analytics in acquiring, nurtur...
Crossing the Digital Chasm - Applying Advanced Analytics in acquiring, nurtur...Vishwa Kolla
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer DataWSO2
 
Data Driven Disruption - Why Marketing and Advertising in WA lags - ADMA WA 2...
Data Driven Disruption - Why Marketing and Advertising in WA lags - ADMA WA 2...Data Driven Disruption - Why Marketing and Advertising in WA lags - ADMA WA 2...
Data Driven Disruption - Why Marketing and Advertising in WA lags - ADMA WA 2...Coert Du Plessis (杜康)
 
The Softer Skills Analysts need to make an impact
The Softer Skills Analysts need to make an impactThe Softer Skills Analysts need to make an impact
The Softer Skills Analysts need to make an impactPaul Laughlin
 

Similar to Exploring the Data science Process (20)

The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Warehouse components
Warehouse componentsWarehouse components
Warehouse components
 
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
JU Analytics Day Presentation by Naveen Agarwal, Creative Analytics Solutions...
 
Big Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessBig Data: selling the Business Case to the business
Big Data: selling the Business Case to the business
 
How to succeed at data without even trying!
How to succeed at data without even trying!How to succeed at data without even trying!
How to succeed at data without even trying!
 
Pin the tail on the metric v00 75 min version
Pin the tail on the metric v00 75 min versionPin the tail on the metric v00 75 min version
Pin the tail on the metric v00 75 min version
 
Domains and data analytics
Domains and data analyticsDomains and data analytics
Domains and data analytics
 
PPT1-Buss Intel Analytics.pptx
PPT1-Buss Intel  Analytics.pptxPPT1-Buss Intel  Analytics.pptx
PPT1-Buss Intel Analytics.pptx
 
10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
 
Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data Decisions
 
L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale
 
Supply Chain Management Workshop
Supply Chain Management WorkshopSupply Chain Management Workshop
Supply Chain Management Workshop
 
Crossing the Digital Chasm - Applying Advanced Analytics in acquiring, nurtur...
Crossing the Digital Chasm - Applying Advanced Analytics in acquiring, nurtur...Crossing the Digital Chasm - Applying Advanced Analytics in acquiring, nurtur...
Crossing the Digital Chasm - Applying Advanced Analytics in acquiring, nurtur...
 
Supply Chain Workshop Demo
Supply Chain Workshop DemoSupply Chain Workshop Demo
Supply Chain Workshop Demo
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer Data
 
Day 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business AnalyticsDay 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business Analytics
 
Data Driven Disruption - Why Marketing and Advertising in WA lags - ADMA WA 2...
Data Driven Disruption - Why Marketing and Advertising in WA lags - ADMA WA 2...Data Driven Disruption - Why Marketing and Advertising in WA lags - ADMA WA 2...
Data Driven Disruption - Why Marketing and Advertising in WA lags - ADMA WA 2...
 
The Softer Skills Analysts need to make an impact
The Softer Skills Analysts need to make an impactThe Softer Skills Analysts need to make an impact
The Softer Skills Analysts need to make an impact
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

Exploring the Data science Process

  • 1. Vishal Patel June 2017 Exploring the Data Science Process
  • 2.  Vishal Patel  Data Science Consultant  Founder of DERIVE, LLC  Fully Automated Advanced Analytics products  Data Science services  MS in Computer Science, and MS in Decision Sciences  Richmond, VA w w w. d e r i v e . i o
  • 4. HOW TO BUILD A MACHINE LEARNING MODEL IN THREE QUICK AND EASY STEPS USING [XXX]!!! – Most tutorials 1
  • 5. 50% of analytic projects fail. – Gartner, 2015 2
  • 6. On September 21, 2009, the grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10.06%.
  • 7. “[T]he additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”
  • 8. Analytic projects fail because… …they aren’t completed within budget or on schedule, or because they fail to deliver the features and benefits that are optimistically agreed on at their outset.
  • 9. How to Avoid Failure? 1 Build with Organizational Buy-in 2 Build with End In Mind 3 Build with Structured Approach
  • 10. How to Avoid Failure? 1 Build with Organizational Buy-in 2 Build with End In Mind 3 Build with Structured Approach
  • 14. The Blind Men and the Elephant It was six men of Indostan To learning much inclined, Who went to see the Elephant (Though all of them were blind), That each by observation Might satisfy his mind. And so these men of Indostan Disputed loud and long, Each in his own opinion Exceeding stiff and strong, Though each was partly in the right And all were in the wrong! – John Godfrey Saxe
  • 16. Data Munging Model Training Model Deployment Data Preparation Model Tracking Business Understanding Model Evaluation Visualize Results Data Science Process STATISTICSBUSINESS COMPUTER SCIENCE 1 2 3 4 5 6 7
  • 17. Data Munging Model Training Model Deployment Data Preparation Model Tracking Business Understanding Model Evaluation Visualize Results Data Science Process STATISTICSBUSINESS COMPUTER SCIENCE
  • 19. Business Understanding Far better an approximate answer to the right question than an exact answer to the wrong question. – John Tukey
  • 21. Business Understanding 1 2 3 DETERMINE UNDERSTAND MAP What does the client want to achieve? Primary Objective o Reduce attrition o Customized targeting o Plan future media spend o Prevent fraud o Recommend Products
  • 22. Business Understanding 1 2 3 DETERMINE UNDERSTAND MAP o Identify secondary or competing objectives o List assumptions, constraints, and important factors o Understand success criteria o Specific, measurable, time-bound o Study existing solutions (if any)
  • 23. Business Understanding 1 2 3 DETERMINE UNDERSTAND MAP o State the project objective(s) in technical terms o Describe how the data science project will help solve the business problem o Explore successful scenarios Business Objective  Technical Objective
  • 24. Business Understanding OBJECTIVE TECHNIQUE EXAMPLES Predict Values Regression Linear regression, Bayesian regression, Decision Trees Predict Categories Classification Logistic regression, SVM, Decision Trees Predict Preference Recommender System Collaborative / Content- based filtering Discover groups Clustering k-means, Hierarchical clustering Discover unusual data points Anomaly Detection k-NN, One-class SVM … 1 2 3 DETERMINE UNDERSTAND MAP
  • 25.
  • 26. If all you have is a hammer then everything looks like a nail.
  • 27. Business Understanding o Primary Objective: Prevent attrition  Increase subscription renewals o Completive Objective: Core customers are also targeted for up-sell o Constraints: Avoid targeting customers too close to their contract expiration o Success Criteria: Current renewal rate = 65%  Improve by 8% o Existing Solution: Business-rule-based targeting o Data Science Objective: Build a binary classification model to identify customers who are not likely to renew their subscriptions three months in advance of their contract expiration. o Success Scenario: The model correctly identifies 80% of the future attritors. And the promotional campaign targets all likely attritors, and successfully converts 19% of them into non-attritors.
  • 28. Business Understanding o Duration o Inventory of resources o Tools and techniques o Risks and contingencies o Costs and benefits o Milestones The thought that disaster is impossible often leads to an unthinkable disaster. – Gerald Weinberg Project Plan Titanic at Southampton docks, prior to departure
  • 30. Data Preparation o Data sources, formats o Database, Streaming API’s, Logs, Excel files, Websites, etc. o Entity Relationship Diagram (ERD) o Identify additional data sources o Demographics data appends o Geographical data o Census data, etc. o Identify relevant data o Record unavailable data o How long a history is available and one should use? 1 IDENTIFY 2 COLLECT 3 ASSESS 4 VECTORIZE
  • 31. Data Preparation o Access or acquire all relevant data in a central location o Quality control checks and tests o File formats, delimiters o Number of records, columns o Primary keys 1 IDENTIFY 2 COLLECT 3 ASSESS 4 VECTORIZE
  • 32. Data Preparation First look at the data o Get familiar with the data o Study seasonality o Monthly/weekly/daily patterns o Unexplained gaps or spikes in the historical data o Detect mistakes o Extreme or outlier values o Unusual values o Special missing values o Check assumptions o Review distributions 1 IDENTIFY 3 ASSESS 4 VECTORIZE 2 COLLECT
  • 33. Tidy dataset are all alike; Every messy dataset is messy in its own way.
  • 34. Data Preparation GOAL: Create Analysis Dataset 1 IDENTIFY 4 VECTORIZE 2 COLLECT 3 ASSESS 𝑦1 𝑦2 𝑦3 . . . 𝑦𝑛 𝑥11 𝑥12 𝑥13 … 𝑥1𝑗 𝑥21 𝑥22 𝑥23 … 𝑥2𝑗 𝑥31 𝑥32 𝑥33 … 𝑥3𝑗 . . . . . . . . . . . . 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗 𝑦 = 𝑋 = Outcome Target Independent Variable Inputs Features Dependent Variables
  • 35. Target Definition o Churn = 90 days of consecutive inactivity o What’s inactivity? o Incoming and outgoing calls o Data usage o Incoming text o Promotional texts o Voicemail usage o Call forwarding o Etc. o What to do with customers who change their device or phone number? o Churn at the individual (person) level, or at the device (phone) level? o How many customers return (become active again) after 90 days of inactivity? o Prediction window o Predict 90 days of consecutive inactivity? o Would 10 days of consecutive inactivity suffice? o How many customers return after x days of inactivity? o How to deal with fraud? o Etc.
  • 36. Modeling Sample o Historical trends and seasonality o Are there certain timeframes that should be discarded? o The model should be generalizable for future targeting purposes o Eligible, relevant population o Must align with the business goals o Eligible, relevant markets (e.g., US vs. International) o Must align with the business goals o Outdated products
  • 37. Selection Bias Abraham Wald's Work on Aircraft Survivability Journal of the American Statistical Association Vol. 79, No. 386 (Jun., 1984)
  • 38. Information Leakage … Sep Oct Nov Dec Jan Feb Mar Apr …to predict whether customers will [do something] Use all available information (“Leading Indicators”) as of the end of Jan… OBSERVATION WIDOW PREDICTION WIDOW o The leading indicators must be calculated from the timeframe leading up to the event – it must not overlap with the prediction window. o Beware of proxy events, e.g., future bookings
  • 39. Data Aggregation o Attribute creation o Derived attributes: Household income / Number of adults = Income per adult o Brainstorm with team members 𝑥11 𝑥12 𝑥13 … 𝑥1𝑗 𝑥21 𝑥22 𝑥23 … 𝑥2𝑗 𝑥31 𝑥32 𝑥33 … 𝑥3𝑗 . . . . . . . . . . . . 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗 𝑋 =
  • 40. Data Aggregation CUSTOMER_ID PURCHASE_DATE 1001 02-12-2015:05:20:39 1001 05-13-2015:12:18:09 1001 12-20-2016:00:15:59 1002 01-19-2014:04:28:54 1003 01-12-2015:09:20:36 1003 05-31-2015:10:10:02 … … 1. Number of transactions (Frequency) 2. Days since the last transaction (Recency) 3. Days since the earliest transaction (Tenure) 4. Avg. days between transaction 5. # of transactions during weekends 6. % of transactions during weekends 7. # of transactions by day-part (breakfast, lunch, etc.) 8. % of transactions by day-part 9. Days since last transaction / Avg. days between transactions 10.… CUSTOMER_ID 𝑥1 𝑥2 … 𝑥𝑗 1001 … … … 1002 … … … 1003 … … … … … … … …
  • 41. Data Preparation 1 IDENTIFY 4 VECTORIZE 2 COLLECT 3 ASSESS OUTPUT: Analysis Dataset 𝑦1 𝑦2 𝑦3 . . . 𝑦𝑛 𝑥11 𝑥12 𝑥13 … 𝑥1𝑗 𝑥21 𝑥22 𝑥23 … 𝑥2𝑗 𝑥31 𝑥32 𝑥33 … 𝑥3𝑗 . . . . . . . . . . . . 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗 𝑦 = 𝑋 =
  • 44. Give me six hours to chop down a tree and I will spend the first four sharpening the axe. – Abraham Lincoln
  • 45. Data Munging o Descriptive statistics o Review with business owners o Correlation analysis o Review with business owners o Watch out for data leakage o Impute missing values o Trim extreme values o Process categorical attributes o Transformations (square, log, etc.) o Binning / variable smoothing o Multicollinearity o Reduce redundancy o Create additional feature o Interactions o Normalization (scaling)
  • 46. Univariate Multivariate Non-Graphical Graphical o Cross-tabulation o Univariate statistics by category o Correlation matrices o Histograms o Box plots, stem-and-leaf plots o Quantile-normal plots o Categorical: Tabulated frequencies o Quantitative: o Central tendency: mean, median, mode o Spread: Standard deviation, inter- quartile range o Skewness and kurtosis o Univariate graphs by category (e.g., side-by-side box-plots) o Scatterplots o Correlation matrix plots Data Munging
  • 47. Data Munging  Feature Reduction: The process of selecting a subset of features for use in model construction  Useful for both supervised and unsupervised learning problems Art is the elimination of the unnecessary. – Pablo Picasso
  • 48. Data Munging  True dimensionality <<< Observed dimensionality  The abundance of redundant and irrelevant features  Curse of dimensionality  With a fixed number of training samples, the predictive power reduces as the dimensionality increases. [Hughes phenomenon]  With 𝑑 binary variables, the number of possible combinations is 𝑂(2 𝑑 ).  Value of Analytics  Descriptive  Diagnostic  Predictive  Prescriptive  Law of Parsimony [Occam’s Razor]  Other things being equal, simpler explanations are generally better than complex ones.  Overfitting  Execution time (Algorithm and data) Hindsight Insight Foresight Feature Reduction: Why
  • 49. Data Munging 1. Percent missing values 2. Amount of variation 3. Pairwise correlation 4. Multicolinearity 5. Principal Component Analysis (PCA) 6. Cluster analysis 7. Correlation (with the target) 8. Forward selection 9. Backward elimination 10. Stepwise selection 11. LASSO 12. Tree-based selection Feature Reduction Techniques
  • 50. All you can eat Eat healthy Feature Reduction
  • 51. Model Training  Try more than one machine learning technique  Fine-tune parameters  Assess model performance  Avoid Over-fitting
  • 52. Assess Model Performance  Area Under the ROC Curve (AUC), Confusion Matrix, Precision, Recall, Log-loss  Model Lift, Model Gains, Kolmogorov-Smirnov (KS), etc.
  • 53. When a measure becomes a target, it ceases to be a good measure. Goodhart's law
  • 54. Model Evaluation  Hold-out (test) set  Tri-fold partitioning, k-fold cross-validation  Out-of-time test set  Value of analytics  Law of Parsimony (Occam’s Razor)  Model execution time  Deployment complexity  Compare against existing business rules/model  Model peer-review (Quality Control) Model Selection and Performance Assessment
  • 56. Model Evaluation  AUC  Cumulative Gains chart  Model usage recommendations  Decile reports  How many deciles should be targeted?  Predictor Importance  Each predictor’s relationship with the target  Interpret results as they relate to the business application Visualization of Modeling Results
  • 57.  Primary bullet  Secondary bullet Visual Output . . . . . . . . . .
  • 58. Model Deployment  Model production cycle (daily, weekly, monthly, etc.)  Scoring code, or publish model as a web service  Model Documentation (Technical Specifications)  Data preparation, transformations, imputations, parameter settings, etc.  Reproducibility  Docker containers, etc.  Model Persistence vs. Model Transience
  • 59. Model Persistence vs. Model Transience Aug Sep Oct Nov Dec Jan Feb Mar Apr Model Build Model Decay Tracking Model Rebuild Model Decay Tracking Scoring Algorithm Scoring Algorithm Model Persistence Aug Sep Oct Nov Dec Model Build Model Transience Model Build Model Build Model Build Model Build • Traditional approach • Provides stability • Less resource intensive • Modern approach • Able to capture recent trends • Resource intensive
  • 60. Model Tracking  Model decay tracking (monitoring) plan  Model maintenance plan  Adding new data sources  Version control  Campaign Set-up and Execution  Experimental Design  Reason Coding
  • 61. No Selection (Random) Experimental Design Marketing Treatment No Treatment Selection Based on Model A Test C Control B Selection Hold-out D Random Hold-out
  • 63. Data Munging Model Building Model Deployment Data Preparation Model Tracking Business Understanding Model Evaluation Visualize Results Data Science Process: Recap STATISTICSBUSINESS COMPUTER SCIENCE
  • 64. Data Munging Model Building Model Deployment Model Evaluation Visualize Results AutoClassifier These steps usually take 2 to 3 weeks But with AutoClassifier, it takes a few mouse-clicks and less than 24 hours w w w . d e r i v e . i o
  • 65. Process as Proxy Good process serves you so you can serve customers. But if you’re not watchful, the process can become the proxy for the result you want. You stop looking at outcomes and just make sure you’re doing the process right. Gulp. It’s always worth asking, do we own the process or does the process own us? – Jeff Bezos