Operationalizing Data Science St. Louis Big Data IDEA
• Please mute your phone and turn off your video. There are over eighty people who don’t want to see or
hear you chewing.
• If you have any suggestions for future topics that you would like this group to cover, please send them to
Scott Shaw using Webex’s chat feature.
• We will send out the presentation deck after the meeting. Look for an announcement in the Meetup link
for this meeting.
• If you have questions during the presentation, also send them to Scott Shaw using Webex’s chat feature.
We will get to as many questions as we can.
Before we begin…
Begin with the end in mind.
Pause in the middle to make sure that you can get to where you are
going.
• What is the business
intention that you are
trying to achieve?
• Minimize Cost
• Maximize Return
• Minimize Risk
• Realize Opportunity
• Engage Stakeholders
• POC vs Production
ready and valued
product
Identify your thesis.
Goal vs. intention
SMART goal
Refine to question that can be
answered with data science
Data science – predict,
explain, evaluate
Decision science –
combination of data science
and data engineering
Acquire data.
Third-Party
Data
Internal API Streaming
General – Amount, Access,
Quality, Labeled?
Third Party
o Assess Data Quality (Value
Range, Adherence,
Representative)
o Data Format (Automatic vs
hand-generated, Similar data
from different partners are
vastly different)
o Governed (Use appropriate
– avoid reidentification, TTL,
Contractuals, Track access,
renewals)
Internal
API (Data size limits,
unreliability, costs)
Streaming (CDC, Device Data,
Standardized?)
Explore the data.
Data Exploration
Statistical
Relationships and Correlations
Profiling
Textual – Word, Stop Words,
Bigram, Trigram
Clustering
Check in with SME
Every block of stone has a statue inside it, and it is the
task of the sculptor to discover it.
Cleanse data.
Data profiling
Deduplication
Outliers
Filter
Imputation
Source Corrections
Data shaping
Sort
Project
Enrichment
Create the model and features.
Type of Models
(Supervised,
Unsupervised,
Reinforcement Learning,
Neural Networks)
Feature Engineering
(Transformations and
Aggregations)
Encode Indicator Variables
Binning/Bucketing
Sparse Classes
Interaction Features
Extract Elements (eg.
Time)
Normalization
Feature Selection
Testing your features
Testing your model
Check in with SME
Check in with Business
Does what you’ve created
address the concerns of
the business?
What is the business intention that you are trying to achieve?
Minimize Cost
Maximize Return
Minimize Risk
Realize Opportunity
Engage Stakeholders
POC vs Production ready and valued product
Decision science
SMART goal
Goal vs. intention
Refine to question
Data science – predict, explain, evaluate
General – Amount, Access, Quality, Labeled?
Third Party
o Assess Data Quality (Value Range, Adherence, Representative)
o Data Format (Automatic vs hand-generated, Similar data from different partners are vastly different)
o Governed (Use appropriate – avoid reidentification, TTL, Contractuals, Track access, renewals)
Internal
API (Data size limits, unreliability, costs)
Streaming (CDC, Device Data, Standardized?)
Data Exploration
Statistical
Relationships and Correlations
Profiling
Textual – Word, Stop Words, Bigram, Trigram
Clustering
Check in with SME
Data profiling
Deduplication
Outliers
Filter
Imputation
Source Corrections
Data shaping
Sort
Project
Enrichment
Type of Models (Supervised, Unsupervised, Reinforcement Learning, Neural Networks)
Feature Engineering (Transformations and Aggregations)
Encode Indicator Variables
Binning/Bucketing
Sparse Classes
Interaction Features
Extract Elements (eg. Time)
Normalization
Feature Selection
Testing your features
Testing your model
Check in with SME
Check in with Business
Does what you’ve created address the concerns of the business?
Batch Training vs Real-time Training
Batch Evaluation vs Real-time Evaluation
Truth Matrix
Mean Square Error
Evaluation time
Automation
Scaling
SLAs
Versioning
Data Pipelines
Ongoing Data Acquisition
Ongoing Data Cleaning
Ongoing Feature Encoding
Integration in application
Drift
Degrading the model
Predictions and their effects
Feature Optimization
Retraining
Remodeling
What does it mean to be done?
Explanation as a Result