How to Make Your Data Scientists Happy
A use-case backed approach for enabling data science in enterprise
April 2018
ANACONDACON 2018
HUSSAIN SULTAN
WASHINGTON DC
Leader in computational
Python development and
Data Science
Amazon and Capital One
Consulting clients: leading
Fintech lenders and
mega-regional banks
TIM HORAN
WASHINGTON DC
10 years of consumer lending
Led US Credit Card Valuations
at Capital One
Consulting clients: leading
market place installment loan
lender and global 100 banks
Introduction
3
Explosion of Data
Modern Analytics
Analytics and data management technology have
progressed significantly in the last 10 years
Cloud Computing
Software Development
Predictive Analytics
Open Source
Infrastructure Automation
90% of today's data was created in the last two
years1
$219.6 billion spent globally on public cloud
services in 2016 and predicted to be $411 billion
by 20202
The line between software development and
sustainable analysis is blurring
The hive-mind of open source clearly has a
space in modern analytics as enterprise
solutions build on top and around it
Low cost compute and storage makes
Machine Learning and Artificial Intelligence
accessible
By the end of 2018, spending on IT-as-a-Service
for data centers, software, and services will be
just under $550 billion worldwide3
1IBM 10 Key Marketing Trends for 2017 - https://ibm.co/2y0r7Ee
2Gartner Press Release - http://gtnr.it/2Fw5LmJ
3Deloitte Technology, Media, and Telecommunications Predictions 2017 - http://bit.ly/2jMYdwm
4
In 2014, Gartner Research predicted 60% of
Big Data projects through 2017 would be failures.
When 2017 rolled around ...
Despite significant investment by enterprises to embrace
Big Data and modern analytics, most efforts are failing.
5
In 2014, Gartner Research predicted 60% of
Big Data projects through 2017 would be failures.
When 2017 rolled around ...
Despite significant investment by enterprises to embrace
Big Data and modern analytics, most efforts are failing.
6
We blame unhappy Data Scientists
7
Let’s start with a game
8
Let’s start with a game
9
Who are your Data Scientists, and what do they do?
Biz Analyst Data Scientist Developer Data Engineer DevOps
Business Insight Generation
Model Building
Insight / Model Deployment
Analytical Tool Creation
Data Science Enablement
Data Management
1Leveraging Base Framework from Anaconda – Journey to Open Data Science - http://bit.ly/2FyvpHD
10
Who are your Data Scientists, and what do they do?
Biz Analyst Data Scientist Developer Data Engineer DevOps
Business Insight Generation
Model Building
Insight / Model Deployment
Analytical Tool Creation
Data Science Enablement
Data Management
Data Scientists play a critical bridge role between
Biz Analysts and traditional IT roles in enterprise
1Leveraging Base Framework from Anaconda – Journey to Open Data Science - http://bit.ly/2FyvpHD
11
Who are your Data Scientists, and what do they do?
Biz Analyst Data Scientist Developer Data Engineer DevOps
Business Insight Generation
Model Building
Insight / Model Deployment
Analytical Tool Creation
Data Science Enablement
Data Management
Deployment in enterprise requires the
most coordination across teams
1Leveraging Base Framework from Anaconda – Journey to Open Data Science - http://bit.ly/2FyvpHD
12
How to make your Data Scientists happy
13
How to make your Data Scientists happy
14
Data Scientists want to drive change in their organization
using Data
It begins with the first word, Data …
Tools to get the job done ...
Transparent path from insight to impact …
• Raw data handled and stored consistently to eliminate data silos
• Metadata readily available, in particular, lineage working backward to raw data sources
• A well understood and thoughtful data access process
• Open source first and foremost (Python/Anaconda, R)
• Scaled Data Science platform to enable interactive exploration and visualization
• Thoughtful and well understood open source governance process
• Automated workflows to deploy new insights to market and monitor results
• At minimum, transparency on how to bring insights to market/production
15
It begins with the first word, Data …
Tools to get the job done ...
Transparent path from insight to impact …
• Raw data handled and stored consistently to eliminate data silos
• Metadata readily available, in particular, lineage working backward to raw data sources
• A well understood and thoughtful data access process
• Open source first and foremost (Python/Anaconda, R)
• Scaled Data Science platform to enable interactive exploration and visualization
• Thoughtful and well understood open source governance process
• Automated workflows to deploy new insights to market and monitor results
• At minimum, transparency on how to bring insights to market/production
A path from insight to implementation is consistently the largest gap to
successful ”Big Data” / modern analytics projects.
Data Scientists want to drive change in their organization
using Data
16
A Common “Big Data” Project Life Cycle
Production
Infrastructure
Analytics and
Monitoring
Stack
Implementation
Process
On-Prem Hadoop or
Cloud Database Scaled Data Science
Environment
Data Scientists
New Insights:
• Model
• Strategy
Parallel Modernization Lab or Center of Excellence
Enterprise BAU
(Business as Usual) Solution
Biz Analyst /
Data Scientists
New ETL
Process
New Implementation Process
17
A Common “Big Data” Project Life Cycle
Production
Infrastructure
Analytics and
Monitoring
Stack
Implementation
Process
On-Prem Hadoop or
Cloud based data
solution Scaled Data Science
Environment
Data Scientists
New Insights:
• Model
• Strategy
Parallel Modernization Lab or Center of Excellence
Enterprise BAU
(Business as Usual) Solution
Biz Analyst /
Data Scientists
New ETL
Process
New Implementation Process
Common Challenge #1
• Key performance indicator for new ETL focused on moving as
much data into lake as possible
• Data landing with limited metadata or challenging structures
• BAU solution not built on raw schema may not have been re-
created in new ETL process
18
A Common “Big Data” Project Life Cycle
Production
Infrastructure
Analytics and
Monitoring
Stack
Implementation
Process
On-Prem Hadoop or
Cloud based data
solution Scaled Data Science
Environment
Data Scientists
New Insights:
• Model
• Strategy
Parallel Modernization Lab or Center of Excellence
Enterprise BAU
(Business as Usual) Solution
Biz Analyst /
Data Scientists
New ETL
Process
New Implementation Process
Common Challenge #2
• Translation required due to separate development environments
• New technology implemented on legacy infrastructure creates
unexpected hurdles or brick walls
• Production implementation requires buy-in that prototypes or
proof of concepts don’t require
19
Recommended Approach
Our modern analytics and Big Data engagements center around
an effective use case from which all software, infrastructure,
and organizational investments are informed.
Modernize analytics
infrastructure as needed
Identify a Use Case
Build use case as a
iteratively improving product
Sustain the new product
and infrastructure
20
Strategically Important
• Does the use case align with corporate imperatives?
• Will its success open the door for more use cases in
your direct team and across the broader organization?
Actionable
• Will insights or results from the use case lead to
in-market changes?
• Can insights or results drive change quickly and be
iteratively improved over time?
Material
• Can insights or results from the use case drive material
impact to the business?
Identify
a strategically-important,
actionable, material use
case to gain support and
guide your investment
21
To illustrate the approach, a specific use case, a
marketing response propensity model
Analytical Database
Biz Analyst /
Data Scientists
Linear Response
Model Specifications
Marketing Targeting
Production Intent
3rd Party
Implementation
& Data Partner
(Credit Bureau)
Marketed Prospects
Non-Marketed Prospects
22
To illustrate the approach, a specific use case, a
marketing response propensity model
Analytical Database
Biz Analyst /
Data Scientists
Linear Response
Model Specifications
Marketing Targeting
Production Intent
3rd Party
Implementation
& Data Partner
(Credit Bureau)
Marketed Prospects
Non-Marketed Prospects
Modern Analytics Use Case Litmus Test
Strategically Important: Response model used regularly to target
marketing spend – driving the growth of a critical business.
23
To illustrate the approach, a specific use case, a
marketing response propensity model
Analytical Database
Biz Analyst /
Data Scientists
Linear Response
Model Specifications
Marketing Targeting
Production Intent
3rd Party
Implementation
& Data Partner
(Credit Bureau)
Marketed Prospects
Non-Marketed Prospects
Modern Analytics Use Case Litmus Test
Actionable: There is opportunity to leverage new machine learning
techniques to build models that typically out perform traditional linear
response models. Unclear if our implementation partner can support
new model types.
24
To illustrate the approach, a specific use case, a
marketing response propensity model
Analytical Database
Biz Analyst /
Data Scientists
Linear Response
Model Specifications
Marketing Targeting
Production Intent
3rd Party
Implementation
& Data Partner
(Credit Bureau)
Marketed Prospects
Non-Marketed Prospects
Modern Analytics Use Case Litmus Test
Material: Determined by measuring the net incremental responders
generated when the model is implemented. If the juice is not worth
the squeeze don’t invest.
Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Response Model Passed
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Data Scientist
Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Marketing Targeting Passed
Biz Analyst
Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Response Model Passed
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Data Scientist
Marketing Targeting Passed
Biz Analyst
Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Pain Point #1: Limited modern analytics
tool chest for response model building
Response Model Passed
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Data Scientist
Marketing Targeting Passed
Biz Analyst
Legacy process and tools are marred with manual touch-points
and lack modern techniques
Marketed Prospects
Non-Marketed
Prospects
Source Systems Modeling &
Analytics
Environment
Production C
Environment
Raw Data
Processed
Data
Engineer
Enterprise Guide
Pain Point #2: Manual, bespoke testing
and go-to-production process
Response Model Passed
Developer
& DevOps
Implemented in Production
and Compared with
Analytics Environment
Data Scientist
Marketing Targeting Passed
Biz Analyst
Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Production C
Environment
Changing either the Source Systems or Production Environments had
the most interdependencies outside of the use case, so left unchanged
Production C
Environment
Source
Systems
Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Modeling &
Analytics
Environment
Production C
Environment
Enterprise Guide
Raw Data
Processed
Data
Engineer
Open Source
Sandbox
Modeling
Data
Ported
Data
Scientist
Replacing the overall Modeling and Analytics environment was costly and
time consuming, so we stood up a separate Open Source Sandbox
Source
Systems
Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Production C
Environment
Enterprise Guide
XGBoost
to C
Package
Raw Data
Processed
Data
Engineer
Open Source
Sandbox
Modeling
Data
Ported
Data
Scientist
Response Model
Passed
Data Scientist
To enable Machine Learning Models like GBM (Gradient Boosting Machine), we
created an XGBoost model dump to C translation package
Modeling &
Analytics
Environment
Source
Systems
Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Production C
Environment
Enterprise Guide
Raw Data
Processed
Data
Engineer
Open Source
Sandbox
Modeling
Data
Ported
Data
Scientist
Biz Analyst
Marketing Targeting Passed
Local Intent
Testing
To lessen the iterative, manual Marketing Targeting intent checks, we
deployed testing that verified Excel inputs against production outputs
Modeling &
Analytics
Environment
Source
Systems
Rather than standing up a whole new process, we focused on the
largest pain points and improved them
Marketed Prospects
Non-Marketed
Prospects
Production C
Environment
Enterprise Guide
XGBoost
to C
Package
Raw Data
Processed
Data
Engineer
Open Source
Sandbox
Modeling
Data
Ported
Data
Scientist
Response Model
Passed
Data Scientist
Biz Analyst
Marketing Targeting Passed
Local Intent
Testing
The initial use case deliverable enabled modern machine learning models and
lessened the manual testing previously required
Modeling &
Analytics
Environment
Source
Systems
36
Remember our Common “Big Data” Project Life Cycle
Production
Infrastructure
Analytics and
Monitoring
Stack
Implementation
Process
On-Prem Hadoop or
Cloud based data
solution Scaled Data Science
Environment
Data Scientists
New Insights:
• Model
• Strategy
Parallel Modernization Lab or Center of Excellence
Enterprise BAU
(Business as Usual) Solution
Biz Analyst /
Data Scientists
New ETL
Process
New Implementation Process
37
Our use case approach often leads to hybrid solutions
that get material results as quickly as possible
Analytics or
Monitoring Stack
Production
Infrastructure
Initial Use Case Solution
Biz Analyst /
Data Scientists
Implementation
Process
Analytics and
Monitoring Stack
New Open
Source Sandbox
GBM Conversion
Routine and Local
Intent Testing
38
As part of initial launch, we build the use case as a
product that can iteratively improve
Product
Team
Backlog
Test Build Deploy
Software Engineering Best Practices
Internal
Customer(s)
Biz Analysts
Features
Product team deliver features with a focus on continued
improvement not getting the product “done”
Machine Learning
Response Model
Illustrative Product Structure
Model iteratively improved as
more or new data is available
!
Finalize potential architecture as you iterate
Biz Analyst
Data Scientist
Computational Frameworks
Distributed Compute & Storage
Model Grid search, Distributed Model
Training, Model Conversion
Build New
Model
Automated
Model Validation
Anaconda
Repository
Historical model
versions are built and
stored for future use
Automated Builds
and Job Scheduling
Continuous
Integration
Databases/ 3rd
Party Services/
Prediction APIs
Deploy
Model Build
Package
Marketed Prospects
Non-Marketed
Prospects
40
Data Scientists want to drive change in their organization
using Data
It begins with the first word, Data …
Tools to get the job done ...
Transparent path from insight to impact …
• Raw data handled and stored consistently to eliminate data silos
• Metadata readily available, in particular, lineage working backward to raw data sources
• A well understood and thoughtful data access process
• Open source first and foremost (Python/Anaconda, R)
• Scaled Data Science platform to enable interactive exploration and visualization
• Thoughtful and well understood open source governance process
• Automated workflows to deploy new insights to market and monitor results
• At minimum, transparency on how to bring insights to market/production
41
Work Backwards from a Specific Use Case
• Identify the problem you want to solve, not the technology you
want to use.
Identify Path to Implementation ASAP
• The Path to Implementation is historically the largest challenge
to successful Big Data and Modern analytics challenges – learn
from others’ mistakes.
Think MacGyver not Michelangelo
• The goal is to get material enhancement into production as
quickly as possible. You won’t have the perfect architecture on
your first pass.
Organize Around Products
• Setting up a product team, clear customers, and a backlog, the
initial answer can be enhanced bit by bit while continuing to
drive better in production solutions.
Take Aways
42
Thank you!

How to make your data scientists happy

  • 1.
    How to MakeYour Data Scientists Happy A use-case backed approach for enabling data science in enterprise April 2018 ANACONDACON 2018
  • 2.
    HUSSAIN SULTAN WASHINGTON DC Leaderin computational Python development and Data Science Amazon and Capital One Consulting clients: leading Fintech lenders and mega-regional banks TIM HORAN WASHINGTON DC 10 years of consumer lending Led US Credit Card Valuations at Capital One Consulting clients: leading market place installment loan lender and global 100 banks Introduction
  • 3.
    3 Explosion of Data ModernAnalytics Analytics and data management technology have progressed significantly in the last 10 years Cloud Computing Software Development Predictive Analytics Open Source Infrastructure Automation 90% of today's data was created in the last two years1 $219.6 billion spent globally on public cloud services in 2016 and predicted to be $411 billion by 20202 The line between software development and sustainable analysis is blurring The hive-mind of open source clearly has a space in modern analytics as enterprise solutions build on top and around it Low cost compute and storage makes Machine Learning and Artificial Intelligence accessible By the end of 2018, spending on IT-as-a-Service for data centers, software, and services will be just under $550 billion worldwide3 1IBM 10 Key Marketing Trends for 2017 - https://ibm.co/2y0r7Ee 2Gartner Press Release - http://gtnr.it/2Fw5LmJ 3Deloitte Technology, Media, and Telecommunications Predictions 2017 - http://bit.ly/2jMYdwm
  • 4.
    4 In 2014, GartnerResearch predicted 60% of Big Data projects through 2017 would be failures. When 2017 rolled around ... Despite significant investment by enterprises to embrace Big Data and modern analytics, most efforts are failing.
  • 5.
    5 In 2014, GartnerResearch predicted 60% of Big Data projects through 2017 would be failures. When 2017 rolled around ... Despite significant investment by enterprises to embrace Big Data and modern analytics, most efforts are failing.
  • 6.
    6 We blame unhappyData Scientists
  • 7.
  • 8.
  • 9.
    9 Who are yourData Scientists, and what do they do? Biz Analyst Data Scientist Developer Data Engineer DevOps Business Insight Generation Model Building Insight / Model Deployment Analytical Tool Creation Data Science Enablement Data Management 1Leveraging Base Framework from Anaconda – Journey to Open Data Science - http://bit.ly/2FyvpHD
  • 10.
    10 Who are yourData Scientists, and what do they do? Biz Analyst Data Scientist Developer Data Engineer DevOps Business Insight Generation Model Building Insight / Model Deployment Analytical Tool Creation Data Science Enablement Data Management Data Scientists play a critical bridge role between Biz Analysts and traditional IT roles in enterprise 1Leveraging Base Framework from Anaconda – Journey to Open Data Science - http://bit.ly/2FyvpHD
  • 11.
    11 Who are yourData Scientists, and what do they do? Biz Analyst Data Scientist Developer Data Engineer DevOps Business Insight Generation Model Building Insight / Model Deployment Analytical Tool Creation Data Science Enablement Data Management Deployment in enterprise requires the most coordination across teams 1Leveraging Base Framework from Anaconda – Journey to Open Data Science - http://bit.ly/2FyvpHD
  • 12.
    12 How to makeyour Data Scientists happy
  • 13.
    13 How to makeyour Data Scientists happy
  • 14.
    14 Data Scientists wantto drive change in their organization using Data It begins with the first word, Data … Tools to get the job done ... Transparent path from insight to impact … • Raw data handled and stored consistently to eliminate data silos • Metadata readily available, in particular, lineage working backward to raw data sources • A well understood and thoughtful data access process • Open source first and foremost (Python/Anaconda, R) • Scaled Data Science platform to enable interactive exploration and visualization • Thoughtful and well understood open source governance process • Automated workflows to deploy new insights to market and monitor results • At minimum, transparency on how to bring insights to market/production
  • 15.
    15 It begins withthe first word, Data … Tools to get the job done ... Transparent path from insight to impact … • Raw data handled and stored consistently to eliminate data silos • Metadata readily available, in particular, lineage working backward to raw data sources • A well understood and thoughtful data access process • Open source first and foremost (Python/Anaconda, R) • Scaled Data Science platform to enable interactive exploration and visualization • Thoughtful and well understood open source governance process • Automated workflows to deploy new insights to market and monitor results • At minimum, transparency on how to bring insights to market/production A path from insight to implementation is consistently the largest gap to successful ”Big Data” / modern analytics projects. Data Scientists want to drive change in their organization using Data
  • 16.
    16 A Common “BigData” Project Life Cycle Production Infrastructure Analytics and Monitoring Stack Implementation Process On-Prem Hadoop or Cloud Database Scaled Data Science Environment Data Scientists New Insights: • Model • Strategy Parallel Modernization Lab or Center of Excellence Enterprise BAU (Business as Usual) Solution Biz Analyst / Data Scientists New ETL Process New Implementation Process
  • 17.
    17 A Common “BigData” Project Life Cycle Production Infrastructure Analytics and Monitoring Stack Implementation Process On-Prem Hadoop or Cloud based data solution Scaled Data Science Environment Data Scientists New Insights: • Model • Strategy Parallel Modernization Lab or Center of Excellence Enterprise BAU (Business as Usual) Solution Biz Analyst / Data Scientists New ETL Process New Implementation Process Common Challenge #1 • Key performance indicator for new ETL focused on moving as much data into lake as possible • Data landing with limited metadata or challenging structures • BAU solution not built on raw schema may not have been re- created in new ETL process
  • 18.
    18 A Common “BigData” Project Life Cycle Production Infrastructure Analytics and Monitoring Stack Implementation Process On-Prem Hadoop or Cloud based data solution Scaled Data Science Environment Data Scientists New Insights: • Model • Strategy Parallel Modernization Lab or Center of Excellence Enterprise BAU (Business as Usual) Solution Biz Analyst / Data Scientists New ETL Process New Implementation Process Common Challenge #2 • Translation required due to separate development environments • New technology implemented on legacy infrastructure creates unexpected hurdles or brick walls • Production implementation requires buy-in that prototypes or proof of concepts don’t require
  • 19.
    19 Recommended Approach Our modernanalytics and Big Data engagements center around an effective use case from which all software, infrastructure, and organizational investments are informed. Modernize analytics infrastructure as needed Identify a Use Case Build use case as a iteratively improving product Sustain the new product and infrastructure
  • 20.
    20 Strategically Important • Doesthe use case align with corporate imperatives? • Will its success open the door for more use cases in your direct team and across the broader organization? Actionable • Will insights or results from the use case lead to in-market changes? • Can insights or results drive change quickly and be iteratively improved over time? Material • Can insights or results from the use case drive material impact to the business? Identify a strategically-important, actionable, material use case to gain support and guide your investment
  • 21.
    21 To illustrate theapproach, a specific use case, a marketing response propensity model Analytical Database Biz Analyst / Data Scientists Linear Response Model Specifications Marketing Targeting Production Intent 3rd Party Implementation & Data Partner (Credit Bureau) Marketed Prospects Non-Marketed Prospects
  • 22.
    22 To illustrate theapproach, a specific use case, a marketing response propensity model Analytical Database Biz Analyst / Data Scientists Linear Response Model Specifications Marketing Targeting Production Intent 3rd Party Implementation & Data Partner (Credit Bureau) Marketed Prospects Non-Marketed Prospects Modern Analytics Use Case Litmus Test Strategically Important: Response model used regularly to target marketing spend – driving the growth of a critical business.
  • 23.
    23 To illustrate theapproach, a specific use case, a marketing response propensity model Analytical Database Biz Analyst / Data Scientists Linear Response Model Specifications Marketing Targeting Production Intent 3rd Party Implementation & Data Partner (Credit Bureau) Marketed Prospects Non-Marketed Prospects Modern Analytics Use Case Litmus Test Actionable: There is opportunity to leverage new machine learning techniques to build models that typically out perform traditional linear response models. Unclear if our implementation partner can support new model types.
  • 24.
    24 To illustrate theapproach, a specific use case, a marketing response propensity model Analytical Database Biz Analyst / Data Scientists Linear Response Model Specifications Marketing Targeting Production Intent 3rd Party Implementation & Data Partner (Credit Bureau) Marketed Prospects Non-Marketed Prospects Modern Analytics Use Case Litmus Test Material: Determined by measuring the net incremental responders generated when the model is implemented. If the juice is not worth the squeeze don’t invest.
  • 25.
    Legacy process andtools are marred with manual touch-points and lack modern techniques Marketed Prospects Non-Marketed Prospects Source Systems Modeling & Analytics Environment Production C Environment Raw Data Processed Data Engineer Enterprise Guide
  • 26.
    Legacy process andtools are marred with manual touch-points and lack modern techniques Marketed Prospects Non-Marketed Prospects Source Systems Modeling & Analytics Environment Production C Environment Raw Data Processed Data Engineer Enterprise Guide Response Model Passed Developer & DevOps Implemented in Production and Compared with Analytics Environment Data Scientist
  • 27.
    Legacy process andtools are marred with manual touch-points and lack modern techniques Marketed Prospects Non-Marketed Prospects Source Systems Modeling & Analytics Environment Production C Environment Raw Data Processed Data Engineer Enterprise Guide Developer & DevOps Implemented in Production and Compared with Analytics Environment Marketing Targeting Passed Biz Analyst
  • 28.
    Legacy process andtools are marred with manual touch-points and lack modern techniques Marketed Prospects Non-Marketed Prospects Source Systems Modeling & Analytics Environment Production C Environment Raw Data Processed Data Engineer Enterprise Guide Response Model Passed Developer & DevOps Implemented in Production and Compared with Analytics Environment Data Scientist Marketing Targeting Passed Biz Analyst
  • 29.
    Legacy process andtools are marred with manual touch-points and lack modern techniques Marketed Prospects Non-Marketed Prospects Source Systems Modeling & Analytics Environment Production C Environment Raw Data Processed Data Engineer Enterprise Guide Pain Point #1: Limited modern analytics tool chest for response model building Response Model Passed Developer & DevOps Implemented in Production and Compared with Analytics Environment Data Scientist Marketing Targeting Passed Biz Analyst
  • 30.
    Legacy process andtools are marred with manual touch-points and lack modern techniques Marketed Prospects Non-Marketed Prospects Source Systems Modeling & Analytics Environment Production C Environment Raw Data Processed Data Engineer Enterprise Guide Pain Point #2: Manual, bespoke testing and go-to-production process Response Model Passed Developer & DevOps Implemented in Production and Compared with Analytics Environment Data Scientist Marketing Targeting Passed Biz Analyst
  • 31.
    Rather than standingup a whole new process, we focused on the largest pain points and improved them Marketed Prospects Non-Marketed Prospects Production C Environment Changing either the Source Systems or Production Environments had the most interdependencies outside of the use case, so left unchanged Production C Environment Source Systems
  • 32.
    Rather than standingup a whole new process, we focused on the largest pain points and improved them Marketed Prospects Non-Marketed Prospects Modeling & Analytics Environment Production C Environment Enterprise Guide Raw Data Processed Data Engineer Open Source Sandbox Modeling Data Ported Data Scientist Replacing the overall Modeling and Analytics environment was costly and time consuming, so we stood up a separate Open Source Sandbox Source Systems
  • 33.
    Rather than standingup a whole new process, we focused on the largest pain points and improved them Marketed Prospects Non-Marketed Prospects Production C Environment Enterprise Guide XGBoost to C Package Raw Data Processed Data Engineer Open Source Sandbox Modeling Data Ported Data Scientist Response Model Passed Data Scientist To enable Machine Learning Models like GBM (Gradient Boosting Machine), we created an XGBoost model dump to C translation package Modeling & Analytics Environment Source Systems
  • 34.
    Rather than standingup a whole new process, we focused on the largest pain points and improved them Marketed Prospects Non-Marketed Prospects Production C Environment Enterprise Guide Raw Data Processed Data Engineer Open Source Sandbox Modeling Data Ported Data Scientist Biz Analyst Marketing Targeting Passed Local Intent Testing To lessen the iterative, manual Marketing Targeting intent checks, we deployed testing that verified Excel inputs against production outputs Modeling & Analytics Environment Source Systems
  • 35.
    Rather than standingup a whole new process, we focused on the largest pain points and improved them Marketed Prospects Non-Marketed Prospects Production C Environment Enterprise Guide XGBoost to C Package Raw Data Processed Data Engineer Open Source Sandbox Modeling Data Ported Data Scientist Response Model Passed Data Scientist Biz Analyst Marketing Targeting Passed Local Intent Testing The initial use case deliverable enabled modern machine learning models and lessened the manual testing previously required Modeling & Analytics Environment Source Systems
  • 36.
    36 Remember our Common“Big Data” Project Life Cycle Production Infrastructure Analytics and Monitoring Stack Implementation Process On-Prem Hadoop or Cloud based data solution Scaled Data Science Environment Data Scientists New Insights: • Model • Strategy Parallel Modernization Lab or Center of Excellence Enterprise BAU (Business as Usual) Solution Biz Analyst / Data Scientists New ETL Process New Implementation Process
  • 37.
    37 Our use caseapproach often leads to hybrid solutions that get material results as quickly as possible Analytics or Monitoring Stack Production Infrastructure Initial Use Case Solution Biz Analyst / Data Scientists Implementation Process Analytics and Monitoring Stack New Open Source Sandbox GBM Conversion Routine and Local Intent Testing
  • 38.
    38 As part ofinitial launch, we build the use case as a product that can iteratively improve Product Team Backlog Test Build Deploy Software Engineering Best Practices Internal Customer(s) Biz Analysts Features Product team deliver features with a focus on continued improvement not getting the product “done” Machine Learning Response Model Illustrative Product Structure Model iteratively improved as more or new data is available !
  • 39.
    Finalize potential architectureas you iterate Biz Analyst Data Scientist Computational Frameworks Distributed Compute & Storage Model Grid search, Distributed Model Training, Model Conversion Build New Model Automated Model Validation Anaconda Repository Historical model versions are built and stored for future use Automated Builds and Job Scheduling Continuous Integration Databases/ 3rd Party Services/ Prediction APIs Deploy Model Build Package Marketed Prospects Non-Marketed Prospects
  • 40.
    40 Data Scientists wantto drive change in their organization using Data It begins with the first word, Data … Tools to get the job done ... Transparent path from insight to impact … • Raw data handled and stored consistently to eliminate data silos • Metadata readily available, in particular, lineage working backward to raw data sources • A well understood and thoughtful data access process • Open source first and foremost (Python/Anaconda, R) • Scaled Data Science platform to enable interactive exploration and visualization • Thoughtful and well understood open source governance process • Automated workflows to deploy new insights to market and monitor results • At minimum, transparency on how to bring insights to market/production
  • 41.
    41 Work Backwards froma Specific Use Case • Identify the problem you want to solve, not the technology you want to use. Identify Path to Implementation ASAP • The Path to Implementation is historically the largest challenge to successful Big Data and Modern analytics challenges – learn from others’ mistakes. Think MacGyver not Michelangelo • The goal is to get material enhancement into production as quickly as possible. You won’t have the perfect architecture on your first pass. Organize Around Products • Setting up a product team, clear customers, and a backlog, the initial answer can be enhanced bit by bit while continuing to drive better in production solutions. Take Aways
  • 42.