Real World End to End machine Learning Pipeline

End to End Machine Learning for Aspiring
Data Scientist
-S r i v a t s a n S r i n i v a s a n
h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b /
1

Before you proceed.. Stop.. Read .. Proceed at your own terms 
This presentation is not to complain on online courses and academics but to highlight the difference in
expectation between these courses and what enterprise need
Doing data science has it’s own set of challenges and multiple failure points. Some of the information I will be
sharing on Linkedin will cover in detail on those failure points and how to overcome the same
If you are Aspiring to be in Data Science this presentation and series of post that I will be sharing over next few
months will take you through end to end machine learning cycle in typical organization
-> Use this information to fill in the skills that can get you closer to industry needs.
-> Use this content to define strategy for yourself to land a job in enterprise world.
You can search for post using hashtag #end2endDS in LinkedIn content or follow me on LinkedIn to get updates
as I post in LinkedIn
h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b /
Content on this topic will be posted between 29th July and 27th September, 2019. The frequency will purely
depend on bandwidth I have. On average you can expect 1 or max 2 posts in a week
I will also summarize key take away in article as well as update this presentation over time
Every data scientist need not be expert in entire ML pipeline but it is good for them to know the process
- Happy Learning

ML Code
XGBoost CNN
SVM
RegressionKNN
Neural Networks
Random Forest
What most online courses and Academics focuses on …..
Statistical Techniques
Basic Data Analysis

How enterprise production solution looks like ……
Image: Hidden debt in machine learning

If you see below “Data Science Hierarchy of Needs” as Hill climbing,
Academia puts you on top of the Hill and real world is when one
understand the path to climb is the most difficult one
Image Source: Hackernoon

Education (Courses/Academics) vs Enterprise
Education Enterprise
Focus on Model Accuracy
and usage of algorithms
Focus on
deployment/Integration.
Balance between accuracy
and explain-ability
Focus on increasing
complexity of Models for
better accuracy
Keep it simple as much as
possible and as long as
possible
Data Mostly comes in Single
or few Files
Data comes from multiple
enterprise system. Need to
be integrated, cross
referenced and summarized
Data size is Typically Small
to Medium
Data size ranges from
Medium to Very Large
Data typically is 80% clean Data is 80% noisy
Limited Tools More Tools + Dev Ops +
Cloud + Other Craps
Do it at decent Pace Agile (Not now, don’t make
me talk)

For most online courses
Data Science = ML Code + Some Data Analysis
In Reality
Data Science = ML Code + Data Analysis + Data Collection + Data Engineering + Software Engineering + Dev Ops + BI Engineer
+ Product Manager
Note: If you are coming from premier institute that addresses all of the reality. Please feel free to exit the presentation

5 Biggest Challenge for Enterprise deploying ML solution
• Data Collection
• Deploying and Reproducing the model in production
• Model Monitoring
• Keeping model relevant by adopting to changing business scenarios
• Communicate and interpret model output to various stakeholders

Components of End to End Machine Learning

Data
Collection
Data
Analysis/Cle
aning
Data
Organization
and
Transformation
Feature
Engineering
Model
Training
Model
Evaluation
and
Validation
Model
Deployment
Model Re-calibration (Some steps might be optional on case basis)
Business
Understanding
Data
Understanding
Model
Monitoring
Model Drift
Analysis
Components of End to End Machine Learning Pipeline in Real World
Problem
Definition
Model
Explanation
(Local and
Global)
Health Dashboard, Reports & Alerts
Model Training (Iterative/Some steps might be optional on case basis)
Model Management and Governance
Data Management
Model and Application Logging
Pipeline Orchestrator
Infrastructure/Dev Ops/Automation
Data Drift
Analysis
Data
Validation/An
omalies
detection
Model
Integration
and SLA
understanding

ML Components and Skills/Role mapping
Components Primary Responsibility Secondary Responsibility
Problem Definition Business Owner, AI Champion Product Owner
Business Understanding Product Owner, Business
Owner, AI Champion
ML Engineer
Data Understanding Data Engineer, ML Engineer,
Product Owner
Business Owner/Analyst
Model Integration and SLA
understanding
ML Engineer, Data Engineer,
Software Engineer
Business Owner, Product
Owner
Data Collection Data Engineer, Data Analyst
Data Analysis/ Cleaning Data Engineer, Data Analyst
Data
Organization/Transformation
Data Engineer, ML Engineer Data Analyst
Data Validation/Anomaly
Detection
Data Analyst, Data Engineer
Feature Engineering ML Engineer Data Engineer
Model Training ML Engineer
Model Evaluation/validation ML Engineer Business Owner, Model
Governance team
Model Monitoring Operations Engineer, ML
Engineer
BI Engineer
Model Deployment Software Engineer, Data
Engineer, ML Engineer
Data Drift/Model Drift Operations Engineer, ML
Engineer
BI Engineer, ML Engineer
Dashboard/Reports BI Engineer Business Owner, Product
Owner
Note: Depending on size of ML project, One person might play multiple role or there might be multiple person required for single role.
Some role might also be part time or some components can be built as capability that can be leveraged across projects

Most of the Role Definition in previous slide can be found online, let me talk about AI
Champion as not much is mentioned on it….
AI Champion (Head of Analytics or Sometimes CAO himself) is responsible for driving intelligent insights backed
by data science capability within enterprise. He also owns the resulting ROI or Impact numbers on delivering
intelligent solution. He leads the data science team by developing policies, strategies and propagates culture of
experimentation and research. He and his team are also responsible for working with business stakeholders in
planning, identifying, prioritizing and Implementing AI use cases
You can find more details here: https://www.linkedin.com/pulse/identifying-prioritizing-artificial-intelligence-use-cases-srivatsan
This role might be more relevant in mid to large size organization where organization has multiple use cases to deliver and AI
Champion helps enterprise focus on prioritizing use case that can be fit for AI as well as generate substantial business value

Few Components of End to End ML Explained
(Will cover more details on each on my LinkedIn post)

Data Collection
• Data is typically collected and centralized from variety of sources either into Data Lake or Data Warehouse or any
enterprise data ecosystem
• Data is sourced from High volume transactional systems like ERP, Sales etc. or from High velocity IOT devices, POS systems
etc
• Data takes variety of shapes - Structured, Semi Structured and Unstructured sources of data
• Data takes variety of forms - Batch, Streaming, API, Alternate Data etc.
• While ingesting data is one part of the puzzle, data also needs to be cataloged, secured and governed
Further Reading: https://www.linkedin.com/pulse/think-data-first-before-being-ai-srivatsan-srinivasan
“Define a efficient Data Strategy that is simple to implement and help accelerate on AI strategy”

Data Analysis and Validation
Inspect and clean data to discover useful information that can further help in modeling AI driven intelligent solution.
Purpose of Data Analysis and Validation is to understand
• What is characteristic of my data and how does my data look like?
• Are there any outliers or errors in the data?
• How does independent variable respond to target variable?
• Base statistics out of analysis phase is used against production inference data to identify if the data has evolved (drifted)
from the underlying assumptions than what the model was trained on?
Further Reading: https://www.linkedin.com/pulse/tensorflow-extended-tfx-data-analysis-validation-drift-srinivasan/
“Understanding your data is key step to insight”

Data Organization and Transformation
Data collected from source systems into Data ecosystem are typically at granular level not directly consumable by ML model.
Sources are as well spread across multiple domain. Take marketing as example data might be spread across customer,
product, transaction systems, loyalty etc. Data Organization and Transformation is to make data consumable for ML models
and as well make data accessible for self service
Raw data typically in TB is cleansed, aggregated in a form that can be fed into model directly. This is where most heavy lifting
work happens in close collaboration with Business, Data Engineers, ML Engineers and Data Analyst
Integrate
Explore
Aggregate
Model
Deploy
Monitor
Raw Data (TB-PB)
Model Input Data (MB-GB)
60%
40%
Data Engineering and Data
Analyst
ML Engineer, Data
Engineer and Software
Engineer
Insight (KB)

Model Deployment
Few key things to remember while deploying models to production or integrating models with business process
Further Reading: https://www.linkedin.com/pulse/ml-model-deployment-considerations-srivatsan-srinivasan/
https://www.linkedin.com/pulse/integrating-machine-learning-models-within-matured-srinivasan/
• Training deployment skew - Models developed on historical sources might have to be deployed in streaming
flow or in edge of network/devices
• Not everything can be flask’ed or exposed as service. Deployment scenario varies based on technology in
business process, inference SLA etc
• Keep model pipeline as simple as possible. Avoid spaghetti pipeline code
• Provision for experimentation of new models when implementing deployment framework -
Champion/Challenger or A/B testing based model deployment and analysis
• Training deployment skew – Features that are hard to compute in inference time or features that were forward
computed during training time (This may sound not so sensible but trust me have seen enterprises doing such
mistake)

Model Monitoring
Machine Learning today is essential for running some of our critical business process. ML is deployed in decision making
substituting or replacing humans and needs to be monitored continuously as it is making decisions
Ongoing monitoring of ML models is essential to evaluate whether the assumptions that model was developed on is not
drifted and is performing as intended.
Model can drift due to changes in business assumption, Changes or issues with data, market conditions that might need
adjustment among others Ongoing monitoring highlight scenarios when model might need re-calibration. For some business
process it can be yearly for some it can be as frequently as daily.
Plan for monitoring the models continuously -> Alert on drift in data, concept or model. Business today evolves rapidly and
assumptions on which models are trained on becomes quickly invalidated. You want to know before your models starts
making wrong predictions

Other Key components to succeed in Enterprise Machine Learning
Structured and modularized code base
Experiment tracking for reproducibility
Version Control of ML code, data and Experiment results
Dev Ops for both Infrastructure and Model deployment
Orchestrator for Data and Model pipeline
Logging deployment runtime critical info and making it searchable

Food for thought #1 - Various point of Failure in ML Lifecycle
Machine Learning cycle is not complete post deployment. Model needs to be monitored continuously and be prepared for
failure at any part of pre and post modeling exercise
• Failure during experimentation. This is ideal case as well if you figure out the problem earlier.
• Failure during development by not thinking about real world inference scenario. Using features that are hard or
impossible to compute during inference
• Failure post deployment where few models did not generate business value they were supposed to
• Failure post deployment to keep up with even changing data landscape. These model need to have frequent re-calibration
or need to have some form of continuous learning
• Failure in using right performance metrics. Think from your business to succeed not for model to succeed
Further Reading –
Reasons why ML project fail: https://www.linkedin.com/pulse/top-reasons-why-artificial-intelligence-projects-fail-srinivasan/

Food for thought #2 - Infrastructure
Further Reading – https://www.linkedin.com/pulse/accelerating-artificial-intelligence-initiatives-srivatsan-srinivasan/
Enterprises hiring artificial intelligence and machine learning expert without right infrastructure and tools is like
“Hiring astronauts to drive a bullock cart”
Building data science capability within enterprise must be thought ground up right from selection of silicon chip. Data
Engineering and ML process are typically compute and memory intensive and on large dataset the infrastructure has to be
thought ground up.
Data scientist typically performs 100’s of iteration to come up with right algorithm, hyper parameters, metrics. Not having right
infrastructure can derail enterprise getting onto machine learning
Plan for Infrastructure with right kind of hardware (GPU, CPU, HPC etc), technologies (Hadoop, Kubernetes etc.) and tools
(Spark ML, Tensorflow, scikit etc.) that can distribute ML/DL pipelines for faster hypothesis and value generation
Cloud is very good alternative to accelerate ML journey where you can spin up compute on demand and tear down when
not needed

Food for thought #3 - Cloud for AI/ML
Further Reading – https://www.linkedin.com/pulse/artificial-intelligence-google-cloud-platform-srivatsan-srinivasan/
https://www.linkedin.com/pulse/data-analytics-google-cloud-platform-srivatsan-srinivasan/
Cloud is key component of AI/ML journey especially for enterprise that needs Agility to meet the huge compute demand needed
to run ML jobs
Key benefits cloud provide are
Scale - Instant access to hundreds of compute instances
Speed - Easy availability of specialized device like (GPU/TPU) that can help accelerate AI development Cloud AI API's - Quick
jump start into complex activities rather build from scratch. For cases like speech to text or language translation, enterprise as
well might lack data to build models with high accuracy as available in cloud
Cloud AutoML - Train high quality models specific to business needs with citizen data scientist or even by business users
Cloud Bursting - With advances in Hybrid Cloud, start small in local data center and use cloud to scale AI compute

Food for thought #4 - Stay simple as long as possible
Fitting simple models and if accuracy is low, Do you immediately jump to complex models?
Try below 2 steps before moving to trendy and complex algorithms
Follow your model output -> Listen to what your algorithm metrics says. Drill down into misclassification scenarios and see if
you are able to find any interesting pattern
Be Curious and Creative with your data -> Try to see if you find any pattern or relationship in data that has ability to influence
your model outcome. Lot can be solved by proper EDA and feature engineering
If you are still not meeting the performance targets go for complex models in increments. The steps you performed above is
still relevant and can be input to your complex models to enhance decision boundary
In some critical business process 84% of simple model performance might be better than 86% of complex models

Food for thought #5 - Data Science and Agile
There is lot of misconception on use of Agile for Data Science. Data Science outcome depends on continuous experimentation
where as Agile focuses on early and continuous delivery throughout the development lifecycle
First thing to remember Agile is set of guiding principles and not set in stone methodology. Agile can be tailored to one’s
unique Data Science need
Here is one way of doing data science in Agile way especially the machine learning part
• Don't set strict deliverables at the end of every sprint
• Use daily/weekly meeting to get road blockers alone not daily status
• As soon as you have working model (Say every sprint or 2) with decent accuracy put it in private beta mode. Private beta
mode or dark mode is where model generate output but it is not actioned on. This will help us monitor the data with real
world information and test its reliability
• Keep updating private beta as you build models with better performance accuracy
• Launch the private beta model to small percentage of live traffic. Collect feedback based on response from end users
• Keep increasing the volume of transaction to model in frequent interval until all traffic is diverted and feedback/outcome is
met
In real world there are scenarios where ML model might not get you same value that was seen during training/evaluation
phase. In this case agile delivery allows machine learning projects to be value and outcome focused and to achieve project
objectives in a timely manner.

Fact
Traditional ML algorithms can scale on large datasets. There are distributed
frameworks that can train your model on large dataset and are very effective
in learning from large dataset as well. Choose technology based on your
business and data needs
If your tabular data is big in size, switch to deep learning. Traditional
ML will not work
Machine Learning will eventually replace existing rules in legacy
system
Think ML as initially technology for complementing your legacy rules.
One can reduce the complexity of rules by introducing ML solution. It
can eventually replace but it is always better to have some deterministic
rules complementing your probabilistic ML models
Machine Learning is the new “Magic Wand” for making your business
process smart and intelligent
Do not take a non ML problem and try to fit ML into it. Use ML when
you believe it will add value to the business process. You can make
your business process smart by advance analytics or statistical
techniques as well
Data science is more than what AutoML can currently do. It will be
assistant to Data Scientist taking care of boring part of Data Scientist and
have them focus more on delivering business value
AutoML will replace and automate data science work
Myth
Food for thought #5 - Myth v/s Fact
Further Reading on AutoML – https://www.linkedin.com/pulse/fear-data-scientist-called-autophobia-srivatsan-srinivasan/

To Summarize
Plan for investing in right
Infrastructure (GPU, CPU,
Cloud) to accelerate model
development process
Only 20% or less of actual
pipeline is ML code

Thank You and Stay Tuned on LinkedIn for more info
on End to End Data Science Pipeline
Follow or search with hashtag #end2endDS in
LinkedIn to get updates

Real World End to End machine Learning Pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real World End to End machine Learning Pipeline

Similar to Real World End to End machine Learning Pipeline (20)

Recently uploaded

Recently uploaded (20)

Real World End to End machine Learning Pipeline