DataOps: Nine steps to transform your data science impact Strata London May 18

DataOps: 9 steps to transform
your data science impact
21-24 May 2018

// Harvinder Atwal
MoneySuperMarket
// Web
dunnhumby
{"previous" : "Insight Director, Tesco Clubcard"}
Lloyds Banking
Group
{"previous" : "Senior Manager, Customer Strategy and Insight"}
{"Current" : "Head of Data Strategy and Advanced Analytics"}
@harvindersatwal
British Airways
{"previous" : "Senior Operational Research Analyst"}
{"about" : "me"}
@gmail.com

£2B
SAVINGS
2017 estimate total of UK savings
1993 24.9M 24 million £323M 989
We started life
as mortgages
2000
Adults choose
to share their
data with us
Average
monthly users
2017
Revenue
2017
Product
Providers

Sometimes it’s simple things that work really well
From one version to 1400+
customised variants of the
newsletter
+19% Increase in Revenue Per
Send

Sometimes it’s more complicated solutions
Worried about whether you can afford a
personal loan? With UK interest rates at record
lows, it’s worth checking to see how reasonable
the cost could be.
Whether you need to borrow to buy something,
or you want to bring your existing debts under
one roof, have a look at these competitive deals
we’ve assembled.
Thanks to our Smart Search tool, you can get
an idea of the loans you’re likely to be accepted
for before you proceed with your application.
Same message but
Language tailored
to the customer’s
Financial Attitude

Only 22% of companies are currently seeing
a significant return from data science
expenditures*
*Obligatory conference presentation quote from GartnerForresterMcKinsey Consulting. Sorry.

HIRE DATA
SCIENTISTS
How businesses think they become
data-driven
1 2 3
MONEY
FLOWS
HOARD
DATA
4

Warning: A data-driven customer
focussed strategy will not paper
over cracks in operational
performance or product deficiency

Putting Data ahead of the Customer
or Financial Performance

Multiple challenges in the process of turning
data into value on existing infrastructure
Business
Problem
Evaluate
available
data
Request
Data Access
from IT
Request
Compute
Resources
from IT
Negotiate
with IT for
requested
resources
Wait for
resources to
be
provisioned
Install
Languages
and tools
Configure
connectivity,
Access and
security
RAM/CPU
Availability,
scaling,
monitoring
Request
network
Config
Change
Request to
install
another
package
Model
building
Compose
PowerPoint
to share
results
Edit
Confluence
to document
work
Negotiate with
business
stakeholder
on
deployment
timeline
Wait for Data
Engineering to
implement the
model
Test Newly
implemented
model to
ensure valid
results
Request
Modifications
to model due
to unexpected
results
Release model
to production
and schedule
Document
release notes
and
deployment
steps
Prepare for
change
management

Data Science trapped in laptops

Thinking real-life Data Science is a Kaggle competition

Treating Data Science as a Death Star
Technology Project

Insight does not scale!.
Using data to generate ad hoc
Decision Support Insight INSTEAD OF
ACTION

Money is wasted
Time is wasted
Talent is wasted

Eliminate waste
LEAN THINKING
The Optimist The Pessimist The Lean Thinker
THE GLASS
IS HALF FULL
THE GLASS IS
HALF EMPTY
WHY IS THE GLASS
TWICE AS BIG AS IT
SHOULD BE?

Alignment of data science with the rest of the
organisation and it's goals

It’s a sprint
not a
marathon

Problems with Agile
Data Science
- How do you define
business value?

DATA
STORAGE
Cloud File Storage Distributed File System
NoSQL DB RDBMS
COMPUTE
INFRASTRUCTURE
ResourceManagement/Monitoring/Auditing
Scheduling
ProjectandDataGovernance
DataEngineering
Distributed
SQLQuery
Engine
Distributed
Compute
Framework
Compute
Instances
Coding
Workspace &
Language
Libraries
Output
Files
ANALYTICS
LAYER Machine
Learning
libraries
Data
Visualisation
libraries
BI Tools
Interactive
dashboards/
Web Apps
Security/IdentityAccesscontrol
APIs
Data Prep/Exploration tools
Summary Analysis, Analysis of Experiments, Segmentation,
Machine Learning, Data Matching
Revision/Deployment Tools
Interactive
Dashboards/
Web App
development
Applications
(Business Layer) Insight
Marketing
Optimisation
External Data
Products
Internal
Reporting
Website
Optimisation
Commercial
Optimisation
Production Code
DevOps/Infrastructure
DBAs
ETL
DQM
MetadataManagement
Agile Data Science does not solve tech complexity problem
Container
Service
Resource Vertical requirementsDATA PRODUCTS
(Presentation/
Service Layer)
Deployment,OrchestrationandScaling
ConfigurationManagement
RevisionControl
KnowledgeManagement
DataScientists
DATA SOURCES
Stream
Processing
Framework

Data Science can’t happen in a vacuum
Situational Awareness is needed

Your business already has hypothesis for
what creates value
Actively avoid work on anything else
It’s the Corporate Strategy and Objectives
(everyone is aligned behind)

Measurement of everything gives feedback of not just individual deliverables (fast
loop) but also the organisation’s hypothesis of what adds value (slow loop)
Situational Awareness
Objectives (Themes)
Strategies (Initiatives)
Tactics
(Epics)
Actions
(Stories)
Strategies (Initiatives) Strategies (Initiatives)
Objectives (Themes)
Tactics
(Epics)
Tactics
(Epics)
Tactics
(Epics)
Tactics
(Epics)
Tactics
(Epics)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Actions
(Stories)
Corporate strategy is broken down into many
options (Epics) for Agile delivery

We reduce Batch sizes of work and
have options to keep flow going

Collaboration is key
Shared Buy-in from Senior management
Organizational behavior structured around the
ideal data-journey model
Shared Priorities
Shared Trust in data
Shared Rewards based on measured outcomes,
not outputs

Test &
Collect
Model Embed Roll Out
Feedback
Plan
Pilot test
Collect Data
Build Model,
Identify segments
Adjust model to fit
organisation
Re-engineer business processes
to support segmented execution
Train organisation
Creation of fast feedback loop

Data cycles are measured to
eliminate bottlenecks

Shortened Data Cycles to be Agile
Data Engineering
Dev Ops/Infrastructure
DB Management
Cloud File
Storage
Distributed File
System
NoSQL DB
RDBMS
Distributed
SQLQuery Engine
Distributed
Compute
Framework
Compute
Instance
Container
Service
Data
Prep/Exploration
tools
Coding
Workspace
& Language
Libraries
Machine
Learning
Data
Visualisation
Interactive
Dashboards/
Web App
development
Version/
Deployment
Tool
Output Files
BI Tools
Interactive
dashboards
/Web Apps
APIs
Knowledge Management
Security/Identity Access control
Revision Control
Configuration Management
Orchestration and scaling
Project and Data Governance
Scheduling
Resource Management/Monitoring/Auditing
ETL
DQM
Data Scientists
Epic
Customer
Feedback &
Iteration
Data
Product
Strategy
Story
Stream
Processing
Data
Sources

Agile
Practice
DevOps
Culture
Lean
Thinking
We had
accidentally
stumbled on
DataOps
Data
Analytics

DataOps was popularised by Andy
Palmer in a 2015 Blog post

DataOps is an independent approach to data analytics
Data Analytics team
moves at lightening speed
using highly optimized
tools and processes
across the whole data
lifecycle
Agile Collaboration to
break down silos and
work on “The Right
Things” that add value
Lean Manufacturing like
focus on eliminating waste
& bottlenecks, improving
quality, monitoring and
control
Iterative project management
Continuous delivery
Automated test and deployment
Monitoring
Self-serve
Quality
Governance
Organisational alignment
Ease of use PredictabilityReproducibility
Strategic Objectives

Further steps to
Trust
DevOps
Reproducibility
implementation
Self-serve
Organisation

Why do we have brakes on a car?

Accept the delivery pipeline is governed by
rules and constraints

Trust part 1: Make the “What you do to data”
people in the organisation happy
Identity and
Access
Management
Custom role
permissions
Audit trail
logs
Data Loss
Prevention
Encryption
of Data at
Rest
Encryption
of Data in
Motion
Resource
Monitoring
Firewall
rules
Resource
and
Object
Isolation
Penetration
Testing
Code
Encryption
and
Backup
Segregation
of Duties
Authorisation
protocols
Data
Access and
Privacy
Policy
Metadata
Management
Data Lineage
Tracking
Data
Stewards
and
Owners

Trust part 2: Make the “What you do with
data” people in the organisation happy
Data
Quality
Testing
Transformation
Testing
End-User
Testing
ETL
Integration
Testing
Metadata
Testing
Data
Completeness
Testing
ETL
Regression
Testing
Incremental
ETL Testing
Reference
Data
Testing
ETL
Performance
Testing

Automated reproducibility is a must

For consistently reproducible computational
environments

Continuous Integration: Commit Code Regularly
Data Cleaning Master
Data Cleaning
Dev Branch
Feature Extraction Dev
Feature Extraction
Master
Model Train Master
Model Train Dev Branch
Machine Learning Pipeline
Product Development (e.g. App, Website, Marketing system, Operational System, Dashboard, etc.)

Run tests and review code
(please integrate safely)

Continuous Delivery and Beyond:
Accelerating Deployment
Dev Integration testApplication test Acceptance test Production
Continuous Integration
Dev Integration testApplication test
Continuous Delivery
Dev Integration testApplication test Acceptance test Production
Continuous Deployment
Automated
Manual

Continuous Operations: Resources that scale

Chemistry is not about tubes
DataOps is not about tools
(but the right ones help)

Align your spine
Needs
Principles
Practices
Tools
Values
How do you know it is the best
possible tool?
How do you know that
the Practices actively help the system?
How do you know
which Principles you want to
apply?
“We use _____ to get our work done”
“We DO Self-Service and DataOps
to continuously create VALUE for
the customer and business”
We LEVERAGE Agile and Lean
PRINCIPLES to change the system and
make sure resources work on the right
thing
We OPTIMISE for Speed, Accuracy,
Experimentation/Feedback and
Security.
We are here to SATISFY THE NEED to
help customers save money and the
business to execute it’s strategy
It all starts at Needs. Why does this
system exist in the first place?
Source: Kevin Trethewey, Danie Roux, Joanne Perold

Avoid building your own anything or
being on the bleeding edge.
Cost of Delay is high.

Data Scientists need a way to manage their projects
end-to-end with self-service data AND ARCHITECTURE
Business
Problem
Evaluate
available
data
Request
Data Access
from IT
Request
Compute
Resources
from IT
Negotiate
with IT for
requested
resources
Wait for
resources to
be
provisioned
Install
Languages
and tools
Configure
connectivity,
Access and
security
RAM/CPU
Availability,
scaling,
monitoring
Request
network
Config
Change
Request to
install
another
package
Model
building
Compose
PowerPoint
to share
results
Edit
Confluence
to document
work
Negotiate with
business
stakeholder
on
deployment
timeline
Wait for Data
Engineering to
implement the
model
Test Newly
implemented
model to
ensure valid
results
Request
Modifications
to model due
to unexpected
results
Release model
to production
and schedule
Document
release notes
and
deployment
steps
Prepare for
change
management

Modern serverless and managed
infrastructure makes it easy to create
data products just bring code and data
A single unified platform reduces data
fragmentation, overcomes business silos
and helps enforce consistent governance

You can make the data supply chain more efficient
by unifying data and tools in one platform
Data
Warehouse(s)ETL
Analytic
s
Platform
Core
Data
Other
Data
Extract/Load
OffLoad
Main Source(s) of Truth
Presentation/
Service Layer(s)
Analytical
tools
Predictive and
Prescriptive analytics
Flatten/Merge
columns
Data Sharing
BI Tools
Descriptive and
Diagnostic
analytics
Source
Cubes on
Dimensions
Reload
Data
Microservices
Flatten/Merge
columns
Data Sharing

Data Science Platforms add further self-serve
capabilities
Data Access, Prep
and Exploration
Jupyter, Rstudio,
Zeppelin, etc.
Automation and
Machine Learning
Run experiments, track
and compare results
Delivery and Model
Management
Publish APIs,
Interactive web apps
Schedule reports
Collaboration and Version Control
Discover, discuss and build on existing work
Compute Environment Library
Customised software stack
Compute Grid
Orchestrate hardware for development and deployment
Source: Domino Data Labs

The market for platforms is exploding

Data
Scientist
Data
Analyst
Data
Engineer
Self-serve enables reduced DataOps roles
ETL
Quality Testing
Descriptive
Analytics
Advanced
Analytics
BI
Dev
Ops
Infrastructure
Engineers
DBAs
X X
X
Business
Stakeholders
Operations
Sys
adminX
Developers
ML
Product Managers

Implement AI: Actionable Intelligence

#1 Eliminate wasted effort
Find the FASTEST, CHEAPEST path between data and consumers

#2 Align with the Organisation
through Agile Collaboration

#3 Deliver
Products
not
Projects
Prioritize solutions that fit into a DataOps workflow over others

#4 Build a measurement and feedback
culture

#5 Embrace
Development
best practise in
Data Science
Version Control, Configuration
management, Continuous Integration,
Continuous Operations

#6 KEEP CALM
AND
BUILD TRUST IN DATA
Put Effective Data Governance, Security and Testing in place

#7 Invest in tools and process to
reduce bottlenecks and increase quality
Managed Infrastructure and Serverless Cloud,
Automation and Data Science Platforms

#8 Decentralise Self-service analytics
AND cloud infrastructure

#9 Organise around the ideal data
journey instead of teams
Fewer roles, more end-to-end ownership, less friction
Store Share UseManageAcquire Process
Data Engineering
Data Scientists
Data Analysts
Business Stakeholders

#9.5
Optimise
data cycles
for…
SPEED!

Data Science Today
Customer
Data
?
Hamster
wheel
Analytics
The
Roadblock
The Aimless
crash and
burn
The “So What happened?”
The “We did it
once, why doesn’t
it work again?”

The DataOps Data Science Factory
Epic
Customer
Data
Product
Strategy
Story
Data
Rest of
Business Analytics
Agile Collaboration
Data Governance
Automated testing
Value Measurement
Version Control
Self-Serve Infrastructure
Automation
Continuous Integration

// Harvinder Atwal // Web
var current: {
companyName : "MoneySuperMarket",
position : "Head of Data Strategy"
+ " and Advanced Analytics"
};
var previous1: {
companyName : "Dunnhumby",
position : "Insight Director,"
+ " Tesco Clubcard"
};
var previous2: {
companyName : "Lloyds Banking Group",
position : "Senior Manager"
};
var previous3: {
companyName : "British Airways",
position : "Senior Operational Research Analyst"
};
{"about" : "me"}
var username = "harvindersatwal";
var linkedIn = "/in/" + username;
var twitter = "@" + username;
var email = username + "@gmail.com";

DataOps: Nine steps to transform your data science impact Strata London May 18

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DataOps: Nine steps to transform your data science impact Strata London May 18

Similar to DataOps: Nine steps to transform your data science impact Strata London May 18 (20)

More from Harvinder Atwal

More from Harvinder Atwal (7)

Recently uploaded

Recently uploaded (20)

DataOps: Nine steps to transform your data science impact Strata London May 18

Editor's Notes