Data science governance : what and how

www.kensu.io
DATA SCIENCE GOVERNANCE
1
What and How

www.kensu.io 2
- CEO & Founder -
Mathematics & Computer Science MsC.
Creator of Spark Notebook
- CSO & Founder -
Physics PhD.  
Genomics & Quantitative Finance
XAVIER TORDOIRANDY PETRELLA
KENSU & ME
Started in 2015 as Data Fellas, focus on Data Science consulting
Team of 10 engineers and scientists
Shift toward Product Company in 2016, renamed to Kensu,
Focus on Data Science Governance
Accelerated by Alchemist Accelerator in San Francisco and The Faktory in Belgium

www.kensu.io
TOPICS
1. Some thoughts on “Data Science”
2. Data Science Governance: What
3. Data Science Governance: How
4. GDPR: Accountability principle and transparency
3

www.kensu.io
SOME THOUGHTS ON “DATA SCIENCE”
4

www.kensu.io
MACHINE LEARNING
Pioneers in 1950s
AI Winter in 1970s due pessimism
Resurgence in 1980s
Machine Learning (and related) is used since the 1990s (esp. SVM and RNN)
Deep learning see widespread commercial use in 2000s
Machine learning receives great publicity (read: buzz) in 2010s
5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning

www.kensu.io
DATA SCIENCE: +ENGINEERING
Claim: “Data Scientist” coined by DJ Patil in 2008.
Pretty much where Machine Learning was part of Softwares
In a way, when we added “engineering” to the mix
Also, engineering is even more prominent with Big Data Distributed
Computing
6

www.kensu.io
DATA SCIENCE: +EXPERIMENTATION
So much data available
So many tools, libraries, frameworks, …
So many things we can try
We have distributed computing now, right? => Let’s try everything
Discover new insights (and potentially new businesses)
7

www.kensu.io
DATA SCIENCE: RECAP
Maths: stats, machine learning and so on
Engineering: ETL, Databases, Computing framework, Softwares, Platforms, …
Creativity: “From business intelligence To intelligent business”- Michael Fergusson
Data Science is an umbrella on top of all activities on data
8

www.kensu.io
DATA SCIENCE GOVERNANCE: WHAT
9

www.kensu.io
DATA PIPELINE
Data pipeline is connecting activities on data, potentially involving
several technologies.
A pipeline is generally thought as an End-to-End processing line to solve
one problem.
But, part of pipelines are reused to save computation, storage, time, …
Thus interdependency between pipeline segments grows with initiatives
10

www.kensu.io
GOAL: TAKE DECISION
Data Pipelines, connected together, aren’t created for the beauty of it.
The ultimate goal is always to take decisions.
Decisions are generally taken or linked to humans with responsibilities. 
(even for self driving cars, in case of problem)
Given that pipelines are cut-and-wired, interleaved, …
How not to be anxious at deploying the last piece used by the decision maker
11

www.kensu.io
SOURCES OF ANXIETY
What if:
• one of the data used in the process has different patterns suddenly?
• one of the tools, projects or similar is modiﬁed upstream?
• the insights are deviating from the reality?
• …
12

www.kensu.io
DEBUGGING?
To reduce the anxiety or, actually, reducing the risks, we need ways to debug.
In pure engineering, we have unit, function, integrations tests,… but
How do we do when the problems come from the data themselves?
We can’t generate all cases of data variations, right?
How to debug?  
Without the big picture, we may try to optimise a model for weeks for nothing
13

www.kensu.io
Data governance: controls that data meets precise standards and
involves monitoring against production data.
Data Science Governance: control that data activity meets precise
standards and involves monitoring against production data activity.
A Data Activity is described by at least technologies, users, systems,
data, processing
14

www.kensu.io
GOVERNING DATA SCIENCE
Who does what on which data and where it is done?
What is the impact of a process on the global system?
What are the performance metrics (quality, execution,…) of the processes?
15

www.kensu.io
CONTINUOUS INTEGRATION FOR DATA SCIENCE
Data Scientists/Citizens have a view on all the activities applied to
the original sources used in his/her own process.
They also have a control on their own results in production
They have the opportunity to analyse and debug a pipeline
involving all activities:
• independently of the technologies
• involving several people in the enterprise
16

www.kensu.io
DATA SCIENCE GOVERNANCE: HOW
17

www.kensu.io
CHALLENGES
So many tools are using data!
The number of processing is growing impressively.
We have to take care of the legacy…
18

www.kensu.io
GET THE DATA
As usual, we have to collect the right data to take right decision.
First run an assessment to create a high level map of all the tools
involved into a company.
For each tool, do whatever it takes to collect information about the
activities it is creating.
Information are metadata, lineage, statistics, accuracy measures, …
19

www.kensu.io
CONNECT THE DATA
Data Science Governance needs the global picture.
To do that we need to connect all data that can be collected.
So that, it is possible to create a cartography of all on-going processes.
This map tracks all data and their descendants
20

www.kensu.io
USE THE DATA
This is where the fun part starts… the map of data activities is an
amazing source of information
Here are a few things you can think of when using this kind of data:
• impact analysis
• dependency analysis
• optimisation
• recommendation
21

www.kensu.io
GDPR
22
General Data Protection Regulation

www.kensu.io
ACCOUNTABILITY PRINCIPLE
Implement appropriate technical and organisational measures that
ensure and demonstrate that you comply. This may include internal
data protection policies such as staff training, internal audits of
processing activities, and reviews of internal HR policies.
23

www.kensu.io
TRANSPARENCY
As well as your obligation to provide comprehensive, clear and
transparent privacy policies, if your organisation has more than 250
employees, you must maintain additional internal records of your
processing activities.
24

www.kensu.io
ACCOUNTABILITY: DATA SCIENCE GOVERNANCE
To govern data science, we have to:
• collect activities
• connect activities
With this information we can reliably create automatically the
process registry
25

www.kensu.io
TRANSPARENCY: DATA SCIENCE GOVERNANCE
To govern data science seen as a continuous integration solution:  
we have to explain and measure activities independently of the
technologies.
With this information we can reliably create transparent reports of
activities across the whole chain of processing
26

www.kensu.io
GUESS WHAT?
This what Adalog, our product at Kensu, does!
27

www.kensu.io
ADALOG
28
Adalog Collectors
Adalog Service
Data Citizen
HTTPSPortonly
Recommendation System
Data Process Registry
Impact Analyzer
Data
Protection
Officer
Dashboard

www.kensu.io
WANT TO SEE MORE?
Request a demo on our website: http://kensu.io
29

www.kensu.io
Andy Petrella
CEO Co Founder
0032 495 99 11 04
@noootsab
Xavier Tordoir
CSO Co Founder
0032 495 99 11 04
+1 (628) 236-9239
@xtordoir
@kensuio

Data science governance : what and how

More Related Content

What's hot

Similar to Data science governance : what and how

More from Andy Petrella

Recently uploaded

Data science governance : what and how