www.kensu.io
DATA SCIENCE GOVERNANCE
1
Turn GDPR’s accountability principles
into an added-value for your business
Data Science Meetup - Milan - March 18
www.kensu.io 2
- CEO & Founder -
Mathematics
Computer Science
ANDY PETRELLA
KENSU & ME
Started with an enterprise stack for Data Scientists: 

Agile Data Science Toolkit
Pivot on internal component:

Data Science Catalog
Main focus: 

Data Science Governance
Spark Notebook O’Reilly Training
www.kensu.io
TOPICS
1. Some thoughts on “Data Science”
2. Data Science Governance: What
3. Data Science Governance: How
4. GDPR: Accountability and Transparency Principles
5. How to leverage GDPR and Data Science to improve
or disrupt the Business
3
www.kensu.io
SOME THOUGHTS ON “DATA SCIENCE”
4
www.kensu.io
MACHINE LEARNING
Pioneers in 1950s
AI Winter in 1970s due pessimism
Resurgence in 1980s
Machine Learning (and related) is used since the 1990s (esp. SVM and RNN)
Deep learning see widespread commercial use in 2000s
Machine learning receives great publicity (read: buzz) in 2010s
5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning
www.kensu.io
DATA SCIENCE: +ENGINEERING
Claim: “Data Scientist” coined by DJ Patil in 2008.
Pretty much where Machine Learning was part of Softwares
In a way, when we added “engineering” to the mix
Also, engineering is even more prominent with Big Data Distributed
Computing
6
www.kensu.io
DATA SCIENCE: +EXPERIMENTATION
So much data available
So many tools, libraries, frameworks, …
So many things we can try
We have distributed computing now, right? => Let’s try everything
Discover new insights (and potentially new businesses)
7
www.kensu.io
DATA SCIENCE: RECAP
Maths: stats, machine learning and so on
Engineering: ETL, Databases, Computing framework, Softwares, Platforms,
…
Creativity: “From business intelligence To intelligent business”- Michael Fergusson
Data Science is an umbrella on top of all activities on data
8
www.kensu.io
DON’T BELIEVE ME?
9https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
www.kensu.io 10
DON’T WANT TO READ THE PAPER?
What about this 3 minutes lecture in the
Google Machine Learning Crash Course
Talking about production of ML systems…
www.kensu.io 11
!!!
OR THIS ONE MAYBE?
Okay, it’s a 14 minutes lecture (probably as long as reading the paper ^^)
Talking about data dependencies
www.kensu.io
DATA SCIENCE GOVERNANCE: WHAT
12
www.kensu.io
DATA PIPELINE
Data pipeline is connecting activities on data, potentially involving
several technologies.
A pipeline is generally thought as an End-to-End processing line to
solve one problem.
But, part of pipelines are reused to save computation, storage, time, …
Thus interdependency between pipeline segments grows with initiatives
13
www.kensu.io
GOAL: TAKE DECISION
Data Pipelines, connected together, aren’t created for the beauty of it.
The ultimate goal is always to take decisions.
Decisions are generally taken or linked to humans with responsibilities.

(even for self driving cars, in case of problem)
Given that pipelines are cut-and-wired, interleaved, …
14
How not to be anxious at deploying the last piece used by the decision maker
www.kensu.io
SOURCES OF ANXIETY
What if:
• one of the data used in the process has different patterns suddenly?
• one of the tools, projects or similar is modified upstream?
• the insights are deviating from the reality?
• …
15
www.kensu.io
DEBUGGING?
To reduce the anxiety or, actually, reducing the risks, we need ways to debug.
In pure engineering, we have unit, function, integrations tests,… but
How do we do when the problems come from the data themselves?
We can’t generate all cases of data variations, right?
How to debug? 

Without the big picture, we may try to optimise a model for weeks for nothing
16
www.kensu.io
DATA SCIENCE GOVERNANCE
Data governance:
• controls that data meets precise standards
• involves monitoring against production data.
Data Science Governance:
• controls that data activity meets precise standards
• involves monitoring against production data activity.
A Data Activity is a phenomenon composed of
Technologies, Users, Systems, Data and Processing
17
www.kensu.io
GOVERNING DATA SCIENCE
Who does what on which data and where it is done?
What is the impact of a process on the global system?
What are the performance metrics (quality, execution,…) of the
processes?
18
www.kensu.io
CONTINUOUS INTEGRATION FOR DATA SCIENCE
Data Scientists/Citizens have a holistic view of their data,
system and processes.
They also have a control on their own results in production
They have the opportunity to analyse and debug any pipeline
involving all activities:
• independently of the technologies
• involving several units in the enterprise
19
www.kensu.io
DATA SCIENCE GOVERNANCE: HOW
20
www.kensu.io
CHALLENGES
So many tools are using data!
The number of processing is growing impressively.
We have to take care of the legacy…
21
www.kensu.io
GET THE DATA
As usual, we have to collect the right data to take right decision.
First run an assessment to create a high level map of all the tools
involved into a company.
For each data tool, do whatever it takes to collect information
describing its activities.
Information are metadata, lineage, statistics, accuracy measures, …
22
www.kensu.io
CONNECT THE DATA
To do that we need to connect all data that can be collected.
So that, it is possible to create a cartography of all on-going processes.
23
This map tracks all data and their descendants
Data Science Governance needs the global picture.
www.kensu.io
USE THE DATA
This is where the fun part starts… the map of data activities is an
amazing source of information
Here are a few things you can think of when using this kind of data:
• impact analysis
• dependency analysis
• pipeline optimisation
• data or model recommendation
24
www.kensu.io
GDPR
25
General Data Protection Regulation
www.kensu.io
ACCOUNTABILITY PRINCIPLE
Implement appropriate technical and organisational measures that
ensure and demonstrate that you comply. This may include internal
data protection policies such as staff training, internal audits of
processing activities, and reviews of internal HR policies.
26
www.kensu.io
TRANSPARENCY
As well as your obligation to provide comprehensive, clear
and transparent privacy policies, if your organisation has
more than 250 employees, you must maintain additional
internal records of your processing activities.
27
www.kensu.io
ACCOUNTABILITY: DATA SCIENCE GOVERNANCE
To govern data science, we have to:
• collect activities
• connect activities
Or… building and maintaining the audit trails needed to
create measures that demonstrates accountability
28
www.kensu.io
TRANSPARENCY: DATA SCIENCE GOVERNANCE
To govern data science seen as a continuous integration solution: 

We have to monitor activities independently of the technologies
With this information we can reliably create automatically the
process registry composed of goals pursued and all data involved
29
www.kensu.io
BUSINESS: IMPROVE AND DISRUPT
30
Connect data and business
Spoiler attack: one-line ahead
www.kensu.io
DATA TO BUSINESS
31
Business KPIs are nothing but data!
www.kensu.io
BUSINESS TO DATA
32
Change the business to match the data
ADAPT!
www.kensu.io
KENSU
Making it real… yet, taking the idea even further
33
Kinda pitchy I know but meh… :-D
www.kensu.io 34
ARTIFICIAL INTELLIGENCE ON DATA SCIENCE
Solution
Scientist / Engineer
Manager

CTO

Business
CDO

DPO
DPO

Authority
Activities
API
Governance
Compliance
Transformation
Machine Learning
Performance
Artificial Intelligence
Actionable
Data
www.kensu.io
DATA FLOW, USERS, PROCESSES ALL-IN
35
Data sources, Schemas
Categories of data involved
Transitive lineage
Markers on privacy data involved
Users involved in the processes
Programs used to create/run the flow
www.kensu.io
INTUITIVE AND COMPREHENSIBLE REST API
36
dam_dependencies = [
ProcessLineageDepsBuilder(input_schema, output_schema)
.identity_from_output("data")
.append('f', ['name', 'last'], "data"),
ProcessLineageDepsBuilder(input_schema_2, output_schema)
.identity_from_output("data")
]
dam_create_process_run_and_lineage(process, user,
code_version, process_name,
dam_dependencies)
Example in Python
High level integration like for Spark
// Initializing library to hook up to Apache Spark
import io.kensu.dam.lineage.spark.lineage.Implicits._
spark.track()
Automatically tracks
- data transformations
- data stats
- machine learning models
- performance of models
www.kensu.io
MONITORING MACHINE LEARNING PERFORMANCE
37
Read cold data
DEV / Offline
Pick parameters
<train>
PROD / Online
Read hot data
Use parameters
<train>
Automated Monitoring
- Create data flow 

data -> prepared data -> model
- Register parameters
- Compute/Gather performance metrics
www.kensu.io
EASY TO INTEGRATE IN DEV ENVIRONMENT
38
Jupyter
Spark (Python, R, Scala)
Even notebooks !
Google Colab
Python, TensorFlow
www.kensu.io
OUR PRODUCT: KENSU DATA ACTIVITY MANAGEMENT
39
Data Science Governance
First Governance, Compliance and Performance solution for Data science
Feature Benefit Why it matters
Connect.Collect.Learn
Automatically captures all data
science relevant activities related to
governance, compliance and
performance within a given domain.
Provided end-to-end control and
insights into all relevant aspects of
data science related activities

#GDPR
DPO Dashboard
One-stop control center for all
potential data privacy violations
Near-realtime notifications and
actionable intelligence current state
of “compliance health”
#GDPR
Compliance Reporting
One-click reports for all relevant
governance and compliance reports
Guarantee for good relationship with
authorities in charge by respecting
their templates
#GDPR
www.kensu.io
BTW
Spark and Machine training in Roma in June:
http://www.technologytransfer.eu/event/1779/
Apache_Spark_and_Machine_Learning_Workshop.html
———————————————————————————————
Interested in our way to think about ML and DS?
We have another 3-days training on this (Spark, TensorFlow, H2O, …)
(One in Roma to be scheduled this Fall)
40
www.kensu.io
DATA SCIENCE GOVERNANCE
Andy Petrella
CEO and Co-Founder
@noootsab
andy.petrella@kensu.io
@kensuio
Let’s chat after the talk o/
Or, contact me for:
- DAM (demo, pilot, …)
- training!

Data science governance and GDPR

  • 1.
    www.kensu.io DATA SCIENCE GOVERNANCE 1 TurnGDPR’s accountability principles into an added-value for your business Data Science Meetup - Milan - March 18
  • 2.
    www.kensu.io 2 - CEO& Founder - Mathematics Computer Science ANDY PETRELLA KENSU & ME Started with an enterprise stack for Data Scientists: 
 Agile Data Science Toolkit Pivot on internal component:
 Data Science Catalog Main focus: 
 Data Science Governance Spark Notebook O’Reilly Training
  • 3.
    www.kensu.io TOPICS 1. Some thoughtson “Data Science” 2. Data Science Governance: What 3. Data Science Governance: How 4. GDPR: Accountability and Transparency Principles 5. How to leverage GDPR and Data Science to improve or disrupt the Business 3
  • 4.
    www.kensu.io SOME THOUGHTS ON“DATA SCIENCE” 4
  • 5.
    www.kensu.io MACHINE LEARNING Pioneers in1950s AI Winter in 1970s due pessimism Resurgence in 1980s Machine Learning (and related) is used since the 1990s (esp. SVM and RNN) Deep learning see widespread commercial use in 2000s Machine learning receives great publicity (read: buzz) in 2010s 5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning
  • 6.
    www.kensu.io DATA SCIENCE: +ENGINEERING Claim:“Data Scientist” coined by DJ Patil in 2008. Pretty much where Machine Learning was part of Softwares In a way, when we added “engineering” to the mix Also, engineering is even more prominent with Big Data Distributed Computing 6
  • 7.
    www.kensu.io DATA SCIENCE: +EXPERIMENTATION Somuch data available So many tools, libraries, frameworks, … So many things we can try We have distributed computing now, right? => Let’s try everything Discover new insights (and potentially new businesses) 7
  • 8.
    www.kensu.io DATA SCIENCE: RECAP Maths:stats, machine learning and so on Engineering: ETL, Databases, Computing framework, Softwares, Platforms, … Creativity: “From business intelligence To intelligent business”- Michael Fergusson Data Science is an umbrella on top of all activities on data 8
  • 9.
  • 10.
    www.kensu.io 10 DON’T WANTTO READ THE PAPER? What about this 3 minutes lecture in the Google Machine Learning Crash Course Talking about production of ML systems…
  • 11.
    www.kensu.io 11 !!! OR THISONE MAYBE? Okay, it’s a 14 minutes lecture (probably as long as reading the paper ^^) Talking about data dependencies
  • 12.
  • 13.
    www.kensu.io DATA PIPELINE Data pipelineis connecting activities on data, potentially involving several technologies. A pipeline is generally thought as an End-to-End processing line to solve one problem. But, part of pipelines are reused to save computation, storage, time, … Thus interdependency between pipeline segments grows with initiatives 13
  • 14.
    www.kensu.io GOAL: TAKE DECISION DataPipelines, connected together, aren’t created for the beauty of it. The ultimate goal is always to take decisions. Decisions are generally taken or linked to humans with responsibilities.
 (even for self driving cars, in case of problem) Given that pipelines are cut-and-wired, interleaved, … 14 How not to be anxious at deploying the last piece used by the decision maker
  • 15.
    www.kensu.io SOURCES OF ANXIETY Whatif: • one of the data used in the process has different patterns suddenly? • one of the tools, projects or similar is modified upstream? • the insights are deviating from the reality? • … 15
  • 16.
    www.kensu.io DEBUGGING? To reduce theanxiety or, actually, reducing the risks, we need ways to debug. In pure engineering, we have unit, function, integrations tests,… but How do we do when the problems come from the data themselves? We can’t generate all cases of data variations, right? How to debug? 
 Without the big picture, we may try to optimise a model for weeks for nothing 16
  • 17.
    www.kensu.io DATA SCIENCE GOVERNANCE Datagovernance: • controls that data meets precise standards • involves monitoring against production data. Data Science Governance: • controls that data activity meets precise standards • involves monitoring against production data activity. A Data Activity is a phenomenon composed of Technologies, Users, Systems, Data and Processing 17
  • 18.
    www.kensu.io GOVERNING DATA SCIENCE Whodoes what on which data and where it is done? What is the impact of a process on the global system? What are the performance metrics (quality, execution,…) of the processes? 18
  • 19.
    www.kensu.io CONTINUOUS INTEGRATION FORDATA SCIENCE Data Scientists/Citizens have a holistic view of their data, system and processes. They also have a control on their own results in production They have the opportunity to analyse and debug any pipeline involving all activities: • independently of the technologies • involving several units in the enterprise 19
  • 20.
  • 21.
    www.kensu.io CHALLENGES So many toolsare using data! The number of processing is growing impressively. We have to take care of the legacy… 21
  • 22.
    www.kensu.io GET THE DATA Asusual, we have to collect the right data to take right decision. First run an assessment to create a high level map of all the tools involved into a company. For each data tool, do whatever it takes to collect information describing its activities. Information are metadata, lineage, statistics, accuracy measures, … 22
  • 23.
    www.kensu.io CONNECT THE DATA Todo that we need to connect all data that can be collected. So that, it is possible to create a cartography of all on-going processes. 23 This map tracks all data and their descendants Data Science Governance needs the global picture.
  • 24.
    www.kensu.io USE THE DATA Thisis where the fun part starts… the map of data activities is an amazing source of information Here are a few things you can think of when using this kind of data: • impact analysis • dependency analysis • pipeline optimisation • data or model recommendation 24
  • 25.
  • 26.
    www.kensu.io ACCOUNTABILITY PRINCIPLE Implement appropriatetechnical and organisational measures that ensure and demonstrate that you comply. This may include internal data protection policies such as staff training, internal audits of processing activities, and reviews of internal HR policies. 26
  • 27.
    www.kensu.io TRANSPARENCY As well asyour obligation to provide comprehensive, clear and transparent privacy policies, if your organisation has more than 250 employees, you must maintain additional internal records of your processing activities. 27
  • 28.
    www.kensu.io ACCOUNTABILITY: DATA SCIENCEGOVERNANCE To govern data science, we have to: • collect activities • connect activities Or… building and maintaining the audit trails needed to create measures that demonstrates accountability 28
  • 29.
    www.kensu.io TRANSPARENCY: DATA SCIENCEGOVERNANCE To govern data science seen as a continuous integration solution: 
 We have to monitor activities independently of the technologies With this information we can reliably create automatically the process registry composed of goals pursued and all data involved 29
  • 30.
    www.kensu.io BUSINESS: IMPROVE ANDDISRUPT 30 Connect data and business Spoiler attack: one-line ahead
  • 31.
  • 32.
    www.kensu.io BUSINESS TO DATA 32 Changethe business to match the data ADAPT!
  • 33.
    www.kensu.io KENSU Making it real…yet, taking the idea even further 33 Kinda pitchy I know but meh… :-D
  • 34.
    www.kensu.io 34 ARTIFICIAL INTELLIGENCEON DATA SCIENCE Solution Scientist / Engineer Manager
 CTO
 Business CDO
 DPO DPO
 Authority Activities API Governance Compliance Transformation Machine Learning Performance Artificial Intelligence Actionable Data
  • 35.
    www.kensu.io DATA FLOW, USERS,PROCESSES ALL-IN 35 Data sources, Schemas Categories of data involved Transitive lineage Markers on privacy data involved Users involved in the processes Programs used to create/run the flow
  • 36.
    www.kensu.io INTUITIVE AND COMPREHENSIBLEREST API 36 dam_dependencies = [ ProcessLineageDepsBuilder(input_schema, output_schema) .identity_from_output("data") .append('f', ['name', 'last'], "data"), ProcessLineageDepsBuilder(input_schema_2, output_schema) .identity_from_output("data") ] dam_create_process_run_and_lineage(process, user, code_version, process_name, dam_dependencies) Example in Python High level integration like for Spark // Initializing library to hook up to Apache Spark import io.kensu.dam.lineage.spark.lineage.Implicits._ spark.track() Automatically tracks - data transformations - data stats - machine learning models - performance of models
  • 37.
    www.kensu.io MONITORING MACHINE LEARNINGPERFORMANCE 37 Read cold data DEV / Offline Pick parameters <train> PROD / Online Read hot data Use parameters <train> Automated Monitoring - Create data flow 
 data -> prepared data -> model - Register parameters - Compute/Gather performance metrics
  • 38.
    www.kensu.io EASY TO INTEGRATEIN DEV ENVIRONMENT 38 Jupyter Spark (Python, R, Scala) Even notebooks ! Google Colab Python, TensorFlow
  • 39.
    www.kensu.io OUR PRODUCT: KENSUDATA ACTIVITY MANAGEMENT 39 Data Science Governance First Governance, Compliance and Performance solution for Data science Feature Benefit Why it matters Connect.Collect.Learn Automatically captures all data science relevant activities related to governance, compliance and performance within a given domain. Provided end-to-end control and insights into all relevant aspects of data science related activities
 #GDPR DPO Dashboard One-stop control center for all potential data privacy violations Near-realtime notifications and actionable intelligence current state of “compliance health” #GDPR Compliance Reporting One-click reports for all relevant governance and compliance reports Guarantee for good relationship with authorities in charge by respecting their templates #GDPR
  • 40.
    www.kensu.io BTW Spark and Machinetraining in Roma in June: http://www.technologytransfer.eu/event/1779/ Apache_Spark_and_Machine_Learning_Workshop.html ——————————————————————————————— Interested in our way to think about ML and DS? We have another 3-days training on this (Spark, TensorFlow, H2O, …) (One in Roma to be scheduled this Fall) 40
  • 41.
    www.kensu.io DATA SCIENCE GOVERNANCE AndyPetrella CEO and Co-Founder @noootsab andy.petrella@kensu.io @kensuio Let’s chat after the talk o/ Or, contact me for: - DAM (demo, pilot, …) - training!