Stopping your Lake becoming a
Swamp
Steve Jones – Global VP Big Data – Capgemini
4 May 2016
#INFA16
The biggest complaint in Data Governance…
The business won’t engage
#INFA16
The biggest complaint in Data Governance…
Want to know why?
#INFA16
The biggest complaint in Data Governance…
Its because they never cared about
Data Governance
#INFA16
The biggest complaint in Data Governance…
Because it didn’t actually govern the
data in the way they wanted
#INFA16
Corporate
Ad-hoc
LOB
management
Operations
Market
Multiple internal views – consistently compromised
Operations
LOB mart Spreadsheets
Line of business
Transactional systems
CRMERP PLM
EDW
Corporate
ODS
Web
#INFA16
Corporate
Ad-hoc
LOB
management
Operations
Market
Multiple internal views - you already have a swamp
Operations
LOB mart Spreadsheets
Line of business
Transactional systems
CRMERP PLM
EDW
Fit
Detail
Freshness
Fidelity
Corporate
ODS
Web
#INFA16
IT’s answer – just make the EDW ‘bigger’- one view for all
#INFA16
Why the single view & the golden record fails
Division1
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Corporate
Now agree on
everythingDivision2
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Division3
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Division4
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Corporate
KPIs
#INFA16
The biggest complaint in Data Governance…
But good news, Big Data is
changing the game
#INFA16
Marketing
Business Meta-Data – recognizing how people work
Data Lake
Debt
RecoveryHigh Value
Customers
Explore the Lake
Prioritize
#INFA16
Marketing
Business Meta-Data – recognizing how people work
Data Lake
Debt
RecoveryHigh Value
Customers
Explore the Lake
Business Meta-
DataPurpose: Debt
Recovery
Verify Purpose
Exclude
Purpose: Mktg Exclude
#INFA16
So what do we need?
Govern where it
matters
!  Focus on Meta-data, MDM and RDM
!  Enforce only when sharing
!  Treat Corporate as aggregation of Local.
Encourage local
requirements
!  Let the business decide what they need
!  Build from the bottom
!  Enable traceability to source
!  Disposable data views.
Distill on demand
!  Select only what you want
!  Business friendly tooling
!  Re-usable information maps
!  Rapid change cycle.
Store everything
!  Store everything ‘as is’
!  Include structured and unstructured data
!  Store it cheaply.
#INFA16
Why Data Lakes work
Archives Operational systems
Case
Mgmt
Operation
s
Schedulin
g
Data Explosion
Geographic
Data
One Place for Everything
Supply Chain
Optimization
Fraud
Detection
Next Best
Action
Marketing
RMADLibSAS
#INFA16
Keep governance small and focused – the MDM
RADAR
Core
Sync
Share
Federated
#INFA16
Collaboration over quality
The cross-reference is the most
important thing an MDM project can
deliver
#INFA16
Collaboration over quality
Quality is a side-effect – not a goal
of the first phase of your MDM
project
#INFA16
Why the X-Ref matters
Local
view
Corporate
standards
Master data and
reference data
Corporate
view
Customer
x-ref
Customer
MDM
Invoices Orders
Customer
Invoices Orders
BU1
Information
governance
BU2 BU3
Customer
Invoices
Orders
Customer
Invoices
Orders
Customer
Invoices
Orders
#INFA16
The Quality mess
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
ETL ETL ETL
#INFA16
The current situation – add a new System
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
ETL ETL ETL
Reports
ETL
#INFA16
The current situation – Where does Data Quality go?
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
ETL ETL ETL
Reports
ETL
Fix at Source
Patch Patch Patch Patch Patch
DQ DQ DQ DQ
#INFA16
Lets try another way – Fix at Source is still the ideal but
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
Data Lake
Data Lake + Quality
Data Quality & MDM
#INFA16
Then lets go further
Internal systems
CRMBack Office Trading
Market Data
Reports Reports Reports Reports
Trading Trading
Data Lake
Data Lake + Quality
Data Quality & MDM
Data
Science
#INFA16
Why the Business Data Lake succeeds
Division1
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
Now agree
where it
counts
Division2
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Division3
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Division4
Sales
Finance
Supply chain
Marketing
R&D
PersonalKPIs
DivsionalKPIs
Corporate
Corporate
KPIs
DivsionalKPIs
#INFA16
Ingestion
tier
Insights
tier
Unified operations tier
System monitoring
System
management
Unified data management tier
Data mgmt.
services
MDM
RDM
Audit and
policy
mgmt.
Processing tier
Workflow management
Distillation tier
HDFS storage
Unstructured and structured data
In-memory
MPP database
Real
time
Micro
batch
Mega
batch
SQL
NoSQL
SparkSQL
Query
interface
s
SQL
Help me ingestion – you’re my only hope
Sources Action tier
Real-time
ingestion
Micro
batch
ingestion
Batch
ingestion
Real-time
insights
Interactive
insights
Batch
insights
Ingestion
The first and
LAST place you
can have control
Distillation
Think ‘Excel’ not EDW
#INFA16
If your business isn’t trivial – you’ll end up with multiple
lakes
Client Data Lake Partner Data Lake
Geo specific Geo specificGeo specific
Corporate Data Lake
Analytics
Results
Distillatio
n
Summary
AnalyticsAnalytics
Results
Analytics
Results
#INFA16
Federation requires THREE things
The cross-reference
•  If you can’t link data sets its impossible
Rationalization
•  The lesson of Google & Amazon is less choice, not
more
Business Meta-Data
•  Technical Meta-Data is no-longer enough
#INFA16
Building a Lake and the challenge of federation
Client specific Client specific
Geo specific Geo specificGeo specific
Corporate Data LakeTo do this you need to
standardize
technology
#INFA16
Business Meta-Data and the x-ref – the two things that
must be shared
Client specific Client specific
Geo specific Geo specificGeo specific
Corporate Data Lake
Business Meta-
Data & x-ref
#INFA16
The new philosophy
Business
Data
Lake
Store
everything
Encourage
local
Govern
only the
common
Assume
Federation
It’s all about insight at the point of action
#INFA16
But remember
Your next challenge will be
analytical collaboration beyond
your company boundaries

Stopping the Lake from becoming a Swamp

  • 1.
    Stopping your Lakebecoming a Swamp Steve Jones – Global VP Big Data – Capgemini 4 May 2016
  • 2.
    #INFA16 The biggest complaintin Data Governance… The business won’t engage
  • 3.
    #INFA16 The biggest complaintin Data Governance… Want to know why?
  • 4.
    #INFA16 The biggest complaintin Data Governance… Its because they never cared about Data Governance
  • 5.
    #INFA16 The biggest complaintin Data Governance… Because it didn’t actually govern the data in the way they wanted
  • 6.
    #INFA16 Corporate Ad-hoc LOB management Operations Market Multiple internal views– consistently compromised Operations LOB mart Spreadsheets Line of business Transactional systems CRMERP PLM EDW Corporate ODS Web
  • 7.
    #INFA16 Corporate Ad-hoc LOB management Operations Market Multiple internal views- you already have a swamp Operations LOB mart Spreadsheets Line of business Transactional systems CRMERP PLM EDW Fit Detail Freshness Fidelity Corporate ODS Web
  • 8.
    #INFA16 IT’s answer –just make the EDW ‘bigger’- one view for all
  • 9.
    #INFA16 Why the singleview & the golden record fails Division1 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Corporate Now agree on everythingDivision2 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Division3 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Division4 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Corporate KPIs
  • 10.
    #INFA16 The biggest complaintin Data Governance… But good news, Big Data is changing the game
  • 11.
    #INFA16 Marketing Business Meta-Data –recognizing how people work Data Lake Debt RecoveryHigh Value Customers Explore the Lake Prioritize
  • 12.
    #INFA16 Marketing Business Meta-Data –recognizing how people work Data Lake Debt RecoveryHigh Value Customers Explore the Lake Business Meta- DataPurpose: Debt Recovery Verify Purpose Exclude Purpose: Mktg Exclude
  • 13.
    #INFA16 So what dowe need? Govern where it matters !  Focus on Meta-data, MDM and RDM !  Enforce only when sharing !  Treat Corporate as aggregation of Local. Encourage local requirements !  Let the business decide what they need !  Build from the bottom !  Enable traceability to source !  Disposable data views. Distill on demand !  Select only what you want !  Business friendly tooling !  Re-usable information maps !  Rapid change cycle. Store everything !  Store everything ‘as is’ !  Include structured and unstructured data !  Store it cheaply.
  • 14.
    #INFA16 Why Data Lakeswork Archives Operational systems Case Mgmt Operation s Schedulin g Data Explosion Geographic Data One Place for Everything Supply Chain Optimization Fraud Detection Next Best Action Marketing RMADLibSAS
  • 15.
    #INFA16 Keep governance smalland focused – the MDM RADAR Core Sync Share Federated
  • 16.
    #INFA16 Collaboration over quality Thecross-reference is the most important thing an MDM project can deliver
  • 17.
    #INFA16 Collaboration over quality Qualityis a side-effect – not a goal of the first phase of your MDM project
  • 18.
    #INFA16 Why the X-Refmatters Local view Corporate standards Master data and reference data Corporate view Customer x-ref Customer MDM Invoices Orders Customer Invoices Orders BU1 Information governance BU2 BU3 Customer Invoices Orders Customer Invoices Orders Customer Invoices Orders
  • 19.
    #INFA16 The Quality mess Internalsystems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading ETL ETL ETL
  • 20.
    #INFA16 The current situation– add a new System Internal systems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading ETL ETL ETL Reports ETL
  • 21.
    #INFA16 The current situation– Where does Data Quality go? Internal systems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading ETL ETL ETL Reports ETL Fix at Source Patch Patch Patch Patch Patch DQ DQ DQ DQ
  • 22.
    #INFA16 Lets try anotherway – Fix at Source is still the ideal but Internal systems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading Data Lake Data Lake + Quality Data Quality & MDM
  • 23.
    #INFA16 Then lets gofurther Internal systems CRMBack Office Trading Market Data Reports Reports Reports Reports Trading Trading Data Lake Data Lake + Quality Data Quality & MDM Data Science
  • 24.
    #INFA16 Why the BusinessData Lake succeeds Division1 Sales Finance Supply chain Marketing R&D PersonalKPIs Now agree where it counts Division2 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Division3 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Division4 Sales Finance Supply chain Marketing R&D PersonalKPIs DivsionalKPIs Corporate Corporate KPIs DivsionalKPIs
  • 25.
    #INFA16 Ingestion tier Insights tier Unified operations tier Systemmonitoring System management Unified data management tier Data mgmt. services MDM RDM Audit and policy mgmt. Processing tier Workflow management Distillation tier HDFS storage Unstructured and structured data In-memory MPP database Real time Micro batch Mega batch SQL NoSQL SparkSQL Query interface s SQL Help me ingestion – you’re my only hope Sources Action tier Real-time ingestion Micro batch ingestion Batch ingestion Real-time insights Interactive insights Batch insights Ingestion The first and LAST place you can have control Distillation Think ‘Excel’ not EDW
  • 26.
    #INFA16 If your businessisn’t trivial – you’ll end up with multiple lakes Client Data Lake Partner Data Lake Geo specific Geo specificGeo specific Corporate Data Lake Analytics Results Distillatio n Summary AnalyticsAnalytics Results Analytics Results
  • 27.
    #INFA16 Federation requires THREEthings The cross-reference •  If you can’t link data sets its impossible Rationalization •  The lesson of Google & Amazon is less choice, not more Business Meta-Data •  Technical Meta-Data is no-longer enough
  • 28.
    #INFA16 Building a Lakeand the challenge of federation Client specific Client specific Geo specific Geo specificGeo specific Corporate Data LakeTo do this you need to standardize technology
  • 29.
    #INFA16 Business Meta-Data andthe x-ref – the two things that must be shared Client specific Client specific Geo specific Geo specificGeo specific Corporate Data Lake Business Meta- Data & x-ref
  • 30.
    #INFA16 The new philosophy Business Data Lake Store everything Encourage local Govern onlythe common Assume Federation It’s all about insight at the point of action
  • 31.
    #INFA16 But remember Your nextchallenge will be analytical collaboration beyond your company boundaries