The document outlines five keys to building a successful data lake:
1. Align the data lake to corporate strategic goals and objectives and ensure executive sponsorship.
2. Establish a solid data integration strategy that manages and automates the data pipeline across sources.
3. Develop a process for onboarding big data from diverse sources at scale while maintaining governance.
4. Embrace new data management practices like early data ingestion, adaptive processing, and applying analytics to all data.
5. Operationalize machine learning models by preparing data, training and testing models, and deploying models to uncover new insights.
5. What Do You Get from Data Lake?
More
accurate
intelligence
Transform
your
business
Ability to
increase
revenue
Create
new
products
Better
understand
your
customers
Streamline
operations
and improve
efficiencies
6.
7.
8. Five Keys to a Killer Data Lake
Align to
corporate strategy
1
Solid data
integration strategy
2
Big Data
on-boarding process
3
Embrace new data
management practices
4
Operationalize
machine learning
models
5
10. Align Goals and Executive Buy In
• Understand corporate goals
• Identify executive leadership and sponsorship
• Recognize lack of alignment
• Ensure efforts are aligned with strategic goals
11. Align to Strategic Organizational Goals
Business Acceleration Operational Efficiency Security and Risk
Know your
customer
Customer 360
Churn
Recommendation
engine
Maximize
Profit
Pricing analytics
Targeted
promotions
Market basket
analytics
New Product
Development
Customization of
product
Next product to
build
Modernizing Data
Architecture
EDWO
Storage data
optimization
Industrial
IoT
Sensor Analytics
Predictive
Maintenance
TelematicsInfrastructure
Analytics
Risk
Credit scoring
Fraud detection
Security
Cyber security
Compliance
Trade compliance
Health care
compliance
Anti Money
Laundering
13. Data Integration Strategy
• Ensure organizational agreement on strategy
• Manage and automate the Data Pipeline
• Modernize your architecture
• Adaptive execution strategy
• Secure your data
• Accept that Data Governance is separate from Data Management
• Rethink Metadata Management
14. Managing and Automating the Pipeline
Administration Security Lifecycle
Management
Data
Provenance
Dynamic Data
Pipeline Monitoring Automation
Analytic Data Pipeline
DATA ENGINEERING DATA PREPARATION ANALYTICS
Cleanse Conform Shape
Transform Ingest
Refine Virtualize Blend
Orchestrate Prepare Enrich
Visualize Build Score
Analyze Model
Data
Lake
17. More Data, More Problems
Modern data onboarding is more than just “connecting” or “loading” – it includes:
Managing a
changing array of
data sources
Establishing
repeatable
processes at scale
Maintaining control
and governance
20. Modern Data Management Strategies
• Adopt early ingest and adaptive processing
• Enable the capture of metadata on ingest
• Adopt streaming data processing where appropriate
• Model on the fly
• Modernize data integration infrastructure
• Extend data management to all data
• Apply analytics to all data
24. Data Lake Blueprint
Global Data Integration
Ingest Blend and Refine
Network
Location
Web
EDW (x12)
Billing
Provisioning
Customer
Social
Media
Pentaho Data
Integration
Hadoop
Cluster
Data
Publisher
Analytical
Database
Pentaho
Analytics
Server
Existing BI and
Data Mining Tools
Data Lake
Pentaho Data Integration
Visual MapReduce
and some native PDI
Transformations On-demand
Data Marts
To be
decommissioned Deliver
Do you want Protegrity logo
or keep it generic? Go ahead
and delete this note if you
don’t need it.
25. Uncover Billions of Tax Revenue
Challenge
• £34B missed tax revenue
• Managing 40 TB of data held across
11 separate legacy data warehouses
• Relied on consultants for reports that
required customization and long
lead time
Benefits
• 360 degree view of the tax citizen
• Created a single Big Data platform and
ability to consolidate 40 reporting
streams with self-service reporting
• New reports save an estimated 900 man
hours per day (based on a user-base of
1,200) by streamlining the reporting
process
Explain why a data lake – and when I talk about a data lake I expect it to be clean, pure and pristine …
Might be good to use a quote from an analyst or analyst blog – maybe from Philip Russom - https://upside.tdwi.org/articles/2016/11/03/benefits-of-hadoop-data-lake.aspx
Explain why a data lake – and when I talk about a data lake I expect it to be clean, pure and pristine …
Might be good to use a quote from an analyst or analyst blog – maybe from Philip Russom - https://upside.tdwi.org/articles/2016/11/03/benefits-of-hadoop-data-lake.aspx
Better intelligence is kind of obvious, but the rest are the key benefits – pointing back to the strategic initiatives that can be achieved .
Not so easy…
There is no EASY button –
Let’s be honest, Hadoop is Hard
And here’s the other problem…. When hadoop first hit the scene, we recognized how great it was in that we could put data into hadoop – litterally “dump” data in – and guess what…
Next slide
That’s what we got! A dump –
We loaded all our data into the truck, backed the truck up and dumped it….
So to ensure you have a pure clean, pristine data lake, you have to put thought into it long before you start dumping…
Understand Corporate Goals
Recognize Lack of Alignment
Identify Executive Leadership and Sponsorship
Ensure Efforts are aligned with Strategic Goals
Align your efforts with the organizations strategic goals and initiatives – naming business acceleration, operational efficiency, and security and risk
Ensure Organizational Agreement on Strategy – find ways to incent appropriate behavior of data owners
In the world Big Data, yoru data integration strategy is what enables you to manage – in fact, big data governance is based on your big data integration strategy – you have to have a solid strategy not just tools (and certainly not the same old tools you’ve been using
Accept that Data Governance is Separate from Data Management - understand that you no longer control data – it’s on prem, in the cloud, in apps, - accept it and govern accordingly
Look at this: http://www.informationweek.com/big-data/big-data-analytics/8-critical-elements-of-a-successful-data-integration-strategy/d/d-id/1327107?image_number=1
Data Integration strategy must support your data lake goals AND operate across the entire data fabric – the data lake part of the whole fabric, but it’s the only thing
Ensure filling the lake without just dumping
Modern data onboarding challenges go beyond just ‘connecting’ to data sources or ‘ingesting’ data into a store of choice
They introduce significant new challenges related to dealing with many more source of data that may changes over time
They also require a flexible, efficient, and governed process to be fully successful…
ELT type process
Metadata – define what we mean – all metadata – data types, etc metadata and business metadata
New Data Mgmt Strategies – think beyond what you’ve done for the past year with data warehousing – things like early ingest (get the data in, then process), capture matadata on the way in, automate the creation of anlaytic data models for use interaction -
Really comes down to AUTOMATION – at the scale and pace fo data int eh modern world, you can keep up by having IT or buisness users building models by had - must be automated.
Prepare Data and Engineer New Features
Train, Tune and Test Models
Deploy and Operationalize Models
Update Models Regularly
Simplify Data Prep
Prepare data and perform feature engineering tasks faster in an easy to use drag and drop environment (enabling self-service for data scientists)
(show the IT->DA->DS triangle of data foundation
Use ML Tool of Choice
Train, tune and test your R, Python, Spark MLlib or Weka machine learning algorithms faster to build more predictive models (for data scientists)
Operationalize
Quickly operationalize your data scientist’s machine learning models, whether they use R, Python, MLlib or Weka (for data engineers and IT in general)