Five Keys to a
Killer Data Lake
Chuck Yarbrough
VP, Solutions Marketing and Management
Agenda
Why a Data Lake?
Five Keys to a Killer Data Lake
What You Should Do Now
Why a Data Lake?
Why a Data Lake?
What Do You Get from Data Lake?
More
accurate
intelligence
Transform
your
business
Ability to
increase
revenue
Create
new
products
Better
understand
your
customers
Streamline
operations
and improve
efficiencies
Five Keys to a Killer Data Lake
Align to
corporate strategy
1
Solid data
integration strategy
2
Big Data
on-boarding process
3
Embrace new data
management practices
4
Operationalize
machine learning
models
5
KEY #1
Align to Strategic Organizational Goals
Align Goals and Executive Buy In
• Understand corporate goals
• Identify executive leadership and sponsorship
• Recognize lack of alignment
• Ensure efforts are aligned with strategic goals
Align to Strategic Organizational Goals
Business Acceleration Operational Efficiency Security and Risk
Know your
customer
Customer 360
Churn
Recommendation
engine
Maximize
Profit
Pricing analytics
Targeted
promotions
Market basket
analytics
New Product
Development
Customization of
product
Next product to
build
Modernizing Data
Architecture
EDWO
Storage data
optimization
Industrial
IoT
Sensor Analytics
Predictive
Maintenance
TelematicsInfrastructure
Analytics
Risk
Credit scoring
Fraud detection
Security
Cyber security
Compliance
Trade compliance
Health care
compliance
Anti Money
Laundering
KEY #2
Have a Solid Data Integration Strategy
Data Integration Strategy
• Ensure organizational agreement on strategy
• Manage and automate the Data Pipeline
• Modernize your architecture
• Adaptive execution strategy
• Secure your data
• Accept that Data Governance is separate from Data Management
• Rethink Metadata Management
Managing and Automating the Pipeline
Administration Security Lifecycle
Management
Data
Provenance
Dynamic Data
Pipeline Monitoring Automation
Analytic Data Pipeline
DATA ENGINEERING DATA PREPARATION ANALYTICS
Cleanse Conform Shape
Transform Ingest
Refine Virtualize Blend
Orchestrate Prepare Enrich
Visualize Build Score
Analyze Model
Data
Lake
KEY #3
Establish a Big Data Onboarding Strategy
Filling the Data Lake
More Data, More Problems
Modern data onboarding is more than just “connecting” or “loading” – it includes:
Managing a
changing array of
data sources
Establishing
repeatable
processes at scale
Maintaining control
and governance
Dynamic Integration Processes Dynamic Transformations
Ingest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest Procedures
Big Data On Boarding
RDBMS
Hadoop
Disparate Data Sources
CSV
Integration Processes Transformations
Ingest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest Procedures
Metadata
Injection
KEY #4
Embrace New Data Management Strategies
Modern Data Management Strategies
• Adopt early ingest and adaptive processing
• Enable the capture of metadata on ingest
• Adopt streaming data processing where appropriate
• Model on the fly
• Modernize data integration infrastructure
• Extend data management to all data
• Apply analytics to all data
KEY #5
Apply Machine Learning Algorithms
Machine Learning Workflow
Prepare
Data
Engineer
Features
Train, Tune and
Tet Models
Update Models
Deploy and
Operationalize Models
360° View of the Taxpayer
Data Lake Blueprint
Global Data Integration
Ingest Blend and Refine
Network
Location
Web
EDW (x12)
Billing
Provisioning
Customer
Social
Media
Pentaho Data
Integration
Hadoop
Cluster
Data
Publisher
Analytical
Database
Pentaho
Analytics
Server
Existing BI and
Data Mining Tools
Data Lake
Pentaho Data Integration
Visual MapReduce
and some native PDI
Transformations On-demand
Data Marts
To be
decommissioned Deliver
 Do you want Protegrity logo
or keep it generic? Go ahead
and delete this note if you
don’t need it.
Uncover Billions of Tax Revenue
Challenge
• £34B missed tax revenue
• Managing 40 TB of data held across
11 separate legacy data warehouses
• Relied on consultants for reports that
required customization and long
lead time
Benefits
• 360 degree view of the tax citizen
• Created a single Big Data platform and
ability to consolidate 40 reporting
streams with self-service reporting
• New reports save an estimated 900 man
hours per day (based on a user-base of
1,200) by streamlining the reporting
process
Takeaways
Takeaways
Align with clear
corporate/strategic
initiatives
Embrace data
management
practices
Enable adaptive
data execution for
data processing
and integration
Drive adoption
of Machine
Learning and
Automation
Questions
Thank You
Chuck Yarbrough
VP, Solutions Marketing and Management
@cyarbrough

The 5 Keys to a Killer Data Lake

  • 1.
    Five Keys toa Killer Data Lake Chuck Yarbrough VP, Solutions Marketing and Management
  • 2.
    Agenda Why a DataLake? Five Keys to a Killer Data Lake What You Should Do Now
  • 3.
  • 4.
  • 5.
    What Do YouGet from Data Lake? More accurate intelligence Transform your business Ability to increase revenue Create new products Better understand your customers Streamline operations and improve efficiencies
  • 8.
    Five Keys toa Killer Data Lake Align to corporate strategy 1 Solid data integration strategy 2 Big Data on-boarding process 3 Embrace new data management practices 4 Operationalize machine learning models 5
  • 9.
    KEY #1 Align toStrategic Organizational Goals
  • 10.
    Align Goals andExecutive Buy In • Understand corporate goals • Identify executive leadership and sponsorship • Recognize lack of alignment • Ensure efforts are aligned with strategic goals
  • 11.
    Align to StrategicOrganizational Goals Business Acceleration Operational Efficiency Security and Risk Know your customer Customer 360 Churn Recommendation engine Maximize Profit Pricing analytics Targeted promotions Market basket analytics New Product Development Customization of product Next product to build Modernizing Data Architecture EDWO Storage data optimization Industrial IoT Sensor Analytics Predictive Maintenance TelematicsInfrastructure Analytics Risk Credit scoring Fraud detection Security Cyber security Compliance Trade compliance Health care compliance Anti Money Laundering
  • 12.
    KEY #2 Have aSolid Data Integration Strategy
  • 13.
    Data Integration Strategy •Ensure organizational agreement on strategy • Manage and automate the Data Pipeline • Modernize your architecture • Adaptive execution strategy • Secure your data • Accept that Data Governance is separate from Data Management • Rethink Metadata Management
  • 14.
    Managing and Automatingthe Pipeline Administration Security Lifecycle Management Data Provenance Dynamic Data Pipeline Monitoring Automation Analytic Data Pipeline DATA ENGINEERING DATA PREPARATION ANALYTICS Cleanse Conform Shape Transform Ingest Refine Virtualize Blend Orchestrate Prepare Enrich Visualize Build Score Analyze Model Data Lake
  • 15.
    KEY #3 Establish aBig Data Onboarding Strategy
  • 16.
  • 17.
    More Data, MoreProblems Modern data onboarding is more than just “connecting” or “loading” – it includes: Managing a changing array of data sources Establishing repeatable processes at scale Maintaining control and governance
  • 18.
    Dynamic Integration ProcessesDynamic Transformations Ingest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest Procedures Big Data On Boarding RDBMS Hadoop Disparate Data Sources CSV Integration Processes Transformations Ingest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest Procedures Metadata Injection
  • 19.
    KEY #4 Embrace NewData Management Strategies
  • 20.
    Modern Data ManagementStrategies • Adopt early ingest and adaptive processing • Enable the capture of metadata on ingest • Adopt streaming data processing where appropriate • Model on the fly • Modernize data integration infrastructure • Extend data management to all data • Apply analytics to all data
  • 21.
    KEY #5 Apply MachineLearning Algorithms
  • 22.
    Machine Learning Workflow Prepare Data Engineer Features Train,Tune and Tet Models Update Models Deploy and Operationalize Models
  • 23.
    360° View ofthe Taxpayer
  • 24.
    Data Lake Blueprint GlobalData Integration Ingest Blend and Refine Network Location Web EDW (x12) Billing Provisioning Customer Social Media Pentaho Data Integration Hadoop Cluster Data Publisher Analytical Database Pentaho Analytics Server Existing BI and Data Mining Tools Data Lake Pentaho Data Integration Visual MapReduce and some native PDI Transformations On-demand Data Marts To be decommissioned Deliver  Do you want Protegrity logo or keep it generic? Go ahead and delete this note if you don’t need it.
  • 25.
    Uncover Billions ofTax Revenue Challenge • £34B missed tax revenue • Managing 40 TB of data held across 11 separate legacy data warehouses • Relied on consultants for reports that required customization and long lead time Benefits • 360 degree view of the tax citizen • Created a single Big Data platform and ability to consolidate 40 reporting streams with self-service reporting • New reports save an estimated 900 man hours per day (based on a user-base of 1,200) by streamlining the reporting process
  • 26.
  • 27.
    Takeaways Align with clear corporate/strategic initiatives Embracedata management practices Enable adaptive data execution for data processing and integration Drive adoption of Machine Learning and Automation
  • 28.
  • 29.
    Thank You Chuck Yarbrough VP,Solutions Marketing and Management @cyarbrough

Editor's Notes

  • #4 Explain why a data lake – and when I talk about a data lake I expect it to be clean, pure and pristine … Might be good to use a quote from an analyst or analyst blog – maybe from Philip Russom - https://upside.tdwi.org/articles/2016/11/03/benefits-of-hadoop-data-lake.aspx
  • #5 Explain why a data lake – and when I talk about a data lake I expect it to be clean, pure and pristine … Might be good to use a quote from an analyst or analyst blog – maybe from Philip Russom - https://upside.tdwi.org/articles/2016/11/03/benefits-of-hadoop-data-lake.aspx
  • #6 Better intelligence is kind of obvious, but the rest are the key benefits – pointing back to the strategic initiatives that can be achieved .
  • #7 Not so easy… There is no EASY button – Let’s be honest, Hadoop is Hard And here’s the other problem…. When hadoop first hit the scene, we recognized how great it was in that we could put data into hadoop – litterally “dump” data in – and guess what… Next slide
  • #8 That’s what we got! A dump – We loaded all our data into the truck, backed the truck up and dumped it….
  • #9 So to ensure you have a pure clean, pristine data lake, you have to put thought into it long before you start dumping…
  • #11 Understand Corporate Goals Recognize Lack of Alignment Identify Executive Leadership and Sponsorship Ensure Efforts are aligned with Strategic Goals
  • #12 Align your efforts with the organizations strategic goals and initiatives – naming business acceleration, operational efficiency, and security and risk
  • #14 Ensure Organizational Agreement on Strategy – find ways to incent appropriate behavior of data owners In the world Big Data, yoru data integration strategy is what enables you to manage – in fact, big data governance is based on your big data integration strategy – you have to have a solid strategy not just tools (and certainly not the same old tools you’ve been using Accept that Data Governance is Separate from Data Management - understand that you no longer control data – it’s on prem, in the cloud, in apps, - accept it and govern accordingly Look at this: http://www.informationweek.com/big-data/big-data-analytics/8-critical-elements-of-a-successful-data-integration-strategy/d/d-id/1327107?image_number=1
  • #15 Data Integration strategy must support your data lake goals AND operate across the entire data fabric – the data lake part of the whole fabric, but it’s the only thing
  • #17 Ensure filling the lake without just dumping
  • #18 Modern data onboarding challenges go beyond just ‘connecting’ to data sources or ‘ingesting’ data into a store of choice They introduce significant new challenges related to dealing with many more source of data that may changes over time They also require a flexible, efficient, and governed process to be fully successful…
  • #19 ELT type process
  • #21 Metadata – define what we mean – all metadata – data types, etc metadata and business metadata New Data Mgmt Strategies – think beyond what you’ve done for the past year with data warehousing – things like early ingest (get the data in, then process), capture matadata on the way in, automate the creation of anlaytic data models for use interaction - Really comes down to AUTOMATION – at the scale and pace fo data int eh modern world, you can keep up by having IT or buisness users building models by had - must be automated.
  • #23 Prepare Data and Engineer New Features Train, Tune and Test Models Deploy and Operationalize Models Update Models Regularly Simplify Data Prep Prepare data and perform feature engineering tasks faster in an easy to use drag and drop environment (enabling self-service for data scientists) (show the IT->DA->DS triangle of data foundation Use ML Tool of Choice Train, tune and test your R, Python, Spark MLlib or Weka machine learning algorithms faster to build more predictive models (for data scientists) Operationalize Quickly operationalize your data scientist’s machine learning models, whether they use R, Python, MLlib or Weka (for data engineers and IT in general)
  • #25 High level view