SlideShare a Scribd company logo
1 of 29
Five Keys to a
Killer Data Lake
Chuck Yarbrough
VP, Solutions Marketing and Management
Agenda
Why a Data Lake?
Five Keys to a Killer Data Lake
What You Should Do Now
Why a Data Lake?
Why a Data Lake?
What Do You Get from Data Lake?
More
accurate
intelligence
Transform
your
business
Ability to
increase
revenue
Create
new
products
Better
understand
your
customers
Streamline
operations
and improve
efficiencies
Five Keys to a Killer Data Lake
Align to
corporate strategy
1
Solid data
integration strategy
2
Big Data
on-boarding process
3
Embrace new data
management practices
4
Operationalize
machine learning
models
5
KEY #1
Align to Strategic Organizational Goals
Align Goals and Executive Buy In
• Understand corporate goals
• Identify executive leadership and sponsorship
• Recognize lack of alignment
• Ensure efforts are aligned with strategic goals
Align to Strategic Organizational Goals
Business Acceleration Operational Efficiency Security and Risk
Know your
customer
Customer 360
Churn
Recommendation
engine
Maximize
Profit
Pricing analytics
Targeted
promotions
Market basket
analytics
New Product
Development
Customization of
product
Next product to
build
Modernizing Data
Architecture
EDWO
Storage data
optimization
Industrial
IoT
Sensor Analytics
Predictive
Maintenance
TelematicsInfrastructure
Analytics
Risk
Credit scoring
Fraud detection
Security
Cyber security
Compliance
Trade compliance
Health care
compliance
Anti Money
Laundering
KEY #2
Have a Solid Data Integration Strategy
Data Integration Strategy
• Ensure organizational agreement on strategy
• Manage and automate the Data Pipeline
• Modernize your architecture
• Adaptive execution strategy
• Secure your data
• Accept that Data Governance is separate from Data Management
• Rethink Metadata Management
Managing and Automating the Pipeline
Administration Security Lifecycle
Management
Data
Provenance
Dynamic Data
Pipeline Monitoring Automation
Analytic Data Pipeline
DATA ENGINEERING DATA PREPARATION ANALYTICS
Cleanse Conform Shape
Transform Ingest
Refine Virtualize Blend
Orchestrate Prepare Enrich
Visualize Build Score
Analyze Model
Data
Lake
KEY #3
Establish a Big Data Onboarding Strategy
Filling the Data Lake
More Data, More Problems
Modern data onboarding is more than just “connecting” or “loading” – it includes:
Managing a
changing array of
data sources
Establishing
repeatable
processes at scale
Maintaining control
and governance
Dynamic Integration Processes Dynamic Transformations
Ingest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest Procedures
Big Data On Boarding
RDBMS
Hadoop
Disparate Data Sources
CSV
Integration Processes Transformations
Ingest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest Procedures
Metadata
Injection
KEY #4
Embrace New Data Management Strategies
Modern Data Management Strategies
• Adopt early ingest and adaptive processing
• Enable the capture of metadata on ingest
• Adopt streaming data processing where appropriate
• Model on the fly
• Modernize data integration infrastructure
• Extend data management to all data
• Apply analytics to all data
KEY #5
Apply Machine Learning Algorithms
Machine Learning Workflow
Prepare
Data
Engineer
Features
Train, Tune and
Tet Models
Update Models
Deploy and
Operationalize Models
360° View of the Taxpayer
Data Lake Blueprint
Global Data Integration
Ingest Blend and Refine
Network
Location
Web
EDW (x12)
Billing
Provisioning
Customer
Social
Media
Pentaho Data
Integration
Hadoop
Cluster
Data
Publisher
Analytical
Database
Pentaho
Analytics
Server
Existing BI and
Data Mining Tools
Data Lake
Pentaho Data Integration
Visual MapReduce
and some native PDI
Transformations On-demand
Data Marts
To be
decommissioned Deliver
 Do you want Protegrity logo
or keep it generic? Go ahead
and delete this note if you
don’t need it.
Uncover Billions of Tax Revenue
Challenge
• £34B missed tax revenue
• Managing 40 TB of data held across
11 separate legacy data warehouses
• Relied on consultants for reports that
required customization and long
lead time
Benefits
• 360 degree view of the tax citizen
• Created a single Big Data platform and
ability to consolidate 40 reporting
streams with self-service reporting
• New reports save an estimated 900 man
hours per day (based on a user-base of
1,200) by streamlining the reporting
process
Takeaways
Takeaways
Align with clear
corporate/strategic
initiatives
Embrace data
management
practices
Enable adaptive
data execution for
data processing
and integration
Drive adoption
of Machine
Learning and
Automation
Questions
Thank You
Chuck Yarbrough
VP, Solutions Marketing and Management
@cyarbrough

More Related Content

What's hot

Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataCloudera, Inc.
 
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessEntity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessDataWorks Summit
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and moreDenodo
 
Emergence of MongoDB as an Enterprise Data Hub
Emergence of MongoDB as an Enterprise Data HubEmergence of MongoDB as an Enterprise Data Hub
Emergence of MongoDB as an Enterprise Data HubMongoDB
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThomas Kelly, PMP
 
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Technologies
 
Enterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricEnterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricPrecisely
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing DataWorks Summit
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin RobbinsData Con LA
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Denodo
 
Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.Denodo
 
Rethink Analytics with an Enterprise Data Hub
Rethink Analytics with an Enterprise Data HubRethink Analytics with an Enterprise Data Hub
Rethink Analytics with an Enterprise Data HubCloudera, Inc.
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use CasesInSemble
 
Moving to the Cloud: Modernizing Data Architecture in Healthcare
Moving to the Cloud: Modernizing Data Architecture in HealthcareMoving to the Cloud: Modernizing Data Architecture in Healthcare
Moving to the Cloud: Modernizing Data Architecture in HealthcarePerficient, Inc.
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureZaloni
 
Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...
Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...
Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...Cloudera, Inc.
 

What's hot (20)

Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessEntity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and morePerformance Acceleration: Summaries, Recommendation, MPP and more
Performance Acceleration: Summaries, Recommendation, MPP and more
 
Emergence of MongoDB as an Enterprise Data Hub
Emergence of MongoDB as an Enterprise Data HubEmergence of MongoDB as an Enterprise Data Hub
Emergence of MongoDB as an Enterprise Data Hub
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT Strategy
 
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
 
Enterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data FabricEnterprise 360 - Graphs at the Center of a Data Fabric
Enterprise 360 - Graphs at the Center of a Data Fabric
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
 
Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.
 
Rethink Analytics with an Enterprise Data Hub
Rethink Analytics with an Enterprise Data HubRethink Analytics with an Enterprise Data Hub
Rethink Analytics with an Enterprise Data Hub
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Moving to the Cloud: Modernizing Data Architecture in Healthcare
Moving to the Cloud: Modernizing Data Architecture in HealthcareMoving to the Cloud: Modernizing Data Architecture in Healthcare
Moving to the Cloud: Modernizing Data Architecture in Healthcare
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data Architecture
 
Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...
Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...
Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...
 

Similar to Five Keys to Building a Killer Data Lake

Five keys to a killer data lake - dataworkssummit 2017
Five keys to a killer data lake - dataworkssummit 2017Five keys to a killer data lake - dataworkssummit 2017
Five keys to a killer data lake - dataworkssummit 2017Chuck Yarbrough
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalHarvinder Atwal
 
Business Intelligence and Analytics Capability
Business Intelligence and Analytics CapabilityBusiness Intelligence and Analytics Capability
Business Intelligence and Analytics CapabilityALTEN Calsoft Labs
 
Building a 360 Degree View of Your Customers on BICS
Building a 360 Degree View of Your Customers on BICSBuilding a 360 Degree View of Your Customers on BICS
Building a 360 Degree View of Your Customers on BICSPerficient, Inc.
 
Migrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie MaeMigrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie MaeDataWorks Summit
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyPerficient, Inc.
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
DataOps: Nine steps to transform your data science impact Strata London May 18
DataOps: Nine steps to transform your data science impact  Strata London May 18DataOps: Nine steps to transform your data science impact  Strata London May 18
DataOps: Nine steps to transform your data science impact Strata London May 18Harvinder Atwal
 
Adopting a Process-Driven Approach to Master Data Management
Adopting a Process-Driven Approach to Master Data ManagementAdopting a Process-Driven Approach to Master Data Management
Adopting a Process-Driven Approach to Master Data ManagementSoftware AG
 
IBM Solutions Connect 2013 - Getting started with Big Data
IBM Solutions Connect 2013 - Getting started with Big DataIBM Solutions Connect 2013 - Getting started with Big Data
IBM Solutions Connect 2013 - Getting started with Big DataIBM Software India
 
#bluecruxtalks in May: Building master data factories, together
#bluecruxtalks in May: Building master data factories, together#bluecruxtalks in May: Building master data factories, together
#bluecruxtalks in May: Building master data factories, togetherBluecrux
 
Information Excellence for Digital Transformation
Information Excellence for Digital TransformationInformation Excellence for Digital Transformation
Information Excellence for Digital TransformationMethod360
 
SG Data Mgt - Findings and Recommendations.pptx
SG Data Mgt - Findings and Recommendations.pptxSG Data Mgt - Findings and Recommendations.pptx
SG Data Mgt - Findings and Recommendations.pptxssuser57f752
 
Bigdata and Analytics Services - Clover Infotech
Bigdata and Analytics Services - Clover InfotechBigdata and Analytics Services - Clover Infotech
Bigdata and Analytics Services - Clover InfotechSwetha Elias
 
Data Mashups for Analytics
Data Mashups for AnalyticsData Mashups for Analytics
Data Mashups for AnalyticsPentaho
 
Data Mashups for Analytics
Data Mashups for AnalyticsData Mashups for Analytics
Data Mashups for AnalyticsKatharine Bierce
 
Big Data, Big Thinking: Untapped Opportunities
Big Data, Big Thinking: Untapped OpportunitiesBig Data, Big Thinking: Untapped Opportunities
Big Data, Big Thinking: Untapped OpportunitiesSAP Technology
 

Similar to Five Keys to Building a Killer Data Lake (20)

Five keys to a killer data lake - dataworkssummit 2017
Five keys to a killer data lake - dataworkssummit 2017Five keys to a killer data lake - dataworkssummit 2017
Five keys to a killer data lake - dataworkssummit 2017
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
A9 schubert
A9 schubertA9 schubert
A9 schubert
 
Business Intelligence and Analytics Capability
Business Intelligence and Analytics CapabilityBusiness Intelligence and Analytics Capability
Business Intelligence and Analytics Capability
 
Building a 360 Degree View of Your Customers on BICS
Building a 360 Degree View of Your Customers on BICSBuilding a 360 Degree View of Your Customers on BICS
Building a 360 Degree View of Your Customers on BICS
 
Migrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie MaeMigrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie Mae
 
How Businesses use Big Data to Impact the Bottom Line
How Businesses use Big Data to Impact the Bottom LineHow Businesses use Big Data to Impact the Bottom Line
How Businesses use Big Data to Impact the Bottom Line
 
06 summary
06 summary06 summary
06 summary
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data Strategy
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
DataOps: Nine steps to transform your data science impact Strata London May 18
DataOps: Nine steps to transform your data science impact  Strata London May 18DataOps: Nine steps to transform your data science impact  Strata London May 18
DataOps: Nine steps to transform your data science impact Strata London May 18
 
Adopting a Process-Driven Approach to Master Data Management
Adopting a Process-Driven Approach to Master Data ManagementAdopting a Process-Driven Approach to Master Data Management
Adopting a Process-Driven Approach to Master Data Management
 
IBM Solutions Connect 2013 - Getting started with Big Data
IBM Solutions Connect 2013 - Getting started with Big DataIBM Solutions Connect 2013 - Getting started with Big Data
IBM Solutions Connect 2013 - Getting started with Big Data
 
#bluecruxtalks in May: Building master data factories, together
#bluecruxtalks in May: Building master data factories, together#bluecruxtalks in May: Building master data factories, together
#bluecruxtalks in May: Building master data factories, together
 
Information Excellence for Digital Transformation
Information Excellence for Digital TransformationInformation Excellence for Digital Transformation
Information Excellence for Digital Transformation
 
SG Data Mgt - Findings and Recommendations.pptx
SG Data Mgt - Findings and Recommendations.pptxSG Data Mgt - Findings and Recommendations.pptx
SG Data Mgt - Findings and Recommendations.pptx
 
Bigdata and Analytics Services - Clover Infotech
Bigdata and Analytics Services - Clover InfotechBigdata and Analytics Services - Clover Infotech
Bigdata and Analytics Services - Clover Infotech
 
Data Mashups for Analytics
Data Mashups for AnalyticsData Mashups for Analytics
Data Mashups for Analytics
 
Data Mashups for Analytics
Data Mashups for AnalyticsData Mashups for Analytics
Data Mashups for Analytics
 
Big Data, Big Thinking: Untapped Opportunities
Big Data, Big Thinking: Untapped OpportunitiesBig Data, Big Thinking: Untapped Opportunities
Big Data, Big Thinking: Untapped Opportunities
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Five Keys to Building a Killer Data Lake

  • 1. Five Keys to a Killer Data Lake Chuck Yarbrough VP, Solutions Marketing and Management
  • 2. Agenda Why a Data Lake? Five Keys to a Killer Data Lake What You Should Do Now
  • 3. Why a Data Lake?
  • 4. Why a Data Lake?
  • 5. What Do You Get from Data Lake? More accurate intelligence Transform your business Ability to increase revenue Create new products Better understand your customers Streamline operations and improve efficiencies
  • 6.
  • 7.
  • 8. Five Keys to a Killer Data Lake Align to corporate strategy 1 Solid data integration strategy 2 Big Data on-boarding process 3 Embrace new data management practices 4 Operationalize machine learning models 5
  • 9. KEY #1 Align to Strategic Organizational Goals
  • 10. Align Goals and Executive Buy In • Understand corporate goals • Identify executive leadership and sponsorship • Recognize lack of alignment • Ensure efforts are aligned with strategic goals
  • 11. Align to Strategic Organizational Goals Business Acceleration Operational Efficiency Security and Risk Know your customer Customer 360 Churn Recommendation engine Maximize Profit Pricing analytics Targeted promotions Market basket analytics New Product Development Customization of product Next product to build Modernizing Data Architecture EDWO Storage data optimization Industrial IoT Sensor Analytics Predictive Maintenance TelematicsInfrastructure Analytics Risk Credit scoring Fraud detection Security Cyber security Compliance Trade compliance Health care compliance Anti Money Laundering
  • 12. KEY #2 Have a Solid Data Integration Strategy
  • 13. Data Integration Strategy • Ensure organizational agreement on strategy • Manage and automate the Data Pipeline • Modernize your architecture • Adaptive execution strategy • Secure your data • Accept that Data Governance is separate from Data Management • Rethink Metadata Management
  • 14. Managing and Automating the Pipeline Administration Security Lifecycle Management Data Provenance Dynamic Data Pipeline Monitoring Automation Analytic Data Pipeline DATA ENGINEERING DATA PREPARATION ANALYTICS Cleanse Conform Shape Transform Ingest Refine Virtualize Blend Orchestrate Prepare Enrich Visualize Build Score Analyze Model Data Lake
  • 15. KEY #3 Establish a Big Data Onboarding Strategy
  • 17. More Data, More Problems Modern data onboarding is more than just “connecting” or “loading” – it includes: Managing a changing array of data sources Establishing repeatable processes at scale Maintaining control and governance
  • 18. Dynamic Integration Processes Dynamic Transformations Ingest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest Procedures Big Data On Boarding RDBMS Hadoop Disparate Data Sources CSV Integration Processes Transformations Ingest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest ProceduresIngest Procedures Metadata Injection
  • 19. KEY #4 Embrace New Data Management Strategies
  • 20. Modern Data Management Strategies • Adopt early ingest and adaptive processing • Enable the capture of metadata on ingest • Adopt streaming data processing where appropriate • Model on the fly • Modernize data integration infrastructure • Extend data management to all data • Apply analytics to all data
  • 21. KEY #5 Apply Machine Learning Algorithms
  • 22. Machine Learning Workflow Prepare Data Engineer Features Train, Tune and Tet Models Update Models Deploy and Operationalize Models
  • 23. 360° View of the Taxpayer
  • 24. Data Lake Blueprint Global Data Integration Ingest Blend and Refine Network Location Web EDW (x12) Billing Provisioning Customer Social Media Pentaho Data Integration Hadoop Cluster Data Publisher Analytical Database Pentaho Analytics Server Existing BI and Data Mining Tools Data Lake Pentaho Data Integration Visual MapReduce and some native PDI Transformations On-demand Data Marts To be decommissioned Deliver  Do you want Protegrity logo or keep it generic? Go ahead and delete this note if you don’t need it.
  • 25. Uncover Billions of Tax Revenue Challenge • £34B missed tax revenue • Managing 40 TB of data held across 11 separate legacy data warehouses • Relied on consultants for reports that required customization and long lead time Benefits • 360 degree view of the tax citizen • Created a single Big Data platform and ability to consolidate 40 reporting streams with self-service reporting • New reports save an estimated 900 man hours per day (based on a user-base of 1,200) by streamlining the reporting process
  • 27. Takeaways Align with clear corporate/strategic initiatives Embrace data management practices Enable adaptive data execution for data processing and integration Drive adoption of Machine Learning and Automation
  • 29. Thank You Chuck Yarbrough VP, Solutions Marketing and Management @cyarbrough

Editor's Notes

  1. Explain why a data lake – and when I talk about a data lake I expect it to be clean, pure and pristine … Might be good to use a quote from an analyst or analyst blog – maybe from Philip Russom - https://upside.tdwi.org/articles/2016/11/03/benefits-of-hadoop-data-lake.aspx
  2. Explain why a data lake – and when I talk about a data lake I expect it to be clean, pure and pristine … Might be good to use a quote from an analyst or analyst blog – maybe from Philip Russom - https://upside.tdwi.org/articles/2016/11/03/benefits-of-hadoop-data-lake.aspx
  3. Better intelligence is kind of obvious, but the rest are the key benefits – pointing back to the strategic initiatives that can be achieved .
  4. Not so easy… There is no EASY button – Let’s be honest, Hadoop is Hard And here’s the other problem…. When hadoop first hit the scene, we recognized how great it was in that we could put data into hadoop – litterally “dump” data in – and guess what… Next slide
  5. That’s what we got! A dump – We loaded all our data into the truck, backed the truck up and dumped it….
  6. So to ensure you have a pure clean, pristine data lake, you have to put thought into it long before you start dumping…
  7. Understand Corporate Goals Recognize Lack of Alignment Identify Executive Leadership and Sponsorship Ensure Efforts are aligned with Strategic Goals
  8. Align your efforts with the organizations strategic goals and initiatives – naming business acceleration, operational efficiency, and security and risk
  9. Ensure Organizational Agreement on Strategy – find ways to incent appropriate behavior of data owners In the world Big Data, yoru data integration strategy is what enables you to manage – in fact, big data governance is based on your big data integration strategy – you have to have a solid strategy not just tools (and certainly not the same old tools you’ve been using Accept that Data Governance is Separate from Data Management - understand that you no longer control data – it’s on prem, in the cloud, in apps, - accept it and govern accordingly Look at this: http://www.informationweek.com/big-data/big-data-analytics/8-critical-elements-of-a-successful-data-integration-strategy/d/d-id/1327107?image_number=1
  10. Data Integration strategy must support your data lake goals AND operate across the entire data fabric – the data lake part of the whole fabric, but it’s the only thing
  11. Ensure filling the lake without just dumping
  12. Modern data onboarding challenges go beyond just ‘connecting’ to data sources or ‘ingesting’ data into a store of choice They introduce significant new challenges related to dealing with many more source of data that may changes over time They also require a flexible, efficient, and governed process to be fully successful…
  13. ELT type process
  14. Metadata – define what we mean – all metadata – data types, etc metadata and business metadata New Data Mgmt Strategies – think beyond what you’ve done for the past year with data warehousing – things like early ingest (get the data in, then process), capture matadata on the way in, automate the creation of anlaytic data models for use interaction - Really comes down to AUTOMATION – at the scale and pace fo data int eh modern world, you can keep up by having IT or buisness users building models by had - must be automated.
  15. Prepare Data and Engineer New Features Train, Tune and Test Models Deploy and Operationalize Models Update Models Regularly Simplify Data Prep Prepare data and perform feature engineering tasks faster in an easy to use drag and drop environment (enabling self-service for data scientists) (show the IT->DA->DS triangle of data foundation Use ML Tool of Choice Train, tune and test your R, Python, Spark MLlib or Weka machine learning algorithms faster to build more predictive models (for data scientists) Operationalize Quickly operationalize your data scientist’s machine learning models, whether they use R, Python, MLlib or Weka (for data engineers and IT in general)
  16. High level view