SlideShare a Scribd company logo
1 of 28
How to build a successful Data
Lake
‣ Alex Gorelik
‣ Founder and CEO, Waterline Data
Alex Gorelik
‣ Founder and CEO, Waterline Data
‣ 3rd start up: Acta (SAP), Exeros (IBM), Waterline
‣ 25 years of building data related technology (Sybase, IBM, Informatica)
‣ 5 years of hands on IT consulting (BAE, Unilever, IBM, Jysk, Stanford
Hospital, Kaiser, etc.)
‣ Last 5 years focused on Big Data (Waterline, Informatica, Menlo
Ventures, IBM)
‣ IBM Distinguished Engineer
‣ BS CS Columbia, MS CS Stanford
Visit Booth 640 (Waterline Data)
and enter to win a signed copy of
The Enterprise Big Data Lake
to be published this
year by O’Reilly Media
For a free two-chapter
preview, go to
http://go.waterlinedata.com/stratasj
Data Lake Powers Data Driven Decision Making
Business value
Data Lake
Data
Warehouse
Off-loading
Data
Puddles
Data
Swamp
Value
No Value
Cost Savings
Limited Scope and Value
Enterprise Impact
Data Swamps
‣ Raw data
‣ Can’t find or use
data
‣ Can’t allow access
without protecting
sensitive data
Data Warehouse Off-loading: Cost Savings
I prefer DW – it’s
more predictable
It takes IT 3 months of data
architecture and ETL work to
add new data to the Data Lake.
I can’t get the original data

Data Puddles: Limited Scope and Value
What Makes a Successful Data Lake?
Right Data Right Interface
Right Platform + +
Right Platform: Hadoop
‣ Volume - Massively scalable
‣ Variety - Schema on read
‣ Future Proof – Modular – same data can be used by
many different projects and technologies
‣ Platform cost – extremely attractive cost structure
Right Data Challenges: Most Data is Lost,
so can’t Analyze it Later
Only a Small Portion of Data is Saved in
Enterprises Today in Data Warehouses
Right Data: Save Raw Data Now to
Analyze Later
‣ Don’t know now what data
will be needed later
‣ Save as much data as
possible now to analyze
later
‣ Save raw data, so can be
treated correctly for each
use case
Right Data Challenges: Data silos and data
hoarding
‣ Departments hoard and protect their
data and do not share it with the rest
of the enterprise
‣ Frictionless ingestion does not
depend on data owners
Right Interface: Key to Broad Adoption
‣ Data marketplace
for data self-service
‣ Providing data at
the right level of
expertise
Providing Data at the Right Level of
Expertise
Data Scientists Business Analyst
Raw
data
Clean, trusted,
prepared data
Roadmap to Data Lake Success
Organize the lake
Set up for Self-Service
Open the lake to the users
Organize Data Lake into Zones Organize
the lake
Multi-modal IT – Different Governance
Levels for Different Zones
• Heavy Governance
• Trusted, curated data
• Lineage, Data Quality
• Minimal Governance
• Make sure there is no
sensitive data
• Heavy governance
• Restricted access
• Minimal Governance
• Make sure there is no
sensitive data
Raw/Lan
ding
Sensitive
Gold/Cu
rated
Work
Multi-modal IT with different governance levels in different zones
Business Analysts
Data Stewards
Data Scientists
Data Scientists
Data Engineers
Business Analyst Self-Service Workflow
Find and
Understand
Provision Prep Analyze
Set up
for Self-
Service
Finding and Understanding Data
Crowd-source metadata and
automate creation of a catalog
1. Institutionalize tribal knowledge
2. Automate discovery to cover all
data sets
3. Establish Trust
• Curated annotated data sets
• Lineage
• Data quality
• Governance
Find and
Understand
Accessing and Provisioning Data
Top down approach
‣ Find and de-identify all
sensitive data
‣ Provide access to every user
for every dataset as needed
Agile/Self-Service Approach
‣ Create a metadata only catalog
‣ When users request access, data
is de-identified and provisioned
Cannot give all access to all users
Must protect PII data and Sensitive business information
Provision
Provide data marketplace interface to find,
understand and provision data
Data Prep
‣ Prepare data for analytics
• Clean data
- Remove or fix bad data
- Fill in missing values
- Convert to common units of measure
• Shape data
- Combine (join, concatenate)
- Resolve entities (e.g., create a single customer record from multiple records or sources)
- Transform (aggregate, filter, bucketize, convert codes to names, etc.)
• Blend data - harmonize data from multiple sources to a common schema/model
‣ Tooling
• Emerging space with many great dedicated data wrangling tools
• Some capabilities in BI/Visualization tools
• SQL and scripting languages for the more technical analysts
Prep
Data Analysis
‣ Many wonderful self-service
BI and Visualization tools
‣ Mature space with many
established and innovative
vendors
Analyze
Magic Quadrant for Business Intelligence and Analytics Platforms
04 February 2016 | ID:G00275847
Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao
Tapadinhas, Kurt Schlegel, Thomas W. Oestreich
Successful Data Lake
Right Data Right Interface
Right Platform + +
Q&A
Visit Booth 640 (Waterline Data)
and enter to win a signed copy of
The Enterprise Big Data Lake
to be published this
year by O’Reilly Media
For a free two-chapter
preview, go to
http://go.waterlinedata.com/stratasj
Thank you

More Related Content

Similar to How to build a successful data lake Presentation.pptx

The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationDatabricks
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platformsJamesAnderson599331
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Managing Data Warehouse Growth in the New Era of Big Data
Managing Data Warehouse Growth in the New Era of Big DataManaging Data Warehouse Growth in the New Era of Big Data
Managing Data Warehouse Growth in the New Era of Big DataVineet
 
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...MongoDB
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...Mark Rittman
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyArcadia Data
 

Similar to How to build a successful data lake Presentation.pptx (20)

The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Managing Data Warehouse Growth in the New Era of Big Data
Managing Data Warehouse Growth in the New Era of Big DataManaging Data Warehouse Growth in the New Era of Big Data
Managing Data Warehouse Growth in the New Era of Big Data
 
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
MongoDB IoT City Tour LONDON: Hadoop and the future of data management. By, M...
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics Strategy
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 

Recently uploaded (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 

How to build a successful data lake Presentation.pptx

  • 1. How to build a successful Data Lake ‣ Alex Gorelik ‣ Founder and CEO, Waterline Data
  • 2. Alex Gorelik ‣ Founder and CEO, Waterline Data ‣ 3rd start up: Acta (SAP), Exeros (IBM), Waterline ‣ 25 years of building data related technology (Sybase, IBM, Informatica) ‣ 5 years of hands on IT consulting (BAE, Unilever, IBM, Jysk, Stanford Hospital, Kaiser, etc.) ‣ Last 5 years focused on Big Data (Waterline, Informatica, Menlo Ventures, IBM) ‣ IBM Distinguished Engineer ‣ BS CS Columbia, MS CS Stanford
  • 3. Visit Booth 640 (Waterline Data) and enter to win a signed copy of The Enterprise Big Data Lake to be published this year by O’Reilly Media For a free two-chapter preview, go to http://go.waterlinedata.com/stratasj
  • 4. Data Lake Powers Data Driven Decision Making
  • 5. Business value Data Lake Data Warehouse Off-loading Data Puddles Data Swamp Value No Value Cost Savings Limited Scope and Value Enterprise Impact
  • 6. Data Swamps ‣ Raw data ‣ Can’t find or use data ‣ Can’t allow access without protecting sensitive data
  • 7. Data Warehouse Off-loading: Cost Savings I prefer DW – it’s more predictable It takes IT 3 months of data architecture and ETL work to add new data to the Data Lake. I can’t get the original data 
  • 8. Data Puddles: Limited Scope and Value
  • 9. What Makes a Successful Data Lake? Right Data Right Interface Right Platform + +
  • 10. Right Platform: Hadoop ‣ Volume - Massively scalable ‣ Variety - Schema on read ‣ Future Proof – Modular – same data can be used by many different projects and technologies ‣ Platform cost – extremely attractive cost structure
  • 11. Right Data Challenges: Most Data is Lost, so can’t Analyze it Later Only a Small Portion of Data is Saved in Enterprises Today in Data Warehouses
  • 12. Right Data: Save Raw Data Now to Analyze Later ‣ Don’t know now what data will be needed later ‣ Save as much data as possible now to analyze later ‣ Save raw data, so can be treated correctly for each use case
  • 13. Right Data Challenges: Data silos and data hoarding ‣ Departments hoard and protect their data and do not share it with the rest of the enterprise ‣ Frictionless ingestion does not depend on data owners
  • 14. Right Interface: Key to Broad Adoption ‣ Data marketplace for data self-service ‣ Providing data at the right level of expertise
  • 15. Providing Data at the Right Level of Expertise Data Scientists Business Analyst Raw data Clean, trusted, prepared data
  • 16. Roadmap to Data Lake Success Organize the lake Set up for Self-Service Open the lake to the users
  • 17. Organize Data Lake into Zones Organize the lake
  • 18. Multi-modal IT – Different Governance Levels for Different Zones • Heavy Governance • Trusted, curated data • Lineage, Data Quality • Minimal Governance • Make sure there is no sensitive data • Heavy governance • Restricted access • Minimal Governance • Make sure there is no sensitive data Raw/Lan ding Sensitive Gold/Cu rated Work Multi-modal IT with different governance levels in different zones Business Analysts Data Stewards Data Scientists Data Scientists Data Engineers
  • 19. Business Analyst Self-Service Workflow Find and Understand Provision Prep Analyze Set up for Self- Service
  • 20. Finding and Understanding Data Crowd-source metadata and automate creation of a catalog 1. Institutionalize tribal knowledge 2. Automate discovery to cover all data sets 3. Establish Trust • Curated annotated data sets • Lineage • Data quality • Governance Find and Understand
  • 21. Accessing and Provisioning Data Top down approach ‣ Find and de-identify all sensitive data ‣ Provide access to every user for every dataset as needed Agile/Self-Service Approach ‣ Create a metadata only catalog ‣ When users request access, data is de-identified and provisioned Cannot give all access to all users Must protect PII data and Sensitive business information Provision
  • 22. Provide data marketplace interface to find, understand and provision data
  • 23. Data Prep ‣ Prepare data for analytics • Clean data - Remove or fix bad data - Fill in missing values - Convert to common units of measure • Shape data - Combine (join, concatenate) - Resolve entities (e.g., create a single customer record from multiple records or sources) - Transform (aggregate, filter, bucketize, convert codes to names, etc.) • Blend data - harmonize data from multiple sources to a common schema/model ‣ Tooling • Emerging space with many great dedicated data wrangling tools • Some capabilities in BI/Visualization tools • SQL and scripting languages for the more technical analysts Prep
  • 24. Data Analysis ‣ Many wonderful self-service BI and Visualization tools ‣ Mature space with many established and innovative vendors Analyze Magic Quadrant for Business Intelligence and Analytics Platforms 04 February 2016 | ID:G00275847 Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich
  • 25. Successful Data Lake Right Data Right Interface Right Platform + +
  • 26. Q&A
  • 27. Visit Booth 640 (Waterline Data) and enter to win a signed copy of The Enterprise Big Data Lake to be published this year by O’Reilly Media For a free two-chapter preview, go to http://go.waterlinedata.com/stratasj