SlideShare a Scribd company logo
Is the traditionnel data
warehouse dead?
James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com
(Data Lake and Data Warehouse – the
best of both worlds)
About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
 Data Warehouse
 Data Lake
 The best of both worlds
 Federated querying
 Patterns
Considering Data Types
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet,
ORC). Strict data model structure
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Two Approaches to getting value out of data: Top-Down +
Bottoms-Up
Of course you still need a data warehouse
A data warehouse is where you store data from multiple data sources to be used for historical and
trend analysis reporting. It acts as a central repository for many subject areas and contains the "single
version of truth".
Reasons for a data warehouse:
 Reduce stress on production system
 Optimized for read access, sequential disk scans
 Integrate many sources of data
 Keep historical records (no need to save hardcopy reports)
 Restructure/rename tables and fields, model data
 Protect against source system upgrades
 Use Master Data Management, including hierarchies
 No IT involvement needed for users to create reports
 Improve data quality and plugs holes in source systems
 One version of the truth
 Easy to create BI solutions on top of it (i.e. SSAS Cubes)
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Traditional Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements
ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
14
Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”
The three V’s
Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
17
The “data lake” Uses A Bottoms-Up Approach
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
Exactly what is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• Inexpensively store unlimited data
• Collect all data “just in case”
• Store data with no modeling – “Schema on read”
• Complements EDW
• Frees up expensive EDW resources
• Quick user access to data
• ETL Hadoop tools
• Easily scalable
• Place to backup data to
• Place to move older data
Needs data governance so your data lake does not turn
into a data swamp!
The real cost of Hadoop
https://www.scribd.com/document/172491475/WinterCorp-
Report-Big-Data-What-Does-It-Really-Cost/
A data lake is just a glorified file folder with data files in it – how many end-users can
accurately create reports from it?
• Query performance not as good as relational database
• Complex query support not good due to lack of query optimizer, in-database operators, advanced memory management,
concurrency, dynamic workload management and robust indexing
• Concurrency limitations
• No concept of “hot” and “cold” data storage with different levels of performance to reduce cost
• Not a DBMS so lack of features such as update/delete of data, referential integrity, statistics, ACID compliance, data security
• File based so no granular security definition at the column level
• No metadata stored in HDFS, so another tool required adding complexity and slowing performance
• Finding expertise in Hadoop is very difficult
• Super complex, with lot’s of integration with multiple technologies to make it work
• Many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard
• Lack of master data management tools for Hadoop
• Requires end-users to learn new reporting tools and Hadoop technologies to query the data
• Pace of change is so quick many Hadoop technologies become obsolete, adding risk
• Lack of cost savings: cloud consumption, support, licenses, training, and migration costs
• Need conversion process to convert data to a relational format if a reporting tool requires it
• Some reporting tools don’t work against Hadoop
Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Well manicured, often relational
sources
Known and expected data volume
and formats
Little to no change
Complex, rigid transformations
Required extensive monitoring
Transformed historical into read
structures
Flat, canned or multi-dimensional
access to historical data
Many reports, multiple versions of
the truth
24 to 48h delay
MONITORING AND TELEMETRY
Current state of a data warehouse
Traditional Approaches
CRMERPOLTP LOB
DATA SOURCES ETL DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
BI AND ANALYTCIS
Emailed,
centrally
stored Excel
reports and
dashboards
Increase in variety of data sources
Increase in data volume
Increase in types of data
Pressure on the ingestion engine
Complex, rigid transformations can’t
longer keep pace
Monitoring is abandoned
Delay in data, inability to transform
volumes, or react to new sources
Repair, adjust and redesign ETL
Reports become invalid or unusable
Delay in preserved reports increases
Users begin to “innovate” to relieve
starvation
MONITORING AND TELEMETRY
INCREASING DATA VOLUME NON-RELATIONAL DATA
INCREASE IN TIME
STALE REPORTING
Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures
Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?
Modern Data Warehouse
• Ultimate goal
• Supports future data needs
• Data harmonized and analyzed in
the data lake or moved to EDW for
more quality and performance
Data Lake Data Warehouse
Schema-on-read Schema-on-write
Physical collection of uncurated data Data of common meaning
System of Insight: Unknown data to do
experimentation / data discovery
System of Record: Well-understood data to do
operational reporting
Any type of data Limited set of data types (ie. relational)
Skills are limited Skills mostly available
All workloads – batch, interactive, streaming,
machine learning
Optimized for interactive querying
Complementary to DW Can be sourced from Data Lake
Data Warehouse
Serving, Security & Compliance
• Business people
• Low latency
• Complex joins
• Interactive ad-hoc query
• High number of users
• Additional security
• Large support for tools
• Dashboards
• Easily create reports (Self-service BI)
• Know questions
Use cases using Hadoop and a DW in combination
Bringing islands of Hadoop data together
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)
Reasons you still need a cube/OLAP
• Semantic layer
• Handle many concurrent users
• Aggregating data for performance
• Multidimensional analysis
• No joins or relationships
• Hierarchies, KPI’s
• Row-level security
• Advanced time-calculations
• Slowly Changing Dimensions (SCD)
?
?
?
?
Federated Querying
Federated Querying
Other names: Data virtualization, logical data warehouse, data
federation, virtual database, and decentralized data warehouse.
A model that allows a single query to retrieve and combine data as it sits
from multiple data sources, so as to not need to use ETL or learn more
than one retrieval technology
SQL Server and PolyBase
Query relational and non-relational data with T-SQL
Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP & TRAIN MODEL & SERVE
Data orchestration
and monitoring
Big data store Hadoop/Spark and
machine learning
Data warehouse
Cloud Bursting
BI + Reporting
Azure Data Factory Azure Blob Storage Azure Databricks
Azure Data Lake
Azure HDInsight
Azure Machine Learning
Machine Learning Server
Azure SQL Data Warehouse
Azure Analysis Services
INGEST STORE PREP & TRAIN MODEL & SERVE
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Azure Data Factory
Azure Data Factory
Azure Databricks
Azure HDInsight
Data Lake Analytics
Analytical
dashboards
PolyBase
Business/custom apps
(Structured) Azure Analysis
Services
Azure Data Lake Store
INGEST STORE PREP & TRAIN MODEL & SERVE
Azure Data Lake Store
Analytical
dashboards
Business/custom apps
(Structured)
Logs, files and media
(unstructured)
Azure SQL Data
Warehouse
Tableau
Server
PolyBase
Operational
Reports
Ad-Hoc
Query
Azure SQL
Database
Hortonworks
https://aka.ms/ADAG
Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

More Related Content

What's hot

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
DATAVERSITY
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
ConfluentInc1
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 

What's hot (20)

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 

Similar to Is the traditional data warehouse dead?

Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
punedevscom
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
Moacyr Passador
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Rukmani Gopalan
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
Attunity
 
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
Bill Kohnen
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 

Similar to Is the traditional data warehouse dead? (20)

Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
IT Category Purchasing Managers Opportunity for Savings with Non Relational S...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 

More from James Serra

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
James Serra
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
James Serra
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
James Serra
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
James Serra
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
James Serra
 
How to build your career
How to build your careerHow to build your career
How to build your career
James Serra
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
James Serra
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
James Serra
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
James Serra
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
James Serra
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
James Serra
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
James Serra
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
James Serra
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
James Serra
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
James Serra
 

More from James Serra (20)

Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
How to build your career
How to build your careerHow to build your career
How to build your career
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
What’s new in SQL Server 2017
What’s new in SQL Server 2017What’s new in SQL Server 2017
What’s new in SQL Server 2017
 
Learning to present and becoming good at it
Learning to present and becoming good at itLearning to present and becoming good at it
Learning to present and becoming good at it
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Introducing DocumentDB
Introducing DocumentDB Introducing DocumentDB
Introducing DocumentDB
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 

Recently uploaded

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 

Recently uploaded (20)

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 

Is the traditional data warehouse dead?

  • 1. Is the traditionnel data warehouse dead? James Serra Big Data Evangelist Microsoft JamesSerra3@gmail.com (Data Lake and Data Warehouse – the best of both worlds)
  • 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  • 3. Agenda  Data Warehouse  Data Lake  The best of both worlds  Federated querying  Patterns
  • 4. Considering Data Types Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured Structured CSV, Columnar Storage (Parquet, ORC). Strict data model structure Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
  • 5. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation Two Approaches to getting value out of data: Top-Down + Bottoms-Up
  • 6.
  • 7.
  • 8. Of course you still need a data warehouse A data warehouse is where you store data from multiple data sources to be used for historical and trend analysis reporting. It acts as a central repository for many subject areas and contains the "single version of truth". Reasons for a data warehouse:  Reduce stress on production system  Optimized for read access, sequential disk scans  Integrate many sources of data  Keep historical records (no need to save hardcopy reports)  Restructure/rename tables and fields, model data  Protect against source system upgrades  Use Master Data Management, including hierarchies  No IT involvement needed for users to create reports  Improve data quality and plugs holes in source systems  One version of the truth  Easy to create BI solutions on top of it (i.e. SSAS Cubes)
  • 9. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Traditional Data Warehousing Uses A Top-Down Approach Data sources Gather Requirements Business Requirements Technical Requirements
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) 5. Create reports. Analyze data All data not immediately required is discarded or archived 14
  • 15. Harness the growing and changing nature of data Need to collect any data StreamingStructured Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format Unstructured “ ”
  • 17. Store indefinitely Analyze See results Gather data from all sources Iterate New big data thinking: All data has value All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit 17
  • 18. The “data lake” Uses A Bottoms-Up Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 19. Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure
  • 20. Exactly what is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • Inexpensively store unlimited data • Collect all data “just in case” • Store data with no modeling – “Schema on read” • Complements EDW • Frees up expensive EDW resources • Quick user access to data • ETL Hadoop tools • Easily scalable • Place to backup data to • Place to move older data
  • 21.
  • 22. Needs data governance so your data lake does not turn into a data swamp!
  • 23.
  • 24. The real cost of Hadoop https://www.scribd.com/document/172491475/WinterCorp- Report-Big-Data-What-Does-It-Really-Cost/
  • 25.
  • 26. A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?
  • 27. • Query performance not as good as relational database • Complex query support not good due to lack of query optimizer, in-database operators, advanced memory management, concurrency, dynamic workload management and robust indexing • Concurrency limitations • No concept of “hot” and “cold” data storage with different levels of performance to reduce cost • Not a DBMS so lack of features such as update/delete of data, referential integrity, statistics, ACID compliance, data security • File based so no granular security definition at the column level • No metadata stored in HDFS, so another tool required adding complexity and slowing performance • Finding expertise in Hadoop is very difficult • Super complex, with lot’s of integration with multiple technologies to make it work • Many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard • Lack of master data management tools for Hadoop • Requires end-users to learn new reporting tools and Hadoop technologies to query the data • Pace of change is so quick many Hadoop technologies become obsolete, adding risk • Lack of cost savings: cloud consumption, support, licenses, training, and migration costs • Need conversion process to convert data to a relational format if a reporting tool requires it • Some reporting tools don’t work against Hadoop
  • 28.
  • 29. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Well manicured, often relational sources Known and expected data volume and formats Little to no change Complex, rigid transformations Required extensive monitoring Transformed historical into read structures Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay MONITORING AND TELEMETRY
  • 30. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Increase in variety of data sources Increase in data volume Increase in types of data Pressure on the ingestion engine Complex, rigid transformations can’t longer keep pace Monitoring is abandoned Delay in data, inability to transform volumes, or react to new sources Repair, adjust and redesign ETL Reports become invalid or unusable Delay in preserved reports increases Users begin to “innovate” to relieve starvation MONITORING AND TELEMETRY INCREASING DATA VOLUME NON-RELATIONAL DATA INCREASE IN TIME STALE REPORTING
  • 31. Data Lake Transformation (ELT not ETL) New Approaches All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools CRMERPOLTP LOB DATA SOURCES FUTURE DATA SOURCESNON-RELATIONAL DATA EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) Transform relevant data into data sets BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports DATA WAREHOUSE Star schemas, views other read- optimized structures
  • 32. Data Lake + Data Warehouse Better Together Data sources What happened? Descriptive Analytics Diagnostic Analytics Why did it happen? What will happen? Predictive Analytics Prescriptive Analytics How can we make it happen?
  • 33. Modern Data Warehouse • Ultimate goal • Supports future data needs • Data harmonized and analyzed in the data lake or moved to EDW for more quality and performance
  • 34. Data Lake Data Warehouse Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do experimentation / data discovery System of Record: Well-understood data to do operational reporting Any type of data Limited set of data types (ie. relational) Skills are limited Skills mostly available All workloads – batch, interactive, streaming, machine learning Optimized for interactive querying Complementary to DW Can be sourced from Data Lake
  • 35. Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions
  • 36. Use cases using Hadoop and a DW in combination Bringing islands of Hadoop data together Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake)
  • 37. Reasons you still need a cube/OLAP • Semantic layer • Handle many concurrent users • Aggregating data for performance • Multidimensional analysis • No joins or relationships • Hierarchies, KPI’s • Row-level security • Advanced time-calculations • Slowly Changing Dimensions (SCD)
  • 39. Federated Querying Other names: Data virtualization, logical data warehouse, data federation, virtual database, and decentralized data warehouse. A model that allows a single query to retrieve and combine data as it sits from multiple data sources, so as to not need to use ETL or learn more than one retrieval technology
  • 40. SQL Server and PolyBase Query relational and non-relational data with T-SQL
  • 41.
  • 42. Advanced Analytics Social LOB Graph IoT Image CRM INGEST STORE PREP & TRAIN MODEL & SERVE Data orchestration and monitoring Big data store Hadoop/Spark and machine learning Data warehouse Cloud Bursting BI + Reporting Azure Data Factory Azure Blob Storage Azure Databricks Azure Data Lake Azure HDInsight Azure Machine Learning Machine Learning Server Azure SQL Data Warehouse Azure Analysis Services
  • 43. INGEST STORE PREP & TRAIN MODEL & SERVE Logs, files and media (unstructured) Azure SQL Data Warehouse Azure Data Factory Azure Data Factory Azure Databricks Azure HDInsight Data Lake Analytics Analytical dashboards PolyBase Business/custom apps (Structured) Azure Analysis Services Azure Data Lake Store
  • 44. INGEST STORE PREP & TRAIN MODEL & SERVE Azure Data Lake Store Analytical dashboards Business/custom apps (Structured) Logs, files and media (unstructured) Azure SQL Data Warehouse Tableau Server PolyBase Operational Reports Ad-Hoc Query Azure SQL Database Hortonworks
  • 45.
  • 47. Q & A ? James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

Editor's Notes

  1. With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that?  No! In the presentation I'll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I'll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I'll put it all together by showing common big data architectures. http://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/ https://www.slideshare.net/jamserra/big-data-architectures-and-the-data-lake https://www.slideshare.net/jamserra/differentiate-big-data-vs-data-warehouse-use-cases-for-a-cloud-solution
  2. Fluff, but point is I bring real work experience to the session
  3. No Pay the piper now or later The real question is… Dump files into data lake and tell user to go for it
  4. Relational databases (RDBMS) generally work with structured data. Non-relational databases (NoSQL) work with semi-structured data Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
  5. Top down starts with descriptive analytics and progresses to prescriptive analytics. Know the questions to ask. Lot’s of upfront work to get data to where you can use it Bottoms up starts with predictive analytics. Don’t know the questions to ask. Little work needs to be done to start using data There are two approaches to doing information management for analytics: Top-down (deductive approach). This is where analytics is done starting with a clear understanding of corporate strategy where theories and hypothesis are made up front. The right data model is then designed and implemented prior to any data collection. Oftentimes, the top-down approach is good for descriptive and diagnostic analytics. What happened in the past and why did it happen? Bottom-up (inductive approach). This is the approach where data is collected up front before any theories and hypothesis are made. All data is kept so that patterns and conclusions can be derived from the data itself. This type of analysis allows for more advanced analytics such as doing predictive or prescriptive analytics: what will happen and/or how can we make it happen? In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottom-up approach becomes part of the top-down approach. .
  6. One version of truth story: different departments using different financial formulas to help bonus This leads to reasons to use BI. This is used to convince your boss of need for DW Note that you still want to do some reporting off of source system (i.e. current inventory counts). It’s important to know upfront if data warehouse needs to be updated in real-time or very frequently as that is a major architectural decision JD Edwards has tables names like T117
  7. The Data Warehouses leverages the top-down approach where there is a well-architected information store and enterprisewide BI solution. To build a data warehouse follows the top-down approach where the company’s corporate strategy is defined first. This is followed by gathering of business and technical requirements for the warehouse. The data warehouse is then implemented by dimension modelling and ETL design followed by the actual development of the warehouse. This is all done prior to any data being collected. It utilizes a rigorous and formalized methodology because a true enterprise data warehouse supports many users/applications within an organization to make better decisions.
  8. Key Points: Businesses can use new data streams to gain a competitive advantage. Microsoft is uniquely equipped to help you manage the growing volume and variety of data: structured, unstructured, and streaming. Talk Track: Does it not seem like every day there is a new kind of data that we need to understand? New data types continue to expand—we need to be prepared to collect that data so that the organization can then go do something with it. Structured data, the type of data we have been working with for years, continues to accelerate. Think how many transactions are occurring across your business. Unstructured data, the typical source of all our big data, takes many forms and originates from various places across the web including social. Streaming data is the data at the heart of the Internet of Things revolution. Just think about how many things in your organization are smart or instrumented and generating data every second. All of this means that data volumes are growing and bringing new capacity challenges. You are also dealing with an enormous opportunity, taking all of this data and putting it to work. In order to take advantage of all this data, you first need a platform that enables you to collect any data—no matter the size or type. The Microsoft data platform is uniquely complete and can help you collect any data using a flexible approach: Collecting data on-premises with SQL Server SQL Server can help you collect and manage structured, unstructured, and streaming data to power all your workloads: OLTP, BI, and Data Warehousing With new in-memory capabilities that are built into SQL Server 2014, you get the benefit of breakthrough speed with your existing hardware and without having to rewrite your apps. If you’ve been considering the cloud, SQL Server provides an on-ramp to help you get started. Using the wizards built into SQL Server Management Studio, extending to the cloud by combining SQL and Microsoft Azure is simple. Capture new data types using the power and flexibility of the Microsoft Azure Cloud Azure is well equipped to provide the flexibility you need to collect and manage any data in the cloud in a way that meets the needs of your business. Big data in Azure: HDInsight: an Apache Hadoop-based analytics solution that allows cluster deployment in minutes, scale up or down as needed, and insights through familiar BI tools. SQL Databases: managed relational SQL Database-as-a-service that offers business-ready capabilities built on SQL Server technology. Blobs: a cloud storage solution offering the simplest way to store large amounts of unstructured text or binary data, such as video, audio, and images. Tables: a NoSQL key/value storage solution that provides simple access to data at a lower cost for applications that do not need robust querying capabilities. Intelligent Systems Service: cloud service that helps enterprises embrace the Internet of Things by securely connecting, managing, and capturing machine-generated data from a variety of sensors and devices to drive improvements in operations and tap into new business opportunities. Machine Learning: if you’re looking to anticipate business challenges or opportunities, or perhaps expand your data practice into data science, Azure’s new Machine Learning service—cloud-based predictive analytics— can help. ML Studio is a fully-managed cloud service that enables data scientists and developers to efficiently embed predictive analytics into their applications, helping organizations use massive data sets and bring all the benefits of the cloud to machine learning. Document DB: a fully managed, highly scalable, NoSQL document database service Azure Stream Analytics: real-time event processing engine that helps uncover insights from devices, sensors, infrastructure, applications, and data Azure Data Factory: enables information production by orchestrating and managing diverse data Azure Event Hubs: a scalable service for collecting data from millions of “things” in seconds Microsoft Analytics Platform System: In the past, to provide users with reliable, trustworthy information, enterprises gathered relational and transactional data in a single data warehouse. But this traditional data warehouse is under pressure, hitting limits amidst massive change. Data volumes are projected to grow tenfold over the next five years. End users want real-time responses and insights. They want to use non-relational data, which now constitutes 85 percent of data growth. They want access to “cloud-born” data, data that was created from growing cloud IT investments. Your enterprise can only cope with these shifts with a modern data warehouse—the Microsoft Analytics Platform System is the answer. The Analytics Platform System brings Microsoft’s massively parallel processing (MPP) data warehouse technology—the SQL Server Parallel Data Warehouse (PDW), together with HDInsight, Microsoft’s 100 percent Apache Hadoop distribution—and delivers it as a turnkey appliance. Now you can collect relational and non-relational data in one appliance. You can have seamless integration of the relational data warehouse and Hadoop with PolyBase.   All of these options give you the flexibility to get the most out of your existing data capture investments while providing a path to a more efficient and optimized data environment that is ready to support new data types.
  9. All data has immediate or potential value This leads to data hoarding—all data is stored indefinitely With an unknown future, there is no defined schema. Data is prepared and stored in native format; No upfront transformation or aggregation Schema is imposed and transformations are done at query time (schema-on-read). Applications and users interpret the data as they see fit.
  10. Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera) http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/ http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/ http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/ http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses http://www.martinsights.com/?p=1088 http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/ http://www.martinsights.com/?p=1082 http://www.martinsights.com/?p=1094 http://www.martinsights.com/?p=1102
  11. Inexpensively store unlimited data Collect all data “just in case” Easy integration of differently-structured data Store data with no modeling – “Schema on read” Complements enterprise data warehouse (EDW) Frees up expensive EDW resources, especially for refining data Hadoop cluster offers faster ETL processing over SMP solutions Quick user access to data Data exploration to see if data valuable before writing ETL and schema for relational database Allows use of Hadoop tools such as ETL and extreme analytics Place to land IoT streaming data On-line archive or backup for data warehouse data Easily scalable With Hadoop, high availability built in Allows for data to be used many times for different analytic needs and use cases Low-cost storage for raw data saving space on the EDW
  12. https://www.sqlchick.com/entries/2017/12/30/zones-in-a-data-lake https://www.sqlchick.com/entries/2016/7/31/data-lake-use-cases-and-planning Question: Do you see many companies building data lakes? Raw: Raw events are stored for historical reference. Also called staging layer or landing area Cleansed: Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer Application: Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation Sandbox: Optional layer to be used to “play” in.  Also called exploration layer or data science workspace
  13. Question: Do you see many companies building data lakes?
  14. I’m not saying your data warehouse can’t consist of just a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, and Yahoo.  But are you as big as them?  Do you have their resources?  Do you generate data like them?  Do you want a solution that only 1% of the workforce has the skillset for?  Is your IT department radical or is it conservative?
  15. does a data scientist or analyst think locally or globally?  Do they create a model that supports just their use case or do think more broadly how this data set can support other use cases?  So it may be best to continue to let IT model and refine the data inside a relational data warehouse so that it is suitable for different types of business users.
  16. As far as reporting goes, whether to have users report off of a data lake or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data but dealing with the delay in waiting for IT to do all this.  The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question.  Another risk in the first case is slower performance because the data is not laid out efficiently.  Most solutions incorporate both to allow power users or data scientists to access the data quickly via a data lake while allowing all the other users to access the data in a relational database or cube, making self-service BI a reality (as most users would not have the skills to access data in a data lake properly or at all so a cube would be appropriate as it provides a semantic layer among other advantages to make report building very easy – see Why use a SSAS cube?).
  17. http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/ Not saying your EDW can’t consist of a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, Yahoo. But are you as big as them? Do you have their resources? Do you generate data like them? Do you want a solution that only 1% of the workforce has the skillset? Radical vs conservative http://www.wintercorp.com/tcod-report/
  18. Why move relational data to data lake? Offload processing to refine data to free-up EDW, use low-cost storage for raw data saving space on EDW, help if ETL jobs on EDW taking too long. So can actually use a data lake for small data – move EDW to Hadoop, refine it, move it back to EDW. Cons: rewriting all current ETL to Hadoop, re-training I believe APS should be used for staging (i.e. “ELT”) in most cases, but there are some good use cases for using a Hadoop Data Lake:   - Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced - Wanting to use some Hadoop technologies/tools to refine/filter data that are not available for APS - Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy - ELT jobs on EDW are taking too long, so offload some of them to the Hadoop data lake - There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools) - The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value
  19. In Gartner’s 2013 study, “Big Data Business Benefits Are Hampered by ‘Culture Clash’”, they make the argument that both approaches are needed for innovation to be successful. Oftentimes what happens in the bottoms-up approach becomes part of the top-down approach. The Top-down approach with the data warehouse utilizes a rigorous and formal approach to designing an enterprise wide data warehouse that can support the entire enterprise. It usually can answer questions that are backwards facing like what just happened or even answer why things happened. The bottoms-up approach with the data lake utilizes a exploratory and informal approach of collecting all data in a single place so that data scientists can do advanced analytics like leveraging Hadoop and machine learning tools. It usually can identify new opportunities, predict future outcomes, etc. In the ideal world, both are leveraged so that they can exploit information in the most valued way where each works together with the other to grow the business.
  20. An evolution of the three previous scenarios that provides multiple options for the various technologies.  Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control.  ELT is usually used instead of ETL (see Difference between ETL and ELT).  The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data. Hub-and-spoke should be your ultimate goal.  See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.
  21. HDInsights benefits: Cheap, quickly procure Key goal of slide: Highlight the four main use cases for PolyBase. Slide talk track: There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop. PolyBase leverages the APS MPP architecture along with optimizations like push-down computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first. PolyBase is a great tool for archiving older or unused data in APS to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL. There are times when you need to share your PDW with Hadoop users and PolyBase makes it easy to copy data to a Hadoop cluster. Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes.
  22. Question: Data warehouses and data lakes are now so fast, do we still need cubes?
  23. Question: Do we need a relational database if we create a data lake?  So we don’t have to make another copy of the data?  Hive LLAP, Spark SQL, and Impala are so fast can’t we get away with just having a data lake? In other words, is the traditional data warehouse dead? I’m a little confused with the update to SQL DW that has unlimited columnar storage.  What was the limit before this change?  Does this change the maximum database size of 240TB? The max previously was governed by max db size of 240TB. We now are holding the columnstore data out on blob store and so the amount of that data we can hold is unlimited. We are still bound by the 240TB limit for page data so indexes and heaps are still capped today. Speed: Blob/ADLS/Local, size, versions, query type, data size, driver, front-end tool, concurrency, etc. Storage spaces Azure SQL Database Managed Instance will be added to this.
  24. SMP is one server where each CPU in the server shares the same memory, disk, and network controllers (scale-up). MPP means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk (scale-out).
  25. http://demo.sqlmag.com/scaling-success-sql-server-2016/integrating-big-data-and-sql-server-2016 When it comes to key BI investments we are making it much easier to manage relational and non-relational data with Polybase technology that allows you to query Hadoop data and SQL Server relational data through single T-SQL query. One of the challenges we see with Hadoop is there are not enough people out there with Hadoop and Map Reduce skillset and this technology simplifies the skillset needed to manage Hadoop data. This can also work across your on-premises environment or SQL Server running in Azure.
  26. https://blogs.technet.microsoft.com/msuspartner/2017/04/05/data-analytics-partners-navigating-data/
  27. Question: Should SQL Database be considered in the Model & Serve blade, using it as a data mart?
  28. Four Reasons to Migrate Your SQL Server Databases to the Cloud: Security, Agility, Availability, and Reliability Reasons not to move to the cloud: Security concerns (potential for compromised information, issues of privacy when data is stored on a public facility, might be more prone to outside security threats because its high-profile, some providers might not implement the same layers of protection you can achieve in-house) Lack of operational control: Lack of access to servers (i.e. say you are hacked and want to get to security and system log files; if something goes wrong you have no way of controlling how and when a response is carried out; the provider can update software, change configuration settings, and allocate resources without your input or your blessing; you must conform to the environment and standards implemented by the provider) Lack of ownership (an outside agency can get to data easier in the cloud data center that you don’t own vs getting to data in your onsite location that you own.  Or a concern that you share a cloud data center with other companies and someone from another company can be onsite near your servers) Compliance restrictions Regulations (health, financial) Legal restrictions (i.e. data can’t leave your country) Company policies You may be sharing resources on your server, as well as competing for system and network resources Data getting stolen in-flight (i.e. from the cloud data center to the on-prem user)
  29. Question: Where should be do data transformations (data lake, relational database, Databricks, etc)? Question: What are the cost vs performance tradeoffs with our products? (many companies will sacrifice performance to save money)