David Rice & Tom Bruce
The Modern Data
Warehouse
15th January 2019
@snapanalytics
hello@snapanalytics.co.uk
Snap-analytics
Agenda
2
Topic
01 Introductions
02 Evolution of the Data Warehouse
03 Problems with traditional Data Warehousing
04 Why the Modern Data Platform?
05 Three components of the Modern Data Platform
06 Demo
07 Key takeaways
Introductions
3
Tom Bruce David Rice - aka ‘Data Dave’
(Delivery Lead and Co-
founder)
Extensive experience designing and
delivering enterprise data warehouse
and analytics solutions.
Core functional expertise in
• Finance
• Marketing
Tom has worked with clients
including:
• Jaguar Land Rover
• Deutsche Bank
• Carlsberg
(CEO and Co-founder)
Over 15 years experience in data
analytics including:
• Data warehouse design
• ETL (data integration)
• Data Modelling
• Delivering self service analytics
David has worked with clients
including:
• ING Bank
• Barclays Capital and
• Jaguar Land Rover
Bill Inmon
Mid 1970s
Bill Inmon begins to define and
discuss the term ‘Data Warehouse’.
AC Nielsen’s ‘Data Mart’
Early 1970s
ACNielsen provided ‘Data Marts’ to
their clients in order to help them
understand their sales better.
Evolution of the Data Warehouse
4
IBM Article of Data
Warehousing
Late 1980s
In 1988 IBM published ‘An
architecture for a business information
system’ and coined the term “business
data warehouse”
Early 1980s
Evolution of the Data Warehouse
MPP Databases
Teradata create the DBC/1012
database.
Goodyear aerospace build the
‘Goodyear MPP’ supercomputer.
5
TDWI
Mid 1990s
‘The Data Warehouse Institute’ is founded.
Early 1990s
Evolution of the Data Warehouse
Ralph Kimball
Ralph Kimball introduces the ‘Red
Brick Data Warehouse’,
The Data Warehouse Toolkit
1996 - ‘The Data Warehouse Toolkit’ is
published by Ralph Kimball
6
‘Big Data’ & No SQL
Late 2000sEarly 2000s
Evolution of the Data Warehouse
Data Vault
Dan Linstedt introduces Data Vault
modelling
Cloud Computing
7
Cloud Adoption
Late 2010sEarly 2010s
Evolution of the Data Warehouse
Cloud Data Warehousing
The benefits of Data Warehousing in the cloud were realised as:
Google Launched a Data Warehouse as a service ‘Big Query’ in 2011
Amazon launched Redshift in 2013
Snowflake Inc. was publicly launched in 2014
Microsoft launched Azure SQL Data Warehouse in 2016
DW Automation
Connectivity
8
Three big
problems!
The Data Warehouse
Data Integration
(ETL)
Data Modelling
9
Poor outcomes
60 percent of Big Data
projects will fail
Gartner, 2017
10
Problems with traditional DW solutions
Initial Set Up
Performance Tuning
Ongoing Maintenance
Scalability
Data Security &
Compliance
Flexibility
High Upfront Costs Resilience
11
Problems with traditional ETL solutions
Time consuming
Documentation
Inconsistent
Auditability & Lineage
Performance
Inefficient
12
A new way of thinking
Modern data platform
Modern data platforms like Snowflake are fast
to set up and scale up. Low cost storage and
decoupled storage and compute eliminate
resource contention. Native JSON support
and ‘time travel’ features also provide great
benefits.
Combining, modern data platforms data
modelling principles and DW automation
tools delivers highly agile, highly scalable,
performant solutions. This can serve the
needs of your data scientists and business
community alike.
Data Warehouse automation
Tools like Fivetran improve consistency, and
significantly reduce development cycles.
Agile data modelling
Data Vault 2.0 enables parallel loading,
support for unstructured data, and is built
with change in mind.
13
Multi
Cloud
Availability
Per
Second
Pricing
Performance
Data
Sharing
Multi
Use
Cases
Zero
Copy
Clone
Time
Travel
Instant
Elasticity
Benefits of Snowflake Data Platform
14
High Level Architecture - Snowflake
15
Streaming
Support
ELT
Performance
Zero-
configuration
SQL
Transforms
Rapid
Dev
Pre-built
Connectors
Benefits of Fivetran
16
Fivetran – Salesforce Schema
17
c
CitiBike Demo Context
3
• CitiBike is a bike share program in New York (similar to
Boris Bikes in London)
• Users are either annual members or buy short term passes
• There are numerous different stations across the city and
users will collect a bike from a station and then return it to
another station once they are finished
• CitiBike want to have a data warehouse to allow them to
analyse all of the historical trips and join this with external
data to give greater insight
• We will see how a modern data platform can be created
within minutes to help them achieve this goal
Demo Architecture
3
Amazon S3
Citibike Trips (CSV)
Amazon S3
NYC Weather Data
(JSON)
Azure Blob
Station MD (JSON)
Snowflake
Staging
Trips Weather
Station
MD
Transformation
Trips & Weather
Reporting
Trips View
Direct Load
Loading in Snowflake
3
• Data is loaded and queried using virtual warehouses available in the following sizes:
• Compute and storage can be completely isolated meaning no resource contention
• Processed using massively parallel processing (MPP) compute clusters
• Able to scale up the server with no administration needed
• Bulk data loading can be done from the following sources:
XS
1 server
XXXL
128 servers
SNOWFLAKE
DEMO
a) Bulk loading from S3 Stage
b) Scaling up the server
02 – ELT v ETL
3
• Modern cloud based solutions now mean that we can utilise ELT
rather than ETL:
 Endless storage capabilities and scalable processing power
 Ability to store semi-structured data meaning that it can be
transformed after loading
• Big advantage of ELT is that it adds extra flexibility:
 Data can be loaded very quickly
 Developers can then decide to transform what is necessary,
and can quickly change what needs to be transformed
FIVETRAN
DEMO
a) JSON source file
b) Loaded into Azure blob storage
c) Fivetran connector
d) Load
e) Transformation
03 – Semi-structured Data
3
• Snowflake is able to store semi-structured data (JSON, Avro, ORC & Parquet) natively enabling ELT
• Variant data type in Snowflake stores this data with SQL extensions to query directly
• Transformation to turn JSON data into structured tables in Snowflake is extremely simple
• Snowflake is a combination of both a Data Warehouse and a Data Lake – a ‘Data Lakehouse’
WEATHER
DATA LOAD
a) Load Weather JSON data from stage
b) View the weather data in raw form
c) Transform the JSON into structured
data
04 – Zero-copy Cloning for Dev and Test
3
• Data is often required to be copied for things such as QA and test
environments
• Creating copies of the data and environments takes considerable time
and there is cost associated to storing the data twice
• Snowflake uses cloning to instantly create copies of the data which do
not persist a copy of the data, simply referencing the original data
 Only new or updated records get stored in the new cloned table
CLONING
DEMO
05 – Time Travel
3
• Frequently there are issues with tables or data that
is accidentally deleted
• Data may be corrupted or changes may be
implemented that adversely affect the data
• Snowflake allows access to historical data (i.e.
changed or deleted) at any point within a 90 day
period
• Data can be quickly backed up from key times in the
past
TIME TRAVEL
DEMO
06 – Reporting Connectivity
3
• Snowflake connects to many different reporting tools, we’ve just selected a few below:
POWER BI
DEMO
Key takeaways
Maximise the work NOT
done
Build for Change
Are you future ready?
33

Modern data warehouse presentation

  • 1.
    David Rice &Tom Bruce The Modern Data Warehouse 15th January 2019 @snapanalytics hello@snapanalytics.co.uk Snap-analytics
  • 2.
    Agenda 2 Topic 01 Introductions 02 Evolutionof the Data Warehouse 03 Problems with traditional Data Warehousing 04 Why the Modern Data Platform? 05 Three components of the Modern Data Platform 06 Demo 07 Key takeaways
  • 3.
    Introductions 3 Tom Bruce DavidRice - aka ‘Data Dave’ (Delivery Lead and Co- founder) Extensive experience designing and delivering enterprise data warehouse and analytics solutions. Core functional expertise in • Finance • Marketing Tom has worked with clients including: • Jaguar Land Rover • Deutsche Bank • Carlsberg (CEO and Co-founder) Over 15 years experience in data analytics including: • Data warehouse design • ETL (data integration) • Data Modelling • Delivering self service analytics David has worked with clients including: • ING Bank • Barclays Capital and • Jaguar Land Rover
  • 4.
    Bill Inmon Mid 1970s BillInmon begins to define and discuss the term ‘Data Warehouse’. AC Nielsen’s ‘Data Mart’ Early 1970s ACNielsen provided ‘Data Marts’ to their clients in order to help them understand their sales better. Evolution of the Data Warehouse 4
  • 5.
    IBM Article ofData Warehousing Late 1980s In 1988 IBM published ‘An architecture for a business information system’ and coined the term “business data warehouse” Early 1980s Evolution of the Data Warehouse MPP Databases Teradata create the DBC/1012 database. Goodyear aerospace build the ‘Goodyear MPP’ supercomputer. 5
  • 6.
    TDWI Mid 1990s ‘The DataWarehouse Institute’ is founded. Early 1990s Evolution of the Data Warehouse Ralph Kimball Ralph Kimball introduces the ‘Red Brick Data Warehouse’, The Data Warehouse Toolkit 1996 - ‘The Data Warehouse Toolkit’ is published by Ralph Kimball 6
  • 7.
    ‘Big Data’ &No SQL Late 2000sEarly 2000s Evolution of the Data Warehouse Data Vault Dan Linstedt introduces Data Vault modelling Cloud Computing 7
  • 8.
    Cloud Adoption Late 2010sEarly2010s Evolution of the Data Warehouse Cloud Data Warehousing The benefits of Data Warehousing in the cloud were realised as: Google Launched a Data Warehouse as a service ‘Big Query’ in 2011 Amazon launched Redshift in 2013 Snowflake Inc. was publicly launched in 2014 Microsoft launched Azure SQL Data Warehouse in 2016 DW Automation Connectivity 8
  • 9.
    Three big problems! The DataWarehouse Data Integration (ETL) Data Modelling 9
  • 10.
    Poor outcomes 60 percentof Big Data projects will fail Gartner, 2017 10
  • 11.
    Problems with traditionalDW solutions Initial Set Up Performance Tuning Ongoing Maintenance Scalability Data Security & Compliance Flexibility High Upfront Costs Resilience 11
  • 12.
    Problems with traditionalETL solutions Time consuming Documentation Inconsistent Auditability & Lineage Performance Inefficient 12
  • 13.
    A new wayof thinking Modern data platform Modern data platforms like Snowflake are fast to set up and scale up. Low cost storage and decoupled storage and compute eliminate resource contention. Native JSON support and ‘time travel’ features also provide great benefits. Combining, modern data platforms data modelling principles and DW automation tools delivers highly agile, highly scalable, performant solutions. This can serve the needs of your data scientists and business community alike. Data Warehouse automation Tools like Fivetran improve consistency, and significantly reduce development cycles. Agile data modelling Data Vault 2.0 enables parallel loading, support for unstructured data, and is built with change in mind. 13
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    CitiBike Demo Context 3 •CitiBike is a bike share program in New York (similar to Boris Bikes in London) • Users are either annual members or buy short term passes • There are numerous different stations across the city and users will collect a bike from a station and then return it to another station once they are finished • CitiBike want to have a data warehouse to allow them to analyse all of the historical trips and join this with external data to give greater insight • We will see how a modern data platform can be created within minutes to help them achieve this goal
  • 19.
    Demo Architecture 3 Amazon S3 CitibikeTrips (CSV) Amazon S3 NYC Weather Data (JSON) Azure Blob Station MD (JSON) Snowflake Staging Trips Weather Station MD Transformation Trips & Weather Reporting Trips View Direct Load
  • 20.
    Loading in Snowflake 3 •Data is loaded and queried using virtual warehouses available in the following sizes: • Compute and storage can be completely isolated meaning no resource contention • Processed using massively parallel processing (MPP) compute clusters • Able to scale up the server with no administration needed • Bulk data loading can be done from the following sources: XS 1 server XXXL 128 servers
  • 21.
    SNOWFLAKE DEMO a) Bulk loadingfrom S3 Stage b) Scaling up the server
  • 22.
    02 – ELTv ETL 3 • Modern cloud based solutions now mean that we can utilise ELT rather than ETL:  Endless storage capabilities and scalable processing power  Ability to store semi-structured data meaning that it can be transformed after loading • Big advantage of ELT is that it adds extra flexibility:  Data can be loaded very quickly  Developers can then decide to transform what is necessary, and can quickly change what needs to be transformed
  • 23.
    FIVETRAN DEMO a) JSON sourcefile b) Loaded into Azure blob storage c) Fivetran connector d) Load e) Transformation
  • 24.
    03 – Semi-structuredData 3 • Snowflake is able to store semi-structured data (JSON, Avro, ORC & Parquet) natively enabling ELT • Variant data type in Snowflake stores this data with SQL extensions to query directly • Transformation to turn JSON data into structured tables in Snowflake is extremely simple • Snowflake is a combination of both a Data Warehouse and a Data Lake – a ‘Data Lakehouse’
  • 25.
    WEATHER DATA LOAD a) LoadWeather JSON data from stage b) View the weather data in raw form c) Transform the JSON into structured data
  • 26.
    04 – Zero-copyCloning for Dev and Test 3 • Data is often required to be copied for things such as QA and test environments • Creating copies of the data and environments takes considerable time and there is cost associated to storing the data twice • Snowflake uses cloning to instantly create copies of the data which do not persist a copy of the data, simply referencing the original data  Only new or updated records get stored in the new cloned table
  • 27.
  • 28.
    05 – TimeTravel 3 • Frequently there are issues with tables or data that is accidentally deleted • Data may be corrupted or changes may be implemented that adversely affect the data • Snowflake allows access to historical data (i.e. changed or deleted) at any point within a 90 day period • Data can be quickly backed up from key times in the past
  • 29.
  • 30.
    06 – ReportingConnectivity 3 • Snowflake connects to many different reporting tools, we’ve just selected a few below:
  • 31.
  • 32.
    Key takeaways Maximise thework NOT done Build for Change Are you future ready? 33

Editor's Notes

  • #14 Modern Data Platform Start small and scale up quickly…minimising risk. Support for JSON and data science use cases at massive scale. Data Warehouse Automation Note that there are other solutions that solve some of these problems too. SQL Data Warehouse as presented by Kamil, Data Bricks presented by Niall based on Apache Spark. We recommend exploring multiple options, taking a fact based view and implementing what works for you and your organisation.