SlideShare a Scribd company logo
1 of 28
In the age of Big Data Analytics
Phil Watt
21st January 2019
Modernising Data Warehousing
Phil Watt
Bio
Phil is a Director in the Escient Victoria Consulting Team
with more than 25 years in large scale enterprise analytics
and integrated data management programmes. His focus is
in the journey to scale business programmes from small,
proof of concept initiatives through to operational company-
wide solutions with a high strategic impact. He has deep
experience in applying business analytics in the CME and
FS sectors in Western Europe and South Pacific, including
global technology leadership roles for Fortune 500
companies. After leading the definition of the technology
components of a State Government data reform strategy,
he now leads the technology implementation and business
alignment of three of its key foundation programmes.
3
All views expressed are my own and may not represent the
opinions of any entity whatsoever with whom I have been, am
now, or will be affiliated.
Disclaimer
4
Why have a data warehouse?
Why modernise your data warehouse?
Design Principles for a modern data warehouse
Cloud and Big Data
Patterns
Outline
Value from integrated data is proportional to the number of users
‘Build it and they will come’ is not a good strategy
Why Modernise?
• New capability
• Better query performance
• Lower data latency (data freshness)
• Lower support/ Opex costs
• Higher developer / end user productivity
• Faster implementation of new data /
requirements
• Risk reduction (stack out of support,
security concerns, skills availability)
• Developer productivity
• Maintenance (number of operations
and support staff)
• End user productivity
The modernisation business case
is likely to involve a mixture of:
Your biggest costs are likely to be
labour – not software or
infrastructure
Incumbent vendors may encourage you
to stick with current ‘best practice’ or
Suggest you have too much invested in
the current platform
https://en.wikipedia.org/wiki/Appeal_t
o_tradition
Vendors often use Appeal to Novelty
(shiny-shiny is better than old-
fangled…) to upsell or get in the door
Remember: If it ain’t broke, don’t fix it
https://en.wikipedia.org/wiki/Appeal_t
o_novelty
Avoid Appeal to Tradition &
Sunk Cost Fallacies
Avoid the Appeal to Novelty
Fallacy
Design Principles
10
# Principle Description
1 Climb the Stack SaaS | PaaS | IaaS | Metal. Compose higher order solutions from components. as-a-
Service allows outsourcing of lower level components.
2 Connect People to Data While transactional business systems are designed to to prevent direct access to data,
Analytics systems are designed to enable a connection to data.
3 Privacy by Design Information privacy and governance is included from the start of system design, on par
with system functionality.
4 Scalable Day 1 Capable of distributed scale-out from day 1.
5 Open Innovation Innovation in data and analytics capabilities is being driven by open collaboration on
algorithms and open source software.
6 Pipeline of Parts Data processing and pipeline components must have clear boundaries & hand-off points.
7 Reuse over Rebuild Reuse and extend components - design and build them in re-usable ways. Use DRY
(Don’t Repeat Yourself) code versus WET (Write Every Time) code.
8 Repeatable over Recoverable Service continuity driven by repeatability and automation over backup/restore.
9 Everything Testable All components must be verifiable via test automation.
10 Know your Data Ensure a solid understanding of the data – including how it was collected (& why), data
definitions, data quality, transformation rules and lineage, and operational metadata.
Carefully Choose Your Design Principles
(Samples below)
Cloud encourages an engineering approach
“If a human operator needs to touch your system during normal
operations, you have a bug. The definition of normal changes as
your systems grow.”
Carla Geisser, Google SRE
SRE – Site Reliability Engineering
Toil often has the following characteristics:
• Manual
• Repetitive
• Automatable
• Tactical
• No enduring value
• Effort to do it scales linearly as a service
grows
See https://landing.google.com/sre/sre-
book/toc/
Tenets of SRE
• Ensuring a Durable Focus on Engineering
• Pursuing Maximum Change Velocity Without
Violating a Service’s SLO
• Monitoring (Alerts, Tickets, Logging)
• Emergency Response
• Change Management
• Demand Forecasting and Capacity Planning
• Provisioning
• Efficiency and Performance
With SRE we work to avoid ToilSRE
14
Enables responsive change in business requirements
Reduces the body of technical knowledge you need to maintain internally
Spend time considering security and privacy challenges
• Engage a third party security expert if needed to help with security designs
Best match for the technical design principles above
• Easier access to SaaS and PaaS offerings
Be open to multi-cloud platform
• Help convince your cloud provider you have choices
• Take advantage of best of breed capabilities
• Don’t always rely on cloud vendor’s native offerings – consider third parties to help mitigate for stickiness
Cloud may INCREASE your infrastructure costs
• Likely to be offset by increased business responsiveness and richer feature availability
Using Cloud Infrastructure
16
‘Hadoop’ is much less relevant in the cloud today
• The overhead of HDFS is unnecessary given cloud storage options like
AWS S3 or Azure Blob Storage
• Useful data processing services are often packaged in PaaS – avoiding
the need to manage complex Hadoop clusters
Big Data and Cloud
Design Patterns
17
Analytical Ecosystem
Based on a diagram by Humza Naseer, University of Melbourne 2019
19
LoadTransform
Extract /
Access
Source
CRM / ERP /
Billing, etc.
Get / Put
Clean
Validate
Conform to
model
Use/present
High Level Patterns Have Hardly Changed for Data
Warehouse ETL in the last 15 years
20
LoadTransform
Extract /
Access
Source
CRM / ERP /
Billing, etc.
Get / Put
Clean
Validate
Conform to
model
Use/present
But latency requirements have
Batch
Stream
Mini Batch
ETL
•Talend
•Databricks
•Snaplogic
•etc.
iPaaS
•Informatica
•Dell Boomi
•Mulesoft
•etc.
ELT
•SSIS
•SQL
•Oracle Data
Integrator
•etc.
Frameworks
•Bonobo
•Pygrametl
•Apache Airflow
•etc.
Raw code
•Python
•Scala
•Spark
•etc.
And there is a bewildering choice of tools just to
get data into the Data Warehouse
Recommendation: Use tools that closely support your
design principles
Keep these in mind when choosing
• a database / query execution engine
• Where you do your data transformations – e.g. should you separate transformations from user queries?
IO and Query Concurrency Drives Performance and User Experience
22
De-risk tool selection by using
Continuous Integration / Continuous Delivery (CI/CD)
De-risk using a phased approach
CI/CD from day one
Select some core-reusable services to use first and do
parallel runs if possible
e.g. load modules, address cleansing,
Deployment – avoid all or nothing ‘big bang’
25
Inmon (normalized core) – Labrador
•Labradors love being around people and wants to be everybody's friend. They are
very sociable, intelligent, active, fun-loving animals who are eager to please. They
make ideal pets for families with children, and make great watchdogs too. The best
possible reference for the breed's docile and reliable nature is the fact that
virtually all guide dogs for the blind in Australia are Labrador Retrievers.
Kimball (dimensional core) – Kelpie
•Australian Kelpies are tough, independent, highly intelligent dogs with extreme
loyalty and utmost devotion to duty, and have a tractable disposition. Obedient
and super alert, the Australian Kelpie is eager to please and makes a devoted
companion, however, their inexhaustible energy makes them unsuitable for
suburban living.
Data Vault (hubs and satellites) – Chow Chow
•The Chow Chow has a reputation for being a one-man dog and not very tolerant of
those it doesn’t know. It can also tend to be willful and hard to train, so they are
not a good choice for a weak or new owner. In addition, this dog has a thick coat
that it sheds about twice a year. Expect to find fur everywhere during this time.
Choosing a Data Model Methodology
www.escient.com.au
Phil Watt
Director
phil.watt@escient.com.au
linkedin.com/in/dataphil
Appendix
Dev Stack (excluding ETL choices above…)

More Related Content

What's hot

Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?SnapLogic
 
Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...Dell World
 
Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?Dell World
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
 
Kythera BioPharma Commercial Infrastructure 2015 05 28 final
Kythera BioPharma Commercial Infrastructure 2015 05 28 finalKythera BioPharma Commercial Infrastructure 2015 05 28 final
Kythera BioPharma Commercial Infrastructure 2015 05 28 finalMichael W. Hughes
 
Adapting to a Hybrid World [Webinar on Demand]
Adapting to a Hybrid World [Webinar on Demand]Adapting to a Hybrid World [Webinar on Demand]
Adapting to a Hybrid World [Webinar on Demand]ServerCentral
 
Dell - HPC-29mai2012
Dell - HPC-29mai2012Dell - HPC-29mai2012
Dell - HPC-29mai2012Agora Group
 
Webinar: DataStax Managed Cloud: focus on innovation, not administration
Webinar:  DataStax Managed Cloud: focus on innovation, not administrationWebinar:  DataStax Managed Cloud: focus on innovation, not administration
Webinar: DataStax Managed Cloud: focus on innovation, not administrationDataStax
 
Optimizing IT Costs & Services With Big Data (Little Effort!) - Case Studies ...
Optimizing IT Costs & Services With Big Data (Little Effort!) - Case Studies ...Optimizing IT Costs & Services With Big Data (Little Effort!) - Case Studies ...
Optimizing IT Costs & Services With Big Data (Little Effort!) - Case Studies ...TeamQuest Corporation
 
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoTMT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoTDell EMC World
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
Nimble storage investor presentation - Q2 FY15
Nimble storage investor presentation -  Q2 FY15Nimble storage investor presentation -  Q2 FY15
Nimble storage investor presentation - Q2 FY15nimblestorageIR
 
Deterministic capacity planning for OpenStack as elastic cloud infrastructure
Deterministic capacity planning for OpenStack as elastic cloud infrastructureDeterministic capacity planning for OpenStack as elastic cloud infrastructure
Deterministic capacity planning for OpenStack as elastic cloud infrastructureSean Cohen
 
Datacenter Pulse Stack v2
Datacenter Pulse Stack v2Datacenter Pulse Stack v2
Datacenter Pulse Stack v2Jan Wiersma
 
Enterprise Data Management - Data Lake - A Perspective
Enterprise Data Management - Data Lake - A PerspectiveEnterprise Data Management - Data Lake - A Perspective
Enterprise Data Management - Data Lake - A PerspectiveSaurav Mukherjee
 
Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies SnapLogic
 
When Databases Meet Big data and Hadoop - Uni of Tromso Online Lecture
When Databases Meet Big data and Hadoop - Uni of Tromso Online LectureWhen Databases Meet Big data and Hadoop - Uni of Tromso Online Lecture
When Databases Meet Big data and Hadoop - Uni of Tromso Online LectureIrfan Elahi
 
Tableau Dashboard Design Best Practices
Tableau Dashboard Design Best Practices Tableau Dashboard Design Best Practices
Tableau Dashboard Design Best Practices Senturus
 
Solutions for a Data Intensive World in a Parallel Universe..
Solutions for a Data Intensive World in a Parallel Universe..Solutions for a Data Intensive World in a Parallel Universe..
Solutions for a Data Intensive World in a Parallel Universe..Intel IT Center
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraCloudera, Inc.
 

What's hot (20)

Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?
 
Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...
 
Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
Kythera BioPharma Commercial Infrastructure 2015 05 28 final
Kythera BioPharma Commercial Infrastructure 2015 05 28 finalKythera BioPharma Commercial Infrastructure 2015 05 28 final
Kythera BioPharma Commercial Infrastructure 2015 05 28 final
 
Adapting to a Hybrid World [Webinar on Demand]
Adapting to a Hybrid World [Webinar on Demand]Adapting to a Hybrid World [Webinar on Demand]
Adapting to a Hybrid World [Webinar on Demand]
 
Dell - HPC-29mai2012
Dell - HPC-29mai2012Dell - HPC-29mai2012
Dell - HPC-29mai2012
 
Webinar: DataStax Managed Cloud: focus on innovation, not administration
Webinar:  DataStax Managed Cloud: focus on innovation, not administrationWebinar:  DataStax Managed Cloud: focus on innovation, not administration
Webinar: DataStax Managed Cloud: focus on innovation, not administration
 
Optimizing IT Costs & Services With Big Data (Little Effort!) - Case Studies ...
Optimizing IT Costs & Services With Big Data (Little Effort!) - Case Studies ...Optimizing IT Costs & Services With Big Data (Little Effort!) - Case Studies ...
Optimizing IT Costs & Services With Big Data (Little Effort!) - Case Studies ...
 
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoTMT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
MT11 - Turn Science Fiction into Reality by Using SAP HANA to Make Sense of IoT
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Nimble storage investor presentation - Q2 FY15
Nimble storage investor presentation -  Q2 FY15Nimble storage investor presentation -  Q2 FY15
Nimble storage investor presentation - Q2 FY15
 
Deterministic capacity planning for OpenStack as elastic cloud infrastructure
Deterministic capacity planning for OpenStack as elastic cloud infrastructureDeterministic capacity planning for OpenStack as elastic cloud infrastructure
Deterministic capacity planning for OpenStack as elastic cloud infrastructure
 
Datacenter Pulse Stack v2
Datacenter Pulse Stack v2Datacenter Pulse Stack v2
Datacenter Pulse Stack v2
 
Enterprise Data Management - Data Lake - A Perspective
Enterprise Data Management - Data Lake - A PerspectiveEnterprise Data Management - Data Lake - A Perspective
Enterprise Data Management - Data Lake - A Perspective
 
Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies
 
When Databases Meet Big data and Hadoop - Uni of Tromso Online Lecture
When Databases Meet Big data and Hadoop - Uni of Tromso Online LectureWhen Databases Meet Big data and Hadoop - Uni of Tromso Online Lecture
When Databases Meet Big data and Hadoop - Uni of Tromso Online Lecture
 
Tableau Dashboard Design Best Practices
Tableau Dashboard Design Best Practices Tableau Dashboard Design Best Practices
Tableau Dashboard Design Best Practices
 
Solutions for a Data Intensive World in a Parallel Universe..
Solutions for a Data Intensive World in a Parallel Universe..Solutions for a Data Intensive World in a Parallel Universe..
Solutions for a Data Intensive World in a Parallel Universe..
 
Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
 

Similar to Modernising the data warehouse - January 2019

The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationDATAVERSITY
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaCloudera, Inc.
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise AnalyticsDATAVERSITY
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceSense Corp
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationEmbarcadero Technologies
 
ADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and ComparisonADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and ComparisonDATAVERSITY
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeDATAVERSITY
 
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and TableauAnalyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and TableauDATAVERSITY
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse OptimizationCloudera, Inc.
 
6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoopDr. Wilfred Lin (Ph.D.)
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseCaserta
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeCloudera, Inc.
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...DataScienceConferenc1
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyPerficient, Inc.
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 

Similar to Modernising the data warehouse - January 2019 (20)

The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
Data Vault Introduction
Data Vault IntroductionData Vault Introduction
Data Vault Introduction
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
 
ADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and ComparisonADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and Comparison
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and TableauAnalyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop
 
Ask bigger questions
Ask bigger questionsAsk bigger questions
Ask bigger questions
 
Introduction to BigData
Introduction to BigData Introduction to BigData
Introduction to BigData
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural Change
 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data Strategy
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 

Modernising the data warehouse - January 2019

  • 1. In the age of Big Data Analytics Phil Watt 21st January 2019 Modernising Data Warehousing
  • 2. Phil Watt Bio Phil is a Director in the Escient Victoria Consulting Team with more than 25 years in large scale enterprise analytics and integrated data management programmes. His focus is in the journey to scale business programmes from small, proof of concept initiatives through to operational company- wide solutions with a high strategic impact. He has deep experience in applying business analytics in the CME and FS sectors in Western Europe and South Pacific, including global technology leadership roles for Fortune 500 companies. After leading the definition of the technology components of a State Government data reform strategy, he now leads the technology implementation and business alignment of three of its key foundation programmes.
  • 3. 3 All views expressed are my own and may not represent the opinions of any entity whatsoever with whom I have been, am now, or will be affiliated. Disclaimer
  • 4. 4 Why have a data warehouse? Why modernise your data warehouse? Design Principles for a modern data warehouse Cloud and Big Data Patterns Outline
  • 5. Value from integrated data is proportional to the number of users
  • 6. ‘Build it and they will come’ is not a good strategy
  • 8. • New capability • Better query performance • Lower data latency (data freshness) • Lower support/ Opex costs • Higher developer / end user productivity • Faster implementation of new data / requirements • Risk reduction (stack out of support, security concerns, skills availability) • Developer productivity • Maintenance (number of operations and support staff) • End user productivity The modernisation business case is likely to involve a mixture of: Your biggest costs are likely to be labour – not software or infrastructure
  • 9. Incumbent vendors may encourage you to stick with current ‘best practice’ or Suggest you have too much invested in the current platform https://en.wikipedia.org/wiki/Appeal_t o_tradition Vendors often use Appeal to Novelty (shiny-shiny is better than old- fangled…) to upsell or get in the door Remember: If it ain’t broke, don’t fix it https://en.wikipedia.org/wiki/Appeal_t o_novelty Avoid Appeal to Tradition & Sunk Cost Fallacies Avoid the Appeal to Novelty Fallacy
  • 11. # Principle Description 1 Climb the Stack SaaS | PaaS | IaaS | Metal. Compose higher order solutions from components. as-a- Service allows outsourcing of lower level components. 2 Connect People to Data While transactional business systems are designed to to prevent direct access to data, Analytics systems are designed to enable a connection to data. 3 Privacy by Design Information privacy and governance is included from the start of system design, on par with system functionality. 4 Scalable Day 1 Capable of distributed scale-out from day 1. 5 Open Innovation Innovation in data and analytics capabilities is being driven by open collaboration on algorithms and open source software. 6 Pipeline of Parts Data processing and pipeline components must have clear boundaries & hand-off points. 7 Reuse over Rebuild Reuse and extend components - design and build them in re-usable ways. Use DRY (Don’t Repeat Yourself) code versus WET (Write Every Time) code. 8 Repeatable over Recoverable Service continuity driven by repeatability and automation over backup/restore. 9 Everything Testable All components must be verifiable via test automation. 10 Know your Data Ensure a solid understanding of the data – including how it was collected (& why), data definitions, data quality, transformation rules and lineage, and operational metadata. Carefully Choose Your Design Principles (Samples below)
  • 12. Cloud encourages an engineering approach
  • 13. “If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.” Carla Geisser, Google SRE SRE – Site Reliability Engineering
  • 14. Toil often has the following characteristics: • Manual • Repetitive • Automatable • Tactical • No enduring value • Effort to do it scales linearly as a service grows See https://landing.google.com/sre/sre- book/toc/ Tenets of SRE • Ensuring a Durable Focus on Engineering • Pursuing Maximum Change Velocity Without Violating a Service’s SLO • Monitoring (Alerts, Tickets, Logging) • Emergency Response • Change Management • Demand Forecasting and Capacity Planning • Provisioning • Efficiency and Performance With SRE we work to avoid ToilSRE 14
  • 15. Enables responsive change in business requirements Reduces the body of technical knowledge you need to maintain internally Spend time considering security and privacy challenges • Engage a third party security expert if needed to help with security designs Best match for the technical design principles above • Easier access to SaaS and PaaS offerings Be open to multi-cloud platform • Help convince your cloud provider you have choices • Take advantage of best of breed capabilities • Don’t always rely on cloud vendor’s native offerings – consider third parties to help mitigate for stickiness Cloud may INCREASE your infrastructure costs • Likely to be offset by increased business responsiveness and richer feature availability Using Cloud Infrastructure
  • 16. 16 ‘Hadoop’ is much less relevant in the cloud today • The overhead of HDFS is unnecessary given cloud storage options like AWS S3 or Azure Blob Storage • Useful data processing services are often packaged in PaaS – avoiding the need to manage complex Hadoop clusters Big Data and Cloud
  • 18. Analytical Ecosystem Based on a diagram by Humza Naseer, University of Melbourne 2019
  • 19. 19 LoadTransform Extract / Access Source CRM / ERP / Billing, etc. Get / Put Clean Validate Conform to model Use/present High Level Patterns Have Hardly Changed for Data Warehouse ETL in the last 15 years
  • 20. 20 LoadTransform Extract / Access Source CRM / ERP / Billing, etc. Get / Put Clean Validate Conform to model Use/present But latency requirements have Batch Stream Mini Batch
  • 21. ETL •Talend •Databricks •Snaplogic •etc. iPaaS •Informatica •Dell Boomi •Mulesoft •etc. ELT •SSIS •SQL •Oracle Data Integrator •etc. Frameworks •Bonobo •Pygrametl •Apache Airflow •etc. Raw code •Python •Scala •Spark •etc. And there is a bewildering choice of tools just to get data into the Data Warehouse Recommendation: Use tools that closely support your design principles
  • 22. Keep these in mind when choosing • a database / query execution engine • Where you do your data transformations – e.g. should you separate transformations from user queries? IO and Query Concurrency Drives Performance and User Experience 22
  • 23. De-risk tool selection by using Continuous Integration / Continuous Delivery (CI/CD)
  • 24. De-risk using a phased approach CI/CD from day one Select some core-reusable services to use first and do parallel runs if possible e.g. load modules, address cleansing, Deployment – avoid all or nothing ‘big bang’
  • 25. 25 Inmon (normalized core) – Labrador •Labradors love being around people and wants to be everybody's friend. They are very sociable, intelligent, active, fun-loving animals who are eager to please. They make ideal pets for families with children, and make great watchdogs too. The best possible reference for the breed's docile and reliable nature is the fact that virtually all guide dogs for the blind in Australia are Labrador Retrievers. Kimball (dimensional core) – Kelpie •Australian Kelpies are tough, independent, highly intelligent dogs with extreme loyalty and utmost devotion to duty, and have a tractable disposition. Obedient and super alert, the Australian Kelpie is eager to please and makes a devoted companion, however, their inexhaustible energy makes them unsuitable for suburban living. Data Vault (hubs and satellites) – Chow Chow •The Chow Chow has a reputation for being a one-man dog and not very tolerant of those it doesn’t know. It can also tend to be willful and hard to train, so they are not a good choice for a weak or new owner. In addition, this dog has a thick coat that it sheds about twice a year. Expect to find fur everywhere during this time. Choosing a Data Model Methodology
  • 28. Dev Stack (excluding ETL choices above…)

Editor's Notes

  1. Providing engineered, integrated data for an individual is expensive – but becomes valuable when you integrate that data for many people or the whole organisation. There is a necessary governance overhead as data is integrated across the organisation as multiple departments need to get together to agree definitions, usage, etc.
  2. Have the capability to build and change things quickly – choose principles to enable this Don’t build before the demand appears – you probably can’t anticipate demand as well as you think
  3. Use design principles to inform and shape design and architecture choices Choose them carefully to avoid driving unintended consequences Our initial qualifying criteria is: Is there a reasonable opposite position to take for this principle? For example, you might reasonably prefer closed source software (principle 5), or prefer to use bare metal wherever you can (principle 1) These have been chosen carefully to encourage high reuse, low vendor lock-in, optionality and to be highly responsive to changing business requirements Keep them few in number so they are easy to absorb, understand (individually and in concert with the others) and easy to recall For example, principles 6, 7 and 8 lead to a conclusion that you should separate application logic from the data – so you have an implied ‘separation of concerns’ principle that doesn’t need to be explicitly stated. This is especially relevant for cloud and the ability to migrate technologies (e.g. change the underlying database It’s OK to have some tension between principles, as long as they don’t provoke confusion and team conflict
  4. Batch processing is seldom NOT required. Ensure consistency in update methods when using both batch and streaming to update the same target – this can cause profound DQ errors otherwise Be wary about patterns like the Lambda architecture (note this is not AWS Lambda serverless…) as they can cause information conflicts and different sources of the truth To get a consistent time for your integrated data it may not make sense to stream data all the way through Latency requirements can increase issues with records arriving out of order. How do you validate an order record if the customer record hasn’t been processed in the system yet? Should you just pass it through and revalidate later? Etc.
  5. Don’t forget concurrency for users – this is often a big performance issue iPaaS = integration Platform as a Service
  6. Note that the debate around ETL vs ELT has passionate advocates on both sides. Both patterns can be appropriate and you will need clear guidelines to choose between the two There are also new cloud patterns to spin up compute on demand – see Snowflake Data Warehouse Think strategically, not tactically – remember local optimisation can cause global sub-optimisation.