I was invited to present on Modernising the Data Warehouse to post-graduate students at the University of Melbourne in January 2019. These slides describe my experience and perspective on this topic that many, if not most, large organisations face. At Escient, we can help organisations navigate this area, and drive better outcomes from data.
1. In the age of Big Data Analytics
Phil Watt
21st January 2019
Modernising Data Warehousing
2. Phil Watt
Bio
Phil is a Director in the Escient Victoria Consulting Team
with more than 25 years in large scale enterprise analytics
and integrated data management programmes. His focus is
in the journey to scale business programmes from small,
proof of concept initiatives through to operational company-
wide solutions with a high strategic impact. He has deep
experience in applying business analytics in the CME and
FS sectors in Western Europe and South Pacific, including
global technology leadership roles for Fortune 500
companies. After leading the definition of the technology
components of a State Government data reform strategy,
he now leads the technology implementation and business
alignment of three of its key foundation programmes.
3. 3
All views expressed are my own and may not represent the
opinions of any entity whatsoever with whom I have been, am
now, or will be affiliated.
Disclaimer
4. 4
Why have a data warehouse?
Why modernise your data warehouse?
Design Principles for a modern data warehouse
Cloud and Big Data
Patterns
Outline
8. • New capability
• Better query performance
• Lower data latency (data freshness)
• Lower support/ Opex costs
• Higher developer / end user productivity
• Faster implementation of new data /
requirements
• Risk reduction (stack out of support,
security concerns, skills availability)
• Developer productivity
• Maintenance (number of operations
and support staff)
• End user productivity
The modernisation business case
is likely to involve a mixture of:
Your biggest costs are likely to be
labour – not software or
infrastructure
9. Incumbent vendors may encourage you
to stick with current ‘best practice’ or
Suggest you have too much invested in
the current platform
https://en.wikipedia.org/wiki/Appeal_t
o_tradition
Vendors often use Appeal to Novelty
(shiny-shiny is better than old-
fangled…) to upsell or get in the door
Remember: If it ain’t broke, don’t fix it
https://en.wikipedia.org/wiki/Appeal_t
o_novelty
Avoid Appeal to Tradition &
Sunk Cost Fallacies
Avoid the Appeal to Novelty
Fallacy
11. # Principle Description
1 Climb the Stack SaaS | PaaS | IaaS | Metal. Compose higher order solutions from components. as-a-
Service allows outsourcing of lower level components.
2 Connect People to Data While transactional business systems are designed to to prevent direct access to data,
Analytics systems are designed to enable a connection to data.
3 Privacy by Design Information privacy and governance is included from the start of system design, on par
with system functionality.
4 Scalable Day 1 Capable of distributed scale-out from day 1.
5 Open Innovation Innovation in data and analytics capabilities is being driven by open collaboration on
algorithms and open source software.
6 Pipeline of Parts Data processing and pipeline components must have clear boundaries & hand-off points.
7 Reuse over Rebuild Reuse and extend components - design and build them in re-usable ways. Use DRY
(Don’t Repeat Yourself) code versus WET (Write Every Time) code.
8 Repeatable over Recoverable Service continuity driven by repeatability and automation over backup/restore.
9 Everything Testable All components must be verifiable via test automation.
10 Know your Data Ensure a solid understanding of the data – including how it was collected (& why), data
definitions, data quality, transformation rules and lineage, and operational metadata.
Carefully Choose Your Design Principles
(Samples below)
13. “If a human operator needs to touch your system during normal
operations, you have a bug. The definition of normal changes as
your systems grow.”
Carla Geisser, Google SRE
SRE – Site Reliability Engineering
14. Toil often has the following characteristics:
• Manual
• Repetitive
• Automatable
• Tactical
• No enduring value
• Effort to do it scales linearly as a service
grows
See https://landing.google.com/sre/sre-
book/toc/
Tenets of SRE
• Ensuring a Durable Focus on Engineering
• Pursuing Maximum Change Velocity Without
Violating a Service’s SLO
• Monitoring (Alerts, Tickets, Logging)
• Emergency Response
• Change Management
• Demand Forecasting and Capacity Planning
• Provisioning
• Efficiency and Performance
With SRE we work to avoid ToilSRE
14
15. Enables responsive change in business requirements
Reduces the body of technical knowledge you need to maintain internally
Spend time considering security and privacy challenges
• Engage a third party security expert if needed to help with security designs
Best match for the technical design principles above
• Easier access to SaaS and PaaS offerings
Be open to multi-cloud platform
• Help convince your cloud provider you have choices
• Take advantage of best of breed capabilities
• Don’t always rely on cloud vendor’s native offerings – consider third parties to help mitigate for stickiness
Cloud may INCREASE your infrastructure costs
• Likely to be offset by increased business responsiveness and richer feature availability
Using Cloud Infrastructure
16. 16
‘Hadoop’ is much less relevant in the cloud today
• The overhead of HDFS is unnecessary given cloud storage options like
AWS S3 or Azure Blob Storage
• Useful data processing services are often packaged in PaaS – avoiding
the need to manage complex Hadoop clusters
Big Data and Cloud
19. 19
LoadTransform
Extract /
Access
Source
CRM / ERP /
Billing, etc.
Get / Put
Clean
Validate
Conform to
model
Use/present
High Level Patterns Have Hardly Changed for Data
Warehouse ETL in the last 15 years
22. Keep these in mind when choosing
• a database / query execution engine
• Where you do your data transformations – e.g. should you separate transformations from user queries?
IO and Query Concurrency Drives Performance and User Experience
22
24. De-risk using a phased approach
CI/CD from day one
Select some core-reusable services to use first and do
parallel runs if possible
e.g. load modules, address cleansing,
Deployment – avoid all or nothing ‘big bang’
25. 25
Inmon (normalized core) – Labrador
•Labradors love being around people and wants to be everybody's friend. They are
very sociable, intelligent, active, fun-loving animals who are eager to please. They
make ideal pets for families with children, and make great watchdogs too. The best
possible reference for the breed's docile and reliable nature is the fact that
virtually all guide dogs for the blind in Australia are Labrador Retrievers.
Kimball (dimensional core) – Kelpie
•Australian Kelpies are tough, independent, highly intelligent dogs with extreme
loyalty and utmost devotion to duty, and have a tractable disposition. Obedient
and super alert, the Australian Kelpie is eager to please and makes a devoted
companion, however, their inexhaustible energy makes them unsuitable for
suburban living.
Data Vault (hubs and satellites) – Chow Chow
•The Chow Chow has a reputation for being a one-man dog and not very tolerant of
those it doesn’t know. It can also tend to be willful and hard to train, so they are
not a good choice for a weak or new owner. In addition, this dog has a thick coat
that it sheds about twice a year. Expect to find fur everywhere during this time.
Choosing a Data Model Methodology
Providing engineered, integrated data for an individual is expensive – but becomes valuable when you integrate that data for many people or the whole organisation.
There is a necessary governance overhead as data is integrated across the organisation as multiple departments need to get together to agree definitions, usage, etc.
Have the capability to build and change things quickly – choose principles to enable this
Don’t build before the demand appears – you probably can’t anticipate demand as well as you think
Use design principles to inform and shape design and architecture choices
Choose them carefully to avoid driving unintended consequences
Our initial qualifying criteria is: Is there a reasonable opposite position to take for this principle?
For example, you might reasonably prefer closed source software (principle 5), or prefer to use bare metal wherever you can (principle 1)
These have been chosen carefully to encourage high reuse, low vendor lock-in, optionality and to be highly responsive to changing business requirements
Keep them few in number so they are easy to absorb, understand (individually and in concert with the others) and easy to recall
For example, principles 6, 7 and 8 lead to a conclusion that you should separate application logic from the data – so you have an implied ‘separation of concerns’ principle that doesn’t need to be explicitly stated.
This is especially relevant for cloud and the ability to migrate technologies (e.g. change the underlying database
It’s OK to have some tension between principles, as long as they don’t provoke confusion and team conflict
Batch processing is seldom NOT required.
Ensure consistency in update methods when using both batch and streaming to update the same target – this can cause profound DQ errors otherwise
Be wary about patterns like the Lambda architecture (note this is not AWS Lambda serverless…) as they can cause information conflicts and different sources of the truth
To get a consistent time for your integrated data it may not make sense to stream data all the way through
Latency requirements can increase issues with records arriving out of order. How do you validate an order record if the customer record hasn’t been processed in the system yet? Should you just pass it through and revalidate later? Etc.
Don’t forget concurrency for users – this is often a big performance issue
iPaaS = integration Platform as a Service
Note that the debate around ETL vs ELT has passionate advocates on both sides.
Both patterns can be appropriate and you will need clear guidelines to choose between the two
There are also new cloud patterns to spin up compute on demand – see Snowflake Data Warehouse
Think strategically, not tactically – remember local optimisation can cause global sub-optimisation.