☺
ANALYSIS
&
VISUALISATION
DIGITAL
COMPANY
DATA
PROCESSING
&
STORAGE
DATA
INTEGRATION
DATA
EXPLOITATION
ENTERPRISE
INFORMATION
MANAGEMENT
DATA
SOURCES
DATA
WAREHOUSE
BUSINESS
INTELLIGENCE
APPLICATION
DEVELOPMENT
LEGACY
ERP, CRM…
IoTBIG DATA
HADOOP
ADVANCED
ANALYTICS
MOBILE APPS
ARTIFICIAL
INTELLIGENCE
APPLICATION
PERFORMANCE
MANAGEMENT
CLOUD
BLOCKCHAIN
Canada
Czechia
Slovakia
Germany
Bulgaria
Russia
USA
Thailand
Adastra
Adastra Business Consulting
Adastra One
Acamar
Ataccama
Proboston Creative
Blindspot AI
Instarea
2000+ Employees
Source: CzechInvest, April 2019
Zdroje: Domo, erwin
Sources: Wikipedia, IBM
1985 - 1995 Prehistory
•Controlled chaos
•Best practice awaking
•Manual scripting
•Basic Relational Analytics
1995 – 2005 Antiquity
•Clash of Titans: Kimball vs.
Inmon
•Mature Best practices
•Enterprise Data Warehouse
•ETL
•OLAP
•Complex Relational Analytics
2005-2015 Middle Ages
•High-end traditional Data Warehousing
•Hub-and-spoke architecture
•Data Governance
•MDM (mechanically divided meat)
•MDD
•ELT
•Data Vault
•Data Mining
•DW appliance
•Columnar DB
•In-memory DB
•Hadoop Stack Dawn
•Unstructured data analytics
2015 – Modern Age
•Logical Data Warehouse
•Extended Data Warehouse
•Data Lake
•Polyglot Architecture
•Kappa / Lambda
•Databus
•Data Pipeline
•Real-time
•Big Data ETL
•Open Source
•Big Data Analytics
•Self-service
•Data Science
•Machine Learning & AI
•Hadoop without Hadoop
•Stream Analytics
•All data analytics
•Data Management Platform
•Cloud
•Automation
•Autonomous Technologies
•New Elasticities of Compute & Storage
•Serverless (incl. Serveless DW)
Small & Midsize DW
Midsize & Large DW
Midsize & Large DW
Vision Reality
Bigger Data
Volumes
Semi-structured
& Unstructured
Data
Solution
Complexity
Fast Data
Legacy Data
Warehouses
Slow Cloud
Adoption (incl.
DWaaS)
Tighter SLA &
Low Performance
Better Service
Quality
Regulatory
Requests
Increasing TCO
Bad Time 2
Market
Weird
Technology
Stacks
Weak Interation
with Enterprise
Architecture
Sterile
Multitenant
Solutions
Technology First
Data Lineage
Obssesion
Lack of DW
experts
Accumulate
Technological
Debts
Non-actionable
Analytics
Too many Data
Warehouses
Divided Data
Platforms (Silos)
Low Added Value
for Business
No Raw Data
Relaxed or
Ovefitted Data
Quality
Missing Self
Service
Insufficient
History
No Automation
Hard Coding No Agile
Data Lake as DW
Replacement
Slow Provisioning Limited Elasticity
Old Templates for
Current Problems
Bad Data
Granularity
One Processing
Frequency
Data Hoarding
Megalomaniac
Scope
No Parallel
Processing
Limited Data
History
Missing
Metadata
Missing
Documentation
Missing Data
Architecture
Missing Design
Standards
Ineffective ETL
Development
Infinite cycle of
insourcing &
outsourcing
Multivendor
Competitions
without Strategy
Missing Data
Strategy
Siloed MDMs
Data Stream
Isolation
Not Real
Reference Data
or only for DW
No Business
Analytics Above
Unmanaged Data
Variety and
Variance
No Real Data
Governance
Evil Database
Dwarves
https://learn.panoply.io/hubfs/Downloadable%20Content/Data_Warehouse_Trends_2019.pdf
Opportunities Abound (DWs are rare in SME segment)
Redshift is losing ground
Complexity remains a significant ‘sore spot’ for data warehouse users
Performance issues are also persistent
PROPER TOOLS SIMPLIFY DATA WAREHOUSING (AND ALSO NASA’S EXPERIMENTS)
16
2012
2018
HIGH PERFORMANCE IS MANDATORY (AND COOL)19
DATA INTEGRATION ARCHITECTURE
ETL vs. ELT Big Data ETL: ETL on Hadoop vs. ELT in Hadoop
Lambda = Kappa + Batch Layer
Kappa = Lambda – Batch Layer
Polyglot / Big SQL
Sources: Ericsson, Oracle, Software Advice
Traditional Data Warehouse (DW) Data Lake (DL) Extended Data Warehouse (XDW)
Data Structured Structured & Semi-Structured & Unstructured Structured & Semi-Structured & Unstructured
Data Processing Processed Raw Processed & Raw
Data Schema Schema-on-write Schema-on-read Schema-on-write & Schema-on-read
Data Model Relational Object-based Relational & Object-based
Data History Hierarchically archived No hierarchy Hierarchically archived & No hierarchy
Agility Fixed configuration Reconfigured anytime as needed Fixed configuration
Reconfigured anytime as needed
Security Mature Maturing Mature
Primary Users Data analysists &
Business professionals
Data Scientists Data analysists & Business professionals &
Data scientists
Technology RDBMS NoSQL DBMS
Hadoop
Other distributed storages
RDBMS
NoSQL DBMS
Hadoop
Other distributed storages
Agility Low High Medium
Added Value Medium Medium High
Cost High Low Medium
DATA WAREHOUSE VS. DATA LAKE: DIFFERENT TECHNOLOGIES BUT SAME RESULTS
Application ServerWeb Server
Pentaho Data
Integration
(Web Console)
Adastra
Workflow
GUI
Adastra
Ref Books
GUI
Adastra
Worflow
Middleware
Adastra
Ref Books
Middleware
Pentaho Data
Integration
(Carte)
Pentaho Data
Integration
(Repository)
Adastra
Worfklow
for RDBMS
Database
Scheduler
Adastra
Ref Books
Store
Adastra
ELT
SAP
PowerDesginer
Adastra
ELT & Workflow
Code Generator
External
Worfklow
Adastra
Data Model
Design Time
Run Time
RDBMS
Data Source
Data
Warehouse
Stream Processor
Stage Database
ETL/ELT
Custom Development
Big Data ETLCluster Filesytem
Data Services
Data Extractor File System
Messaging
Change Data Capture
Clustered Stage DB
NoSQL
Many Ways from Data Sources to Data Warehouse (Real Example)
On-Premises
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
IaaS
(Infrastructure as a
Service)
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
PaaS
(Platform as a Service)
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
SaaS
(Software as a Service)
Applications
Data
Runtime
Middleware
OS
Virtualization
Servers
Storage
Networking
ELASTICITY LEADS THE WAY (AND THIS IS NOT 737 MAX ON THE PICTURE)
26
AUTOMATION & AUTONOMOUS TECHNOLIGIES ARE NEW BLACK
27
DATA GOVERNANCE MUST BE COMPLEX
Concepts
Vision & Mission
Guiding Principles
Organization & Roles
Business Rules
Activities
Scope
Benefits & Goals
Components
Data Architecture
Data Quality
Data Integration
Operations
Security
RDM & MDM
Metadata
Data Platform & BI
Tools
CASE
Enteprise Metadata Repository
Data Quality Tools
QA Framework
Workflow & Orchestration
IDE
Audit Log
Resource Management
RDBMS
NoSQL
Hadoop
Integration tools
Monitoring
Source Code Repository
Testing Tools
Others
Why What How
Business Drivers
• Digitalization
• Smart Data incl. Single Customer
View
• Data Literacy
• External Analytics Innovation
• Quick Time-to-Market
• Business Process Autonomous
Optimization
• Individual Personal Offers at
Massive Scale
Analytics
• Single View of Facts
• Augmented Analytics
• Advanced Data Visualization
• Predictive & Prescriptive Analytics
• Collaborative Business Intelligence
• Natural Language Processing
• Self-Service BI & Data Preparation
• Data Discovery
• Data Science
• Effective Advanced Analytics incl.
Real-time
• External Analytics Innovation
• Embedded Analytics
• Stream Analytics
• Geospatial Analytics
• Data Blending
Governance & Architecture
• Data Quality Management &
Master Data Management
• Holistic Data Governance incl. Big
Data
• Embracing Data Catalogs
• Real Data Science Governance
• Full Data Lifecycle Management
• Metadata Integration incl. Real-
time
• Advanced Security & Audit
• Data Warehouse Modernization
(XDW)
• Data Warehouse Automation
(DWA)
• Doom of Classical Data
Warehousing (Hub & Spoke)
• Lambda & Kappa Architecture
• Data Lake 2.0
• Analytic Data Store 2.0
Technology
• Serverless
• Compute & Storage Divided
Elasticities
• Extreme Performance &
Appliances
• Polymorphic Data Models
• Artificial Intelligence
• Cloud Continuum
• Edge Computing
• Complex Metadata Driven
Development
• Data Management Platforms
• Data & Processing Offloading
• Data Virtualization & Data
Federation (Polyglot Architecture)
• Autonomous Technologies
• Next Generation Sandboxing
• In-memory
• DWaaS
• Digital Twins
Bigger Data
Volumes
Semi-structured
& Unstructured
Data
Solution
Complexity
Fast Data
Legacy Data
Warehouses
Slow Cloud
Adoption (incl.
DWaaS)
Tighter SLA &
Low Performance
Better Service
Quality
Regulatory
Requests
Increasing TCO
Bad Time 2
Market
Weird
Technology
Stacks
Weak Interation
with Enterprise
Architecture
Sterile
Multitenant
Solutions
Technology First
Data Lineage
Obssesion
Lack of DW
experts
Accumulate
Technological
Debts
Non-actionable
Analytics
Too many Data
Warehouses
Divided Data
Platforms (Silos)
Low Added Value
for Business
No Raw Data
Relaxed or
Ovefitted Data
Quality
Missing Self
Service
Insufficient
History
No Automation
Hard Coding No Agile
Data Lake as DW
Replacement
Slow Provisioning Limited Elasticity
Old Templates for
Current Problems
Bad Data
Granularity
One Processing
Frequency
Data Hoarding
Megalomaniac
Scope
No Parallel
Processing
Limited Data
History
Missing
Metadata
Missing
Documentation
Missing Data
Architecture
Missing Design
Standards
Ineffective ETL
Development
Infinite cycle of
insourcing &
outsourcing
Multivendor
Competitions
without Strategy
Missing Data
Strategy
Siloed MDMs
Data Stream
Isolation
Not Real
Reference Data
or only for DW
No Business
Analytics Above
Unmanaged Data
Variety and
Variance
No Real Data
Governance
Evil Database
Dwarves
☺
TORONTO
LONDON
FRANKFURT
PRAGUE
BRATISLAVA
SOFIA
MOSCOW
THANK YOU !
CONTACT ADASTRA GROUP
CZECH REPUBLIC
KAROLINSKÁ 654/2
186 00 PRAHA 8
+420 271 733 303
INFOCZ@ADASTRAGRP.COM
WWW.ADASTRA.CZ
https://www.linkedin.com/in/martin-b%C3%A9m-7a92089/

Pitfalls of Data Warehousing_2019-04-24

  • 2.
  • 3.
    ANALYSIS & VISUALISATION DIGITAL COMPANY DATA PROCESSING & STORAGE DATA INTEGRATION DATA EXPLOITATION ENTERPRISE INFORMATION MANAGEMENT DATA SOURCES DATA WAREHOUSE BUSINESS INTELLIGENCE APPLICATION DEVELOPMENT LEGACY ERP, CRM… IoTBIG DATA HADOOP ADVANCED ANALYTICS MOBILEAPPS ARTIFICIAL INTELLIGENCE APPLICATION PERFORMANCE MANAGEMENT CLOUD BLOCKCHAIN Canada Czechia Slovakia Germany Bulgaria Russia USA Thailand Adastra Adastra Business Consulting Adastra One Acamar Ataccama Proboston Creative Blindspot AI Instarea 2000+ Employees
  • 4.
  • 7.
  • 8.
  • 10.
    1985 - 1995Prehistory •Controlled chaos •Best practice awaking •Manual scripting •Basic Relational Analytics 1995 – 2005 Antiquity •Clash of Titans: Kimball vs. Inmon •Mature Best practices •Enterprise Data Warehouse •ETL •OLAP •Complex Relational Analytics 2005-2015 Middle Ages •High-end traditional Data Warehousing •Hub-and-spoke architecture •Data Governance •MDM (mechanically divided meat) •MDD •ELT •Data Vault •Data Mining •DW appliance •Columnar DB •In-memory DB •Hadoop Stack Dawn •Unstructured data analytics 2015 – Modern Age •Logical Data Warehouse •Extended Data Warehouse •Data Lake •Polyglot Architecture •Kappa / Lambda •Databus •Data Pipeline •Real-time •Big Data ETL •Open Source •Big Data Analytics •Self-service •Data Science •Machine Learning & AI •Hadoop without Hadoop •Stream Analytics •All data analytics •Data Management Platform •Cloud •Automation •Autonomous Technologies •New Elasticities of Compute & Storage •Serverless (incl. Serveless DW)
  • 11.
    Small & MidsizeDW Midsize & Large DW Midsize & Large DW
  • 13.
  • 14.
    Bigger Data Volumes Semi-structured & Unstructured Data Solution Complexity FastData Legacy Data Warehouses Slow Cloud Adoption (incl. DWaaS) Tighter SLA & Low Performance Better Service Quality Regulatory Requests Increasing TCO Bad Time 2 Market Weird Technology Stacks Weak Interation with Enterprise Architecture Sterile Multitenant Solutions Technology First Data Lineage Obssesion Lack of DW experts Accumulate Technological Debts Non-actionable Analytics Too many Data Warehouses Divided Data Platforms (Silos) Low Added Value for Business No Raw Data Relaxed or Ovefitted Data Quality Missing Self Service Insufficient History No Automation Hard Coding No Agile Data Lake as DW Replacement Slow Provisioning Limited Elasticity Old Templates for Current Problems Bad Data Granularity One Processing Frequency Data Hoarding Megalomaniac Scope No Parallel Processing Limited Data History Missing Metadata Missing Documentation Missing Data Architecture Missing Design Standards Ineffective ETL Development Infinite cycle of insourcing & outsourcing Multivendor Competitions without Strategy Missing Data Strategy Siloed MDMs Data Stream Isolation Not Real Reference Data or only for DW No Business Analytics Above Unmanaged Data Variety and Variance No Real Data Governance Evil Database Dwarves
  • 15.
    https://learn.panoply.io/hubfs/Downloadable%20Content/Data_Warehouse_Trends_2019.pdf Opportunities Abound (DWsare rare in SME segment) Redshift is losing ground Complexity remains a significant ‘sore spot’ for data warehouse users Performance issues are also persistent
  • 16.
    PROPER TOOLS SIMPLIFYDATA WAREHOUSING (AND ALSO NASA’S EXPERIMENTS) 16
  • 17.
  • 18.
  • 19.
    HIGH PERFORMANCE ISMANDATORY (AND COOL)19
  • 20.
    DATA INTEGRATION ARCHITECTURE ETLvs. ELT Big Data ETL: ETL on Hadoop vs. ELT in Hadoop Lambda = Kappa + Batch Layer Kappa = Lambda – Batch Layer Polyglot / Big SQL Sources: Ericsson, Oracle, Software Advice
  • 21.
    Traditional Data Warehouse(DW) Data Lake (DL) Extended Data Warehouse (XDW) Data Structured Structured & Semi-Structured & Unstructured Structured & Semi-Structured & Unstructured Data Processing Processed Raw Processed & Raw Data Schema Schema-on-write Schema-on-read Schema-on-write & Schema-on-read Data Model Relational Object-based Relational & Object-based Data History Hierarchically archived No hierarchy Hierarchically archived & No hierarchy Agility Fixed configuration Reconfigured anytime as needed Fixed configuration Reconfigured anytime as needed Security Mature Maturing Mature Primary Users Data analysists & Business professionals Data Scientists Data analysists & Business professionals & Data scientists Technology RDBMS NoSQL DBMS Hadoop Other distributed storages RDBMS NoSQL DBMS Hadoop Other distributed storages Agility Low High Medium Added Value Medium Medium High Cost High Low Medium
  • 22.
    DATA WAREHOUSE VS.DATA LAKE: DIFFERENT TECHNOLOGIES BUT SAME RESULTS
  • 23.
    Application ServerWeb Server PentahoData Integration (Web Console) Adastra Workflow GUI Adastra Ref Books GUI Adastra Worflow Middleware Adastra Ref Books Middleware Pentaho Data Integration (Carte) Pentaho Data Integration (Repository) Adastra Worfklow for RDBMS Database Scheduler Adastra Ref Books Store Adastra ELT SAP PowerDesginer Adastra ELT & Workflow Code Generator External Worfklow Adastra Data Model Design Time Run Time RDBMS
  • 24.
    Data Source Data Warehouse Stream Processor StageDatabase ETL/ELT Custom Development Big Data ETLCluster Filesytem Data Services Data Extractor File System Messaging Change Data Capture Clustered Stage DB NoSQL Many Ways from Data Sources to Data Warehouse (Real Example)
  • 25.
    On-Premises Applications Data Runtime Middleware OS Virtualization Servers Storage Networking IaaS (Infrastructure as a Service) Applications Data Runtime Middleware OS Virtualization Servers Storage Networking PaaS (Platformas a Service) Applications Data Runtime Middleware OS Virtualization Servers Storage Networking SaaS (Software as a Service) Applications Data Runtime Middleware OS Virtualization Servers Storage Networking
  • 26.
    ELASTICITY LEADS THEWAY (AND THIS IS NOT 737 MAX ON THE PICTURE) 26
  • 27.
    AUTOMATION & AUTONOMOUSTECHNOLIGIES ARE NEW BLACK 27
  • 29.
    DATA GOVERNANCE MUSTBE COMPLEX Concepts Vision & Mission Guiding Principles Organization & Roles Business Rules Activities Scope Benefits & Goals Components Data Architecture Data Quality Data Integration Operations Security RDM & MDM Metadata Data Platform & BI Tools CASE Enteprise Metadata Repository Data Quality Tools QA Framework Workflow & Orchestration IDE Audit Log Resource Management RDBMS NoSQL Hadoop Integration tools Monitoring Source Code Repository Testing Tools Others Why What How
  • 31.
    Business Drivers • Digitalization •Smart Data incl. Single Customer View • Data Literacy • External Analytics Innovation • Quick Time-to-Market • Business Process Autonomous Optimization • Individual Personal Offers at Massive Scale Analytics • Single View of Facts • Augmented Analytics • Advanced Data Visualization • Predictive & Prescriptive Analytics • Collaborative Business Intelligence • Natural Language Processing • Self-Service BI & Data Preparation • Data Discovery • Data Science • Effective Advanced Analytics incl. Real-time • External Analytics Innovation • Embedded Analytics • Stream Analytics • Geospatial Analytics • Data Blending Governance & Architecture • Data Quality Management & Master Data Management • Holistic Data Governance incl. Big Data • Embracing Data Catalogs • Real Data Science Governance • Full Data Lifecycle Management • Metadata Integration incl. Real- time • Advanced Security & Audit • Data Warehouse Modernization (XDW) • Data Warehouse Automation (DWA) • Doom of Classical Data Warehousing (Hub & Spoke) • Lambda & Kappa Architecture • Data Lake 2.0 • Analytic Data Store 2.0 Technology • Serverless • Compute & Storage Divided Elasticities • Extreme Performance & Appliances • Polymorphic Data Models • Artificial Intelligence • Cloud Continuum • Edge Computing • Complex Metadata Driven Development • Data Management Platforms • Data & Processing Offloading • Data Virtualization & Data Federation (Polyglot Architecture) • Autonomous Technologies • Next Generation Sandboxing • In-memory • DWaaS • Digital Twins
  • 33.
    Bigger Data Volumes Semi-structured & Unstructured Data Solution Complexity FastData Legacy Data Warehouses Slow Cloud Adoption (incl. DWaaS) Tighter SLA & Low Performance Better Service Quality Regulatory Requests Increasing TCO Bad Time 2 Market Weird Technology Stacks Weak Interation with Enterprise Architecture Sterile Multitenant Solutions Technology First Data Lineage Obssesion Lack of DW experts Accumulate Technological Debts Non-actionable Analytics Too many Data Warehouses Divided Data Platforms (Silos) Low Added Value for Business No Raw Data Relaxed or Ovefitted Data Quality Missing Self Service Insufficient History No Automation Hard Coding No Agile Data Lake as DW Replacement Slow Provisioning Limited Elasticity Old Templates for Current Problems Bad Data Granularity One Processing Frequency Data Hoarding Megalomaniac Scope No Parallel Processing Limited Data History Missing Metadata Missing Documentation Missing Data Architecture Missing Design Standards Ineffective ETL Development Infinite cycle of insourcing & outsourcing Multivendor Competitions without Strategy Missing Data Strategy Siloed MDMs Data Stream Isolation Not Real Reference Data or only for DW No Business Analytics Above Unmanaged Data Variety and Variance No Real Data Governance Evil Database Dwarves
  • 34.
  • 36.
    TORONTO LONDON FRANKFURT PRAGUE BRATISLAVA SOFIA MOSCOW THANK YOU ! CONTACTADASTRA GROUP CZECH REPUBLIC KAROLINSKÁ 654/2 186 00 PRAHA 8 +420 271 733 303 INFOCZ@ADASTRAGRP.COM WWW.ADASTRA.CZ https://www.linkedin.com/in/martin-b%C3%A9m-7a92089/