Data Warehousing
- in the real world -
Dr. Thomas Zurek
@tfxz
November 2015
Big Data und Analytische Applikationen
Who am I ?
• Vice President of Development @ SAP for
– HANA Data Warehousing (DW)
– Enterprise Performance Management (EPM)
– HANA Analytics
• 18 years at SAP
• PhD in Computer Science
• Universities of Karlsruhe and Edinburgh
2November 2015 Real-World Data Warehouses / Thomas Zurek
Agenda
1. Examples
2. Business Intelligence (BI) + Data Warehouses (DW)
3. Data Warehouses
4. Layered Scalable Architecture (LSA)
5. Big Data + Data Warehousing
6. Summary
3November 2015 Real-World Data Warehouses / Thomas Zurek
EXAMPLES
4November 2015 Real-World Data Warehouses / Thomas Zurek
Examples of Business Intelligence Scenarios
• fraud detection
 retail company
 point-of-sales data & given discounts
 huge amounts of data
 a prototypical BI question
• long tail analysis
 e-commerce companies like Amazon, Ebay, iTunes, Netflix, …
 translate sales of popular products into (additional) sales in
the long tail
 BI integrated into operational processes
November 2015 Real-World Data Warehouses / Thomas Zurek 5
Long Tail Analysis (1) – An Amazon Example
November 2015 Real-World Data Warehouses / Thomas Zurek 6
Long Tail Analysis (2)
Source: Chris Anderson, The Long Tail, Wired, October 2004, http://www.wired.com/wired/archive/12.10/tail.html
November 2015 Real-World Data Warehouses / Thomas Zurek 7
Long Tail Analysis (3)
Source: Chris Anderson, The Long Tail, Wired, October 2004, http://www.wired.com/wired/archive/12.10/tail.html
November 2015 Real-World Data Warehouses / Thomas Zurek 8
BUSINESS INTELLIGENCE +
DATA WAREHOUSES
9November 2015 Real-World Data Warehouses / Thomas Zurek
Business Intelligence and Data Warehouses
• Business Intelligence (BI)
An environment in which business users conduct analyses that yield
overall understanding of where
 the business has been,
 where it is now, and
 where it will be in the near future (i.e. planning).
• Data Warehouse (DW)
 An implementation of an informational database used to collect,
integrate and provide sharable data sourced from multiple
operational databases for analyses.
 Provide data that is reliable, consistent, understandable.
 It typically serves as the foundation for a business intelligence
system.
November 2015 Real-World Data Warehouses / Thomas Zurek 10
Business Intelligence and Data Warehouses
11
Business Intelligence
OLAP, cubes, dimensions, measures, KPIs, scoreboards, dashboards,
pivot tables, data mining, predictive, slice & dice, planning, EPM,
analytics, …
Data Warehouse
connectivity, cleansing, scrubbing, ETL, ELT, EHL,
transformation, harmonisation,consistency,
compliance, auditing, big data, scalability, …
Operational
System
ERP, CRM, SCM, HR,
…
MetaData
security,models,…
November 2015 Real-World Data Warehouses / Thomas Zurek
Business Intelligence and Data Warehouses
12
Operational
System
ERP, CRM, SCM, HR,
…
MetaData
security,models,…
simply remember:
(1) BI and DW
(2) BI ≠ DW
Business Intelligence
OLAP, cubes, dimensions, measures, KPIs, scoreboards, dashboards,
pivot tables, data mining, predictive, slice & dice, planning, EPM,
analytics, …
Data Warehouse
connectivity, cleansing, scrubbing, ETL, ELT, EHL,
transformation, harmonisation,consistency,
compliance, auditing, big data, scalability, …
Focus today!
November 2015 Real-World Data Warehouses / Thomas Zurek
DATA WAREHOUSES
13November 2015 Real-World Data Warehouses / Thomas Zurek
Multiple Data Sources
Why are there so many DBs at an enterprise?
• business processes  data captured in some DB
• organisation reflected in system landscape
• geography reflected in system landscape
• smaller systems easier to manage than big systems
• mergers and acquisitions
• external data: market data, supplier data, …
• …
14November 2015 Real-World Data Warehouses / Thomas Zurek
A Typical Example for Business Processes in an Enterprise
15
source: http://thebankwatch.com/2006/09/13/simplifying-the-business-model/
November 2015 Real-World Data Warehouses / Thomas Zurek
Business
transform
End-user access / Presentation
Provide data
Data Acquisition
Harmonization
Data Propagation
Reporting / Analyses /
Planning
Main Service : Spot for apps/Delta to app/App recovery
Transform : Enriched || General Business logic
Content : Data source || Business domain specific
History : Determined by rebuild requirements of apps
Store : DSO(can be logical partitioned)
Main Service : Decouple, Fast load and distribute
Transform : 1:1
Content : 1 data source, All fields
History : 4 weeks
Store : PSA, DSO-WO.
Main Service : Integrated, harmonized
Transform : Harmonize quality assure (in flow|| lookup)
Content : Defined fields
History : Short or not at all || Long term
Store : Info source || IO/DSO/Z-table
Main Service : Make data available for reporting & planning tools
Transform : Application specific/(dis-)aggregate/lookup
Content : Application specific
History : Application specific
Store : IC,DSO, Info Set, Virtual Provider, Multi Provider.
A Typical Data Warehouse Architecture
Corp.
Memory
ODSBI Layer
Data Warehouse
Source 1 Source 2 Source 3 Source 4 Source 5
ProjectGovernanceITGovernance
November 2015 Real-World Data Warehouses / Thomas Zurek 16
Challenge 1: RELIABLE
• typical: data from 50-100 data sources
• availability of data sources not given
– system downtimes
– network failures
– example:
• availability per data source = 98%
• all 100 data sources available = 0.98**100 = 13%
• 1 out of 100 data sources not available = 1 – 0.13 = 87%
 all data in one place asserts reliable data access
17November 2015 Real-World Data Warehouses / Thomas Zurek
Challenge 2: CONSISTENT
• Assume: each data source is consistent!
• Is the union of all data sources consistent?
NO !
 In a DW, data gets synchronised and
harmonized to provide a consistent view
spanning multiple data sources.
18November 2015 Real-World Data Warehouses / Thomas Zurek
Examples Challenge 2: Transformation, Cleansing
• Jun 1, 2011 = 1.6.2011 = 06/01/11 = …
• VW Touareg = VW TOUAREG = [product] 87654 = …
• currency and unit conversions:
– box  kg
– €, $, £, ¥, …  €
• resolve ID clashes:
product 123 [in subsiduary A] ≠ product 123 [in subsiduary B]
• enrich data:
add attributes from source A to data from source B
19November 2015 Real-World Data Warehouses / Thomas Zurek
Examples Challenge 2: History / Time-Dependency
• data is time-dependent, e.g.
– employee A worked in department X in 2012
– employee A worked in department Y in 2013
– currency exchange rates
– current view vs historic view analysis
• versioning of meta data
– models change
– development  test  production
– auditing
20November 2015 Real-World Data Warehouses / Thomas Zurek
Automatisierte Überprüfung der Datenqualität in Form eines
Plausibility Gates
Single Point of Truth
Quelle 1 Quelle 2 Quelle ... Quelle n
Fachliche Überprüfung der Daten verringern den Administrationsaufwand und den
anschließenden „Ärger“
Harmonisierte Auswertungen
Plausibility Gate
UNSPSC-Code vorhanden?
RVO mit BVO-Bezug?
DUNS-Nummer vorhanden?
Größenordnung BVO/RVO?
Challenge 3: UNDERSTANDABLE
• texts for cryptic numbers
• multi-language support
• data provenance:
know where the data
originated
• auditing: track changes
• relevance:
show the user data from his "realm of command"
22November 2015 Real-World Data Warehouses / Thomas Zurek
LAYERED SCALABALE
ARCHITECTURE (LSA)
23November 2015 Real-World Data Warehouses / Thomas Zurek
Business
transform
End-user access / Presentation
Provide data
Data Acquisition
Harmonization
Data Propagation
Reporting / Analyses /
Planning
Main Service : Spot for apps/Delta to app/App recovery
Transform : Enriched || General Business logic
Content : Data source || Business domain specific
History : Determined by rebuild requirements of apps
Store : DSO(can be logical partitioned)
Main Service : Decouple, Fast load and distribute
Transform : 1:1
Content : 1 data source, All fields
History : 4 weeks
Store : PSA, DSO-WO.
Main Service : Integrated, harmonized
Transform : Harmonize quality assure (in flow|| lookup)
Content : Defined fields
History : Short or not at all || Long term
Store : Info source || IO/DSO/Z-table
Main Service : Make data available for reporting & planning tools
Transform : Application specific/(dis-)aggregate/lookup
Content : Application specific
History : Application specific
Store : IC,DSO, Info Set, Virtual Provider, Multi Provider.
A Typical Data Warehouse Architecture
Corp.
Memory
ODSBI Layer
Data Warehouse
Source 1 Source 2 Source 3 Source 4 Source 5
ProjectGovernanceITGovernance
November 2015 Real-World Data Warehouses / Thomas Zurek 24
Yet Another, Arbitrary Example …
25
Source: http://www.zentut.com/wp-content/uploads/2012/10/stand-alone-data-mart.jpg
November 2015 Real-World Data Warehouses / Thomas Zurek
The Layered Scalable Architecture (LSA)
• reference architecture for DW
• term introduced by SAP, but not SAP-specific
• layers:
– each layer has a certain task
– each layer has an associated service-level
– layers describe the step-wise refinement of data
• not every DW needs all LSA-layers
• modern technology allows to remove / merge layers as
less or no performance-motivated services are required
• more: http://tinyurl.com/sap-lsa
26November 2015 Real-World Data Warehouses / Thomas Zurek
BIG DATA +
DATA WAREHOUSING
28November 2015 Real-World Data Warehouses / Thomas Zurek
Big Data – The 3 "V"s
• Velocity  speed, parallelism
• Volume  scale
• Variety  many formats, file system
November 2015 Real-World Data Warehouses / Thomas Zurek 29
Big Data Example: Connected Cows
November 2015 Real-World Data Warehouses / Thomas Zurek 30
• estrus detection by counting
steps
– motion sensors for 40000 cows
• benefits for artifical
insemination:
– less labour intensive
– higher success rates:
45%  63%
– sex determination
• references:
– Strata Feb 2015:
http://tinyurl.com/connected-cows
http://www.fujitsu.com/global/about/resources/news/press-releases/2013/1015-01.html
RDBMS vs Big Data
Data Warehouses / RDBMS
• INSERT + UPDATE + DELETE
• prescreptive schema
– DDL
• strict notion of consistency
Big Data / HDFS
• INSERT (mostly)
• descriptive schema
– XML
– JSON
• sloppier due to scale of data
and number of systems
involved
November 2015 Real-World Data Warehouses / Thomas Zurek 31
Big Data + Data Warehouses
November 2015 Real-World Data Warehouses / Thomas Zurek 32
Source: Strata Feb 2015, San Jose, CA, US – Keynote by Amr Awadallah (Cloudera)
SUMMARY
33November 2015 Real-World Data Warehouses / Thomas Zurek
What You Should Take Away
1. Difference: BI vs DW vs Big Data
2. What are the problems that a DW handles?
3. How are those problems tackled?
34November 2015 Real-World Data Warehouses / Thomas Zurek
35November 2015 Real-World Data Warehouses / Thomas Zurek

Data Warehousing - in the real world

  • 1.
    Data Warehousing - inthe real world - Dr. Thomas Zurek @tfxz November 2015 Big Data und Analytische Applikationen
  • 2.
    Who am I? • Vice President of Development @ SAP for – HANA Data Warehousing (DW) – Enterprise Performance Management (EPM) – HANA Analytics • 18 years at SAP • PhD in Computer Science • Universities of Karlsruhe and Edinburgh 2November 2015 Real-World Data Warehouses / Thomas Zurek
  • 3.
    Agenda 1. Examples 2. BusinessIntelligence (BI) + Data Warehouses (DW) 3. Data Warehouses 4. Layered Scalable Architecture (LSA) 5. Big Data + Data Warehousing 6. Summary 3November 2015 Real-World Data Warehouses / Thomas Zurek
  • 4.
    EXAMPLES 4November 2015 Real-WorldData Warehouses / Thomas Zurek
  • 5.
    Examples of BusinessIntelligence Scenarios • fraud detection  retail company  point-of-sales data & given discounts  huge amounts of data  a prototypical BI question • long tail analysis  e-commerce companies like Amazon, Ebay, iTunes, Netflix, …  translate sales of popular products into (additional) sales in the long tail  BI integrated into operational processes November 2015 Real-World Data Warehouses / Thomas Zurek 5
  • 6.
    Long Tail Analysis(1) – An Amazon Example November 2015 Real-World Data Warehouses / Thomas Zurek 6
  • 7.
    Long Tail Analysis(2) Source: Chris Anderson, The Long Tail, Wired, October 2004, http://www.wired.com/wired/archive/12.10/tail.html November 2015 Real-World Data Warehouses / Thomas Zurek 7
  • 8.
    Long Tail Analysis(3) Source: Chris Anderson, The Long Tail, Wired, October 2004, http://www.wired.com/wired/archive/12.10/tail.html November 2015 Real-World Data Warehouses / Thomas Zurek 8
  • 9.
    BUSINESS INTELLIGENCE + DATAWAREHOUSES 9November 2015 Real-World Data Warehouses / Thomas Zurek
  • 10.
    Business Intelligence andData Warehouses • Business Intelligence (BI) An environment in which business users conduct analyses that yield overall understanding of where  the business has been,  where it is now, and  where it will be in the near future (i.e. planning). • Data Warehouse (DW)  An implementation of an informational database used to collect, integrate and provide sharable data sourced from multiple operational databases for analyses.  Provide data that is reliable, consistent, understandable.  It typically serves as the foundation for a business intelligence system. November 2015 Real-World Data Warehouses / Thomas Zurek 10
  • 11.
    Business Intelligence andData Warehouses 11 Business Intelligence OLAP, cubes, dimensions, measures, KPIs, scoreboards, dashboards, pivot tables, data mining, predictive, slice & dice, planning, EPM, analytics, … Data Warehouse connectivity, cleansing, scrubbing, ETL, ELT, EHL, transformation, harmonisation,consistency, compliance, auditing, big data, scalability, … Operational System ERP, CRM, SCM, HR, … MetaData security,models,… November 2015 Real-World Data Warehouses / Thomas Zurek
  • 12.
    Business Intelligence andData Warehouses 12 Operational System ERP, CRM, SCM, HR, … MetaData security,models,… simply remember: (1) BI and DW (2) BI ≠ DW Business Intelligence OLAP, cubes, dimensions, measures, KPIs, scoreboards, dashboards, pivot tables, data mining, predictive, slice & dice, planning, EPM, analytics, … Data Warehouse connectivity, cleansing, scrubbing, ETL, ELT, EHL, transformation, harmonisation,consistency, compliance, auditing, big data, scalability, … Focus today! November 2015 Real-World Data Warehouses / Thomas Zurek
  • 13.
    DATA WAREHOUSES 13November 2015Real-World Data Warehouses / Thomas Zurek
  • 14.
    Multiple Data Sources Whyare there so many DBs at an enterprise? • business processes  data captured in some DB • organisation reflected in system landscape • geography reflected in system landscape • smaller systems easier to manage than big systems • mergers and acquisitions • external data: market data, supplier data, … • … 14November 2015 Real-World Data Warehouses / Thomas Zurek
  • 15.
    A Typical Examplefor Business Processes in an Enterprise 15 source: http://thebankwatch.com/2006/09/13/simplifying-the-business-model/ November 2015 Real-World Data Warehouses / Thomas Zurek
  • 16.
    Business transform End-user access /Presentation Provide data Data Acquisition Harmonization Data Propagation Reporting / Analyses / Planning Main Service : Spot for apps/Delta to app/App recovery Transform : Enriched || General Business logic Content : Data source || Business domain specific History : Determined by rebuild requirements of apps Store : DSO(can be logical partitioned) Main Service : Decouple, Fast load and distribute Transform : 1:1 Content : 1 data source, All fields History : 4 weeks Store : PSA, DSO-WO. Main Service : Integrated, harmonized Transform : Harmonize quality assure (in flow|| lookup) Content : Defined fields History : Short or not at all || Long term Store : Info source || IO/DSO/Z-table Main Service : Make data available for reporting & planning tools Transform : Application specific/(dis-)aggregate/lookup Content : Application specific History : Application specific Store : IC,DSO, Info Set, Virtual Provider, Multi Provider. A Typical Data Warehouse Architecture Corp. Memory ODSBI Layer Data Warehouse Source 1 Source 2 Source 3 Source 4 Source 5 ProjectGovernanceITGovernance November 2015 Real-World Data Warehouses / Thomas Zurek 16
  • 17.
    Challenge 1: RELIABLE •typical: data from 50-100 data sources • availability of data sources not given – system downtimes – network failures – example: • availability per data source = 98% • all 100 data sources available = 0.98**100 = 13% • 1 out of 100 data sources not available = 1 – 0.13 = 87%  all data in one place asserts reliable data access 17November 2015 Real-World Data Warehouses / Thomas Zurek
  • 18.
    Challenge 2: CONSISTENT •Assume: each data source is consistent! • Is the union of all data sources consistent? NO !  In a DW, data gets synchronised and harmonized to provide a consistent view spanning multiple data sources. 18November 2015 Real-World Data Warehouses / Thomas Zurek
  • 19.
    Examples Challenge 2:Transformation, Cleansing • Jun 1, 2011 = 1.6.2011 = 06/01/11 = … • VW Touareg = VW TOUAREG = [product] 87654 = … • currency and unit conversions: – box  kg – €, $, £, ¥, …  € • resolve ID clashes: product 123 [in subsiduary A] ≠ product 123 [in subsiduary B] • enrich data: add attributes from source A to data from source B 19November 2015 Real-World Data Warehouses / Thomas Zurek
  • 20.
    Examples Challenge 2:History / Time-Dependency • data is time-dependent, e.g. – employee A worked in department X in 2012 – employee A worked in department Y in 2013 – currency exchange rates – current view vs historic view analysis • versioning of meta data – models change – development  test  production – auditing 20November 2015 Real-World Data Warehouses / Thomas Zurek
  • 21.
    Automatisierte Überprüfung derDatenqualität in Form eines Plausibility Gates Single Point of Truth Quelle 1 Quelle 2 Quelle ... Quelle n Fachliche Überprüfung der Daten verringern den Administrationsaufwand und den anschließenden „Ärger“ Harmonisierte Auswertungen Plausibility Gate UNSPSC-Code vorhanden? RVO mit BVO-Bezug? DUNS-Nummer vorhanden? Größenordnung BVO/RVO?
  • 22.
    Challenge 3: UNDERSTANDABLE •texts for cryptic numbers • multi-language support • data provenance: know where the data originated • auditing: track changes • relevance: show the user data from his "realm of command" 22November 2015 Real-World Data Warehouses / Thomas Zurek
  • 23.
    LAYERED SCALABALE ARCHITECTURE (LSA) 23November2015 Real-World Data Warehouses / Thomas Zurek
  • 24.
    Business transform End-user access /Presentation Provide data Data Acquisition Harmonization Data Propagation Reporting / Analyses / Planning Main Service : Spot for apps/Delta to app/App recovery Transform : Enriched || General Business logic Content : Data source || Business domain specific History : Determined by rebuild requirements of apps Store : DSO(can be logical partitioned) Main Service : Decouple, Fast load and distribute Transform : 1:1 Content : 1 data source, All fields History : 4 weeks Store : PSA, DSO-WO. Main Service : Integrated, harmonized Transform : Harmonize quality assure (in flow|| lookup) Content : Defined fields History : Short or not at all || Long term Store : Info source || IO/DSO/Z-table Main Service : Make data available for reporting & planning tools Transform : Application specific/(dis-)aggregate/lookup Content : Application specific History : Application specific Store : IC,DSO, Info Set, Virtual Provider, Multi Provider. A Typical Data Warehouse Architecture Corp. Memory ODSBI Layer Data Warehouse Source 1 Source 2 Source 3 Source 4 Source 5 ProjectGovernanceITGovernance November 2015 Real-World Data Warehouses / Thomas Zurek 24
  • 25.
    Yet Another, ArbitraryExample … 25 Source: http://www.zentut.com/wp-content/uploads/2012/10/stand-alone-data-mart.jpg November 2015 Real-World Data Warehouses / Thomas Zurek
  • 26.
    The Layered ScalableArchitecture (LSA) • reference architecture for DW • term introduced by SAP, but not SAP-specific • layers: – each layer has a certain task – each layer has an associated service-level – layers describe the step-wise refinement of data • not every DW needs all LSA-layers • modern technology allows to remove / merge layers as less or no performance-motivated services are required • more: http://tinyurl.com/sap-lsa 26November 2015 Real-World Data Warehouses / Thomas Zurek
  • 27.
    BIG DATA + DATAWAREHOUSING 28November 2015 Real-World Data Warehouses / Thomas Zurek
  • 28.
    Big Data –The 3 "V"s • Velocity  speed, parallelism • Volume  scale • Variety  many formats, file system November 2015 Real-World Data Warehouses / Thomas Zurek 29
  • 29.
    Big Data Example:Connected Cows November 2015 Real-World Data Warehouses / Thomas Zurek 30 • estrus detection by counting steps – motion sensors for 40000 cows • benefits for artifical insemination: – less labour intensive – higher success rates: 45%  63% – sex determination • references: – Strata Feb 2015: http://tinyurl.com/connected-cows http://www.fujitsu.com/global/about/resources/news/press-releases/2013/1015-01.html
  • 30.
    RDBMS vs BigData Data Warehouses / RDBMS • INSERT + UPDATE + DELETE • prescreptive schema – DDL • strict notion of consistency Big Data / HDFS • INSERT (mostly) • descriptive schema – XML – JSON • sloppier due to scale of data and number of systems involved November 2015 Real-World Data Warehouses / Thomas Zurek 31
  • 31.
    Big Data +Data Warehouses November 2015 Real-World Data Warehouses / Thomas Zurek 32 Source: Strata Feb 2015, San Jose, CA, US – Keynote by Amr Awadallah (Cloudera)
  • 32.
    SUMMARY 33November 2015 Real-WorldData Warehouses / Thomas Zurek
  • 33.
    What You ShouldTake Away 1. Difference: BI vs DW vs Big Data 2. What are the problems that a DW handles? 3. How are those problems tackled? 34November 2015 Real-World Data Warehouses / Thomas Zurek
  • 34.
    35November 2015 Real-WorldData Warehouses / Thomas Zurek