Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Vault 2.0: Big Data Meets Data Warehousing

289 views

Published on

Presented at All Things Open
Presented by Dean Hallman with WireSoft, LLC
10/22/18 - 3:15 PM - Databases

Published in: Technology
  • Be the first to comment

Data Vault 2.0: Big Data Meets Data Warehousing

  1. 1. DATA VAULT 2.0: Big Data Meets Data Warehousing DEAN HALLMAN WIRESOFT, LLC
  2. 2. DATA WAREHOUSING VS BIG DATA • Does Big Data replace Data Warehousing? Or do I need both? • What’s the difference: • Between the data flowing into a data warehouse vs big data tools? • Between the ingestion processes and infrastructure? • Data Lakes arrived with Big Data, so are they useful in Data Warehousing? • How should I model my data in EDW? • 3NF, Star Schema, same as my operational data stores? • Data Vault 2.0 • Graph Databases • What is an architecture that allows both to co-exists effectively?
  3. 3. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse CDC, snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) StagingVault RawVault BusinessVault InformationMart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing Clients ETL ELT 𝛾 𝛾 𝛾 𝛾 BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL
  4. 4. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse CDC, snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) StagingVault RawVault BusinessVault InformationMart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing Clients ETL ELT 𝛾 𝛾 𝛾 𝛾 BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL THE DATA MODEL
  5. 5. DATA VAULT 2.0 COMMON FOUNDATIONAL WAREHOUSE ARCHITECTURE • “The Data Vault Model is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise” -- Dan Linstedt, Creator of Data Vault • Data loaded as-is from sources, no edits or cleanup • Append-only to afford highest performance • Agile & agnostic to changes in the operational store’s data model • Essentially, a prescription for Layered Graph to Relational Mapping
  6. 6. DATA WAREHOUSING & DATA VAULT 2.0 • 60’s, 70’s, 80’s • E.F. Codd => 3NF • Bill Inmon invents Data Warehousing concept • Dr. Ralph Kimball popularizes Star Schema design • 90’s, 00’s: • Dan Linstedt creates Data Vault Model @ DOD • 2014: • Dan Introduces Data Vault 2.0
  7. 7. Source: “What are Graph Databases and Why should I care?“, by Dave Bechberger of Expero
  8. 8. SOLVE BY STAR SCHEMA ?
  9. 9. RELATIONAL VS GRAPH DATABASES • Enterprise Grade • Well-worn path • SQL has been relatively stagnant vs programming languages
  10. 10. GRAPH DATA MODEL Source: https://neo4j.com/developer/graph-database/
  11. 11. GRAPH DATABASE VS DATA VAULT
  12. 12. GRAPH DATABASE VS DATA VAULT
  13. 13. Flight Base Dest Forecast Record Source LoadDate Depart Gate LGA 2018-10-11 1:25P M B27 CAE 2018-10-24 3:30P M A14 SFO 2018-09-06 8:55P M G19 RDU 2018-08-12 4:45P M C22 SERVICED_BY Record Source Airport CAE Load Date 2018-11-17 Source Id 20181117-32-983 Aircraft Base Service FAA NTSB Record Source LoadDate Model Tailno United 2017-02-11 767 1477 Delta 2015-11-04 A6 2381 Alaska 2013-08-28 747 8312 Frontie r 2016-07-19 182 1438 Record Source United Airlines Load Date 2018-01-17 Source Id 2412c SERVICED_BY Base Dest Manifest Record Source LoadDate Begin End United 2017-02-11 2017-04-23 2017-09-23 Delta 2015-11-04 2015-12-01 2017-04-22 Alaska 2013-08-28 2013-09-14 2016-05-04 Frontie r 2016-07-19 2016-08-02 2018-04-11 Record Source United Airlines Load Date 2018-09-17 Hubs Links SatellitesTab
  14. 14. • Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations - Mel Conway
  15. 15. FLIGHT Base Dest Forecast Record Source LoadDate Depart Gate LGA 2018-10- 11 1:25P M B27 CAE 2018-10- 24 3:30P M A14 FLIGHT Record Source Airport CAE Load Date 2018-11-17 Source Id 20181117-32-983 Aircraft Bas e Service FAA NTSB Record Source LoadDate Model Tailno United 2017-02- 11 767 1477 Delta 2015-11- 04 A6 2381 Alaska 2013-08- 28 747 8312 Frontie r 2016-07- 19 182 1438 Record Source United Airlines Load Date 2018-01-17 Source Id 2412c Airport Base Dest Manifest Record Source LoadDate Begin End United 2017-02-11 2017-04-23 2017-09- 23 Delta 2015-11-04 2015-12-01 2017-04- 22 Alaska 2013-08-28 2013-09-14 2016-05- 04 Frontie r 2016-07-19 2016-08-02 2018-04- 11 Record Source United Airlines Load Date 2018-09-17 Airline Base Service FAA NTS B Record Source LoadDate Model Tailno United 2017-02-11 767 1477 Delta 2015-11-04 A6 2381 Record Source United Airlines Load Date 2018-01-17 Source Id 2412c Hubs Links SatellitesTab
  16. 16. Source: https://www.wherescape.com/solutions/project-types/data-vault-automation/
  17. 17. • Modeled after self- organizing networks • A Business Key identifies a key concept in business. • They have a business meaning • They are unique and have very low propensity to change • Business keys change only when the business change • Enables (forces) cross- source modeling Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
  18. 18. Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
  19. 19. Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
  20. 20. DATA VAULT 2.0 MODELING: HUBS, LINKS & SATELLITES
  21. 21. @wiresoft/Pathfinder
  22. 22. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse CDC, snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) StagingVault RawVault BusinessVault InformationMart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing Clients ETL ELT 𝛾 𝛾 𝛾 𝛾 BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL THE DATA Impressions vs Business Data
  23. 23. ENTERPRISE DATA SILOS Small DataLarge DataBig Data Describes the user base Describes the Enterprise Describes the Product
  24. 24. Instance Grain Transaction Grain Audit Grain Impression Grain Big Data Enterprise Data Warehouse Operational Data Stores Impression Analytics Business Analytics External Data Sources DATA GRANULARITY FUNNEL
  25. 25. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse CDC, snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) StagingVault RawVault BusinessVault InformationMart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing Clients ETL ELT 𝛾 𝛾 𝛾 𝛾 BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL DATA INGESTION ETL vs ELT vs SerDe
  26. 26. ETL VS ELT VS SerDe • Beware the Turing tar-pit, in which everything is possible, but nothing of interest is easy - Alan Perlis
  27. 27. DATA CLASSIFICATION MATRIX: DECLARATIVE VS INTERPRETIVE Declarative Interpretive HadoopRDBMS Web Events Media Player
  28. 28. DATA WAREHOUSING • Deep Topic • 60’s, 70’s, 80’s • E.F. Codd => 3NF • Bill Inmon invents Data Warehousing concept • Dr. Ralph Kimball popularizes Star Schema design • 90’s, 00’s: • Dan Linstedt creates Data Vault Model @ DOD • 2014: • Dan Introduces Data Vault 2.0 • Data Warehouse vs Operational Data Stores • Data Warehouse as Version Control System • MapReduce, 2004, Google by Jeffery Dean and Sanjay, “MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS” , GFS • Nutch 2005, Hadoop 2006, 2007 - Doug Cutting • What exactly is “Big Data”? BIG DATA
  29. 29. Client User Interpreter Analysis UNSTRUCTURED USER EXPERIENCE L L n L ilossy
  30. 30. Client User Time Series Event Record Analysis STRUCTURED USER EXPERIENCE losslessL p L p L e
  31. 31. ETL OR SERDE ? S3 Hadoop Time Series Event Record Analysis Deserializer L e L d L m Client User Serializer L p L p Eventlog.e Eventlog.d L e Single Source (Version Locked) Kafka/Kinesis LeInternet
  32. 32. ETL ELT (SerDe) vs Source: https://www.ironsidegroup.com/2015/03/01/etl-vs-elt-whats-the-big-difference/ Schema On Write Schema On Read
  33. 33. OTHER CHALLENGES • Satellites must be loaded chronologically • Time-based scheduling vs data-availability scheduling
  34. 34. QUESTIONS? • Contact:  Dean Hallman  rdhallman@gmail.com  Linkedin: https://www.linkedin.com/in/dean-hallman/

×