Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Is the traditional data warehouse dead?

6,322 views

Published on

With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.

Published in: Technology
  • How to Love Yourself: 15 Tips for Developing Self Love ➤➤ http://t.cn/AiuvUMl2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nine Signs Wealth is Coming Your Way... ●●● https://bit.ly/30Ju5r6
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • FREE TRAINING: "How to Earn a 6-Figure Side-Income Online" ... ●●● https://tinyurl.com/y3ylrovq
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Are You Heartbroken? Don't be upset, let Justin help you get your Ex back.  http://scamcb.com/exback123/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Is the traditional data warehouse dead?

  1. 1. Is the traditionnel data warehouse dead? James Serra Big Data Evangelist Microsoft JamesSerra3@gmail.com (Data Lake and Data Warehouse – the best of both worlds)
  2. 2. About Me  Microsoft, Big Data Evangelist  In IT for 30 years, worked on many BI and DW projects  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm employee, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference  Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  3. 3. Agenda  Data Warehouse  Data Lake  The best of both worlds  Federated querying  Patterns
  4. 4. Considering Data Types Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured Structured CSV, Columnar Storage (Parquet, ORC). Strict data model structure Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
  5. 5. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation Two Approaches to getting value out of data: Top-Down + Bottoms-Up
  6. 6. Of course you still need a data warehouse A data warehouse is where you store data from multiple data sources to be used for historical and trend analysis reporting. It acts as a central repository for many subject areas and contains the "single version of truth". Reasons for a data warehouse:  Reduce stress on production system  Optimized for read access, sequential disk scans  Integrate many sources of data  Keep historical records (no need to save hardcopy reports)  Restructure/rename tables and fields, model data  Protect against source system upgrades  Use Master Data Management, including hierarchies  No IT involvement needed for users to create reports  Improve data quality and plugs holes in source systems  One version of the truth  Easy to create BI solutions on top of it (i.e. SSAS Cubes)
  7. 7. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Traditional Data Warehousing Uses A Top-Down Approach Data sources Gather Requirements Business Requirements Technical Requirements
  8. 8. ETL pipeline Dedicated ETL tools (e.g. SSIS) Defined schema Queries Results Relational LOB Applications Traditional business analytics process 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data (curation) and transform it to target schema (‘schema-on-write’) 5. Create reports. Analyze data All data not immediately required is discarded or archived 14
  9. 9. Harness the growing and changing nature of data Need to collect any data StreamingStructured Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format Unstructured “ ”
  10. 10. The three V’s
  11. 11. Store indefinitely Analyze See results Gather data from all sources Iterate New big data thinking: All data has value All data has potential value Data hoarding No defined schema—stored in native format Schema is imposed and transformations are done at query time (schema-on-read). Apps and users interpret the data as they see fit 17
  12. 12. The “data lake” Uses A Bottoms-Up Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  13. 13. Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure
  14. 14. Exactly what is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • Inexpensively store unlimited data • Collect all data “just in case” • Store data with no modeling – “Schema on read” • Complements EDW • Frees up expensive EDW resources • Quick user access to data • ETL Hadoop tools • Easily scalable • Place to backup data to • Place to move older data
  15. 15. Needs data governance so your data lake does not turn into a data swamp!
  16. 16. The real cost of Hadoop https://www.scribd.com/document/172491475/WinterCorp- Report-Big-Data-What-Does-It-Really-Cost/
  17. 17. A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?
  18. 18. • Query performance not as good as relational database • Complex query support not good due to lack of query optimizer, in-database operators, advanced memory management, concurrency, dynamic workload management and robust indexing • Concurrency limitations • No concept of “hot” and “cold” data storage with different levels of performance to reduce cost • Not a DBMS so lack of features such as update/delete of data, referential integrity, statistics, ACID compliance, data security • File based so no granular security definition at the column level • No metadata stored in HDFS, so another tool required adding complexity and slowing performance • Finding expertise in Hadoop is very difficult • Super complex, with lot’s of integration with multiple technologies to make it work • Many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard • Lack of master data management tools for Hadoop • Requires end-users to learn new reporting tools and Hadoop technologies to query the data • Pace of change is so quick many Hadoop technologies become obsolete, adding risk • Lack of cost savings: cloud consumption, support, licenses, training, and migration costs • Need conversion process to convert data to a relational format if a reporting tool requires it • Some reporting tools don’t work against Hadoop
  19. 19. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Well manicured, often relational sources Known and expected data volume and formats Little to no change Complex, rigid transformations Required extensive monitoring Transformed historical into read structures Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay MONITORING AND TELEMETRY
  20. 20. Current state of a data warehouse Traditional Approaches CRMERPOLTP LOB DATA SOURCES ETL DATA WAREHOUSE Star schemas, views other read- optimized structures BI AND ANALYTCIS Emailed, centrally stored Excel reports and dashboards Increase in variety of data sources Increase in data volume Increase in types of data Pressure on the ingestion engine Complex, rigid transformations can’t longer keep pace Monitoring is abandoned Delay in data, inability to transform volumes, or react to new sources Repair, adjust and redesign ETL Reports become invalid or unusable Delay in preserved reports increases Users begin to “innovate” to relieve starvation MONITORING AND TELEMETRY INCREASING DATA VOLUME NON-RELATIONAL DATA INCREASE IN TIME STALE REPORTING
  21. 21. Data Lake Transformation (ELT not ETL) New Approaches All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools CRMERPOLTP LOB DATA SOURCES FUTURE DATA SOURCESNON-RELATIONAL DATA EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) Transform relevant data into data sets BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports DATA WAREHOUSE Star schemas, views other read- optimized structures
  22. 22. Data Lake + Data Warehouse Better Together Data sources What happened? Descriptive Analytics Diagnostic Analytics Why did it happen? What will happen? Predictive Analytics Prescriptive Analytics How can we make it happen?
  23. 23. Modern Data Warehouse • Ultimate goal • Supports future data needs • Data harmonized and analyzed in the data lake or moved to EDW for more quality and performance
  24. 24. Data Lake Data Warehouse Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do experimentation / data discovery System of Record: Well-understood data to do operational reporting Any type of data Limited set of data types (ie. relational) Skills are limited Skills mostly available All workloads – batch, interactive, streaming, machine learning Optimized for interactive querying Complementary to DW Can be sourced from Data Lake
  25. 25. Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions
  26. 26. Use cases using Hadoop and a DW in combination Bringing islands of Hadoop data together Archiving data warehouse data to Hadoop (move) (Hadoop as cold storage) Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake)
  27. 27. Reasons you still need a cube/OLAP • Semantic layer • Handle many concurrent users • Aggregating data for performance • Multidimensional analysis • No joins or relationships • Hierarchies, KPI’s • Row-level security • Advanced time-calculations • Slowly Changing Dimensions (SCD)
  28. 28. ? ? ? ? Federated Querying
  29. 29. Federated Querying Other names: Data virtualization, logical data warehouse, data federation, virtual database, and decentralized data warehouse. A model that allows a single query to retrieve and combine data as it sits from multiple data sources, so as to not need to use ETL or learn more than one retrieval technology
  30. 30. SQL Server and PolyBase Query relational and non-relational data with T-SQL
  31. 31. Advanced Analytics Social LOB Graph IoT Image CRM INGEST STORE PREP & TRAIN MODEL & SERVE Data orchestration and monitoring Big data store Hadoop/Spark and machine learning Data warehouse Cloud Bursting BI + Reporting Azure Data Factory Azure Blob Storage Azure Databricks Azure Data Lake Azure HDInsight Azure Machine Learning Machine Learning Server Azure SQL Data Warehouse Azure Analysis Services
  32. 32. INGEST STORE PREP & TRAIN MODEL & SERVE Logs, files and media (unstructured) Azure SQL Data Warehouse Azure Data Factory Azure Data Factory Azure Databricks Azure HDInsight Data Lake Analytics Analytical dashboards PolyBase Business/custom apps (Structured) Azure Analysis Services Azure Data Lake Store
  33. 33. INGEST STORE PREP & TRAIN MODEL & SERVE Azure Data Lake Store Analytical dashboards Business/custom apps (Structured) Logs, files and media (unstructured) Azure SQL Data Warehouse Tableau Server PolyBase Operational Reports Ad-Hoc Query Azure SQL Database Hortonworks
  34. 34. https://aka.ms/ADAG
  35. 35. Q & A ? James Serra, Big Data Evangelist Email me at: JamesSerra3@gmail.com Follow me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)

×