Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Lakehouse Symposium | Day 4

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 74 Ad

Data Lakehouse Symposium | Day 4

Download to read offline

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.

Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.

Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.

This is an educational event.

Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.

Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.

Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.

This is an educational event.

Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Data Lakehouse Symposium | Day 4 (20)

Advertisement

More from Databricks (20)

Advertisement

Recently uploaded (20)

Data Lakehouse Symposium | Day 4

  1. 1. ©2022 Databricks Inc. — All rights reserved The Databricks Lakehouse Platform Modern data, analytics and AI workloads 1
  2. 2. ©2022 Databricks Inc. — All rights reserved Jason Pohl Principal Solutions Architect Sean Owen Principal Solutions Architect Presenters
  3. 3. ©2022 Databricks Inc. — All rights reserved The future is here ...it’s just not evenly distributed 3 83% 85% $3.9T 87% of CEOs say AI is a strategic priority of big data projects fail in business value created by AI in 2022 of data science projects never make it into production
  4. 4. ©2022 Databricks Inc. — All rights reserved The Data and AI Maturity Curve From descriptive to prescriptive Analytics Maturity Competitive Advantage Raw Data Cleaned Data Reports Ad Hoc Queries Data Exploration Predictive Modeling Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make it happen?
  5. 5. ©2022 Databricks Inc. — All rights reserved Modern Data Teams 5 Data Engineers Data Scientists Data Analysts
  6. 6. ©2022 Databricks Inc. — All rights reserved The vast majority of companies struggle to get this right
  7. 7. ©2022 Databricks Inc. — All rights reserved Most enterprises struggle with data Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming Data Science and ML Extract Load Transform Streaming data sources Streaming Data Engine Real-time Database Analytics and BI Data marts Data warehouse Structured data Structured, semi- structured and unstructured data Data Lake Data prep Data Lake Machine Learning Data Science Structured, semi- structured and unstructured data 7
  8. 8. ©2022 Databricks Inc. — All rights reserved Most enterprises struggle with data Disconnected systems and proprietary data formats make integration difficult Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming Data Science and ML Amazon Redshift Teradata Azure Synapse Google BigQuery Snowflake IBM Db2 SAP Oracle Autonomous Data Warehouse Hadoop Apache Airflow Amazon EMR Apache Spark Google Dataproc Cloudera Jupyter Amazon SageMaker Azure ML Studio MatLAB Domino Data Labs SAS TensorFlow PyTorch Apache Kafka Apache Spark Apache Flink Amazon Kinesis Azure Stream Analytics Google Dataflow Tibco Spotfire Confluent Extract Load Transform Streaming data sources Streaming Data Engine Real-time Database Analytics and BI Data marts Data warehouse Structured data Structured, semi- structured and unstructured data Data Lake Data prep Data Lake Machine Learning Data Science Structured, semi- structured and unstructured data 8
  9. 9. ©2022 Databricks Inc. — All rights reserved Most enterprises struggle with data Siloed data teams decrease productivity Disconnected systems and proprietary data formats make integration difficult Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming Data Science and ML Data Analysts Data Engineers Data Engineers Data Scientists Amazon Redshift Teradata Azure Synapse Google BigQuery Snowflake IBM Db2 SAP Oracle Autonomous Data Warehouse Hadoop Apache Airflow Amazon EMR Apache Spark Google Dataproc Cloudera Jupyter Amazon SageMaker Azure ML Studio MatLAB Domino Data Labs SAS TensorFlow PyTorch Apache Kafka Apache Spark Apache Flink Amazon Kinesis Azure Stream Analytics Google Dataflow Tibco Spotfire Confluent Extract Load Transform Streaming data sources Streaming Data Engine Real-time Database Analytics and BI Data marts Data warehouse Structured data Structured, semi- structured and unstructured data Data Lake Data prep Data Lake Machine Learning Data Science Structured, semi- structured and unstructured data 9
  10. 10. ©2022 Databricks Inc. — All rights reserved Where does all this complexity come from? data warehouses vs. data lakes
  11. 11. ©2022 Databricks Inc. — All rights reserved 11 Data Warehouse Data Lake vs.
  12. 12. ©2022 Databricks Inc. — All rights reserved Warehouses and lakes create complexity Two separate copies of the data Warehouses Proprietary Lakes Open Incompatible interfaces Warehouses SQL Lakes Python Incompatible security and governance models Warehouses Tables Lakes Files
  13. 13. ©2022 Databricks Inc. — All rights reserved Data Warehouse Data Lake Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lakehouse One platform to unify all of your data, analytics, and AI workloads
  14. 14. ©2022 Databricks Inc. — All rights reserved The data lakehouse offers a better path Lake-first approach that builds upon where the freshest, most complete data resides AI/ML from the ground up Single approach to managing data Support for all use cases on a single platform: • Data engineering • Data warehousing • Real time streaming • Data science and ML Built on open source and open standards High reliability and performance Data processing and management built on open source and open standards Common security, governance, and administration Modern Data Engineering Analytics and Data Warehousing Data Science and ML Integrated and collaborative role-based experiences with open API’s Cloud Data Lake Structured, semi-structured, and unstructured data Multi-cloud, work with your cloud of choice
  15. 15. ©2022 Databricks Inc. — All rights reserved Databricks The Data Lakehouse Foundation Modern Data Engineering on the Lakehouse Analytics & Data Warehousing on the Lakehouse Governance on the Lakehouse 1 2 3 4 Machine Learning on the Lakehouse 5
  16. 16. ©2022 Databricks Inc. — All rights reserved The Data Lakehouse Foundation 16 1
  17. 17. ©2021 Databricks Inc. — All rights reserved An open approach to bringing data management and governance to data lakes Better reliability with transactions 48x faster data processing with indexing Data governance at scale with fine- grained access control lists Data Warehouse Data Lake
  18. 18. ©2022 Databricks Inc. — All rights reserved Delta Lake solves challenges with data lakes RELIABILITY & QUALITY PERFORMANCE & LATENCY GOVERNANCE ACID transactions Advanced indexing & caching Governance with Data Catalogs
  19. 19. ©2022 Databricks Inc. — All rights reserved Delta Lake adoption Already >50% of Databricks workload Broad industry support Ingest from Query from Store data in
  20. 20. ©2022 Databricks Inc. — All rights reserved What Delta Lake can do for you Scale data insights throughout your organization with a simplified solution Provide best price/performance Enable a multi-cloud, secure infrastructure The foundation of your lakehouse
  21. 21. ©2022 Databricks Inc. — All rights reserved Demo: Delta Lake 21
  22. 22. ©2022 Databricks Inc. — All rights reserved Modern Data Engineering on the Lakehouse 22 2
  23. 23. ©2022 Databricks Inc. — All rights reserved Modern data engineering on the lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering Analytics and Data Warehousing Data Science and ML
  24. 24. ©2022 Databricks Inc. — All rights reserved Problems with Today’s Architectures Data reliability suffers: • Multiple storage systems with different semantics, SQL dialects, etc • Extra ETL steps that can go wrong Timeliness suffers: • Extra ETL steps before data available in DW High cost: • Continuous ETL, duplicated storage BI Data Science Machine Learning Structured, Semi-structured & Unstructured Data Reports Data Warehouses ETL Data Lake Cheap to store all the data, but the 2-tier architecture is much more complex!
  25. 25. ©2022 Databricks Inc. — All rights reserved Lakehouse Systems Implement data warehouse management and performance features on top of directly-accessible data in open formats Structured, Semi & Unstructured Data Data Lake Storage BI Data Science Machine Learning Reports Management & Performance Layer ETL Can we get state-of-the-art performance & governance features with this design? Cheap storage in open formats for all data Direct access to data files SQL
  26. 26. ©2022 Databricks Inc. — All rights reserved Modern data engineering on the lakehouse Data Engineering on the Databricks Lakehouse Platform Open format storage Data transformation Scheduling & orchestration Automatic deployment & operations BI / Reporting Dashboarding Machine Learning / Data Science Data & ML Sharing Data Products Databases Streaming Sources Cloud Object Stores SaaS Applications NoSQL On-premises Systems Data Sources Data Consumers Observability, lineage, and end-to-end pipeline visibility Data quality management Data ingestion
  27. 27. ©2022 Databricks Inc. — All rights reserved 27 Building the foundation of a Lakehouse Greatly improve the quality of your data for end users BRONZE Filtered, Cleaned, Augmented SILVER Business-level Aggregates GOLD Raw Ingestion and History
  28. 28. ©2022 Databricks Inc. — All rights reserved 28 Building the foundation of a Lakehouse Greatly improve the quality of your data for end users BRONZE Filtered, Cleaned, Augmented SILVER Business-level Aggregates GOLD Raw Ingestion and History Data Lake CSV, JSON, TXT… Kinesis Quality BI & Reporting Streaming Analytics Data Science & ML
  29. 29. ©2022 Databricks Inc. — All rights reserved But the reality is not so simple Maintaining data quality and reliability at scale is complex and brittle Data Lake CSV, JSON, TXT… Kinesis BI & Reporting Streaming Analytics Data Science & ML
  30. 30. ©2022 Databricks Inc. — All rights reserved Large scale ETL is complex and brittle • Hard to build and maintain table dependencies • Difficult to switch between batch and stream processing • Difficult to monitor & enforce data quality • Impossible to trace data lineage • Poor observability at granular, data level • Error handling and recovery is laborious Complex pipeline development Poor data quality Difficult pipeline operations
  31. 31. ©2022 Databricks Inc. — All rights reserved Demo: Modern Data Engineering on the Lakehouse 31
  32. 32. ©2022 Databricks Inc. — All rights reserved Data Warehousing on the Lakehouse 32 3
  33. 33. ©2022 Databricks Inc. — All rights reserved Data Warehousing on the Lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering Analytics and Data Warehousing Data Science and ML
  34. 34. ©2022 Databricks Inc. — All rights reserved BI & SQL workloads (DW) on Databricks • Great performance and concurrency for BI and SQL workloads on Delta Lake • Native SQL interface for analysts • Support for BI tools to directly query your most recent data in Delta Lake • Serverless
  35. 35. ©2022 Databricks Inc. — All rights reserved SQL Analytics World-Class Performance and Data Lake Economics • Fast and predictable performance for all queries • Simplified administration and fine-grained governance • Analytics on the freshest data with your tools of choice Analytics on the Lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering Analytics and Data Warehousing Data Science and ML
  36. 36. ©2022 Databricks Inc. — All rights reserved A platform for your tools of choice Get critical business data in with one click integrations, and benefit from fast performance, low latency, and high user concurrency for your existing BI tools. Ingest & ETL BI & Analytics
  37. 37. ©2022 Databricks Inc. — All rights reserved A first-class SQL development experience Query data lake data using familiar ANSI SQL, and find and share new insights faster with the built-in SQL query editor, alerts, visualizations, and interactive dashboards.
  38. 38. ©2022 Databricks Inc. — All rights reserved Simple administration and governance Quickly setup instant, elastic SQL compute decoupled from storage. Databricks automatically determines instance types and configuration for the best price/performance. Then, easily manage users, data, and resources with endpoint monitoring, query history, and fine- grained governance.
  39. 39. ©2022 Databricks Inc. — All rights reserved Demo: Data Warehousing on the Lakehouse 39
  40. 40. ©2022 Databricks Inc. — All rights reserved Data Governance on the Lakehouse 40 4
  41. 41. ©2022 Databricks Inc. — All rights reserved Governance requirements for data are quickly evolving
  42. 42. ©2022 Databricks Inc. — All rights reserved Governance is hard to enforce on data lakes 42 Cloud 2 Cloud 3 Structured Semi-structured Unstructured Streaming Cloud 1
  43. 43. ©2022 Databricks Inc. — All rights reserved The problem is getting bigger Enterprises need a way to share and govern a wide variety of data products Files Dashboards Models Tables
  44. 44. ©2022 Databricks Inc. — All rights reserved Data governance on the lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering Analytics and Data Warehousing Data Science and ML
  45. 45. ©2022 Databricks Inc. — All rights reserved Unity Catalog for Lakehouse Governance • Centrally catalog, Search, and discover data and AI assets • Simplify governance with a unified Cross- cloud governance model • Easily integrate with your existing Enterprise Data Catalogs • Securely share live data across platforms with delta sharing
  46. 46. ©2022 Databricks Inc. — All rights reserved Data Sharing is Critical in the Digital Economy
  47. 47. ©2022 Databricks Inc. — All rights reserved Data Sharing is Critical in the Digital Economy To fully realize the value locked in data, enterprises must be able to securely exchange data with trusted customers, partners, and suppliers Current options are not fit for purpose Homegrown Solutions Built on APIs, direct connectivity, or SFTP ● Not scalable ● Complex ● Brittle Commercial Solutions Vendor provided technology ● Expensive ● Locked in ● Inflexible
  48. 48. ©2022 Databricks Inc. — All rights reserved The open approach to sharing Fully open, without proprietary lock-in using any computing platform Simple to share data with other organizations Easily managed privacy, security, and compliance
  49. 49. ©2022 Databricks Inc. — All rights reserved Permissible Access via Open Source Protocol Delta Lake Table Delta Sharing Server Delta Sharing Protocol Data Provider Data Recipient Any Sharing Client Access permissions
  50. 50. ©2022 Databricks Inc. — All rights reserved Machine Learning on the Lakehouse 50 Sean Owen Principal Solution Architect @ Databricks 5
  51. 51. ©2022 Databricks Inc. — All rights reserved Machine Learning on the Lakehouse Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering BI and SQL Analytics Data Science and ML
  52. 52. ©2021 Databricks Inc. — All rights reserved Open Multi-Cloud Data Lakehouse and Feature Store Collaborative Multi-Language Notebooks ← Full ML Lifecycle → Model Tracking and Registry Model Training and Tuning Model Serving and Monitoring Automation and Governance Data Science and Machine Learning A data-native and collaborative solution for the full ML lifecycle
  53. 53. ©2022 Databricks Inc. — All rights reserved 54 Three Data Users • SQL and BI tools • Prepare and run reports • Summarize data • Visualize data • (Sometimes) Big Data • Data Warehouse data store • R, SAS, some Python • Statistical analysis • Explain data • Visualize data • Often small data sets • Database, data warehouse data store; local files Business Intelligence Data Science • Python • Deep learning and specialized GPU hardware • Create predictive models • Deploy models to prod • Often big data sets • Unstructured data in files Machine Learning
  54. 54. ©2022 Databricks Inc. — All rights reserved Data Warehouse vs Data Lake Which appeals to each user? 55 Business Intelligence Data Science Machine Learning Data Warehouse Data Lake
  55. 55. ©2022 Databricks Inc. — All rights reserved 56 What ML Needs from a Lakehouse
  56. 56. ©2022 Databricks Inc. — All rights reserved How Is ML Different? 57 • Operates on unstructured data like text and images • Can require learning from massive data sets, not just analysis of a sample • Uses open source tooling to manipulate data as “DataFrames” rather than with SQL • Outputs are models rather than data or reports • Sometimes needs special hardware
  57. 57. ©2022 Databricks Inc. — All rights reserved 58 What Does ML Need from a Lakehouse? Your subtitle here Access to Unstructured Data • Images, text, audio, custom formats • Libraries understand files, not tables • Must scale to petabytes Open Source Libraries • OSS dominates ML tooling (Tensorflow, scikit- learn, xgboost, R, etc) • Must be able to apply these in Python, R Specialized Hardware, Distributed Compute • Scalability of algorithms • GPUs, for deep learning • Cloud elasticity to manage that cost! Model Lifecycle Management • Outputs are model artifacts • Artifact lineage • Productionization of model
  58. 58. ©2022 Databricks Inc. — All rights reserved 59 Architecture Sketches Your subtitle here Data Warehouse Data Lakehouse
  59. 59. ©2022 Databricks Inc. — All rights reserved 61 MLOps
  60. 60. ©2022 Databricks Inc. — All rights reserved MLOps and the Lakehouse 62 • Applying open tools in-place to data in the lakehouse is a win for training • Applying them for operating models is important too! • "Models are data too" • Need to apply models to data • MLFlow for MLOps on the lakehouse • Track and manage model data, lineage, inputs • Deploy models as lakehouse "services"
  61. 61. ©2022 Databricks Inc. — All rights reserved 63 Example: Chest X-Ray Classification
  62. 62. ©2022 Databricks Inc. — All rights reserved Classifying Chest X-rays 64 • 45,000 X-ray images • About 50GB • Includes correct doctor diagnosis • From National Institute of Health • Relatively easy deep learning problem • If you have access to the data • If you have deep learning open source software • If you have GPU hardware • (Don't diagnose at home!) https://databricks.com/p/webinar/operationalizing-machine-learning-at-scale
  63. 63. ©2022 Databricks Inc. — All rights reserved 65 Architecture
  64. 64. ©2022 Databricks Inc. — All rights reserved 66 Feature Stores
  65. 65. ©2022 Databricks Inc. — All rights reserved A day (or 6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv Serving
  66. 66. ©2022 Databricks Inc. — All rights reserved csv csv A day (or 6 months) in the life of an ML model Client Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving
  67. 67. ©2022 Databricks Inc. — All rights reserved csv csv Client Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving A day (or 6 months) in the life of an ML model need to be equivalent
  68. 68. ©2022 Databricks Inc. — All rights reserved Feature Stores for Model Inputs 70 • Tables are OK for managing model input • Input often structured • Well understood, easy to access • … but not quite enough • Upstream lineage: how were features computed? • Downstream lineage: where is the feature used? • Model caller has to read, feed inputs • How to do (also) access in real time?
  69. 69. ©2022 Databricks Inc. — All rights reserved 71 Example: Telco Churn Classification
  70. 70. ©2022 Databricks Inc. — All rights reserved 72 Bonus: Auto ML
  71. 71. ©2022 Databricks Inc. — All rights reserved 73 Learn More
  72. 72. ©2022 Databricks Inc. — All rights reserved The data lakehouse offers a better path Lake-first approach that builds upon where the freshest, most complete data resides AI/ML from the ground up Single approach to managing data Support for all use cases on a single platform: • Data engineering • Data warehousing • Real time streaming • Data science and ML Built on open source and open standards High reliability and performance Multi-cloud, work with your cloud of choice Integrated and collaborative role-based experiences with open APIs Data processing and management built on open source and open standards Common security, governance, and administration Cloud Data Lake Structured, semi-structured, and unstructured data Data Engineering BI and SQL Analytics Data Science and ML
  73. 73. ©2022 Databricks Inc. — All rights reserved
  74. 74. ©2022 Databricks Inc. — All rights reserved Thank you 76

×