Successfully reported this slideshow.
Your SlideShare is downloading. ×

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 42 Ad

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

Download to read offline

Databricks' founders caused a seismic shift in data analysis community when they created Apache Spark which has become a cornerstone of Big Data processing pipelines and tools in large and small companies all around the world. Now they've built a revolutionary, comprehensive and easy-to-use platform around Apache Spark and their other inventions, such as MLFlow and Koalas frameworks and most importantly the Data Lakehouse: a concept of fusing data warehouse and data lake architectures into a single versatile and fast platform. Technical foundation for Databricks Data Lakehouse is Delta Lake. More than 7000 organizations today rely on Databricks to enable massive-scale data engineering, collaborative data science, full-lifecycle machine learning and business analytics. Come to the talk and see the demo to find out why.

Databricks' founders caused a seismic shift in data analysis community when they created Apache Spark which has become a cornerstone of Big Data processing pipelines and tools in large and small companies all around the world. Now they've built a revolutionary, comprehensive and easy-to-use platform around Apache Spark and their other inventions, such as MLFlow and Koalas frameworks and most importantly the Data Lakehouse: a concept of fusing data warehouse and data lake architectures into a single versatile and fast platform. Technical foundation for Databricks Data Lakehouse is Delta Lake. More than 7000 organizations today rely on Databricks to enable massive-scale data engineering, collaborative data science, full-lifecycle machine learning and business analytics. Come to the talk and see the demo to find out why.

Advertisement
Advertisement

More Related Content

More from DataScienceConferenc1 (20)

Recently uploaded (20)

Advertisement

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

  1. 1. inteligencija.com Overview of the Databricks Platform Petar Zečević Poslovna inteligencija
  2. 2. inteligencija.com We are Data & Analytics consulting company committed to deliver great solutions and products that enables our clients to unlock hidden opportunities within data, become data-driven and make better business decisions Our goal is to enable data-driven business decisions Offices in UK, Sweden, Austria, Slovenia and Croatia 200+ employees 20 years in Data & Analytics 250+ projects 100+ clients on 5 continents
  3. 3. inteligencija.com We deliver E2E Cloud Data & Analytics solutions Data Strategy & Governance Data Management Data Science & Analytics Performance Management Implement practices, concepts and processes dedicated to leveraging data as valuable asset. Design data models, improve data quality and master data, protect data, manage whole data supply chain and make data available for any relevant business need. Utilize data and answer business questions through reporting, self-service BI and data visualization. Use machine learning algorithms to uncover the unseen patterns, insights and trends in data and derive meaningful information. Automate budgeting and forecasting, financial consolidation and performance management reporting. Discover opportunities for data monetization, access organizational maturity, evaluate architectural options and define migration to cloud strategy, plan and prioritize projects and estimate costs. Data Engineering Collect and store data at scale, from multiple sources and formats, and make them reliable and consistent for analysis.
  4. 4. inteligencija.com Databricks Lakehouse Platform
  5. 5. inteligencija.com The story about Databricks • The team who built Apache Spark founded Databricks in 2013 • They started several OSS projects: • Apache Spark • Delta Lake • MLFlow • Invented the Data Lakehouse concept • Named leader by Gartner in both • Database Management Systems
  6. 6. inteligencija.com • 7000+ customers • 3000+ employees • Received more than $3B in funding
  7. 7. inteligencija.com Data Lakehouse Concept • Marries Data Warehouses and Data Lakes • Data Warehouses • Built for efficient BI and reporting • But: • Poor support for unstructured data, data science and streaming • Closed formats • Expensive to scale
  8. 8. inteligencija.com Data Lakehouse Concept • Data Lakes • Store any kind of data • Cheap storage • Allow for exploratory data analysis and streaming UCs • However: • Complex to set up • Poor BI performance • Often devolve into data swamps
  9. 9. inteligencija.com Gartner insights • 85% of Big Data and Data Science projects fail • $3.9T business value created by AI in 2022 (by the 15% ?) • Why do Data Science projects fail? • Recent MIT Technology Review survey of 600 C-level executives: “72% percent of the technology executives we surveyed for this study say that, should their companies fail to achieve their AI goals, data issues are more likely than not to be the reason. Improving processing speeds, governance, and quality of data, as well as its sufficiency for models, are the main data imperatives to ensure AI can be scaled, say the survey respondents.”
  10. 10. inteligencija.com The usual problems • Ill-defined use cases • Data warehouses and data lakes in separate silos: • Data often duplicated and/or difficult to access (formats, interfaces) • Difficult to consolidate security models • Difficult to apply governance
  11. 11. inteligencija.com Databricks Lakehouse Platform - benefits • Unifies Data Warehouse and AI use cases on a single platform • Built on open source and open standards • Consistent across cloud providers (Azure, AWS, GCP) • Provides ACID transactions • Schema enforcement capabilities • In one platform: • Data Warehousing • Data Engineering • Data Streaming • Data Science and ML • Data Governance
  12. 12. inteligencija.com Platform Integrations
  13. 13. inteligencija.com
  14. 14. inteligencija.com
  15. 15. inteligencija.com
  16. 16. inteligencija.com
  17. 17. inteligencija.com
  18. 18. inteligencija.com Computing resources • Clusters • One or more VM instances running Spark components: Driver and Executors • Required for running notebooks, jobs, pipelines, … • All-purpose clusters and job clusters • SQL Warehouses (formerly „SQL Endpoints”) • Optimized for BI workloads • Required for running anything in SQL Workspace • For exploring data, running queries, alerts, …
  19. 19. inteligencija.com Accounts for cloud resources
  20. 20. inteligencija.com Databricks Lakehouse Platform – technical foundations • Apache Spark • Delta Lake and Delta Live Tables • MLFlow • Jupyter Notebooks • Jobs and Pipelines
  21. 21. inteligencija.com Apache Spark • General-purpose, distributed data processing engine • Efficient and fast • Spark SQL, Spark Streaming, Spark ML • APIs in Java, Scala, Python, R • Widely used today – ubiquitous • Databricks provides Photon execution engine on top
  22. 22. inteligencija.com Jupyter notebooks • Web-based, interactive and collaborative • Databricks supports Python, SQL, R and Scala • Can also serve as documentation (can be exported to HTML, PDF, etc.) • Can be executed as jobs in Databricks and organized in Pipelines • In Databricks attached to clusters
  23. 23. inteligencija.com Delta Lake • Data storage framework built on top of Parquet • Provides ACID transactions; upserts (MERGE statements) and deletes • Schema enforcement • Time travel • Scalable metadata handling • Unifies streaming and batch processing
  24. 24. inteligencija.com Delta Live Tables • Framework for building data processing pipelines • You define transformations and DLT manages: • Orchestration • Cluster management • Monitoring • Data quality (Expectations) • Error handling • Can perform CDC with APPLY CHANGES INTO .. FROM ..
  25. 25. inteligencija.com MLflow • Framework for managing machine learning lifecycles • MLflow Tracking – tracks experiments and runs, parameters, metrics • MLflow models – storage format for describing models of different “flavors” (e.g. sklearn, keras, xgboost etc.) • MLflow Projects – package code in a format to reproduce runs on different platforms • Model registry – manage models in a central repository
  26. 26. inteligencija.com SQL Editor
  27. 27. inteligencija.com
  28. 28. inteligencija.com
  29. 29. inteligencija.com
  30. 30. inteligencija.com
  31. 31. inteligencija.com
  32. 32. inteligencija.com Jupyter Notebooks
  33. 33. inteligencija.com
  34. 34. inteligencija.com
  35. 35. inteligencija.com
  36. 36. inteligencija.com
  37. 37. inteligencija.com
  38. 38. inteligencija.com Delta Live Tables
  39. 39. inteligencija.com
  40. 40. inteligencija.com
  41. 41. inteligencija.com
  42. 42. inteligencija.com Questions ?

×