WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Garren Staubli
Solutions Architect
Databricks + Snowflake:
Catalyzing Data and AI
#UnifiedAnalytics #SparkAISummit
Slides & Resources: garrens.com/DataSnowCat
garren@databricks.com | @gstaubli
Agenda
3#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Introductions
Scenario
Challenges
Solutions
Demo
4#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Introductions - Me
2011 2012 2013 2014 2015 2016 2017 2018 2019
MySQL
Ruby
AWS
Pig & Hive
Python
Linux
Scala, Python & Java
Apache Spark & ML
Hadoop
NoSQL
Databricks Delta
Databricks Workspace
Collaborative Notebooks, Production Jobs
Databricks Runtime
Transactions Indexing
ML FrameworksML Frameworks
Introductions - Databricks
Cloud
Data & ML
Lifecycle
Data Engineering Data Science
Accelerate innovation by unifying data science and engineering
6#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Introductions - Snowflake
7#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Forget Oil. Data is worth
more than Gold
8#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Scenario
9#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Data Mining Data Science ML Engineering
QADevOps
Production
Delivery*
* not Digiorno
Scenario - Annotated
10#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Scenario - Reality
11#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Challenges
WAREHOUSES NOSQL
LAKES STREAMS
Sources
APIs Apps BI
Data-Driven
Production
12#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Challenges - Reality
Partitions: 20
Rows per second: 10,000
Format: JSON
Extract Transform
ML
Load
Analysis
Insights
Flat, RDBMS, Streams, etc
Unified batch & stream APIs
Autoscaling with usage
Python, Scala, SQL, R & Java
JVM w/ optimization
Multilevel APIs (SQL + RDDs)
13#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Challenges & Solutions - ETL
Partitions: 20
Rows per second: 10,000
Format: JSON
Sources
Syntax
Scale
Languages
Performance
Expressiveness
14#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Challenges & Solutions - ETL
Partitions: 20
Rows per second: 10,000
Format: JSON
Malformed Records
Errors
Changing Fields
Writes
- Performance
- Semantics
- Reliability
Ignore/infer + log records
Handle + retry w/ checkpoint
Schema Evolution
Partitioned + optimized files
Exactly once
ACID transactions
15#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Challenges & Solutions - ML
Partitions: 20
Rows per second: 10,000
Format: JSON
Data Access
Syntax
Collaboration
Models
- Iteration
- Reproducibility
- Deployment
Apache Spark + Delta
Koalas
Databricks Notebooks
- Tracking
- Projects
- Models
Interactive queries
Instant Scaling
SQL
Tableau, PowerBI, etc
Optimized DWaaS
Decoupled storage + compute
16#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Challenges & Solutions - Analysis
Partitions: 20
Rows per second: 10,000
Format: JSON
Time to value
Intermittent demand
Language
Common Tooling
Ease of Use
Cost control
17#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Final Solution Architecture
Partitions: 20
Rows per second: 10,000
Format: JSON
Machine Learning
BI Reporting
Dashboards
18#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Demo
Review
19#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Introductions
Scenario
Challenges
Solutions
Demo
20#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat
Solution
WAREHOUSES NOSQL
LAKES STREAMS
Sources
Processing
WAREHOUSES NOSQL
BI
Persistence
APIs Apps BI
LAKES
DELTA
Integration
DELTA
ML
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Databricks + Snowflake: Catalyzing Data and AI Initiatives

Databricks + Snowflake: Catalyzing Data and AI Initiatives

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Garren Staubli Solutions Architect Databricks+ Snowflake: Catalyzing Data and AI #UnifiedAnalytics #SparkAISummit Slides & Resources: garrens.com/DataSnowCat garren@databricks.com | @gstaubli
  • 3.
    Agenda 3#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Introductions Scenario Challenges Solutions Demo
  • 4.
    4#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Introductions - Me 2011 2012 2013 2014 2015 2016 2017 2018 2019 MySQL Ruby AWS Pig & Hive Python Linux Scala, Python & Java Apache Spark & ML Hadoop NoSQL
  • 5.
    Databricks Delta Databricks Workspace CollaborativeNotebooks, Production Jobs Databricks Runtime Transactions Indexing ML FrameworksML Frameworks Introductions - Databricks Cloud Data & ML Lifecycle Data Engineering Data Science Accelerate innovation by unifying data science and engineering
  • 6.
    6#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Introductions - Snowflake
  • 7.
    7#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Forget Oil. Data is worth more than Gold
  • 8.
    8#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Scenario
  • 9.
    9#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Data Mining Data Science ML Engineering QADevOps Production Delivery* * not Digiorno Scenario - Annotated
  • 10.
    10#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Scenario - Reality
  • 11.
    11#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Challenges WAREHOUSES NOSQL LAKES STREAMS Sources APIs Apps BI Data-Driven Production
  • 12.
    12#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Challenges - Reality Partitions: 20 Rows per second: 10,000 Format: JSON Extract Transform ML Load Analysis Insights
  • 13.
    Flat, RDBMS, Streams,etc Unified batch & stream APIs Autoscaling with usage Python, Scala, SQL, R & Java JVM w/ optimization Multilevel APIs (SQL + RDDs) 13#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat Challenges & Solutions - ETL Partitions: 20 Rows per second: 10,000 Format: JSON Sources Syntax Scale Languages Performance Expressiveness
  • 14.
    14#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Challenges & Solutions - ETL Partitions: 20 Rows per second: 10,000 Format: JSON Malformed Records Errors Changing Fields Writes - Performance - Semantics - Reliability Ignore/infer + log records Handle + retry w/ checkpoint Schema Evolution Partitioned + optimized files Exactly once ACID transactions
  • 15.
    15#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Challenges & Solutions - ML Partitions: 20 Rows per second: 10,000 Format: JSON Data Access Syntax Collaboration Models - Iteration - Reproducibility - Deployment Apache Spark + Delta Koalas Databricks Notebooks - Tracking - Projects - Models
  • 16.
    Interactive queries Instant Scaling SQL Tableau,PowerBI, etc Optimized DWaaS Decoupled storage + compute 16#UnifiedAnalytics #SparkAISummit | Slides & Resources: garrens.com/DataSnowCat Challenges & Solutions - Analysis Partitions: 20 Rows per second: 10,000 Format: JSON Time to value Intermittent demand Language Common Tooling Ease of Use Cost control
  • 17.
    17#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Final Solution Architecture Partitions: 20 Rows per second: 10,000 Format: JSON Machine Learning BI Reporting Dashboards
  • 18.
    18#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Demo
  • 19.
    Review 19#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Introductions Scenario Challenges Solutions Demo
  • 20.
    20#UnifiedAnalytics #SparkAISummit |Slides & Resources: garrens.com/DataSnowCat Solution WAREHOUSES NOSQL LAKES STREAMS Sources Processing WAREHOUSES NOSQL BI Persistence APIs Apps BI LAKES DELTA Integration DELTA ML
  • 21.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT