Databricks for Dummies

1
for dummies
Rodney Joyce – Data & AI Consultant
LinkedIn - bit.ly/rodneyjoyce
© 2019

2
Agenda
ο Objective / Data Science Series
ο Boring theory
ο Use Cases
ο Demos
ο Getting started
ο Interactive Notebooks
ο ETL Batch job with Azure Data Factory
ο The humble dataframe
ο Pricing
ο Takeaways
ο Questions

3
Objective & Data Science Series
1. Databricks for dummies
2. Titanic survival prediction with Databricks + Python + Spark ML
3. Titanic with Azure Machine Learning Studio
4. Titanic with Databricks + Azure Machine Learning Service
5. Titanic with Databricks + MLS + AutoML
6. Titanic with Databricks + MLFlow
7. Titanic with DataRobot
8. Deployment, DevOps/MLops and Operationalization
What is Azure Databricks, why you should learn it and how to get started…

4
How to get value out of your data?

5
Data Science Workflow
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore

6
Why is data science so hard?
ο Data Science requires a lot of data engineering before it can succeed
ο Siloed roles = unique terminology
ο Fragmented technologies and solutions
ο Model training requires huge scale some of the time
ο The more data we use to train the better
ο Big Data infrastructure is expensive and costly to maintain
ο Operational challenges – how to get model to production?
Problem = $$$ and slow to deliver value

7
Where Data Scientists spend most of their time
Cleaning and Organising Data
60%
Extracting
Data
19%
Mining
Data for
Patterns
9%
Other
5%
Refining Algorithms
4% Building Training Datasets
3%

8
Solution: Unified Analytics Platform
ο Unifies Data Science, Engineering and Business
ο Removes silos, improves collaboration
ο Supports multiple languages
ο Business value:
ο Cost saving (Resources, operational, training etc)
ο Speed to market
ο Easy to extend for future ML
ο Focus on extracting insights from your data and not infrastructure and processes around it!

9
What is
ο Open-source big data processing engine
ο Massively scalable/distributed
ο Highly extensible with many libraries
ο Started in 2009 and written in Scala
ο Supports 4 languages
ο Designed for speed and ease of use
ο In memory = faster than Hadoop
ο You can run Spark on Azure directly

10
Databricks PaaS – Managed Spark Service on Azure

11
Use Case - Modern Analytics Platform

13
Demo 1 – Getting started with databricks
ο Create a Databricks service (Resource Group)
ο Launching a workspace – AD Integration
ο Menu Overview
ο Workspaces
ο Notebooks
ο RBAC (Premium)
ο Add a new cluster with auto-scale
ο Installing libraries

14
Demo 2 – Interactive Notebook
ο Notebook overview
ο Attach to Cluster
ο Cells
ο Markdown
ο Running a command & shortcuts
ο Comments
ο Revisions (Git)
ο Data Tables
ο Language choice (4!)
ο Magic commands (e.g. Unix, md)
ο Charting and Dashboards

15
Demo 3 – ETL Batch Job with databricks
ο Key Vault integration
ο Storage integration
ο Widgets/Parameters
ο Nesting pipelines
ο Scheduling a Job (Time based)
ο ADF integration (Event driven)

16
Demo 4 – The humble dataframe
ο Import a Notebook
ο Dataframes versus Datasets versus RDDs
ο Download and read a CSV file and infer schema
ο Intellisense
ο Lazy Evaluation
ο Actions versus Transformations
ο Importing a library (e.g. Pandas)
ο Immutability
ο Static versus Dynamically typed
ο Koalas/Apache .net

18
Databricks Pricing
Notes
ο Pay DBUs (per min) only when your cluster is running a job on Spark
ο Cluster VM size determines DBU’s usage per hour – Cheapest is 0.5 DBU
ο Azure costs depend on size of your VMs and depends on cluster state. E.g. VMs, storage
ο Shutdown your clusters when not in use… and don’t run an infinite loop on a notebook before long w/e
Premium Features
ο RBAC for notebooks, clusters, jobs and tables
ο JDBC/ODBC Authentication (Power BI!)
ο RStudio Integration
Data Engineering (Jobs) Data Analytics (Interactive)
STANDARD $0.20 per DBU + Azure Cost $0.40 per DBU + Azure Cost
PREMIUM $0.35 per DBU + Azure Cost $0.55 per DBU + Azure Cost

19
Takeaways
ο Data Science requires a lot of data engineering before it can succeed
ο CI/CD, API, DBFS, CLI, Databricks Delta, MLFlow – some other time
ο Databricks is awesome in the analytics stack
ο Learn Python!
ο To get started, spin up databricks on your MSDN subscription and start playing around:
ο https://databricks.com/spark/getting-started-with-apache-spark
ο https://docs.databricks.com/getting-started/index.html

20
Questions?
Rodney Joyce – Data & AI Consultant
LinkedIn - bit.ly/rodneyjoyce
© 2019
1. Databricks for dummies
2. Titanic survival prediction with Databricks + Python + Spark ML
3. Titanic with Azure Machine Learning Studio
4. Titanic with Databricks + Azure Machine Learning Service
5. Titanic with Databricks + MLS + AutoML
6. Titanic with Databricks + MLFlow
7. Titanic with DataRobot
8. Deployment, DevOps/MLops and Operationalization
e.g. What is HDInsights on Azure?

Databricks for Dummies

More Related Content

What's hot

Similar to Databricks for Dummies

Recently uploaded

Databricks for Dummies

Editor's Notes