SlideShare a Scribd company logo
1 of 50
Download to read offline
Predicting Flights
with Azure Databricks
Presented by Sarah Dutkiewicz
Microsoft MVP, Developer Technologies
Cleveland Tech Consulting, LLC
Microsoft Cloud Workshops
(MCWs)
• Great way to see how Azure services work
together in end-to-end scenarios
• This presentation is based on content for the Big
Data Analysis and Visualization MCW.
• Learn more at: Microsoft Cloud Workshop
Agenda
• What is Databricks?
• Introducing the flight prediction scenario
• Exploring Databricks while getting to the solution
Scenario
• Margie’s Travels – a concierge for business travelers - wants to enable
their agents to enter in the flight information and produce a
prediction as to whether the departing flight will encounter a 15-
minute or longer delay, considering the weather forecast for the
departure hour.
• Data details:
• Sample data from 2013
• 2.7 million flight delays with airport codes - examples
• 20 columns – features
Training the model
• Using Decision Tree algorithm (binary classification) from Spark MLlib
• Part of the historical data (2013) is used for training and another part
for test
• Flights are delayed have DepDel15 value of 1
• Sample keeps all delayed and a downsample of 30% not delayed –
stratified sampling
• One-Hot encoded categorical variables and use the Pipeline API
• 3-fold CrossValidator
• Save the model for use in other notebooks and saved in case cluster
restarts
What is Databricks?
• Web-based analytics platform with
3 environments:
• Databricks SQL – for querying data
lakes with SQL
• Databricks Data Science &
Engineering – for data engineers, data
scientists, and ML engineers for data
ingestion and analysis using the
Apache Spark Ecosystem
• Databricks Machine Learning –
experiment tracking, model training,
feature development and
management
Databricks SQL
Databricks SQL
• Query data lake using SQL
• Create visualizations as part of queries
• Build and share dashboards using visualizations
SQL Endpoint
• Starter Endpoint
created by default
• Additional endpoints
can be created
Databricks SQL flow
Databricks SQL Components flow
Data Source: Data Lake
• Using Azure
Storage
• CSVs will get
migrated from an
on-premises
environment to
Azure Storage
using Azure Data
Factory
Azure Data Factory
• Azure Data
Factory can
migrate data
from on-
premises to
Azure Storage.
• Once migrated,
we can run a
Databricks
notebook to
score the data.
Loading the Data into Azure Databricks
• Done in the Data Science
and Engineering load
• Uses a cluster for loading
data into Azure Databricks
• Tables have 2 different
types:
• Global tables -
accessible across all
clusters
• Local tables - available
only within one cluster
Creating a Table
Working with Queries in Azure Databricks
Query in SQL Editor
• Schema Browser to help show tables and columns
• Past Executions for metrics of past performance
• Query tabs
• Results / Visualizations
• Table is by default
• Create additional visualizations here
Dashboard
Dashboard
• Visualizations need to
be created with
queries first
• Lay out multiple
visualizations
• Filter details – filters
also come from
queries
Alerts
Databricks
Data Science & Engineering
Classic Databricks
Databricks Data Science & Engineering
• Classic Databricks
environment
• Backbone for the ML
environment
• Key components:
• Workspaces
• Runtimes
• Clusters
• Notebooks
• Jobs
Workspaces
• Environment for all Azure Databricks
assets
• Organizes notebooks, libraries, and
experiments into folders
• Notebooks – runnable code + markdown
• Libraries – third party resources or locally
built code accessible to clusters
• Experiments – MLflow machine learning
model activities
• Provides access to clusters and jobs
• Integrates with Git through Repos
Clusters
• Powerhouse for Azure Databricks
• Compute and configuration
• Supports workloads for:
• Data engineering
• Data science
• Data analytics
• Takes time to start
• Two types
• All-purpose – can be shared for collaborative works;
manually managed
• Job clusters – used for jobs, created and terminated by the
Azure Databricks job scheduler; cannot be restarted
Runtimes
• Assigned at the cluster-level
• Provides the engine for the platform, based on Apache Spark
• Includes Delta Lake for storage
• GPU-enabled support available
• Ubuntu and system libraries
• Supports the following languages:
• Python
• R
• Java
• Scala
Special Runtimes
• Databricks Runtime for Machine Learning
• Databricks Light
• Photon-enabled runtimes
• Uses a native vectorized query engine
• Currently in Public Preview
• Works in both Azure Databricks clusters and Databricks SQL endpoints
Notebooks
• Can mix
Markdown
and
languages to
present data
• Example:
Data
Preparation
notebook
with:
• Markdown
• SQL
• Python
• R
Notebook cells
• Command
number helps for
execution and
navigation
• Execution details
include:
• Duration
• User
• Timestamp
• Cluster name
• Example shows
Python
command and
output
SQL notebook cell
Data
munging
• Using SparkR to
clean
• Yes… R in the same
notebook as SQL…
labeled as a
Python notebook
Export to a Databricks table
• Storing
munged data
into another
table
Cleaning weather data with Python
• Same
notebook
DataFrame schema
• dfFlightDelays is a
DataFrame
• Python code, using
pretty print pprint
library
Jobs
• Used for running non-interactive code
• ETL or ELT
• Task orchestration
• Tasks include:
• JAR file (Java)
• Azure Databricks notebook
• Delta Live Tables pipeline
• Application written in Scala, Java, or Python
• Not included in this presentation’s demo
Databricks Machine Learning
Databricks Machine Learning
• Builds on top of Data
Science & Engineering
• Same workspace
components
• ML components:
• Experiments
• Feature Stores
• Models
Experiments
• In the 02 Train and
Evaluate Models
notebook, Cmd27
• Uses mlflow to trigger
experiment via code
• Experiments can
show MLflow
experiments across
an organization that
you have access to
Creating your own AutoML experiment
• Select a compute cluster
• ML Problem Type:
• Classification
• Regression
• Forecasting – ML Runtimes 10.x and higher
• Evaluation metric (advanced
configuration):
• Classification
• F1 score
• Accuracy
• Log loss
• Precision
• ROC/AUC
• Regression
• R-squared
• Mean absolute error
• Mean squared error
• Root mean squared error
Train and Evaluate Models
• Working with
PySpark in Python
• Using stratified
sampling with
sampleBy() function
• Using binary
classification – flight
is either delayed or it
is not
• Using the Decision
Tree classifier from
Spark MLib
• Models will be saved
for batch scoring
Batch Scoring
• Reads from Azure Data Storage
• Creates DataFrames for the CSVs
• Load the trained model
• Make predictions against the set
• Save the scored data in a global table scoredflights
Databricks as a Data Source
Summary data in Power BI
Looking at PHX’s track record
Source: Microsoft Cloud Workshop: Big Data Analytics and Visualization, Hands-on Lab
Additional Resources
• User guides
• Databricks SQL user guide - Azure Databricks - Databricks SQL | Microsoft
Docs
• Databricks Data Science & Engineering guide - Azure Databricks | Microsoft
Docs
• Databricks Machine Learning guide - Azure Databricks | Microsoft Docs
• Microsoft Learn pathways
• Build and operate machine learning solutions with Azure Databricks - Learn |
Microsoft Docs
• Data engineering with Azure Databricks - Learn | Microsoft Docs
• Perform data science with Azure Databricks - Learn | Microsoft Docs

More Related Content

What's hot

Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukErwin de Kreuk
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
 
Data saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de KreukData saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de KreukErwin de Kreuk
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Daniel Toomey
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureDATAVERSITY
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for DatabricksDatabricks
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflakeSunil Gurav
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 

What's hot (20)

Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Data saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de KreukData saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de Kreuk
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Introduction to snowflake
Introduction to snowflakeIntroduction to snowflake
Introduction to snowflake
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 

Similar to Predicting Flight Delays with Azure Databricks

A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowDatabricks
 
Making Data Scientists Productive in Azure
Making Data Scientists Productive in AzureMaking Data Scientists Productive in Azure
Making Data Scientists Productive in AzureValdas Maksimavičius
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?Dan Sullivan, Ph.D.
 
Cnam azure ze cloud resource manager
Cnam azure ze cloud  resource managerCnam azure ze cloud  resource manager
Cnam azure ze cloud resource managerAymeric Weinbach
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on DatabricksDataScienceConferenc1
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
 
EM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM MetricsEM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM MetricsMaaz Anjum
 
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model ServingDAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model Servingamesar0
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated MLMark Tabladillo
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure applicationCodecamp Romania
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowDatabricks
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Automation 2.0 - Automation Tools for Hybrid Cloud Environments
Automation 2.0 - Automation Tools for Hybrid Cloud EnvironmentsAutomation 2.0 - Automation Tools for Hybrid Cloud Environments
Automation 2.0 - Automation Tools for Hybrid Cloud EnvironmentsMichael Rüefli
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsS N
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning Jesus Rodriguez
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAlberto Diaz Martin
 

Similar to Predicting Flight Delays with Azure Databricks (20)

A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 
Making Data Scientists Productive in Azure
Making Data Scientists Productive in AzureMaking Data Scientists Productive in Azure
Making Data Scientists Productive in Azure
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Cnam azure ze cloud resource manager
Cnam azure ze cloud  resource managerCnam azure ze cloud  resource manager
Cnam azure ze cloud resource manager
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
 
EM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM MetricsEM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM Metrics
 
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model ServingDAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
 
05 entity framework
05 entity framework05 entity framework
05 entity framework
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Automation 2.0 - Automation Tools for Hybrid Cloud Environments
Automation 2.0 - Automation Tools for Hybrid Cloud EnvironmentsAutomation 2.0 - Automation Tools for Hybrid Cloud Environments
Automation 2.0 - Automation Tools for Hybrid Cloud Environments
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
 

More from Sarah Dutkiewicz

Passwordless Development using Azure Identity
Passwordless Development using Azure IdentityPasswordless Development using Azure Identity
Passwordless Development using Azure IdentitySarah Dutkiewicz
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for DevelopersSarah Dutkiewicz
 
Azure DevOps for JavaScript Developers
Azure DevOps for JavaScript DevelopersAzure DevOps for JavaScript Developers
Azure DevOps for JavaScript DevelopersSarah Dutkiewicz
 
Azure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalAzure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalSarah Dutkiewicz
 
Noodling with Data in Jupyter Notebook
Noodling with Data in Jupyter NotebookNoodling with Data in Jupyter Notebook
Noodling with Data in Jupyter NotebookSarah Dutkiewicz
 
Intro to Python for C# Developers
Intro to Python for C# DevelopersIntro to Python for C# Developers
Intro to Python for C# DevelopersSarah Dutkiewicz
 
Introduction to Testing and TDD
Introduction to Testing and TDDIntroduction to Testing and TDD
Introduction to Testing and TDDSarah Dutkiewicz
 
Becoming a Servant Leader, Leading from the Trenches
Becoming a Servant Leader, Leading from the TrenchesBecoming a Servant Leader, Leading from the Trenches
Becoming a Servant Leader, Leading from the TrenchesSarah Dutkiewicz
 
NEOISF - On Mentoring Future Techies
NEOISF - On Mentoring Future TechiesNEOISF - On Mentoring Future Techies
NEOISF - On Mentoring Future TechiesSarah Dutkiewicz
 
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerSarah Dutkiewicz
 
The importance of UX for Developers
The importance of UX for DevelopersThe importance of UX for Developers
The importance of UX for DevelopersSarah Dutkiewicz
 
The Impact of Women Trailblazers in Tech
The Impact of Women Trailblazers in TechThe Impact of Women Trailblazers in Tech
The Impact of Women Trailblazers in TechSarah Dutkiewicz
 
Unstoppable Course Final Presentation
Unstoppable Course Final PresentationUnstoppable Course Final Presentation
Unstoppable Course Final PresentationSarah Dutkiewicz
 
Even More Tools for the Developer's UX Toolbelt
Even More Tools for the Developer's UX ToolbeltEven More Tools for the Developer's UX Toolbelt
Even More Tools for the Developer's UX ToolbeltSarah Dutkiewicz
 
History of Women in Tech - Trivia
History of Women in Tech - TriviaHistory of Women in Tech - Trivia
History of Women in Tech - TriviaSarah Dutkiewicz
 
The UX Toolbelt for Developers
The UX Toolbelt for DevelopersThe UX Toolbelt for Developers
The UX Toolbelt for DevelopersSarah Dutkiewicz
 
World Usability Day 2014 - UX Toolbelt for Developers
World Usability Day 2014 - UX Toolbelt for DevelopersWorld Usability Day 2014 - UX Toolbelt for Developers
World Usability Day 2014 - UX Toolbelt for DevelopersSarah Dutkiewicz
 

More from Sarah Dutkiewicz (20)

Passwordless Development using Azure Identity
Passwordless Development using Azure IdentityPasswordless Development using Azure Identity
Passwordless Development using Azure Identity
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for Developers
 
Azure DevOps for JavaScript Developers
Azure DevOps for JavaScript DevelopersAzure DevOps for JavaScript Developers
Azure DevOps for JavaScript Developers
 
Azure DevOps for the Data Professional
Azure DevOps for the Data ProfessionalAzure DevOps for the Data Professional
Azure DevOps for the Data Professional
 
Noodling with Data in Jupyter Notebook
Noodling with Data in Jupyter NotebookNoodling with Data in Jupyter Notebook
Noodling with Data in Jupyter Notebook
 
Pairing and mobbing
Pairing and mobbingPairing and mobbing
Pairing and mobbing
 
Intro to Python for C# Developers
Intro to Python for C# DevelopersIntro to Python for C# Developers
Intro to Python for C# Developers
 
Introduction to Testing and TDD
Introduction to Testing and TDDIntroduction to Testing and TDD
Introduction to Testing and TDD
 
Becoming a Servant Leader, Leading from the Trenches
Becoming a Servant Leader, Leading from the TrenchesBecoming a Servant Leader, Leading from the Trenches
Becoming a Servant Leader, Leading from the Trenches
 
NEOISF - On Mentoring Future Techies
NEOISF - On Mentoring Future TechiesNEOISF - On Mentoring Future Techies
NEOISF - On Mentoring Future Techies
 
Becoming a Servant Leader
Becoming a Servant LeaderBecoming a Servant Leader
Becoming a Servant Leader
 
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
 
The importance of UX for Developers
The importance of UX for DevelopersThe importance of UX for Developers
The importance of UX for Developers
 
The Impact of Women Trailblazers in Tech
The Impact of Women Trailblazers in TechThe Impact of Women Trailblazers in Tech
The Impact of Women Trailblazers in Tech
 
Unstoppable Course Final Presentation
Unstoppable Course Final PresentationUnstoppable Course Final Presentation
Unstoppable Course Final Presentation
 
Even More Tools for the Developer's UX Toolbelt
Even More Tools for the Developer's UX ToolbeltEven More Tools for the Developer's UX Toolbelt
Even More Tools for the Developer's UX Toolbelt
 
History of Women in Tech
History of Women in TechHistory of Women in Tech
History of Women in Tech
 
History of Women in Tech - Trivia
History of Women in Tech - TriviaHistory of Women in Tech - Trivia
History of Women in Tech - Trivia
 
The UX Toolbelt for Developers
The UX Toolbelt for DevelopersThe UX Toolbelt for Developers
The UX Toolbelt for Developers
 
World Usability Day 2014 - UX Toolbelt for Developers
World Usability Day 2014 - UX Toolbelt for DevelopersWorld Usability Day 2014 - UX Toolbelt for Developers
World Usability Day 2014 - UX Toolbelt for Developers
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

Predicting Flight Delays with Azure Databricks

  • 1. Predicting Flights with Azure Databricks Presented by Sarah Dutkiewicz Microsoft MVP, Developer Technologies Cleveland Tech Consulting, LLC
  • 2. Microsoft Cloud Workshops (MCWs) • Great way to see how Azure services work together in end-to-end scenarios • This presentation is based on content for the Big Data Analysis and Visualization MCW. • Learn more at: Microsoft Cloud Workshop
  • 3. Agenda • What is Databricks? • Introducing the flight prediction scenario • Exploring Databricks while getting to the solution
  • 4. Scenario • Margie’s Travels – a concierge for business travelers - wants to enable their agents to enter in the flight information and produce a prediction as to whether the departing flight will encounter a 15- minute or longer delay, considering the weather forecast for the departure hour. • Data details: • Sample data from 2013 • 2.7 million flight delays with airport codes - examples • 20 columns – features
  • 5. Training the model • Using Decision Tree algorithm (binary classification) from Spark MLlib • Part of the historical data (2013) is used for training and another part for test • Flights are delayed have DepDel15 value of 1 • Sample keeps all delayed and a downsample of 30% not delayed – stratified sampling • One-Hot encoded categorical variables and use the Pipeline API • 3-fold CrossValidator • Save the model for use in other notebooks and saved in case cluster restarts
  • 6. What is Databricks? • Web-based analytics platform with 3 environments: • Databricks SQL – for querying data lakes with SQL • Databricks Data Science & Engineering – for data engineers, data scientists, and ML engineers for data ingestion and analysis using the Apache Spark Ecosystem • Databricks Machine Learning – experiment tracking, model training, feature development and management
  • 8. Databricks SQL • Query data lake using SQL • Create visualizations as part of queries • Build and share dashboards using visualizations
  • 9. SQL Endpoint • Starter Endpoint created by default • Additional endpoints can be created
  • 12. Data Source: Data Lake • Using Azure Storage • CSVs will get migrated from an on-premises environment to Azure Storage using Azure Data Factory
  • 13. Azure Data Factory • Azure Data Factory can migrate data from on- premises to Azure Storage. • Once migrated, we can run a Databricks notebook to score the data.
  • 14. Loading the Data into Azure Databricks • Done in the Data Science and Engineering load • Uses a cluster for loading data into Azure Databricks • Tables have 2 different types: • Global tables - accessible across all clusters • Local tables - available only within one cluster
  • 16. Working with Queries in Azure Databricks
  • 17. Query in SQL Editor • Schema Browser to help show tables and columns • Past Executions for metrics of past performance • Query tabs • Results / Visualizations • Table is by default • Create additional visualizations here
  • 18.
  • 19.
  • 20.
  • 22. Dashboard • Visualizations need to be created with queries first • Lay out multiple visualizations • Filter details – filters also come from queries
  • 24. Databricks Data Science & Engineering Classic Databricks
  • 25. Databricks Data Science & Engineering • Classic Databricks environment • Backbone for the ML environment • Key components: • Workspaces • Runtimes • Clusters • Notebooks • Jobs
  • 26. Workspaces • Environment for all Azure Databricks assets • Organizes notebooks, libraries, and experiments into folders • Notebooks – runnable code + markdown • Libraries – third party resources or locally built code accessible to clusters • Experiments – MLflow machine learning model activities • Provides access to clusters and jobs • Integrates with Git through Repos
  • 27. Clusters • Powerhouse for Azure Databricks • Compute and configuration • Supports workloads for: • Data engineering • Data science • Data analytics • Takes time to start • Two types • All-purpose – can be shared for collaborative works; manually managed • Job clusters – used for jobs, created and terminated by the Azure Databricks job scheduler; cannot be restarted
  • 28.
  • 29. Runtimes • Assigned at the cluster-level • Provides the engine for the platform, based on Apache Spark • Includes Delta Lake for storage • GPU-enabled support available • Ubuntu and system libraries • Supports the following languages: • Python • R • Java • Scala
  • 30. Special Runtimes • Databricks Runtime for Machine Learning • Databricks Light • Photon-enabled runtimes • Uses a native vectorized query engine • Currently in Public Preview • Works in both Azure Databricks clusters and Databricks SQL endpoints
  • 31. Notebooks • Can mix Markdown and languages to present data • Example: Data Preparation notebook with: • Markdown • SQL • Python • R
  • 32. Notebook cells • Command number helps for execution and navigation • Execution details include: • Duration • User • Timestamp • Cluster name • Example shows Python command and output
  • 34. Data munging • Using SparkR to clean • Yes… R in the same notebook as SQL… labeled as a Python notebook
  • 35. Export to a Databricks table • Storing munged data into another table
  • 36. Cleaning weather data with Python • Same notebook
  • 37. DataFrame schema • dfFlightDelays is a DataFrame • Python code, using pretty print pprint library
  • 38. Jobs • Used for running non-interactive code • ETL or ELT • Task orchestration • Tasks include: • JAR file (Java) • Azure Databricks notebook • Delta Live Tables pipeline • Application written in Scala, Java, or Python • Not included in this presentation’s demo
  • 40. Databricks Machine Learning • Builds on top of Data Science & Engineering • Same workspace components • ML components: • Experiments • Feature Stores • Models
  • 41. Experiments • In the 02 Train and Evaluate Models notebook, Cmd27 • Uses mlflow to trigger experiment via code • Experiments can show MLflow experiments across an organization that you have access to
  • 42. Creating your own AutoML experiment • Select a compute cluster • ML Problem Type: • Classification • Regression • Forecasting – ML Runtimes 10.x and higher • Evaluation metric (advanced configuration): • Classification • F1 score • Accuracy • Log loss • Precision • ROC/AUC • Regression • R-squared • Mean absolute error • Mean squared error • Root mean squared error
  • 43. Train and Evaluate Models • Working with PySpark in Python • Using stratified sampling with sampleBy() function • Using binary classification – flight is either delayed or it is not • Using the Decision Tree classifier from Spark MLib • Models will be saved for batch scoring
  • 44. Batch Scoring • Reads from Azure Data Storage • Creates DataFrames for the CSVs • Load the trained model • Make predictions against the set • Save the scored data in a global table scoredflights
  • 45. Databricks as a Data Source
  • 46.
  • 47. Summary data in Power BI
  • 48. Looking at PHX’s track record
  • 49. Source: Microsoft Cloud Workshop: Big Data Analytics and Visualization, Hands-on Lab
  • 50. Additional Resources • User guides • Databricks SQL user guide - Azure Databricks - Databricks SQL | Microsoft Docs • Databricks Data Science & Engineering guide - Azure Databricks | Microsoft Docs • Databricks Machine Learning guide - Azure Databricks | Microsoft Docs • Microsoft Learn pathways • Build and operate machine learning solutions with Azure Databricks - Learn | Microsoft Docs • Data engineering with Azure Databricks - Learn | Microsoft Docs • Perform data science with Azure Databricks - Learn | Microsoft Docs