Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
2. 2
Agenda
ο Objective
ο Titanic Kaggle Competition
ο Series Overview
ο Disclaimer
ο Boring Theory - Data Science Workflow
ο Demo – Organising and exploring Titanic data
ο Machine Learning Theory
ο Demo – Predicating survival on the Titanic
ο Takeaways
ο Questions
3. 3
Objective – Solve a Kaggle Competition
ο The “Hello World” of Data Science problems - Simple business problem
ο https://www.kaggle.com/c/titanic/overview
ο Use Machine Learning to predict which passengers survived the tragedy
ο Binary Classification – Survived or Not Survived
ο Your score is the % of passengers outcomes correctly predicted (“accuracy”)
ο Submit a csv file with exactly 418 entries plus a header row with 2 columns
ο Personal Tool choice: Databricks + Python + ML (No Numpy or Pandas if possible!)
ο TECHNICAL demos – Demonstrate the power of Spark
ο Focusing more on Data Engineering that mathematical algorithms
4. 4
Series Overview
1. Databricks for dummies
2. Titanic survival prediction with Databricks + Python + Spark ML
3. Titanic with Azure Machine Learning Studio
4. Titanic with Databricks + Azure Machine Learning Service
5. Titanic with Databricks + MLS + AutoML
6. Titanic with Databricks + MLFlow
7. Titanic with DataRobot
8. Deployment, DevOps/MLops and Operationalization
5. 5
Where Data Scientists spend most of their time
Cleaning and Organising Data
60%
Extracting
Data
19%
Mining
Data for
Patterns
9%
Other
5%
Refining Algorithms
4% Building Training Datasets
3%
6. 6
Data Science Workflow
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
7. 7
Demo – Extracting Titanic Data
Extract Organise
Analyse +
Model
PresentData
Value
8. 8
Demo – Extracting Titanic Data
ο https://www.kaggle.com/c/titanic/data
ο Data dictionary – Domain knowledge
ο Download and store on blob for access by Databricks
ο Merge Training and Test Set to have more input data
9. 9
Organising the Data
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
10. 10
Organising the Data
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Missing
Values
Outliers
Incorrect
Values
Derived
Features
Feature
Encoding
11. 11
Demo – EDA – Basic Structure
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Basic
Structure
• How many rows (Observations)?
• How many columns (Features) are there?
• What are the data types?
• Explore subset of data – How complete is it?
• Filtering and sorting
12. 12
Demo – EDA – Summary Statistics
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Summary
Statistics
Helps to summarise data in an overall sense and provide
overview information about the data
Numerical Feature/Column
• Centrality measure
• One number to describe data
• mean, median
• Dispersion measure
• Variability – spread out or not
• range, percentiles, variance, standard deviation
Categorical Feature (Cannot be measured)
• Cannot calculate centrality or dispersion measures
• Total count
• Unique count
• Per Category count
• Per Category Statistics (E.g. Average Fare by Embarkment)
13. 13
Demo – EDA – Distributions
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Distributions
Visualise the distribution of data
Univariate (1 Feature)
• Box plot (Outliers)
• Histogram (Bins - Skewness)
• Kernel Density Estimation (KDE) plot
Bivariate (2 Features)
• Scatter plot (Correlations)
More than 2…
19. 19
Analyse + Model the Data
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
20. 20
Demo - Analyse + Model the Data
• Machine Learning = Learning from Data or Examples
• Look for patterns (train) based on Input (predictors) – e.g. Spam detection
• Apply pattern (model) to new input to predict outcome
• Binary Classification (2 discrete labels). Regression = continuous output (e.g. mileage)
• Supervised Machine Learning (known input and output).
• Unsupervised Machine Learning (only known input) - e.g. grouping good customers
• Splitting Data for testing without submission
• Measure/Evaluate
• Accuracy, Precision, Recall
• Make a Baseline Model with majority class
• Choosing the most accurate Classifier/Model (Logistic Regression Model)
21. 21
Presenting the Data
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
22. 22
Recap – Data Science Workflow
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
23. 23
Takeaways
ο Data Science requires a lot of data engineering before it can succeed
ο Domain knowledge is key
ο This workflow can be applied to most data problems
ο Databricks is awesome. Python is pretty cool too
ο Technologies: Databricks, Python (PySpark), Spark ML, Koalas/Pandads
ο Kudos: Pluralsight Course – Data Science with Python: Pandas/Scikit Learn
Prep: Clusters running and bumped up, no timeout – make sure Koala library is installed
Storage Explorer
Browsers, Kaggle, Azure, Databricks
What is Kaggle?
Simple business problem so that we can focus on the technical process that is similar in most solutions
Will focus a lot on Data Engineering and less on mathematical models
Up to you what tools/frameworks etc. you use. Databricks is a good fit as a unified platform as it allows huge scale of data, python (and other choices!) and extensibility with common libraries.
We are not going to use source control, keyvaults or set up databricks – In the interests of time we are focusing on the workflow only
Important: Numpy and Pandas make a lot of this easier – I wanted to use native PySpark and Spark ML instead
We will repeat the same process but with different tools and compare the ease and accuracy
ML sounds sexy but it’s not really ;)
Switch to Databricks after - https://australiaeast.azuredatabricks.net/?o=839250446484486#notebook/2693555023493589
Helps to summarise data in an overall sense and provide overview information about the data.
Depends on the column type
Range is easily affected by extreme values
Switch to Databricks after - https://australiaeast.azuredatabricks.net/?o=839250446484486#notebook/3538880722377381/command/4408895896691184
Go to demo
We won’t go into more detail here as we have already looked at some ways to present insights with box plots and histograms. Databricks has build in dashboards and graphs and there’s always Power BI and MatPlotLib if need be.
The Kaggle competition expects a file as output so we’ll work on that.