SlideShare a Scribd company logo
1 of 24
1
1 - Titanic survival prediction with Databricks + Python + Spark ML
Data Science for dummies
Rodney Joyce – Data & AI Consultant
LinkedIn - bit.ly/rodneyjoyce
© 2019
2
Agenda
ο Objective
ο Titanic Kaggle Competition
ο Series Overview
ο Disclaimer
ο Boring Theory - Data Science Workflow
ο Demo – Organising and exploring Titanic data
ο Machine Learning Theory
ο Demo – Predicating survival on the Titanic
ο Takeaways
ο Questions
3
Objective – Solve a Kaggle Competition
ο The “Hello World” of Data Science problems - Simple business problem
ο https://www.kaggle.com/c/titanic/overview
ο Use Machine Learning to predict which passengers survived the tragedy
ο Binary Classification – Survived or Not Survived
ο Your score is the % of passengers outcomes correctly predicted (“accuracy”)
ο Submit a csv file with exactly 418 entries plus a header row with 2 columns
ο Personal Tool choice: Databricks + Python + ML (No Numpy or Pandas if possible!)
ο TECHNICAL demos – Demonstrate the power of Spark
ο Focusing more on Data Engineering that mathematical algorithms
4
Series Overview
1. Databricks for dummies
2. Titanic survival prediction with Databricks + Python + Spark ML
3. Titanic with Azure Machine Learning Studio
4. Titanic with Databricks + Azure Machine Learning Service
5. Titanic with Databricks + MLS + AutoML
6. Titanic with Databricks + MLFlow
7. Titanic with DataRobot
8. Deployment, DevOps/MLops and Operationalization
5
Where Data Scientists spend most of their time
Cleaning and Organising Data
60%
Extracting
Data
19%
Mining
Data for
Patterns
9%
Other
5%
Refining Algorithms
4% Building Training Datasets
3%
6
Data Science Workflow
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
7
Demo – Extracting Titanic Data
Extract Organise
Analyse +
Model
PresentData
Value
8
Demo – Extracting Titanic Data
ο https://www.kaggle.com/c/titanic/data
ο Data dictionary – Domain knowledge
ο Download and store on blob for access by Databricks
ο Merge Training and Test Set to have more input data
9
Organising the Data
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
10
Organising the Data
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Missing
Values
Outliers
Incorrect
Values
Derived
Features
Feature
Encoding
11
Demo – EDA – Basic Structure
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Basic
Structure
• How many rows (Observations)?
• How many columns (Features) are there?
• What are the data types?
• Explore subset of data – How complete is it?
• Filtering and sorting
12
Demo – EDA – Summary Statistics
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Summary
Statistics
Helps to summarise data in an overall sense and provide
overview information about the data
Numerical Feature/Column
• Centrality measure
• One number to describe data
• mean, median
• Dispersion measure
• Variability – spread out or not
• range, percentiles, variance, standard deviation
Categorical Feature (Cannot be measured)
• Cannot calculate centrality or dispersion measures
• Total count
• Unique count
• Per Category count
• Per Category Statistics (E.g. Average Fare by Embarkment)
13
Demo – EDA – Distributions
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Distributions
Visualise the distribution of data
Univariate (1 Feature)
• Box plot (Outliers)
• Histogram (Bins - Skewness)
• Kernel Density Estimation (KDE) plot
Bivariate (2 Features)
• Scatter plot (Correlations)
More than 2…
14
Demo – EDA – Grouping, Crosstabs & Pivots
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Grouping
Grouping/Aggregations
Crosstabs & Pivots
Crosstabs
Pivots
15
Demo – Data Munging
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Missing
Values
Outliers
Incorrect
Values
Missing
Values
Outliers
Incorrect
Values
Missing Values
• Not available / known
• Incorrect manual entry
• Error in machine reading
• Leads to:
• Inaccurate analysis
• Models might not work with nulls
• Solutions:
• Delete row / observation (40%?)
• Replace value (Imputation)
Outliers / Extreme Values
• Different from normal, good to explore
• Analysis could be biased by extremes
• Solutions:
• Removal, Keep, Binning, Transform, Imputation
Incorrect Values
• Requires business knowledge – out of scope
16
Demo – Feature Engineering
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Missing
Values
Outliers
Incorrect
Values
Derived
Features
Feature
Encoding
Derived
Features
Transform raw data to better representative features in order to create better predictive models
• Transformation (e.g. Log of Fare)
• Creation using domain knowledge (e.g. Title)
• Selection (e.g Dropping Cabin)
17
Demo – Feature Engineering - Encoding
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Missing
Values
Outliers
Incorrect
Values
Derived
Features
Feature
Encoding
Feature
Encoding
• ML usually requires Numerical Features, not Categorical
• Categorical Feature Encoding converts Categorical Features into Numerical Features
• Binary Encoding
• 2 categories/classes.
• Male = 0, Female = 1
• Label Encoding
• More than 2 classes with implicit ordered values.
• Low = 1, Medium = 2, High = 3
• One-Hot Encoding
• No ordered values. Embarkment Point – S, C, Q
• Creates a Numerical Feature for each value
• Is_S = 0|1, Is_C = 0|1, Is_Q = 0|1…
18
Demo – Visualisations
Organise
Visualisations
Feature Engineering
Data Munging
Exploratory Data Analysis
(EDA)
Basic
Structure
Summary
Statistics
Distributions
Grouping
Crosstabs
Pivots
Missing
Values
Outliers
Incorrect
Values
Derived
Features
Feature
Encoding
19
Analyse + Model the Data
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
20
Demo - Analyse + Model the Data
• Machine Learning = Learning from Data or Examples
• Look for patterns (train) based on Input (predictors) – e.g. Spam detection
• Apply pattern (model) to new input to predict outcome
• Binary Classification (2 discrete labels). Regression = continuous output (e.g. mileage)
• Supervised Machine Learning (known input and output).
• Unsupervised Machine Learning (only known input) - e.g. grouping good customers
• Splitting Data for testing without submission
• Measure/Evaluate
• Accuracy, Precision, Recall
• Make a Baseline Model with majority class
• Choosing the most accurate Classifier/Model (Logistic Regression Model)
21
Presenting the Data
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
22
Recap – Data Science Workflow
Extract Organise
Analyse +
Model
PresentData
Value
Visualisations
Feature Engineering
Data Munging
Explore
23
Takeaways
ο Data Science requires a lot of data engineering before it can succeed
ο Domain knowledge is key
ο This workflow can be applied to most data problems
ο Databricks is awesome. Python is pretty cool too
ο Technologies: Databricks, Python (PySpark), Spark ML, Koalas/Pandads
ο Kudos: Pluralsight Course – Data Science with Python: Pandas/Scikit Learn
24
Questions?
e.g. What is Apache Spark .net?
Rodney Joyce – Data & AI Consultant
LinkedIn - bit.ly/rodneyjoyce
© 2019
1. Databricks for dummies
2. Titanic survival prediction with Databricks + Python + Spark ML
3. Titanic with Azure Machine Learning Studio
4. Titanic with Databricks + Azure Machine Learning Service
5. Titanic with Databricks + MLS + AutoML
6. Titanic with Databricks + MLFlow
7. Titanic with DataRobot
8. Deployment, DevOps/MLops and Operationalization

More Related Content

What's hot

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Lecture1 introduction to machine learning
Lecture1 introduction to machine learningLecture1 introduction to machine learning
Lecture1 introduction to machine learningUmmeSalmaM1
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learningSri Ambati
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDBArangoDB Database
 
Semantic Technologies for Big Data
Semantic Technologies for Big DataSemantic Technologies for Big Data
Semantic Technologies for Big DataMarin Dimitrov
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ NetflixMichelle Ufford
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science ProjectDigital Vidya
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Danish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML OpsDanish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML OpsNeo4j
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMatei Zaharia
 
Monitoring Models in Production
Monitoring Models in ProductionMonitoring Models in Production
Monitoring Models in ProductionJannes Klaas
 
Big Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesBig Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesSlideTeam
 
ENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DBENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DBNeo4j
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionGianluca Bontempi
 

What's hot (20)

Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Lecture1 introduction to machine learning
Lecture1 introduction to machine learningLecture1 introduction to machine learning
Lecture1 introduction to machine learning
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Semantic Technologies for Big Data
Semantic Technologies for Big DataSemantic Technologies for Big Data
Semantic Technologies for Big Data
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science Project
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Danish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML OpsDanish Business Authority: Explainability and causality in relation to ML Ops
Danish Business Authority: Explainability and causality in relation to ML Ops
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine Learning
 
Monitoring Models in Production
Monitoring Models in ProductionMonitoring Models in Production
Monitoring Models in Production
 
Big Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesBig Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation Slides
 
ENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DBENEL Electricity Grids on Neo4j Graph DB
ENEL Electricity Grids on Neo4j Graph DB
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
 

Similar to Data Science for Dummies - Data Engineering with Titanic dataset + Databricks + Python

Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleSongtao Guo
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedLaurenz Wuttke
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialQiang Zhu
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 

Similar to Data Science for Dummies - Data Engineering with Titanic dataset + Databricks + Python (20)

Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at Scale
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 

Recently uploaded

Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSSnehalVinod
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptTanveerAhmed817946
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1sinhaabhiyanshu
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 

Recently uploaded (20)

Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks + Python

  • 1. 1 1 - Titanic survival prediction with Databricks + Python + Spark ML Data Science for dummies Rodney Joyce – Data & AI Consultant LinkedIn - bit.ly/rodneyjoyce © 2019
  • 2. 2 Agenda ο Objective ο Titanic Kaggle Competition ο Series Overview ο Disclaimer ο Boring Theory - Data Science Workflow ο Demo – Organising and exploring Titanic data ο Machine Learning Theory ο Demo – Predicating survival on the Titanic ο Takeaways ο Questions
  • 3. 3 Objective – Solve a Kaggle Competition ο The “Hello World” of Data Science problems - Simple business problem ο https://www.kaggle.com/c/titanic/overview ο Use Machine Learning to predict which passengers survived the tragedy ο Binary Classification – Survived or Not Survived ο Your score is the % of passengers outcomes correctly predicted (“accuracy”) ο Submit a csv file with exactly 418 entries plus a header row with 2 columns ο Personal Tool choice: Databricks + Python + ML (No Numpy or Pandas if possible!) ο TECHNICAL demos – Demonstrate the power of Spark ο Focusing more on Data Engineering that mathematical algorithms
  • 4. 4 Series Overview 1. Databricks for dummies 2. Titanic survival prediction with Databricks + Python + Spark ML 3. Titanic with Azure Machine Learning Studio 4. Titanic with Databricks + Azure Machine Learning Service 5. Titanic with Databricks + MLS + AutoML 6. Titanic with Databricks + MLFlow 7. Titanic with DataRobot 8. Deployment, DevOps/MLops and Operationalization
  • 5. 5 Where Data Scientists spend most of their time Cleaning and Organising Data 60% Extracting Data 19% Mining Data for Patterns 9% Other 5% Refining Algorithms 4% Building Training Datasets 3%
  • 6. 6 Data Science Workflow Extract Organise Analyse + Model PresentData Value Visualisations Feature Engineering Data Munging Explore
  • 7. 7 Demo – Extracting Titanic Data Extract Organise Analyse + Model PresentData Value
  • 8. 8 Demo – Extracting Titanic Data ο https://www.kaggle.com/c/titanic/data ο Data dictionary – Domain knowledge ο Download and store on blob for access by Databricks ο Merge Training and Test Set to have more input data
  • 9. 9 Organising the Data Extract Organise Analyse + Model PresentData Value Visualisations Feature Engineering Data Munging Explore
  • 10. 10 Organising the Data Organise Visualisations Feature Engineering Data Munging Exploratory Data Analysis (EDA) Basic Structure Summary Statistics Distributions Grouping Crosstabs Pivots Missing Values Outliers Incorrect Values Derived Features Feature Encoding
  • 11. 11 Demo – EDA – Basic Structure Organise Visualisations Feature Engineering Data Munging Exploratory Data Analysis (EDA) Basic Structure Summary Statistics Distributions Grouping Crosstabs Pivots Basic Structure • How many rows (Observations)? • How many columns (Features) are there? • What are the data types? • Explore subset of data – How complete is it? • Filtering and sorting
  • 12. 12 Demo – EDA – Summary Statistics Organise Visualisations Feature Engineering Data Munging Exploratory Data Analysis (EDA) Basic Structure Summary Statistics Distributions Grouping Crosstabs Pivots Summary Statistics Helps to summarise data in an overall sense and provide overview information about the data Numerical Feature/Column • Centrality measure • One number to describe data • mean, median • Dispersion measure • Variability – spread out or not • range, percentiles, variance, standard deviation Categorical Feature (Cannot be measured) • Cannot calculate centrality or dispersion measures • Total count • Unique count • Per Category count • Per Category Statistics (E.g. Average Fare by Embarkment)
  • 13. 13 Demo – EDA – Distributions Organise Visualisations Feature Engineering Data Munging Exploratory Data Analysis (EDA) Basic Structure Summary Statistics Distributions Grouping Crosstabs Pivots Distributions Visualise the distribution of data Univariate (1 Feature) • Box plot (Outliers) • Histogram (Bins - Skewness) • Kernel Density Estimation (KDE) plot Bivariate (2 Features) • Scatter plot (Correlations) More than 2…
  • 14. 14 Demo – EDA – Grouping, Crosstabs & Pivots Organise Visualisations Feature Engineering Data Munging Exploratory Data Analysis (EDA) Basic Structure Summary Statistics Distributions Grouping Crosstabs Pivots Grouping Grouping/Aggregations Crosstabs & Pivots Crosstabs Pivots
  • 15. 15 Demo – Data Munging Organise Visualisations Feature Engineering Data Munging Exploratory Data Analysis (EDA) Basic Structure Summary Statistics Distributions Grouping Crosstabs Pivots Missing Values Outliers Incorrect Values Missing Values Outliers Incorrect Values Missing Values • Not available / known • Incorrect manual entry • Error in machine reading • Leads to: • Inaccurate analysis • Models might not work with nulls • Solutions: • Delete row / observation (40%?) • Replace value (Imputation) Outliers / Extreme Values • Different from normal, good to explore • Analysis could be biased by extremes • Solutions: • Removal, Keep, Binning, Transform, Imputation Incorrect Values • Requires business knowledge – out of scope
  • 16. 16 Demo – Feature Engineering Organise Visualisations Feature Engineering Data Munging Exploratory Data Analysis (EDA) Basic Structure Summary Statistics Distributions Grouping Crosstabs Pivots Missing Values Outliers Incorrect Values Derived Features Feature Encoding Derived Features Transform raw data to better representative features in order to create better predictive models • Transformation (e.g. Log of Fare) • Creation using domain knowledge (e.g. Title) • Selection (e.g Dropping Cabin)
  • 17. 17 Demo – Feature Engineering - Encoding Organise Visualisations Feature Engineering Data Munging Exploratory Data Analysis (EDA) Basic Structure Summary Statistics Distributions Grouping Crosstabs Pivots Missing Values Outliers Incorrect Values Derived Features Feature Encoding Feature Encoding • ML usually requires Numerical Features, not Categorical • Categorical Feature Encoding converts Categorical Features into Numerical Features • Binary Encoding • 2 categories/classes. • Male = 0, Female = 1 • Label Encoding • More than 2 classes with implicit ordered values. • Low = 1, Medium = 2, High = 3 • One-Hot Encoding • No ordered values. Embarkment Point – S, C, Q • Creates a Numerical Feature for each value • Is_S = 0|1, Is_C = 0|1, Is_Q = 0|1…
  • 18. 18 Demo – Visualisations Organise Visualisations Feature Engineering Data Munging Exploratory Data Analysis (EDA) Basic Structure Summary Statistics Distributions Grouping Crosstabs Pivots Missing Values Outliers Incorrect Values Derived Features Feature Encoding
  • 19. 19 Analyse + Model the Data Extract Organise Analyse + Model PresentData Value Visualisations Feature Engineering Data Munging Explore
  • 20. 20 Demo - Analyse + Model the Data • Machine Learning = Learning from Data or Examples • Look for patterns (train) based on Input (predictors) – e.g. Spam detection • Apply pattern (model) to new input to predict outcome • Binary Classification (2 discrete labels). Regression = continuous output (e.g. mileage) • Supervised Machine Learning (known input and output). • Unsupervised Machine Learning (only known input) - e.g. grouping good customers • Splitting Data for testing without submission • Measure/Evaluate • Accuracy, Precision, Recall • Make a Baseline Model with majority class • Choosing the most accurate Classifier/Model (Logistic Regression Model)
  • 21. 21 Presenting the Data Extract Organise Analyse + Model PresentData Value Visualisations Feature Engineering Data Munging Explore
  • 22. 22 Recap – Data Science Workflow Extract Organise Analyse + Model PresentData Value Visualisations Feature Engineering Data Munging Explore
  • 23. 23 Takeaways ο Data Science requires a lot of data engineering before it can succeed ο Domain knowledge is key ο This workflow can be applied to most data problems ο Databricks is awesome. Python is pretty cool too ο Technologies: Databricks, Python (PySpark), Spark ML, Koalas/Pandads ο Kudos: Pluralsight Course – Data Science with Python: Pandas/Scikit Learn
  • 24. 24 Questions? e.g. What is Apache Spark .net? Rodney Joyce – Data & AI Consultant LinkedIn - bit.ly/rodneyjoyce © 2019 1. Databricks for dummies 2. Titanic survival prediction with Databricks + Python + Spark ML 3. Titanic with Azure Machine Learning Studio 4. Titanic with Databricks + Azure Machine Learning Service 5. Titanic with Databricks + MLS + AutoML 6. Titanic with Databricks + MLFlow 7. Titanic with DataRobot 8. Deployment, DevOps/MLops and Operationalization

Editor's Notes

  1. Prep: Clusters running and bumped up, no timeout – make sure Koala library is installed Storage Explorer Browsers, Kaggle, Azure, Databricks
  2. What is Kaggle? Simple business problem so that we can focus on the technical process that is similar in most solutions Will focus a lot on Data Engineering and less on mathematical models Up to you what tools/frameworks etc. you use. Databricks is a good fit as a unified platform as it allows huge scale of data, python (and other choices!) and extensibility with common libraries. We are not going to use source control, keyvaults or set up databricks – In the interests of time we are focusing on the workflow only Important: Numpy and Pandas make a lot of this easier – I wanted to use native PySpark and Spark ML instead
  3. We will repeat the same process but with different tools and compare the ease and accuracy
  4. ML sounds sexy but it’s not really ;)
  5. Switch to Databricks after - https://australiaeast.azuredatabricks.net/?o=839250446484486#notebook/2693555023493589
  6. Helps to summarise data in an overall sense and provide overview information about the data. Depends on the column type Range is easily affected by extreme values Switch to Databricks after - https://australiaeast.azuredatabricks.net/?o=839250446484486#notebook/3538880722377381/command/4408895896691184
  7. Go to demo
  8. We won’t go into more detail here as we have already looked at some ways to present insights with box plots and histograms. Databricks has build in dashboards and graphs and there’s always Power BI and MatPlotLib if need be. The Kaggle competition expects a file as output so we’ll work on that.