©2018 dataiku, Inc.
Applied Data Science Online Course
1st Class: Learning the Basics,
concepts, & your first ML model
©2018 dataiku, Inc.
● September 20th at 12PM ET: Learning the Basics, concepts, & your first ML model
● September 27th at 12PM ET: The data science workflow, building a predictive model flow
● October 4th at 12PM ET: Getting dirty; data preparation and feature creation
● October 11th at 12PM ET: Understanding your model - and communicating about it
Curriculum
Go from Small to Big Data in 4 weeks
©2018 dataiku, Inc.
> Intro (that’s now)
> Background: going from small to big data
> Machine Learning definitions & basic concepts
> The data science workflow
> Questions
> Hands-on exercise: Titanic Prediction data
The plan for today
Learning the Basics, concepts, & your first ML model
©2018 dataiku, Inc.
Going from Small to Big Data
©2018 dataiku, Inc.
Local processing
● Limited Power (100k lines)
● Downloading and opening csv or xls on a local place
● Not distributed
Database processing
● Can process billions of lines
● File are not stored and process in the same space
than co-worker
● Distributed analysis
Local processing vs. Database processing
Going from Small to Big Data
©2018 dataiku, Inc.
The basic element you’re working on when you’re modifying data in Excel is the cell.
When you’re working with data from a database, your basic element is a column.
Whether you’re cleaning your data or enriching it with new variables, you’ll be creating new columns in new datasets,
never changing one line of a file at a time.
Cell-to-cell Modifications vs. Mass Actions
Going from Small to Big Data
VS.
©2018 dataiku, Inc.
Potential pain points for analysts to transition
Going from Small to Big Data
Interacting
with database
How to connect to
Amazon Web
Service? Hadoop?
How to extract and
transform the data
1 2 4
Collaboration
with other
profile
How to benefit and
interact with the
works of a Data
Scientist or Data
Engineer?
3
Working with
Big Data
Extract and
Transform my data
on very large files
Advanced
Analytics
How to create
Machine Learning
Models without
coding skills? And to
handle Geography,
Time series…?
©2018 dataiku, Inc.
Concepts and Definitions
©2018 dataiku, Inc.
Data Science: An interdisciplinary field that uses scientific
methods, processes, algorithms and systems to extract
knowledge and insights from data in various forms, both
structured and unstructured (wikipedia)
What are we talking about?
Definitions
©2018 dataiku, Inc.
Machine Learning: A field of study focused on constructed systems that learn from large amount of data to make
predictions or find relations.
What are we talking about?
Definitions
©2018 dataiku, Inc.
Different types of Machine Learning
Definitions
Supervised Unsupervised
Data is labeled, algorithm predicts an
output from the input data
Data isn’t labeled, algorithm learns the
inherent structure of the data and makes a
prediction
Examples:
• Predicting the genre of a song based on a
label
Examples:
• Predicting the genre of a song without a label
©2018 dataiku, Inc.
Different types of Machine Learning
Definitions
Prediction Clustering
Goal: Create a model that can predict a
target variable
Goal: Separate data into clusters based on
similarity (no specific target)
Examples:
• Predict the sales price of an apartment
• Forecast the winner of an election
• Diagnose a disease
Examples:
• Find groups of similar apartments
• Segment voters into demographic groups
• Group diseases based on symptoms
©2018 dataiku, Inc.
Different types of Prediction
Definition
If target is (continuous) then regression
If target is (discrete) then classification
Ex: predicting price of airline tickets
Ex: predicting fraud
©2018 dataiku, Inc.
Different types of Prediction
Definition
©2018 dataiku, Inc.
Different types of Machine Learning
Examples
Predicting mortgage defaults
Forecasting lifetime spending of customer
Grouping songs into genres
Predicting amount of snowfall
Segmenting website visitors
Recommending movies to Netflix users
Detecting unusual financial transactions
Prediction Clustering
(classification)
(regression)
(regression)
(classification)
(regression)
©2018 dataiku, Inc.
What’s in a dataset
Definitions
Feature
Observation
● Types of data
©2018 dataiku, Inc.
Train, test, validate
If performance on test set starts to decline, think about retraining your model
Training set
Used to create your model
Validation set
Used to measure performance
Testing set
Used to check model performance
after deployment
©2018 dataiku, Inc.
Train, test, validate
Random or Time based
Training set Validation set Test set
70%
20%
10%
©2018 dataiku, Inc.
The Data Science Workflow
©2018 dataiku, Inc.
7 steps of a data projects
The Data Science Workflow
©2018 dataiku, Inc.
Advanced version of the workflow
The Data Science Workflow
Data Acquisition &
Understanding
Data Preparation Model Creation
Evaluation Deployment
Dataset 1
Scored
dataset
Scored
dataset
Iteration 1
Iteration 2
Iteration n
Dataset 2
Dataset n
Business
Understanding
©2018 dataiku, Inc.
Qu s o s?
©2018 dataiku, Inc.
Hands-on
©2018 dataiku, Inc.
Kaggle Titanic Challenge
Titanic Use Case
Predicting who survived the tragedy
©2018 dataiku, Inc.
About Dataiku - Your Path to Enterprise AI

Applied Data Science Course Part 1: Concepts & your first ML model

  • 1.
    ©2018 dataiku, Inc. AppliedData Science Online Course 1st Class: Learning the Basics, concepts, & your first ML model
  • 2.
    ©2018 dataiku, Inc. ●September 20th at 12PM ET: Learning the Basics, concepts, & your first ML model ● September 27th at 12PM ET: The data science workflow, building a predictive model flow ● October 4th at 12PM ET: Getting dirty; data preparation and feature creation ● October 11th at 12PM ET: Understanding your model - and communicating about it Curriculum Go from Small to Big Data in 4 weeks
  • 3.
    ©2018 dataiku, Inc. >Intro (that’s now) > Background: going from small to big data > Machine Learning definitions & basic concepts > The data science workflow > Questions > Hands-on exercise: Titanic Prediction data The plan for today Learning the Basics, concepts, & your first ML model
  • 4.
    ©2018 dataiku, Inc. Goingfrom Small to Big Data
  • 5.
    ©2018 dataiku, Inc. Localprocessing ● Limited Power (100k lines) ● Downloading and opening csv or xls on a local place ● Not distributed Database processing ● Can process billions of lines ● File are not stored and process in the same space than co-worker ● Distributed analysis Local processing vs. Database processing Going from Small to Big Data
  • 6.
    ©2018 dataiku, Inc. Thebasic element you’re working on when you’re modifying data in Excel is the cell. When you’re working with data from a database, your basic element is a column. Whether you’re cleaning your data or enriching it with new variables, you’ll be creating new columns in new datasets, never changing one line of a file at a time. Cell-to-cell Modifications vs. Mass Actions Going from Small to Big Data VS.
  • 7.
    ©2018 dataiku, Inc. Potentialpain points for analysts to transition Going from Small to Big Data Interacting with database How to connect to Amazon Web Service? Hadoop? How to extract and transform the data 1 2 4 Collaboration with other profile How to benefit and interact with the works of a Data Scientist or Data Engineer? 3 Working with Big Data Extract and Transform my data on very large files Advanced Analytics How to create Machine Learning Models without coding skills? And to handle Geography, Time series…?
  • 8.
  • 9.
    ©2018 dataiku, Inc. DataScience: An interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured (wikipedia) What are we talking about? Definitions
  • 10.
    ©2018 dataiku, Inc. MachineLearning: A field of study focused on constructed systems that learn from large amount of data to make predictions or find relations. What are we talking about? Definitions
  • 11.
    ©2018 dataiku, Inc. Differenttypes of Machine Learning Definitions Supervised Unsupervised Data is labeled, algorithm predicts an output from the input data Data isn’t labeled, algorithm learns the inherent structure of the data and makes a prediction Examples: • Predicting the genre of a song based on a label Examples: • Predicting the genre of a song without a label
  • 12.
    ©2018 dataiku, Inc. Differenttypes of Machine Learning Definitions Prediction Clustering Goal: Create a model that can predict a target variable Goal: Separate data into clusters based on similarity (no specific target) Examples: • Predict the sales price of an apartment • Forecast the winner of an election • Diagnose a disease Examples: • Find groups of similar apartments • Segment voters into demographic groups • Group diseases based on symptoms
  • 13.
    ©2018 dataiku, Inc. Differenttypes of Prediction Definition If target is (continuous) then regression If target is (discrete) then classification Ex: predicting price of airline tickets Ex: predicting fraud
  • 14.
    ©2018 dataiku, Inc. Differenttypes of Prediction Definition
  • 15.
    ©2018 dataiku, Inc. Differenttypes of Machine Learning Examples Predicting mortgage defaults Forecasting lifetime spending of customer Grouping songs into genres Predicting amount of snowfall Segmenting website visitors Recommending movies to Netflix users Detecting unusual financial transactions Prediction Clustering (classification) (regression) (regression) (classification) (regression)
  • 16.
    ©2018 dataiku, Inc. What’sin a dataset Definitions Feature Observation ● Types of data
  • 17.
    ©2018 dataiku, Inc. Train,test, validate If performance on test set starts to decline, think about retraining your model Training set Used to create your model Validation set Used to measure performance Testing set Used to check model performance after deployment
  • 18.
    ©2018 dataiku, Inc. Train,test, validate Random or Time based Training set Validation set Test set 70% 20% 10%
  • 19.
    ©2018 dataiku, Inc. TheData Science Workflow
  • 20.
    ©2018 dataiku, Inc. 7steps of a data projects The Data Science Workflow
  • 21.
    ©2018 dataiku, Inc. Advancedversion of the workflow The Data Science Workflow Data Acquisition & Understanding Data Preparation Model Creation Evaluation Deployment Dataset 1 Scored dataset Scored dataset Iteration 1 Iteration 2 Iteration n Dataset 2 Dataset n Business Understanding
  • 22.
  • 23.
  • 24.
    ©2018 dataiku, Inc. KaggleTitanic Challenge Titanic Use Case Predicting who survived the tragedy
  • 25.
    ©2018 dataiku, Inc. AboutDataiku - Your Path to Enterprise AI