SlideShare a Scribd company logo
©2018 dataiku, Inc.
Applied Data Science Online Course
3rd Class: Getting dirty; data
preparation and feature creation
©2018 dataiku, Inc.
● September 20th at 12PM ET: Learning the Basics, concepts, & your first ML model
● September 27th at 12PM ET: The data science workflow, building a predictive model flow
● October 4th at 12PM ET: Getting dirty; data preparation and feature creation
● October 11th at 12PM ET: Understanding your model - and communicating about it
Curriculum
Go from Small to Big Data in 4 weeks
©2018 dataiku, Inc.
Most Advanced version of the workflow
The Data Science Workflow
Determine
business
objectives
Assess situation
Determine data
science goals
Produce project
plan
Collect data Describe data Explore data
Assess qualitySelectCleanBuildReport Reformat
Select modeling
techniques
Generate test
design
Build model Assess model Evaluate results Review process
Determine next
steps
Check common
risks
Plan
deployment
Plan monitoring
& performance
Produce final
report
Review project
©2018 dataiku, Inc.
Data preparation
70% of the work
Select Clean Build ReportReformat
©2018 dataiku, Inc.
Different types of data
©2018 dataiku, Inc.
Different types of Data
Definitions
Structured Unstructured
Data stored with clearly defined data types whose
pattern makes them easily searchable and linkable
- most often tabular format
Data is not structured via pre-defined data
models or schema.
Examples:
• Database with columns for name, phone
number etc
Examples:
• Text files
• Websites and social media
©2018 dataiku, Inc.
Different types of Data
Definitions
©2018 dataiku, Inc.
Different types of Storage
Definitions
Relational Databases Unstructured data
storage
Data is stored in different tables that are
connected with unique identifiers
Systems that can store and process both
structured and unstructured data
Examples:
• Structured Query Language DB
Examples:
• Hadoop
• NOSQL
©2018 dataiku, Inc.
Different type of structured data
Data can be one of several
categories
Examples:
• Gender
• Nationality
• Hair color
Data is a number
Examples:
• Age
• Weight
• Salary
Data is free-form text
Examples:
• Tweets
• Documents
• Business name
+ semi-structured data: json
©2018 dataiku, Inc.
Different type of structured data
Examples
Eye Color (e.g. Blue)
Height (e.g. 170 cm)
Country of Birth (e.g. France)
Postal Code (e.g. 75001)
Date (e.g. Wednesday, 15 Jan 1976)
Address (e.g. 10 Rue Saint Martin, Paris)
Curriculum Vitae
©2018 dataiku, Inc.
Missing values
©2018 dataiku, Inc.
What to do with missing values?
©2018 dataiku, Inc.
Drop values
Delete all data from any participant with missing values
Few Warnings:
• Be sure your sample is large enough, then you likely can
drop data without substantial loss of statistical power.
• Be sure data is not missing at Random: There is a
pattern in the missing data that affect your primary
dependent variables.
For example, lower-income participants are less likely to
respond income column.
©2018 dataiku, Inc.
Imputation
Replacing missing values with substitute values.
14
Method #1: Common Value
For Number:
• Average
• Median
• Constant Value
For Category:
• Treat like the category
« Empty »
• Most frequent value
• A constant value
Method #2: Educated Guess
Infer a missing value:
• If Age is lower than 20, Income
is likely to be 0
• If living in a house in a rich city:
income is likely to be higher
than average
• Nb. of child is likely to not be 0
if age is high and situation not
married
Method #3: Sub-Model
• Create a specific model of
machine learning to
predict the missing
(Regression,
Classification)
©2018 dataiku, Inc.
Example
How to handle the missing data in this doc?
Not OK
• Drop Rows
• Average
Better :
• Educated guess from
SPCategory
• Sub-prediction Model
©2018 dataiku, Inc.
Grouping & Joining Data
©2018 dataiku, Inc.
Join operation
Similar to a VLOOKUP / INDEX MATCH
©2018 dataiku, Inc.
Group by operations
Definition
©2018 dataiku, Inc.
Aggregate data from an entity
How can you aggregate this dataset?
©2018 dataiku, Inc.
Aggregate data from an entity
How can you aggregate this dataset?
Group By Options:
For Number:
• Average
• Sum
• Minimum and Maximum
• Standard Deviation
For Category and Number:
• Count of Value
• Count of Distinct Value
• First and Last Value
• Most Frequent
©2018 dataiku, Inc.
Result
©2018 dataiku, Inc.
Dummification & Rescaling
©2018 dataiku, Inc.
● Some data is in a textual or numerical format but should be understood as a category by your model
● This also allows you to use non numerical data in a linear model
> Create a dummy variable that corresponds to these categories: 0-1 for linear model
● Your numerical data can be distributed in a way that will be misunderstood by your model
> Change the values to rearrange them on a scale
Dummification & Rescaling: the issue
The specificity of your data can create bias in your model
15
0
15
0
18
0
18
0
Y= a1
x1
+ a2
x2
+ a3
x3
+ a4
x4
+
a5
x5
Father’s Height
Your
Height
©2018 dataiku, Inc.
Dummification for linear models
For categorical variables
Then your formula will look like this: Y= abatteries
x1
+ aSoap
x2
+ aCandy
x3…
And X1,
X2,
X3…
are either 0 or 1
©2018 dataiku, Inc.
Rescaling for numerical variables
Without rescaling your formula will look like this:
Yrosalindrosenbaum
= aincome
*28114 + aChild
*1…
YBrettStamm
= aincome
*47901 + aChild
*3…
©2018 dataiku, Inc.
Feature rescaling for linear models
Feature scaling is a method used to standardize the range of independent variables
Example of Rescaling – Min-Max Rescaling
©2018 dataiku, Inc.
Qu s o s?
©2018 dataiku, Inc.
Hands-on
©2018 dataiku, Inc.
©2018 dataiku, Inc.
About Dataiku - Your Path to Enterprise AI

More Related Content

What's hot

The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
Vishal Patel
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Bias in AI
Bias in AIBias in AI
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
Fairness and Bias in Machine Learning
Fairness and Bias in Machine LearningFairness and Bias in Machine Learning
Fairness and Bias in Machine Learning
Surya Dutta
 
AI and Data Science.pdf
AI and Data Science.pdfAI and Data Science.pdf
AI and Data Science.pdf
MohammadMuzammilAnsa2
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
SadhanaParameswaran
 
Text mining
Text miningText mining
Text mining
Koshy Geoji
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Tharushi Ruwandika
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learning
Ivo Andreev
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Simplilearn
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
Gregory Piatetsky-Shapiro
 
Artificial Intelligence Introduction & Business usecases
Artificial Intelligence Introduction & Business usecasesArtificial Intelligence Introduction & Business usecases
Artificial Intelligence Introduction & Business usecases
Vikas Jain
 
Fraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive AnalyticsFraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive Analytics
Alejandro Correa Bahnsen, PhD
 
Machine learning
Machine learningMachine learning
Machine learning
omaraldabash
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
Salah Amean
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
Knoldus Inc.
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentation
Ramandeep Kaur Bagri
 

What's hot (20)

The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Bias in AI
Bias in AIBias in AI
Bias in AI
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Fairness and Bias in Machine Learning
Fairness and Bias in Machine LearningFairness and Bias in Machine Learning
Fairness and Bias in Machine Learning
 
AI and Data Science.pdf
AI and Data Science.pdfAI and Data Science.pdf
AI and Data Science.pdf
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Text mining
Text miningText mining
Text mining
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learning
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Artificial Intelligence Introduction & Business usecases
Artificial Intelligence Introduction & Business usecasesArtificial Intelligence Introduction & Business usecases
Artificial Intelligence Introduction & Business usecases
 
Fraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive AnalyticsFraud Detection with Cost-Sensitive Predictive Analytics
Fraud Detection with Cost-Sensitive Predictive Analytics
 
Machine learning
Machine learningMachine learning
Machine learning
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Machine Learning project presentation
Machine Learning project presentationMachine Learning project presentation
Machine Learning project presentation
 

Similar to Applied Data Science Part 3: Getting dirty; data preparation and feature creation

Big data
Big dataBig data
Big data
Srinivasa Reddy
 
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
VMware Tanzu
 
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
DATAVERSITY
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
DataKitchen
 
Big Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationBig Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview Preparation
Intellipaat
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
Nonprofits + Data: Pathway to Innovation
Nonprofits + Data: Pathway to InnovationNonprofits + Data: Pathway to Innovation
Nonprofits + Data: Pathway to Innovation
Tim Sarrantonio
 
Data Visualization: Sales forecasting
Data Visualization: Sales forecastingData Visualization: Sales forecasting
Data Visualization: Sales forecasting
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing Strategies
Knoldus Inc.
 
Leap to Next Generation Data Management with Denodo 7.0
Leap to Next Generation Data Management with Denodo 7.0Leap to Next Generation Data Management with Denodo 7.0
Leap to Next Generation Data Management with Denodo 7.0
Denodo
 
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
DATAVERSITY
 
23.pdf
23.pdf23.pdf
23.pdf
JeanJaggu
 
Big data careers
Big data careersBig data careers
Big data careers
Mohammad Hassan Adjigol
 
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
Workday, Inc.
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
Caserta
 
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjnWHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
RohitKumar639388
 
Big_Data.pptx
Big_Data.pptxBig_Data.pptx
Big_Data.pptx
mohamedibrahim946387
 
Data Analytics Business Intelligence
Data Analytics Business IntelligenceData Analytics Business Intelligence
Data Analytics Business Intelligence
Ravikanth-BA
 
Data Mining Services in various types
Data Mining Services in various typesData Mining Services in various types
Data Mining Services in various types
loginworks software
 
Big Data and Business Insight
Big Data and Business InsightBig Data and Business Insight
Big Data and Business Insight
Amazon Web Services
 

Similar to Applied Data Science Part 3: Getting dirty; data preparation and feature creation (20)

Big data
Big dataBig data
Big data
 
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
 
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Big Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationBig Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview Preparation
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
Nonprofits + Data: Pathway to Innovation
Nonprofits + Data: Pathway to InnovationNonprofits + Data: Pathway to Innovation
Nonprofits + Data: Pathway to Innovation
 
Data Visualization: Sales forecasting
Data Visualization: Sales forecastingData Visualization: Sales forecasting
Data Visualization: Sales forecasting
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing Strategies
 
Leap to Next Generation Data Management with Denodo 7.0
Leap to Next Generation Data Management with Denodo 7.0Leap to Next Generation Data Management with Denodo 7.0
Leap to Next Generation Data Management with Denodo 7.0
 
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
 
23.pdf
23.pdf23.pdf
23.pdf
 
Big data careers
Big data careersBig data careers
Big data careers
 
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjnWHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
 
Big_Data.pptx
Big_Data.pptxBig_Data.pptx
Big_Data.pptx
 
Data Analytics Business Intelligence
Data Analytics Business IntelligenceData Analytics Business Intelligence
Data Analytics Business Intelligence
 
Data Mining Services in various types
Data Mining Services in various typesData Mining Services in various types
Data Mining Services in various types
 
Big Data and Business Insight
Big Data and Business InsightBig Data and Business Insight
Big Data and Business Insight
 

More from Dataiku

Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
Dataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
Dataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
Dataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
Dataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Dataiku
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
Dataiku
 

More from Dataiku (20)

Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 

Applied Data Science Part 3: Getting dirty; data preparation and feature creation

  • 1. ©2018 dataiku, Inc. Applied Data Science Online Course 3rd Class: Getting dirty; data preparation and feature creation
  • 2. ©2018 dataiku, Inc. ● September 20th at 12PM ET: Learning the Basics, concepts, & your first ML model ● September 27th at 12PM ET: The data science workflow, building a predictive model flow ● October 4th at 12PM ET: Getting dirty; data preparation and feature creation ● October 11th at 12PM ET: Understanding your model - and communicating about it Curriculum Go from Small to Big Data in 4 weeks
  • 3. ©2018 dataiku, Inc. Most Advanced version of the workflow The Data Science Workflow Determine business objectives Assess situation Determine data science goals Produce project plan Collect data Describe data Explore data Assess qualitySelectCleanBuildReport Reformat Select modeling techniques Generate test design Build model Assess model Evaluate results Review process Determine next steps Check common risks Plan deployment Plan monitoring & performance Produce final report Review project
  • 4. ©2018 dataiku, Inc. Data preparation 70% of the work Select Clean Build ReportReformat
  • 6. ©2018 dataiku, Inc. Different types of Data Definitions Structured Unstructured Data stored with clearly defined data types whose pattern makes them easily searchable and linkable - most often tabular format Data is not structured via pre-defined data models or schema. Examples: • Database with columns for name, phone number etc Examples: • Text files • Websites and social media
  • 7. ©2018 dataiku, Inc. Different types of Data Definitions
  • 8. ©2018 dataiku, Inc. Different types of Storage Definitions Relational Databases Unstructured data storage Data is stored in different tables that are connected with unique identifiers Systems that can store and process both structured and unstructured data Examples: • Structured Query Language DB Examples: • Hadoop • NOSQL
  • 9. ©2018 dataiku, Inc. Different type of structured data Data can be one of several categories Examples: • Gender • Nationality • Hair color Data is a number Examples: • Age • Weight • Salary Data is free-form text Examples: • Tweets • Documents • Business name + semi-structured data: json
  • 10. ©2018 dataiku, Inc. Different type of structured data Examples Eye Color (e.g. Blue) Height (e.g. 170 cm) Country of Birth (e.g. France) Postal Code (e.g. 75001) Date (e.g. Wednesday, 15 Jan 1976) Address (e.g. 10 Rue Saint Martin, Paris) Curriculum Vitae
  • 12. ©2018 dataiku, Inc. What to do with missing values?
  • 13. ©2018 dataiku, Inc. Drop values Delete all data from any participant with missing values Few Warnings: • Be sure your sample is large enough, then you likely can drop data without substantial loss of statistical power. • Be sure data is not missing at Random: There is a pattern in the missing data that affect your primary dependent variables. For example, lower-income participants are less likely to respond income column.
  • 14. ©2018 dataiku, Inc. Imputation Replacing missing values with substitute values. 14 Method #1: Common Value For Number: • Average • Median • Constant Value For Category: • Treat like the category « Empty » • Most frequent value • A constant value Method #2: Educated Guess Infer a missing value: • If Age is lower than 20, Income is likely to be 0 • If living in a house in a rich city: income is likely to be higher than average • Nb. of child is likely to not be 0 if age is high and situation not married Method #3: Sub-Model • Create a specific model of machine learning to predict the missing (Regression, Classification)
  • 15. ©2018 dataiku, Inc. Example How to handle the missing data in this doc? Not OK • Drop Rows • Average Better : • Educated guess from SPCategory • Sub-prediction Model
  • 17. ©2018 dataiku, Inc. Join operation Similar to a VLOOKUP / INDEX MATCH
  • 18. ©2018 dataiku, Inc. Group by operations Definition
  • 19. ©2018 dataiku, Inc. Aggregate data from an entity How can you aggregate this dataset?
  • 20. ©2018 dataiku, Inc. Aggregate data from an entity How can you aggregate this dataset? Group By Options: For Number: • Average • Sum • Minimum and Maximum • Standard Deviation For Category and Number: • Count of Value • Count of Distinct Value • First and Last Value • Most Frequent
  • 23. ©2018 dataiku, Inc. ● Some data is in a textual or numerical format but should be understood as a category by your model ● This also allows you to use non numerical data in a linear model > Create a dummy variable that corresponds to these categories: 0-1 for linear model ● Your numerical data can be distributed in a way that will be misunderstood by your model > Change the values to rearrange them on a scale Dummification & Rescaling: the issue The specificity of your data can create bias in your model 15 0 15 0 18 0 18 0 Y= a1 x1 + a2 x2 + a3 x3 + a4 x4 + a5 x5 Father’s Height Your Height
  • 24. ©2018 dataiku, Inc. Dummification for linear models For categorical variables Then your formula will look like this: Y= abatteries x1 + aSoap x2 + aCandy x3… And X1, X2, X3… are either 0 or 1
  • 25. ©2018 dataiku, Inc. Rescaling for numerical variables Without rescaling your formula will look like this: Yrosalindrosenbaum = aincome *28114 + aChild *1… YBrettStamm = aincome *47901 + aChild *3…
  • 26. ©2018 dataiku, Inc. Feature rescaling for linear models Feature scaling is a method used to standardize the range of independent variables Example of Rescaling – Min-Max Rescaling
  • 30. ©2018 dataiku, Inc. About Dataiku - Your Path to Enterprise AI