SlideShare a Scribd company logo
1 of 22
Download to read offline
© 2019 KNIME AG. All rights reserved.
Practicing Data Science
KNIME: Rosaria.Silipo@knime.com
@KNIME
Asking for Directions in an AI Project
… is starting soon …
© 2019 KNIME AG. All rights reserved.
Practicing Data Science
KNIME: Rosaria.Silipo@knime.com
@KNIME
Asking for Directions in an AI Project
© 2019 KNIME AG. All rights reserved.
Introduction
This webinar collects the answers to
the questions I get every time I start
a new data science project
3
© 2019 KNIME AG. All rights reserved.
The Standard DS Process
4
https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
© 2019 KNIME AG. All rights reserved.
The Training Process as a Workflow
5
© 2019 KNIME AG. All rights reserved.
How Standard is the Standard Process?
• Data Preparation
– Data types (structured vs. unstructured)
– Weird distributions (rare and infrequent classes)
– Model-dependent transformations
• Machine Learning model
– Model yes/no
– Which problem?
– Which model for which problem?
• Deployment
– Reports, dashboards, REST, or just save to DB?
– Scalability
6
The standard data mining
process is not very
standard
© 2019 KNIME AG. All rights reserved.
Do I need to train a ML model?
7
Sometimes a picture is better than 1000
words
Customer Description
Money vs. Loyalty
User Behaviour
Energy Consumption
Sometimes we only
need KPI measures.
Clickstream Analysis
Multiple Aggregations
Sometimes we only need
a Data WareHouse
data
DB
data data data
DWH
Business
Unit Business
Unit
Business
Unit
ETL ETL ETL ETL
ETL
ETL ETL ETL
© 2019 KNIME AG. All rights reserved.
Classification or Number Prediction?
8
Classes: Red, Blonde, Brown, Black
EnergyUsage(KwH)
now Wed 12:00
Binning
Discretization
deep learning network
© 2019 KNIME AG. All rights reserved.
deep learning network
Number Prediction or Time Series Analysis?
9
Linear Regression
Time Series Prediction
y from x1, x2, ..., xn x(t) from past x(t-1) ... x(t-n)
time
original
predicted
Make sure that the future does not
mix with the past in data partitioning
© 2019 KNIME AG. All rights reserved.
Supervised vs. Unsupervised ML Algorithms
10
x1 x2 xn...x3 class
yx1 x2 xn...x3
Labelled Training Set
x1 x2 xn...x3
Unlabelled Training Set
Supervised Unsupervised
DBSCAN
Fuzzy c-Means
Hierarchical clusteringActiveLearning
© 2019 KNIME AG. All rights reserved.
Unevenly Distributed, Infrequent, and Rare Classes
Infrequent
11
Unevenly distributed Rare (anomaly)
distance
Auto-encoder
distance
numerical prediction
clustering
Training only on „normal“ data
© 2019 KNIME AG. All rights reserved.
Structured Data vs. Unstructured Data
12
Structured Data Unstructured Data
Text NetworksImages
Text / Image / Network / Chemistry Extension
To numbers
© 2019 KNIME AG. All rights reserved.
The Deployment Process as a Workflow
13
© 2019 KNIME AG. All rights reserved.
Deployment: REST API, Shiny Dashboards, plain Background Execution
14
Your workflow as ...
... a REST API
... Guided Application
© 2019 KNIME AG. All rights reserved.
Scalability: Spark, Parallel Execution, in-DB Processing
15
Spark
Parallel Execution
on Server
In-database
processing
© 2019 KNIME AG. All rights reserved.
Summary
• Is the standard DS process so standard?
• Do I need a ML model?
• Training
– Classification or Number Prediction?
• Number Prediction or Time Series Analysis?
• Supervised or Unsupervised Learning?
– Unevenly Distributed, Infrequent, and Rare Classes
– Structured vs. Unstructured Data
• Deployment
– REST API, Dashboards, Background Execution
– Scalability Options
16
© 2019 KNIME AG. All rights reserved.
KNIME Spring Summit 2019
March 18 – 22 at bcc Berlin Congress Center, Berlin
• Monday & Tuesday: One-day courses
• Wednesday & Thursday: Summit sessions
• Friday: Workshops
Use the code
WEBINAR-20
for 20% off tickets!
Register at
knime.com/spring-summit2019
© 2019 KNIME AG. All rights reserved.
Practicing Data Science
Free Copy of “Practicing Data Science” e-Book from KNIME Press
https://www.knime.com/knimepress
with this code: PDS-WEBINAR-0319
18
© 2019 KNIME AG. All rights reserved. 19
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH,
and are registered in the United States. KNIME® is also registered in Germany.
Thank You!
© 2019 KNIME AG. All rights reserved.
Let’s unroll it!
It always starts
with some data …
20
Data
Preparation
Model
Training
Model
Optimization
Deployment
Data Manipulation
Data Blending
Missing Values Handling
Feature Generation
Dimensionality Reduction
Feature Selection
Outlier Removal
Normalization
Partitioning
…
Model Training
Bag of Models
Model Selection
Ensemble Models
Own Ensemble Model
External Models
Import Existing Models
Model Factory
…
Parameter Tuning
Parameter Optimization
Regularization
Model Size
No. Iterations
…
Performance Measures
Accuracy
ROC Curve
Cross-Validation
…
Files & DBs
Dashboards
REST API
SQL Code Export
Reporting
…
Model
Evaluation
© 2019 KNIME AG. All rights reserved.
The many Lives of a Dataset
21
Data
Preparation
Model
Training
Model
Optimization
Model
Evaluation
Deployment
Partitioning:
• Training Set
• Validation Set
• Test Set
Training Set Validation Set Test Set New Data from Real
World Applications
Original Data
Set with Past
Observations
© 2019 KNIME AG. All rights reserved.
Data Exploration
• Data Understanding is a Data Exploration phase
• The Data Exploration phase is useful to get to
know the data
• KNIME offers a few visualization nodes to build
dashboards to explore data
22

More Related Content

More from KNIMESlides

More from KNIMESlides (20)

KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
 
Scoring Metrics for Classification Models
Scoring Metrics for Classification ModelsScoring Metrics for Classification Models
Scoring Metrics for Classification Models
 
Open Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME SoftwareOpen Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME Software
 
Anomaly Detection - Discover unknown Frauds and Anomalies using Machine Learning
Anomaly Detection - Discover unknown Frauds and Anomalies using Machine LearningAnomaly Detection - Discover unknown Frauds and Anomalies using Machine Learning
Anomaly Detection - Discover unknown Frauds and Anomalies using Machine Learning
 
Sharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME ServerSharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME Server
 
Guided Automation- A Blueprint for Interactive Automated Machine Learning
Guided Automation- A Blueprint for Interactive Automated Machine LearningGuided Automation- A Blueprint for Interactive Automated Machine Learning
Guided Automation- A Blueprint for Interactive Automated Machine Learning
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
 
Sentiment Analysis with KNIME Analytics Platform
Sentiment Analysis with KNIME Analytics PlatformSentiment Analysis with KNIME Analytics Platform
Sentiment Analysis with KNIME Analytics Platform
 
Chemistry Data Basics with KNIME Analytics Platform
Chemistry Data Basics with KNIME Analytics PlatformChemistry Data Basics with KNIME Analytics Platform
Chemistry Data Basics with KNIME Analytics Platform
 
Sentiment Analysis with Deep Learning, Machine Learning or Lexicon based
Sentiment Analysis with Deep Learning, Machine Learning or Lexicon basedSentiment Analysis with Deep Learning, Machine Learning or Lexicon based
Sentiment Analysis with Deep Learning, Machine Learning or Lexicon based
 
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To DeploymentKNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To Deployment
 
KNIME Software Overview
KNIME Software OverviewKNIME Software Overview
KNIME Software Overview
 
From Raw Data to Deployment
From Raw Data to DeploymentFrom Raw Data to Deployment
From Raw Data to Deployment
 
From raw data to deployment
From raw data to deployment From raw data to deployment
From raw data to deployment
 
Heterogeneous Data Mining with Spark
Heterogeneous Data Mining with SparkHeterogeneous Data Mining with Spark
Heterogeneous Data Mining with Spark
 
Just add Imagination
Just add ImaginationJust add Imagination
Just add Imagination
 
Advanced analytics for the Internet of Things. Restocking Rental Bike Stations
Advanced analytics for the Internet of Things. Restocking Rental Bike StationsAdvanced analytics for the Internet of Things. Restocking Rental Bike Stations
Advanced analytics for the Internet of Things. Restocking Rental Bike Stations
 
Knime customer intelligence on social media: Text Analytics vs. Network Mining
Knime customer intelligence on social media: Text Analytics vs. Network MiningKnime customer intelligence on social media: Text Analytics vs. Network Mining
Knime customer intelligence on social media: Text Analytics vs. Network Mining
 
Text Processing with KNIME
Text Processing with KNIMEText Processing with KNIME
Text Processing with KNIME
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
 

Recently uploaded

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 

Recently uploaded (20)

AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 

Practicing Data Science - Asking for Directions in an AI Project

  • 1. © 2019 KNIME AG. All rights reserved. Practicing Data Science KNIME: Rosaria.Silipo@knime.com @KNIME Asking for Directions in an AI Project … is starting soon …
  • 2. © 2019 KNIME AG. All rights reserved. Practicing Data Science KNIME: Rosaria.Silipo@knime.com @KNIME Asking for Directions in an AI Project
  • 3. © 2019 KNIME AG. All rights reserved. Introduction This webinar collects the answers to the questions I get every time I start a new data science project 3
  • 4. © 2019 KNIME AG. All rights reserved. The Standard DS Process 4 https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
  • 5. © 2019 KNIME AG. All rights reserved. The Training Process as a Workflow 5
  • 6. © 2019 KNIME AG. All rights reserved. How Standard is the Standard Process? • Data Preparation – Data types (structured vs. unstructured) – Weird distributions (rare and infrequent classes) – Model-dependent transformations • Machine Learning model – Model yes/no – Which problem? – Which model for which problem? • Deployment – Reports, dashboards, REST, or just save to DB? – Scalability 6 The standard data mining process is not very standard
  • 7. © 2019 KNIME AG. All rights reserved. Do I need to train a ML model? 7 Sometimes a picture is better than 1000 words Customer Description Money vs. Loyalty User Behaviour Energy Consumption Sometimes we only need KPI measures. Clickstream Analysis Multiple Aggregations Sometimes we only need a Data WareHouse data DB data data data DWH Business Unit Business Unit Business Unit ETL ETL ETL ETL ETL ETL ETL ETL
  • 8. © 2019 KNIME AG. All rights reserved. Classification or Number Prediction? 8 Classes: Red, Blonde, Brown, Black EnergyUsage(KwH) now Wed 12:00 Binning Discretization deep learning network
  • 9. © 2019 KNIME AG. All rights reserved. deep learning network Number Prediction or Time Series Analysis? 9 Linear Regression Time Series Prediction y from x1, x2, ..., xn x(t) from past x(t-1) ... x(t-n) time original predicted Make sure that the future does not mix with the past in data partitioning
  • 10. © 2019 KNIME AG. All rights reserved. Supervised vs. Unsupervised ML Algorithms 10 x1 x2 xn...x3 class yx1 x2 xn...x3 Labelled Training Set x1 x2 xn...x3 Unlabelled Training Set Supervised Unsupervised DBSCAN Fuzzy c-Means Hierarchical clusteringActiveLearning
  • 11. © 2019 KNIME AG. All rights reserved. Unevenly Distributed, Infrequent, and Rare Classes Infrequent 11 Unevenly distributed Rare (anomaly) distance Auto-encoder distance numerical prediction clustering Training only on „normal“ data
  • 12. © 2019 KNIME AG. All rights reserved. Structured Data vs. Unstructured Data 12 Structured Data Unstructured Data Text NetworksImages Text / Image / Network / Chemistry Extension To numbers
  • 13. © 2019 KNIME AG. All rights reserved. The Deployment Process as a Workflow 13
  • 14. © 2019 KNIME AG. All rights reserved. Deployment: REST API, Shiny Dashboards, plain Background Execution 14 Your workflow as ... ... a REST API ... Guided Application
  • 15. © 2019 KNIME AG. All rights reserved. Scalability: Spark, Parallel Execution, in-DB Processing 15 Spark Parallel Execution on Server In-database processing
  • 16. © 2019 KNIME AG. All rights reserved. Summary • Is the standard DS process so standard? • Do I need a ML model? • Training – Classification or Number Prediction? • Number Prediction or Time Series Analysis? • Supervised or Unsupervised Learning? – Unevenly Distributed, Infrequent, and Rare Classes – Structured vs. Unstructured Data • Deployment – REST API, Dashboards, Background Execution – Scalability Options 16
  • 17. © 2019 KNIME AG. All rights reserved. KNIME Spring Summit 2019 March 18 – 22 at bcc Berlin Congress Center, Berlin • Monday & Tuesday: One-day courses • Wednesday & Thursday: Summit sessions • Friday: Workshops Use the code WEBINAR-20 for 20% off tickets! Register at knime.com/spring-summit2019
  • 18. © 2019 KNIME AG. All rights reserved. Practicing Data Science Free Copy of “Practicing Data Science” e-Book from KNIME Press https://www.knime.com/knimepress with this code: PDS-WEBINAR-0319 18
  • 19. © 2019 KNIME AG. All rights reserved. 19 The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States. KNIME® is also registered in Germany. Thank You!
  • 20. © 2019 KNIME AG. All rights reserved. Let’s unroll it! It always starts with some data … 20 Data Preparation Model Training Model Optimization Deployment Data Manipulation Data Blending Missing Values Handling Feature Generation Dimensionality Reduction Feature Selection Outlier Removal Normalization Partitioning … Model Training Bag of Models Model Selection Ensemble Models Own Ensemble Model External Models Import Existing Models Model Factory … Parameter Tuning Parameter Optimization Regularization Model Size No. Iterations … Performance Measures Accuracy ROC Curve Cross-Validation … Files & DBs Dashboards REST API SQL Code Export Reporting … Model Evaluation
  • 21. © 2019 KNIME AG. All rights reserved. The many Lives of a Dataset 21 Data Preparation Model Training Model Optimization Model Evaluation Deployment Partitioning: • Training Set • Validation Set • Test Set Training Set Validation Set Test Set New Data from Real World Applications Original Data Set with Past Observations
  • 22. © 2019 KNIME AG. All rights reserved. Data Exploration • Data Understanding is a Data Exploration phase • The Data Exploration phase is useful to get to know the data • KNIME offers a few visualization nodes to build dashboards to explore data 22