More Related Content More from KNIMESlides (20) Practicing Data Science - Asking for Directions in an AI Project1. © 2019 KNIME AG. All rights reserved.
Practicing Data Science
KNIME: Rosaria.Silipo@knime.com
@KNIME
Asking for Directions in an AI Project
… is starting soon …
2. © 2019 KNIME AG. All rights reserved.
Practicing Data Science
KNIME: Rosaria.Silipo@knime.com
@KNIME
Asking for Directions in an AI Project
3. © 2019 KNIME AG. All rights reserved.
Introduction
This webinar collects the answers to
the questions I get every time I start
a new data science project
3
4. © 2019 KNIME AG. All rights reserved.
The Standard DS Process
4
https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
5. © 2019 KNIME AG. All rights reserved.
The Training Process as a Workflow
5
6. © 2019 KNIME AG. All rights reserved.
How Standard is the Standard Process?
• Data Preparation
– Data types (structured vs. unstructured)
– Weird distributions (rare and infrequent classes)
– Model-dependent transformations
• Machine Learning model
– Model yes/no
– Which problem?
– Which model for which problem?
• Deployment
– Reports, dashboards, REST, or just save to DB?
– Scalability
6
The standard data mining
process is not very
standard
7. © 2019 KNIME AG. All rights reserved.
Do I need to train a ML model?
7
Sometimes a picture is better than 1000
words
Customer Description
Money vs. Loyalty
User Behaviour
Energy Consumption
Sometimes we only
need KPI measures.
Clickstream Analysis
Multiple Aggregations
Sometimes we only need
a Data WareHouse
data
DB
data data data
DWH
Business
Unit Business
Unit
Business
Unit
ETL ETL ETL ETL
ETL
ETL ETL ETL
8. © 2019 KNIME AG. All rights reserved.
Classification or Number Prediction?
8
Classes: Red, Blonde, Brown, Black
EnergyUsage(KwH)
now Wed 12:00
Binning
Discretization
deep learning network
9. © 2019 KNIME AG. All rights reserved.
deep learning network
Number Prediction or Time Series Analysis?
9
Linear Regression
Time Series Prediction
y from x1, x2, ..., xn x(t) from past x(t-1) ... x(t-n)
time
original
predicted
Make sure that the future does not
mix with the past in data partitioning
10. © 2019 KNIME AG. All rights reserved.
Supervised vs. Unsupervised ML Algorithms
10
x1 x2 xn...x3 class
yx1 x2 xn...x3
Labelled Training Set
x1 x2 xn...x3
Unlabelled Training Set
Supervised Unsupervised
DBSCAN
Fuzzy c-Means
Hierarchical clusteringActiveLearning
11. © 2019 KNIME AG. All rights reserved.
Unevenly Distributed, Infrequent, and Rare Classes
Infrequent
11
Unevenly distributed Rare (anomaly)
distance
Auto-encoder
distance
numerical prediction
clustering
Training only on „normal“ data
12. © 2019 KNIME AG. All rights reserved.
Structured Data vs. Unstructured Data
12
Structured Data Unstructured Data
Text NetworksImages
Text / Image / Network / Chemistry Extension
To numbers
13. © 2019 KNIME AG. All rights reserved.
The Deployment Process as a Workflow
13
14. © 2019 KNIME AG. All rights reserved.
Deployment: REST API, Shiny Dashboards, plain Background Execution
14
Your workflow as ...
... a REST API
... Guided Application
15. © 2019 KNIME AG. All rights reserved.
Scalability: Spark, Parallel Execution, in-DB Processing
15
Spark
Parallel Execution
on Server
In-database
processing
16. © 2019 KNIME AG. All rights reserved.
Summary
• Is the standard DS process so standard?
• Do I need a ML model?
• Training
– Classification or Number Prediction?
• Number Prediction or Time Series Analysis?
• Supervised or Unsupervised Learning?
– Unevenly Distributed, Infrequent, and Rare Classes
– Structured vs. Unstructured Data
• Deployment
– REST API, Dashboards, Background Execution
– Scalability Options
16
17. © 2019 KNIME AG. All rights reserved.
KNIME Spring Summit 2019
March 18 – 22 at bcc Berlin Congress Center, Berlin
• Monday & Tuesday: One-day courses
• Wednesday & Thursday: Summit sessions
• Friday: Workshops
Use the code
WEBINAR-20
for 20% off tickets!
Register at
knime.com/spring-summit2019
18. © 2019 KNIME AG. All rights reserved.
Practicing Data Science
Free Copy of “Practicing Data Science” e-Book from KNIME Press
https://www.knime.com/knimepress
with this code: PDS-WEBINAR-0319
18
19. © 2019 KNIME AG. All rights reserved. 19
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH,
and are registered in the United States. KNIME® is also registered in Germany.
Thank You!
20. © 2019 KNIME AG. All rights reserved.
Let’s unroll it!
It always starts
with some data …
20
Data
Preparation
Model
Training
Model
Optimization
Deployment
Data Manipulation
Data Blending
Missing Values Handling
Feature Generation
Dimensionality Reduction
Feature Selection
Outlier Removal
Normalization
Partitioning
…
Model Training
Bag of Models
Model Selection
Ensemble Models
Own Ensemble Model
External Models
Import Existing Models
Model Factory
…
Parameter Tuning
Parameter Optimization
Regularization
Model Size
No. Iterations
…
Performance Measures
Accuracy
ROC Curve
Cross-Validation
…
Files & DBs
Dashboards
REST API
SQL Code Export
Reporting
…
Model
Evaluation
21. © 2019 KNIME AG. All rights reserved.
The many Lives of a Dataset
21
Data
Preparation
Model
Training
Model
Optimization
Model
Evaluation
Deployment
Partitioning:
• Training Set
• Validation Set
• Test Set
Training Set Validation Set Test Set New Data from Real
World Applications
Original Data
Set with Past
Observations
22. © 2019 KNIME AG. All rights reserved.
Data Exploration
• Data Understanding is a Data Exploration phase
• The Data Exploration phase is useful to get to
know the data
• KNIME offers a few visualization nodes to build
dashboards to explore data
22