SlideShare a Scribd company logo
Supervised
Learning
Orozco Hsu
2023-12-05 1
About me
• Education
• NCU (MIS)、NCCU (CS)
• Experiences
• Telecom big data Innovation
• Retail Media Network (RMN)
• Customer Data Platform (CDP)
• Know-your-customer (KYC)
• Digital Transformation
• Research
• Data Ops (ML Ops)
• Business Data Analysis, AI
2
Tutorial
Content
3
Decision Tree
Build the supervised learning models
Home work
What is the supervised learning
Code
• Download code
• https://drive.google.com/drive/folders/19fTtqp-nyASeL-
Qkpr7yjbt18YgfKsyj?usp=sharing
4
Reviewing the previous assignment
5
The 80% of work are in data pre-processing!
• It requires the data engineering work.
• Launch the Jupyter Lab and open the notebook.
6
Transpose_matrix.ipynb
Previous Homework (workflow)
7
03.ows
Previous Homework (Find the rules!)
8
What is the supervised learning
9
Supervised learning
• Supervised learning: discover patterns in the data that relate data
attributes with a target (called labeled) attribute.
• These patterns are utilized to predict the values of the target attribute in
future data instances.
• Classic supervised learning algorithm
• Classification => Label Prediction (Yes/ No binary question)
• Regression => Measures Prediction (Continuous numeric value)
10
11
Supervised learning
12
Ref: https://www.tibco.com/reference-center/what-is-supervised-learning
Abou the decision tree model
13
Decision Tree Model
• Decision tree is a method of prediction that uses conditions
with YES/NO answers, called classification.
• Due to its similarity to human thought processes, the results
obtained from this method are easy to understand.
• The parts involving conditions are called nodes or internal
nodes. The topmost node is called root node.
• The end nodes representing the classification in a decision
tree are called leaf nodes, representing categories.
14
15
Leaf node
Internal node
Root node
Split
Tree model is a kind of them
16
About the tree based model
• In general, you can make classification
predictions or numerical predictions.
• An advanced tree based model is such
like the Random Forest model, which
combines multiple CART models.
• The idea is to ensemble several weak
solvers to construct a stronger model.
17
Leo Breiman introduced CART,
Random Forest, and Bagging
algorithms.
https://en.wikipedia.org/wiki/Leo_Breiman
About the tree based model
• To classify based on the internal conditions of feature values.
18
Temp < 15
Temp > 25
Humid < 40%
Humid > 60%
Un cozy
Un cozy
Un cozy
Un cozy Cozy
No
Yes
No
Yes
No
Yes
No
Yes
The importance features:
Temp > Humid
About the tree based model
• Classification
19
Humid %
Temp
Cozy
Non
Cozy
60
40
15 25
About the tree based model
https://sharkyun.medium.com/decision-tree-%E6%B1%BA%E7%AD%96%E6%A8%B9-41597818c075
9
About the tree based model
• Each block has LIKE( ) and DISLIKE ( ) samples
• The probability of each block:
• The first block: 5/6
• The second block: 8/12
• The third block: 3/10
• The forth block: 1/4
21
Quiz:
1. How to select the most LIKE conditions?
About the tree based model
• A regression solver, the leaf
nodes contain numeric values.
• Mean
• Mode
• Median
22
About the tree based model
23
Quiz:
1. Can you explain the importance of features?
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/datasets/descr/diabetes.rst
About the tree based model
• Fast computation speed.
• Minimal data engineering required, no need for data normalization,
dummy variables, one-hot encoding, etc.
• Only CART can handle both continuous and categorical features
simultaneously.
• Prone to the impact of sample imbalance (dependent variable), needs pre-
processing in advance .
• Over-sampling (Synthetic Minority Oversampling Technique, SMOTE)
• Under-sampling
• High interpretability, suitable for visual analysis, and easy to extract rules.
• The model results serve as conditions, and can be directly retrieved later
using SQL syntax.
24
The 3 types of Tree based models
• ID3 algorithm
• Choosing the highest information gain value to split the nodes
• C4.5 algorithm
• Choosing the highest information gain ratio value to split the
nodes
• CART algorithm (Classification and Regression Tree)
• Choosing the lowest of GINI impurity value to split the nodes
25
The 3 types of Tree based models
26
https://github.com/richzw/MachineLearningTips/blob/master/DecisionTree.md
Node split criteria
• Discussing the criteria for splitting nodes is mainly to explore
the feature importance.
• Why do we need the importance of features?
27
Node split criteria
• In 1948, Shannon published the mathematical principles of
communication, laying the foundation for modern information theory.
He introduced the concept of information entropy, which solved the
problem of quantifying information.
• When the uncertainty of a problem is greater, it requires more
information to understand the problem, indicating a higher
information entropy.
28
P(x) the probability of event x occurring Claude Shannon
• A box contains 5 white balls and 5 red balls. If you randomly pick one
ball, what is the color of that ball? How much information does this
question carry?
• The probability of getting a white ball or a red ball is both 1/2. When plugged
into the information entropy formula:
Node split criteria
29
Node split criteria
• When building a decision tree, how do you prioritize which feature to
choose for splitting?
• Examine all features, calculate the change in information entropy before and
after splitting the dataset for each feature.
• Finally, choose the feature that results in the largest change in information
entropy as the primary basis for splitting nodes.
30
Node split criteria
Ball Color Red White Black Blue
Quality 1 1 1 2
Quantity 1 2 1 2
Number of balls 2 2 4 8
Probability 2/16 2/16 4/16 8/16
31
• Should the priority feature difference be Quality or Quantity?
Node split criteria
32
16 balls
Quantity
=1
White*2
Blue*8
Red*2
Black*4
Entropy = 1.75
Entropy =0.721928
Entropy =0.918296
16 balls
Quality
=1
Blue*8
Blue*2
Black*4
White*2
Entropy = 1.75
Entropy =0
Entropy =1.5
No
Yes
No
Yes
Quiz: How to select feature to split the node?
33
Node split criteria
• Using information gain as the criteria for feature splitting, it
tends to prioritize features with the most feature values
• For example: using personal ID as a feature might result in each leaf
node having only one sample.
• Solution:
• Merge the feature values based on business experience to reduce
the number of feature values.
• Use other algorithm (such like information gain ration, C4.5)
34
Node split criteria
• Tree Orange 3 (Hyper-parameters)
• https://orange3.readthedocs.io/projects/orange-visual-
programming/en/latest/widgets/model/tree.html
35
Node split criteria
• Prioritize the hyper-parameter
of [limiting the maximum tree
depth]
36
Model evaluation
37
About the Bias Variance
• Such like the marksmanship training
• Hit the target
• Low bias/ variance
• Concentrated but not accurate
• Low variance/ high bias
38
Lower the total error
39
When the Overfitting, it always comes
with the high variance
What is the Overfitting ?
40
Overfitting reason?
1. Dataset has noise
2. Too complex hyper-parameters
3. Early Stopping
4. Dataset is too small
5. Feature reduction
6. Normalization
7. Adjust hyper-parameters
8. Change other algorithm
What is the Underfitting ?
• Model is too simply
• Increate the iteration to convergency
• Adjust the hyper-parameters
• Add more features to dataset
• Change to another complicated model
41
Supervised learning evaluation
• Confusion matrix (for classification)
42
Supervised learning evaluation
43
Ture Positive Rate (TPR)
False Positive Rate (FPR)
Tree based models and prediction
44
Tree based model (It uses ID3 algorithm)
• Orange 3 has itself re-defined algorithm for tree based model.
• You must check feature values of your dataset carefully!
• Why?
• In Orange 3, you can use both types of category or numeric inputs.
45
04.ows
Tree based model
46
Rank: Feature selection
Data Sampler: Splitting dataset
ROC Analysis: Model Evaluation
Confusion Matrix: Model Evalution
The AUC of testing dataset should below the training dataset
Tree based model
47
Select case
when petal.length <= 1.9 then Iris-Sentosa
when petal.length > 1.9 and petal.width > 1.7 then Iris-virginca
when petal.length >1.9 and petal.width <= 1.7 and petal.length <= 4.9 then Iris-versicolor
when petal.length >1.9 and petal.width <= 1.7 and petal.length > 4.9 and petal.width <= 1.5 then Iris-virginica
when petal.length >1.9 and petal.width <= 1.7 and petal.length > 4.9 and petal.width > 1.5 then Iris-versicolor
From Iris;
Tree based model
• Quiz: Can you observe the feature importance of tree layout?
• Yes, the petal.length is the most significant feature!
48
Feature selection
Random Forest algorithm
• The Random Forest algorithm widespread popularity stems from its
user-friendly nature and adaptability, enabling it to tackle both
classification and regression problems effectively.
• It lies in its ability to handle complex datasets and mitigate overfitting,
making it a valuable tool for various predictive tasks in machine
learning.
49
Bagging and bootstrap
• Randomly select multiple subsets
of dataset with rows and features
and build multiple CART models.
• Each CART model must have an
accuracy of over 50%.
• Finally, combine these CART
model outputs to produce the
ensemble prediction.
• Voting
50
https://gaussian37.github.io/ml-concept-bagging/
Random Forest algorithm
51
In general, Random Forest tends to perform better.
52
It supports classifier or regressor.
Stacking all
prediction
outputs from
testing inputs
Add testing Y to
stacking dataset,
and build a meta
model
Using training set to build a based model
Get the stacking
results
It normally works well on complex data sets.
Stack
• Download Bank Marketing dataset
53
Stacking
54
This workflow only built from training dataset, no testing dataset!
Quiz: Do you know how to split the dataset and rebuild it again?
Stacking
55
• In overall, the Stack perform a better results!
Homework
• Using healthcare_dataset.csv to build a classification model and predict
the Test Results.
• You should follow the steps such like:
• Exploration Data Analysis/ Pre-processing (missing value, data cleaning…)
• Featuring selection
• Build multiple models and try to get the best accuracy one.
• Submit your training/testing model evaluation
• Copy those steps above you used in the PPT and lucid present with
texts or illustrations to your observations.
56

More Related Content

Similar to 2023 Supervised Learning for Orange3 from scratch

AL slides.ppt
AL slides.pptAL slides.ppt
AL slides.ppt
ShehnazIslam1
 
background.pptx
background.pptxbackground.pptx
background.pptx
KabileshCm
 
Classification.pptx
Classification.pptxClassification.pptx
Classification.pptx
Dr. Amanpreet Kaur
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
PriyadharshiniG41
 
MACHINE LEARNING - ENTROPY & INFORMATION GAINpptx
MACHINE LEARNING - ENTROPY & INFORMATION GAINpptxMACHINE LEARNING - ENTROPY & INFORMATION GAINpptx
MACHINE LEARNING - ENTROPY & INFORMATION GAINpptx
Vijayalakshmi171563
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hakky St
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
Alex Henderson
 
Lec 18-19.pptx
Lec 18-19.pptxLec 18-19.pptx
Lec 18-19.pptx
vijaita kashyap
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
Shree Shree
 
Predict oscars (5:11)
Predict oscars (5:11)Predict oscars (5:11)
Predict oscars (5:11)
Thinkful
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
zohebmusharraf
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousing
Vaishnavi
 
OR Ndejje Univ.pptx
OR Ndejje Univ.pptxOR Ndejje Univ.pptx
OR Ndejje Univ.pptx
ChandigaRichard1
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Mauro Vallati
 
Rapid pruning of search space through hierarchical matching
Rapid pruning of search space through hierarchical matchingRapid pruning of search space through hierarchical matching
Rapid pruning of search space through hierarchical matching
lucenerevolution
 
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
Jeff Heaton
 

Similar to 2023 Supervised Learning for Orange3 from scratch (20)

AL slides.ppt
AL slides.pptAL slides.ppt
AL slides.ppt
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
Classification.pptx
Classification.pptxClassification.pptx
Classification.pptx
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
MACHINE LEARNING - ENTROPY & INFORMATION GAINpptx
MACHINE LEARNING - ENTROPY & INFORMATION GAINpptxMACHINE LEARNING - ENTROPY & INFORMATION GAINpptx
MACHINE LEARNING - ENTROPY & INFORMATION GAINpptx
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
Lec 18-19.pptx
Lec 18-19.pptxLec 18-19.pptx
Lec 18-19.pptx
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Predict oscars (5:11)
Predict oscars (5:11)Predict oscars (5:11)
Predict oscars (5:11)
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
 
Business intelligence and data warehousing
Business intelligence and data warehousingBusiness intelligence and data warehousing
Business intelligence and data warehousing
 
OR Ndejje Univ.pptx
OR Ndejje Univ.pptxOR Ndejje Univ.pptx
OR Ndejje Univ.pptx
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
 
Rapid pruning of search space through hierarchical matching
Rapid pruning of search space through hierarchical matchingRapid pruning of search space through hierarchical matching
Rapid pruning of search space through hierarchical matching
 
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 

More from FEG

Sequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdfSequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdf
FEG
 
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
FEG
 
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
FEG
 
Pytorch cnn netowork introduction 20240318
Pytorch cnn netowork introduction 20240318Pytorch cnn netowork introduction 20240318
Pytorch cnn netowork introduction 20240318
FEG
 
2023 Decision Tree analysis in business practices
2023 Decision Tree analysis in business practices2023 Decision Tree analysis in business practices
2023 Decision Tree analysis in business practices
FEG
 
2023 Clustering analysis using Python from scratch
2023 Clustering analysis using Python from scratch2023 Clustering analysis using Python from scratch
2023 Clustering analysis using Python from scratch
FEG
 
2023 Data visualization using Python from scratch
2023 Data visualization using Python from scratch2023 Data visualization using Python from scratch
2023 Data visualization using Python from scratch
FEG
 
2023 Supervised_Learning_Association_Rules
2023 Supervised_Learning_Association_Rules2023 Supervised_Learning_Association_Rules
2023 Supervised_Learning_Association_Rules
FEG
 
202312 Exploration Data Analysis Visualization (English version)
202312 Exploration Data Analysis Visualization (English version)202312 Exploration Data Analysis Visualization (English version)
202312 Exploration Data Analysis Visualization (English version)
FEG
 
202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization
FEG
 
Transfer Learning (20230516)
Transfer Learning (20230516)Transfer Learning (20230516)
Transfer Learning (20230516)
FEG
 
Image Classification (20230411)
Image Classification (20230411)Image Classification (20230411)
Image Classification (20230411)
FEG
 
Google CoLab (20230321)
Google CoLab (20230321)Google CoLab (20230321)
Google CoLab (20230321)
FEG
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised Learning
FEG
 
UnSupervised Learning Clustering
UnSupervised Learning ClusteringUnSupervised Learning Clustering
UnSupervised Learning Clustering
FEG
 
Data Visualization in Excel
Data Visualization in ExcelData Visualization in Excel
Data Visualization in Excel
FEG
 
6_Association_rule_碩士班第六次.pdf
6_Association_rule_碩士班第六次.pdf6_Association_rule_碩士班第六次.pdf
6_Association_rule_碩士班第六次.pdf
FEG
 
5_Neural_network_碩士班第五次.pdf
5_Neural_network_碩士班第五次.pdf5_Neural_network_碩士班第五次.pdf
5_Neural_network_碩士班第五次.pdf
FEG
 
4_Regression_analysis.pdf
4_Regression_analysis.pdf4_Regression_analysis.pdf
4_Regression_analysis.pdf
FEG
 
3_Decision_tree.pdf
3_Decision_tree.pdf3_Decision_tree.pdf
3_Decision_tree.pdf
FEG
 

More from FEG (20)

Sequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdfSequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdf
 
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
 
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
 
Pytorch cnn netowork introduction 20240318
Pytorch cnn netowork introduction 20240318Pytorch cnn netowork introduction 20240318
Pytorch cnn netowork introduction 20240318
 
2023 Decision Tree analysis in business practices
2023 Decision Tree analysis in business practices2023 Decision Tree analysis in business practices
2023 Decision Tree analysis in business practices
 
2023 Clustering analysis using Python from scratch
2023 Clustering analysis using Python from scratch2023 Clustering analysis using Python from scratch
2023 Clustering analysis using Python from scratch
 
2023 Data visualization using Python from scratch
2023 Data visualization using Python from scratch2023 Data visualization using Python from scratch
2023 Data visualization using Python from scratch
 
2023 Supervised_Learning_Association_Rules
2023 Supervised_Learning_Association_Rules2023 Supervised_Learning_Association_Rules
2023 Supervised_Learning_Association_Rules
 
202312 Exploration Data Analysis Visualization (English version)
202312 Exploration Data Analysis Visualization (English version)202312 Exploration Data Analysis Visualization (English version)
202312 Exploration Data Analysis Visualization (English version)
 
202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization
 
Transfer Learning (20230516)
Transfer Learning (20230516)Transfer Learning (20230516)
Transfer Learning (20230516)
 
Image Classification (20230411)
Image Classification (20230411)Image Classification (20230411)
Image Classification (20230411)
 
Google CoLab (20230321)
Google CoLab (20230321)Google CoLab (20230321)
Google CoLab (20230321)
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised Learning
 
UnSupervised Learning Clustering
UnSupervised Learning ClusteringUnSupervised Learning Clustering
UnSupervised Learning Clustering
 
Data Visualization in Excel
Data Visualization in ExcelData Visualization in Excel
Data Visualization in Excel
 
6_Association_rule_碩士班第六次.pdf
6_Association_rule_碩士班第六次.pdf6_Association_rule_碩士班第六次.pdf
6_Association_rule_碩士班第六次.pdf
 
5_Neural_network_碩士班第五次.pdf
5_Neural_network_碩士班第五次.pdf5_Neural_network_碩士班第五次.pdf
5_Neural_network_碩士班第五次.pdf
 
4_Regression_analysis.pdf
4_Regression_analysis.pdf4_Regression_analysis.pdf
4_Regression_analysis.pdf
 
3_Decision_tree.pdf
3_Decision_tree.pdf3_Decision_tree.pdf
3_Decision_tree.pdf
 

Recently uploaded

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 

Recently uploaded (20)

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 

2023 Supervised Learning for Orange3 from scratch

  • 2. About me • Education • NCU (MIS)、NCCU (CS) • Experiences • Telecom big data Innovation • Retail Media Network (RMN) • Customer Data Platform (CDP) • Know-your-customer (KYC) • Digital Transformation • Research • Data Ops (ML Ops) • Business Data Analysis, AI 2
  • 3. Tutorial Content 3 Decision Tree Build the supervised learning models Home work What is the supervised learning
  • 4. Code • Download code • https://drive.google.com/drive/folders/19fTtqp-nyASeL- Qkpr7yjbt18YgfKsyj?usp=sharing 4
  • 5. Reviewing the previous assignment 5
  • 6. The 80% of work are in data pre-processing! • It requires the data engineering work. • Launch the Jupyter Lab and open the notebook. 6 Transpose_matrix.ipynb
  • 8. Previous Homework (Find the rules!) 8
  • 9. What is the supervised learning 9
  • 10. Supervised learning • Supervised learning: discover patterns in the data that relate data attributes with a target (called labeled) attribute. • These patterns are utilized to predict the values of the target attribute in future data instances. • Classic supervised learning algorithm • Classification => Label Prediction (Yes/ No binary question) • Regression => Measures Prediction (Continuous numeric value) 10
  • 11. 11
  • 13. Abou the decision tree model 13
  • 14. Decision Tree Model • Decision tree is a method of prediction that uses conditions with YES/NO answers, called classification. • Due to its similarity to human thought processes, the results obtained from this method are easy to understand. • The parts involving conditions are called nodes or internal nodes. The topmost node is called root node. • The end nodes representing the classification in a decision tree are called leaf nodes, representing categories. 14
  • 16. Tree model is a kind of them 16
  • 17. About the tree based model • In general, you can make classification predictions or numerical predictions. • An advanced tree based model is such like the Random Forest model, which combines multiple CART models. • The idea is to ensemble several weak solvers to construct a stronger model. 17 Leo Breiman introduced CART, Random Forest, and Bagging algorithms. https://en.wikipedia.org/wiki/Leo_Breiman
  • 18. About the tree based model • To classify based on the internal conditions of feature values. 18 Temp < 15 Temp > 25 Humid < 40% Humid > 60% Un cozy Un cozy Un cozy Un cozy Cozy No Yes No Yes No Yes No Yes The importance features: Temp > Humid
  • 19. About the tree based model • Classification 19 Humid % Temp Cozy Non Cozy 60 40 15 25
  • 20. About the tree based model https://sharkyun.medium.com/decision-tree-%E6%B1%BA%E7%AD%96%E6%A8%B9-41597818c075 9
  • 21. About the tree based model • Each block has LIKE( ) and DISLIKE ( ) samples • The probability of each block: • The first block: 5/6 • The second block: 8/12 • The third block: 3/10 • The forth block: 1/4 21 Quiz: 1. How to select the most LIKE conditions?
  • 22. About the tree based model • A regression solver, the leaf nodes contain numeric values. • Mean • Mode • Median 22
  • 23. About the tree based model 23 Quiz: 1. Can you explain the importance of features? https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/datasets/descr/diabetes.rst
  • 24. About the tree based model • Fast computation speed. • Minimal data engineering required, no need for data normalization, dummy variables, one-hot encoding, etc. • Only CART can handle both continuous and categorical features simultaneously. • Prone to the impact of sample imbalance (dependent variable), needs pre- processing in advance . • Over-sampling (Synthetic Minority Oversampling Technique, SMOTE) • Under-sampling • High interpretability, suitable for visual analysis, and easy to extract rules. • The model results serve as conditions, and can be directly retrieved later using SQL syntax. 24
  • 25. The 3 types of Tree based models • ID3 algorithm • Choosing the highest information gain value to split the nodes • C4.5 algorithm • Choosing the highest information gain ratio value to split the nodes • CART algorithm (Classification and Regression Tree) • Choosing the lowest of GINI impurity value to split the nodes 25
  • 26. The 3 types of Tree based models 26 https://github.com/richzw/MachineLearningTips/blob/master/DecisionTree.md
  • 27. Node split criteria • Discussing the criteria for splitting nodes is mainly to explore the feature importance. • Why do we need the importance of features? 27
  • 28. Node split criteria • In 1948, Shannon published the mathematical principles of communication, laying the foundation for modern information theory. He introduced the concept of information entropy, which solved the problem of quantifying information. • When the uncertainty of a problem is greater, it requires more information to understand the problem, indicating a higher information entropy. 28 P(x) the probability of event x occurring Claude Shannon
  • 29. • A box contains 5 white balls and 5 red balls. If you randomly pick one ball, what is the color of that ball? How much information does this question carry? • The probability of getting a white ball or a red ball is both 1/2. When plugged into the information entropy formula: Node split criteria 29
  • 30. Node split criteria • When building a decision tree, how do you prioritize which feature to choose for splitting? • Examine all features, calculate the change in information entropy before and after splitting the dataset for each feature. • Finally, choose the feature that results in the largest change in information entropy as the primary basis for splitting nodes. 30
  • 31. Node split criteria Ball Color Red White Black Blue Quality 1 1 1 2 Quantity 1 2 1 2 Number of balls 2 2 4 8 Probability 2/16 2/16 4/16 8/16 31 • Should the priority feature difference be Quality or Quantity?
  • 32. Node split criteria 32 16 balls Quantity =1 White*2 Blue*8 Red*2 Black*4 Entropy = 1.75 Entropy =0.721928 Entropy =0.918296 16 balls Quality =1 Blue*8 Blue*2 Black*4 White*2 Entropy = 1.75 Entropy =0 Entropy =1.5 No Yes No Yes
  • 33. Quiz: How to select feature to split the node? 33
  • 34. Node split criteria • Using information gain as the criteria for feature splitting, it tends to prioritize features with the most feature values • For example: using personal ID as a feature might result in each leaf node having only one sample. • Solution: • Merge the feature values based on business experience to reduce the number of feature values. • Use other algorithm (such like information gain ration, C4.5) 34
  • 35. Node split criteria • Tree Orange 3 (Hyper-parameters) • https://orange3.readthedocs.io/projects/orange-visual- programming/en/latest/widgets/model/tree.html 35
  • 36. Node split criteria • Prioritize the hyper-parameter of [limiting the maximum tree depth] 36
  • 38. About the Bias Variance • Such like the marksmanship training • Hit the target • Low bias/ variance • Concentrated but not accurate • Low variance/ high bias 38
  • 39. Lower the total error 39 When the Overfitting, it always comes with the high variance
  • 40. What is the Overfitting ? 40 Overfitting reason? 1. Dataset has noise 2. Too complex hyper-parameters 3. Early Stopping 4. Dataset is too small 5. Feature reduction 6. Normalization 7. Adjust hyper-parameters 8. Change other algorithm
  • 41. What is the Underfitting ? • Model is too simply • Increate the iteration to convergency • Adjust the hyper-parameters • Add more features to dataset • Change to another complicated model 41
  • 42. Supervised learning evaluation • Confusion matrix (for classification) 42
  • 43. Supervised learning evaluation 43 Ture Positive Rate (TPR) False Positive Rate (FPR)
  • 44. Tree based models and prediction 44
  • 45. Tree based model (It uses ID3 algorithm) • Orange 3 has itself re-defined algorithm for tree based model. • You must check feature values of your dataset carefully! • Why? • In Orange 3, you can use both types of category or numeric inputs. 45 04.ows
  • 46. Tree based model 46 Rank: Feature selection Data Sampler: Splitting dataset ROC Analysis: Model Evaluation Confusion Matrix: Model Evalution The AUC of testing dataset should below the training dataset
  • 47. Tree based model 47 Select case when petal.length <= 1.9 then Iris-Sentosa when petal.length > 1.9 and petal.width > 1.7 then Iris-virginca when petal.length >1.9 and petal.width <= 1.7 and petal.length <= 4.9 then Iris-versicolor when petal.length >1.9 and petal.width <= 1.7 and petal.length > 4.9 and petal.width <= 1.5 then Iris-virginica when petal.length >1.9 and petal.width <= 1.7 and petal.length > 4.9 and petal.width > 1.5 then Iris-versicolor From Iris;
  • 48. Tree based model • Quiz: Can you observe the feature importance of tree layout? • Yes, the petal.length is the most significant feature! 48 Feature selection
  • 49. Random Forest algorithm • The Random Forest algorithm widespread popularity stems from its user-friendly nature and adaptability, enabling it to tackle both classification and regression problems effectively. • It lies in its ability to handle complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning. 49
  • 50. Bagging and bootstrap • Randomly select multiple subsets of dataset with rows and features and build multiple CART models. • Each CART model must have an accuracy of over 50%. • Finally, combine these CART model outputs to produce the ensemble prediction. • Voting 50 https://gaussian37.github.io/ml-concept-bagging/
  • 51. Random Forest algorithm 51 In general, Random Forest tends to perform better.
  • 52. 52 It supports classifier or regressor. Stacking all prediction outputs from testing inputs Add testing Y to stacking dataset, and build a meta model Using training set to build a based model Get the stacking results It normally works well on complex data sets.
  • 53. Stack • Download Bank Marketing dataset 53
  • 54. Stacking 54 This workflow only built from training dataset, no testing dataset! Quiz: Do you know how to split the dataset and rebuild it again?
  • 55. Stacking 55 • In overall, the Stack perform a better results!
  • 56. Homework • Using healthcare_dataset.csv to build a classification model and predict the Test Results. • You should follow the steps such like: • Exploration Data Analysis/ Pre-processing (missing value, data cleaning…) • Featuring selection • Build multiple models and try to get the best accuracy one. • Submit your training/testing model evaluation • Copy those steps above you used in the PPT and lucid present with texts or illustrations to your observations. 56