Abstract: This PDSG workshop introduces basic concepts on machine learning. The course covers fundamentals of Supervised and Unsupervised Learning, Decision Trees, Pruning, Ensemble Trees, Linear Regressions, Loss Functions, K-means, and dataset preparation.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
5. Ladder ( Complexity == Salary)
Marketing / Sales
Text Classification
Human Interface
Computer Vision
Factory Automation
3D Printing
Autonomous
Space
Frontier
Emergent
Mature Growth Very Good $$
Exceptional $$
Stratosphere $$
6. It’s About Training
Machine Learning is about using data to train a model
DATA
Training
Data
Test
Data
Train
Split Dataset into Training and Test
Model
Use Training Data to Train the Model
Produce Model
Test the Model Accuracy
Determine
Accuracy
7. It’s in the Label
Features + Label (where label is what the item [row] is.
e.g., apple.
Supervised versus Unsupervised Learning
Labeled
Data
Unlabeled
Data
Features Only. e.g., we do not know its an apple
• Human (or program) pre-label the data.
• Learn how features map to labels.
• Learn how features map to clusters.
• Learn how clusters map to labels.
8. Supervised Learning
Feature 1 Feature 2 Feature 3 Feature 4 Label
real-value real-value real-value categorical-value category/value
real-value real-value real-value categorical-value category/value
real-value real-value real-value categorical-value category/value
Attributes of each sample What the sample is
Weight (oz) Width (in) Height (in) Color Label
6 3 3.5 green apple
2 1.5 8 yellow banana
7 4 4.7 yellow apple
Example
Little SignificanceGreater Significance
9. Decision Tree
Simple Training Model
1st Feature
2nd Feature 2nd Feature
green yellow
3rd Feature 3rd Feature 3rd Feature 3rd Feature
color
weight
< 4 >= 4 < 3.5 >= 3.5
width
height
e.g., yellow
apples weigh more
then green apples
4th Feature 4th Feature 4th Feature
banana apple banana apple
Learn
thresholds
Leaves are the classification
apple
10. Pruning
Weight (oz) Width (in) Height (in) Color Label
6 3 3.5 green apple
2 1.5 8 yellow banana
7 4 4.7 yellow apple
Assume does not contribute to outcome
Weight (oz) Width (in) Height (in) Label
6 3 3.5 apple
2 1.5 8 banana
7 4 4.7 apple
11. Decision Tree After Pruning
Simple Training Model
1st Feature
2nd Feature 2nd Feature
< 4 >= 4
3rd Feature 3rd Feature 3rd Feature 3rd Feature
weight
width
< 2 >= 2 < 2.5 >= 2.5
height
banana apple banana apple
Leaves are the classification
apple
> 3 <= 3
13. (Simple) Linear Regression
It’s In The Line
Age
(x)
0
Feature (data)
Spend
(y)
Label
(learn) Data Plotted (Scatter)
Best Fitted Line
y = a + bx
a
bx (slope)
14. Loss Function
Minimize Loss (Estimated Error) when Fitting a Line
y1
Actual Values (y)
Predicted Values (yhat)
y2
y3
y4
y5
y6
1
𝑛
𝑗=1
𝑛
(𝑦 − 𝑦ℎ𝑎𝑡)2
MSE =
(y – yhat)
Mean Square Error
Sum the Square of the Difference
Divide by the number of samples
15. Libraries Do the Work
Python & R have libraries that do the math!
e.g., numPy, sci-learn
Split the
Dataset X_train, X_test = split( dataset, 0.80 )
Train the
Model
dataset percentage (e.g., 80% train, 20% test)training & test data
model = train( X_train, 4 )
training
datamethod
trained
model
Test the
Model Y_test = model( X_test, 4 )
column of
label
predicted
values
trained
model
test
data
column of
label
Calculate
Accuracy
result = accuracy( X_test, 4, Y_test )
actual value
predicted
values
Pseudo Names
16. Unsupervised Learning
Feature 1 Feature 2 Feature 3 Feature 4
real-value real-value real-value categorical-value
real-value real-value real-value categorical-value
real-value real-value real-value categorical-value
Attributes of each sample
Weight (oz) Width (in) Height (in) Color
6 3 3.5 green
2 1.5 8 yellow
7 4 4.7 yellow
Example
There is NO label – we don’t know what each sample is in the training set!
17. Clusters
It’s In The Cluster
Height
(x1)
0
Weight
(x2)
Data Plotted (Scatter)
Cluster
(e.g., Apple)
Cluster
(e.g., Banana)
Find a Relationship Between Data that Separates them into Clusters
18. K-Means
Height
(x1)0
Weight
(x2)
Cluster Centroid
• Pick Number of Clusters (e.g., 2 for Apple and Banana)
• Place a point (cluster centroid) randomly for each cluster
• Assign each sample to a cluster based on closest cluster centroid
Calculate Distance
to each centroid
19. Recalculate Centroids
Height
(x1)0
Weight
(x2)
Previous Cluster Centroid
• Calculate Centroid (Center) of each Cluster
• Move Centroid to new calculated location.
• Assign each sample to a cluster based on closest (new) cluster centroid
New Centroids
Recalculate
Distances
REPEAT STEPS until centroids do not move anymore
20. Preparing a Dataset - Clean
Clean
• Fix/Remove unreadable entries
• e.g., bad (funny) characters from different character codesets
• Fix/Remove Misaligned entries
• e.g., incorrect number of fields for row in a CSV file.
• Replace blank fields (i.e., synthesize a value).
• e.g., Mean value of all non-blank values
• e.g., Use rows with values as training set to learn the value
Dataset
21. Preparing a Dataset – Conversion
Categorical
Value
Conversion
• Change categorical values into real values
• Cannot use enumeration (values imply importance!)
• Expand into dummy variables, one per category, use 0 and 1 as values
CleanDataset
Fruit
Apple
Banana
Pear
Apple Banana Pear
1
1
1
22. Preparing a Dataset – Feature Scaling
Feature
Scaling
• Scale values to be within the same proportional range
• A column with much larger range will over influence
learning over another column with smaller range.
• Typically, scale the range between 0 and 1 (normalization) or
-1 and 1 (standardization).
Categorical
Value
Conversion
CleanDataset
X’ =
𝑥 − min(𝑥)
max 𝑥 − min(𝑥)
Normalization
original valuenew value