2. About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
7. Supervised learning vs. Unsupervised learning
• Supervised learning: discover patterns in the data that relate data
attributes with a target (class labeled) attribute.
• These patterns are then utilized to predict the values of the target attribute in
future data instances.
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
• Classic supervised learning algorithm
• Classification
• Regression
7
9. What is Classification in Supervised Learning?
• Classification is where an algorithm is trained to classify input data on
discrete variables.
• During training, algorithms are given training input data with a class
label. For example, training data might consist of the last credit card
bills of a set of customers, labeled with whether they made a future
purchase or not.
• When a new customer’s credit balance is presented to the algorithm,
it classifies the customer to either will purchase or will not purchase
group.
9
10. What is Regression in Supervised Learning?
• Regression is a supervised learning method where an algorithm is
trained to predict an output from a continuous range of possible
values. For example, real estate training data would take note of the
location, area, and other relevant parameters. The output is the price
of the specific real estate.
• In regression, an algorithm needs to identify a functional relationship
between the input parameters and the output.
• The output value is not discrete like in classification, instead it is a
function of the continuous outputs.
10
11. Real-life Applications of Classification
• Binary classification (Most companies use)
• Spam detection
• Churn prediction
• Conversion prediction
• Imbalanced Classification
• Fraud detection: In the labeled data set used for training, only a small number of
inputs are labeled as a fraud.
• Medical diagnostics: In a large pool of samples, ones with a positive case of a disease
might be far less.
• Multi-class Classification
• Face classification: Based on the training data, a model categorizes a photo and maps
it to a specific person.
• Email classification: Multi-class classification is used to segregate emails into various
categories – social, education, work, and family.
11
12. Real-life Applications of Regression
• Linear regression
• It can be used to predict values within a continuous
range, (e.g. sales, price forecasting) or classifying
them into categories (e.g. cat, dog - logistic
regression)
• Polynomial regression
• It is used for a more complex data set that will not
fit neatly into a linear regression. An algorithm is
trained with a complex, labeled data set that may
not fit well under a straight line regression.
12
15. Image Classification using Logistic Regression
• Embedder:
• Inception V3: Google’s Inception v3 model trained on ImageNet.
• SqueezeNet: Deep model for image recognition that achieves AlexNet-level
accuracy on ImageNet with 50x fewer parameter.
• VGG-16: 16-layer image recognition model trained on ImageNet.
• Vgg-19: 19-layer image recognition model trained on ImageNet.
• Painter: A model tained to predict painters from artwork images.
• DeepLoc: A model trained to analyze yeast cell images.
• openface: Face recognition model trained on FaceScrub and CASIA-WebFace
dataset.
• http://vintage.winklerbros.net/facescrub.html
15
16. ImageNet/ Inception V3
• ImageNet is an image database. The images in the database are
organized into a hierarchy, with each node of the hierarchy depicted
by hundreds and thousands of images.
• Sample size of Training data: 1 million
• Sample size of validation data: 50000
• Number of classes: 1000
16
19. Outlier Detection
• Many applications require being
able to decide whether a new
observation belongs to the same
distribution as existing
observations (it is an inlier), or
should be considered as different
(it is an outlier). Often, this ability
is used to clean real data sets.
19
Ref: https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection
26. 26
Open model_build_on_preduction.ows
• Main concepts:
• Data exploration
• Feature Statistics
• Rank
• Data Preprocess
• Preprocess
• Data Split
• Data Sampler
• Model
• Tree/ Tree Viewer
• Save Model/ Load Model
• Test and Score
• Confusion Matrix
• Prediction
27. Homework
• Please attempt to apply the below-mentioned models to your own
binary dataset, and endeavor to identify the model with the optimal
performances as well as the most significant variables. (in next page)
• Furthermore, it is advised to elucidate the underlying factors of model;
you should include those subsections as below:
• Introduction to your dataset
• Data exploration
• Model evaluation
• Conclusion
• Use PPT with texts and illustrations to present your observations.
27