These slides cover machine learning models more specifically classification algorithms (Logistic Regression, Linear Discriminant Analysis (LDA),
K-Nearest Neighbors (KNN),
Trees, Random Forests, and Boosting
Support Vector Machines (SVM),
Neural Networks)
Lazy learning methods store training data and wait until test data is received to perform classification, taking less time to train but more time to predict. Eager learning methods construct a classification model during training. Lazy methods like k-nearest neighbors use a richer hypothesis space while eager methods commit to a single hypothesis. The k-nearest neighbor algorithm classifies new examples based on the labels of its k closest training examples. Case-based reasoning uses a symbolic case database for classification while genetic algorithms evolve rule populations through crossover and mutation to classify data.
CHPTER 3: Multiple Linear Regression
Introduction
In simple regression we study the relationship between a dependent variable and a single explanatory (independent variable); assume that a dependent variable is influenced by only one explanatory variable.
This document discusses supervised learning. Supervised learning uses labeled training data to train models to predict outputs for new data. Examples given include weather prediction apps, spam filters, and Netflix recommendations. Supervised learning algorithms are selected based on whether the target variable is categorical or continuous. Classification algorithms are used when the target is categorical while regression is used for continuous targets. Common regression algorithms discussed include linear regression, logistic regression, ridge regression, lasso regression, and elastic net. Metrics for evaluating supervised learning models include accuracy, R-squared, adjusted R-squared, mean squared error, and coefficients/p-values. The document also covers challenges like overfitting and regularization techniques to address it.
This document discusses computational intelligence and supervised learning techniques for classification. It provides examples of applications in medical diagnosis and credit card approval. The goal of supervised learning is to learn from labeled training data to predict the class of new unlabeled examples. Decision trees and backpropagation neural networks are introduced as common supervised learning algorithms. Evaluation methods like holdout validation, cross-validation and performance metrics beyond accuracy are also summarized.
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithminventionjournals
International Journal of Business and Management Invention (IJBMI) is an international journal intended for professionals and researchers in all fields of Business and Management. IJBMI publishes research articles and reviews within the whole field Business and Management, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
These slides cover machine learning models more specifically classification algorithms (Logistic Regression, Linear Discriminant Analysis (LDA),
K-Nearest Neighbors (KNN),
Trees, Random Forests, and Boosting
Support Vector Machines (SVM),
Neural Networks)
Lazy learning methods store training data and wait until test data is received to perform classification, taking less time to train but more time to predict. Eager learning methods construct a classification model during training. Lazy methods like k-nearest neighbors use a richer hypothesis space while eager methods commit to a single hypothesis. The k-nearest neighbor algorithm classifies new examples based on the labels of its k closest training examples. Case-based reasoning uses a symbolic case database for classification while genetic algorithms evolve rule populations through crossover and mutation to classify data.
CHPTER 3: Multiple Linear Regression
Introduction
In simple regression we study the relationship between a dependent variable and a single explanatory (independent variable); assume that a dependent variable is influenced by only one explanatory variable.
This document discusses supervised learning. Supervised learning uses labeled training data to train models to predict outputs for new data. Examples given include weather prediction apps, spam filters, and Netflix recommendations. Supervised learning algorithms are selected based on whether the target variable is categorical or continuous. Classification algorithms are used when the target is categorical while regression is used for continuous targets. Common regression algorithms discussed include linear regression, logistic regression, ridge regression, lasso regression, and elastic net. Metrics for evaluating supervised learning models include accuracy, R-squared, adjusted R-squared, mean squared error, and coefficients/p-values. The document also covers challenges like overfitting and regularization techniques to address it.
This document discusses computational intelligence and supervised learning techniques for classification. It provides examples of applications in medical diagnosis and credit card approval. The goal of supervised learning is to learn from labeled training data to predict the class of new unlabeled examples. Decision trees and backpropagation neural networks are introduced as common supervised learning algorithms. Evaluation methods like holdout validation, cross-validation and performance metrics beyond accuracy are also summarized.
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithminventionjournals
International Journal of Business and Management Invention (IJBMI) is an international journal intended for professionals and researchers in all fields of Business and Management. IJBMI publishes research articles and reviews within the whole field Business and Management, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
The document analyzes models for predicting loan default using a German credit dataset. It fits generalized linear models, generalized additive models, linear discriminant analysis, and classification trees to the data. Based on out-of-sample testing, the linear discriminant analysis model provided the best results with a minimum misclassification rate of 0.40 and maximum area under the ROC curve of 0.867. However, the performance of all the models was quite similar.
WEKA:Credibility Evaluating Whats Been Learnedweka Content
- Training and test sets are used to measure classification success rates, with the test set being independent of the training set. The error rate on the training set is optimistic. Cross validation techniques like 10-fold stratified cross validation are used when data is limited.
- True success rates are predicted using properties of statistics and normal distributions. Confidence levels determine the range within which the true rate is expected to lie.
- Techniques like paired t-tests are used to statistically compare the performance of different algorithms or data mining methods. They determine if performance differences are statistically significant.
This document discusses various techniques for evaluating machine learning models and comparing their performance, including:
- Measuring error rates on separate test and training sets to avoid overfitting
- Using techniques like cross-validation, bootstrapping, and holdout validation when data is limited
- Comparing algorithms using statistical tests like paired t-tests
- Accounting for costs of different prediction outcomes in evaluation and model training
- Visualizing performance using lift charts and ROC curves to compare models
- The Minimum Description Length principle for selecting the model that best compresses the data
Understanding Blackbox Prediction via Influence FunctionsSEMINARGROOT
Pang Wei Koh and Percy Liang
"Understanding Black-Box prediction via influence functions" ICML 2017 Best paper
References:
https://youtu.be/0w9fLX_T6tY
https://arxiv.org/abs/1703.04730
This lecture discusses multiple regression analysis using two independent variables. Multiple regression faces two challenges: determining the influence of each variable and identifying which variables should be included. The model examines determinants of earnings using years of schooling and cognitive ability scores to predict earnings. Omitted variable bias can occur if an omitted variable is correlated with the included variables and influences earnings. Measures like R-squared, adjusted R-squared and F-tests evaluate how well the regression model fits the data.
This document discusses quantitative and qualitative data analysis techniques. It covers:
- Displays for numerical (frequency charts, histograms) and categorical data (bar charts, pie charts, contingency tables).
- Measures for numerical data including mean, median, mode, range, variance, standard deviation, and quartiles.
- Scatter plots to examine relationships between two quantitative variables and measures of association like covariance and correlation coefficient.
- Contingency tables to study relationships between two categorical variables and examine dependency/independency.
- An example analyzing Titanic passenger data using contingency tables to examine the "first-class passengers first" policy.
This document provides an overview of descriptive statistics, inferential statistics, and regression analysis using PASW Statistics software. It discusses topics such as frequency analysis, measures of central tendency, hypothesis testing, t-tests, ANOVA, chi-square tests, correlation, and linear regression. The document is divided into multiple parts that cover opening and manipulating data files, descriptive statistics, tests of significance, regression analysis, and chi-square/ANOVA. It also discusses importing/exporting data and using scripts in PASW Statistics.
The document discusses the Naive Bayes classification model. It begins by explaining that a Naive Bayes classifier is a simple probabilistic classifier based on Bayes' theorem that makes strong independence assumptions. It assumes the presence or absence of a feature is unrelated to any other feature. The document then provides mathematical formulas to describe the Naive Bayes probabilistic model and explains how to apply it to classify data. An example is shown predicting whether someone will buy a computer based on attributes like age, income, student status, and credit rating. The document concludes by discussing some common applications of Naive Bayes classification like text classification, spam filtering, and recommender systems.
This document provides an overview of supervised learning and various classification algorithms. It begins with basic concepts in supervised learning and defines classification as predicting class labels for new data based on a model learned from labeled training data. Two healthcare examples of classification problems are described. The document then covers decision tree learning in detail, including algorithms for building decision trees from data, methods for handling continuous attributes, and avoiding overfitting. Evaluation of classifiers and various techniques like holdout validation and cross-validation are also discussed. The document concludes with an outline of additional topics to be covered related to supervised learning and classification.
This document provides an overview of supervised learning and various supervised learning algorithms. It begins with basic concepts in supervised learning and defines classification as the task of predicting discrete class labels for new examples. Next, it discusses decision tree learning in particular, including the decision tree induction process, how to evaluate and select attributes to split on, and how to convert decision trees to classification rules. Finally, it outlines other supervised learning algorithms that will be covered, including evaluation methods, rule induction, naive Bayes, support vector machines, and ensemble methods.
This document provides an overview of supervised learning and various classification algorithms. It begins with basic concepts in supervised learning and defines classification as predicting class labels for new data based on a model learned from labeled training data. Two healthcare examples of classification problems are given. The document then covers decision tree learning in detail, including algorithms for building decision trees from data, methods for handling continuous attributes, and avoiding overfitting. Evaluation of classifiers and various other classification algorithms that will be covered are outlined.
This document provides an overview of supervised learning and various supervised learning algorithms. It begins with basic concepts in supervised learning and defines terms like classification, prediction of discrete class attributes, training and testing data. It then outlines various supervised learning algorithms that will be covered, including decision tree induction, rule induction, naive Bayes, support vector machines, and ensemble methods. Two examples of classification problems are provided to illustrate supervised learning applications in domains like medical diagnosis and credit card approval prediction. The learning process and assumptions are also summarized.
This document provides an overview of supervised learning and decision tree induction. It begins with introducing supervised learning and giving examples of classification problems. It then outlines the process of decision tree learning, which involves building a tree from training data where each path from root to leaf represents a classification rule. The document explains how a decision tree is constructed in a greedy top-down manner by selecting attributes that best split the data based on an impurity measure like information gain. It also discusses how decision trees can be converted to classification rules and evaluates different attributes as potential root nodes.
This document provides an overview of supervised learning and various supervised learning algorithms. It begins with basic concepts in supervised learning and defines terms like classification, prediction of discrete class attributes, training and testing data. It then outlines various supervised learning algorithms that will be covered, including decision tree induction, rule induction, naive Bayes, support vector machines, and ensemble methods. Two examples of classification problems are provided to illustrate supervised learning applications in domains like medical diagnosis and credit card approval prediction. The learning process and assumptions are also summarized.
This document provides an overview of supervised learning and decision tree induction. It begins with introducing supervised learning and giving examples of classification problems. It then outlines the process of decision tree learning, which involves building a tree from training data where each path leads to a label. The document explains how a decision tree can be converted to classification rules and describes the algorithm for building decision trees through recursive partitioning. It discusses how to select the best attribute to split on at each node using information gain and entropy measures from information theory.
Predicting breast cancer: Adrian VallesAdrián Vallés
Performed and compared predictive modelling approaches (classification tree, logistic regression and random forest) to predict benign vs malignant breast cancers using R for the Data mining class (BANA 4080)
The document discusses various data mining and machine learning techniques. It describes data mining as the process of analyzing large databases to find useful patterns. It discusses classification, decision trees, regression, support vector machines, and validating classifiers. Classification involves predicting which class new data belongs to, based on attributes. Decision trees use a greedy algorithm to recursively split data into pure subsets. Regression predicts a numeric value. Support vector machines find the maximum margin separating hyperplane. Validating classifiers measures accuracy on test data.
The document analyzes models for predicting loan default using a German credit dataset. It fits generalized linear models, generalized additive models, linear discriminant analysis, and classification trees to the data. Based on out-of-sample testing, the linear discriminant analysis model provided the best results with a minimum misclassification rate of 0.40 and maximum area under the ROC curve of 0.867. However, the performance of all the models was quite similar.
WEKA:Credibility Evaluating Whats Been Learnedweka Content
- Training and test sets are used to measure classification success rates, with the test set being independent of the training set. The error rate on the training set is optimistic. Cross validation techniques like 10-fold stratified cross validation are used when data is limited.
- True success rates are predicted using properties of statistics and normal distributions. Confidence levels determine the range within which the true rate is expected to lie.
- Techniques like paired t-tests are used to statistically compare the performance of different algorithms or data mining methods. They determine if performance differences are statistically significant.
This document discusses various techniques for evaluating machine learning models and comparing their performance, including:
- Measuring error rates on separate test and training sets to avoid overfitting
- Using techniques like cross-validation, bootstrapping, and holdout validation when data is limited
- Comparing algorithms using statistical tests like paired t-tests
- Accounting for costs of different prediction outcomes in evaluation and model training
- Visualizing performance using lift charts and ROC curves to compare models
- The Minimum Description Length principle for selecting the model that best compresses the data
Understanding Blackbox Prediction via Influence FunctionsSEMINARGROOT
Pang Wei Koh and Percy Liang
"Understanding Black-Box prediction via influence functions" ICML 2017 Best paper
References:
https://youtu.be/0w9fLX_T6tY
https://arxiv.org/abs/1703.04730
This lecture discusses multiple regression analysis using two independent variables. Multiple regression faces two challenges: determining the influence of each variable and identifying which variables should be included. The model examines determinants of earnings using years of schooling and cognitive ability scores to predict earnings. Omitted variable bias can occur if an omitted variable is correlated with the included variables and influences earnings. Measures like R-squared, adjusted R-squared and F-tests evaluate how well the regression model fits the data.
This document discusses quantitative and qualitative data analysis techniques. It covers:
- Displays for numerical (frequency charts, histograms) and categorical data (bar charts, pie charts, contingency tables).
- Measures for numerical data including mean, median, mode, range, variance, standard deviation, and quartiles.
- Scatter plots to examine relationships between two quantitative variables and measures of association like covariance and correlation coefficient.
- Contingency tables to study relationships between two categorical variables and examine dependency/independency.
- An example analyzing Titanic passenger data using contingency tables to examine the "first-class passengers first" policy.
This document provides an overview of descriptive statistics, inferential statistics, and regression analysis using PASW Statistics software. It discusses topics such as frequency analysis, measures of central tendency, hypothesis testing, t-tests, ANOVA, chi-square tests, correlation, and linear regression. The document is divided into multiple parts that cover opening and manipulating data files, descriptive statistics, tests of significance, regression analysis, and chi-square/ANOVA. It also discusses importing/exporting data and using scripts in PASW Statistics.
The document discusses the Naive Bayes classification model. It begins by explaining that a Naive Bayes classifier is a simple probabilistic classifier based on Bayes' theorem that makes strong independence assumptions. It assumes the presence or absence of a feature is unrelated to any other feature. The document then provides mathematical formulas to describe the Naive Bayes probabilistic model and explains how to apply it to classify data. An example is shown predicting whether someone will buy a computer based on attributes like age, income, student status, and credit rating. The document concludes by discussing some common applications of Naive Bayes classification like text classification, spam filtering, and recommender systems.
This document provides an overview of supervised learning and various classification algorithms. It begins with basic concepts in supervised learning and defines classification as predicting class labels for new data based on a model learned from labeled training data. Two healthcare examples of classification problems are described. The document then covers decision tree learning in detail, including algorithms for building decision trees from data, methods for handling continuous attributes, and avoiding overfitting. Evaluation of classifiers and various techniques like holdout validation and cross-validation are also discussed. The document concludes with an outline of additional topics to be covered related to supervised learning and classification.
This document provides an overview of supervised learning and various supervised learning algorithms. It begins with basic concepts in supervised learning and defines classification as the task of predicting discrete class labels for new examples. Next, it discusses decision tree learning in particular, including the decision tree induction process, how to evaluate and select attributes to split on, and how to convert decision trees to classification rules. Finally, it outlines other supervised learning algorithms that will be covered, including evaluation methods, rule induction, naive Bayes, support vector machines, and ensemble methods.
This document provides an overview of supervised learning and various classification algorithms. It begins with basic concepts in supervised learning and defines classification as predicting class labels for new data based on a model learned from labeled training data. Two healthcare examples of classification problems are given. The document then covers decision tree learning in detail, including algorithms for building decision trees from data, methods for handling continuous attributes, and avoiding overfitting. Evaluation of classifiers and various other classification algorithms that will be covered are outlined.
This document provides an overview of supervised learning and various supervised learning algorithms. It begins with basic concepts in supervised learning and defines terms like classification, prediction of discrete class attributes, training and testing data. It then outlines various supervised learning algorithms that will be covered, including decision tree induction, rule induction, naive Bayes, support vector machines, and ensemble methods. Two examples of classification problems are provided to illustrate supervised learning applications in domains like medical diagnosis and credit card approval prediction. The learning process and assumptions are also summarized.
This document provides an overview of supervised learning and decision tree induction. It begins with introducing supervised learning and giving examples of classification problems. It then outlines the process of decision tree learning, which involves building a tree from training data where each path from root to leaf represents a classification rule. The document explains how a decision tree is constructed in a greedy top-down manner by selecting attributes that best split the data based on an impurity measure like information gain. It also discusses how decision trees can be converted to classification rules and evaluates different attributes as potential root nodes.
This document provides an overview of supervised learning and various supervised learning algorithms. It begins with basic concepts in supervised learning and defines terms like classification, prediction of discrete class attributes, training and testing data. It then outlines various supervised learning algorithms that will be covered, including decision tree induction, rule induction, naive Bayes, support vector machines, and ensemble methods. Two examples of classification problems are provided to illustrate supervised learning applications in domains like medical diagnosis and credit card approval prediction. The learning process and assumptions are also summarized.
This document provides an overview of supervised learning and decision tree induction. It begins with introducing supervised learning and giving examples of classification problems. It then outlines the process of decision tree learning, which involves building a tree from training data where each path leads to a label. The document explains how a decision tree can be converted to classification rules and describes the algorithm for building decision trees through recursive partitioning. It discusses how to select the best attribute to split on at each node using information gain and entropy measures from information theory.
Predicting breast cancer: Adrian VallesAdrián Vallés
Performed and compared predictive modelling approaches (classification tree, logistic regression and random forest) to predict benign vs malignant breast cancers using R for the Data mining class (BANA 4080)
The document discusses various data mining and machine learning techniques. It describes data mining as the process of analyzing large databases to find useful patterns. It discusses classification, decision trees, regression, support vector machines, and validating classifiers. Classification involves predicting which class new data belongs to, based on attributes. Decision trees use a greedy algorithm to recursively split data into pure subsets. Regression predicts a numeric value. Support vector machines find the maximum margin separating hyperplane. Validating classifiers measures accuracy on test data.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
4. 4
An example application
An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of newly
admitted patients.
A decision is needed: whether to put a new patient
in an intensive-care unit.
Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority.
Problem: to predict high-risk patients and
discriminate them from low-risk patients.
5. 5
Another application
A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
age
Marital status
annual salary
outstanding debts
credit rating
etc.
Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.
6. 6
Machine learning and our focus
Like human learning from past experiences.
A computer does not have “experiences”.
A computer system learns from data, which
represent some “past experiences” of an
application domain.
Our focus: learn a target function that can be used
to predict the values of a discrete class attribute,
e.g., approve or not-approved, and high-risk or low
risk.
The task is commonly called: Supervised learning,
classification, or inductive learning.
7. 7
Data: A set of data records (also called
examples, instances or cases) described by
k attributes: A1, A2, … Ak.
a class: Each example is labelled with a pre-
defined class.
Goal: To learn a classification model from the
data that can be used to predict the classes
of new (future, or test) cases/instances.
The data and the goal
9. 9
An example: the learning task
Learn a classification model from the data
Use the model to classify future loan applications
into
Yes (approved) and
No (not approved)
What is the class for following case/instance?
10. 10
Supervised vs. unsupervised Learning
Supervised learning: classification is seen as
supervised learning from examples.
Supervision: The data (observations,
measurements, etc.) are labeled with pre-defined
classes. It is like that a “teacher” gives the classes
(supervision).
Test data are classified into these classes too.
Unsupervised learning (clustering)
Class labels of the data are unknown
Given a set of data, the task is to establish the
existence of classes or clusters in the data
11. 11
Supervised learning process: two steps
Learning (training): Learn a model using the
training data
Testing: Test the model using unseen test data
to assess the model accuracy
,
cases
test
of
number
Total
tions
classifica
correct
of
Number
Accuracy
12. 12
What do we mean by learning?
Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D to
perform the task T if after learning the
system’s performance on T improves as
measured by M.
In other words, the learned model helps the
system to perform T better as compared to
no learning.
13. 13
An example
Data: Loan application data
Task: Predict whether a loan should be
approved or not.
Performance measure: accuracy.
No learning: classify all future applications (test
data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
We can do better than 60% with learning.
14. 14
Fundamental assumption of learning
Assumption: The distribution of training
examples is identical to the distribution of test
examples (including future unseen examples).
In practice, this assumption is often violated
to certain degree.
Strong violations will clearly result in poor
classification accuracy.
To achieve good accuracy on the test data,
training examples must be sufficiently
representative of the test data.
16. In Machine Learning,
Linear Regression is a supervised machine
learning algorithm.
It tries to find out the best linear relationship
that describes the data you have.
It assumes that there exists a linear
relationship between a dependent variable
and independent variable(s).
The value of the dependent variable of a
linear regression model is a continuous value
i.e. real numbers.
16
18. Representing Linear Regression Model-
Linear regression model represents the linear
relationship between a dependent variable
and independent variable(s) via a sloped
straight line.
The sloped straight line representing the
linear relationship that fits the given data best
is called as a regression line.
It is also called as best fit line.
18
19. Types of Linear Regression-
Based on the number of independent
variables, there are two types of linear
regression-
Simple Linear Regression
Multiple Linear Regression
19
20. 1. Simple Linear Regression-
simple linear regression, the dependent variable
depends only on a single independent variable.
For simple linear regression, the form of the
model is- Y = β0 + β1X
Here, Y is a dependent variable,
X is an independent variable.
β0 and β1 are the regression coefficients.
β0 is the intercept or the bias that fixes the offset
to a line.
β1 is the slope or weight that specifies the factor
by which X has an impact on Y.
20
21. There are following 3 cases possible-
Case-01: β1 < 0
It indicates that variable X has negative impact
on Y.
If X increases, Y will decrease and vice-versa.
21
22. Case-02: β1 = 0
It indicates that variable X has no impact on
Y.
If X changes, there will be no change in Y.
22
23. Case-03: β1 > 0
It indicates that variable X has positive impact
on Y.
If X increases, Y will increase and vice-versa.
23
24. 2. Multiple Linear Regression-
In multiple linear regression, the dependent
variable depends on more than one independent
variables.
For multiple linear regression, the form of the
model is- Y = β0 + β1X1 + β2X2 + β3X3 + ……
+ βnXn
Here,
Y is a dependent variable.
X1, X2, …., Xn are independent variables.
β0, β1,…, βn are the regression coefficients.
βj (1<=j<=n) is the slope or weight that specifies
the factor by which Xj has an impact on Y. 24
26. 26
Bayesian classification
Probabilistic view: Supervised learning can naturally
be studied from a probabilistic point of view.
Let A1 through Ak be attributes with discrete values.
The class is C.
Given a test example d with observed attribute
values a1 through ak.
Classification is basically to compute the following
posteriori probability. The prediction is the class cj
such that
is maximal
27. 27
Apply Bayes’ Rule
Pr(C=cj) is the class prior probability: easy to
estimate from the training data.
|
|
1
|
|
|
|
1
1
|
|
|
|
1
1
|
|
|
|
1
1
|
|
|
|
1
1
|
|
|
|
1
1
)
Pr(
)
|
,...,
Pr(
)
Pr(
)
|
,...,
Pr(
)
,...,
Pr(
)
Pr(
)
|
,...,
Pr(
)
,...,
|
Pr(
C
r
r
r
A
A
j
j
A
A
A
A
j
j
A
A
A
A
j
c
C
c
C
a
A
a
A
c
C
c
C
a
A
a
A
a
A
a
A
c
C
c
C
a
A
a
A
a
A
a
A
c
C
28. 28
Computing probabilities
The denominator P(A1=a1,...,Ak=ak) is
irrelevant for decision making since it is the
same for every class.
We only need P(A1=a1,...,Ak=ak | C=ci), which
can be written as
Pr(A1=a1|A2=a2,...,Ak=ak, C=cj)* Pr(A2=a2,...,Ak=ak |C=cj)
Recursively, the second factor above can be
written in the same way, and so on.
Now an assumption is needed.
29. 29
Conditional independence assumption
All attributes are conditionally independent
given the class C = cj.
Formally, we assume,
Pr(A1=a1 | A2=a2, ..., A|A|=a|A|, C=cj) = Pr(A1=a1 | C=cj)
and so on for A2 through A|A|. I.e.,
|
|
1
|
|
|
|
1
1 )
|
Pr(
)
|
,...,
Pr(
A
i
j
i
i
i
A
A c
C
a
A
c
C
a
A
a
A
30. 30
Final naïve Bayesian classifier
We are done!
How do we estimate P(Ai = ai| C=cj)? Easy!.
|
|
1
|
|
1
|
|
1
|
|
|
|
1
1
)
|
Pr(
)
Pr(
)
|
Pr(
)
Pr(
)
,...,
|
Pr(
C
r
A
i
r
i
i
r
A
i
j
i
i
j
A
A
j
c
C
a
A
c
C
c
C
a
A
c
C
a
A
a
A
c
C
31. 31
Classify a test instance
If we only need a decision on the most
probable class for the test instance, we only
need the numerator as its denominator is the
same for every class.
Thus, given a test example, we compute the
following to decide the most probable class
for the test instance
|
|
1
)
|
Pr(
)
Pr(
max
arg
A
i
j
i
i
j
c
c
C
a
A
c
c
j
33. 33
An Example (cont …)
For C = t, we have
For class C = f, we have
C = t is more probable. t is the final class.
25
2
5
2
5
2
2
1
)
|
Pr(
)
Pr(
2
1
j
j
j t
C
a
A
t
C
25
1
5
2
5
1
2
1
)
|
Pr(
)
Pr(
2
1
j
j
j f
C
a
A
f
C
34. 34
Additional issues
Numeric attributes: Naïve Bayesian learning
assumes that all attributes are categorical.
Numeric attributes need to be discretized.
Zero counts: An particular attribute value
never occurs together with a class in the
training set. We need smoothing.
Missing values: Ignored
i
j
ij
j
i
i
n
n
n
c
C
a
A
)
|
Pr(
35. 35
On naïve Bayesian classifier
Advantages:
Easy to implement
Very efficient
Good results obtained in many applications
Disadvantages
Assumption: class conditional independence,
therefore loss of accuracy when the assumption
is seriously violated (those highly correlated
data sets)
37. 37
Text classification/categorization
Due to the rapid growth of online documents in
organizations and on the Web, automated document
classification has become an important problem.
Techniques discussed previously can be applied to
text classification, but they are not as effective as
the next three methods.
We first study a naïve Bayesian method specifically
formulated for texts, which makes use of some text
specific features.
However, the ideas are similar to the preceding
method.
38. 38
Probabilistic framework
Generative model: Each document is
generated by a parametric distribution
governed by a set of hidden parameters.
The generative model makes two
assumptions
The data (or the text documents) are generated by
a mixture model,
There is one-to-one correspondence between
mixture components and document classes.
39. 39
Mixture model
A mixture model models the data with a
number of statistical distributions.
Intuitively, each distribution corresponds to a data
cluster and the parameters of the distribution
provide a description of the corresponding cluster.
Each distribution in a mixture model is also
called a mixture component.
The distribution/component can be of any
kind
40. 40
An example
The figure shows a plot of the probability
density function of a 1-dimensional data set
(with two classes) generated by
a mixture of two Gaussian distributions,
one per class, whose parameters (denoted by i) are
the mean (i) and the standard deviation (i), i.e., i
= (i, i).
class 1 class 2
41. 41
Mixture model (cont …)
Let the number of mixture components (or
distributions) in a mixture model be K.
Let the jth distribution have the parameters j.
Let be the set of parameters of all
components, = {1, 2, …, K, 1, 2, …, K},
where j is the mixture weight (or mixture
probability) of the mixture component j and j
is the parameters of component j.
How does the model generate documents?
42. 42
Document generation
Due to one-to-one correspondence, each class
corresponds to a mixture component. The mixture
weights are class prior probabilities, i.e., j = Pr(cj|).
The mixture model generates each document di by:
first selecting a mixture component (or class) according to
class prior probabilities (i.e., mixture weights), j = Pr(cj|).
then having this selected mixture component (cj) generate
a document di according to its parameters, with distribution
Pr(di|cj; ) or more precisely Pr(di|cj; j).
)
;
|
Pr(
)
Θ
|
Pr(
)
|
Pr(
|
|
1
C
j
j
i
j
i c
d
c
d (23)
43. 43
Model text documents
The naïve Bayesian classification treats each
document as a “bag of words”. The
generative model makes the following further
assumptions:
Words of a document are generated
independently of context given the class label.
The familiar naïve Bayes assumption used before.
The probability of a word is independent of its
position in the document. The document length is
chosen independent of its class.
44. 44
Multinomial distribution
With the assumptions, each document can be
regarded as generated by a multinomial
distribution.
In other words, each document is drawn from
a multinomial distribution of words with as
many independent trials as the length of the
document.
The words are from a given vocabulary V =
{w1, w2, …, w|V|}.
45. 45
Use probability function of multinomial
distribution
where Nti is the number of times that word wt
occurs in document di and
|
|
1 !
)
;
|
Pr(
|!
|
|)
Pr(|
)
;
|
Pr(
V
t ti
ti
N
j
t
i
i
j
i
N
c
w
d
d
c
d
|
|
|
|
1
i
V
t
it d
N
.
1
)
;
|
Pr(
|
|
1
V
t
j
t c
w
(24)
(25)
46. 46
Parameter estimation
The parameters are estimated based on empirical
counts.
In order to handle 0 counts for infrequent occurring
words that do not appear in the training set, but may
appear in the test set, we need to smooth the
probability. Lidstone smoothing, 0 1
.
)
|
Pr(
)
|
Pr(
)
ˆ
;
|
Pr( |
|
1
|
|
1
|
|
1
V
s
D
i i
j
si
D
i i
j
ti
j
t
d
c
N
d
c
N
c
w
.
)
|
Pr(
|
|
)
|
Pr(
)
ˆ
;
|
Pr( |
|
1
|
|
1
|
|
1
V
s
D
i i
j
si
D
i i
j
ti
j
t
d
c
N
V
d
c
N
c
w
(26)
(27)
47. 47
Parameter estimation (cont …)
Class prior probabilities, which are mixture
weights j, can be easily estimated using
training data
|
|
)
|
Pr(
)
ˆ
|
Pr(
|
|
1
D
d
c
c
D
i
i
j
j
(28)
48. 48
Classification
Given a test document di, from Eq. (23) (27) and (28)
|
|
1
|
|
1 ,
|
|
1 ,
)
ˆ
;
|
Pr(
)
ˆ
|
Pr(
)
ˆ
;
|
Pr(
)
ˆ
|
Pr(
)
ˆ
|
Pr(
)
ˆ
;
|
Pr(
)
ˆ
|
Pr(
)
ˆ
;
|
Pr(
C
r
d
k r
k
d
d
k k
d
i
i
r
i
j
i
j
i
j
i
j
i
j
c
w
c
c
w
c
d
c
d
c
d
c
49. 49
Discussions
Most assumptions made by naïve Bayesian
learning are violated to some degree in
practice.
Despite such violations, researchers have
shown that naïve Bayesian learning produces
very accurate models.
The main problem is the mixture model
assumption. When this assumption is seriously
violated, the classification performance can be
poor.
Naïve Bayesian learning is extremely efficient.
51. 51
Introduction
Support vector machines were invented by V.
Vapnik and his co-workers in 1970s in Russia and
became known to the West in 1992.
SVMs are linear classifiers that find a hyperplane to
separate two class of data, positive and negative.
Kernel functions are used for nonlinear separation.
SVM not only has a rigorous theoretical foundation,
but also performs classification more accurately than
most other methods in applications, especially for
high dimensional data.
It is perhaps the best classifier for text classification.
52. 52
Basic concepts
Let the set of training examples D be
{(x1, y1), (x2, y2), …, (xr, yr)},
where xi = (x1, x2, …, xn) is an input vector in a
real-valued space X Rn and yi is its class label
(output value), yi {1, -1}.
1: positive class and -1: negative class.
SVM finds a linear function of the form (w: weight
vector)
f(x) = w x + b
0
1
0
1
b
if
b
if
y
i
i
i
x
w
x
w
53. 53
The hyperplane
The hyperplane that separates positive and negative
training data is
w x + b = 0
It is also called the decision boundary (surface).
So many possible hyperplanes, which one to choose?
54. 54
Maximal margin hyperplane
SVM looks for the separating hyperplane with the largest
margin.
Machine learning theory says this hyperplane minimizes the
error bound
55. 55
Linear SVM: separable case
Assume the data are linearly separable.
Consider a positive data point (x+, 1) and a negative
(x-, -1) that are closest to the hyperplane
<w x> + b = 0.
We define two parallel hyperplanes, H+ and H-, that
pass through x+ and x- respectively. H+ and H- are
also parallel to <w x> + b = 0.
56. 56
Compute the margin
Now let us compute the distance between the two
margin hyperplanes H+ and H-. Their distance is the
margin (d+ + d in the figure).
Recall from vector space in algebra that the
(perpendicular) distance from a point xi to the
hyperplane w x + b = 0 is:
where ||w|| is the norm of w,
||
||
|
|
w
x
w b
i
2
2
2
2
1 ...
||
|| n
w
w
w
w
w
w
(36)
(37)
57. 57
Compute the margin (cont …)
Let us compute d+.
Instead of computing the distance from x+ to the
separating hyperplane w x + b = 0, we pick up
any point xs on w x + b = 0 and compute the
distance from xs to w x+ + b = 1 by applying the
distance Eq. (36) and noticing w xs + b = 0,
||
||
1
||
||
|
1
|
w
w
x
w s
b
d
||
||
2
w
d
d
margin
(38)
(39)
58. 58
A optimization problem!
Definition (Linear SVM: separable case): Given a set of
linearly separable training examples,
D = {(x1, y1), (x2, y2), …, (xr, yr)}
Learning is to solve the following constrained
minimization problem,
summarizes
w xi + b 1 for yi = 1
w xi + b -1 for yi = -1.
r
i
b
y i
i ...,
2,
1,
,
1
)
(
:
Subject to
2
:
Minimize
x
w
w
w
r
i
b
y i
i ...,
2,
1,
,
1
(
x
w
(40)
59. 59
Solve the constrained minimization
Standard Lagrangian method
where i 0 are the Lagrange multipliers.
Optimization theory says that an optimal
solution to (41) must satisfy certain conditions,
called Kuhn-Tucker conditions, which are
necessary (but not sufficient)
Kuhn-Tucker conditions play a central role in
constrained optimization.
]
1
)
(
[
2
1
1
b
y
L i
r
i
i
i
P x
w
w
w (41)
60. 60
Kuhn-Tucker conditions
Eq. (50) is the original set of constraints.
The complementarity condition (52) shows that only those
data points on the margin hyperplanes (i.e., H+ and H-) can
have i > 0 since for them yi(w xi + b) – 1 = 0.
These points are called the support vectors, All the other
parameters i = 0.
61. 61
Solve the problem
In general, Kuhn-Tucker conditions are necessary
for an optimal solution, but not sufficient.
However, for our minimization problem with a
convex objective function and linear constraints, the
Kuhn-Tucker conditions are both necessary and
sufficient for an optimal solution.
Solving the optimization problem is still a difficult
task due to the inequality constraints.
However, the Lagrangian treatment of the convex
optimization problem leads to an alternative dual
formulation of the problem, which is easier to solve
than the original problem (called the primal).
62. 62
Dual formulation
From primal to a dual: Setting to zero the
partial derivatives of the Lagrangian (41) with
respect to the primal variables (i.e., w and
b), and substituting the resulting relations
back into the Lagrangian.
I.e., substitute (48) and (49), into the original
Lagrangian (41) to eliminate the primal variables
(55)
,
2
1
1
,
1
j
i
r
j
i
j
i
j
i
r
i
i
D y
y
L x
x
63. 63
Dual optimization prolem
This dual formulation is called the Wolfe dual.
For the convex objective function and linear constraints of
the primal, it has the property that the maximum of LD
occurs at the same values of w, b and i, as the minimum
of LP (the primal).
Solving (56) requires numerical techniques and clever
strategies, which are beyond our scope.
64. 64
The final decision boundary
After solving (56), we obtain the values for i, which
are used to compute the weight vector w and the
bias b using Equations (48) and (52) respectively.
The decision boundary
Testing: Use (57). Given a test instance z,
If (58) returns 1, then the test instance z is classified
as positive; otherwise, it is classified as negative.
0
b
y
b
sv
i
i
i
i x
x
x
w (57)
sv
i
i
i
i b
y
sign
b
sign z
x
z
w
)
( (58)
65. 65
Linear SVM: Non-separable case
Linear separable case is the ideal situation.
Real-life data may have noise or errors.
Class label incorrect or randomness in the application
domain.
Recall in the separable case, the problem was
With noisy data, the constraints may not be
satisfied. Then, no solution!
r
i
b
y i
i ...,
2,
1,
,
1
)
(
:
Subject to
2
:
Minimize
x
w
w
w
66. 66
Relax the constraints
To allow errors in data, we relax the margin
constraints by introducing slack variables, i
( 0) as follows:
w xi + b 1 i for yi = 1
w xi + b 1 + i for yi = -1.
The new constraints:
Subject to: yi(w xi + b) 1 i, i =1, …, r,
i 0, i =1, 2, …, r.
68. 68
Penalize errors in objective function
We need to penalize the errors in the
objective function.
A natural way of doing it is to assign an extra
cost for errors to change the objective
function to
k = 1 is commonly used, which has the
advantage that neither i nor its Lagrangian
multipliers appear in the dual formulation.
r
i
k
i
C
1
)
(
2
:
Minimize
w
w (60)
69. 69
New optimization problem
This formulation is called the soft-margin
SVM. The primal Lagrangian is
where i, i 0 are the Lagrange multipliers
r
i
r
i
b
y
C
i
i
i
i
r
i
i
...,
2,
1,
,
0
...,
2,
1,
,
1
)
(
:
Subject to
2
:
Minimize
1
x
w
w
w
(61)
r
i
i
i
i
i
r
i
i
i
r
i
i
P b
y
C
L
1
1
1
]
1
)
(
[
2
1
x
w
w
w
(62)
71. 71
From primal to dual
As the linear separable case, we transform
the primal to a dual by setting to zero the
partial derivatives of the Lagrangian (62) with
respect to the primal variables (i.e., w, b
and i), and substituting the resulting
relations back into the Lagrangian.
Ie.., we substitute Equations (63), (64) and
(65) into the primal Lagrangian (62).
From Equation (65), C i i = 0, we can
deduce that i C because i 0.
72. 72
Dual
The dual of (61) is
Interestingly, i and its Lagrange multipliers i are
not in the dual. The objective function is identical to
that for the separable case.
The only difference is the constraint i C.
73. 73
Find primal variable values
The dual problem (72) can be solved numerically.
The resulting i values are then used to compute w
and b. w is computed using Equation (63) and b is
computed using the Kuhn-Tucker complementarity
conditions (70) and (71).
Since no values for i, we need to get around it.
From Equations (65), (70) and (71), we observe that if 0 < i
< C then both i = 0 and yiw xi + b – 1 + i = 0. Thus, we
can use any training data point for which 0 < i < C and
Equation (69) (with i = 0) to compute b.
.
0
1
1
j
r
i
i
i
i
i
y
y
b x
x
(73)
74. CS583, Bing Liu, UIC 74
(65), (70) and (71) in fact tell us more
(74) shows a very important property of SVM.
The solution is sparse in i. Many training data points are
outside the margin area and their i’s in the solution are 0.
Only those data points that are on the margin (i.e., yi(w xi
+ b) = 1, which are support vectors in the separable case),
inside the margin (i.e., i = C and yi(w xi + b) < 1), or
errors are non-zero.
Without this sparsity property, SVM would not be practical for
large data sets.
75. 75
The final decision boundary
The final decision boundary is (we note that many
i’s are 0)
The decision rule for classification (testing) is the
same as the separable case, i.e.,
sign(w x + b).
Finally, we also need to determine the parameter C
in the objective function. It is normally chosen
through the use of a validation set or cross-
validation.
0
1
b
y
b
r
i
i
i
i x
x
x
w (75)
76. 76
How to deal with nonlinear separation?
The SVM formulations require linear separation.
Real-life data sets may need nonlinear separation.
To deal with nonlinear separation, the same
formulation and techniques as for the linear case
are still used.
We only transform the input data into another space
(usually of a much higher dimension) so that
a linear decision boundary can separate positive and
negative examples in the transformed space,
The transformed space is called the feature space.
The original data space is called the input space.
77. 77
Space transformation
The basic idea is to map the data in the input
space X to a feature space F via a nonlinear
mapping ,
After the mapping, the original training data
set {(x1, y1), (x2, y2), …, (xr, yr)} becomes:
{((x1), y1), ((x2), y2), …, ((xr), yr)}
)
(
:
x
x
F
X
(76)
(77)
78. 78
Geometric interpretation
In this example, the transformed space is
also 2-D. But usually, the number of
dimensions in the feature space is much
higher than that in the input space
80. 80
An example space transformation
Suppose our input space is 2-dimensional,
and we choose the following transformation
(mapping) from 2-D to 3-D:
The training example ((2, 3), -1) in the input
space is transformed to the following in the
feature space:
((4, 9, 8.5), -1)
)
2
,
,
(
)
,
( 2
1
2
2
2
1
2
1 x
x
x
x
x
x
81. 81
Problem with explicit transformation
The potential problem with this explicit data
transformation and then applying the linear SVM is
that it may suffer from the curse of dimensionality.
The number of dimensions in the feature space can
be huge with some useful transformations even with
reasonable numbers of attributes in the input space.
This makes it computationally infeasible to handle.
Fortunately, explicit transformation is not needed.
82. 82
Kernel functions
We notice that in the dual formulation both
the construction of the optimal hyperplane (79) in F and
the evaluation of the corresponding decision function (80)
only require dot products (x) (z) and never the mapped
vector (x) in its explicit form. This is a crucial point.
Thus, if we have a way to compute the dot product
(x) (z) using the input vectors x and z directly,
no need to know the feature vector (x) or even itself.
In SVM, this is done through the use of kernel
functions, denoted by K,
K(x, z) = (x) (z) (82)
83. 83
An example kernel function
Polynomial kernel
K(x, z) = x zd
Let us compute the kernel with degree d = 2 in a 2-
dimensional space: x = (x1, x2) and z = (z1, z2).
This shows that the kernel x z2 is a dot product in
a transformed feature space
(83)
,
)
(
)
(
)
2
(
)
2
(
2
)
(
2
2
2
2
2
2
2
2
2
2
1
2
2
1
1
2
2
1
2
2
1
1
2
1
2
1
2
1
1
2
z
x
z
x
z
z
,
z
,
z
x
x
,
x
,
x
z
x
z
x
z
x
z
x
z
x
z
x
(84)
84. 84
Kernel trick
The derivation in (84) is only for illustration
purposes.
We do not need to find the mapping function.
We can simply apply the kernel function
directly by
replace all the dot products (x) (z) in (79) and
(80) with the kernel function K(x, z) (e.g., the
polynomial kernel x zd in (83)).
This strategy is called the kernel trick.
85. 85
Is it a kernel function?
The question is: how do we know whether a
function is a kernel without performing the
derivation such as that in (84)? I.e,
How do we know that a kernel function is indeed a
dot product in some feature space?
This question is answered by a theorem
called the Mercer’s theorem, which we will
not discuss here.
86. 86
Commonly used kernels
It is clear that the idea of kernel generalizes the dot
product in the input space. This dot product is also
a kernel with the feature map being the identity
87. 87
Some other issues in SVM
SVM works only in a real-valued space. For a
categorical attribute, we need to convert its
categorical values to numeric values.
SVM does only two-class classification. For multi-
class problems, some strategies can be applied, e.g.,
one-against-rest, and error-correcting output coding.
The hyperplane produced by SVM is hard to
understand by human users. The matter is made
worse by kernels. Thus, SVM is commonly used in
applications that do not required human
understanding.
89. 89
k-Nearest Neighbor Classification (kNN)
Unlike all the previous learning methods, kNN
does not build model from the training data.
To classify a test instance d, define k-
neighborhood P as k nearest neighbors of d
Count number n of training instances in P that
belong to class cj
Estimate Pr(cj|d) as n/k
No training is needed. Classification time is
linear in training set size for each test case.
90. 90
kNNAlgorithm
k is usually chosen empirically via a validation
set or cross-validation by trying a range of k
values.
Distance function is crucial, but depends on
applications.
92. 92
Discussions
kNN can deal with complex and arbitrary
decision boundaries.
Despite its simplicity, researchers have
shown that the classification accuracy of kNN
can be quite strong and in many cases as
accurate as those elaborated methods.
kNN is slow at the classification time
kNN does not produce an understandable
model
94. Artificial Neural Network
94
Computational models inspired by the
human brain:
Algorithms that try to mimic the brain.
Massively parallel, distributed system, made up
of simple processing units (neurons)
Synaptic connection strengths among neurons
are used to store the acquired knowledge.
Knowledge is acquired by the network from its
environment through a learning process
95. Applications of ANNs
95
ANNs have been widely used in various
domains for:
Pattern recognition
Function approximation
Associative memory
96. Properties:
96
Inputs are flexible
any real values
Highly correlated or independent
Target function may be discrete-valued, real-valued,
or vectors of discrete or real values
Outputs are real numbers between 0 and 1
Resistant to errors in the training data
Long training time
Fast evaluation
The function produced can be difficult for humans to
interpret
97. When to consider Neural Networks
97
Input is high-dimensional discrete or raw-valued
Output is discrete or real-valued
Output is a vector of values
Possibly noisy data
Form of target function is unknown
Human readability of the result is not important
Examples:
Speech phoneme recognition
Image classification
Financial prediction
111. 111
Summary
Applications of supervised learning are in almost
any field or domain.
We studied 5 classification techniques.
There are still many other methods, e.g.,
Bayesian networks
Genetic algorithms
Fuzzy classification
This large number of methods also show the importance of
classification and its wide applicability.
It remains to be an active research area.