Unit I and II Machine Learning MCA CREC.pptx

Paul Bharath Bhushan Petlu
Computer Applications Department,
Chadalawada Ramanamma Engineering College,
Tirupati, Andhra Pradesh, India.

A duck seems to be pleasant on the surface of the pond;
But, there is a restless pedaling under the water.

 Human beings dreamt of creating machines with
human-like traits
 Robots in manufacturing, mining, agriculture,
space, ocean exploration, and health sciences etc
 These machines are enslaved by commands
create intelligent machines that emulate human
intelligence

 Human Intelligence possesses robust attributes
with complex sensory, control, affective
(emotional), and cognitive (thinking processes)
 Central Nervous System: over one hundred billion
biological neurons
 CNS – acquires information from natural sensory
organs

 Cognitive Mathematics?
 Neural networks: a low-level cognitive machine – a
thinking machine
 Fuzzy logic: mathematical power for the emulation
of the higher-order cognitive functions – the
thought and perception process
 Neural networks + Fuzzy logic

 Needs, Motivations, and Rationale:
 Information is power and a must for success
 The collected information may be categorized on
the basis of nature of experience:
1. Experimental data
2. Structured human knowledge expressed in linguistic
form
 Banks, hospitals, automobiles, observatories
around the world, etc

 Soft Computing / Machine Learning:
 Conventional approach Human intelligence to
solving decision problems
 The present scene much different from
yesterday. We now have ocean of data to be
processed. Humans are unable to extract useful
information from them. Computers of today can
store this data and analyze it. However, to lead to
meaningful analysis, a new mathematical theory
has emerged which is built on the foundation of
human facilities of learning, memorizing, adapting
and generalizing

Soft Computing / Machine Learning:
The basic premises of machine learning are:
 The real world is pervasively imprecise and
uncertain
 The precision and certainty carry a cost
The guiding principle of machine learning, which
follows from these premises, is as follows:
Exploit tolerance for imprecision and uncertainty to
achieve tractability, robustness, and low-cost solutions

 There are 3 identified features to have a well
defined learning problems:
1. 1) The learning task
2. 2) The measure of performance
3. 3) The task experience
 Important aspects of ‘learning from experience’
behavior of humans and other animals embedded
in machine learning are:
 1) Remembering and Adapting
 2) Generalizing

 Machine Computer Program
 Learning Machine
 Learning Algorithms Computer Program Design
 Learned Knowledge

A block diagrammatic representation of a Learning
Machine:

 Google is by far the most popular and extensively
used of all search engines.
 The moment we start browsing for items on
Amazon, we see recommendations for products,
books, movies, music, etc., we probably are
interested in.
 Amazon used recommender system designed by
machine learning, based on the data generated
from social networking sites.

 Some application domains are:
 Medical Diagnostics
 Finance Domain
 Stock Market Forecasting
 Machine Vision
 Speech Recognition
 Text Mining
 Robotics and Automation
 Etc..

 Medical Diagnostics: Major success of deep learning
for machine vision applications have made it
possible to accurately analyze medical images – X-
rays, MRI, CT scan, ultrasound images, ECG etc.
Machine Learning augmented with deep learning
diagnoses promises to revolutionize healthcare.

 Finance Domain: The applications of Machine
Learning in finance domain helps banks offer
personalized services to customers at lower cost,
and better compliance. This helps banks to
generate higher revenue. Machine Learning can
scan through large amounts of transactional data in
seconds, and identify if there is any fraudulent
behavior and predict it.

 Stock Market Forecasting: Readers aware of the stock
market know that the seamless buying and selling
of company stocks data is in the form of time
series. It is sequential data wherein data at a time t
depends on the past history at t-1, t-2, … Stock
index is an average value that is calculated by
combining various stocks and its prediction
represents the market’s movements over time.

 Machine Vision: A machine vision system captures
images through a camera and analyzes these to
describe the images. A level of visual
understanding and recognition that humans
exhibit, cannot be matched by machine vision
algorithms. However, certain problems, such as
biometric recognition – finger prints identification,
face recognition, etc are being handled with success

 Speech Recognition: Using signal processing techniques
we can represent the speech signal by a set of real
values. The resulting data is sequential in nature
and with deep learning we get higher levels of
performance for speech recognition systems.
Virutal Personal Assistants (Amazon Echo, Google
Home) assist in finding information when asked
over voice.
 For example: “what are the flights from Delhi to
Chennai?”

 Text Mining: Text mining is an area that is
concerned with the identification of patterns in text
data. The procedure involves analysis of text for
extraction of useful information for specific
purposes – email spam detection etc.

 Robotics and Automation: Computers are controlling and
monitoring manufacturing processes with high degree
of automation, facilitated by machine learning
techniques and robots – for industrial automation,
medical robots, military robots, robots employed in
disaster areas, and so on. Machine Vision is an integral
part of many robotic applications, for example, images
have to be analyzed online and a machine learning
system has to categorize the objects into ‘defect’ and
‘non-defect’ category and then the robot can put the
objects in the right place.

 More on Application Domains: Machine Learning /
Data Mining is omnipresent and is an empirical
technology that has applications in all knowledge
domains: engineering, business management,
natural science, social science.

 Data Representation: Structured / Unstructured Data
 Experience in the form of raw data is a source of
learning in many applications and human knowledge
in linguistic form is an additional learning source.
 Data warehousing provides integrated, consistent and
cleaned data to the machine learning algorithms and
also from the availability of data in a flat file, which is a
simple data table.
 Logical structure of a database is established by data
modelling. A data model determines how data is
stored, organized, and then manipulated in the
database.

 Structured Data: It is the data that adheres to a
predefined data model. It can be stored in a
relational database. It conforms to a tabular format
with relationship between different rows and
columns. Data can easily be aggregated from
various locations in the database. This data model
is the simplest way to manage information and the
techniques to analyze structured data

 Unstructured Data: It is the information that either
doesn’t have a data model or is not organized in a
predefined manner. It often includes text and
multimedia data, for example, social media data
generated from YouTube, Facebook, Twitter,
Instagram, LinkedIn, etc, text internal to the
company such as documents, logs, survey results,
emails, images and videos, audios. In case where
this kind of data has internal structure, the data
still considered ‘unstructured’ because it doesn’t fit
neatly in a relational database.

 Semi structured Data: It is the information that
doesn’t conform to a formal structure of data
models associated with relational databases, but
that does have some organizational properties that
make it easier to analyze. With some processing,
we can transform them into a format that machines
accept for various prediction tasks.

 Unlocking the information power of Unstructured data:
About 80% of the total data being collected and
stored today is unstructured. Therefore, unlocking
the information power of such data is very
important. Examples of non-relational databases
include Apache Cassandra, MongoDB,
Hadoop/MapReduce, Spark, among others. A
number of software solutions are being designed to
search unstructured data and extract important
information.

 Forms of Learning:
 Any method that incorporates information from
experience in the design of a machine employs
learning. A learning method depends on the type of
experience from which the machine will learn or
trained. The type of available learning experience
can have a significant impact on the success or
failure of the learning machine.

 Forms of Learning:
 The field of machine learning usually
distinguished four forms of learning:
 1) Supervised Learning
 2) Unsupervised Learning
 3) Reinforcement Learning
 4) Learning based on natural processes – evolution,
swarming, and immune systems

 1) Predictive / Directed / Supervised Learning:
 In general xi is a D-dimensional vector of number,
say, height and weight of a person. These are called
features, attributes or covariates.
 Input xi could be a complex structured object, such
as an image, a sentence, an email message, a time
series, a molecular shape, a graph, etc
 similarly, the form of output or response variable
can in principle be anything, but most methods
assume that yi is a categorical or nominal variable
from some finite set, yi ϵ {1, 2, …., C)

 Binary Classification: C = 2
 (a) Some labeled training examples of colored shapes, along with 3
unlabeled test cases. (b) Representing the training data as an N x D
design matrix. Row i represents the feature vector xi. The last column is
the label, yi ϵ {0, 1}. Based on a figure by Leslie Kaelbling

 Classification of flowers:
 Three types of Iris flowers: Setosa, Versicolor and Virginica. Source:
http://www.statlab.uni-heidelberg.de/data/iris/.

Image Classification:
 We might want to classify the image as a whole, e.g., is it an
indoors or outdoors scene? Is it a horizontal or vertical
photo? Does it contain a dog or not? This is called image
classification.
Handwriting recognition:
 In the special case that the images consist of isolated
handwritten letters and digits, for example, in a postal or
ZIP code on a letter, we can use classification to perform
handwriting recognition.

 Face detection and Recognition:
 Example of face detection. (a) Input image (b) Output of classifier,
which detected 5 faces at different poses.

 Regression:
 (a) Linear Regression on some 1d data. (b) Same data with polynomial
regression (degree 2). Figure generated by linregpolyVsDegree

Some of the examples of real-world regression problems are:
•Predict tomorrow’s stock market price given current market
conditions and other possible side information.
•Predict the age of a viewer watching a given video on
YouTube.
•Predict the location in 3d space of a robot armend effector,
given control signals (torques) sent to its various motors.
•Predict the amount of prostate specific antigen (PSA) in the
body as a function of a number of different clinical
measurements.
•Predict the temperature at any location inside a building
using weather data, time, door sensors, etc.

 2) Descriptive / Undirected / Unsupervised Learning:
 Here we are only given inputs, D = {xi}i=1 to N,
and the goal is to find “interesting patterns” in the
data. This is a much less well-defined problem,
since we are not told what kinds of patterns to look
for, and there is no obvious error metric to use.

 3) Reinforcement Learning:
 This is somewhat less commonly used. This is
useful for learning how to act or behave when
given occasional reward or punishment signals.
(For example, consider how a baby learns to walk)

 4) Learning based on Natural Processes: Evolution,
Swarming, and Immune Systems
 Some learning approaches take inspiration from
nature for the development of novel problem-
solving techniques. These are applied successfully
to a variety of optimization problems.

 Evolutionary Computation
 Evolutionary biology essentially states that a
population of individuals possessing the ability to
reproduce and exposed to genetic variation
followed by selection gives rise to new
populations which are fitter to their environment.
The primary streams are: genetic algorithms,
evolution strategies, evolutionary programming
and genetic programming

 Swarm Intelligence
 It is a feature of systems of unintelligent agents
with inadequate individual abilities, displaying
collectively intelligent behaviour. It includes
algorithms derived from the collective behaviour of
social insects (Ant Colony Optimization) and other
animal / human societies (Particle Swarm
Optimization)

 Artificial Immune Systems
 An Artificial Immune System (AIS) replicates
certain aspects of the natural immune system,
which is primarily applied to solve pattern-
recognition problems and cluster data. The natural
immune system has an extraordinary ability to
match patterns, employed to differentiate between
foreign cells making an entry into the body
(antigen) and the cells that are part of the body.

 Machine Learning and Data Mining
 Machine Learning:
 Early AI research was focused on hard coding, the
rules that mimic human intelligence. Machine
Learning, a subfield of AI, still involves classifical
programming, human intelligence is required to
convert raw data to representations used by machine
for learning. Deep learning, a subfield of Machine
Learning, is a form of ‘representation learning’,
inspired by biological nervous system. Machine
Learning is the computation process wherein a
machine ‘learn’ and adjusts its behaviour based on
feedback from data.

 Machine Learning and Data Mining
 Data Mining:
 DM focuses concerns on real-world application,
concentrated on commercial applications and
business-related problems of data analysis tends to
drift in the direction of data mining.
 Both ML and DM are related to each other sharing
methods and algorithms pertaining to the analysis
of the data to seek informative patterns.

 Data Science
 Data Science is a new name given to an action plan
for expanding the technical areas of the field of
statistics.
 Data Science is the extraction of knowledge from
data.
 The task ‘knowledge extraction’ does not have any
boundaries.

 Relationship among key technologies

 Learning from Observations
 We can visualize each pattern with n numerical
features as a point in n-dimensional state space Rn :
 x = [x1 x2 . . . . xn]T ϵ Rn
 The training experience is in the form of data D that
describes how the system behaves over its entire range
of operation.
 D : {x(i), y(i)}; i = 1, 2, . . . . , N(2.1)
 data D is independently drawn and identically
distributed (iid) represented by probability density
function p(x, y)

 Learning from Observations
 Assume a machine defined by a function f: X  Y
 When f(.) is selected, the machine is called a trained
machine that gives estimated output value
for a given pattern x.
 We can define the set of learning machines by a
function f(x, w), where w contains adjustable
parameters.
 Loss function is L(y, f(x, w))

 Empirical Risk Minimization
 Our problem is to find a decision function f(x, w)
against p(x, y) that minimizes the risk function
R(w).
 With dataset, D : {x(i), y(i)}; i = 1, 2, . . . . , N, being
the only source of information, the empirical risk
function given by:
 This empirical risk function replaces average over
p(x, y) by an average over the training sample.

 Inductive Learning
 Given a collection of examples (x(i) f(x(i)); i = 1,
2, …., N, of a function f(x), return a function h(x)
that approximates f(x).
 The assumption in inductive learning is that the
ideal hypothesis related to unseen patterns is the
one induced by the observed training data.

 Bias and Variance
 Consider the following experiment. We first collect a
random sample D of N independently drawn patterns
from the distribution p(x, y), and then measure the
sample error / training error / approximation error from:
 using loss function for classification problems:

 using loss function for regression problems:
 L(y, f(x, w)) = ½ (y – f(x, w))2

BIAS AND VARIANCE LINEAR CURVE FITTING
 If we run K such
experiments, measuring
the random variable
errorDj[h]; j = 1, 2, ...., K
then the average over
the K experiments is:
 errorD[h] = ED{ errorDj[h]}
 where ED{.} denotes the
expectation or ensemble
average.

 Bias and Variance
 A non-zero error can arise for two reasons:
 1) It may be that the hypothesis function h(.) is, on
an average, different from the regression function
f(x). This is called bias.
 2) It may be that the hypothesis function is very
sensitive to the particular dataset Dj, so that for a
given x, approximation error is larger for some
datasets, and smaller for other datasets. This is
called variance.

 Occam’s razor principle
 The Franciscan Monk, William of Occam, was born
in 1280. His principle:
 ‘The simpler explanations are more reasonable, and
any unnecessary complexity should be shaved off’.
 ‘Simpler’ may imply needing lesser parameters,
lesser training time, fewer attributes for data
representation, and lesser computational
complexity.

 Overfitting avoidance
 Occam’s razor principle suggests hypothesis
functions that avoid overfitting of the training data.
We stop looking for a design when the solution is
‘good enough’, not necessarily the optimal one.
 In the machine learning jargon, a learning machine
is said to overfit the training examples if certain
other learning machine that fits the training
examples less well, actually performs better over
the total distribution of patterns.

 Heuristic Search in Inductive Learning
 Trial-and-error is the approach of searching for a ‘good
enough’ solution.
 Applied Machine Learning organizes the search as per
the following two-step procedure:
 1) The search is first focused on a class of the possible
hypothesis, chosen for the learning task in hand. Prior
knowledge and experience are helpful in this selection.
Different hypothesis functions are appropriate for
different kinds of learning tasks, and available data.
 2) For each of the members of the class, the
corresponding learning algorithm organizes the search
through all possible structures of the learning machine.

 Principal techniques used in heuristic search
 Regularization:
 Early Stopping:
 Pruning:

 Regularization:
 The regularization model promotes smoother functions by
creating a new criterion function that relies not only on the
training error, but also on algorithmic intricacy.
 E̅ = E + λ Ω  2.1
 = error on data + λ * hypothesis complexity, where λ gives the weight
of penalty.
 When λ=0, there is no regularization and results in a model that
tends to have some variance in it. That means, this model won’t
generalize well for a dataset different from its training data
(overfitting). As the value of λ rises, till a point, it reduces the
variance without substantially increasing the bias. But after
certain increase in the value of λ, it starts giving rise to increase in
bias in the model, and thus underfitting. λ is optimized using
corss-validation.

 Early Stopping:
 Stopping the training before attaining a minimum
training error represents a technique of restricting the
effective hypothesis complexity.
 Pruning:
 An alternative solution that sometimes is more
successful than early stopping the complexity of the
hypothesis is pruning the full-grown hypothesis that is
likely to be overfitting the training data.

 Evaluation of Learning System
 Before using the Machine Learning System, it
should be evaluated in many aspects, which are:
 Accuracy:
 Robustness:
 Computation Complexity and Speed:
 Interpretability:
 Online Learning:
 Scalability:

 Accuracy: The learning system extracts knowledge
from the training data. The learned knowledge should
be general enough to deal with unknown data. The
generalization capability of a learning system is an
index of accuracy of the learning machine.
 Robustness: It means that the machine can perform
adequately under all circumstances including the cases
when information is corrupted by noise, is incomplete,
and is interfered with irrelevant data. All these
conditions seem to be part of the real world, and must
be considered while evaluating a learning system.

 Computation Complexity and Speed: Computational
complexity of a learning algorithm and learning speed
determine the efficiency of a learning system: how fast
the systems can arrive at a correct answer and how
much computer memory is required. We know how
important speed is in real-time situations.
 Interpretability: This is the level of understanding and
insight offered by a learning algorithm. Interpretability
is subjective and, hence, tougher to evaluate.
Interpretability is easy in decision trees, but still their
interpretability may decrease with an increase in their
complexity.

 Online Learning: The spectrum of applications is
increasing with the growth of technology. There are
sources which are generating streaming data, which
has to be analyzed in real time. An online learning
system continues to receive inputs from a real-time
environment and analyze it in real time.
 Scalability: Today huge amounts of data are being
generated in real-world applications. The capability of
higher levels of scalability is a desirable feature of a
learning machine. Typically, the assessment of
scalability can be done with a series of datasets of
ascending order in size complexity.

 Estimating Generalization Errors
 The success of learning depends on the hypothesis
space complexity and sample complexity, which
are interdependent. The goal is to find a function
simplest in terms of complexity and best in terms
of empirical error on the data. Such a choice is to
give good generalization performance.
 If we partition available data into training /
validation / testing datasets, the validation set is
used to optimize the parameters of the model
obtained using training data.

 Holdout Method and Random Subsampling
 In the holdout technique, some amount of data is
earmarked for the purpose of testing (one-third),
while the remainder is employed for training. If the
data is collected over time (time series data), then we
can make use of the earlier part to train and the
latter part of the data for the purpose of testing.

 The samples used to train and test have to
represent the underlying distribution for the
problem area. The proportion of class-data in
training, testing, and full datasets should more or
less be same. To make sure this happens, random
sampling should be performed in a manner that
will guarantee that each class is properly
represented in training as well as test sets. This
process is known as stratification.

 Even though stratified holdout is generally well
worth doing, it offers merely a basic safeguard
against irregular representation in training and test
sets. A more general way to alleviate any bias
resulting from the specific sample selected for
holdout is random sampling, wherein the holdout
technique is iterated K times with various arbitrary
samples. The accuracy estimate on the whole is
considered as the average of the accuracies got
from each repetition.

 Cross-Validation
 A commonly used technique for forecasting the
success rate of a learning method, taking into
account a fixed data sample, is the K-fold cross-
validation.
 Another estimate prevalent is the leave-one-out
cross-validation.

 K-Fold cross-validation
 In K-fold cross-validation, the given data D is randomly
divided into K mutually exclusive subsets or folds, Dk,
where k = 1, 2, …., K, each of about equal size. Training
and testing is done K times. In iteration k, partition Dk
is set aside for testing, and the remainder of the
divisions are collectively employed to train the model.
That is, in the first iteration, the set D2 D3 …. Dk serves
as the training set to attain the first model which is
tested on D1, then second iteration is trained on D1 D3
…. Dk and tested on D2, and so on.

 Stratified K-Fold cross-validation
 If stratification is also used, it is known as stratified
K-fold cross-validation for classification. Ultimately,
the K error estimates received from K iterations are
averaged to give rise to an overall error estimate.
Out of the 10 machines, the one with lowest error
may be deployed. K = 10 folds is the standard
number employed to predict the error rate of a
learning method.

 Leave-one-out Cross-validation
 Only a single sample is left out for the test in each
iteration. The learning machine is trained on the
remainder of the samples. It is judged by its
accuracy on the left-out sample. The average of all
outcomes of all N judgements in N iterations is
taken, and this is the average which is
representative of the error-estimate.

 Leave-one-out Cross-validation
 The computational expense of this process is quite
high as the whole learning process has to be
iterated N times, and this is generally not feasible
for big datasets. Nevertheless, leave-one-out seems
to present an opportunity to squeeze the maximum
out of a small dataset and obtain an estimate that is
as precise as possible. This process disallows
stratification.

 Bootstrapping
 The bootstrap technique is based on the process of
sampling with replacement. In the earlier techniques,
the same instance, which was once chosen could
not be chosen again. However, most learning
techniques can employ an instance several times,
and it affects the learning outcome if it is available
in the training set more than once. The concept of
bootstrapping aims to sample the dataset by
replacement, so as to form a training set and a test
set.

 Bootstrapping
 There are many bootstrap techniques. The most
popular one is the 0.632 bootstrap, which works as
follows:
 A dataset of N instances is sampled N times with
replacements to give rise to another new dataset of N
instances, which is a bootstrap sample – a training set of
N samples. As certain elements in the bootstrap sample
will be repeated, there will be certain instances in the
original dataset D that have not been selected – these
will be used as test instances. If we attempt this many
times, on an average, 63.2% of the original data
instances will result in the bootstrap sample and the
remaining 36.8% will give rise to the test set (therefore,
the name, 0.632 bootstrap).

 Metrics for Assessing Regression (Numeric
Prediction) Accuracy
 The task is to find a model h(x) that explains the
underlying data, that is, for all samples (x, y).
Equivalently, the task is to approximate function
f(x) with unknown properties by h(x).
 Estimating the error in prediction using holdout
and random subsampling, cross-validation and
bootstrap methods are common techniques for
assessing accuracy of the predictor. Several
alternative metrics can be used to assess the
accuracy of numeric prediction.

 Mean Square Error (MSE): The mean is obtained
from the training data as arithmative average,
 Root Mean Square Error (RMSE):
 Taking the square root yields,

 Sum-of-error-squares:
 Sometimes total error, and not the average, is taken
for mathematical manipulation by some statistical
/ machine learning techniques:

 Mean Absolute Error: Measures the average
deviation of the predicted value from the true
value

 Metrics for Assessing Classification (Pattern
Recognition) Accuracy
 The basic principles – use of an independent test
dataset instead of the training set to evaluate
performance, the holdout technique and cross-
validation – are equally applicable to classification.
The errors in numeric prediction arise in various
sizes whereas in classification, errors simply exist
or are absent.
 Several different measures can be used to assess
the accuracy of a classifier. They are:

 Misclassification Error:
 The metric for assessing the accuracy of
classification algorithms is: number of samples
misclassified by the model h(w, x). For example, for
binary classification problems,
 y(i) ϵ [0, 1], and h(w, x(i)) = y^(i)ϵ [0, 1];
 i = 1, 2, …., N.
 For 0% error, (y(i) - y^(i)) = 0 for all data points.

 Misclassification Error:
 This accuracy measure works well for the
situations where class tuples are more or less evenly
distributed. However, when the classes are
imbalanced, decisions made on classifications based
on misclassification error lead to poor
performance.

 Log Loss: A loss function is a method of evaluating
how well our algorithm models our dataset. Log
Loss takes into account the uncertainty of
prediction based on how much it varies from the
actual label.

 Log Loss: Log loss is a straightforward modification
of log-likelihood function. With maximization
transformed to minimization, the log loss for one
sample is given by:

 Log Loss: The cost function is taken as the average
of loss over the entire dataset. Therefore, log loss
metric for classification tasks is expressed as:

 Log Loss:
 Note that the log loss of a sample is low when it’s
predicted probability is high, indicating that the
prediction matches the actual value. The log loss
increases as the predicted probability reduces;
probabilities close to 0 would be bad and result in
high loss value.

 Cross Entropy:
 Cross entropy is a measure from the field of
information theory. Although the two measures –
log loss and cross entropy – are derived from
different fields, when used as cost functions for
classification models, both the measures calculate
the same quantity and can be used
interchangeably.

Unit I and II Machine Learning MCA CREC.pptx

Recommended

Recommended

More Related Content

Similar to Unit I and II Machine Learning MCA CREC.pptx

Similar to Unit I and II Machine Learning MCA CREC.pptx (20)

Recently uploaded

Recently uploaded (20)

Unit I and II Machine Learning MCA CREC.pptx