Big Data Analytics - Unit 3.pptx

Classification
 Supervised v/s Unsupervised Learning
 Supervised learning (classification):
 Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the
observations.
 New data is classified based on the training
set
 Unsupervised learning (clustering):
 The class labels of training data is unknown
Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data

What is Classification and
Prediction?
 There are two forms of data analysis that can be used
for extracting models describing important classes or to
predict future data trends. These two forms are as
follows −
➢ Classification
➢ Prediction
 Classification models predict categorical class labels;
and prediction models predict continuous valued
functions.
 For example, we can build a classification model to
categorize bank loan applications as either safe or risky,
or a prediction model to predict the expenditures in
dollars of potential customers on computer equipment
given their income and occupation.

WHAT IS CLASSIFICATION?
 Following are the examples of cases where
the data analysis task is Classification −
➢ A bank loan officer wants to analyze the data
in order to know which customer (loan
applicant) are risky or which are safe.
➢ A marketing manager at a company needs to
analyze a customer with a given profile, who
will buy a new computer.
 In both of the above examples, a model or
classifier is constructed to predict the
categorical labels. These labels are risky or
safe for loan application data and yes or no
for marketing data

WHAT IS PREDICTION?
 Following are the examples of cases where
the data analysis task is Prediction −
 Suppose the marketing manager needs to
predict how much a given customer will
spend during a sale at his company. In this
example we are bothered to predict a
numeric value. Therefore the data analysis
task is an example of numeric prediction. In
this case, a model or a predictor will be
constructed that predicts a continuous-
valued-function or ordered value.
 Note − Regression analysis is a statistical
methodology that is most often used for
numeric prediction

Classification – A 2 step
process
 Model construction: describing a set of predetermined
classes
◦ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
◦ The set of tuples used for model construction is training set
◦ The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
◦ The known label of test sample is compared with the classified result
from the model
◦ Accuracy rate is the percentage of test set samples that are correctly
classified by the model
◦ Test set is independent of training set, otherwise over-fitting will
occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known

Classification of Data Mining
Techniques
 Classification of Data mining
frameworks based on the type of data
sources mined:
◦ Here the classification is as per the type
of data. Eg: multimedia, text data, spatial
data, time series data, www etc.
frameworks based on database
involved
◦ Here the classification is as per the data
model involved. Eg: Object-oriented
database, transactional database,

Classification of Data Mining
Techniques
frameworks as per the kind of
knowledge discovered:
◦ This classification depends on the types
of knowledge discovered Eg:
Discrimination, classification, clustering,
characterization etc.
 Classification of data mining
frameworks based on data mining
techniques used:
◦ This classification is based on the data
analysis approach utilized. Eg: neural

Issues regarding Classification
and Prediction
 The major issue is preparing the data for
Classification and Prediction. Preparing the
data involves the following activities −
 ➢ Data Cleaning − Data cleaning involves
removing the noise and treatment of missing
values. The noise is removed by applying
smoothing techniques and the problem of
missing values is solved by replacing a missing
value with most commonly occurring value for
that attribute.
 ➢ Relevance Analysis − Database may also
have the irrelevant attributes. Correlation
analysis is used to know whether any two given
attributes are related.

Issues regarding Classification
and Prediction
 ➢ Data Transformation and reduction −
The data can be transformed by any of the
following methods.
◦ ■ Normalization − The data is transformed
using normalization. Normalization involves
scaling all values for given attribute in order to
make them fall within a small specified range.
Normalization is used when in the learning step,
the neural networks or the methods involving
measurements are used.
◦ ■ Generalization − The data can also be
transformed by generalizing it to the higher
concept. For this purpose we can use the
concept hierarchies..

Comparison of Classification and
Prediction Methods
 Here is the criteria for comparing the methods of
Classification and Prediction −
 ➢ Accuracy − Accuracy of classifier refers to the ability
of classifier. It predict the class label correctly and
the accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a
new data.
 ➢ Speed − This refers to the computational cost in
generating and using the classifier or predictor.
 ➢ Robustness − It refers to the ability of classifier or
predictor to make correct predictions from given noisy
data.
 ➢ Scalability − Scalability refers to the ability to
construct the classifier or predictor efficiently, given
large amount of data.
 ➢ Interpretability − It refers to what extent the classifier
or predictor understands

Naïve Bayes Classifier
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
PlayTennis is target variable with output Yes / No.
New Instance to be classified as Yes / No.
Outlook = Sunny , Temperature = Cool,
Humidity = High, Wind = Strong
 Prior Probability
 P(Play Tennis = yes) = 9 / 14 =
0.64
 P(Play Tennis = no) = 5 / 14 =
0.36
 Current Probability / conditional
probabilities of individual
attributes:
 4 attributes viz., – Outlook,
Temperature, Humidity and Wind
 Find conditional probabilities of
individual attributes.
Outlook Y N
Sunny 2/9 3/5
Overcast 4/9 0
Rainy 3/9 2/5
Temperature Y N
Hot 2/9 2/5
Mild 4/9 2/5
Cool 3/9 1/5
Humidity Y N
High 3/9 4/5
Normal 6/9 1/5
Wind Y N
Strong 3/9 3/5
Weak 6/9 2/5

 Prior Probability
 P(Play Tennis = yes) = 9 / 14 = 0.64
 P(Play Tennis = no) = 5 / 14 = 0.36
 Current Probability / conditional
probabilities of individual attributes:
 4 attributes viz., – Outlook,
Temperature, Humidity and Wind
 Find conditional probabilities of
individual attributes.

Probability that the person will play Tennis is less than the probability that he w
not play tennis. Hence the conclusion is that he will not play Tennis.

Classification by
Backpropagation

Classification by Back-
propagation

Back Propagation
 Features of Back-propagation:
 It uses the gradient descent method
 It is different from other networks in respect to the process by
which the weights are calculated during the learning period of the
network.
 Training is done in the three stages :
◦ the feed-forward of input training pattern
◦ the calculation and back-propagation of the error
◦ Updating the weight

Back Propagation Algorithm
 Step 1: Inputs X, arrive through the pre-connected path.
 Step 2: The input is modeled using true weights W. Weights are
usually chosen randomly.
 Step 3: Calculate the output of each neuron from the input layer to
the hidden layer to the output layer.
 Step 4: Calculate the error in the outputs
 Back propagation Error= Actual Output – Desired Output
 Step 5: From the output layer, go back to the hidden layer to adjust
the weights to reduce the error.
 Step 6: Repeat the process until the desired output is achieved.
 Parameters :
 x = inputs training vector x=(x1,x2,…xn).
 t = target vector t=(t1,t2……………tn).
 δk = error at output unit.
 δj = error at hidden layer.
 α = learning rate.
 V0j = bias of hidden unit j.

Types of Back-Propagation
 Static back-propagation: Static back propagation
is a network designed to map static inputs for static
outputs.
 Eg: OCR (Optical Character Recognition)
 Recurrent back-propagation: Activation in
recurrent back-propagation is feed-forward until a
fixed value is reached.
Static back propagation provides an instant mapping,
while recurrent back propagation does not provide
an instant mapping.

 Advantages:
 It is simple, fast, and easy to program.
 It is Flexible and efficient.
 No need for users to learn any special functions.
 Disadvantages:
 It is sensitive to noisy data and irregularities. Noisy
data can lead to inaccurate results.
 Performance is highly dependent on input data.
 Spending too much time training.

Prediction
 Predictive Modeling and Machine Learning
-Machine learning takes weather data and builds
relationships between the available data and the
relative predictors.
2. Data – A Crucial Part of Weather Predictions
3. Weather Data – An Aid for many Events
Prediction of floods, sports, predict car sales, predict
asthma attack.
4. Satellite Imagery and Sensor Data
Example: A record 1.2 million people (equal to the
population of Mauritius) were evacuated in less
than 48 hours just because of data scientists. It was
one of the strongest cyclones to have hit India in the
last 20 years.

Accuracy
 True Positive – Model correctly predicts
the positive class.
 True Negative– Model correctly predicts
the Negative class.
 False Positive – Model in correctly
predicts the positive class.
 False Negative– Model in correctly
predicts the negative class.
 Accuracy = No. of correct predictions
----------------------------------
Total number of predictions

Clustering
 Clustering is an unsupervised Machine Learning-based
Algorithm
 Clustering only utilizes input data, to determine
patterns, anomalies or similarities in its input data
 A good clustering algorithm aims to obtain clusters
whose:
◦ The intra-cluster similarities are high, It implies
that the data present inside the cluster is similar to
one another.
◦ The inter-cluster similarity is low, and it means
each cluster holds data that is not similar to other
data.
 What is a Cluster?
◦ A cluster is a subset of similar objects

Clustering
 Grouping of specific objects based on their
characteristics and their similarities.
 A good clustering algorithm is able to identify the
cluster independent of cluster shape.
 3 basic stages of clustering algorithm are
Raw Data
Clustering Algorithm
Clusters of Data

Methods of Clustering in Data
Mining
 Many clusters can partition information
into a data set.
 Methods of Clustering in Data Mining
◦ 1. Partitioning based method
◦ 2. Density-based method
◦ 3. Centroid-based method
◦ 4. Hierarchical method
◦ 5. Grid-based method
◦ 6. Model-based method

Partitioning based Method
 Partition algorithm divides data into many subsets
 Let the algorithm build a partition of data and n objects
present in the database.
 This indicates that each group has at least one object, and
every object, must belong to exactly one group.

Density based Method
 The algorithm produces clusters of high dense
regions in the data space, separated by regions of
the lower density of points.

Centroid Based Method
 In Centroid based clustering algorithm clusters are
formed by the closeness of data points to
the centroid of clusters.
 Here, the cluster center i.e. centroid is formed such
that the distance of data points is minimum with the
center.
 A vector of values references (centroids) almost
every cluster in this type of grouping technique.
 Number of groups should be predefined.

Hierarchical Method
 Hierarchical clustering analysis is a method of
cluster analysis that seeks to build a hierarchy of
clusters i.e. tree-type structure based on the
hierarchy.
 Agglomerative Clustering: Also known as
bottom-up approach or hierarchical agglomerative
clustering (HAC).
 This clustering algorithm does not require us to
prespecify the number of clusters.
 Bottom-up algorithms treat each data as a
singleton cluster at the outset and then
successively agglomerates pairs of clusters until all
clusters have been merged into a single cluster
that contains all data.

Hierarchical Agglomerative
Approach

Divisive Approach
 Also known as a top-down approach.
 This algorithm also does not require to prespecify
the number of clusters.
 Top-down clustering requires a method for splitting
a cluster that contains the whole data and
proceeds by splitting clusters recursively until
individual data have been split into singleton
clusters.

Grid-Based Method
 Grid is divided based on the characteristics of the
data.
 By using this method, non-numeric data is easy to
manage.
 Data order does not affect the partitioning of the
grid.
 An important advantage is the faster execution
time.

Model-Based Method
 In this method a hypothesized model based on
probability distribution is used.
 By clustering the density function, this method
locates the clusters.

Applications of Clustering in Data
Mining
 Clustering analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image
processing.
 Clustering can also help marketers discover distinct groups in their
customer base. And they can characterize their customer groups
based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar functionalities and gain
insight into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use in an
earth observation database. It also helps in the identification of groups
of houses in a city according to house type, value, and geographic
location.
 Clustering also helps in classifying documents on the web for
information discovery.
 Clustering is also used in outlier detection applications such as
detection of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain

Big Data Analytics - Unit 3.pptx

Recommended

Recommended

More Related Content

Similar to Big Data Analytics - Unit 3.pptx

Similar to Big Data Analytics - Unit 3.pptx (20)

Recently uploaded

Recently uploaded (20)

Big Data Analytics - Unit 3.pptx