SlideShare a Scribd company logo
Big Data Analytics - Unit 3
Classification
 Supervised v/s Unsupervised Learning
 Supervised learning (classification):
 Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the
observations.
 New data is classified based on the training
set
 Unsupervised learning (clustering):
 The class labels of training data is unknown
Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
What is Classification and
Prediction?
 There are two forms of data analysis that can be used
for extracting models describing important classes or to
predict future data trends. These two forms are as
follows −
➢ Classification
➢ Prediction
 Classification models predict categorical class labels;
and prediction models predict continuous valued
functions.
 For example, we can build a classification model to
categorize bank loan applications as either safe or risky,
or a prediction model to predict the expenditures in
dollars of potential customers on computer equipment
given their income and occupation.
WHAT IS CLASSIFICATION?
 Following are the examples of cases where
the data analysis task is Classification −
➢ A bank loan officer wants to analyze the data
in order to know which customer (loan
applicant) are risky or which are safe.
➢ A marketing manager at a company needs to
analyze a customer with a given profile, who
will buy a new computer.
 In both of the above examples, a model or
classifier is constructed to predict the
categorical labels. These labels are risky or
safe for loan application data and yes or no
for marketing data
WHAT IS PREDICTION?
 Following are the examples of cases where
the data analysis task is Prediction −
 Suppose the marketing manager needs to
predict how much a given customer will
spend during a sale at his company. In this
example we are bothered to predict a
numeric value. Therefore the data analysis
task is an example of numeric prediction. In
this case, a model or a predictor will be
constructed that predicts a continuous-
valued-function or ordered value.
 Note − Regression analysis is a statistical
methodology that is most often used for
numeric prediction
Classification – A 2 step
process
 Model construction: describing a set of predetermined
classes
◦ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
◦ The set of tuples used for model construction is training set
◦ The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
◦ The known label of test sample is compared with the classified result
from the model
◦ Accuracy rate is the percentage of test set samples that are correctly
classified by the model
◦ Test set is independent of training set, otherwise over-fitting will
occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
Model Construction
Using the Model in Prediction
Classification of Data Mining
Techniques
 Classification of Data mining
frameworks based on the type of data
sources mined:
◦ Here the classification is as per the type
of data. Eg: multimedia, text data, spatial
data, time series data, www etc.
 Classification of Data mining
frameworks based on database
involved
◦ Here the classification is as per the data
model involved. Eg: Object-oriented
database, transactional database,
Classification of Data Mining
Techniques
 Classification of Data mining
frameworks as per the kind of
knowledge discovered:
◦ This classification depends on the types
of knowledge discovered Eg:
Discrimination, classification, clustering,
characterization etc.
 Classification of data mining
frameworks based on data mining
techniques used:
◦ This classification is based on the data
analysis approach utilized. Eg: neural
Issues regarding Classification
and Prediction
 The major issue is preparing the data for
Classification and Prediction. Preparing the
data involves the following activities −
 ➢ Data Cleaning − Data cleaning involves
removing the noise and treatment of missing
values. The noise is removed by applying
smoothing techniques and the problem of
missing values is solved by replacing a missing
value with most commonly occurring value for
that attribute.
 ➢ Relevance Analysis − Database may also
have the irrelevant attributes. Correlation
analysis is used to know whether any two given
attributes are related.
Issues regarding Classification
and Prediction
 ➢ Data Transformation and reduction −
The data can be transformed by any of the
following methods.
◦ ■ Normalization − The data is transformed
using normalization. Normalization involves
scaling all values for given attribute in order to
make them fall within a small specified range.
Normalization is used when in the learning step,
the neural networks or the methods involving
measurements are used.
◦ ■ Generalization − The data can also be
transformed by generalizing it to the higher
concept. For this purpose we can use the
concept hierarchies..
Comparison of Classification and
Prediction Methods
 Here is the criteria for comparing the methods of
Classification and Prediction −
 ➢ Accuracy − Accuracy of classifier refers to the ability
of classifier. It predict the class label correctly and
the accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a
new data.
 ➢ Speed − This refers to the computational cost in
generating and using the classifier or predictor.
 ➢ Robustness − It refers to the ability of classifier or
predictor to make correct predictions from given noisy
data.
 ➢ Scalability − Scalability refers to the ability to
construct the classifier or predictor efficiently, given
large amount of data.
 ➢ Interpretability − It refers to what extent the classifier
or predictor understands
Classification and Prediction
Building Classification Model
Decision Tree
Decision Tree Algorithm
Naïve Bayes Classifier
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
PlayTennis is target variable with output Yes / No.
New Instance to be classified as Yes / No.
Outlook = Sunny , Temperature = Cool,
Humidity = High, Wind = Strong
 Prior Probability
 P(Play Tennis = yes) = 9 / 14 =
0.64
 P(Play Tennis = no) = 5 / 14 =
0.36
 Current Probability / conditional
probabilities of individual
attributes:
 4 attributes viz., – Outlook,
Temperature, Humidity and Wind
 Find conditional probabilities of
individual attributes.
Outlook Y N
Sunny 2/9 3/5
Overcast 4/9 0
Rainy 3/9 2/5
Temperature Y N
Hot 2/9 2/5
Mild 4/9 2/5
Cool 3/9 1/5
Humidity Y N
High 3/9 4/5
Normal 6/9 1/5
Wind Y N
Strong 3/9 3/5
Weak 6/9 2/5
Naïve Bayes Classifier
 Prior Probability
 P(Play Tennis = yes) = 9 / 14 = 0.64
 P(Play Tennis = no) = 5 / 14 = 0.36
 Current Probability / conditional
probabilities of individual attributes:
 4 attributes viz., – Outlook,
Temperature, Humidity and Wind
 Find conditional probabilities of
individual attributes.
Naïve Bayes Classifier
Probability that the person will play Tennis is less than the probability that he w
not play tennis. Hence the conclusion is that he will not play Tennis.
Classification by
Backpropagation
Classification by Back-
propagation
Back Propagation
 Features of Back-propagation:
 It uses the gradient descent method
 It is different from other networks in respect to the process by
which the weights are calculated during the learning period of the
network.
 Training is done in the three stages :
◦ the feed-forward of input training pattern
◦ the calculation and back-propagation of the error
◦ Updating the weight
Back Propagation Algorithm
 Step 1: Inputs X, arrive through the pre-connected path.
 Step 2: The input is modeled using true weights W. Weights are
usually chosen randomly.
 Step 3: Calculate the output of each neuron from the input layer to
the hidden layer to the output layer.
 Step 4: Calculate the error in the outputs
 Back propagation Error= Actual Output – Desired Output
 Step 5: From the output layer, go back to the hidden layer to adjust
the weights to reduce the error.
 Step 6: Repeat the process until the desired output is achieved.
 Parameters :
 x = inputs training vector x=(x1,x2,…xn).
 t = target vector t=(t1,t2……………tn).
 δk = error at output unit.
 δj = error at hidden layer.
 α = learning rate.
 V0j = bias of hidden unit j.
Types of Back-Propagation
 Static back-propagation: Static back propagation
is a network designed to map static inputs for static
outputs.
 Eg: OCR (Optical Character Recognition)
 Recurrent back-propagation: Activation in
recurrent back-propagation is feed-forward until a
fixed value is reached.
Static back propagation provides an instant mapping,
while recurrent back propagation does not provide
an instant mapping.
 Advantages:
 It is simple, fast, and easy to program.
 It is Flexible and efficient.
 No need for users to learn any special functions.
 Disadvantages:
 It is sensitive to noisy data and irregularities. Noisy
data can lead to inaccurate results.
 Performance is highly dependent on input data.
 Spending too much time training.
Prediction
Prediction
 Predictive Modeling and Machine Learning
-Machine learning takes weather data and builds
relationships between the available data and the
relative predictors.
2. Data – A Crucial Part of Weather Predictions
3. Weather Data – An Aid for many Events
Prediction of floods, sports, predict car sales, predict
asthma attack.
4. Satellite Imagery and Sensor Data
Example: A record 1.2 million people (equal to the
population of Mauritius) were evacuated in less
than 48 hours just because of data scientists. It was
one of the strongest cyclones to have hit India in the
last 20 years.
Accuracy
Classification accuracy
Accuracy
 True Positive – Model correctly predicts
the positive class.
 True Negative– Model correctly predicts
the Negative class.
 False Positive – Model in correctly
predicts the positive class.
 False Negative– Model in correctly
predicts the negative class.
 Accuracy = No. of correct predictions
----------------------------------
Total number of predictions
Clustering
Clustering
 Clustering is an unsupervised Machine Learning-based
Algorithm
 Clustering only utilizes input data, to determine
patterns, anomalies or similarities in its input data
 A good clustering algorithm aims to obtain clusters
whose:
◦ The intra-cluster similarities are high, It implies
that the data present inside the cluster is similar to
one another.
◦ The inter-cluster similarity is low, and it means
each cluster holds data that is not similar to other
data.
 What is a Cluster?
◦ A cluster is a subset of similar objects
Clustering
 Grouping of specific objects based on their
characteristics and their similarities.
 A good clustering algorithm is able to identify the
cluster independent of cluster shape.
 3 basic stages of clustering algorithm are
Raw Data
Clustering Algorithm
Clusters of Data
Methods of Clustering in Data
Mining
 Many clusters can partition information
into a data set.
 Methods of Clustering in Data Mining
◦ 1. Partitioning based method
◦ 2. Density-based method
◦ 3. Centroid-based method
◦ 4. Hierarchical method
◦ 5. Grid-based method
◦ 6. Model-based method
Partitioning based Method
 Partition algorithm divides data into many subsets
 Let the algorithm build a partition of data and n objects
present in the database.
 This indicates that each group has at least one object, and
every object, must belong to exactly one group.
Density based Method
 The algorithm produces clusters of high dense
regions in the data space, separated by regions of
the lower density of points.
Centroid Based Method
 In Centroid based clustering algorithm clusters are
formed by the closeness of data points to
the centroid of clusters.
 Here, the cluster center i.e. centroid is formed such
that the distance of data points is minimum with the
center.
 A vector of values references (centroids) almost
every cluster in this type of grouping technique.
 Number of groups should be predefined.
Hierarchical Method
 Hierarchical clustering analysis is a method of
cluster analysis that seeks to build a hierarchy of
clusters i.e. tree-type structure based on the
hierarchy.
 Agglomerative Clustering: Also known as
bottom-up approach or hierarchical agglomerative
clustering (HAC).
 This clustering algorithm does not require us to
prespecify the number of clusters.
 Bottom-up algorithms treat each data as a
singleton cluster at the outset and then
successively agglomerates pairs of clusters until all
clusters have been merged into a single cluster
that contains all data.
Hierarchical Agglomerative
Approach
Divisive Approach
 Also known as a top-down approach.
 This algorithm also does not require to prespecify
the number of clusters.
 Top-down clustering requires a method for splitting
a cluster that contains the whole data and
proceeds by splitting clusters recursively until
individual data have been split into singleton
clusters.
Grid-Based Method
 Grid is divided based on the characteristics of the
data.
 By using this method, non-numeric data is easy to
manage.
 Data order does not affect the partitioning of the
grid.
 An important advantage is the faster execution
time.
Grid – Based Method
Model-Based Method
 In this method a hypothesized model based on
probability distribution is used.
 By clustering the density function, this method
locates the clusters.
Applications of Clustering in Data
Mining
 Clustering analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image
processing.
 Clustering can also help marketers discover distinct groups in their
customer base. And they can characterize their customer groups
based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar functionalities and gain
insight into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use in an
earth observation database. It also helps in the identification of groups
of houses in a city according to house type, value, and geographic
location.
 Clustering also helps in classifying documents on the web for
information discovery.
 Clustering is also used in outlier detection applications such as
detection of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain

More Related Content

Similar to Big Data Analytics - Unit 3.pptx

AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
Dr. C.V. Suresh Babu
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
321106410027
 
Classification and Prediction.pptx
Classification and Prediction.pptxClassification and Prediction.pptx
Classification and Prediction.pptx
SandeepAgrawal84
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
ijsrd.com
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
ssuser6654de1
 
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
pandavaTirumala
 
pradeep ppt final.pptx
pradeep ppt final.pptxpradeep ppt final.pptx
pradeep ppt final.pptx
pandavaTirumala
 
Artificial Neural Networks for data mining
Artificial Neural Networks for data miningArtificial Neural Networks for data mining
Artificial Neural Networks for data mining
ALIZAIB KHAN
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
Kai Koenig
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
Sơn Còm Nhom
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
AmAn Singh
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
ABDUmomo
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Vikash Kumar
 
classification in data warehouse and mining
classification in data warehouse and miningclassification in data warehouse and mining
classification in data warehouse and mining
anjanasharma77573
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET Journal
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
Sanghun Kim
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Tonmoy Bhagawati
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
presentationIDC - 14MAY2015
presentationIDC - 14MAY2015presentationIDC - 14MAY2015
presentationIDC - 14MAY2015
Anat Reiner-Benaim
 

Similar to Big Data Analytics - Unit 3.pptx (20)

AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Classification and Prediction.pptx
Classification and Prediction.pptxClassification and Prediction.pptx
Classification and Prediction.pptx
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
 
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
 
pradeep ppt final.pptx
pradeep ppt final.pptxpradeep ppt final.pptx
pradeep ppt final.pptx
 
Artificial Neural Networks for data mining
Artificial Neural Networks for data miningArtificial Neural Networks for data mining
Artificial Neural Networks for data mining
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Artificial Neural Networks for Data Mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
classification in data warehouse and mining
classification in data warehouse and miningclassification in data warehouse and mining
classification in data warehouse and mining
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
presentationIDC - 14MAY2015
presentationIDC - 14MAY2015presentationIDC - 14MAY2015
presentationIDC - 14MAY2015
 

Recently uploaded

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 

Big Data Analytics - Unit 3.pptx

  • 2. Classification  Supervised v/s Unsupervised Learning  Supervised learning (classification):  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations.  New data is classified based on the training set  Unsupervised learning (clustering):  The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • 3. What is Classification and Prediction?  There are two forms of data analysis that can be used for extracting models describing important classes or to predict future data trends. These two forms are as follows − ➢ Classification ➢ Prediction  Classification models predict categorical class labels; and prediction models predict continuous valued functions.  For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation.
  • 4. WHAT IS CLASSIFICATION?  Following are the examples of cases where the data analysis task is Classification − ➢ A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or which are safe. ➢ A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new computer.  In both of the above examples, a model or classifier is constructed to predict the categorical labels. These labels are risky or safe for loan application data and yes or no for marketing data
  • 5. WHAT IS PREDICTION?  Following are the examples of cases where the data analysis task is Prediction −  Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his company. In this example we are bothered to predict a numeric value. Therefore the data analysis task is an example of numeric prediction. In this case, a model or a predictor will be constructed that predicts a continuous- valued-function or ordered value.  Note − Regression analysis is a statistical methodology that is most often used for numeric prediction
  • 6. Classification – A 2 step process  Model construction: describing a set of predetermined classes ◦ Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute ◦ The set of tuples used for model construction is training set ◦ The model is represented as classification rules, decision trees, or mathematical formulae  Model usage: for classifying future or unknown objects  Estimate accuracy of the model ◦ The known label of test sample is compared with the classified result from the model ◦ Accuracy rate is the percentage of test set samples that are correctly classified by the model ◦ Test set is independent of training set, otherwise over-fitting will occur  If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
  • 8. Using the Model in Prediction
  • 9. Classification of Data Mining Techniques  Classification of Data mining frameworks based on the type of data sources mined: ◦ Here the classification is as per the type of data. Eg: multimedia, text data, spatial data, time series data, www etc.  Classification of Data mining frameworks based on database involved ◦ Here the classification is as per the data model involved. Eg: Object-oriented database, transactional database,
  • 10. Classification of Data Mining Techniques  Classification of Data mining frameworks as per the kind of knowledge discovered: ◦ This classification depends on the types of knowledge discovered Eg: Discrimination, classification, clustering, characterization etc.  Classification of data mining frameworks based on data mining techniques used: ◦ This classification is based on the data analysis approach utilized. Eg: neural
  • 11. Issues regarding Classification and Prediction  The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following activities −  ➢ Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute.  ➢ Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related.
  • 12. Issues regarding Classification and Prediction  ➢ Data Transformation and reduction − The data can be transformed by any of the following methods. ◦ ■ Normalization − The data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used. ◦ ■ Generalization − The data can also be transformed by generalizing it to the higher concept. For this purpose we can use the concept hierarchies..
  • 13. Comparison of Classification and Prediction Methods  Here is the criteria for comparing the methods of Classification and Prediction −  ➢ Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data.  ➢ Speed − This refers to the computational cost in generating and using the classifier or predictor.  ➢ Robustness − It refers to the ability of classifier or predictor to make correct predictions from given noisy data.  ➢ Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently, given large amount of data.  ➢ Interpretability − It refers to what extent the classifier or predictor understands
  • 18. Naïve Bayes Classifier Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No PlayTennis is target variable with output Yes / No. New Instance to be classified as Yes / No. Outlook = Sunny , Temperature = Cool, Humidity = High, Wind = Strong  Prior Probability  P(Play Tennis = yes) = 9 / 14 = 0.64  P(Play Tennis = no) = 5 / 14 = 0.36  Current Probability / conditional probabilities of individual attributes:  4 attributes viz., – Outlook, Temperature, Humidity and Wind  Find conditional probabilities of individual attributes. Outlook Y N Sunny 2/9 3/5 Overcast 4/9 0 Rainy 3/9 2/5 Temperature Y N Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Humidity Y N High 3/9 4/5 Normal 6/9 1/5 Wind Y N Strong 3/9 3/5 Weak 6/9 2/5
  • 19. Naïve Bayes Classifier  Prior Probability  P(Play Tennis = yes) = 9 / 14 = 0.64  P(Play Tennis = no) = 5 / 14 = 0.36  Current Probability / conditional probabilities of individual attributes:  4 attributes viz., – Outlook, Temperature, Humidity and Wind  Find conditional probabilities of individual attributes.
  • 20. Naïve Bayes Classifier Probability that the person will play Tennis is less than the probability that he w not play tennis. Hence the conclusion is that he will not play Tennis.
  • 23. Back Propagation  Features of Back-propagation:  It uses the gradient descent method  It is different from other networks in respect to the process by which the weights are calculated during the learning period of the network.  Training is done in the three stages : ◦ the feed-forward of input training pattern ◦ the calculation and back-propagation of the error ◦ Updating the weight
  • 24. Back Propagation Algorithm  Step 1: Inputs X, arrive through the pre-connected path.  Step 2: The input is modeled using true weights W. Weights are usually chosen randomly.  Step 3: Calculate the output of each neuron from the input layer to the hidden layer to the output layer.  Step 4: Calculate the error in the outputs  Back propagation Error= Actual Output – Desired Output  Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the error.  Step 6: Repeat the process until the desired output is achieved.  Parameters :  x = inputs training vector x=(x1,x2,…xn).  t = target vector t=(t1,t2……………tn).  δk = error at output unit.  δj = error at hidden layer.  α = learning rate.  V0j = bias of hidden unit j.
  • 25. Types of Back-Propagation  Static back-propagation: Static back propagation is a network designed to map static inputs for static outputs.  Eg: OCR (Optical Character Recognition)  Recurrent back-propagation: Activation in recurrent back-propagation is feed-forward until a fixed value is reached. Static back propagation provides an instant mapping, while recurrent back propagation does not provide an instant mapping.
  • 26.  Advantages:  It is simple, fast, and easy to program.  It is Flexible and efficient.  No need for users to learn any special functions.  Disadvantages:  It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.  Performance is highly dependent on input data.  Spending too much time training.
  • 28. Prediction  Predictive Modeling and Machine Learning -Machine learning takes weather data and builds relationships between the available data and the relative predictors. 2. Data – A Crucial Part of Weather Predictions 3. Weather Data – An Aid for many Events Prediction of floods, sports, predict car sales, predict asthma attack. 4. Satellite Imagery and Sensor Data Example: A record 1.2 million people (equal to the population of Mauritius) were evacuated in less than 48 hours just because of data scientists. It was one of the strongest cyclones to have hit India in the last 20 years.
  • 31. Accuracy  True Positive – Model correctly predicts the positive class.  True Negative– Model correctly predicts the Negative class.  False Positive – Model in correctly predicts the positive class.  False Negative– Model in correctly predicts the negative class.  Accuracy = No. of correct predictions ---------------------------------- Total number of predictions
  • 33. Clustering  Clustering is an unsupervised Machine Learning-based Algorithm  Clustering only utilizes input data, to determine patterns, anomalies or similarities in its input data  A good clustering algorithm aims to obtain clusters whose: ◦ The intra-cluster similarities are high, It implies that the data present inside the cluster is similar to one another. ◦ The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other data.  What is a Cluster? ◦ A cluster is a subset of similar objects
  • 34. Clustering  Grouping of specific objects based on their characteristics and their similarities.  A good clustering algorithm is able to identify the cluster independent of cluster shape.  3 basic stages of clustering algorithm are Raw Data Clustering Algorithm Clusters of Data
  • 35. Methods of Clustering in Data Mining  Many clusters can partition information into a data set.  Methods of Clustering in Data Mining ◦ 1. Partitioning based method ◦ 2. Density-based method ◦ 3. Centroid-based method ◦ 4. Hierarchical method ◦ 5. Grid-based method ◦ 6. Model-based method
  • 36. Partitioning based Method  Partition algorithm divides data into many subsets  Let the algorithm build a partition of data and n objects present in the database.  This indicates that each group has at least one object, and every object, must belong to exactly one group.
  • 37. Density based Method  The algorithm produces clusters of high dense regions in the data space, separated by regions of the lower density of points.
  • 38. Centroid Based Method  In Centroid based clustering algorithm clusters are formed by the closeness of data points to the centroid of clusters.  Here, the cluster center i.e. centroid is formed such that the distance of data points is minimum with the center.  A vector of values references (centroids) almost every cluster in this type of grouping technique.  Number of groups should be predefined.
  • 39. Hierarchical Method  Hierarchical clustering analysis is a method of cluster analysis that seeks to build a hierarchy of clusters i.e. tree-type structure based on the hierarchy.  Agglomerative Clustering: Also known as bottom-up approach or hierarchical agglomerative clustering (HAC).  This clustering algorithm does not require us to prespecify the number of clusters.  Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all data.
  • 41. Divisive Approach  Also known as a top-down approach.  This algorithm also does not require to prespecify the number of clusters.  Top-down clustering requires a method for splitting a cluster that contains the whole data and proceeds by splitting clusters recursively until individual data have been split into singleton clusters.
  • 42. Grid-Based Method  Grid is divided based on the characteristics of the data.  By using this method, non-numeric data is easy to manage.  Data order does not affect the partitioning of the grid.  An important advantage is the faster execution time.
  • 43. Grid – Based Method
  • 44. Model-Based Method  In this method a hypothesized model based on probability distribution is used.  By clustering the density function, this method locates the clusters.
  • 45. Applications of Clustering in Data Mining  Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.  Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns.  In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations.  Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location.  Clustering also helps in classifying documents on the web for information discovery.  Clustering is also used in outlier detection applications such as detection of credit card fraud.  As a data mining function, cluster analysis serves as a tool to gain