UNIT-5
CLASS IMBALANCE PROBLEM
What is the Class Imbalance Problem?
It is the problem in machine learning where the total number of a
class of data (positive) is far less than the total number of another
class of data (negative). This problem is extremely common in
practice and can be observed in various disciplines including fraud
detection, anomaly detection, medical diagnosis, oil spillage detection,
facial recognition, etc.
ABOUT PROBLEM
Given a dataset of transaction data, we would like to find out which are
fraudulent and which are genuine ones. Now, it highly cost to the e-
commerce company if a fraudulent transaction goes through as this
impacts our customers trust in us, and costs us money. So we want to
catch as many fraudulent transactions as possible.
If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the
classifier will tend to classify fraudulent transactions as genuine transactions. The reason
can be easily explained by the numbers. Suppose the machine learning algorithm has two
possibly outputs as follows:
Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of
10000 genuine transactions as fraudulent transactions.
Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out
of 10000 genuine transactions as fraudulent transactions.
How to depict which model is the better
solution?
M1 OR M2?????
To tell the machine learning algorithm (or the researcher) that Model 2 is better
than Model 1, we need to show that Model 2 above is better than Model 1 above.
For that, we will need better metrics than just counting the number of mistakes
made.
The concept of True Positive, True Negative, False Positive and False Negative
has been introduced:
True Positive (TP) – An example that is positive and is classified correctly
as positive
True Negative (TN) – An example that is negative and is classified correctly
as negative
False Positive (FP) – An example that is negative but is classified wrongly
as positive
False Negative (FN) – An example that is positive but is classified wrongly
as negative
Based on this we will have True Positive Rate, True Negative Rate, False Positive
Rate, False Negative Rate:
Graph Mining
Graphs become increasingly important in modelling complicated structures, such as
circuits, images, chemical compounds, protein structures, biological networks, social
networks, the Web, workflows, and XML documents. Many graph search algorithms
have been developed in chemical informatics, computer vision, video indexing, and
text
retrieval. With the increasing demand on the analysis of large amounts of structured
data, graph mining has become an active and important theme in data mining.
Among the various kinds of graph patterns, frequent substructures are the very basic
patterns that can be discovered in a collection of graphs.
They are useful for characterizing graph sets, discriminating different groups of graphs,
classifying and clustering graphs, building graph indices, and facilitating similarity
search in graph databases.
Recent studies have developed several graph mining methods and applied them to the
discovery of interesting patterns in various applications.
Social network
Social network can be defined as the set of
relationships between individuals where each individual is
a social entity. It represents both the collection of ties
between people as well as the strength of those ties . In a
general way, Social network is used as a measure of social
“connectedness”, within the social networks for observing
and calculating the quality and quantity of information flow
within individuals and also within groups.
Ppt on CLASS IMBALANCE PROBLEM in Data Mining
Ppt on CLASS IMBALANCE PROBLEM in Data Mining
Ppt on CLASS IMBALANCE PROBLEM in Data Mining
Ppt on CLASS IMBALANCE PROBLEM in Data Mining
Ppt on CLASS IMBALANCE PROBLEM in Data Mining
Ppt on CLASS IMBALANCE PROBLEM in Data Mining
Ppt on CLASS IMBALANCE PROBLEM in Data Mining
Ppt on CLASS IMBALANCE PROBLEM in Data Mining
Ppt on CLASS IMBALANCE PROBLEM in Data Mining

Ppt on CLASS IMBALANCE PROBLEM in Data Mining

  • 1.
  • 2.
    What is theClass Imbalance Problem? It is the problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative). This problem is extremely common in practice and can be observed in various disciplines including fraud detection, anomaly detection, medical diagnosis, oil spillage detection, facial recognition, etc. ABOUT PROBLEM Given a dataset of transaction data, we would like to find out which are fraudulent and which are genuine ones. Now, it highly cost to the e- commerce company if a fraudulent transaction goes through as this impacts our customers trust in us, and costs us money. So we want to catch as many fraudulent transactions as possible. If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the classifier will tend to classify fraudulent transactions as genuine transactions. The reason can be easily explained by the numbers. Suppose the machine learning algorithm has two possibly outputs as follows: Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of 10000 genuine transactions as fraudulent transactions. Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out of 10000 genuine transactions as fraudulent transactions.
  • 3.
    How to depictwhich model is the better solution? M1 OR M2????? To tell the machine learning algorithm (or the researcher) that Model 2 is better than Model 1, we need to show that Model 2 above is better than Model 1 above. For that, we will need better metrics than just counting the number of mistakes made. The concept of True Positive, True Negative, False Positive and False Negative has been introduced: True Positive (TP) – An example that is positive and is classified correctly as positive True Negative (TN) – An example that is negative and is classified correctly as negative False Positive (FP) – An example that is negative but is classified wrongly as positive False Negative (FN) – An example that is positive but is classified wrongly as negative Based on this we will have True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate:
  • 7.
    Graph Mining Graphs becomeincreasingly important in modelling complicated structures, such as circuits, images, chemical compounds, protein structures, biological networks, social networks, the Web, workflows, and XML documents. Many graph search algorithms have been developed in chemical informatics, computer vision, video indexing, and text retrieval. With the increasing demand on the analysis of large amounts of structured data, graph mining has become an active and important theme in data mining. Among the various kinds of graph patterns, frequent substructures are the very basic patterns that can be discovered in a collection of graphs. They are useful for characterizing graph sets, discriminating different groups of graphs, classifying and clustering graphs, building graph indices, and facilitating similarity search in graph databases. Recent studies have developed several graph mining methods and applied them to the discovery of interesting patterns in various applications.
  • 18.
    Social network Social networkcan be defined as the set of relationships between individuals where each individual is a social entity. It represents both the collection of ties between people as well as the strength of those ties . In a general way, Social network is used as a measure of social “connectedness”, within the social networks for observing and calculating the quality and quantity of information flow within individuals and also within groups.