Ppt on CLASS IMBALANCE PROBLEM in Data Mining

UNIT-5
CLASS IMBALANCE PROBLEM

What is the Class Imbalance Problem?
It is the problem in machine learning where the total number of a
class of data (positive) is far less than the total number of another
class of data (negative). This problem is extremely common in
practice and can be observed in various disciplines including fraud
detection, anomaly detection, medical diagnosis, oil spillage detection,
facial recognition, etc.
ABOUT PROBLEM
Given a dataset of transaction data, we would like to find out which are
fraudulent and which are genuine ones. Now, it highly cost to the e-
commerce company if a fraudulent transaction goes through as this
impacts our customers trust in us, and costs us money. So we want to
catch as many fraudulent transactions as possible.
If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the
classifier will tend to classify fraudulent transactions as genuine transactions. The reason
can be easily explained by the numbers. Suppose the machine learning algorithm has two
possibly outputs as follows:
Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of
10000 genuine transactions as fraudulent transactions.
Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out
of 10000 genuine transactions as fraudulent transactions.

How to depict which model is the better
solution?
M1 OR M2?????
To tell the machine learning algorithm (or the researcher) that Model 2 is better
than Model 1, we need to show that Model 2 above is better than Model 1 above.
For that, we will need better metrics than just counting the number of mistakes
made.
The concept of True Positive, True Negative, False Positive and False Negative
has been introduced:
True Positive (TP) – An example that is positive and is classified correctly
as positive
True Negative (TN) – An example that is negative and is classified correctly
as negative
False Positive (FP) – An example that is negative but is classified wrongly
as positive
False Negative (FN) – An example that is positive but is classified wrongly
as negative
Based on this we will have True Positive Rate, True Negative Rate, False Positive
Rate, False Negative Rate:

Graph Mining
Graphs become increasingly important in modelling complicated structures, such as
circuits, images, chemical compounds, protein structures, biological networks, social
networks, the Web, workﬂows, and XML documents. Many graph search algorithms
have been developed in chemical informatics, computer vision, video indexing, and
text
retrieval. With the increasing demand on the analysis of large amounts of structured
data, graph mining has become an active and important theme in data mining.
Among the various kinds of graph patterns, frequent substructures are the very basic
patterns that can be discovered in a collection of graphs.
They are useful for characterizing graph sets, discriminating different groups of graphs,
classifying and clustering graphs, building graph indices, and facilitating similarity
search in graph databases.
Recent studies have developed several graph mining methods and applied them to the
discovery of interesting patterns in various applications.

Social network
Social network can be defined as the set of
relationships between individuals where each individual is
a social entity. It represents both the collection of ties
between people as well as the strength of those ties . In a
general way, Social network is used as a measure of social
“connectedness”, within the social networks for observing
and calculating the quality and quantity of information flow
within individuals and also within groups.

Ppt on CLASS IMBALANCE PROBLEM in Data Mining

Ppt on CLASS IMBALANCE PROBLEM in Data Mining

More Related Content

Similar to Ppt on CLASS IMBALANCE PROBLEM in Data Mining

Recently uploaded

Ppt on CLASS IMBALANCE PROBLEM in Data Mining