2. S. Malpani Radhika and Dr. Sulochana Sonkamble
http://www.iaeme.com/IJCET/index.asp 28 editor@iaeme.com
Engineering and Technology, 6(7), 2015, pp. 27-34.
http://www.iaeme.com/IJCET/issues.asp?JTypeIJCET&VType=6&IType=7
_____________________________________________________________________
1. INTRODUCTION
In the communal intellect, the word biased refers to an action which leads to unfair
decision making towards people on the basis of their membership to a group, without
regard to the personage merit. For an instance, U. S. federal laws forbid biases based
on the race, color, religion, nationality, gender, marital status, and age in a number of
setting, including: scoring of credit or insurance: sale, rental etc [1].
By considering the side of researchers, the problem of biasing in credit, finance,
insurance, labor market, education and other human actions which has been focused
by many of the researchers of the human and economics science. The technology of
data mining is a source of both, generating biased decisions and a method for
determining and preventing the biases. The term direct bias means the apparent when
a user is indulgenced adversed as his or her individual attribute are sensitive like sex,
race, age, disability or the marital status [1].
This type of biased is simple and can influence the person being biased seriously.
The term indirect biased occurs when a confident rule to deal with all people equally
but has the result of affecting a number of certain people. Services in the information
society allow and custom set of large amount of data. These data are used to training
association or categorization rules in view of making automatic decision, like loan
accepting or rejecting, insurance premium calculation, personnel selection, etc.
Automating decisions may give a sense of fairness: categorization rules do not guide
themselves by personal preference. Though, at a closer look, one understands that
categorization rules are learned by the system from the training data. If the training
data are intrinsically biased for or against a particularly community, the learned model
may show a biased intolerant activities. In another words, the genuine reason behind
denying the loan is that the person belongs to other nationality. Therefore there is a
need to remove such a potential biases from the training data without affecting the
decision making utility. Everyone wants to prevent their data from becoming the
source of biases, due to data mining tasks generating biased model from biased data
sets as a part of automated decision making. In [4], its concluded data mining can be
both a source of biased and a resource for discovering biased. Hence some techniques
to avoid biases have to be introduced and they should be revised to achieve more
accuracy and allow the DSS to make biased free decision based on the biased free
dataset [1].
In this paper we discussed about preventing the biased in the dataset. For
preventing the dataset we introduced two methods first one is Post-processing method
which prevent the biased rule by generating strong rules from the input dataset.
Second method is the Categorization with least biased method this method also
prevent the data from generating the biased rule.
We will discuss the implementation details of the proposed system in details in the
further sections.
The remaining paper is organized in the following manner. In section II we
discussed about the related work done by the researchers for preventing the database
from the bias rule. In section III we discussed about the implementation details of the
proposed system. In this section we discussed about the system overview, algorithms
of the proposed system. In section IV discussed about the result and discussion of the
3. A Data Mining Approach To Avoid Potential Biases
http://www.iaeme.com/IJCET/index.asp 29 editor@iaeme.com
proposed system. In section V we discussed the conclusion of the proposed system
and finally we discussed the references used for the paper.
2. RELATED WORK
2.1. Literature Review
Despite of the tremendous enhancement of information system on the basis of data
mining technology in the decision making, the problem of anti-biasing in the data
mining didn’t receive too much of attention. Now the work done by the researcher for
detection and measuring the biases that occur in the data mining technology is
discussed. Also related work done for the preventing the bias in the data mining is
considered.
In [1], the authors represent two novel algorithms for solving the issues which are
essentially distinct from the recognized algorithms. The algorithms are Apriori and
AprioriTid. The features of the best two algorithms are shared to form the hybrid
algorithm known as AprioriHybrid. The Apriori begins with calculating the number of
item occurrences in every pass. Then the sample item sets are generated and the
support of candidates in each sample item set is evaluated. The distinguishing feature
of AprioriTid is that, it does not calculate the support after every pass.
In [2], the author describes the framework for evaluating potential biases by
analyzing the previous decision records generated out of sensitive attributes and also
addresses the issues regarding determining an accurate measure of the degree of bias
from a known group in a known context with respect to the decision. The author
considers this problem is rearticulated in a classification rule based setting, and a
compilation of quantitative measures of bias is introduced based on the existing norms
and regulation. Few measures to calculate the potential biases i.elift, olift, slift
formulas, are introduced in this work.
In [3], the authors introduced the issue of generating the biases through data
mining in the data set of traditional decision records, selected by the user or by the
system. Author honor the direct and indirect biases discovered by the modeling
protected by law groups and contexts where biases occurs in the classification rule
based syntax.
In [4], the authors introduced and studied the idea of bias classification rules.
Providing an assurance of non bias is shown to be a non trivial task. The authors also
introduced the term” Biases in dataset” for the first time. In data mining, classification
models are constructed on the basis of historical data, hence if there were some biased
decision making done previously, then the classification studied by the classification
models will also be biased. Hence the research focused on identifying the sensitive
attributes that could contribute to biases decision making. This idea lead to the term
called” Direct-Bias prevention”. In addition the” inference model” to tackle with
indirect biases was also introduced. The inference model suggested a secondary
database to be maintained along with original dataset called as “Background
Knowledge”.
In [5], the author guides the people through the legal problems about the biases
hidden in data, and through distinct legally grounded analyses to unveil biases
circumstances. The authors say that DCUBE is an analytical tool supporting the
interactive and iterative procedure of detecting potential biases. The future users of
DCUBE include: anti-bias establishment, proprietors of socially sensitive decision
databases, and auditors, researchers in social sciences, economics and law.
4. S. Malpani Radhika and Dr. Sulochana Sonkamble
http://www.iaeme.com/IJCET/index.asp 30 editor@iaeme.com
In [6], the authors represent the model for ruling proof of biases in datasets of
traditional decision records in communally responsive tasks, including access to
credit, mortgage, insurance, labor market and other benefits. The authors presented a
reference model for the examination and revelation of biases in socially-sensitive
choices taken by DSS. The methodology comprises first of extracting frequent
classification rules, and afterward of examining them on the premise of quantitative
measures of Biases and their measurable significance. The key legitimate ideas of
protected-by-law groups, direct biases, indirect biases, honest to goodness
occupational prerequisite, affirmative activities and partiality are formalized as
explanations over the set of concentrated runs and, perhaps, extra foundation
information.
In [7], the authors begin with addressing the issue of bias rules occurred in the
dataset by introducing the novel classification method for learning non-bias technique
on the basis of training data. This method is based on the manipulating the dataset by
creating the least intrusive modification which lead to an unbiased dataset.
In [8], the authors examine and study how to adjust the naive Bayes classifier for
performing classification which has restricted to be sovereign with respect to a given
sensitive attributes.
In [9], the authors discussed how to spotless training datasets and outsourced
datasets in such a way those rightful classifications rules can still be remove but
biased rules on the basis of sensitive attributes cannot. The authors analyzed how
biased decision making could affect on cyber security applications, particularly IDSs.
IDSs use computational knowledge advances, for example, data mining. It is evident
that the training data of these frameworks could be capable of generating biases,
which would bring about them to settle on such decision when foreseeing
interruption.
In [10], the authors discussed a novel preprocessing technique for indirect biased
prevention on the basis of data transformation which can consider distinguishes
biased attributes and their mixture. Additionally some measures for assessing their
proposed technique in term of its success in biased prevention and its impact on the
data quality.
In [11], the Adult Dataset is provided; this data set consists of 48,842 records,
split into a “train” part with 32,561 records and a “test” part with 16,281 records. The
data set has 14 attributes (without class attribute).
2.2. Existing System
The existing system used preprocessing approached for preventing direct and indirect
biases in the dataset. The existing system divided up into two phase:
2.2.1. Measurement of biases
Direct and indirect biased recognition contains obtaining the alpha biasing rules and
redlining rules. Initially, potentially biasing rules and potentially nonbiasing rules are
based on the biased items present in the database DB and FP the frequent
categorization rule. After that by using the direct biased measures and the biased
threshold the direct biased is measured by obtaining the alpha biasing rules with the
potentially biasing rules. After that, same as the direct bias, indirect bias is measured
by obtaining the redlining rules with the potentially non biased rules merging with the
background knowledge, using an indirect biased measures and the bias threshold [1].
5. A Data Mining Approach To Avoid Potential Biases
http://www.iaeme.com/IJCET/index.asp 31 editor@iaeme.com
2.2.2. Transformation of biases
Transforming the original database DB in such way that direct or indirect biases are
eliminate, with least impact on the data and on rightful decision rules, so that no
unfair decision rule can be mined from the transaction database[1].
2.2.3. Algorithms used in the preprocessing approach [1]
We consider the class assume in the database is binary. We consider in FP with the
negative classification rule. Also we consider the biased item set (A’) and non biased
item set (D) to be binary or non-binary category.
i) Direct Biases Prevention Algorithm:
• Direct Rule Protection (Method I)
• Direct Rule Protection (Method II)
• Direct Rule Protection and Rule Generalization
ii) Indirect biases Prevention Algorithm:
3. IMPLEMENTATION DETAILS
3.1. System Overview
In the Figure 1, we discussed the proposed system. In the proposed system we
introduced the method for preventing the bias from the database. Here we discussed
two methods which are Post-processing method i.e. Extended-CPAR for removing
bias and Categorization with least Biased Rule algorithm and initially user uploads the
dataset which contain biased rules. Initially we prevent the biased rule by using Post-
processing method. This algorithm merges the benefits of both the associative
categorization and traditional rule based categorization. This method has basically
splits into three steps:
• Rule generating
• Estimation accuracy of rule
• Categorization and rule analysis.
Figure 1 System Architecture
Here we introduced the method for preventing the bias by using Categorization
with least bias. This method changes the allocation of distinct data objects for a given
data to make it biased free. The basic plan is that the data object nearest to the
decision boundary are more prone to be sufferer of biased. Therefore the main
6. S. Malpani Radhika and Dr. Sulochana Sonkamble
http://www.iaeme.com/IJCET/index.asp 32 editor@iaeme.com
purpose is to alter the distribution of these borderline objects to make the biased free.
On the original dataset the ranking function was applied, for identifying the dataset
nearest to the bias data. The basic steps of this algorithm are as follows:
• Check the eligibility criteria
• Rank the client based on number of eligibility criteria they satisfied
• The cancellation is applied to those who having lower rank.
3.2. Algorithm
Algorithm 1: Algorithm for Post-processing (Extended-CPAR)
1. Let D be the original data set.
2. Declare a counter Weight.
3. Each attribute is assigned Weight=1.
4. Declare set gain that holds the values of strong attributes.
5. Initially set Rule is NULL.
6. Declare the set “Result” which holds the result obtained by applying the traditional
techniques.
7. Declare the set “Negative” which holds the attributes not included in Result.
8. if Weight (Negative) > Weight (Result) then.
9. Evaluate gain for each attribute in Negative.
10. While for each attribute if gain is strong then.
11. change its Class Attribute ¬c to C
12. Add the new Classification rule in Rules set
13. Include such attributes in result set.
4. RESULTS AND DISCUSSION
4.1. Expected Experimentation
The system is built using Java framework (version jdk 8.1) on Windows platform.
The Netbeans (version 8.0.1) is used as a development tool. The system does not
require any specific hardware to run; any standard machine is capable of running the
application.
4.2. Results
Here we discussed the results and generated graph for the Proposed System. While the
K Value = 3, Minimum Best Gain = 0.3, Total Weight Factor = 0.6, Gain Similarity
Ratio = 0.4, the discrimination removal is as follows:
Here in Graph 1, the Red Bar shows Existing System and the Blue bar shows
Proposed System. In the Existing System the Pre-Processing Algorithms have been
implemented whereas in Proposed System the Post-Processing Algorithms have been
implemented. Potential biases discovered in Existing System ranges between 0.1 to
2.5% whereas the potential biases discovered in Proposed System is 15%.
7. A Data Mining Approach To Avoid Potential Biases
http://www.iaeme.com/IJCET/index.asp 33 editor@iaeme.com
Graph 1 Degree of Potential Biases Removed
Graph 2, shows memory required for the respective algorithms to execute.
Graph 2 Memory Requirement
5. CONCLUSION
In this paper we discussed regarding the biased rule generated in the database. There
are two types of biased rule which are direct biased rule or indirect biased rule.
8. S. Malpani Radhika and Dr. Sulochana Sonkamble
http://www.iaeme.com/IJCET/index.asp 34 editor@iaeme.com
Number of techniques was developed for preventing the biased rule. In this paper we
proposed the Post-processing method which prevents the biased rule from the
database. We proposed Post-processing algorithm for preventing the biased rule.
The Existing System (Pre-processing) identifies only 2.5% of the dataset as
discriminated (biased) where as the Proposed System (Post-Processing) identifies up
to 15% of the dataset as biased while generating lesser but strong classification rules.
The proposed system thus is an excellent solution for avoiding biases in Data Mining.
REFERENCES
[1] Hajian, S. and Domingo-Ferrer, J. A Methodology for Direct and Indirect
Discrimination Prevention in Data Mining. IEEE Transactions on Knowledge and
Data Engineering, 25(7), July 2013.
[2] Pedreschi, D., Ruggieri, S. and Turini, F. Measuring Discrimination in Socially-
Sensitive Decision Records. Proc. Ninth SIAM Data Mining Conf. (SDM 09),
2009, pp. 581–592.
[3] Ruggieri, S., Pedreschi, D. and Turini, F. Data Mining for Discrimination
Discovery, ACM Trans. Knowledge Discovery from Data, 4(2), 2010, article 9.
[4] Pedreschi, D., Ruggieri, S., and Turini, F. Discrimination aware data mining. In
Proc. of KDD 2008, ACM, 560568, 2008.
[5] Ruggieri, S., Pedreschi, D. and Turini, F.DCUBE: Discrimination Discovery in
Databases. Proc. ACM Intl Conf. Management of Data (SIGMOD 10), 2010, pp.
1127–1130.
[6] Pedreschi, D., Ruggieri, S. and Turini, F. Integrating Induction and Deduction for
Finding Evidence of Discrimination, Proc. 12th
ACM Intl Conf. Artificial
Intelligence and Law (ICAIL 09), 2009, pp. 157–166.
[7] Kamiran, F. and Calders, T. Classification without Discrimination. Proc. IEEE
Second Intl Conf. Computer, Control and Comm. (IC4 09), 2009.
[8] Calders, T. and Verwer, S. Three Naive Bayes Approaches for Discrimination-
Free Classification. Data Mining and Knowledge Discovery, 21(2), 2010, pp.
277–292.
[9] Hajian, S., Domingo-Ferrer, J. and Martnez-Balleste, A. Discrimination
Preventionin Data Mining for Intrusion and Crime Detection. Proc. IEEE Symp.
Computational Intelligence in Cyber Security (CICS 11), 2011, pp. 47–54.
[10] Hajian, S., Domingo-Ferrer, J. and Martnez-Balleste, A. Rule Protection for
Indirect Discrimination Prevention in Data Mining. Proc. Eighth Intl Conf.
Modeling Decisions for Artificial Intelligence (MDAI 11), 2011, pp. 211–222.
[11] Kohavi, R. and Becker, B. UCI Repository of Machine Learning Databases,
1996, http://archive.ics.uci.edu/ml/datasets/Adult.