The Independence of Fairness-aware Classifiers
IEEE International Workshop on Privacy Aspects of Data Mining (PADM), in conjunction with ICDM2013
Article @ Official Site:
Article @ Personal Site: http://www.kamishima.net/archive/2013-ws-icdm-print.pdf
Handnote : http://www.kamishima.net/archive/2013-ws-icdm-HN.pdf
Program codes : http://www.kamishima.net/fadm/
Workshop Homepage: http://www.cs.cf.ac.uk/padm2013/
Abstract:
Due to the spread of data mining technologies, such technologies are being used for determinations that seriously affect individuals' lives. For example, credit scoring is frequently determined based on the records of past credit data together with statistical prediction techniques. Needless to say, such determinations must be nondiscriminatory and fair in sensitive features, such as race, gender, religion, and so on. The goal of fairness-aware classifiers is to classify data while taking into account the potential issues of fairness, discrimination, neutrality, and/or independence. In this paper, after reviewing fairness-aware classification methods, we focus on one such method, Calders and Verwer's two-naive-Bayes method. This method has been shown superior to the other classifiers in terms of fairness, which is formalized as the statistical independence between a class and a sensitive feature. However, the cause of the superiority is unclear, because it utilizes a somewhat heuristic post-processing technique rather than an explicitly formalized model. We clarify the cause by comparing this method with an alternative naive Bayes classifier, which is modified by a modeling technique called "hypothetical fair-factorization." This investigation reveals the theoretical background of the two-naive-Bayes method and its connections with other methods. Based on these findings, we develop another naive Bayes method with an "actual fair-factorization" technique and empirically show that this new method can achieve an equal level of fairness as that of the two-naive-Bayes classifier.
Correcting Popularity Bias by Enhancing Recommendation NeutralityToshihiro Kamishima
Correcting Popularity Bias by Enhancing Recommendation Neutrality on
The 8th ACM Conference on Recommender Systems, Poster
Article @ Official Site: http://ceur-ws.org/Vol-1247/
Article @ Personal Site: http://www.kamishima.net/archive/2014-po-recsys-print.pdf
Abstract:
In this paper, we attempt to correct a popularity bias, which is the tendency for popular items to be recommended more frequently, by enhancing recommendation neutrality. Recommendation neutrality involves excluding specified information from the prediction process of recommendation. This neutrality was formalized as the statistical independence between a recommendation result and the specified information, and we developed a recommendation algorithm that satisfies this independence constraint. We correct the popularity bias by enhancing neutrality with respect to information regarding whether candidate items are popular or not. We empirically show that a popularity bias in the predicted preference scores can be corrected.
Future Directions of Fairness-Aware Data Mining: Recommendation, Causality, a...Toshihiro Kamishima
Future Directions of Fairness-Aware Data Mining: Recommendation, Causality, and Theoretical Aspects
Invited Talk @ Workshop on Fairness, Accountability, and Transparency in Machine Learning
In conjunction with the ICML 2015 @ Lille, France, Jul. 11, 2015
Web Site: http://www.kamishima.net/fadm/
Handnote: http://www.kamishima.net/archive/2015-ws-icml-HN.pdf
The goal of fairness-aware data mining (FADM) is to analyze data while taking into account potential issues of fairness. In this talk, we will cover three topics in FADM:
1. Fairness in a Recommendation Context: In classification tasks, the term "fairness" is regarded as anti-discrimination. We will present other types of problems related to the fairness in a recommendation context.
2. What is Fairness: Most formal definitions of fairness have a connection with the notion of statistical independence. We will explore other types of formal fairness based on causality, agreement, and unfairness.
3. Theoretical Problems of FADM: After reviewing technical and theoretical open problems in the FADM literature, we will introduce the theory of the generalization bound in terms of accuracy as well as fairness.
Joint work with Jun Sakuma, Shotaro Akaho, and Hideki Asoh
Model-based Approaches for Independence-Enhanced RecommendationToshihiro Kamishima
Model-based Approaches for Independence-Enhanced Recommendation
IEEE International Workshop on Privacy Aspects of Data Mining (PADM), in conjunction with ICDM2016
Article @ Official Site: http://doi.ieeecomputersociety.org/10.1109/ICDMW.2016.0127
Workshop Homepage: http://pddm16.eurecat.org/
Abstract:
This paper studies a new approach to enhance recommendation independence. Such approaches are useful in ensuring adherence to laws and regulations, fair treatment of content providers, and exclusion of unwanted information. For example, recommendations that match an employer with a job applicant should not be based on socially sensitive information, such as gender or race, from the perspective of social fairness. An algorithm that could exclude the influence of such sensitive information would be useful in this case. We previously gave a formal definition of recommendation independence and proposed a method adopting a regularizer that imposes such an independence constraint. As no other options than this regularization approach have been put forward, we here propose a new model-based approach, which is based on a generative model that satisfies the constraint of recommendation independence. We apply this approach to a latent class model and empirically show that the model-based approach can enhance recommendation independence. Recommendation algorithms based on generative models, such as topic models, are important, because they have a flexible functionality that enables them to incorporate a wide variety of information types. Our new model-based approach will broaden the applications of independence-enhanced recommendation by integrating the functionality of generative models.
Fairness-aware Learning through Regularization Approach
The 3rd IEEE International Workshop on Privacy Aspects of Data Mining (PADM 2011)
Dec. 11, 2011 @ Vancouver, Canada, in conjunction with ICDM2011
Article @ Official Site: http://doi.ieeecomputersociety.org/10.1109/ICDMW.2011.83
Article @ Personal Site: http://www.kamishima.net/archive/2011-ws-icdm_padm.pdf
Handnote: http://www.kamishima.net/archive/2011-ws-icdm_padm-HN.pdf
Workshop Homepage: http://www.zurich.ibm.com/padm2011/
Abstract:
With the spread of data mining technologies and the accumulation of social data, such technologies and data are being used for determinations that seriously affect people's lives. For example, credit scoring is frequently determined based on the records of past credit data together with statistical prediction techniques. Needless to say, such determinations must be socially and legally fair from a viewpoint of social responsibility; namely, it must be unbiased and nondiscriminatory in sensitive features, such as race, gender, religion, and so on. Several researchers have recently begun to attempt the development of analysis techniques that are aware of social fairness or discrimination. They have shown that simply avoiding the use of sensitive features is insufficient for eliminating biases in determinations, due to the indirect influence of sensitive information. From a privacy-preserving viewpoint, this can be interpreted as hiding sensitive information when classification results are observed. In this paper, we first discuss three causes of unfairness in machine learning. We then propose a regularization approach that is applicable to any prediction algorithm with probabilistic discriminative models. We further apply this approach to logistic regression and empirically show its effectiveness and efficiency.
Efficiency Improvement of Neutrality-Enhanced RecommendationToshihiro Kamishima
Efficiency Improvement of Neutrality-Enhanced Recommendation
Workshop on Human Decision Making in Recommender Systems, in conjunction with RecSys 2013
Article @ Official Site: http://ceur-ws.org/Vol-1050/
Article @ Personal Site: http://www.kamishima.net/archive/2013-ws-recsys-print.pdf
Handnote : http://www.kamishima.net/archive/2013-ws-recsys-HN.pdf
Program codes : http://www.kamishima.net/inrs/
Workshop Homepage: http://recex.ist.tugraz.at/RecSysWorkshop/
Abstract:
This paper proposes an algorithm for making recommendations so that neutrality from a viewpoint specified by the user is enhanced. This algorithm is useful for avoiding decisions based on biased information. Such a problem is pointed out as the filter bubble, which is the influence in social decisions biased by personalization technologies. To provide a neutrality-enhanced recommendation, we must first assume that a user can specify a particular viewpoint from which the neutrality can be applied, because a recommendation that is neutral from all viewpoints is no longer a recommendation. Given such a target viewpoint, we implement an information-neutral recommendation algorithm by introducing a penalty term to enforce statistical independence between the target viewpoint and a rating. We empirically show that our algorithm enhances the independence from the specified viewpoint.
Consideration on Fairness-aware Data Mining
IEEE International Workshop on Discrimination and Privacy-Aware Data Mining (DPADM 2012)
Dec. 10, 2012 @ Brussels, Belgium, in conjunction with ICDM2012
Article @ Official Site: http://doi.ieeecomputersociety.org/10.1109/ICDMW.2012.101
Article @ Personal Site: http://www.kamishima.net/archive/2012-ws-icdm-print.pdf
Handnote: http://www.kamishima.net/archive/2012-ws-icdm-HN.pdf
Workshop Homepage: https://sites.google.com/site/dpadm2012/
Abstract:
With the spread of data mining technologies and the accumulation of social data, such technologies and data are being used for determinations that seriously affect individuals' lives. For example, credit scoring is frequently determined based on the records of past credit data together with statistical prediction techniques. Needless to say, such determinations must be nondiscriminatory and fair regarding sensitive features such as race, gender, religion, and so on. Several researchers have recently begun to develop fairness-aware or discrimination-aware data mining techniques that take into account issues of social fairness, discrimination, and neutrality. In this paper, after demonstrating the applications of these techniques, we explore the formal concepts of fairness and techniques for handling fairness in data mining. We then provide an integrated view of these concepts based on statistical independence. Finally, we discuss the relations between fairness-aware data mining and other research topics, such as privacy-preserving data mining or causal inference.
Fairness-aware Classifier with Prejudice Remover RegularizerToshihiro Kamishima
Fairness-aware Classifier with Prejudice Remover Regularizer
Proceedings of the European Conference on Machine Learning and Principles of Knowledge Discovery in Databases (ECMLPKDD), Part II, pp.35-50 (2012)
Article @ Official Site: http://dx.doi.org/10.1007/978-3-642-33486-3_3
Article @ Personal Site: http://www.kamishima.net/archive/2012-p-ecmlpkdd-print.pdf
Handnote: http://www.kamishima.net/archive/2012-p-ecmlpkdd-HN.pdf
Program codes : http://www.kamishima.net/fadm/
Conference Homepage: http://www.ecmlpkdd2012.net/
Abstract:
With the spread of data mining technologies and the accumulation of social data, such technologies and data are being used for determinations that seriously affect individuals' lives. For example, credit scoring is frequently determined based on the records of past credit data together with statistical prediction techniques. Needless to say, such determinations must be nondiscriminatory and fair in sensitive features, such as race, gender, religion, and so on. Several researchers have recently begun to attempt the development of analysis techniques that are aware of social fairness or discrimination. They have shown that simply avoiding the use of sensitive features is insufficient for eliminating biases in determinations, due to the indirect influence of sensitive information. In this paper, we first discuss three causes of unfairness in machine learning. We then propose a regularization approach that is applicable to any prediction algorithm with probabilistic discriminative models. We further apply this approach to logistic regression and empirically show its effectiveness and efficiency.
Considerations on Recommendation Independence for a Find-Good-Items TaskToshihiro Kamishima
Considerations on Recommendation Independence for a Find-Good-Items Task
Workshop on Responsible Recommendation (FATREC), in conjunction with RecSys2017
Article @ Official Site: http://doi.org/10.18122/B2871W
Workshop Homepage: https://piret.gitlab.io/fatrec/
This paper examines the notion of recommendation independence, which is a constraint that a recommendation result is independent from specific information. This constraint is useful in ensuring adherence to laws and regulations, fair treatment of content providers, and exclusion of unwanted information. For example, to make a job-matching recommendation socially fair, the matching should be independent of socially sensitive information, such as gender or race. We previously developed several recommenders satisfying recommendation independence, but these were all designed for a predicting-ratings task, whose goal is to predict a score that a user would rate. We here focus on another find-good-items task, which aims to find some items that a user would prefer. In this task, scores representing the degree of preference to items are first predicted, and some items having the largest scores are displayed in the form of a ranked list. We developed a preliminary algorithm for this task through a naive approach, enhancing independence between a preference score and sensitive information. We empirically show that although this algorithm can enhance independence of a preference score, it is not fit for the purpose of enhancing independence in terms of a ranked list. This result indicates the need for inventing a notion of independence that is suitable for use with a ranked list and that is applicable for completing a find-good-items task.
Correcting Popularity Bias by Enhancing Recommendation NeutralityToshihiro Kamishima
Correcting Popularity Bias by Enhancing Recommendation Neutrality on
The 8th ACM Conference on Recommender Systems, Poster
Article @ Official Site: http://ceur-ws.org/Vol-1247/
Article @ Personal Site: http://www.kamishima.net/archive/2014-po-recsys-print.pdf
Abstract:
In this paper, we attempt to correct a popularity bias, which is the tendency for popular items to be recommended more frequently, by enhancing recommendation neutrality. Recommendation neutrality involves excluding specified information from the prediction process of recommendation. This neutrality was formalized as the statistical independence between a recommendation result and the specified information, and we developed a recommendation algorithm that satisfies this independence constraint. We correct the popularity bias by enhancing neutrality with respect to information regarding whether candidate items are popular or not. We empirically show that a popularity bias in the predicted preference scores can be corrected.
Future Directions of Fairness-Aware Data Mining: Recommendation, Causality, a...Toshihiro Kamishima
Future Directions of Fairness-Aware Data Mining: Recommendation, Causality, and Theoretical Aspects
Invited Talk @ Workshop on Fairness, Accountability, and Transparency in Machine Learning
In conjunction with the ICML 2015 @ Lille, France, Jul. 11, 2015
Web Site: http://www.kamishima.net/fadm/
Handnote: http://www.kamishima.net/archive/2015-ws-icml-HN.pdf
The goal of fairness-aware data mining (FADM) is to analyze data while taking into account potential issues of fairness. In this talk, we will cover three topics in FADM:
1. Fairness in a Recommendation Context: In classification tasks, the term "fairness" is regarded as anti-discrimination. We will present other types of problems related to the fairness in a recommendation context.
2. What is Fairness: Most formal definitions of fairness have a connection with the notion of statistical independence. We will explore other types of formal fairness based on causality, agreement, and unfairness.
3. Theoretical Problems of FADM: After reviewing technical and theoretical open problems in the FADM literature, we will introduce the theory of the generalization bound in terms of accuracy as well as fairness.
Joint work with Jun Sakuma, Shotaro Akaho, and Hideki Asoh
Model-based Approaches for Independence-Enhanced RecommendationToshihiro Kamishima
Model-based Approaches for Independence-Enhanced Recommendation
IEEE International Workshop on Privacy Aspects of Data Mining (PADM), in conjunction with ICDM2016
Article @ Official Site: http://doi.ieeecomputersociety.org/10.1109/ICDMW.2016.0127
Workshop Homepage: http://pddm16.eurecat.org/
Abstract:
This paper studies a new approach to enhance recommendation independence. Such approaches are useful in ensuring adherence to laws and regulations, fair treatment of content providers, and exclusion of unwanted information. For example, recommendations that match an employer with a job applicant should not be based on socially sensitive information, such as gender or race, from the perspective of social fairness. An algorithm that could exclude the influence of such sensitive information would be useful in this case. We previously gave a formal definition of recommendation independence and proposed a method adopting a regularizer that imposes such an independence constraint. As no other options than this regularization approach have been put forward, we here propose a new model-based approach, which is based on a generative model that satisfies the constraint of recommendation independence. We apply this approach to a latent class model and empirically show that the model-based approach can enhance recommendation independence. Recommendation algorithms based on generative models, such as topic models, are important, because they have a flexible functionality that enables them to incorporate a wide variety of information types. Our new model-based approach will broaden the applications of independence-enhanced recommendation by integrating the functionality of generative models.
Fairness-aware Learning through Regularization Approach
The 3rd IEEE International Workshop on Privacy Aspects of Data Mining (PADM 2011)
Dec. 11, 2011 @ Vancouver, Canada, in conjunction with ICDM2011
Article @ Official Site: http://doi.ieeecomputersociety.org/10.1109/ICDMW.2011.83
Article @ Personal Site: http://www.kamishima.net/archive/2011-ws-icdm_padm.pdf
Handnote: http://www.kamishima.net/archive/2011-ws-icdm_padm-HN.pdf
Workshop Homepage: http://www.zurich.ibm.com/padm2011/
Abstract:
With the spread of data mining technologies and the accumulation of social data, such technologies and data are being used for determinations that seriously affect people's lives. For example, credit scoring is frequently determined based on the records of past credit data together with statistical prediction techniques. Needless to say, such determinations must be socially and legally fair from a viewpoint of social responsibility; namely, it must be unbiased and nondiscriminatory in sensitive features, such as race, gender, religion, and so on. Several researchers have recently begun to attempt the development of analysis techniques that are aware of social fairness or discrimination. They have shown that simply avoiding the use of sensitive features is insufficient for eliminating biases in determinations, due to the indirect influence of sensitive information. From a privacy-preserving viewpoint, this can be interpreted as hiding sensitive information when classification results are observed. In this paper, we first discuss three causes of unfairness in machine learning. We then propose a regularization approach that is applicable to any prediction algorithm with probabilistic discriminative models. We further apply this approach to logistic regression and empirically show its effectiveness and efficiency.
Efficiency Improvement of Neutrality-Enhanced RecommendationToshihiro Kamishima
Efficiency Improvement of Neutrality-Enhanced Recommendation
Workshop on Human Decision Making in Recommender Systems, in conjunction with RecSys 2013
Article @ Official Site: http://ceur-ws.org/Vol-1050/
Article @ Personal Site: http://www.kamishima.net/archive/2013-ws-recsys-print.pdf
Handnote : http://www.kamishima.net/archive/2013-ws-recsys-HN.pdf
Program codes : http://www.kamishima.net/inrs/
Workshop Homepage: http://recex.ist.tugraz.at/RecSysWorkshop/
Abstract:
This paper proposes an algorithm for making recommendations so that neutrality from a viewpoint specified by the user is enhanced. This algorithm is useful for avoiding decisions based on biased information. Such a problem is pointed out as the filter bubble, which is the influence in social decisions biased by personalization technologies. To provide a neutrality-enhanced recommendation, we must first assume that a user can specify a particular viewpoint from which the neutrality can be applied, because a recommendation that is neutral from all viewpoints is no longer a recommendation. Given such a target viewpoint, we implement an information-neutral recommendation algorithm by introducing a penalty term to enforce statistical independence between the target viewpoint and a rating. We empirically show that our algorithm enhances the independence from the specified viewpoint.
Consideration on Fairness-aware Data Mining
IEEE International Workshop on Discrimination and Privacy-Aware Data Mining (DPADM 2012)
Dec. 10, 2012 @ Brussels, Belgium, in conjunction with ICDM2012
Article @ Official Site: http://doi.ieeecomputersociety.org/10.1109/ICDMW.2012.101
Article @ Personal Site: http://www.kamishima.net/archive/2012-ws-icdm-print.pdf
Handnote: http://www.kamishima.net/archive/2012-ws-icdm-HN.pdf
Workshop Homepage: https://sites.google.com/site/dpadm2012/
Abstract:
With the spread of data mining technologies and the accumulation of social data, such technologies and data are being used for determinations that seriously affect individuals' lives. For example, credit scoring is frequently determined based on the records of past credit data together with statistical prediction techniques. Needless to say, such determinations must be nondiscriminatory and fair regarding sensitive features such as race, gender, religion, and so on. Several researchers have recently begun to develop fairness-aware or discrimination-aware data mining techniques that take into account issues of social fairness, discrimination, and neutrality. In this paper, after demonstrating the applications of these techniques, we explore the formal concepts of fairness and techniques for handling fairness in data mining. We then provide an integrated view of these concepts based on statistical independence. Finally, we discuss the relations between fairness-aware data mining and other research topics, such as privacy-preserving data mining or causal inference.
Fairness-aware Classifier with Prejudice Remover RegularizerToshihiro Kamishima
Fairness-aware Classifier with Prejudice Remover Regularizer
Proceedings of the European Conference on Machine Learning and Principles of Knowledge Discovery in Databases (ECMLPKDD), Part II, pp.35-50 (2012)
Article @ Official Site: http://dx.doi.org/10.1007/978-3-642-33486-3_3
Article @ Personal Site: http://www.kamishima.net/archive/2012-p-ecmlpkdd-print.pdf
Handnote: http://www.kamishima.net/archive/2012-p-ecmlpkdd-HN.pdf
Program codes : http://www.kamishima.net/fadm/
Conference Homepage: http://www.ecmlpkdd2012.net/
Abstract:
With the spread of data mining technologies and the accumulation of social data, such technologies and data are being used for determinations that seriously affect individuals' lives. For example, credit scoring is frequently determined based on the records of past credit data together with statistical prediction techniques. Needless to say, such determinations must be nondiscriminatory and fair in sensitive features, such as race, gender, religion, and so on. Several researchers have recently begun to attempt the development of analysis techniques that are aware of social fairness or discrimination. They have shown that simply avoiding the use of sensitive features is insufficient for eliminating biases in determinations, due to the indirect influence of sensitive information. In this paper, we first discuss three causes of unfairness in machine learning. We then propose a regularization approach that is applicable to any prediction algorithm with probabilistic discriminative models. We further apply this approach to logistic regression and empirically show its effectiveness and efficiency.
Considerations on Recommendation Independence for a Find-Good-Items TaskToshihiro Kamishima
Considerations on Recommendation Independence for a Find-Good-Items Task
Workshop on Responsible Recommendation (FATREC), in conjunction with RecSys2017
Article @ Official Site: http://doi.org/10.18122/B2871W
Workshop Homepage: https://piret.gitlab.io/fatrec/
This paper examines the notion of recommendation independence, which is a constraint that a recommendation result is independent from specific information. This constraint is useful in ensuring adherence to laws and regulations, fair treatment of content providers, and exclusion of unwanted information. For example, to make a job-matching recommendation socially fair, the matching should be independent of socially sensitive information, such as gender or race. We previously developed several recommenders satisfying recommendation independence, but these were all designed for a predicting-ratings task, whose goal is to predict a score that a user would rate. We here focus on another find-good-items task, which aims to find some items that a user would prefer. In this task, scores representing the degree of preference to items are first predicted, and some items having the largest scores are displayed in the form of a ranked list. We developed a preliminary algorithm for this task through a naive approach, enhancing independence between a preference score and sensitive information. We empirically show that although this algorithm can enhance independence of a preference score, it is not fit for the purpose of enhancing independence in terms of a ranked list. This result indicates the need for inventing a notion of independence that is suitable for use with a ranked list and that is applicable for completing a find-good-items task.
Recommendation Independence
The 1st Conference on Fairness, Accountability, and Transparency
Article @ Official Site: http://proceedings.mlr.press/v81/kamishima18a.html
Conference site: https://fatconference.org/2018/
Abstract:
This paper studies a recommendation algorithm whose outcomes are not influenced by specified information. It is useful in contexts potentially unfair decision should be avoided, such as job-applicant recommendations that are not influenced by socially sensitive information. An algorithm that could exclude the influence of sensitive information would thus be useful for job-matching with fairness. We call the condition between a recommendation outcome and a sensitive feature Recommendation Independence, which is formally defined as statistical independence between the outcome and the feature. Our previous independence-enhanced algorithms simply matched the means of predictions between sub-datasets consisting of the same sensitive value. However, this approach could not remove the sensitive information represented by the second or higher moments of distributions. In this paper, we develop new methods that can deal with the second moment, i.e., variance, of recommendation outcomes without increasing the computational complexity. These methods can more strictly remove the sensitive information, and experimental results demonstrate that our new algorithms can more effectively eliminate the factors that undermine fairness. Additionally, we explore potential applications for independence-enhanced recommendation, and discuss its relation to other concepts, such as recommendation diversity.
Alleviating Privacy Attacks Using Causal ModelsAmit Sharma
Machine learning models, especially deep neural networks have been shown to reveal membership information of inputs in the training data. Such membership inference attacks are a serious privacy concern, for example, patients providing medical records to build a model that detects HIV would not want their identity to be leaked. Further, we show that the attack accuracy amplifies when the model is used to predict samples that come from a different distribution than the training set, which is often the case in real world applications. Therefore, we propose the use of causal learning approaches where a model learns the causal relationship between the input features and the outcome. An ideal causal model is known to be invariant to the training distribution and hence generalizes well to shifts between samples from the same distribution and across different distributions. First, we prove that models learned using causal structure provide stronger differential privacy guarantees than associational models under reasonable assumptions. Next, we show that causal models trained on sufficiently large samples are robust to membership inference attacks across different distributions of datasets and those trained on smaller sample sizes always have lower attack accuracy than corresponding associational models. Finally, we confirm our theoretical claims with experimental evaluation on 4 moderately complex Bayesian network datasets and a colored MNIST image dataset. Associational models exhibit upto 80\% attack accuracy under different test distributions and sample sizes whereas causal models exhibit attack accuracy close to a random guess. Our results confirm the value of the generalizability of causal models in reducing susceptibility to privacy attacks. Paper available at https://arxiv.org/abs/1909.12732
Instance Selection and Optimization of Neural NetworksITIIIndustries
Credit scoring is an important tool in financial institutions, which can be used in credit granting decision. Credit applications are marked by credit scoring models and those with high marks will be treated as “good”, while those with low marks will be regarded as “bad”. As data mining technique develops, automatic credit scoring systems are warmly welcomed for their high efficiency and objective judgments. Many machine learning algorithms have been applied in training credit scoring models, and ANN is one of them with good performance. This paper presents a higher accuracy credit scoring model based on MLP neural networks trained with back propagation algorithm. Our work focuses on enhancing credit scoring models in three aspects: optimize data distribution in datasets using a new method called Average Random Choosing; compare effects of training-validation-test instances numbers; and find the most suitable number of hidden units. Another contribution of this paper is summarizing the tendency of scoring accuracy of models when the number of hidden units increases. The experiment results show that our methods can achieve high credit scoring accuracy with imbalanced datasets. Thus, credit granting decision can be made by data mining methods using MLP neural networks.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org. If you would like to use this course content for teaching, please reach out to inquiry@deltanalytics.org
Dowhy: An end-to-end library for causal inferenceAmit Sharma
In addition to efficient statistical estimators of a treatment's effect, successful application of causal inference requires specifying assumptions about the mechanisms underlying observed data and testing whether they are valid, and to what extent. However, most libraries for causal inference focus only on the task of providing powerful statistical estimators. We describe DoWhy, an open-source Python library that is built with causal assumptions as its first-class citizens, based on the formal framework of causal graphs to specify and test causal assumptions. DoWhy presents an API for the four steps common to any causal analysis---1) modeling the data using a causal graph and structural assumptions, 2) identifying whether the desired effect is estimable under the causal model, 3) estimating the effect using statistical estimators, and finally 4) refuting the obtained estimate through robustness checks and sensitivity analyses. In particular, DoWhy implements a number of robustness checks including placebo tests, bootstrap tests, and tests for unoberved confounding. DoWhy is an extensible library that supports interoperability with other implementations, such as EconML and CausalML for the the estimation step.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...Sri Ambati
This session was recorded in NYC on October 22nd, 2019 and can be viewed here: https://youtu.be/ngOBhhINWb8
Explainable Machine Learning with Shapley Values
Shapley values are popular approach for explaining predictions made by complex machine learning models. In this talk I will discuss what problems Shapley values solve, an intuitive presentation of what they mean, and examples of how they can be used through the ‘shap’ python package.
Bio: I am a senior researcher at Microsoft Research. Before joining Microsoft, I did my Ph.D. studies at the Paul G. Allen School of Computer Science & Engineering of the University of Washington working with Su-In Lee. My work focuses on explainable artificial intelligence and its application to problems in medicine and healthcare. This has led to the development of broadly applicable methods and tools for interpreting complex machine learning models that are now used in banking, logistics, sports, manufacturing, cloud services, economics, and many other areas.
Recommendation Independence
The 1st Conference on Fairness, Accountability, and Transparency
Article @ Official Site: http://proceedings.mlr.press/v81/kamishima18a.html
Conference site: https://fatconference.org/2018/
Abstract:
This paper studies a recommendation algorithm whose outcomes are not influenced by specified information. It is useful in contexts potentially unfair decision should be avoided, such as job-applicant recommendations that are not influenced by socially sensitive information. An algorithm that could exclude the influence of sensitive information would thus be useful for job-matching with fairness. We call the condition between a recommendation outcome and a sensitive feature Recommendation Independence, which is formally defined as statistical independence between the outcome and the feature. Our previous independence-enhanced algorithms simply matched the means of predictions between sub-datasets consisting of the same sensitive value. However, this approach could not remove the sensitive information represented by the second or higher moments of distributions. In this paper, we develop new methods that can deal with the second moment, i.e., variance, of recommendation outcomes without increasing the computational complexity. These methods can more strictly remove the sensitive information, and experimental results demonstrate that our new algorithms can more effectively eliminate the factors that undermine fairness. Additionally, we explore potential applications for independence-enhanced recommendation, and discuss its relation to other concepts, such as recommendation diversity.
Alleviating Privacy Attacks Using Causal ModelsAmit Sharma
Machine learning models, especially deep neural networks have been shown to reveal membership information of inputs in the training data. Such membership inference attacks are a serious privacy concern, for example, patients providing medical records to build a model that detects HIV would not want their identity to be leaked. Further, we show that the attack accuracy amplifies when the model is used to predict samples that come from a different distribution than the training set, which is often the case in real world applications. Therefore, we propose the use of causal learning approaches where a model learns the causal relationship between the input features and the outcome. An ideal causal model is known to be invariant to the training distribution and hence generalizes well to shifts between samples from the same distribution and across different distributions. First, we prove that models learned using causal structure provide stronger differential privacy guarantees than associational models under reasonable assumptions. Next, we show that causal models trained on sufficiently large samples are robust to membership inference attacks across different distributions of datasets and those trained on smaller sample sizes always have lower attack accuracy than corresponding associational models. Finally, we confirm our theoretical claims with experimental evaluation on 4 moderately complex Bayesian network datasets and a colored MNIST image dataset. Associational models exhibit upto 80\% attack accuracy under different test distributions and sample sizes whereas causal models exhibit attack accuracy close to a random guess. Our results confirm the value of the generalizability of causal models in reducing susceptibility to privacy attacks. Paper available at https://arxiv.org/abs/1909.12732
Instance Selection and Optimization of Neural NetworksITIIIndustries
Credit scoring is an important tool in financial institutions, which can be used in credit granting decision. Credit applications are marked by credit scoring models and those with high marks will be treated as “good”, while those with low marks will be regarded as “bad”. As data mining technique develops, automatic credit scoring systems are warmly welcomed for their high efficiency and objective judgments. Many machine learning algorithms have been applied in training credit scoring models, and ANN is one of them with good performance. This paper presents a higher accuracy credit scoring model based on MLP neural networks trained with back propagation algorithm. Our work focuses on enhancing credit scoring models in three aspects: optimize data distribution in datasets using a new method called Average Random Choosing; compare effects of training-validation-test instances numbers; and find the most suitable number of hidden units. Another contribution of this paper is summarizing the tendency of scoring accuracy of models when the number of hidden units increases. The experiment results show that our methods can achieve high credit scoring accuracy with imbalanced datasets. Thus, credit granting decision can be made by data mining methods using MLP neural networks.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org. If you would like to use this course content for teaching, please reach out to inquiry@deltanalytics.org
Dowhy: An end-to-end library for causal inferenceAmit Sharma
In addition to efficient statistical estimators of a treatment's effect, successful application of causal inference requires specifying assumptions about the mechanisms underlying observed data and testing whether they are valid, and to what extent. However, most libraries for causal inference focus only on the task of providing powerful statistical estimators. We describe DoWhy, an open-source Python library that is built with causal assumptions as its first-class citizens, based on the formal framework of causal graphs to specify and test causal assumptions. DoWhy presents an API for the four steps common to any causal analysis---1) modeling the data using a causal graph and structural assumptions, 2) identifying whether the desired effect is estimable under the causal model, 3) estimating the effect using statistical estimators, and finally 4) refuting the obtained estimate through robustness checks and sensitivity analyses. In particular, DoWhy implements a number of robustness checks including placebo tests, bootstrap tests, and tests for unoberved confounding. DoWhy is an extensible library that supports interoperability with other implementations, such as EconML and CausalML for the the estimation step.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...Sri Ambati
This session was recorded in NYC on October 22nd, 2019 and can be viewed here: https://youtu.be/ngOBhhINWb8
Explainable Machine Learning with Shapley Values
Shapley values are popular approach for explaining predictions made by complex machine learning models. In this talk I will discuss what problems Shapley values solve, an intuitive presentation of what they mean, and examples of how they can be used through the ‘shap’ python package.
Bio: I am a senior researcher at Microsoft Research. Before joining Microsoft, I did my Ph.D. studies at the Paul G. Allen School of Computer Science & Engineering of the University of Washington working with Su-In Lee. My work focuses on explainable artificial intelligence and its application to problems in medicine and healthcare. This has led to the development of broadly applicable methods and tools for interpreting complex machine learning models that are now used in banking, logistics, sports, manufacturing, cloud services, economics, and many other areas.
Fuzzy soft set approach in decision making plays a crucial role by using Dempster–Shafer theory of evidence. First, the uncertain degrees of several parameters are obtained via grey relational analysis that apply to calculate the grey mean relational degree. Secondly, a mass functions of different independent choices with several parameters have given according to the uncertain degree. Lastly, aggregate the choices into a collective choices, Dempster’s rule of evidence combination have been utilized. The aforesaid soft computing based method have been applied on decision making problem.
Outlier Detection Using Unsupervised Learning on High Dimensional DataIJERA Editor
The outliers in data mining can be detected using semi-supervised and unsupervised methods. Outlier
detection in high dimensional data faces various challenges from curse of dimensionality. It means due
to the distance concentration the data becomes unobvious in high dimensional data. Using outlier
detection techniques, the distance base methods are used to detect outliers and label all the points as
good outliers. In high dimensional data to detect outliers effectively, we use unsupervised learning
methods like IQR, KNN with Anti hub.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Statistical skepticism: How to use significance tests effectively jemille6
Prof. D. Mayo, presentation Oct. 12, 2017 at the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session: “What are the best uses for P-values?“
Cyber Summit 2016: Privacy Issues in Big Data Sharing and ReuseCybera Inc.
Although there is no well-established definition of big data, its main characteristic is its sheer volume. Large volumes of data are generated by people (e.g., via social media) and by technology, including sensors (e.g., cameras, microphones), trackers (e.g., RFID tags, web surfing behavior) and other devices (e.g., mobile phones, wearables for self-surveillance/quantified self), whether or not they are connected to the Internet of Things. However, the large volumes of data needed to capitalize on the benefits of big data can to some extent also be established by the reuse of existing data, a source that is sometimes overlooked.
Data can be reused for purposes similar to that for which it was initially collected, but also beyond these purposes. Similarly, data can be reused in its original context, but also beyond this context. However, such repurposing and recontextualizing of data may lead to privacy issues. For instance, data reuse may lead to issues regarding informed consent and informational self-determination. When the data is used for profiling and other types of predictive analytics, also issues regarding stigmatization and discrimination may arise. This presentation by Bart Custers, Head of Research, eLaw – Center for Law and Digital Technologies at Leiden University, The Netherlands, focuses on the privacy issues of big data sharing and reuse and how these issues could be addressed.
What is the impact of bias in data analysis, and how can it be mitigated.pdfSoumodeep Nanee Kundu
Data analysis is a powerful tool for deriving insights and making informed decisions across various domains, from healthcare and finance to marketing and criminal justice. However, data analysis is not immune to bias, which can significantly impact the quality and fairness of the results. Bias in data analysis can stem from various sources, including biased data collection, algorithmic biases, and human biases in decision-making. In this article, we will explore the impact of bias in data analysis and discuss strategies for mitigating it to ensure more accurate, ethical, and fair outcomes.
Multiple Linear Regression Models in Outlier Detection IJORCS
Identifying anomalous values in the real-world database is important both for improving the quality of original data and for reducing the impact of anomalous values in the process of knowledge discovery in databases. Such anomalous values give useful information to the data analyst in discovering useful patterns. Through isolation, these data may be separated and analyzed. The analysis of outliers and influential points is an important step of the regression diagnostics. In this paper, our aim is to detect the points which are very different from the others points. They do not seem to belong to a particular population and behave differently. If these influential points are to be removed it will lead to a different model. Distinction between these points is not always obvious and clear. Hence several indicators are used for identifying and analyzing outliers. Existing methods of outlier detection are based on manual inspection of graphically represented data. In this paper, we present a new approach in automating the process of detecting and isolating outliers. Impact of anomalous values on the dataset has been established by using two indicators DFFITS and Cook’sD. The process is based on modeling the human perception of exceptional values by using multiple linear regression analysis.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Classification with No Direct DiscriminationEditor IJCATR
In many automated applications, large amount of data is collected every day and it is used to learn classifier as well as to
make automated decisions. If that training data is biased towards or against certain entity, race, nationality, gender then mining model
may leads to discrimination. This paper elaborate direct discrimination prevention method. The DRP algorithm modifies the original
data set to prevent direct discrimination. Direct discrimination takes place when decisions are made based on discriminatory attributes
those are specified by the user. The performance of this system is evaluated using measures MC, GC, DDPP, DPDM etc. Different
discrimination measures can be used to discover the discrimination
A Comparative Study on Privacy Preserving Datamining TechniquesIJMER
Privacy protection is very important in the recent years for the reason of increasing in the
ability to store data. In particular, recent advances in the data mining field have lead to increased
concerns about privacy. Data in its original form, however, typically contains sensitive information about
individuals, and publishing such data will violate individual privacy. The current practice in data
publishing based on that what type of data can be released and use of that data. Recently, PPDM has
received immersed attention in research communities, and many approaches have been proposed for
different data publishing scenarios. In this comparative study we will systematically summarize and
evaluate different approaches for PPDM, study the challenges ,differences and requirements that
distinguish PPDM from other related problems, and propose future research directions
Exploring Author Gender in Book Rating and Recommendation
M. D. Ekstrand, M. Tian, M. R. I. Kazi, H. Mehrpouyan, and D. Kluver
https://doi.org/10.1145/3240323.3240373
RecSys2018 論文読み会 (2018-11-17) https://atnd.org/events/101334
WSDM2018読み会
2018-04-14 @ クックパッド
https://atnd.org/events/95510
Offline A/B Testing for Recommender Systems
A. Gilotte, C. Calauzénes, T. Nedelec, A. Abraham, and S. Dollé
https://doi.org/10.1145/3159652.3159687
KDD2016勉強会 https://atnd.org/events/80771
論文:“Why Should I Trust You?”Explaining the Predictions of Any Classifier
著者:M. T. Ribeiro and S. Singh and C. Guestrin
論文リンク: http://www.kdd.org/kdd2016/subtopic/view/why-should-i-trust-you-explaining-the-predictions-of-any-classifier
WSDM2016勉強会 https://atnd.org/events/74341
論文:Portrait of an Online Shopper: Understanding and Predicting Consumer Behavior
著者:F. Kooti and K. Lerman and L. M. Aiello and M. Grbovic and N. Djuric
論文リンク: http://dx.doi.org/10.1145/2835776.2835831
Absolute and Relative Clustering
4th MultiClust Workshop on Multiple Clusterings, Multi-view Data, and Multi-source Knowledge-driven Clustering (Multiclust 2013)
Aug. 11, 2013 @ Chicago, U.S.A, in conjunction with KDD2013
Article @ Official Site: http://dx.doi.org/10.1145/2501006.2501013
Article @ Personal Site: http://www.kamishima.net/archive/2013-ws-kdd-print.pdf
Handnote: http://www.kamishima.net/archive/2013-ws-kdd-HN.pdf
Workshop Homepage: http://cs.au.dk/research/research-areas/data-intensive-systems/projects/multiclust2013/
Abstract:
Research into (semi-)supervised clustering has been increasing. Supervised clustering aims to group similar data that are partially guided by the user's supervision. In this supervised clustering, there are many choices for formalization. For example, as a type of supervision, one can adopt labels of data points, must/cannot links, and so on. Given a real clustering task, such as grouping documents or image segmentation, users must confront the question ``How should we mathematically formalize our task?''To help answer this question, we propose the classification of real clusterings into absolute and relative clusterings, which are defined based on the relationship between the resultant partition and the data set to be clustered. This categorization can be exploited to choose a type of task formalization.
Enhancement of the Neutrality in Recommendation
Workshop on Human Decision Making in Recommender Systems, in conjunction with RecSys 2012
Article @ Official Site: http://ceur-ws.org/Vol-893/
Article @ Personal Site: http://www.kamishima.net/archive/2012-ws-recsys-print.pdf
Handnote : http://www.kamishima.net/archive/2012-ws-recsys-HN.pdf
Program codes : http://www.kamishima.net/inrs
Workshop Homepage: http://recex.ist.tugraz.at/RecSysWorkshop2012
Abstract:
This paper proposes an algorithm for making recommendation so that the neutrality toward the viewpoint specified by a user is enhanced. This algorithm is useful for avoiding to make decisions based on biased information. Such a problem is pointed out as the filter bubble, which is the influence in social decisions biased by a personalization technology. To provide such a recommendation, we assume that a user specifies a viewpoint toward which the user want to enforce the neutrality, because recommendation that is neutral from any information is no longer recommendation. Given such a target viewpoint, we implemented information neutral recommendation algorithm by introducing a penalty term to enforce the statistical independence between the target viewpoint and a preference score. We empirically show that our algorithm enhances the independence toward the specified viewpoint by and then demonstrate how sets of recommended items are changed.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
1. The Independence of
Fairness-Aware Classifiers
Toshihiro Kamishima*, Shotaro Akaho*, Hideki Asoh*, and Jun Sakuma**
*National Institute of Advanced Industrial Science and Technology (AIST), Japan
**University of Tsukuba, Japan; and Japan Science and Technology Agency
IEEE International Workshop on Privacy Aspects of Data Mining (PADM)
@ Dallas, USA, Dec. 7, 2013
START
1
2. Introduction
[Romei+ 13]
Fairness-aware Data Mining
data analysis taking into account potential issues of
fairness, discrimination, neutrality, or independence
Fairness-aware Classification
One of a major task of fairness-aware data mining
A problem of learning a classifier that predicts a class as
accurately as possible under the fairness constraints from
potentially unfair data
✤ We use the term “fairness-aware” instead of “discrimination-aware,” because
the word “discrimination” means classification in the ML context, and this
technique applicable to tasks other than avoiding discriminative decisions
2
3. Introduction
Calders & Verwer’s 2-naive-Bayes (CV2NB)
is very simple, but highly effective
Other fairness-aware classifiers are equally accurate,
but less fairer
Why the CV2NB method performed better:
model bias
deterministic decision rule
Based on our findings, we discuss how to improve our method
3
4. Outline
Applications of Fairness-aware Data Mining
prevention of unfairness, information-neutral recommendation, ignoring
uninteresting information
Fairness-aware Classification
basic notations, fairness in data mining, fairness-aware classification,
Connection with PPDM, Calders & Verwer’s 2-naive-Bayes
Hypothetical Fair-factorization Naive Bayes
hypothetical fair-factorization naive Bayes (HFFNB), Connection with other
methods, experimental results
Why Did the HFFNB Method Fail?
model bias, deterministic decision rule
How to Modify the HFFNB Method
actual fair-factorization method, experimental results
Conclusion
4
5. Outline
Applications of Fairness-aware Data Mining
prevention of unfairness, information-neutral recommendation, ignoring
uninteresting information
Fairness-aware Classification
basic notations, fairness in data mining, fairness-aware classification,
connection with PPDM, & Verwer’s 2-naive-Bayes
Hypothetical Fair-factorization Naive Bayes
hypothetical fair-factorization naive Bayes (HFFNB), Connection with other
methods, experimental results
Why Did the HFFNB Method Fail?
model bias, deterministic decision rule
How to Modify the HFFNB Method
actual fair-factorization method, experimental results
Conclusion
5
6. Prevention of Unfairness
[Sweeney 13]
An suspicious placement keyword-matching advertisement
Advertisements indicating arrest records were more frequently
displayed for names that are more popular among individuals of
African descent than those of European descent
African descent names
European descent names
Located:
Arrested?
This situation is simply due to the optimization of click-through
rate, and no information about users’ race was used
Such unfair decisions can be prevented by FADM techniques
6
7. Recommendation Neutrality
[TED Talk by Eli Pariser, http://www.filterbubble.com/]
The Filter Bubble Problem
Pariser posed a concern that personalization technologies narrow and
bias the topics of information provided to people
Friend recommendation list in the Facebook
To fit for Pariser’s preference, conservative people are eliminated
from his recommendation list, while this fact is not noticed to him
FADM technologies are useful for providing neutral information
7
8. Ignoring Uninteresting Information
[Gondek+ 04]
non-redundant clustering : find clusters that are as independent
from a given uninteresting partition as possible
a conditional information bottleneck method,
which is a variant of an information bottleneck method
clustering facial images
Simple clustering methods find two clusters:
one contains only faces, and the other
contains faces with shoulders
Data analysts consider this clustering is
useless and uninteresting
By ignoring this uninteresting information,
more useful male and female clusters could
be obtained
Uninteresting information can be excluded by FADM techniques
8
9. Outline
Applications of Fairness-aware Data Mining
prevention of unfairness, information-neutral recommendation, ignoring
uninteresting information
Fairness-aware Classification
basic notations, fairness in data mining, fairness-aware classification,
connection with PPDM, & Verwer’s 2-naive-Bayes
Hypothetical Fair-factorization Naive Bayes
hypothetical fair-factorization naive Bayes (HFFNB), Connection with other
methods, experimental results
Why Did the HFFNB Method Fail?
model bias, deterministic decision rule
How to Modify the HFFNB Method
actual fair-factorization method, experimental results
Conclusion
9
10. Basic Notations
objective variable
a result of serious decision
ex., whether or not to allow credit
S
sensitive feature
socially sensitive information
ex., gender or religion
X
non-sensitive feature vector
Y
all features other than a sensitive feature
non-sensitive, but may correlate with S
10
11. Fairness in Data Mining
Fairness in Data Mining
Sensitive information does not influence the target value
red-lining effect: Everyone inclines to eliminate a sensitive
feature from calculations, but this action is insufficient
Non-sensitive features that correlate with sensitive features
also contains sensitive information
Y and S are merely conditionally independent: Y ? S | X
?
A sensitive feature and a target variable
must be unconditionally independent Y ? S
?
11
13. Fairness-aware Classification
the space of distributions
fairness-aware classification: Pr[Y, X, S]
find a fair model that approximates
a true distribution instead of a fair
true distribution under the fairness
constraints
We want to approximate
fair true distribution, but
samples from this
distribution cannot be
obtained, because
samples from real world
are potentially unfair
model sub-space
ˆ
Pr[Y, X, S; ⇥]
ˆ
Pr[Y, X, S; ⇥⇤ ]
Pr† [Y, X, S]
ˆ
Pr† [Y, X, S; ⇥⇤ ]
fair sub-space
Y ? S
?
13
14. Privacy-Preserving Data Mining
Fairness in Data Mining
the independence between an objective Y and a sensitive feature S
from the information theoretic perspective,
mutual information between Y and S is zero
from the viewpoint of privacy-preservation,
protection of sensitive information when an objective variable is exposed
Different points from PPDM
introducing randomness is occasionally inappropriate for severe
decisions, such as job application
disclosure of identity isn’t problematic in FADM, generally
14
15. Calders-Verwer’s 2-Naive-Bayes
[Calders+ 10]
Unfair decisions are modeled by introducing
the dependence of X on S as well as on Y
Naive Bayes
Calders-Verwer Two
Naive Bayes (CV2NB)
Y
Y
S
X
S and X are conditionally
independent given Y
S
X
non-sensitive features in X are
conditionally independent
given Y and S
✤ It is as if two naive Bayes classifiers are learned depending on each value of the
sensitive feature; that is why this method was named by the 2-naive-Bayes
15
16. Calders-Verwer’s 2-Naive-Bayes
[Calders+ 10]
parameters are initialized by the corresponding sample distributions
Q ˆ
ˆ
ˆ
Pr[Y, X, S] = Pr[Y, S] i Pr[Xi |Y, S]
ˆ
Pr[Y, S] is modified so as to improve the fairness
ˆ
estimated model Pr[Y, S]
fair
ˆ † [Y, S]
fair estimated model Pr
ˆ
keep the updated marginal distribution close to the Pr[Y ]
while Pr[Y=1 | S=1] - Pr[Y=1 | S=0] > 0
if # of data classified as “1” < # of “1” samples in original data then
increase Pr[Y=1, S=0], decrease Pr[Y=0, S=0]
else
increase Pr[Y=0, S=1], decrease Pr[Y=1, S=1]
reclassify samples using updated model Pr[Y, S]
update the joint distribution so that its fairness is enhanced
16
17. Outline
Applications of Fairness-aware Data Mining
prevention of unfairness, information-neutral recommendation, ignoring
uninteresting information
Fairness-aware Classification
basic notations, fairness in data mining, fairness-aware classification,
connection with PPDM, Calders & Verwer’s 2-naive-Bayes
Hypothetical Fair-factorization Naive Bayes
hypothetical fair-factorization naive Bayes (HFFNB), Connection with other
methods, experimental results
Why Did the HFFNB Method Fail?
model bias, deterministic decision rule
How to Modify the HFFNB Method
actual fair-factorization method, experimental results
Conclusion
17
18. Hypothetical Fair-factorization
Hypothetical Fair-factorization
A modeling technique to make a classifier fair
Y
S
X
In a classification model, a sensitive feature
and a target variable are decoupled
A graphical model of a fair-factorized classifier
By this technique, a sensitive feature and a
target variable become statistically independent
Hypothetical Fair-factorization naive Bayes (HFFNB)
A fair-factorization technique is applied to a naive Bayes model
Q ˆ † (k)
ˆ
ˆ
ˆ
Pr [Y, X, S] = Pr [Y ]Pr [S] k Pr [X |Y, S]
†
†
†
Under the ML or MAP principle, model parameters can be
derived by simply counting training examples
18
19. Connection with the ROC Decision Rule
[Kamiran+ 12]
In a non-fairized case, a new object (x, s) is classified into class 1, if
ˆ
Pr[Y =1|x, s]
1/2 ⌘ p
Kamiran et al.’s ROC decision rule
A fair classifier is built by changing the decision boundary, p,
according as the value of sensitive feature
The HFFNB method is equivalent to changing decision boundary, p, to
ˆ
ˆ
Prr[Y |S](1 Pr[Y ])
0
p =
ˆ
ˆ
ˆ
ˆ
Pr[Y ] + Pr[Y |S] 2Pr[Y ]Pr[Y |S]
(Elkan’s theorem regarding cost-sensitive learning)
The HFFNB can be considered as a special case of the ROC method
19
20. CV2NB vs HFFNB
The CV2NB and HFFNB methods are compared
in their accuracy and fairness
Accuracy
The larger value indicates
more accurate prediction
HFFNB
CV2NB
Unfairness
(Normalized Prejudice Index)
The larger value indicates
unfairer prediction
Accuracy
0.828
0.828
Unfairness
1.52×10-2
6.89×10-6
The HFFNB method is equally accurate as the CV2NB method,
but it made much unfairer prediction
WHY?
20
21. Outline
Applications of Fairness-aware Data Mining
prevention of unfairness, information-neutral recommendation, ignoring
uninteresting information
Fairness-aware Classification
basic notations, fairness in data mining, fairness-aware classification,
connection with PPDM, Calders & Verwer’s 2-naive-Bayes
Hypothetical Fair-factorization Naive Bayes
hypothetical fair-factorization naive Bayes (HFFNB), Connection with other
methods, experimental results
Why Did the HFFNB Method Fail?
model bias, deterministic decision rule
How to Modify the HFFNB Method
actual fair-factorization method, experimental results
Conclusion
21
22. Why Did the HFFNB Method Fail?
HFFNB method
explicitly imposed the fairness
constraint to the model
CV2NB method
The fairness of the model is
enhanced by the post
processing
Though both models are designed so as to enhance the fairness,
the CV2NB method constantly learns much fairer model
Two reasons why these modeled independences are damaged
Model Bias: the difference between model and true distributions
Deterministic Decision Rule: class labels are not probabilistically
generated, but are deterministically chosen by the decision rule
22
23. Model Bias
In a hypothetically Fair-factorized model, data are assumed to be
generated according to the estimated distribution
ˆ
ˆ
ˆ
Pr[Y ]Pr[S]Pr[X|Y, S]
Input objects are firstly generated from a true distribution,
then the object is labeled according to the estimated distribution
estimated
ˆ
Pr[Y |X, S] Pr[X, S]
true
These two distributions are diverged, especially if model bias is high
23
24. Model Bias
Synthetic Data
Data are truly generated from a naive Bayes model
Model bias is controlled by the number of features
Changes of the NPI (fairness)
HFFNB
1.02×10-1
1.10×10-1
1.28×10-1
CV2NB
5.68×10-4
9.60×10-2
1.28×10-1
high bias
low bias
As the decrease of model biases,
the differences between two methods in their fairness decreases
The divergence between estimated and true distributions due to
a model bias damages the fairness in classification
24
25. Deterministic Decision Rule
In a hypothetically Fair-factorized model, labels are assumed to be
generated probabilistically according to the distribution:
ˆ
Pr[Y |X, S]
Predicted labels are generated by this deterministic decision rule:
y ⇤ = arg
max
y2Dom(Y )
ˆ
Pr[Y |X, S]
Labels generated by these two processes do not agree generally
25
26. Deterministic Decision Rule
simple classification model: one binary label and one binary feature
ˆ
class distribution is uniform: Pr[Y =1] = 0.5
Y* is deterministically determined: Y ⇤ = arg max Pr[Y |X]
changing parameters: Pr[X=1|Y=1] and Pr[X=1|Y=0]
E[Y ⇤ ]
1.0
E[Y*]
E[Y*] = E[Y]
0.5
E[Y]
0.0
1.0
1.0
Pr[X=1|Y =1]
0.5
0.5
0.0
Pr[X=1|Y =0]
E[Y*] and E[Y] do not agree generally
26
27. Outline
Applications of Fairness-aware Data Mining
prevention of unfairness, information-neutral recommendation, ignoring
uninteresting information
Fairness-aware Classification
basic notations, fairness in data mining, fairness-aware classification,
connection with PPDM, Calders & Verwer’s 2-naive-Bayes
Hypothetical Fair-factorization Naive Bayes
hypothetical fair-factorization naive Bayes (HFFNB), Connection with other
methods, experimental results
Why Did the HFFNB Method Fail?
model bias, deterministic decision rule
How to Modify the HFFNB Method
actual fair-factorization method, experimental results
Conclusion
27
28. Actual Fair-factorization
The reason why the HFFNB failed is the ignorance of the influence of
a model bias and a deterministic decision rule
Actual Fair-factorization
A class and a sensitive features are decoupled not over the
estimated distribution, but over the actual distribution
As the hypothetical fair-factorization, a class label and a sensitive
feature are made statistically independent
ˆ
We consider not the estimated distribution, Pr[Y, X, S], but the
ˆ
actual distribution, Pr[Y |X, S] Pr[X, S]
As a class label, we adopt a deterministically decided labels
28
29. Actual Fair-factorization naive Bayes
Actual Fair-factorization naive Bayes (AFFNB)
An actual fair-factorization technique is applied to a naive Bayes
model
model bias
The multiplication of a true distribution, Pr[Y |X, S] Pr[X, S], is
ˆ
P
ˆ
approximated by a sample mean,(1/|D|) (x,s)2D Pr[Y |X=x, S=s]
deterministic decision rule
Instead of using a distribution of class labels, we count up the number
of deterministically decided class labels
Y* and S are made independent
under the constraint that the marginal distribution of Y* and S equal to
the corresponding sample distribution
29
30. CV2NB and AFFNB
The CV2NB and AFFNB methods are compared
in their accuracy and fairness
AFFNB
CV2NB
Accuracy
0.828
0.828
Unfairness
5.43×10-6
6.89×10-6
CV2NB and AFFNB are equally accurate as well as equally fair
The superiority of the CV2NB method is considering the
independence not over the estimated distribution, but over the
actual distribution of a class label and a sensitive feature
30
31. Conclusion
Contributions
After reviewing a fairness-aware classification task, we focus on why
the CV2NB method can attain fairer results than other methods
We theoretically and empirically show the reason by comparing a
simple alternative naive-Bayes modified by a hypothetical fairfactorization technique
Based on our findings, we developed a modified version, an actual
fair-factorization technique, and show that this technique drastically
improved the performance
Future Work
We plan to apply our actual fair-factorization technique in order to
modify other classification methods, such as logistic regression or a
support vector machine
31
32. Program Codes and Data Sets
Fairness-aware Data Mining
http://www.kamishima.net/fadm
Information-neutral Recommender System
http://www.kamishima.net/inrs
Acknowledgements
This work is supported by MEXT/JSPS KAKENHI Grant Number
16700157, 21500154, 23240043, 24500194, and 25540094.
32