SlideShare a Scribd company logo
AN EVALUATION OF DATA MINING TECHNIQUES IN
THE CREDIT AND DATA BUREAU SECTOR
Gideon Stephanus du Toit
A research report submitted to the Faculty of Commerce, Law and Management, University
of the Witwatersrand, Johannesburg, in partial fulfillment of the requirements of the degree
of Master of Business Administration.
Johannesburg, March 2006
ABSTRACT
This research investigated the scope of data mining, common data mining techniques and
algorithms, uses of these and also possible future direction of data mining techniques and
what the possible value and uses of these techniques might be.
A synthesis of the literature review gave a definition, scope, techniques and uses of data
mining. A panel of experts was constituted to discover the uses, techniques, benefits and
possible future benefits of the techniques in this sector.
Thirty-five different techniques for data mining were found and these were classified into 4
different sections.
Eighteen separate applications of data mining in this sector were uncovered.
The research demonstrated the use of data mining in this sector although many techniques
were not yet being used, especially amongst the smaller bureaus, but the possible future
benefits of data mining would lead to the greater use of more techniques.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page ii of vii
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page iii of vii
DECLARATION
I declare that this report is my own, unaided work. It is submitted in partial fulfillment of the
requirements for the degree of Master of Business Administration at the University of the
Witwatersrand, Johannesburg. It has not been submitted for any degree or examination in
any other university.
Gideon Stephanus du Toit
April 2006
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page iv of vii
ACKNOWLEDGEMENTS
The assistance provided by a number of people in completing this research is greatly appre-
ciated.
Thanks to my wife, Christelle du Toit, for her unwavering support, love and assistance.
My supervisor, Professor Neil Duffy, who provided his support and encouragement willingly
and freely.
The support of the members of the expert panel and the faith they showed by allowing me
to conduct this research, and without whom this report would not have been possible.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page v of vii
TABLE OF CONTENTS
ABSTRACT .............................................................................................................
DECLARATION .......................................................................................................
ACKNOWLEDGEMENTS ...........................................................................................
TABLE OF CONTENTS .............................................................................................
LIST OF TABLES ......................................................................................................
LIST OF FIGURES ....................................................................................................
LIST OF APPENDICES ..............................................................................................
CHAPTER 1: INTRODUCTION ..............................................................................
1.1 THE RELEVANCE OF DATA MINING ...............................................................
1.2 THE IMPORTANCE OF THE STUDY ................................................................
1.3 THE RESEARCH OBJECTIVES ........................................................................
1.4 INTRODUCTION ..........................................................................................
1.5 THE STATEMENT OF THE PROBLEM ..............................................................
1.6 THE SUB-PROBLEMS ....................................................................................
1.7 THE DELIMITATIONS ...................................................................................
1.8 DEFINITION OF TERMS ................................................................................
1.9 ASSUMPTIONS ............................................................................................
1.10 THE RESEARCH STRUCTURE ........................................................................
CHAPTER 2: LITERATURE REVIEW ....................................................................
2.1 DATA AND DATA MINING IN THE BUSINESS CONTEXT ..................................
2.2 DATA MINING TECHNIQUES AND ALGORITHMS ............................................
2.2.1 Pure statistics ......................................................................................
2.2.2 Artificial Intelligence (AI) methods .........................................................
2.2.3 Genetic algorithms and genetic programming ..........................................
2.2.4 Decision trees ......................................................................................
2.2.5 Data visualisation .................................................................................
2.2.6 Rule induction methods ........................................................................
2.2.7 Data warehousing ................................................................................
2.3 THE USES OF THESE TECHNIQUES AND ALGORITHMS ..................................
2.3.1 Targeting / Predictive / Descriptive models ..............................................
2.3.2 Fraud prediction and identification .........................................................
2.3.3 Going concern prediction .....................................................................
2.4 THE FUTURE DIRECTION OF DATA MINING AND ITS
TECHNIQUES AND THE POSSIBLE USES OF THIS ...........................................
CHAPTER 3: RESEARCH QUESTIONS ..................................................................
3.1 WHAT ARE COMMON DATA MINING TECHNIQUES AND ALGORITHMS? ...........
3.2 WHAT ARE THE USES OF THESE TECHNIQUES? .............................................
3.3 WHAT IS THE FUTURE DIRECTION OF DATA MINING AND ITS
TECHNIQUES IN THIS SECTOR AND THE POSSIBLE USES THEREOF? ..............
ii
iii
iv
v
vii
vii
vii
Page
CHAPTER 4: RESEARCH METHODOLOGY ...........................................................
4.1 QUALITATIVE RESEARCH PARADIGM ...........................................................
4.2 DESCRIPTIVE RESEARCH DESIGN .................................................................
4.3 POPULATION AND SAMPLE ..........................................................................
4.4 DATA COLLECTION ......................................................................................
4.5 DATA ANALYSIS ..........................................................................................
4.6 VALIDITY AND RELIABILITY .........................................................................
4.6.1 Internal validity ...................................................................................
4.6.2 External validity ...................................................................................
4.6.3 Reliability ............................................................................................
4.7 COMPLETION OF THE RESEARCH REPORT ....................................................
CHAPTER 5: RESULTS ..........................................................................................
5.1 COMMON DATA MINING TECHNIQUES AND ALGORITHMS .............................
5.1.1 Descriptive statistics .............................................................................
5.1.2 Inferential statistics ..............................................................................
5.1.3 Data reduction techniques .....................................................................
5.1.4 Numerical techniques ...........................................................................
5.1.5 Other techniques .................................................................................
5.2 THE USES OF DATA MINING IN THE CREDIT AND DATA BUREAU SECTOR ...
5.3 FUTURE TECHNIQUES AND THEIR POSSIBLE USES .......................................
CHAPTER 6: DISCUSSION ...................................................................................
6.1 COMMON DATA MINING TECHNIQUES AND ALGORITHMS .............................
6.1.1 Descriptive statistics .............................................................................
6.1.2 Inferential statistics ..............................................................................
6.1.3 Data reduction techniques .....................................................................
6.1.4 Numerical techniques ...........................................................................
6.1.5 Other techniques .................................................................................
6.2 THE USES OF THESE TECHNIQUES ...............................................................
6.3 FUTURE TECHNIQUES AND THEIR POSSIBLE USES .......................................
CHAPTER 7: CONCLUSION AND RECOMMENDATIONS ......................................
7.1 BUSINESS IMPLICATIONS ............................................................................
7.2 SUGGESTIONS FOR FURTHER RESEARCH .....................................................
REFERENCES ........................................................................................................
APPENDIX A: THE WRITTEN REQUEST .............................................................
APPENDIX B: TELEPHONE PROTOCOL ...............................................................
APPENDIX C: INTERVIEW PROTOCOL ...............................................................
END .......................................................................................................................
ii
iii
iv
v
vii
vii
vii
Page
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page vi of vii
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page vii of vii
LIST OF TABLES
TABLE# TABLE TITLE
Table 1 Organisations that agreed to partake in the research ...............................
Table 2 Data Mining Techniques used in this sector .............................................
Table 3 Summary of Data Mining techniques in this sector ..................................
Table 4 Uses of Data Mining in this sector ..........................................................
PAGE
23
22
22
22
FIGURE# FIGURE TITLE
Figure 1 Research on basic scientific issues will influence data mining
applications in many other areas ...........................................................
Figure 2 Data mining techniques .......................................................................
PAGE
23
22
LIST OF FIGURES
LIST OF APPENDICES
APPENDIX A: THE WRITTEN REQUEST ..................................................................
APPENDIX B: TELEPHONE PROTOCOL ..................................................................
23
22
PAGE
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 1 of 51
Chapter 1: Introduction
1.1 The relevance of data mining
Data mining has a tradition of research and practice going back to the early 1960s, when it
was originally known as statistical analysis and in a cruder form as "data dredging" where it
was implied that there was no specific predetermined hypothesis or aim. Data mining has
evolved from statistical analysis using classical statistical techniques such as penetration
analysis, univariate analysis, correlation, regression, chi-square and cross tabulation to be-
ing augmented by more diverse techniques such as fuzzy logic, heuristic reasoning and
neural networks. Since the 1990s the best approaches have been packaged together along
with newer and even more powerful techniques and the results are being presented in much
more user friendly and effective ways (Kimball et al, 1998:19; Parr Rud, 2001).
Early applications of data mining were in specialist applications such as geological research
(searching for natural resources e.g. mining exploration) and meteorological research (weather
forecasting), and are presently applied in areas such as retailing, the insurance, financial and
credit industries as well as the medical domain (Benyon-Davies, 1996).
In today's intensely competitive global marketplace, enterprise decision makers look for
ways to increase competitive advantages by eliminating inefficiencies, optimizing internal
operations, and maximizing relationships with all organizational stakeholders (employees,
customers, partners, and shareholders). One area that assists in this is the deployment of
data mining technologies to leverage data-resources to enhance their decision-making capa-
bilities (Nemati & Barko, 2003).
Knowledge discovery / data mining techniques were formed from several decades of re-
search into machine learning, pattern recognition, statistics and visualisation techniques and
have been a research topic of long-standing interest (Vickery, 1997).
The techniques used in data mining give knowledge workers deeper insights than those
provided by management information systems, standard production reports, managed que-
ries, executive information systems, and online analytical processing.
Techniques employed in data mining to facilitate the finding of previously hidden informa-
tion include the capabilities to discover rules, classify, partition, associate, and optimise. In a
dynamic environment data continuously changes and the timeliness of using data mining
translates into a big advantage for the user. The ability to seamlessly automate and embed
some of the mundane, repetitive and tedious steps traditionally used is another advantage of
data mining (Gargano & Raggad, 1999).
1.2 The importance of the study
IBM defined four major operations for data mining reported in Technology Forecast, 1997
cited in Lee & Siau, 2001:
1. Predictive modeling: using inductive reasoning techniques such as neural networks
and inductive reasoning algorithms to create predictive models.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 2 of 51
2. Database segmentation: using statistical clustering techniques to partition data into
clusters.
3. Link analysis: identifying useful associations between data.
4. Deviation detection: detecting and explaining why certain records cannot be put into
specific segments.
Lee & Siau (2001) also defined three main steps in data mining. These steps are:
1. Preparing the data,
2. Reducing the data and,
3. Looking for valuable information in the data.
The specific approaches may differ from company to company and researcher to researcher.
Fayyad, Piatetsky-Shapiro & Smyth (1996), proposed the following steps:
1. Retrieving the data from a large database.
2. Selecting the relevant subset to work with.
3. Deciding on the appropriate sampling system, cleaning the data and dealing with
missing fields and records.
4. Applying the appropriate transformations, dimensionality reduction, and projections.
5. Fitting models to the preprocessed data.
A classification of techniques, algorithms, and uses in data mining, and possible future
direction of data mining in this sector will provide managers and business users with a
reference, source of understanding and a means to verify the claims made by this sector
about the results of the data mining and the subsequent release of information and data
sets.
The results of data mining exercises and some of the generic uses of data mining and
techniques in this field may be of use to other users. They may allow data miners them-
selves to adapt some of these algorithms or techniques and to consider the possible future
direction or use of data mining. An understanding of the uses of the techniques will also
enable managers to better motivate use of the data mining services and data value-add of
the bureaus.
1.3 The research objectives
Based on the background provided above, the research objectives become clearer:
• To determine what common data mining techniques and algorithms are and what the
uses of these techniques are;
• To determine what the future direction of data mining techniques in this sector are
and the possible uses of these future techniques.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 3 of 51
These objectives will aid the reader in understanding some of the benefits and uses that
could be achieved for their organisation through the use of the data mining techniques and
the subsequent data output by the vendors in this sector and how the users may benefit
from understanding the techniques used and their value.
The objectives of this research will be achieved by answering each of the research questions
posed.
1.4 Introduction
Many businesses today make use of data provided by credit and data bureaus and also of the
data mining techniques (sometimes inadvertently and unknowingly) used by these bureaus.
These include businesses like marketing research companies, banks, retailers, micro-lend-
ers, brokers and employment agencies who have all along been avid consumers of the data
and techniques used by the bureaus. The increased usage has been accentuated by in-
creased interest in making efficient use of organisational data through data mining and data
warehousing. Usage of all forms of data and data mining is gaining popularity and is being
used more and more frequently, and this is likely to continue being the case. The algorithms
and techniques used in data mining are complex and require a solid understanding of
statistical methods and other techniques (Cabena, Hadjinian, Stadler, Verhees & Zanasi,
1998; Beynon, Curry & Morgan, 2001).
Credit and Data Bureaus are ideal for this research since they collect and mine enormous
amounts of data. Data Bureaus like Effective Intelligence hold more than 20,000,000 records
(J. Ardagh from Effective Intelligence, personal communication, 30 January 2005) on credit
active consumers in South Africa and Credit Bureaus like Kredit Inform hold more than
1,000,000 records (M. Hendriksen from Kredit Inform, personal communication, 30 January
2005) on business entities in South Africa and process more than 1,000,000 online requests
for information daily. This information and the applied data mining is used in more than
3,000 businesses (C. Capper from Experian, personal communication, 30 January 2005) in
South Africa to make credit decisions, for direct marketing, to predict fraud, consumer
behaviour or the propensity of a business to default.
1.5 The Statement of the problem
The aim of the research is to identify and evaluate data mining techniques in the Credit and
Data Bureau sector and to expand on the body of knowledge available to managers in this
sector, and users of these data and techniques as clients of this sector.
Describing and classifying the main data mining algorithms and techniques, and comment-
ing on the generic uses to the end-user, tools used and possible future direction of data
mining provide the background for this study.
The aim of the research and sub-problems are based on a study done by Chidley (2002) on
an evaluation of data mining techniques in the banking sector. This was expanded to include
research into the possible future direction of data mining in this sector and the uses thereof.
These objectives should assist managers and business people who interact with this sector
to better understand the techniques used, and the benefits and uses of these techniques.
Users get their data from these vendors and are not sure what the vendors have done to this
data in order to get the delivered results. If users understand the uses of data mining and
the techniques and tools used they could build on this or even request new or unmined data
to analyse.
1.6 The sub-problems
I. What are common data mining techniques and algorithms?
II. What are the uses of these techniques?
III. a. What is the future direction of data mining techniques in this sector?
b. And the possible uses of these future techniques?
1.7 The delimitations
This study will not compare software tools used by the bureaus.
1.8 Definition of terms
Data mining - Data mining is the process of extracting valuable knowledge from large
databases and using it to make decisions critical to some organisations. There are a number
of features to this definition:
I. Data mining is concerned with the discovery of hidden, unexpected patterns of data.
II. Data mining usually works on large volumes of data. Frequently large volumes are
needed to produce reliable conclusions in relation to data patterns.
III. Data mining is useful in making critical organisational decisions, particularly those of
a strategic nature. (Benyon-Davies, 1996; Kimball, Reeves, Ross & Thornthwaite,
1998).
1.9 Assumptions
The assumptions made are based on what Chidley (2002) used in his study and are also
applicable here:
I. That the experts approached for the study will have sufficient skills and experience in
the field for the report to present a true reflection of the uses to which data mining is
being put;
II. That the experts' views were representative of those in this sector.
1.10 The research structure
The research was based on the literature review and the results from interviewing experts in
this sector in data mining.
The literature review reveals current definitions of data mining and techniques (including
algorithms as applicable) used and the uses of these techniques as well as possible future
directions of techniques and data mining. The chapter concludes with three research ques-
tions.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 4 of 51
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 5 of 51
The results are presented in Chapter Five. The results of a synthesis of the literature, in
order to answer two of the three research questions, are presented. This chapter also de-
scribes the results of the interviews with members of the expert panel.
In Chapter Five, the applications of data mining that were found in the interview process are
reviewed. This allows comparisons to be made between the uses discovered during the
literature review and the uses suggested by the expert panel. Appropriate conclusions are
drawn in Chapter Six.
A similar process is followed with regards to data mining techniques and algorithms. A
contrast is drawn between the techniques and algorithms mentioned in the literature and the
techniques being used in the Credit and Data Bureau sector.
Chapter Five is finalized with a summary and discussion of the expert panel's views on the
possible future techniques of data mining and possible uses of these techniques in this
sector.
The research is concluded with a chapter for conclusions and recommendations. In this
chapter, the research questions are again posed and a summarized answer to each is pre-
sented and also presents the business implications of the research and suggestions for
future research.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 6 of 51
Chapter 2: Literature Review
2.1 Data and data mining in the business context
Data mining is defined as: "... leveraging data-mining tools and technologies to enhance the
decision-making process by transforming data into valuable and actionable knowledge to
gain a competitive advantage." (Nemati & Barko, 2003:282).
Knowledge discovery has been defined as: "...the 'extraction of implicit, previously un-
known, and potentially useful information from data'. The information extracted includes
concepts, concept interrelations, classifications, decision rules, and other patterns of inter-
est." (Vickery, 1997:107)
Data is everywhere and is used and created in almost every activity in an organisation's day-
to-day workings. The amount of data collected and stored continues to grow at an enor-
mous rate. Unfortunately for business users wishing to mine this data, wishing to add value
to this data or wishing to create value from this data, this data is usually stored in a way that
is essentially random. How to create a competitive advantage from this data and it's mining
is the critical challenge facing many organisations today (Forcht & Cochran, 1999).
Recently three new and interrelated areas that emphasise obtaining and creating more infor-
mation and knowledge from data have emerged strongly in information systems and infor-
mation technology. These are:
• Data warehousing
• Knowledge management
• Data mining
Data mining can be considered a recently developed methodology and technology that has
seen increased focus and importance in organisations that will have an important impact on
the organisation's performance. Data mining has only come into prominence in the last ten
or so years. Recently data mining has gained widespread attention and increasing popularity
in the commercial world. Successful data mining applications have been reported and recent
surveys have found that data mining has grown in usage and effectiveness (Fayyad, Piatetsky-
Shapiro & Smyth, 1996; Koh & Low, 2004).
2.2 Data mining techniques and algorithms
In the review of the literature the terms "techniques", "algorithms" and "tools", and the
terminology to describe these were found to describe the same or similar things. Chidley
(2002) in his research found the same.
"Techniques" were described by Lee & Siau (2001) as a clustering of similar mathematical
algorithms like statistics, artificial intelligence, decision tree approach, genetic algorithm,
and visualisation while the "tools" were described by Gargano & Raggad (1999) as including
artificial intelligence methods (e.g. expert systems, fuzzy logic), decision trees, rule induc-
tion methods, genetic algorithms and genetic programming, neural networks (e.g.
backpropagation, associative memories), and clustering techniques.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 7 of 51
"Algorithms" are defined as the mathematical and statistical formulas and or software
code behind specific ways of querying the data when mining it (Chidley, 2002).
Gargano & Raggad (1999:83) further defined the tools used in data mining as "simple,
concise, easy to implement algorithms, that model nonrandom (i.e. statistically
significant) relationships (or patterns) in large historic data sets."
For the purposes of this research the terms "techniques", "algorithms" and "tools"
will be used interchangeably. A clear distinction must however be made between
the techniques used for data mining and the uses of data mining.
A review of the literature found the following techniques:
2.2.1 Pure statistics
Basic statistics
Statistics is the most basic and an indispensable component of data mining and is
also used to evaluate the results of the mining done and to separate the good from
the bad. Statistics allow the miner to get a hands on, and sometimes visual feel for
the data and enables a basic understanding of the nature of the data and serves as
an indication of the most suitable techniques for further mining. It is used in the
cleaning of data and enables the identification of outliers and anomalies/ "noise" in
the data. Statistics also assist deal with missing data using estimation techniques
(Lee & Siau, 2001).
Probability distributions - Probability distributions aim to find relations between
data points or variables (Forcht & Cochran, 1999).
Inference - Inference estimates the likelihood of various outcomes, given a set of
variables and is frequently a step beyond a probability distribution as it often uses
the results of a probability distribution as part of its raw data (Forcht & Cochran,
1999).
Estimation - One way of dealing with missing data is the use of estimation techniques
(Lee & Siau, 2001). Estimations are almost always made on the basis of assumptions
that may not be strictly met for a variety of different reasons. When this happens
one should not assume that if the model is incorrect, the assumptions must be
incorrect. This may sometimes be true but is not always the case. Analysts often
test their models by finding ways to weaken their assumptions. They attempt to
discount weak assumptions and leave only the strongest assumptions. When using
inference or estimation models different models may be sound, even though they
have competing assumptions. Instead of using only one model, it is best to use
several and to combine the models and find a weighted average, which when
considered and averaged, should improve the quality of the estimation made (Forcht
& Cochran, 1999).
Hypothesis testing - Hypothesis testing is a type of estimation that seeks an answer
that is binary in nature. The test seeks only a "yes or no" type of answer to verify
whether a hypothesis is plausible or not. Usually, one hypothesis is tested against
an alternative one to find the stronger of the two (Forcht & Cochran, 1999).
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 8 of 51
Regression - This is the most important of all the multivariate techniques available of non-
experimentalists. Once analysts understand regression, almost any question amenable to
quantitative analysis can be answered. This technique, perhaps more than any other data
manipulation technique, lends itself to visualisation. Regression contains many different
subsets e.g. bivariate or multiple regression. In its purest form regression answers the
common query: What is the relationship between variable X and variable Y? (Lewis-Beck,
Berry, Feldman, Fox & Hardy (1993). This technique has a myriad of uses in data mining
(Koh & Low, 2004).
Discriminant analysis - This is a classification technique used to describe group separation
(Rencher, 1995; Gordon, 1999). Standard linear discriminant analysis involves a linear clas-
sification boundary and is used to group the population (Rencher, 1995), but it should be
noted that it depends on assumptions regarding normality of the underlying populations,
which must also possess identical variance-covariance matrices. The linear rule can be shown
to minimise the expected number of misclassifications.
Clustering
Clustering may be a preparatory step to segmenting a database before applying other data
mining techniques or as a separate technique for data mining (Chidley 2002). The technique
itself is the process of identifying useful and homogenous clusters (e.g. objects or people),
patterns, relationships or interesting trends with similar characteristics in time-dependent
data (Emory & Cooper 1991; Gargano & Raggad (1999); Forcht & Cochran, 1999; Lee &
Siau, 2001). A cluster or pattern may be regarded as a collection or class of records sharing
something in common. Conceptual clustering uses not only similarity but also what has
been called 'conceptual cohesiveness' as defined by background information. Interactive
clustering includes contributions from the human user's knowledge (Vickery, 1997).
Classification
Classification is the process of dividing and allocating data items in a data set into previously
defined and mutually exclusive groups so that the members of each group are as close as
possible to one another, and the members of different groups are as far as possible from one
another. An example of a typical classification problem is dividing a database of customers
into groups that are as homogeneous as possible with respect to a variable such as credit-
worthiness (Lee & Siau, 2001).
Link analysis
Link analysis is a descriptive approach to identifying useful associations and relationships
between values in a database (Lee & Siau, 2001).
Association rules and associative memories
These techniques are used to mine transactional or relational databases (Lee & Siau, 2001)
and are able to detect similarities between new patterns and previously stored patterns
(Caudill & Butler, 1990).
The main tool used for this according to Gargano & Raggad (1999) is associative
memories where pairs (or larger groups) of associated data items are memorised
(or discarded, in effect “forgotten”) using a long-term memory network mode. A
partial stimulation of the long-term memory network results in a retrieved data pair.
This retrieved pair may have been either a previously memorised pair or the best
attempts of the network in trying to compromise the initial stimulus with a reason-
able output pair response.
2.2.2 Artificial Intelligence (AI) methods
Artificial Intelligence techniques are widely used in data mining (Lee & Siau, 2001;
Koh & Low, 2004). These include neural networks, backpropagation, expert systems
and fuzzy logic (Gargano & Raggad, 1999; Zwick, 2004).
Neural networks
Neural networks were originally designed for use in mainly the disciplines of psy-
chology and biology. Their application in a data mining context is driven by the
desire to exploit their properties as non-linear statistical methods (Beynon et al,
2001).
These are powerful techniques for analysing complex non-linear and interaction
relationships, and can be used to supplement and complement traditional statistical
methods in for example constructing going concern prediction models (Lee & Siau,
2001; Koh & Low, 2004).
Neural networks are some of the most common types of data mining tools used.
They are used for recognising patterns in data, especially when the relationships
between the dependent and independent variables are unknown and/or complex.
Designed to "think" like and modeled after the human brain, which can be perceived
as a highly connected network of neurons (called nodes in neural networks termi-
nology). Each node (in a layer of nodes) receives inputs from at least one node in a
previous layer and combines the inputs and generates an output to at least one
node in the next layer. Generally, the independent variables comprise the input layer
and the dependent variable the output layer and between these there may be one or
more hidden layers of nodes. In combining inputs and generating an output, each
node performs a computation (to combine the inputs) and a transformation (to
generate an output). Each connection between two nodes has a weight that deter-
mines how the input from a prior node must be combined with other inputs to
generate an output that must be received by the next node (Vickery, 1997; Gargano
& Raggad, 1999; Lee & Siau, 2001).
Neural networks first break down data sets into smaller, more manageable pieces
before trying to discover patterns in the data. Such techniques require large amounts
of resources and frequently require some custom programming for each search, as
well as more processing afterward because the system may "discover" patterns that
seem logical to it but after human intervention it becomes clear that they are not
(Forcht & Cochran, 1999; Koh & Low, 2004).
Lu et al. (1996) (cited in Lee & Siau, 2001), split the neural network-based data
mining approach into three major phases:
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 9 of 51
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 10 of 51
• Network construction and training: in this phase, a layered neural network based on
the number of attributes, number of classes, and chosen input coding method are
trained and constructed.
• Network pruning: in this phase, redundant links and units are removed without in-
creasing the classification error rate of the network.
• Rule extraction: classification rules are extracted in this phase (Lee & Siau, 2001)
Backpropagation systems
These techniques are highly supervised. The backprop neural network model is ideal for
prediction and classification in situations where there is a good deal of historic data available
for training. This tool uses output variables generated by the neural network that are cor-
rected by adjusting the weights of the hidden layer variables until the output variables match
those in the training dataset (Gargano & Raggad, 1999; Chidley, 2002).
Expert systems
Expert systems are made up of a knowledge base of rules (extracted from experts), facts (or
data), and a logic based inference engine (or control) that creates new rules and facts based
on previously accumulated knowledge and facts. Expert systems attempt to mimic, with
some success, the reasoning of human experts whose knowledge of a specific and narrow
domain is deep, thus permitting human experts and expert systems to arrive at similar
conclusions, thus serving to justify the system's existence by improving the expert decision
maker's own productivity. The expert system thus operates using queries formulated by
human experts and incorporated into the system. Expert systems do not rely on algorithmic
or statistical methods and cannot solve problems that have not been defined during the
programming of the model (Jackson, 1990; Gargano & Raggad, 1999; Chidley, 2002).
Jackson (1990:4) listed the following characteristics for expert systems:
• They simulate human reasoning,
• They perform reasoning "over representations of human knowledge",
• Heuristic or approximate methods are used to solve problems (which does not guar-
antee success as would have been the case had algorithmic techniques or solutions
been used).
Fuzzy expert systems
Fuzzy expert systems employ fuzzy logic concepts and were developed in an attempt to try
and solve the brittleness problem inherent in expert systems. The truth or falsity of a fact
can be measured in a fuzzy way using values from the real number interval zero to one
inclusive (i.e. [0, 1]). In expert systems, information is either totally false (i.e. zero) or
totally true (i.e. one), but in fuzzy expert systems, true values can lie anywhere on the zero
to one interval of real numbers. Some facts are close to being true or close to being false
(having low entropy), while other facts lie close to the middle between being true or false
(having high entropy). Using fuzzy operators, such as AND, OR, NOT, VERY, and SOME-
WHAT, the system can make fuzzy implications. Fuzzy systems can easily handle illogical
complexities, poor clarity (in the facts and/or rules), or internal inconsistencies (Gargano &
Raggad, 1999).
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 11 of 51
2.2.3 Genetic algorithms and genetic programming
Genetic algorithms are a relatively new technique inspired by Darwin's theory of evolution
(Natural selection and survival of the fittest). A population of rules, that may or may not
repress a solution to a problem, is created at random. Then pairs of these rules, usually the
strongest rules are selected as "parents", are combined to produce "offspring" for the next
generation. A mutation process is used to randomly modify the genetic structures of some
members of each new generation. The system runs for dozens or hundreds of generations
and is only terminated when an acceptable or optimum solution is found, or after a fixed
time limit. Genetic algorithms are appropriate for problems that require optimisation with
respect to some computable criterion (Lee & Siau, 2001; Mitchell, 2005)
While genetic algorithms evolve complex data structures, genetic programming evolves
using complex algorithmic structures (i.e. computer programs). This technique is useful for
finding solutions to hard optimisation problems by generating optimal or near optimal
solutions to such problems, to fine tune the parameters of other data mining techniques and
models and also for classification (Vickery, 1997; Gargano & Raggad, 1999; Lee & Siau,
2001).
2.2.4 Decision trees
Decision trees - This is a statistical approach based on a branching system of decisions. A
decision rule is answered at each node either positively (Yes) or negatively (No). The answer
gives another set of decisions (Gargano & Raggad, 1999).
Koh and Low (2004:466) summarised it very nicely: "In the Automatic Interaction Detection
(AID) algorithm, all possible two-way splits of each node for each independent variable are
examined. The split that leads to the most significant t-statistic (as per the analysis of the
variance) for the difference in means of the dependent variable between the two lower-level
nodes is selected. In the chi-square Automatic Interaction Detection (CHAID) algorithm, the
chi-square statistic is used to determine the best split while in the Classification and Regres-
sion Trees (CART) algorithm, an index of diversity is used to determine the best split."
This technique has several strengths:
• Understandable rules can be generated
• Both continuous and categorical variables can be handled
• The ability to indicate the relative importance of the variables for classification and
prediction
• Outputs are easy to understand
• They are relatively simple to implement and
• Their results can be easily explained
(Gargano & Raggad, 1999; Chidley, 2002; Koh & Low, 2004)
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 12 of 51
2.2.5 Data visualisation
Visualisation is a method of clearly presenting the typically complex results found using data
mining tools. This allows the presentation of the complex interdependencies among many
attributes in a visual format in order to get an intuitive feel of the data and the results of the
analysis. Analysts and management users can easily assess and make sense of vast amounts
of data. Techniques include colors, shapes, sounds, in various combinations, statistical scat-
ter plots, decision trees, demonstrate the results of curve fitting, geographical maps or
display a development dashboard which tracks and controls the evolution of a data mining
modeling tool (Gargano & Raggad, 1999; Lee & Siau, 2001).
2.2.6 Rule induction methods
Rule induction uses statistical discovery methods to develop rules that depend on the fre-
quency of correlation, the rate of accuracy, and the accuracy of prediction. Typically, IF -
THEN type rules are created by focusing on either the variables forming the IF part of a rule
or the variables forming the THEN part of a rule. For rule induction it is useful to think of
data mining from marketing databases. The technique is based on measures of data ambi-
guity or approximation quality. These measures are formulated in terms of ratios, involving
objects either definitely or possibly allocated to a decision class, on the basis of a given table
or data matrix. The end result is a set of decision rules, which are very easy to understand
and interpret. Rule induction is a useful tool for development of expert systems (Gargano &
Raggad, 1999; Beynon et al, 2001).
Gargano & Raggad (1999:85) caution that: "Sometimes, however, the novelty, significance,
value, or exceptionality of a rule is deemed to be most interesting. Rule induction methods
are highly unsupervised, however, they do require that experts evaluate the rules generated.
This technique is most often used when new rules need to be generated. Owing to the
combinatorially explosive nature of generating rules in this manner, such models usually run
in the background or at times when computing demand is low."
2.2.7 Data warehousing
Data warehousing is described by Lee & Siau (2001) as one of the most important research
areas related to data mining. A data warehouse is necessary to organise historical data
gathered from large-scale client/server-based applications for further analysis.
A data warehouse is a read-only database containing large volumes of subject-oriented
data, where all levels of an organisation can find the information in a timely manner (Lee &
Siau, 2001).
Kimball et al (1998:19) call the data warehouse the foundation of decision-making in an
organisation. "The queryable source of data in the enterprise".
Data warehousing enables each user to share a common, diverse database that they may
analytically explore, using all of the available data quickly and correctly and increases the
effectiveness of data-driven decision making (Cabena et al, 1998; Gargano & Raggad, 1999).
The data warehouse architecture consists of a series of data marts that give a consolidated,
consistent view of the organisation's historical analytical, time-based data (Cabena et al,
1998; Kimball et al, 1998) Raw data are extracted, cleaned, transformed, and integrated into
the marts from a variety of sources. Metadata, data about the data in the warehouse, is also
an integral part of the system. The warehouse architecture must manage standard informa-
tion delivery systems and data queries, interfaces with applications development platforms
and management information systems (MIS), and online analytical processing (OLAP), in
addition to advanced information technology data mining and business intelligence tools
(Kimball et al, 1998; Forcht & Cochran, 1999; Gargano & Raggad, 1999).
2.3 The uses of these techniques and algorithms
Mitchell (1999) stated that in the field of data mining there are practical applications in areas
like analyzing medical outcomes, detecting credit card fraud, predicting customer purchase
behavior, predicting the personal interests of internet users, optimizing manufacturing pro-
cesses or which bank-loan applicants are at high risk of failing to repay their loans.
As shown in Figure 1 from Mitchell (1999), data in such applications typically consists of
time-series descriptions of customer bank balances and other demographic information.
Other data mining applications include predicting customer purchase behavior, customer
retention, and the quality of goods produced by a particular manufacturing line. Mitchell
(1999) believes that research on basic scientific issues (like the medical field) will influence
data mining applications in many other business related areas. Data mining is thus valuable
to itself as techniques used in one sector or industry may be of use in another sector in that
techniques may be adapted for different uses. Data miners thus learn from other data
miners and techniques that may have one use could have a completely other use in another
sector.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 13 of 51
Research on basic scientific issues (left) will influence data mining
applications in many areas (right)
Scientific Issues
Basic Technologies
Applications
Figure 1: Research on basic scientific issues. Source: (Mitchell, 1999)
Learning from mixed media data, such as
numeric, text, image, voice, sensor
Active experimentation, exploration
Optimizing decisions, rather than
predictions
Inventing new features to improve
accuracy
Learning from multiple databases and
the Web
Medicine
Manufacturing
Financial
Intelligence analysis
Public policy
Marketing
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 14 of 51
Data mining and its techniques can be applied to many areas in business and in many
different businesses. The different uses of the techniques used in data mining described
below have been extracted from the literature and have uses in the sector that make use of
the data and credit bureaus as well as in the this sector.
2.3.1 Targeting / Predictive / Descriptive models
These models typically calculate a value that represents possible future activity. This could
be a purchase amount or the likelihood of an action, such as a response to an offer or
defaulting on a loan (Parr Rud, 2001).
They may include:
• Customer profiling and segmentation
Having an understanding of the customer is valuable in that their demographics,
attributes and behaviour is the first step in good customer relationship management.
Data mining enables understanding of who the customers are and how to split them
into segments that have the same or similar attributes. This leads to further mining to
enable steps like prospecting, scoring, propensity to buy and others as discussed later
(Vickery, 1997; Cabena et al, 1998; Gargano & Raggad, 1999; Lee & Siau, 2001; Parr
Rud, 2001; Geist, 2002; Nemati & Barko, 2003).
• Database marketing
Database marketing is a type of marketing segmentation used by businesses via data
mining. Data mining of customer databases has had a large impact on marketing in
organisations. Individual consumers can be targeted for direct marketing offers. The
value here is that the correct customer may be directly targeted with the correct offer,
saving time, money and effort and enabling a focused approach to marketing that
promises much better results. Algorithms are used to predict consumer behavior by
predicting which consumers would be most responsive to promotional and sales cam-
paigns (Forcht & Cochran, 1999).
The value and goal of this type of marketing is to attract new, or retain profitable
clients or to avoid high-risk clients, and multiple opportunities for this exists in data
mining of large databases. Increasing the response rates of direct mailing campaigns
by small margins like only 1-2% can have large impacts on ROI and data mining is a
powerful tool in increasing response rates and ultimately of immense value to the
organisation (Cabena et al, 1998; Forcht & Cochran, 1999; Parr Rud, 2001; Apte, Liu,
Pednault, & Smyth, 2002).
• Customer attrition prediction
A growing risk in the ever-increasing competitiveness of markets is the loss or attrition
of their customers to competitors. Data mining is used to predict these customer
losses and to identify vulnerable customers so that steps may be taken to prevent or
mitigate attrition and thus save costs and effort in attracting new clients or spending
on attracting customers who depart before their lifetime value has justified the ex-
pense of attracting them in the first place (Cabena et al, 1998; Nemati & Barko, 2003).
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 15 of 51
• Credit scoring / Risk modelling
Credit scoring algorithms have the ability to consider and use many different factors
and variables in determining a customer's 'creditworthiness' and assigning a credit
limit or particular loan amount to that customer in either pre-scoring to extend a
marketing offer or when the customer applies for credit. This is very valuable in
assuring that a customer does not have a line of credit extended to them that they
cannot or will not repay. This has a knock-on effect in savings of time, effort and
expenditure in preventing unnecessary collections and administration. Numerous com-
panies have used data mining in developing credit risk scores for their own use or for
selling on to other users (Cabena et al, 1998; Lee & Siau, 2001; Parr Rud, 2001;
Geist, 2002; Nemati & Barko, 2003).
Customers' data is mined and algorithms applied in an attempt to determine whom
the higher risk clients are so that these may be either avoided or a different interaction
strategy enacted to deal with them. An insurance company may for instance want to
determine the risk profile of clients to enable them to customise each client's policy
individually (Parr Rud, 2001; Apte et al 2002).
• Customer value analysis
Performing customer value analysis and lifetime value allows managers to understand
their customer database in terms of revenue and risk. Mining the customers data
assists in:
- Determining the risk category;
- The amount of customer spend over a given period;
- Lets the manager assign a value to each customer that is used in determining the
company's interaction and dealings with each client on an individual basis
(Cabena et al, 1998; Parr Rud, 2001).
2.3.2 Fraud prediction and identification
Fraud costs companies and economies millions of Dollars / Pounds / Rands every year and
with the increase in electronic transactions, credit cards and telephonic transacting this is
becoming even more prevalent. The masses of data available to companies allow them to
mine these transactions and applications in an effort to identify or predict fraud. The general
approach is to build a model of known, suspected or potential fraudulent behaviour and
then using data mining to identify similar occurrences. Data mining tools are valuable as
they learn the patterns of fraud and enable the identification and prevention. (Cabena et al,
1998; Lee & Siau, 2001; Parr Rud, 2001).
2.3.3 Going concern prediction
Koh & Low (2004) researched this field and found that several researchers had developed
prediction models for making going concern predictions of companies. The suggested mod-
els are based primarily on statistical methods. Koh & Low (2004) listed the following ex-
amples - Altman, (1982); Dopuch et al., (1987); and Koh, (1991). This area of data mining
also includes bankruptcy prediction. Several studies listed by Koh & Low (2004) have dealt
with prediction models in the going concern context. These include models derived from
statistical methods such as multiple discriminant analysis, logit and probit analyses and
neural networks.
Altman, (1968), Sung, Chang & Lee, (1999), Beynon et al, 2001 and Koh & Low, (2004)
noted that discriminant analysis is the most widely used technique for going concern and
bankruptcy prediction.
2.4 The future direction of data mining and its techniques and the possible
uses of this
The literature review found mainly data relating to other sectors and techniques and uses.
Only one source was found describing possible future uses of data mining or future tech-
niques. It is possible that the bureaus may have some ideas as to what their future use of
data mining, what new techniques or the possible uses these may be.
The only source describing possible direction of data mining was from Mitchell (1999) who
speculated that the accuracy of predictions from data mining may be improved by inventing
more appropriate sets of features for describing the available data, provided the dataset was
large enough. It is suggested that this could lead to increased accuracy in many prediction
problems like customer attrition and credit repayments. More universities are also offering
data mining as a subject as there is a lack of skills in this area.
Research into the area of data mining could lead to more useful data visualization tools,
ways of supporting mixed initiative human-machine data exploration and more efficient data
warehousing and legacy data combinations (Mitchell 1999).
Mitchell (1999:36) and Fayyad, Haussler & Stolorz (1996) further speculated that that "progress
in data mining over the next decade was driven by three mutually reinforcing trends:
• Development of new machine learning algorithms that learn more accurately, utilize
data from dramatically more diverse data sources available over the Internet and
intranets, and incorporate more human input as they work,
• Integration of these algorithms into standard database management systems,
• An increasing awareness of data mining technology within many organizations and an
attendant increase in efforts to capture, warehouse, and utilize historical data to sup-
port evidence-based decision making."
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 16 of 51
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 17 of 51
Chapter 3: Research questions
The literature reviews for this research is in most respects quite comprehensive, however,
data mining in South Africa and particularly in the credit and data bureau sector, is a rela-
tively new field, and although there is agreement amongst the authors of the respective
works in most fields, there are some areas of discrepancy. Most authors agree on the tech-
niques used and the uses of these techniques, but there is little literature density on uses of
data mining in this sector and more specifically in South Africa. As a result of the literature
review the following questions arise:
3.1 What are common data mining techniques and algorithms?
A review of the literature produced the following list of techniques used in data mining and
these techniques could be used in the Credit and Data bureau sector for data mining:
 Pure statistics (Lee  Siau, 2001)
• Basic Statistics (Forcht  Cochran, 1999; Beynon et al, 2001; Koh  Low, 2004)
- Probability distributions
- Inference
- Estimation
- Hypothesis testing
- Regression
- Discriminant analysis
• Clustering (Emory  Cooper 1991; Vickery, 1997; Forcht  Cochran, 1999; Gargano
 Raggad, 1999; Chidley, 2002)
• Classification (Lee  Siau, 2001)
• Link analysis (Lee  Siau, 2001)
• Association rules (Caudill  Butler, 1990; Lee  Siau, 2001), and associative memo-
ries (Gargano  Raggad, 1999)
 Artificial intelligence methods (Lee  Siau, 2001; Koh  Low, 2004)
• Neural networks (Gargano  Raggad, 1999)
- Backpropagation (Gargano  Raggad, 1999)
• Expert systems (Jackson, 1990; Gargano  Raggad, 1999)
• Fuzzy logic (Gargano  Raggad, 1999; Zwick, 2004)
 Genetic algorithms (Mitchell, 2005; Lee  Siau, 2001) and genetic programming
(Vickery, 1997; Lee  Siau, 2001)
 Decision trees (Gargano  Raggad, 1999; Chidley, 2002; Koh  Low, 2004)
 Data visualisation (Gargano  Raggad, 1999; Lee  Siau, 2001)
 Rule induction methods (Gargano  Raggad, 1999; Beynon et al, 2001)
 Data warehousing (Kimball et al, 1998; Forcht  Cochran, 1999; Gargano  Raggad,
1999)
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 18 of 51
3.2 What are the uses of these techniques?
A review of the literature gave the following uses of the different techniques used in data
mining that could be applicable to this sector. The possibility is that these are where the
value in data mining lies for the bureaus and their users. Mitchell (1999) also believed that
techniques in one sector may influence techniques used in other sector and thus data mining
is valuable to itself in that new techniques are developed in one sector because of the
influences in another sector. The research will attempt to determine if this is the case in the
credit and data bureau sector as well. Other uses found where:
• Targeting / Predictive / Descriptive models (Parr Rud, 2001)
- Customer profiling and segmentation (Vickery, 1997; Cabena et al, 1998; Gargano
 Raggad, 1999; Lee  Siau, 2001; Parr Rud, 2001; Geist, 2002; Nemati  Barko,
2003).
- Database marketing (Cabena et al, 1998; Forcht  Cochran, 1999; Parr Rud, 2001;
Apte et al 2002).
- Customer attrition prediction (Cabena et al, 1998; Nemati  Barko, 2003).
- Credit Scoring / Risk modelling (Cabena et al, 1998; Lee  Siau, 2001; Parr Rud,
2001; Apte et al, 2002; Geist, 2002; Nemati  Barko, 2003).
- Customer value analysis (Cabena et al, 1998; Parr Rud, 2001).
These techniques enable:
- An understanding of the customer and thus good customer relationship manage-
ment.
- Marketing to the correct customer who may be directly targeted with the correct
offer, saving time, money and effort and enabling a focused approach to marketing
that promises much better results.
- The attraction of new, retention of profitable clients or avoidance of high-risk cli-
ents.
- Increasing the response rates of direct mailing campaigns by small margins like only
1-2% can have large impacts on ROI.
- Savings in attracting new clients or spending on attracting customers who depart
before their lifetime value has justified the expense of attracting them in the first
place.
- Credit scoring clients to assure that a line of credit extended is not too much forcing
a client into a position of overextension where they cannot or will not repay. This
has a knock-on effect in savings of time, effort and expenditure in preventing unnec-
essary collections and administration.
• Fraud prediction and identification (Cabena et al, 1998; Lee  Siau, 2001; Parr Rud,
2001).
• Going concern prediction (Altman, 1968; Sung, Chang  Lee, 1999; Beynon et al,
2001; Koh  Low, 2004).
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 19 of 51
3.3 What is the future direction of data mining and its techniques in this sec-
tor and the possible uses thereof?
As there was only one source for a possible answer to this question, it is left quite open-
ended. Some possibilities are:
• New and more accurate means of prediction may be found using more appropriate
sets of features for describing the available data, provided the dataset was large enough,
• Increased accuracy in many prediction problems like customer attrition and credit
repayments,
• More useful data visualization tools, ways of supporting mixed initiative human-ma-
chine data exploration and more efficient data warehousing and legacy data combina-
tions,
• More efforts to train people in data mining as the skills are not common (Mitchell
1999).
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 20 of 51
CHAPTER 4: RESEARCH METHODOLOGY
4.1 Qualitative Research Paradigm
The aim of this research is to identify and evaluate data mining techniques in the Credit and
Data Bureau sector and to expand on the body of knowledge available to managers in this
sector, and users of these data and techniques as clients of this sector. The research para-
digm for the research is qualitative in nature.
Qualitative techniques are intended more to determine 'what' things are than to determine
the quantity of those things. These techniques are not concerned with measurement and are
thus less structured than quantitative techniques and can therefore be made more respon-
sive to the needs of the respondents and to the nature of the subject being researched.
Typically qualitative techniques yield large volumes of very rich and descriptive data from a
limited number of individuals in a particular field. (Walker, 1985)
The intent of qualitative research is to answer questions about the complex nature of
phenomena, often with the purpose of describing and understanding the phenomena from
the participants' point of view (Leedy  Ormrod, 2001:101).
Based on the characteristics of a qualitative paradigm given by Walker (1985) and Leedy 
Ormrod (2001), this approach is proposed for the following reasons:
• There is insufficient theory on the particular sector,
• The purpose of the research is to describe and explore,
• The research is not concerned with measurement
• The variables are unknown,
• The research is context bound and encompasses personal views,
• The sample size is small,
• In-depth semi-structured interviews are to be used to collect data,
• The data gathered were explicitly interpretive, creative and personal.
Added to the assumptions made in Chapter 1 (1.5) of this document are particular assump-
tions that are part of qualitative research. These were proposed by Creswell (1994) and
(Marshall  Rossman, 1989) and must also be considered:
• The participant's perspective on the social phenomenon of interest should unfold as
the participant views it, not as the researcher views it(Marshall  Rossman, 1989:82),
• The researcher interacts with what they are researching,
• The role of values is value-laden and biased Creswell (1994:5),
• Respondents in research see reality in a subjective and in multiple ways,
• The language of the research is informal, evolving decisions, personal voice, ac-
cepted qualitative words.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 21 of 51
4.2 Descriptive research design
The qualitative design was in the form of a content analysis. This was described by Walker
(1985) and Leedy  Ormrod (2001) as being a technique that identifies patterns, themes or
biases in data on communication and the examination of this data allows the researcher to
determine if a hypothesis is supported or not. In this research the content analysis was done
on the transcripts of the interviews between the researcher and the respondents.
For this research in-depth semi-structured interviews were used as the method of data
collection. The interviews were based on a number of open-ended questions (Leedy 
Ormrod, 2001). In depth interviewing is ideal for this kind of research and has been de-
scribed as a conversation with a purpose (Marshall  Rossman, 1989:82). Interviews are
typically more like conversations than formally structured interviews, but this assists in
uncovering the respondents meaning and perspective but at the same time respects the way
in which the respondent frames and structures the responses (Marshall  Rossman, 1989).
Advantages of using in-depth semi-structured interviews for data collection include (Marshall
 Rossman, 1989; Pirow, 1990; Creswell, 1994; Leedy  Ormrod, 2001):
• Interviews are useful means of quickly obtaining large amounts of data.
• Respondents can provide historical background information.
• Interviews allow for the gathering of a wide variety of information and a large number
of different subjects.
• Immediate follow-up questions and clarification of points can be done.
• The researcher has control over both the questions asked and the environment.
• It is flexible and enables the researcher to prompt and probe as necessary.
• It enables the researcher to take cognisance of non-verbal behaviour.
• The researcher can alter the order of questions and ensure that all the questions are
answered.
Despite its many advantages the researcher is aware that skill and care is required in using
this method of collecting data. There are also some disadvantages associated with this
method of data collection and the researcher took care to be aware of these when conduct-
ing the research. Marshall  Rossman, (1989) and Creswell, (1994) listed the following:
• Information provided by the respondent is colored by their own perspective,
• The interviewer must obtain the cooperation of the interviewee,
• Respondents may not be willing to share some (possibly sensitive) information,
• Respondents may not all be of the same level of articulation or perception,
• The researcher may not be able to ask the correct type of questions because of a lack
of technical expertise on the side of the researcher.
The researcher attempted to mitigate some of these disadvantages by:
• Continuously confirming with the respondent the intended meaning of their response,
• Not intentionally leading the respondent and avoiding colloquialisms and ambiguous
words.
4.3 Population and Sample
The population in this research can be considered to be all the data miners, data managers,
analysts, practitioners, facilitators, and vendors for and from all the credit and data bureaus
in the country. This is to the extent that they are subject matter experts on data mining. The
sample drawn contained the managers of the data mining departments or business intelli-
gence departments, analysts, directors and or practitioners in these fields in these bureaus
and their vendors that are located within South Africa. The nineteen respondents can be
considered to form 100% of the population.
The respondents were not selected in a random fashion, at all times attempting to ensure
that they are experienced and knowledgeable enough in the area of study (Creswell, 1994),
but the researcher attempted to be objective in the selection of the respondents (Walker,
1985) and the sample design is thus purposive (Walker, 1985:30).
The small number of data and credit bureaus in South Africa limited the sample size. The
sample was drawn from the bureaus and their vendors directly, specifically from the ranks of
the data mining, business intelligence and managerial areas.
The selection of experts in the field used the following criteria and ensured that the respon-
dent was able to comment, from an informed position, on the techniques, uses and trends in
data mining in the credit and data bureau sector. The opinions expressed during the inter-
views should be based on a sound knowledge of this sector and of data mining.
The criteria were:
• The expert is to be involved in data mining, having implemented, or had management
oversight of a data mining project in South Africa;
• The expert should occupy a senior or management position in the organisation;
• The expert should have experience in the products and uses of data mining in the
sector;
• The expert should have at least three years experience in the field;
• The expert should be available for a one hour interview;
• The organisation the expert represents should not have an objection to the expert
partaking in the research.
In total, nineteen interviews were conducted during the entire research process. Every major
credit bureau and all of the minor credit bureaus except one, both the data bureaus and
every vendor that engaged with the bureaus on data mining had at least one person who
met the criteria to qualify as an expert to be interviewed in this field. One of the vendors
interviewed had lots of experience in data mining, but none with the South African credit
bureaus.
The researcher approached respondents from the institutions listed in the table on the
following page and received their institution's willingness to participate in the research:
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 22 of 51
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 23 of 51
noitutitsnI rotceS emaN noitangiseD
nairepxE uaeruBtiderC reppaCgiarC tcudorP:rotceriD
ssenisuB,tnempoleveD
gnitekraMdnaecnegilletnI
kciredorBnalA reganaMiB
syuBezilraM tsylanAgnirocS
soBdrahreG tsylanAgnirocS
tiderCsredneLorciM
)BCLM(uaeruB
uaeruBtiderC sreffetSderF rotceriD
)IK(mrofnItiderK uaeruBtiderC yessuHekiM reganaM
ecnegilletnIevitceffE uaeruBataD hgadrAnailuJ rotceriDgniganaM
ahtoBdrahreG reganaMsmetsySTI
debuC-P rodneV relliMluoaR rotceriDgniganaM
LTE uaeruBataD naniuQydnA rotceriDgniganaM
nacSupmoC uaeruBtiderC streblAocaJ rotceriD
rotpaR rodneV namyeHkraM tsylanA
greBnaaiR tsylanA
SAS rodneV kciddaCyecatS reganaMtnuoccA
CTInoinUsnarT uaeruBtiderC eiruoFnhoJ dnascitylanA-rotceriD
gnitlusnoc
nosirraHeilseL tnatlusnoCssenisuB
navaihtneremmE
gninierG
tsylanAlacitsitatS
samohTkcirraW esuoheraWataD
thcetihcrA
CTInoinUsnarT
troppuSnoisiceD
)SSDUT(secivreS
uaeruBtiderC nassaHrimahT rotceriDgniganaM
Table 1: Organisations that agreed to partake in the research.
4.4 Data collection
The institutions were contacted formally in writing, detailing the nature, purpose and meth-
odology of the research and requesting their formal approval of their participation. The
respondent nominated by each institution was contacted initially by telephone to invite them
to participate in the research and to inform them of the purpose of the research, subjects to
be covered and the research process and methodology, including the expected duration of
the interview.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 24 of 51
A formal written communication by e-mail was sent thanking the respondent for being
willing to participate in the study and confirming the place, date and time of the interview.
Each respondent was offered a copy of the research report as an incentive for participating in
the study. Respondents were guaranteed that their responses would be confidential and
remain anonymous (Refer Appendix 1  2 for copies of the written request and telephone
protocol).
The interviews were in-depth and of a semi-structured nature and took place at a site
convenient to the respondent. As the researcher knows many of the respondents personally,
the locations for the interviews tended to be informal and aimed at putting the respondents
at ease and enabled them to more easily discuss the research questions with the researcher.
Each interview was audiotaped with the permission of the responder. Notes were also taken
as the interview progressed.
Creswell (1994:152) suggested the following protocol and the researcher attempted to fol-
low this for each interview (Refer Appendix 3 for a copy of the Interview Protocol). The
components of the protocol are as follows:
• (a) a heading,
• (b) instructions to the interviewer (opening statements),
• (c) the key research questions to ask,
• (d) probes to follow key questions,
• (e) transition messages for the interviewer,
• (f) space for recording the interviewer's comments, and
• (g) space in which the researchers records reflective notes.
Care was taken not to lead respondents in their response during the course of the interview.
4.5 Data analysis
Unlike quantitative research where the process is linear, here data analysis took place at the
same time as the collection and interpretation of the data, and the writing of the report.
(Creswell, 1994).
The following procedures were deployed in analysing the data (Walker, 1985; Creswell,
1994; Leedy  Ormrod, 2001):
1. The taped interviews are transcribed,
2. The notes made during the interview are reviewed immediately after the interview
and additional comments and thoughts added,
3. The data were organized into categories, coded and were interpreted through the use
of schemas,
4. The data were integrated and synthesized. This was represented in the form of matri-
ces.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 25 of 51
In addition, the frequency of each identifiable factor uncovered in the transcripts was tabu-
lated. This will inform the researcher as to the perceived importance of the identifiable factor
across the respondents. No statistical analysis was performed on these results.
4.6 Validity and reliability
The validity of research is determined by the internal and external validity of the research.
Internal validity is the extent to which its design and the data that it yields allow the
researcher to draw accurate conclusions about cause-and-effect and other relationships within
the data (Leedy  Ormrod, 2001:103), and external validity is the extent to which its
results apply to situations beyond the study itself (Leedy  Ormrod, 2001:105).
4.6.1 Internal validity
The importance of internal validity is in attempting to find other possible explanations for
the results obtained in the research (Leedy  Ormrod, 2001). Asking the respondents if they
agreed with the accuracy, objectivity and reliability of the conclusions made by the re-
searcher checked the internal validity of this research. Each respondent was given a copy of
the findings and requested to add any comments.
4.6.2 External validity
The intent for qualitative research is not to be able to infer the findings onto the population,
but to attempt to interpret the event from a unique perspective (Creswell, 1994). The valid-
ity criteria used in this research is that it is well argued and believable and the purposive
sample should reflect the views of the general population.
4.6.3 Reliability
As it is unlikely that similar research conducted in a different context would reach different
conclusions in the same industry, but could reach different conclusions in a different indus-
try, the research reliability is limited.
Marshall  Rossman (1989:148) suggested that: the researcher purposefully avoids con-
trolling the research conditions and concentrates on recording the complexity of situational
contexts and interrelations as they occur. It is unlikely that future researchers will replicate
the research by altering research strategies and it is discouraged (Marshall  Rossman,
1998).
4.7 Completion of the research report
The research report was then written, identifying the dominant themes in this sector and
commenting about the applicability of the different algorithms and techniques and their
various uses in this sector.
The interview transcripts were summarized and each use assigned to two categories. The
methodology followed here was that of Chidley used in 2002.
The first use category was based upon the terms used by the respondents during the inter-
views. The information to determine the first category of uses was based on the terms used
by respondent in describing the specific data mining projects they had worked on and or the
specific uses they assigned and or equated with each data mining technique or algorithm.
The second categorization was done by using the generic data mining uses taken from the
literature. The aim of the specific project and use referred to by the respondent was com-
pared to the generic use category and if there was a match, the project or stated value and
use was assigned to that category. Sometimes the process followed in the actual data mining
was analysed and a category assigned to the project or technique used.
The interviews data, processed in this way, was used as the basis for the results and inter-
pretation of the results for this research report.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 26 of 51
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 27 of 51
Chapter 5: Results
5.1 Common data mining techniques and algorithms
In his work on data mining in the banking sector, Chidley (2002) proposed a metric based
on his finding when doing his literature review. This same metric was compared to what was
found when doing the literature review for this research report, and the categorization was
virtually identical. The common techniques and algorithms identified in section 2.2 were
compared to Chidley's findings and distilled into a single model showing how each tech-
nique related to the others. This new model is show on the following page:
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 28 of 51
noitaraperp-erP
:scitsitatSeruP
naeM•
noitaiveddradnatS•
noitatneserperlacihparG•
snoitubirtsidytilibaborP•
ecnerefnI•
noitamitsE•
gnitsetsisehtopyH•
noissergeR•
sisylanatnanimircsiD•
Data mining techniques
lacitsitatS ecnegilletnIlaicifitrA
ecnednepedretnI ecnednepeD
:gniretsulC
gniretsulclautpecnoC•
gniretsulcevitcaretnI•
robhgientseraen-K•
gninosaerdesabyromeM•
:noitacifissalC
sisylanatnanimircsiD•
noissergercitsigoL•
:skrowtenlarueN
dnanoitcurtsnockrowteN•
gniniart
gninurpkrowteN•
noitcartxeeluR•
:noitacifissalC
noitcudnieluR•
:seerTnoisiceD
DIAHC•
TRAC•
noitagaporpkcaB
:seerTnoisiceD
TRAC•
SRAM•
sledomevitiddalareneG smetsystrepxE
sledomevitiddalareneG smetsystrepxeyzzuF
sisylanakniL
selurevitaicossA
seiromemevitaicossA
noitasilausiV
secirtamtolprettacS•
secirtamgnitcepsorP•
setanidrooclellaraP•
secirtamnoitcejorP•
seuqinhcetnoitcejorpcirtemoeG•
gnisuoheraWataD
)LTE(gnidaoL,noitamrofsnarT,noitcartxE•
stramataD•
PALO•
SIM•
Figure 2: Data mining techniques
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 29 of 51
The techniques and algorithms found were categorised to enable the manager to easily and
at a single glance understand the techniques and algorithms used and to match these to the
possible uses of these techniques as described in this report.
Interviews were conducted with all the members of the expert panel with a view to establish
the techniques used in data mining in the credit and data bureau sector. It was clear from the
interviews that there were numerous techniques referred to by the members of the panel,
and invariably the same terminology was used to describe the different techniques.
There were thirty-five techniques mentioned during the interviews and these are listed in the
table on the following page:
Table 2: Data mining techniques used in this sector
.oN seuqinhceT
latoT
secnerruccO
1 .gvA,veD.dtS,snaeM.g.escitsitatScisaB/smhtiroglAlacitamehtaM 51
2 noissergeR 51
3 noitatnemgeS 41
4 ecnegilletnIlaicifitrA 11
5 gniliforP 01
6 seerTnoisiceD 7
7 noitasilausiV 7
8 gnisuoheraWataD 6
9 gniledoMevitciderP 5
01 ytilanosaeS 3
11 sisylanAretsulC 3
21 erauqs-ihC 3
31 SIG 2
41 noitacifissalC 2
51 DIAHC 2
61 sisylanaseiresemiT 2
71 sisylanAytilibaborP 1
81 gnidnerT 1
91 euqinhceT-ihpleD 1
02 gniledoMesnopseR 1
12 sisylanAfI-tahW 1
22 gnihcraeSevitaretI 1
32 sisylanAoteraP 1
42 gnitseTsisehtopyH 1
52 scitsitats-oiB 1
62 metsyStrepxE 1
72 selbaTycnegnitnoC 1
82 sisylanAnoitalerroC 1
92 smhtiroglAciteneG 1
03 sisylanAdesaBeluR 1
13 sisylanAetairavitluM 1
23 sisylanAetairavoC 1
33 sisylanAtnioj-oC 1
43 sisylanAdnerT 1
53 noitingoceRnrettaP 1
slatoT 621
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 30 of 51
These thirty-five techniques were classified into the following categories:
Table 3: Summary of data mining techniques in this sector
yrogetaC euqinhceT
nisecnerruccofo#
sweivretni
statsevitpircseD .cte.ved.dts,naem.g.escitsitatsesaB 51
noitasilausiV 7
ytilanosaeS 3
sisylanaseiresemiT 2
gnidnerT 1
sisylanaoteraP 1
gnitsetsisehtopyH 1
sisylanadnerT 1
gnihcraesevitaretI 1
noitingocernrettaP 1
statslaitnerefnI noissergeR 51
seertnoisiceD 7
erauqs-ihC 3
DIAHC 2
sisylanaytilibaborP 1
euqinhcet-ihpleD 1
gniledomesnopseR 1
sisylanaetairavitluM 1
sisylanaetairavoC 1
sisylanatnioj-oC 1
selbatycnegnitnoC 1
sisylanafI-tahW 1
noitcuderataD noitatnemgeS 41
gniliforP 01
sisylanaretsulC 3
noitacifissalC 2
sisylananoitalerroC 1
sisylanadesabeluR 1
gniledomevitciderP 5
seuqinhcetlaciremuN ecnegilletnilaicifitrA 11
scitsitats-oiB 1
metsystrepxE 1
sisylanafI-tahW 1
smhtiroglaciteneG 1
rehtO gnisuoherawataD 6
SIG 2
gniledomesnopseR 1
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 31 of 51
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 32 of 51
5.1.1 Descriptive statistics
Every single bureau had a respondent speak of using simple mathematical algorithms e.g.
Means, standard deviations, averages and so on. Fifteen of the nineteen respondents indi-
cated that because of the large volumes of data they dealt with, the more basic mathematical
algorithms and statistical techniques were invaluable in determining:
• which parts of data sets could and or should be mined,
• achieving a better understanding of what was contained in the datasets,
• getting a visual feel of the data,
• standardizing different data sets,
• matching different data sets,
• excluding bad / corrupt data,
• improving the quality of data.
Of the nineteen people interviewed, seven indicated that they also made use of visualisation
techniques to better understand their data sets, to better understand the results of their data
mining exercises and also to hi-light any discrepancies in their analysis.
Further mention was made of the other techniques listed in the above table in this category,
but mostly by single individuals. Interestingly, only one person made use of the word
hypothesis testing, although it was obvious from the interviews with virtually every single
person that all the data mining was using some for of hypothesis testing in that they were
hypothesizing as to the outcome of particular tests.
5.1.2 Inferential statistics
Of the eight bureaus, seven mentioned that they used regression in one form or another,
whether it was logistical regression, linear regression, bivariate or multiple or stepwise
regression.
Fifteen of the nineteen respondents indicated that regression analysis played a large role in
the data mining done by the bureaus.
Decision trees were mentioned by seven of the nineteen respondents, but used by only the
three bigger credit bureaus and both the data bureaus. Of this series of techniques, Chi-
square and CHAID were mentioned by one of the larger consumer credit bureaus and one of
the data bureaus as techniques specifically used as they was a good technique for shorter
time continuums, and was excellent for explaining response models, an area that all of these
bureaus were moving into more and more.
5.1.3 Data reduction techniques
This category of techniques was well represented amongst all the bureaus, as they all used
segmentation or classification as they typically segmented databases of customers into groups
that are as homogeneous as possible with respect to a variable such as creditworthiness.
Fourteen of the nineteen respondents listed this as an important part of data mining in this
sector, and this was also the second most referred to technique.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 33 of 51
All of the bureaus also referred to profiling. Although the specific term was not found in the
literature review, the techniques described by the bureaus match those described in the
literature of classification. Some respondents also referred specifically to classification and
cluster analysis when describing these techniques.
5.1.4 Numerical techniques
Only one of the credit bureaus was using these techniques in conjunction with an external
vendor who was also interviewed. The techniques used included Artificial Intelligence, neu-
ral networks and to a lesser degree bio-statistics.
5.1.5 Other techniques
The other techniques mentioned here were data warehousing and Geographical Information
Systems (GIS). As was found in the literature, the larger credit bureaus and both the data
bureaus were using data warehousing to organise large volumes of historical information
gathered from large-scale client/server-based applications for further analysis.
Only one of the data bureaus mentioned using GIS techniques to mine their data, but were
unwilling to provide further information as it was considered too sensitive at the time.
5.2 The uses of data mining in the credit and data bureau sector
During the interviews, the members of the expert panel mentioned several uses of data
mining. Some of these uses were described in different ways, but were clearly the same
thing and these uses were categorised by the members in nearly identical fashion. Given the
small size of this sector in South Africa, this is hardly surprising.
In total eighteen uses of data mining were discovered during the interviews. Each use will be
discussed in the following sections. The table on the following page lists the uses found
during the interviews.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 34 of 51
Table 4: Uses of data mining in this sector
.oN gninimatadfosesU
latoT
secnerruccO
1 seitinutroppognitekramgniyfitnedignidulcni,gnitekramtceriD 81
2 dnatluafedronoitadiuqilfoytilibaborpgnitciderp,gnidulcnigniledomksiR
setartluafedgnisaerced
81
3 gnirocStiderC 01
4 snoitagitsevnIcisnerofgnidulcni,duarffonoitneverP/noitciderP 6
5 gnirocSlaroivaheB 5
6 sledomesnopseR 5
7 sdnertcimonocEorciM+orcaM 5
8 emocnignitciderP 4
9 gniledomevitciderP 4
01 skoobsrotbedfoksirehtgninimreteD 3
11 snosirapmocyrtsudnI 3
21 stimiltidercgninimreteD 3
31 noitcellocfoytilibaborpgnitciderP 2
41 noitirttafonoitciderP 2
51 ytilibadroffa/erusopxetneilcgnitciderP 1
61 seicilopnisespalgnitciderP 1
71 sisylanaevititepmoC 1
81 gnicarT 1
The table shows that the dominant uses of data mining in the credit and data bureau sector
in South Africa are direct marketing, risk modeling and credit scoring. Every single bureau is
doing some for of risk modeling and in some way assisting their clients with direct market-
ing, either via cleansing of data, creation of mailing lists or telephone lists. While the main
use of the techniques described above for the credit bureaus is still risk modeling and
assisting their clients in preventing or predicting default, bad debt or liquidation, the assis-
tance with direct marketing now features as much. Eighteen of the nineteen respondents
mentioned these uses of the techniques above.
Specific mention was made of the identification of marketing opportunities, the creation of
strategies based on existing market segmentation, the growing role of behavioral scoring
(mentioned five times at four of the bureaus) and of the profiling and prediction abilities of
data mining.
The predicting abilities of data mining for use in direct marketing was mentioned in different
forms on fourteen separate occasions during the interview process. These included predict-
ing income, attrition, client affordability and or exposure and probability of accepting an
offer.
The third most used, credit scoring, was mentioned by only ten of the nineteen respondents,
and while not in use by the data bureaus, also only used by seven of the credit bureaus, and
not all of them as would be expected.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 35 of 51
All of the uses of data mining mentioned in the literature was found at some of the bureaus,
with fraud and the prevention, detection and prediction thereof being a use at both the data
bureaus and four of the eight credit bureaus.
Uses that were not found in the literature included the tracking of macro- and micro-eco-
nomic trends that was a new use of the data mining techniques at two of the credit bureaus
and one of the data bureaus as well as one of the vendors. Another use that was not
specifically mentioned in the research was that of using data mining techniques for the
tracing of debtors. This may be because of the stringent privacy laws internationally.
5.3 Future techniques and their possible uses
No specific techniques or possible techniques were mentioned in the interviews, and all the
respondents felt that data mining was still too new in their sector for them to be able to
predict any possible new techniques.
Eight of the respondents mentioned that they thought there should be some for of data
standardization and or data set standards and or one data standard for all elements in the
future.
Behavioral scoring was mentioned by six of the respondents as a definite new direction for
data mining in the sector with growing interest from all sectors of their client bases.
Artificial Intelligence techniques and their possible application was mentioned as possible
techniques by six of the respondents who were not currently using these techniques, but all
of them said that they had no experience and only thought it might be a possibility to look
into in the future, particularly for fraud prevention and predictive modeling for direct mar-
keting.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 36 of 51
Chapter 6: Discussion
6.1 Common data mining techniques and algorithms
Before any real data mining is done on a data set, a basic understanding is needed of the
data and dataset before the data may be used for data mining. For this, the basic statistical
techniques are typically used. There are many more statistical techniques than those de-
scribed in this research, but those mentioned here were found to be the ones most com-
monly used to gain an understanding of data and in preparation of further data mining.
The next step in the process of data mining is divided into two broad categories:
• Statistics and,
• Artificial Intelligence
The major difference between these two areas is that the field of statistics has its basis in the
science of pure mathematics and the field of pure statistics, and has undergone rigorous
mathematical proofs. Artificial Intelligence techniques are not necessarily subject to these
same rigorous mathematical proofs, but instead arrive primarily from machine learnings.
The two areas that follow, Visualisation and Data Warehousing are both areas that display
the results of data mining, but in data warehousing the data may be even further explored
using Online Analytical Processing (OLAP) and resulting in Management Information Sys-
tems (MIS) which may be visual in nature, whereas visualisation techniques usually assists in
the interpretation of the results of data mining in the form of charts and graphs. Another
visualisation techniques is Geographical Information Systems (GIS), where data is converted
into spatial information and graphically displayed in the form of maps of areas, suburbs,
municipal areas and the like. This is an innovative way to not only graphically display the
findings of data mining, but to also make it easily and visually understandable to a large
audience base.
6.1.1 Descriptive statistics
When this category is compared to what was found in the literature, it is clear that all of the
bureaus are using all of the techniques found for this category of descriptive statistics when
embarking on data mining, particularly in the initial stages of their projects or for smaller
scale projects. Although no specific reference was made to probability distributions or esti-
mation techniques, it was again obvious from a large number of interviews that these
techniques, although not specifically named, were being applied in some form.
6.1.2 Inferential statistics
That all the bureaus except one mentioned regression in one form or another was not
surprising as indicated by the literature review that found that this is the most important of
all the multivariate techniques available. The one bureau that did not mention this tech-
nique, was also one of the smaller bureaus that did no data mining at all.
The respondents indicated that regression, decision trees, including Chi-square and CHAID
techniques, were being used more and more and their value in creating credit policies,
deriving the most predictive values and assessment and predictivity of regression analysis
was excellent. This is also what was found in the review of the literature. In fact, Gargano 
Raggad, (1999); Chidley (2002) and Koh  Low (2004) confirmed the strengths of these
techniques were in their ability to generate more understandable rules and their ability to
indicate the relative importance of the variables for classification and prediction.
6.1.3 Data reduction techniques
The results found in Chapter 5, with this category of techniques being well represented,
match what Lee  Siau (2001) described, and they gave the example of a typical classifica-
tion problem as being the division of a database of customers into groups that are as
homogeneous as possible with respect to a variable such as creditworthiness, exactly what
the bureaus in this sector are doing when using classification.
The bureaus all use their data to and data mining techniques like segmentation and classifi-
cation to further filter and refine their data sets and also the data sets of their clients for
particular variables, typically being creditworthiness, affordability or profiling based on cer-
tain demographic characteristics.
6.1.4 Numerical techniques
When the bureaus that were not using this technique (all except one) were queried on their
use of these techniques, the almost standard reply was that they believed these techniques
were too complex, difficult and did not yield results that were worth the additional effort and
expense. It was interesting to note from the literature review that these techniques are
considered to be widely and successfully used in many other industries, specifically in bank-
ing and financial institutions.
6.1.5 Other techniques
Data warehousing enables the bureaus that use this technique to quickly and effectively
analyse very large volumes of data to enable them to further build models using other
techniques like regression analysis for the building of for example behavioral scoring mod-
els.
All the bureaus mentioned the importance of this technique, but also listed the prohibitive
cost and the very low rate of successful implementation of data warehousing projects in all
sectors as an obstacle to implementing data warehouses in their companies.
Only one of the data bureaus mentioned using GIS techniques to mine their data, but were
unwilling to provide further information as it was considered too sensitive at the time. It will
be interesting to see what the use of this technique will be and if the other bureaus follow
suite should this prove to be a successful tool for the particular bureau.
Looking at the findings above it is clear that the common data mining techniques in this
sector closely match those found in the literature review, so it can be accepted that the sub-
problem: What are the common data mining techniques and algorithms? has been an-
swered by this research.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 37 of 51
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 38 of 51
6.2 The uses of these techniques
The dominant uses of data mining in the credit and data bureau sector in South Africa are
direct marketing, risk modeling and credit scoring. Every single bureau is doing some for of
risk modeling and in some way assisting their clients with direct marketing, either via cleans-
ing of data, creation of mailing lists or telephone lists.
The bureaus and their clients are all well aware that the bureaus hold the largest pool of
consumer information in the country, and that this is a good opportunity for the bureaus to
assist their clients in their direct marketing efforts.
The responses from the respondents were that there was a growing demand for these
services and the use of techniques in data mining to do this and the value add to their
clients. This was identical to the finding in the literature review that indicated data mining is
a powerful tool in increasing response rates and ultimately of immense value to the organisation
in direct marketing and risk modelling and credit scoring.
Several of the respondent mentioned that there was a growing trend amongst their custom-
ers to do build their own scoring models, and that their clients relied on the bureaus less for
the actual models, but more on specific data with which to build these models. The bureaus
were entering a new era where they were being used more in a consulting role and as the
suppliers of data for use by their clients in the construction of their own scorecards, where
they used to build these models for their clients in the past.
Another use that was not specifically mentioned in the research was that of using data
mining techniques for the tracing of debtors. This may be because of the stringent privacy
laws internationally. Several times mention was made by respondents of the possible impact
of the new Credit Bill on the way that they conduct their business and on the way they use
the data mining techniques in future.
Although not all of the uses mentioned in the literature were found to be used at every
single bureau or even very widely in some cases, there were also some uses that were found
at the bureaus that were not found in the literature review. Overall a good description is
given of the uses of these techniques, thus addressing the research problem of what the
uses of these techniques are.
6.3 Future techniques and their possible uses
Every single respondent raised the new Credit Bill and the possible changes to legislation as
a factor that would influence any future data mining techniques and possible uses in this
sector. So far the quality standards for data in the sector have been self-regulating, but the
proposed bill would hold the directors of companies and in particular the bureaus personally
liable for any errors, and this was of grave concern to them.
All the bureaus believe that data standardization or a set of data standards will have an
enormous impact on the future of data mining in this sector and the techniques used as one
standard would make it simpler to combine data sets from different sources and ultimately
expedite the process of mining these data sets.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 39 of 51
There was a general feeling amongst all of the respondents that data mining and data
management was becoming more and more important, not only in their own organisations,
but also those of their clients. It was felt that the impact of data mining would continue to
increase and enjoy more and more focus into the future.
Five of the bureaus also had plans to either skill their own staff better in the statistical areas
of data mining or to expand their data mining areas with more statistically skilled individuals
with greater experience that the current individuals.
This question was the least satisfactorily answered of all the questions as data mining is still
a relatively new field in this sector, as in most sectors in South Africa, and the respondents
are not too sure of what the possible future direction may hold, but all of the respondents
has some specific thoughts on the subject, thus addressing the problem of what possible
future techniques may be.
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006
GSduToit_MBA_Research_Report_2006

More Related Content

Similar to GSduToit_MBA_Research_Report_2006

final_3
final_3final_3
final_3
Kristian Juel
 
Jhonny niño trabajo de investigacion bachiller_2020
Jhonny niño trabajo de investigacion bachiller_2020Jhonny niño trabajo de investigacion bachiller_2020
Jhonny niño trabajo de investigacion bachiller_2020
eberperez6
 
Investigations of Market Entry Strategies for Clean Technology Companies
Investigations of Market Entry Strategies for Clean Technology CompaniesInvestigations of Market Entry Strategies for Clean Technology Companies
Investigations of Market Entry Strategies for Clean Technology Companies
Peter Hong
 
AN ANALYSIS OF INNOVATION ECOSYSTEM IN VIETNAMESE ENTERPRISES
AN ANALYSIS OF INNOVATION ECOSYSTEM IN VIETNAMESE ENTERPRISESAN ANALYSIS OF INNOVATION ECOSYSTEM IN VIETNAMESE ENTERPRISES
AN ANALYSIS OF INNOVATION ECOSYSTEM IN VIETNAMESE ENTERPRISES
lamluanvan.net Viết thuê luận văn
 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industry
Stefano Perfetti
 
Big data
Big dataBig data
gusdazjo_thesis
gusdazjo_thesisgusdazjo_thesis
Undergraduate Dissertation
Undergraduate DissertationUndergraduate Dissertation
Undergraduate Dissertation
Patrick Cole
 
Pharma statistic 2018
Pharma statistic 2018Pharma statistic 2018
Pharma statistic 2018
Majdi Ayoub
 
Literature survey andrei_manta_0
Literature survey andrei_manta_0Literature survey andrei_manta_0
Literature survey andrei_manta_0
darshanahiren
 
Research Design Report Building On Experiences
Research Design Report Building On ExperiencesResearch Design Report Building On Experiences
Research Design Report Building On Experiences
4Building
 
Dissertation_Governing Process Infrastructure Governmental Programmes_Oxford_...
Dissertation_Governing Process Infrastructure Governmental Programmes_Oxford_...Dissertation_Governing Process Infrastructure Governmental Programmes_Oxford_...
Dissertation_Governing Process Infrastructure Governmental Programmes_Oxford_...
Pedro Monteiro Lima, MSc, PMP
 
Digital Convergence
Digital ConvergenceDigital Convergence
Digital Convergence
M V
 
DISSERTATION 2015 final
DISSERTATION 2015 finalDISSERTATION 2015 final
DISSERTATION 2015 final
Milos Stanojevic
 
How does big data disrupt marketing : the modification of a marketer’s job
 How does big data disrupt marketing : the modification of a marketer’s job  How does big data disrupt marketing : the modification of a marketer’s job
How does big data disrupt marketing : the modification of a marketer’s job
Nicolas Suchaud
 
PhD_Thesis_Dimos_Andronoudis
PhD_Thesis_Dimos_AndronoudisPhD_Thesis_Dimos_Andronoudis
CAPACITY BUILDING IN THE PRINTING INDUSRTY NEW
CAPACITY BUILDING IN THE PRINTING INDUSRTY NEWCAPACITY BUILDING IN THE PRINTING INDUSRTY NEW
CAPACITY BUILDING IN THE PRINTING INDUSRTY NEW
Richard Odei-Nkansah
 
What factors can influence the marketing strategy's success of software and I...
What factors can influence the marketing strategy's success of software and I...What factors can influence the marketing strategy's success of software and I...
What factors can influence the marketing strategy's success of software and I...
Jai Sharma
 
MBA dissertation
MBA dissertationMBA dissertation
MBA dissertation
M V
 
Commercialisation Report
Commercialisation ReportCommercialisation Report
Commercialisation Report
David Luttrell
 

Similar to GSduToit_MBA_Research_Report_2006 (20)

final_3
final_3final_3
final_3
 
Jhonny niño trabajo de investigacion bachiller_2020
Jhonny niño trabajo de investigacion bachiller_2020Jhonny niño trabajo de investigacion bachiller_2020
Jhonny niño trabajo de investigacion bachiller_2020
 
Investigations of Market Entry Strategies for Clean Technology Companies
Investigations of Market Entry Strategies for Clean Technology CompaniesInvestigations of Market Entry Strategies for Clean Technology Companies
Investigations of Market Entry Strategies for Clean Technology Companies
 
AN ANALYSIS OF INNOVATION ECOSYSTEM IN VIETNAMESE ENTERPRISES
AN ANALYSIS OF INNOVATION ECOSYSTEM IN VIETNAMESE ENTERPRISESAN ANALYSIS OF INNOVATION ECOSYSTEM IN VIETNAMESE ENTERPRISES
AN ANALYSIS OF INNOVATION ECOSYSTEM IN VIETNAMESE ENTERPRISES
 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industry
 
Big data
Big dataBig data
Big data
 
gusdazjo_thesis
gusdazjo_thesisgusdazjo_thesis
gusdazjo_thesis
 
Undergraduate Dissertation
Undergraduate DissertationUndergraduate Dissertation
Undergraduate Dissertation
 
Pharma statistic 2018
Pharma statistic 2018Pharma statistic 2018
Pharma statistic 2018
 
Literature survey andrei_manta_0
Literature survey andrei_manta_0Literature survey andrei_manta_0
Literature survey andrei_manta_0
 
Research Design Report Building On Experiences
Research Design Report Building On ExperiencesResearch Design Report Building On Experiences
Research Design Report Building On Experiences
 
Dissertation_Governing Process Infrastructure Governmental Programmes_Oxford_...
Dissertation_Governing Process Infrastructure Governmental Programmes_Oxford_...Dissertation_Governing Process Infrastructure Governmental Programmes_Oxford_...
Dissertation_Governing Process Infrastructure Governmental Programmes_Oxford_...
 
Digital Convergence
Digital ConvergenceDigital Convergence
Digital Convergence
 
DISSERTATION 2015 final
DISSERTATION 2015 finalDISSERTATION 2015 final
DISSERTATION 2015 final
 
How does big data disrupt marketing : the modification of a marketer’s job
 How does big data disrupt marketing : the modification of a marketer’s job  How does big data disrupt marketing : the modification of a marketer’s job
How does big data disrupt marketing : the modification of a marketer’s job
 
PhD_Thesis_Dimos_Andronoudis
PhD_Thesis_Dimos_AndronoudisPhD_Thesis_Dimos_Andronoudis
PhD_Thesis_Dimos_Andronoudis
 
CAPACITY BUILDING IN THE PRINTING INDUSRTY NEW
CAPACITY BUILDING IN THE PRINTING INDUSRTY NEWCAPACITY BUILDING IN THE PRINTING INDUSRTY NEW
CAPACITY BUILDING IN THE PRINTING INDUSRTY NEW
 
What factors can influence the marketing strategy's success of software and I...
What factors can influence the marketing strategy's success of software and I...What factors can influence the marketing strategy's success of software and I...
What factors can influence the marketing strategy's success of software and I...
 
MBA dissertation
MBA dissertationMBA dissertation
MBA dissertation
 
Commercialisation Report
Commercialisation ReportCommercialisation Report
Commercialisation Report
 

GSduToit_MBA_Research_Report_2006

  • 1. AN EVALUATION OF DATA MINING TECHNIQUES IN THE CREDIT AND DATA BUREAU SECTOR Gideon Stephanus du Toit A research report submitted to the Faculty of Commerce, Law and Management, University of the Witwatersrand, Johannesburg, in partial fulfillment of the requirements of the degree of Master of Business Administration. Johannesburg, March 2006
  • 2. ABSTRACT This research investigated the scope of data mining, common data mining techniques and algorithms, uses of these and also possible future direction of data mining techniques and what the possible value and uses of these techniques might be. A synthesis of the literature review gave a definition, scope, techniques and uses of data mining. A panel of experts was constituted to discover the uses, techniques, benefits and possible future benefits of the techniques in this sector. Thirty-five different techniques for data mining were found and these were classified into 4 different sections. Eighteen separate applications of data mining in this sector were uncovered. The research demonstrated the use of data mining in this sector although many techniques were not yet being used, especially amongst the smaller bureaus, but the possible future benefits of data mining would lead to the greater use of more techniques. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page ii of vii
  • 3. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page iii of vii DECLARATION I declare that this report is my own, unaided work. It is submitted in partial fulfillment of the requirements for the degree of Master of Business Administration at the University of the Witwatersrand, Johannesburg. It has not been submitted for any degree or examination in any other university. Gideon Stephanus du Toit April 2006
  • 4. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page iv of vii ACKNOWLEDGEMENTS The assistance provided by a number of people in completing this research is greatly appre- ciated. Thanks to my wife, Christelle du Toit, for her unwavering support, love and assistance. My supervisor, Professor Neil Duffy, who provided his support and encouragement willingly and freely. The support of the members of the expert panel and the faith they showed by allowing me to conduct this research, and without whom this report would not have been possible.
  • 5. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page v of vii TABLE OF CONTENTS ABSTRACT ............................................................................................................. DECLARATION ....................................................................................................... ACKNOWLEDGEMENTS ........................................................................................... TABLE OF CONTENTS ............................................................................................. LIST OF TABLES ...................................................................................................... LIST OF FIGURES .................................................................................................... LIST OF APPENDICES .............................................................................................. CHAPTER 1: INTRODUCTION .............................................................................. 1.1 THE RELEVANCE OF DATA MINING ............................................................... 1.2 THE IMPORTANCE OF THE STUDY ................................................................ 1.3 THE RESEARCH OBJECTIVES ........................................................................ 1.4 INTRODUCTION .......................................................................................... 1.5 THE STATEMENT OF THE PROBLEM .............................................................. 1.6 THE SUB-PROBLEMS .................................................................................... 1.7 THE DELIMITATIONS ................................................................................... 1.8 DEFINITION OF TERMS ................................................................................ 1.9 ASSUMPTIONS ............................................................................................ 1.10 THE RESEARCH STRUCTURE ........................................................................ CHAPTER 2: LITERATURE REVIEW .................................................................... 2.1 DATA AND DATA MINING IN THE BUSINESS CONTEXT .................................. 2.2 DATA MINING TECHNIQUES AND ALGORITHMS ............................................ 2.2.1 Pure statistics ...................................................................................... 2.2.2 Artificial Intelligence (AI) methods ......................................................... 2.2.3 Genetic algorithms and genetic programming .......................................... 2.2.4 Decision trees ...................................................................................... 2.2.5 Data visualisation ................................................................................. 2.2.6 Rule induction methods ........................................................................ 2.2.7 Data warehousing ................................................................................ 2.3 THE USES OF THESE TECHNIQUES AND ALGORITHMS .................................. 2.3.1 Targeting / Predictive / Descriptive models .............................................. 2.3.2 Fraud prediction and identification ......................................................... 2.3.3 Going concern prediction ..................................................................... 2.4 THE FUTURE DIRECTION OF DATA MINING AND ITS TECHNIQUES AND THE POSSIBLE USES OF THIS ........................................... CHAPTER 3: RESEARCH QUESTIONS .................................................................. 3.1 WHAT ARE COMMON DATA MINING TECHNIQUES AND ALGORITHMS? ........... 3.2 WHAT ARE THE USES OF THESE TECHNIQUES? ............................................. 3.3 WHAT IS THE FUTURE DIRECTION OF DATA MINING AND ITS TECHNIQUES IN THIS SECTOR AND THE POSSIBLE USES THEREOF? .............. ii iii iv v vii vii vii Page
  • 6. CHAPTER 4: RESEARCH METHODOLOGY ........................................................... 4.1 QUALITATIVE RESEARCH PARADIGM ........................................................... 4.2 DESCRIPTIVE RESEARCH DESIGN ................................................................. 4.3 POPULATION AND SAMPLE .......................................................................... 4.4 DATA COLLECTION ...................................................................................... 4.5 DATA ANALYSIS .......................................................................................... 4.6 VALIDITY AND RELIABILITY ......................................................................... 4.6.1 Internal validity ................................................................................... 4.6.2 External validity ................................................................................... 4.6.3 Reliability ............................................................................................ 4.7 COMPLETION OF THE RESEARCH REPORT .................................................... CHAPTER 5: RESULTS .......................................................................................... 5.1 COMMON DATA MINING TECHNIQUES AND ALGORITHMS ............................. 5.1.1 Descriptive statistics ............................................................................. 5.1.2 Inferential statistics .............................................................................. 5.1.3 Data reduction techniques ..................................................................... 5.1.4 Numerical techniques ........................................................................... 5.1.5 Other techniques ................................................................................. 5.2 THE USES OF DATA MINING IN THE CREDIT AND DATA BUREAU SECTOR ... 5.3 FUTURE TECHNIQUES AND THEIR POSSIBLE USES ....................................... CHAPTER 6: DISCUSSION ................................................................................... 6.1 COMMON DATA MINING TECHNIQUES AND ALGORITHMS ............................. 6.1.1 Descriptive statistics ............................................................................. 6.1.2 Inferential statistics .............................................................................. 6.1.3 Data reduction techniques ..................................................................... 6.1.4 Numerical techniques ........................................................................... 6.1.5 Other techniques ................................................................................. 6.2 THE USES OF THESE TECHNIQUES ............................................................... 6.3 FUTURE TECHNIQUES AND THEIR POSSIBLE USES ....................................... CHAPTER 7: CONCLUSION AND RECOMMENDATIONS ...................................... 7.1 BUSINESS IMPLICATIONS ............................................................................ 7.2 SUGGESTIONS FOR FURTHER RESEARCH ..................................................... REFERENCES ........................................................................................................ APPENDIX A: THE WRITTEN REQUEST ............................................................. APPENDIX B: TELEPHONE PROTOCOL ............................................................... APPENDIX C: INTERVIEW PROTOCOL ............................................................... END ....................................................................................................................... ii iii iv v vii vii vii Page Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page vi of vii
  • 7. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page vii of vii LIST OF TABLES TABLE# TABLE TITLE Table 1 Organisations that agreed to partake in the research ............................... Table 2 Data Mining Techniques used in this sector ............................................. Table 3 Summary of Data Mining techniques in this sector .................................. Table 4 Uses of Data Mining in this sector .......................................................... PAGE 23 22 22 22 FIGURE# FIGURE TITLE Figure 1 Research on basic scientific issues will influence data mining applications in many other areas ........................................................... Figure 2 Data mining techniques ....................................................................... PAGE 23 22 LIST OF FIGURES LIST OF APPENDICES APPENDIX A: THE WRITTEN REQUEST .................................................................. APPENDIX B: TELEPHONE PROTOCOL .................................................................. 23 22 PAGE
  • 8. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 1 of 51 Chapter 1: Introduction 1.1 The relevance of data mining Data mining has a tradition of research and practice going back to the early 1960s, when it was originally known as statistical analysis and in a cruder form as "data dredging" where it was implied that there was no specific predetermined hypothesis or aim. Data mining has evolved from statistical analysis using classical statistical techniques such as penetration analysis, univariate analysis, correlation, regression, chi-square and cross tabulation to be- ing augmented by more diverse techniques such as fuzzy logic, heuristic reasoning and neural networks. Since the 1990s the best approaches have been packaged together along with newer and even more powerful techniques and the results are being presented in much more user friendly and effective ways (Kimball et al, 1998:19; Parr Rud, 2001). Early applications of data mining were in specialist applications such as geological research (searching for natural resources e.g. mining exploration) and meteorological research (weather forecasting), and are presently applied in areas such as retailing, the insurance, financial and credit industries as well as the medical domain (Benyon-Davies, 1996). In today's intensely competitive global marketplace, enterprise decision makers look for ways to increase competitive advantages by eliminating inefficiencies, optimizing internal operations, and maximizing relationships with all organizational stakeholders (employees, customers, partners, and shareholders). One area that assists in this is the deployment of data mining technologies to leverage data-resources to enhance their decision-making capa- bilities (Nemati & Barko, 2003). Knowledge discovery / data mining techniques were formed from several decades of re- search into machine learning, pattern recognition, statistics and visualisation techniques and have been a research topic of long-standing interest (Vickery, 1997). The techniques used in data mining give knowledge workers deeper insights than those provided by management information systems, standard production reports, managed que- ries, executive information systems, and online analytical processing. Techniques employed in data mining to facilitate the finding of previously hidden informa- tion include the capabilities to discover rules, classify, partition, associate, and optimise. In a dynamic environment data continuously changes and the timeliness of using data mining translates into a big advantage for the user. The ability to seamlessly automate and embed some of the mundane, repetitive and tedious steps traditionally used is another advantage of data mining (Gargano & Raggad, 1999). 1.2 The importance of the study IBM defined four major operations for data mining reported in Technology Forecast, 1997 cited in Lee & Siau, 2001: 1. Predictive modeling: using inductive reasoning techniques such as neural networks and inductive reasoning algorithms to create predictive models.
  • 9. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 2 of 51 2. Database segmentation: using statistical clustering techniques to partition data into clusters. 3. Link analysis: identifying useful associations between data. 4. Deviation detection: detecting and explaining why certain records cannot be put into specific segments. Lee & Siau (2001) also defined three main steps in data mining. These steps are: 1. Preparing the data, 2. Reducing the data and, 3. Looking for valuable information in the data. The specific approaches may differ from company to company and researcher to researcher. Fayyad, Piatetsky-Shapiro & Smyth (1996), proposed the following steps: 1. Retrieving the data from a large database. 2. Selecting the relevant subset to work with. 3. Deciding on the appropriate sampling system, cleaning the data and dealing with missing fields and records. 4. Applying the appropriate transformations, dimensionality reduction, and projections. 5. Fitting models to the preprocessed data. A classification of techniques, algorithms, and uses in data mining, and possible future direction of data mining in this sector will provide managers and business users with a reference, source of understanding and a means to verify the claims made by this sector about the results of the data mining and the subsequent release of information and data sets. The results of data mining exercises and some of the generic uses of data mining and techniques in this field may be of use to other users. They may allow data miners them- selves to adapt some of these algorithms or techniques and to consider the possible future direction or use of data mining. An understanding of the uses of the techniques will also enable managers to better motivate use of the data mining services and data value-add of the bureaus. 1.3 The research objectives Based on the background provided above, the research objectives become clearer: • To determine what common data mining techniques and algorithms are and what the uses of these techniques are; • To determine what the future direction of data mining techniques in this sector are and the possible uses of these future techniques.
  • 10. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 3 of 51 These objectives will aid the reader in understanding some of the benefits and uses that could be achieved for their organisation through the use of the data mining techniques and the subsequent data output by the vendors in this sector and how the users may benefit from understanding the techniques used and their value. The objectives of this research will be achieved by answering each of the research questions posed. 1.4 Introduction Many businesses today make use of data provided by credit and data bureaus and also of the data mining techniques (sometimes inadvertently and unknowingly) used by these bureaus. These include businesses like marketing research companies, banks, retailers, micro-lend- ers, brokers and employment agencies who have all along been avid consumers of the data and techniques used by the bureaus. The increased usage has been accentuated by in- creased interest in making efficient use of organisational data through data mining and data warehousing. Usage of all forms of data and data mining is gaining popularity and is being used more and more frequently, and this is likely to continue being the case. The algorithms and techniques used in data mining are complex and require a solid understanding of statistical methods and other techniques (Cabena, Hadjinian, Stadler, Verhees & Zanasi, 1998; Beynon, Curry & Morgan, 2001). Credit and Data Bureaus are ideal for this research since they collect and mine enormous amounts of data. Data Bureaus like Effective Intelligence hold more than 20,000,000 records (J. Ardagh from Effective Intelligence, personal communication, 30 January 2005) on credit active consumers in South Africa and Credit Bureaus like Kredit Inform hold more than 1,000,000 records (M. Hendriksen from Kredit Inform, personal communication, 30 January 2005) on business entities in South Africa and process more than 1,000,000 online requests for information daily. This information and the applied data mining is used in more than 3,000 businesses (C. Capper from Experian, personal communication, 30 January 2005) in South Africa to make credit decisions, for direct marketing, to predict fraud, consumer behaviour or the propensity of a business to default. 1.5 The Statement of the problem The aim of the research is to identify and evaluate data mining techniques in the Credit and Data Bureau sector and to expand on the body of knowledge available to managers in this sector, and users of these data and techniques as clients of this sector. Describing and classifying the main data mining algorithms and techniques, and comment- ing on the generic uses to the end-user, tools used and possible future direction of data mining provide the background for this study. The aim of the research and sub-problems are based on a study done by Chidley (2002) on an evaluation of data mining techniques in the banking sector. This was expanded to include research into the possible future direction of data mining in this sector and the uses thereof. These objectives should assist managers and business people who interact with this sector to better understand the techniques used, and the benefits and uses of these techniques. Users get their data from these vendors and are not sure what the vendors have done to this
  • 11. data in order to get the delivered results. If users understand the uses of data mining and the techniques and tools used they could build on this or even request new or unmined data to analyse. 1.6 The sub-problems I. What are common data mining techniques and algorithms? II. What are the uses of these techniques? III. a. What is the future direction of data mining techniques in this sector? b. And the possible uses of these future techniques? 1.7 The delimitations This study will not compare software tools used by the bureaus. 1.8 Definition of terms Data mining - Data mining is the process of extracting valuable knowledge from large databases and using it to make decisions critical to some organisations. There are a number of features to this definition: I. Data mining is concerned with the discovery of hidden, unexpected patterns of data. II. Data mining usually works on large volumes of data. Frequently large volumes are needed to produce reliable conclusions in relation to data patterns. III. Data mining is useful in making critical organisational decisions, particularly those of a strategic nature. (Benyon-Davies, 1996; Kimball, Reeves, Ross & Thornthwaite, 1998). 1.9 Assumptions The assumptions made are based on what Chidley (2002) used in his study and are also applicable here: I. That the experts approached for the study will have sufficient skills and experience in the field for the report to present a true reflection of the uses to which data mining is being put; II. That the experts' views were representative of those in this sector. 1.10 The research structure The research was based on the literature review and the results from interviewing experts in this sector in data mining. The literature review reveals current definitions of data mining and techniques (including algorithms as applicable) used and the uses of these techniques as well as possible future directions of techniques and data mining. The chapter concludes with three research ques- tions. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 4 of 51
  • 12. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 5 of 51 The results are presented in Chapter Five. The results of a synthesis of the literature, in order to answer two of the three research questions, are presented. This chapter also de- scribes the results of the interviews with members of the expert panel. In Chapter Five, the applications of data mining that were found in the interview process are reviewed. This allows comparisons to be made between the uses discovered during the literature review and the uses suggested by the expert panel. Appropriate conclusions are drawn in Chapter Six. A similar process is followed with regards to data mining techniques and algorithms. A contrast is drawn between the techniques and algorithms mentioned in the literature and the techniques being used in the Credit and Data Bureau sector. Chapter Five is finalized with a summary and discussion of the expert panel's views on the possible future techniques of data mining and possible uses of these techniques in this sector. The research is concluded with a chapter for conclusions and recommendations. In this chapter, the research questions are again posed and a summarized answer to each is pre- sented and also presents the business implications of the research and suggestions for future research.
  • 13. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 6 of 51 Chapter 2: Literature Review 2.1 Data and data mining in the business context Data mining is defined as: "... leveraging data-mining tools and technologies to enhance the decision-making process by transforming data into valuable and actionable knowledge to gain a competitive advantage." (Nemati & Barko, 2003:282). Knowledge discovery has been defined as: "...the 'extraction of implicit, previously un- known, and potentially useful information from data'. The information extracted includes concepts, concept interrelations, classifications, decision rules, and other patterns of inter- est." (Vickery, 1997:107) Data is everywhere and is used and created in almost every activity in an organisation's day- to-day workings. The amount of data collected and stored continues to grow at an enor- mous rate. Unfortunately for business users wishing to mine this data, wishing to add value to this data or wishing to create value from this data, this data is usually stored in a way that is essentially random. How to create a competitive advantage from this data and it's mining is the critical challenge facing many organisations today (Forcht & Cochran, 1999). Recently three new and interrelated areas that emphasise obtaining and creating more infor- mation and knowledge from data have emerged strongly in information systems and infor- mation technology. These are: • Data warehousing • Knowledge management • Data mining Data mining can be considered a recently developed methodology and technology that has seen increased focus and importance in organisations that will have an important impact on the organisation's performance. Data mining has only come into prominence in the last ten or so years. Recently data mining has gained widespread attention and increasing popularity in the commercial world. Successful data mining applications have been reported and recent surveys have found that data mining has grown in usage and effectiveness (Fayyad, Piatetsky- Shapiro & Smyth, 1996; Koh & Low, 2004). 2.2 Data mining techniques and algorithms In the review of the literature the terms "techniques", "algorithms" and "tools", and the terminology to describe these were found to describe the same or similar things. Chidley (2002) in his research found the same. "Techniques" were described by Lee & Siau (2001) as a clustering of similar mathematical algorithms like statistics, artificial intelligence, decision tree approach, genetic algorithm, and visualisation while the "tools" were described by Gargano & Raggad (1999) as including artificial intelligence methods (e.g. expert systems, fuzzy logic), decision trees, rule induc- tion methods, genetic algorithms and genetic programming, neural networks (e.g. backpropagation, associative memories), and clustering techniques.
  • 14. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 7 of 51 "Algorithms" are defined as the mathematical and statistical formulas and or software code behind specific ways of querying the data when mining it (Chidley, 2002). Gargano & Raggad (1999:83) further defined the tools used in data mining as "simple, concise, easy to implement algorithms, that model nonrandom (i.e. statistically significant) relationships (or patterns) in large historic data sets." For the purposes of this research the terms "techniques", "algorithms" and "tools" will be used interchangeably. A clear distinction must however be made between the techniques used for data mining and the uses of data mining. A review of the literature found the following techniques: 2.2.1 Pure statistics Basic statistics Statistics is the most basic and an indispensable component of data mining and is also used to evaluate the results of the mining done and to separate the good from the bad. Statistics allow the miner to get a hands on, and sometimes visual feel for the data and enables a basic understanding of the nature of the data and serves as an indication of the most suitable techniques for further mining. It is used in the cleaning of data and enables the identification of outliers and anomalies/ "noise" in the data. Statistics also assist deal with missing data using estimation techniques (Lee & Siau, 2001). Probability distributions - Probability distributions aim to find relations between data points or variables (Forcht & Cochran, 1999). Inference - Inference estimates the likelihood of various outcomes, given a set of variables and is frequently a step beyond a probability distribution as it often uses the results of a probability distribution as part of its raw data (Forcht & Cochran, 1999). Estimation - One way of dealing with missing data is the use of estimation techniques (Lee & Siau, 2001). Estimations are almost always made on the basis of assumptions that may not be strictly met for a variety of different reasons. When this happens one should not assume that if the model is incorrect, the assumptions must be incorrect. This may sometimes be true but is not always the case. Analysts often test their models by finding ways to weaken their assumptions. They attempt to discount weak assumptions and leave only the strongest assumptions. When using inference or estimation models different models may be sound, even though they have competing assumptions. Instead of using only one model, it is best to use several and to combine the models and find a weighted average, which when considered and averaged, should improve the quality of the estimation made (Forcht & Cochran, 1999). Hypothesis testing - Hypothesis testing is a type of estimation that seeks an answer that is binary in nature. The test seeks only a "yes or no" type of answer to verify whether a hypothesis is plausible or not. Usually, one hypothesis is tested against an alternative one to find the stronger of the two (Forcht & Cochran, 1999).
  • 15. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 8 of 51 Regression - This is the most important of all the multivariate techniques available of non- experimentalists. Once analysts understand regression, almost any question amenable to quantitative analysis can be answered. This technique, perhaps more than any other data manipulation technique, lends itself to visualisation. Regression contains many different subsets e.g. bivariate or multiple regression. In its purest form regression answers the common query: What is the relationship between variable X and variable Y? (Lewis-Beck, Berry, Feldman, Fox & Hardy (1993). This technique has a myriad of uses in data mining (Koh & Low, 2004). Discriminant analysis - This is a classification technique used to describe group separation (Rencher, 1995; Gordon, 1999). Standard linear discriminant analysis involves a linear clas- sification boundary and is used to group the population (Rencher, 1995), but it should be noted that it depends on assumptions regarding normality of the underlying populations, which must also possess identical variance-covariance matrices. The linear rule can be shown to minimise the expected number of misclassifications. Clustering Clustering may be a preparatory step to segmenting a database before applying other data mining techniques or as a separate technique for data mining (Chidley 2002). The technique itself is the process of identifying useful and homogenous clusters (e.g. objects or people), patterns, relationships or interesting trends with similar characteristics in time-dependent data (Emory & Cooper 1991; Gargano & Raggad (1999); Forcht & Cochran, 1999; Lee & Siau, 2001). A cluster or pattern may be regarded as a collection or class of records sharing something in common. Conceptual clustering uses not only similarity but also what has been called 'conceptual cohesiveness' as defined by background information. Interactive clustering includes contributions from the human user's knowledge (Vickery, 1997). Classification Classification is the process of dividing and allocating data items in a data set into previously defined and mutually exclusive groups so that the members of each group are as close as possible to one another, and the members of different groups are as far as possible from one another. An example of a typical classification problem is dividing a database of customers into groups that are as homogeneous as possible with respect to a variable such as credit- worthiness (Lee & Siau, 2001). Link analysis Link analysis is a descriptive approach to identifying useful associations and relationships between values in a database (Lee & Siau, 2001). Association rules and associative memories These techniques are used to mine transactional or relational databases (Lee & Siau, 2001) and are able to detect similarities between new patterns and previously stored patterns (Caudill & Butler, 1990). The main tool used for this according to Gargano & Raggad (1999) is associative memories where pairs (or larger groups) of associated data items are memorised
  • 16. (or discarded, in effect “forgotten”) using a long-term memory network mode. A partial stimulation of the long-term memory network results in a retrieved data pair. This retrieved pair may have been either a previously memorised pair or the best attempts of the network in trying to compromise the initial stimulus with a reason- able output pair response. 2.2.2 Artificial Intelligence (AI) methods Artificial Intelligence techniques are widely used in data mining (Lee & Siau, 2001; Koh & Low, 2004). These include neural networks, backpropagation, expert systems and fuzzy logic (Gargano & Raggad, 1999; Zwick, 2004). Neural networks Neural networks were originally designed for use in mainly the disciplines of psy- chology and biology. Their application in a data mining context is driven by the desire to exploit their properties as non-linear statistical methods (Beynon et al, 2001). These are powerful techniques for analysing complex non-linear and interaction relationships, and can be used to supplement and complement traditional statistical methods in for example constructing going concern prediction models (Lee & Siau, 2001; Koh & Low, 2004). Neural networks are some of the most common types of data mining tools used. They are used for recognising patterns in data, especially when the relationships between the dependent and independent variables are unknown and/or complex. Designed to "think" like and modeled after the human brain, which can be perceived as a highly connected network of neurons (called nodes in neural networks termi- nology). Each node (in a layer of nodes) receives inputs from at least one node in a previous layer and combines the inputs and generates an output to at least one node in the next layer. Generally, the independent variables comprise the input layer and the dependent variable the output layer and between these there may be one or more hidden layers of nodes. In combining inputs and generating an output, each node performs a computation (to combine the inputs) and a transformation (to generate an output). Each connection between two nodes has a weight that deter- mines how the input from a prior node must be combined with other inputs to generate an output that must be received by the next node (Vickery, 1997; Gargano & Raggad, 1999; Lee & Siau, 2001). Neural networks first break down data sets into smaller, more manageable pieces before trying to discover patterns in the data. Such techniques require large amounts of resources and frequently require some custom programming for each search, as well as more processing afterward because the system may "discover" patterns that seem logical to it but after human intervention it becomes clear that they are not (Forcht & Cochran, 1999; Koh & Low, 2004). Lu et al. (1996) (cited in Lee & Siau, 2001), split the neural network-based data mining approach into three major phases: Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 9 of 51
  • 17. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 10 of 51 • Network construction and training: in this phase, a layered neural network based on the number of attributes, number of classes, and chosen input coding method are trained and constructed. • Network pruning: in this phase, redundant links and units are removed without in- creasing the classification error rate of the network. • Rule extraction: classification rules are extracted in this phase (Lee & Siau, 2001) Backpropagation systems These techniques are highly supervised. The backprop neural network model is ideal for prediction and classification in situations where there is a good deal of historic data available for training. This tool uses output variables generated by the neural network that are cor- rected by adjusting the weights of the hidden layer variables until the output variables match those in the training dataset (Gargano & Raggad, 1999; Chidley, 2002). Expert systems Expert systems are made up of a knowledge base of rules (extracted from experts), facts (or data), and a logic based inference engine (or control) that creates new rules and facts based on previously accumulated knowledge and facts. Expert systems attempt to mimic, with some success, the reasoning of human experts whose knowledge of a specific and narrow domain is deep, thus permitting human experts and expert systems to arrive at similar conclusions, thus serving to justify the system's existence by improving the expert decision maker's own productivity. The expert system thus operates using queries formulated by human experts and incorporated into the system. Expert systems do not rely on algorithmic or statistical methods and cannot solve problems that have not been defined during the programming of the model (Jackson, 1990; Gargano & Raggad, 1999; Chidley, 2002). Jackson (1990:4) listed the following characteristics for expert systems: • They simulate human reasoning, • They perform reasoning "over representations of human knowledge", • Heuristic or approximate methods are used to solve problems (which does not guar- antee success as would have been the case had algorithmic techniques or solutions been used). Fuzzy expert systems Fuzzy expert systems employ fuzzy logic concepts and were developed in an attempt to try and solve the brittleness problem inherent in expert systems. The truth or falsity of a fact can be measured in a fuzzy way using values from the real number interval zero to one inclusive (i.e. [0, 1]). In expert systems, information is either totally false (i.e. zero) or totally true (i.e. one), but in fuzzy expert systems, true values can lie anywhere on the zero to one interval of real numbers. Some facts are close to being true or close to being false (having low entropy), while other facts lie close to the middle between being true or false (having high entropy). Using fuzzy operators, such as AND, OR, NOT, VERY, and SOME- WHAT, the system can make fuzzy implications. Fuzzy systems can easily handle illogical complexities, poor clarity (in the facts and/or rules), or internal inconsistencies (Gargano & Raggad, 1999).
  • 18. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 11 of 51 2.2.3 Genetic algorithms and genetic programming Genetic algorithms are a relatively new technique inspired by Darwin's theory of evolution (Natural selection and survival of the fittest). A population of rules, that may or may not repress a solution to a problem, is created at random. Then pairs of these rules, usually the strongest rules are selected as "parents", are combined to produce "offspring" for the next generation. A mutation process is used to randomly modify the genetic structures of some members of each new generation. The system runs for dozens or hundreds of generations and is only terminated when an acceptable or optimum solution is found, or after a fixed time limit. Genetic algorithms are appropriate for problems that require optimisation with respect to some computable criterion (Lee & Siau, 2001; Mitchell, 2005) While genetic algorithms evolve complex data structures, genetic programming evolves using complex algorithmic structures (i.e. computer programs). This technique is useful for finding solutions to hard optimisation problems by generating optimal or near optimal solutions to such problems, to fine tune the parameters of other data mining techniques and models and also for classification (Vickery, 1997; Gargano & Raggad, 1999; Lee & Siau, 2001). 2.2.4 Decision trees Decision trees - This is a statistical approach based on a branching system of decisions. A decision rule is answered at each node either positively (Yes) or negatively (No). The answer gives another set of decisions (Gargano & Raggad, 1999). Koh and Low (2004:466) summarised it very nicely: "In the Automatic Interaction Detection (AID) algorithm, all possible two-way splits of each node for each independent variable are examined. The split that leads to the most significant t-statistic (as per the analysis of the variance) for the difference in means of the dependent variable between the two lower-level nodes is selected. In the chi-square Automatic Interaction Detection (CHAID) algorithm, the chi-square statistic is used to determine the best split while in the Classification and Regres- sion Trees (CART) algorithm, an index of diversity is used to determine the best split." This technique has several strengths: • Understandable rules can be generated • Both continuous and categorical variables can be handled • The ability to indicate the relative importance of the variables for classification and prediction • Outputs are easy to understand • They are relatively simple to implement and • Their results can be easily explained (Gargano & Raggad, 1999; Chidley, 2002; Koh & Low, 2004)
  • 19. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 12 of 51 2.2.5 Data visualisation Visualisation is a method of clearly presenting the typically complex results found using data mining tools. This allows the presentation of the complex interdependencies among many attributes in a visual format in order to get an intuitive feel of the data and the results of the analysis. Analysts and management users can easily assess and make sense of vast amounts of data. Techniques include colors, shapes, sounds, in various combinations, statistical scat- ter plots, decision trees, demonstrate the results of curve fitting, geographical maps or display a development dashboard which tracks and controls the evolution of a data mining modeling tool (Gargano & Raggad, 1999; Lee & Siau, 2001). 2.2.6 Rule induction methods Rule induction uses statistical discovery methods to develop rules that depend on the fre- quency of correlation, the rate of accuracy, and the accuracy of prediction. Typically, IF - THEN type rules are created by focusing on either the variables forming the IF part of a rule or the variables forming the THEN part of a rule. For rule induction it is useful to think of data mining from marketing databases. The technique is based on measures of data ambi- guity or approximation quality. These measures are formulated in terms of ratios, involving objects either definitely or possibly allocated to a decision class, on the basis of a given table or data matrix. The end result is a set of decision rules, which are very easy to understand and interpret. Rule induction is a useful tool for development of expert systems (Gargano & Raggad, 1999; Beynon et al, 2001). Gargano & Raggad (1999:85) caution that: "Sometimes, however, the novelty, significance, value, or exceptionality of a rule is deemed to be most interesting. Rule induction methods are highly unsupervised, however, they do require that experts evaluate the rules generated. This technique is most often used when new rules need to be generated. Owing to the combinatorially explosive nature of generating rules in this manner, such models usually run in the background or at times when computing demand is low." 2.2.7 Data warehousing Data warehousing is described by Lee & Siau (2001) as one of the most important research areas related to data mining. A data warehouse is necessary to organise historical data gathered from large-scale client/server-based applications for further analysis. A data warehouse is a read-only database containing large volumes of subject-oriented data, where all levels of an organisation can find the information in a timely manner (Lee & Siau, 2001). Kimball et al (1998:19) call the data warehouse the foundation of decision-making in an organisation. "The queryable source of data in the enterprise". Data warehousing enables each user to share a common, diverse database that they may analytically explore, using all of the available data quickly and correctly and increases the effectiveness of data-driven decision making (Cabena et al, 1998; Gargano & Raggad, 1999). The data warehouse architecture consists of a series of data marts that give a consolidated, consistent view of the organisation's historical analytical, time-based data (Cabena et al,
  • 20. 1998; Kimball et al, 1998) Raw data are extracted, cleaned, transformed, and integrated into the marts from a variety of sources. Metadata, data about the data in the warehouse, is also an integral part of the system. The warehouse architecture must manage standard informa- tion delivery systems and data queries, interfaces with applications development platforms and management information systems (MIS), and online analytical processing (OLAP), in addition to advanced information technology data mining and business intelligence tools (Kimball et al, 1998; Forcht & Cochran, 1999; Gargano & Raggad, 1999). 2.3 The uses of these techniques and algorithms Mitchell (1999) stated that in the field of data mining there are practical applications in areas like analyzing medical outcomes, detecting credit card fraud, predicting customer purchase behavior, predicting the personal interests of internet users, optimizing manufacturing pro- cesses or which bank-loan applicants are at high risk of failing to repay their loans. As shown in Figure 1 from Mitchell (1999), data in such applications typically consists of time-series descriptions of customer bank balances and other demographic information. Other data mining applications include predicting customer purchase behavior, customer retention, and the quality of goods produced by a particular manufacturing line. Mitchell (1999) believes that research on basic scientific issues (like the medical field) will influence data mining applications in many other business related areas. Data mining is thus valuable to itself as techniques used in one sector or industry may be of use in another sector in that techniques may be adapted for different uses. Data miners thus learn from other data miners and techniques that may have one use could have a completely other use in another sector. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 13 of 51 Research on basic scientific issues (left) will influence data mining applications in many areas (right) Scientific Issues Basic Technologies Applications Figure 1: Research on basic scientific issues. Source: (Mitchell, 1999) Learning from mixed media data, such as numeric, text, image, voice, sensor Active experimentation, exploration Optimizing decisions, rather than predictions Inventing new features to improve accuracy Learning from multiple databases and the Web Medicine Manufacturing Financial Intelligence analysis Public policy Marketing
  • 21. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 14 of 51 Data mining and its techniques can be applied to many areas in business and in many different businesses. The different uses of the techniques used in data mining described below have been extracted from the literature and have uses in the sector that make use of the data and credit bureaus as well as in the this sector. 2.3.1 Targeting / Predictive / Descriptive models These models typically calculate a value that represents possible future activity. This could be a purchase amount or the likelihood of an action, such as a response to an offer or defaulting on a loan (Parr Rud, 2001). They may include: • Customer profiling and segmentation Having an understanding of the customer is valuable in that their demographics, attributes and behaviour is the first step in good customer relationship management. Data mining enables understanding of who the customers are and how to split them into segments that have the same or similar attributes. This leads to further mining to enable steps like prospecting, scoring, propensity to buy and others as discussed later (Vickery, 1997; Cabena et al, 1998; Gargano & Raggad, 1999; Lee & Siau, 2001; Parr Rud, 2001; Geist, 2002; Nemati & Barko, 2003). • Database marketing Database marketing is a type of marketing segmentation used by businesses via data mining. Data mining of customer databases has had a large impact on marketing in organisations. Individual consumers can be targeted for direct marketing offers. The value here is that the correct customer may be directly targeted with the correct offer, saving time, money and effort and enabling a focused approach to marketing that promises much better results. Algorithms are used to predict consumer behavior by predicting which consumers would be most responsive to promotional and sales cam- paigns (Forcht & Cochran, 1999). The value and goal of this type of marketing is to attract new, or retain profitable clients or to avoid high-risk clients, and multiple opportunities for this exists in data mining of large databases. Increasing the response rates of direct mailing campaigns by small margins like only 1-2% can have large impacts on ROI and data mining is a powerful tool in increasing response rates and ultimately of immense value to the organisation (Cabena et al, 1998; Forcht & Cochran, 1999; Parr Rud, 2001; Apte, Liu, Pednault, & Smyth, 2002). • Customer attrition prediction A growing risk in the ever-increasing competitiveness of markets is the loss or attrition of their customers to competitors. Data mining is used to predict these customer losses and to identify vulnerable customers so that steps may be taken to prevent or mitigate attrition and thus save costs and effort in attracting new clients or spending on attracting customers who depart before their lifetime value has justified the ex- pense of attracting them in the first place (Cabena et al, 1998; Nemati & Barko, 2003).
  • 22. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 15 of 51 • Credit scoring / Risk modelling Credit scoring algorithms have the ability to consider and use many different factors and variables in determining a customer's 'creditworthiness' and assigning a credit limit or particular loan amount to that customer in either pre-scoring to extend a marketing offer or when the customer applies for credit. This is very valuable in assuring that a customer does not have a line of credit extended to them that they cannot or will not repay. This has a knock-on effect in savings of time, effort and expenditure in preventing unnecessary collections and administration. Numerous com- panies have used data mining in developing credit risk scores for their own use or for selling on to other users (Cabena et al, 1998; Lee & Siau, 2001; Parr Rud, 2001; Geist, 2002; Nemati & Barko, 2003). Customers' data is mined and algorithms applied in an attempt to determine whom the higher risk clients are so that these may be either avoided or a different interaction strategy enacted to deal with them. An insurance company may for instance want to determine the risk profile of clients to enable them to customise each client's policy individually (Parr Rud, 2001; Apte et al 2002). • Customer value analysis Performing customer value analysis and lifetime value allows managers to understand their customer database in terms of revenue and risk. Mining the customers data assists in: - Determining the risk category; - The amount of customer spend over a given period; - Lets the manager assign a value to each customer that is used in determining the company's interaction and dealings with each client on an individual basis (Cabena et al, 1998; Parr Rud, 2001). 2.3.2 Fraud prediction and identification Fraud costs companies and economies millions of Dollars / Pounds / Rands every year and with the increase in electronic transactions, credit cards and telephonic transacting this is becoming even more prevalent. The masses of data available to companies allow them to mine these transactions and applications in an effort to identify or predict fraud. The general approach is to build a model of known, suspected or potential fraudulent behaviour and then using data mining to identify similar occurrences. Data mining tools are valuable as they learn the patterns of fraud and enable the identification and prevention. (Cabena et al, 1998; Lee & Siau, 2001; Parr Rud, 2001). 2.3.3 Going concern prediction Koh & Low (2004) researched this field and found that several researchers had developed prediction models for making going concern predictions of companies. The suggested mod- els are based primarily on statistical methods. Koh & Low (2004) listed the following ex- amples - Altman, (1982); Dopuch et al., (1987); and Koh, (1991). This area of data mining
  • 23. also includes bankruptcy prediction. Several studies listed by Koh & Low (2004) have dealt with prediction models in the going concern context. These include models derived from statistical methods such as multiple discriminant analysis, logit and probit analyses and neural networks. Altman, (1968), Sung, Chang & Lee, (1999), Beynon et al, 2001 and Koh & Low, (2004) noted that discriminant analysis is the most widely used technique for going concern and bankruptcy prediction. 2.4 The future direction of data mining and its techniques and the possible uses of this The literature review found mainly data relating to other sectors and techniques and uses. Only one source was found describing possible future uses of data mining or future tech- niques. It is possible that the bureaus may have some ideas as to what their future use of data mining, what new techniques or the possible uses these may be. The only source describing possible direction of data mining was from Mitchell (1999) who speculated that the accuracy of predictions from data mining may be improved by inventing more appropriate sets of features for describing the available data, provided the dataset was large enough. It is suggested that this could lead to increased accuracy in many prediction problems like customer attrition and credit repayments. More universities are also offering data mining as a subject as there is a lack of skills in this area. Research into the area of data mining could lead to more useful data visualization tools, ways of supporting mixed initiative human-machine data exploration and more efficient data warehousing and legacy data combinations (Mitchell 1999). Mitchell (1999:36) and Fayyad, Haussler & Stolorz (1996) further speculated that that "progress in data mining over the next decade was driven by three mutually reinforcing trends: • Development of new machine learning algorithms that learn more accurately, utilize data from dramatically more diverse data sources available over the Internet and intranets, and incorporate more human input as they work, • Integration of these algorithms into standard database management systems, • An increasing awareness of data mining technology within many organizations and an attendant increase in efforts to capture, warehouse, and utilize historical data to sup- port evidence-based decision making." Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 16 of 51
  • 24. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 17 of 51 Chapter 3: Research questions The literature reviews for this research is in most respects quite comprehensive, however, data mining in South Africa and particularly in the credit and data bureau sector, is a rela- tively new field, and although there is agreement amongst the authors of the respective works in most fields, there are some areas of discrepancy. Most authors agree on the tech- niques used and the uses of these techniques, but there is little literature density on uses of data mining in this sector and more specifically in South Africa. As a result of the literature review the following questions arise: 3.1 What are common data mining techniques and algorithms? A review of the literature produced the following list of techniques used in data mining and these techniques could be used in the Credit and Data bureau sector for data mining: Pure statistics (Lee Siau, 2001) • Basic Statistics (Forcht Cochran, 1999; Beynon et al, 2001; Koh Low, 2004) - Probability distributions - Inference - Estimation - Hypothesis testing - Regression - Discriminant analysis • Clustering (Emory Cooper 1991; Vickery, 1997; Forcht Cochran, 1999; Gargano Raggad, 1999; Chidley, 2002) • Classification (Lee Siau, 2001) • Link analysis (Lee Siau, 2001) • Association rules (Caudill Butler, 1990; Lee Siau, 2001), and associative memo- ries (Gargano Raggad, 1999) Artificial intelligence methods (Lee Siau, 2001; Koh Low, 2004) • Neural networks (Gargano Raggad, 1999) - Backpropagation (Gargano Raggad, 1999) • Expert systems (Jackson, 1990; Gargano Raggad, 1999) • Fuzzy logic (Gargano Raggad, 1999; Zwick, 2004) Genetic algorithms (Mitchell, 2005; Lee Siau, 2001) and genetic programming (Vickery, 1997; Lee Siau, 2001) Decision trees (Gargano Raggad, 1999; Chidley, 2002; Koh Low, 2004) Data visualisation (Gargano Raggad, 1999; Lee Siau, 2001) Rule induction methods (Gargano Raggad, 1999; Beynon et al, 2001) Data warehousing (Kimball et al, 1998; Forcht Cochran, 1999; Gargano Raggad, 1999)
  • 25. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 18 of 51 3.2 What are the uses of these techniques? A review of the literature gave the following uses of the different techniques used in data mining that could be applicable to this sector. The possibility is that these are where the value in data mining lies for the bureaus and their users. Mitchell (1999) also believed that techniques in one sector may influence techniques used in other sector and thus data mining is valuable to itself in that new techniques are developed in one sector because of the influences in another sector. The research will attempt to determine if this is the case in the credit and data bureau sector as well. Other uses found where: • Targeting / Predictive / Descriptive models (Parr Rud, 2001) - Customer profiling and segmentation (Vickery, 1997; Cabena et al, 1998; Gargano Raggad, 1999; Lee Siau, 2001; Parr Rud, 2001; Geist, 2002; Nemati Barko, 2003). - Database marketing (Cabena et al, 1998; Forcht Cochran, 1999; Parr Rud, 2001; Apte et al 2002). - Customer attrition prediction (Cabena et al, 1998; Nemati Barko, 2003). - Credit Scoring / Risk modelling (Cabena et al, 1998; Lee Siau, 2001; Parr Rud, 2001; Apte et al, 2002; Geist, 2002; Nemati Barko, 2003). - Customer value analysis (Cabena et al, 1998; Parr Rud, 2001). These techniques enable: - An understanding of the customer and thus good customer relationship manage- ment. - Marketing to the correct customer who may be directly targeted with the correct offer, saving time, money and effort and enabling a focused approach to marketing that promises much better results. - The attraction of new, retention of profitable clients or avoidance of high-risk cli- ents. - Increasing the response rates of direct mailing campaigns by small margins like only 1-2% can have large impacts on ROI. - Savings in attracting new clients or spending on attracting customers who depart before their lifetime value has justified the expense of attracting them in the first place. - Credit scoring clients to assure that a line of credit extended is not too much forcing a client into a position of overextension where they cannot or will not repay. This has a knock-on effect in savings of time, effort and expenditure in preventing unnec- essary collections and administration. • Fraud prediction and identification (Cabena et al, 1998; Lee Siau, 2001; Parr Rud, 2001). • Going concern prediction (Altman, 1968; Sung, Chang Lee, 1999; Beynon et al, 2001; Koh Low, 2004).
  • 26. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 19 of 51 3.3 What is the future direction of data mining and its techniques in this sec- tor and the possible uses thereof? As there was only one source for a possible answer to this question, it is left quite open- ended. Some possibilities are: • New and more accurate means of prediction may be found using more appropriate sets of features for describing the available data, provided the dataset was large enough, • Increased accuracy in many prediction problems like customer attrition and credit repayments, • More useful data visualization tools, ways of supporting mixed initiative human-ma- chine data exploration and more efficient data warehousing and legacy data combina- tions, • More efforts to train people in data mining as the skills are not common (Mitchell 1999).
  • 27. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 20 of 51 CHAPTER 4: RESEARCH METHODOLOGY 4.1 Qualitative Research Paradigm The aim of this research is to identify and evaluate data mining techniques in the Credit and Data Bureau sector and to expand on the body of knowledge available to managers in this sector, and users of these data and techniques as clients of this sector. The research para- digm for the research is qualitative in nature. Qualitative techniques are intended more to determine 'what' things are than to determine the quantity of those things. These techniques are not concerned with measurement and are thus less structured than quantitative techniques and can therefore be made more respon- sive to the needs of the respondents and to the nature of the subject being researched. Typically qualitative techniques yield large volumes of very rich and descriptive data from a limited number of individuals in a particular field. (Walker, 1985) The intent of qualitative research is to answer questions about the complex nature of phenomena, often with the purpose of describing and understanding the phenomena from the participants' point of view (Leedy Ormrod, 2001:101). Based on the characteristics of a qualitative paradigm given by Walker (1985) and Leedy Ormrod (2001), this approach is proposed for the following reasons: • There is insufficient theory on the particular sector, • The purpose of the research is to describe and explore, • The research is not concerned with measurement • The variables are unknown, • The research is context bound and encompasses personal views, • The sample size is small, • In-depth semi-structured interviews are to be used to collect data, • The data gathered were explicitly interpretive, creative and personal. Added to the assumptions made in Chapter 1 (1.5) of this document are particular assump- tions that are part of qualitative research. These were proposed by Creswell (1994) and (Marshall Rossman, 1989) and must also be considered: • The participant's perspective on the social phenomenon of interest should unfold as the participant views it, not as the researcher views it(Marshall Rossman, 1989:82), • The researcher interacts with what they are researching, • The role of values is value-laden and biased Creswell (1994:5), • Respondents in research see reality in a subjective and in multiple ways, • The language of the research is informal, evolving decisions, personal voice, ac- cepted qualitative words.
  • 28. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 21 of 51 4.2 Descriptive research design The qualitative design was in the form of a content analysis. This was described by Walker (1985) and Leedy Ormrod (2001) as being a technique that identifies patterns, themes or biases in data on communication and the examination of this data allows the researcher to determine if a hypothesis is supported or not. In this research the content analysis was done on the transcripts of the interviews between the researcher and the respondents. For this research in-depth semi-structured interviews were used as the method of data collection. The interviews were based on a number of open-ended questions (Leedy Ormrod, 2001). In depth interviewing is ideal for this kind of research and has been de- scribed as a conversation with a purpose (Marshall Rossman, 1989:82). Interviews are typically more like conversations than formally structured interviews, but this assists in uncovering the respondents meaning and perspective but at the same time respects the way in which the respondent frames and structures the responses (Marshall Rossman, 1989). Advantages of using in-depth semi-structured interviews for data collection include (Marshall Rossman, 1989; Pirow, 1990; Creswell, 1994; Leedy Ormrod, 2001): • Interviews are useful means of quickly obtaining large amounts of data. • Respondents can provide historical background information. • Interviews allow for the gathering of a wide variety of information and a large number of different subjects. • Immediate follow-up questions and clarification of points can be done. • The researcher has control over both the questions asked and the environment. • It is flexible and enables the researcher to prompt and probe as necessary. • It enables the researcher to take cognisance of non-verbal behaviour. • The researcher can alter the order of questions and ensure that all the questions are answered. Despite its many advantages the researcher is aware that skill and care is required in using this method of collecting data. There are also some disadvantages associated with this method of data collection and the researcher took care to be aware of these when conduct- ing the research. Marshall Rossman, (1989) and Creswell, (1994) listed the following: • Information provided by the respondent is colored by their own perspective, • The interviewer must obtain the cooperation of the interviewee, • Respondents may not be willing to share some (possibly sensitive) information, • Respondents may not all be of the same level of articulation or perception, • The researcher may not be able to ask the correct type of questions because of a lack of technical expertise on the side of the researcher. The researcher attempted to mitigate some of these disadvantages by: • Continuously confirming with the respondent the intended meaning of their response,
  • 29. • Not intentionally leading the respondent and avoiding colloquialisms and ambiguous words. 4.3 Population and Sample The population in this research can be considered to be all the data miners, data managers, analysts, practitioners, facilitators, and vendors for and from all the credit and data bureaus in the country. This is to the extent that they are subject matter experts on data mining. The sample drawn contained the managers of the data mining departments or business intelli- gence departments, analysts, directors and or practitioners in these fields in these bureaus and their vendors that are located within South Africa. The nineteen respondents can be considered to form 100% of the population. The respondents were not selected in a random fashion, at all times attempting to ensure that they are experienced and knowledgeable enough in the area of study (Creswell, 1994), but the researcher attempted to be objective in the selection of the respondents (Walker, 1985) and the sample design is thus purposive (Walker, 1985:30). The small number of data and credit bureaus in South Africa limited the sample size. The sample was drawn from the bureaus and their vendors directly, specifically from the ranks of the data mining, business intelligence and managerial areas. The selection of experts in the field used the following criteria and ensured that the respon- dent was able to comment, from an informed position, on the techniques, uses and trends in data mining in the credit and data bureau sector. The opinions expressed during the inter- views should be based on a sound knowledge of this sector and of data mining. The criteria were: • The expert is to be involved in data mining, having implemented, or had management oversight of a data mining project in South Africa; • The expert should occupy a senior or management position in the organisation; • The expert should have experience in the products and uses of data mining in the sector; • The expert should have at least three years experience in the field; • The expert should be available for a one hour interview; • The organisation the expert represents should not have an objection to the expert partaking in the research. In total, nineteen interviews were conducted during the entire research process. Every major credit bureau and all of the minor credit bureaus except one, both the data bureaus and every vendor that engaged with the bureaus on data mining had at least one person who met the criteria to qualify as an expert to be interviewed in this field. One of the vendors interviewed had lots of experience in data mining, but none with the South African credit bureaus. The researcher approached respondents from the institutions listed in the table on the following page and received their institution's willingness to participate in the research: Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 22 of 51
  • 30. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 23 of 51 noitutitsnI rotceS emaN noitangiseD nairepxE uaeruBtiderC reppaCgiarC tcudorP:rotceriD ssenisuB,tnempoleveD gnitekraMdnaecnegilletnI kciredorBnalA reganaMiB syuBezilraM tsylanAgnirocS soBdrahreG tsylanAgnirocS tiderCsredneLorciM )BCLM(uaeruB uaeruBtiderC sreffetSderF rotceriD )IK(mrofnItiderK uaeruBtiderC yessuHekiM reganaM ecnegilletnIevitceffE uaeruBataD hgadrAnailuJ rotceriDgniganaM ahtoBdrahreG reganaMsmetsySTI debuC-P rodneV relliMluoaR rotceriDgniganaM LTE uaeruBataD naniuQydnA rotceriDgniganaM nacSupmoC uaeruBtiderC streblAocaJ rotceriD rotpaR rodneV namyeHkraM tsylanA greBnaaiR tsylanA SAS rodneV kciddaCyecatS reganaMtnuoccA CTInoinUsnarT uaeruBtiderC eiruoFnhoJ dnascitylanA-rotceriD gnitlusnoc nosirraHeilseL tnatlusnoCssenisuB navaihtneremmE gninierG tsylanAlacitsitatS samohTkcirraW esuoheraWataD thcetihcrA CTInoinUsnarT troppuSnoisiceD )SSDUT(secivreS uaeruBtiderC nassaHrimahT rotceriDgniganaM Table 1: Organisations that agreed to partake in the research. 4.4 Data collection The institutions were contacted formally in writing, detailing the nature, purpose and meth- odology of the research and requesting their formal approval of their participation. The respondent nominated by each institution was contacted initially by telephone to invite them to participate in the research and to inform them of the purpose of the research, subjects to be covered and the research process and methodology, including the expected duration of the interview.
  • 31. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 24 of 51 A formal written communication by e-mail was sent thanking the respondent for being willing to participate in the study and confirming the place, date and time of the interview. Each respondent was offered a copy of the research report as an incentive for participating in the study. Respondents were guaranteed that their responses would be confidential and remain anonymous (Refer Appendix 1 2 for copies of the written request and telephone protocol). The interviews were in-depth and of a semi-structured nature and took place at a site convenient to the respondent. As the researcher knows many of the respondents personally, the locations for the interviews tended to be informal and aimed at putting the respondents at ease and enabled them to more easily discuss the research questions with the researcher. Each interview was audiotaped with the permission of the responder. Notes were also taken as the interview progressed. Creswell (1994:152) suggested the following protocol and the researcher attempted to fol- low this for each interview (Refer Appendix 3 for a copy of the Interview Protocol). The components of the protocol are as follows: • (a) a heading, • (b) instructions to the interviewer (opening statements), • (c) the key research questions to ask, • (d) probes to follow key questions, • (e) transition messages for the interviewer, • (f) space for recording the interviewer's comments, and • (g) space in which the researchers records reflective notes. Care was taken not to lead respondents in their response during the course of the interview. 4.5 Data analysis Unlike quantitative research where the process is linear, here data analysis took place at the same time as the collection and interpretation of the data, and the writing of the report. (Creswell, 1994). The following procedures were deployed in analysing the data (Walker, 1985; Creswell, 1994; Leedy Ormrod, 2001): 1. The taped interviews are transcribed, 2. The notes made during the interview are reviewed immediately after the interview and additional comments and thoughts added, 3. The data were organized into categories, coded and were interpreted through the use of schemas, 4. The data were integrated and synthesized. This was represented in the form of matri- ces.
  • 32. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 25 of 51 In addition, the frequency of each identifiable factor uncovered in the transcripts was tabu- lated. This will inform the researcher as to the perceived importance of the identifiable factor across the respondents. No statistical analysis was performed on these results. 4.6 Validity and reliability The validity of research is determined by the internal and external validity of the research. Internal validity is the extent to which its design and the data that it yields allow the researcher to draw accurate conclusions about cause-and-effect and other relationships within the data (Leedy Ormrod, 2001:103), and external validity is the extent to which its results apply to situations beyond the study itself (Leedy Ormrod, 2001:105). 4.6.1 Internal validity The importance of internal validity is in attempting to find other possible explanations for the results obtained in the research (Leedy Ormrod, 2001). Asking the respondents if they agreed with the accuracy, objectivity and reliability of the conclusions made by the re- searcher checked the internal validity of this research. Each respondent was given a copy of the findings and requested to add any comments. 4.6.2 External validity The intent for qualitative research is not to be able to infer the findings onto the population, but to attempt to interpret the event from a unique perspective (Creswell, 1994). The valid- ity criteria used in this research is that it is well argued and believable and the purposive sample should reflect the views of the general population. 4.6.3 Reliability As it is unlikely that similar research conducted in a different context would reach different conclusions in the same industry, but could reach different conclusions in a different indus- try, the research reliability is limited. Marshall Rossman (1989:148) suggested that: the researcher purposefully avoids con- trolling the research conditions and concentrates on recording the complexity of situational contexts and interrelations as they occur. It is unlikely that future researchers will replicate the research by altering research strategies and it is discouraged (Marshall Rossman, 1998). 4.7 Completion of the research report The research report was then written, identifying the dominant themes in this sector and commenting about the applicability of the different algorithms and techniques and their various uses in this sector. The interview transcripts were summarized and each use assigned to two categories. The methodology followed here was that of Chidley used in 2002. The first use category was based upon the terms used by the respondents during the inter- views. The information to determine the first category of uses was based on the terms used
  • 33. by respondent in describing the specific data mining projects they had worked on and or the specific uses they assigned and or equated with each data mining technique or algorithm. The second categorization was done by using the generic data mining uses taken from the literature. The aim of the specific project and use referred to by the respondent was com- pared to the generic use category and if there was a match, the project or stated value and use was assigned to that category. Sometimes the process followed in the actual data mining was analysed and a category assigned to the project or technique used. The interviews data, processed in this way, was used as the basis for the results and inter- pretation of the results for this research report. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 26 of 51
  • 34. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 27 of 51 Chapter 5: Results 5.1 Common data mining techniques and algorithms In his work on data mining in the banking sector, Chidley (2002) proposed a metric based on his finding when doing his literature review. This same metric was compared to what was found when doing the literature review for this research report, and the categorization was virtually identical. The common techniques and algorithms identified in section 2.2 were compared to Chidley's findings and distilled into a single model showing how each tech- nique related to the others. This new model is show on the following page:
  • 35. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 28 of 51 noitaraperp-erP :scitsitatSeruP naeM• noitaiveddradnatS• noitatneserperlacihparG• snoitubirtsidytilibaborP• ecnerefnI• noitamitsE• gnitsetsisehtopyH• noissergeR• sisylanatnanimircsiD• Data mining techniques lacitsitatS ecnegilletnIlaicifitrA ecnednepedretnI ecnednepeD :gniretsulC gniretsulclautpecnoC• gniretsulcevitcaretnI• robhgientseraen-K• gninosaerdesabyromeM• :noitacifissalC sisylanatnanimircsiD• noissergercitsigoL• :skrowtenlarueN dnanoitcurtsnockrowteN• gniniart gninurpkrowteN• noitcartxeeluR• :noitacifissalC noitcudnieluR• :seerTnoisiceD DIAHC• TRAC• noitagaporpkcaB :seerTnoisiceD TRAC• SRAM• sledomevitiddalareneG smetsystrepxE sledomevitiddalareneG smetsystrepxeyzzuF sisylanakniL selurevitaicossA seiromemevitaicossA noitasilausiV secirtamtolprettacS• secirtamgnitcepsorP• setanidrooclellaraP• secirtamnoitcejorP• seuqinhcetnoitcejorpcirtemoeG• gnisuoheraWataD )LTE(gnidaoL,noitamrofsnarT,noitcartxE• stramataD• PALO• SIM• Figure 2: Data mining techniques
  • 36. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 29 of 51 The techniques and algorithms found were categorised to enable the manager to easily and at a single glance understand the techniques and algorithms used and to match these to the possible uses of these techniques as described in this report. Interviews were conducted with all the members of the expert panel with a view to establish the techniques used in data mining in the credit and data bureau sector. It was clear from the interviews that there were numerous techniques referred to by the members of the panel, and invariably the same terminology was used to describe the different techniques. There were thirty-five techniques mentioned during the interviews and these are listed in the table on the following page:
  • 37. Table 2: Data mining techniques used in this sector .oN seuqinhceT latoT secnerruccO 1 .gvA,veD.dtS,snaeM.g.escitsitatScisaB/smhtiroglAlacitamehtaM 51 2 noissergeR 51 3 noitatnemgeS 41 4 ecnegilletnIlaicifitrA 11 5 gniliforP 01 6 seerTnoisiceD 7 7 noitasilausiV 7 8 gnisuoheraWataD 6 9 gniledoMevitciderP 5 01 ytilanosaeS 3 11 sisylanAretsulC 3 21 erauqs-ihC 3 31 SIG 2 41 noitacifissalC 2 51 DIAHC 2 61 sisylanaseiresemiT 2 71 sisylanAytilibaborP 1 81 gnidnerT 1 91 euqinhceT-ihpleD 1 02 gniledoMesnopseR 1 12 sisylanAfI-tahW 1 22 gnihcraeSevitaretI 1 32 sisylanAoteraP 1 42 gnitseTsisehtopyH 1 52 scitsitats-oiB 1 62 metsyStrepxE 1 72 selbaTycnegnitnoC 1 82 sisylanAnoitalerroC 1 92 smhtiroglAciteneG 1 03 sisylanAdesaBeluR 1 13 sisylanAetairavitluM 1 23 sisylanAetairavoC 1 33 sisylanAtnioj-oC 1 43 sisylanAdnerT 1 53 noitingoceRnrettaP 1 slatoT 621 Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 30 of 51
  • 38. These thirty-five techniques were classified into the following categories: Table 3: Summary of data mining techniques in this sector yrogetaC euqinhceT nisecnerruccofo# sweivretni statsevitpircseD .cte.ved.dts,naem.g.escitsitatsesaB 51 noitasilausiV 7 ytilanosaeS 3 sisylanaseiresemiT 2 gnidnerT 1 sisylanaoteraP 1 gnitsetsisehtopyH 1 sisylanadnerT 1 gnihcraesevitaretI 1 noitingocernrettaP 1 statslaitnerefnI noissergeR 51 seertnoisiceD 7 erauqs-ihC 3 DIAHC 2 sisylanaytilibaborP 1 euqinhcet-ihpleD 1 gniledomesnopseR 1 sisylanaetairavitluM 1 sisylanaetairavoC 1 sisylanatnioj-oC 1 selbatycnegnitnoC 1 sisylanafI-tahW 1 noitcuderataD noitatnemgeS 41 gniliforP 01 sisylanaretsulC 3 noitacifissalC 2 sisylananoitalerroC 1 sisylanadesabeluR 1 gniledomevitciderP 5 seuqinhcetlaciremuN ecnegilletnilaicifitrA 11 scitsitats-oiB 1 metsystrepxE 1 sisylanafI-tahW 1 smhtiroglaciteneG 1 rehtO gnisuoherawataD 6 SIG 2 gniledomesnopseR 1 Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 31 of 51
  • 39. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 32 of 51 5.1.1 Descriptive statistics Every single bureau had a respondent speak of using simple mathematical algorithms e.g. Means, standard deviations, averages and so on. Fifteen of the nineteen respondents indi- cated that because of the large volumes of data they dealt with, the more basic mathematical algorithms and statistical techniques were invaluable in determining: • which parts of data sets could and or should be mined, • achieving a better understanding of what was contained in the datasets, • getting a visual feel of the data, • standardizing different data sets, • matching different data sets, • excluding bad / corrupt data, • improving the quality of data. Of the nineteen people interviewed, seven indicated that they also made use of visualisation techniques to better understand their data sets, to better understand the results of their data mining exercises and also to hi-light any discrepancies in their analysis. Further mention was made of the other techniques listed in the above table in this category, but mostly by single individuals. Interestingly, only one person made use of the word hypothesis testing, although it was obvious from the interviews with virtually every single person that all the data mining was using some for of hypothesis testing in that they were hypothesizing as to the outcome of particular tests. 5.1.2 Inferential statistics Of the eight bureaus, seven mentioned that they used regression in one form or another, whether it was logistical regression, linear regression, bivariate or multiple or stepwise regression. Fifteen of the nineteen respondents indicated that regression analysis played a large role in the data mining done by the bureaus. Decision trees were mentioned by seven of the nineteen respondents, but used by only the three bigger credit bureaus and both the data bureaus. Of this series of techniques, Chi- square and CHAID were mentioned by one of the larger consumer credit bureaus and one of the data bureaus as techniques specifically used as they was a good technique for shorter time continuums, and was excellent for explaining response models, an area that all of these bureaus were moving into more and more. 5.1.3 Data reduction techniques This category of techniques was well represented amongst all the bureaus, as they all used segmentation or classification as they typically segmented databases of customers into groups that are as homogeneous as possible with respect to a variable such as creditworthiness. Fourteen of the nineteen respondents listed this as an important part of data mining in this sector, and this was also the second most referred to technique.
  • 40. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 33 of 51 All of the bureaus also referred to profiling. Although the specific term was not found in the literature review, the techniques described by the bureaus match those described in the literature of classification. Some respondents also referred specifically to classification and cluster analysis when describing these techniques. 5.1.4 Numerical techniques Only one of the credit bureaus was using these techniques in conjunction with an external vendor who was also interviewed. The techniques used included Artificial Intelligence, neu- ral networks and to a lesser degree bio-statistics. 5.1.5 Other techniques The other techniques mentioned here were data warehousing and Geographical Information Systems (GIS). As was found in the literature, the larger credit bureaus and both the data bureaus were using data warehousing to organise large volumes of historical information gathered from large-scale client/server-based applications for further analysis. Only one of the data bureaus mentioned using GIS techniques to mine their data, but were unwilling to provide further information as it was considered too sensitive at the time. 5.2 The uses of data mining in the credit and data bureau sector During the interviews, the members of the expert panel mentioned several uses of data mining. Some of these uses were described in different ways, but were clearly the same thing and these uses were categorised by the members in nearly identical fashion. Given the small size of this sector in South Africa, this is hardly surprising. In total eighteen uses of data mining were discovered during the interviews. Each use will be discussed in the following sections. The table on the following page lists the uses found during the interviews.
  • 41. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 34 of 51 Table 4: Uses of data mining in this sector .oN gninimatadfosesU latoT secnerruccO 1 seitinutroppognitekramgniyfitnedignidulcni,gnitekramtceriD 81 2 dnatluafedronoitadiuqilfoytilibaborpgnitciderp,gnidulcnigniledomksiR setartluafedgnisaerced 81 3 gnirocStiderC 01 4 snoitagitsevnIcisnerofgnidulcni,duarffonoitneverP/noitciderP 6 5 gnirocSlaroivaheB 5 6 sledomesnopseR 5 7 sdnertcimonocEorciM+orcaM 5 8 emocnignitciderP 4 9 gniledomevitciderP 4 01 skoobsrotbedfoksirehtgninimreteD 3 11 snosirapmocyrtsudnI 3 21 stimiltidercgninimreteD 3 31 noitcellocfoytilibaborpgnitciderP 2 41 noitirttafonoitciderP 2 51 ytilibadroffa/erusopxetneilcgnitciderP 1 61 seicilopnisespalgnitciderP 1 71 sisylanaevititepmoC 1 81 gnicarT 1 The table shows that the dominant uses of data mining in the credit and data bureau sector in South Africa are direct marketing, risk modeling and credit scoring. Every single bureau is doing some for of risk modeling and in some way assisting their clients with direct market- ing, either via cleansing of data, creation of mailing lists or telephone lists. While the main use of the techniques described above for the credit bureaus is still risk modeling and assisting their clients in preventing or predicting default, bad debt or liquidation, the assis- tance with direct marketing now features as much. Eighteen of the nineteen respondents mentioned these uses of the techniques above. Specific mention was made of the identification of marketing opportunities, the creation of strategies based on existing market segmentation, the growing role of behavioral scoring (mentioned five times at four of the bureaus) and of the profiling and prediction abilities of data mining. The predicting abilities of data mining for use in direct marketing was mentioned in different forms on fourteen separate occasions during the interview process. These included predict- ing income, attrition, client affordability and or exposure and probability of accepting an offer. The third most used, credit scoring, was mentioned by only ten of the nineteen respondents, and while not in use by the data bureaus, also only used by seven of the credit bureaus, and not all of them as would be expected.
  • 42. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 35 of 51 All of the uses of data mining mentioned in the literature was found at some of the bureaus, with fraud and the prevention, detection and prediction thereof being a use at both the data bureaus and four of the eight credit bureaus. Uses that were not found in the literature included the tracking of macro- and micro-eco- nomic trends that was a new use of the data mining techniques at two of the credit bureaus and one of the data bureaus as well as one of the vendors. Another use that was not specifically mentioned in the research was that of using data mining techniques for the tracing of debtors. This may be because of the stringent privacy laws internationally. 5.3 Future techniques and their possible uses No specific techniques or possible techniques were mentioned in the interviews, and all the respondents felt that data mining was still too new in their sector for them to be able to predict any possible new techniques. Eight of the respondents mentioned that they thought there should be some for of data standardization and or data set standards and or one data standard for all elements in the future. Behavioral scoring was mentioned by six of the respondents as a definite new direction for data mining in the sector with growing interest from all sectors of their client bases. Artificial Intelligence techniques and their possible application was mentioned as possible techniques by six of the respondents who were not currently using these techniques, but all of them said that they had no experience and only thought it might be a possibility to look into in the future, particularly for fraud prevention and predictive modeling for direct mar- keting.
  • 43. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 36 of 51 Chapter 6: Discussion 6.1 Common data mining techniques and algorithms Before any real data mining is done on a data set, a basic understanding is needed of the data and dataset before the data may be used for data mining. For this, the basic statistical techniques are typically used. There are many more statistical techniques than those de- scribed in this research, but those mentioned here were found to be the ones most com- monly used to gain an understanding of data and in preparation of further data mining. The next step in the process of data mining is divided into two broad categories: • Statistics and, • Artificial Intelligence The major difference between these two areas is that the field of statistics has its basis in the science of pure mathematics and the field of pure statistics, and has undergone rigorous mathematical proofs. Artificial Intelligence techniques are not necessarily subject to these same rigorous mathematical proofs, but instead arrive primarily from machine learnings. The two areas that follow, Visualisation and Data Warehousing are both areas that display the results of data mining, but in data warehousing the data may be even further explored using Online Analytical Processing (OLAP) and resulting in Management Information Sys- tems (MIS) which may be visual in nature, whereas visualisation techniques usually assists in the interpretation of the results of data mining in the form of charts and graphs. Another visualisation techniques is Geographical Information Systems (GIS), where data is converted into spatial information and graphically displayed in the form of maps of areas, suburbs, municipal areas and the like. This is an innovative way to not only graphically display the findings of data mining, but to also make it easily and visually understandable to a large audience base. 6.1.1 Descriptive statistics When this category is compared to what was found in the literature, it is clear that all of the bureaus are using all of the techniques found for this category of descriptive statistics when embarking on data mining, particularly in the initial stages of their projects or for smaller scale projects. Although no specific reference was made to probability distributions or esti- mation techniques, it was again obvious from a large number of interviews that these techniques, although not specifically named, were being applied in some form. 6.1.2 Inferential statistics That all the bureaus except one mentioned regression in one form or another was not surprising as indicated by the literature review that found that this is the most important of all the multivariate techniques available. The one bureau that did not mention this tech- nique, was also one of the smaller bureaus that did no data mining at all. The respondents indicated that regression, decision trees, including Chi-square and CHAID techniques, were being used more and more and their value in creating credit policies, deriving the most predictive values and assessment and predictivity of regression analysis
  • 44. was excellent. This is also what was found in the review of the literature. In fact, Gargano Raggad, (1999); Chidley (2002) and Koh Low (2004) confirmed the strengths of these techniques were in their ability to generate more understandable rules and their ability to indicate the relative importance of the variables for classification and prediction. 6.1.3 Data reduction techniques The results found in Chapter 5, with this category of techniques being well represented, match what Lee Siau (2001) described, and they gave the example of a typical classifica- tion problem as being the division of a database of customers into groups that are as homogeneous as possible with respect to a variable such as creditworthiness, exactly what the bureaus in this sector are doing when using classification. The bureaus all use their data to and data mining techniques like segmentation and classifi- cation to further filter and refine their data sets and also the data sets of their clients for particular variables, typically being creditworthiness, affordability or profiling based on cer- tain demographic characteristics. 6.1.4 Numerical techniques When the bureaus that were not using this technique (all except one) were queried on their use of these techniques, the almost standard reply was that they believed these techniques were too complex, difficult and did not yield results that were worth the additional effort and expense. It was interesting to note from the literature review that these techniques are considered to be widely and successfully used in many other industries, specifically in bank- ing and financial institutions. 6.1.5 Other techniques Data warehousing enables the bureaus that use this technique to quickly and effectively analyse very large volumes of data to enable them to further build models using other techniques like regression analysis for the building of for example behavioral scoring mod- els. All the bureaus mentioned the importance of this technique, but also listed the prohibitive cost and the very low rate of successful implementation of data warehousing projects in all sectors as an obstacle to implementing data warehouses in their companies. Only one of the data bureaus mentioned using GIS techniques to mine their data, but were unwilling to provide further information as it was considered too sensitive at the time. It will be interesting to see what the use of this technique will be and if the other bureaus follow suite should this prove to be a successful tool for the particular bureau. Looking at the findings above it is clear that the common data mining techniques in this sector closely match those found in the literature review, so it can be accepted that the sub- problem: What are the common data mining techniques and algorithms? has been an- swered by this research. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 37 of 51
  • 45. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 38 of 51 6.2 The uses of these techniques The dominant uses of data mining in the credit and data bureau sector in South Africa are direct marketing, risk modeling and credit scoring. Every single bureau is doing some for of risk modeling and in some way assisting their clients with direct marketing, either via cleans- ing of data, creation of mailing lists or telephone lists. The bureaus and their clients are all well aware that the bureaus hold the largest pool of consumer information in the country, and that this is a good opportunity for the bureaus to assist their clients in their direct marketing efforts. The responses from the respondents were that there was a growing demand for these services and the use of techniques in data mining to do this and the value add to their clients. This was identical to the finding in the literature review that indicated data mining is a powerful tool in increasing response rates and ultimately of immense value to the organisation in direct marketing and risk modelling and credit scoring. Several of the respondent mentioned that there was a growing trend amongst their custom- ers to do build their own scoring models, and that their clients relied on the bureaus less for the actual models, but more on specific data with which to build these models. The bureaus were entering a new era where they were being used more in a consulting role and as the suppliers of data for use by their clients in the construction of their own scorecards, where they used to build these models for their clients in the past. Another use that was not specifically mentioned in the research was that of using data mining techniques for the tracing of debtors. This may be because of the stringent privacy laws internationally. Several times mention was made by respondents of the possible impact of the new Credit Bill on the way that they conduct their business and on the way they use the data mining techniques in future. Although not all of the uses mentioned in the literature were found to be used at every single bureau or even very widely in some cases, there were also some uses that were found at the bureaus that were not found in the literature review. Overall a good description is given of the uses of these techniques, thus addressing the research problem of what the uses of these techniques are. 6.3 Future techniques and their possible uses Every single respondent raised the new Credit Bill and the possible changes to legislation as a factor that would influence any future data mining techniques and possible uses in this sector. So far the quality standards for data in the sector have been self-regulating, but the proposed bill would hold the directors of companies and in particular the bureaus personally liable for any errors, and this was of grave concern to them. All the bureaus believe that data standardization or a set of data standards will have an enormous impact on the future of data mining in this sector and the techniques used as one standard would make it simpler to combine data sets from different sources and ultimately expedite the process of mining these data sets.
  • 46. Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 39 of 51 There was a general feeling amongst all of the respondents that data mining and data management was becoming more and more important, not only in their own organisations, but also those of their clients. It was felt that the impact of data mining would continue to increase and enjoy more and more focus into the future. Five of the bureaus also had plans to either skill their own staff better in the statistical areas of data mining or to expand their data mining areas with more statistically skilled individuals with greater experience that the current individuals. This question was the least satisfactorily answered of all the questions as data mining is still a relatively new field in this sector, as in most sectors in South Africa, and the respondents are not too sure of what the possible future direction may hold, but all of the respondents has some specific thoughts on the subject, thus addressing the problem of what possible future techniques may be.