GSduToit_MBA_Research_Report_2006

AN EVALUATION OF DATA MINING TECHNIQUES IN
THE CREDIT AND DATA BUREAU SECTOR
Gideon Stephanus du Toit
A research report submitted to the Faculty of Commerce, Law and Management, University
of the Witwatersrand, Johannesburg, in partial fulfillment of the requirements of the degree
of Master of Business Administration.
Johannesburg, March 2006

ABSTRACT
This research investigated the scope of data mining, common data mining techniques and
algorithms, uses of these and also possible future direction of data mining techniques and
what the possible value and uses of these techniques might be.
A synthesis of the literature review gave a definition, scope, techniques and uses of data
mining. A panel of experts was constituted to discover the uses, techniques, benefits and
possible future benefits of the techniques in this sector.
Thirty-five different techniques for data mining were found and these were classified into 4
different sections.
Eighteen separate applications of data mining in this sector were uncovered.
The research demonstrated the use of data mining in this sector although many techniques
were not yet being used, especially amongst the smaller bureaus, but the possible future
benefits of data mining would lead to the greater use of more techniques.
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page ii of vii

Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page iii of vii
DECLARATION
I declare that this report is my own, unaided work. It is submitted in partial fulfillment of the
requirements for the degree of Master of Business Administration at the University of the
Witwatersrand, Johannesburg. It has not been submitted for any degree or examination in
any other university.
Gideon Stephanus du Toit
April 2006

Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page iv of vii
ACKNOWLEDGEMENTS
The assistance provided by a number of people in completing this research is greatly appre-
ciated.
Thanks to my wife, Christelle du Toit, for her unwavering support, love and assistance.
My supervisor, Professor Neil Duffy, who provided his support and encouragement willingly
and freely.
The support of the members of the expert panel and the faith they showed by allowing me
to conduct this research, and without whom this report would not have been possible.

Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page v of vii
TABLE OF CONTENTS
ABSTRACT .............................................................................................................
DECLARATION .......................................................................................................
ACKNOWLEDGEMENTS ...........................................................................................
TABLE OF CONTENTS .............................................................................................
LIST OF TABLES ......................................................................................................
LIST OF FIGURES ....................................................................................................
LIST OF APPENDICES ..............................................................................................
CHAPTER 1: INTRODUCTION ..............................................................................
1.1 THE RELEVANCE OF DATA MINING ...............................................................
1.2 THE IMPORTANCE OF THE STUDY ................................................................
1.3 THE RESEARCH OBJECTIVES ........................................................................
1.4 INTRODUCTION ..........................................................................................
1.5 THE STATEMENT OF THE PROBLEM ..............................................................
1.6 THE SUB-PROBLEMS ....................................................................................
1.7 THE DELIMITATIONS ...................................................................................
1.8 DEFINITION OF TERMS ................................................................................
1.9 ASSUMPTIONS ............................................................................................
1.10 THE RESEARCH STRUCTURE ........................................................................
CHAPTER 2: LITERATURE REVIEW ....................................................................
2.1 DATA AND DATA MINING IN THE BUSINESS CONTEXT ..................................
2.2 DATA MINING TECHNIQUES AND ALGORITHMS ............................................
2.2.1 Pure statistics ......................................................................................
2.2.2 Artificial Intelligence (AI) methods .........................................................
2.2.3 Genetic algorithms and genetic programming ..........................................
2.2.4 Decision trees ......................................................................................
2.2.5 Data visualisation .................................................................................
2.2.6 Rule induction methods ........................................................................
2.2.7 Data warehousing ................................................................................
2.3 THE USES OF THESE TECHNIQUES AND ALGORITHMS ..................................
2.3.1 Targeting / Predictive / Descriptive models ..............................................
2.3.2 Fraud prediction and identification .........................................................
2.3.3 Going concern prediction .....................................................................
2.4 THE FUTURE DIRECTION OF DATA MINING AND ITS
TECHNIQUES AND THE POSSIBLE USES OF THIS ...........................................
CHAPTER 3: RESEARCH QUESTIONS ..................................................................
3.1 WHAT ARE COMMON DATA MINING TECHNIQUES AND ALGORITHMS? ...........
3.2 WHAT ARE THE USES OF THESE TECHNIQUES? .............................................
3.3 WHAT IS THE FUTURE DIRECTION OF DATA MINING AND ITS
TECHNIQUES IN THIS SECTOR AND THE POSSIBLE USES THEREOF? ..............
ii
iii
iv
v
vii
vii
vii
Page

CHAPTER 4: RESEARCH METHODOLOGY ...........................................................
4.1 QUALITATIVE RESEARCH PARADIGM ...........................................................
4.2 DESCRIPTIVE RESEARCH DESIGN .................................................................
4.3 POPULATION AND SAMPLE ..........................................................................
4.4 DATA COLLECTION ......................................................................................
4.5 DATA ANALYSIS ..........................................................................................
4.6 VALIDITY AND RELIABILITY .........................................................................
4.6.1 Internal validity ...................................................................................
4.6.2 External validity ...................................................................................
4.6.3 Reliability ............................................................................................
4.7 COMPLETION OF THE RESEARCH REPORT ....................................................
CHAPTER 5: RESULTS ..........................................................................................
5.1 COMMON DATA MINING TECHNIQUES AND ALGORITHMS .............................
5.1.1 Descriptive statistics .............................................................................
5.1.2 Inferential statistics ..............................................................................
5.1.3 Data reduction techniques .....................................................................
5.1.4 Numerical techniques ...........................................................................
5.1.5 Other techniques .................................................................................
5.2 THE USES OF DATA MINING IN THE CREDIT AND DATA BUREAU SECTOR ...
5.3 FUTURE TECHNIQUES AND THEIR POSSIBLE USES .......................................
CHAPTER 6: DISCUSSION ...................................................................................
6.1 COMMON DATA MINING TECHNIQUES AND ALGORITHMS .............................
6.1.1 Descriptive statistics .............................................................................
6.1.2 Inferential statistics ..............................................................................
6.1.3 Data reduction techniques .....................................................................
6.1.4 Numerical techniques ...........................................................................
6.1.5 Other techniques .................................................................................
6.2 THE USES OF THESE TECHNIQUES ...............................................................
6.3 FUTURE TECHNIQUES AND THEIR POSSIBLE USES .......................................
CHAPTER 7: CONCLUSION AND RECOMMENDATIONS ......................................
7.1 BUSINESS IMPLICATIONS ............................................................................
7.2 SUGGESTIONS FOR FURTHER RESEARCH .....................................................
REFERENCES ........................................................................................................
APPENDIX A: THE WRITTEN REQUEST .............................................................
APPENDIX B: TELEPHONE PROTOCOL ...............................................................
APPENDIX C: INTERVIEW PROTOCOL ...............................................................
END .......................................................................................................................
ii
iii
iv
v
vii
vii
vii
Page
Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page vi of vii

Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page vii of vii
LIST OF TABLES
TABLE# TABLE TITLE
Table 1 Organisations that agreed to partake in the research ...............................
Table 2 Data Mining Techniques used in this sector .............................................
Table 3 Summary of Data Mining techniques in this sector ..................................
Table 4 Uses of Data Mining in this sector ..........................................................
PAGE
23
22
22
22
FIGURE# FIGURE TITLE
Figure 1 Research on basic scientific issues will influence data mining
applications in many other areas ...........................................................
Figure 2 Data mining techniques .......................................................................
PAGE
23
22
LIST OF FIGURES
LIST OF APPENDICES
APPENDIX A: THE WRITTEN REQUEST ..................................................................
APPENDIX B: TELEPHONE PROTOCOL ..................................................................
23
22
PAGE

Research for MBA - Gideon S. du Toit - MBA P/T 2003/6 - Page 1 of 51
Chapter 1: Introduction
1.1 The relevance of data mining
Data mining has a tradition of research and practice going back to the early 1960s, when it
was originally known as statistical analysis and in a cruder form as "data dredging" where it
was implied that there was no specific predetermined hypothesis or aim. Data mining has
evolved from statistical analysis using classical statistical techniques such as penetration
analysis, univariate analysis, correlation, regression, chi-square and cross tabulation to be-
ing augmented by more diverse techniques such as fuzzy logic, heuristic reasoning and
neural networks. Since the 1990s the best approaches have been packaged together along
with newer and even more powerful techniques and the results are being presented in much
more user friendly and effective ways (Kimball et al, 1998:19; Parr Rud, 2001).
Early applications of data mining were in specialist applications such as geological research
(searching for natural resources e.g. mining exploration) and meteorological research (weather
forecasting), and are presently applied in areas such as retailing, the insurance, financial and
credit industries as well as the medical domain (Benyon-Davies, 1996).
In today's intensely competitive global marketplace, enterprise decision makers look for
ways to increase competitive advantages by eliminating inefficiencies, optimizing internal
operations, and maximizing relationships with all organizational stakeholders (employees,
customers, partners, and shareholders). One area that assists in this is the deployment of
data mining technologies to leverage data-resources to enhance their decision-making capa-
bilities (Nemati & Barko, 2003).
Knowledge discovery / data mining techniques were formed from several decades of re-
search into machine learning, pattern recognition, statistics and visualisation techniques and
have been a research topic of long-standing interest (Vickery, 1997).
The techniques used in data mining give knowledge workers deeper insights than those
provided by management information systems, standard production reports, managed que-
ries, executive information systems, and online analytical processing.
Techniques employed in data mining to facilitate the finding of previously hidden informa-
tion include the capabilities to discover rules, classify, partition, associate, and optimise. In a
dynamic environment data continuously changes and the timeliness of using data mining
translates into a big advantage for the user. The ability to seamlessly automate and embed
some of the mundane, repetitive and tedious steps traditionally used is another advantage of
data mining (Gargano & Raggad, 1999).
1.2 The importance of the study
IBM defined four major operations for data mining reported in Technology Forecast, 1997
cited in Lee & Siau, 2001:
1. Predictive modeling: using inductive reasoning techniques such as neural networks
and inductive reasoning algorithms to create predictive models.

2. Database segmentation: using statistical clustering techniques to partition data into
clusters.
3. Link analysis: identifying useful associations between data.
4. Deviation detection: detecting and explaining why certain records cannot be put into
specific segments.
Lee & Siau (2001) also defined three main steps in data mining. These steps are:
1. Preparing the data,
2. Reducing the data and,
3. Looking for valuable information in the data.
The specific approaches may differ from company to company and researcher to researcher.
Fayyad, Piatetsky-Shapiro & Smyth (1996), proposed the following steps:
1. Retrieving the data from a large database.
2. Selecting the relevant subset to work with.
3. Deciding on the appropriate sampling system, cleaning the data and dealing with
missing fields and records.
4. Applying the appropriate transformations, dimensionality reduction, and projections.
5. Fitting models to the preprocessed data.
A classification of techniques, algorithms, and uses in data mining, and possible future
direction of data mining in this sector will provide managers and business users with a
reference, source of understanding and a means to verify the claims made by this sector
about the results of the data mining and the subsequent release of information and data
sets.
The results of data mining exercises and some of the generic uses of data mining and
techniques in this field may be of use to other users. They may allow data miners them-
selves to adapt some of these algorithms or techniques and to consider the possible future
direction or use of data mining. An understanding of the uses of the techniques will also
enable managers to better motivate use of the data mining services and data value-add of
the bureaus.
1.3 The research objectives
Based on the background provided above, the research objectives become clearer:
• To determine what common data mining techniques and algorithms are and what the
uses of these techniques are;
• To determine what the future direction of data mining techniques in this sector are
and the possible uses of these future techniques.

These objectives will aid the reader in understanding some of the benefits and uses that
could be achieved for their organisation through the use of the data mining techniques and
the subsequent data output by the vendors in this sector and how the users may benefit
from understanding the techniques used and their value.
The objectives of this research will be achieved by answering each of the research questions
posed.
1.4 Introduction
Many businesses today make use of data provided by credit and data bureaus and also of the
data mining techniques (sometimes inadvertently and unknowingly) used by these bureaus.
These include businesses like marketing research companies, banks, retailers, micro-lend-
ers, brokers and employment agencies who have all along been avid consumers of the data
and techniques used by the bureaus. The increased usage has been accentuated by in-
creased interest in making efficient use of organisational data through data mining and data
warehousing. Usage of all forms of data and data mining is gaining popularity and is being
used more and more frequently, and this is likely to continue being the case. The algorithms
and techniques used in data mining are complex and require a solid understanding of
statistical methods and other techniques (Cabena, Hadjinian, Stadler, Verhees & Zanasi,
1998; Beynon, Curry & Morgan, 2001).
Credit and Data Bureaus are ideal for this research since they collect and mine enormous
amounts of data. Data Bureaus like Effective Intelligence hold more than 20,000,000 records
(J. Ardagh from Effective Intelligence, personal communication, 30 January 2005) on credit
active consumers in South Africa and Credit Bureaus like Kredit Inform hold more than
1,000,000 records (M. Hendriksen from Kredit Inform, personal communication, 30 January
2005) on business entities in South Africa and process more than 1,000,000 online requests
for information daily. This information and the applied data mining is used in more than
3,000 businesses (C. Capper from Experian, personal communication, 30 January 2005) in
South Africa to make credit decisions, for direct marketing, to predict fraud, consumer
behaviour or the propensity of a business to default.
1.5 The Statement of the problem
The aim of the research is to identify and evaluate data mining techniques in the Credit and
Data Bureau sector and to expand on the body of knowledge available to managers in this
sector, and users of these data and techniques as clients of this sector.
Describing and classifying the main data mining algorithms and techniques, and comment-
ing on the generic uses to the end-user, tools used and possible future direction of data
mining provide the background for this study.
The aim of the research and sub-problems are based on a study done by Chidley (2002) on
an evaluation of data mining techniques in the banking sector. This was expanded to include
research into the possible future direction of data mining in this sector and the uses thereof.
These objectives should assist managers and business people who interact with this sector
to better understand the techniques used, and the benefits and uses of these techniques.
Users get their data from these vendors and are not sure what the vendors have done to this

data in order to get the delivered results. If users understand the uses of data mining and
the techniques and tools used they could build on this or even request new or unmined data
to analyse.
1.6 The sub-problems
I. What are common data mining techniques and algorithms?
II. What are the uses of these techniques?
III. a. What is the future direction of data mining techniques in this sector?
b. And the possible uses of these future techniques?
1.7 The delimitations
This study will not compare software tools used by the bureaus.
1.8 Definition of terms
Data mining - Data mining is the process of extracting valuable knowledge from large
databases and using it to make decisions critical to some organisations. There are a number
of features to this definition:
I. Data mining is concerned with the discovery of hidden, unexpected patterns of data.
II. Data mining usually works on large volumes of data. Frequently large volumes are
needed to produce reliable conclusions in relation to data patterns.
III. Data mining is useful in making critical organisational decisions, particularly those of
a strategic nature. (Benyon-Davies, 1996; Kimball, Reeves, Ross & Thornthwaite,
1998).
1.9 Assumptions
The assumptions made are based on what Chidley (2002) used in his study and are also
applicable here:
I. That the experts approached for the study will have sufficient skills and experience in
the field for the report to present a true reflection of the uses to which data mining is
being put;
II. That the experts' views were representative of those in this sector.
1.10 The research structure
The research was based on the literature review and the results from interviewing experts in
this sector in data mining.
The literature review reveals current definitions of data mining and techniques (including
algorithms as applicable) used and the uses of these techniques as well as possible future
directions of techniques and data mining. The chapter concludes with three research ques-
tions.

The results are presented in Chapter Five. The results of a synthesis of the literature, in
order to answer two of the three research questions, are presented. This chapter also de-
scribes the results of the interviews with members of the expert panel.
In Chapter Five, the applications of data mining that were found in the interview process are
reviewed. This allows comparisons to be made between the uses discovered during the
literature review and the uses suggested by the expert panel. Appropriate conclusions are
drawn in Chapter Six.
A similar process is followed with regards to data mining techniques and algorithms. A
contrast is drawn between the techniques and algorithms mentioned in the literature and the
techniques being used in the Credit and Data Bureau sector.
Chapter Five is finalized with a summary and discussion of the expert panel's views on the
possible future techniques of data mining and possible uses of these techniques in this
sector.
The research is concluded with a chapter for conclusions and recommendations. In this
chapter, the research questions are again posed and a summarized answer to each is pre-
sented and also presents the business implications of the research and suggestions for
future research.

Chapter 2: Literature Review
2.1 Data and data mining in the business context
Data mining is defined as: "... leveraging data-mining tools and technologies to enhance the
decision-making process by transforming data into valuable and actionable knowledge to
gain a competitive advantage." (Nemati & Barko, 2003:282).
Knowledge discovery has been defined as: "...the 'extraction of implicit, previously un-
known, and potentially useful information from data'. The information extracted includes
concepts, concept interrelations, classifications, decision rules, and other patterns of inter-
est." (Vickery, 1997:107)
Data is everywhere and is used and created in almost every activity in an organisation's day-
to-day workings. The amount of data collected and stored continues to grow at an enor-
mous rate. Unfortunately for business users wishing to mine this data, wishing to add value
to this data or wishing to create value from this data, this data is usually stored in a way that
is essentially random. How to create a competitive advantage from this data and it's mining
is the critical challenge facing many organisations today (Forcht & Cochran, 1999).
Recently three new and interrelated areas that emphasise obtaining and creating more infor-
mation and knowledge from data have emerged strongly in information systems and infor-
mation technology. These are:
• Data warehousing
• Knowledge management
• Data mining
Data mining can be considered a recently developed methodology and technology that has
seen increased focus and importance in organisations that will have an important impact on
the organisation's performance. Data mining has only come into prominence in the last ten
or so years. Recently data mining has gained widespread attention and increasing popularity
in the commercial world. Successful data mining applications have been reported and recent
surveys have found that data mining has grown in usage and effectiveness (Fayyad, Piatetsky-
Shapiro & Smyth, 1996; Koh & Low, 2004).
2.2 Data mining techniques and algorithms
In the review of the literature the terms "techniques", "algorithms" and "tools", and the
terminology to describe these were found to describe the same or similar things. Chidley
(2002) in his research found the same.
"Techniques" were described by Lee & Siau (2001) as a clustering of similar mathematical
algorithms like statistics, artificial intelligence, decision tree approach, genetic algorithm,
and visualisation while the "tools" were described by Gargano & Raggad (1999) as including
artificial intelligence methods (e.g. expert systems, fuzzy logic), decision trees, rule induc-
tion methods, genetic algorithms and genetic programming, neural networks (e.g.
backpropagation, associative memories), and clustering techniques.

"Algorithms" are defined as the mathematical and statistical formulas and or software
code behind specific ways of querying the data when mining it (Chidley, 2002).
Gargano & Raggad (1999:83) further defined the tools used in data mining as "simple,
concise, easy to implement algorithms, that model nonrandom (i.e. statistically
significant) relationships (or patterns) in large historic data sets."
For the purposes of this research the terms "techniques", "algorithms" and "tools"
will be used interchangeably. A clear distinction must however be made between
the techniques used for data mining and the uses of data mining.
A review of the literature found the following techniques:
2.2.1 Pure statistics
Basic statistics
Statistics is the most basic and an indispensable component of data mining and is
also used to evaluate the results of the mining done and to separate the good from
the bad. Statistics allow the miner to get a hands on, and sometimes visual feel for
the data and enables a basic understanding of the nature of the data and serves as
an indication of the most suitable techniques for further mining. It is used in the
cleaning of data and enables the identification of outliers and anomalies/ "noise" in
the data. Statistics also assist deal with missing data using estimation techniques
(Lee & Siau, 2001).
Probability distributions - Probability distributions aim to find relations between
data points or variables (Forcht & Cochran, 1999).
Inference - Inference estimates the likelihood of various outcomes, given a set of
variables and is frequently a step beyond a probability distribution as it often uses
the results of a probability distribution as part of its raw data (Forcht & Cochran,
1999).
Estimation - One way of dealing with missing data is the use of estimation techniques
(Lee & Siau, 2001). Estimations are almost always made on the basis of assumptions
that may not be strictly met for a variety of different reasons. When this happens
one should not assume that if the model is incorrect, the assumptions must be
incorrect. This may sometimes be true but is not always the case. Analysts often
test their models by finding ways to weaken their assumptions. They attempt to
discount weak assumptions and leave only the strongest assumptions. When using
inference or estimation models different models may be sound, even though they
have competing assumptions. Instead of using only one model, it is best to use
several and to combine the models and find a weighted average, which when
considered and averaged, should improve the quality of the estimation made (Forcht
& Cochran, 1999).
Hypothesis testing - Hypothesis testing is a type of estimation that seeks an answer
that is binary in nature. The test seeks only a "yes or no" type of answer to verify
whether a hypothesis is plausible or not. Usually, one hypothesis is tested against
an alternative one to find the stronger of the two (Forcht & Cochran, 1999).

Regression - This is the most important of all the multivariate techniques available of non-
experimentalists. Once analysts understand regression, almost any question amenable to
quantitative analysis can be answered. This technique, perhaps more than any other data
manipulation technique, lends itself to visualisation. Regression contains many different
subsets e.g. bivariate or multiple regression. In its purest form regression answers the
common query: What is the relationship between variable X and variable Y? (Lewis-Beck,
Berry, Feldman, Fox & Hardy (1993). This technique has a myriad of uses in data mining
(Koh & Low, 2004).
Discriminant analysis - This is a classification technique used to describe group separation
(Rencher, 1995; Gordon, 1999). Standard linear discriminant analysis involves a linear clas-
sification boundary and is used to group the population (Rencher, 1995), but it should be
noted that it depends on assumptions regarding normality of the underlying populations,
which must also possess identical variance-covariance matrices. The linear rule can be shown
to minimise the expected number of misclassifications.
Clustering
Clustering may be a preparatory step to segmenting a database before applying other data
mining techniques or as a separate technique for data mining (Chidley 2002). The technique
itself is the process of identifying useful and homogenous clusters (e.g. objects or people),
patterns, relationships or interesting trends with similar characteristics in time-dependent
data (Emory & Cooper 1991; Gargano & Raggad (1999); Forcht & Cochran, 1999; Lee &
Siau, 2001). A cluster or pattern may be regarded as a collection or class of records sharing
something in common. Conceptual clustering uses not only similarity but also what has
been called 'conceptual cohesiveness' as defined by background information. Interactive
clustering includes contributions from the human user's knowledge (Vickery, 1997).
Classification
Classification is the process of dividing and allocating data items in a data set into previously
defined and mutually exclusive groups so that the members of each group are as close as
possible to one another, and the members of different groups are as far as possible from one
another. An example of a typical classification problem is dividing a database of customers
into groups that are as homogeneous as possible with respect to a variable such as credit-
worthiness (Lee & Siau, 2001).
Link analysis
Link analysis is a descriptive approach to identifying useful associations and relationships
between values in a database (Lee & Siau, 2001).
Association rules and associative memories
These techniques are used to mine transactional or relational databases (Lee & Siau, 2001)
and are able to detect similarities between new patterns and previously stored patterns
(Caudill & Butler, 1990).
The main tool used for this according to Gargano & Raggad (1999) is associative
memories where pairs (or larger groups) of associated data items are memorised

(or discarded, in effect “forgotten”) using a long-term memory network mode. A
partial stimulation of the long-term memory network results in a retrieved data pair.
This retrieved pair may have been either a previously memorised pair or the best
attempts of the network in trying to compromise the initial stimulus with a reason-
able output pair response.
2.2.2 Artificial Intelligence (AI) methods
Artificial Intelligence techniques are widely used in data mining (Lee & Siau, 2001;
Koh & Low, 2004). These include neural networks, backpropagation, expert systems
and fuzzy logic (Gargano & Raggad, 1999; Zwick, 2004).
Neural networks
Neural networks were originally designed for use in mainly the disciplines of psy-
chology and biology. Their application in a data mining context is driven by the
desire to exploit their properties as non-linear statistical methods (Beynon et al,
2001).
These are powerful techniques for analysing complex non-linear and interaction
relationships, and can be used to supplement and complement traditional statistical
methods in for example constructing going concern prediction models (Lee & Siau,
2001; Koh & Low, 2004).
Neural networks are some of the most common types of data mining tools used.
They are used for recognising patterns in data, especially when the relationships
between the dependent and independent variables are unknown and/or complex.
Designed to "think" like and modeled after the human brain, which can be perceived
as a highly connected network of neurons (called nodes in neural networks termi-
nology). Each node (in a layer of nodes) receives inputs from at least one node in a
previous layer and combines the inputs and generates an output to at least one
node in the next layer. Generally, the independent variables comprise the input layer
and the dependent variable the output layer and between these there may be one or
more hidden layers of nodes. In combining inputs and generating an output, each
node performs a computation (to combine the inputs) and a transformation (to
generate an output). Each connection between two nodes has a weight that deter-
mines how the input from a prior node must be combined with other inputs to
generate an output that must be received by the next node (Vickery, 1997; Gargano
& Raggad, 1999; Lee & Siau, 2001).
Neural networks first break down data sets into smaller, more manageable pieces
before trying to discover patterns in the data. Such techniques require large amounts
of resources and frequently require some custom programming for each search, as
well as more processing afterward because the system may "discover" patterns that
seem logical to it but after human intervention it becomes clear that they are not
(Forcht & Cochran, 1999; Koh & Low, 2004).
Lu et al. (1996) (cited in Lee & Siau, 2001), split the neural network-based data
mining approach into three major phases:

• Network construction and training: in this phase, a layered neural network based on
the number of attributes, number of classes, and chosen input coding method are
trained and constructed.
• Network pruning: in this phase, redundant links and units are removed without in-
creasing the classification error rate of the network.
• Rule extraction: classification rules are extracted in this phase (Lee & Siau, 2001)
Backpropagation systems
These techniques are highly supervised. The backprop neural network model is ideal for
prediction and classification in situations where there is a good deal of historic data available
for training. This tool uses output variables generated by the neural network that are cor-
rected by adjusting the weights of the hidden layer variables until the output variables match
those in the training dataset (Gargano & Raggad, 1999; Chidley, 2002).
Expert systems
Expert systems are made up of a knowledge base of rules (extracted from experts), facts (or
data), and a logic based inference engine (or control) that creates new rules and facts based
on previously accumulated knowledge and facts. Expert systems attempt to mimic, with
some success, the reasoning of human experts whose knowledge of a specific and narrow
domain is deep, thus permitting human experts and expert systems to arrive at similar
conclusions, thus serving to justify the system's existence by improving the expert decision
maker's own productivity. The expert system thus operates using queries formulated by
human experts and incorporated into the system. Expert systems do not rely on algorithmic
or statistical methods and cannot solve problems that have not been defined during the
programming of the model (Jackson, 1990; Gargano & Raggad, 1999; Chidley, 2002).
Jackson (1990:4) listed the following characteristics for expert systems:
• They simulate human reasoning,
• They perform reasoning "over representations of human knowledge",
• Heuristic or approximate methods are used to solve problems (which does not guar-
antee success as would have been the case had algorithmic techniques or solutions
been used).
Fuzzy expert systems
Fuzzy expert systems employ fuzzy logic concepts and were developed in an attempt to try
and solve the brittleness problem inherent in expert systems. The truth or falsity of a fact
can be measured in a fuzzy way using values from the real number interval zero to one
inclusive (i.e. [0, 1]). In expert systems, information is either totally false (i.e. zero) or
totally true (i.e. one), but in fuzzy expert systems, true values can lie anywhere on the zero
to one interval of real numbers. Some facts are close to being true or close to being false
(having low entropy), while other facts lie close to the middle between being true or false
(having high entropy). Using fuzzy operators, such as AND, OR, NOT, VERY, and SOME-
WHAT, the system can make fuzzy implications. Fuzzy systems can easily handle illogical
complexities, poor clarity (in the facts and/or rules), or internal inconsistencies (Gargano &
Raggad, 1999).

2.2.3 Genetic algorithms and genetic programming
Genetic algorithms are a relatively new technique inspired by Darwin's theory of evolution
(Natural selection and survival of the fittest). A population of rules, that may or may not
repress a solution to a problem, is created at random. Then pairs of these rules, usually the
strongest rules are selected as "parents", are combined to produce "offspring" for the next
generation. A mutation process is used to randomly modify the genetic structures of some
members of each new generation. The system runs for dozens or hundreds of generations
and is only terminated when an acceptable or optimum solution is found, or after a fixed
time limit. Genetic algorithms are appropriate for problems that require optimisation with
respect to some computable criterion (Lee & Siau, 2001; Mitchell, 2005)
While genetic algorithms evolve complex data structures, genetic programming evolves
using complex algorithmic structures (i.e. computer programs). This technique is useful for
finding solutions to hard optimisation problems by generating optimal or near optimal
solutions to such problems, to fine tune the parameters of other data mining techniques and
models and also for classification (Vickery, 1997; Gargano & Raggad, 1999; Lee & Siau,
2001).
2.2.4 Decision trees
Decision trees - This is a statistical approach based on a branching system of decisions. A
decision rule is answered at each node either positively (Yes) or negatively (No). The answer
gives another set of decisions (Gargano & Raggad, 1999).
Koh and Low (2004:466) summarised it very nicely: "In the Automatic Interaction Detection
(AID) algorithm, all possible two-way splits of each node for each independent variable are
examined. The split that leads to the most significant t-statistic (as per the analysis of the
variance) for the difference in means of the dependent variable between the two lower-level
nodes is selected. In the chi-square Automatic Interaction Detection (CHAID) algorithm, the
chi-square statistic is used to determine the best split while in the Classification and Regres-
sion Trees (CART) algorithm, an index of diversity is used to determine the best split."
This technique has several strengths:
• Understandable rules can be generated
• Both continuous and categorical variables can be handled
• The ability to indicate the relative importance of the variables for classification and
prediction
• Outputs are easy to understand
• They are relatively simple to implement and
• Their results can be easily explained
(Gargano & Raggad, 1999; Chidley, 2002; Koh & Low, 2004)

2.2.5 Data visualisation
Visualisation is a method of clearly presenting the typically complex results found using data
mining tools. This allows the presentation of the complex interdependencies among many
attributes in a visual format in order to get an intuitive feel of the data and the results of the
analysis. Analysts and management users can easily assess and make sense of vast amounts
of data. Techniques include colors, shapes, sounds, in various combinations, statistical scat-
ter plots, decision trees, demonstrate the results of curve fitting, geographical maps or
display a development dashboard which tracks and controls the evolution of a data mining
modeling tool (Gargano & Raggad, 1999; Lee & Siau, 2001).
2.2.6 Rule induction methods
Rule induction uses statistical discovery methods to develop rules that depend on the fre-
quency of correlation, the rate of accuracy, and the accuracy of prediction. Typically, IF -
THEN type rules are created by focusing on either the variables forming the IF part of a rule
or the variables forming the THEN part of a rule. For rule induction it is useful to think of
data mining from marketing databases. The technique is based on measures of data ambi-
guity or approximation quality. These measures are formulated in terms of ratios, involving
objects either definitely or possibly allocated to a decision class, on the basis of a given table
or data matrix. The end result is a set of decision rules, which are very easy to understand
and interpret. Rule induction is a useful tool for development of expert systems (Gargano &
Raggad, 1999; Beynon et al, 2001).
Gargano & Raggad (1999:85) caution that: "Sometimes, however, the novelty, significance,
value, or exceptionality of a rule is deemed to be most interesting. Rule induction methods
are highly unsupervised, however, they do require that experts evaluate the rules generated.
This technique is most often used when new rules need to be generated. Owing to the
combinatorially explosive nature of generating rules in this manner, such models usually run
in the background or at times when computing demand is low."
2.2.7 Data warehousing
Data warehousing is described by Lee & Siau (2001) as one of the most important research
areas related to data mining. A data warehouse is necessary to organise historical data
gathered from large-scale client/server-based applications for further analysis.
A data warehouse is a read-only database containing large volumes of subject-oriented
data, where all levels of an organisation can find the information in a timely manner (Lee &
Siau, 2001).
Kimball et al (1998:19) call the data warehouse the foundation of decision-making in an
organisation. "The queryable source of data in the enterprise".
Data warehousing enables each user to share a common, diverse database that they may
analytically explore, using all of the available data quickly and correctly and increases the
effectiveness of data-driven decision making (Cabena et al, 1998; Gargano & Raggad, 1999).
The data warehouse architecture consists of a series of data marts that give a consolidated,
consistent view of the organisation's historical analytical, time-based data (Cabena et al,

1998; Kimball et al, 1998) Raw data are extracted, cleaned, transformed, and integrated into
the marts from a variety of sources. Metadata, data about the data in the warehouse, is also
an integral part of the system. The warehouse architecture must manage standard informa-
tion delivery systems and data queries, interfaces with applications development platforms
and management information systems (MIS), and online analytical processing (OLAP), in
addition to advanced information technology data mining and business intelligence tools
(Kimball et al, 1998; Forcht & Cochran, 1999; Gargano & Raggad, 1999).
2.3 The uses of these techniques and algorithms
Mitchell (1999) stated that in the field of data mining there are practical applications in areas
like analyzing medical outcomes, detecting credit card fraud, predicting customer purchase
behavior, predicting the personal interests of internet users, optimizing manufacturing pro-
cesses or which bank-loan applicants are at high risk of failing to repay their loans.
As shown in Figure 1 from Mitchell (1999), data in such applications typically consists of
time-series descriptions of customer bank balances and other demographic information.
Other data mining applications include predicting customer purchase behavior, customer
retention, and the quality of goods produced by a particular manufacturing line. Mitchell
(1999) believes that research on basic scientific issues (like the medical field) will influence
data mining applications in many other business related areas. Data mining is thus valuable
to itself as techniques used in one sector or industry may be of use in another sector in that
techniques may be adapted for different uses. Data miners thus learn from other data
miners and techniques that may have one use could have a completely other use in another
sector.
Research on basic scientific issues (left) will influence data mining
applications in many areas (right)
Scientific Issues
Basic Technologies
Applications
Figure 1: Research on basic scientific issues. Source: (Mitchell, 1999)
Learning from mixed media data, such as
numeric, text, image, voice, sensor
Active experimentation, exploration
Optimizing decisions, rather than
predictions
Inventing new features to improve
accuracy
Learning from multiple databases and
the Web
Medicine
Manufacturing
Financial
Intelligence analysis
Public policy
Marketing

Data mining and its techniques can be applied to many areas in business and in many
different businesses. The different uses of the techniques used in data mining described
below have been extracted from the literature and have uses in the sector that make use of
the data and credit bureaus as well as in the this sector.
2.3.1 Targeting / Predictive / Descriptive models
These models typically calculate a value that represents possible future activity. This could
be a purchase amount or the likelihood of an action, such as a response to an offer or
defaulting on a loan (Parr Rud, 2001).
They may include:
• Customer profiling and segmentation
Having an understanding of the customer is valuable in that their demographics,
attributes and behaviour is the first step in good customer relationship management.
Data mining enables understanding of who the customers are and how to split them
into segments that have the same or similar attributes. This leads to further mining to
enable steps like prospecting, scoring, propensity to buy and others as discussed later
(Vickery, 1997; Cabena et al, 1998; Gargano & Raggad, 1999; Lee & Siau, 2001; Parr
Rud, 2001; Geist, 2002; Nemati & Barko, 2003).
• Database marketing
Database marketing is a type of marketing segmentation used by businesses via data
mining. Data mining of customer databases has had a large impact on marketing in
organisations. Individual consumers can be targeted for direct marketing offers. The
value here is that the correct customer may be directly targeted with the correct offer,
saving time, money and effort and enabling a focused approach to marketing that
promises much better results. Algorithms are used to predict consumer behavior by
predicting which consumers would be most responsive to promotional and sales cam-
paigns (Forcht & Cochran, 1999).
The value and goal of this type of marketing is to attract new, or retain profitable
clients or to avoid high-risk clients, and multiple opportunities for this exists in data
mining of large databases. Increasing the response rates of direct mailing campaigns
by small margins like only 1-2% can have large impacts on ROI and data mining is a
powerful tool in increasing response rates and ultimately of immense value to the
organisation (Cabena et al, 1998; Forcht & Cochran, 1999; Parr Rud, 2001; Apte, Liu,
Pednault, & Smyth, 2002).
• Customer attrition prediction
A growing risk in the ever-increasing competitiveness of markets is the loss or attrition
of their customers to competitors. Data mining is used to predict these customer
losses and to identify vulnerable customers so that steps may be taken to prevent or
mitigate attrition and thus save costs and effort in attracting new clients or spending
on attracting customers who depart before their lifetime value has justified the ex-
pense of attracting them in the first place (Cabena et al, 1998; Nemati & Barko, 2003).

• Credit scoring / Risk modelling
Credit scoring algorithms have the ability to consider and use many different factors
and variables in determining a customer's 'creditworthiness' and assigning a credit
limit or particular loan amount to that customer in either pre-scoring to extend a
marketing offer or when the customer applies for credit. This is very valuable in
assuring that a customer does not have a line of credit extended to them that they
cannot or will not repay. This has a knock-on effect in savings of time, effort and
expenditure in preventing unnecessary collections and administration. Numerous com-
panies have used data mining in developing credit risk scores for their own use or for
selling on to other users (Cabena et al, 1998; Lee & Siau, 2001; Parr Rud, 2001;
Geist, 2002; Nemati & Barko, 2003).
Customers' data is mined and algorithms applied in an attempt to determine whom
the higher risk clients are so that these may be either avoided or a different interaction
strategy enacted to deal with them. An insurance company may for instance want to
determine the risk profile of clients to enable them to customise each client's policy
individually (Parr Rud, 2001; Apte et al 2002).
• Customer value analysis
Performing customer value analysis and lifetime value allows managers to understand
their customer database in terms of revenue and risk. Mining the customers data
assists in:
- Determining the risk category;
- The amount of customer spend over a given period;
- Lets the manager assign a value to each customer that is used in determining the
company's interaction and dealings with each client on an individual basis
(Cabena et al, 1998; Parr Rud, 2001).
2.3.2 Fraud prediction and identification
Fraud costs companies and economies millions of Dollars / Pounds / Rands every year and
with the increase in electronic transactions, credit cards and telephonic transacting this is
becoming even more prevalent. The masses of data available to companies allow them to
mine these transactions and applications in an effort to identify or predict fraud. The general
approach is to build a model of known, suspected or potential fraudulent behaviour and
then using data mining to identify similar occurrences. Data mining tools are valuable as
they learn the patterns of fraud and enable the identification and prevention. (Cabena et al,
1998; Lee & Siau, 2001; Parr Rud, 2001).
2.3.3 Going concern prediction
Koh & Low (2004) researched this field and found that several researchers had developed
prediction models for making going concern predictions of companies. The suggested mod-
els are based primarily on statistical methods. Koh & Low (2004) listed the following ex-
amples - Altman, (1982); Dopuch et al., (1987); and Koh, (1991). This area of data mining

also includes bankruptcy prediction. Several studies listed by Koh & Low (2004) have dealt
with prediction models in the going concern context. These include models derived from
statistical methods such as multiple discriminant analysis, logit and probit analyses and
neural networks.
Altman, (1968), Sung, Chang & Lee, (1999), Beynon et al, 2001 and Koh & Low, (2004)
noted that discriminant analysis is the most widely used technique for going concern and
bankruptcy prediction.
2.4 The future direction of data mining and its techniques and the possible
uses of this
The literature review found mainly data relating to other sectors and techniques and uses.
Only one source was found describing possible future uses of data mining or future tech-
niques. It is possible that the bureaus may have some ideas as to what their future use of
data mining, what new techniques or the possible uses these may be.
The only source describing possible direction of data mining was from Mitchell (1999) who
speculated that the accuracy of predictions from data mining may be improved by inventing
more appropriate sets of features for describing the available data, provided the dataset was
large enough. It is suggested that this could lead to increased accuracy in many prediction
problems like customer attrition and credit repayments. More universities are also offering
data mining as a subject as there is a lack of skills in this area.
Research into the area of data mining could lead to more useful data visualization tools,
ways of supporting mixed initiative human-machine data exploration and more efficient data
warehousing and legacy data combinations (Mitchell 1999).
Mitchell (1999:36) and Fayyad, Haussler & Stolorz (1996) further speculated that that "progress
in data mining over the next decade was driven by three mutually reinforcing trends:
• Development of new machine learning algorithms that learn more accurately, utilize
data from dramatically more diverse data sources available over the Internet and
intranets, and incorporate more human input as they work,
• Integration of these algorithms into standard database management systems,
• An increasing awareness of data mining technology within many organizations and an
attendant increase in efforts to capture, warehouse, and utilize historical data to sup-
port evidence-based decision making."

Chapter 3: Research questions
The literature reviews for this research is in most respects quite comprehensive, however,
data mining in South Africa and particularly in the credit and data bureau sector, is a rela-
tively new field, and although there is agreement amongst the authors of the respective
works in most fields, there are some areas of discrepancy. Most authors agree on the tech-
niques used and the uses of these techniques, but there is little literature density on uses of
data mining in this sector and more specifically in South Africa. As a result of the literature
review the following questions arise:
3.1 What are common data mining techniques and algorithms?
A review of the literature produced the following list of techniques used in data mining and
these techniques could be used in the Credit and Data bureau sector for data mining:
Pure statistics (Lee Siau, 2001)
• Basic Statistics (Forcht Cochran, 1999; Beynon et al, 2001; Koh Low, 2004)
- Probability distributions
- Inference
- Estimation
- Hypothesis testing
- Regression
- Discriminant analysis
• Clustering (Emory Cooper 1991; Vickery, 1997; Forcht Cochran, 1999; Gargano
Raggad, 1999; Chidley, 2002)
• Classification (Lee Siau, 2001)
• Link analysis (Lee Siau, 2001)
• Association rules (Caudill Butler, 1990; Lee Siau, 2001), and associative memo-
ries (Gargano Raggad, 1999)
Artificial intelligence methods (Lee Siau, 2001; Koh Low, 2004)
• Neural networks (Gargano Raggad, 1999)
- Backpropagation (Gargano Raggad, 1999)
• Expert systems (Jackson, 1990; Gargano Raggad, 1999)
• Fuzzy logic (Gargano Raggad, 1999; Zwick, 2004)
Genetic algorithms (Mitchell, 2005; Lee Siau, 2001) and genetic programming
(Vickery, 1997; Lee Siau, 2001)
Decision trees (Gargano Raggad, 1999; Chidley, 2002; Koh Low, 2004)
Data visualisation (Gargano Raggad, 1999; Lee Siau, 2001)
Rule induction methods (Gargano Raggad, 1999; Beynon et al, 2001)
Data warehousing (Kimball et al, 1998; Forcht Cochran, 1999; Gargano Raggad,
1999)

3.2 What are the uses of these techniques?
A review of the literature gave the following uses of the different techniques used in data
mining that could be applicable to this sector. The possibility is that these are where the
value in data mining lies for the bureaus and their users. Mitchell (1999) also believed that
techniques in one sector may influence techniques used in other sector and thus data mining
is valuable to itself in that new techniques are developed in one sector because of the
influences in another sector. The research will attempt to determine if this is the case in the
credit and data bureau sector as well. Other uses found where:
• Targeting / Predictive / Descriptive models (Parr Rud, 2001)
- Customer profiling and segmentation (Vickery, 1997; Cabena et al, 1998; Gargano
Raggad, 1999; Lee Siau, 2001; Parr Rud, 2001; Geist, 2002; Nemati Barko,
2003).
- Database marketing (Cabena et al, 1998; Forcht Cochran, 1999; Parr Rud, 2001;
Apte et al 2002).
- Customer attrition prediction (Cabena et al, 1998; Nemati Barko, 2003).
- Credit Scoring / Risk modelling (Cabena et al, 1998; Lee Siau, 2001; Parr Rud,
2001; Apte et al, 2002; Geist, 2002; Nemati Barko, 2003).
- Customer value analysis (Cabena et al, 1998; Parr Rud, 2001).
These techniques enable:
- An understanding of the customer and thus good customer relationship manage-
ment.
- Marketing to the correct customer who may be directly targeted with the correct
offer, saving time, money and effort and enabling a focused approach to marketing
that promises much better results.
- The attraction of new, retention of profitable clients or avoidance of high-risk cli-
ents.
- Increasing the response rates of direct mailing campaigns by small margins like only
1-2% can have large impacts on ROI.
- Savings in attracting new clients or spending on attracting customers who depart
before their lifetime value has justified the expense of attracting them in the first
place.
- Credit scoring clients to assure that a line of credit extended is not too much forcing
a client into a position of overextension where they cannot or will not repay. This
has a knock-on effect in savings of time, effort and expenditure in preventing unnec-
essary collections and administration.
• Fraud prediction and identification (Cabena et al, 1998; Lee Siau, 2001; Parr Rud,
2001).
• Going concern prediction (Altman, 1968; Sung, Chang Lee, 1999; Beynon et al,
2001; Koh Low, 2004).

3.3 What is the future direction of data mining and its techniques in this sec-
tor and the possible uses thereof?
As there was only one source for a possible answer to this question, it is left quite open-
ended. Some possibilities are:
• New and more accurate means of prediction may be found using more appropriate
sets of features for describing the available data, provided the dataset was large enough,
• Increased accuracy in many prediction problems like customer attrition and credit
repayments,
• More useful data visualization tools, ways of supporting mixed initiative human-ma-
chine data exploration and more efficient data warehousing and legacy data combina-
tions,
• More efforts to train people in data mining as the skills are not common (Mitchell
1999).

CHAPTER 4: RESEARCH METHODOLOGY
4.1 Qualitative Research Paradigm
The aim of this research is to identify and evaluate data mining techniques in the Credit and
Data Bureau sector and to expand on the body of knowledge available to managers in this
sector, and users of these data and techniques as clients of this sector. The research para-
digm for the research is qualitative in nature.
Qualitative techniques are intended more to determine 'what' things are than to determine
the quantity of those things. These techniques are not concerned with measurement and are
thus less structured than quantitative techniques and can therefore be made more respon-
sive to the needs of the respondents and to the nature of the subject being researched.
Typically qualitative techniques yield large volumes of very rich and descriptive data from a
limited number of individuals in a particular field. (Walker, 1985)
The intent of qualitative research is to answer questions about the complex nature of
phenomena, often with the purpose of describing and understanding the phenomena from
the participants' point of view (Leedy Ormrod, 2001:101).
Based on the characteristics of a qualitative paradigm given by Walker (1985) and Leedy
Ormrod (2001), this approach is proposed for the following reasons:
• There is insufficient theory on the particular sector,
• The purpose of the research is to describe and explore,
• The research is not concerned with measurement
• The variables are unknown,
• The research is context bound and encompasses personal views,
• The sample size is small,
• In-depth semi-structured interviews are to be used to collect data,
• The data gathered were explicitly interpretive, creative and personal.
Added to the assumptions made in Chapter 1 (1.5) of this document are particular assump-
tions that are part of qualitative research. These were proposed by Creswell (1994) and
(Marshall Rossman, 1989) and must also be considered:
• The participant's perspective on the social phenomenon of interest should unfold as
the participant views it, not as the researcher views it(Marshall Rossman, 1989:82),
• The researcher interacts with what they are researching,
• The role of values is value-laden and biased Creswell (1994:5),
• Respondents in research see reality in a subjective and in multiple ways,
• The language of the research is informal, evolving decisions, personal voice, ac-
cepted qualitative words.

4.2 Descriptive research design
The qualitative design was in the form of a content analysis. This was described by Walker
(1985) and Leedy Ormrod (2001) as being a technique that identifies patterns, themes or
biases in data on communication and the examination of this data allows the researcher to
determine if a hypothesis is supported or not. In this research the content analysis was done
on the transcripts of the interviews between the researcher and the respondents.
For this research in-depth semi-structured interviews were used as the method of data
collection. The interviews were based on a number of open-ended questions (Leedy
Ormrod, 2001). In depth interviewing is ideal for this kind of research and has been de-
scribed as a conversation with a purpose (Marshall Rossman, 1989:82). Interviews are
typically more like conversations than formally structured interviews, but this assists in
uncovering the respondents meaning and perspective but at the same time respects the way
in which the respondent frames and structures the responses (Marshall Rossman, 1989).
Advantages of using in-depth semi-structured interviews for data collection include (Marshall
Rossman, 1989; Pirow, 1990; Creswell, 1994; Leedy Ormrod, 2001):
• Interviews are useful means of quickly obtaining large amounts of data.
• Respondents can provide historical background information.
• Interviews allow for the gathering of a wide variety of information and a large number
of different subjects.
• Immediate follow-up questions and clarification of points can be done.
• The researcher has control over both the questions asked and the environment.
• It is flexible and enables the researcher to prompt and probe as necessary.
• It enables the researcher to take cognisance of non-verbal behaviour.
• The researcher can alter the order of questions and ensure that all the questions are
answered.
Despite its many advantages the researcher is aware that skill and care is required in using
this method of collecting data. There are also some disadvantages associated with this
method of data collection and the researcher took care to be aware of these when conduct-
ing the research. Marshall Rossman, (1989) and Creswell, (1994) listed the following:
• Information provided by the respondent is colored by their own perspective,
• The interviewer must obtain the cooperation of the interviewee,
• Respondents may not be willing to share some (possibly sensitive) information,
• Respondents may not all be of the same level of articulation or perception,
• The researcher may not be able to ask the correct type of questions because of a lack
of technical expertise on the side of the researcher.
The researcher attempted to mitigate some of these disadvantages by:
• Continuously confirming with the respondent the intended meaning of their response,

• Not intentionally leading the respondent and avoiding colloquialisms and ambiguous
words.
4.3 Population and Sample
The population in this research can be considered to be all the data miners, data managers,
analysts, practitioners, facilitators, and vendors for and from all the credit and data bureaus
in the country. This is to the extent that they are subject matter experts on data mining. The
sample drawn contained the managers of the data mining departments or business intelli-
gence departments, analysts, directors and or practitioners in these fields in these bureaus
and their vendors that are located within South Africa. The nineteen respondents can be
considered to form 100% of the population.
The respondents were not selected in a random fashion, at all times attempting to ensure
that they are experienced and knowledgeable enough in the area of study (Creswell, 1994),
but the researcher attempted to be objective in the selection of the respondents (Walker,
1985) and the sample design is thus purposive (Walker, 1985:30).
The small number of data and credit bureaus in South Africa limited the sample size. The
sample was drawn from the bureaus and their vendors directly, specifically from the ranks of
the data mining, business intelligence and managerial areas.
The selection of experts in the field used the following criteria and ensured that the respon-
dent was able to comment, from an informed position, on the techniques, uses and trends in
data mining in the credit and data bureau sector. The opinions expressed during the inter-
views should be based on a sound knowledge of this sector and of data mining.
The criteria were:
• The expert is to be involved in data mining, having implemented, or had management
oversight of a data mining project in South Africa;
• The expert should occupy a senior or management position in the organisation;
• The expert should have experience in the products and uses of data mining in the
sector;
• The expert should have at least three years experience in the field;
• The expert should be available for a one hour interview;
• The organisation the expert represents should not have an objection to the expert
partaking in the research.
In total, nineteen interviews were conducted during the entire research process. Every major
credit bureau and all of the minor credit bureaus except one, both the data bureaus and
every vendor that engaged with the bureaus on data mining had at least one person who
met the criteria to qualify as an expert to be interviewed in this field. One of the vendors
interviewed had lots of experience in data mining, but none with the South African credit
bureaus.
The researcher approached respondents from the institutions listed in the table on the
following page and received their institution's willingness to participate in the research:

noitutitsnI rotceS emaN noitangiseD
nairepxE uaeruBtiderC reppaCgiarC tcudorP:rotceriD
ssenisuB,tnempoleveD
gnitekraMdnaecnegilletnI
kciredorBnalA reganaMiB
syuBezilraM tsylanAgnirocS
soBdrahreG tsylanAgnirocS
tiderCsredneLorciM
)BCLM(uaeruB
uaeruBtiderC sreffetSderF rotceriD
)IK(mrofnItiderK uaeruBtiderC yessuHekiM reganaM
ecnegilletnIevitceffE uaeruBataD hgadrAnailuJ rotceriDgniganaM
ahtoBdrahreG reganaMsmetsySTI
debuC-P rodneV relliMluoaR rotceriDgniganaM
LTE uaeruBataD naniuQydnA rotceriDgniganaM
nacSupmoC uaeruBtiderC streblAocaJ rotceriD
rotpaR rodneV namyeHkraM tsylanA
greBnaaiR tsylanA
SAS rodneV kciddaCyecatS reganaMtnuoccA
CTInoinUsnarT uaeruBtiderC eiruoFnhoJ dnascitylanA-rotceriD
gnitlusnoc
nosirraHeilseL tnatlusnoCssenisuB
navaihtneremmE
gninierG
tsylanAlacitsitatS
samohTkcirraW esuoheraWataD
thcetihcrA
CTInoinUsnarT
troppuSnoisiceD
)SSDUT(secivreS
uaeruBtiderC nassaHrimahT rotceriDgniganaM
Table 1: Organisations that agreed to partake in the research.
4.4 Data collection
The institutions were contacted formally in writing, detailing the nature, purpose and meth-
odology of the research and requesting their formal approval of their participation. The
respondent nominated by each institution was contacted initially by telephone to invite them
to participate in the research and to inform them of the purpose of the research, subjects to
be covered and the research process and methodology, including the expected duration of
the interview.

A formal written communication by e-mail was sent thanking the respondent for being
willing to participate in the study and confirming the place, date and time of the interview.
Each respondent was offered a copy of the research report as an incentive for participating in
the study. Respondents were guaranteed that their responses would be confidential and
remain anonymous (Refer Appendix 1 2 for copies of the written request and telephone
protocol).
The interviews were in-depth and of a semi-structured nature and took place at a site
convenient to the respondent. As the researcher knows many of the respondents personally,
the locations for the interviews tended to be informal and aimed at putting the respondents
at ease and enabled them to more easily discuss the research questions with the researcher.
Each interview was audiotaped with the permission of the responder. Notes were also taken
as the interview progressed.
Creswell (1994:152) suggested the following protocol and the researcher attempted to fol-
low this for each interview (Refer Appendix 3 for a copy of the Interview Protocol). The
components of the protocol are as follows:
• (a) a heading,
• (b) instructions to the interviewer (opening statements),
• (c) the key research questions to ask,
• (d) probes to follow key questions,
• (e) transition messages for the interviewer,
• (f) space for recording the interviewer's comments, and
• (g) space in which the researchers records reflective notes.
Care was taken not to lead respondents in their response during the course of the interview.
4.5 Data analysis
Unlike quantitative research where the process is linear, here data analysis took place at the
same time as the collection and interpretation of the data, and the writing of the report.
(Creswell, 1994).
The following procedures were deployed in analysing the data (Walker, 1985; Creswell,
1994; Leedy Ormrod, 2001):
1. The taped interviews are transcribed,
2. The notes made during the interview are reviewed immediately after the interview
and additional comments and thoughts added,
3. The data were organized into categories, coded and were interpreted through the use
of schemas,
4. The data were integrated and synthesized. This was represented in the form of matri-
ces.

In addition, the frequency of each identifiable factor uncovered in the transcripts was tabu-
lated. This will inform the researcher as to the perceived importance of the identifiable factor
across the respondents. No statistical analysis was performed on these results.
4.6 Validity and reliability
The validity of research is determined by the internal and external validity of the research.
Internal validity is the extent to which its design and the data that it yields allow the
researcher to draw accurate conclusions about cause-and-effect and other relationships within
the data (Leedy Ormrod, 2001:103), and external validity is the extent to which its
results apply to situations beyond the study itself (Leedy Ormrod, 2001:105).
4.6.1 Internal validity
The importance of internal validity is in attempting to find other possible explanations for
the results obtained in the research (Leedy Ormrod, 2001). Asking the respondents if they
agreed with the accuracy, objectivity and reliability of the conclusions made by the re-
searcher checked the internal validity of this research. Each respondent was given a copy of
the findings and requested to add any comments.
4.6.2 External validity
The intent for qualitative research is not to be able to infer the findings onto the population,
but to attempt to interpret the event from a unique perspective (Creswell, 1994). The valid-
ity criteria used in this research is that it is well argued and believable and the purposive
sample should reflect the views of the general population.
4.6.3 Reliability
As it is unlikely that similar research conducted in a different context would reach different
conclusions in the same industry, but could reach different conclusions in a different indus-
try, the research reliability is limited.
Marshall Rossman (1989:148) suggested that: the researcher purposefully avoids con-
trolling the research conditions and concentrates on recording the complexity of situational
contexts and interrelations as they occur. It is unlikely that future researchers will replicate
the research by altering research strategies and it is discouraged (Marshall Rossman,
1998).
4.7 Completion of the research report
The research report was then written, identifying the dominant themes in this sector and
commenting about the applicability of the different algorithms and techniques and their
various uses in this sector.
The interview transcripts were summarized and each use assigned to two categories. The
methodology followed here was that of Chidley used in 2002.
The first use category was based upon the terms used by the respondents during the inter-
views. The information to determine the first category of uses was based on the terms used

by respondent in describing the specific data mining projects they had worked on and or the
specific uses they assigned and or equated with each data mining technique or algorithm.
The second categorization was done by using the generic data mining uses taken from the
literature. The aim of the specific project and use referred to by the respondent was com-
pared to the generic use category and if there was a match, the project or stated value and
use was assigned to that category. Sometimes the process followed in the actual data mining
was analysed and a category assigned to the project or technique used.
The interviews data, processed in this way, was used as the basis for the results and inter-
pretation of the results for this research report.

Chapter 5: Results
5.1 Common data mining techniques and algorithms
In his work on data mining in the banking sector, Chidley (2002) proposed a metric based
on his finding when doing his literature review. This same metric was compared to what was
found when doing the literature review for this research report, and the categorization was
virtually identical. The common techniques and algorithms identified in section 2.2 were
compared to Chidley's findings and distilled into a single model showing how each tech-
nique related to the others. This new model is show on the following page:

noitaraperp-erP
:scitsitatSeruP
naeM•
noitaiveddradnatS•
noitatneserperlacihparG•
snoitubirtsidytilibaborP•
ecnerefnI•
noitamitsE•
gnitsetsisehtopyH•
noissergeR•
sisylanatnanimircsiD•
Data mining techniques
lacitsitatS ecnegilletnIlaicifitrA
ecnednepedretnI ecnednepeD
:gniretsulC
gniretsulclautpecnoC•
gniretsulcevitcaretnI•
robhgientseraen-K•
gninosaerdesabyromeM•
:noitacifissalC
sisylanatnanimircsiD•
noissergercitsigoL•
:skrowtenlarueN
dnanoitcurtsnockrowteN•
gniniart
gninurpkrowteN•
noitcartxeeluR•
:noitacifissalC
noitcudnieluR•
:seerTnoisiceD
DIAHC•
TRAC•
noitagaporpkcaB
:seerTnoisiceD
TRAC•
SRAM•
sledomevitiddalareneG smetsystrepxE
sledomevitiddalareneG smetsystrepxeyzzuF
sisylanakniL
selurevitaicossA
seiromemevitaicossA
noitasilausiV
secirtamtolprettacS•
secirtamgnitcepsorP•
setanidrooclellaraP•
secirtamnoitcejorP•
seuqinhcetnoitcejorpcirtemoeG•
gnisuoheraWataD
)LTE(gnidaoL,noitamrofsnarT,noitcartxE•
stramataD•
PALO•
SIM•
Figure 2: Data mining techniques

The techniques and algorithms found were categorised to enable the manager to easily and
at a single glance understand the techniques and algorithms used and to match these to the
possible uses of these techniques as described in this report.
Interviews were conducted with all the members of the expert panel with a view to establish
the techniques used in data mining in the credit and data bureau sector. It was clear from the
interviews that there were numerous techniques referred to by the members of the panel,
and invariably the same terminology was used to describe the different techniques.
There were thirty-five techniques mentioned during the interviews and these are listed in the
table on the following page:

Table 2: Data mining techniques used in this sector
.oN seuqinhceT
latoT
secnerruccO
1 .gvA,veD.dtS,snaeM.g.escitsitatScisaB/smhtiroglAlacitamehtaM 51
2 noissergeR 51
3 noitatnemgeS 41
4 ecnegilletnIlaicifitrA 11
5 gniliforP 01
6 seerTnoisiceD 7
7 noitasilausiV 7
8 gnisuoheraWataD 6
9 gniledoMevitciderP 5
01 ytilanosaeS 3
11 sisylanAretsulC 3
21 erauqs-ihC 3
31 SIG 2
41 noitacifissalC 2
51 DIAHC 2
61 sisylanaseiresemiT 2
71 sisylanAytilibaborP 1
81 gnidnerT 1
91 euqinhceT-ihpleD 1
02 gniledoMesnopseR 1
12 sisylanAfI-tahW 1
22 gnihcraeSevitaretI 1
32 sisylanAoteraP 1
42 gnitseTsisehtopyH 1
52 scitsitats-oiB 1
62 metsyStrepxE 1
72 selbaTycnegnitnoC 1
82 sisylanAnoitalerroC 1
92 smhtiroglAciteneG 1
03 sisylanAdesaBeluR 1
13 sisylanAetairavitluM 1
23 sisylanAetairavoC 1
33 sisylanAtnioj-oC 1
43 sisylanAdnerT 1
53 noitingoceRnrettaP 1
slatoT 621

These thirty-five techniques were classified into the following categories:
Table 3: Summary of data mining techniques in this sector
yrogetaC euqinhceT
nisecnerruccofo#
sweivretni
statsevitpircseD .cte.ved.dts,naem.g.escitsitatsesaB 51
noitasilausiV 7
ytilanosaeS 3
sisylanaseiresemiT 2
gnidnerT 1
sisylanaoteraP 1
gnitsetsisehtopyH 1
sisylanadnerT 1
gnihcraesevitaretI 1
noitingocernrettaP 1
statslaitnerefnI noissergeR 51
seertnoisiceD 7
erauqs-ihC 3
DIAHC 2
sisylanaytilibaborP 1
euqinhcet-ihpleD 1
gniledomesnopseR 1
sisylanaetairavitluM 1
sisylanaetairavoC 1
sisylanatnioj-oC 1
selbatycnegnitnoC 1
sisylanafI-tahW 1
noitcuderataD noitatnemgeS 41
gniliforP 01
sisylanaretsulC 3
noitacifissalC 2
sisylananoitalerroC 1
sisylanadesabeluR 1
gniledomevitciderP 5
seuqinhcetlaciremuN ecnegilletnilaicifitrA 11
scitsitats-oiB 1
metsystrepxE 1
sisylanafI-tahW 1
smhtiroglaciteneG 1
rehtO gnisuoherawataD 6
SIG 2
gniledomesnopseR 1

5.1.1 Descriptive statistics
Every single bureau had a respondent speak of using simple mathematical algorithms e.g.
Means, standard deviations, averages and so on. Fifteen of the nineteen respondents indi-
cated that because of the large volumes of data they dealt with, the more basic mathematical
algorithms and statistical techniques were invaluable in determining:
• which parts of data sets could and or should be mined,
• achieving a better understanding of what was contained in the datasets,
• getting a visual feel of the data,
• standardizing different data sets,
• matching different data sets,
• excluding bad / corrupt data,
• improving the quality of data.
Of the nineteen people interviewed, seven indicated that they also made use of visualisation
techniques to better understand their data sets, to better understand the results of their data
mining exercises and also to hi-light any discrepancies in their analysis.
Further mention was made of the other techniques listed in the above table in this category,
but mostly by single individuals. Interestingly, only one person made use of the word
hypothesis testing, although it was obvious from the interviews with virtually every single
person that all the data mining was using some for of hypothesis testing in that they were
hypothesizing as to the outcome of particular tests.
5.1.2 Inferential statistics
Of the eight bureaus, seven mentioned that they used regression in one form or another,
whether it was logistical regression, linear regression, bivariate or multiple or stepwise
regression.
Fifteen of the nineteen respondents indicated that regression analysis played a large role in
the data mining done by the bureaus.
Decision trees were mentioned by seven of the nineteen respondents, but used by only the
three bigger credit bureaus and both the data bureaus. Of this series of techniques, Chi-
square and CHAID were mentioned by one of the larger consumer credit bureaus and one of
the data bureaus as techniques specifically used as they was a good technique for shorter
time continuums, and was excellent for explaining response models, an area that all of these
bureaus were moving into more and more.
5.1.3 Data reduction techniques
This category of techniques was well represented amongst all the bureaus, as they all used
segmentation or classification as they typically segmented databases of customers into groups
that are as homogeneous as possible with respect to a variable such as creditworthiness.
Fourteen of the nineteen respondents listed this as an important part of data mining in this
sector, and this was also the second most referred to technique.

All of the bureaus also referred to profiling. Although the specific term was not found in the
literature review, the techniques described by the bureaus match those described in the
literature of classification. Some respondents also referred specifically to classification and
cluster analysis when describing these techniques.
5.1.4 Numerical techniques
Only one of the credit bureaus was using these techniques in conjunction with an external
vendor who was also interviewed. The techniques used included Artificial Intelligence, neu-
ral networks and to a lesser degree bio-statistics.
5.1.5 Other techniques
The other techniques mentioned here were data warehousing and Geographical Information
Systems (GIS). As was found in the literature, the larger credit bureaus and both the data
bureaus were using data warehousing to organise large volumes of historical information
gathered from large-scale client/server-based applications for further analysis.
Only one of the data bureaus mentioned using GIS techniques to mine their data, but were
unwilling to provide further information as it was considered too sensitive at the time.
5.2 The uses of data mining in the credit and data bureau sector
During the interviews, the members of the expert panel mentioned several uses of data
mining. Some of these uses were described in different ways, but were clearly the same
thing and these uses were categorised by the members in nearly identical fashion. Given the
small size of this sector in South Africa, this is hardly surprising.
In total eighteen uses of data mining were discovered during the interviews. Each use will be
discussed in the following sections. The table on the following page lists the uses found
during the interviews.

Table 4: Uses of data mining in this sector
.oN gninimatadfosesU
latoT
secnerruccO
1 seitinutroppognitekramgniyfitnedignidulcni,gnitekramtceriD 81
2 dnatluafedronoitadiuqilfoytilibaborpgnitciderp,gnidulcnigniledomksiR
setartluafedgnisaerced
81
3 gnirocStiderC 01
4 snoitagitsevnIcisnerofgnidulcni,duarffonoitneverP/noitciderP 6
5 gnirocSlaroivaheB 5
6 sledomesnopseR 5
7 sdnertcimonocEorciM+orcaM 5
8 emocnignitciderP 4
9 gniledomevitciderP 4
01 skoobsrotbedfoksirehtgninimreteD 3
11 snosirapmocyrtsudnI 3
21 stimiltidercgninimreteD 3
31 noitcellocfoytilibaborpgnitciderP 2
41 noitirttafonoitciderP 2
51 ytilibadroffa/erusopxetneilcgnitciderP 1
61 seicilopnisespalgnitciderP 1
71 sisylanaevititepmoC 1
81 gnicarT 1
The table shows that the dominant uses of data mining in the credit and data bureau sector
in South Africa are direct marketing, risk modeling and credit scoring. Every single bureau is
doing some for of risk modeling and in some way assisting their clients with direct market-
ing, either via cleansing of data, creation of mailing lists or telephone lists. While the main
use of the techniques described above for the credit bureaus is still risk modeling and
assisting their clients in preventing or predicting default, bad debt or liquidation, the assis-
tance with direct marketing now features as much. Eighteen of the nineteen respondents
mentioned these uses of the techniques above.
Specific mention was made of the identification of marketing opportunities, the creation of
strategies based on existing market segmentation, the growing role of behavioral scoring
(mentioned five times at four of the bureaus) and of the profiling and prediction abilities of
data mining.
The predicting abilities of data mining for use in direct marketing was mentioned in different
forms on fourteen separate occasions during the interview process. These included predict-
ing income, attrition, client affordability and or exposure and probability of accepting an
offer.
The third most used, credit scoring, was mentioned by only ten of the nineteen respondents,
and while not in use by the data bureaus, also only used by seven of the credit bureaus, and
not all of them as would be expected.

All of the uses of data mining mentioned in the literature was found at some of the bureaus,
with fraud and the prevention, detection and prediction thereof being a use at both the data
bureaus and four of the eight credit bureaus.
Uses that were not found in the literature included the tracking of macro- and micro-eco-
nomic trends that was a new use of the data mining techniques at two of the credit bureaus
and one of the data bureaus as well as one of the vendors. Another use that was not
specifically mentioned in the research was that of using data mining techniques for the
tracing of debtors. This may be because of the stringent privacy laws internationally.
5.3 Future techniques and their possible uses
No specific techniques or possible techniques were mentioned in the interviews, and all the
respondents felt that data mining was still too new in their sector for them to be able to
predict any possible new techniques.
Eight of the respondents mentioned that they thought there should be some for of data
standardization and or data set standards and or one data standard for all elements in the
future.
Behavioral scoring was mentioned by six of the respondents as a definite new direction for
data mining in the sector with growing interest from all sectors of their client bases.
Artificial Intelligence techniques and their possible application was mentioned as possible
techniques by six of the respondents who were not currently using these techniques, but all
of them said that they had no experience and only thought it might be a possibility to look
into in the future, particularly for fraud prevention and predictive modeling for direct mar-
keting.

Chapter 6: Discussion
6.1 Common data mining techniques and algorithms
Before any real data mining is done on a data set, a basic understanding is needed of the
data and dataset before the data may be used for data mining. For this, the basic statistical
techniques are typically used. There are many more statistical techniques than those de-
scribed in this research, but those mentioned here were found to be the ones most com-
monly used to gain an understanding of data and in preparation of further data mining.
The next step in the process of data mining is divided into two broad categories:
• Statistics and,
• Artificial Intelligence
The major difference between these two areas is that the field of statistics has its basis in the
science of pure mathematics and the field of pure statistics, and has undergone rigorous
mathematical proofs. Artificial Intelligence techniques are not necessarily subject to these
same rigorous mathematical proofs, but instead arrive primarily from machine learnings.
The two areas that follow, Visualisation and Data Warehousing are both areas that display
the results of data mining, but in data warehousing the data may be even further explored
using Online Analytical Processing (OLAP) and resulting in Management Information Sys-
tems (MIS) which may be visual in nature, whereas visualisation techniques usually assists in
the interpretation of the results of data mining in the form of charts and graphs. Another
visualisation techniques is Geographical Information Systems (GIS), where data is converted
into spatial information and graphically displayed in the form of maps of areas, suburbs,
municipal areas and the like. This is an innovative way to not only graphically display the
findings of data mining, but to also make it easily and visually understandable to a large
audience base.
6.1.1 Descriptive statistics
When this category is compared to what was found in the literature, it is clear that all of the
bureaus are using all of the techniques found for this category of descriptive statistics when
embarking on data mining, particularly in the initial stages of their projects or for smaller
scale projects. Although no specific reference was made to probability distributions or esti-
mation techniques, it was again obvious from a large number of interviews that these
techniques, although not specifically named, were being applied in some form.
6.1.2 Inferential statistics
That all the bureaus except one mentioned regression in one form or another was not
surprising as indicated by the literature review that found that this is the most important of
all the multivariate techniques available. The one bureau that did not mention this tech-
nique, was also one of the smaller bureaus that did no data mining at all.
The respondents indicated that regression, decision trees, including Chi-square and CHAID
techniques, were being used more and more and their value in creating credit policies,
deriving the most predictive values and assessment and predictivity of regression analysis

was excellent. This is also what was found in the review of the literature. In fact, Gargano
Raggad, (1999); Chidley (2002) and Koh Low (2004) confirmed the strengths of these
techniques were in their ability to generate more understandable rules and their ability to
indicate the relative importance of the variables for classification and prediction.
6.1.3 Data reduction techniques
The results found in Chapter 5, with this category of techniques being well represented,
match what Lee Siau (2001) described, and they gave the example of a typical classifica-
tion problem as being the division of a database of customers into groups that are as
homogeneous as possible with respect to a variable such as creditworthiness, exactly what
the bureaus in this sector are doing when using classification.
The bureaus all use their data to and data mining techniques like segmentation and classifi-
cation to further filter and refine their data sets and also the data sets of their clients for
particular variables, typically being creditworthiness, affordability or profiling based on cer-
tain demographic characteristics.
6.1.4 Numerical techniques
When the bureaus that were not using this technique (all except one) were queried on their
use of these techniques, the almost standard reply was that they believed these techniques
were too complex, difficult and did not yield results that were worth the additional effort and
expense. It was interesting to note from the literature review that these techniques are
considered to be widely and successfully used in many other industries, specifically in bank-
ing and financial institutions.
6.1.5 Other techniques
Data warehousing enables the bureaus that use this technique to quickly and effectively
analyse very large volumes of data to enable them to further build models using other
techniques like regression analysis for the building of for example behavioral scoring mod-
els.
All the bureaus mentioned the importance of this technique, but also listed the prohibitive
cost and the very low rate of successful implementation of data warehousing projects in all
sectors as an obstacle to implementing data warehouses in their companies.
Only one of the data bureaus mentioned using GIS techniques to mine their data, but were
unwilling to provide further information as it was considered too sensitive at the time. It will
be interesting to see what the use of this technique will be and if the other bureaus follow
suite should this prove to be a successful tool for the particular bureau.
Looking at the findings above it is clear that the common data mining techniques in this
sector closely match those found in the literature review, so it can be accepted that the sub-
problem: What are the common data mining techniques and algorithms? has been an-
swered by this research.

6.2 The uses of these techniques
The dominant uses of data mining in the credit and data bureau sector in South Africa are
direct marketing, risk modeling and credit scoring. Every single bureau is doing some for of
risk modeling and in some way assisting their clients with direct marketing, either via cleans-
ing of data, creation of mailing lists or telephone lists.
The bureaus and their clients are all well aware that the bureaus hold the largest pool of
consumer information in the country, and that this is a good opportunity for the bureaus to
assist their clients in their direct marketing efforts.
The responses from the respondents were that there was a growing demand for these
services and the use of techniques in data mining to do this and the value add to their
clients. This was identical to the finding in the literature review that indicated data mining is
a powerful tool in increasing response rates and ultimately of immense value to the organisation
in direct marketing and risk modelling and credit scoring.
Several of the respondent mentioned that there was a growing trend amongst their custom-
ers to do build their own scoring models, and that their clients relied on the bureaus less for
the actual models, but more on specific data with which to build these models. The bureaus
were entering a new era where they were being used more in a consulting role and as the
suppliers of data for use by their clients in the construction of their own scorecards, where
they used to build these models for their clients in the past.
Another use that was not specifically mentioned in the research was that of using data
mining techniques for the tracing of debtors. This may be because of the stringent privacy
laws internationally. Several times mention was made by respondents of the possible impact
of the new Credit Bill on the way that they conduct their business and on the way they use
the data mining techniques in future.
Although not all of the uses mentioned in the literature were found to be used at every
single bureau or even very widely in some cases, there were also some uses that were found
at the bureaus that were not found in the literature review. Overall a good description is
given of the uses of these techniques, thus addressing the research problem of what the
uses of these techniques are.
6.3 Future techniques and their possible uses
Every single respondent raised the new Credit Bill and the possible changes to legislation as
a factor that would influence any future data mining techniques and possible uses in this
sector. So far the quality standards for data in the sector have been self-regulating, but the
proposed bill would hold the directors of companies and in particular the bureaus personally
liable for any errors, and this was of grave concern to them.
All the bureaus believe that data standardization or a set of data standards will have an
enormous impact on the future of data mining in this sector and the techniques used as one
standard would make it simpler to combine data sets from different sources and ultimately
expedite the process of mining these data sets.

There was a general feeling amongst all of the respondents that data mining and data
management was becoming more and more important, not only in their own organisations,
but also those of their clients. It was felt that the impact of data mining would continue to
increase and enjoy more and more focus into the future.
Five of the bureaus also had plans to either skill their own staff better in the statistical areas
of data mining or to expand their data mining areas with more statistically skilled individuals
with greater experience that the current individuals.
This question was the least satisfactorily answered of all the questions as data mining is still
a relatively new field in this sector, as in most sectors in South Africa, and the respondents
are not too sure of what the possible future direction may hold, but all of the respondents
has some specific thoughts on the subject, thus addressing the problem of what possible
future techniques may be.

Chapter 7: Conclusion and recommendations
Chapter one established the need for this research, the relevance of data mining and the
research objectives. Chapter two gave an overview of the literature on data mining that
provided an insight into the techniques used and the uses of these techniques.
Chapter three provided an overview of the research questions and a summary of the litera-
ture findings. Chapter four covered the research methodology.
In Chapter five the research questions were answered based on the interviews that were
conducted and the findings compared to the literature review. The research questions an-
swered in this section were the common data mining techniques and algorithms, the uses of
these techniques and the possible future direction of data mining techniques in this sector
and the possible uses of these future techniques.
In Chapter six the findings were discussed and each sub-problem addressed.
On a whole, the majority of techniques and uses mentioned in the literature were found to
be in use and of similar use in the credit and data bureau sector. There were no major
discrepancies between what was described in the literature and what was found during the
research.
Data mining in this sector is still in its infancy, with the majority of respondents having no or
very little formal training in either the specific data mining techniques mentioned or in
statistics of any form. There are one or two very experienced and qualified individuals in the
industry that appear to be setting the trend for the rest of the sector, with a very much wait
and see attitude being followed. Most of the senior managers interviewed agreed that there
was a growing demand for data mining, but did not see how they were going to be able to
achieve this, and only one had a specific plan and strategy for training and upskilling of the
individuals in their specific data area.
7.1 Business implications
Data mining is definitely being used extensively in this sector, and there is a growing de-
mand for it by the clients of this sector. The sector has a very small body of individuals and
several of the organisations have strategic alliances. Given that the behavior of one organisation
affects all of the others, it seems logical that there should be some effort to establishing a
data mining forum and or a data standards / mining association.
The impact that a set of data standards could have on this sector and the exchange of ideas
and techniques, not only in this sector, but also amongst this sector and other sector could
have an enormous impact into the future.
It is suggested that this sector follow the example set by the larger banks in establishing a
data management / data mining forum amongst themselves and or joining the current data
mining / management forums that exist in the banks. At least one of the vendors inter-
viewed indicated that there was an association of data management that the bureaus could
join and that there were regular forum meetings that were held in other industries on data
mining.

It is obvious from the research, that although data mining is taking place, many of the
individuals who are doing the mining have no formal training in specific statistical or data
mining techniques. Although they on the whole do have an excellent idea of what they need
to achieve, and are at times applying certain techniques with out being specifically aware of
the name or nature of the technique, with specific training and upskilling of resources, data
mining will enter a new level in this sector.
Specifically, educating analysts and so called data miners in the more basic statistical tech-
niques or employing individuals with a strong background in statistical techniques should go
a long way toward improving the skills of individuals and the specific results achieved by
data mining departments in these organisations.
Nearly every one of the organisations interviewed had some sort of link to other interna-
tional organisations and these international links should be explored and exploited as the
international companies are very far advanced in the area of data mining, as was clear from
the research.
A suggestion might be for the bureaus to contract with the various tertiary organisations or
to form strategic alliances with them in the development of their data management and
mining capabilities. This will have a two-fold benefit as students with the necessary aca-
demic training will have the ability to attempt to apply the various statistical and or data
mining techniques they have learnt about, and bureaus should be able to explore new
avenues and uses of data mining for their clients.
7.2 Suggestions for further research
Given that nearly every individual interviewed raised the possible implications of the credit
bill, there is an opportunity and possibly a need for the impact and or implications of the
new credit bill to be researched.
This research could also be duplicated in other industry sector and the research then com-
pared to what has been done here and also previously on the banking sector by Chidley in
2002.
The impact of data warehousing on data mining would also be of interest to the business
world, as more and more organisations are embarking on data mining and also data ware-
housing, and given the enormous costs of data warehousing projects, a quantification of the
benefits of data warehousing and the specific impact on data mining and organisational
profits would be very valuable to organisations.
Another area of research that could be explored would be the often highlighted area of
behavioral scoring or the data mining of behavioral data in this or other sectors in South
Africa. Several of the respondents interviewed indicated that the impact and importance of
the behaviour of clients in different areas was of growing interest and the impact of these
findings was increasing.
A final suggestion would be for a study of the monetary or bottom line quantification of the
effect of data mining on an organisation. There is an increased awareness of data mining
amongst the bureaus and also business as a whole in South Africa, and the investment in
infrastructure like computing hardware and staff and the necessary skills is often quite

substantial. Businesses would feel a much greater level of comfort in investing in data
mining if they had some idea of the results and quantifiable benefits they could expect from
such an exercise.

REFERENCES:
Altman, E.I. (1968): Financial ratios, discriminant analysis and the prediction of corpo-
rate bankruptcy, Journal of Finance, 23(3):589-609.
Apte, C., Liu, B., Pednault, E. Smyth, P. (2002): Business Applications of Data Mining,
Communications of the ACM, 45(8):49-53.
Benyon-Davies, P. (1996): Data Mining in Database Systems, London: Macmillan, 372 –
375.
Beynon, M., Curry, B. Morgan, P. (2001): Knowledge discovery in marketing, An ap-
proach through Rough Set Theory, European Journal of Marketing, 35(7):915-935.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J. Zanasi, A. (1998): Discovering Data
Mining – From concept to implementation, New Jersey: Prentice-Hall.
Caudill, M. Butler, C. (1990): Naturally Intelligent Systems, Cambridge: MIT Press.
Chidley, C.T. (2002): An Evaluation of Data Mining Techniques in the Banking Sector,
Unpublished MBA Research Project, Johannesburg: University of the Witwatersrand.
Emory. W Cooper, D.R. (1991): Business Research Methods, fourth edition, Boston:
Irwin McGraw Hill.
Creswell, J.W. (1994): Research Design: Qualitative Quantitative Approaches, London:
SAGE Publications.
Fayyad, U., Piatetsky-Shapiro G. Smyth, P. (1996): From data mining to knowledge
discovery in databases, Artificial Intelligence Magazine, (Fall) 1996:37-51.
Fayyad, U., Haussler, D. Stolorz, P. (1996): Mining scientific data, ACM, 1996:51–57.
Forcht, K. Cochran, K. (1999): Using data mining and datawarehousing techniques,
Industrial Management Data Systems, 99(5):189-196.
Gargano, M. Raggad, B. (1999): Data mining – a powerful information creating tool,
OCLC Systems and Services, 15(2):81-89.
Gordon, A.D. (1999): Classification, London: Chapman Hall.
Geist, I (2002): A Framework for Data Mining and KDD, ACM, 2(02/03):508-513.
Jackson, P. (1990): Introduction to Expert Systems, second edition, Workingham: Addison
Wesley Publishing Company.
Kimball, R., Reeves L., Ross M. Thornthwaite W. (1998): Architecture for the front
room in The Data Warehouse Lifecycle Toolkit, R.M. Elliot, P. Sobotka, B. Snapp (Eds.),
Canada : Wiley, 19 - 20, 26-27, 377, 401-403, 637 – 640.

Koh, H.C. Low, C.K. (2004): Going concern prediction using data mining techniques,
Managerial Auditing, 19(3):462-476.
Lee, S.J. Siau, K. (2001): A review of data mining techniques, Industrial Management
and Data Systems, 101(1):41-46.
Leedy, P.D. Ormrod, J.E. (2001): Practical Research: Planning and Design, New Jer-
sey: Merill Prentice Hall.
Lewis-Beck M.S., Berry, W.D., Feldman, S., Fox, J. Hardy, M.A. (1993): Applied Regres-
sion: An introduction, in Regression Analysis, M.S. Lewis-Beck, (Ed.), second edition,
Iowa: Sage Publications, 1-18.
Marshall, C. Rossman, G.B., (1989): Designing Qualitative Research, London : Sage
Publications.
Mitchell, T.M. (1999): Machine Learning and Data Mining, Communications of the ACM,
42(11):30-36.
Mitchell, T.M. (2005): Machine Learning. Draft of chapter 1 for inclusion into the new
edition of the book Machine Learning, [http://www.cs.cmu.edu/~tom/mlbook.html] (Ac-
cessed 26th
February 2005).
Nemati H.R. Barko C.D. (2003): Key factors for achieving organizational data-mining
success, Journal of Industrial Management and Data Systems, 103(4):282-292.
Parr Rud, O. (2001): Introduction in Data Mining Cookbook, R.M. Elliott, E. Herman, J.
Atkins B. Snapp (Eds.), Canada: Wiley and Sons, Inc.
Pirow, P.C. (1990): How To Do Business Research, Johannesburg: Juta Publishers.
Rencher, A.C. (1995): Methods of Multivariate Analysis, New York: Wiley and Sons, Inc.
Sung, T.K., Chang, N. Lee, G. (1999): Dynamics of modeling in data mining: Interpre-
tive approach to bankruptcy prediction, Management Information Systems, 16(1):63-86.
Vickery, B. (1997): Knowledge Discovery from Databases : An Introductory Review, The
Journal of Documentation, 53(2):107–122.
Walker, R. (1985): Applied Qualitative Research, Aldershot: Gower Publishing Company
Limited.
Zwick, M. (2004): An overview of reconstructability analysis, Kybernetes, 33(5):877-
882.

APPENDIX A
The Written Request:
Company Name:
Company Address:
Postal Code:
2 May 2005
Attention:
MBA RESEARCH: AN EVALUATION OF DATA MINING TECHNIQUES IN THE CREDIT AND
DATA BUREAU SECTOR IN SOUTH AFRICA
Following your earlier indication of a willingness on the part of your organisation to assist in
the above mentioned research, please find attached further detail as well as the interview
questions.
The research is being conducted as part of the requirements for the completion of the Master
of Business Administration degree at the Wits Business School, which is part of the Faculty of
Management at the University of the Witwatersrand.
The topic of the research is : An evaluation of data mining techniques in the credit and data
bureau sector. It is to be a qualitative study where experts are interviewed by the researcher
with the data gathered in the interviews forming the main body of the report.
The study should be of value to your organisation, the sector and users of the bureaus in
general in that it will provide a summary of the data mining techniques and algorithms used
and their uses.
Describing and classifying the main data mining algorithms and techniques, and comment-
ing on the generic uses to the end-user, tools used and possible future direction of data
mining and its techniques in this sector and the possible uses thereof provide the back-
ground for the study.
The findings should assist managers and business people who interact with this sector to
better understand the techniques used, and the benefits of these techniques. Users get their
data from this sector and are not sure what the vendors have done to this data in order to
get the delivered results. If users understand the uses of data mining and the techniques and
tools used they could build on this or even request new or un-mined data to analyse.
Your organisation has been identified as a source of experts in the field who were inter-
viewed as part of the research.
The criteria for the experts are as follows:
• the expert is to be involved in data mining, having implemented, or had management
oversight of a data mining project in South Africa;

• the expert should occupy a senior or management position in the organisation;
• the expert should have experience in the products and uses of data mining in the
sector;
• the expert should have at least three years experience in the field;
• the expert should be available for a one hour interview;
• the organisation the expert represents should not have an objection to the expert
partaking in the research.
The interviews are to be conducted from the 23rd of May 2005 at a date and place of your
convenience.
The questions to be asked during the interview process are as follows:
• Please list the commercial bureau data mining projects that you have been involved in.
The aim of this question is to draw up a list of projects that have used data mining and
the discussion around these projects will form the basis for further discussion in the
interview.
• What was the aim of each of these projects?
This question is asked to determine the uses of the techniques used in data mining.
• What data mining techniques / algorithms were used during these projects?
The aim of this question is to determine the statistical and other techniques that are
used in data mining. The level of detail discussed does not have to be specific to the
steps taken, but only as to the particular techniques used. This will also lead into the
uses of these techniques.
• What do you believe may be the future direction of data mining and its techniques in
this sector?
• What do you believe the possible uses of this be?
The interview should not take more than one hour.
If you have any additional queries regarding this research that you feel cannot be answered
by myself, please do not hesitate to contact my supervisor for the project, Professor Neil
Duffy, a member of the faculty of the Wits Business School. He may be reached on telephone
717-3536 or via e-mail at duffy@megaweb.co.za.
Thank you for your positive response to date and I look forward to your further contribu-
tion.
Yours sincerely
Gideon S. du Toit
MBA Student
(W) 011-679-4894
(F) 088-011-679-4894
(C) 082-450-3222
gsdutoit@kreature.co.za

APPENDIX B
Telephone Protocol
The purpose of the call: to set up an in-depth interview with the respondent.
Speaking to the Referred Expert Respondent
Hello Mr/ Ms ....................................., my name is Gideon du Toit. I am a MBA student at
Wits Business School and I am currently conducting research on data mining in the credit
and data bureaus in South Africa.
Mr/Ms ..................................... at/ in ..................................... suggested that I speak to
you about the possibility of you contributing to the research given your expertise in the area.
The research I am conducting explores the different techniques used in data mining and the
uses of these techniques to the bureaus and their clients. The research is based on in-depth
interviews with key players and experts in this field.
I would really like to discuss this area with you. The discussion should take about an hour of
your time. If you wish your contributions were anonymous and neither your name nor
company were linked to any statements or comments made.
A very exciting aspect of the research is the fact that no academic research currently exists
on data mining by the bureaus in the local market and I believe the results have a very
valuable contribution to make. For assisting with this research, you will receive a copy of the
research findings and I would also be glad to make a presentation of the findings at a
suitable forum, if you should wish.
Based on this, would you be interested in participating? [If yes, schedule date, time, place of
interview, and parking. If no, explore reasons for objections and if necessary, request refer-
ral.]

APPENDIX C
Interview Protocol:
Data Mining in the Credit and Data Bureaus in South Africa
Details of the Respondent:
Respondent Name:
Title:
Organisation:
Contact Telephone Number:
Email Address:
Date:
Time:
Duration of Interview:
Opening Statement to Respondent:
Thank you for agreeing to contribute to this research. The interview should not last longer
than an hour. If you wish your contributions were anonymous and neither your name nor
company were linked to any statements or comments you make.
As I explained briefly in our telephone conversation, I am doing research into the techniques
used by the bureaus for data mining and the uses of these techniques. I would also like to
explore what may be the possible future direction of data mining and its techniques in this
sector and the possible uses thereof.
Key Interview Questions:
Interview Question 1
Please list the commercial bureau data mining projects that you have been involved in.
The aim of this question is to draw up a list of projects that have used data mining and the
discussion around these projects will for the basis for further discussion in the interview.
Prompt if needed:
Typically techniques I found during my literature review included things like cluster analysis,
regression analysis, data warehousing and so on.
Probe:
What was the use of this technique?
Prompt:
• Possibly this was used for segmentation or prediction of fraud?

The prompts will depend on the nature of the answer given above.
Probe:
What use did your clients get from the use of these techniques?
Prompt:
• The data may have been segmented to enable the customer to simply extract it and
use it for direct mailing.
The prompts will depend on the nature of the answer given above.
What do you believe may be the future direction of data mining and its techniques in this
sector?
What do you believe the possible uses of this to be?
Is there anything else you would like to add?
Closing Statement to Respondent:
Thank you very much for your time and for contributing to the research. The information
gathered today has been very valuable.
As soon as I have transcribed and analysed today’s interview I shall send you a copy of the
results for your comments and possible suggestions for changes.
Once the research is completed I shall send you a copy. Once again thank you and goodbye.

Interviewer's Comments:

Reflective Notes:
END

GSduToit_MBA_Research_Report_2006

More Related Content

Similar to GSduToit_MBA_Research_Report_2006

GSduToit_MBA_Research_Report_2006