Data mining is a superposition of rapidly evolving disciplines, including statistics and
artificial intelligence which are the two most emerging among many others. This article
clarifies the meaning of the main technical terms which can make it more difficult to
understand the methods of analysis, and in particular, those used for the prediction of the
phenomena of interest and the construction of appropriate predictive models. Fraud
management, as well as other industrial applications, relies on data mining techniques to
perform fast decision making according to the scoring of the fraud risks. The concepts
contained in this article come from the work done by the author, during the preparation of
the workshop on data mining and fraud management, held in Rome in the auditorium of
Telecom Italia on September 13, 2011.
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
Data Mining In Support Of Fraud Management
1.
2. The techniques of data mining
in support of Fraud Management
How to design a Predictive Model, by Marco Scattareggia – HP EMEA Fraud Center of Excellence Manager
Information Security magazine published on May/June 2011
>> Data mining is a superposition of rapidly evolving disciplines, including statistics and
artificial intelligence which are the two most emerging among many others. This article
clarifies the meaning of the main technical terms which can make it more difficult to
understand the methods of analysis, and in particular, those used for the prediction of the
phenomena of interest and the construction of appropriate predictive models. Fraud
management, as well as other industrial applications, relies on data mining techniques to
perform fast decision making according to the scoring of the fraud risks. The concepts
contained in this article come from the work done by the author, during the preparation of
the workshop on data mining and fraud management, held in Rome in the auditorium of
Telecom Italia on September 13, 2011, thanks to a worthy initiative of Stefano Maria de
'Rossi, to whom the author gives his thanks.
About the author Marco Scattareggia
Marco Scattareggia, graduated in Electronic
Engineering and Computer Science, works in Rome at
Hewlett-Packard Italia where he directs the Center
of Excellence for HP EMEA dedicated to the design
and implementation of fraud management solutions
for telecom operators.
3. Inductive reasoning, data mining and also needed. These preliminary activities
fraud management are highlighted in the early stages of the
Data mining, or digging for gold in the KDD (Knowledge Discovery in Databases)
large amounts of data, is the combination paradigm, and are generally found in
of several disciplines including statistical products known as ETL (Extract,
inference, the management of computer Transform, Load). By visiting the site
databases and machine learning, which is www.kdd.org, you can understand how
the study of self-learning in artificial data mining can actually consist in the
intelligence research. analysis phase of the interactive process
for the extraction of knowledge from the
Literally data mining refers to extracting data shown in Figure 1.
knowledge from a mass of data in order to
acquire the rules that provide decision
support and determine what action should
be taken. Such a concept can effectively
be expressed with the term actionable
insight and the benefits to a business
process, like fraud management, can be
drawn from forecasting techniques. In data
mining, this predictive analytics activity is
based on three elements:
1. Large amounts of available data to
Figure 1
be analyzed and to provide
representative samples for training, Besides being interested in the practical
verification and validation of applications of data mining in an industrial
predictive models. context, it is also useful to examine Figure
2. Analytical techniques for 2, which sets forth the evolution of the
understanding the data, their techniques of business analytics. It starts
structures and their significance. with the simple act of reporting, which
3. Forecasting models to be provides a graphical summary of data
articulated, as in every computer grouped according to their different
process, in terms of input, process, dimensions, and highlights the main
and output. In different words, by differences and the elements of interest.
predictors (the input), algorithms The second phase corresponds to the
(the process) and target of the activity of analysis to understand why
forecast (the output). there was a specific phenomenon.
Subsequently, the monitoring is the use of
In addition to the techniques of analysis, tools that let you control what is
adequate tools and methods for data happening and finally, predictive analytics
collection, normalization, and loading are allows you to determine what could or
4. should happen in the future. Obviously, it
should be pointed out that the future may It is interesting to note that the techniques
be predicted only in probabilistic terms, of business analytics are derived from
and nobody can be one hundred percent inferential statistics and more specifically
sure about what really will happen. from Bayesian probabilistic reasoning.
The result of this process is an ordering, a Thomas Bayes' theorem on conditional
probabilistic ranking of the possible events probability answers the question "Knowing
based on previously accumulated that there was the effect B, what is the
experience. This activity, known as probability that A is the cause?" In a
scoring, assigns a value in percentage nutshell, it gives the probability of a cause
terms, the score, which expresses the when knowing its effect.
confidence we may have in the forecast The article “How to build a predictive
itself. It allows us to perform our actions in model” published on May/June 2011 by the
a consistent way according to the score Italian Information Security magazine,
values. For example, in fraud explained how to calculate the buying
management, a high score corresponds to a probability given the gender (man or
big risk of fraud and the consequential woman), and by observing the customers’
action could be to stop the service (e.g., dressing style:
the loan from a financial bank, the - During the construction of the
telephone line, the insurance protection, model, the outcome or effect,
etc.), while a more moderate score may which in the example is the positive
only require an additional investigation by or negative result of a purchase, is
the analyst. known while the cause requires a
This article will show how the fraud probabilistic assessment and is the
management application, designed as a object of analysis. The roles are
business process, can benefit from data reversed: knowing the effect we
mining techniques and the practical use of look for the cause.
predictive models. - When forecasting, the roles of
cause and effect return to their
natural sequence: given the causes,
the model predicts the resulting
effect. The gender of a person and
his or her dressing are the
predictors, while the purchase
decision, whether positive or
negative, becomes the target to
predict.
The analysis phase, during which the roles
of cause and effect (i.e., the predictors
and the target) are reversed, is indicated
in the techniques of predictive analytics as
Figure 2
5. supervised training of the model. priori we know little or nothing of how the
outcome of the results is determined,
Figure 3 below shows the contingency table Laplace derived the way to calculate the
with the exemplary values of the probability that the next result is a
probabilities to be used in Bayes' theorem success:
to calculate the probability of purchase for
a man or a woman. It's like saying that P = (s +1) / (n +2)
having analyzed the history of purchases
and having been able to calculate or where "s" is the number of previously
estimate the probability of the causes observed successes and "n" the total
(predictors) conditioned by a specific number of known instances. Laplace went
effect (target), we can use a forecasting on to use his rule of succession to calculate
model based on Bayes’ theorem to predict the probability of the rising sun each new
the likelihood of a future purchase once day, based on the fact that, to date, this
having the person's gender and his or her event has never failed and, obviously, he
dressing style. was strongly criticized by his
contemporaries for his irreverent
extrapolation.
The goal of inferential statistics is to
provide methods that are used to learn
from experience, that is to build models to
move from a set of particular cases to the
general case. However, Laplace’s rule of
succession, as well as the whole system of
Bayesian inductive reasoning, can lead to
blatant errors.
The pitfalls inherent in the reasoning about
Figure 3 the probabilities are highlighted by the so-
called paradoxes that pose questions
Bayes’ theorem of probability of causes is whose correct answers are highly illogical.
widely used to predict which causes are The philosopher Bertrand Russell, for
more likely to have produced the observed example, pointed out that falling from the
event. However, it was Pierre-Simon roof of a twenty floor building, when
Laplace to consolidate, in his Essai sur les arriving at the first floor you may
probabilités philosophique (1814), the incorrectly infer from the Laplace’s rule of
logical system that is the foundation of succession that, because nothing bad
inductive reasoning, and now referred to happened during the fall for 19 of 20
as Bayesian reasoning. floors, there is no danger in the last
The formula that follows is Laplace’s rule twentieth part of the fall too. Russell
of succession. Assuming that the results of concluded pragmatically that an inductive
a phenomenon have only two options, reasoning can be accepted if it not only
"success" and "failure", and alleged that a
6. leads to a high probability prediction, but The forecasts provided by the inductive
is also reasonably credible. models and their practical use in business
decisions are based upon this response of
Russell.
When selecting data samples for the
training, testing and validation of a
predictive model, you need to raise two
fundamental questions:
Figure 4 a) Are the rules that constitute the
algorithm of the model consistent
Another example often used to with the characteristics of the
demonstrate the limits of inductive logic individual entities that make up the
procedure is the paradox of the black sample?
ravens developed by Carl Gustav Hempel.
By examining a million crows, one by one, b) Are the sample data really
we note that they are all black. After each representative of the whole
observation, therefore, the theory that all population of entities to be
ravens are black became increasingly likely inferred?
to be true, and consistent with the
inductive principle. But the assumption The answers to these questions are derived
"the crows are all black", if isolated, is respectively from the concepts of internal
logically equivalent to the assumption "all validity and external validity of an
things that are not black are not crows." inferential statistical analysis as shown in
This second point would be more likely Figure 5. The internal validity measures
true even after the observation of a "red how much the results of the analysis are
apple" – it would be observed, in fact, corrected for the sample of entities that
something "not black" that "is not a crow." have been studied, and it may be affected
Obviously, the observation of a red apple, by a not-perfectly random sampling
if taken to make true the proposition that procedure which becomes an element of
all crows are black, it is not consistent and noise and disturbance (bias). A good
not reasonably credible. Bertrand Russell internal validity is necessary but not
would argue that if the population of crows sufficient and we should also check the
in the world totals a million plus one external validity and the degree of
exemplars, then the inference "the crows generalization that is acquired by the
are all black", after examining a million of predictive model. When the model did not
black crow, could be considered reasonably have enough general rules, we may likely
correct. But if you were to estimate the have just recorded most of the data
existence of a hundred million crows, then present in the sample used for training (we
the sample of only one million black crows have overfitted the model), but not
would no longer be sufficient. effectively learned from the data (we
7. didn’t extract the knowledge hidden model. It improves the internal validity of
behind the data). In this situation, the the training sample by dividing the mass of
model will not be able to successfully available data into homogeneous subsets.
process new cases from other samples. It may also discover new patterns of fraud
and help you to generate new detection
rules. Moreover, the identification of
values very distant from the average,
called outliers, leads directly to the
identification of cases that have a high
probability of fraud and therefore require
more thorough investigation.
The Dilemma of the fraud manager
The desire of every organization that is
Figure 5 aware of the loss of revenues due to fraud,
is obviously to achieve zero losses.
The techniques of predictive analytics help Unfortunately this is not possible either
you make decisions once the data have due to the rapid reaction of the criminal
been classified and characterized as to a organizations that profit from fraud and
certain phenomenon. Other techniques, quickly find new attack patterns and
such as OLAP (On-Line Analytical discover more weaknesses in the defense
Processing), help to make decisions too systems, and because fighting fraud has a
because they allow you to see what cost that grows in proportion to the level
happened. However, a predictive model of defense put in place. Figure 6 shows
directly provides the prediction of a graphically that, without enforcement
phenomenon, estimates its size, and allows systems, the losses to fraud can reach very
you to perform the right actions. high levels, to over 30% of total revenues,
A further possibility made available by and may even threaten the very survival of
using the techniques of predictive analytics the company. By putting in place an
is the separation and classification of the appropriate organization to manage fraud,
elements belonging to a non-homogeneous and provide an appropriate technology
set. The most common example for this infrastructure, losses can be brought down
type of application is selecting which to acceptable levels very quickly to the
customers to address in a marketing order of a few percentage points.
campaign, who to send a business proposal The competence of the fraud manager is
having a reasonable chance of getting a important to identify the optimum
positive response, and rightly so, in these compromise between the costs of
cases one can speak of business managing fraud and the residual losses due
intelligence. This technique, known as to fraud. This tradeoff is indicated by the
clustering, is also useful in the fraud red point in Figure 6. Going further could
management area because it allows you to significantly increase the cost of personnel
better target the action of a predictive
8. and instruments to achieve only tiny
incremental loss reductions.
Figure 7
Figure 6 Wishing to reach the ideal point for which
The main difficulty, however, lies not on you would have, at the same time, a
demonstrating the value of residual fraud precision and a recall of 100%, one can
but estimating the losses actually make several attempts to improve one or
prevented by the regular activities the other KPI. For example, you could
performed by the fraud management team. increase the number of cases of suspected
In other words it is not easy to estimate fraud to be tested daily (increasing recall),
the loss size and the consequences and, of course, increase the number of
theoretically due to frauds, which have not working hours too. Conversely, one may
been perpetrated thanks to daily attempt to better configure the FMS, and
prevention works. to reduce the number of cases to be
For more details and to understand how to analyzed in a day by eliminating the false
calculate the ROI of an FMS, you can refer alarms that needlessly consume the time of
to the article Return on Investment of an the analysts (increasing precision).
FMS published on March/April 2011 by the However, if you do not really increase the
Italian Information Security magazine. information given to the system by adding
new rules or some better searching
Technically you must choose the keywords, when improving the precision
appropriate KPIs (Key Performance you will get a worse percentage of recall,
Indicators) and measure both the value of and vice versa.
fraud detected in a given period and of The problem exposed leads to the dilemma
that remaining in the same period. For that afflicts every fraud manager. In fact
example, the trends of two popular KPIs, you cannot improve the results of fighting
known as precision (percentage of fraud against fraud without simultaneously
detected in the analyzed total fraud) and increasing the costs of structure (i.e., its
recall (percentage of fraud detected in the power), or without increasing the
total of existing fraud) are shown in Figure information to be provided to the FMS. It is
7. therefore necessary meeting at least one
of the two requirements, costs, or
9. information, and possibly improve them when both precision and recall KPIs are
both. equal to 100% and therefore to a model
that has matched the ideal point shown in
Figure 7.
For a comprehensive evaluation of a
predictive model, the reader may refer to
the article Evaluation of the predictive
capabilities of an FMS published on
February/March by the Italian Information
Security magazine.
Construction of a model to score
cases of fraud in telecommunications
Figure 8 Figure 9 shows the conceptual scheme of a
predictive model to score cases of fraud in
The predictive models lend themselves to a telecommunications company. In this
improve the effectiveness and efficiency of representation the algorithm which forms
a fraud management department. For the core of the model is represented by a
example, the inductive techniques of neural network. However the whole model
decision trees can be used to extract new would not change if you chose a different
rules from the data for better identifying algorithm such as, for example, a decision
the cases of fraud, the scoring technique tree, a Bayesian network, etc.
makes it easier to organize the human
resources on a risk priority basis and,
eventually, enabling automatic
mechanisms to be used at night or in the
absence of personnel. Figure 8 represents
the gain chart for three different scoring
models. This productivity gain consists of
the analyst’s time saving and it is in
contrast to a non-guided processing of the
cases that follows a random sequence
indicated by the red diagonal. The blue
solid line indicates the ideal path, which is Figure 9
practically unattainable, but is just the
aim. According to that, all cases of The alarms and cases generated by the FMS
outright fraud, the true positives, are are derived from aggregations, or other
immediately discovered without losing information processing, of the elementary
time due to false alarms. It is interesting data coming from the telecommunication
to note that this ideal situation occurs traffic. In fact, all input data to a
10. predictive model can be elaborated and destined to become the lingua franca, i.e.
replaced with other derived parameters. spoken by many vendors and systems, for
All input data and derived parameters the standard definition and the immediate
compete in a sort of analytic game to be use of predictive models. The PMML, which
elected as predictors, that are the right is based on XML, provides all the methods
input to the core forecasting algorithm and tools to define, verify and then put
which is highlighted in the blue box of into practice the predictive models. By
Figure 9. adopting PMML, it is no more necessarily
The output of the predictive model is the case that the model is developed and
simply the score value associated with the run by the same company as the vendor of
case. This value is a percentage and varies the software products. All definitions and
between zero and one hundred, or descriptions necessary for understanding
between zero and one, and expresses the the PMML can be found on the DMG
probability that the case represents an website, http://www.dmg.org/.
outright fraud (when the score is closer to
100), or a false alarm (when the score is In conclusion, the PMML, being an open
close to 0). standard, when combined with an offer of
cloud computing, can dramatically lower
The inclusion of a predictive model in the the TCO (Total Cost of Ownership) by
operational context of the company has a breaking down the barriers of
significant impact on its existing structure incompatibility between different systems
of information technology (IT) and it can of the IT infrastructure already in place in
take many months to develop dedicated the company. Furthermore, the inclusion
custom software and the associated of the operational model in the context of
operating procedures. However, recently applications can be run directly by the
the development of wide data transfer same people who developed it, i.e.,
capacity through the Internet and the web without involving the heavily technical IT
service technology, the emerging department.
paradigms of cloud computing solutions, For more on the creation of predictive
and SaaS - Software-as-a-Service, have models, see the article How to design a
paved the way for an easier transition into Predictive Model, published on May/June
production of the predictive models. The 2011 by the Italian Information Security
data mining community, represented by magazine.
the Data Mining Group (DMG), has recently
developed a new language, PMML
(Predictive Model Markup Language) that is