Data mining has become one of the latest trends in using data. Rod Newing
explains that it is a complex process which has been around for a long time.
O rganisations world-wide are ac-
cumulating vast quantities of
electronic data as databases
become ever more pervasive. The recent
The name is derived from the pro-
cess of sifting large amounts of ore to
discover nuggets of gold, just as the
software is able to sift large volumes of
automated most of the work involved.
Data mining differs from statistical
analysis in that the latter is used to verify
existing knowledge in order to prove a
trend to implement a data warehouse data to find nuggets of information known relationship. Most data mining
architecture is increasing the quality and which yield gold in the form of compe- involves carrying out several different
accessibility of data. This is all being titive advantage. The extracted operations using more than one tech-
done at great cost, but the information information can be used to do one or nology, so it should be thought of as an
is only valuable if used effectively. more of the following: operation, rather than a product.
Users have been using query tools, Data mining can be carried out on
OLAP servers, Business Intelligence q Provide an understanding of data any data file, from a spreadsheet to a
tools, Enterprise Information Systems relationships to end users. data warehouse. Transaction process-
and a wide range of other packaged q Form a prediction or classification ing systems can be mined, and the
software to examine their data. model. exercise can be used to generate
However, these tools either work q Allow prediction of future trends benefits which can help to justify the
with summarised data or answer users based on past experience. considerable investment required to im-
specific questions. The more numerate q Identify relationships between da- plement a data warehouse architecture.
analysts have recognised that there are tabase records. Figure 1 outlines the major mile-
hidden patterns, relationships and rules q Provide a summary of the database stones in the evolution of Data Mining.
in their data which cannot be found by being mined.
using these traditional methods. Objectives
The answer is to use specialist soft- With a query, the user knows what
ware which harnesses advanced is in the database and know what infor- Data mining can achieve a number
mathematics to examine large volumes mation to ask for, so they must know of different objectives, using one or
of detailed data. This specialist group of what patterns exist. With data mining, more different technologies.
software has become known as "data the software establishes the patterns
mining" or "knowledge discovery". Data and relationships. It is possible to carry Prediction And Classification
mining is defined as the process of ex- out data mining operations using a This approach uses the historical
tracting valid, previously unknown and query tool, but the process is extremely data in the database to predict future
ultimately comprehensible informa- complex and would be prohibitively behaviour. It creates a generalised de-
tion from large databases and using it manually intensive. Data mining soft- scription which characterises the
to make critical business decisions. ware uses algorithms which have contents of the database by generating
Time Evolutionary step Business question Enabling Technologies Characteristics
1960s Data collection "What was my total revenue Computers, tapes, disks. Retrospective static data
in each of the last five years?" delivery.
1980s Data access "What were unit sales in Relational databases, SQL, Retrospective dynamic
New England in March?" ODBC. data delivery at record
1990s Data warehousing "What were unit sales in On-Line Analytical Retrospective dynamic
and decision support New England in March?" Processing, data warehouses. data delivery at
Drill down to Boston. multiple levels.
Now Data mining "What is likely to happen to Advanced algorithms, Prospective proactive
Boston unit sales next multi-processor computers, information delivery.
month? Why?" massive databases.
Figure 1 - Milestones in the evolution of Data Mining.
Issue 74 Page 13
PC Network Advisor File: M0481.1
an understandable model. It enables Data Transformation It may be necessary to refine the data,
the model to be applied to new data Once it has been selected, the data repeating the sequence again. This pro-
sets in order to predict the behaviour may need to be transformed. For in- cess is often referred to as "data
hidden in that data. For example, a stance, neural networks require refining".
predictive model of existing customers nominal values to be converted to
can be applied to potential customers numeric ones. Alternatively, derived Techniques
in order to identify those most likely to attributes may need to be created by
purchase a particular product or ser- applying mathematical or logical oper- There are a number of techniques
vice. It has traditionally used statistical ators, such as a ratio or logarithmic for carrying out the data mining exer-
techniques, but lots of automatic model value. cise.
development techniques are being de-
veloped, often based on supervised Applying Algorithms Supervised Induction
induction. One or more data mining tech- Supervised induction automatically
niques are carried out to try to extract creates a classification model from a set
Analysing Links of records, known as a "training set",
the required information or meet the
Data mining can be used to establish required objective. Some of the algo- which may be the whole database or a
relationships between the records in rithms used are described in Figure 2. sample of data from it. The induced
the database which would otherwise model consists of generalised patterns
be impossible to find because they can- Results Interpretation which can be used to classify new rec-
not be predicted and so cannot be The result of applying data mining ords. It can use neural networks or
found other than by accident. It is a algorithms will be tables of values or decision trees, but the latter do not
relatively recent technique, which has relationships. The user will have to work well with noisy data.
become well known through shopping look for interesting groupings of data It produces high quality models,
basket analysis, which indicates popu- and establish if there is any business even when data in the training set is
lar combinations purchased by retail value in them. They need to be ana- poor or incomplete. The result is more
customers. lysed using a data visualisation (see accurate than that obtained using stat-
Figure 3) or decision support tool. Vis- istical methods, because it checks for
Segmenting Databases local patterns, whereas the latter work
ualisation helps the user to understand
This is a form of sophisticated query the data and identify patterns. across the entire database. The models
to identify common groups of records If the objective is to produce a are easy for the user to understand. An
within a database. It may be a tech- model, it must be validated and tested. example would be a credit card ana-
nique in its own right or may be used lysis to discover the attributes of a good
to prepare data for further processing.
Detecting Deviations Neural Networks
Software which learns from training to identify patterns and construct a
This identifies unusual values model. This model is then applied to larger data sets to predict its structures.
which do not conform to the expected It can also identify changes, which then become a notifiable event.
pattern. It is often a source of new
knowledge since the results defy Decision Trees
known logic. It is also used in fraud Decision trees are tree-shaped structures which represent sets of decisions.
detection, where unusual values may They generate rules for classifying the data set, using algorithms such as ID3,
represent an unauthorised transaction. Classification and Regression Trees ("CART") and Chi Square Automatic
Interaction Detection ("CHAID").
There are four basic steps which In this method, artificial intelligence search techniques are used to identify
need to be carried out in order to com- subsets in a cluster. It uses software such as AQ11, UNIMEM and COBWEB.
plete a data mining exercise.
Data Selection Rule induction involves the extraction of "if ... then ...." rules from data based
The objective determines the type of on statistical significance. Examples are IBM’s RMINI, and FOIL, which are
information and the way it is or- in the public domain.
ganised. Only part of the data available
from the source data file will be Genetic Algorithms
needed, so the relevant data must be This is an optimisation technique which uses processes such as genetic
identified. Noise and missing values combination, mutation and natural selection in a design based on the con-
may need to be addressed. It may also cepts of evolution.
be preferable to sample the data re-
quired and mine the sample.
Figure 2 - Data Mining Technologies.
PC Network Advisor Issue 74 Page 14
credit risk in order to predict credit example of association discovery is related transactions. It is used for tar-
worthiness of applicants. market basket analysis. geting direct mail.
Association Discovery Sequence Discovery Clustering
This is a technique which identifies This is similar to association dis- This technique is used to segment a
the affinities which exist among rec- covery, but works over time. It is database into subsets of mutually ex-
ords. The output might find that 67% frequently directed towards individ- clusive groups. The members of each
of records containing A, B and C, also ual customers as a means of identifying group should be as close to each other
contain Y and Z. The percentage is their preferences. It detects buying pat- as possible and as far apart from other
known as the "confidence factor". An terns which occur in a sequence of groups as possible. The members of
each cluster should possess properties
which are interesting to the user. Data
Data visualisation provides the user with visual summaries of the results of
visualisation techniques are then used
the data mining algorithms. This helps them to understand the results of the
to examine each cluster to establish
data mining algorithms by communicating relationships in a way that rows
which are useful or interesting.
and columns cannot. It is interactive, allowing the user to filter or change the
It is less precise than other tech-
information displayed. The user can also change the presentation method
niques because of redundant or
used, such as from a histogram to a scatter chart.
irrelevant data. The solution is for the
Visualisation allows users to browse the data looking for unusual features.
user to direct the software to ignore
It is good at identifying small meaningful sub-sets of data which defy
subsets of attributes, assign weightings
conventional wisdom. These "outliers" are anomalies which may be errors,
to them or apply filters to the informa-
or genuine and valuable exceptions to established wisdom.
tion. The importance of the attributes
A wide range of advanced chart types can be used:
themselves can be established using
q Geographical maps, combined with histograms, colour coding, pie charts
Clustering can also be used to pro-
vide data for other techniques, such as
q Tree maps showing the hierarchy of a classified database.
supervised induction. Clusters can be
q Rule visualisation.
created using statistics, neural net-
works or unsupervised induction.
q Scatter graphs.
However, using statistical methods
q Heat maps.
makes it difficult to assign new records
to existing clusters, because of the dif-
These chart types are very advanced when compared with traditional
ficulty of measuring and handling its
graphing tools and need powerful workstations. For instance, a five dimen-
deviation from those clusters.
sional chart can be created by representing clusters on a three dimensional
scatter chart as a sphere. The size and colour of the sphere represent the
fourth and fifth dimensions. Applications
The time dimension can be incorporated by "playing" the chart like a video.
The importance of data mining has
The user can watch the movements in a multi-dimensional chart as it changes
been recognised by information intens-
with the elapsed time.
ive industries which have large
databases of customer transactions,
Figure 3 - Data visualisation. such as banking, health care, insur-
Supplier Product Contact Details
Angoss Knowledge Seeker http://www.angoss.com
Attar XpertRule http://www.attar.com
Brann Software Viper http://www.brannsoftware.co.uk
DataMind Corporation Mine Your Own Business http://www.datamindcorp.com
EDS Dbintellect http://www.dbintellect.com
IBM Intelligent Miner http://www.software.ibm.com
Intelligent Decision Server
Integral Solutions Clementine http://www.isl.co.uk
Right Information Systems 4Thought http://www.4thought.com
The SAS Institute Neural Network Application, http://www.sas.com
Insight, Spectraview, GIS
Silicon Graphics MineSet http://www.sgi.com
SPSS SPSS CHAID, Neural Connection, http://www.spss.com
Professional Statistics etc
Figure 4 - The Main Data Mining Products.
Issue 74 Page 15
PC Network Advisor File: M0481.3
Supplier Product Tool Contact Details
Cognos PowerPlay 4Thought, Knowledge Seeker http://www.cognos.com
Comshare Commander Decision Own http://www.comshare.com
NCR Knowledge Discovery Clementine http://www.ncr.com
Holistic Systems Holos Own http://www.holossys.com
Oracle Express Partners’ http://www.oracle.com
Pilot Software Pilot Discovery Server Own, based on the Thinking Machine http://www.pilotsw.com
Planning Sciences Gentia Own, plus Intelligent Miner http://www.gentium.com
Red Brick Systems Red Brick Data Mine Mine Your Own Business http://www.redbrick.com
Figure 5 - Products incorporating data mining.
ance, marketing, retail and telecom- found up to a twenty-fold decrease on present the data in an easy to under-
munications. costs over conventional approaches. stand manner so that users can assess
One of the most well-known data The data mining operation can also be its significance to the business. It may
mining applications is market/shop- taken a step further by identifying clus- incorporate its own visualisation tools
ping basket analysis. This involves ters of the most profitable likely or work with third-party packages.
running an association discovery oper- customers, which may be different to The software must incorporate fil-
ation over Electronic Point Of Sale those most likely to respond. ters to remove "noise", which is
(EPOS) data. It analyses the combina- Identifying exceptions can be just as incorrect information or spurious rela-
tions of products purchased by important as finding hidden patterns. tionships. For instance, the software
individual buyers to find depend- In fraud detection, credit card transac- shouldn’t waste the user’s time by re-
encies. Until the recent arrival of tions are often analysed by a neural porting that 99.9% of married people
loyalty cards, it has been the only way network to identify unusual transac- have a spouse of the opposite gender!
the supermarkets and high street stores tions which may indicate that the card Software for data mining is avail-
has to understand who their customers is not being used by its holder, even able either direct from the authors or
are and how they behave. before the loss is reported. through decision support vendors who
Other common applications are for It is important to understand that a have embedded it into their own appli-
promotion effectiveness, customer vul- particular data mining exercise may cations. IBM and the other vendors
nerability analysis, cross-selling, use more than one stage and use sev- have open Application Programming
portfolio creation and fraud detection. eral algorithms by passing the results Interfaces so that application builders
It is also used in healthcare, where it from one analysis to another. For in- can add value to their decision support
can find relationships between patient stance, the user might produce software by driving a data mining en-
histories, illnesses and surgical oper- associations using a decision tree and gine from their own tools.
ations. It is also used in manufacturing then pass the result to a neural network
processes to monitor quality and spot to identify changes over time. Mining
machine wear. elements can be combined in an infinite
In marketing, if an organisation variety of ways.
wants to cross-sell one product to an-
other, it cannot target all customers, Software
because the volume may be too large. PCNA
Therefore it is necessary to mine the For most organisations, the soft-
database of existing customers to ware needs to be scalable from a
identify patterns which describe the stand-alone PC to a parallel-processing
characteristics of purchasers of the pro- server. This allows data mining oper-
duct. These patterns can then be ations to be carried out on desktop
applied to the database of customers databases, relational or multi-dimen-
who have not purchased the product to sional data marts, transaction
segment and predict those who are processing systems or enterprise data
more likely to purchase the product. warehouses.
These are then targeted in a very spe- Because of the different techniques The Author
cific marketing campaign. and technologies, the software needs to Rod Newing MBA FCA FInstD is
Data mining is often used to predict integrate various different algorithms a specialist writer on Executive
and identify people most likely to re- into one product. Most vendors use Computing. He can be contacted
spond to direct mail. This reduces the several different ones and are writing via email as firstname.lastname@example.org-
cost of mailing without affecting the further modules to expand the scope of link.co.uk.
response rate. Organisations have their products. The software must
PC Network Advisor Issue 74 Page 16
New Reviews from Tech Support Alert
Anti-Trojan Software Reviews
A detailed review of six of the best anti trojan software programs. Two products
were impressive with a clear gap between these and other contenders in their
ability to detect and remove dangerous modern trojans.
Inkjet Printer Cartridge Suppliers
Everyone gets inundated by hundreds of ads for inkjet printer cartridges, all
claiming to be the cheapest or best. But which vendor do you believe? Our
editors decided to put them to the test by anonymously buying printer cartridges
and testing them in our office inkjet printers. Many suppliers disappointed but we
came up with several web sites that offer good quality cheap inkjet cartridges
with impressive customer service.
Windows Backup Software
In this review we looked at 18 different backup software products for home or
SOHO use. In the end we could only recommend six though only two were good
enough to get our “Editor’s Choice” award
The 46 Best Freeware Programs
There are many free utilities that perform as well or better than expensive
commercial products. Our Editor Ian Richards picks out his selection of the very
best freeware programs and he comes up with some real gems.
Tech Support Alert