Welcome to this course. Data mining is an interdisciplinary ﬁeld which
brings together techniques of machine learning, database information re-
trieval, mathematics and statistics. These techniques are used to ﬁnd useful
patterns in large datasets. Methods for such knowledge discovery in data
bases are required owing to the size and complexity of data collection in
administration, business and science.
This course is a balance of essential mathematical framework, computer
algorithms and performance, and applications to a number of ﬁelds including
Sample case studies
Example 0.1: the cancer revolution. In 1998 Todd Golub and his team
proved that microarrays could pinpoint faulty genetic activity. They
used microarrays to analyse bone samples from 38 patients with acute
leukaemia. For decades researchers had known that this disease came
in two major types, acute lymphoblastic leukaemia (all) and acute
myeloid leukaemia (aml), each with its own course of disease. Using a
microarray containing 6817 genes,1 Golub’s team found that each type
A microarray is a slide dotted with thousands of tiny samples of dna, each representing
a diﬀerent gene. Microarrays rely on the ability of one strand of dna to stick to another
strand with a complementary sequence. Researchers make a dna copy of a cell’s messenger
had its own genetic signature. From these signatures they selected
50 genes as markers for identifying each disease. Subsequently, they
used these markers to predict the the subtype of 34 additional samples.
The markers made an accurate prediction in 29 of the samples. With
these genetic signatures a one-step process is more accurate than other
tests. New Scientist, 23 August 2003
Example 0.2: catalogue astral objects from sky surveys. Systemati-
cally examine 3000 old photographic plates of astronomical surveys to
categorise objects as either: star, fuzzy star, galaxy, quasar, or artifact.
Human cataloging is tedious, error prone, and misses most features.
Instead each plate is digitised to 23, 040 × 23, 040 pixels with 16 bit
intensity forming a data base of about 3 terabytes (3 × 1012 bytes).
Data mining classiﬁes the pixel images with about 94% accuracy and
classiﬁes objects an order of magnitude fainter than humans have man-
aged. For example, the approach makes the search for quasars about
40 times more eﬃcient. Also, at least 3 times the number of classi-
ﬁed sky objects than would be possible by traditional computational
Example 0.3: automobile manufacturing plant. Unexplained slow-
downs in the production had been noticed to occur on one particular
day of the week. A thorough study had been carried out to solve dis-
cover the culprit. Every bit of activity in the plant had been recorded,
with a total of 10 GByte of data collected per day. These data were
then used as an input for data mining tools and the problem was solved
and resulted in savings of millions of dollars. The method used was
rna, “label” this copy with a chemical that ﬂuoresces under laser light. Any sample that
meets its match on the chip will stick to that spot, and the pattern of glowing dna dots
indicates which genes were turned on when the sample was taken.
c USQ, February 28, 2005
not disclosed. Also, Neither the day nor the reason were disclosed by
the researcher, as these were property of the company.
Example 0.4: Medical fraud detected. The aim of the study was the
exploration of fraud and changes in healthcare delivery. The collected
data is transaction based. Fraud is patient level, provider level or
provider-ring-level. Fraud is usually not self revealing. Also, fraud
is normally mixed with legitimate business. In Pathology fraud in-
vestigation, Link analysis (ibm) have been used to process 6.8 million
records by 120 variables (3.5 GB). It took the research group ﬁfteen
months for data preparation and two weeks for data mining. Unex-
pected combinations of services were discovered, and $550,000 worth
of cover was refused.
Example 0.5: Taxation Fraud and Compliance. The aim of the study
was to identify groups with common unusual behaviour and to also to
carry out compliance study. Hot spots have been used as a technique
for fraud detection. These combine clustering, rules and interesting-
ness. The technique of ousted stumps and roc curves were used for
compliance studies. There were 25 diﬀerent databases throughout the
enterprise. Data extraction took 8 months and four months for data
analysis. Regression models were constructed to display compliance
and had been used to uncover unusual behaviours.
Prerequisites: A basic knowledge of programming and discrete mathemat-
ics, such as obtained through csc1401 and mat1101, are fundamental pre-
requisites for this course. However, other courses provide analogous skills.
There will be many times when you need the concepts and techniques of such
courses. Be sure you are familiar with those concepts, and have appropriate
references on hand.
c USQ, February 28, 2005
• Berry and Linoﬀ [BL97, Chapt. 1] is a good overview. further, over-
all, Berry and Linoﬀ has good support for introductory qualitative
discussion about Data Mining and some applications in the business
• Hand, Mannila & Smyth [HMS01] seems at a medium level. As is Han
& Kamber [HK00]. These books are more computationally oriented.
• The following is the most advanced and abstract: M. Hegland, Data
mining — challenges, models, methods and alogorithms, http://datamining.
Please also let us know as soon as possible, if you suspect any errors in the
study book. This is a new course, and your constructive feedback will enable
us to improve the materials.
c USQ, February 28, 2005
[ABB+ 99] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Don-
garra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney,
and D. Sorensen. LAPACK User’s Guide. SIAM, Philadelphia,
3rd edition, 1999. [http://www.netlib.org/lapack/lug/]. 94,
[AS94] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for
mining association rules. In Jorge B. Bocca, Matthias Jarke, and
Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data
Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994.
[BGL+ 99] Michael P. S. Brown, William Noble Grundy, David Lin, Noel
Christianini, Charles Sugnet, Jr. Manual Ares, and David Haus-
sler. Support vector machine classiﬁcation of microarray gene
expression data. Technical report, [http://??], 1999. UCSC-
CRL-99-09. 60, 80
[BL97] M. Berry and G. Linoﬀ. Data mining techniques: for marketing,
sales, and customer support. Wiley, 1997. 658.802 Ber. vi, 2, 62,
[EN00] Ramez Elmasri and Shamkant B. Navathe. Fundmentals of
database systems. Addison-Wesley, 3rd edition, 2000.
[FLPR99] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran.
Cache-oblivious algorithms. In 40th Annual Symposium of Com-
puter Science, FOCS 99, 1999. [http://supertech.lcs.mit.
[Heg01] Markus Hegland. Data mining techniques. Acta Numerica,
10:313–355, 2001. 3
[HK00] Jiawei Han and Micheline Kamber. Data mining: Concepts and
techniques. Morgan Kaufmann, 2000. 006.3 Han. vi, 28, 57, 82,
[HMS01] David Hand, Heikki Mannila, and Padhraic Smyth. Principles
of data Mining. MIT Press, 2001. 006.3 Han. vi, 78
[Koh97] T. Kohonen. Self-Organizing Maps. Springer, 2nd edition, 1997.
[LK03] Xiaohui Liu and Paul Kellam. Mining gene expression data. In
??, editor, Bioinformatics: genes proteins and computers, chap-
ter 15, pages 229–244. ??, 2003. 31, 32, 55
[Pic00] P. Picton. Neural Networks. Palgrave, 2nd edition, 2000. 130
[PTVF92] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.
Flannery. Numerical recipes in C. The art of scientiﬁc com-
puting. Cambridge University Press, 2nd edition, 1992. [http:
//www.library.cornell.edu/nr/bookcpdf.html]. 83, 95, 96,
[XOX02] Ying Xu, Victor Olman, and Dong Xu. Clustering gener expres-
sion data using a graph-theoretic approach: an application of
minimum spanning trees. Bioinformatics, 18:536–545, 2002. 47
c USQ, February 28, 2005