Your SlideShare is downloading. ×
Data mining discovers knowledge
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data mining discovers knowledge

288
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
288
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data mining discovers knowledge Tony Roberts, Jiuyong Li, et al. Department of Mathematics & Computing February 28, 2005
  • 2. ii c USQ, February 28, 2005
  • 3. Preface Welcome to this course. Data mining is an interdisciplinary field which brings together techniques of machine learning, database information re- trieval, mathematics and statistics. These techniques are used to find useful patterns in large datasets. Methods for such knowledge discovery in data bases are required owing to the size and complexity of data collection in administration, business and science. This course is a balance of essential mathematical framework, computer algorithms and performance, and applications to a number of fields including bioinformatics. Sample case studies Example 0.1: the cancer revolution. In 1998 Todd Golub and his team proved that microarrays could pinpoint faulty genetic activity. They used microarrays to analyse bone samples from 38 patients with acute leukaemia. For decades researchers had known that this disease came in two major types, acute lymphoblastic leukaemia (all) and acute myeloid leukaemia (aml), each with its own course of disease. Using a microarray containing 6817 genes,1 Golub’s team found that each type 1 A microarray is a slide dotted with thousands of tiny samples of dna, each representing a different gene. Microarrays rely on the ability of one strand of dna to stick to another strand with a complementary sequence. Researchers make a dna copy of a cell’s messenger iii
  • 4. iv Preface had its own genetic signature. From these signatures they selected 50 genes as markers for identifying each disease. Subsequently, they used these markers to predict the the subtype of 34 additional samples. The markers made an accurate prediction in 29 of the samples. With these genetic signatures a one-step process is more accurate than other tests. New Scientist, 23 August 2003 Example 0.2: catalogue astral objects from sky surveys. Systemati- cally examine 3000 old photographic plates of astronomical surveys to categorise objects as either: star, fuzzy star, galaxy, quasar, or artifact. Human cataloging is tedious, error prone, and misses most features. Instead each plate is digitised to 23, 040 × 23, 040 pixels with 16 bit intensity forming a data base of about 3 terabytes (3 × 1012 bytes). Data mining classifies the pixel images with about 94% accuracy and classifies objects an order of magnitude fainter than humans have man- aged. For example, the approach makes the search for quasars about 40 times more efficient. Also, at least 3 times the number of classi- fied sky objects than would be possible by traditional computational methods. Example 0.3: automobile manufacturing plant. Unexplained slow- downs in the production had been noticed to occur on one particular day of the week. A thorough study had been carried out to solve dis- cover the culprit. Every bit of activity in the plant had been recorded, with a total of 10 GByte of data collected per day. These data were then used as an input for data mining tools and the problem was solved and resulted in savings of millions of dollars. The method used was rna, “label” this copy with a chemical that fluoresces under laser light. Any sample that meets its match on the chip will stick to that spot, and the pattern of glowing dna dots indicates which genes were turned on when the sample was taken. c USQ, February 28, 2005
  • 5. v not disclosed. Also, Neither the day nor the reason were disclosed by the researcher, as these were property of the company. Example 0.4: Medical fraud detected. The aim of the study was the exploration of fraud and changes in healthcare delivery. The collected data is transaction based. Fraud is patient level, provider level or provider-ring-level. Fraud is usually not self revealing. Also, fraud is normally mixed with legitimate business. In Pathology fraud in- vestigation, Link analysis (ibm) have been used to process 6.8 million records by 120 variables (3.5 GB). It took the research group fifteen months for data preparation and two weeks for data mining. Unex- pected combinations of services were discovered, and $550,000 worth of cover was refused. Example 0.5: Taxation Fraud and Compliance. The aim of the study was to identify groups with common unusual behaviour and to also to carry out compliance study. Hot spots have been used as a technique for fraud detection. These combine clustering, rules and interesting- ness. The technique of ousted stumps and roc curves were used for compliance studies. There were 25 different databases throughout the enterprise. Data extraction took 8 months and four months for data analysis. Regression models were constructed to display compliance and had been used to uncover unusual behaviours. Prerequisites: A basic knowledge of programming and discrete mathemat- ics, such as obtained through csc1401 and mat1101, are fundamental pre- requisites for this course. However, other courses provide analogous skills. There will be many times when you need the concepts and techniques of such courses. Be sure you are familiar with those concepts, and have appropriate references on hand. c USQ, February 28, 2005
  • 6. vi Preface Reference material: • Berry and Linoff [BL97, Chapt. 1] is a good overview. further, over- all, Berry and Linoff has good support for introductory qualitative discussion about Data Mining and some applications in the business area. • Hand, Mannila & Smyth [HMS01] seems at a medium level. As is Han & Kamber [HK00]. These books are more computationally oriented. • The following is the most advanced and abstract: M. Hegland, Data mining — challenges, models, methods and alogorithms, http://datamining. anu.edu.au/~hegland/script.pdf, 2003. Please also let us know as soon as possible, if you suspect any errors in the study book. This is a new course, and your constructive feedback will enable us to improve the materials. c USQ, February 28, 2005
  • 7. Table of Contents Preface iii 1 Find association rules 1 1.1 Investigate market baskets . . . . . . . . . . . . . . . . . . . . 2 1.2 Rules encode knowledge . . . . . . . . . . . . . . . . . . . . . 3 1.3 Some itemsets occur frequently . . . . . . . . . . . . . . . . . 8 1.4 a priori algorithm . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.1 Find all frequent itemsets . . . . . . . . . . . . . . . . 13 1.4.2 Form the strong association rules . . . . . . . . . . . . 14 1.4.3 Data structure . . . . . . . . . . . . . . . . . . . . . . 19 1.4.4 a priori is usually not scalable . . . . . . . . . . . . . 20 1.5 Extend the scope . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.5.1 Rules of inhibition . . . . . . . . . . . . . . . . . . . . 22 1.5.2 Usually some data is missing . . . . . . . . . . . . . . 24 1.5.3 Alternative rules explode combinations . . . . . . . . . 25 1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 vii
  • 8. viii Table of Contents 2 Determine clusters 29 2.1 k-means algorithm finds clusters . . . . . . . . . . . . . . . . 33 2.1.1 Graphical introduction . . . . . . . . . . . . . . . . . . 33 2.1.2 The k-means algorithm is scalable . . . . . . . . . . . 34 2.1.3 Normalise data to compare apples and oranges . . . . 40 2.1.4 Measure distance flexibly . . . . . . . . . . . . . . . . 42 2.2 Agglomerate into hierarchial clusters . . . . . . . . . . . . . . 47 2.2.1 Introduce a two-dimensional example . . . . . . . . . 48 2.2.2 Prim grows the minimum spanning tree . . . . . . . . 51 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3 Grow and prune decision trees to classify 59 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Greedily grow a decision tree . . . . . . . . . . . . . . . . . . 65 3.3 Entropy may be the best decision . . . . . . . . . . . . . . . . 68 3.4 Prune the decision tree . . . . . . . . . . . . . . . . . . . . . . 73 3.5 Decision trees scale . . . . . . . . . . . . . . . . . . . . . . . . 77 3.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.8 Appendix: unimodal minimisation . . . . . . . . . . . . . . . 82 c USQ, February 28, 2005
  • 9. Table of Contents ix 4 Discover complex linear relationships 85 4.1 Linear regression appears naturally . . . . . . . . . . . . . . . 86 4.2 Vector spaces provide structure . . . . . . . . . . . . . . . . . 88 4.3 Solve linear equations to minimise residuals . . . . . . . . . . 90 4.4 Global linear regression is scalable . . . . . . . . . . . . . . . 96 4.5 Linear regression summary . . . . . . . . . . . . . . . . . . . 97 4.6 Always cross validate . . . . . . . . . . . . . . . . . . . . . . . 98 4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5 Hypersurfaces model complex decisions 103 5.1 Overview basic heuristic processes . . . . . . . . . . . . . . . 105 5.2 Fill in missing data with nearest neighbours . . . . . . . . . . 107 5.2.1 Measurement error is missing data . . . . . . . . . . . 110 5.3 Hypersplines vary smoothly . . . . . . . . . . . . . . . . . . . 111 5.3.1 Solve linear equations for coefficients . . . . . . . . . . 115 5.4 Combine with decision trees to scale . . . . . . . . . . . . . . 119 5.5 Radial basis function splines also classify . . . . . . . . . . . . 126 5.6 Artificial neural networks . . . . . . . . . . . . . . . . . . . . 127 5.6.1 Multiple layer perceptrons (back-propagation networks)132 5.7 Self organising maps . . . . . . . . . . . . . . . . . . . . . . . 135 5.8 Hypersurface summary . . . . . . . . . . . . . . . . . . . . . . 138 5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6 Outlook 143 c USQ, February 28, 2005
  • 10. x Table of Contents c USQ, February 28, 2005
  • 11. Bibliography [ABB+ 99] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Don- garra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK User’s Guide. SIAM, Philadelphia, 3rd edition, 1999. [http://www.netlib.org/lapack/lug/]. 94, 97 [AS94] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994. 11 [BGL+ 99] Michael P. S. Brown, William Noble Grundy, David Lin, Noel Christianini, Charles Sugnet, Jr. Manual Ares, and David Haus- sler. Support vector machine classification of microarray gene expression data. Technical report, [http://??], 1999. UCSC- CRL-99-09. 60, 80 [BL97] M. Berry and G. Linoff. Data mining techniques: for marketing, sales, and customer support. Wiley, 1997. 658.802 Ber. vi, 2, 62, 77, 107 [EN00] Ramez Elmasri and Shamkant B. Navathe. Fundmentals of database systems. Addison-Wesley, 3rd edition, 2000. 149
  • 12. 150 Bibliography [FLPR99] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In 40th Annual Symposium of Com- puter Science, FOCS 99, 1999. [http://supertech.lcs.mit. edu/cilk/papers/abstracts/abstract4.html]. 78 [Heg01] Markus Hegland. Data mining techniques. Acta Numerica, 10:313–355, 2001. 3 [HK00] Jiawei Han and Micheline Kamber. Data mining: Concepts and techniques. Morgan Kaufmann, 2000. 006.3 Han. vi, 28, 57, 82, 102, 141 [HMS01] David Hand, Heikki Mannila, and Padhraic Smyth. Principles of data Mining. MIT Press, 2001. 006.3 Han. vi, 78 [Koh97] T. Kohonen. Self-Organizing Maps. Springer, 2nd edition, 1997. 138 [LK03] Xiaohui Liu and Paul Kellam. Mining gene expression data. In ??, editor, Bioinformatics: genes proteins and computers, chap- ter 15, pages 229–244. ??, 2003. 31, 32, 55 [Pic00] P. Picton. Neural Networks. Palgrave, 2nd edition, 2000. 130 [PTVF92] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes in C. The art of scientific com- puting. Cambridge University Press, 2nd edition, 1992. [http: //www.library.cornell.edu/nr/bookcpdf.html]. 83, 95, 96, 112, 119 [XOX02] Ying Xu, Victor Olman, and Dong Xu. Clustering gener expres- sion data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18:536–545, 2002. 47 c USQ, February 28, 2005

×