Data Mining Applications - SEDE'07 - Invited Talk

1,356 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,356
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Ratanamahatana. C. A & Keogh, E. (2004). Using Relevance Feedback in Multimedia Databases. In the proceedings of the 7th International Conference on Visual Information Systems Pierre Maurel and Guillermo Sapiro (2003). Dynamic shapes average. Institute for Mathematics and its Application. Thomas G. Dietterich and Ashit Gandhi, . Content-Based Image Retrieval: Plant Species Identification using Dynamic Programming. http://web.engr.oregonstate.edu/~tgd/leaves/
  • This is image of an actual expression array after scanning The image to the left is the full array, with every feature represented – notice it looks pretty blurry and it’s hard to make things out The second image on the right is a zoomed in view of a section of the array.’ It represents about 324 features Each feature is scanned for its’ intensity and it is evident that there is a wide range of intensities in this area The black features represent no intensity (no RNA combined with the probes in the feature) The intensity level from lowest to highest by color is: Dark blue -> Blue -> Light Blue -> Green -> Yellow -> Orange -> Red -> White Remember that more intensity means more RNA combined with that feature, which basically means the gene was expressed at a higher level
  • Data Mining Applications - SEDE'07 - Invited Talk

    1. 1. DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email_address] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Some slides used by permission from Dr Eamonn Keogh; University of California Riverside; [email_address]
    2. 2. The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole
    3. 3. OBJECTIVE <ul><li>Explore some of the applications of data mining techniques. </li></ul>
    4. 4. Data Mining Applications Outline <ul><li>Introduction – Data Mining Overview </li></ul><ul><ul><li>Classification (Prediction,Forecasting) </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Association Rules (Link Analysis) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Fraud Detection & Illegal Activities </li></ul></ul><ul><ul><li>Facial Recognition </li></ul></ul><ul><ul><li>Cheating & Plagiarism </li></ul></ul><ul><ul><li>Bioinformatics </li></ul></ul><ul><li>Conclusions </li></ul>
    5. 5. Data Mining Overview <ul><li>Finding hidden information in a database </li></ul><ul><li>Fit data to a model </li></ul><ul><li>You must know what you are looking for </li></ul><ul><li>You must know how to look for you </li></ul>
    6. 6. “ If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” Classification Clustering Link Analysis (Profiling) (Similarity) “ If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.” Description Behavior Associations
    7. 7. Classification Applications <ul><li>Teachers classify students’ grades as A, B, C, D, or F. </li></ul><ul><li>Letter Recognition </li></ul><ul><li>andwriting Recognition </li></ul><ul><li>Phishing: http://computerworld.com/action/article.do?command=viewArticleBasic&taxonomyName=cybercrime_hacking&articleId=9002996&taxonomyId=82 </li></ul><ul><li>Pluto: http ://www.npr.org/templates/story/story.php?storyId=5705254 </li></ul>
    8. 8. Grasshoppers Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers) , decide what type of insect the unlabeled example is. (c) Eamonn Keogh, eamonn@cs.ucr.edu Classification Example
    9. 9. Antenna Length Grasshoppers Katydids Abdomen Length (c) Eamonn Keogh, eamonn@cs.ucr.edu 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9
    10. 10. Clustering Applications <ul><li>Targeted Marketing </li></ul><ul><li>Determining Gene Functionality </li></ul><ul><li>Identifying Species </li></ul><ul><li>Clustering vs. Classification </li></ul><ul><ul><li>No prior knowledge </li></ul></ul><ul><ul><li>Number of clusters </li></ul></ul><ul><ul><li>Meaning of clusters </li></ul></ul><ul><li>Unsupervised learning </li></ul>
    11. 11. http://149.170.199.144/multivar/ca.htm
    12. 12. What is Similarity ? (c) Eamonn Keogh, eamonn@cs.ucr.edu
    13. 13. Association Rules Applications <ul><li>People who buy diapers also buy beer </li></ul><ul><li>If gene A is highly expressed in this disease then gene B is also expressed </li></ul><ul><li>Relationships between people </li></ul><ul><li>www.amazon.com </li></ul><ul><li>Book Stores </li></ul><ul><li>Department Stores </li></ul><ul><li>Advertising </li></ul><ul><li>Product Placement </li></ul>
    14. 14. Data Mining Introductory and Advanced Topics , by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc.
    15. 15. Data Mining Applications Outline <ul><li>Introduction – Data Mining Overview </li></ul><ul><ul><li>Classification (Prediction,Forecasting) </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Association Rules (Link Analysis) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Fraud Detection & Illegal Activities </li></ul></ul><ul><ul><li>Facial Recognition </li></ul></ul><ul><ul><li>Cheating & Plagiarism </li></ul></ul><ul><ul><li>Bioinformatics </li></ul></ul><ul><li>Conclusions </li></ul>
    16. 17. Fraud Detection <ul><li>Identify fraudulent behavior </li></ul><ul><li>Used Extensively in financial, law enforcement, health care, etc. sectors </li></ul><ul><li>http:// www.aaai.org/AITopics/html/fraud.html </li></ul><ul><li>SPSS: http://www.spss.com/predictiveclaims/fraud_detection.htm </li></ul><ul><li>Neural Technologies: h ttp:// www.neuralt.com/fraud_management.html </li></ul>
    17. 18. Law Enforcement <ul><li>Identify suspect behavior and relationships </li></ul><ul><li>I2 Inc. </li></ul><ul><ul><li>Investigative analytic/visualization software </li></ul></ul><ul><ul><li>http://www.i2inc.com </li></ul></ul><ul><li>Social Network Analysis – Analyze patterns of relationships </li></ul><ul><li>Relationships: personal, religious, operational, etc. </li></ul>
    18. 19. Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network”  Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.
    19. 20. Data Mining Applications Outline <ul><li>Introduction – Data Mining Overview </li></ul><ul><ul><li>Classification (Prediction,Forecasting) </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Association Rules (Link Analysis) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Fraud Detection & Illegal Activities </li></ul></ul><ul><ul><li>Facial Recognition </li></ul></ul><ul><ul><li>Cheating & Plagiarism </li></ul></ul><ul><ul><li>Bioinformatics </li></ul></ul><ul><li>Conclusions </li></ul>
    20. 21. How Stuff Works, “Facial Recognition,” http://computer.howstuffworks.com/facial-recognition1.htm
    21. 22. Facial Recognition <ul><li>Based upon features in face </li></ul><ul><li>Convert face to a feature vector </li></ul><ul><li>Less invasive than other biometric techniques </li></ul><ul><li>http://www.face-rec.org </li></ul><ul><li>http://computer.howstuffworks.com/facial-recognition.htm </li></ul><ul><li>SIMS: </li></ul><ul><ul><li>http://www.casinoincidentreporting.com/Products.aspx </li></ul></ul>
    22. 23. (c) Eamonn Keogh, eamonn@cs.ucr.edu
    23. 24. Data Mining Applications Outline <ul><li>Introduction – Data Mining Overview </li></ul><ul><ul><li>Classification (Prediction,Forecasting) </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Association Rules (Link Analysis) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Fraud Detection & Illegal Activities </li></ul></ul><ul><ul><li>Facial Recognition </li></ul></ul><ul><ul><li>Cheating & Plagiarism </li></ul></ul><ul><ul><li>Bioinformatics </li></ul></ul><ul><li>Conclusions </li></ul>
    24. 25. Cheating on Multiple Choice Tests <ul><li>Similarity between tests based on number of common wrong answers. </li></ul><ul><li>(George O. Wesolowsky, “Detecting Excessive Similarity in Answers on Multiple Choice Exams,” Journal of Applied Statistics, vol 27, no 7,200, pp909-923.) </li></ul><ul><li>The number of common correct answers is often ignored. </li></ul><ul><li>H-H Index (D.N. Harpp, J.J. Hogan, and J.S. Jennings, 1996, “Crime in the Classroom – Part II, and update,” Journal of Chemical Education , vol 73, no 4, pp 349-351): </li></ul><ul><ul><li>H-H = (Number of exact answers in common) </li></ul></ul><ul><ul><li>(Number of different answers) </li></ul></ul>
    25. 26. Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News , June 4, 2007.
    26. 27. No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News , June 4, 2007.
    27. 28. Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News , June 4, 2007.
    28. 29. Data Mining Applications Outline <ul><li>Introduction – Data Mining Overview </li></ul><ul><ul><li>Classification (Prediction,Forecasting) </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Association Rules (Link Analysis) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Fraud Detection & Illegal Activities </li></ul></ul><ul><ul><li>Facial Recognition </li></ul></ul><ul><ul><li>Cheating & Plagiarism </li></ul></ul><ul><ul><li>Bioinformatics </li></ul></ul><ul><li>Conclusions </li></ul>
    29. 30. DNA <ul><li>Basic building blocks of organisms </li></ul><ul><li>Located in nucleus of cells </li></ul><ul><li>Composed of 4 nucleotides </li></ul><ul><li>Two strands bound together </li></ul>http://www.visionlearning.com/library/module_viewer.php?mid=63
    30. 31. Central Dogma: DNA -> RNA -> Protein CCTGAGCCAACTATTGATGAA PEPTIDE CCUGAGCCA ACU AUUGAUGAA www.bioalgorithms.info ; chapter 6; Gene Prediction Protein RNA DNA transcription translation
    31. 32. miRNA <ul><li>Short (20-25nt) sequence of noncoding RNA </li></ul><ul><li>Known since 1993 but significance not widely appreciated until 2001 </li></ul><ul><li>Impact / Prevent translation of mRNA </li></ul><ul><li>Generally reduce protein levels without impacting mRNA levels (animal cells) </li></ul><ul><li>Functions </li></ul><ul><ul><li>Causes some cancers </li></ul></ul><ul><ul><li>Guide embryo development </li></ul></ul><ul><ul><li>Regulate cell Differentiation </li></ul></ul><ul><ul><li>Associated with HIV </li></ul></ul><ul><ul><li>… </li></ul></ul>
    32. 33. Questions <ul><li>If each cell in an organism contains the same DNA – </li></ul><ul><ul><li>How does each cell behave differently? </li></ul></ul><ul><ul><li>Why do cells behave differently during childhood/? </li></ul></ul><ul><ul><li>What causes some cells to act differently – such as during disease? </li></ul></ul><ul><li>DNA contains many genes, but only a few are being transcribed – why? </li></ul><ul><li>One answer - miRNA </li></ul>
    33. 34. <ul><li>http://www.time.com/time/magazine/article/0,9171,1541283,00.html </li></ul>
    34. 35. Human Genome <ul><li>Scientists originally thought there would be about 100,000 genes </li></ul><ul><li>Appear to be about 20,000 </li></ul><ul><li>WHY? </li></ul><ul><li>Almost identical to that of Chimps. What makes the difference? </li></ul><ul><li>Visualization from UCR </li></ul><ul><li>dnaQT.mov </li></ul><ul><li>Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk) </li></ul>
    35. 36. RNAi – Nobel Prize in Medicine 2006 Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA siRNA may be artificially added to cell! Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html , Advanced Information, Image 3
    36. 37. Computer Science & Bioinformatics <ul><li>Algorithms </li></ul><ul><li>Data Structures </li></ul><ul><li>Improving efficiency </li></ul><ul><li>Data Mining </li></ul><ul><li>Biologists don’t usually understand or even appreciate what Computer Science can do </li></ul><ul><li>Issues: </li></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Fuzzy </li></ul></ul><ul><li>We will look at: </li></ul><ul><ul><li>Microarray Clustering </li></ul></ul><ul><ul><li>TCGR </li></ul></ul>
    37. 38. Affymetrix GeneChip ® Array http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx
    38. 39. Microarray Data Analysis <ul><li>Each probe location associated with gene </li></ul><ul><li>Measure the amount of mRNA </li></ul><ul><li>Color indicates degree of gene expression </li></ul><ul><li>Compare different samples (normal/disease) </li></ul><ul><li>Track same sample over time </li></ul><ul><li>Questions </li></ul><ul><ul><li>Which genes are related to this disease? </li></ul></ul><ul><ul><li>Which genes behave in a similar manner? </li></ul></ul><ul><ul><li>What is the function of a gene? </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>Hierarchical </li></ul></ul><ul><ul><li>K-means </li></ul></ul>
    39. 40. Microarray Data - Clustering &quot;Gene expression profiling identifies clinically relevant subtypes of prostate cancer&quot; Proc. Natl. Acad. Sci . USA , Vol. 101, Issue 3, 811-816, January 20, 2004
    40. 41. miRNA Research Issues <ul><li>Predict / Find miRNA in genomic sequence </li></ul><ul><li>Predict miRNA targets </li></ul><ul><li>Identify miRNA functions </li></ul>
    41. 42. Temporal CGR (TCGR) <ul><li>2D Array </li></ul><ul><ul><li>Each Row represents counts for a particular window in sequence </li></ul></ul><ul><ul><ul><li>First row – first window </li></ul></ul></ul><ul><ul><ul><li>Last row – last window </li></ul></ul></ul><ul><ul><ul><li>We start successive windows at the next character location </li></ul></ul></ul><ul><ul><li>Each Column represents the counts for the associated pattern in that window </li></ul></ul><ul><ul><ul><li>Initially we have assumed order of patterns is alphabetic </li></ul></ul></ul><ul><ul><li>Size of TCGR depends on sequence length and subpattern length </li></ul></ul>
    42. 43. TCGR Example (cont’d) TCGRs for Sub-patterns of length 1, 2, and 3
    43. 44. TCGR – Mature miRNA (Window=5; Pattern=3) All Mature Mus Musculus Homo Sapiens C Elegans ACG CGC GCG UCG
    44. 45. TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics , vol 6, no 310 . NEGAT I VE POS I T I VE
    45. 46. TCGRs for Xue Test Data NEGAT I VE POS I T I VE
    46. 47. Data Mining Applications Outline <ul><li>Introduction – Data Mining Overview </li></ul><ul><ul><li>Classification (Prediction,Forecasting) </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Association Rules (Link Analysis) </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Fraud Detection & Illegal Activities </li></ul></ul><ul><ul><li>Facial Recognition </li></ul></ul><ul><ul><li>Cheating & Plagiarism </li></ul></ul><ul><ul><li>Bioinformatics </li></ul></ul><ul><li>Conclusions </li></ul>
    47. 48. Conclusions <ul><li>Not magic </li></ul><ul><li>Doesn’t work for all applications </li></ul><ul><li>Stock Market Prediction </li></ul><ul><li>Issues </li></ul><ul><ul><li>Privacy </li></ul></ul><ul><ul><li>Data </li></ul></ul><ul><li>Here are some infamous examples of failed data mining applications </li></ul>
    48. 50. Dallas Morning News October 7, 2005
    49. 51. http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=& arnumber =1502526&isnumber=32236
    50. 52. BIG BROTHER ? <ul><li>Total Information Awareness </li></ul><ul><ul><li>http:// infowar.net/tia/www.darpa.mil/iao/index.htm </li></ul></ul><ul><ul><li>http:// www.govtech.net/magazine/story.php?id =45918 </li></ul></ul><ul><ul><li>http:// en.wikipedia.org/wiki/Information_Awareness_Office </li></ul></ul><ul><li>Terror Watch List </li></ul><ul><ul><li>http://www.businessweek.com/technology/content/may2005/tc20050511_8047_tc_210.htm </li></ul></ul><ul><ul><li>http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ </li></ul></ul><ul><ul><li>http://blogs.abcnews.com/theblotter/2007/06/fbi_terror_watc.html </li></ul></ul><ul><ul><li>http://www.thedenverchannel.com/news/9559707/detail.html </li></ul></ul><ul><li>CAPPS </li></ul><ul><ul><li>http://www.theregister.co.uk/2004/04/26/airport_security_failures/ </li></ul></ul><ul><ul><li>http://www.heritage.org/Research/HomelandDefense/BG1683.cfm </li></ul></ul><ul><ul><li>http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ </li></ul></ul><ul><ul><li>http:// en.wikipedia.org/wiki/CAPPS </li></ul></ul>
    51. 55. Thank You

    ×