Mining Knowledge in Data Explosion Age


Published on

  • Be the first to comment

  • Be the first to like this

Mining Knowledge in Data Explosion Age

  1. 1. Mining Knowledge in Data Explosion Age (在資料爆炸時代中挖掘知識) 廖宜恩 中興大學資訊科學與工程系 1
  2. 2. Outline • Some News Reports • Why Data Mining • What is Data Mining • Knowledge Discovery Process • Data Mining Functionalities • Data Mining Process • Data Mining Tools • Trends in Data Mining • Some Research Results on Data Mining • Conclusions 2
  3. 3. Some News Reports • Time's Person of the Year for 2006 • 12 IT skills that employers can't say no to • F.B.I. Data Mining Reached Beyond Initial Targets • MIT names its top 10 emerging technologies for 2008 • Effect of US Recession on Data Mining Demand (July 2008) 3
  4. 4. Why Data Mining • Data Explosion Problem(資料爆炸問題) – Data in the world doubles every 20 months! – NASA’s Earth Orbiting System: forty-six megabytes of data per second • 4,000,000,000,000 bytes a day(4 TeraByte/day; 20×200GB Hard Disk) – FBI fingerprints image library: • 200,000,000,000,000 bytes(200 TB) – In-line image analysis for particle detection: 1 megabyte in one second 4
  5. 5. Why Data Mining? Commercial Viewpoint • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions • Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) 5
  6. 6. Why Data Mining? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data 6
  7. 7. Mining Large Data Sets - Motivation • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all • We are drowning in data, but starving for knowledge! (淹沒於資料, 飢渴於知識) 4,000,000 3,500,000 3,000,000 The Data Gap 2,500,000 2,000,000 1,500,000 Total new disk (TB) since 1995 1,000,000 500,000 Number of 0 analysts 1995 1996 1997 1998 1999 7
  8. 8. What is Data Mining? • Data Mining (Knowledge Discovery in Databases, KDD) (資料挖掘、資料探勘、 資料採礦): – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules(以自 動化或半自動化方式探索、分析大量 資料以發現有意義的樣式和規則) 8
  9. 9. Knowledge Discovery Process • Data mining: the core Pattern Evaluation of knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration 9 Databases
  10. 10. Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniques may be unsuitable due to Statistics/ Machine Learning/ – Enormity of data AI Pattern (龐大的資料) Recognition – Curse of high Data Mining dimensionality (高維度資料的魔咒) Database – Heterogeneous, systems distributed nature of data(分散且異質的資料) 10
  11. 11. Data Mining Functionalities 1. Concept description: Characterization and discrimination(資料集特徵或差異的描述) 2. Classification(分類) 3. Association rule mining(關聯法則挖掘) 4. Clustering(分群) 5. Sequence analysis(序列分析) 6. Anomaly detection(異常偵測) 11
  12. 12. Concept description: Characterization and discrimination • Concept description: – Characterization: provides a concise summarization of the given collection of data • Example: Describe general characteristics of graduate students in the NCHU database – Discrimination: provides descriptions comparing two or more collections of data • Example: Compare graduate and undergraduate students of NCHU using discriminant rule 12
  13. 13. Classification(分類) • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 13
  14. 14. Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes 10 Apply Tid Attrib1 Attrib2 Attrib3 Class Model 11 No Small 55K ? Decision 12 Yes Medium 80K ? Tree 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 14
  15. 15. Example of a Decision Tree Splitting Attributes Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K No 2 No Married 100K No Refund Yes No 3 No Single 70K No 4 Yes Married 120K No NO MarSt 5 No Divorced 95K Yes Single, Divorced Married 6 No Married 60K No 7 Yes Divorced 220K No TaxInc NO 8 No Single 85K Yes < 80K > 80K 9 No Married 75K No NO YES 10 No Single 90K Yes 10 Training Data Model: Decision Tree 15
  16. 16. Apply Model to Test Data Test Data Start from the root of tree. Refund Marital Taxable Status Income Cheat No Married 80K ? Refund 10 Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES 16
  17. 17. Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Categorizing news stories as finance, weather, entertainment, sports, etc 17
  18. 18. Association rule mining(關聯法則挖掘) • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules TID Items {Diaper} → {Beer}, 1 Bread, Milk {Milk, Bread} → {Eggs,Coke}, 2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk}, 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 18
  19. 19. Association Rule Discovery: Application 1 • Marketing and Sales Promotion: – Let the rule discovered be {Beer, … } --> {Potato Chips} – Potato Chips as consequent => Can be used to determine what should be done to boost its sales. – Beer in the antecedent => Can be used to see which products would be affected if the store discontinues selling beer. – Beer in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Beer to promote sale of Potato chips! 19
  20. 20. Clustering(分群) • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another. 20
  21. 21. Illustrating Clustering ⌧Euclidean Distance Based Clustering in 3-D space. Intracluster distances Intercluster distances are minimized are maximized 21
  22. 22. Clustering: Applications • Market Segmentation:(市場區隔) – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. • Document Clustering:(文件分群) – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. 22
  23. 23. Clustering of Microarray Data(微陣列資 料分群) 23
  24. 24. Sequence analysis(序列分析) Sequence Sequence Element Event Database (Transaction) (Item) Customer Purchase history of a A set of items bought by Books, diary given customer a customer at time t products, CDs, etc Web Data Browsing activity of a A collection of files Home page, index particular Web visitor viewed by a Web visitor page, contact info, after a single mouse etc click Event data History of events Events triggered by a Types of alarms generated by a given sensor at time t generated by sensors sensor Genome DNA sequence of a An element of the DNA Bases A,T,G,C sequences particular species sequence Element Event (Transaction) E1 E1 E3 (Item) E2 E2 E2 E3 E4 Sequence 24
  25. 25. 25 Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
  26. 26. How does the human genome stack up? Organism Genome Size (Bases) Estimated Genes Human (Homo sapiens) 3 billion 25,000 Laboratory mouse (M. musculus) 2.6 billion 30,000 Mustard weed (A. thaliana) 100 million 25,000 Roundworm (C. elegans) 97 million 19,000 Fruit fly (D. melanogaster) 137 million 13,000 Yeast (S. cerevisiae) 12.1 million 6,000 Bacterium (E. coli) 4.6 million 3,200 Human immunodeficiency virus (HIV) 9700 9 26
  27. 27. Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG 27
  28. 28. Anomaly Detection(異常偵測) • Detect significant deviations from normal behavior • Applications: – Credit Card Fraud Detection – Network Intrusion Detection • Typical network traffic at University level may reach over 100 million connections per day 28
  29. 29. Social Network Analysis (Link Mining) • 舞台劇<六度分離>(Six Degrees of Separation):「我從某處得知,在地球上,人 與人之間只被六個人隔絕。六度的分隔,正是 這個星球的人際距離。」 • Link: relationship among data objects • Link-Based Object Ranking (LBR): Exploit the link structure of a graph to order or prioritize the set of objects within the graph • Web information analysis such as PageRank and Hits are typical LBR approaches 29
  30. 30. Complex Network • A complex network is a network (graph) that has certain non-trivial topological features that do not occur in simple networks. • Such non-trivial features include: a heavy-tail in the degree distribution; a high clustering coefficient; assortativity (a correlation between two nodes) or disassortativity among vertices; and evidence of a hierarchical structure. 30
  31. 31. Web Mining • Web Usage Mining • Web Structure Mining • Web Content Mining – Google has a precious asset: Database of Intensions(人類意圖資料庫) 31
  32. 32. Graph Mining • Find frequent subgraph in a given graph database • Graphs are ubiquitous – Web databases, XML databases – Cheminformatics (chemical compound) – Bioinformactics (protein structure, pathway) – Workflow analysis – Social network analysis 32
  33. 33. Example (Chemistry-informatics) Graph Dataset (A) (B) (C) Frequent Patterns (min support is 2) (1) (2) 33
  34. 34. Data Mining Process • Define the problem • Build data mining database • Explore data • Prepare data for modeling • Build model • Evaluate model • Deploy model 34
  35. 35. Examples of data mining in science & engineering • Data mining in Biomedical Engineering – “Robotic Arm Control Using Data Mining Techniques” 35
  36. 36. Data Mining Process: 1. Define the problem • Control a robotic arm by means of EMG signals from biceps and triceps muscles. • Electromyography (EMG,肌電描記器) is a medical technique for evaluating and recording physiologic properties of muscles at rest and while contracting. Muscle Biceps Triceps Contraction (二頭肌) (三頭肌) Supination H H (旋後) Pronation L L (前旋) Flexion (彎 H L 曲) Extension Supination Pronation Flexion Extension (伸張) L H 36
  37. 37. Data Mining Process: 2. Build a data mining database The dataset includes 80 records. There are two input variables; biceps signal and triceps signal. One output variable, with four possible values; supination, pronation, flexion and extension. 37
  38. 38. Data Mining Process: 3. Explore data Scatter Plot Triceps Record# Flexion Extension Supination Pronation 38
  39. 39. Data Mining Process: 3. Explore data (cont.) Scatter Plot Biceps Record# Flexion Extension Supination Pronation 39
  40. 40. Data Mining Process: 4. Prepare data for modeling Build a dataset with the ARFF format: @relation EMG @attribute Triceps real @attribute Biceps real @attribute Move {Flexion,Extension,Pronation,Supination} @data 13,31,Flexion 14,30,Flexion 10,31,Flexion 13,29,Flexion …… 40
  41. 41. Data Mining Process: 5. Build Model Classification OneR Decision Tree Naïve Bayesian K-Nearest Neighbors Neural Networks Linear Discriminant Analysis Support Vector Machines … 41
  42. 42. Data Mining Process: 5. Decision Tree 1. Find the attribute that best classifies the training data. 2. Use this attribute as the root of the decision tree. 3. Repeat the process for each subtree. Triceps <=37 >37 Triceps Biceps <=14 >14 <=17 >17 42 Flexion Pronation Extension Supination
  43. 43. Data Mining Process: 6. Evaluate Models Simple validation : training set and test set n-fold cross-validation Leave-one-out 10 -fold cross-validation OneR 76% Decision Tree 90% Naïve Bayesian 98% 1-Nearest Neighbors 100% Neural Networks 100% 43
  44. 44. Data Mining Process: 7. Deploy Model The neural network model was successfully implemented inside the robotic arm. 44
  45. 45. Data Mining Tools • Commercial tools: SAS Enterprise Miner , IBM Intelligent Miner, SPSS Clementine • Open source tools: – WEKA: – RapidMiner: • Poll: Data mining/analytic tools you used in 2006 • Good portals for data mining: KDnuggets 45
  46. 46. Trends in Data Mining • Application exploration – development of application-specific data mining system – Invisible data mining (mining as built-in function) • Scalable data mining methods – Constraint-based mining: use of constraints to guide data mining systems in their search for interesting patterns • Integration of data mining with database systems, data warehouse systems, and Web database systems 46
  47. 47. Trends in Data Mining • Web mining • Social network analysis • Recommender systems: – US$1 Million prize for 10% improvement on Cinematch movie recommender system – Netflix – If You Liked This, You’re Sure to Love That (New York Times, Nov. 21, 2008) 47
  48. 48. Trends in Data Mining • Spam filters: – Cost of Spam: – How much does spam cost you? Google will calculate – oi_calculator.html • Privacy protection and information security in data mining • Bioinformatics 48
  49. 49. Some Research Results on DM • Localization system for WLAN • Rogue Access Point Detection System Based on Packet Analysis • Library Recommender System Based on Personal Ontology Model 49
  50. 50. Localization system for WLAN • Enhancing the Accuracy of WLAN-based Location Determination Systems Using Predicted Orientation Information (Information Sciences, Vol. 178, No. 4, Feb. 15, 2008, pp. 1049–1068.) • We proposed Accumulated Orientation Strength (AOS) algorithm based on Bayesian classifier to predict the orientation of a mobile user for improving the accuracy of localization system. 50
  51. 51. Rogue Access Point Detection System • A paper entitled "Detecting Rogue Access Points Using Client-side Bottleneck Bandwidth Analysis" has been accepted for publication in Computers & Security. 51
  52. 52. Rogue Access Point Detection System • Big challenge in managing APs in university campus: NCHU is a class B network with more than 50 departmental networks 52
  53. 53. Rogue Access Point Detection System: Intruders from the Air 53
  54. 54. Rogue Access Point Detection System • Proposed a novel approach for detecting rogue access points by estimating client-side bottleneck bandwidth based on ACK packet pair technique. • The system is implemented and tested in the Computer and Information Network Center at NCHU. • Experimental results show that the accuracy is higher than 90%. 54
  55. 55. Library Recommender System Based on Personal Ontology Model (PORE) • A paper entitled "PORE: A Personal Ontology Recommender System for Digital Library" has been accepted for publication in The Electronic Library. • Proposed personal ontology model for recommending books to library patrons based on keywords extracted from the books borrowed by the user 55
  56. 56. Library Recommender System Based on Personal Ontology Model (PORE) • Collaborative filtering techniques are also incorporated into the PORE system • PORE system is in service at NCHU Library 56
  57. 57. Conclusions • We are drowning in data, but starving for knowledge! • Data mining is the key to knowledge discovery. • Applications of data mining techniques can be found in almost every research area of computer science and engineering. • Even in a recession, data mining services are still in strong demand. 57
  58. 58. References 1. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2006. 2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd Ed., Morgan Kaufmann, 2005. 3. Jones, Neil and Pevzner, Pavel, An Introduction to Bioinformatics Algorithms, MIT Press, 2004. 4. 5. Duncan Watts,6個人的小世界(Six Degrees),大塊 文化,2004。 6. Mark Buchanan,連結(Nexus),天下文化,2003。 7. 58