Successfully reported this slideshow.

Algorithmics Algorithmics Research on Knowledge Research on ...


Published on

  • Be the first to comment

Algorithmics Algorithmics Research on Knowledge Research on ...

  1. 1. Algorithmics Research on Knowledge Discovery and Data Mining ©Vladimir Estivill-Castro School of Computing and Information Technology 1 Outline u Motivation u What is Data Mining u Extent of The Field u Web Mining, Text Mining, Software Engineering and Data Mining u Example of algorithm u Attribute Oriented Generalization © Vladimir Estivill -Castro 2
  2. 2. Motivation 3 Motivation u The problem of data overload looms ominously ahead. u Our technology to analyze data and understand massive datasets lags far behind our technology to gather and store data. © Vladimir Estivill -Castro 4
  3. 3. Interest for Knowledge Discovery u Emerged in Databases u Large logs of transactions u Credit card transactions u Supermarket transactions u Data Mining (1960s) u “Massaging the data so statistics would reflect my preconceived hypothesis” u Deductive vs Inductive Science u Data => Hypothesis vs Hypothesis validated by data © Vladimir Estivill -Castro 5 Knowledge Discovery The nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patters in large data sets. u Data: The geo-referenced layers. u Information: The average population per administrative region u Knowledge: The patterns of growth of population densities and valid explanations for them. © Vladimir Estivill -Castro 7 6
  4. 4. Aspects of Knowledge Discovery u Nontrivial u Potentially useful u it goes beyond u for the user computing closed- from u Understandable quantities or evaluating models. u simple, descriptive, informative u Valid u the discovered patterns are true with some degree of certainty of unseen data © Vladimir Estivill -Castro 7 What is Data Mining 8
  5. 5. What is Data Mining? u Different answers ¬ Using computers to make sense of large volumes of data u using a set of fundamental data manipulation tasks u classification u association rules / basket analysis u what makes it different from Statistics? - A task in the process of knowledge discovery ® A multidisciplinary field u what makes it different from Statistics? © Vladimir Estivill -Castro 9 Mining Different Kinds of Knowledge u Description u Estimation. u Prediction. u Affinity Grouping . u Classification. u Clustering. © Vladimir Estivill -Castro
  6. 6. Mining Different Kinds of Knowledge (by J. Han) u Characterization: Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions. u Association: Rules like “inside(x, city) à near(x, highway)”. u Classification: Classify data based on the values in a classifying attribute, e.g., classify countries based on climate. u Clustering: Cluster data to form new classes, e.g., cluster houses to find distribution patterns. u Trend and deviation analysis : Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., housing market analysis. u Pattern-directed analysis : Find and characterize user-specified patterns in large databases, e.g., volcanoes on Mars. © Vladimir Estivill -Castro Example: Retail u Bar-code technology makes possible to collect and store massive amounts of sales data (the basked data). u Information driven marketing process demands mining association rules over basket data. u ``98% of customers that purchase tires and auto accessories also get service done’’ © Vladimir Estivill -Castro 12
  7. 7. Application of Rules u Cross-marketing and attached mailing applications. u customized and designed junk-mail u catalog design, add-on sales. u store layout u customer segmentation based on buying patterns. © Vladimir Estivill -Castro 13 Product Product Product Product A B C D Illustration Customer 1 X X X Customer 2 X X Customer 3 X X X u ``90% of purchases that have bread and butter also include milk’’. u It is a rule of the form A ⇒ B (90%). u A is the antecedent. u B is the consequent. u There is a confidence value associated to the rule. © Vladimir Estivill -Castro 14
  8. 8. Usage of mining for rules u Find all rules that have Diet Coke as the consequent. u Help plan what should be done to boast the sales of Diet Coke. u Find all the rules that have bagels in the antecedent. u Determine what products may be impacted if the store discontinues selling bagels. © Vladimir Estivill -Castro 15 Usage of mining rules u Find all rules that have sausage in the antecedent and mustard in the consequent u what items should be sold with sausages to make highly likely that mustard will also be sold. u Find all rules relating items located in shelves A and B u understand if the distance affects the sales of items from both shelves. © Vladimir Estivill -Castro 16
  9. 9. The Knowledge Discovery Process Arrangement Preprocessing Data Mining Selection Interpretation Data Target Preprocessed Transformed Patterns Knowledge Data Data Data © Vladimir Estivill -Castro 17 A formal process (and a standard) © Vladimir Estivill -Castro 18
  10. 10. Multi-disciplinary u Finding useful patterns in data is known by different names among different communities: u data mining (statistics, databases) u knowledge discovery u information discovery, information harvesting u data archeology u pattern processing © Vladimir Estivill -Castro 19 A multi-disciplinary field Data Bases Statistic Visualization Machine Learning Logic Programming © Vladimir Estivill -Castro 20
  11. 11. Knowledge Discovery u Discovery of useful knowledge from data. u Databases u Machine Learning u Pattern Recognition u Artificial Intelligence u Knowledge Acquisition u Scientific Discovery u High-Performance Computing u Algorithms (Analysis and Design) u Statistics © Vladimir Estivill -Castro 21 Differences with Statistics © Vladimir Estivill -Castro 22
  12. 12. A significant amount of overlap u Statistics u KDDM u probabilistic models u concern for computability and u descriptive, non- scalability parametric, exploratory u interpret data u mathematically sound (advanced) u understandable u informative and u data on electronic predictive media (and structured) Exploratory Analysis of Massive Data Sets © Vladimir Estivill -Castro 23 Extent of the Field 24
  13. 13. Large number of conferences and specialized workshops u KDD - ACM u but now IEEE conference on KDD u SIAM conference on KDD u PAKDD (2001 - 5th Pacific Rim Conf. on KDD) u PKDD (2000 - 4th European Conf. on Principles and Practice of KDD) u Data Base conferences u SIGMOD u VLDB u Artificial Intelligence and Machine Learning u ICML u ECML © Vladimir Estivill -Castro 25 Successful Example u Recent analysis of Bank of America loan database u 250 fields per customer u back to 1914! u over nine million records u A clustering tools was used to automatically segment customers into groups with many similar categorical attributes. u 14 groups identified, only one could be explained u The interesting cluster had 2 properties u 39% of customers had business and personal accounts with the bank u this cluster accounted for 27% of the 11% of customers that had been classified by a decision tree as likely respondents to a home equity loan offer © Vladimir Estivill -Castro 26
  14. 14. Text Mining u With / without Natural Language Processing u Different from Information Retrieval/Extraction u Results: u Thereexists a text (or a sets of texts) such as speak-much-of Miss Lewinsky & speak-little- of Mrs. Clinton u There exists another text such as speak-a-lot-of Mrs. Clinton & do-not-speak-of Miss Lewinsky © Vladimir Estivill -Castro 27 Spatial Data Mining u Data Mining u GIS u system generates hypothesis u user generates hypothesis u search (and visualization) in u visualization in geographical abstract space space u inductive generalizations u shows what’s inside the data exceeding content of database hard to visualize multivariate dependencies on a map © Vladimir Estivill -Castro 28
  15. 15. Spatial Data Mining Thematic Map Decision Tree © Vladimir Estivill -Castro 29 Spatial Associations FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course" FROM Washington_Golf_courses, Washington WHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, "3 km") AND Washington.CFCC <> "D81" IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC SET SUPPORT THRESHOLD 0.5 © Vladimir Estivill -Castro 30
  16. 16. Spatial Associations & Hierarchy of Spatial Relationships u Spatial association: Association relationship containing spatial predicates, e.g., close_to, intersect, contains, etc. u Topological relations: u intersects, overlaps, disjoint, etc. u Spatial orientations: u left_of, west_of, under, etc. u Distance information: u close_to, within_distance, etc. u Hierarchy of spatial relationship: u “g_close_to”: near_by, touch, intersect, contain, etc. u First search for rough relationship and then refine it. © Vladimir Estivill -Castro Example: Spatial Association Rule Mining u “What kinds of spatial objects are close to each other in B.C.?” u Kinds of objects: cities, water, forests, usa_boundary, mines, etc. u Rules mined: u is_a(x, large_town) ^ intersect(x, highway) → adjacent_to(x, water). [7%, 85%] u is_a(x, large_town) ^adjacent_to(x, georgia_strait) → close_to(x, u.s.a.). [1%, 78%] u Mining method: Apriori + multi-level association + geo- spatial algorithms (from rough to high precision). © Vladimir Estivill -Castro
  17. 17. Spatial Classification u Generalization- based induction u Interactive classification © Vladimir Estivill -Castro 33 Visualization of Predicted Distribution © Vladimir Estivill -Castro
  18. 18. Spatial Prediction and Trend Analysis u Spatial trend predictive modeling (Ester et al’97): u Discover centers: local maximal of some non-spatial attribute. u Determine the (theoretical) trend of some non-spatial attribute, when moving away from the centers. u Discover deviations (from the theoretical trend). u Explain the deviations. u Example: Trend of unemployment rate change according to the distance to Munich. u Similar modeling can be used to study trend of temperature with the altitude, degree of pollution in relevance to the regions of population density, etc. © Vladimir Estivill -Castro Spatial Clustering u How can we cluster points? u What are the distinct features of the clusters? There are more customers with university degrees in clusters located in the West. Thus, we can use different marketing strategies! © Vladimir Estivill -Castro 36
  19. 19. Mining in Image & Raster Databases u Magellan project (Fayyad et al.’96, JPL). u identify volcanos on Venus surface u over 30,000 high resolution images u Resolution accuracy: 80% u 3 steps: data focusing, feature extraction, and classification learning u POSSII project (Palomar Obervatory Sky Survey II, ) u 2x 109 stellar objects (galaxies, stars, etc.) classified u Resolution:one magnitude better than in previous studies u Classification accuracy: no normalization 75%, with normalization 94%, and compared with neural networks. u QuakeFinder (Stolorz et al’96): Find earth quakes from space. u using statistics, massive parallelism, and global optimization © Vladimir Estivill -Castro Basket Analysis / Link Analysis u u Who bought what together u What are related books u Finding who has fax at home u Get phone data for business with a fax number u Get usage records of lines to find who dials a business fax number from home for larger than 20 seconds © Vladimir Estivill -Castro 38
  20. 20. Crime detection u crime investigation (e.g., the Okalahoma City bombing) u fraud detection [Italy KDD-99 San Diego] but also Australian Taxation Office and HIC [PAKDD-99] u Characterization of Doctor Shoppers © Vladimir Estivill -Castro 39 Software Engineering and Data Mining © Vladimir Estivill -Castro 40
  21. 21. KDD techniques used in Software Engineering Clustering u (Mancoridis): Clustering for graph partitioning towards high-coupling and low-cohesion. u Graph partitioning algorithms using hill-climbing u (Ouyang) Clustering towards improving the reusability in the design phase. u (Anquetil) Re-modularization of software © Vladimir Estivill -Castro 41 Decomposition into subsystems- Non OO case u Input from ISA u Software S made of a set of programs P, and set of files F. u Representation of database of the system A set alpha : A={P, F} : P ⊆ P , F ⊆ F p1p2p3p4p5 Creation of a grouping table f1 X X X X Programason rows, Files in columnss f2 X X T = t1 , t2 , ... , t|F| f3 X X X ti={ p ∈ P | p uses fi} f4 X X X X u KDDM Associaiton rules c p3= 3/4, cp4= 3/3 “c% of the programs that use file X also use files Y and Z”. p4 ⇒ p3 (100%) Results in a group of programs that uses a similar set of files. Colect and summirize results u Association rules are used to guide clustering u A set of programs and files makes a subsystem=> SUBSISTEMA © Vladimir Estivill -Castro 42
  22. 22. Decomposition into Subsystems - OO u Using metrics, OO, KDDM and clustering to split with low coupling. u Approach: ( Create list of related entity pairs (classes, methods and objects) ( Use OO metrics to create sets of metrics CBOSet, RFCSet, and DACSet (DAC: Data Abstraction Coupling) CBO_d, CBO_d’, and DAC_t ( Generate matrices of classes that interact (matrices de interaccion) ( Apply KDDM algorithms (association rules) CBOSet’ Interaction Matrix ( Use hierarchical clustering on the coefficients of the association rules to produce a hierarchical decomposition. © Vladimir Estivill -Castro 43 Case Study - OO Sistema II Mozilla - Netscape Communicator u HTML Editor MOZILLA HTML Composer/Editor - Mozilla SUMMARY Symbol Table Statistics of 1223 projects SUMMARY Symbol Table Statistics of 30 projects ======================================== ============================================== Files: 111 Files: 6713 Includes: 697 Includes: 36492 Macros: 147 Macros: 27024 Functions: 60 Functions: 15898 Types: 0 Types: 3176 Variables: 79 Variables: 11151 Enums: 4 Enums: 715 Userdef: 0 Userdef: 0 Classes: 90 Classes: 5933 Inst Vars: 415 Instance Variables: 23757 Methods: 1320 Friends: 25 Methods: 41015 Localdefs: 46 Friends: 273 Localdefs: 290 SUMMARY File Type Statistics of 30 projects ============================================== SUMMARY File Type Statistics of 1223 projects File Type: Number of Files: ======================================== File Type Number of files HTML 4 Header 65 IDL Interface 4 HTML 763 Image 57 Header 3331 Implementation 42 Implementation 3382 Project Description 38 Make 117 Project Description 1309 Mozilla Statistics HTML Editor Statistics © Vladimir Estivill -Castro 44
  23. 23. Graph Mining and WEB Mining u Step beyond link analysis u Finding sub-graphs that are similar u Chemicalmolecules u CAD/CAM parts (designs) u Patterns of use u Analysis of WEB data u Text Mining/ Multimedia Mining © Vladimir Estivill -Castro 45 Time Series Mining u Predicting stock market u Monitoring condition of equipment, weather, pilot behavior during long flights. © Vladimir Estivill -Castro 46
  24. 24. An example of generalization u For attribute-oriented data © Vladimir Estivill -Castro 47 Automated Attribute Oriented Induction u Illustration of basic strategies u based on a hierarchy of concepts, u where discovery initiated by a query for a rule with necessary conditions for a class. © Vladimir Estivill -Castro 48
  25. 25. The data set NAME STATUS MAJOR BIRTH PLACE GPA Anderson M.A. History Vancouver 3.5 Bach junior Math Calgary 3.7 Carlton junior Lib. Arts Edmonton 2.6 Fraser M.S. Physics Ottawa 3.9 PhD Math Bombay 3.3 STUDENT Gupta Hart sophomore Chemistry Richmond 2.7 Jackson senior Computing Victoria 3.5 Liu PhD Biology Shanghai 3.4 Wang M.S. Statistics Nanjing 3.2 Wise freshman Literature Toronto 3.9 © Vladimir Estivill -Castro 49 The concept hierarchy ANY For attribute STATUS undergraduate graduate freshman sophomore junior senior M.A. M.S. PhD © Vladimir Estivill -Castro 50
  26. 26. The discovery query u In relation STUDENT, learn characteristic rule for STATUS=“graduate” in relevance to NAME, BIRTH PLACE, GPA u (threshold value of 3) © Vladimir Estivill -Castro 51 The induction u 1) Select “graduate” students. NAME STATUS MAJOR BIRTH PLACE GPA VOTE Anderson M.A. History Vancouver 3.5 1 Frazer M.S. Physics Ottawa 3.9 1 Gupta PhD Math Bombay 3.3 1 1 Liu PhD Biology Shanghai 3.4 1 Monk PhD Computing Victoria 3.8 1 1 1 Wang M.S. Statistics Nanjing 3.2 1 © Vladimir Estivill -Castro 52
  27. 27. The induction u 1) Select “graduate” students. NAME STATUS MAJOR BIRTH PLACE GPA VOTE Anderson M.A. History Vancouver 3.5 1 Frazer M.S. Physics Ottawa 3.9 1 Gupta PhD Math Bombay 3.3 1 1 Liu PhD Biology Shanghai 3.4 1 Monk PhD Computing Victoria 3.8 1 1 1 Wang M.S. Statistics Nanjing 3.2 1 © Vladimir Estivill -Castro 53 The induction u 2) Eliminate attribute “STATUS” because it is “graduate” for all students. NAME MAJOR BIRTH PLACE GPA VOTE Anderson History Vancouver 3.5 1 Frazer Physics Ottawa 3.9 1 Gupta Math Bombay 3.3 1 Liu Biology Shanghai 3.4 1 Monk Computing Victoria 3.8 1 1 Wang Statistics Nanjing 3.2 1 © Vladimir Estivill -Castro 54
  28. 28. The induction u 3) Generalize on the smallest decomposable components. u Not illustrated here, because we do not have composite attributes. © Vladimir Estivill -Castro 55 The induction u 4) If there is a large set of distinct values for an attribute but there is no higher level concept provided for the attribute, the attribute should be removed. u Attribute “NAME” satisfies this. © Vladimir Estivill -Castro 56
  29. 29. The induction u 2)Eliminate attribute “NAME” because it has to many values. NAME MAJOR BIRTH PLACE GPA VOTE Anderson History Vancouver 3.5 1 Frazer Physics Ottawa 3.9 1 Gupta Math Bombay 3.3 1 Liu Biology Shanghai 3.4 1 Monk Computing Victoria 3.8 1 1 Wang Statistics Nanjing 3.2 1 © Vladimir Estivill -Castro 57 A generalization MAJOR BIRTH PLACE GPA VOTE History Vancouver 3.5 2 Physics Ottawa 3.9 3 Math Bombay 3.3 1 Biology Shanghai 3.4 1 Computing Victoria 3.8 2 Statistics Nanjing 3.2 1 The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples. © Vladimir Estivill -Castro 58
  30. 30. Concept tree ascension u 4) If there is a higher level concept in the concept tree for an attribute, the substitution of the higher level concept generalizes the tuple. Science Arts Statistics Computing Biology Math Physics History © Vladimir Estivill -Castro 59 A generalization of each attribute MAJOR BIRTH PLACE GPA VOTE Art British Columbia excellent 35 Science Ontario excellent 10 Science British Columbia excellent 30 Science India good 10 Science China good 15 The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples. © Vladimir Estivill -Castro 60
  31. 31. Threshold Control u If the number of distinct values of an attribute is larger than the generalization threshold value, further generalization on this attribute should be performed. u Not the case for “Major” u It is the case for “Birth Place” © Vladimir Estivill -Castro 61 Further generalization MAJOR BIRTH PLACE GPA VOTE Art Canada excellent 35 Science Canada excellent 40 Science foreign good 25 The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples. © Vladimir Estivill -Castro 62
  32. 32. Transformation to rules MAJOR BIRTH PLACE GPA VOTE {Art, Science} Canada excellent 37 Science foreign good 25 A graduate student is either (with 75% probability) a Canadian with excellent GPA or (with 25% probability) a foreign student, majoring in science with a good GPA. © Vladimir Estivill -Castro 63 Questions? 64