6107 Ch4 V2


Published on

Published in: Economy & Finance, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 6107 Ch4 V2

    1. 1. Introduction to Data Mining C hapter 4
    2. 2. Chapter 4 Outline <ul><ul><li>Background </li></ul></ul><ul><ul><li>Information is Power </li></ul></ul><ul><ul><li>Knowledge is Power </li></ul></ul><ul><ul><li>Data Mining </li></ul></ul>
    3. 3. Introduction
    4. 4. Information is Power <ul><li>Relevant </li></ul><ul><li>Right Information </li></ul><ul><li>Globalised world </li></ul><ul><li>Vast amount of information available </li></ul>
    5. 5. What is an information <ul><li>a collection of data </li></ul><ul><li>The act of human analysis and interpretation of activities </li></ul><ul><li>Decomposing it into various components and tackling them </li></ul>
    6. 6. What is Knowledge? <ul><li>The act of human synthesis and evaluation of information </li></ul><ul><li>Integration of the relevant components and form as a relevant whole system. </li></ul>
    7. 7. <ul><li>Lots of data is being collected and warehoused </li></ul><ul><ul><li>Web data, e-commerce </li></ul></ul><ul><ul><li>purchases at department/ grocery stores </li></ul></ul><ul><ul><li>Bank/Credit Card transactions </li></ul></ul><ul><li>Computers have become cheaper and more powerful </li></ul><ul><li>Competitive Pressure is Strong </li></ul><ul><ul><li>Provide better, customized services for an edge (e.g. in Customer Relationship Management) </li></ul></ul>Why Mine Data? Commercial Viewpoint
    8. 8. Why Mine Data? Scientific Viewpoint <ul><li>Data collected and stored at enormous speeds (GB/hour) </li></ul><ul><ul><li>remote sensors on a satellite </li></ul></ul><ul><ul><li>telescopes scanning the skies </li></ul></ul><ul><ul><li>microarrays generating gene expression data </li></ul></ul><ul><ul><li>scientific simulations generating terabytes of data </li></ul></ul><ul><li>Traditional techniques infeasible for raw data </li></ul><ul><li>Data mining may help scientists </li></ul><ul><ul><li>in classifying and segmenting data </li></ul></ul><ul><ul><li>in Hypothesis Formation </li></ul></ul>
    9. 9. Data Mining Definition I <ul><li>The nontrivial extraction of hidden, previously unidentified, and potentially valuable knowledge from data </li></ul><ul><li>A variety of techniques such as neural networks, decision trees or standard statistical techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in areas such as decision support, prediction, forecasting, and estimation. </li></ul>
    10. 10. Data Mining Definition II <ul><li>Finding hidden information in a database </li></ul>
    11. 11. Hidden Information <ul><li>Number of years of experiences </li></ul><ul><li>Great secret recipes </li></ul><ul><li>Success Factors </li></ul>
    12. 12. <ul><li>Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems </li></ul><ul><li>Traditional Techniques may be unsuitable due to </li></ul><ul><ul><li>Enormity of data </li></ul></ul><ul><ul><li>High dimensionality of data </li></ul></ul><ul><ul><li>Heterogeneous, distributed nature of data </li></ul></ul>Origins of Data Mining Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems
    13. 13. What is (not) Data Mining? <ul><li>What is Data Mining? </li></ul><ul><ul><li>Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) </li></ul></ul><ul><ul><li>Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) </li></ul></ul><ul><li>What is not Data Mining? </li></ul><ul><ul><li>Look up phone number in phone directory </li></ul></ul><ul><ul><li>Query a Web search engine for information about “Amazon” </li></ul></ul>
    14. 14. Database Processing vs. Data Mining Processing <ul><li>Query </li></ul><ul><ul><li>Well defined </li></ul></ul><ul><ul><li>SQL </li></ul></ul><ul><li>Query </li></ul><ul><ul><li>Poorly defined </li></ul></ul><ul><ul><li>No precise query language </li></ul></ul><ul><li>Data </li></ul><ul><ul><li>Operational data </li></ul></ul><ul><li>Output </li></ul><ul><ul><li>Precise </li></ul></ul><ul><ul><li>Subset of database </li></ul></ul><ul><li>Data </li></ul><ul><ul><li>Not operational data </li></ul></ul><ul><li>Output </li></ul><ul><ul><li>Fuzzy </li></ul></ul><ul><ul><li>Not a subset of database </li></ul></ul>
    15. 15. Query Examples <ul><li>Database </li></ul><ul><li>Data Mining </li></ul><ul><ul><ul><ul><ul><li>Find all customers who have purchased bread </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Find all items which are frequently purchased with bread. (association rules) </li></ul></ul></ul></ul></ul><ul><li>Find all credit applicants with surname name of Lee. </li></ul><ul><ul><li>Identify customers who have purchased more than $100,000 in the last year. </li></ul></ul><ul><ul><li>Find all credit applicants who are good credit risks. (classification) </li></ul></ul><ul><ul><ul><ul><ul><li>Identify customers with similar eating habits. (Clustering) </li></ul></ul></ul></ul></ul>
    16. 16. Data Mining Models and Tasks
    17. 17. Classification: Definition <ul><li>Given a collection of records ( training set ) </li></ul><ul><ul><li>Each record contains a set of attributes , one of the attributes is the class . </li></ul></ul><ul><li>Find a model for class attribute as a function of the values of other attributes. </li></ul><ul><li>Goal: previously unseen records should be assigned a class as accurately as possible. </li></ul><ul><ul><li>A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. </li></ul></ul>
    18. 18. Illustrating Classification Task
    19. 19. Examples of Classification Task <ul><li>Predicting tumor cells as benign or malignant </li></ul><ul><li>Classifying credit card transactions as legitimate or fraudulent </li></ul><ul><li>Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil </li></ul><ul><li>Categorizing news stories as finance, weather, entertainment, sports, etc </li></ul>
    20. 20. Classification Techniques <ul><li>Decision Tree based Methods </li></ul><ul><li>Rule-based Methods </li></ul><ul><li>Memory based reasoning </li></ul><ul><li>Neural Networks </li></ul><ul><li>Naïve Bayes and Bayesian Belief Networks </li></ul><ul><li>Support Vector Machines </li></ul>
    21. 21. Example of a Decision Tree Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree categorical categorical continuous class
    22. 22. Another Example of Decision Tree categorical categorical continuous class MarSt Refund TaxInc YES NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data! NO
    23. 23. Decision Tree Classification Task Decision Tree
    24. 24. Apply Model to Test Data Test Data Start from the root of tree. Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
    25. 25. Apply Model to Test Data Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
    26. 26. Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
    27. 27. Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
    28. 28. Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
    29. 29. Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Assign Cheat to “No”
    30. 30. Decision Tree Classification Task Decision Tree
    31. 31. What is Cluster Analysis? <ul><li>Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups </li></ul>Inter-cluster distances are maximized Intra-cluster distances are minimized
    32. 32. Applications of Cluster Analysis <ul><li>Understanding </li></ul><ul><ul><li>Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations </li></ul></ul><ul><li>Summarization </li></ul><ul><ul><li>Reduce the size of large data sets </li></ul></ul>
    33. 33. What is not Cluster Analysis? <ul><li>Supervised classification </li></ul><ul><ul><li>Have class label information </li></ul></ul><ul><li>Simple segmentation </li></ul><ul><ul><li>Dividing students into different registration groups alphabetically, by last name </li></ul></ul><ul><li>Results of a query </li></ul><ul><ul><li>Groupings are a result of an external specification </li></ul></ul><ul><li>Graph partitioning </li></ul><ul><ul><li>Some mutual relevance and synergy, but areas are not identical </li></ul></ul>
    34. 34. Notion of a Cluster can be Ambiguous How many clusters? Four Clusters Two Clusters Six Clusters
    35. 35. Types of Clusterings <ul><li>A clustering is a set of clusters </li></ul><ul><li>Important distinction between hierarchical and partitional sets of clusters </li></ul><ul><li>Partitional Clustering </li></ul><ul><ul><li>A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset </li></ul></ul><ul><li>Hierarchical clustering </li></ul><ul><ul><li>A set of nested clusters organized as a hierarchical tree </li></ul></ul>
    36. 36. Partitional Clustering Original Points A Partitional Clustering
    37. 37. Hierarchical Clustering Traditional Hierarchical Clustering Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram
    38. 38. Association Rules <ul><li>Association Rules are a data mining technique and complement market basket analysis. </li></ul><ul><li>All association rules are unidirectional and take the following form: </li></ul><ul><li>Left-hand side rule IMPLIES Right-hand side rule </li></ul><ul><li>Both left hand side and the right-hand side of the rule may contain multiple items or combination of items such as following: </li></ul><ul><li>Yellow Peppers IMPLIES Red Peppers, Bananas, and Bakery </li></ul><ul><li>Associations are written as A B, where A is called antecedent or left-hand side(LHS) and B is called consequent or right-hand side(RHS). </li></ul><ul><ul><li>Ex: “If people buy printer then they buy catridge” </li></ul></ul><ul><ul><ul><li>The antecedent is “buy printer” and the consequent is “buy catridge” </li></ul></ul></ul>
    39. 39. Association Rules <ul><li>Market Basket Analysis </li></ul><ul><li>-Necessary to have a list of transactions and what was purchased in each one. </li></ul><ul><li>-Ex: </li></ul><ul><li>Transaction 1: Frozen Pizza, Cola, Milk </li></ul><ul><li>Transaction 2: Milk, potato chips, </li></ul><ul><li>Transaction 3: Cola, Frozen pizza </li></ul><ul><li>Transaction 4: Milk, pretzels </li></ul><ul><li>Transaction 5: Cola, pretzels </li></ul>
    40. 40. Association Rules 0 1 0 1 0 Potato Chips 1 0 3 1 2 Cola 0 1 0 Potato Chips 2 1 1 0 Pretzels 1 1 3 1 Milk 0 2 1 2 Frozen Pizza Pretzels Cola Milk Frozen Pizza
    41. 41. Association Rules <ul><li>Measures of Association </li></ul><ul><ul><li>Support - the support measure refers to the percentage of baskets in the analysis where the rule is true, that is where both the left-hand side and the right-hand side of the association are found. </li></ul></ul><ul><ul><li>Confidence </li></ul></ul><ul><ul><ul><li>The percentage of baskets from the analysis having the left-hand side item that also contain the right-hand side item is found via the confidence measure. This measure is different from support in that confidence is the probability that the right-hand side item is present given that we know the left-hand side item is in the basket. </li></ul></ul></ul><ul><ul><ul><li>Calculated as a ratio: </li></ul></ul></ul><ul><ul><ul><ul><li>(frequency of A and B)/(frequency of A) </li></ul></ul></ul></ul>
    42. 42. Association Rules <ul><li>Measures of Association </li></ul><ul><li>-The support measure </li></ul><ul><ul><li>for the rule </li></ul></ul><ul><li>“ Cola IMPLIES Frozen Pizza ” is 40% </li></ul><ul><li>“ Frozen Pizza IMPLIES Cola” is 40% </li></ul><ul><ul><li>single item </li></ul></ul><ul><ul><ul><li>“ Milk” is 60% </li></ul></ul></ul><ul><li>(Note: support considers only the combination and not the direction.) </li></ul>
    43. 43. Association Rules <ul><li>Measures of Association </li></ul><ul><ul><li>Confidence </li></ul></ul><ul><ul><li>“Milk IMPLIES Potato Chips” has confidence: </li></ul></ul><ul><ul><li>= (frequency of A and B) / (frequency of A) </li></ul></ul><ul><ul><li>= 20% / 60% </li></ul></ul><ul><ul><li>= 33% </li></ul></ul>
    44. 44. Data Mining vs. KDD <ul><li>Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. </li></ul><ul><li>Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. </li></ul>
    45. 45. KDD Process <ul><li>Selection ( Pre-Mining 1): Obtain data from various sources. </li></ul><ul><li>Preprocessing (Pre-Mining 2) : Cleanse data. </li></ul><ul><li>Transformation (Pre-Mining 3): Convert to common format. Transform to new format. </li></ul><ul><li>Data Mining: Obtain desired results. </li></ul><ul><li>Interpretation/Evaluation (Post-Mining): Present results to user in meaningful manner. </li></ul>Modified from [FPSS96C]
    46. 46. KDD Process Ex: Web Log <ul><li>Selection: </li></ul><ul><ul><li>Select log data (dates and locations) to use </li></ul></ul><ul><li>Preprocessing: </li></ul><ul><ul><li>Remove identifying URLs </li></ul></ul><ul><ul><li>Remove error logs </li></ul></ul><ul><li>Transformation: </li></ul><ul><ul><li>Sessionize (sort and group) </li></ul></ul><ul><li>Data Mining: </li></ul><ul><ul><li>Identify and count patterns </li></ul></ul><ul><ul><li>Construct data structure </li></ul></ul><ul><li>Interpretation/Evaluation: </li></ul><ul><ul><li>Identify and display frequently accessed sequences. </li></ul></ul><ul><li>Potential User Applications: </li></ul><ul><ul><li>Cache prediction </li></ul></ul><ul><ul><li>Personalisation </li></ul></ul>
    47. 47. Data Mining Development <ul><li>Similarity Measures </li></ul><ul><li>Hierarchical Clustering </li></ul><ul><li>IR Systems </li></ul><ul><li>Imprecise Queries </li></ul><ul><li>Textual Data </li></ul><ul><li>Web Search Engines </li></ul><ul><li>Bayes Theorem </li></ul><ul><li>Regression Analysis </li></ul><ul><li>EM Algorithm </li></ul><ul><li>K-Means Clustering </li></ul><ul><li>Time Series Analysis </li></ul><ul><li>Neural Networks </li></ul><ul><li>Decision Tree Algorithms </li></ul><ul><li>Algorithm Design Techniques </li></ul><ul><li>Algorithm Analysis </li></ul><ul><li>Data Structures </li></ul><ul><li>Relational Data Model </li></ul><ul><li>SQL </li></ul><ul><li>Association Rule Algorithms </li></ul><ul><li>Data Warehousing </li></ul><ul><li>Scalability Techniques </li></ul>
    48. 48. Data mining: What it can’t do <ul><li>tell the value of the patterns to the organization </li></ul><ul><li>replace skilled business analysts or managers </li></ul><ul><li>automatically discover solutions without guidance </li></ul>