Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

© P. Giorgini, F. Dalpiaz 1


Published on

  • Be the first to comment

  • Be the first to like this

© P. Giorgini, F. Dalpiaz 1

  1. 1. Data Mining – Day 1 Fabiano Dalpiaz Department of Information and Communication Technology University of Trento - Italy Database e Business Intelligence A.A. 2007-2008
  2. 2. Acknowledgements <ul><li>This presentation is partially based on the slides for the book: </li></ul><ul><li>Data Mining: Concepts and Techniques, 2° ed </li></ul><ul><li>Jiawei Han and Micheline Kamber </li></ul>
  3. 3. Two-days outline <ul><li>Data Mining and KDD </li></ul><ul><li>Why Data Mining </li></ul><ul><li>Applications of Data Mining </li></ul><ul><li>Data Preprocessing </li></ul><ul><li>Data Mining techniques </li></ul><ul><li>Visualization of the results </li></ul><ul><li>Summary </li></ul>
  4. 4. Data Mining and KDD KDD Conference Logo
  5. 5. Looking for knowledge <ul><li>The Explosive Growth of Data </li></ul><ul><ul><li>The World Wide Web </li></ul></ul><ul><ul><li>Business: e-commerce, transactions, stocks, … </li></ul></ul><ul><ul><li>Science: Remote sensing, bioinformatics, scientific simulation </li></ul></ul><ul><ul><li>Society and everyone: news, digital cameras, YouTube, forums, blogs, Google & Co </li></ul></ul><ul><li>We are drowning in data, but starving for knowledge ! </li></ul><ul><li>Avoid data tombs </li></ul><ul><li>“ Necessity is the mother of invention” — Data mining — Automated analysis of massive data sets. </li></ul>
  6. 6. What is Data Mining? <ul><li>Data mining (knowledge discovery from data) </li></ul><ul><ul><li>Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful ) patterns or knowledge from huge amount of data </li></ul></ul><ul><li>Alternative names </li></ul><ul><ul><li>Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. </li></ul></ul><ul><ul><li>Are simple search engines data mining? Are queries data mining? Are expert systems data mining? </li></ul></ul>
  7. 7. Knowledge Discovery (KDD) Process Data sources Data Cleaning Data Warehouse Data Mining Knowledge Pattern Evaluation Selection Data Integration Task-relevant Data
  8. 8. Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Quantity of data
  9. 9. Data Mining: confluence of multiple disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithms Other Disciplines Visualization
  10. 10. Why Data Mining?
  11. 11. Why is Data Mining so complex? A matter of data dimensions <ul><li>Tremendous amount of data </li></ul><ul><ul><li>Walmart – Customer buying patterns – a data warehouse 7.5 Terabytes large in 1995 </li></ul></ul><ul><ul><li>VISA – Detecting credit card interoperability issues – 6800 payment transactions per second </li></ul></ul><ul><li>High-dimensionality of data </li></ul><ul><ul><li>Many dimensions to be combined together </li></ul></ul><ul><ul><li>Data cube example: time, location, product  sales </li></ul></ul><ul><li>High complexity of data </li></ul><ul><ul><li>Time-series data, temporal data, sequence data </li></ul></ul><ul><ul><li>Structure data, graphs, social networks and multi-linked data </li></ul></ul><ul><ul><li>Spatial, spatiotemporal, multimedia, text and Web data </li></ul></ul>
  12. 12. What does Data Mining provide me with? (1) <ul><li>Multidimensional concept description : Characterization and discrimination </li></ul><ul><ul><li>Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions </li></ul></ul><ul><ul><li>Characterization describes things in the same class, discrimination describes how to separate different classes </li></ul></ul><ul><li>Frequent patterns , association, correlation vs. causality </li></ul><ul><ul><li>Wine  Spaghetti [0.3% of all basket cases, 75% of cases when tomato sauce is bought] </li></ul></ul><ul><ul><li>Is this correlation or not? </li></ul></ul>
  13. 13. What does Data Mining provide me with? (2) <ul><li>Classification and prediction </li></ul><ul><ul><li>Construct models (functions) that describe and distinguish classes or concepts for future prediction </li></ul></ul><ul><ul><li>E.g., classify countries based on climate , or classify cars based on gas mileage </li></ul></ul><ul><ul><li>Predict some unknown or missing numerical values </li></ul></ul><ul><li>Cluster analysis </li></ul><ul><ul><li>Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns </li></ul></ul><ul><ul><li>Maximizing intra-class similarity & minimizing interclass similarity </li></ul></ul>
  14. 14. What does Data Mining provide me with? (3) <ul><li>Outlier analysis </li></ul><ul><ul><li>Outlier: Data object that does not comply with the general behavior of the data </li></ul></ul><ul><ul><li>Fraud detection is the main application area </li></ul></ul><ul><ul><li>Noise or exception? </li></ul></ul><ul><li>Trend and evolution analysis </li></ul><ul><ul><li>Trend and deviation: e.g., regression analysis </li></ul></ul><ul><ul><li>Sequential pattern mining: e.g., digital camera  large SD memory </li></ul></ul><ul><ul><li>Periodicity analysis </li></ul></ul><ul><ul><li>Similarity-based analysis </li></ul></ul>
  15. 15. Applications of Data Mining Market Analysis and Management <ul><li>Data sources: </li></ul><ul><ul><li>credit card transactions, loyalty cards, smart cards, discount coupons, ... </li></ul></ul><ul><li>Target marketing </li></ul><ul><ul><li>Find clusters of “model” customers who share the same characteristics: </li></ul></ul><ul><ul><ul><li>Geographics (lives in Rome, lives in Trentino) </li></ul></ul></ul><ul><ul><ul><li>Demographics (married, between 21-35, at least one child, family income more than 40.000€/year) </li></ul></ul></ul><ul><ul><ul><li>Psychographics (likes new products, consistently uses the Web) </li></ul></ul></ul><ul><ul><ul><li>Behaviors (searches info in Internet, always defends her decisions) </li></ul></ul></ul><ul><ul><li>Determine customer purchasing patterns over time </li></ul></ul>
  16. 16. Applications of Data Mining Market Analysis and Management <ul><li>Cross-market analysis </li></ul><ul><ul><li>Find associations between product sales, and predict based on such association </li></ul></ul><ul><ul><li>Compare the sales in the US and in Italy, find associations in old products and predict if new ones will have success </li></ul></ul><ul><li>Customer profiling </li></ul><ul><ul><li>What types of customers buy what products </li></ul></ul><ul><ul><li>Customers with age between 20-30 and income > 20K€ will buy product A </li></ul></ul><ul><li>Customer requirement analysis </li></ul><ul><ul><li>Identify the best products for different groups of customers </li></ul></ul><ul><ul><li>Predict what factors will attract new customers </li></ul></ul>
  17. 17. Applications of Data Mining Corporate Analysis <ul><li>Finance Planning and Asset Evaluation </li></ul><ul><ul><li>Cash flow prediction and analysis </li></ul></ul><ul><ul><li>Cross-sectional and time-series analysis (financial ratio, trend analysis) </li></ul></ul><ul><li>Resource Planning </li></ul><ul><ul><li>summarize and compare the resources and spending </li></ul></ul><ul><li>Competition </li></ul><ul><ul><li>monitor competitors and market directions </li></ul></ul><ul><ul><li>group customers into classes and a class-based pricing procedure </li></ul></ul><ul><ul><li>set pricing strategy in a highly competitive market </li></ul></ul><ul><li>Other examples? </li></ul>
  18. 18. What’s next? <ul><li>Data Preprocessing </li></ul><ul><ul><li>Why is it needed? </li></ul></ul><ul><ul><li>Data cleaning </li></ul></ul><ul><ul><li>Data integration and transformation, </li></ul></ul><ul><ul><li>Data reduction </li></ul></ul><ul><ul><li>Discretization and Concept hiererchy </li></ul></ul><ul><li>Data Mining techniques </li></ul><ul><ul><li>Frequent patterns, association rules </li></ul></ul><ul><ul><li>Classification and prediction </li></ul></ul><ul><ul><li>Cluster Analysis </li></ul></ul><ul><li>Visualization of the results </li></ul><ul><li>Summary </li></ul>Are you sleeping?
  19. 19. Data Preprocessing
  20. 20. Why Data Preprocessing? <ul><li>Data in the real world is dirty </li></ul><ul><ul><li>incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data </li></ul></ul><ul><ul><ul><li>e.g., occupation=“ ”, birthdate=“31/12/2099” </li></ul></ul></ul><ul><ul><li>noisy : containing errors or outliers </li></ul></ul><ul><ul><ul><li>e.g., Salary=“-10” </li></ul></ul></ul><ul><ul><li>inconsistent : containing discrepancies in codes or names </li></ul></ul><ul><ul><ul><li>e.g., Age=“42” Birthday=“03/07/1997” ( we are in 2007!! ) </li></ul></ul></ul><ul><ul><ul><li>e.g., Was rating “1,2,3”, now rating “A, B, C” </li></ul></ul></ul><ul><ul><ul><li>e.g., discrepancy between duplicate records. In one copy of the data customer A has to pay 200.000€, in the second copy of the data A does not have to pay anything. </li></ul></ul></ul>
  21. 21. Why is data dirty? <ul><li>Incomplete data may come from </li></ul><ul><ul><li>“Not applicable” data value when collected </li></ul></ul><ul><ul><li>Different considerations between the time when the data was collected and when it is analyzed. </li></ul></ul><ul><ul><li>Human/hardware/software problems </li></ul></ul><ul><li>Noisy data (incorrect values) may come from </li></ul><ul><ul><li>Faulty data collection instruments </li></ul></ul><ul><ul><li>Human or computer error at data entry </li></ul></ul><ul><ul><li>Errors in data transmission </li></ul></ul><ul><li>Inconsistent data may come from </li></ul><ul><ul><li>Different data sources </li></ul></ul><ul><ul><li>Functional dependency violation (e.g., modify some linked data) </li></ul></ul>
  22. 22. Why Is Data Preprocessing Important?
  23. 23. Data Preprocessing 1. Data cleaning – missing values <ul><ul><li>“ Data cleaning is one of the three biggest problems in data warehousing”— Ralph Kimball </li></ul></ul><ul><li>Fill in missing values </li></ul><ul><ul><li>Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“” </li></ul></ul><ul><ul><li>Ignore the record (is it always feasible?) </li></ul></ul><ul><ul><li>Manually filling missing attributes </li></ul></ul><ul><ul><li>Automatically insert a constant </li></ul></ul><ul><ul><li>Automatically insert the mean value (relative to the record class) </li></ul></ul><ul><ul><li>Most probable value: make some inference! </li></ul></ul>
  24. 24. Data Preprocessing 1. Data cleaning – binning <ul><li>Handle noisy data </li></ul><ul><ul><li>Binning, clustering, regression (not details) </li></ul></ul><ul><li>Binning </li></ul><ul><li>Sort data by price (€): 4, 8, 9, 15, 21, 21, 24, 25, 26 </li></ul><ul><li>Partition into equal-frequency (equi-depth) bins: </li></ul><ul><ul><li>Bin 1: 4, 8, 9 </li></ul></ul><ul><ul><li>Bin 2: 15, 21, 21 </li></ul></ul><ul><ul><li>Bin 3: 24, 25, 26 </li></ul></ul><ul><li>Smoothing by bin means: </li></ul><ul><ul><li>Bin 1: 7, 7, 7 </li></ul></ul><ul><ul><li>Bin 2: 19, 19, 19 </li></ul></ul><ul><ul><li>Bin 3: 25, 25, 25 </li></ul></ul>
  25. 25. Data Preprocessing 1. Data cleaning – clustering noise
  26. 26. Data Preprocessing 2. Integration and transformation <ul><li>Data Integration combines data from multiple sources into a coherent store </li></ul><ul><li>Schema integration </li></ul><ul><ul><li>Integrate metadata from different sources </li></ul></ul><ul><ul><li>A.cust-id  B.cust-number </li></ul></ul><ul><li>Entity identification problem: </li></ul><ul><ul><li>Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton </li></ul></ul><ul><li>Detecting and resolving data value conflicts </li></ul><ul><ul><li>For the same real world entity, attribute values from different sources are different (e.g., cm vs. inch) </li></ul></ul>D 1 D 2 D 3 D 1,2,3
  27. 27. Data Preprocessing 2. Integration and transformation <ul><li>Data integration can lead to redundant attributes </li></ul><ul><ul><li>Same object ( = B.residence) </li></ul></ul><ul><ul><li>Derivates (A.annualIncome =  B.salary+C.rentalIncome) </li></ul></ul><ul><li>Redundant attributes can be discoverd via correlation analysis </li></ul><ul><ul><li>A mathematical method detecting the correletion between two attributes </li></ul></ul><ul><ul><li>Correlation coefficient ( Pearson’s product moment coefficient ): the higher it is, the stronger the correlation between attributes </li></ul></ul><ul><ul><li>Χ 2 (chi-square) test </li></ul></ul><ul><ul><li>No details on these methods here </li></ul></ul>
  28. 28. Data Preprocessing 2. Integration and transformation <ul><li>Aggregation: </li></ul><ul><ul><li>Sum the sales of different branches (in different data sources) to compute the company sales </li></ul></ul><ul><li>Generalization: </li></ul><ul><ul><li>concept hierarchy climbing </li></ul></ul><ul><ul><li>From integer attribute age to classes of age (children, adult, old) </li></ul></ul><ul><li>Normalization: scaled to fall within a small, specified range </li></ul><ul><ul><li>Change the range from [- ∞,+ ∞] to [-1,+1] </li></ul></ul><ul><ul><li>{-13, -6, -3, 10, 100}  {-0.13, -0.06, -0.03, 0.1, 1} </li></ul></ul>
  29. 29. Data Preprocessing 3. Data reduction <ul><li>Data reduction </li></ul><ul><ul><li>Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results </li></ul></ul><ul><ul><li>Different reduction types (dimensions, numerosity, discretization) </li></ul></ul><ul><li>Dimensionality: Attribute subset selection </li></ul><ul><ul><li>Example with a decision tree (left branches True, right False) </li></ul></ul>Initial attribute set: {A1, A2, A3, A4, A5, A6} Reduced attribute set: {A1, A4, A6} A1? A6? Class 1 A4? Class 1 Class 2 Class 2
  30. 30. Data Preprocessing 3. Data reduction <ul><li>Dimensionality: Principal Components Analysis </li></ul><ul><ul><li>Given N data vectors from n -dimensions, find k ≤ n orthogonal vectors ( principal components ) that can be best used to represent data </li></ul></ul><ul><ul><li>Works for numeric data only </li></ul></ul><ul><ul><li>Used when the number of dimensions is large </li></ul></ul><ul><li>Numerosity: Clustering </li></ul><ul><ul><li>Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only </li></ul></ul>2 clusters Sparse data leads to many clusters – non effective
  31. 31. Data Preprocessing 3. Data reduction <ul><li>Numerosity: Sampling </li></ul><ul><ul><li>obtaining a small sample s to represent the whole data set N </li></ul></ul><ul><ul><li>Problem: How to select a representative sampling set </li></ul></ul><ul><ul><li>Random sampling is not enough – representative samples should be preserved </li></ul></ul><ul><ul><li>Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database </li></ul></ul>No samples from here Random sampling Stratified sampling
  32. 32. Data Preprocessing 4. Discretization - concept hierarchy <ul><li>Three types of attributes </li></ul><ul><ul><li>Nominal — values from an unordered set (color, profession) </li></ul></ul><ul><ul><li>Ordinal — values from an ordered set (military or academic rank) </li></ul></ul><ul><ul><li>Continuous — numbers (integer or real numbers) </li></ul></ul><ul><li>Discretization </li></ul><ul><ul><li>Divide the range of a continuous attribute into intervals </li></ul></ul><ul><ul><li>Reduces data size and its complexity </li></ul></ul><ul><ul><li>Some data mining algorithms do not support continuous types, and in those cases discretization is mandatory </li></ul></ul><ul><li>Some useful methods: </li></ul><ul><ul><li>Binning, clustering (already presented) </li></ul></ul><ul><ul><li>Entropy-based discretization (no details here) </li></ul></ul>
  33. 33. Data Preprocessing 4. Discretization - concept hierarchy <ul><li>Concept hierarchy generation </li></ul><ul><ul><li>For categorical data </li></ul></ul><ul><ul><li>Specification of an ordering between attributes (schema level) </li></ul></ul><ul><ul><ul><li>street < city < state < country </li></ul></ul></ul><ul><ul><li>Specification of a hierarchy of values (data level) </li></ul></ul><ul><ul><ul><li>{Urbana, Champaign, Chicago} < Illinois </li></ul></ul></ul><ul><ul><li>Automatic generation using the number of distinct values </li></ul></ul><ul><ul><ul><li>For the set of attributes: {street, city, state, country} </li></ul></ul></ul><ul><ul><ul><li>IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15 </li></ul></ul></ul><ul><ul><ul><li>THEN: street < city < state < country </li></ul></ul></ul>
  34. 34. Day 1 Summary <ul><li>Data Mining and KDD </li></ul><ul><li>Why Data Mining </li></ul><ul><li>Applications of Data Mining </li></ul><ul><li>Data Preprocessing </li></ul><ul><ul><li>Data Cleaning </li></ul></ul><ul><ul><li>Data Integration and Transformation </li></ul></ul><ul><ul><li>Data Reduction </li></ul></ul><ul><ul><li>Discretization and concept hierarchy </li></ul></ul><ul><li>Tomorrow? </li></ul><ul><ul><li>Data Mining techniques </li></ul></ul><ul><ul><li>Results visualization </li></ul></ul><ul><ul><li>Summary </li></ul></ul>Questions?