Your SlideShare is downloading. ×
0
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
[ppt]
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

[ppt]

420

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
420
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 3
  • 4
  • Transcript

    • 1. The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 3: An Introduction To Data Mining (I) Johannes Gehrke [email_address] http://www.cs.cornell.edu/ johannes
    • 2. Lectures Three and Four <ul><li>Data preprocessing </li></ul><ul><li>Multidimensional data analysis </li></ul><ul><li>Data mining </li></ul><ul><ul><li>Association rules </li></ul></ul><ul><ul><li>Classification trees </li></ul></ul><ul><ul><li>Clustering </li></ul></ul>
    • 3. Why Data Preprocessing? <ul><li>Quality decisions come from quality data. </li></ul><ul><li>Problems with real life data: </li></ul><ul><ul><li>Data needs to be integrated from different sources </li></ul></ul><ul><ul><li>Missing values </li></ul></ul><ul><ul><li>Noisy and inconsistent values </li></ul></ul><ul><ul><li>Data is not at the right level of aggregation </li></ul></ul>
    • 4. Recall: The Relational Data Model <ul><li>A relational database is a set of relations </li></ul><ul><li>A relation has two components: </li></ul><ul><ul><li>The relation instance. Basically a table, with rows (also: records, tuples) and columns (also: fields, attributes). Number of records in the relation instance: Cardinality. </li></ul></ul><ul><ul><li>The relation schema . Specifies name of the relation, plus name and type of each column. </li></ul></ul><ul><li>Turing award (Nobel price in CS) for Codd in 1981 for his work on the relational model </li></ul>
    • 5. Example: Customer Relation <ul><li>Relation schema: Customers (cid: integer, name: string, byear: integer, state: string) </li></ul><ul><li>Relation instance: </li></ul>
    • 6. Data Integration <ul><li>Integrate data from multiple sources into a common format for data mining. </li></ul><ul><li>Note: A good data warehouse has already taken care of this step. </li></ul>Data Warehouse Server OLTP DBMSs Other Data Sources Extract, clean, transform, aggregate, load, update Data Marts
    • 7. Data Integration (Contd.) <ul><li>Problem: Heterogeneous schema integration </li></ul><ul><li>Different attribute names </li></ul><ul><li>Different units: Sales in $, sales in Yen, sales in DM </li></ul>
    • 8. Data Integration (Contd.) <ul><li>Problem: Heterogeneous schema integration </li></ul><ul><li>Different scales: Sales in dollars versus sales in pennies </li></ul><ul><li>Derived attributes: Annual salary versus monthly salary </li></ul>
    • 9. Data Integration (Contd.) <ul><li>Problem: Inconsistency due to redundancy </li></ul><ul><li>Customer with customer-id 150 has three children in relation1 and four children in relation2 </li></ul><ul><li>Computation of annual salary from monthly salary in relation1 does not match “annual-salary” attribute in relation2 </li></ul>
    • 10. Missing Values <ul><li>Often the values of some attributes in a record are not known. </li></ul><ul><li>Reasons for missing values: </li></ul><ul><ul><li>Attribute does not apply (e.g., maiden name) </li></ul></ul><ul><ul><li>Inconsistency with other recorded data </li></ul></ul><ul><ul><li>Equipment malfunction </li></ul></ul><ul><ul><li>Human errors </li></ul></ul><ul><ul><li>Attribute introduced recently (e.g., email address) </li></ul></ul>
    • 11. Missing Values: Approaches <ul><li>Ignore the record </li></ul><ul><li>Complete the missing value: </li></ul><ul><ul><li>Manual completion: Tedious and likely to be infeasible </li></ul></ul><ul><ul><li>Fill in a global constant, e.g., “NULL”, “unknown” </li></ul></ul><ul><ul><li>Use the attribute mean or mode </li></ul></ul><ul><ul><li>Construct a data mining model that predicts the missing value </li></ul></ul>
    • 12. Noisy Data <ul><li>Examples: </li></ul><ul><ul><li>Faulty data collection instruments </li></ul></ul><ul><ul><li>Data entry problems, misspellings </li></ul></ul><ul><ul><li>Data transmission problems </li></ul></ul><ul><ul><li>Technology limitation </li></ul></ul><ul><ul><li>Inconsistency in naming conventions </li></ul></ul><ul><ul><li>Duplicate records with different values for a common field </li></ul></ul>
    • 13. Noisy Data: Remove Outliers
    • 14. Noisy Data: Smoothing x y X1 Y1 Y1’
    • 15. Noisy Data: Normalization <ul><li>Scale data to fall within a small, specified range </li></ul><ul><ul><li>Leave out extreme order statistics </li></ul></ul><ul><ul><li>Min-max normalization </li></ul></ul><ul><ul><li>Z-score normalization </li></ul></ul><ul><ul><li>Normalization by decimal scaling </li></ul></ul>
    • 16. Data Reduction <ul><li>Problem: </li></ul><ul><li>Data might not be at the right scale for analysis. Example: Individual phone calls versus monthly phone call usage </li></ul><ul><li>Complex data mining tasks might run a very long time. Example: Multi-terabyte data warehouses One disk drive: About 20MB/s </li></ul>
    • 17. Data Reduction: Attribute Selection <ul><li>Select the “relevant” attributes for the data mining task </li></ul><ul><li>If there are k attributes, there are 2 k -1 different subsets </li></ul><ul><ul><li>Example: {salary,children,byear}, {salary,children}, {salary,byear}, {children,byear}, {salary}, {children}, {byear} </li></ul></ul><ul><li>Choice of the right subset depends on: </li></ul><ul><ul><li>Data mining task </li></ul></ul><ul><ul><li>Underlying probability distribution </li></ul></ul>
    • 18. Attribute Selection (Contd.) <ul><li>How to choose relevant attributes: </li></ul><ul><ul><li>Forward selection: Select greedily one attribute at a time </li></ul></ul><ul><ul><li>Backward elimination: Start with all attributes, eliminate irrelevant attributes </li></ul></ul><ul><ul><li>Combination of forward selection and backward elimination </li></ul></ul>
    • 19. Data Reduction: Parametric Models <ul><li>Main idea: </li></ul><ul><ul><li>Fit a parametric model to the data (e.g., multivariate normal distribution) </li></ul></ul><ul><ul><li>Store the model parameters, discard the data (except for outliers) </li></ul></ul>
    • 20. Parametric Models: Example <ul><li>Instead of storing (x,y) pairs, store only the x-value. Then recompute the y-value using y = ax + b </li></ul>x y
    • 21. Data Reduction: Sampling <ul><li>Choose a representative subset of the data </li></ul><ul><li>Simple random sampling may have very poor performance in the presence of skew </li></ul>
    • 22. Data Reduction: Sampling (Contd.) <ul><li>Stratified sampling: Biased sampling </li></ul><ul><li>Example: Keep population group ratios </li></ul><ul><li>Example: Keep minority population group count </li></ul>
    • 23. Data Reduction: Histograms <ul><li>Divide data into buckets and store average (sum) for each bucket </li></ul><ul><li>Can be constructed “optimally” for one attribute using dynamic programming </li></ul><ul><li>Example: Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6, 7,7,8,8,9,9,10,10,11,11,12,12 Histogram: (range, count, sum) (1-2,12,16), (3-6,8,36), (7-9,6,48), (10-12,6,66) </li></ul>
    • 24. Histograms (Contd.) <ul><li>Equal-width histogram </li></ul><ul><ul><li>Divides the domain of an attribute into k intervals of equal size </li></ul></ul><ul><ul><li>Interval width = (Max – Min)/k </li></ul></ul><ul><ul><li>Computationally easy </li></ul></ul><ul><ul><li>Problems with data skew and outliers </li></ul></ul><ul><li>Example: Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6, 7,7,8,8,9,9,10,10,11,11,12,12 Histogram: (range, count, sum) (1-3,14,22), (4-6,6,30), (7-9,6,48), (10-12,6,66) </li></ul>
    • 25. Histograms (Contd.) <ul><li>Equal-depth histogram </li></ul><ul><ul><li>Divides the domain of an attribute into k intervals, each containing the same number of records </li></ul></ul><ul><ul><li>Variable interval width </li></ul></ul><ul><ul><li>Computationally easy </li></ul></ul><ul><li>Example: Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6, 7,7,8,8,9,9,10,10,11,11,12,12 Histogram: (range, count, sum) (1,8,8), (2-4,8,22), (5-8,8,52), (9-12,8,84) </li></ul>
    • 26. Data Reduction: Discretization <ul><li>Same concept as histograms </li></ul><ul><li>Divide domain of a numerical attribute into intervals. </li></ul><ul><li>Replace attribute value with label for interval. </li></ul><ul><li>Example: </li></ul><ul><ul><li>Dataset (age; salary): (25;30,000),(30;80,000),(27;50,000), (60;70,000),(50;55,000),(28;25,000) </li></ul></ul><ul><ul><li>Discretized dataset (age, discretizedSalary): (25,low),(30,high),(27,medium),(60,high), (50,medium),(28,low) </li></ul></ul>
    • 27. Discretization: Natural Hierarchies <ul><li>Replace low-level concepts with high-level concepts </li></ul><ul><li>Example replacements: </li></ul><ul><ul><li>Product SKU by category </li></ul></ul><ul><ul><li>City by state </li></ul></ul>Industry Category Product Country State City Year Quarter Month Week Day
    • 28. Data Reduction: Aggregation <ul><li>Natural hierarchies on attributes can be used to aggregate data along the hierarchy </li></ul>Industry Category Product Country State City Year Quarter Month Week Day
    • 29. Aggregation: Example
    • 30. Data Reduction: Other Methods <ul><li>Principal component analysis </li></ul><ul><li>Fourier transformation </li></ul><ul><li>Wavelet transformation </li></ul>
    • 31. Data Preprocessing: Summary <ul><li>Problems during data integration: </li></ul><ul><ul><li>Different attribute names </li></ul></ul><ul><ul><li>Different units </li></ul></ul><ul><ul><li>Different scales </li></ul></ul><ul><ul><li>Derived attributes </li></ul></ul><ul><ul><li>Redundant data </li></ul></ul><ul><li>Missing values </li></ul><ul><ul><li>Imputation </li></ul></ul><ul><ul><li>Prediction </li></ul></ul><ul><li>Noisy data: </li></ul><ul><ul><li>Outlier removal </li></ul></ul><ul><ul><li>Smoothing </li></ul></ul><ul><ul><li>Normalization </li></ul></ul><ul><li>Data Reduction: </li></ul><ul><ul><li>Attribute selection </li></ul></ul><ul><ul><li>Fitting parametric models </li></ul></ul><ul><ul><li>Sampling </li></ul></ul><ul><ul><li>Histograms </li></ul></ul><ul><ul><li>Discretization </li></ul></ul><ul><ul><li>Aggregation </li></ul></ul>
    • 32. Data Analysis <ul><li>Data preprocessing </li></ul><ul><li>Multidimensional data analysis </li></ul><ul><li>Data mining </li></ul><ul><ul><li>Association rules </li></ul></ul><ul><ul><li>Classification trees </li></ul></ul><ul><ul><li>Clustering </li></ul></ul>
    • 33. Multidimensional Data Analysis <ul><li>Recall: </li></ul><ul><ul><li>Transactions(ckey, timekey, pkey, units, price) </li></ul></ul><ul><ul><li>Customers(ckey, cid, name, byear, city, state, country) </li></ul></ul><ul><ul><li>Time(tkey, day, month, quarter, year) </li></ul></ul><ul><ul><li>Products(pkey, pname, price, pid, category, industry) </li></ul></ul><ul><li>Hierarchies on dimensions: </li></ul>Industry Category Product Country State City Year Quarter Month Week Day
    • 34. Multidimensional Data Analysis Industry Category Country=“USA” State City Year Quarter Month Week Day Product $500 $1000 $500 Industry2 $3000 $2000 CA $3000 $3000 Industry3 $1000 $1000 Industry1 WI NY
    • 35. Corresponding Query in SQL <ul><li>SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.country = “USA” GROUP BY P.industry, C.state </li></ul><ul><li>We think that Industry3 in CA is interesting. </li></ul>Country=“USA” State City Year Quarter Month Week Day Industry Category Product
    • 36. Slice and Drill-Down Industry=“Industry3” Category Country State=“CA” City Year Quarter Month Week Day Product $400 $300 $300 Category2 $800 $300 San Jose $100 $100 Category3 $400 $300 Category1 Los Angeles San Francisco
    • 37. Corresponding Query in SQL <ul><li>SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND P.industry = “Industry3” AND C.state = “CA” GROUP BY P.category, C.city </li></ul><ul><li>We think that Category3 is interesting. </li></ul>Industry=“Industry3” Category Country State=“CA” City Quarter Month Week Day Product Year
    • 38. Slice and Drill-Down Industry Category=“Category3” Country State=“CA” City Quarter Month Week Day Product Year $20 $160 $20 Product2 $480 $160 San Jose $60 $60 Product3 $20 $20 Product1 Los Angeles San Francisco
    • 39. Corresponding Query in SQL <ul><li>SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3” GROUP BY P.product, C.city </li></ul><ul><li>Nothing new in this view of the data. </li></ul>Quarter Month Week Day Year Industry Category=“Category3” Country State=“CA” City Product
    • 40. Pivot To (City, Year) Quarter Month Week Day Year Industry Category=“Category3” Country State=“CA” City Product $20 $600 $20 1998 $100 $100 San Jose $60 $60 1999 $20 $20 1997 Los Angeles San Francisco
    • 41. Corresponding Query in SQL <ul><li>SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3” GROUP BY C.city, T.year </li></ul>Industry Category=“Category3” Country State=“CA” City Quarter Month Week Day Product Year
    • 42. Multidimensional Data Analysis <ul><li>Set of data manipulation operators </li></ul><ul><ul><li>Roll-up: Go up one step in a dimension hierarchy. Example: month -> quarter </li></ul></ul><ul><ul><li>Drill-down: Go down one step in a dimension hierarchy. Example: quarter -> month </li></ul></ul><ul><ul><li>Slice: Select a subset of the values of a dimension. Example: All categories -> only Category3 </li></ul></ul><ul><ul><li>Dice: Select all values of a dimension. Example: Only Category3 -> all categories </li></ul></ul><ul><ul><li>Pivot: Select new dimensions to visualize the data. Example: Pivot to Time(quarter) and Customer(state) </li></ul></ul>
    • 43. Visual Intuition: Cube Customer Data Mart Product Time M T W Th F S S Product1 Product2 Product3 Product4 Product5 Product6 SH SF LA 20 30 20 15 10 50 50 Units of Product6 sold on Monday in LA roll-up to week roll-up to category roll-up to state
    • 44. OLAP Server Architectures <ul><li>Relational OLAP (ROLAP) </li></ul><ul><ul><li>Relational DBMS stores data mart </li></ul></ul><ul><ul><li>OLAP middleware: </li></ul></ul><ul><ul><ul><li>Aggregation and navigation logic </li></ul></ul></ul><ul><ul><ul><li>Optimized for DBMS in the background, but slow and complex </li></ul></ul></ul><ul><ul><ul><li>Basically only one vendor: Microstrategy </li></ul></ul></ul><ul><li>Multidimensional OLAP (MOLAP) </li></ul><ul><ul><li>Specialized array-based storage structure </li></ul></ul><ul><ul><li>Vendors: Hyperion (Essbase), Appix (iTM1), Oracle, Microsoft </li></ul></ul>
    • 45. OLAP Server Architectures <ul><li>Desktop OLAP (DOLAP) </li></ul><ul><ul><li>Performs OLAP directly at your PC </li></ul></ul><ul><ul><li>Vendors: Cognos (Powerplay), Business Objects, Brio Technology, Hummingbird </li></ul></ul><ul><li>Hybrids and Application OLAP </li></ul><ul><li>More: www.olapreport.com </li></ul>
    • 46. The OLAP Market
    • 47. Enterprise Reporting Tools
    • 48. Summary: Multidimensional Analysis <ul><li>Spreadsheet style data analysis </li></ul><ul><li>Roll-up, drill-down, slice, dice, and pivot your way to interesting cells in the CUBE </li></ul><ul><li>Mainstream technology </li></ul><ul><li>Established enterprises already have OLAP installations </li></ul><ul><li>When establishing your e-business, OLAP will be your first step in data analysis </li></ul>
    • 49. Data Mining
    • 50. Definition <ul><li>Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. </li></ul><ul><li>Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6% </li></ul>
    • 51. Definition (Cont.) <ul><li>Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. </li></ul><ul><li>Valid: The patterns hold in general. </li></ul><ul><li>Novel: We did not know the pattern beforehand. </li></ul><ul><li>Useful: We can devise actions from the patterns. </li></ul><ul><li>Understandable: We can interpret and comprehend the patterns. </li></ul>
    • 52. Why Use Data Mining Today? <ul><li>Human analysis skills are inadequate: </li></ul><ul><ul><li>Volume and dimensionality of the data </li></ul></ul><ul><ul><li>High data growth rate </li></ul></ul><ul><li>Availability of: </li></ul><ul><ul><li>Data </li></ul></ul><ul><ul><li>Storage </li></ul></ul><ul><ul><li>Computational power </li></ul></ul><ul><ul><li>Off-the-shelf software </li></ul></ul>
    • 53. An Abundance of Data <ul><li>Supermarket scanners, POS data </li></ul><ul><li>Preferred customer cards </li></ul><ul><li>Credit card transactions </li></ul><ul><li>Direct mail response </li></ul><ul><li>Call center records </li></ul><ul><li>Web server logs </li></ul><ul><li>Customer web site trails </li></ul><ul><li>ATM machines </li></ul><ul><li>Demographic data </li></ul>
    • 54. Evolution of Database Technology <ul><li>1960s: IMS and network DBMS </li></ul><ul><li>1970s: The relational data model, small relational DBMS implementations </li></ul><ul><li>1980s: Maturing RDBMS, application-specific DBMS (spatial data, scientific data, images, etc.) </li></ul><ul><li>1990s: Mature, high-performance RDBMS technology, parallel DBMS, terabyte data warehouses, object-relational DBMS, middleware and web technology </li></ul><ul><li>2000s: High availability, maintainability, seamless integration into business processes </li></ul>
    • 55. Computational Power <ul><li>Moore’s Law: In 1965, Intel Corp. cofounder Gordon Moore predicted that the density of transistors in an integrated circuit would double every year. </li></ul><ul><li>Later changed to reflect 18 months progress. </li></ul><ul><li>Experts on ants estimate that there are 10 16 to 10 17 ants on earth. In the year 1997, we produced one transistor per ant. </li></ul>
    • 56. OTS-Software for Data Mining <ul><li>ANGOSS KnowledgeStudio </li></ul><ul><li>IBM Intelligent Miner </li></ul><ul><li>Metaputer PolyAnalyst </li></ul><ul><li>SAS Enterprise Miner </li></ul><ul><li>SGI Mineset </li></ul><ul><li>SPSS Clementine </li></ul><ul><li>Many others </li></ul><ul><li>More at http://www.kdnuggets.com/software </li></ul>
    • 57. Why Use Data Mining Today? <ul><li>And … competitive pressure! </li></ul><ul><li>“ The secret of success is to know something that nobody else knows.” </li></ul><ul><li>Aristotle Onassis </li></ul><ul><li>Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) </li></ul><ul><li>Personalization </li></ul><ul><li>Your competitors already apply data mining </li></ul>
    • 58. The Knowledge Discovery Process <ul><li>Steps: </li></ul><ul><li>Identify business problem </li></ul><ul><li>Data preprocessing </li></ul><ul><ul><li>Data selection: Identify target datasets and relevant fields </li></ul></ul><ul><ul><li>Data cleaning: Remove noise and outliers </li></ul></ul><ul><ul><li>Data transformation </li></ul></ul><ul><ul><li>Create common units </li></ul></ul><ul><ul><li>Generate new fields </li></ul></ul><ul><li>Data mining and model evaluation </li></ul><ul><li>Deployment and measurement of the results </li></ul>
    • 59. Preprocessing and Mining Original Data Target Data Preprocessed Data Patterns Knowledge Data Integration and Selection Preprocessing Model Construction Interpretation
    • 60. What is a Data Mining Model? <ul><li>A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values. </li></ul><ul><li>Examples: </li></ul><ul><li>Linear regression model </li></ul><ul><li>Classification model </li></ul><ul><li>Clustering </li></ul>
    • 61. Data Mining Models (Contd.) <ul><li>A data mining model can be described at two levels: </li></ul><ul><li>Functional level: </li></ul><ul><ul><li>Describes model in terms of its intended usage. Examples: Classification, clustering </li></ul></ul><ul><li>Representational level: </li></ul><ul><ul><li>Specific representation of a model. Example: Log-linear model, classification tree, nearest neighbor method. </li></ul></ul><ul><li>Black-box models versus transparent models </li></ul>
    • 62. Data Mining Techniques <ul><li>Supervised learning </li></ul><ul><ul><li>Classification and regression </li></ul></ul><ul><li>Unsupervised learning </li></ul><ul><ul><li>Clustering </li></ul></ul><ul><li>Dependency modeling </li></ul><ul><ul><li>Associations, summarization, causality </li></ul></ul><ul><li>Outlier and deviation detection </li></ul><ul><li>Trend analysis and change detection </li></ul>
    • 63. Fields Related to Data Mining <ul><li>Database systems </li></ul><ul><li>Machine learning </li></ul><ul><li>Statistics </li></ul><ul><li>Visualization </li></ul><ul><li>Parallel computing </li></ul>
    • 64. Traditional Analysis Tools <ul><li>Predefined set of reports </li></ul><ul><ul><li>Capture most important known business questions </li></ul></ul><ul><ul><li>Generated at fixed points in time </li></ul></ul><ul><ul><li>No support for impromptu queries </li></ul></ul><ul><li>SQL: “Write your own SQL query.” </li></ul><ul><li>Statistics department within a company </li></ul><ul><ul><li>Slow reaction to users needs </li></ul></ul><ul><ul><li>Limited understanding of the business questions </li></ul></ul><ul><ul><li>Problems communicating the findings </li></ul></ul>
    • 65. Example Application: Sports <ul><li>IBM Advanced Scout analyzes NBA game statistics </li></ul><ul><ul><li>Shots blocked </li></ul></ul><ul><ul><li>Assists </li></ul></ul><ul><ul><li>Fouls </li></ul></ul><ul><li>http://www.research.ibm.com/scout/home.html </li></ul>
    • 66. Advanced Scout <ul><li>Example pattern: An analysis of the data from a game played between the New York Knicks and the Charlotte Hornets revealed that “ When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots.&quot; </li></ul><ul><li>Pattern is interesting: The average shooting percentage for the Charlotte Hornets during that game was 54%. </li></ul>
    • 67. Example Application: Sky Survey <ul><li>Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete </li></ul><ul><li>Goal: Generate a catalog with all objects and their type </li></ul><ul><li>Method: Use decision trees as data mining model </li></ul><ul><li>Results: </li></ul><ul><ul><li>94% accuracy in predicting sky object classes </li></ul></ul><ul><ul><li>Increased number of faint objects classified by 300% </li></ul></ul><ul><ul><li>Helped team of astronomers to discover 16 new high red-shift quasars in one order of magnitude less observation time </li></ul></ul>
    • 68. Gold Nuggets? <ul><li>Investment firm mailing list: Discovered that old people do not respond to IRA mailings </li></ul><ul><li>Bank clustered their customers. One cluster: Older customers, no mortgage, less likely to have a credit card </li></ul><ul><li>“ Bank of 1911” </li></ul><ul><li>Customer churn example </li></ul>
    • 69. Market Basket Analysis <ul><li>Consider shopping cart filled with several items </li></ul><ul><li>Market basket analysis tries to answer the following questions: </li></ul><ul><ul><li>Who makes purchases? </li></ul></ul><ul><ul><li>What do customers buy together? </li></ul></ul><ul><ul><li>In what order do customers purchase items? </li></ul></ul>
    • 70. Market Basket Analysis <ul><li>Given: </li></ul><ul><li>A database of customer transactions </li></ul><ul><li>Each transaction is a set of items </li></ul><ul><li>Example: Transaction with TID 111 contains items {Pen, Ink, Milk, Juice} </li></ul>
    • 71. Market Basket Analysis (Contd.) <ul><li>Coocurrences </li></ul><ul><ul><li>80% of all customers purchase items X, Y and Z together. </li></ul></ul><ul><li>Association rules </li></ul><ul><ul><li>60% of all customers who purchase X and Y also buy Z. </li></ul></ul><ul><li>Sequential patterns </li></ul><ul><ul><li>60% of customers who first buy X also purchase Y within three weeks. </li></ul></ul>
    • 72. Confidence and Support <ul><li>We prune the set of all possible association rules using two interestingness measures: </li></ul><ul><li>Confidence of a rule: </li></ul><ul><ul><li>X => Y has confidence c if P(Y|X) = c </li></ul></ul><ul><li>Support of a rule: </li></ul><ul><ul><li>X => Y has support s if P(XY) = s </li></ul></ul><ul><li>We can also define </li></ul><ul><li>Support of an itemset (a coocurrence) XY: </li></ul><ul><ul><li>XY has support s if P(XY) = s </li></ul></ul>
    • 73. Example <ul><li>Examples: </li></ul><ul><li>{Pen} => {Milk} Support: 75% Confidence: 75% </li></ul><ul><li>{Ink} => {Pen} Support: 100% Confidence: 100% </li></ul>
    • 74. Exercise <ul><li>Find all itemsets with support >= 75%? </li></ul>
    • 75. Exercise <ul><li>Can you find all association rules with support >= 50%? </li></ul>
    • 76. Extensions <ul><li>Imposing constraints </li></ul><ul><ul><li>Only find rules involving the dairy department </li></ul></ul><ul><ul><li>Only find rules involving expensive products </li></ul></ul><ul><ul><li>Only find “expensive” rules </li></ul></ul><ul><ul><li>Only find rules with “whiskey” on the right hand side </li></ul></ul><ul><ul><li>Only find rules with “milk” on the left hand side </li></ul></ul><ul><ul><li>Hierarchies on the items </li></ul></ul><ul><ul><li>Calendars (every Sunday, every 1 st of the month) </li></ul></ul>
    • 77. Market Basket Analysis: Applications <ul><li>Sample Applications </li></ul><ul><ul><li>Direct marketing </li></ul></ul><ul><ul><li>Fraud detection for medical insurance </li></ul></ul><ul><ul><li>Floor/shelf planning </li></ul></ul><ul><ul><li>Web site layout </li></ul></ul><ul><ul><li>Cross-selling </li></ul></ul>
    • 78. Beyond Support and Confidence <ul><li>Example: 5000 students </li></ul><ul><ul><li>3000 students play basketball </li></ul></ul><ul><ul><li>3750 students eat cereal </li></ul></ul><ul><ul><li>2000 students both play basketball and eat cereal </li></ul></ul>
    • 79. Misleading Association Rules <ul><li>Basketball => Cereal (support: 40%, confidence: 66.7%) is misleading because 75% of students eat cereal </li></ul><ul><li>Basketball => No cereal (support: 20%, confidence: 33.3%) is more interesting, although with lower support and confidence </li></ul>
    • 80. Interest <ul><li>Interest of rule A => B: P(AB)/(P(A)*P(B)) </li></ul><ul><ul><li>Symmetric (uses both P(A) and P(B)) </li></ul></ul><ul><ul><li>Note that confidence is not symmetric (confidence of rule A => B: P(AB)/P(A)) </li></ul></ul><ul><li>Interest values: </li></ul><ul><ul><li>Interest = 1: A and B are independent (P(AB)=P(B)*P(A)) </li></ul></ul><ul><ul><li>Interest > 1: A and B are positively correlated </li></ul></ul><ul><ul><li>Interest < 1: A and B are negatively correlated </li></ul></ul>
    • 81. Interest: Example
    • 82. Extensions <ul><li>Imposing constraints </li></ul><ul><ul><li>Only find rules involving the dairy department </li></ul></ul><ul><ul><li>Only find rules involving expensive products </li></ul></ul><ul><ul><li>Only find “expensive” rules </li></ul></ul><ul><ul><li>Only find rules with “whiskey” on the right hand side </li></ul></ul><ul><ul><li>Only find rules with “milk” on the left hand side </li></ul></ul><ul><ul><li>Hierarchies on the items </li></ul></ul><ul><ul><li>Calendars (every Sunday, every 1 st of the month) </li></ul></ul>
    • 83. Applications <ul><li>Sample Applications </li></ul><ul><ul><li>Direct marketing </li></ul></ul><ul><ul><li>Fraud detection for medical insurance </li></ul></ul><ul><ul><li>Floor/shelf planning </li></ul></ul><ul><ul><li>Web site layout </li></ul></ul><ul><ul><li>Cross-selling </li></ul></ul><ul><li>Case study </li></ul>
    • 84. Summary <ul><li>Data preprocessing </li></ul><ul><ul><li>Quality analysis comes from quality data </li></ul></ul><ul><li>Multidimensional data analysis </li></ul><ul><ul><li>Interactive, spreadsheet-style data analysis </li></ul></ul><ul><li>Data mining </li></ul><ul><ul><li>The knowledge discovery process </li></ul></ul><ul><ul><li>Association rules </li></ul></ul>
    • 85. Questions? <ul><li>(In the fourth lecture: More data mining.) </li></ul>

    ×