Data Mining & Knowledge Discovery


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining & Knowledge Discovery

  1. 1. Data Mining & Knowledge Discovery: Personalization Technologies for One to One Marketing Bhagi Narahari
  2. 2. Outline of Lecture <ul><li>What and Why of Data Mining and KDD? </li></ul><ul><ul><li>Importance and Applications to E-commerce </li></ul></ul><ul><li>How ? </li></ul><ul><li>Personalization </li></ul><ul><ul><li>personalized one-to-one business on the internet </li></ul></ul><ul><li>Part I: Overview of Personalization </li></ul><ul><li>Part 2: The Data Mining Process </li></ul>
  3. 3. Predictive Modelling <ul><li>A “black box” that makes predictions about the future based on information from the past and present </li></ul>Age balance income How much will customer spend on next catalog order ? Model (Crystal ball?)
  4. 4. What is Data Mining? <ul><li>It is the exploration and analysis by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules. </li></ul>
  5. 5. Why now? (A historical perspective) <ul><li>Because data is now available (wasn’t always) </li></ul><ul><li>Distributed sources </li></ul><ul><li>Technology evolution </li></ul><ul><li>Competition (do what you can to outdo) </li></ul>
  6. 6. Why DM? <ul><li>CRM (Customer Relationship Management) - important success factor in E-commerce </li></ul><ul><ul><li>price differentiation no longer enough </li></ul></ul><ul><ul><li>customer service more important </li></ul></ul><ul><li>Links with suppliers already exist (B2B) - JIT, joint forecasting, planning, procurement </li></ul><ul><li>Current emphasis on links with customers - feedback, input in design, etc. </li></ul>
  7. 7. CRM <ul><li>Identifying profitable customers </li></ul><ul><li>Better service for more valued customers </li></ul><ul><li>Retaining profitable customers </li></ul><ul><ul><li>Getting a new customer costs a lot more than retaining an existing one </li></ul></ul><ul><ul><li>takes 5X to acquire new customers (Peppers&Rogers) </li></ul></ul><ul><ul><li>An increase from 75% to 80% in retention reduces costs by about 10% </li></ul></ul><ul><li>Larger share of customer pool </li></ul>
  8. 8. CRM <ul><li>Product differentiations based on “price” and “quality” are increasingly difficult </li></ul><ul><ul><li>need to differentiate based on relationships </li></ul></ul><ul><li>Increasingly sophisticated mass marketing increases probability of success </li></ul><ul><ul><li>cost of mass marketing is driven down by internet (reach) </li></ul></ul>
  9. 9. CRM <ul><li>Goal: Positively interact with your customers and prospects </li></ul><ul><ul><li>define customer segments </li></ul></ul><ul><ul><li>lights out execution of campaigns against segments </li></ul></ul><ul><ul><li>attribution and evaluation of responses </li></ul></ul>
  10. 10. Personalization in Ecommerce <ul><li>Positive: </li></ul><ul><ul><li>much better chance of personalization </li></ul></ul><ul><ul><ul><li>customer identification </li></ul></ul></ul><ul><ul><ul><li>tracking across visits and within visit </li></ul></ul></ul><ul><ul><li>ability to do ‘what if’ experiments </li></ul></ul><ul><li>Negative: </li></ul><ul><ul><li>cost of switching is much less </li></ul></ul><ul><ul><li>is web based shopping good for ‘touchy feely’ things </li></ul></ul><ul><ul><li>price differentiation across geographies not easy </li></ul></ul>
  11. 11. Personalization Product Discovery Product Evaluation Terms Negotiation Order Placement Order Payment Customer Service & Support Market Research Market Stimulation/ Education Terms Negotiations Order Receipt Order billing and payment management Customer Service & Support Producer Chain Customer Chain
  12. 12. B2C Personalization Objectives <ul><li>Know the customer </li></ul><ul><ul><li>profile - registration, cookies </li></ul></ul><ul><li>Determine what the customer wants </li></ul><ul><ul><li>Ask: Questionnaires </li></ul></ul><ul><ul><ul><li>what is the incentive for truthfulness </li></ul></ul></ul><ul><ul><li>Deduce: click streams, history, collaborative filtering (Amazon!!) </li></ul></ul><ul><li>Deliver </li></ul><ul><ul><li>Customize the look and feel </li></ul></ul><ul><ul><li>offer special promotions </li></ul></ul><ul><ul><li>offer customized products (Holy Grail) </li></ul></ul>
  13. 13. Use of Personalization <ul><li>In addition to storing and retrieving information on the individual’s profile “on the fly” </li></ul><ul><ul><li>can also use mining software to analyze the information in the database to make recommendations or comments specific to the individual </li></ul></ul>
  14. 14. Impact of Personalization <ul><li>Customer relationship </li></ul><ul><li>Learn more about customers </li></ul><ul><ul><li>learn and understand the why and how they prefer to do business with your organization </li></ul></ul><ul><li>In tandem with tracking provides you with a tool to monitor your website </li></ul><ul><ul><li>what works, what does’nt, what makes your audience “click” </li></ul></ul>
  15. 15. Security and Privacy as Barrier to Personalization <ul><li>Large number of customers concerned about personalization (double click!) </li></ul><ul><li>will they pay more to preserve privacy? </li></ul><ul><li>Some falsify info to preserve privacy </li></ul><ul><li>customers give more info to trusted site </li></ul><ul><li>need secure site with clear privacy policies stated at site </li></ul>
  16. 16. Personalization Know the Customer Identify Give the customer his/her wants Questionnaires Past history Click Streams Profile Login Credit Card# Predicting the wants Mapping to “ peers” Extrapolation from past Extrapolation from peers ( Look &feel Product selection& promotions New Product
  17. 17. Know the customer <ul><li>Cookies </li></ul><ul><ul><li>backlash (users do not trust them) </li></ul></ul><ul><li>OPS: Open Profiling Standard </li></ul><ul><ul><li>combined with eTrust certification </li></ul></ul><ul><li>Registration </li></ul><ul><ul><li>User certificates: logons </li></ul></ul><ul><li>Key Question: </li></ul><ul><ul><li>how do you know that this customer is same as that goes to your storefront </li></ul></ul><ul><ul><li>need standard warehouse techniques like address resolution, cred.card resolution etc. </li></ul></ul>
  18. 18. Know the Customer:OPS <ul><li>Two drivers </li></ul><ul><ul><li>user should not retype again & again basic info </li></ul></ul><ul><ul><li>data is used in a trusted fashion (not leaked, other data not see etc.) by users </li></ul></ul><ul><li>Two parts </li></ul><ul><ul><li>Common data </li></ul></ul><ul><ul><ul><li>demographics (country,zip,age,gender) </li></ul></ul></ul><ul><ul><ul><li>Contact (name, address, CreditCard…) </li></ul></ul></ul><ul><ul><ul><li>User agent preferences </li></ul></ul></ul><ul><ul><li>Per-site Sections (can be shared across sites, if user allows) </li></ul></ul>
  19. 19. What if no profile??? <ul><li>Deduce </li></ul><ul><ul><li>collect information: history of purchases, time spent on pages </li></ul></ul><ul><ul><li>ask questions (offer rewards) </li></ul></ul><ul><ul><li>combine with database marketing data </li></ul></ul><ul><li>Predict behaviour </li></ul><ul><ul><li>buy probabilities </li></ul></ul><ul><ul><li>build customer relationship </li></ul></ul><ul><li>mining is key! </li></ul>
  20. 20. Personalization: Actions to take- Look and feel <ul><li>Personalized pages </li></ul><ul><ul><li>specific data </li></ul></ul><ul><ul><li>specific presentation and design </li></ul></ul><ul><ul><li>sent through various mediums </li></ul></ul><ul><li>Manage Customers not products: 1-1 marketing </li></ul><ul><li> </li></ul><ul><ul><li>deliver personalized pages </li></ul></ul><ul><ul><ul><li>eg: stock portfolio, personal info including alarm, travel reservations </li></ul></ul></ul><ul><ul><li>use different mediums </li></ul></ul><ul><ul><ul><li>WAP enable phones (eg: Sprint PCS Web) </li></ul></ul></ul>
  21. 21. Storefront Personalization <ul><li>Customers visit Store Website </li></ul><ul><ul><li>Howard buys ties </li></ul></ul><ul><ul><li>Rob buys Baby Products </li></ul></ul><ul><ul><li>Ray buys toys </li></ul></ul><ul><ul><li>Amy buys clothes </li></ul></ul><ul><li>Provide a view of the store to these customers </li></ul><ul><ul><li>present them with what they are likely to buy? </li></ul></ul><ul><ul><ul><li>Howard: ties, and men’s formal wear </li></ul></ul></ul><ul><ul><ul><li>Ray: Toys and gadgets </li></ul></ul></ul><ul><ul><ul><li>Rob: Infant, Toddler section </li></ul></ul></ul><ul><ul><ul><li>Amy: Women’s Clothes section </li></ul></ul></ul>
  22. 22. More Actions: Product Presentations & Promotions Basic Storefront Product Hierarchy Clothes Men’s Women’s Children’s Shirts Pants Casuals Evening Infants Kids John’s View Mary’s View
  23. 23. <ul><li>BroadVision One-to-One application </li></ul><ul><ul><li>allows businesses to develop and manage personalized web sites </li></ul></ul><ul><ul><li>interactively profile each visitor and dynamically match info based on their profile and business rules specified by providers of site & services </li></ul></ul><ul><ul><ul><li>users do not go through hoops finding relevant data </li></ul></ul></ul>
  24. 24. DM Terminology OLAP ROLAP Data Warehouse Data Marts Data Stores Neural Networks Genetic Algorithms Data Mining Rule Based Systems SQL
  25. 25. How? <ul><li>Determine probability of buying as a function of customer attributes such as age, income, past buying patterns, .. </li></ul><ul><li>Target customers by ranking from highest to lowest probabilities </li></ul><ul><li>Other techniques: Decision Trees, Neural Networks, …. </li></ul>
  26. 26. KDD <ul><li>Knowledge Discovery in Databases </li></ul><ul><li>It is the process of identifying valid, novel, potentially useful, and understandable patterns in data (Fayyad, Piatesky-Shapiro, and Smyth) </li></ul><ul><li>It involves data preparation, pattern extraction, knowledge evaluation, and refinement, in iteration </li></ul>
  27. 27. KDD <ul><li>Data mining is a step in the KDD process that involves the application of certain algorithms to extract patterns </li></ul><ul><li>Steps in the KDD process: </li></ul><ul><ul><ul><li>Select Data </li></ul></ul></ul><ul><ul><ul><li>Data Cleansing and Pre-processing </li></ul></ul></ul><ul><ul><ul><li>Data Mining </li></ul></ul></ul><ul><ul><ul><li>Results interpretation </li></ul></ul></ul><ul><ul><ul><li>Implementation </li></ul></ul></ul>
  28. 28. Pre-processing in KDD <ul><li>80-90% of KDD process is spent here </li></ul><ul><li>Why? </li></ul><ul><ul><ul><li>Operational data is incomplete, inconsistent, in different formats across systems </li></ul></ul></ul><ul><ul><ul><li>DM techniques might require data in a specific format </li></ul></ul></ul>
  29. 29. Data Mining Problems <ul><li>Classification/Segmentation </li></ul><ul><ul><li>Binary (Yes/No) </li></ul></ul><ul><ul><li>Multiple Category (Large/Medium/Small) </li></ul></ul><ul><li>Forecasting (how much) </li></ul><ul><li>Association Rule extraction (market basket analysis) </li></ul><ul><li>Sequence detection </li></ul><ul><ul><li>balance increase -> missed payment -> default </li></ul></ul>
  30. 30. Typical DM tasks <ul><li>Prediction and Classification </li></ul><ul><ul><li>Directed </li></ul></ul><ul><ul><li>Decision trees, Neural networks, memory based reasoning, logistic regression </li></ul></ul><ul><ul><li>Examples: </li></ul></ul><ul><ul><ul><li>How many units will be sold on a given day? </li></ul></ul></ul><ul><ul><ul><li>What will be the stock price on a given day? </li></ul></ul></ul><ul><ul><ul><li>Will a customer buy the product or not? </li></ul></ul></ul>
  31. 31. DM tasks <ul><li>Affinity grouping </li></ul><ul><ul><li>Undirected </li></ul></ul><ul><ul><li>Which products go together naturally? </li></ul></ul><ul><ul><li>The beer-diaper syndrome? </li></ul></ul><ul><ul><li>Market basket analysis </li></ul></ul><ul><ul><li>Examples: </li></ul></ul><ul><ul><ul><li>Which products peak in demand simultaneously? </li></ul></ul></ul>
  32. 32. DM tasks <ul><li>Clustering task </li></ul><ul><ul><li>Undirected </li></ul></ul><ul><ul><li>Segmenting into similar clusters </li></ul></ul><ul><ul><li>Different from classification </li></ul></ul><ul><ul><li>Examples </li></ul></ul><ul><ul><ul><li>Customers with similar buying profiles </li></ul></ul></ul><ul><ul><ul><li>Products with similar demand patterns </li></ul></ul></ul>
  33. 33. DM success factors <ul><li>Integration with data warehouses and DSS </li></ul><ul><li>Users should develop a good understanding of techniques </li></ul><ul><li>Recognize that these tools cannot automatically find patterns without being told what to do </li></ul><ul><li>Most methods now used are extensions of analytical methods that have been around for decades </li></ul>
  34. 34. Legal and Ethical Issues <ul><li>Privacy concerns </li></ul><ul><ul><li>becoming more important </li></ul></ul><ul><ul><li>will impact the way that data can be used and analyzed </li></ul></ul><ul><ul><li>ownership issues </li></ul></ul><ul><ul><li>European data laws have implications on US </li></ul></ul><ul><li>Often data included in the data warehouse cannot legally be used in decision making process </li></ul><ul><ul><li>Race, Gender, Age </li></ul></ul><ul><li>Data contamination will become critical </li></ul>
  35. 35. Making Decisions Data Warehouse? Models Decisions Data Data Data Data
  36. 36. Data Warehouse <ul><li>Bill Inmon: “A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management decisions.” </li></ul><ul><li>is managed data that is situated after and outside the operational systems </li></ul>
  37. 37. Data Warehousing <ul><li>Increasing need to find, summarize, and interpret large amounts of data effectively </li></ul><ul><ul><li>Especially when data is distributed across many different databases </li></ul></ul><ul><li>Transaction processing systems not easily accessible to other systems </li></ul><ul><ul><li>Plus TP systems have time constraints </li></ul></ul>
  38. 38. Enter the Data Warehouse <ul><li>To deliver decision data to decision makers </li></ul><ul><li>by integrating data from various TPS to a single storage which can then </li></ul><ul><li>feed a range of decision support applications </li></ul><ul><li>through an OLAP interface! </li></ul>
  39. 39. Data Complications <ul><li>Noise </li></ul><ul><li>Missing data </li></ul><ul><li>Transformation </li></ul><ul><ul><li>numeric data </li></ul></ul><ul><ul><li>text </li></ul></ul><ul><li>Need to differentiate between variables you can control and those you cannot </li></ul><ul><ul><li>Actionable: size of discount, number of offers etc. </li></ul></ul><ul><ul><li>Non-actionable: age, income .. </li></ul></ul>
  40. 40. Data Mining Techniques <ul><li>Market Basket Analysis </li></ul><ul><li>Memory Based Reasoning </li></ul><ul><li>Cluster Detection </li></ul><ul><li>Link Analysis </li></ul><ul><li>Decision Trees and Rule Induction </li></ul><ul><li>Neural Networks </li></ul><ul><li>Genetic Algorithms </li></ul><ul><li>OLAP </li></ul>
  41. 41. OLAP: On Line Analytical Processing <ul><li>While a data warehouse brings data together, OLAP lets you look at data and manipulate interactively </li></ul><ul><li>OLAP allows users to “slice and dice” data </li></ul><ul><li>Allows user to drill-down into detail data </li></ul>
  42. 42. Relational vs Multidimensional
  43. 43. Consolidations
  44. 44. Multidimensional Terminology <ul><li>East, West, Central are input members of the Region dimension. Total Region is an output member of the Region dimension . Similarly, Nuts, Screws, Bolts, Washers, and Total are members of the Product dimension. </li></ul><ul><li>Variables are typically numerical measures like Sales, Costs, Profits, Expenses, and so forth. </li></ul><ul><li>Dimensions are roughly equivalent to Fields in a relational database. Cells are roughly equivalent to Records. </li></ul>
  45. 45. Steps in DW and OLAP Data Loader Data Converter Data Scrubber Data Transformer Data Warehouse OLAP Server OLAP Interface Data Data Data
  46. 47. Cluster Detection <ul><li>Undirected data mining </li></ul><ul><li>Finds records that are similar to each other (clusters) </li></ul><ul><li>Clusters are found using geometric methods, statistical methods, and neural networks </li></ul><ul><li>Good way to start any analysis </li></ul>
  47. 48. Market Basket Analysis <ul><li>Form of clustering used for finding items that occur together (in a transaction or market basket) </li></ul><ul><li>Likelihood of different products being purchased together as rules </li></ul><ul><li>Planning store layouts, limiting specials to one of the products in a set,... </li></ul>
  48. 49. Transaction data
  49. 50. Co-occurrence matrix
  50. 51. Support and confidence <ul><li>For a rule that says: If A then B </li></ul><ul><li>Support is defined as the ratio of number of transactions that include both A and B to total number of transactions </li></ul><ul><li>Confidence is defined by the ratio of the number of transactions that include both A and B to the number of transactions that include A. </li></ul><ul><li>How do you specify ‘significant’ support and confidence ? </li></ul>
  51. 52. Algorithm for Finding Association Rules <ul><li>Input is Min-Support and Min-Confidence </li></ul><ul><li>Find all sets of items with Min-Support ( frequent itemsets ) </li></ul><ul><ul><li>Frequent Itemsets Property: Every subset of a frequent itemset must also be a frequent itemset </li></ul></ul><ul><ul><ul><li>iterative algorithm: start with frequent itemsets with one item, and construct larger itemsets using only smaller frequent itemsets. </li></ul></ul></ul>
  52. 53. MBA example <ul><li>Using the sample data create a co-occurrence table </li></ul><ul><li>Let relevant Support = 25% and Confidence= 50%: </li></ul><ul><ul><li>Beer and Diapers appear in 3/5= 60% </li></ul></ul><ul><ul><li>If beer then diapers has confidence of 2/3=67% </li></ul></ul><ul><ul><li>Thus, “If customer buys beer then customer buys diapers” satisfies 25% support & 50% confidence </li></ul></ul><ul><li>Conclusion drawn by mining system: </li></ul><ul><ul><li>Customers who buy beer also buy diapers </li></ul></ul>
  53. 54. Applying MBA Results <ul><li>Is the relationship useful ? </li></ul><ul><ul><li>Beer and Diapers may not be of use </li></ul></ul><ul><ul><li>Victoria’s Secret transaction mining led to specific apparel sent to specific stores -- Microstrategy software </li></ul></ul><ul><li>Who defines “usefullness” </li></ul><ul><ul><li>only as good as rules specified by humans/marketing workforce </li></ul></ul><ul><ul><li>NBA mining: designers of s/w did not include height mismatches at first…coaches made the correction </li></ul></ul>
  54. 55. Data Mining Algorithms <ul><li>Four algorithms commonly cited </li></ul><ul><ul><li>Association Rule (used in over 90% of the cases!) </li></ul></ul><ul><ul><li>Nearest Neighbor </li></ul></ul><ul><ul><ul><li>quick and easy but models get large </li></ul></ul></ul><ul><ul><li>Decision Tree </li></ul></ul><ul><ul><li>Neural Network </li></ul></ul><ul><ul><ul><li>difficult to interpret and large time </li></ul></ul></ul>
  55. 56. Decision Trees <ul><li>Series of if/then rules </li></ul><ul><ul><li>easy to understand, complexity in implementation </li></ul></ul>No yes Balance<10K Balance > 10K Age > 48 Age< 48 yes
  56. 57. CRM and Data Mining <ul><li>Recall:customer segmentation is key in CRM </li></ul><ul><ul><li>data mining can help improve understanding of customer behaviour </li></ul></ul><ul><ul><ul><li>helps located meaningful segments from customer data </li></ul></ul></ul><ul><ul><li>users want to turn that understanding into an automated interactions with their customers </li></ul></ul>
  57. 58. Integrating Data Mining & CRM <ul><li>Data mining application owns the modelling process </li></ul><ul><li>CRM application owns the campaign execution process </li></ul><ul><li>Goals: </li></ul><ul><ul><li>minimize pain involved with using models in campaigns </li></ul></ul><ul><ul><li>score records only when and where necessary </li></ul></ul>
  58. 59. Integrating Mining & CRM <ul><li>Step 1: </li></ul><ul><ul><li>analytic user creates model using mining system </li></ul></ul><ul><ul><li>model is then exported into campaign management system </li></ul></ul><ul><li>Step 2: </li></ul><ul><ul><li>Marketing user creates campaign that includes predictive models </li></ul></ul><ul><ul><li>when campaign executes, data mining engine scores customers dynamically </li></ul></ul>
  59. 60. Benefits of Integration <ul><li>Pre-generated model selection </li></ul><ul><li>Score defined segments “on the fly” </li></ul><ul><ul><li>eliminates need to score entire database </li></ul></ul><ul><ul><li>improve efficiency of campaigns </li></ul></ul><ul><li>Reduces manual intervention and error </li></ul><ul><li>Accelerates the market cycle </li></ul><ul><ul><li>increases likelihood of reaching customers before competitors </li></ul></ul><ul><ul><li>improves campaign results and lower costs </li></ul></ul>
  60. 61. Summary <ul><li>“ Using the new media of the one-to-one future, you will be able to communicate directly with customers individually…..” - Don Peppers & Martha Rogers (One-to-One Future) </li></ul><ul><li>“ What are you afraid of?…..Even if you’re not afraid of these things, the beauty is,with proper marketing, we can make you afraid”-- Michael Saylor, CEO Microstrategy. </li></ul>