eCommerce Technology 20-751 Data Mining
Coping with Information <ul><li>Computerization of daily life produces data </li></ul><ul><ul><li>Point-of-sale, Internet ...
Data Overload <ul><li>Only a small portion of data collected is analyzed (estimate: 5%) </li></ul><ul><li>Vast quantities ...
Data Mining <ul><li>“ The key in business is to know something that nobody else knows.” </li></ul><ul><li>—  Aristotle Ona...
Data Mining <ul><li>Extracting previously unknown relationships from large datasets </li></ul><ul><ul><li>summarize large ...
Taxonomy of Data Mining Methods Data Mining   Methods Database Segmentation Predictive Modeling <ul><li>Decision Trees </l...
Predictive Modeling <ul><li>Objective: use data about the past to predict future behavior </li></ul><ul><li>Sample problem...
Predictive Modeling SOURCE: WELGE & REINCKE, NCSA Which characteristics distinguish the two groups? Tridas Vickie Mike Hon...
Learned Rules in Predictive Modeling Honest  =  has round eyes  and  a smile SOURCE: WELGE & REINCKE, NCSA Tridas Vickie M...
Rule Induction Example height hair eyes class short blond blue A tall blond brown B tall red blue A short dark blue B tall...
Build a Decision Tree hair dark red blond Completely classifies dark-haired and red-haired people Does  not  completely cl...
Build a Decision Tree hair dark red blond SOURCE: WELGE & REINCKE, NCSA eye blue brown tall = B short = B Decision tree is...
Learned Predictive Rules SOURCE: WELGE & REINCKE, NCSA hair eyes B B A A dark red blond blue brown
Decision Trees <ul><li>Good news: a decision tree can  always  be built from training data </li></ul><ul><li>Any  variable...
Database Segmentation (Clustering) <ul><li>“ The art of finding groups in data” Kaufman & Rousseeuw </li></ul><ul><li>Obje...
Clustering Example <ul><li>Are there natural clusters in the data (36,10), (12,8), (38,42), (13,6), (36,38), (16,9), (40,3...
Clustering <ul><li>K-means algorithm </li></ul><ul><li>To divide a set into K clusters </li></ul><ul><li>Pick K points at ...
Neural Networks Networks of processing units called neurons.  This is the  j  th  neuron: SOURCE:  CONSTRUCTING INTELLIGEN...
Neural Networks INPUTS: 1 PER INPUT LAYER NEURON INPUT LAYER HIDDEN LAYER OUTPUT LAYER OUTPUTS: 1 PER OUTPUT LAYER NEURON ...
Neural Networks Learning through back-propagation 1. Network is trained by giving it  many  inputs whose output is known 2...
Neural Network Classification “ Which factors determine a pet’s favorite food?” Breed = Mixed Owner’s age > 45 Owner’s sex...
Neural Network Demos <ul><li>Demo:  Notre  Dame football , Automated  surveillance ,  Handwriting analyzer </li></ul><ul><...
Rule Association <ul><li>Try to find rules of the form </li></ul><ul><li>IF <left-hand-side> THEN <right-hand-side> </li><...
Association Rules from Market Basket Analysis <ul><li><Dairy-Milk-Refrigerated>    <Soft Drinks Carbonated> </li></ul><ul...
Use of Rule Associations <ul><li>Coupons, discounts </li></ul><ul><ul><li>Don’t give discounts on 2 items that are frequen...
Finding Rule Associations <ul><li>Example: grocery shopping </li></ul><ul><li>For each item, count # of occurrences (say o...
Rule Association Demos <ul><li>Magnum Opus ( RuleQuest , free download) </li></ul><ul><li>See5/C5.0 ( RuleQuest , free dow...
Text Mining <ul><li>Objective: discover relationships among people & things from their appearance in text </li></ul><ul><l...
Catalog Mining SOURCE:  TUPAI SYSTEMS                                                        
Visualization <ul><li>Objective: produce a graphic view of data so it become understandable to humans </li></ul><ul><li>Hy...
Major Ideas <ul><li>There’s too much data </li></ul><ul><li>We don’t understand what it means </li></ul><ul><li>It can be ...
Q A &
Upcoming SlideShare
Loading in …5
×

datamining11.ppt

468 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
468
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Note that neural nets are a ‘black box’ approach in the sense that it is not possible to (easily) analyze the reasons of an answer.
  • datamining11.ppt

    1. 1. eCommerce Technology 20-751 Data Mining
    2. 2. Coping with Information <ul><li>Computerization of daily life produces data </li></ul><ul><ul><li>Point-of-sale, Internet shopping (& browsing), credit cards, banks . . . </li></ul></ul><ul><ul><li>Info on credit cards, purchase patterns, product preferences, payment history, sites visited . . . </li></ul></ul><ul><li>Travel. One trip by one person generates info on destination, airline preferences, seat selection, hotel, rental car, name, address, restaurant choices . . . </li></ul><ul><li>Data cannot be processed or even inspected manually </li></ul>
    3. 3. Data Overload <ul><li>Only a small portion of data collected is analyzed (estimate: 5%) </li></ul><ul><li>Vast quantities of data are collected and stored out of fear that important info will be missed </li></ul><ul><li>Data volume grows so fast that old data is never analyzed </li></ul><ul><li>Database systems do not support queries like </li></ul><ul><ul><li>“ Who is likely to buy product X” </li></ul></ul><ul><ul><li>“ List all reports of problems similar to this one” </li></ul></ul><ul><ul><li>“ Flag all fraudulent transactions” </li></ul></ul><ul><li>But these may be the most important questions! </li></ul>
    4. 4. Data Mining <ul><li>“ The key in business is to know something that nobody else knows.” </li></ul><ul><li>— Aristotle Onassis </li></ul><ul><li>“ To understand is to perceive patterns.” </li></ul><ul><li>— Sir Isaiah Berlin </li></ul>PHOTO: LUCINDA DOUGLAS-MENZIES PHOTO: HULTON-DEUTSCH COLL
    5. 5. Data Mining <ul><li>Extracting previously unknown relationships from large datasets </li></ul><ul><ul><li>summarize large data sets </li></ul></ul><ul><ul><li>discover trends, relationships, dependencies </li></ul></ul><ul><ul><li>make predictions </li></ul></ul><ul><li>Differs from traditional statistics </li></ul><ul><ul><li>Huge, multidimensional datasets </li></ul></ul><ul><ul><li>High proportion of missing/erroneous data </li></ul></ul><ul><ul><li>Sampling unimportant; work with whole population </li></ul></ul><ul><li>Sometimes called </li></ul><ul><ul><li>KDD (Knowledge Discovery in Databases) </li></ul></ul><ul><ul><li>OLAP (Online Analytical Processing) </li></ul></ul>
    6. 6. Taxonomy of Data Mining Methods Data Mining Methods Database Segmentation Predictive Modeling <ul><li>Decision Trees </li></ul><ul><li>Neural Networks </li></ul><ul><li>Naive Bayesian </li></ul><ul><li>Branching criteria </li></ul>Deviation Detection <ul><li>Clustering </li></ul><ul><li>K-Means </li></ul>Link Analysis Rule Associa tion Visualization SOURCE: WELGE & REINCKE, NCSA Text Mining Semantic Maps
    7. 7. Predictive Modeling <ul><li>Objective: use data about the past to predict future behavior </li></ul><ul><li>Sample problems: </li></ul><ul><ul><li>Will this (new) customer pay his bill on time? (classification) </li></ul></ul><ul><ul><li>What will the Dow-Jones Industrial Average be on October 15? (prediction) </li></ul></ul><ul><li>Technique: supervised learning </li></ul><ul><ul><li>decision trees </li></ul></ul><ul><ul><li>neural networks </li></ul></ul><ul><ul><li>naive Bayesian </li></ul></ul>
    8. 8. Predictive Modeling SOURCE: WELGE & REINCKE, NCSA Which characteristics distinguish the two groups? Tridas Vickie Mike Honest Barney Waldo Wally Crooked
    9. 9. Learned Rules in Predictive Modeling Honest = has round eyes and a smile SOURCE: WELGE & REINCKE, NCSA Tridas Vickie Mike
    10. 10. Rule Induction Example height hair eyes class short blond blue A tall blond brown B tall red blue A short dark blue B tall dark blue B tall blond blue A tall dark brown B short blond brown B Data: Devise a predictive rule to classify a new person as A or B SOURCE: WELGE & REINCKE, NCSA
    11. 11. Build a Decision Tree hair dark red blond Completely classifies dark-haired and red-haired people Does not completely classify blonde-haired people. More work is required SOURCE: WELGE & REINCKE, NCSA short, blue = B tall, blue = B tall, brown= B {tall, blue = A } short, blue = A tall, brown = B tall, blue = A short, brown = B
    12. 12. Build a Decision Tree hair dark red blond SOURCE: WELGE & REINCKE, NCSA eye blue brown tall = B short = B Decision tree is complete because 1. All 8 cases appear at nodes 2. At each node, all cases are in the same class (A or B) short, blue = B tall, blue = B tall, brown= B {tall, blue = A } short, blue = A tall, brown = B tall, blue = A short, brown = B short = A tall = A
    13. 13. Learned Predictive Rules SOURCE: WELGE & REINCKE, NCSA hair eyes B B A A dark red blond blue brown
    14. 14. Decision Trees <ul><li>Good news: a decision tree can always be built from training data </li></ul><ul><li>Any variable can be used at any level of the tree </li></ul><ul><li>Bad news: every data point may wind up at a leaf (tree has not compressed the data) </li></ul>8 cases, 7 nodes. This tree has not summarized the data effectively B hair B A height short tall eyes eyes A hair blue B B blonde dark brown brown B blue blonde dark red
    15. 15. Database Segmentation (Clustering) <ul><li>“ The art of finding groups in data” Kaufman & Rousseeuw </li></ul><ul><li>Objective: gather items from a database into sets according to (unknown) common characteristics </li></ul><ul><li>Much more difficult than classification since the classes are not known in advance (no training) </li></ul><ul><li>Examples: </li></ul><ul><ul><li>Demographic patterns </li></ul></ul><ul><ul><li>Topic detection (words about the topic often occur together) </li></ul></ul><ul><li>Technique: unsupervised learning </li></ul>
    16. 16. Clustering Example <ul><li>Are there natural clusters in the data (36,10), (12,8), (38,42), (13,6), (36,38), (16,9), (40,36), (35,19), (37,7), (39,8)? </li></ul>
    17. 17. Clustering <ul><li>K-means algorithm </li></ul><ul><li>To divide a set into K clusters </li></ul><ul><li>Pick K points at random. Use them to divide the set into K clusters based on nearest distance </li></ul><ul><li>Loop: </li></ul><ul><ul><li>Find the mean of each cluster. Move the point there. </li></ul></ul><ul><ul><li>Redefine the clusters. </li></ul></ul><ul><ul><li>If no point changes cluster, done </li></ul></ul><ul><li>K-means demo </li></ul><ul><li>Agglomerative clustering: start with N clusters & merge </li></ul><ul><li>Agglomerative clustering demo </li></ul>
    18. 18. Neural Networks Networks of processing units called neurons. This is the j th neuron: SOURCE: CONSTRUCTING INTELLIGENT AGENTS WITH JAVA Neurons are easy to simulate n INPUTS x 1 , …, x n n WEIGHTS w 1 j , …, w nj Neuron computes a linear function of the inputs 1 OUTPUT y j depends only on the linear function
    19. 19. Neural Networks INPUTS: 1 PER INPUT LAYER NEURON INPUT LAYER HIDDEN LAYER OUTPUT LAYER OUTPUTS: 1 PER OUTPUT LAYER NEURON DISTINGUISHED OUTPUT (THE “ANSWER”)
    20. 20. Neural Networks Learning through back-propagation 1. Network is trained by giving it many inputs whose output is known 2. Deviation is “fed back” to the neurons to adjust their weights 3. Network is then ready for live data SOURCE: CONSTRUCTING INTELLIGENT AGENTS WITH JAVA DEVIATION
    21. 21. Neural Network Classification “ Which factors determine a pet’s favorite food?” Breed = Mixed Owner’s age > 45 Owner’s sex = F food: Chum food: Mr. Dog Species = Dog
    22. 22. Neural Network Demos <ul><li>Demo: Notre Dame football , Automated surveillance , Handwriting analyzer </li></ul><ul><li>Financial applications: </li></ul><ul><ul><li>Churning : are trades being instituted just to generate commissions? </li></ul></ul><ul><ul><li>Fraud detection in credit card transactions </li></ul></ul><ul><ul><li>Kiting : isolate float on uncollected funds </li></ul></ul><ul><ul><li>Money Laundering : detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) </li></ul></ul><ul><li>Insurance applications: </li></ul><ul><ul><li>Auto Insurance : detect a group of people who stage accidents to collect on insurance </li></ul></ul><ul><ul><li>Medical Insurance : detect professional patients and ring of doctors and ring of references </li></ul></ul>
    23. 23. Rule Association <ul><li>Try to find rules of the form </li></ul><ul><li>IF <left-hand-side> THEN <right-hand-side> </li></ul><ul><li>(This is the reverse of a rule-based agent, where the rules are given and the agent must act. Here the actions are given and we have to discover the rules!) </li></ul><ul><li>Prevalence = probability that LHS and RHS occur together (sometimes called “support factor,” “leverage” or “lift”) </li></ul><ul><li>Predictability = probability of RHS given LHS (sometimes called “confidence” or “strength”) </li></ul>
    24. 24. Association Rules from Market Basket Analysis <ul><li><Dairy-Milk-Refrigerated>  <Soft Drinks Carbonated> </li></ul><ul><ul><li>prevalence = 4.99%, predictability = 22.89% </li></ul></ul><ul><li><Dry Dinners - Pasta>  <Soup-Canned> </li></ul><ul><ul><li>prevalence = 0.94%, predictability = 28.14% </li></ul></ul><ul><li><Paper Towels - Jumbo>  <Toilet Tissue> </li></ul><ul><ul><li>prevalence = 2.11%, predictability = 38.22% </li></ul></ul><ul><li><Dry Dinners - Pasta>  <Cereal - Ready to Eat> </li></ul><ul><ul><li>prevalence = 1.36%, predictability = 41.02% </li></ul></ul><ul><li><American Cheese Slices >  <Cereal - Ready to Eat> </li></ul><ul><ul><li>prevalence = 1.16%, predictability = 38.01% </li></ul></ul>
    25. 25. Use of Rule Associations <ul><li>Coupons, discounts </li></ul><ul><ul><li>Don’t give discounts on 2 items that are frequently bought together. Use the discount on 1 to “pull” the other </li></ul></ul><ul><li>Product placement </li></ul><ul><ul><li>Offer correlated products to the customer at the same time. Increases sales </li></ul></ul><ul><li>Timing of cross-marketing </li></ul><ul><ul><li>Send camcorder offer to VCR purchasers 2-3 months after VCR purchase </li></ul></ul><ul><li>Discovery of patterns </li></ul><ul><ul><li>People who bought X, Y and Z (but not any pair) bought W over half the time </li></ul></ul>
    26. 26. Finding Rule Associations <ul><li>Example: grocery shopping </li></ul><ul><li>For each item, count # of occurrences (say out of 100,000) </li></ul><ul><li>apples 1891, caviar 3, ice cream 1088, pet food 2451, … </li></ul><ul><li>Drop the ones that are below a minimum support level </li></ul><ul><li>apples 1891, ice cream 1088, pet food 2451, … </li></ul><ul><li>Make a table of each item against each other item: </li></ul><ul><li>Discard cells below support threshold. Now make a cube for triples, etc. Add 1 dimension for each product on LHS. </li></ul>
    27. 27. Rule Association Demos <ul><li>Magnum Opus ( RuleQuest , free download) </li></ul><ul><li>See5/C5.0 ( RuleQuest , free download) </li></ul><ul><li>Cubist numerical rule finder ( RuleQuest , free download) </li></ul><ul><li>IBM Interactive Miner </li></ul>
    28. 28. Text Mining <ul><li>Objective: discover relationships among people & things from their appearance in text </li></ul><ul><li>Topic detection, term detection </li></ul><ul><ul><li>When has a new term been seen that is worth recording? </li></ul></ul><ul><li>Generation of “knowledge map”, a graph representing terms/topics and their relationships </li></ul><ul><li>SemioMap demo (Semio Corp.) </li></ul><ul><ul><li>Phrase extraction </li></ul></ul><ul><ul><li>Concept clustering (through co-occurrence) not by document </li></ul></ul><ul><ul><li>Graphic navigation (link means concepts co-occur) </li></ul></ul><ul><ul><li>Processing time: 90 minutes per gigabyte </li></ul></ul><ul><li>Summary server (inxight.com) </li></ul>
    29. 29. Catalog Mining SOURCE: TUPAI SYSTEMS                                                        
    30. 30. Visualization <ul><li>Objective: produce a graphic view of data so it become understandable to humans </li></ul><ul><li>Hyperbolic trees </li></ul><ul><li>SpotFire (free download from www.spotfire.com ) </li></ul><ul><li>SeeItIn3D </li></ul><ul><li>TableLens </li></ul><ul><li>OpenViz </li></ul>
    31. 31. Major Ideas <ul><li>There’s too much data </li></ul><ul><li>We don’t understand what it means </li></ul><ul><li>It can be handled without human intervention </li></ul><ul><li>Relationships can be discovered automatically </li></ul>
    32. 32. Q A &

    ×