Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,330
On Slideshare
1,330
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
63
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Mining Introduction Prithwis Mukerjee, Ph.D.
  • 2. Why Suddenly Data Mining ?
    • Fluid, dynamic business environment
        • Markets are, or seem to be, saturated
        • Customers are aggresive and disloyal
        • Speed is essential
          • “ The quick and the dead”
    • Availability of data
        • Vast amounts of data are generated, stored electronically and are waiting to be processed !
    • Availability of tools and techniques
      • Mathematical tools are available to “process” this data
          • This processing is significantly different from MIS / EDP style of data processing
      • Software tools are available to implement some of these mathematical models
  • 3. The Business Environment
    • Customer Behaviour
      • Customers have access to more channels
        • Newer retail formats
        • Online stores
      • Customers have access to more suppliers
        • Increased commoditisation
        • Customer loyalty is not assured any more
    • Market Saturation
        • Multiple suppliers operating in each market
        • Niche market based on demographics and preferences
      • Competition has intensified
    • Need for speed
        • Product life cycles are getting shorter
        • Time to market
  • 4. New way of looking at customer
    • Customer Relationship Management
      • Intimacy, collaboration, one-to-one partnerships are necessary
    • Need to ask ...
      • What classes of customers do we have ? Are there subclasses in terms of behaviour ?
      • How can we sell more to existing customers ? What exactly are they buying now ?
      • Is there a pattern in the way our customers behave ?
      • Who are my good customers ?
        • To whom we should sell more
      • Who are my bad customers ?
        • Who are likely to default or defraud ?
  • 5. Availability of vast amounts of data
    • ERP and OLTP systems
      • With their centralised RDBMSs are huge pools of firmwide data that can overwhelm even the most dedicated manager
    • Datawarehouse
      • Technology has resulted in equally huge pools of historical data
    • Storage Capacity
      • Inexpensive
      • Ultra high capacity
  • 6. Availability of vast amounts of data
    • Cards
      • Credit & Debit Cards
      • Loyalty Card
      • Result in capture of huge pools of data
    • Transactional Data Capture
      • Point of sales systems, Bar code readers
      • Capture vast amount of transaction data at increasing levels of granularity
        • What was sold ? Product, SKU
        • When was it sold ? Date, time
        • How was it sold ? Discount, Promotions
      • Beyond simple sales
        • Telephone calls, frequent flyer data
  • 7. Trends leading to Data Flood
    • More data is generated:
      • Web, text, images …
      • Business transactions, calls, ...
      • Scientific data: astronomy, biology, etc
    • More data is captured:
      • Storage technology faster and cheaper
      • DBMS can handle bigger DB
  • 8. Data Growth In 2 years (2003 to 2005), the size of the largest database TRIPLED!
  • 9. Coming to the point ....
    • Data mining
      • Is the process of extracting unknown , valid and actionable information from large databases and then using this information to make crucial business decisions .
    Database Data Mining Tools Data Presentation and Visualisation Tools Decisions
  • 10. Knowledge Discovery Definition
    • Knowledge Discovery in Data is the
    • non-trivial process of identifying
      • valid
      • novel
      • potentially useful
      • and ultimately understandable patterns in data.
    • from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
  • 11. Old Wine in new bottles ?
    • Are these the same as data mining ?
      • SQL queries against large databases
      • Multidimensional database analysis
      • Online analytical processing
      • Sophisticated graphic visualisation
      • “ Classical” statistical analysis
        • ANOVA ? Regression ? Correrelation ?
    • What is missing ?
      • Discovery of information without a previously articulated or formulated hypothesis
  • 12. Related Fields Statistics Machine Learning Databases Visualization Data Mining and Knowledge Discovery
  • 13. Statistics, Machine Learning, Data Mining
    • Statistics:
      • more theory-based
      • more focused on testing hypotheses
    • Machine learning
      • more heuristic
      • focused on improving performance of a learning agent
      • also looks at real-time learning and robotics – areas not part of data mining
    • Data Mining and Knowledge Discovery
      • integrates theory and heuristics
      • focus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results
    • Distinctions are fuzzy
  • 14. Data Mining Tasks
  • 15. Some Definitions
    • Instance (also Item or Record):
      • an example, described by a number of attributes,
      • e.g. a day can be described by temperature, humidity and cloud status
    • Attribute or Field
      • measuring aspects of the Instance, e.g. temperature
    • Class (Label)
      • grouping of instances, e.g. days good for playing
  • 16. Major Data Mining Tasks
    • Classification
      • predicting an item class
    • Clustering
      • finding clusters in data
    • Associations
      • e.g. A & B & C occur frequently
    • Visualization
      • to facilitate human discovery
    • Summarization
      • describing a group
    • Deviation Detection
      • finding changes
    • Estimation
      • predicting a continuous value
    • Link Analysis
      • finding relationships
    • And
      • So on ...
  • 17. Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
  • 18. Clustering Find “natural” grouping of instances given un-labeled data
  • 19. Association Rules & Frequent Itemsets Transactions Frequent Itemsets: Milk, Bread (4)‏ Bread, Cereal (3)‏ Milk, Bread, Cereal (2)‏ … Rules: Milk => Bread (66%)
  • 20. Visualization & Data Mining
    • Visualizing the data to facilitate human discovery
    • Presenting the discovered results in a visually "nice" way
  • 21. Summarization
    • Describe features of the selected group
    • Use natural language and graphics
    • Usually in Combination with Deviation detection or other methods
    Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...
  • 22. Data Mining Central Quest Find true patterns and avoid overfitting (finding seemingly signifcant but really random patterns due to searching too many possibilites)‏
  • 23. Classification Methods
  • 24. Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Regression, Decision Trees, Bayesian, Neural Networks, ... Given a set of points from classes what is the class of new point ?
  • 25. Classification: Linear Regression
    • Linear Regression
    • w 0 + w 1 x + w 2 y >= 0
    • Regression computes w i from data to minimize squared error to ‘fit’ the data
    • Not flexible enough
  • 26. Regression for Classification
    • Any regression technique can be used for classification
      • Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that don’t
      • Prediction: predict class corresponding to model with largest output value ( membership value )‏
    • For linear regression this is known as multi-response linear regression
  • 27. Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 5 2 3
  • 28. DECISION TREE
    • An internal node is a test on an attribute.
    • A branch represents an outcome of the test, e.g., Color=red.
    • A leaf node represents a class label or class label distribution.
    • At each node, one attribute is chosen to split training examples into distinct classes as much as possible
    • A new instance is classified by following a matching path to a leaf node.
  • 29. Weather Data: Play or not Play? Note: Outlook is the Forecast, no relation to Microsoft email program No true high mild rain Yes false normal hot overcast Yes true high mild overcast Yes true normal mild sunny Yes false normal mild rain Yes false normal cool sunny No false high mild sunny Yes true normal cool overcast No true normal cool rain Yes false normal cool rain Yes false high mild rain Yes false high hot overcast No true high hot sunny No false high hot sunny Play? Windy Humidity Temperature Outlook
  • 30. Example Tree for “Play?” overcast high normal false true sunny rain No No Yes Yes Yes Outlook Humidity Windy
  • 31. Classification: Neural Nets
    • Can select more complex regions
    • Can be more accurate
    • Also can overfit the data – find patterns in random noise
  • 32. Classification: other approaches
    • Naïve Bayes
    • Rules
    • Support Vector Machines
    • Genetic Algorithms
    • See www.KDnuggets.com/software/
  • 33. Direct Marketing Paradigm
    • Find most likely prospects to contact
    • Not everybody needs to be contacted
    • Number of targets is usually much smaller than number of prospects
    • Typical Applications
      • retailers, catalogues, direct mail (and e-mail)
      • customer acquisition, cross-sell, attrition prediction
      • ...
  • 34. Direct Marketing Evaluation
    • Accuracy on the entire dataset is not the right measure
    • Approach
      • develop a target model
      • score all prospects and rank them by decreasing score
      • select top P% of prospects for action
    • How do we decide what is the best subset of prospects ?
  • 35. Model-Sorted List Use a model to assign score to each customer Sort customers by decreasing score Expect more targets (hits) near the top of the list 3 hits in top 5% of the list If there 15 targets overall, then top 5 has 3/15=20% of targets … 4897 N 0.92 5 2422 2734 … 3820 2478 1024 1746 CustID N 0.06 100 … N 0.11 99 … … … … … Age Y 0.93 4 … … Y 0.94 3 N 0.95 2 Y 0.97 1 Target Score No
  • 36. Data Mining Applications
  • 37. Data Mining Applications
    • Science: Chemistry, Physics, Medicine
      • Biochemical analysis
      • Remote sensors on a satellite
      • Telescopes – star galaxy classification
      • Medical Image analysis
    • Bioscience
      • Sequence-based analysis
      • Protein structure and function prediction
      • Protein family classification
      • Microarray gene expression
  • 38. Microarrays: Classifying Leukemia
    • Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999
      • 72 examples (38 train, 34 test), about 7,000 genes
    ALL AML Visually similar, but genetically very different Best Model: 97% accuracy, 1 error (sample suspected mislabelled)‏
  • 39. Microarray Potential Applications
    • New and better molecular diagnostics
      • Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based on Affymetrix technology
    • New molecular targets for therapy
      • few new drugs, large pipeline, …
    • Improved treatment outcome
      • Partially depends on genetic signature
    • Fundamental Biological Discovery
      • finding and refining biological pathways
    • Personalized medicine ?!
  • 40.
    • Pharmaceutical companies, Insurance and Health care, Medicine
      • Drug development
      • Identify successful medical therapies
      • Claims analysis, fraudulent behavior
      • Medical diagnostic tools
      • Predict office visits
    Data Mining Applications
    • Financial Industry, Banks, Businesses, E-commerce
      • Stock and investment analysis
      • Identify loyal customers vs. risky customer
      • Predict customer spending
      • Risk management
      • Sales forecasting
  • 41.
    • Retail and Marketing
      • Customer buying patterns/demographic characteristics
      • Mailing campaigns
      • Market basket analysis
      • Trend analysis
    Data Mining Applications
  • 42. Application: Direct Marketing and CRM
    • Most major direct marketing companies are using modeling and data mining
    • Most financial companies are using customer modeling
    • Modeling is easier than changing customer behaviour
    • Example
      • Verizon Wireless reduced customer attrition rate from 2% to 1.5%, saving many millions of $
  • 43. Application: Security and Fraud Detection
    • Credit Card Fraud Detection
      • over 20 Million credit cards protected by Neural networks (Fair, Isaac)‏
    • Securities Fraud Detection
      • NASDAQ KDD system
    • Phone fraud detection
      • AT&T, Bell Atlantic, British Telecom/MCI
  • 44. Fraud Detection and Management (1)‏
    • Applications
      • widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.
    • Approach
      • use historical data to build models of fraudulent behavior and use data mining to help identify similar instances
    • Examples
        • auto insurance: detect a group of people who stage accidents to collect on insurance
        • money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)
        • medical insurance: detect professional patients and ring of doctors and ring of references
  • 45. Fraud Detection and Management (2)‏
    • Detecting inappropriate medical treatment
      • Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).
    • Detecting telephone fraud
      • Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.
      • British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.
    • Retail
      • Analysts estimate that 38% of retail shrink is due to dishonest employees.
  • 46. Application: e-Commerce
    • Amazon.com recommendations
      • if you bought (viewed) X, you are likely to buy Y
    • Netflix
      • If you liked "Monty Python and the Holy Grail",
    • you get a recommendation for "This is Spinal Tap"
    • Comparison shopping
      • Froogle, mySimon, Yahoo Shopping, …
  • 47. Example : Processing Loan Applications
    • Given: questionnaire with financial and personal information
    • Problem: should money be lend?
    • Borderline cases referred to loan officers
    • But: 50% of accepted borderline cases defaulted!
    • Solution:
      • reject all borderline cases?
    • Borderline cases are most active customers!
  • 48. Enter Machine Learning
    • Given:
      • 1000 training examples of borderline cases
    • 20 attributes:
      • age, years with current employer,years at current address, years with the bank, years at current job, other credit cards
    • Learned rules predicted 2/3 of borderline cases correctly!
    • Rules could be used to explain decisions to customers
  • 49. Case study 2:Screening images
    • Given:
      • radar satellite images of coastal waters
    • Problem:
      • detecting oil slicks in those images
    • Oil slicks = dark regions with changing size and shape
    • Look-alike dark regions can be caused by weather conditions (e.g. high wind)‏
    • Expensive process requiring highly trained personnel
  • 50.
    • Dark regions extracted from normalized image
    • Attributes:
      • size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background
    • Constraints:
      • Scarcity of training examples (oil slicks are rare!)‏
      • Unbalanced data: most dark regions aren’t oil slicks
      • Regions from same image form a batch
      • Requirement is adjustable false-alarm rate
    Enter Machine Learning
  • 51. Data Mining Applications ..
    • Prediction & Description
      • Would this customer buy this product ?
      • Is this customer likely to leave ?
    • Relationship Marketing
      • What kind of products have been bought by this customer ?
      • What kind of marketing strategy has this customer responded to ?
    • Outlier identification and Fraud detection
      • Locating unusual cases and behaviours
    • Customer Profiling & Segmentation
      • Is the bottomline that we are all looking at ...
  • 52. Data Mining Challenges
    • Computationally expensive to investigate all possibilities
    • Dealing with noise/missing information and errors in data
    • Choosing appropriate attributes/input representation
    • Finding the minimal attribute space
    • Finding adequate evaluation function(s)‏
    • Extracting meaningful information
    • Not overfitting
  • 53. Are All “Discovered” Patterns Interesting?
    • Interestingness measures:
      • A pattern is interesting if
        • it is easily understood by humans,
        • valid on new or test data with some degree of certainty,
        • potentially useful ,
        • novel , or validates some hypothesis that a user
    • Objective vs. subjective measures:
      • Objective: based on statistics and structures of patterns
        • support and confidence
      • Subjective: based on user’s belief in the data
        • unexpectedness, novelty, action ability, etc.
    • Completeness - Find all the interesting patterns
        • Can a data mining system find all the interesting patterns?
        • Association vs. classification vs. clustering
  • 54. Privacy Issues
  • 55. Data Mining, Privacy, and Security
    • TIA: Terrorism (formerly Total) Information Awareness Program –
      • TIA program closed by Congress in 2003 because of privacy concerns
    • However, in 2006 we learn that NSA is analyzing US domestic call info to find potential terrorists
      • Invasion of Privacy or Needed Intelligence?
  • 56. Criticism of Analytic Approaches to Threat Detection:
    • Data Mining will
      • be ineffective - generate millions of false positives
      • and invade privacy
    • First, can data mining be effective?
  • 57. Can Data Mining and Statistics be Effective for Threat Detection?
    • Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives
    • Reality: Analytical models correlate many items of information to reduce false positives.
    • Example: Identify one biased coin from 1,000.
      • After one throw of each coin, we cannot
      • After 30 throws, one biased coin will stand out with high probability.
      • Can identify 19 biased coins out of 100 million with sufficient number of throws
  • 58. Another Approach: Link Analysis Can find unusual patterns in the network structure
  • 59. Analytic technology can be effective
    • Data Mining is just one additional tool to help analysts
    • Combining multiple models and link analysis can reduce false positives
    • Today there are millions of false positives with manual analysis
    • Analytic technology has the potential to reduce the current high rate of false positives
  • 60. Data Mining with Privacy
    • Data Mining looks for patterns, not people!
    • Technical solutions can limit privacy invasion
      • Replacing sensitive personal data with anon. ID
      • Give randomized outputs
      • Multi-party computation – distributed data
    • Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003
  • 61. Summary
    • Data Mining and Knowledge Discovery are needed to deal with the flood of data
    • Knowledge Discovery is a process !
    • Avoid overfitting (finding random patterns by searching too many possibilities)‏
  • 62. Additional Resources www.KDnuggets.com data mining software, jobs, courses, etc www.acm.org/sigkdd ACM SIGKDD – the professional society for data mining