Data Mining Lecture 1: Introduction to Data Mining
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Data Mining Lecture 1: Introduction to Data Mining

  • 1,744 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,744
On Slideshare
1,734
From Embeds
10
Number of Embeds
1

Actions

Shares
Downloads
201
Comments
0
Likes
6

Embeds 10

http://www.slideshare.net 10

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Mining
      • Lecture 1:
      • Introduction to Data Mining
      • Manuel Penaloza, PhD
  • 2. Introduction to Data Mining
    • Society produces huge amounts of data daily
      • Retail Store
        • POS data on customer purchases
      • Banks
        • Collection of customer service calls
      • Telecommunications
        • Phone call records (mobile and house-based calls)
      • Medicine
        • Genomic data collected on the structure of genes
      • Government
        • Law enforcement data, income tax data
      • Others: (Transactional) data from Sports, Schools, Research, Search engines, etc.
  • 3. What is Data Mining (DM)?
    • It is the process of discovering hidden relationships and patterns in large data sets
      • It can also predict the outcome of a future observation
    • Data mining is an interdisciplinary field
      • It is an extension to statistical analysis
      • It uses techniques from:
        • Statistics
        • Machine learning
        • Pattern recognition
        • Database technology
        • Visualization
        • High-performance computing
  • 4. Questions answered by DM
    • Extracting useful information from a dataset that answer:
      • Which CC customers are most profitable?
      • Which loan applicants are high-risk?
      • Which customer will respond to a planned promotion?
      • How do we detect phone card fraud?
      • How do customer profile change over time?
      • Which customers do prefer product A over product B?
      • What is the revenue prediction for next year?
      • Which students are most likely to transfer than others?
      • Which tax payer may be cheating the system?
      • Who is most likely to violate a probation sentence?
      • What is the predicted outcome for some treatment?
  • 5. Data sources
    • Relational Databases
      • Transactional data with many tables
    • Data warehouses
      • Historical data, aggregated and updated periodically
    • Files
      • In special format (e.g., CSV) or proprietary binary
    • Internet or electronic mail
      • HTML, XML, web search results, e-mails
    • Scientific, research
      • Seismology, remote sensing, etc.
  • 6. Example: Health System
    • Characteristics of the Health System:
      • Personal medical records (GP, specialists, etc.)
      • Billing records
      • Hospital data (surgery, admission, etc.)
    • Questions:
      • Are MD's following the procedures?
      • Which patient may have an adverse drug reactions?
      • Are people committing frauds?
      • Which patient are most likely to get cancer?
  • 7. Case study: E-commerce
    • A person buys book from Amazon.com
    • Objective: Recommend other books this person is likely to buy
    • Amazon may do clustering or sequential pattern analysis based on books bought by other people
    • Data analyzed:
      • “Customer who bought “Data Mining: Practical Machine Learning Tools and Techniques” also bought “Introduction to Data Mining”
    • Recommendations have been successful for Amazon
      • Increasing buyer’s satisfaction and purchases
  • 8. What motivated data mining?
    • Growth in data collection
    • Presence of data warehouses with reliable data
    • Competitive pressure to increase sales
    • The development of commercial off the shelves (COTS) data mining software
      • Examples: XLMiner, Insightful Miner, SAS, SPSS
    • Growth of computing power and storage capacity
    • High dimensionality of the data
    • Heterogeneous and complex data
    • Limitation of humans
  • 9. Insightful Miner TM 7: GUI *Figures taken from the Insightful Miner 7 Guide
  • 10. Creating Models
    • Create a network of pipelined components
      • By dragging and dropping components
  • 11. Choosing a data mining system
    • They have different functionality or methodology
    • Selection determined by:
      • Type of operating system used in your organization
      • The data sources handle by the tool:
        • ASCII text files, relational databases, XML data
      • The data mining functions and methods offered
      • Scalability of the system
        • Row and column scalability
      • Visualization tools available
      • Graphical user interface that guides the execution of the methods
      • Integration with other information systems
      • Cost and performance
  • 12. Data Mining in Databases
    • Current applications include data mining modules
    • Example:
      • Database management systems such as Oracle and MS SQL Server
      • CRM (Customer Relationship Management)
    • Advantages for Database systems:
      • One Stop shopping
      • Minimize data movement and conversion
    • Disadvantages for Database systems:
      • Limited to DM methods available in the system
      • Data extractions and transformations may not be powerful enough
  • 13. Standard data mining life cycle
    • CRISP (Cross-Industry Standard Process)
    • It is an iterative process with phase dependencies
    • IT consists of six (6) phases:
    see www.crisp-dm.org for more information
  • 14. CRISP_DM
    • Cross-industry standard developed in 1996
      • Analysts from SPSS/ISL, NCR, Daimler-Benz, OHRA
    • Funding from European Commission
    • Important Characteristics:
      • Non-proprietary
      • Application/Industry neutral
      • Tool neutral
      • General problem-solving process
      • Process with six phases but missing:
        • Saving results and updating the model
  • 15. CRISP-DM Phases (1)
    • Business Understanding
      • Understand project objectives and requirements
      • Formulation of a data mining problem definition
    • Data Understanding
      • Data collection
      • Evaluate the quality of the data
      • Perform exploratory data analysis
    • Data Preparation
      • Clean, prepare, integrate, and transform the data
      • Select appropriate attributes and variables
  • 16. CRISP-DM Phases (2)
    • Modeling
      • Select and apply appropriate modeling techniques
      • Calibrate model parameters to optimize results
      • If necessary, return to data preparation phase to satisfy model's data format
    • Evaluation
      • Determine if model satisfies objectives set in phase 1
      • Identify business issues that have not been addressed
    • Deployment
      • Organize and present the model to the “user”
      • Put model into practice
      • Set up for continuous mining of the data
  • 17. Data mining tasks (1)
    • Classification
      • Predict the categorical value of a target (dependent) variable based on the values of other attributes
      • Target variable is partitioned into classes
      • It predicts class membership of a new observation
      • Examples: Which drug should be prescribed for older patients with low sodium/potassium ratios?
    • Estimation
      • Similar to classification except target variable is numeric
      • That is, predicting a numeric value
      • Example: Estimate the blood pressure of a person based on his/her age, gender, body mass index, etc.
  • 18. Data mining tasks (2)
    • Prediction
      • Similar to estimation except that results lie in the future
      • Example: Predict the price of a stock 3 months into the future
    • Clustering
      • Grouping similar records together
      • Example: Find patients with similar profiles
    • Associations
      • Uncover rules that indicates the association between two or more attributes
      • Find out which items are purchased together
  • 19. Task: Classification
    • Build a model that learns to predict the class from pre-labeled instances or observations
      • Many approaches: Regression, Decision Trees, Neural Networks
    Given a set of points from classes what is the class of new point ? * Diagram taken from www.kdnuggets.com/data_mining_course/index.html
  • 20. Task: Clustering
    • Find grouping of instances given un-labeled data
    * Diagram taken from www.kdnuggets.com/data_mining_course/index.html
  • 21. DM looks easy Data Data Mining Method Regression Decision Tree Neural Network … Association Rules Model
    • But it is not easy
    • Real-world is complicate
  • 22. Methods and Techniques
    • Cluster Analysis (tasks: clustering)
    • Association Rules (tasks: association)
    • Decision trees (tasks: prediction, classification)
    • Neural networks (tasks: prediction, classification)
    • K-nearest neighbor (tasks: prediction, classification, clustering)
    • Regression analysis (task: estimation, prediction)
    • Confidence interval estimation (task: estimation)
  • 23. Fallacies of Data Mining (1)
    • Fallacy 1: There are data mining tools that automatically find the answers to our problem
      • Reality: There are no automatic tools that will solve your problems “while you wait”
    • Fallacy 2: The DM process require little human intervention
      • Reality: The DM process require human intervention in all its phases, including updating and evaluating the model by human experts
    • Fallacy 3: Data mining have a quick ROI
      • Reality: It depends on the startup costs, personnel costs, data source costs, and so on
  • 24. Fallacies of Data Mining (2)
    • Fallacy 4: DM tools are easy to use
      • Reality: Analysts must be familiar with the model
    • Fallacy 5: DM will identify the causes to the business problem
      • Reality: DM tool only identify patterns in your data, analysts must identify the cause
    • Fallacy 6: Data mining will clean up a data repository automatically
      • Reality: Sequence of transformation tasks must be defined by an analysts during early DM phases
      • * Fallacies described by Jen Que Louie, President of Nautilus Systems, Inc.
  • 25. In summary,
    • Problems suitable for Data Mining:
      • Require to discover knowledge to make right decisions
      • Current solutions are not adequate
      • Expected high-payoff for the right decisions
      • Have accessible, sufficient, and relevant data
      • Have a changing environment
    • IMPORTANT:
      • ENSURE privacy if personal data is used!
      • Not every data mining application is successful!
  • 26. Main References
    • Ian Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques , 2 nd edition, Morgan Kaufmann Publishers
    • Daniel LaRose. Discovering Knowledge in Data: An Introduction to Data Mining , Wiley Publication
    • Pang-Ning Tang et. al. Introduction to Data Mining , Addison Wesley
    • Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques , Morgan Kaufmann Publishers
    • Online data mining course offered by KDnuggets TM at www.kdnuggets.com/data_mining_course/index.html
    • Engineering Statistics Handbook available online at http://www.itl.nist.gov/div898/handbook/eda/section1/eda126.htm
  • 27. Exercise #1
    • CRISP-DM is not the only DM process, do a quick search on the Internet for another process. Describe any similarity and differences with CRISP-DM.
    • Determine how data mining could help a web search engine company like Google in its operation?
      • Identify one or more objectives.
      • Which data mining task(s) could help this company?