Data mining introduction


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data mining introduction

  1. 1. Data Mining Introduction
  2. 2. intro Data mining is a powerful new technology with great potential to help companies focus on the most important information in the data they have collected about the behavior of their customers and potential customers.
  3. 3. Data collections in the real world    Ten largest transaction-processing databases range from 3 to 18 Terabytes Ten largest decision support databases range from 10 to 29 Terabytes Sizes have doubled / tripled between 2001 and end of 2003
  4. 4. Questions arise    Is there any new, unexpected and potentially useful information contained in this data? Can we use historical data to predict future outcomes? (e.g. customer behavior, fraud detection, etc.)
  5. 5. Some examples of data mining 1. Telecommunications Huge amount of data is collected daily  Transactional data (about each phone call)  Data on mobile phones, house based phones, Internet, etc.)  Other customer data (billing, personal information, etc.)  Additional data (network load, faults, etc.) Questions arises  Which customer group is highly profitable, which one is not?  To which customers should we advertise what kind of special offers?  What kind of call rates would increase profit without loosing good customers?  How do customer profiles change over time?  Fraud detection (stolen mobile phones or phone cards 
  6. 6. Another 2. Health  Different aspects of the health system  Personal health records (at GPs, specialists, etc.)  Hospital data (e.g. admission data, midwives data, surgery data)  Billing information (Medicare, PBS) Questions  Are doctors following the procedures (e.g. prescription of medication)?  Adverse drug reactions (analysis of different data collections to find correlations)  Are people committing fraud (e.g. doctor shoppers)  Correlations between social and environmental issues and people's health?
  7. 7. What is data mining?  Data Mining is the automated extraction of previously unrealized information from Large data sources for the purpose of supporting business actions.
  8. 8. Some more definitions    Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. An information extraction activity whose goal is to discover hidden facts contained in databases. Data mining, or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data.
  9. 9. Data mining process
  10. 10. Data mining process    Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals.
  11. 11. Data mining process   Analyze the data by application software. Present the data in a useful format, such as a graph or table.
  12. 12. DM is multi disciplinary
  13. 13. What they do Detect patterns in data: Rules, patterns, classes, associations and functional dependencies, outliers, data distributions, clusters
  14. 14. How they do it  Search through data and pattern space, non-parametric modelling, filtering, aggregation How well they do it Errors and biases, over-fitting, confounding effects, speed, scalability
  15. 15. Challenges in DM    Data size  Size of data collections grows more than linear, doubling every 18 months  Scalable algorithms are needed  Data complexity Different types of data (free text, HTML, XML, multimedia) Dimensionality of the data increases (more attributes)
  16. 16. Challenges contd..    The curse of dimensionality affects many algorithms (for example find nearest neighbors in high dimensions) Data quality  Real world data is messy and dirty (missing and out-of-date values, typographical errors, different coding/formats, etc.)
  17. 17. Why mine data?       Data is being recorded Recorded data is being warehoused Computing power is affordable Competitive pressure is strong Commercial DM products are available It provides support for business decisions
  18. 18. Value to business    Market segmentation - Identify the common characteristics of customers who buy the same products from your company. Customer churn - Predict which customers are likely to leave your company and go to a competitor. Fraud detection - Identify which transactions are most likely to be fraudulent.
  19. 19. Value to business   Interactive marketing - Predict what each individual accessing a Web site is most likely interested in seeing. Market basket analysis - Understand what products or services are commonly purchased together; e.g., beer and diapers.
  20. 20. Value to business    Trend analysis - Reveal the difference between a typical customer this month and last. Data mining can also effectively deal with missing, inconsistent, and noisy data. Direct marketing - Identify which prospects should be included in a mailing list to obtain the highest response rate.