Upcoming SlideShare
Loading in...5

Like this? Share it with your network








Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

DATA MINING Document Transcript

  • 1. DATA MINING Nigel Martin School of Computer Science and Information Systems email: tel: 020 7631 6714 Overview • What is data mining? • Data mining applications • Steps in the data mining process • Data mining tasks • Data mining techniques • A data mining product - Clementine • Web mining • The KDnuggets website has links to much useful information and other websites related to data min- ing, web mining and knowledge discovery.
  • 2. What is Data Mining? • The discovery of previously unknown patterns in large databases • Isn’t that just OLAP operations on data warehouses? Not really. Data warehouses support • information processing (querying and report- ing) and • analytical processing (using OLAP operations on summary data) but data mining supports • knowledge discovery
  • 3. Data Mining Applications • These range from traditional business applications such as • Fraud detection • Market basket analysis • Credit worthiness assessment to new business applications • Web mining • E-commerce as well as scientific applications • Analyzing X-ray images • Modelling gene behaviour
  • 4. Steps in the Data Mining Process • Data mining requires a sequence of steps. 1. Understanding the application domain, relevant prior knowledge and goals of the end user 2. Creating a target data set 3. Data cleaning and preprocessing 4. Data reduction and transformation 5. Choosing the data mining task 6. Choosing the data mining algorithm(s) 7. Data mining 8. Evaluate output of Step 7 9. Consolidate discovered knowledge • Note that mining is only one step in the overall pro- cess, and often not the most difficult step. • Since the overall process and an individual step within it are both referred to as data mining, some use the term knowledge discovery to refer to the overall process to avoid confusion.
  • 5. Data Mining Tasks These include • Prediction. Given a model and a new sample, predict the value of a specific data item of that sample. For example predict the number of pur- chases a new customer will make in the first year. • Classification. Given a set of pre-defined classes, determine which class a new sample belongs to. For example, classify the credit-worthiness of a new customer. • Clustering. For a set of samples, partition the set such that samples with similar characteristics are in the same partition. For example, partition cus- tomers into groups each of which will be suitable for a different direct marketing campaign. • Link Analysis (Associations). Given a set of sam- ples, find relationships between data items of those samples such that if one data item has some value, then another data item is likely to have some other value. For example, find which sets of items a customer is likely to buy in a single tran- saction.
  • 6. Data Mining Techniques • We look at three different techniques as represen- tatives of the many techniques which exist: • Association Rule Mining • Decision Trees • Neural Nets • The first is a commonly used link analysis tech- nique, while the last two are commonly used classification techniques.
  • 7. Association Rule Mining • Association rule mining searches for interesting relationships between among data items in a set of samples. • The classic application is market basket analysis: find items that are frequently purchased together by a customer. • Not all rules are interesting: "99% of the people taking the exam "Internet Technology" are stu- dents registered at Birkbeck College." • Interesting rules are identified by checking that the rule has some minimum confidence and sup- port.
  • 8. • For example, consider transactions consisting of the purchase of a number of items. Suppose a rule ( A -> B) has been identified that customers who buy item A also often buy item B. • Confidence for the rule (A -> B) is defined as: (number of transactions containing A and B) ___________________________________________ (number of transactions containing A) • Support for the rule (A -> B) is defined as: (number of transactions containing A and B) ___________________________________________ (total number of transactions) • Association rules that satisfy both user-specified minimum confidence and support thresholds are sometimes referred to as strong association rules.
  • 9. • Algorithms exists to find all sets of items (item- set) that have minimum support. For example, the apriori algorithm works as follows. 1. From the records of transactions in the data- base, generate all possible itemsets with one item within it. 2. Eliminate itemsets which do not have the necessary support. 3. Using the remaining itemsets, generate all possible itemsets which contain one more item than an existing itemset. 4. Repeat from Step 2. • Apply the algorithm to the following transaction data looking for frequent item sets with support > 50%. ______________________  _____________________ _trans_id  list_of_items  T1  I1,I3,I4   T2  I2,I3,I5      T3  I1,I2,I3,I5   _____________________ _T4  I2,I5 • The basic algorithm above can be optimised to give improved speed of operation.
  • 10. Decision Trees • A decision tree is a tree structure where each internal node represents a test on a data item, each branch represents an outcome of the test, and each leaf represents a class. • In order to classify a sample, the tree is followed from the root with each test being applied to the data items of the sample and the appropriate branch being taken. The leaf node eventually reached classifies the sample.
  • 11. • Algorithms exist to construct decision trees, prune branches which represent noise or outliers in the training data, and other enhancements to improve scalability. • The knowledge inherent a decision tree can be extracted and represented in the form of classification IF..THEN rules. One rule is created for each path from the root to a leaf node. • The reasons for a classification made by a deci- sion tree are easily interpretable by humans.
  • 12. Neural Nets • Inspired by the way the brain operates, but not intended to be biologically realistic in detail. • They are organised in layers as shown below. • Data is input to the first layer, and values are then propagated to each subsequent layer until a final output layer delivers results. • Values are transformed by weights associated with the links. • Weights are initially random, but new weights are "learned" by experience.
  • 13. Classes of Learning for Neural Nets • Supervised A "teach" input is provided which tells the net the output required for a given input. Weights are adjusted so as to minimise the differ- ence between the desired and actual outputs for each input pattern. A backpropagation algo- rithm modifies weights in the "backwards" direc- tion from the output layer to the first hidden layer. • Reinforced The net receives a global reward/penalty signal. Weights are changed in order to develop input/output which maximises the probability of receiving a reward and minim- ises that of receiving a penalty. • Unsupervised The net is able to discover statisti- cal regularities in its input space and automati- cally develops different modes of behaviour to represent different classes of input. A Kohonen network (self-organising map) is an example of this technique.
  • 14. Characteristics of Neural Nets • Neural nets are typically used for problems where: • Rules for solving the problem are difficult to formalise. • The desired input and output sets are known. • Data is noisy. • Speed is needed. • Disadvantages of neural nets are: • There are no clear rules or guidance for a given application. • No general way of assessing the internal operation of the network. • Training may be difficult or impossible. • Difficult to predict future network perfor- mance.
  • 15. A Data Mining Product - Clementine • Clementine is a data mining tool from SPSS. • Using it, you can: • Obtain data from a variety of sources • Select and transform data • Visualise the data using a variety of plots and graphs • Model the data with data mining methods including • Neural Nets • Decision Trees • Kohonen Nets • Association Rules • Statistical Models • Clustering Models • Output the results in a variety of forms.
  • 16. Web Mining • A data mining of particular interest in an e-commerce context is web mining. • This encompasses • Web Content Mining - automated discovery of Web-based information. • Web Usage Mining - automated discovery of user access patterns. • Particular problems arise from the nature of the Web when trying to mine Web server logs, such as: • How can users be identified? • How can sessions or transactions be identified? • One approach to dealing with such prob- lems is to log at the application server rather than Web server level.