Teaching an Introductory Course in Data Mining
Upcoming SlideShare
Loading in...5
×
 

Teaching an Introductory Course in Data Mining

on

  • 1,272 views

 

Statistics

Views

Total Views
1,272
Views on SlideShare
1,271
Embed Views
1

Actions

Likes
0
Downloads
28
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Teaching an Introductory Course in Data Mining Teaching an Introductory Course in Data Mining Presentation Transcript

  • Teaching an Introductory Course in Data Mining
    • Richard J. Roiger
    • Computer and Information Sciences Dept.
    • Minnesota State University, Mankato USA
    • Email: [email_address]
    • Web site: krypton.mnsu.edu/~roiger
  • Teaching an Introductory Course in Data Mining
    • Designed for university instructors teaching in information science or computer science departments who wish to introduce a data mining course or unit into their curriculum.
    • Appropriate for anyone interested in a detailed overview of data mining as a problem-solving tool.
    • Will emphasize material found in the text: “Data Mining A Tutorial-Based
    • Primer” published by Addison-Wesley in 2003.
    • Additional materials covering the most recent trends in data mining will also
    • be presented.
    • Participants will have the opportunity to experience the data mining process.
    • Each participant will receive a complimentary copy of the aforementioned text together with a CD containing power point slides and a student version of IDA.
  • Questions to Answer
    • What constitutes data mining?
    • Where does data mining fit in a CS or IS curriculum?
    • Can I use data mining to solve my problem?
    • How do I use data mining to solve my problem?
  • What Constitutes Data Mining?
    • Finding interesting patterns in data
    • Model building
    • Inductive learning
    • Generalization
  • What Constitutes Data Mining?
    • Business applications
      • beer and diapers
      • valid vs. invalid credit purchases
      • churn analysis
    • Web applications
      • crawler vs. human being
      • user browsing habits
  • What Constitutes Data Mining?
    • Medical applications
      • microarray data mining
      • disease diagnosis
    • Scientific applications
      • earthquake detection
      • gamma-ray bursts
  • Where does data mining fit in a CS or IS curriculum? 1 1 Information Systems 5 1 Computer Science Maximum Minimum Intelligent Systems
  • Where does data mining fit in a CS or IS curriculum? 3 3 Information Systems 0 0 Computer Science Maximum Minimum Decision Theory
  • Can I use data mining to solve my problem?
    • Do I have access to the data?
    • Is the data easily obtainable?
    • Do I have access to the right attributes?
  • How do I use data mining to solve my problem?
    • What strategies should I apply?
    • What data mining techniques should I use?
    • How do I evaluate results?
    • How do I apply what has been learned?
    • Have I adhered to all data privacy issues?
  • Data Mining: A First View Chapter 1
  • Data Mining The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.
  • Knowledge Discovery in Databases (KDD) The application of the scientific method to data mining. Data mining is one step of the KDD process.
  • Computers & Learning Computers are good at learning concepts. Concepts are the output of a data mining session.
  • Supervised Learning
    • Build a learner model using data instances of known origin.
    • Use the model to determine the outcome new instances of unknown origin.
  • Supervised Learning: A Decision Tree Example
  • Decision Tree A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.
  •  
  •  
  •  
  • Production Rules IF Swollen Glands = Yes THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy
  • Unsupervised Clustering A data mining method that builds models from data without predefined classes.
  • The Acme Investors Dataset
  • The Acme Investors Dataset & Supervised Learning
    • Can I develop a general profile of an online investor?
    • Can I determine if a new customer is likely to open a margin account?
    • Can I build a model predict the average number of trades per month for a new investor?
    • What characteristics differentiate female and male investors?
  • The Acme Investors Dataset & Unsupervised Clustering
    • What attribute similarities group customers of Acme Investors together?
    • What differences in attribute values segment the customer database?
  • 1.3 Is Data Mining Appropriate for My Problem?
  • Data Mining or Data Query?
    • Shallow Knowledge
    • Multidimensional Knowledge
    • Hidden Knowledge
    • Deep Knowledge
  • Shallow Knowledge Shallow knowledge is factual. It can be easily stored and manipulated in a database.
  • Multidimensional Knowledge Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge.
  • Hidden Knowledge Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease.
  • Data Mining vs. Data Query: An Example
    • Use data query if you already almost know what you are looking for.
    • Use data mining to find regularities in data that are not obvious.
  • 1.4 Expert Systems or Data Mining?
  • Expert System A computer program that emulates the problem-solving skills of one or more human experts.
  • Knowledge Engineer A person trained to interact with an expert in order to capture their knowledge.
  •  
  • 1.5 A Simple Data Mining Process Model
  •  
  • 1.6 Why Not Simple Search?
    • Nearest Neighbor Classifier
    • K-nearest Neighbor Classifier
  • Nearest Neighbor Classifier Classification is performed by searching the training data for the instance closest in distance to the unknown instance.
  • Customer Intrinsic Value
  •  
  • Data Mining: A Closer Look Chapter 2
  • 2.1 Data Mining Strategies
  •  
  • Data Mining Strategies: Classification
    • Learning is supervised.
    • The dependent variable is categorical.
    • Well-defined classes.
    • Current rather than future behavior.
  • Data Mining Strategies: Estimation
    • Learning is supervised.
    • The dependent variable is numeric.
    • Well-defined classes.
    • Current rather than future behavior.
  • Data Mining Strategies: Prediction
    • The emphasis is on predicting future rather than current outcomes.
    • The output attribute may be categorical or numeric.
  • Classification, Estimation or Prediction?
    • The nature of the data determines whether a model is suitable for classification, estimation, or prediction.
  • The Cardiology Patient Dataset This dataset contains 303 instances. Each instance holds information about a patient who either has or does not have a heart condition.
  • The Cardiology Patient Dataset
    • 138 instances represent patients with heart disease.
    • 165 instances contain information about patients free of heart disease.
  •  
  •  
  • Classification, Estimation or Prediction? The next two slides each contain a rule generated from this dataset. Are either of these rules predictive?
  • A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55%
  • A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17%
  • Data Mining Strategies: Unsupervised Clustering
  • Unsupervised Clustering can be used to:
    • determine if relationships can be found in the data.
    • evaluate the likely performance of a supervised model.
    • find a best set of input attributes for supervised learning.
    • detect Outliers.
  • Data Mining Strategies: Market Basket Analysis
    • Find interesting relationships among retail products.
    • Uses association rule algorithms.
  • 2.2 Supervised Data Mining Techniques
  • The Credit Card Promotion Database
  •  
  • A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.
  • Supervised Data Mining Techniques: Production Rules
  • A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67%
  • Production Rule Accuracy & Coverage
    • Rule accuracy is a between-class measure.
    • Rule coverage is a within-class measure.
  • Supervised Data Mining Techniques: Neural Networks
  •  
  •  
  • Supervised Data Mining Techniques: Statistical Regression Life insurance promotion = 0.5909 (credit card insurance) - 0.5455 (sex) + 0.7727
  • 2.3 Association Rules
  • Comparing Association Rules & Production Rules
    • Association rules can have one or several output attributes. Production rules are limited to one output attribute.
    • With association rules, an output attribute for one rule can be an input attribute for another rule.
  • Two Association Rules for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes IF Sex = Female & Age = over40 THEN Credit Card Insurance = No & Life Insurance Promotion = Yes
  • 2.4 Clustering Techniques
  •  
  • 2.5 Evaluating Performance
  • Evaluating Supervised Learner Models
  • Confusion Matrix
    • A matrix used to summarize the results of a supervised classification.
    • Entries along the main diagonal are correct classifications.
    • Entries other than those on the main diagonal are classification errors.
  •  
  • Two-Class Error Analysis
  •  
  •  
  • Evaluating Numeric Output
    • Mean absolute error
    • Mean squared error
    • Root mean squared error
  • Mean Absolute Error The average absolute difference between classifier predicted output and actual output.
  • Mean Squared Error The average of the sum of squared differences between classifier predicted output and actual output.
  • Root Mean Squared Error The square root of the mean squared error.
  • Comparing Models by Measuring Lift
  •  
  • Computing Lift
  •  
  •  
  • Unsupervised Model Evaluation
  • Unsupervised Model Evaluation (cluster quality)
    • All clustering techniques compute some measure of cluster quality.
    • One evaluation method is to calculate the sum of squared error differences between the instances of each cluster and their cluster center.
    • Smaller values indicate clusters of higher quality.
  • Supervised Learning for Unsupervised Model Evaluation
    • Designate each formed cluster as a class and assign each class an arbitrary name.
    • Choose a random sample of instances from each class for supervised learning.
    • Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model.
  • Basic Data Mining Techniques Chapter 3
  • 3.1 Decision Trees
  • An Algorithm for Building Decision Trees 1. Let T be the set of training instances. 2. Choose an attribute that best differentiates the instances in T. 3. Create a tree node whose value is the chosen attribute. -Create child links from this node where each link represents a unique value for the chosen attribute. -Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: -If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path. -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.
  •  
  •  
  •  
  •  
  • Decision Trees for the Credit Card Promotion Database
  •  
  •  
  •  
  • Decision Tree Rules
  • A Rule for the Tree in Figure 3.4 IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No
  • A Simplified Rule Obtained by Removing Attribute Age IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No
  • Other Methods for Building Decision Trees
    • CART
    • CHAID
  • Advantages of Decision Trees
    • Easy to understand.
    • Map nicely to a set of production rules.
    • Applied to real problems.
    • Make no prior assumptions about the data.
    • Able to process both numerical and categorical data.
  • Disadvantages of Decision Trees
    • Output attribute must be categorical.
    • Limited to one output attribute.
    • Decision tree algorithms are unstable.
    • Trees created from numeric datasets can be complex.
  • 3.2 Generating Association Rules
  • Confidence and Support
  • Rule Confidence Given a rule of the form “If A then B”, rule confidence is the conditional probability that B is true when A is known to be true.
  • Rule Support The minimum percentage of instances in the database that contain all items listed in a given association rule.
  • Mining Association Rules: An Example
  •  
  •  
  •  
  • Two Possible Two-Item Set Rules IF Magazine Promotion =Yes THEN Life Insurance Promotion =Yes (5/7) IF Life Insurance Promotion =Yes THEN Magazine Promotion =Yes (5/5)
  • Three-Item Set Rules IF Watch Promotion =No & Life Insurance Promotion = No THEN Credit Card Insurance =No (4/4) IF Watch Promotion =No THEN Life Insurance Promotion = No & Credit Card Insurance = No (4/6)
  • General Considerations
    • We are interested in association rules that show a lift in product sales where the lift is the result of the product’s association with one or more other products.
    • We are also interested in association rules that show a lower than expected confidence for a particular association.
  • 3.3 The K-Means Algorithm
    • Choose a value for K , the total number of clusters.
    • Randomly choose K points as cluster centers.
    • Assign the remaining instances to their closest cluster center.
    • Calculate a new cluster center for each cluster.
    • Repeat steps 3-5 until the cluster centers do not change.
  • An Example Using K-Means
  •  
  •  
  •  
  •  
  • General Considerations
    • Requires real-valued data.
    • We must select the number of clusters present in the data.
    • Works best when the clusters in the data are of approximately equal size.
    • Attribute significance cannot be determined.
    • Lacks explanation capabilities.
  • 3.4 Genetic Learning
  • Genetic Learning Operators
    • Crossover
    • Mutation
    • Selection
  • Genetic Algorithms and Supervised Learning
  •  
  •  
  •  
  •  
  •  
  • Genetic Algorithms and Unsupervised Clustering
  •  
  •  
  • General Considerations
    • Global optimization is not a guarantee.
    • The fitness function determines the complexity of the algorithm.
    • Explain their results provided the fitness function is understandable.
    • Transforming the data to a form suitable for genetic learning can be a challenge.
  • 3.5 Choosing a Data Mining Technique
  • Initial Considerations
    • Is learning supervised or unsupervised?
    • Is explanation required?
    • What is the interaction between input and output attributes?
    • What are the data types of the input and output attributes?
  • Further Considerations
    • Do We Know the Distribution of the Data?
    • Do We Know Which Attributes Best Define the Data?
    • Does the Data Contain Missing Values?
    • Is Time an Issue?
    • Which Technique Is Most Likely to Give a Best Test Set Accuracy?
  • An Excel-based Data Mining Tool Chapter 4
  •  
  • 4.2 ESX: A Multipurpose Tool for Data Mining
  •  
  •  
  •  
  • Knowledge Discovery in Databases Chapter 5
  • 5.1 A KDD Process Model
  •  
  •  
  • Step 1: Goal Identification
    • Define the Problem.
    • Choose a Data Mining Tool.
    • Estimate Project Cost.
    • Estimate Project Completion Time.
    • Address Legal Issues.
    • Develop a Maintenance Plan.
  • Step 2: Creating a Target Dataset
  •  
  • Step 3: Data Preprocessing
    • Noisy Data
    • Missing Data
  • Noisy Data
    • Locate Duplicate Records.
    • Locate Incorrect Attribute Values.
    • Smooth Data.
  • Preprocessing Missing Data
    • Discard Records With Missing Values.
    • Replace Missing Real-valued Items With the Class Mean.
    • Replace Missing Values With Values Found Within Highly Similar Instances.
  • Processing Missing Data While Learning
    • Ignore Missing Values.
    • Treat Missing Values As Equal Compares.
    • Treat Missing values As Unequal Compares.
  • Step 4: Data Transformation
    • Data Normalization
    • Data Type Conversion
    • Attribute and Instance Selection
  • Data Normalization
    • Decimal Scaling
    • Min-Max Normalization
    • Normalization using Z-scores
    • Logarithmic Normalization
  • Attribute and Instance Selection
    • Eliminating Attributes
    • Creating Attributes
    • Instance Selection
  •  
  • Step 5: Data Mining
    • Choose training and test data.
    • Designate a set of input attributes.
    • If learning is supervised, choose one or more output attributes.
    • Select learning parameter values.
    • Invoke the data mining tool.
  • Step 6: Interpretation and Evaluation
    • Statistical analysis.
    • Heuristic analysis.
    • Experimental analysis.
    • Human analysis.
  • Step 7: Taking Action
    • Create a report.
    • Relocate retail items.
    • Mail promotional information.
    • Detect fraud.
    • Fund new research.
  • 5.9 The Crisp-DM Process Model
    • Business understanding
    • Data understanding
    • Data preparation
    • Modeling
    • Evaluation
    • Deployment
  • The Data Warehouse Chapter 6
  • 6.1 Operational Databases
  • Data Modeling and Normalization
    • One-to-One Relationships
    • One-to-Many Relationships
    • Many-to-Many Relationships
  • Data Modeling and Normalization
    • First Normal Form
    • Second Normal Form
    • Third Normal Form
  •  
  • The Relational Model
  •  
  •  
  •  
  • 6.2 Data Warehouse Design
  • The Data Warehouse
    • “ A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process ( W.H. Inmon ).”
  • Granularity
    • Granularity is a term used to describe the level of detail of stored information.
  •  
  • Entering Data into the Warehouse
    • Independent Data Mart
    • ETL (Extract, Transform, Load Routine)
    • Metadata
  • Structuring the Data Warehouse: Two Methods
    • Structure the warehouse model using the star schema
    • Structure the warehouse model as a multidimensional array
  • The Star Schema
    • Fact Table
    • Dimension Tables
    • Slowly Changing Dimensions
  •  
  • The Multidimensionality of the Star Schema
  •  
  • Additional Relational Schemas
    • Snowflake Schema
    • Constellation Schema
  •  
  • Decision Support: Analyzing the Warehouse Data
    • Reporting Data
    • Analyzing Data
    • Knowledge Discovery
  • 6.3 On-line Analytical Processing
  • OLAP Operations
    • Slice – A single dimension operation
    • Dice – A multidimensional operation
    • Roll-up – A higher level of generalization
    • Drill-down – A greater level of detail
    • Rotation – View data from a new perspective
  •  
  • Concept Hierarchy A mapping that allows attributes to be viewed from varying levels of detail.
  •  
  •  
  • Formal Evaluation Techniques Chapter 7
  • 7.1 What Should Be Evaluated?
    • Supervised Model
    • Training Data
    • Attributes
    • Model Builder
    • Parameters
    • Test Set Evaluation
  •  
  • Single-Valued Summary Statistics
    • Mean
    • Variance
    • Standard deviation
  • The Normal Distribution
  •  
  • Normal Distributions & Sample Means
    • A distribution of means taken from random sets of independent samples of equal size are distributed normally.
    • Any sample mean will vary less than two standard errors from the population mean 95% of the time.
  • A Classical Model for Hypothesis Testing
  •  
  • 7.3 Computing Test Set Confidence Intervals
  • Computing 95% Confidence Intervals
    • Given a test set sample S of size n and error rate E
    • Compute sample variance as V= E(1-E)
    • Compute the standard error ( SE) as the square root of V divided by n .
    • Calculate an upper bound error as E + 2(SE)
    • Calculate a lower bound error as E - 2(SE)
  • Cross Validation
    • Used when ample test data is not available
    • Partition the dataset into n fixed-size units. n-1 units are used for training and the nth unit is used as a test set.
    • Repeat this process until each of the fixed-size units has been used as test data.
    • Model correctness is taken as the average of all training-test trials.
  • Bootstrapping
    • Used when ample training and test data is not available.
    • Bootstrapping allows instances to appear more than once in the training data.
  • 7.4 Comparing Supervised Learner Models
  • Comparing Models with Independent Test Data where E 1 = The error rate for model M 1 E 2 = The error rate for model M 2 q = ( E 1 + E 2 )/2 n 1 = the number of instances in test set A n 2 = the number of instances in test set B  
  • Comparing Models with a Single Test Dataset where E 1 = The error rate for model M 1 E 2 = The error rate for model M 2 q = ( E 1 + E 2 )/2 n = the number of test set instances  
  • 7.5 Attribute Evaluation
  • Locating Redundant Attributes with Excel
    • Correlation Coefficient
    • Positive Correlation
    • Negative Correlation
    • Curvilinear Relationship
  • Creating a Scatterplot Diagram with MS Excel
  • Hypothesis Testing for Numerical Attribute Significance
  • 7.6 Unsupervised Evaluation Techniques
    • Unsupervised Clustering for Supervised Evaluation
    • Supervised Evaluation for Unsupervised Clustering
    • Additional Methods
  • 7.7 Evaluating Supervised Models with Numeric Output
  • Mean Squared Error where for the i th instance, a i = actual output value c i = computed output value    
  • Mean Absolute Error where for the i th instance, a i = actual output value c i = computed output value    
  • Neural Networks Chapter 8
  • 8.1 Feed-Forward Neural Networks
  •  
  • The Sigmoid Function
  •  
  • Supervised Learning with Feed-Forward Networks
    • Backpropagation Learning
    • Genetic Learning
  • Unsupervised Clustering with Self-Organizing Maps
  •  
  • 8.3 Neural Network Explanation
    • Sensitivity Analysis
    • Average Member Technique
  • 8.4 General Considerations
    • What input attributes will be used to build the network?
    • How will the network output be represented?
    • How many hidden layers should the network contain?
    • How many nodes should there be in each hidden layer?
    • What condition will terminate network training?
  • Neural Network Strengths
    • Work well with noisy data.
    • Can process numeric and categorical data.
    • Appropriate for applications requiring a time element.
    • Have performed well in several domains.
    • Appropriate for supervised learning and unsupervised clustering.
  • Weaknesses
    • Lack explanation capabilities.
    • May not provide optimal solutions to problems.
    • Overtraining can be a problem.
  • Statistical Techniques Chapter 10
  • 10.1 Linear Regression Analysis
  • Multiple Linear Regression with Excel
  • Regression Trees
  •  
  • 10.2 Logistic Regression
  • Transforming the Linear Regression Model Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.
  • The Logistic Regression Model
  • 10.3 Bayes Classifier
  • Bayes Classifier: An Example
  •  
  • The Instance to be Classified Magazine Promotion = Yes Watch Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Sex = ?
  •  
  • Computing The Probability For Sex = Male
  • Conditional Probabilities for Sex = Male P ( magazine promotion = yes | sex = male ) = 4/6 P ( watch promotion = yes | sex = male ) = 2/6 P ( life insurance promotion = no | sex = male ) = 4/6 P ( credit card insurance = no | sex = male ) = 4/6 P(E | sex =male) = (4/6) (2/6) (4/6) (4/6) = 8/81
  • The Probability for Sex=Male Given Evidence E P ( sex = male | E )  0.0593 / P ( E )
  • The Probability for Sex=Female Given Evidence E P ( sex = female| E )  0.0281 / P ( E )
  • Zero-Valued Attribute Counts
  • Missing Data With Bayes classifier missing data items are ignored.
  • Numeric Data where e = the exponential function  = the class mean for the given numerical attribute  = the class standard deviation for the attribute x = the attribute value
  • 10.4 Clustering Algorithms
  • Agglomerative Clustering
    • Place each instance into a separate partition.
    • Until all instances are part of a single cluster:
    • a. Determine the two most similar clusters.
    • b. Merge the clusters chosen into a single cluster.
    • 3. Choose a clustering formed by one of the step 2 iterations as a final result.
  • Conceptual Clustering
    • Create a cluster with the first instance as its only member.
    • For each remaining instance, take one of two actions at each tree level.
    • a. Place the new instance into an existing cluster.
    • b. Create a new concept cluster having the new instance as its only member.
  • Expectation Maximization
    • The EM (expectation-maximization) algorithm is a statistical technique that makes use of the finite Gaussian mixtures model.
  • Expectation Maximization
    • A mixture is a set of n probability distributions where each distribution represents a cluster.
    • The mixtures model assigns each data instance a probability that it would have a certain set of attribute values given it was a member of a specified cluster.
  • Expectation Maximization
    • The EM algorithm is similar to the K-Means procedure in that a set of parameters are recomputed until a desire convergence is achieved.
    • In the simplest case, there are two clusters, a single real-valued attribute, and the probability distributions are normal.
  • EM Algorithm ( two-class, one attribute scenario)
    • Guess initial values for the five parameters.
    • Until a termination criterion is achieved:
    • a. Use the probability density function for normal distributions to compute the cluster probability for each instance.
    • b. Use the probability scores assigned to each instance in step 2(a) to re-estimate the parameters.
  • Specialized Techniques Chapter 11
  • 11.1 Time-Series Analysis Time-series Problems: Prediction applications with one or more time-dependent attributes.
  •  
  • 11.2 Mining the Web
  • Web-Based Mining (identifying the goal)
      • Decrease the average number of pages visited by a customer before a purchase transaction.
      • Increase the average number of pages viewed per user session.
      • Increase Web server efficiency
      • Personalize Web pages for customers
      • Determine those products that tend to be purchased or viewed together
      • Decrease the total number of item returns
      • Increase visitor retention rates
  • Web-Based Mining (preparing the data)
      • Data is stored in Web server log files, typically in the form of clickstream sequences
      • Server log files provide information in extended common log file format
  • Extended Common Log File Format
    • Host Address
    • Date/Time
    • Request
    • Status
    • Bytes
    • Referring Page
    • Browser Type
  • Extended Common Log File Format
    • 80.202.8.93 - - [16/Apr/2002:22:43:28 -0600] &quot;GET /grbts/images/msu-new-color.gif HTTP/1.1&quot; 200 5006 &quot;http://grb.mnsu.edu/doc/index.html&quot; &quot;Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.01 [nb]“
    • 134.29.41.219 - - [17/Apr/2002:19:23:30 -0600] &quot;GET /resin-doc/images/resin_powered.gif HTTP/1.1&quot; 200 571 &quot;http://grb.mnsu.edu/&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Q312461)&quot;
  • Preparing the Data (the session file)
    • A session file is a file created by the data preparation process.
    • Each instance of a session file represents a single user session.
  • Preparing the Data (the session file)
    • A user session is a set of pageviews requested by a single user from a single Web server.
    • A pageview contains one or more page files each forming a display window in a Web browser.
    • Each pageview is tagged with a unique uniform resource identifier (URI).
  •  
  • Preparing the Data (the session file)
    • Creating the session file is difficult
      • Identify individual users in a log file
      • Host addresses are of limited help
      • Host address combined with referring page is beneficial
      • One user page request may generate multiple log file entries from several types of servers
      • Easiest when sites are allowed to use cookies
  • Web-Based Mining (mining the data)
    • Traditional techniques such as association rule generators or clustering methods can be applied.
    • Sequence miners, which are special data mining algorithms used to discover frequently accessed Web pages that occur in the same order, are often used.
  • Web-Based Mining (evaluating results)
    • Consider four hypothetical pageview instances
    • P5  P4  P10  P3  P15  P2  P1
    • P2  P4  P10  P8  P15  P4  P15  P1
    • P4  P3  P7  P11  P14  P8  P2  P10
    • P1  P3  P10  P11  P4  P15  P9
  • Evaluating Results (association rules)
    • An association rule generator outputs the following rule from our session data.
    • IF  P 4 & P 10
    • THEN P 15 {3/4}
    • This rule states that P 4 , P 10 and P 15 appear in three session instances. Also, a four instances have P 4 and P 10 appearing in the same session instance
  • Evaluating Results (unsupervised clustering)
    • Use agglomerative clustering to place session instances into clusters.
    • Instance similarity is computed by dividing the total number of pageviews each pair of instances share by the total number of pageviews contained within the instances.
  • Evaluating Results (unsupervised clustering)
    • Consider the following session instances:
    • P5  P4  P10  P3  P15  P2  P1
    • P2  P4  P10  P8  P15  P4  P15  P1
    • The computed similarity is 5/8 = 0.625
  • Evaluating Results (summary statistics)
    • Summary statistics about the activities taking place at a Web site can be obtained using a Web server log analyzer.
    • The output of the analyzer is an aggregation of log file data displayed in graphical format.
  • Web-Based Mining (Taking Action)
    • Implement a strategy based on created user profiles to personalize the Web pages viewed by site visitors.
    • Adapt the indexing structure of a Web site to better reflect the paths followed by typical users.
    • Set up online advertising promotions for registered Web site customers.
    • Send e-mail to promote products of likely interest to a select group of registered customers.
    • Modify the content of a Web site by grouping products likely to be purchased together, removing products of little interest, and expanding the offerings of high-demand products.
  • Data Mining for Web Site Evaluation Web site evaluation is concerned with determining whether the actual use of a site matches the intentions of its designer.
  • Data Mining for Web Site Evaluation
    • Data mining can help with site evaluation by determining the frequent patterns and routes traveled by the user population.
    • Sequential ordering of pageviews is of primary interest.
    • Sequence miners are used to determine pageview order sequencing.
  • Data Mining for Personalization
    • The goal of personalization is to present Web users with what interests them without requiring them to ask for it directly.
    • Manual techniques force users to register at a Web site and to fill in questionnaires.
    • Data mining can be used to automate personalization.
  • Data Mining for Personalization
    • Automatic personalization is accomplished by creating usage profiles from stored session data.
  • Data Mining for Web Site Adaptation The index synthesis problem : Given a Web site and a visitor access log, create new index pages containing collections of links to related but currently unlinked pages.
  • 11.3 Mining Textual Data
    • Train: Create an attribute dictionary.
    • Filter: Remove common words.
    • Classify: Classify new documents.
  • 11.4 Improving Performance
    • Bagging
    • Boosting
    • Instance Typicality
  • Data Mining Standards
    • Grossman, R.L., Hornick , M.F., Meyer, G., Data Mining Standards Initiatives, Communications of the ACM, August 2002,Vol. 45. No. 8
  • Privacy & Data Mining
    • Inference is the process of users posing queries and deducing unauthorized information from the legitimate responses that they receive.
    • Data mining offers sophisticated tools to deduce sensitive patterns from data.
  • Privacy & Data Mining (an example)
    • Unnamed Health records are public information.
    • People's names are public.
    • People associated with their individual health records is private information.
  • Privacy & Data Mining (an example)
    • Former employees have their employment records stored in a data
    • warehouse. An employer uses data mining to build a classification
    • model to differentiate employees relative to their termination:
    • They quit
    • They were fired
    • They were laid off
    • They retired
    • The employer now uses the model to classify current employees. He fires employees likely to quit, and lays off employees likely to
    • retire. Is this ethical?
  • Privacy & Data Mining (handling the inference problem)
    • Given a database and a data mining tool, apply the tool to determine if sensitive information can be deduced.
    • Use an inference controller to detect the motives of the user.
    • Give only samples of the data to the user thereby preventing the user from building a data mining model.
  • Privacy & Data Mining
    • Thuraisingham, B., Web Data Mining and Applications in Business Intelligence and Counter-Terrorism, CRC Press, 2003.
  • Data Mining Software
    • http:// datamining.itsc.uah.edu/adam/binary.html
    • http:// www.cs.waikato.ac.nz/ml/weka /
    • http://magix.fri.uni-lj.si/orange /
    • www.kdnuggets.com
    • http://datamining.itsc.uah.edu/adam/binary.html
    • http:// grb.mnsu.edu/grbts/ts.jsp
  • Data Mining Textbooks
    • Berry, M.J., Linoff, G., Data Mining Techinques For marketing, Sales, and Customer Support, Wiley, 1997.
    • Han, J., Kamber, M., Data Mining Concepts and Techniques, Academic Press, 2001.
    • Roiger, R.J., Geatz, M.W., Data Mining: A Tutorial-Based Primer, Addison-Wesley, 2003.
    • Tan, P., Steinbach, M., Kumar, V., Introduction To Data Mining, Addison-Wesley, 2005.
    • Witten, I.H., Frank, E., Data Mining Practical Machine Learning Tools with Java Implementations, Academic Press, 2000.
  • Data Mining Resources
    • AI magazine
    • Communications of the ACM
    • SIGKDD Explorations
    • Computer Magazine
    • PC AI
    • IEEE Transactions on Data and Knowledge Engineering
  • Data Mining A Tutorial-Based Primer
    • Part I: Data Mining Fundamentals
    • Part II: tools for Knowledge Discovery
    • Part III: Advanced Data Mining Techniques
    • Part IV: Intelligent Systems
  • Part I: Data Mining Fundamentals
      • Chapter 1 Data Mining: A First View
      • Chapter 2 Data Mining: A Closer Look
      • Chapter 3 Basic Data Mining Techniques
      • Chapter 4 An Excel-Based Data Mining Tool
  • Part II: Tools for Knowledge Discovery
    • Chapter 5: Knowledge Discovery in Databases
    • Chapter 6: The Data Warehouse
    • Chapter 7: Formal Evaluation Techniques
  • Part III: Advanced Data Mining Techniques
    • Chapter 8: Neural Networks
    • Chapter 9: Building Neural Networks with IDA
    • Chapter 10: Statistical Techniques
    • Chapter 11: Specialized Techniques
  • Part IV: Intelligent Systems
    • Chapter 12: Rule-Based Systems
    • Chapter 13: Managing Uncertainty in Rule-Based Systems
    • Chapter 14: Intelligent Agents