• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
DataMining-techniques1.ppt
 

DataMining-techniques1.ppt

on

  • 8,191 views

 

Statistics

Views

Total Views
8,191
Views on SlideShare
8,190
Embed Views
1

Actions

Likes
1
Downloads
211
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    DataMining-techniques1.ppt DataMining-techniques1.ppt Presentation Transcript

    • Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
    • Review:
      • Information Retrieval
        • Similarity measures
        • Evaluation Metrics : Precision and Recall
      • Question Answering
      • Web Search Engine
        • An application of IR
        • Related to web mining
    • Data Mining Techniques Outline
      • Statistical
        • Point Estimation
        • Models Based on Summarization
        • Bayes Theorem
        • Hypothesis Testing
        • Regression and Correlation
      • Similarity Measures
      Goal: Provide an overview of basic data mining techniques
    • Point Estimation
      • Point Estimate: estimate a population parameter.
      • May be made by calculating the parameter for a sample.
      • May be used to predict value for missing data.
      • Ex:
        • R contains 100 employees
        • 99 have salary information
        • Mean salary of these is $50,000
        • Use $50,000 as value of remaining employee’s salary.
        • Is this a good idea?
    • Estimation Error
      • Bias: Difference between expected value and actual value.
      • Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:
      • Root Mean Square Error (RMSE)
    • Jackknife Estimate
      • Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values.
      • Named to describe a “handy and useful tool”
      • Used to reduce bias
      • Property : The Jackknife estimator lowers the bias from the order of 1/n to 1/n 2
    • Jackknife Estimate
      • Definition :
        • Divide the sample size n into g groups of size m each, so n=mg. (often m=1 and g=n)
        • estimate  j by ignoring the jth group.
        •  _ is the average of  j .
        • The Jackknife estimator is
          •  Q = g  – (g-1)  _.
          • Where  is an estimator for the parameter theta.
    • Jackknife Estimator: Example 1
      • Estimate of mean for X={x 1 , x 2 , x 3 ,}, n =3, g=3, m=1,  =  = ( x 1 + x 2 + x 3 )/3
      •  1 = ( x 2 + x 3 )/2,  2 = ( x 1 + x 3 )/2,  1 = ( x 1 + x 2 )/2,
      •  _ = (  1 +  2 +  2 )/3
      •  Q = g  -(g-1)  _= 3  -(3-1)  _= ( x 1 + x 2 + x 3 )/3
      • In this case, the Jackknife Estimator is the same as the usual estimator.
    • Jackknife Estimator: Example 2
      • Estimate of variance for X={1, 4, 4}, n =3, g=3, m=1,  =  2
      •  2 = ((1-3) 2 +(4-3) 2 +(4-3) 2 )/3 = 2
      •  1 = ((4-4) 2 + (4-4) 2 ) /2 = 0 ,
      •  2 = 2.25 ,  3 = 2.25
      •  _ = (  1 +  2 +  2 )/3 = 1.5
      •  Q = g  -(g-1)  _= 3  -(3-1)  _
        • =3(2)-2(1.5)=3
      • In this case, the Jackknife Estimator is different from the usual estimator.
    • Jackknife Estimator: Example 2(cont’d)
      • In general, apply the Jackknife technique to the biased estimator  2
      •  2 =  (x i – x ) 2 / n
      • then the jackknife estimator is s 2
      • s 2 =  (x i – x ) 2 / (n -1)
      • Which is known to be unbiased for  2
    • Maximum Likelihood Estimate (MLE)
      • Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model.
      • Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function:
      • Maximize L.
    • MLE Example
      • Coin toss five times: {H,H,H,H,T}
      • Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:
      • However if the probability of a H is 0.8 then:
    • MLE Example (cont’d)
      • General likelihood formula:
      • Estimate for p is then 4/5 = 0.8
    • Expectation-Maximization (EM)
      • Solves estimation with incomplete data.
      • Obtain initial estimates for parameters.
      • Iteratively use estimates for missing data and continue until convergence.
    • EM Example
    • EM Algorithm
    • Models Based on Summarization
      • Basic concepts to provide an abstraction and summarization of the data as a whole.
        • Statistical concepts: mean, variance, median, mode, etc.
      • Visualization: display the structure of the data graphically.
        • Line graphs, Pie charts, Histograms, Scatter plots, Hierarchical graphs
    • Scatter Diagram
    • Bayes Theorem
      • Posterior Probability: P(h 1 |x i )
      • Prior Probability: P(h 1 )
      • Bayes Theorem:
      • Assign probabilities of hypotheses given a data value.
    • Bayes Theorem Example
      • Credit authorizations (hypotheses): h 1 =authorize purchase, h 2 = authorize after further identification, h 3 =do not authorize, h 4 = do not authorize but contact police
      • Assign twelve data values for all combinations of credit and income:
      • From training data: P(h 1 ) = 60%; P(h 2 )=20%; P(h 3 )=10%; P(h 4 )=10%.
    • Bayes Example(cont’d)
      • Training Data:
    • Bayes Example(cont’d)
      • Calculate P(x i |h j ) and P(x i )
      • Ex: P(x 7 |h 1 )=2/6; P(x 4 |h 1 )=1/6; P(x 2 |h 1 )=2/6; P(x 8 |h 1 )=1/6; P(x i |h 1 )=0 for all other x i .
      • Predict the class for x 4 :
        • Calculate P(h j |x 4 ) for all h j .
        • Place x 4 in class with largest value.
        • Ex:
          • P(h 1 |x 4 )=(P(x 4 |h 1 )(P(h 1 ))/P(x 4 )
          • =(1/6)(0.6)/0.1=1.
          • x 4 in class h 1 .
    • Hypothesis Testing
      • Find model to explain behavior by creating and then testing a hypothesis about the data.
      • Exact opposite of usual DM approach.
      • H 0 – Null hypothesis; Hypothesis to be tested.
      • H 1 – Alternative hypothesis
    • Chi-Square Test
      • One technique to perform hypothesis testing
      • Used to test the association between two observed variable values and determine if a set of observed values is statistically different.
      • The chi-squared statistic is defines as:
      • O – observed value
      • E – Expected value based on hypothesis.
    • Chi-Square Test
      • Given the average scores of five schools. Determine whether the difference is statistically significant.
      • Ex:
        • O={50,93,67,78,87}
        • E=75
        •  2 =15.55 and therefore significant
      • Examine a chi-squared significance table.
        • with a degree of 4 and a significance level of 95%, the critical value is 9.488. Thus the variance between the schools’ scores and the expected value cannot be associated with pure chance.
    • Regression
      • Predict future values based on past values
      • Fitting a set of points to a curve
      • Linear Regression assumes linear relationship exists.
      • y = c 0 + c 1 x 1 + … + c n x n
        • n input variables, (called regressors or predictors)
        • One out put variable, called response
        • n+1 constants, chosen during the modlong process to match the input examples
    • Linear Regression -- with one input value
    • Correlation
      • Examine the degree to which the values for two variables behave similarly.
      • Correlation coefficient r:
        • 1 = perfect correlation
        • -1 = perfect but opposite correlation
        • 0 = no correlation
    • Correlation
      • Where X, Y are means for X and Y respectively.
      • Suppose X=(1,3,5,7,9) and Y=(9,7,5,3,1)
        • r = ?
      • Suppose X=(1,3,5,7,9) and Y=(2,4,6,8,10)
        • r = ?
    • Similarity Measures
      • Determine similarity between two objects.
      • Similarity characteristics:
      • Alternatively, distance measure measure how unlike or dissimilar objects are.
    • Similarity Measures
    • Distance Measures
      • Measure dissimilarity between objects
    • Next Lecture:
      • Data Mining techniques (II)
        • Decision trees, neural networks and genetic algorithms
      • Reading assignments: Chapter 3