Like this presentation? Why not share!

# DataMining-techniques1.ppt

## on May 10, 2010

• 8,191 views

### Views

Total Views
8,191
Views on SlideShare
8,190
Embed Views
1

Likes
1
211
0

### 1 Embed1

 http://www.slideshare.net 1

### Report content

• Comment goes here.
Are you sure you want to

## DataMining-techniques1.pptPresentation Transcript

• Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
• Review:
• Information Retrieval
• Similarity measures
• Evaluation Metrics : Precision and Recall
• Web Search Engine
• An application of IR
• Related to web mining
• Data Mining Techniques Outline
• Statistical
• Point Estimation
• Models Based on Summarization
• Bayes Theorem
• Hypothesis Testing
• Regression and Correlation
• Similarity Measures
Goal: Provide an overview of basic data mining techniques
• Point Estimation
• Point Estimate: estimate a population parameter.
• May be made by calculating the parameter for a sample.
• May be used to predict value for missing data.
• Ex:
• R contains 100 employees
• 99 have salary information
• Mean salary of these is \$50,000
• Use \$50,000 as value of remaining employee’s salary.
• Is this a good idea?
• Estimation Error
• Bias: Difference between expected value and actual value.
• Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:
• Root Mean Square Error (RMSE)
• Jackknife Estimate
• Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values.
• Named to describe a “handy and useful tool”
• Used to reduce bias
• Property : The Jackknife estimator lowers the bias from the order of 1/n to 1/n 2
• Jackknife Estimate
• Definition :
• Divide the sample size n into g groups of size m each, so n=mg. (often m=1 and g=n)
• estimate  j by ignoring the jth group.
•  _ is the average of  j .
• The Jackknife estimator is
•  Q = g  – (g-1)  _.
• Where  is an estimator for the parameter theta.
• Jackknife Estimator: Example 1
• Estimate of mean for X={x 1 , x 2 , x 3 ,}, n =3, g=3, m=1,  =  = ( x 1 + x 2 + x 3 )/3
•  1 = ( x 2 + x 3 )/2,  2 = ( x 1 + x 3 )/2,  1 = ( x 1 + x 2 )/2,
•  _ = (  1 +  2 +  2 )/3
•  Q = g  -(g-1)  _= 3  -(3-1)  _= ( x 1 + x 2 + x 3 )/3
• In this case, the Jackknife Estimator is the same as the usual estimator.
• Jackknife Estimator: Example 2
• Estimate of variance for X={1, 4, 4}, n =3, g=3, m=1,  =  2
•  2 = ((1-3) 2 +(4-3) 2 +(4-3) 2 )/3 = 2
•  1 = ((4-4) 2 + (4-4) 2 ) /2 = 0 ,
•  2 = 2.25 ,  3 = 2.25
•  _ = (  1 +  2 +  2 )/3 = 1.5
•  Q = g  -(g-1)  _= 3  -(3-1)  _
• =3(2)-2(1.5)=3
• In this case, the Jackknife Estimator is different from the usual estimator.
• Jackknife Estimator: Example 2(cont’d)
• In general, apply the Jackknife technique to the biased estimator  2
•  2 =  (x i – x ) 2 / n
• then the jackknife estimator is s 2
• s 2 =  (x i – x ) 2 / (n -1)
• Which is known to be unbiased for  2
• Maximum Likelihood Estimate (MLE)
• Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model.
• Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function:
• Maximize L.
• MLE Example
• Coin toss five times: {H,H,H,H,T}
• Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:
• However if the probability of a H is 0.8 then:
• MLE Example (cont’d)
• General likelihood formula:
• Estimate for p is then 4/5 = 0.8
• Expectation-Maximization (EM)
• Solves estimation with incomplete data.
• Obtain initial estimates for parameters.
• Iteratively use estimates for missing data and continue until convergence.
• EM Example
• EM Algorithm
• Models Based on Summarization
• Basic concepts to provide an abstraction and summarization of the data as a whole.
• Statistical concepts: mean, variance, median, mode, etc.
• Visualization: display the structure of the data graphically.
• Line graphs, Pie charts, Histograms, Scatter plots, Hierarchical graphs
• Scatter Diagram
• Bayes Theorem
• Posterior Probability: P(h 1 |x i )
• Prior Probability: P(h 1 )
• Bayes Theorem:
• Assign probabilities of hypotheses given a data value.
• Bayes Theorem Example
• Credit authorizations (hypotheses): h 1 =authorize purchase, h 2 = authorize after further identification, h 3 =do not authorize, h 4 = do not authorize but contact police
• Assign twelve data values for all combinations of credit and income:
• From training data: P(h 1 ) = 60%; P(h 2 )=20%; P(h 3 )=10%; P(h 4 )=10%.
• Bayes Example(cont’d)
• Training Data:
• Bayes Example(cont’d)
• Calculate P(x i |h j ) and P(x i )
• Ex: P(x 7 |h 1 )=2/6; P(x 4 |h 1 )=1/6; P(x 2 |h 1 )=2/6; P(x 8 |h 1 )=1/6; P(x i |h 1 )=0 for all other x i .
• Predict the class for x 4 :
• Calculate P(h j |x 4 ) for all h j .
• Place x 4 in class with largest value.
• Ex:
• P(h 1 |x 4 )=(P(x 4 |h 1 )(P(h 1 ))/P(x 4 )
• =(1/6)(0.6)/0.1=1.
• x 4 in class h 1 .
• Hypothesis Testing
• Find model to explain behavior by creating and then testing a hypothesis about the data.
• Exact opposite of usual DM approach.
• H 0 – Null hypothesis; Hypothesis to be tested.
• H 1 – Alternative hypothesis
• Chi-Square Test
• One technique to perform hypothesis testing
• Used to test the association between two observed variable values and determine if a set of observed values is statistically different.
• The chi-squared statistic is defines as:
• O – observed value
• E – Expected value based on hypothesis.
• Chi-Square Test
• Given the average scores of five schools. Determine whether the difference is statistically significant.
• Ex:
• O={50,93,67,78,87}
• E=75
•  2 =15.55 and therefore significant
• Examine a chi-squared significance table.
• with a degree of 4 and a significance level of 95%, the critical value is 9.488. Thus the variance between the schools’ scores and the expected value cannot be associated with pure chance.
• Regression
• Predict future values based on past values
• Fitting a set of points to a curve
• Linear Regression assumes linear relationship exists.
• y = c 0 + c 1 x 1 + … + c n x n
• n input variables, (called regressors or predictors)
• One out put variable, called response
• n+1 constants, chosen during the modlong process to match the input examples
• Linear Regression -- with one input value
• Correlation
• Examine the degree to which the values for two variables behave similarly.
• Correlation coefficient r:
• 1 = perfect correlation
• -1 = perfect but opposite correlation
• 0 = no correlation
• Correlation
• Where X, Y are means for X and Y respectively.
• Suppose X=(1,3,5,7,9) and Y=(9,7,5,3,1)
• r = ?
• Suppose X=(1,3,5,7,9) and Y=(2,4,6,8,10)
• r = ?
• Similarity Measures
• Determine similarity between two objects.
• Similarity characteristics:
• Alternatively, distance measure measure how unlike or dissimilar objects are.
• Similarity Measures
• Distance Measures
• Measure dissimilarity between objects
• Next Lecture:
• Data Mining techniques (II)
• Decision trees, neural networks and genetic algorithms