Kevin Swingler: Introduction to Data Mining
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Kevin Swingler: Introduction to Data Mining



Presentation to the third LIS DREaM workshop, held at Edinburgh Napier university on Wednesday 25th April 2012....

Presentation to the third LIS DREaM workshop, held at Edinburgh Napier university on Wednesday 25th April 2012.

More information about the event can be found at



Total Views
Views on SlideShare
Embed Views



5 Embeds 480 311 153 10 4 2


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Kevin Swingler: Introduction to Data Mining Presentation Transcript

  • 1. Data Mining Methodology Kevin Swingler University of Stirling Lecturer, Computing Science
  • 2. What is Data Mining?• Generally, methods of using large quantities of data and appropriate algorithms to allow a computer to ‘learn’ to perform a task• Task oriented: – Predict outcomes or forecast the future – Classify objects as belonging to one of several categories – Separate data into clusters of similar objects• Most methods produce a model of the data that performs the task 2
  • 3. Some Examples• Predicting patterns of drug side-effects• Spotting credit card or insurance fraud• Controlling complex machinery• Predicting the outcome of medical interventions• Predicting the price of stocks and shares or exchange rates• Knowing when a cow is most fertile (really!) 3
  • 4. Examples in LIS• Text Mining – Automatically determine what an article is ‘about’ – Classify attitudes in social media• Demand Prediction – Predicting demand for resources such as new books or journals or buildings• Search and Recommend – Analysis of borrowing history to make recommendations – Links analysis for citation clustering 4
  • 5. Data Sources• In House – Data you own – Borrow records – Search histories – Catalogue data• Bought in – Demographic data about customers – Demographic data about the locality around a library 5
  • 6. Methods• Techniques for data mining are based on mathematics and statistics, but are implemented in easy to use software packages• Where methodology is important is in pre- processing the data, choosing the techniques, and interpreting the results 6
  • 7. CRISP DM Standard• CRoss Industry Standard Process for Data Mining 7
  • 8. Data Preparation• Clean the data – Remove rows with missing values – Remove rows with obvious data entry errors – e.g. Age = 200 – Recode obvious data entry inconsistencies – e.g. If Gender = M or F, but occasionally Male – Remove rows with minority values – Select which variables to use in the model 8
  • 9. Data Quantity• Choose the variables to be used for the model• Look at the distributions of the chosen values• Look at the level of noise in the data• Look at the degree of linearity in the data• Decide whether or not there are sufficient examples in the data• Treat unbalanced data 9
  • 10. Consider Error Costs• Imagine a system that classifies input patterns into one of several possible categories• Sometimes it will get things wrong, how often depends on the problem: – Direct mail targeting – very often – Credit risk assessment – quite often – Medical reasoning – very infrequently 10
  • 11. Error Costs• An error in one direction can cost more than an error in the opposite direction – Recommending a blood test based on a false positive is better than missing an infection due to a false negative – Missing a case of insurance fraud is more costly than flagging a claim to be double checked• The balance of examples in each case can be manipulated to reflect the cost 11
  • 12. Check Points• Data quantity and quality: do you have sufficient good data for the task? – How many variables are there? – How complex is the task? – Is the data’s distribution appropriate? • Outliers • Balance • Value set size 12
  • 13. Distributions• A frequency distribution is a count of how often each variable contains each value in a data set• For discrete numbers and categorical values, this is simply a count of each value• For continuous numbers, the count is of how many values fall into each of a set of sub- ranges 13
  • 14. Plotting Distributions• The easiest way to visualise a distribution is to plot it in a histogram: 14
  • 15. Features of a Distribution to Look For• Outliers• Minority values• Data Balance• Data entry errors 15
  • 16. Outliers• A small number of values that are much larger or much smaller than all the others• Can disrupt the data mining process and give misleading results• You should either remove them or, if they are important, collect more data to reflect this aspect of the world you are modelling• Could be data entry errors 16
  • 17. Minority Values• Values that only appear infrequently in the data• Do they appear often enough to contribute to the model?• Might be worth removing them from the data or collecting more data where they are represented• Are they needed in the finished system?• Could they be the result of data entry errors? 17
  • 18. Minority Values 600 500 400 300 200 100 0 Male Female M FWhat does this chart tell you about the gender variable in a data set?What should you do before modelling or mining the data? 18
  • 19. Flat and Wide Variables• Variables where all the values are minority values have a flat, wide distribution – one or two of each possible value• Such variables are of little use in data mining because the goal of DM is to find general patterns from specific data• No such patterns can exist if each data point is completely different• Such variables should be excluded from a model 19
  • 20. Data Balance• Imagine I want to predict whether or not a prospective customer will respond to a mailing campaign• I collect the data, put it into a data mining algorithm, which learns and reports a success rate of 98%• Sounds good, but when I put a new set of prospects through to see who to mail, what happens? 20
  • 21. A Problem• … the system predicts ‘No’ for every single prospect.• With a response rate on a campaign of 2%, then the system is right 98% of the time if it always says ‘No’.• So it never chooses anybody to target in the campaign 21
  • 22. A Solution• One data pre-processing solution is to balance the number of examples of each target class in the output variable• In our previous example: 50% customers and 50% non- customers• That way, any gain in accuracy over 50% would certainly be due to patterns in the data, not the prior distribution• This is not always easy to achieve – you might need to throw away a lot of data to balance the examples, or build several models on balanced subsets• Not always necessary – if an event is rare because its cause is rare, then the problem won’t arise 22
  • 23. Data Quantity• How much data do you need?• How long is a piece of string?• Data must be sufficient to: – Represent the dynamics of the system to be modelled – Cover all situations likely to be encountered when predictions are needed – Compensate for any noise in the data 23
  • 24. Model Building• Choose a number of techniques suitable to the task: – Neural network for prediction or classification – Decision tree for classification – Rule induction for classification – Bayesian network for classification – K-Means for clustering 24
  • 25. Train Models• For each technique: – Run a series of experiments with different parameters – Each experiment should use around 70% of the data for training and the rest for testing – When a good solution is found, use cross validation (10 fold is a good choice) to verify the result 25
  • 26. Cross Validation• Split the data into ten subsets, then train 10 models – each one using 9 of the 10 subsets as training data and the 10th as test. The score is the average of all 10.• This is a more accurate representation of how well the data may be modelled, as it reduces the risk of getting a lucky test set 26
  • 27. Assess Models• You can measure the success of your model in a number of ways – Mean Squared error – not always meaningful – Percentage correct for classification – Confusion matrix for classification Output= True False True 80 30 False 20 90 27
  • 28. Probability Outputs• Most classification techniques provide a score with the classification – either a probability or some other measure• This can be used: – Allow an answer of “unsure” for cases where no single class has a high enough probability – Weighting outputs to allow for unequal cost of outcomes – Lift charts and ROC curves 28
  • 29. Generalisation and Over Fitting• Most data mining models have a degree of complexity that can be controlled by the designer• The goal is to find the degree of complexity that is best suited to the data• A model that is too simple over generalises• A model that is too complex over fits• Both have an adverse effect on performance 29
  • 30. Gen-Spec Trade Off• Adding to the complexity of the model fits the training data better at the expense of higher test error 30
  • 31. Repeat or Finish• The result of the data mining will leave you with either a model that works or the need to improve• More data may need to be collected• Different variables might be tried• The process can loop several times before a satisfactory answer is found 31
  • 32. Understanding and Using the Results• The resulting model has the ability to perform the task it was set, so can be embedded in an automated system• Some techniques produce models that are human readable and allow insights into the structure of the data• Some are almost impossible to extract knowledge from 32
  • 33. 33