SlideShare a Scribd company logo
Data Mining Methodology
          Kevin Swingler
       University of Stirling
   Lecturer, Computing Science
        kms@cs.stir.ac.uk
What is Data Mining?
• Generally, methods of using large quantities of data
  and appropriate algorithms to allow a computer to
  ‘learn’ to perform a task
• Task oriented:
   – Predict outcomes or forecast the future
   – Classify objects as belonging to one of several categories
   – Separate data into clusters of similar objects
• Most methods produce a model of the data that
  performs the task

                                                                  2
Some Examples
• Predicting patterns of drug side-effects
• Spotting credit card or insurance fraud
• Controlling complex machinery
• Predicting the outcome of medical
  interventions
• Predicting the price of stocks and shares or
  exchange rates
• Knowing when a cow is most fertile (really!)
                                                 3
Examples in LIS
• Text Mining
  – Automatically determine what an article is ‘about’
  – Classify attitudes in social media
• Demand Prediction
  – Predicting demand for resources such as new books or
    journals or buildings
• Search and Recommend
  – Analysis of borrowing history to make recommendations
  – Links analysis for citation clustering


                                                            4
Data Sources
• In House – Data you own
  – Borrow records
  – Search histories
  – Catalogue data
• Bought in
  – Demographic data about customers
  – Demographic data about the locality around a
    library

                                                   5
Methods
• Techniques for data mining are based on
  mathematics and statistics, but are
  implemented in easy to use software
  packages
• Where methodology is important is in pre-
  processing the data, choosing the techniques,
  and interpreting the results


                                                  6
CRISP DM Standard
• CRoss Industry Standard Process for Data
  Mining




                                             7
Data Preparation
• Clean the data
  – Remove rows with missing values
  – Remove rows with obvious data entry errors – e.g.
    Age = 200
  – Recode obvious data entry inconsistencies – e.g. If
    Gender = M or F, but occasionally Male
  – Remove rows with minority values
  – Select which variables to use in the model


                                                      8
Data Quantity
• Choose the variables to be used for the model
• Look at the distributions of the chosen values
• Look at the level of noise in the data
• Look at the degree of linearity in the data
• Decide whether or not there are sufficient
  examples in the data
• Treat unbalanced data


                                                   9
Consider Error Costs
• Imagine a system that classifies input patterns
  into one of several possible categories
• Sometimes it will get things wrong, how often
  depends on the problem:
  – Direct mail targeting – very often
  – Credit risk assessment – quite often
  – Medical reasoning – very infrequently



                                                10
Error Costs
• An error in one direction can cost more than
  an error in the opposite direction
  – Recommending a blood test based on a false
    positive is better than missing an infection due to
    a false negative
  – Missing a case of insurance fraud is more costly
    than flagging a claim to be double checked
• The balance of examples in each case can be
  manipulated to reflect the cost

                                                          11
Check Points
• Data quantity and quality: do you have
  sufficient good data for the task?
  – How many variables are there?
  – How complex is the task?
  – Is the data’s distribution appropriate?
     • Outliers
     • Balance
     • Value set size


                                              12
Distributions
• A frequency distribution is a count of how
  often each variable contains each value in a
  data set
• For discrete numbers and categorical values,
  this is simply a count of each value
• For continuous numbers, the count is of how
  many values fall into each of a set of sub-
  ranges

                                                 13
Plotting Distributions
• The easiest way to visualise a distribution is to
  plot it in a histogram:




                                                  14
Features of a Distribution
                 to Look For
•   Outliers
•   Minority values
•   Data Balance
•   Data entry errors




                                       15
Outliers
• A small number of values that are much larger
  or much smaller than all the others
• Can disrupt the data mining process and give
  misleading results
• You should either remove them or, if they are
  important, collect more data to reflect this
  aspect of the world you are modelling
• Could be data entry errors

                                              16
Minority Values
• Values that only appear infrequently in the data
• Do they appear often enough to contribute to the
  model?
• Might be worth removing them from the data or
  collecting more data where they are represented
• Are they needed in the finished system?
• Could they be the result of data entry errors?



                                                     17
Minority Values
             600


             500


             400


             300


             200


             100


               0
                    Male     Female      M         F




What does this chart tell you about the gender variable in a data set?
What should you do before modelling or mining the data?

                                                                         18
Flat and Wide Variables
• Variables where all the values are minority values
  have a flat, wide distribution – one or two of each
  possible value
• Such variables are of little use in data mining because
  the goal of DM is to find general patterns from
  specific data
• No such patterns can exist if each data point is
  completely different
• Such variables should be excluded from a model

                                                        19
Data Balance
• Imagine I want to predict whether or not a
  prospective customer will respond to a mailing
  campaign
• I collect the data, put it into a data mining
  algorithm, which learns and reports a success
  rate of 98%
• Sounds good, but when I put a new set of
  prospects through to see who to mail, what
  happens?

                                               20
A Problem
• … the system predicts ‘No’ for every single
  prospect.
• With a response rate on a campaign of 2%,
  then the system is right 98% of the time if it
  always says ‘No’.
• So it never chooses anybody to target in the
  campaign


                                                   21
A Solution
• One data pre-processing solution is to balance the number of
  examples of each target class in the output variable
• In our previous example: 50% customers and 50% non-
  customers
• That way, any gain in accuracy over 50% would certainly be
  due to patterns in the data, not the prior distribution
• This is not always easy to achieve – you might need to throw
  away a lot of data to balance the examples, or build several
  models on balanced subsets
• Not always necessary – if an event is rare because its cause is
  rare, then the problem won’t arise


                                                                22
Data Quantity
• How much data do you need?
• How long is a piece of string?
• Data must be sufficient to:
  – Represent the dynamics of the system to be
    modelled
  – Cover all situations likely to be encountered when
    predictions are needed
  – Compensate for any noise in the data

                                                     23
Model Building
• Choose a number of techniques suitable to
  the task:
  – Neural network for prediction or classification
  – Decision tree for classification
  – Rule induction for classification
  – Bayesian network for classification
  – K-Means for clustering



                                                      24
Train Models
• For each technique:
  – Run a series of experiments with different
    parameters
  – Each experiment should use around 70% of the
    data for training and the rest for testing
  – When a good solution is found, use cross
    validation (10 fold is a good choice) to verify the
    result


                                                          25
Cross Validation
• Split the data into ten subsets, then train 10
  models – each one using 9 of the 10 subsets
  as training data and the 10th as test. The score
  is the average of all 10.
• This is a more accurate representation of how
  well the data may be modelled, as it reduces
  the risk of getting a lucky test set


                                                     26
Assess Models
• You can measure the success of your model in a
  number of ways
   – Mean Squared error – not always meaningful
   – Percentage correct for classification
   – Confusion matrix for classification

               Output= True        False
               True        80      30
               False       20      90

                                                   27
Probability Outputs
• Most classification techniques provide a score
  with the classification – either a probability or
  some other measure
• This can be used:
  – Allow an answer of “unsure” for cases where no
    single class has a high enough probability
  – Weighting outputs to allow for unequal cost of
    outcomes
  – Lift charts and ROC curves

                                                     28
Generalisation and Over Fitting
• Most data mining models have a degree of
  complexity that can be controlled by the
  designer
• The goal is to find the degree of complexity
  that is best suited to the data
• A model that is too simple over generalises
• A model that is too complex over fits
• Both have an adverse effect on performance
                                                 29
Gen-Spec Trade Off
• Adding to the complexity of the model fits the
  training data better at the expense of higher
  test error




                                               30
Repeat or Finish
• The result of the data mining will leave you
  with either a model that works or the need to
  improve
• More data may need to be collected
• Different variables might be tried
• The process can loop several times before a
  satisfactory answer is found


                                                  31
Understanding and Using the Results
• The resulting model has the ability to perform
  the task it was set, so can be embedded in an
  automated system
• Some techniques produce models that are
  human readable and allow insights into the
  structure of the data
• Some are almost impossible to extract
  knowledge from

                                               32
33

More Related Content

Similar to Kevin Swingler: Introduction to Data Mining

DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
Akash527744
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
QuantUniversity
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
Aboul Ella Hassanien
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4RichardGroom
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
Sanghamitra Deb
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
Marc Berman
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
Jen Stirrup
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
ShantanuDeosthale
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
PyData
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
Dhilsath Fathima
 
Predictive Analysis
Predictive AnalysisPredictive Analysis
Predictive Analysis
Michael Bystry
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
GibDevs
 

Similar to Kevin Swingler: Introduction to Data Mining (20)

DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Predictive Analysis
Predictive AnalysisPredictive Analysis
Predictive Analysis
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 

More from Library and Information Science Research Coalition

Research into practice: library and information research resources briefing
Research into practice: library and information research resources briefingResearch into practice: library and information research resources briefing
Research into practice: library and information research resources briefing
Library and Information Science Research Coalition
 
Research into practice: The present situation
Research into practice:The present situationResearch into practice:The present situation
Research into practice: The present situation
Library and Information Science Research Coalition
 
DREaM 5: One minute madness 2012
DREaM 5: One minute madness 2012DREaM 5: One minute madness 2012
DREaM 5: One minute madness 2012
Library and Information Science Research Coalition
 
DREaM 5: Facets of DREaM
DREaM 5: Facets of DREaMDREaM 5: Facets of DREaM
DREaM 5: DREaM past, present and future
DREaM 5: DREaM past, present and futureDREaM 5: DREaM past, present and future
DREaM 5: DREaM past, present and future
Library and Information Science Research Coalition
 
DREaM 5: Building evidence of the value and impact of library information ser...
DREaM 5: Building evidence of the value and impact of library information ser...DREaM 5: Building evidence of the value and impact of library information ser...
DREaM 5: Building evidence of the value and impact of library information ser...
Library and Information Science Research Coalition
 
We have a DREaM: the Developing Research Excellence & Methods network
We have a DREaM: the Developing Research Excellence & Methods networkWe have a DREaM: the Developing Research Excellence & Methods network
We have a DREaM: the Developing Research Excellence & Methods network
Library and Information Science Research Coalition
 
Presentation on the RiLIES projects at QQML2012
Presentation on the RiLIES projects at QQML2012Presentation on the RiLIES projects at QQML2012
Presentation on the RiLIES projects at QQML2012
Library and Information Science Research Coalition
 
Dr Phil Turner: Techniques from Psychology
Dr Phil Turner: Techniques from PsychologyDr Phil Turner: Techniques from Psychology
Dr Phil Turner: Techniques from Psychology
Library and Information Science Research Coalition
 
Dr Harry Woodroof: Introduction to Horizon Scanning
Dr Harry Woodroof: Introduction to Horizon ScanningDr Harry Woodroof: Introduction to Horizon Scanning
Dr Harry Woodroof: Introduction to Horizon Scanning
Library and Information Science Research Coalition
 
Welcome to DREaM3
Welcome to DREaM3Welcome to DREaM3
Nick Moore: Making the bullets for others to fire (research and policy)
Nick Moore: Making the bullets for others to fire (research and policy)Nick Moore: Making the bullets for others to fire (research and policy)
Nick Moore: Making the bullets for others to fire (research and policy)
Library and Information Science Research Coalition
 
Mike Thelwall: Introduction to Webometrics
Mike Thelwall: Introduction to WebometricsMike Thelwall: Introduction to Webometrics
Mike Thelwall: Introduction to Webometrics
Library and Information Science Research Coalition
 
Thomas Haigh: Techniques from History
Thomas Haigh: Techniques from HistoryThomas Haigh: Techniques from History
Thomas Haigh: Techniques from History
Library and Information Science Research Coalition
 
Thomas Haigh: DREaM workshop 2 task
Thomas Haigh: DREaM workshop 2 taskThomas Haigh: DREaM workshop 2 task
Thomas Haigh: DREaM workshop 2 task
Library and Information Science Research Coalition
 
Strengthening the links between research and practice: the Research in Librar...
Strengthening the links between research and practice: the Research in Librar...Strengthening the links between research and practice: the Research in Librar...
Strengthening the links between research and practice: the Research in Librar...
Library and Information Science Research Coalition
 
DREaM Event 2: Paul Lynch
DREaM Event 2: Paul LynchDREaM Event 2: Paul Lynch
DREaM Event 2: Charles Oppenheim (Handout)
DREaM Event 2: Charles Oppenheim (Handout)DREaM Event 2: Charles Oppenheim (Handout)
DREaM Event 2: Charles Oppenheim (Handout)
Library and Information Science Research Coalition
 

More from Library and Information Science Research Coalition (20)

Research into practice: library and information research resources briefing
Research into practice: library and information research resources briefingResearch into practice: library and information research resources briefing
Research into practice: library and information research resources briefing
 
Research into practice: The present situation
Research into practice:The present situationResearch into practice:The present situation
Research into practice: The present situation
 
DREaM 5: One minute madness 2012
DREaM 5: One minute madness 2012DREaM 5: One minute madness 2012
DREaM 5: One minute madness 2012
 
DREaM 5: Library and information science practitioner researcher excellence a...
DREaM 5: Library and information science practitioner researcher excellence a...DREaM 5: Library and information science practitioner researcher excellence a...
DREaM 5: Library and information science practitioner researcher excellence a...
 
DREaM 5: Facets of DREaM
DREaM 5: Facets of DREaMDREaM 5: Facets of DREaM
DREaM 5: Facets of DREaM
 
DREaM 5: DREaM past, present and future
DREaM 5: DREaM past, present and futureDREaM 5: DREaM past, present and future
DREaM 5: DREaM past, present and future
 
DREaM 5: Building evidence of the value and impact of library information ser...
DREaM 5: Building evidence of the value and impact of library information ser...DREaM 5: Building evidence of the value and impact of library information ser...
DREaM 5: Building evidence of the value and impact of library information ser...
 
We have a DREaM: the Developing Research Excellence & Methods network
We have a DREaM: the Developing Research Excellence & Methods networkWe have a DREaM: the Developing Research Excellence & Methods network
We have a DREaM: the Developing Research Excellence & Methods network
 
Presentation on the RiLIES projects at QQML2012
Presentation on the RiLIES projects at QQML2012Presentation on the RiLIES projects at QQML2012
Presentation on the RiLIES projects at QQML2012
 
Dr Phil Turner: Techniques from Psychology
Dr Phil Turner: Techniques from PsychologyDr Phil Turner: Techniques from Psychology
Dr Phil Turner: Techniques from Psychology
 
Dr Harry Woodroof: Introduction to Horizon Scanning
Dr Harry Woodroof: Introduction to Horizon ScanningDr Harry Woodroof: Introduction to Horizon Scanning
Dr Harry Woodroof: Introduction to Horizon Scanning
 
Welcome to DREaM3
Welcome to DREaM3Welcome to DREaM3
Welcome to DREaM3
 
Nick Moore: Making the bullets for others to fire (research and policy)
Nick Moore: Making the bullets for others to fire (research and policy)Nick Moore: Making the bullets for others to fire (research and policy)
Nick Moore: Making the bullets for others to fire (research and policy)
 
Mike Thelwall: Introduction to Webometrics
Mike Thelwall: Introduction to WebometricsMike Thelwall: Introduction to Webometrics
Mike Thelwall: Introduction to Webometrics
 
Thomas Haigh: Techniques from History
Thomas Haigh: Techniques from HistoryThomas Haigh: Techniques from History
Thomas Haigh: Techniques from History
 
Thomas Haigh: DREaM workshop 2 task
Thomas Haigh: DREaM workshop 2 taskThomas Haigh: DREaM workshop 2 task
Thomas Haigh: DREaM workshop 2 task
 
Strengthening the links between research and practice: the Research in Librar...
Strengthening the links between research and practice: the Research in Librar...Strengthening the links between research and practice: the Research in Librar...
Strengthening the links between research and practice: the Research in Librar...
 
LIS DREaM 2: Social Network Analysis Workshop Exercise Results
LIS DREaM 2: Social Network Analysis Workshop Exercise ResultsLIS DREaM 2: Social Network Analysis Workshop Exercise Results
LIS DREaM 2: Social Network Analysis Workshop Exercise Results
 
DREaM Event 2: Paul Lynch
DREaM Event 2: Paul LynchDREaM Event 2: Paul Lynch
DREaM Event 2: Paul Lynch
 
DREaM Event 2: Charles Oppenheim (Handout)
DREaM Event 2: Charles Oppenheim (Handout)DREaM Event 2: Charles Oppenheim (Handout)
DREaM Event 2: Charles Oppenheim (Handout)
 

Recently uploaded

BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 

Recently uploaded (20)

BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 

Kevin Swingler: Introduction to Data Mining

  • 1. Data Mining Methodology Kevin Swingler University of Stirling Lecturer, Computing Science kms@cs.stir.ac.uk
  • 2. What is Data Mining? • Generally, methods of using large quantities of data and appropriate algorithms to allow a computer to ‘learn’ to perform a task • Task oriented: – Predict outcomes or forecast the future – Classify objects as belonging to one of several categories – Separate data into clusters of similar objects • Most methods produce a model of the data that performs the task 2
  • 3. Some Examples • Predicting patterns of drug side-effects • Spotting credit card or insurance fraud • Controlling complex machinery • Predicting the outcome of medical interventions • Predicting the price of stocks and shares or exchange rates • Knowing when a cow is most fertile (really!) 3
  • 4. Examples in LIS • Text Mining – Automatically determine what an article is ‘about’ – Classify attitudes in social media • Demand Prediction – Predicting demand for resources such as new books or journals or buildings • Search and Recommend – Analysis of borrowing history to make recommendations – Links analysis for citation clustering 4
  • 5. Data Sources • In House – Data you own – Borrow records – Search histories – Catalogue data • Bought in – Demographic data about customers – Demographic data about the locality around a library 5
  • 6. Methods • Techniques for data mining are based on mathematics and statistics, but are implemented in easy to use software packages • Where methodology is important is in pre- processing the data, choosing the techniques, and interpreting the results 6
  • 7. CRISP DM Standard • CRoss Industry Standard Process for Data Mining 7
  • 8. Data Preparation • Clean the data – Remove rows with missing values – Remove rows with obvious data entry errors – e.g. Age = 200 – Recode obvious data entry inconsistencies – e.g. If Gender = M or F, but occasionally Male – Remove rows with minority values – Select which variables to use in the model 8
  • 9. Data Quantity • Choose the variables to be used for the model • Look at the distributions of the chosen values • Look at the level of noise in the data • Look at the degree of linearity in the data • Decide whether or not there are sufficient examples in the data • Treat unbalanced data 9
  • 10. Consider Error Costs • Imagine a system that classifies input patterns into one of several possible categories • Sometimes it will get things wrong, how often depends on the problem: – Direct mail targeting – very often – Credit risk assessment – quite often – Medical reasoning – very infrequently 10
  • 11. Error Costs • An error in one direction can cost more than an error in the opposite direction – Recommending a blood test based on a false positive is better than missing an infection due to a false negative – Missing a case of insurance fraud is more costly than flagging a claim to be double checked • The balance of examples in each case can be manipulated to reflect the cost 11
  • 12. Check Points • Data quantity and quality: do you have sufficient good data for the task? – How many variables are there? – How complex is the task? – Is the data’s distribution appropriate? • Outliers • Balance • Value set size 12
  • 13. Distributions • A frequency distribution is a count of how often each variable contains each value in a data set • For discrete numbers and categorical values, this is simply a count of each value • For continuous numbers, the count is of how many values fall into each of a set of sub- ranges 13
  • 14. Plotting Distributions • The easiest way to visualise a distribution is to plot it in a histogram: 14
  • 15. Features of a Distribution to Look For • Outliers • Minority values • Data Balance • Data entry errors 15
  • 16. Outliers • A small number of values that are much larger or much smaller than all the others • Can disrupt the data mining process and give misleading results • You should either remove them or, if they are important, collect more data to reflect this aspect of the world you are modelling • Could be data entry errors 16
  • 17. Minority Values • Values that only appear infrequently in the data • Do they appear often enough to contribute to the model? • Might be worth removing them from the data or collecting more data where they are represented • Are they needed in the finished system? • Could they be the result of data entry errors? 17
  • 18. Minority Values 600 500 400 300 200 100 0 Male Female M F What does this chart tell you about the gender variable in a data set? What should you do before modelling or mining the data? 18
  • 19. Flat and Wide Variables • Variables where all the values are minority values have a flat, wide distribution – one or two of each possible value • Such variables are of little use in data mining because the goal of DM is to find general patterns from specific data • No such patterns can exist if each data point is completely different • Such variables should be excluded from a model 19
  • 20. Data Balance • Imagine I want to predict whether or not a prospective customer will respond to a mailing campaign • I collect the data, put it into a data mining algorithm, which learns and reports a success rate of 98% • Sounds good, but when I put a new set of prospects through to see who to mail, what happens? 20
  • 21. A Problem • … the system predicts ‘No’ for every single prospect. • With a response rate on a campaign of 2%, then the system is right 98% of the time if it always says ‘No’. • So it never chooses anybody to target in the campaign 21
  • 22. A Solution • One data pre-processing solution is to balance the number of examples of each target class in the output variable • In our previous example: 50% customers and 50% non- customers • That way, any gain in accuracy over 50% would certainly be due to patterns in the data, not the prior distribution • This is not always easy to achieve – you might need to throw away a lot of data to balance the examples, or build several models on balanced subsets • Not always necessary – if an event is rare because its cause is rare, then the problem won’t arise 22
  • 23. Data Quantity • How much data do you need? • How long is a piece of string? • Data must be sufficient to: – Represent the dynamics of the system to be modelled – Cover all situations likely to be encountered when predictions are needed – Compensate for any noise in the data 23
  • 24. Model Building • Choose a number of techniques suitable to the task: – Neural network for prediction or classification – Decision tree for classification – Rule induction for classification – Bayesian network for classification – K-Means for clustering 24
  • 25. Train Models • For each technique: – Run a series of experiments with different parameters – Each experiment should use around 70% of the data for training and the rest for testing – When a good solution is found, use cross validation (10 fold is a good choice) to verify the result 25
  • 26. Cross Validation • Split the data into ten subsets, then train 10 models – each one using 9 of the 10 subsets as training data and the 10th as test. The score is the average of all 10. • This is a more accurate representation of how well the data may be modelled, as it reduces the risk of getting a lucky test set 26
  • 27. Assess Models • You can measure the success of your model in a number of ways – Mean Squared error – not always meaningful – Percentage correct for classification – Confusion matrix for classification Output= True False True 80 30 False 20 90 27
  • 28. Probability Outputs • Most classification techniques provide a score with the classification – either a probability or some other measure • This can be used: – Allow an answer of “unsure” for cases where no single class has a high enough probability – Weighting outputs to allow for unequal cost of outcomes – Lift charts and ROC curves 28
  • 29. Generalisation and Over Fitting • Most data mining models have a degree of complexity that can be controlled by the designer • The goal is to find the degree of complexity that is best suited to the data • A model that is too simple over generalises • A model that is too complex over fits • Both have an adverse effect on performance 29
  • 30. Gen-Spec Trade Off • Adding to the complexity of the model fits the training data better at the expense of higher test error 30
  • 31. Repeat or Finish • The result of the data mining will leave you with either a model that works or the need to improve • More data may need to be collected • Different variables might be tried • The process can loop several times before a satisfactory answer is found 31
  • 32. Understanding and Using the Results • The resulting model has the ability to perform the task it was set, so can be embedded in an automated system • Some techniques produce models that are human readable and allow insights into the structure of the data • Some are almost impossible to extract knowledge from 32
  • 33. 33