SlideShare a Scribd company logo
1 of 18
Download to read offline
DATA MINING PROCESS
LECTURE 02
DR. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES
1
DATA MINING PROCESS
• Phases in a typical Data Mining effort:
1. Discovery
Frame business problem
Identify analytics component
Formulate initial hypotheses
2. Data Preparation
Obtain dataset form internal and external sources
Data consistency checks in terms of definitions of fields, units of measurement, time
periods etc.,
Sample
2
DATA MINING PROCESS
• Phases in a typical Data Mining effort:
3. Data Exploration and Conditioning
Missing data handling, Range reasonability, Outliers,
Graphical or Visual Analysis
Transformation, Creation of new variables, and Normalization
Partitioning into Training, Validation, and Test datasets
3
DATA MINING PROCESS
• Phases in a typical Data Mining effort:
4. Model Planning
Determine data mining task such as prediction, classification etc.
Select appropriate data mining methods and techniques such as regression, neural
networks, clustering etc.
5. Model Building
Building different candidate models using selected techniques and their variants using
training data
Refine and select the final model using validation data
Evaluate the final model on test data
4
DATA MINING PROCESS
• Phases in a typical Data Mining effort:
6. Results Interpretation
Model evaluation using key performance metrics
7. Model Deployment
Pilot project to integrate and run the model on operational systems
• Similar data mining methodologies developed by SAS and
IBM Modeler (SPSS Clementine) are called SEMAA and CRISP-
DM respectively
5
DATA MINING PROCESS
• Data mining techniques can be divided into Supervised
Learning Methods and Unsupervised Learning Methods
• Supervised Learning
– In supervised learning, algorithms are used to learn the function ‘f’
that can map input variables (X) into output variables (Y)
Y = f(X)
– Idea is to approximate ‘f’ such that new data on input variables (X) can
predict the output variables (Y) with minimum possible error (Ɛ)
6
DATA MINING PROCESS
• Supervised Learning problems can be grouped into prediction
and classification problems
• Unsupervised Learning
– In Unsupervised Learning, algorithms are used to learn the underlying
structure or patterns hidden in the data
• Unsupervised Learning problems can be grouped into
clustering and association rule learning problems
7
DATA MINING PROCESS
• Target Population
– Subset of the population under study
– Results are generalized to the target population
• Sample
– Subset of the target population
• Simple Random Sampling
– A sampling method wherein each observation has an equal chance of
being selected
8
DATA MINING PROCESS
• Random Sampling
– A sampling method wherein each observation does not necessarily
have an equal chance of being selected
• Sampling with Replacement
– Sample values are independent
• Sampling without Replacement
– Sample values aren’t independent
9
DATA MINING PROCESS
• Sampling results in less no. of observations than the no. of
total observations in the dataset
• Data Mining algorithms
– Varying limitations on number of observations and variables
• Limitations due to computing power and storage capacity
• Limitations due to statistical software being used
• How many observations to build accurate models?
10
DATA MINING PROCESS
• Rare Event, e.g., low response rate in advertising by
traditional mail or email
– Oversampling of ‘success’ cases
– Arises mainly in classification tasks
– Costs of misclassification
• Asymmetric costs due to more importance of ‘success’ class
– Costs of failing to identify ‘success’ cases are generally more than costs
of detailed review of all cases
– Prediction of ‘success’ cases is likely to come at cost of misclassifying
more ‘failure’ cases as ‘success’ cases than usual
11
DATA MINING PROCESS
• Dummy coding for categorical variables
– Some statistical software cannot use categorical variables expressed in
the label format
– Dummy binary variables (having 0’s and 1’s: 0 indicating ‘absence’ and
1 indicating ‘presence’) for different classes of categorical variables are
created
– For example, if ‘activity status’ of individuals can be put into four
mutually exclusive and jointly exhaustive classes as {student,
unemployed, employed, retired}, only three dummy variables would
be required
12
DATA MINING PROCESS
• Principle of Parsimony
– A model or theory with less no. of assumptions and variables but with
high explanatory power is generally desirable
• More no. of variables also increase the sample size
requirements due to reliability of estimate
• Overfitting
– A model built using a complex function that fits the data perfectly
– Model ends up fitting the noise and explaining the chance variation
13
DATA MINING PROCESS
• Overfitting
– More no. of iterations resulting in excessive learning of the data
– More no. of variables in the model may lead to fitting spurious
relationships
• Sample Size
– Domain Knowledge
– General rule of thumb: 10 × p observations, where p is the no. of
predictors
– For classification tasks: 6 × m × p observations, where m is the no. of
classes in the outcome variable (Delmaster & Hancock, 2001)
14
DATA MINING PROCESS
• Outliers
– A distant data point
– Valid point or erroneous value?
– Further review
• Manual Inspection (Sorting, minimum and maximum values, clustering etc.)
• Domain Knowledge
• Missing Values
– Few records with missing values can be removed
– Imputation
15
DATA MINING PROCESS
• Missing Values
– Drop the variables having missing values
– Replace with proxy variable
• Normalization
– Standardization using z-score
– Min-max normalization
16
Key References
• Data Science and Big Data Analytics: Discovering, Analyzing,
Visualizing and Presenting Data by EMC Education Services
(2015)
• Data Mining for Business Intelligence: Concepts, Techniques,
and Applications in Microsoft Office Excel with XLMiner by
Shmueli, G., Patel, N. R., & Bruce, P. C. (2010)
17
Thanks…
18

More Related Content

Similar to Lecture 2 Data mining process.pdf

Qualitative and quantitative analysis
Qualitative and quantitative analysisQualitative and quantitative analysis
Qualitative and quantitative analysisNellie Deutsch (Ed.D)
 
Data warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniquesData warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniquesVaibhav Khanna
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfSaketBansal9
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introductionDr-Dipali Meher
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.pptSK Chew
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introductionBasma Gamal
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aRai University
 
Data Gathering and Statistical Treatment
Data Gathering and Statistical TreatmentData Gathering and Statistical Treatment
Data Gathering and Statistical TreatmentMarielleARopeta
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxshumPanwar
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overviewdublinx
 

Similar to Lecture 2 Data mining process.pdf (20)

Qualitative and quantitative analysis
Qualitative and quantitative analysisQualitative and quantitative analysis
Qualitative and quantitative analysis
 
Data warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniquesData warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniques
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data analysis
Data analysisData analysis
Data analysis
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.ppt
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.ppt
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
 
ml-02x01.pdf
ml-02x01.pdfml-02x01.pdf
ml-02x01.pdf
 
Data Gathering and Statistical Treatment
Data Gathering and Statistical TreatmentData Gathering and Statistical Treatment
Data Gathering and Statistical Treatment
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
 
Lesson1.2.pptx.pdf
Lesson1.2.pptx.pdfLesson1.2.pptx.pdf
Lesson1.2.pptx.pdf
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overview
 

More from Kaushik Kundu

Investment portfolio analysis and behavioural bises
Investment portfolio analysis and behavioural bisesInvestment portfolio analysis and behavioural bises
Investment portfolio analysis and behavioural bisesKaushik Kundu
 
Team Dynamics Transactional Analysis.ppt
Team Dynamics Transactional Analysis.pptTeam Dynamics Transactional Analysis.ppt
Team Dynamics Transactional Analysis.pptKaushik Kundu
 
Labour Policy in Indian Five Years Plan.pptx
Labour Policy in Indian Five Years Plan.pptxLabour Policy in Indian Five Years Plan.pptx
Labour Policy in Indian Five Years Plan.pptxKaushik Kundu
 
lecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
lecture-6 WAGE DETERMINATION-lecture-5CU2010.pptlecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
lecture-6 WAGE DETERMINATION-lecture-5CU2010.pptKaushik Kundu
 
Theory of queuing systems
Theory of queuing systemsTheory of queuing systems
Theory of queuing systemsKaushik Kundu
 
Concept of organizational culture
Concept of organizational cultureConcept of organizational culture
Concept of organizational cultureKaushik Kundu
 

More from Kaushik Kundu (9)

Investment portfolio analysis and behavioural bises
Investment portfolio analysis and behavioural bisesInvestment portfolio analysis and behavioural bises
Investment portfolio analysis and behavioural bises
 
Team Dynamics Transactional Analysis.ppt
Team Dynamics Transactional Analysis.pptTeam Dynamics Transactional Analysis.ppt
Team Dynamics Transactional Analysis.ppt
 
Labour Policy in Indian Five Years Plan.pptx
Labour Policy in Indian Five Years Plan.pptxLabour Policy in Indian Five Years Plan.pptx
Labour Policy in Indian Five Years Plan.pptx
 
Coaching.pptx
Coaching.pptxCoaching.pptx
Coaching.pptx
 
lecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
lecture-6 WAGE DETERMINATION-lecture-5CU2010.pptlecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
lecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
 
Duopoly
DuopolyDuopoly
Duopoly
 
Q research
Q researchQ research
Q research
 
Theory of queuing systems
Theory of queuing systemsTheory of queuing systems
Theory of queuing systems
 
Concept of organizational culture
Concept of organizational cultureConcept of organizational culture
Concept of organizational culture
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

Lecture 2 Data mining process.pdf

  • 1. DATA MINING PROCESS LECTURE 02 DR. GAURAV DIXIT DEPARTMENT OF MANAGEMENT STUDIES 1
  • 2. DATA MINING PROCESS • Phases in a typical Data Mining effort: 1. Discovery Frame business problem Identify analytics component Formulate initial hypotheses 2. Data Preparation Obtain dataset form internal and external sources Data consistency checks in terms of definitions of fields, units of measurement, time periods etc., Sample 2
  • 3. DATA MINING PROCESS • Phases in a typical Data Mining effort: 3. Data Exploration and Conditioning Missing data handling, Range reasonability, Outliers, Graphical or Visual Analysis Transformation, Creation of new variables, and Normalization Partitioning into Training, Validation, and Test datasets 3
  • 4. DATA MINING PROCESS • Phases in a typical Data Mining effort: 4. Model Planning Determine data mining task such as prediction, classification etc. Select appropriate data mining methods and techniques such as regression, neural networks, clustering etc. 5. Model Building Building different candidate models using selected techniques and their variants using training data Refine and select the final model using validation data Evaluate the final model on test data 4
  • 5. DATA MINING PROCESS • Phases in a typical Data Mining effort: 6. Results Interpretation Model evaluation using key performance metrics 7. Model Deployment Pilot project to integrate and run the model on operational systems • Similar data mining methodologies developed by SAS and IBM Modeler (SPSS Clementine) are called SEMAA and CRISP- DM respectively 5
  • 6. DATA MINING PROCESS • Data mining techniques can be divided into Supervised Learning Methods and Unsupervised Learning Methods • Supervised Learning – In supervised learning, algorithms are used to learn the function ‘f’ that can map input variables (X) into output variables (Y) Y = f(X) – Idea is to approximate ‘f’ such that new data on input variables (X) can predict the output variables (Y) with minimum possible error (Ɛ) 6
  • 7. DATA MINING PROCESS • Supervised Learning problems can be grouped into prediction and classification problems • Unsupervised Learning – In Unsupervised Learning, algorithms are used to learn the underlying structure or patterns hidden in the data • Unsupervised Learning problems can be grouped into clustering and association rule learning problems 7
  • 8. DATA MINING PROCESS • Target Population – Subset of the population under study – Results are generalized to the target population • Sample – Subset of the target population • Simple Random Sampling – A sampling method wherein each observation has an equal chance of being selected 8
  • 9. DATA MINING PROCESS • Random Sampling – A sampling method wherein each observation does not necessarily have an equal chance of being selected • Sampling with Replacement – Sample values are independent • Sampling without Replacement – Sample values aren’t independent 9
  • 10. DATA MINING PROCESS • Sampling results in less no. of observations than the no. of total observations in the dataset • Data Mining algorithms – Varying limitations on number of observations and variables • Limitations due to computing power and storage capacity • Limitations due to statistical software being used • How many observations to build accurate models? 10
  • 11. DATA MINING PROCESS • Rare Event, e.g., low response rate in advertising by traditional mail or email – Oversampling of ‘success’ cases – Arises mainly in classification tasks – Costs of misclassification • Asymmetric costs due to more importance of ‘success’ class – Costs of failing to identify ‘success’ cases are generally more than costs of detailed review of all cases – Prediction of ‘success’ cases is likely to come at cost of misclassifying more ‘failure’ cases as ‘success’ cases than usual 11
  • 12. DATA MINING PROCESS • Dummy coding for categorical variables – Some statistical software cannot use categorical variables expressed in the label format – Dummy binary variables (having 0’s and 1’s: 0 indicating ‘absence’ and 1 indicating ‘presence’) for different classes of categorical variables are created – For example, if ‘activity status’ of individuals can be put into four mutually exclusive and jointly exhaustive classes as {student, unemployed, employed, retired}, only three dummy variables would be required 12
  • 13. DATA MINING PROCESS • Principle of Parsimony – A model or theory with less no. of assumptions and variables but with high explanatory power is generally desirable • More no. of variables also increase the sample size requirements due to reliability of estimate • Overfitting – A model built using a complex function that fits the data perfectly – Model ends up fitting the noise and explaining the chance variation 13
  • 14. DATA MINING PROCESS • Overfitting – More no. of iterations resulting in excessive learning of the data – More no. of variables in the model may lead to fitting spurious relationships • Sample Size – Domain Knowledge – General rule of thumb: 10 × p observations, where p is the no. of predictors – For classification tasks: 6 × m × p observations, where m is the no. of classes in the outcome variable (Delmaster & Hancock, 2001) 14
  • 15. DATA MINING PROCESS • Outliers – A distant data point – Valid point or erroneous value? – Further review • Manual Inspection (Sorting, minimum and maximum values, clustering etc.) • Domain Knowledge • Missing Values – Few records with missing values can be removed – Imputation 15
  • 16. DATA MINING PROCESS • Missing Values – Drop the variables having missing values – Replace with proxy variable • Normalization – Standardization using z-score – Min-max normalization 16
  • 17. Key References • Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data by EMC Education Services (2015) • Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner by Shmueli, G., Patel, N. R., & Bruce, P. C. (2010) 17