SlideShare a Scribd company logo
DATA MINING PROCESS
LECTURE 02
DR. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES
1
DATA MINING PROCESS
• Phases in a typical Data Mining effort:
1. Discovery
Frame business problem
Identify analytics component
Formulate initial hypotheses
2. Data Preparation
Obtain dataset form internal and external sources
Data consistency checks in terms of definitions of fields, units of measurement, time
periods etc.,
Sample
2
DATA MINING PROCESS
• Phases in a typical Data Mining effort:
3. Data Exploration and Conditioning
Missing data handling, Range reasonability, Outliers,
Graphical or Visual Analysis
Transformation, Creation of new variables, and Normalization
Partitioning into Training, Validation, and Test datasets
3
DATA MINING PROCESS
• Phases in a typical Data Mining effort:
4. Model Planning
Determine data mining task such as prediction, classification etc.
Select appropriate data mining methods and techniques such as regression, neural
networks, clustering etc.
5. Model Building
Building different candidate models using selected techniques and their variants using
training data
Refine and select the final model using validation data
Evaluate the final model on test data
4
DATA MINING PROCESS
• Phases in a typical Data Mining effort:
6. Results Interpretation
Model evaluation using key performance metrics
7. Model Deployment
Pilot project to integrate and run the model on operational systems
• Similar data mining methodologies developed by SAS and
IBM Modeler (SPSS Clementine) are called SEMAA and CRISP-
DM respectively
5
DATA MINING PROCESS
• Data mining techniques can be divided into Supervised
Learning Methods and Unsupervised Learning Methods
• Supervised Learning
– In supervised learning, algorithms are used to learn the function ‘f’
that can map input variables (X) into output variables (Y)
Y = f(X)
– Idea is to approximate ‘f’ such that new data on input variables (X) can
predict the output variables (Y) with minimum possible error (Ɛ)
6
DATA MINING PROCESS
• Supervised Learning problems can be grouped into prediction
and classification problems
• Unsupervised Learning
– In Unsupervised Learning, algorithms are used to learn the underlying
structure or patterns hidden in the data
• Unsupervised Learning problems can be grouped into
clustering and association rule learning problems
7
DATA MINING PROCESS
• Target Population
– Subset of the population under study
– Results are generalized to the target population
• Sample
– Subset of the target population
• Simple Random Sampling
– A sampling method wherein each observation has an equal chance of
being selected
8
DATA MINING PROCESS
• Random Sampling
– A sampling method wherein each observation does not necessarily
have an equal chance of being selected
• Sampling with Replacement
– Sample values are independent
• Sampling without Replacement
– Sample values aren’t independent
9
DATA MINING PROCESS
• Sampling results in less no. of observations than the no. of
total observations in the dataset
• Data Mining algorithms
– Varying limitations on number of observations and variables
• Limitations due to computing power and storage capacity
• Limitations due to statistical software being used
• How many observations to build accurate models?
10
DATA MINING PROCESS
• Rare Event, e.g., low response rate in advertising by
traditional mail or email
– Oversampling of ‘success’ cases
– Arises mainly in classification tasks
– Costs of misclassification
• Asymmetric costs due to more importance of ‘success’ class
– Costs of failing to identify ‘success’ cases are generally more than costs
of detailed review of all cases
– Prediction of ‘success’ cases is likely to come at cost of misclassifying
more ‘failure’ cases as ‘success’ cases than usual
11
DATA MINING PROCESS
• Dummy coding for categorical variables
– Some statistical software cannot use categorical variables expressed in
the label format
– Dummy binary variables (having 0’s and 1’s: 0 indicating ‘absence’ and
1 indicating ‘presence’) for different classes of categorical variables are
created
– For example, if ‘activity status’ of individuals can be put into four
mutually exclusive and jointly exhaustive classes as {student,
unemployed, employed, retired}, only three dummy variables would
be required
12
DATA MINING PROCESS
• Principle of Parsimony
– A model or theory with less no. of assumptions and variables but with
high explanatory power is generally desirable
• More no. of variables also increase the sample size
requirements due to reliability of estimate
• Overfitting
– A model built using a complex function that fits the data perfectly
– Model ends up fitting the noise and explaining the chance variation
13
DATA MINING PROCESS
• Overfitting
– More no. of iterations resulting in excessive learning of the data
– More no. of variables in the model may lead to fitting spurious
relationships
• Sample Size
– Domain Knowledge
– General rule of thumb: 10 × p observations, where p is the no. of
predictors
– For classification tasks: 6 × m × p observations, where m is the no. of
classes in the outcome variable (Delmaster & Hancock, 2001)
14
DATA MINING PROCESS
• Outliers
– A distant data point
– Valid point or erroneous value?
– Further review
• Manual Inspection (Sorting, minimum and maximum values, clustering etc.)
• Domain Knowledge
• Missing Values
– Few records with missing values can be removed
– Imputation
15
DATA MINING PROCESS
• Missing Values
– Drop the variables having missing values
– Replace with proxy variable
• Normalization
– Standardization using z-score
– Min-max normalization
16
Key References
• Data Science and Big Data Analytics: Discovering, Analyzing,
Visualizing and Presenting Data by EMC Education Services
(2015)
• Data Mining for Business Intelligence: Concepts, Techniques,
and Applications in Microsoft Office Excel with XLMiner by
Shmueli, G., Patel, N. R., & Bruce, P. C. (2010)
17
Thanks…
18

More Related Content

Similar to Lecture 2 Data mining process.pdf

Qualitative and quantitative analysis
Qualitative and quantitative analysisQualitative and quantitative analysis
Qualitative and quantitative analysis
Nellie Deutsch (Ed.D)
 
Data warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniquesData warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniques
Vaibhav Khanna
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Niko Vuokko
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
Izwan Nizal Mohd Shaharanee
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
Izwan Nizal Mohd Shaharanee
 
Data analysis
Data analysisData analysis
Data analysis
Titus Mutambu Mweta
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
Valerii Klymchuk
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
Dr-Dipali Meher
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.ppt
SK Chew
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.ppt
ChiragJoshi59934
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
Basma Gamal
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
Rai University
 
ml-02x01.pdf
ml-02x01.pdfml-02x01.pdf
Data Gathering and Statistical Treatment
Data Gathering and Statistical TreatmentData Gathering and Statistical Treatment
Data Gathering and Statistical Treatment
MarielleARopeta
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
shumPanwar
 
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. TisiModule-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Arunnaik63
 
Lesson1.2.pptx.pdf
Lesson1.2.pptx.pdfLesson1.2.pptx.pdf
Lesson1.2.pptx.pdf
JhimarPeredoJurado
 

Similar to Lecture 2 Data mining process.pdf (20)

Qualitative and quantitative analysis
Qualitative and quantitative analysisQualitative and quantitative analysis
Qualitative and quantitative analysis
 
Data warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniquesData warehouse 16 data analysis techniques
Data warehouse 16 data analysis techniques
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data analysis
Data analysisData analysis
Data analysis
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.ppt
 
crisp.ppt
crisp.pptcrisp.ppt
crisp.ppt
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
 
ml-02x01.pdf
ml-02x01.pdfml-02x01.pdf
ml-02x01.pdf
 
Data Gathering and Statistical Treatment
Data Gathering and Statistical TreatmentData Gathering and Statistical Treatment
Data Gathering and Statistical Treatment
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
 
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. TisiModule-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
 
Lesson1.2.pptx.pdf
Lesson1.2.pptx.pdfLesson1.2.pptx.pdf
Lesson1.2.pptx.pdf
 

More from Kaushik Kundu

Investment portfolio analysis and behavioural bises
Investment portfolio analysis and behavioural bisesInvestment portfolio analysis and behavioural bises
Investment portfolio analysis and behavioural bises
Kaushik Kundu
 
Team Dynamics Transactional Analysis.ppt
Team Dynamics Transactional Analysis.pptTeam Dynamics Transactional Analysis.ppt
Team Dynamics Transactional Analysis.ppt
Kaushik Kundu
 
Labour Policy in Indian Five Years Plan.pptx
Labour Policy in Indian Five Years Plan.pptxLabour Policy in Indian Five Years Plan.pptx
Labour Policy in Indian Five Years Plan.pptx
Kaushik Kundu
 
Coaching.pptx
Coaching.pptxCoaching.pptx
Coaching.pptx
Kaushik Kundu
 
lecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
lecture-6 WAGE DETERMINATION-lecture-5CU2010.pptlecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
lecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
Kaushik Kundu
 
Duopoly
DuopolyDuopoly
Duopoly
Kaushik Kundu
 
Q research
Q researchQ research
Q research
Kaushik Kundu
 
Theory of queuing systems
Theory of queuing systemsTheory of queuing systems
Theory of queuing systems
Kaushik Kundu
 
Concept of organizational culture
Concept of organizational cultureConcept of organizational culture
Concept of organizational culture
Kaushik Kundu
 

More from Kaushik Kundu (9)

Investment portfolio analysis and behavioural bises
Investment portfolio analysis and behavioural bisesInvestment portfolio analysis and behavioural bises
Investment portfolio analysis and behavioural bises
 
Team Dynamics Transactional Analysis.ppt
Team Dynamics Transactional Analysis.pptTeam Dynamics Transactional Analysis.ppt
Team Dynamics Transactional Analysis.ppt
 
Labour Policy in Indian Five Years Plan.pptx
Labour Policy in Indian Five Years Plan.pptxLabour Policy in Indian Five Years Plan.pptx
Labour Policy in Indian Five Years Plan.pptx
 
Coaching.pptx
Coaching.pptxCoaching.pptx
Coaching.pptx
 
lecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
lecture-6 WAGE DETERMINATION-lecture-5CU2010.pptlecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
lecture-6 WAGE DETERMINATION-lecture-5CU2010.ppt
 
Duopoly
DuopolyDuopoly
Duopoly
 
Q research
Q researchQ research
Q research
 
Theory of queuing systems
Theory of queuing systemsTheory of queuing systems
Theory of queuing systems
 
Concept of organizational culture
Concept of organizational cultureConcept of organizational culture
Concept of organizational culture
 

Recently uploaded

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 

Recently uploaded (20)

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 

Lecture 2 Data mining process.pdf

  • 1. DATA MINING PROCESS LECTURE 02 DR. GAURAV DIXIT DEPARTMENT OF MANAGEMENT STUDIES 1
  • 2. DATA MINING PROCESS • Phases in a typical Data Mining effort: 1. Discovery Frame business problem Identify analytics component Formulate initial hypotheses 2. Data Preparation Obtain dataset form internal and external sources Data consistency checks in terms of definitions of fields, units of measurement, time periods etc., Sample 2
  • 3. DATA MINING PROCESS • Phases in a typical Data Mining effort: 3. Data Exploration and Conditioning Missing data handling, Range reasonability, Outliers, Graphical or Visual Analysis Transformation, Creation of new variables, and Normalization Partitioning into Training, Validation, and Test datasets 3
  • 4. DATA MINING PROCESS • Phases in a typical Data Mining effort: 4. Model Planning Determine data mining task such as prediction, classification etc. Select appropriate data mining methods and techniques such as regression, neural networks, clustering etc. 5. Model Building Building different candidate models using selected techniques and their variants using training data Refine and select the final model using validation data Evaluate the final model on test data 4
  • 5. DATA MINING PROCESS • Phases in a typical Data Mining effort: 6. Results Interpretation Model evaluation using key performance metrics 7. Model Deployment Pilot project to integrate and run the model on operational systems • Similar data mining methodologies developed by SAS and IBM Modeler (SPSS Clementine) are called SEMAA and CRISP- DM respectively 5
  • 6. DATA MINING PROCESS • Data mining techniques can be divided into Supervised Learning Methods and Unsupervised Learning Methods • Supervised Learning – In supervised learning, algorithms are used to learn the function ‘f’ that can map input variables (X) into output variables (Y) Y = f(X) – Idea is to approximate ‘f’ such that new data on input variables (X) can predict the output variables (Y) with minimum possible error (Ɛ) 6
  • 7. DATA MINING PROCESS • Supervised Learning problems can be grouped into prediction and classification problems • Unsupervised Learning – In Unsupervised Learning, algorithms are used to learn the underlying structure or patterns hidden in the data • Unsupervised Learning problems can be grouped into clustering and association rule learning problems 7
  • 8. DATA MINING PROCESS • Target Population – Subset of the population under study – Results are generalized to the target population • Sample – Subset of the target population • Simple Random Sampling – A sampling method wherein each observation has an equal chance of being selected 8
  • 9. DATA MINING PROCESS • Random Sampling – A sampling method wherein each observation does not necessarily have an equal chance of being selected • Sampling with Replacement – Sample values are independent • Sampling without Replacement – Sample values aren’t independent 9
  • 10. DATA MINING PROCESS • Sampling results in less no. of observations than the no. of total observations in the dataset • Data Mining algorithms – Varying limitations on number of observations and variables • Limitations due to computing power and storage capacity • Limitations due to statistical software being used • How many observations to build accurate models? 10
  • 11. DATA MINING PROCESS • Rare Event, e.g., low response rate in advertising by traditional mail or email – Oversampling of ‘success’ cases – Arises mainly in classification tasks – Costs of misclassification • Asymmetric costs due to more importance of ‘success’ class – Costs of failing to identify ‘success’ cases are generally more than costs of detailed review of all cases – Prediction of ‘success’ cases is likely to come at cost of misclassifying more ‘failure’ cases as ‘success’ cases than usual 11
  • 12. DATA MINING PROCESS • Dummy coding for categorical variables – Some statistical software cannot use categorical variables expressed in the label format – Dummy binary variables (having 0’s and 1’s: 0 indicating ‘absence’ and 1 indicating ‘presence’) for different classes of categorical variables are created – For example, if ‘activity status’ of individuals can be put into four mutually exclusive and jointly exhaustive classes as {student, unemployed, employed, retired}, only three dummy variables would be required 12
  • 13. DATA MINING PROCESS • Principle of Parsimony – A model or theory with less no. of assumptions and variables but with high explanatory power is generally desirable • More no. of variables also increase the sample size requirements due to reliability of estimate • Overfitting – A model built using a complex function that fits the data perfectly – Model ends up fitting the noise and explaining the chance variation 13
  • 14. DATA MINING PROCESS • Overfitting – More no. of iterations resulting in excessive learning of the data – More no. of variables in the model may lead to fitting spurious relationships • Sample Size – Domain Knowledge – General rule of thumb: 10 × p observations, where p is the no. of predictors – For classification tasks: 6 × m × p observations, where m is the no. of classes in the outcome variable (Delmaster & Hancock, 2001) 14
  • 15. DATA MINING PROCESS • Outliers – A distant data point – Valid point or erroneous value? – Further review • Manual Inspection (Sorting, minimum and maximum values, clustering etc.) • Domain Knowledge • Missing Values – Few records with missing values can be removed – Imputation 15
  • 16. DATA MINING PROCESS • Missing Values – Drop the variables having missing values – Replace with proxy variable • Normalization – Standardization using z-score – Min-max normalization 16
  • 17. Key References • Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data by EMC Education Services (2015) • Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner by Shmueli, G., Patel, N. R., & Bruce, P. C. (2010) 17