SlideShare a Scribd company logo
1 of 20
Data Mining: A Practical
Introduction for Organizational
Researchers
Jeffrey Stanton
Syracuse University
School of Information Studies
A Chapter in “Modern Research Methods for the Study of Behavior in Organizations”
edited by Jose Cortina and Ron Landis, Routledge 2013 (pp. 199-232)
We Are Awash in Data
Data Can Serve Research in New Ways
• Available data on a scale millions of times
larger than 20 years ago: customer
transactions; sensor outputs; web documents;
digital images and audio
• As a complementary alternative to the
hypothetico-deductive method that has
dominated social science research, what if we
could use large, existing data sets to
inductively discover new insights?
The Classic Example
Customers Carts Store Inventory
Item 1 Baby Wipes
Item 2 Beer
Customer 1 Item 3 Bread
Item 4 Cheddar
Item 5 Chips
Corn Flakes
Item 1 Diapers
Customer 2 Item 2 Lettuce
Item 3 Mayonnaise
Milk
Item 1 Peanut Butter
Item 2 Salami
Customer 3 Item 3 Shampoo
Item 4 Sponges
Item 5 Tomatoes
Item 6 Toothpaste
Item 7
Other Examples
• Recommender functions (e.g., other people who
bought this book also enjoyed…)
• The Irises dataset: Collected by R.A. Fisher, uses
the ratios of measurements of plant attributes to
classify species
• Soybean disease classification: determining the
cause of disease based on symptom sets
• 1987-1988 Canadian labor contract negotiations:
predicting which contracts fall through based on
characteristics of contracts
A Definition of Data Mining
• Data mining refers to the use of algorithms
and computers to discover novel and
interesting structure within data
(Fayyad, Grinstein, & Wierse, 2002).
Examples of Data Mining Techniques
Supervised
learning
Neural
networks
Support vector
machines
Boosted
Regression
Trees
Classification
and Regression
Tree
General
additive models
Unsupervised
learning
Independent
Components
Analysis
K-means
clustering
Self organizing
maps
Association
rules mining
Supervised learning
is parallel in concept
to the predictive
statistical techniques
used by many social
science
researchers, such as
linear regression, but
without the
restriction of only
exploring linear
relationships.
Unsupervised
learning includes a
variety of machine
learning techniques
that do not use a
criterion or
dependent
variable, but rather
look for patterns
solely among
“independent”
variables.
Four Familiar Steps
Pre-processing
/ Data
Preparation
Exploratory
Analysis /
Dimension
Reduction
Model
Exploration
and
Development
Model
Interpretation
/ Deployment
Data Mining
Flowchart
Data Pre-Processing
Screening – Detecting outliers, missing
data, illegal values, unusual patterns, unexpected
distributions, unusable coding schemes
Diagnosis – Mechanisms of missing
data, coding/entry errors, true extreme
values, alternative distributions
Repair – Leave data unchanged, missing data
mitigation, deletion of anomalous records,
transformation, recoding, binning
Curse of Dimensionality
• Data mining tasks often begin
with a dataset that has
hundreds or even thousands of
variables and little or no
indication of which of the
variables are important and
should be retained versus
those that can safely be
discarded
• Analytical techniques used in
the model building phase of
data mining depend upon
“searching” through a
multidimensional space for a
set of locally or globally
optimal coefficients
Addressing High Dimensionality
• Any data set with dozens or hundreds of variables is likely
to have considerable redundancy in it as well as numerous
variables that are not useful or relevant; two big methods
for dealing with this:
– Feature selection: The process of choosing which variables to
keep and which to discard; simplest method: screen each input-
output pair with a Pearson correlation (or more efficiently with
a form of multiple regression); major goal is to ditch input
variables that are unlikely to contribute to the analysis
– Feature extraction: The process of reducing a large set of
variables that contain redundancy with a smaller number of
non-redundant variables; simplest method: principal
components analysis; major goal is to combine (linearly or non-
linearly) redundant set into a smaller non-redundant set
ICA Example
Algorithm/Model Selection
• Within a family of DM techniques
(i.e., supervised or unsupervised)
there will almost always be
multiple choices of algorithms
• How to decide which one to use?
• Given the empirical nature of data
mining, it is often satisfactory to
choose the algorithm that “works
best” (i.e., has the lowest error
rate) across the largest amount of
evaluation (validation) data
• What is training data versus
evaluation data? Model building screen from Statistica
Selected Unsupervised Algorithms
• Association rules mining / Market basket analysis: Looks for
combinations of items that occur together
• Independent Components Analysis – Conceptually similar
to principle components analysis, but can work on variables
that are not jointly normally distributed; a form of blind
source/signal separation
• K-means clustering – organizes a set of observations into
clusters, where observations in a group cluster closely
around a centroid/mean
• Self-organizing maps – Similar to multidimensional
scaling, takes a high dimensional problem and translates it
into low dimensional space so it van be visualized; uses
neural networks to process data
Association Rules Mining Example
Customers Carts Store Inventory
Item 1 Baby Wipes
Item 2 Beer
Customer 1 Item 3 Bread
Item 4 Cheddar
Item 5 Chips
Corn Flakes
Item 1 Diapers
Customer 2 Item 2 Lettuce
Item 3 Mayonnaise
Milk
Item 1 Peanut Butter
Item 2 Salami
Customer 3 Item 3 Shampoo
Item 4 Sponges
Item 5 Tomatoes
Item 6 Toothpaste
Item 7
Selected Supervised Algorithms
• Artificial neural networks (ANNs) – Uses a simulation of biological neurons
to create an interconnected system of elements that translates inputs
accurately into outputs; can work well for systems with multiple outputs
• General additive models – Like general linear models (e.g., multiple
regression) except relaxes constraints on the distributions of the input and
output variables; can accommodate non-linear relations between input
and output variables
• Decision/classification/regression trees (CART) – Iteratively creates a tree-
like decision structure with internal branches that bifurcate on values of
the input variable; each path from the root to a leaf translates particular
input values into output values; results are easy to visualize and interpret
• Support vector machines – Uses a “kernel” algorithm to develop a
separation line (or plane or hyperplane) that divides a set of observations
into two classes (can also solve multi-class problems); hard to interpret
results, but can produce highly accurate and generalizable models
CART Example
Data Mining Software Choices
• R – Open source, free, many algorithms, Rattle GUI,
command line difficult, little support
• WEKA – Quasi-open source, free, great textbooks, nice
GUI, little support
• RapidMiner – Open Source (registration required), paid
training available, connections to R
• SAS/Enterprise Miner– Proprietary, expensive, lots of
support, lots of documentation
• SPSS/Clementine – Proprietary, expensive, lots of
support, lots of documentation
• Statistica – Proprietary, workbench/workflow style
interface good for beginners, support, documentation
Selected References
• Berkhin, P. (2006). A survey of clustering data mining techniques. Grouping
Multidimensional Data, 25-71.
• Bigus, J. (1996). Data mining with neural networks. Mc GrawHill, USA.
• Caragea, D., Cook, D., Wickham, H., & Honavar, V. (2008). Visual methods for
examining SVM classifiers. Visual Data Mining, 136-153.
• Elith, J., Leathwick, J., & Hastie, T. (2008). A working guide to boosted regression
trees. Journal of Animal Ecology, 77(4), 802-813.
• Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009).
The WEKA data mining software: an update. ACM SIGKDD Explorations
Newsletter, 11(1), 10-18.
• Hastie, T., & Tibshirani, R. (1990). Generalized additive models: Chapman &
Hall/CRC.
• Kohonen, T. (2002). The self-organizing map. Proceedings of the IEEE, 78(9), 1464-
1480.
• Stone, J. V. (2004). Independent component analysis: a tutorial introduction: The
MIT Press.
• Witten, I. H., Frank, E., Holmes, G., & Hall, M. A. (2011). Data Mining: Practical
Machine Learning Tools and Techniques: Morgan Kaufmann.

More Related Content

What's hot

Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 
Data mining
Data mining Data mining
Data mining AthiraR23
 
data mining
data miningdata mining
data mininguoitc
 
Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)Pratik Tambekar
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process Shuvra Ghosh
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 

What's hot (19)

Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Data mining
Data miningData mining
Data mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data mining
Data mining Data mining
Data mining
 
data mining
data miningdata mining
data mining
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Introduction
IntroductionIntroduction
Introduction
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
data mining
data miningdata mining
data mining
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 

Viewers also liked

Viewers also liked (13)

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Data mining
Data miningData mining
Data mining
 

Similar to Basic Overview of Data Mining

Data mining basic concept and Data warehousing
Data mining basic concept and Data warehousingData mining basic concept and Data warehousing
Data mining basic concept and Data warehousingNivaTripathy1
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesKimberley Mitchell
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupSri Kanajan
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introductionBasma Gamal
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Information_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.pptInformation_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.pptPrasadG76
 
chap1.ppt
chap1.pptchap1.ppt
chap1.pptImXaib
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 

Similar to Basic Overview of Data Mining (20)

Data mining
Data miningData mining
Data mining
 
Data mining basic concept and Data warehousing
Data mining basic concept and Data warehousingData mining basic concept and Data warehousing
Data mining basic concept and Data warehousing
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
Information_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.pptInformation_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 

More from Syracuse University

Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultySyracuse University
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformSyracuse University
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale developmentSyracuse University
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question proSyracuse University
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics CourseSyracuse University
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Syracuse University
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collectionSyracuse University
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internetSyracuse University
 
Carma internet research module: Encouraging responding
Carma internet research module: Encouraging respondingCarma internet research module: Encouraging responding
Carma internet research module: Encouraging respondingSyracuse University
 

More from Syracuse University (20)

Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
 
Siop impact of social media
Siop impact of social mediaSiop impact of social media
Siop impact of social media
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)
 
What is Data Science
What is Data ScienceWhat is Data Science
What is Data Science
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collection
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
 
Carma internet research module: Encouraging responding
Carma internet research module: Encouraging respondingCarma internet research module: Encouraging responding
Carma internet research module: Encouraging responding
 

Recently uploaded

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 

Recently uploaded (20)

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

Basic Overview of Data Mining

  • 1. Data Mining: A Practical Introduction for Organizational Researchers Jeffrey Stanton Syracuse University School of Information Studies A Chapter in “Modern Research Methods for the Study of Behavior in Organizations” edited by Jose Cortina and Ron Landis, Routledge 2013 (pp. 199-232)
  • 2. We Are Awash in Data
  • 3. Data Can Serve Research in New Ways • Available data on a scale millions of times larger than 20 years ago: customer transactions; sensor outputs; web documents; digital images and audio • As a complementary alternative to the hypothetico-deductive method that has dominated social science research, what if we could use large, existing data sets to inductively discover new insights?
  • 4. The Classic Example Customers Carts Store Inventory Item 1 Baby Wipes Item 2 Beer Customer 1 Item 3 Bread Item 4 Cheddar Item 5 Chips Corn Flakes Item 1 Diapers Customer 2 Item 2 Lettuce Item 3 Mayonnaise Milk Item 1 Peanut Butter Item 2 Salami Customer 3 Item 3 Shampoo Item 4 Sponges Item 5 Tomatoes Item 6 Toothpaste Item 7
  • 5. Other Examples • Recommender functions (e.g., other people who bought this book also enjoyed…) • The Irises dataset: Collected by R.A. Fisher, uses the ratios of measurements of plant attributes to classify species • Soybean disease classification: determining the cause of disease based on symptom sets • 1987-1988 Canadian labor contract negotiations: predicting which contracts fall through based on characteristics of contracts
  • 6. A Definition of Data Mining • Data mining refers to the use of algorithms and computers to discover novel and interesting structure within data (Fayyad, Grinstein, & Wierse, 2002).
  • 7. Examples of Data Mining Techniques Supervised learning Neural networks Support vector machines Boosted Regression Trees Classification and Regression Tree General additive models Unsupervised learning Independent Components Analysis K-means clustering Self organizing maps Association rules mining Supervised learning is parallel in concept to the predictive statistical techniques used by many social science researchers, such as linear regression, but without the restriction of only exploring linear relationships. Unsupervised learning includes a variety of machine learning techniques that do not use a criterion or dependent variable, but rather look for patterns solely among “independent” variables.
  • 8. Four Familiar Steps Pre-processing / Data Preparation Exploratory Analysis / Dimension Reduction Model Exploration and Development Model Interpretation / Deployment
  • 10. Data Pre-Processing Screening – Detecting outliers, missing data, illegal values, unusual patterns, unexpected distributions, unusable coding schemes Diagnosis – Mechanisms of missing data, coding/entry errors, true extreme values, alternative distributions Repair – Leave data unchanged, missing data mitigation, deletion of anomalous records, transformation, recoding, binning
  • 11. Curse of Dimensionality • Data mining tasks often begin with a dataset that has hundreds or even thousands of variables and little or no indication of which of the variables are important and should be retained versus those that can safely be discarded • Analytical techniques used in the model building phase of data mining depend upon “searching” through a multidimensional space for a set of locally or globally optimal coefficients
  • 12. Addressing High Dimensionality • Any data set with dozens or hundreds of variables is likely to have considerable redundancy in it as well as numerous variables that are not useful or relevant; two big methods for dealing with this: – Feature selection: The process of choosing which variables to keep and which to discard; simplest method: screen each input- output pair with a Pearson correlation (or more efficiently with a form of multiple regression); major goal is to ditch input variables that are unlikely to contribute to the analysis – Feature extraction: The process of reducing a large set of variables that contain redundancy with a smaller number of non-redundant variables; simplest method: principal components analysis; major goal is to combine (linearly or non- linearly) redundant set into a smaller non-redundant set
  • 14. Algorithm/Model Selection • Within a family of DM techniques (i.e., supervised or unsupervised) there will almost always be multiple choices of algorithms • How to decide which one to use? • Given the empirical nature of data mining, it is often satisfactory to choose the algorithm that “works best” (i.e., has the lowest error rate) across the largest amount of evaluation (validation) data • What is training data versus evaluation data? Model building screen from Statistica
  • 15. Selected Unsupervised Algorithms • Association rules mining / Market basket analysis: Looks for combinations of items that occur together • Independent Components Analysis – Conceptually similar to principle components analysis, but can work on variables that are not jointly normally distributed; a form of blind source/signal separation • K-means clustering – organizes a set of observations into clusters, where observations in a group cluster closely around a centroid/mean • Self-organizing maps – Similar to multidimensional scaling, takes a high dimensional problem and translates it into low dimensional space so it van be visualized; uses neural networks to process data
  • 16. Association Rules Mining Example Customers Carts Store Inventory Item 1 Baby Wipes Item 2 Beer Customer 1 Item 3 Bread Item 4 Cheddar Item 5 Chips Corn Flakes Item 1 Diapers Customer 2 Item 2 Lettuce Item 3 Mayonnaise Milk Item 1 Peanut Butter Item 2 Salami Customer 3 Item 3 Shampoo Item 4 Sponges Item 5 Tomatoes Item 6 Toothpaste Item 7
  • 17. Selected Supervised Algorithms • Artificial neural networks (ANNs) – Uses a simulation of biological neurons to create an interconnected system of elements that translates inputs accurately into outputs; can work well for systems with multiple outputs • General additive models – Like general linear models (e.g., multiple regression) except relaxes constraints on the distributions of the input and output variables; can accommodate non-linear relations between input and output variables • Decision/classification/regression trees (CART) – Iteratively creates a tree- like decision structure with internal branches that bifurcate on values of the input variable; each path from the root to a leaf translates particular input values into output values; results are easy to visualize and interpret • Support vector machines – Uses a “kernel” algorithm to develop a separation line (or plane or hyperplane) that divides a set of observations into two classes (can also solve multi-class problems); hard to interpret results, but can produce highly accurate and generalizable models
  • 19. Data Mining Software Choices • R – Open source, free, many algorithms, Rattle GUI, command line difficult, little support • WEKA – Quasi-open source, free, great textbooks, nice GUI, little support • RapidMiner – Open Source (registration required), paid training available, connections to R • SAS/Enterprise Miner– Proprietary, expensive, lots of support, lots of documentation • SPSS/Clementine – Proprietary, expensive, lots of support, lots of documentation • Statistica – Proprietary, workbench/workflow style interface good for beginners, support, documentation
  • 20. Selected References • Berkhin, P. (2006). A survey of clustering data mining techniques. Grouping Multidimensional Data, 25-71. • Bigus, J. (1996). Data mining with neural networks. Mc GrawHill, USA. • Caragea, D., Cook, D., Wickham, H., & Honavar, V. (2008). Visual methods for examining SVM classifiers. Visual Data Mining, 136-153. • Elith, J., Leathwick, J., & Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77(4), 802-813. • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10-18. • Hastie, T., & Tibshirani, R. (1990). Generalized additive models: Chapman & Hall/CRC. • Kohonen, T. (2002). The self-organizing map. Proceedings of the IEEE, 78(9), 1464- 1480. • Stone, J. V. (2004). Independent component analysis: a tutorial introduction: The MIT Press. • Witten, I. H., Frank, E., Holmes, G., & Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques: Morgan Kaufmann.