SlideShare a Scribd company logo
Data Mining: A Practical
Introduction for Organizational
Researchers
Jeffrey Stanton
Syracuse University
School of Information Studies
A Chapter in “Modern Research Methods for the Study of Behavior in Organizations”
edited by Jose Cortina and Ron Landis, Routledge 2013 (pp. 199-232)
We Are Awash in Data
Data Can Serve Research in New Ways
• Available data on a scale millions of times
larger than 20 years ago: customer
transactions; sensor outputs; web documents;
digital images and audio
• As a complementary alternative to the
hypothetico-deductive method that has
dominated social science research, what if we
could use large, existing data sets to
inductively discover new insights?
The Classic Example
Customers Carts Store Inventory
Item 1 Baby Wipes
Item 2 Beer
Customer 1 Item 3 Bread
Item 4 Cheddar
Item 5 Chips
Corn Flakes
Item 1 Diapers
Customer 2 Item 2 Lettuce
Item 3 Mayonnaise
Milk
Item 1 Peanut Butter
Item 2 Salami
Customer 3 Item 3 Shampoo
Item 4 Sponges
Item 5 Tomatoes
Item 6 Toothpaste
Item 7
Other Examples
• Recommender functions (e.g., other people who
bought this book also enjoyed…)
• The Irises dataset: Collected by R.A. Fisher, uses
the ratios of measurements of plant attributes to
classify species
• Soybean disease classification: determining the
cause of disease based on symptom sets
• 1987-1988 Canadian labor contract negotiations:
predicting which contracts fall through based on
characteristics of contracts
A Definition of Data Mining
• Data mining refers to the use of algorithms
and computers to discover novel and
interesting structure within data
(Fayyad, Grinstein, & Wierse, 2002).
Examples of Data Mining Techniques
Supervised
learning
Neural
networks
Support vector
machines
Boosted
Regression
Trees
Classification
and Regression
Tree
General
additive models
Unsupervised
learning
Independent
Components
Analysis
K-means
clustering
Self organizing
maps
Association
rules mining
Supervised learning
is parallel in concept
to the predictive
statistical techniques
used by many social
science
researchers, such as
linear regression, but
without the
restriction of only
exploring linear
relationships.
Unsupervised
learning includes a
variety of machine
learning techniques
that do not use a
criterion or
dependent
variable, but rather
look for patterns
solely among
“independent”
variables.
Four Familiar Steps
Pre-processing
/ Data
Preparation
Exploratory
Analysis /
Dimension
Reduction
Model
Exploration
and
Development
Model
Interpretation
/ Deployment
Data Mining
Flowchart
Data Pre-Processing
Screening – Detecting outliers, missing
data, illegal values, unusual patterns, unexpected
distributions, unusable coding schemes
Diagnosis – Mechanisms of missing
data, coding/entry errors, true extreme
values, alternative distributions
Repair – Leave data unchanged, missing data
mitigation, deletion of anomalous records,
transformation, recoding, binning
Curse of Dimensionality
• Data mining tasks often begin
with a dataset that has
hundreds or even thousands of
variables and little or no
indication of which of the
variables are important and
should be retained versus
those that can safely be
discarded
• Analytical techniques used in
the model building phase of
data mining depend upon
“searching” through a
multidimensional space for a
set of locally or globally
optimal coefficients
Addressing High Dimensionality
• Any data set with dozens or hundreds of variables is likely
to have considerable redundancy in it as well as numerous
variables that are not useful or relevant; two big methods
for dealing with this:
– Feature selection: The process of choosing which variables to
keep and which to discard; simplest method: screen each input-
output pair with a Pearson correlation (or more efficiently with
a form of multiple regression); major goal is to ditch input
variables that are unlikely to contribute to the analysis
– Feature extraction: The process of reducing a large set of
variables that contain redundancy with a smaller number of
non-redundant variables; simplest method: principal
components analysis; major goal is to combine (linearly or non-
linearly) redundant set into a smaller non-redundant set
ICA Example
Algorithm/Model Selection
• Within a family of DM techniques
(i.e., supervised or unsupervised)
there will almost always be
multiple choices of algorithms
• How to decide which one to use?
• Given the empirical nature of data
mining, it is often satisfactory to
choose the algorithm that “works
best” (i.e., has the lowest error
rate) across the largest amount of
evaluation (validation) data
• What is training data versus
evaluation data? Model building screen from Statistica
Selected Unsupervised Algorithms
• Association rules mining / Market basket analysis: Looks for
combinations of items that occur together
• Independent Components Analysis – Conceptually similar
to principle components analysis, but can work on variables
that are not jointly normally distributed; a form of blind
source/signal separation
• K-means clustering – organizes a set of observations into
clusters, where observations in a group cluster closely
around a centroid/mean
• Self-organizing maps – Similar to multidimensional
scaling, takes a high dimensional problem and translates it
into low dimensional space so it van be visualized; uses
neural networks to process data
Association Rules Mining Example
Customers Carts Store Inventory
Item 1 Baby Wipes
Item 2 Beer
Customer 1 Item 3 Bread
Item 4 Cheddar
Item 5 Chips
Corn Flakes
Item 1 Diapers
Customer 2 Item 2 Lettuce
Item 3 Mayonnaise
Milk
Item 1 Peanut Butter
Item 2 Salami
Customer 3 Item 3 Shampoo
Item 4 Sponges
Item 5 Tomatoes
Item 6 Toothpaste
Item 7
Selected Supervised Algorithms
• Artificial neural networks (ANNs) – Uses a simulation of biological neurons
to create an interconnected system of elements that translates inputs
accurately into outputs; can work well for systems with multiple outputs
• General additive models – Like general linear models (e.g., multiple
regression) except relaxes constraints on the distributions of the input and
output variables; can accommodate non-linear relations between input
and output variables
• Decision/classification/regression trees (CART) – Iteratively creates a tree-
like decision structure with internal branches that bifurcate on values of
the input variable; each path from the root to a leaf translates particular
input values into output values; results are easy to visualize and interpret
• Support vector machines – Uses a “kernel” algorithm to develop a
separation line (or plane or hyperplane) that divides a set of observations
into two classes (can also solve multi-class problems); hard to interpret
results, but can produce highly accurate and generalizable models
CART Example
Data Mining Software Choices
• R – Open source, free, many algorithms, Rattle GUI,
command line difficult, little support
• WEKA – Quasi-open source, free, great textbooks, nice
GUI, little support
• RapidMiner – Open Source (registration required), paid
training available, connections to R
• SAS/Enterprise Miner– Proprietary, expensive, lots of
support, lots of documentation
• SPSS/Clementine – Proprietary, expensive, lots of
support, lots of documentation
• Statistica – Proprietary, workbench/workflow style
interface good for beginners, support, documentation
Selected References
• Berkhin, P. (2006). A survey of clustering data mining techniques. Grouping
Multidimensional Data, 25-71.
• Bigus, J. (1996). Data mining with neural networks. Mc GrawHill, USA.
• Caragea, D., Cook, D., Wickham, H., & Honavar, V. (2008). Visual methods for
examining SVM classifiers. Visual Data Mining, 136-153.
• Elith, J., Leathwick, J., & Hastie, T. (2008). A working guide to boosted regression
trees. Journal of Animal Ecology, 77(4), 802-813.
• Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009).
The WEKA data mining software: an update. ACM SIGKDD Explorations
Newsletter, 11(1), 10-18.
• Hastie, T., & Tibshirani, R. (1990). Generalized additive models: Chapman &
Hall/CRC.
• Kohonen, T. (2002). The self-organizing map. Proceedings of the IEEE, 78(9), 1464-
1480.
• Stone, J. V. (2004). Independent component analysis: a tutorial introduction: The
MIT Press.
• Witten, I. H., Frank, E., Holmes, G., & Hall, M. A. (2011). Data Mining: Practical
Machine Learning Tools and Techniques: Morgan Kaufmann.

More Related Content

What's hot

Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
IJMER
 
Data mining
Data miningData mining
Data mining
pradeepa n
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
DataminingTools Inc
 
Data mining
Data mining Data mining
Data mining
AthiraR23
 
data mining
data miningdata mining
data mining
uoitc
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
Nit Hik
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
Izwan Nizal Mohd Shaharanee
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
Salah Amean
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
Pratik Tambekar
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
Shuvra Ghosh
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
Amritanshu Mehra
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
Mahmoud Alfarra
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
Thanveen
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
data mining
data miningdata mining
data mining
manasa polu
 

What's hot (19)

Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Data mining
Data miningData mining
Data mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data mining
Data mining Data mining
Data mining
 
data mining
data miningdata mining
data mining
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Introduction
IntroductionIntroduction
Introduction
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
data mining
data miningdata mining
data mining
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 

Viewers also liked

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
Syracuse University
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
Marc Berman
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
Valerii Klymchuk
 
Decision tree
Decision treeDecision tree
Decision tree
Venkata Reddy Konasani
 
Decision Trees
Decision TreesDecision Trees
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Girish Khanzode
 
Data mining
Data miningData mining
Data mining
Akannsha Totewar
 

Viewers also liked (13)

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Data mining
Data miningData mining
Data mining
 

Similar to Basic Overview of Data Mining

Data mining
Data miningData mining
Data mining
heba_ahmad
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
Kimberley Mitchell
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
Dhilsath Fathima
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
Wake Tech BAS
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
tafosepsdfasg
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
Basma Gamal
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
Harsha Patel
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
Vijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
thamizh arasi
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
AsifImran37
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
IfedayoOladeji1
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
ImXaib
 
Information_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.pptInformation_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.ppt
PrasadG76
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
Dhilsath Fathima
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
AkhirulAminulloh2
 
Business Analytics and Data mining.pdf
Business Analytics and Data mining.pdfBusiness Analytics and Data mining.pdf
Business Analytics and Data mining.pdf
ssuser0413ec
 

Similar to Basic Overview of Data Mining (20)

Data mining
Data miningData mining
Data mining
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
Information_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.pptInformation_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.ppt
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Business Analytics and Data mining.pdf
Business Analytics and Data mining.pdfBusiness Analytics and Data mining.pdf
Business Analytics and Data mining.pdf
 

More from Syracuse University

Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
Syracuse University
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
Syracuse University
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
Syracuse University
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
Syracuse University
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale developmentSyracuse University
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
Syracuse University
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
Syracuse University
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
Syracuse University
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
Syracuse University
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
Syracuse University
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Syracuse University
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
Syracuse University
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
Syracuse University
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collectionSyracuse University
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
Syracuse University
 
Carma internet research module: Encouraging responding
Carma internet research module: Encouraging respondingCarma internet research module: Encouraging responding
Carma internet research module: Encouraging responding
Syracuse University
 

More from Syracuse University (20)

Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
 
Siop impact of social media
Siop impact of social mediaSiop impact of social media
Siop impact of social media
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)
 
What is Data Science
What is Data ScienceWhat is Data Science
What is Data Science
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collection
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
 
Carma internet research module: Encouraging responding
Carma internet research module: Encouraging respondingCarma internet research module: Encouraging responding
Carma internet research module: Encouraging responding
 

Recently uploaded

Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
CarlosHernanMontoyab2
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 

Recently uploaded (20)

Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 

Basic Overview of Data Mining

  • 1. Data Mining: A Practical Introduction for Organizational Researchers Jeffrey Stanton Syracuse University School of Information Studies A Chapter in “Modern Research Methods for the Study of Behavior in Organizations” edited by Jose Cortina and Ron Landis, Routledge 2013 (pp. 199-232)
  • 2. We Are Awash in Data
  • 3. Data Can Serve Research in New Ways • Available data on a scale millions of times larger than 20 years ago: customer transactions; sensor outputs; web documents; digital images and audio • As a complementary alternative to the hypothetico-deductive method that has dominated social science research, what if we could use large, existing data sets to inductively discover new insights?
  • 4. The Classic Example Customers Carts Store Inventory Item 1 Baby Wipes Item 2 Beer Customer 1 Item 3 Bread Item 4 Cheddar Item 5 Chips Corn Flakes Item 1 Diapers Customer 2 Item 2 Lettuce Item 3 Mayonnaise Milk Item 1 Peanut Butter Item 2 Salami Customer 3 Item 3 Shampoo Item 4 Sponges Item 5 Tomatoes Item 6 Toothpaste Item 7
  • 5. Other Examples • Recommender functions (e.g., other people who bought this book also enjoyed…) • The Irises dataset: Collected by R.A. Fisher, uses the ratios of measurements of plant attributes to classify species • Soybean disease classification: determining the cause of disease based on symptom sets • 1987-1988 Canadian labor contract negotiations: predicting which contracts fall through based on characteristics of contracts
  • 6. A Definition of Data Mining • Data mining refers to the use of algorithms and computers to discover novel and interesting structure within data (Fayyad, Grinstein, & Wierse, 2002).
  • 7. Examples of Data Mining Techniques Supervised learning Neural networks Support vector machines Boosted Regression Trees Classification and Regression Tree General additive models Unsupervised learning Independent Components Analysis K-means clustering Self organizing maps Association rules mining Supervised learning is parallel in concept to the predictive statistical techniques used by many social science researchers, such as linear regression, but without the restriction of only exploring linear relationships. Unsupervised learning includes a variety of machine learning techniques that do not use a criterion or dependent variable, but rather look for patterns solely among “independent” variables.
  • 8. Four Familiar Steps Pre-processing / Data Preparation Exploratory Analysis / Dimension Reduction Model Exploration and Development Model Interpretation / Deployment
  • 10. Data Pre-Processing Screening – Detecting outliers, missing data, illegal values, unusual patterns, unexpected distributions, unusable coding schemes Diagnosis – Mechanisms of missing data, coding/entry errors, true extreme values, alternative distributions Repair – Leave data unchanged, missing data mitigation, deletion of anomalous records, transformation, recoding, binning
  • 11. Curse of Dimensionality • Data mining tasks often begin with a dataset that has hundreds or even thousands of variables and little or no indication of which of the variables are important and should be retained versus those that can safely be discarded • Analytical techniques used in the model building phase of data mining depend upon “searching” through a multidimensional space for a set of locally or globally optimal coefficients
  • 12. Addressing High Dimensionality • Any data set with dozens or hundreds of variables is likely to have considerable redundancy in it as well as numerous variables that are not useful or relevant; two big methods for dealing with this: – Feature selection: The process of choosing which variables to keep and which to discard; simplest method: screen each input- output pair with a Pearson correlation (or more efficiently with a form of multiple regression); major goal is to ditch input variables that are unlikely to contribute to the analysis – Feature extraction: The process of reducing a large set of variables that contain redundancy with a smaller number of non-redundant variables; simplest method: principal components analysis; major goal is to combine (linearly or non- linearly) redundant set into a smaller non-redundant set
  • 14. Algorithm/Model Selection • Within a family of DM techniques (i.e., supervised or unsupervised) there will almost always be multiple choices of algorithms • How to decide which one to use? • Given the empirical nature of data mining, it is often satisfactory to choose the algorithm that “works best” (i.e., has the lowest error rate) across the largest amount of evaluation (validation) data • What is training data versus evaluation data? Model building screen from Statistica
  • 15. Selected Unsupervised Algorithms • Association rules mining / Market basket analysis: Looks for combinations of items that occur together • Independent Components Analysis – Conceptually similar to principle components analysis, but can work on variables that are not jointly normally distributed; a form of blind source/signal separation • K-means clustering – organizes a set of observations into clusters, where observations in a group cluster closely around a centroid/mean • Self-organizing maps – Similar to multidimensional scaling, takes a high dimensional problem and translates it into low dimensional space so it van be visualized; uses neural networks to process data
  • 16. Association Rules Mining Example Customers Carts Store Inventory Item 1 Baby Wipes Item 2 Beer Customer 1 Item 3 Bread Item 4 Cheddar Item 5 Chips Corn Flakes Item 1 Diapers Customer 2 Item 2 Lettuce Item 3 Mayonnaise Milk Item 1 Peanut Butter Item 2 Salami Customer 3 Item 3 Shampoo Item 4 Sponges Item 5 Tomatoes Item 6 Toothpaste Item 7
  • 17. Selected Supervised Algorithms • Artificial neural networks (ANNs) – Uses a simulation of biological neurons to create an interconnected system of elements that translates inputs accurately into outputs; can work well for systems with multiple outputs • General additive models – Like general linear models (e.g., multiple regression) except relaxes constraints on the distributions of the input and output variables; can accommodate non-linear relations between input and output variables • Decision/classification/regression trees (CART) – Iteratively creates a tree- like decision structure with internal branches that bifurcate on values of the input variable; each path from the root to a leaf translates particular input values into output values; results are easy to visualize and interpret • Support vector machines – Uses a “kernel” algorithm to develop a separation line (or plane or hyperplane) that divides a set of observations into two classes (can also solve multi-class problems); hard to interpret results, but can produce highly accurate and generalizable models
  • 19. Data Mining Software Choices • R – Open source, free, many algorithms, Rattle GUI, command line difficult, little support • WEKA – Quasi-open source, free, great textbooks, nice GUI, little support • RapidMiner – Open Source (registration required), paid training available, connections to R • SAS/Enterprise Miner– Proprietary, expensive, lots of support, lots of documentation • SPSS/Clementine – Proprietary, expensive, lots of support, lots of documentation • Statistica – Proprietary, workbench/workflow style interface good for beginners, support, documentation
  • 20. Selected References • Berkhin, P. (2006). A survey of clustering data mining techniques. Grouping Multidimensional Data, 25-71. • Bigus, J. (1996). Data mining with neural networks. Mc GrawHill, USA. • Caragea, D., Cook, D., Wickham, H., & Honavar, V. (2008). Visual methods for examining SVM classifiers. Visual Data Mining, 136-153. • Elith, J., Leathwick, J., & Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77(4), 802-813. • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10-18. • Hastie, T., & Tibshirani, R. (1990). Generalized additive models: Chapman & Hall/CRC. • Kohonen, T. (2002). The self-organizing map. Proceedings of the IEEE, 78(9), 1464- 1480. • Stone, J. V. (2004). Independent component analysis: a tutorial introduction: The MIT Press. • Witten, I. H., Frank, E., Holmes, G., & Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques: Morgan Kaufmann.