A companion slide deck for this chapter:
Stanton, J. M. (2013). Data Mining: A Practical Introduction for Organizational Researchers. In Cortina, J. M., & Landis, R. S., Modern Research Methods for the Study of Behavior in Organizations. New York: Routledge Academic.
1. Data Mining: A Practical
Introduction for Organizational
Researchers
Jeffrey Stanton
Syracuse University
School of Information Studies
A Chapter in “Modern Research Methods for the Study of Behavior in Organizations”
edited by Jose Cortina and Ron Landis, Routledge 2013 (pp. 199-232)
3. Data Can Serve Research in New Ways
• Available data on a scale millions of times
larger than 20 years ago: customer
transactions; sensor outputs; web documents;
digital images and audio
• As a complementary alternative to the
hypothetico-deductive method that has
dominated social science research, what if we
could use large, existing data sets to
inductively discover new insights?
5. Other Examples
• Recommender functions (e.g., other people who
bought this book also enjoyed…)
• The Irises dataset: Collected by R.A. Fisher, uses
the ratios of measurements of plant attributes to
classify species
• Soybean disease classification: determining the
cause of disease based on symptom sets
• 1987-1988 Canadian labor contract negotiations:
predicting which contracts fall through based on
characteristics of contracts
6. A Definition of Data Mining
• Data mining refers to the use of algorithms
and computers to discover novel and
interesting structure within data
(Fayyad, Grinstein, & Wierse, 2002).
7. Examples of Data Mining Techniques
Supervised
learning
Neural
networks
Support vector
machines
Boosted
Regression
Trees
Classification
and Regression
Tree
General
additive models
Unsupervised
learning
Independent
Components
Analysis
K-means
clustering
Self organizing
maps
Association
rules mining
Supervised learning
is parallel in concept
to the predictive
statistical techniques
used by many social
science
researchers, such as
linear regression, but
without the
restriction of only
exploring linear
relationships.
Unsupervised
learning includes a
variety of machine
learning techniques
that do not use a
criterion or
dependent
variable, but rather
look for patterns
solely among
“independent”
variables.
8. Four Familiar Steps
Pre-processing
/ Data
Preparation
Exploratory
Analysis /
Dimension
Reduction
Model
Exploration
and
Development
Model
Interpretation
/ Deployment
10. Data Pre-Processing
Screening – Detecting outliers, missing
data, illegal values, unusual patterns, unexpected
distributions, unusable coding schemes
Diagnosis – Mechanisms of missing
data, coding/entry errors, true extreme
values, alternative distributions
Repair – Leave data unchanged, missing data
mitigation, deletion of anomalous records,
transformation, recoding, binning
11. Curse of Dimensionality
• Data mining tasks often begin
with a dataset that has
hundreds or even thousands of
variables and little or no
indication of which of the
variables are important and
should be retained versus
those that can safely be
discarded
• Analytical techniques used in
the model building phase of
data mining depend upon
“searching” through a
multidimensional space for a
set of locally or globally
optimal coefficients
12. Addressing High Dimensionality
• Any data set with dozens or hundreds of variables is likely
to have considerable redundancy in it as well as numerous
variables that are not useful or relevant; two big methods
for dealing with this:
– Feature selection: The process of choosing which variables to
keep and which to discard; simplest method: screen each input-
output pair with a Pearson correlation (or more efficiently with
a form of multiple regression); major goal is to ditch input
variables that are unlikely to contribute to the analysis
– Feature extraction: The process of reducing a large set of
variables that contain redundancy with a smaller number of
non-redundant variables; simplest method: principal
components analysis; major goal is to combine (linearly or non-
linearly) redundant set into a smaller non-redundant set
14. Algorithm/Model Selection
• Within a family of DM techniques
(i.e., supervised or unsupervised)
there will almost always be
multiple choices of algorithms
• How to decide which one to use?
• Given the empirical nature of data
mining, it is often satisfactory to
choose the algorithm that “works
best” (i.e., has the lowest error
rate) across the largest amount of
evaluation (validation) data
• What is training data versus
evaluation data? Model building screen from Statistica
15. Selected Unsupervised Algorithms
• Association rules mining / Market basket analysis: Looks for
combinations of items that occur together
• Independent Components Analysis – Conceptually similar
to principle components analysis, but can work on variables
that are not jointly normally distributed; a form of blind
source/signal separation
• K-means clustering – organizes a set of observations into
clusters, where observations in a group cluster closely
around a centroid/mean
• Self-organizing maps – Similar to multidimensional
scaling, takes a high dimensional problem and translates it
into low dimensional space so it van be visualized; uses
neural networks to process data
17. Selected Supervised Algorithms
• Artificial neural networks (ANNs) – Uses a simulation of biological neurons
to create an interconnected system of elements that translates inputs
accurately into outputs; can work well for systems with multiple outputs
• General additive models – Like general linear models (e.g., multiple
regression) except relaxes constraints on the distributions of the input and
output variables; can accommodate non-linear relations between input
and output variables
• Decision/classification/regression trees (CART) – Iteratively creates a tree-
like decision structure with internal branches that bifurcate on values of
the input variable; each path from the root to a leaf translates particular
input values into output values; results are easy to visualize and interpret
• Support vector machines – Uses a “kernel” algorithm to develop a
separation line (or plane or hyperplane) that divides a set of observations
into two classes (can also solve multi-class problems); hard to interpret
results, but can produce highly accurate and generalizable models
19. Data Mining Software Choices
• R – Open source, free, many algorithms, Rattle GUI,
command line difficult, little support
• WEKA – Quasi-open source, free, great textbooks, nice
GUI, little support
• RapidMiner – Open Source (registration required), paid
training available, connections to R
• SAS/Enterprise Miner– Proprietary, expensive, lots of
support, lots of documentation
• SPSS/Clementine – Proprietary, expensive, lots of
support, lots of documentation
• Statistica – Proprietary, workbench/workflow style
interface good for beginners, support, documentation
20. Selected References
• Berkhin, P. (2006). A survey of clustering data mining techniques. Grouping
Multidimensional Data, 25-71.
• Bigus, J. (1996). Data mining with neural networks. Mc GrawHill, USA.
• Caragea, D., Cook, D., Wickham, H., & Honavar, V. (2008). Visual methods for
examining SVM classifiers. Visual Data Mining, 136-153.
• Elith, J., Leathwick, J., & Hastie, T. (2008). A working guide to boosted regression
trees. Journal of Animal Ecology, 77(4), 802-813.
• Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009).
The WEKA data mining software: an update. ACM SIGKDD Explorations
Newsletter, 11(1), 10-18.
• Hastie, T., & Tibshirani, R. (1990). Generalized additive models: Chapman &
Hall/CRC.
• Kohonen, T. (2002). The self-organizing map. Proceedings of the IEEE, 78(9), 1464-
1480.
• Stone, J. V. (2004). Independent component analysis: a tutorial introduction: The
MIT Press.
• Witten, I. H., Frank, E., Holmes, G., & Hall, M. A. (2011). Data Mining: Practical
Machine Learning Tools and Techniques: Morgan Kaufmann.