Basic Overview of Data Mining


Published on

A companion slide deck for this chapter:

Stanton, J. M. (2013). Data Mining: A Practical Introduction for Organizational Researchers. In Cortina, J. M., & Landis, R. S., Modern Research Methods for the Study of Behavior in Organizations. New York: Routledge Academic.

Published in: Education
  • Be the first to comment

Basic Overview of Data Mining

  1. 1. Data Mining: A PracticalIntroduction for OrganizationalResearchersJeffrey StantonSyracuse UniversitySchool of Information StudiesA Chapter in “Modern Research Methods for the Study of Behavior in Organizations”edited by Jose Cortina and Ron Landis, Routledge 2013 (pp. 199-232)
  2. 2. We Are Awash in Data
  3. 3. Data Can Serve Research in New Ways• Available data on a scale millions of timeslarger than 20 years ago: customertransactions; sensor outputs; web documents;digital images and audio• As a complementary alternative to thehypothetico-deductive method that hasdominated social science research, what if wecould use large, existing data sets toinductively discover new insights?
  4. 4. The Classic ExampleCustomers Carts Store InventoryItem 1 Baby WipesItem 2 BeerCustomer 1 Item 3 BreadItem 4 CheddarItem 5 ChipsCorn FlakesItem 1 DiapersCustomer 2 Item 2 LettuceItem 3 MayonnaiseMilkItem 1 Peanut ButterItem 2 SalamiCustomer 3 Item 3 ShampooItem 4 SpongesItem 5 TomatoesItem 6 ToothpasteItem 7
  5. 5. Other Examples• Recommender functions (e.g., other people whobought this book also enjoyed…)• The Irises dataset: Collected by R.A. Fisher, usesthe ratios of measurements of plant attributes toclassify species• Soybean disease classification: determining thecause of disease based on symptom sets• 1987-1988 Canadian labor contract negotiations:predicting which contracts fall through based oncharacteristics of contracts
  6. 6. A Definition of Data Mining• Data mining refers to the use of algorithmsand computers to discover novel andinteresting structure within data(Fayyad, Grinstein, & Wierse, 2002).
  7. 7. Examples of Data Mining TechniquesSupervisedlearningNeuralnetworksSupport vectormachinesBoostedRegressionTreesClassificationand RegressionTreeGeneraladditive modelsUnsupervisedlearningIndependentComponentsAnalysisK-meansclusteringSelf organizingmapsAssociationrules miningSupervised learningis parallel in conceptto the predictivestatistical techniquesused by many socialscienceresearchers, such aslinear regression, butwithout therestriction of onlyexploring linearrelationships.Unsupervisedlearning includes avariety of machinelearning techniquesthat do not use acriterion ordependentvariable, but ratherlook for patternssolely among“independent”variables.
  8. 8. Four Familiar StepsPre-processing/ DataPreparationExploratoryAnalysis /DimensionReductionModelExplorationandDevelopmentModelInterpretation/ Deployment
  9. 9. Data MiningFlowchart
  10. 10. Data Pre-ProcessingScreening – Detecting outliers, missingdata, illegal values, unusual patterns, unexpecteddistributions, unusable coding schemesDiagnosis – Mechanisms of missingdata, coding/entry errors, true extremevalues, alternative distributionsRepair – Leave data unchanged, missing datamitigation, deletion of anomalous records,transformation, recoding, binning
  11. 11. Curse of Dimensionality• Data mining tasks often beginwith a dataset that hashundreds or even thousands ofvariables and little or noindication of which of thevariables are important andshould be retained versusthose that can safely bediscarded• Analytical techniques used inthe model building phase ofdata mining depend upon“searching” through amultidimensional space for aset of locally or globallyoptimal coefficients
  12. 12. Addressing High Dimensionality• Any data set with dozens or hundreds of variables is likelyto have considerable redundancy in it as well as numerousvariables that are not useful or relevant; two big methodsfor dealing with this:– Feature selection: The process of choosing which variables tokeep and which to discard; simplest method: screen each input-output pair with a Pearson correlation (or more efficiently witha form of multiple regression); major goal is to ditch inputvariables that are unlikely to contribute to the analysis– Feature extraction: The process of reducing a large set ofvariables that contain redundancy with a smaller number ofnon-redundant variables; simplest method: principalcomponents analysis; major goal is to combine (linearly or non-linearly) redundant set into a smaller non-redundant set
  13. 13. ICA Example
  14. 14. Algorithm/Model Selection• Within a family of DM techniques(i.e., supervised or unsupervised)there will almost always bemultiple choices of algorithms• How to decide which one to use?• Given the empirical nature of datamining, it is often satisfactory tochoose the algorithm that “worksbest” (i.e., has the lowest errorrate) across the largest amount ofevaluation (validation) data• What is training data versusevaluation data? Model building screen from Statistica
  15. 15. Selected Unsupervised Algorithms• Association rules mining / Market basket analysis: Looks forcombinations of items that occur together• Independent Components Analysis – Conceptually similarto principle components analysis, but can work on variablesthat are not jointly normally distributed; a form of blindsource/signal separation• K-means clustering – organizes a set of observations intoclusters, where observations in a group cluster closelyaround a centroid/mean• Self-organizing maps – Similar to multidimensionalscaling, takes a high dimensional problem and translates itinto low dimensional space so it van be visualized; usesneural networks to process data
  16. 16. Association Rules Mining ExampleCustomers Carts Store InventoryItem 1 Baby WipesItem 2 BeerCustomer 1 Item 3 BreadItem 4 CheddarItem 5 ChipsCorn FlakesItem 1 DiapersCustomer 2 Item 2 LettuceItem 3 MayonnaiseMilkItem 1 Peanut ButterItem 2 SalamiCustomer 3 Item 3 ShampooItem 4 SpongesItem 5 TomatoesItem 6 ToothpasteItem 7
  17. 17. Selected Supervised Algorithms• Artificial neural networks (ANNs) – Uses a simulation of biological neuronsto create an interconnected system of elements that translates inputsaccurately into outputs; can work well for systems with multiple outputs• General additive models – Like general linear models (e.g., multipleregression) except relaxes constraints on the distributions of the input andoutput variables; can accommodate non-linear relations between inputand output variables• Decision/classification/regression trees (CART) – Iteratively creates a tree-like decision structure with internal branches that bifurcate on values ofthe input variable; each path from the root to a leaf translates particularinput values into output values; results are easy to visualize and interpret• Support vector machines – Uses a “kernel” algorithm to develop aseparation line (or plane or hyperplane) that divides a set of observationsinto two classes (can also solve multi-class problems); hard to interpretresults, but can produce highly accurate and generalizable models
  18. 18. CART Example
  19. 19. Data Mining Software Choices• R – Open source, free, many algorithms, Rattle GUI,command line difficult, little support• WEKA – Quasi-open source, free, great textbooks, niceGUI, little support• RapidMiner – Open Source (registration required), paidtraining available, connections to R• SAS/Enterprise Miner– Proprietary, expensive, lots ofsupport, lots of documentation• SPSS/Clementine – Proprietary, expensive, lots ofsupport, lots of documentation• Statistica – Proprietary, workbench/workflow styleinterface good for beginners, support, documentation
  20. 20. Selected References• Berkhin, P. (2006). A survey of clustering data mining techniques. GroupingMultidimensional Data, 25-71.• Bigus, J. (1996). Data mining with neural networks. Mc GrawHill, USA.• Caragea, D., Cook, D., Wickham, H., & Honavar, V. (2008). Visual methods forexamining SVM classifiers. Visual Data Mining, 136-153.• Elith, J., Leathwick, J., & Hastie, T. (2008). A working guide to boosted regressiontrees. Journal of Animal Ecology, 77(4), 802-813.• Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009).The WEKA data mining software: an update. ACM SIGKDD ExplorationsNewsletter, 11(1), 10-18.• Hastie, T., & Tibshirani, R. (1990). Generalized additive models: Chapman &Hall/CRC.• Kohonen, T. (2002). The self-organizing map. Proceedings of the IEEE, 78(9), 1464-1480.• Stone, J. V. (2004). Independent component analysis: a tutorial introduction: TheMIT Press.• Witten, I. H., Frank, E., Holmes, G., & Hall, M. A. (2011). Data Mining: PracticalMachine Learning Tools and Techniques: Morgan Kaufmann.