Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Data mining course learning outcome... by jaya lakshmi 3393 views
- Data mining (lecture 1 & 2) conecpt... by Saif Ullah 46385 views
- Data Mining: Concepts and Techniques by Tommy96 82740 views
- Carma internet research module vi... by Syracuse University 972 views
- Introduction to DataMining by SAS SNDP YOGAM C... 1805 views
- What Is DATA MINING(INTRODUCTION) by Pratik Tambekar 1146 views

3,821 views

Published on

Stanton, J. M. (2013). Data Mining: A Practical Introduction for Organizational Researchers. In Cortina, J. M., & Landis, R. S., Modern Research Methods for the Study of Behavior in Organizations. New York: Routledge Academic.

Published in:
Education

No Downloads

Total views

3,821

On SlideShare

0

From Embeds

0

Number of Embeds

6

Shares

0

Downloads

0

Comments

0

Likes

6

No embeds

No notes for slide

- 1. Data Mining: A PracticalIntroduction for OrganizationalResearchersJeffrey StantonSyracuse UniversitySchool of Information StudiesA Chapter in “Modern Research Methods for the Study of Behavior in Organizations”edited by Jose Cortina and Ron Landis, Routledge 2013 (pp. 199-232)
- 2. We Are Awash in Data
- 3. Data Can Serve Research in New Ways• Available data on a scale millions of timeslarger than 20 years ago: customertransactions; sensor outputs; web documents;digital images and audio• As a complementary alternative to thehypothetico-deductive method that hasdominated social science research, what if wecould use large, existing data sets toinductively discover new insights?
- 4. The Classic ExampleCustomers Carts Store InventoryItem 1 Baby WipesItem 2 BeerCustomer 1 Item 3 BreadItem 4 CheddarItem 5 ChipsCorn FlakesItem 1 DiapersCustomer 2 Item 2 LettuceItem 3 MayonnaiseMilkItem 1 Peanut ButterItem 2 SalamiCustomer 3 Item 3 ShampooItem 4 SpongesItem 5 TomatoesItem 6 ToothpasteItem 7
- 5. Other Examples• Recommender functions (e.g., other people whobought this book also enjoyed…)• The Irises dataset: Collected by R.A. Fisher, usesthe ratios of measurements of plant attributes toclassify species• Soybean disease classification: determining thecause of disease based on symptom sets• 1987-1988 Canadian labor contract negotiations:predicting which contracts fall through based oncharacteristics of contracts
- 6. A Definition of Data Mining• Data mining refers to the use of algorithmsand computers to discover novel andinteresting structure within data(Fayyad, Grinstein, & Wierse, 2002).
- 7. Examples of Data Mining TechniquesSupervisedlearningNeuralnetworksSupport vectormachinesBoostedRegressionTreesClassificationand RegressionTreeGeneraladditive modelsUnsupervisedlearningIndependentComponentsAnalysisK-meansclusteringSelf organizingmapsAssociationrules miningSupervised learningis parallel in conceptto the predictivestatistical techniquesused by many socialscienceresearchers, such aslinear regression, butwithout therestriction of onlyexploring linearrelationships.Unsupervisedlearning includes avariety of machinelearning techniquesthat do not use acriterion ordependentvariable, but ratherlook for patternssolely among“independent”variables.
- 8. Four Familiar StepsPre-processing/ DataPreparationExploratoryAnalysis /DimensionReductionModelExplorationandDevelopmentModelInterpretation/ Deployment
- 9. Data MiningFlowchart
- 10. Data Pre-ProcessingScreening – Detecting outliers, missingdata, illegal values, unusual patterns, unexpecteddistributions, unusable coding schemesDiagnosis – Mechanisms of missingdata, coding/entry errors, true extremevalues, alternative distributionsRepair – Leave data unchanged, missing datamitigation, deletion of anomalous records,transformation, recoding, binning
- 11. Curse of Dimensionality• Data mining tasks often beginwith a dataset that hashundreds or even thousands ofvariables and little or noindication of which of thevariables are important andshould be retained versusthose that can safely bediscarded• Analytical techniques used inthe model building phase ofdata mining depend upon“searching” through amultidimensional space for aset of locally or globallyoptimal coefficients
- 12. Addressing High Dimensionality• Any data set with dozens or hundreds of variables is likelyto have considerable redundancy in it as well as numerousvariables that are not useful or relevant; two big methodsfor dealing with this:– Feature selection: The process of choosing which variables tokeep and which to discard; simplest method: screen each input-output pair with a Pearson correlation (or more efficiently witha form of multiple regression); major goal is to ditch inputvariables that are unlikely to contribute to the analysis– Feature extraction: The process of reducing a large set ofvariables that contain redundancy with a smaller number ofnon-redundant variables; simplest method: principalcomponents analysis; major goal is to combine (linearly or non-linearly) redundant set into a smaller non-redundant set
- 13. ICA Example
- 14. Algorithm/Model Selection• Within a family of DM techniques(i.e., supervised or unsupervised)there will almost always bemultiple choices of algorithms• How to decide which one to use?• Given the empirical nature of datamining, it is often satisfactory tochoose the algorithm that “worksbest” (i.e., has the lowest errorrate) across the largest amount ofevaluation (validation) data• What is training data versusevaluation data? Model building screen from Statistica
- 15. Selected Unsupervised Algorithms• Association rules mining / Market basket analysis: Looks forcombinations of items that occur together• Independent Components Analysis – Conceptually similarto principle components analysis, but can work on variablesthat are not jointly normally distributed; a form of blindsource/signal separation• K-means clustering – organizes a set of observations intoclusters, where observations in a group cluster closelyaround a centroid/mean• Self-organizing maps – Similar to multidimensionalscaling, takes a high dimensional problem and translates itinto low dimensional space so it van be visualized; usesneural networks to process data
- 16. Association Rules Mining ExampleCustomers Carts Store InventoryItem 1 Baby WipesItem 2 BeerCustomer 1 Item 3 BreadItem 4 CheddarItem 5 ChipsCorn FlakesItem 1 DiapersCustomer 2 Item 2 LettuceItem 3 MayonnaiseMilkItem 1 Peanut ButterItem 2 SalamiCustomer 3 Item 3 ShampooItem 4 SpongesItem 5 TomatoesItem 6 ToothpasteItem 7
- 17. Selected Supervised Algorithms• Artificial neural networks (ANNs) – Uses a simulation of biological neuronsto create an interconnected system of elements that translates inputsaccurately into outputs; can work well for systems with multiple outputs• General additive models – Like general linear models (e.g., multipleregression) except relaxes constraints on the distributions of the input andoutput variables; can accommodate non-linear relations between inputand output variables• Decision/classification/regression trees (CART) – Iteratively creates a tree-like decision structure with internal branches that bifurcate on values ofthe input variable; each path from the root to a leaf translates particularinput values into output values; results are easy to visualize and interpret• Support vector machines – Uses a “kernel” algorithm to develop aseparation line (or plane or hyperplane) that divides a set of observationsinto two classes (can also solve multi-class problems); hard to interpretresults, but can produce highly accurate and generalizable models
- 18. CART Example
- 19. Data Mining Software Choices• R – Open source, free, many algorithms, Rattle GUI,command line difficult, little support• WEKA – Quasi-open source, free, great textbooks, niceGUI, little support• RapidMiner – Open Source (registration required), paidtraining available, connections to R• SAS/Enterprise Miner– Proprietary, expensive, lots ofsupport, lots of documentation• SPSS/Clementine – Proprietary, expensive, lots ofsupport, lots of documentation• Statistica – Proprietary, workbench/workflow styleinterface good for beginners, support, documentation
- 20. Selected References• Berkhin, P. (2006). A survey of clustering data mining techniques. GroupingMultidimensional Data, 25-71.• Bigus, J. (1996). Data mining with neural networks. Mc GrawHill, USA.• Caragea, D., Cook, D., Wickham, H., & Honavar, V. (2008). Visual methods forexamining SVM classifiers. Visual Data Mining, 136-153.• Elith, J., Leathwick, J., & Hastie, T. (2008). A working guide to boosted regressiontrees. Journal of Animal Ecology, 77(4), 802-813.• Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009).The WEKA data mining software: an update. ACM SIGKDD ExplorationsNewsletter, 11(1), 10-18.• Hastie, T., & Tibshirani, R. (1990). Generalized additive models: Chapman &Hall/CRC.• Kohonen, T. (2002). The self-organizing map. Proceedings of the IEEE, 78(9), 1464-1480.• Stone, J. V. (2004). Independent component analysis: a tutorial introduction: TheMIT Press.• Witten, I. H., Frank, E., Holmes, G., & Hall, M. A. (2011). Data Mining: PracticalMachine Learning Tools and Techniques: Morgan Kaufmann.

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment