Report

Colleen FarrellyFollow

Jun. 7, 2017•0 likes•6,370 views

Jun. 7, 2017•0 likes•6,370 views

Download to read offline

Report

Data & Analytics

From my graduate work and extended to the field of education. Citation of paper from which presentation was derived: Farrelly, C. M., Schwartz, S. J., Amodeo, A. L., Feaster, D. J., Steinley, D. L., Meca, A., & Picariello, S. (2017). The Analysis of Bridging Constructs with Hierarchical Clustering Methods: An application to identity. Journal of Research in Personality.

Colleen FarrellyFollow

Topology for data scienceColleen Farrelly

Logistic regression: topological and geometric considerationsColleen Farrelly

Clustering in data Mining (Data Mining)Mustafa Sherazi

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007

05 Classification And PredictionAchmad Solichin

Data preprocessingGajanand Sharma

- 1. Hierarchical Clustering and Topology for Psychometric Validation By Colleen M. Farrelly
- 2. Creating a New Survey: Psychometrics Many types of surveys/tests exist for assessing academic achievement, psychological traits, or sociological constructs; the field that studies the construction and functioning of tests is called psychometrics. Sometimes, a new survey must be created to either improve upon a previous/discontinued one or assess a new idea/context for a given behavior or trait. These new surveys pose several statistical challenges: Consistency within a survey (measuring what a survey is thought to measure) Crohnbach’s alpha, differential item functioning… Validation across samples (measures the same thing across populations/time) Exploratory factor analysis followed by confirmatory factor analysis Subscales for easier computation and interpretation of results (need to figure out what items on the survey function similarly) Statistical frameworks exist for assessing these challenges, but they typically require large sample sizes and assume certain structures underlie the survey design.
- 3. Example Survey a) Red:Rainbow::July:____ (Month, Year, Hot, Cloud) b) Soothing:Anodyne::____:Esoteric (Eccentric, School, Abstruse, Calming) c) Pyrrhic:Victory::Potemkin:____ (Village, Battle, Hollow, Achilles) d) Stegasaurus:Jurassic::Trilobite:____ (Triassic, Dinosaur, Mesozoic, Cambrian) e) Mice:Men::Cabbages:____ (Women, Lettuce, Salad, Kings) f) Fill in the following series: 1, 1/8, 1/27, 1/64, ___ g) Fill in the following series: ___, 25, 168, 1229, 9592 h) Fill in the following series: 3, ___,4,1,5
- 4. Factor Analysis Creation of new surveys requires internal and external validation, typically done through factor analysis. Exploratory factor analysis is used to cluster items measuring similar underlying processes. Confirmatory factor analysis can then be applied to validate those clusters, or subscales, that were found in the exploratory analysis. Crohnbach’s alpha establishes internal consistency. Verbal Math f g h a b c d e
- 5. Potential Pitfalls in Psychometric Validation with Factor Analysis Two major problems challenge the assumptions of these methods and necessitate the development of a new way to analyze and validate the measure. Time-wise or context-wise measurement can introduce non-independent, non- hierarchical components into the model. Study habits across terms (longitudinal effects on measurement), identity across social spheres (student perception of intellectual ability when with friends, work, and school) Factor analysis can be broadened to Bayesian networks and structural equation models, but this method comes with its own assumptions on the underlying geometry and sample size. Small sample size can create numerical instability in traditional algorithms for both factor analysis and structural equation models (suggest 5-10 participants per item). If there are 90 items, at least 450 students would be needed to discover subscales, and another 450 would be needed to validate these findings. Cost and population size can be prohibitive to the study. Ex. Bridging constructs, or loosely connected concepts without a defined hierarchy, typically run into both limitations and require a new method to validate their surveys. Many of these issues arise from the dependence on linear mapping from the survey response space to a lower-dimensional space.
- 6. Moving from Euclidean-Based Statistics to Topologically-Based Statistics Loss of information with each projection to a lower- dimensional space (errors) Topological methods work by partitioning existing space into homogenous components (no maps, no error) 2D example
- 7. Algebraic Topology and Topological Spaces Spaces, such as the one formed by survey response data, can be defined topologically and decomposed using algebraic topology/geometry. Data follows discrete versions of many theoretical results in this area of math. Topology is rubber sheet geometry, with areas analogous to gluing together children’s building blocks, examining connections on shapes, or hunting for mountain/valley water flows. Examining how the pieces fit together in a given space allows one to study the topological space’s defining characteristics and the behavior of functions in that space. One can define connections between pieces of this space via algebra and examine structural properties computationally: Homotopy (shrinking connected paths to a point) Homology (hole-counting to define topological classification of structure) 1 2 3Homotopy/ Homology Basins of Attraction (Morse Theory) Hodge Theory
- 8. Applied Homology: Filtrations and Persistence Filtration This is an iterative changing of lens with which to examine data (height, neighbors…). Topological features appear and disappear as the lens changes. This creates a nested sequence of features with underlying algebraic objects, called a homology sequence: Hom1⊂Hom2⊂Hom3⊂Hom4 Persistence is the length of feature existence in a homology sequence, which can be visualized. This information maps back to the data space’s topology (shape). The first level of algebraic objects corresponds to connectedness of the space (0th Betti numbers), and this is directly related to a type of clustering analysis. 0 2 4 6 8 10 time Connected space Vertices Hole in middle
- 9. Solution: Use Machine Learning to Exploit Underlying Topology of Survey Data Single-linkage hierarchical clustering partitions data space according to connected components (0th Betti numbers) across filtration levels (i.e. a series of distance filtrations). This method has been successfully applied to neuroimaging studies focused on patterns of brain activity across diseases, neuropsychological tests, and drug states. This provides a nuanced scanning of topologically-based features within the datasets at different correlation/similarity thresholds. These can be summarized in feature plots, called persistence diagrams, that track the birth and death of a given feature across thresholds, and can be compared through existing statistical tests, such as a nonparametric Wasserstein metric test. It has also been used to track gene expression pattern changes across time and/or disease states in microarray studies. These studies particularly emphasize the visualization of hierarchical clustering through dendrograms (tree diagrams of relationships at different filtration levels) and heat maps (color-coded expression-similarity plots among genes in the microarray). These visualizations provide a user-friendly way to understand and communicate key findings of this statistical method. This method can handle data with fewer observations than predictors (p>>n), and, thus, does not require large sample sizes. Internal correlations do not pose issues; in fact, the method excels at separating data within and across dependencies.
- 10. Hierarchical Clustering: Example Survey Math Verbal Heatmap Very distinct separation of items (noted by sharp color contrast of heatmap and long height bars on dendrogram)
- 11. Validation: Dendrograms and Topology Dendrograms are a special type of graph, called a tree. Because graphs have a defined topological space and dendrograms are a type of graph, they can be studied or measured through the tools of topology and metric geometry. Hausdorff distance allows two objects of the same dimension to be compared by a defined metric. This examines the greatest distance between close points, allowing for a nearness-of-match type of metric on two objects (top left). Within a graph framework, it allows one to calculate worst best match between two graphs (as shown at bottom left). This allows for the development of a distance-based nonparametric test to test for dendrogram structural differences in a statistical framework. Hausdorff Distance
- 12. Steps in Exploration and Validation of Surveys with Hierarchical Clustering 1) Partition sample into training and validation sets/draw a small number of bootstrap samples from the original dataset. 2) Calculate distance metrics in each sample. 3) Run a single-linkage hierarchical clustering algorithm on the training set to obtain exploratory clusters of similar survey items (pvclust R package statistically tests internal survey structure like the Crohnbach alpha metric). Create heat map and dendrogram. 4) Repeat (3) on validation sets to obtain a set of dendrograms. 5) Calculate Hausdorff distance (a topological metric) between dendrograms to estimate differences in results (validation step). 6) Obtain p-value through permuting the extant dendrograms or generating random dendrograms. 7) If p-value is larger than 0.05/n (Bonferroni correction) for dendrograms in (5), no statistically significant differences exist in dendrogram structure, meaning that the survey clusters are consistent and valid.
- 13. Example Measure: Bridging Constructs Identity expression across life contexts (ILLCQ Survey): There are many components to identity in leading theories of identity. Example: religious identity in school, family, and friends contexts It was unknown whether identity type or social context plays a greater role in the expression of identity within an individual. Identity type as more influential would suggest that identity is a fairly static trait. Context as more influential would suggest that identity is fluid. Sample size and survey size 406 participants (FIU students) and 91 distinct survey items. 5 draws of 130 participants each for validation and consistency checks. Results suggest certain aspects of identity are fluid and others are fixed. Political and racial/ethnic identity are fairly fixed. Other types, such as athletic or gender, are fairly fluid. Bootstrapped samples suggest consistency of measure and validate findings. Subscales hold over different samples (tests of difference, all p>0.05). This validates the measure and allows for inference into the psychology of identity.
- 14. Identity by Context Survey HeatmapILLCa_school_success_family ILLCa_school_success_school ILLCa_gender_dating ILLCa_age_dating ILLCa_age_freetime ILLCa_sexual_or_dating ILLCa_beauty_dating ILLCa_sport_dating ILLCa_sport_freetime ILLCa_sport_religion ILLCa_religion_freetime ILLCa_religion_family ILLCa_religion_school ILLCa_religion_neighborhood ILLCa_politics_dating ILLCa_religion_group ILLCa_sexual_or_religion ILLCa_gender_religion ILLCa_age_religion ILLCa_politics_religion ILLCa_politics_family ILLCa_politics_neighborhood ILLCa_politics_group ILLCa_politics_school ILLCa_politics_freetime ILLCa_tribe_dating ILLCa_tribe_group ILLCa_tribe_freetime ILLCa_tribe_family ILLCa_tribe_school ILLCa_tribe_neighborhood ILLCa_tribe_religion ILLCa_beauty_neighborhood ILLCa_look_neighborhood ILLCa_school_success_religion ILLCa_look_religion ILLCa_music_neighborhood ILLCa_race_religion ILLCa_status_religion ILLCa_beauty_religion ILLCa_religion_religion ILLCa_religion_dating ILLCa_race_school ILLCa_race_freetime ILLCa_sexual_or_school ILLCa_beauty_family ILLCa_beauty_freetime ILLCa_beauty_school ILLCa_beauty_group ILLCa_look_freetime ILLCa_look_family ILLCa_look_school ILLCa_status_dating ILLCa_status_group ILLCa_race_group ILLCa_race_dating ILLCa_sexual_or_group ILLCa_sexual_or_freetime ILLCa_gender_freetime ILLCa_gender_family ILLCa_gender_school ILLCa_age_family ILLCa_age_school ILLCa_school_success_neighborhood ILLCa_race_neighborhood ILLCa_sexual_or_neighborhood ILLCa_status_neighborhood ILLCa_gender_neighborhood ILLCa_age_neighborhood ILLCa_sport_school ILLCa_sport_family ILLCa_sport_group ILLCa_music_freetime ILLCa_music_religion ILLCa_music_dating ILLCa_sport_neighborhood ILLCa_school_success_dating ILLCa_school_success_group ILLCa_school_success_freetime ILLCa_music_school ILLCa_music_family ILLCa_music_group ILLCa_gender_group ILLCa_age_group ILLCa_look_group ILLCa_look_dating ILLCa_race_family ILLCa_sexual_or_family ILLCa_status_freetime ILLCa_status_family ILLCa_status_school -0.2 0 0.2 0.4 0.6 0.8 1
- 15. Conclusion This method offers a robust way to create survey subscales and validate measures without needing a large sample or a pre-defined measure structure. Flexible Deeply routed in mathematics Statistically testable Internal validity by pvclust’s statistical test of cluster hierarchy for cut-points External validity by Hausdorff nonparametric test on bootstrapped samples It has been successfully applied to a bridging concept survey (factorial design), as well as more traditional survey designs. This offers a general way to extend traditional areas of statistics to a more general framework through the use of topological theory and tools. Likely to be useful as data becomes more complex in industry and academia. May be able to circumvent other problems in modern statistics. Item response theory (how people in different groups perform on test items) Network comparison (social networks, covariance networks…) between groups or over time Structural equation modeling when data does not meet method assumptions
- 16. Co-authors The Analysis of bridging constructs with hierarchical clustering methods: An application to identity (under review Journal of Research in Personality) Seth Schwartz, University of Miami Anna Lisa Amodeo, University of Naples Daniel Feaster, University of Miami Douglas Steinley, University of Missouri Alan Meca, University of Miami Simona Picariello, University of Naples

- (Answers: Year, Abstruse, Village, Cambrian, Kings, 1/125, 4, 1) Bonus: Laplacian:Heat::Ricci:___ (Water, Cold, Curvature, Valley)--Curvature
- Santos, J. R. A. (1999). Cronbach’s alpha: A tool for assessing the reliability of scales. Journal of extension, 37(2), 1-5. Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and applications. American Psychological Association.
- Costello, A. B. (2009). Getting the most from your analysis. Pan, 12(2), 131-146. Rouquette, A., & Falissard, B. (2011). Sample size requirements for the internal validation of psychiatric scales. International Journal of Methods in Psychiatric Research, 20(4), 235-249. DeCoster, J. (1998). Overview of factor analysis.
- Zomorodian, A., & Carlsson, G. (2005). Computing persistent homology. Discrete & Computational Geometry, 33(2), 249-274. Lee, H., Kang, H., Chung, M. K., Kim, B. N., & Lee, D. S. (2012). Persistent brain network homology from the perspective of dendrogram. IEEE transactions on medical imaging, 31(12), 2267-2277.
- Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Multivariate Behavioral Research, 14(1), 57-74. Lee, H., Kang, H., Chung, M. K., Kim, B. N., & Lee, D. S. (2012). Persistent brain network homology from the perspective of dendrogram. IEEE transactions on medical imaging, 31(12), 2267-2277. Suzuki, R., & Shimodaira, H. (2006). Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics, 22(12), 1540-1542. Chipman, H., & Tibshirani, R. (2006). Hybrid hierarchical clustering with applications to microarray data. Biostatistics, 7(2), 286-301.
- Gross, J. L., & Tucker, T. W. (1987). Topological graph theory. Courier Corporation.