2. Problem
Data contains many
underlying structures
and relationships.
Current methods (such
as k-means
clustering):
◦ Don’t capture all of these
structures
◦ Struggle with certain
data properties
(dimensionality)
◦ Provide little information
about connectedness
between
clusters/individuals
◦ Instability
3. Recent Solutions
Nonlinear distance
metrics
◦ Random forest-based
◦ Manifold learning-based
Hierarchical clustering
◦ Nested clustering
approach
Multiscale K-Nearest
Neighbors
◦ Adjust number of
neighbors to slice data
Still don’t provide a
comprehensive view of
data structure
4. Topology Overview
Branch of mainly pure
mathematics
Study of changes in
function behavior on
different shapes
(called manifolds)
Can examine locally-
variant and globally-
invariant properties
Classify
similarities/differences
between shapes
based on these
characteristics
Algebra can be used to build
more complex structures
from basic building blocks
5. Topology and Data
Data clouds can be turned into discrete shapes
combinations (simplices)
Identify key topological features across different slices of
the data (circles, holes…)
◦ Classified by Betti numbers (dimension plus feature type)
Find connected components of similar topological
structure
doi.ieeecomputersociety.org
6. Mapper Algorithm
Topological clustering
◦ Define distance metric
Linear or nonlinear
◦ Define filtration function
Linear, density-based…
◦ Slice multidimensional
dataset with Morse function
Type of function associated
with gradient flow and critical
point identification on smooth
manifolds
◦ Examine function behavior
across slice (level set)
◦ Cluster function behavior
◦ Graph cluster connections
Type of extended Reeb Graph
Response
gradations
Outliers
7. Multiscale Extension of
Mapper
Instability of single-
scale mapper algorithm
◦ Clusters may change with
scale
◦ Connections may change
with scale
Filtrations at multiple
resolution settings
Connections change as
lens zooms in or out
◦ Contains information
about underlying data
structure and
relationships
◦ Hierarchy of Reeb graphs
◦ Topological summary
8. Graph Theory Extensions of
Mapper
Cluster relationships from
Mapper give an adjacency
matrix and distance metric
◦ Clusters as vertices
◦ Nested hierarchy as edges
◦ Connected/unconnected
components
◦ Centrality of certain points
◦ Bridges linking disparate
clusters
◦ Path lengths between
clusters
Can apply network
analytics to assess cluster
relationships and
individual connections
across clusters
This is a weighted,
undirected graph!
9. Network Extensions of Multiscale
Mapper
Graph theory
algorithms applied to
Mapper results dig
deeper into:
◦ Data topology/structure
◦ Nature of individuals’
similarities across
multivariate distribution
Examine across
different lenses
◦ Hierarchy of networks
connected through
individuals common to
multiple networks
◦ Analyze across slices to
gain deeper insight into
network and underlying
data structures
10. Information from Each
Network Hubs
◦ Direct connection
to many other
clusters
Betweenness
◦ Non-extremity
measure
Diversity
◦ Information
contained
Bridges
◦ Connection
between less-
related
components
Graph Laplacian
◦ Eigenvectors with
connection/bridge
weights
Centrality
◦ Weight direct
connections and
bridges for
importance to
network
Vertices
◦ Clusters at a
particular resolution
Edges
◦ Connections between
clusters
◦ Individuals common
between clusters
Levels
◦ Level sets (height
slices) containing one
or more vertices
◦ Individuals bridging
levels
11. Combined Insight of
Extensions
Multiple resolutions
◦ Cluster hierarchy
Evolving cluster structure
More complete picture of
individual classifications
◦ Network hierarchy
Evolving network structure
More complete picture of
cluster relationships and
structure
More complete picture of
individual connections
12. Example Demonstration
Demo dataset of 7th grade SAT
scores
Group-level data mining of results
Transition
Transition
Emergence of
subgroups
Split into
two distinct
groups
13. Individual Mining Results
Map back to individuals
◦ Bridging individuals
Transition between clusters
Multivariate cut-off scores
determination
◦ Isolated individuals
Outliers and outlier groups
Unique response or predictors
subsets
◦ Consistently clustered
individuals
Cohesive subgroups in data
Underlying similarity of
predictors or response
14. Conclusion
New method ameliorates some of the issues
with clustering methods
◦ Robust
◦ Works in high dimensions
◦ Captures connectedness
◦ Stable
◦ Provides hierarchy
◦ Quantify relationships
Editor's Notes
Dey, T. K., Memoli, F., & Wang, Y. (2015). Mutiscale Mapper: A Framework for Topological Summarization of Data and Maps. arXiv preprint arXiv:1504.03763.
Singh, G., Mémoli, F., & Carlsson, G. E. (2007, September). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In SPBG (pp. 91-100).
Ghosh, A. K., Chaudhuri, P., & Murthy, C. A. (2006). Multiscale classification using nearest neighbor density estimates. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 36(5), 1139-1148.
Shi, T., Seligson, D., Belldegrun, A. S., Palotie, A., & Horvath, S. (2005). Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Modern Pathology, 18(4), 547-557.
Navarro, J. F., Frenk, C. S., & White, S. D. (1997). A Universal density profile from hierarchical clustering. The Astrophysical Journal, 490(2), 493.
Spanier, E. H. (1994). Algebraic topology (Vol. 55, No. 1). Springer Science & Business Media.
Aspinwall, P. S., Greene, B. R., & Morrison, D. R. (1994). Calabi-Yau moduli space, mirror manifolds and spacetime topology change in string theory. Nuclear Physics B, 416(2), 414-480.
Schwarz, M. (1993). Morse homology. In Progress in Mathematics.
Palis, J. (1969). On morse-smale dynamical systems. Topology, 8(4), 385-404.
Devaney, R. L. (1989). An introduction to chaotic dynamical systems (Vol. 13046). Reading: Addison-Wesley.
Epstein, C., Carlsson, G., & Edelsbrunner, H. (2011). Topological data analysis. Inverse Problems, 27(12), 120201.
Zomorodian, A. (2007). Topological data analysis. Advances in Applied and Computational Topology, 70, 1-39.
Singh, G., Mémoli, F., & Carlsson, G. E. (2007, September). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In SPBG (pp. 91-100).
Lum, P. Y., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson, M., Alagappan, M., ... & Carlsson, G. (2013). Extracting insights from the shape of complex data using topology. Scientific reports, 3.
Carlsson, G., Jardine, R., Feichtner-Kozlov, D., Morozov, D., Chazal, F., de Silva, V., ... & Wang, Y. (2012). Topological Data Analysis and Machine Learning Theory.
Dey, T. K., Memoli, F., & Wang, Y. (2015). Mutiscale Mapper: A Framework for Topological Summarization of Data and Maps. arXiv preprint arXiv:1504.03763.
Opens the door to extremely deep unsupervised learning and a new way to look at data structure and underlying relationships.
Scott, J. (2012). Social network analysis. Sage.
Carrington, P. J., Scott, J., & Wasserman, S. (Eds.). (2005). Models and methods in social network analysis (Vol. 28). Cambridge university press.
Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications (Vol. 8). Cambridge university press.