1. D A M I A N A . V O N S C H O E N B O R N
Topological Data Analysis
2. Abstract
By now, the Big Data revolution is well on its way.
Storage capacity has ballooned, and simple queries
against these data stores can be executed with relative
ease. However, analytic techniques have generally not
matured to handle the massive datasets of this new
era. This talk will present a set of techniques known
collectively as Topological Data Analysis (TDA), where
concepts from Topology are applied to classify,
visualize, and explore data. TDA shows promise in the
era of Big Data.
3. Agenda
Issues with Big Data analysis
Topology Overview
Computational Topology and Formal TDA
Relaxed TDA
Q&A
4. Problems in Big Data Analytics
Problems with legacy
analytic techniques
Run in series,
in memory
hypothesis-
driven
Visualizations
limited
5. Topology Overview (as relevant here)
Metric Space
• Pair-wise distance between points
• Continuously defined surfaces
Coordinate free
• Orientation doesn’t matter
• Ability to compare sets from different coordinate
systems
Small deformations don’t change topology
• Stretching, bending, etc. okay
• Cutting, gluing, etc. not okay
• Less sensitivity to noise [1]
Simplicial Complexes
• Course (“compressed”) representations of reality
Intuitively, a topological space
is a set of points, each of whom
knows its neighbors. Formally, a
topology on a set X is a subset T
⊆ 2X such that:
• If 𝑆1, 𝑆2 ∈ 𝑇, then 𝑆1 ∩ 𝑆2 ∈ 𝑇
• If 𝑆𝐽|𝑗 ∈ 𝐽 ⊆ 𝑇, then
∪𝑗∈𝐽 𝑆𝑗 ∈ 𝑇
• ∅, 𝑋 ∈ 𝑇
[3]
6. Topological Data Analysis
Definition: Given a finite dataset S ⊆ 𝕐 of noisy
points sampled from an unknown space 𝕏,
topological data analysis recovers the topology of
𝕏, assuming both 𝕏 and 𝕐 are topological spaces.[3]
We want a process that does not require
assumptions about manifold structure,
smoothness, or lack of curvature.[3]
7. Formal Combinatorial Representations
• Construct a combinatorial representation that approximates
the underlying space from which the data was sampled[3]
• Many types of these representations (simplicial complexes)
have been developed
Goal
• Both the Čech and VR complexes typically produce simplices
in dimensions much higher than the dimension of the space [4]
• The VR Complex is less expensive than the corresponding Cech
complex, even though the VR complex has more simplices[2]
• The Čech Complex is not computed in practice due to its
computational complexity[3]
• Currently, the VR complex is one of the few practical methods
for topological analysis in high dimensions[3]
Two of the most
popular are the
Čech and
Vietoris-Rips
(VR) Complexes
8. Defining the VR Complex
Definition 1[3]
Given 𝑆 ⊆ 𝕐 and 𝜀 ∈ ℝ, let
𝐺𝜀 = (𝑆, 𝐸𝜀) be the ε-
neighborhood graph on S,
where
𝐸𝜀 =
𝑢, 𝑣 |𝑑(𝑢, 𝑣) ≤ 𝜀, 𝑢 ≠ 𝑣 ∈ 𝑆
The VR Complex is the clique
complex of the ε-neighborhood
graph
A clique is the subset of vertices
that induces a complete
subgraph and is maximal if it
cannot be made any larger
The clique complex has the
maximal cliques of a graph as
its maximal simplices
Definition 2[4]
Let X denote a metric space with metric
d. Then the VR complex for X, attached
to the parameter 𝜀, will be the
simplicial complex whose vertex set is
X and where {x0, x1, …, xk} spans a k-
simplex if and only if d(xi,xj) ≤ 𝜀 for all
0 ≤ i,j ≤k
9. Creating the VR Complex
Begin with complete dataset
Create ε-balls around each
data point
Draw an edge connecting
each overlapping ε-ball pair
[2]
Describe with Betti Numbers
b0: # of connected components
b1: # of 1D holes
b2: # of 2D holes
10. What features are an artifact of the chosen ε vs. a
representation of the underlying structure?
Betti Numbers insufficient
Persistence
Features persisting over
large range of ε values are
significant
Features that quickly arise
and drop off are noise and
can be ignored
[2]
13. So where do we stand?Pros
• Useful when high
resolution representation
needed
• Surface reconstruction
• Anomaly detection
• Comparing datasets
• Optimize models
• Choose models and
parameters best suited to
handle the type of dataset
you’re analyzing
Cons
• Some subjective judgment
• Potentially difficult to read
• Not ideal for Big Data
• Computationally
expensive(epsilon balls,
pairwise overlap flags,
etc. all computed for
every epsilon value in
range) [4]
• Typically need to sample
from data, reducing
resolution.
14. Dimensionality Reduction
Principal Components Analysis, MDS, ISOMAP
Record Consolidation
Cluster Analysis
Retain much of the
underlying structure of the
data while limiting the
number of dimensions
needed to describe it [6]
Drawbacks
Loss of information, missing
subtleties
Assumes normality
Assumes that data is from a
flat hyperplane with no
curvature[3]
Discover underlying segments
of the data by grouping data
points that are most similar [6]
Drawbacks
Distinct groups, no relationship
between them, arbitrary
distinction in continuous data
Specification of number of
clusters upfront
Often difficult to apply clustering
algorithms to very large datasets[4]
Shrinking Data Size
With many algorithms in each category, choosing the right one takes experience or luck
16. Process Overview
A. Discrete sample space
B. Filter function can be
any combination of
dimensions in the
dataset or derived
calculated fields
C. Slightly-overlapping
bins
D. Simplified
representation
[1]
17. Useful filter functions[5]
• Combinations of in-data dimensions (or derivations thereof), typically
chosen by domain knowledge
Field(s) from the
data
• Use Gaussian kernel: 𝑓𝜀 𝑥 = 𝐶𝜀 𝑒
−𝑑(𝑥,𝑦)2
𝜀
𝑦Density
• Identify points which are far from the center without identifying the actual
center
• For 1 ≤ 𝑝 < ∞, let 𝐸 𝑝 𝑥 =
𝑑(𝑥,𝑦) 𝑝
𝑦∈𝑋
𝑁
1
𝑝
Eccentricity
(data depth)
• Let 𝐿 𝑥, 𝑦 =
𝑤(𝑥,𝑦)
𝑤(𝑥,𝑧)𝑧 𝑤(𝑥,𝑧)𝑧
where 𝑤 𝑥, 𝑦 = 𝑘 𝑑 𝑥, 𝑦 for smoothing
kernel 𝑘 (e.g. Gaussian)
• Eigenvectors of L(x,y) are a set of orthogonal vectors that give interesting
geometric information
Eigenvectors of
graph Laplacians
19. Benefits
• Able to move away from hypothesis-driven analyses[1]
• Visualize entire dataset, without making unfounded assumptions
Visual Exploration
• Process can be applied to wide variety of data sources
• No predefined format, scaling, etc. needed
• Multiscale representations: Useful to have the flexibility of changing the
resolution “on the fly” [4]
Fungibility
• Choice of clustering algorithms
• Choice of filter functions
Integration of favorite machine learning techniques
• Clustering performed on subsets – allows for parallelization
Computation
21. References
1. Lum, P.Y. et al. Extracting insights from the shape of complex
data using topology. Sci. Rep. 3, 1236; DOI: 10.1038/srep01236
(2013)
2. Ghrist, R. Barcodes: The Persistent Topology of Data. Bulletin of
the AMS 45.1 pp61-75 S 0273-0979(07)01191-3 (2008)
3. Zomorodian, A. Topological Data Analysis. Proceedings of
Symposia in Applied Mathematics. AMS (2011)
4. Carlsson, G. Topology and Data. Bulletin of the AMS 46.2 pp255-
308 S 0273-0979(09)01249-X (2009)
5. Singh, G. et al. Topological Methods for the Analysis of High
Dimensional Data Sets and 3D Object Recognition. Eurographics
Symposium on Point-Based Graphics (2007)
6. Ayasdi. TDA and Machine Learning: Better Together. (2015)
7. "Clustering." 2.3. Clustering — Scikit-learn 0.15.2 Documentation.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR
12, pp. 2825-2830 (2011)