TOPOLOGICAL DATA ANALYSIS
HJ vanVeen· Data Science· Nubank Brasil
• "When a truth is necessary, the reason for it can be
found by analysis, that is, by resolving it into simpler
ideas and truths until the primary ones are reached."
• Topology is the mathematical study of topological
• Topology is interested in shapes,
• More speciﬁcally: the concept of 'connectedness'
• A topologist is someone who does not see the
difference between a coffee mug and a donut.
• “Nothing at all takes place in the universe in which
some rule of maximum or minimum does not
appear.” - Euler
• Seven Bridges of Koningsbrucke: devise a walk
through the city that would cross each bridge
once and only once.
• Euler's big insights:
• Doesn’t matter where you start walking, only matters which bridges
• A similar solution should be found, regardless where you start your
• only the connectedness of bridges matter,
• a solution should also apply to all other bridges that are connected
in a similar fashion, no matter the distances between them.
• We now call these graph walks ‘Eulerian walks’ in
• Euler's ﬁrst proven graph theory theorem:
• 'Euler walks' are possible if exactly zero or two nodes
have an odd number of edges.
• TDA marries 300-year old maths with
modern data analysis.
• Captures the shape of data
• Is invariant
• Compresses large datasets
• Functions well in the presence of noise / missing variables
• Capturing the shape of data
•Traditional techniques like clustering or dimensionality reduction have
trouble capturing this shape.
• Euler showed that only connectedness matters.The size, position, or
pose of an object doesn't change that object.
• Compressed representations use
the order in data.
• Only order can be compressed.
• Random noise or slight variations
• Lossy compression retains the most
• "Now where there are no parts, there neither extension, nor shape, nor divisibility is possible.
And these monads are the true atoms of nature and, in a word, the elements of things." - Leibniz
• Mapper was created by Ayasdi Co-founder
Gurjeet Singh during his PhD under Gunnar
• Based on the idea of partial clustering of the data
guided by a set of functions deﬁned on the data.
• Mapper was inspired by the Reeb Graph.
• Map the data with overlapping intervals.
• Cluster the points inside the intervals
• When clusters share data points draw an edge
• Color nodes by function
• Example of spousal fraud
• Create: global linear model
• Function: L2-norm
• Color: Heatmap by ground truth and animate to out-of-fold model predictions
• Identify: Low accuracy sub graphs
• Select: Features that are most important for sub graphs
• Create: Local linear models on sub graphs
• Stack: DecisionTree
• Compare: Divide-and-Conquer and LIME