SlideShare a Scribd company logo
1 of 21
Download to read offline
D A M I A N A . V O N S C H O E N B O R N
Topological Data Analysis
Abstract
By now, the Big Data revolution is well on its way.
Storage capacity has ballooned, and simple queries
against these data stores can be executed with relative
ease. However, analytic techniques have generally not
matured to handle the massive datasets of this new
era. This talk will present a set of techniques known
collectively as Topological Data Analysis (TDA), where
concepts from Topology are applied to classify,
visualize, and explore data. TDA shows promise in the
era of Big Data.
Agenda
 Issues with Big Data analysis
 Topology Overview
 Computational Topology and Formal TDA
 Relaxed TDA
 Q&A
Problems in Big Data Analytics
Problems with legacy
analytic techniques
Run in series,
in memory
hypothesis-
driven
Visualizations
limited
Topology Overview (as relevant here)
Metric Space
• Pair-wise distance between points
• Continuously defined surfaces
Coordinate free
• Orientation doesn’t matter
• Ability to compare sets from different coordinate
systems
Small deformations don’t change topology
• Stretching, bending, etc. okay
• Cutting, gluing, etc. not okay
• Less sensitivity to noise [1]
Simplicial Complexes
• Course (“compressed”) representations of reality
Intuitively, a topological space
is a set of points, each of whom
knows its neighbors. Formally, a
topology on a set X is a subset T
⊆ 2X such that:
• If 𝑆1, 𝑆2 ∈ 𝑇, then 𝑆1 ∩ 𝑆2 ∈ 𝑇
• If 𝑆𝐽|𝑗 ∈ 𝐽 ⊆ 𝑇, then
∪𝑗∈𝐽 𝑆𝑗 ∈ 𝑇
• ∅, 𝑋 ∈ 𝑇
[3]
Topological Data Analysis
Definition: Given a finite dataset S ⊆ 𝕐 of noisy
points sampled from an unknown space 𝕏,
topological data analysis recovers the topology of
𝕏, assuming both 𝕏 and 𝕐 are topological spaces.[3]
We want a process that does not require
assumptions about manifold structure,
smoothness, or lack of curvature.[3]
Formal Combinatorial Representations
• Construct a combinatorial representation that approximates
the underlying space from which the data was sampled[3]
• Many types of these representations (simplicial complexes)
have been developed
Goal
• Both the Čech and VR complexes typically produce simplices
in dimensions much higher than the dimension of the space [4]
• The VR Complex is less expensive than the corresponding Cech
complex, even though the VR complex has more simplices[2]
• The Čech Complex is not computed in practice due to its
computational complexity[3]
• Currently, the VR complex is one of the few practical methods
for topological analysis in high dimensions[3]
Two of the most
popular are the
Čech and
Vietoris-Rips
(VR) Complexes
Defining the VR Complex
Definition 1[3]
Given 𝑆 ⊆ 𝕐 and 𝜀 ∈ ℝ, let
𝐺𝜀 = (𝑆, 𝐸𝜀) be the ε-
neighborhood graph on S,
where
𝐸𝜀 =
𝑢, 𝑣 |𝑑(𝑢, 𝑣) ≤ 𝜀, 𝑢 ≠ 𝑣 ∈ 𝑆
The VR Complex is the clique
complex of the ε-neighborhood
graph
A clique is the subset of vertices
that induces a complete
subgraph and is maximal if it
cannot be made any larger
The clique complex has the
maximal cliques of a graph as
its maximal simplices
Definition 2[4]
Let X denote a metric space with metric
d. Then the VR complex for X, attached
to the parameter 𝜀, will be the
simplicial complex whose vertex set is
X and where {x0, x1, …, xk} spans a k-
simplex if and only if d(xi,xj) ≤ 𝜀 for all
0 ≤ i,j ≤k
Creating the VR Complex
Begin with complete dataset
Create ε-balls around each
data point
Draw an edge connecting
each overlapping ε-ball pair
[2]
Describe with Betti Numbers
b0: # of connected components
b1: # of 1D holes
b2: # of 2D holes
What features are an artifact of the chosen ε vs. a
representation of the underlying structure?
 Betti Numbers insufficient
 Persistence
 Features persisting over
large range of ε values are
significant
 Features that quickly arise
and drop off are noise and
can be ignored
[2]
Graphs Barcodes
Visualizing Persistent Homology
[2][3]
[3]
Potential Application: Optimizing Model Selection
[7]
So where do we stand?Pros
• Useful when high
resolution representation
needed
• Surface reconstruction
• Anomaly detection
• Comparing datasets
• Optimize models
• Choose models and
parameters best suited to
handle the type of dataset
you’re analyzing
Cons
• Some subjective judgment
• Potentially difficult to read
• Not ideal for Big Data
• Computationally
expensive(epsilon balls,
pairwise overlap flags,
etc. all computed for
every epsilon value in
range) [4]
• Typically need to sample
from data, reducing
resolution.
Dimensionality Reduction
Principal Components Analysis, MDS, ISOMAP
Record Consolidation
Cluster Analysis
 Retain much of the
underlying structure of the
data while limiting the
number of dimensions
needed to describe it [6]
 Drawbacks
 Loss of information, missing
subtleties
 Assumes normality
 Assumes that data is from a
flat hyperplane with no
curvature[3]
 Discover underlying segments
of the data by grouping data
points that are most similar [6]
 Drawbacks
 Distinct groups, no relationship
between them, arbitrary
distinction in continuous data
 Specification of number of
clusters upfront
 Often difficult to apply clustering
algorithms to very large datasets[4]
Shrinking Data Size
With many algorithms in each category, choosing the right one takes experience or luck
An alternate approach
1
2
3
4
[6]
Process Overview
A. Discrete sample space
B. Filter function can be
any combination of
dimensions in the
dataset or derived
calculated fields
C. Slightly-overlapping
bins
D. Simplified
representation
[1]
Useful filter functions[5]
• Combinations of in-data dimensions (or derivations thereof), typically
chosen by domain knowledge
Field(s) from the
data
• Use Gaussian kernel: 𝑓𝜀 𝑥 = 𝐶𝜀 𝑒
−𝑑(𝑥,𝑦)2
𝜀
𝑦Density
• Identify points which are far from the center without identifying the actual
center
• For 1 ≤ 𝑝 < ∞, let 𝐸 𝑝 𝑥 =
𝑑(𝑥,𝑦) 𝑝
𝑦∈𝑋
𝑁
1
𝑝
Eccentricity
(data depth)
• Let 𝐿 𝑥, 𝑦 =
𝑤(𝑥,𝑦)
𝑤(𝑥,𝑧)𝑧 𝑤(𝑥,𝑧)𝑧
where 𝑤 𝑥, 𝑦 = 𝑘 𝑑 𝑥, 𝑦 for smoothing
kernel 𝑘 (e.g. Gaussian)
• Eigenvectors of L(x,y) are a set of orthogonal vectors that give interesting
geometric information
Eigenvectors of
graph Laplacians
Traditional methodsTDA
Application: Gene expression in cancer cells [1]
Benefits
• Able to move away from hypothesis-driven analyses[1]
• Visualize entire dataset, without making unfounded assumptions
Visual Exploration
• Process can be applied to wide variety of data sources
• No predefined format, scaling, etc. needed
• Multiscale representations: Useful to have the flexibility of changing the
resolution “on the fly” [4]
Fungibility
• Choice of clustering algorithms
• Choice of filter functions
Integration of favorite machine learning techniques
• Clustering performed on subsets – allows for parallelization
Computation
Q & A
References
1. Lum, P.Y. et al. Extracting insights from the shape of complex
data using topology. Sci. Rep. 3, 1236; DOI: 10.1038/srep01236
(2013)
2. Ghrist, R. Barcodes: The Persistent Topology of Data. Bulletin of
the AMS 45.1 pp61-75 S 0273-0979(07)01191-3 (2008)
3. Zomorodian, A. Topological Data Analysis. Proceedings of
Symposia in Applied Mathematics. AMS (2011)
4. Carlsson, G. Topology and Data. Bulletin of the AMS 46.2 pp255-
308 S 0273-0979(09)01249-X (2009)
5. Singh, G. et al. Topological Methods for the Analysis of High
Dimensional Data Sets and 3D Object Recognition. Eurographics
Symposium on Point-Based Graphics (2007)
6. Ayasdi. TDA and Machine Learning: Better Together. (2015)
7. "Clustering." 2.3. Clustering — Scikit-learn 0.15.2 Documentation.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR
12, pp. 2825-2830 (2011)

More Related Content

What's hot

EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSijcsit
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Dataymelka
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networksananth
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1ananth
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Yan Xu
 
Centrality Prediction in Mobile Social Networks
Centrality Prediction in Mobile Social NetworksCentrality Prediction in Mobile Social Networks
Centrality Prediction in Mobile Social NetworksIJERA Editor
 
Fuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering AlgorithmsFuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering AlgorithmsJustin Cletus
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin Rshanelynn
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolSelman Bozkır
 
Pivot Selection Techniques
Pivot Selection TechniquesPivot Selection Techniques
Pivot Selection TechniquesCatarina Moreira
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSAIRCC Publishing Corporation
 
Intrusion Detection Model using Self Organizing Maps.
Intrusion Detection Model using Self Organizing Maps.Intrusion Detection Model using Self Organizing Maps.
Intrusion Detection Model using Self Organizing Maps.Tushar Shinde
 
AUTOMATIC THRESHOLDING TECHNIQUES FOR OPTICAL IMAGES
AUTOMATIC THRESHOLDING TECHNIQUES FOR OPTICAL IMAGESAUTOMATIC THRESHOLDING TECHNIQUES FOR OPTICAL IMAGES
AUTOMATIC THRESHOLDING TECHNIQUES FOR OPTICAL IMAGESsipij
 
SnapNETS: Automatic Segmentation of Network Sequences with Node Labels
SnapNETS: Automatic Segmentation of Network Sequences with Node LabelsSnapNETS: Automatic Segmentation of Network Sequences with Node Labels
SnapNETS: Automatic Segmentation of Network Sequences with Node LabelsSorour E. Amiri
 

What's hot (20)

isprsarchives-XL-3-381-2014
isprsarchives-XL-3-381-2014isprsarchives-XL-3-381-2014
isprsarchives-XL-3-381-2014
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Data
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Self-organizing map
Self-organizing mapSelf-organizing map
Self-organizing map
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
20151130
2015113020151130
20151130
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
Centrality Prediction in Mobile Social Networks
Centrality Prediction in Mobile Social NetworksCentrality Prediction in Mobile Social Networks
Centrality Prediction in Mobile Social Networks
 
Mf3421892195
Mf3421892195Mf3421892195
Mf3421892195
 
Fuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering AlgorithmsFuzzy c-Means Clustering Algorithms
Fuzzy c-Means Clustering Algorithms
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis Tool
 
Pivot Selection Techniques
Pivot Selection TechniquesPivot Selection Techniques
Pivot Selection Techniques
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
 
MultiModal Retrieval Image
MultiModal Retrieval ImageMultiModal Retrieval Image
MultiModal Retrieval Image
 
Intrusion Detection Model using Self Organizing Maps.
Intrusion Detection Model using Self Organizing Maps.Intrusion Detection Model using Self Organizing Maps.
Intrusion Detection Model using Self Organizing Maps.
 
AUTOMATIC THRESHOLDING TECHNIQUES FOR OPTICAL IMAGES
AUTOMATIC THRESHOLDING TECHNIQUES FOR OPTICAL IMAGESAUTOMATIC THRESHOLDING TECHNIQUES FOR OPTICAL IMAGES
AUTOMATIC THRESHOLDING TECHNIQUES FOR OPTICAL IMAGES
 
SnapNETS: Automatic Segmentation of Network Sequences with Node Labels
SnapNETS: Automatic Segmentation of Network Sequences with Node LabelsSnapNETS: Automatic Segmentation of Network Sequences with Node Labels
SnapNETS: Automatic Segmentation of Network Sequences with Node Labels
 

Viewers also liked

013_20160328_Topological_Measurement_Of_Protein_Compressibility
013_20160328_Topological_Measurement_Of_Protein_Compressibility013_20160328_Topological_Measurement_Of_Protein_Compressibility
013_20160328_Topological_Measurement_Of_Protein_CompressibilityHa Phuong
 
011_20160321_Topological_data_analysis_of_contagion_map
011_20160321_Topological_data_analysis_of_contagion_map011_20160321_Topological_data_analysis_of_contagion_map
011_20160321_Topological_data_analysis_of_contagion_mapHa Phuong
 
017_20160826 Thermodynamics Of Stochastic Turing Machines
017_20160826 Thermodynamics Of Stochastic Turing Machines017_20160826 Thermodynamics Of Stochastic Turing Machines
017_20160826 Thermodynamics Of Stochastic Turing MachinesHa Phuong
 
BNI - Business Networking International - Grupo Next – Sites
BNI - Business Networking International - Grupo Next – SitesBNI - Business Networking International - Grupo Next – Sites
BNI - Business Networking International - Grupo Next – SitesMarcio Okabe
 
Shearwater net suite ecommerce solution
Shearwater net suite ecommerce solutionShearwater net suite ecommerce solution
Shearwater net suite ecommerce solutionbaptisteshearwater
 
Social media nonprofitcenter0913
Social media nonprofitcenter0913Social media nonprofitcenter0913
Social media nonprofitcenter0913Jan Hirabayashi
 
Moving Trends 2013
Moving Trends 2013Moving Trends 2013
Moving Trends 2013Moving Guru
 
Why people do not reach their potential 080113
Why people do not reach their potential 080113Why people do not reach their potential 080113
Why people do not reach their potential 080113Lars Ray, CC, MCC
 
συναντηση υπουργειου εθνικης αμυνης
συναντηση υπουργειου εθνικης αμυνηςσυναντηση υπουργειου εθνικης αμυνης
συναντηση υπουργειου εθνικης αμυνηςATHANASIOS KAVVADAS
 
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...Voxxed Days Thessaloniki
 
Ensayo formacion de alumnos para el futuro
Ensayo  formacion de alumnos para el futuroEnsayo  formacion de alumnos para el futuro
Ensayo formacion de alumnos para el futuropetramalena
 
BioConference Live Genetics 2013
BioConference Live Genetics 2013BioConference Live Genetics 2013
BioConference Live Genetics 2013LabRoots, Inc.
 
Stay sane. Test for real.
Stay sane. Test for real.Stay sane. Test for real.
Stay sane. Test for real.Bartosz Majsak
 
Understanding the bronchiectasis prognosis
Understanding the bronchiectasis prognosisUnderstanding the bronchiectasis prognosis
Understanding the bronchiectasis prognosisSugeng Hartono
 

Viewers also liked (19)

013_20160328_Topological_Measurement_Of_Protein_Compressibility
013_20160328_Topological_Measurement_Of_Protein_Compressibility013_20160328_Topological_Measurement_Of_Protein_Compressibility
013_20160328_Topological_Measurement_Of_Protein_Compressibility
 
011_20160321_Topological_data_analysis_of_contagion_map
011_20160321_Topological_data_analysis_of_contagion_map011_20160321_Topological_data_analysis_of_contagion_map
011_20160321_Topological_data_analysis_of_contagion_map
 
017_20160826 Thermodynamics Of Stochastic Turing Machines
017_20160826 Thermodynamics Of Stochastic Turing Machines017_20160826 Thermodynamics Of Stochastic Turing Machines
017_20160826 Thermodynamics Of Stochastic Turing Machines
 
BNI - Business Networking International - Grupo Next – Sites
BNI - Business Networking International - Grupo Next – SitesBNI - Business Networking International - Grupo Next – Sites
BNI - Business Networking International - Grupo Next – Sites
 
Shearwater net suite ecommerce solution
Shearwater net suite ecommerce solutionShearwater net suite ecommerce solution
Shearwater net suite ecommerce solution
 
Social media nonprofitcenter0913
Social media nonprofitcenter0913Social media nonprofitcenter0913
Social media nonprofitcenter0913
 
Moving Trends 2013
Moving Trends 2013Moving Trends 2013
Moving Trends 2013
 
Why people do not reach their potential 080113
Why people do not reach their potential 080113Why people do not reach their potential 080113
Why people do not reach their potential 080113
 
Home Buyers Guide
Home Buyers Guide Home Buyers Guide
Home Buyers Guide
 
συναντηση υπουργειου εθνικης αμυνης
συναντηση υπουργειου εθνικης αμυνηςσυναντηση υπουργειου εθνικης αμυνης
συναντηση υπουργειου εθνικης αμυνης
 
Trabajo
TrabajoTrabajo
Trabajo
 
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
 
Ensayo formacion de alumnos para el futuro
Ensayo  formacion de alumnos para el futuroEnsayo  formacion de alumnos para el futuro
Ensayo formacion de alumnos para el futuro
 
cert
certcert
cert
 
Presentation1
Presentation1Presentation1
Presentation1
 
BioConference Live Genetics 2013
BioConference Live Genetics 2013BioConference Live Genetics 2013
BioConference Live Genetics 2013
 
Practica 5
Practica 5Practica 5
Practica 5
 
Stay sane. Test for real.
Stay sane. Test for real.Stay sane. Test for real.
Stay sane. Test for real.
 
Understanding the bronchiectasis prognosis
Understanding the bronchiectasis prognosisUnderstanding the bronchiectasis prognosis
Understanding the bronchiectasis prognosis
 

Similar to Topological Data Analysis

Programming in python
Programming in pythonProgramming in python
Programming in pythonIvan Rojas
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringMachine Learning Valencia
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsMason Porter
 
Visualization of Anomalies in Dynamic Networks with NodeXL
Visualization of Anomalies in Dynamic Networks with NodeXLVisualization of Anomalies in Dynamic Networks with NodeXL
Visualization of Anomalies in Dynamic Networks with NodeXLJacopo Cirrone
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
Summary of survey papers on deep learning method to 3D data
Summary of survey papers on deep learning method to 3D dataSummary of survey papers on deep learning method to 3D data
Summary of survey papers on deep learning method to 3D dataArithmer Inc.
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
theory of computation lecture 01
theory of computation lecture 01theory of computation lecture 01
theory of computation lecture 018threspecter
 
Cahall Final Intern Presentation
Cahall Final Intern PresentationCahall Final Intern Presentation
Cahall Final Intern PresentationDaniel Cahall
 
Data models in geographical information system(GIS)
Data models in geographical information system(GIS)Data models in geographical information system(GIS)
Data models in geographical information system(GIS)Pramoda Raj
 
Weave-D - 2nd Progress Evaluation Presentation
Weave-D - 2nd Progress Evaluation PresentationWeave-D - 2nd Progress Evaluation Presentation
Weave-D - 2nd Progress Evaluation Presentationlasinducharith
 
Designing Network Design Spaces
Designing Network Design SpacesDesigning Network Design Spaces
Designing Network Design SpacesSungchul Kim
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesJinwon Lee
 

Similar to Topological Data Analysis (20)

Programming in python
Programming in pythonProgramming in python
Programming in python
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial Systems
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 
Visualization of Anomalies in Dynamic Networks with NodeXL
Visualization of Anomalies in Dynamic Networks with NodeXLVisualization of Anomalies in Dynamic Networks with NodeXL
Visualization of Anomalies in Dynamic Networks with NodeXL
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Summary of survey papers on deep learning method to 3D data
Summary of survey papers on deep learning method to 3D dataSummary of survey papers on deep learning method to 3D data
Summary of survey papers on deep learning method to 3D data
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
theory of computation lecture 01
theory of computation lecture 01theory of computation lecture 01
theory of computation lecture 01
 
Cahall Final Intern Presentation
Cahall Final Intern PresentationCahall Final Intern Presentation
Cahall Final Intern Presentation
 
Data models in geographical information system(GIS)
Data models in geographical information system(GIS)Data models in geographical information system(GIS)
Data models in geographical information system(GIS)
 
Weave-D - 2nd Progress Evaluation Presentation
Weave-D - 2nd Progress Evaluation PresentationWeave-D - 2nd Progress Evaluation Presentation
Weave-D - 2nd Progress Evaluation Presentation
 
Designing Network Design Spaces
Designing Network Design SpacesDesigning Network Design Spaces
Designing Network Design Spaces
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 

Topological Data Analysis

  • 1. D A M I A N A . V O N S C H O E N B O R N Topological Data Analysis
  • 2. Abstract By now, the Big Data revolution is well on its way. Storage capacity has ballooned, and simple queries against these data stores can be executed with relative ease. However, analytic techniques have generally not matured to handle the massive datasets of this new era. This talk will present a set of techniques known collectively as Topological Data Analysis (TDA), where concepts from Topology are applied to classify, visualize, and explore data. TDA shows promise in the era of Big Data.
  • 3. Agenda  Issues with Big Data analysis  Topology Overview  Computational Topology and Formal TDA  Relaxed TDA  Q&A
  • 4. Problems in Big Data Analytics Problems with legacy analytic techniques Run in series, in memory hypothesis- driven Visualizations limited
  • 5. Topology Overview (as relevant here) Metric Space • Pair-wise distance between points • Continuously defined surfaces Coordinate free • Orientation doesn’t matter • Ability to compare sets from different coordinate systems Small deformations don’t change topology • Stretching, bending, etc. okay • Cutting, gluing, etc. not okay • Less sensitivity to noise [1] Simplicial Complexes • Course (“compressed”) representations of reality Intuitively, a topological space is a set of points, each of whom knows its neighbors. Formally, a topology on a set X is a subset T ⊆ 2X such that: • If 𝑆1, 𝑆2 ∈ 𝑇, then 𝑆1 ∩ 𝑆2 ∈ 𝑇 • If 𝑆𝐽|𝑗 ∈ 𝐽 ⊆ 𝑇, then ∪𝑗∈𝐽 𝑆𝑗 ∈ 𝑇 • ∅, 𝑋 ∈ 𝑇 [3]
  • 6. Topological Data Analysis Definition: Given a finite dataset S ⊆ 𝕐 of noisy points sampled from an unknown space 𝕏, topological data analysis recovers the topology of 𝕏, assuming both 𝕏 and 𝕐 are topological spaces.[3] We want a process that does not require assumptions about manifold structure, smoothness, or lack of curvature.[3]
  • 7. Formal Combinatorial Representations • Construct a combinatorial representation that approximates the underlying space from which the data was sampled[3] • Many types of these representations (simplicial complexes) have been developed Goal • Both the Čech and VR complexes typically produce simplices in dimensions much higher than the dimension of the space [4] • The VR Complex is less expensive than the corresponding Cech complex, even though the VR complex has more simplices[2] • The Čech Complex is not computed in practice due to its computational complexity[3] • Currently, the VR complex is one of the few practical methods for topological analysis in high dimensions[3] Two of the most popular are the Čech and Vietoris-Rips (VR) Complexes
  • 8. Defining the VR Complex Definition 1[3] Given 𝑆 ⊆ 𝕐 and 𝜀 ∈ ℝ, let 𝐺𝜀 = (𝑆, 𝐸𝜀) be the ε- neighborhood graph on S, where 𝐸𝜀 = 𝑢, 𝑣 |𝑑(𝑢, 𝑣) ≤ 𝜀, 𝑢 ≠ 𝑣 ∈ 𝑆 The VR Complex is the clique complex of the ε-neighborhood graph A clique is the subset of vertices that induces a complete subgraph and is maximal if it cannot be made any larger The clique complex has the maximal cliques of a graph as its maximal simplices Definition 2[4] Let X denote a metric space with metric d. Then the VR complex for X, attached to the parameter 𝜀, will be the simplicial complex whose vertex set is X and where {x0, x1, …, xk} spans a k- simplex if and only if d(xi,xj) ≤ 𝜀 for all 0 ≤ i,j ≤k
  • 9. Creating the VR Complex Begin with complete dataset Create ε-balls around each data point Draw an edge connecting each overlapping ε-ball pair [2] Describe with Betti Numbers b0: # of connected components b1: # of 1D holes b2: # of 2D holes
  • 10. What features are an artifact of the chosen ε vs. a representation of the underlying structure?  Betti Numbers insufficient  Persistence  Features persisting over large range of ε values are significant  Features that quickly arise and drop off are noise and can be ignored [2]
  • 12. Potential Application: Optimizing Model Selection [7]
  • 13. So where do we stand?Pros • Useful when high resolution representation needed • Surface reconstruction • Anomaly detection • Comparing datasets • Optimize models • Choose models and parameters best suited to handle the type of dataset you’re analyzing Cons • Some subjective judgment • Potentially difficult to read • Not ideal for Big Data • Computationally expensive(epsilon balls, pairwise overlap flags, etc. all computed for every epsilon value in range) [4] • Typically need to sample from data, reducing resolution.
  • 14. Dimensionality Reduction Principal Components Analysis, MDS, ISOMAP Record Consolidation Cluster Analysis  Retain much of the underlying structure of the data while limiting the number of dimensions needed to describe it [6]  Drawbacks  Loss of information, missing subtleties  Assumes normality  Assumes that data is from a flat hyperplane with no curvature[3]  Discover underlying segments of the data by grouping data points that are most similar [6]  Drawbacks  Distinct groups, no relationship between them, arbitrary distinction in continuous data  Specification of number of clusters upfront  Often difficult to apply clustering algorithms to very large datasets[4] Shrinking Data Size With many algorithms in each category, choosing the right one takes experience or luck
  • 16. Process Overview A. Discrete sample space B. Filter function can be any combination of dimensions in the dataset or derived calculated fields C. Slightly-overlapping bins D. Simplified representation [1]
  • 17. Useful filter functions[5] • Combinations of in-data dimensions (or derivations thereof), typically chosen by domain knowledge Field(s) from the data • Use Gaussian kernel: 𝑓𝜀 𝑥 = 𝐶𝜀 𝑒 −𝑑(𝑥,𝑦)2 𝜀 𝑦Density • Identify points which are far from the center without identifying the actual center • For 1 ≤ 𝑝 < ∞, let 𝐸 𝑝 𝑥 = 𝑑(𝑥,𝑦) 𝑝 𝑦∈𝑋 𝑁 1 𝑝 Eccentricity (data depth) • Let 𝐿 𝑥, 𝑦 = 𝑤(𝑥,𝑦) 𝑤(𝑥,𝑧)𝑧 𝑤(𝑥,𝑧)𝑧 where 𝑤 𝑥, 𝑦 = 𝑘 𝑑 𝑥, 𝑦 for smoothing kernel 𝑘 (e.g. Gaussian) • Eigenvectors of L(x,y) are a set of orthogonal vectors that give interesting geometric information Eigenvectors of graph Laplacians
  • 18. Traditional methodsTDA Application: Gene expression in cancer cells [1]
  • 19. Benefits • Able to move away from hypothesis-driven analyses[1] • Visualize entire dataset, without making unfounded assumptions Visual Exploration • Process can be applied to wide variety of data sources • No predefined format, scaling, etc. needed • Multiscale representations: Useful to have the flexibility of changing the resolution “on the fly” [4] Fungibility • Choice of clustering algorithms • Choice of filter functions Integration of favorite machine learning techniques • Clustering performed on subsets – allows for parallelization Computation
  • 20. Q & A
  • 21. References 1. Lum, P.Y. et al. Extracting insights from the shape of complex data using topology. Sci. Rep. 3, 1236; DOI: 10.1038/srep01236 (2013) 2. Ghrist, R. Barcodes: The Persistent Topology of Data. Bulletin of the AMS 45.1 pp61-75 S 0273-0979(07)01191-3 (2008) 3. Zomorodian, A. Topological Data Analysis. Proceedings of Symposia in Applied Mathematics. AMS (2011) 4. Carlsson, G. Topology and Data. Bulletin of the AMS 46.2 pp255- 308 S 0273-0979(09)01249-X (2009) 5. Singh, G. et al. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Eurographics Symposium on Point-Based Graphics (2007) 6. Ayasdi. TDA and Machine Learning: Better Together. (2015) 7. "Clustering." 2.3. Clustering — Scikit-learn 0.15.2 Documentation. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830 (2011)