WiDS Alexandria, Egypt workshop in topological data analysis (Python and R code available on request), covering persistent homology, the Mapper algorithm, and discrete Ricci curvature. Examples include text data and social network data.
Topological Data Analysis: visual presentation of multidimensional data setsDataRefiner
Topology data analysis (TDA) is an unsupervised approach which may revolutionise the way data can be mined and eventually drive the new generation of analytical tools. The idea behind TDA is an attempt to "measure" shape of data and find compressed combinatorial representation of the shape. In ordinary topology, the combinatorial representations serve the purpose of providing the compressed representation of high dimensional data sets which retains information about the geometric relationships between data points. TDA can also be used as a very powerful clustering technique. Edward will present the comparison between TDA and other dimension reduction algorithms like PCA, LLE, Isomap, MDS, and Spectral Embedding.
Introduction to Topological Data AnalysisMason Porter
Here are slides for my 3/14/21 talk on an introduction to topological data analysis.
This is the first talk in our Short Course on topological data analysis at the 2021 American Physical Society (APS) March Meeting: https://march.aps.org/program/dsoft/gsnp-short-course-introduction-to-topological-data-analysis/
CCS2019-opological time-series analysis with delay-variant embeddingHa Phuong
Q. H. Tran and Y. Hasegawa, Topological time-series analysis with delay-variant embedding, Oral Presentation at Conference on Complex Systems, Singapore, Singapore, Oct. 2019.
Topological Data Analysis: visual presentation of multidimensional data setsDataRefiner
Topology data analysis (TDA) is an unsupervised approach which may revolutionise the way data can be mined and eventually drive the new generation of analytical tools. The idea behind TDA is an attempt to "measure" shape of data and find compressed combinatorial representation of the shape. In ordinary topology, the combinatorial representations serve the purpose of providing the compressed representation of high dimensional data sets which retains information about the geometric relationships between data points. TDA can also be used as a very powerful clustering technique. Edward will present the comparison between TDA and other dimension reduction algorithms like PCA, LLE, Isomap, MDS, and Spectral Embedding.
Introduction to Topological Data AnalysisMason Porter
Here are slides for my 3/14/21 talk on an introduction to topological data analysis.
This is the first talk in our Short Course on topological data analysis at the 2021 American Physical Society (APS) March Meeting: https://march.aps.org/program/dsoft/gsnp-short-course-introduction-to-topological-data-analysis/
CCS2019-opological time-series analysis with delay-variant embeddingHa Phuong
Q. H. Tran and Y. Hasegawa, Topological time-series analysis with delay-variant embedding, Oral Presentation at Conference on Complex Systems, Singapore, Singapore, Oct. 2019.
SIAM-AG21-Topological Persistence Machine of Phase TransitionHa Phuong
Presentation at SIAM Conference on Applied Algebraic Geometry (AG21), Aug. 2021.
Abstract. The study of phase transitions using data-driven approaches is challenging, especially when little prior knowledge of the system is available. Topological data analysis is an emerging framework for characterizing the shape of data and has recently achieved success in detecting structural transitions in material science, such as the glass--liquid transition. However, data obtained from physical states may not have explicit shapes as structural materials. We thus propose a general framework, termed “topological persistence machine," to construct the shape of data from correlations in states so that we can subsequently decipher phase transitions via qualitative changes in the shape. Our framework enables an effective and unified approach in phase transition analysis without having prior knowledge about phases or requiring the investigation of the system with large size. We demonstrate the efficacy of the approach in terms of detecting the Berezinskii--Kosterlitz--Thouless phase transition in the classical XY model and quantum phase transitions in the transverse Ising and Bose--Hubbard models. Interestingly, while these phase transitions have proven to be notoriously difficult to analyze using traditional methods, they can be characterized through our framework without requiring prior knowledge of the phases. Our approach is thus expected to be widely applicable and will provide the prospective with practical interests in exploring the phases of experimental physical systems.
Extending superlearner framework to survival analysis. Includes boosted regression, random forest, decision trees, Bayesian model average, and Morse-Smale regression.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
UMAP is a technique for dimensionality reduction that was proposed 2 years ago that quickly gained widespread usage for dimensionality reduction.
In this presentation I will try to demistyfy UMAP by comparing it to tSNE. I also sketch its theoretical background in topology and fuzzy sets.
Residuals represent variation in the data that cannot be explained by the model.
Residual plots useful for discovering patterns, outliers or misspecifications of the model. Systematic patterns discovered may suggest how to reformulate the model.
If the residuals exhibit no pattern, then this is a good indication that the model is appropriate for the particular data.
Slide show for the webinar on "Spatial Data Science with R" organized for the GeoDevelopers.org community. The video of the webinar and all the related materials including source code and sample data can be downloaded from this link: http://amsantac.co/blog/en/2016/08/07/spatial-data-science-r.html
In this webinar I talked about Data Science in the context of its application to spatial data and explained how we can use the R language for the analysis of geographic information within the different stages of a data science workflow, from the import and processing of spatial data to visualization and publication of results.
Tim Maudlin: New Foundations for Physical GeometryArun Gupta
New Foundations for Physical Geometry
Original URL: http://www.unil.ch/webdav/site/philo/shared/summer_school_2013/NYU.ppt
Tim Maudlin
NYU
Physics & Philosophy of Time
July 25, 2013
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesColleen Farrelly
Miami Data Science SALON (Nov 2018) talk regarding geometric methods for dimensionality reduction, data visualization, and stock market analysis (India's NSE).
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Databricks
Given the resurgence of neural network-based techniques in recent years, it is important for data science practitioner to understand how to apply these techniques and the tradeoffs between neural network-based and traditional statistical methods.
This lecture discusses two specific techniques: Vector Autoregressive (VAR) Models and Recurrent Neural Network (RNN). The former is one of the most important class of multivariate time series statistical models applied in finance while the latter is a neural network architecture that is suitable for time series forecasting. I’ll demonstrate how they are implemented in practice and compares their advantages and disadvantages. Real-world applications, demonstrated using python and Spark, are used to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using time-series plot, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, plots of cross-correlations, histogram, and kernel density plot, will also be included in the demo.
The attendees will learn – the formulation of a time series forecasting problem statement in context of VAR and RNN – the application of Recurrent Neural Network-based techniques in time series forecasting – the application of Vector Autoregressive Models in multivariate time series forecasting – the pros and cons of using VAR and RNN-based techniques in the context of financial time series forecasting – When to use VAR and when to use RNN-based techniques
Exploratory data analysis using xgboost package in RSatoshi Kato
Explain HOW-TO procedure exploratory data analysis using xgboost (EDAXGB), such as feature importance, sensitivity analysis, feature contribution and feature interaction. It is just based on using built-in predict() function in R package.
All of the sample codes are available at: https://github.com/katokohaku/EDAxgboost
"Number Crunching in Python": slides presented at EuroPython 2012, Florence, Italy
Slides have been authored by me and by Dr. Enrico Franchi.
Scientific and Engineering Computing, Numpy NDArray implementation and some working case studies are reported.
SIAM-AG21-Topological Persistence Machine of Phase TransitionHa Phuong
Presentation at SIAM Conference on Applied Algebraic Geometry (AG21), Aug. 2021.
Abstract. The study of phase transitions using data-driven approaches is challenging, especially when little prior knowledge of the system is available. Topological data analysis is an emerging framework for characterizing the shape of data and has recently achieved success in detecting structural transitions in material science, such as the glass--liquid transition. However, data obtained from physical states may not have explicit shapes as structural materials. We thus propose a general framework, termed “topological persistence machine," to construct the shape of data from correlations in states so that we can subsequently decipher phase transitions via qualitative changes in the shape. Our framework enables an effective and unified approach in phase transition analysis without having prior knowledge about phases or requiring the investigation of the system with large size. We demonstrate the efficacy of the approach in terms of detecting the Berezinskii--Kosterlitz--Thouless phase transition in the classical XY model and quantum phase transitions in the transverse Ising and Bose--Hubbard models. Interestingly, while these phase transitions have proven to be notoriously difficult to analyze using traditional methods, they can be characterized through our framework without requiring prior knowledge of the phases. Our approach is thus expected to be widely applicable and will provide the prospective with practical interests in exploring the phases of experimental physical systems.
Extending superlearner framework to survival analysis. Includes boosted regression, random forest, decision trees, Bayesian model average, and Morse-Smale regression.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
UMAP is a technique for dimensionality reduction that was proposed 2 years ago that quickly gained widespread usage for dimensionality reduction.
In this presentation I will try to demistyfy UMAP by comparing it to tSNE. I also sketch its theoretical background in topology and fuzzy sets.
Residuals represent variation in the data that cannot be explained by the model.
Residual plots useful for discovering patterns, outliers or misspecifications of the model. Systematic patterns discovered may suggest how to reformulate the model.
If the residuals exhibit no pattern, then this is a good indication that the model is appropriate for the particular data.
Slide show for the webinar on "Spatial Data Science with R" organized for the GeoDevelopers.org community. The video of the webinar and all the related materials including source code and sample data can be downloaded from this link: http://amsantac.co/blog/en/2016/08/07/spatial-data-science-r.html
In this webinar I talked about Data Science in the context of its application to spatial data and explained how we can use the R language for the analysis of geographic information within the different stages of a data science workflow, from the import and processing of spatial data to visualization and publication of results.
Tim Maudlin: New Foundations for Physical GeometryArun Gupta
New Foundations for Physical Geometry
Original URL: http://www.unil.ch/webdav/site/philo/shared/summer_school_2013/NYU.ppt
Tim Maudlin
NYU
Physics & Philosophy of Time
July 25, 2013
High-Dimensional Data Visualization, Geometry, and Stock Market CrashesColleen Farrelly
Miami Data Science SALON (Nov 2018) talk regarding geometric methods for dimensionality reduction, data visualization, and stock market analysis (India's NSE).
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Databricks
Given the resurgence of neural network-based techniques in recent years, it is important for data science practitioner to understand how to apply these techniques and the tradeoffs between neural network-based and traditional statistical methods.
This lecture discusses two specific techniques: Vector Autoregressive (VAR) Models and Recurrent Neural Network (RNN). The former is one of the most important class of multivariate time series statistical models applied in finance while the latter is a neural network architecture that is suitable for time series forecasting. I’ll demonstrate how they are implemented in practice and compares their advantages and disadvantages. Real-world applications, demonstrated using python and Spark, are used to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using time-series plot, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, plots of cross-correlations, histogram, and kernel density plot, will also be included in the demo.
The attendees will learn – the formulation of a time series forecasting problem statement in context of VAR and RNN – the application of Recurrent Neural Network-based techniques in time series forecasting – the application of Vector Autoregressive Models in multivariate time series forecasting – the pros and cons of using VAR and RNN-based techniques in the context of financial time series forecasting – When to use VAR and when to use RNN-based techniques
Exploratory data analysis using xgboost package in RSatoshi Kato
Explain HOW-TO procedure exploratory data analysis using xgboost (EDAXGB), such as feature importance, sensitivity analysis, feature contribution and feature interaction. It is just based on using built-in predict() function in R package.
All of the sample codes are available at: https://github.com/katokohaku/EDAxgboost
"Number Crunching in Python": slides presented at EuroPython 2012, Florence, Italy
Slides have been authored by me and by Dr. Enrico Franchi.
Scientific and Engineering Computing, Numpy NDArray implementation and some working case studies are reported.
Statistical Programming with JavaScriptDavid Simons
Almost every application needs data to function - and if you don't know how to be nice to your data, then things will start to go wrong. This talk aims to convince JavaScript developers that they do need to care about statistics, and then talk about how to do so. We look at some theory and lots of case studies and real-world advice to deal with a range of scenarios.
The talk aims to touch on the entire data life cycle: We'll dive into data modelling and how the shape and size of your data affects your architecture, and how to build these architectures using JavaScript. Once the data is in the front-end, we'll touch on the wide range of libraries that allows your code to react based on the data, and the wrappers on top that aid visualisation and readability.
Talk given at neo4j conference "Graph Connect" - discussing some graph theory (old and new), and why knowing your stuff can come in handy on a software project.
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
Explore the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Apache Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks.
This session will examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). Learn how these methods are applied to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. The data matrices are tall-and-skinny, which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns and provide tuning guidance to obtain high performance. Based on joint work with Alex Gittens and many others.
Forecasting time series powerful and simpleIvo Andreev
Time series are a sequence of data points positioned in order of time. Time series forecasting has two main purposes - to understand the mechanisms that lead to rise or fall, and to predict future values. Very often it analyses trends, cyclical events, seasonality and has unique importance in Economics and Business. The quality of predictions can be evaluated only in future due to temporal dependencies on previous data points and there are many model types for approximation. In this session we are going to talk about challenges, ways of improvement and technology stack like ML.NET, ARIMA, Python, Azure ML, Regression and FB Prophet
Data Modelling is an important tool in the toolbox of a developer. By building and communicating a shared understanding of the domain they're working with, their applications and APIs are more useable and maintainable. However, as you scale up your technical teams, how do you keep these benefits whilst avoiding time-consuming meetings every time something new comes along? This talk reminds ourselves of key data modelling technique and how our use of Kafka changes and informs them. It then examines how these patterns change as more teams join your organisation and how Kafka comes into its own in this world.
Spatially resolved pair correlation functions for point cloud dataTony Fast
Presentation on computing spatial correlation functions for point cloud materials science information. This presentation uses tree algorithms and Fourier methods to compute the statistics. The analysis is performed on Al-Cu interface information provided by John Gibbs and Peter Voorhees at Northwestern University as funded by the Mosaic of Microstructure MURI program.
This slides describes the basic concepts of industrial-strength compiler design. This includes basic concept of static single-assignment form (SSA) and various optimizations such as dead code elimination, global value numbering, constant propagation, etc. This is intend for a 150 minutes undergraduate compiler class.
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
A brief overview of generative AI technologies and their use for social good initiatives, including cultural training, medical image generation, drug design, and public health.
PyData Global 2023 talk overviewing case studies in network science, including stock market crash prediction, food price pattern mining, and stopping the spread of epidemics.
Overview of mathematical and machine learning models related to climate risk modeling, climate change simulations, and change point detection. Includes a hands-on session with geometry-based systems analysis of food prices related to climate change and geopolitical factors.
WiDS Workshop on natural language processing and generative AI. Details common methods that tie into coding examples. Ends with ethics discussion regarding these technologies and potential for misuse.
Link to talk YouTube: https://www.youtube.com/watch?v=byGzKm0H1-8&list=PLHAk3jHXWpxI7fHw8m5PhrpSRpR3NIjQo&index=3
ODSC-East 2023 presentation covering topics related to my book, The Shape of Data, including how geometry plays a role in text/image embeddings, network science problems, survey data analytics, image analytics, and epidemic wrangling.
This talk overviews my background as a female data scientist, introduces many types of generative AI, discusses potential use cases, highlights the need for representation in generative AI, and showcases a few tools that currently exist.
Emerging Technologies for Public Health in Remote Locations.pptxColleen Farrelly
The tools possible to leverage for public health interventions has changed significantly in the past decades. Tools from geometry, natural language processing, and generative AI allow for a quick design and implementation of interventions, even in very rural parts of the world. Case studies involve HIV, Ebola, and COVID interventions.
WoComToQC workshop lecture on Forman-Ricci curvature for applications in industry (social networks, disaster logistics, spatial data, and spatiotemporal goods pricing data).
PyData Global talk covering tools from geometry/topology and their uses in public health, public policy, and social good initiatives. Examples include food price prediction, COVID policies, public health interventions, and fair AI.
Data Science Dojo Talk on comparing time series using persistent homology. Short overview of time series data. A bit of topology. Code available. Example includes stock exchange data.
Statistical and topological algorithm piece of an Applied Machine Learning Days Morocco talk. Covers ARIMA models, SSA models, GEE models, and persistent homology. Applications include pricing data, stock data, development data, and healthcare data. Datasets and full presentation can be found on GitHub: https://github.com/gabayae/Time-Series-Applications_AMLD2022
An introduction to quantum machine learning.pptxColleen Farrelly
Very basic introduction to quantum computing given at Indaba Malawi 2022. Overviews some basic hardware in classical and quantum computing, as well as a few quantum machine learning algorithms in use today. Resources for self-study provided.
Indaba Malawi workshop on basic approaches to time series data, including ARIMA models and SSA models. Example in R includes an agricultural example from historical Malawi data with Rssa package and base ARIMA models.
NLP: Challenges and Opportunities in Underserved AreasColleen Farrelly
This talk highlights the challenges and opportunities that exist in linguistically underserved areas. It highlights NLP initiatives in Sub-Saharan Africa, as well as financial opportunities in technology if areas neglected linguistically can produce tools in their local languages. Ethics, ownership, and other concerns are highlighted to guide development initiatives.
Geometry, Data, and One Path Into Data Science.pptxColleen Farrelly
Women in Data Science (Alexandria, Egypt) keynote address. Topics cover my journey into data science/machine learning, an overview of data science as a profession, and some case studies on topology/geometry in analytics. Example case studies include insurance, natural language processing, social network analysis, and psychometrics.
First part of a workshop looking at industry case studies in natural language processing for From Theory to Practice Workshop (AIMS, Kigali, March 2022).
SAS Global 2021 Introduction to Natural Language Processing Colleen Farrelly
Overview of text data, processing of text data, integration of text data with structured databases, and uses of text data in analytics across a variety of fields. Here's the talk link: https://www.youtube.com/watch?v=wS0X1bSsuUU
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
1. T O P O L O G I C A L
D ATA A N A LY S I S
C O L L E E N M . F A R R E L L Y ,
D A T A S E M B L Y
2. W H Y
T O P O L O G I C A L
D ATA A N A LY S I S ?
• Autocorrelations/dynamic systems (time
series, spatiotemporal data)
• Wide data (-omics data)
• Small data (pilot studies, rare diseases…)
• Visualization-heavy needs for
comparisons/groups (especially high-
dimensional data)
• Data that breaks assumptions of machine
learning algorithms/statistical models
3. E X A M P L E S
O F T D A
T O O L S
Persistent homology
Mapper algorithm
Homotopy continuation
Morse functions/clustering/regression
Euler calculus
Discrete exterior calculus
Ricci curvature
Mappings to Teichmüller space
4. P E R S I S T E N T
H O M O L O G Y
C O M P A R I N G G R O U P S A N D E X T E N D I N G
H I E R A R C H I C A L C L U S T E R I N G
5. P O I N T C L O U D S
A N D D I S TA N C E
M E T R I C S
7. H O M O L O G Y O V E R V I E W : B E T T I
N U M B E R S
(1,0,0…) (1,1,0…) (1,0,1…)
8. F I LT R AT I O N S
A N D
P E R S I S T E N C
E
• Filter distances or objects to
obtain a series of topological
objects (graphs, simplicial
complexes…)
• Compute a series of metrics
or summary statistics over
filtrations
• Track how metrics/statistics
9. A L G O R I T H M D E TA I L S
Rips filtration
• Pairwise intersections
of ɛ-balls centered at a
given point in the point
cloud or distance
matrix
Dimension parameter
• Number of Betti
numbers to compute
(usually set to a
dimension of 0 or 1)
Diagram
parameters/distance
computation parameters
• Optional visualization
or statistical testing
functions after using
ripser()
10. I M P L E M E N TAT I O N I N P Y T H O N O R R
• TDAstats
• TDAverse
R packages
• Scikit-TDA
• Ripser/persim
• Giotto-TDA
Python packages
11. E X A M P L E
A N A L Y S I S :
P R O B L E M / D A T
A
Small set of BERT-
embedded poems that are
either humorous or serious
in tone
Want to understand if there
are significant differences in
BERT features between the
two sets of poems
12. M A P P E R
C L U S T E R I N G A N D D A T A M I N I N G
13. M O R S E
F U N C T I O N S
: H E I G H T
F U N C T I O N S
A N D
C R I T I C A L
P O I N T S
15. A L G O R I T H M D E TA I L S
Project Data
• Takes input
data and
projects to
custom
embeddings
(3-
dimensional
space, knn
distances…)
Create Cover
• Percent of
overlap
across
covers and
number of
covers
(different
results with
different
parameters)
Cluster
• DBSCAN or
other
clusterers
available in
scikit-learn
Save Model
• Save output
and details
to a
webpage
(path_html)
16. I M P L E M E N TAT I O N I N P Y T H O N O R
R
• TDAmapper
R packages
• Kepler-Mapper (part of Scikit)
• Giotto-TDA
• tmap
Python packages
17. E X A M P L E A N A LY S I S :
P R O B L E M / D ATA
Small set of BERT-embedded poems that
are either humorous or serious in tone
Want to cluster poems to understand the
existence of subgroups
18. R I C C I C U R VAT U R E
F I N D I N G K E Y P I E C E S O F A S O C I A L
N E T W O R K
19. R I C C I
C U R VAT U R E
Negative
Zero
Positive
20. P O W E R / D I S E A S E
N E T W O R K B A C K B O N E S
21. A L G O R I T H M D E TA I L S
Calculate Curvature
on Edges
• Examine vertices
and their adjacent
edges to see how
much “pull” there is
on an edge
Calculate Curvature
on Vertices
• Sum up edge
weights around a
vertex to find out
how much “stuff” is
weighing it down
22. I M P L E M E N TAT I O N I N P Y T H O N O R
R
• Custom in igraph
R packages
• Custom in igraph
• Custom in networkx
Python packages
23. E X A M P L E A N A LY S I S :
P R O B L E M / D ATA
Town network representing a supply
chain (medical, food, electricity…)
Want to understand vulnerabilities
that exist within the network