Talk given at CMStatistics 2016 (http://cmstatistics.org/CMStatistics2016/).
The standard methodology for clustering financial time series is quite brittle to outliers / heavy-tails for many reasons: Single Linkage / MST suffers from the chaining phenomenon; Pearson correlation coefficient is relevant for Gaussian distributions which is usually not the case for financial returns (especially for credit derivatives). At Hellebore Capital Ltd, we strive to improve the methodology and to ground it. We think that stability is a paramount property to verify, which is closely linked to statistical convergence rates of the methodologies (combination of clustering algorithms and dependence estimators). This gives us a model selection criterion: The best clustering methodology is the methodology that can reach a given 'accuracy' with the minimum sample size.
Clustering Financial Time Series: How Long is Enough?Gautier Marti
IJCAI-16, New York, conference presentation of paper http://www.ijcai.org/Proceedings/16/Papers/367.pdf
Researchers have used from 30 days to several
years of daily returns as source data for clustering
financial time series based on their correlations.
This paper sets up a statistical framework to study
the validity of such practices. We first show that
clustering correlated random variables from their
observed values is statistically consistent. Then,
we also give a first empirical answer to the much
debated question: How long should the time series
be? If too short, the clusters found can be spurious;
if too long, dynamics can be smoothed out.
Optimal Transport between Copulas for Clustering Time SeriesGautier Marti
Presentation slides of our ICASSP 2016 conference paper in Shanghai. They describe the motivation and design of the Target Dependence Coefficient, a coefficient which can target or forget specific dependence relationships between the variables. This coefficient can be useful for clustering financial time series. Several of such use-cases are described on our Tech Blog https://www.datagrapple.com/Tech/optimal-copula-transport.html
You may have already read many times that the job of a Data Scientist is to skim through a huge amount of data searching for correlations between some variables of interest. And also, that one of his worst enemies (besides correlation doesn't imply causation) is spurious correlation. But what really is correlation? Are there several types of correlations? Some "good", some "bad"? What about their estimation? This talk will be a very visual presentation around the notion of correlation and dependence. I will first illustrate how the standard linear correlation is estimated (Pearson coefficient), then some more robust alternative: the Spearman coefficient. Building on the geometric understanding of their nature, I will present a generalization that can help Data Scientists to explore, interpret, and measure the dependence (not necessarily linear or comonotonic) between the variables of a given dataset. Financial time series (stocks, credit default swaps, fx rates), and features from the UCI datasets are considered as use cases.
Optimal Transport vs. Fisher-Rao distance between CopulasGautier Marti
How can we compare two dependence structures (represented by copulas)? It depends on the task. For clustering variables with similar dependence, prefer Optimal Transport. For detecting change points in a dynamical dependence structure, prefer Fisher-Rao and its associated f-divergences (for example, an approach a la Frédéric Barbaresco in radar signal processing). This study illustrates these properties with bivariate Gaussian copulas.
Clustering Financial Time Series using their Correlations and their Distribut...Gautier Marti
We have designed a distance that takes into account both the correlation between the time series and also the distribution of the individual time series. A tutorial with Python code is available: https://www.datagrapple.com/Tech/GNPR-tutorial-How-to-cluster-random-walks.html
This talk was given at the Paris Machine Learning Meetup.
On the stability of clustering financial time seriesGautier Marti
Talk at IEEE ICMLA 2015 Miami
In this presentation, we suggest some data perturbations that can help to validate or reject a clustering methodology besides yielding insights on the time series at hand. We show in this study that Pearson correlation is not that relevant for clustering these time series since it yields unstable clusters; prefer a more robust measure such as Spearman correlation based on rank statistics.
Clustering Financial Time Series: How Long is Enough?Gautier Marti
IJCAI-16, New York, conference presentation of paper http://www.ijcai.org/Proceedings/16/Papers/367.pdf
Researchers have used from 30 days to several
years of daily returns as source data for clustering
financial time series based on their correlations.
This paper sets up a statistical framework to study
the validity of such practices. We first show that
clustering correlated random variables from their
observed values is statistically consistent. Then,
we also give a first empirical answer to the much
debated question: How long should the time series
be? If too short, the clusters found can be spurious;
if too long, dynamics can be smoothed out.
Optimal Transport between Copulas for Clustering Time SeriesGautier Marti
Presentation slides of our ICASSP 2016 conference paper in Shanghai. They describe the motivation and design of the Target Dependence Coefficient, a coefficient which can target or forget specific dependence relationships between the variables. This coefficient can be useful for clustering financial time series. Several of such use-cases are described on our Tech Blog https://www.datagrapple.com/Tech/optimal-copula-transport.html
You may have already read many times that the job of a Data Scientist is to skim through a huge amount of data searching for correlations between some variables of interest. And also, that one of his worst enemies (besides correlation doesn't imply causation) is spurious correlation. But what really is correlation? Are there several types of correlations? Some "good", some "bad"? What about their estimation? This talk will be a very visual presentation around the notion of correlation and dependence. I will first illustrate how the standard linear correlation is estimated (Pearson coefficient), then some more robust alternative: the Spearman coefficient. Building on the geometric understanding of their nature, I will present a generalization that can help Data Scientists to explore, interpret, and measure the dependence (not necessarily linear or comonotonic) between the variables of a given dataset. Financial time series (stocks, credit default swaps, fx rates), and features from the UCI datasets are considered as use cases.
Optimal Transport vs. Fisher-Rao distance between CopulasGautier Marti
How can we compare two dependence structures (represented by copulas)? It depends on the task. For clustering variables with similar dependence, prefer Optimal Transport. For detecting change points in a dynamical dependence structure, prefer Fisher-Rao and its associated f-divergences (for example, an approach a la Frédéric Barbaresco in radar signal processing). This study illustrates these properties with bivariate Gaussian copulas.
Clustering Financial Time Series using their Correlations and their Distribut...Gautier Marti
We have designed a distance that takes into account both the correlation between the time series and also the distribution of the individual time series. A tutorial with Python code is available: https://www.datagrapple.com/Tech/GNPR-tutorial-How-to-cluster-random-walks.html
This talk was given at the Paris Machine Learning Meetup.
On the stability of clustering financial time seriesGautier Marti
Talk at IEEE ICMLA 2015 Miami
In this presentation, we suggest some data perturbations that can help to validate or reject a clustering methodology besides yielding insights on the time series at hand. We show in this study that Pearson correlation is not that relevant for clustering these time series since it yields unstable clusters; prefer a more robust measure such as Spearman correlation based on rank statistics.
A review of two decades of correlations, hierarchies, networks and clustering...Gautier Marti
Opinionated review of two decades of correlations, hierarchies,
networks and clustering in financial markets presented at Ton Duc Thang University in Ho Chi Minh City, Vietnam.
Autoregressive Convolutional Neural Networks for Asynchronous Time SeriesGautier Marti
In this talk, we present a CNN architecture for predicting autoregressive asynchronous time series. We illustrate its application on predicting traders’ quotes of credit default swaps (proprietary dataset from Hellebore Capital), and on artificial time series. The paper is available there: http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf
Affine cascade models for term structure dynamics of sovereign yield curvesLAURAMICHAELA
Rafael Serrano profesor de la Universidad del Rosario
Resumen:
In the first part of the talk, I will present an introduction to stochastic affine short rate models for term structure of yield curves In the second part, I will focus on a recursive affine cascade with persistent factors for which the number of parameters, under specifications, is invariant to the size of the state space and converges to a stochastic limit as the number of factors goes to infinity. The cascade construction thereby overcomes dimensionality difficulties associated with general affine models. We contrast two specfifications of the model using linear Kalman filter for a panel of Colombian sovereign yields.
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsUniversidade de São Paulo
Inference problems on networks and their algorithms were always important subjects, but more so now with so much data available and so little time to make sense of it.
Common applications range from product recommendation to social networks and protein interaction.
One of the main inferences in this types of networks is the guilty-by-association method, where labeled nodes propagate their information throughout the network, towards unlabeled nodes.
While there is a widely used algorithm for this context, called Belief Propagation, it lacks the necessary convergence guarantees for loopy-networks.
More recently, a new alternative method was proposed, called LinBP and while it solved the convergence issue, the scalability for large graphs that do not fit memory remains a challenge.
Additionally, most works that try to use BP considering large scale graphs rely on specific infrastructure such as supercomputers and computational clusters.
Therefore we propose a new algorithm, that leverages state-of-the-art asynchronous vertex-centric parallel processing techniques in conjunction with the state-of-the-art BP alternative LinBP, to provide a scalable framework for large graph inference that runs on a single commodity machine.
Our results show that our algorithm is up to 200 times faster than LinBP's SQL implementation on tested networks, while achieving the same accuracy rate.
We also show that due to the asynchronous processing, our algorithm actually needs less iterations to converge when compared to LinBP when using the same parameters.
Finally, we believe that our methodology highlights the yet not fully explored parallelism available on commodity machines, leaning towards a more cost-efficient computational paradigm.
A review of two decades of correlations, hierarchies, networks and clustering...Gautier Marti
Opinionated review of two decades of correlations, hierarchies,
networks and clustering in financial markets presented at Ton Duc Thang University in Ho Chi Minh City, Vietnam.
Autoregressive Convolutional Neural Networks for Asynchronous Time SeriesGautier Marti
In this talk, we present a CNN architecture for predicting autoregressive asynchronous time series. We illustrate its application on predicting traders’ quotes of credit default swaps (proprietary dataset from Hellebore Capital), and on artificial time series. The paper is available there: http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf
Affine cascade models for term structure dynamics of sovereign yield curvesLAURAMICHAELA
Rafael Serrano profesor de la Universidad del Rosario
Resumen:
In the first part of the talk, I will present an introduction to stochastic affine short rate models for term structure of yield curves In the second part, I will focus on a recursive affine cascade with persistent factors for which the number of parameters, under specifications, is invariant to the size of the state space and converges to a stochastic limit as the number of factors goes to infinity. The cascade construction thereby overcomes dimensionality difficulties associated with general affine models. We contrast two specfifications of the model using linear Kalman filter for a panel of Colombian sovereign yields.
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsUniversidade de São Paulo
Inference problems on networks and their algorithms were always important subjects, but more so now with so much data available and so little time to make sense of it.
Common applications range from product recommendation to social networks and protein interaction.
One of the main inferences in this types of networks is the guilty-by-association method, where labeled nodes propagate their information throughout the network, towards unlabeled nodes.
While there is a widely used algorithm for this context, called Belief Propagation, it lacks the necessary convergence guarantees for loopy-networks.
More recently, a new alternative method was proposed, called LinBP and while it solved the convergence issue, the scalability for large graphs that do not fit memory remains a challenge.
Additionally, most works that try to use BP considering large scale graphs rely on specific infrastructure such as supercomputers and computational clusters.
Therefore we propose a new algorithm, that leverages state-of-the-art asynchronous vertex-centric parallel processing techniques in conjunction with the state-of-the-art BP alternative LinBP, to provide a scalable framework for large graph inference that runs on a single commodity machine.
Our results show that our algorithm is up to 200 times faster than LinBP's SQL implementation on tested networks, while achieving the same accuracy rate.
We also show that due to the asynchronous processing, our algorithm actually needs less iterations to converge when compared to LinBP when using the same parameters.
Finally, we believe that our methodology highlights the yet not fully explored parallelism available on commodity machines, leaning towards a more cost-efficient computational paradigm.
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/xlTSk72QHbs
Speaker's Bio:
Leland Wilkinson is Chief Scientist at H2O and Adjunct Professor of Computer Science at the University of Illinois Chicago. He received an A.B. degree from Harvard in 1966, an S.T.B. degree from Harvard Divinity School in 1969, and a Ph.D. from Yale in 1975. Wilkinson wrote the SYSTAT statistical package and founded SYSTAT Inc. in 1984. After the company grew to 50 employees, he sold SYSTAT to SPSS in 1994 and worked there for ten years on research and development of visualization systems. Wilkinson subsequently worked at Skytree and Tableau before joining H2O. Wilkinson is a Fellow of the American Statistical Association, an elected member of the International Statistical Institute, and a Fellow of the American Association for the Advancement of Science. He has won best speaker award at the National Computer Graphics Association and the Youden prize for best expository paper in the statistics journal Technometrics. He has served on the Committee on Applied and Theoretical Statistics of the National Research Council and is a member of the Boards of the National Institute of Statistical Sciences (NISS) and the Institute for Pure and Applied Mathematics (IPAM). In addition to authoring journal articles, the original SYSTAT computer program and manuals, and patents in visualization and distributed analytic computing, Wilkinson is the author (with Grant Blank and Chris Gruber) of Desktop Data Analysis with SYSTAT. He is also the author of The Grammar of Graphics, the foundation for several commercial and opensource visualization systems (IBMRAVE, Tableau, Rggplot2, and PythonBokeh).
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/bas3-Ue2qxc.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
Abstract:
Auto Visualization involves the problem of producing meaningful graphics when presented with data. Relevant to this task are the strategies that expert statisticians and data analysts use to gain insights through visualization, as well as the portfolio of diagnostic methods devised by statisticians in the last 50 years. While some researchers and companies may claim to do automatic visualization, the problem is much deeper than simply producing collections of histograms, bar charts, and scatterplots. The deeper problem is what subset of these graphics is critical to recognizing anomalies, outliers, unusual distributions, missing values, and so on. This talk will cover aspects of this deeper problem and will introduce H2O software that implements some of these algorithms.
Leland Wilkinson is Chief Scientist at H2O.ai and Adjunct Professor of Computer Science at the University of Illinois Chicago. He received an A.B. degree from Harvard in 1966, an S.T.B. degree from Harvard Divinity School in 1969, and a Ph.D. from Yale in 1975. Wilkinson wrote the SYSTAT statistical package and founded SYSTAT Inc. in 1984. After the company grew to 50 employees, he sold SYSTAT to SPSS in 1994 and worked there for ten years on research and development of visualization systems. Wilkinson subsequently worked at Skytree and Tableau before joining H2O.ai. Wilkinson is a Fellow of the American Statistical Association, an elected member of the International Statistical Institute, and a Fellow of the American Association for the Advancement of Science. He has won best speaker award at the National Computer Graphics Association and the Youden prize for best expository paper in the statistics journal Technometrics. He has served on the Committee on Applied and Theoretical Statistics of the National Research Council and is a member of the Boards of the National Institute of Statistical Sciences (NISS) and the Institute for Pure and Applied Mathematics (IPAM). In addition to authoring journal articles, the original SYSTAT computer program and manuals, and patents in visualization and distributed analytic computing, Wilkinson is the author (with Grant Blank and Chris Gruber) of Desktop Data Analysis with SYSTAT. He is also the author of The Grammar of Graphics, the foundation for several commercial and opensource visualization systems (IBMRAVE, Tableau, Rggplot2, and PythonBokeh).
Many computer programs and software systems used in the interpretation of forensic evidence have as their output Bayes factors also commonly referred to as likelihood ratios. For example, it is not unusual to see it reported that the DNA recovered at the crime scene is a million times more likely under the assumption that the defendant is a contributor to the crime stain than under the assumption that the defendant is not a contributor. In this talk we summarize existing approaches for examining the validity of likelihood ratio systems and discuss a new statistical methodology, based on generalized fiducial inference, for empirically examining the validity of such likelihood ratio assessments. We illustrate our approach by examining LR values calculated with one or more widely available data sets.
Joint work with Hari Iyer at National Institute of Standards and Technology
Identification of Outliersin Time Series Data via Simulation Studyiosrjce
IOSR Journal of Mathematics(IOSR-JM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Measure of dispersion has two types Absolute measure and Graphical measure. There are other different types in there.
In this slide the discussed points are:
1. Dispersion & it's types
2. Definition
3. Use
4. Merits
5. Demerits
6. Formula & math
7. Graph and pictures
8. Real life application.
Direct use of hydroclimatic information for reservoir operationAndrea Castelletti
Direct use of hydroclimatic information for reservoir operation - Plenary Talk at the Conference "From operational hydrological forecast to reservoir management optimization" Québec City, Québec, Canadahttp://acrhrta2014.ouranos.ca/program.html
Distribution of EstimatesLinear Regression ModelAssume (yt,.docxmadlynplamondon
Distribution of Estimates
Linear Regression Model
Assume (yt, xt) are independent and identically distributed and E(xtet) = 0
Estimation Consistency
The estimates approach the true values as the sample size increases.
Estimation variance decreases as the sample size increases.
Illustration of Consistency
Take a random sample of U.S. men
Estimate a linear regression of log(wages) on education
Total sample = 9089
Start with 100 observations, and sequentially increase sample size until in the final regression use the whole 9089.
Sequence of Slope Coefficients
Asymptotic Normality
4
Illustration of Asymptotic Normality
Time Series
Do these results apply to time-series data?
Consistency
Asymptotic Normality
Variance Formula
Time-series models
AR models, i.e., xt = yt-1
Trend and seasonal models
One-step and multi-step forecasting
Derivation of Variance Formula
For simplicity
Assume the variables have zero mean
The regression has no intercept
Model with no intercept:
Model with no intercept
OLS minimizes the sum of squares
The first-order condition is
Solution
Now substitute
We have
The denominator is the sample variance (when x has mean zero), so
10
Then
Where
Since
Then
From the covariance formula
When the observations are independent, the covariances are zero.
And since
We obtain
We have found
As stated at the beginning.
Extension to Time-Series
The only place in this argument where we used the assumption of the independence of observations was to show that vt = xtet has zero covariance with vj = xjej.
This is saying that vt is not autocorrelated.
Unforecastable one-step errors
In one-step-ahead forecasting, if the regression error is unforecastable, then vt is not autocorrelated.
In this case, the variance formula for the least-squares estimate is
Why is this true?
The error is unforecastable if
For simplicity, suppose that xt = 1.
Then for
Summary
In one-step-ahead time-series models, if the error is unforecastable, then least-squares estimates satisfy the asymptotic (approximate) distribution
As the sample size T is in the denominator, the variance decreases as the sample size increases.
This means that least-squares is consistent.
Variance Formula
The variance formula for the least-squares estimate takes the form
This formula is valid in time-series regression when the error is unforecastable.
Classical Variance Formula
If we make the simplifying assumption
Then
Homoskedasticity
The variance simplification is valid under “conditional homoskedasticity”
This is a simplifying assumption made to make calculations easier, and is a conventional assumption in introductory econometrics courses.
It is not used in serious econometrics.
Variance Formula: AR(1) Model
Take the AR(1) model with unforecastable homoscedastic errors
Then the variance of the OLS estimate is
Since in this model
AR(1) Asymptotic Variance
We know that
So
The asymp ...
Similar to Clustering CDS: algorithms, distances, stability and convergence rates (20)
Using Large Language Models in 10 Lines of CodeGautier Marti
Modern NLP models can be daunting: No more bag-of-words but complex neural network architectures, with billions of parameters. Engineers, financial analysts, entrepreneurs, and mere tinkerers, fear not! You can get started with as little as 10 lines of code.
Presentation prepared for the Abu Dhabi Machine Learning Meetup Season 3 Episode 3 hosted at ADGM in Abu Dhabi.
... two decades of correlation, hierarchies, networks and clustering in financial markets
Summary of some of my past research work at Complex Networks 2022.
The study of correlations, hierarchies, networks and communities (or clustering) has more than 20 years of history in econophysics.
However, for the practitioner, it seems that these tools are not fully ready yet:
Many questions around their proper use for trading or risk monitoring are left unanswered.
Deep Learning might help solve some hard problems such as finding more reliably communities (or clusters) and their number.
Running large simulations (based on GANs, VAEs or realistic market simulators) could also help understand when complex networks methods can give wrong insights (e.g. not enough data, or not stationary enough; too low correlations).
Conference: Complex Networks 2022 in Palermo, Sicily, Italy.
A quick demo of Top2Vec With application on 2020 10-K business descriptionsGautier Marti
A short presentation I did at the Hong Kong Machine Learning Meetup Season 4 Episode 4. Top2Vec is a novel method to find topics in a corpus of documents. It can automatically find a relevant number of topics in the corpus. Besides, you get also relevant word and document vectors for further processing.
cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Dist...Gautier Marti
A Generative Adversarial Networks model to generate realistic correlation matrices. In these slides, we discuss a use case in quantitative finance (comparison of risk-based portfolio allocation methods), and how to improve the seminal model with information geometry (Riemannian neural networks suited for correlation matrices). There are many use cases to explore within, and outside, quantitative finance. The Riemannian geometry of correlation matrices is still under-developed.
We highlight exciting problems at the intersection of Riemannian geometry and deep learning.
How deep generative models can help quants reduce the risk of overfitting?Gautier Marti
How deep generative models can help quants reduce
the risk of overfitting? Applications of GANs for Quants.
Presentation at the "QuantUniversity Autumn School 2020".
Generating Realistic Synthetic Data in FinanceGautier Marti
Talk at IHS Markit Webinar (15 October 2020) on the potential Applications of GANs in Finance. These models could be useful for quants and their managers to avoid over-fitting, portfolio and risk managers for proper capital and risk allocation, cloud computing servicing willing to work with banks and other sensitive data rich organizations, auditors and regulators to detect frauds, and data vendors (such as IHS Markit) to bring new products to market and iterate quickly with clients.
This presentation highlights potential use cases of deep generative models, and Generative Adversarial Networks (GANs) in particular, in Finance. Essentially, these models are useful to generate realistic synthetic datasets. Quantitative Strategists, Traders, Asset and Risk Managers can find these novel techniques useful. Auditors and Regulators should also become aware of their existence as they may be source of new accounting frauds and misleading financial statements (deepfakes).
My recent attempts at using GANs for simulating realistic stocks returnsGautier Marti
A presentation for the Hong Kong Machine Learning meetup summarizing my hobby research over the past year. My goal is to be able to simulate realistic multivariate financial time series. If so, I will be able to compare different statistical methods for portfolio construction, studying complex networks, algorithmic trading, being able to do some reinforcement learning, etc. Still far from being achieved...
Takeaways from ICML 2019, Long Beach, CaliforniaGautier Marti
A few slides that highlight some of my personal takeaways from the ICML 2019 conference. I tried to identify niche trends such as Shapley values, topological data analysis, Hawkes processes...
On Clustering Financial Time Series - Beyond CorrelationGautier Marti
Financial correlation matrices are noisy. Most of their coefficients are meaningless. RMT advocates that the intrinsic dimension is much lower than O(N^2). Clustering can help to reduce the dimension. But, it can also work on other information than mere correlation...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Clustering CDS: algorithms, distances, stability and convergence rates
1. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Clustering CDS: algorithms, distances,
stability and convergence rates
CMStatistics 2016, University of Seville, Spain
Gautier Marti, Frank Nielsen, Philippe Donnat
HELLEBORECAPITAL
December 9, 2016
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
2. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coefficients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
3. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Introduction
Goal: Finding groups of ’homogeneous’ assets that can help to:
• build alternative measures of risk,
• elaborate trading strategies. . .
But, we need a high confidence in these clusters (networks).
So, we need appropriate AND fast converging methodologies [8]:
to be consistent yet efficient (bias–variance tradeoff),
to avoid non-stationarity of the time series (too large sample).
A good model selection criterion:
Minimum sample size to reach a given ’accuracy’.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
4. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coefficients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
5. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The standard methodology - description
The methodology widely adopted in empirical studies: [7].
Let N be the number of assets.
Let Pi (t) be the price at time t of asset i, 1 ≤ i ≤ N.
Let ri (t) be the log-return at time t of asset i:
ri (t) = log Pi (t) − log Pi (t − 1).
For each pair i, j of assets, compute their correlation:
ρij =
ri rj − ri rj
( r2
i − ri
2) r2
j − rj
2
.
Convert the correlation coefficients ρij into distances:
dij = 2(1 − ρij ).
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
6. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The standard methodology - description
From all the distances dij , compute a minimum spanning tree:
Figure: A minimum spanning tree of stocks (from [1]); stocks from the
same industry (represented by color) tend to cluster together
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
7. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The standard methodology - limitations
• MST clustering equivalent to Single Linkage clustering:
• chaining phenomenon
• not stable to noise / small perturbations [11]
• Use of the Pearson correlation:
• can take value 0 whereas variables are strongly dependent
• not invariant to variable monotone transformations
• not robust to outliers
Is it still useful for financial time series? stocks? CDS??!
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
8. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The standard methodology - limitations
• MST clustering equivalent to Single Linkage clustering:
• chaining phenomenon
• not stable to noise / small perturbations [11]
• Use of the Pearson correlation:
• can take value 0 whereas variables are strongly dependent
• not invariant to variables monotone transformations
• not robust to outliers
Is it still useful for financial time series? stocks? CDS??!
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
9. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coefficients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
10. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Copulas
Sklar’s Theorem [13]
For (Xi , Xj ) having continuous marginal cdfs FXi
, FXj
, its joint cumulative
distribution F is uniquely expressed as
F(Xi , Xj ) = C(FXi
(Xi ), FXj
(Xj )),
where C is known as the copula of (Xi , Xj ).
Copula’s uniform marginals jointly encode all the dependence.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
11. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
From ranks to empirical copula
ri , rj are the rank statistics of Xi , Xj respectively, i.e. rt
i is the rank
of Xt
i in {X1
i , . . . , XT
i }: rt
i = T
k=1 1{Xk
i ≤ Xt
i }.
Deheuvels’ empirical copula [3]
Any copula ˆC defined on the lattice L = {( ti
T ,
tj
T ) : ti , tj = 0, . . . , T} by
ˆC( ti
T ,
tj
T ) = 1
T
T
t=1 1{rt
i ≤ ti , rt
j ≤ tj } is an empirical copula.
ˆC is a consistent estimator of C with uniform convergence [4].
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
12. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Clustering of bivariate empirical copulas
Generate the N
2 bivariate empirical copulas
Find clusters of copulas using optimal transport [10, 9]
Compute and display the clusters’ centroids [2]
Some code available at www.datagrapple.com/Tech.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
13. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Copula-centers for stocks (CAC 40)
Figure: Stocks: More mass in the bottom-left corner, i.e. lower tail
dependence. Stock prices tend to plummet together.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
14. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Copula-centers for Credit Default Swaps (XO index)
Figure: Credit default swaps: More mass in the top-right corner, i.e.
upper tail dependence. Insurance cost against entities’ default tends to
soar in stressed market.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
15. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coefficients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
16. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Dependence as relative distances between copulas
C copula of (Xi , Xj ),
|u − v|/
√
2 distance between (u, v) to the diagonal
Spearman’s ρS :
ρS (Xi , Xj ) = 12
1
0
1
0
(C(u, v) − uv)dudv
= 1 − 6
1
0
1
0
(u − v)2
dC(u, v)
Many correlation coefficients can be expressed as distances to the
Fr´echet–Hoeffding bounds or the independence [6]. Some are explicitely
built this way (e.g. [12, 5, 9]).
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
17. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
A metric space for copulas: Optimal Transport
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
18. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
The Target/Forget Dependence Coefficient (TFDC)
Now, we can define our bespoke dependence coefficient:
Build the forget-dependence copulas {CF
l }l
Build the target-dependence copulas {CT
k }k
Compute the empirical copula Cij from xi , xj
TFDC(Cij ) =
minl D(CF
l , Cij )
minl D(CF
l , Cij ) + mink D(Cij , CT
k )
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
19. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Spearman vs. TFDC
0.0 0.2 0.4 0.6 0.8 1.0
discontinuity position a
0.0
0.2
0.4
0.6
0.8
1.0
Estimatedpositivedependence
Spearman & TFDC values as a function of a
TFDC
Spearman
Figure: Empirical copulas for (X, Y ) where
X = Z1{Z < a} + X 1{Z > a},
Y = Z1{Z < a + 0.25} + Y 1{Z > a + 0.25}, a = 0, 0.05, . . . , 0.95, 1,
and where Z is uniform on [0, 1] and X , Y are independent noises (left).
TFDC and Spearman coefficients estimated between X and Y as a
function of a (right).
For a = 0.75, Spearman coefficient yields a negative value, yet X = Y
over [0, a].
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
20. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coefficients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
21. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Process: Recovering a simulated ground-truth [8]
A simulation & benchmark process that needs to be refined:
Extract (using a large sample) a filtered correlation matrix R
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
22. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Process: Recovering a simulated ground-truth [8]
A simulation & benchmark process that needs to be refined:
Generate samples of size T = 10, . . . , 20, . . . from a relevant
distribution (parameterized by R)
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
23. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Process: Recovering a simulated ground-truth [8]
A simulation & benchmark process that needs to be refined:
Compute the ratio of the number of correct clustering
obtained over the number of trials as a function of T
100 200 300 400 500
Sample size
0.0
0.2
0.4
0.6
0.8
1.0
Score
Empirical rates of convergence for Single Linkage
Gaussian - Pearson
Gaussian - Spearman
Student - Pearson
Student - Spearman
100 200 300 400 500
Sample size
0.0
0.2
0.4
0.6
0.8
1.0
Score
Empirical rates of convergence for Average Linkage
Gaussian - Pearson
Gaussian - Spearman
Student - Pearson
Student - Spearman
100 200 300 400 500
Sample size
0.0
0.2
0.4
0.6
0.8
1.0
Score
Empirical rates of convergence for Ward
Gaussian - Pearson
Gaussian - Spearman
Student - Pearson
Student - Spearman
A full comparative study will be posted online at www.datagrapple.com/Tech.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
24. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
1 Introduction
2 The standard methodology
3 Exploring dependence between returns
4 Copula-based dependence coefficients (clustering distances)
5 Empirical convergence rates
6 Beyond dependence: a (copula,margins) representation
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
25. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
ON CLUSTERING FINANCIAL TIME SERIES
GAUTIER MARTI, PHILIPPE DONNAT AND FRANK NIELSEN
NOISY CORRELATION MATRICES
Let X be the matrix storing the standardized re-
turns of N = 560 assets (credit default swaps)
over a period of T = 2500 trading days.
Then, the empirical correlation matrix of the re-
turns is
C =
1
T
XX .
We can compute the empirical density of its
eigenvalues
ρ(λ) =
1
N
dn(λ)
dλ
,
where n(λ) counts the number of eigenvalues of
C less than λ.
From random matrix theory, the Marchenko-
Pastur distribution gives the limit distribution as
N → ∞, T → ∞ and T/N fixed. It reads:
ρ(λ) =
T/N
2π
(λmax − λ)(λ − λmin)
λ
,
where λmax
min = 1 + N/T ± 2 N/T, and λ ∈
[λmin, λmax].
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
λ
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
ρ(λ)
Figure 1: Marchenko-Pastur density vs. empirical den-
sity of the correlation matrix eigenvalues
Notice that the Marchenko-Pastur density fits
well the empirical density meaning that most of
the information contained in the empirical corre-
lation matrix amounts to noise: only 26 eigenval-
ues are greater than λmax.
The highest eigenvalue corresponds to the ‘mar-
ket’, the 25 others can be associated to ‘industrial
sectors’.
CLUSTERING TIME SERIES
Given a correlation matrix of the returns,
0 100 200 300 400 500
0
100
200
300
400
500
Figure 2: An empirical and noisy correlation matrix
one can re-order assets using a hierarchical clus-
tering algorithm to make the hierarchical correla-
tion pattern blatant,
0 100 200 300 400 500
0
100
200
300
400
500
Figure 3: The same noisy correlation matrix re-ordered
by a hierarchical clustering algorithm
and finally filter the noise according to the corre-
lation pattern:
0 100 200 300 400 500
0
100
200
300
400
500
Figure 4: The resulting filtered correlation matrix
BEYOND CORRELATION
Sklar’s Theorem. For any random vector X = (X1, . . . , XN ) having continuous marginal cumulative
distribution functions Fi, its joint cumulative distribution F is uniquely expressed as
F(X1, . . . , XN ) = C(F1(X1), . . . , FN (XN )),
where C, the multivariate distribution of uniform marginals, is known as the copula of X.
Figure 5: ArcelorMittal and Société générale prices are projected on dependence ⊕ distribution space; notice their
heavy-tailed exponential distribution.
Let θ ∈ [0, 1]. Let (X, Y ) ∈ V2
. Let G = (GX, GY ), where GX and GY are respectively X and Y marginal
cdf. We define the following distance
d2
θ(X, Y ) = θd2
1(GX(X), GY (Y )) + (1 − θ)d2
0(GX, GY ),
where d2
1(GX(X), GY (Y )) = 3E[|GX(X) − GY (Y )|2
], and d2
0(GX, GY ) = 1
2 R
dGX
dλ − dGY
dλ
2
dλ.
CLUSTERING RESULTS & STABILITY
0 5 10 15 20 25 30
Standard Deviation in basis points
0
5
10
15
20
25
30
35
Numberofoccurrences
Standard Deviations Histogram
Figure 6: (Top) The returns correlation structure ap-
pears more clearly using rank correlation; (Bottom)
Clusters of returns distributions can be partly described
by the returns volatility
Figure 7: Stability test on Odd/Even trading days sub-
sampling: our approach (GNPR) yields more stable
clusters with respect to this perturbation than standard
approaches (using Pearson correlation or L2 distances).
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
26. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Ricardo Coelho, Przemyslaw Repetowicz, Stefan Hutzler, and
Peter Richmond.
Investigation of Cluster Structure in the London Stock
Exchange.
Marco Cuturi and Arnaud Doucet.
Fast computation of wasserstein barycenters.
In Proceedings of the 31th International Conference on
Machine Learning, ICML 2014, Beijing, China, 21-26 June
2014, pages 685–693, 2014.
Paul Deheuvels.
La fonction de d´ependance empirique et ses propri´et´es. un test
non param´etrique d’ind´ependance.
Acad. Roy. Belg. Bull. Cl. Sci.(5), 65(6):274–292, 1979.
Paul Deheuvels.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
27. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
A non-parametric test for independence.
Publications de l’Institut de Statistique de l’Universit´e de
Paris, 26:29–50, 1981.
Fabrizio Durante and Roberta Pappada.
Cluster analysis of time series via kendall distribution.
In Strengthening Links Between Data Analysis and Soft
Computing, pages 209–216. Springer, 2015.
Eckhard Liebscher et al.
Copula-based dependence measures.
Dependence Modeling, 2(1):49–64, 2014.
Rosario N Mantegna.
Hierarchical structure in financial markets.
The European Physical Journal B-Condensed Matter and
Complex Systems, 11(1):193–197, 1999.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
28. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Gautier Marti, S´ebastien Andler, Frank Nielsen, and Philippe
Donnat.
Clustering financial time series: How long is enough?
Proceedings of the Twenty-Fifth International Joint
Conference on Artificial Intelligence, IJCAI 2016, New York,
NY, USA, 9-15 July 2016, pages 2583–2589, 2016.
Gautier Marti, Sebastien Andler, Frank Nielsen, and Philippe
Donnat.
Exploring and measuring non-linear correlations: Copulas,
lightspeed transportation and clustering.
NIPS 2016 Time Series Workshop, 55, 2016.
Gautier Marti, S´ebastien Andler, Frank Nielsen, and Philippe
Donnat.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
29. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
Optimal transport vs. fisher-rao distance between copulas for
clustering multivariate time series.
In IEEE Statistical Signal Processing Workshop, SSP 2016,
Palma de Mallorca, Spain, June 26-29, 2016, pages 1–5, 2016.
Gautier Marti, Philippe Very, Philippe Donnat, and Frank
Nielsen.
A proposal of a methodological framework with experimental
guidelines to investigate clustering stability on financial time
series.
In 14th IEEE International Conference on Machine Learning
and Applications, ICMLA 2015, Miami, FL, USA, December
9-11, 2015, pages 32–37, 2015.
Barnab´as P´oczos, Zoubin Ghahramani, and Jeff G. Schneider.
Copula-based kernel dependency measures.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
30. HELLEBORECAPITAL
Introduction
The standard methodology
Exploring dependence between returns
Copula-based dependence coefficients (clustering distances)
Empirical convergence rates
Beyond dependence: a (copula,margins) representation
In Proceedings of the 29th International Conference on
Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June
26 - July 1, 2012, 2012.
A Sklar.
Fonctions de r´epartition `a n dimensions et leurs marges.
Universit´e Paris 8, 1959.
Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r