Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.                              Upcoming SlideShare
Loading in …5
×

# Clustering CDS: algorithms, distances, stability and convergence rates

347 views

Published on

Talk given at CMStatistics 2016 (http://cmstatistics.org/CMStatistics2016/).
The standard methodology for clustering financial time series is quite brittle to outliers / heavy-tails for many reasons: Single Linkage / MST suffers from the chaining phenomenon; Pearson correlation coefficient is relevant for Gaussian distributions which is usually not the case for financial returns (especially for credit derivatives). At Hellebore Capital Ltd, we strive to improve the methodology and to ground it. We think that stability is a paramount property to verify, which is closely linked to statistical convergence rates of the methodologies (combination of clustering algorithms and dependence estimators). This gives us a model selection criterion: The best clustering methodology is the methodology that can reach a given 'accuracy' with the minimum sample size.

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here • Be the first to comment

• Be the first to like this

### Clustering CDS: algorithms, distances, stability and convergence rates

1. 1. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Clustering CDS: algorithms, distances, stability and convergence rates CMStatistics 2016, University of Seville, Spain Gautier Marti, Frank Nielsen, Philippe Donnat HELLEBORECAPITAL December 9, 2016 Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
2. 2. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation 1 Introduction 2 The standard methodology 3 Exploring dependence between returns 4 Copula-based dependence coeﬃcients (clustering distances) 5 Empirical convergence rates 6 Beyond dependence: a (copula,margins) representation Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
3. 3. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Introduction Goal: Finding groups of ’homogeneous’ assets that can help to: • build alternative measures of risk, • elaborate trading strategies. . . But, we need a high conﬁdence in these clusters (networks). So, we need appropriate AND fast converging methodologies : to be consistent yet eﬃcient (bias–variance tradeoﬀ), to avoid non-stationarity of the time series (too large sample). A good model selection criterion: Minimum sample size to reach a given ’accuracy’. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
4. 4. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation 1 Introduction 2 The standard methodology 3 Exploring dependence between returns 4 Copula-based dependence coeﬃcients (clustering distances) 5 Empirical convergence rates 6 Beyond dependence: a (copula,margins) representation Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
5. 5. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation The standard methodology - description The methodology widely adopted in empirical studies: . Let N be the number of assets. Let Pi (t) be the price at time t of asset i, 1 ≤ i ≤ N. Let ri (t) be the log-return at time t of asset i: ri (t) = log Pi (t) − log Pi (t − 1). For each pair i, j of assets, compute their correlation: ρij = ri rj − ri rj ( r2 i − ri 2) r2 j − rj 2 . Convert the correlation coeﬃcients ρij into distances: dij = 2(1 − ρij ). Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
6. 6. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation The standard methodology - description From all the distances dij , compute a minimum spanning tree: Figure: A minimum spanning tree of stocks (from ); stocks from the same industry (represented by color) tend to cluster together Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
7. 7. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation The standard methodology - limitations • MST clustering equivalent to Single Linkage clustering: • chaining phenomenon • not stable to noise / small perturbations  • Use of the Pearson correlation: • can take value 0 whereas variables are strongly dependent • not invariant to variable monotone transformations • not robust to outliers Is it still useful for ﬁnancial time series? stocks? CDS??! Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
8. 8. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation The standard methodology - limitations • MST clustering equivalent to Single Linkage clustering: • chaining phenomenon • not stable to noise / small perturbations  • Use of the Pearson correlation: • can take value 0 whereas variables are strongly dependent • not invariant to variables monotone transformations • not robust to outliers Is it still useful for ﬁnancial time series? stocks? CDS??! Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
9. 9. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation 1 Introduction 2 The standard methodology 3 Exploring dependence between returns 4 Copula-based dependence coeﬃcients (clustering distances) 5 Empirical convergence rates 6 Beyond dependence: a (copula,margins) representation Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
10. 10. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Copulas Sklar’s Theorem  For (Xi , Xj ) having continuous marginal cdfs FXi , FXj , its joint cumulative distribution F is uniquely expressed as F(Xi , Xj ) = C(FXi (Xi ), FXj (Xj )), where C is known as the copula of (Xi , Xj ). Copula’s uniform marginals jointly encode all the dependence. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
11. 11. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation From ranks to empirical copula ri , rj are the rank statistics of Xi , Xj respectively, i.e. rt i is the rank of Xt i in {X1 i , . . . , XT i }: rt i = T k=1 1{Xk i ≤ Xt i }. Deheuvels’ empirical copula  Any copula ˆC deﬁned on the lattice L = {( ti T , tj T ) : ti , tj = 0, . . . , T} by ˆC( ti T , tj T ) = 1 T T t=1 1{rt i ≤ ti , rt j ≤ tj } is an empirical copula. ˆC is a consistent estimator of C with uniform convergence . Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
12. 12. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Clustering of bivariate empirical copulas Generate the N 2 bivariate empirical copulas Find clusters of copulas using optimal transport [10, 9] Compute and display the clusters’ centroids  Some code available at www.datagrapple.com/Tech. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
13. 13. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Copula-centers for stocks (CAC 40) Figure: Stocks: More mass in the bottom-left corner, i.e. lower tail dependence. Stock prices tend to plummet together. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
14. 14. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Copula-centers for Credit Default Swaps (XO index) Figure: Credit default swaps: More mass in the top-right corner, i.e. upper tail dependence. Insurance cost against entities’ default tends to soar in stressed market. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
15. 15. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation 1 Introduction 2 The standard methodology 3 Exploring dependence between returns 4 Copula-based dependence coeﬃcients (clustering distances) 5 Empirical convergence rates 6 Beyond dependence: a (copula,margins) representation Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
16. 16. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Dependence as relative distances between copulas C copula of (Xi , Xj ), |u − v|/ √ 2 distance between (u, v) to the diagonal Spearman’s ρS : ρS (Xi , Xj ) = 12 1 0 1 0 (C(u, v) − uv)dudv = 1 − 6 1 0 1 0 (u − v)2 dC(u, v) Many correlation coeﬃcients can be expressed as distances to the Fr´echet–Hoeﬀding bounds or the independence . Some are explicitely built this way (e.g. [12, 5, 9]). Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
17. 17. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation A metric space for copulas: Optimal Transport Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
18. 18. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation The Target/Forget Dependence Coeﬃcient (TFDC) Now, we can deﬁne our bespoke dependence coeﬃcient: Build the forget-dependence copulas {CF l }l Build the target-dependence copulas {CT k }k Compute the empirical copula Cij from xi , xj TFDC(Cij ) = minl D(CF l , Cij ) minl D(CF l , Cij ) + mink D(Cij , CT k ) Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
19. 19. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Spearman vs. TFDC 0.0 0.2 0.4 0.6 0.8 1.0 discontinuity position a 0.0 0.2 0.4 0.6 0.8 1.0 Estimatedpositivedependence Spearman & TFDC values as a function of a TFDC Spearman Figure: Empirical copulas for (X, Y ) where X = Z1{Z < a} + X 1{Z > a}, Y = Z1{Z < a + 0.25} + Y 1{Z > a + 0.25}, a = 0, 0.05, . . . , 0.95, 1, and where Z is uniform on [0, 1] and X , Y are independent noises (left). TFDC and Spearman coeﬃcients estimated between X and Y as a function of a (right). For a = 0.75, Spearman coeﬃcient yields a negative value, yet X = Y over [0, a]. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
20. 20. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation 1 Introduction 2 The standard methodology 3 Exploring dependence between returns 4 Copula-based dependence coeﬃcients (clustering distances) 5 Empirical convergence rates 6 Beyond dependence: a (copula,margins) representation Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
21. 21. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Process: Recovering a simulated ground-truth  A simulation & benchmark process that needs to be reﬁned: Extract (using a large sample) a ﬁltered correlation matrix R Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
22. 22. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Process: Recovering a simulated ground-truth  A simulation & benchmark process that needs to be reﬁned: Generate samples of size T = 10, . . . , 20, . . . from a relevant distribution (parameterized by R) Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
23. 23. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Process: Recovering a simulated ground-truth  A simulation & benchmark process that needs to be reﬁned: Compute the ratio of the number of correct clustering obtained over the number of trials as a function of T 100 200 300 400 500 Sample size 0.0 0.2 0.4 0.6 0.8 1.0 Score Empirical rates of convergence for Single Linkage Gaussian - Pearson Gaussian - Spearman Student - Pearson Student - Spearman 100 200 300 400 500 Sample size 0.0 0.2 0.4 0.6 0.8 1.0 Score Empirical rates of convergence for Average Linkage Gaussian - Pearson Gaussian - Spearman Student - Pearson Student - Spearman 100 200 300 400 500 Sample size 0.0 0.2 0.4 0.6 0.8 1.0 Score Empirical rates of convergence for Ward Gaussian - Pearson Gaussian - Spearman Student - Pearson Student - Spearman A full comparative study will be posted online at www.datagrapple.com/Tech. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
24. 24. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation 1 Introduction 2 The standard methodology 3 Exploring dependence between returns 4 Copula-based dependence coeﬃcients (clustering distances) 5 Empirical convergence rates 6 Beyond dependence: a (copula,margins) representation Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
25. 25. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation ON CLUSTERING FINANCIAL TIME SERIES GAUTIER MARTI, PHILIPPE DONNAT AND FRANK NIELSEN NOISY CORRELATION MATRICES Let X be the matrix storing the standardized re- turns of N = 560 assets (credit default swaps) over a period of T = 2500 trading days. Then, the empirical correlation matrix of the re- turns is C = 1 T XX . We can compute the empirical density of its eigenvalues ρ(λ) = 1 N dn(λ) dλ , where n(λ) counts the number of eigenvalues of C less than λ. From random matrix theory, the Marchenko- Pastur distribution gives the limit distribution as N → ∞, T → ∞ and T/N ﬁxed. It reads: ρ(λ) = T/N 2π (λmax − λ)(λ − λmin) λ , where λmax min = 1 + N/T ± 2 N/T, and λ ∈ [λmin, λmax]. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 λ 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 ρ(λ) Figure 1: Marchenko-Pastur density vs. empirical den- sity of the correlation matrix eigenvalues Notice that the Marchenko-Pastur density ﬁts well the empirical density meaning that most of the information contained in the empirical corre- lation matrix amounts to noise: only 26 eigenval- ues are greater than λmax. The highest eigenvalue corresponds to the ‘mar- ket’, the 25 others can be associated to ‘industrial sectors’. CLUSTERING TIME SERIES Given a correlation matrix of the returns, 0 100 200 300 400 500 0 100 200 300 400 500 Figure 2: An empirical and noisy correlation matrix one can re-order assets using a hierarchical clus- tering algorithm to make the hierarchical correla- tion pattern blatant, 0 100 200 300 400 500 0 100 200 300 400 500 Figure 3: The same noisy correlation matrix re-ordered by a hierarchical clustering algorithm and ﬁnally ﬁlter the noise according to the corre- lation pattern: 0 100 200 300 400 500 0 100 200 300 400 500 Figure 4: The resulting ﬁltered correlation matrix BEYOND CORRELATION Sklar’s Theorem. For any random vector X = (X1, . . . , XN ) having continuous marginal cumulative distribution functions Fi, its joint cumulative distribution F is uniquely expressed as F(X1, . . . , XN ) = C(F1(X1), . . . , FN (XN )), where C, the multivariate distribution of uniform marginals, is known as the copula of X. Figure 5: ArcelorMittal and Société générale prices are projected on dependence ⊕ distribution space; notice their heavy-tailed exponential distribution. Let θ ∈ [0, 1]. Let (X, Y ) ∈ V2 . Let G = (GX, GY ), where GX and GY are respectively X and Y marginal cdf. We deﬁne the following distance d2 θ(X, Y ) = θd2 1(GX(X), GY (Y )) + (1 − θ)d2 0(GX, GY ), where d2 1(GX(X), GY (Y )) = 3E[|GX(X) − GY (Y )|2 ], and d2 0(GX, GY ) = 1 2 R dGX dλ − dGY dλ 2 dλ. CLUSTERING RESULTS & STABILITY 0 5 10 15 20 25 30 Standard Deviation in basis points 0 5 10 15 20 25 30 35 Numberofoccurrences Standard Deviations Histogram Figure 6: (Top) The returns correlation structure ap- pears more clearly using rank correlation; (Bottom) Clusters of returns distributions can be partly described by the returns volatility Figure 7: Stability test on Odd/Even trading days sub- sampling: our approach (GNPR) yields more stable clusters with respect to this perturbation than standard approaches (using Pearson correlation or L2 distances). Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
26. 26. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Ricardo Coelho, Przemyslaw Repetowicz, Stefan Hutzler, and Peter Richmond. Investigation of Cluster Structure in the London Stock Exchange. Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 685–693, 2014. Paul Deheuvels. La fonction de d´ependance empirique et ses propri´et´es. un test non param´etrique d’ind´ependance. Acad. Roy. Belg. Bull. Cl. Sci.(5), 65(6):274–292, 1979. Paul Deheuvels. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
27. 27. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation A non-parametric test for independence. Publications de l’Institut de Statistique de l’Universit´e de Paris, 26:29–50, 1981. Fabrizio Durante and Roberta Pappada. Cluster analysis of time series via kendall distribution. In Strengthening Links Between Data Analysis and Soft Computing, pages 209–216. Springer, 2015. Eckhard Liebscher et al. Copula-based dependence measures. Dependence Modeling, 2(1):49–64, 2014. Rosario N Mantegna. Hierarchical structure in ﬁnancial markets. The European Physical Journal B-Condensed Matter and Complex Systems, 11(1):193–197, 1999. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
28. 28. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Gautier Marti, S´ebastien Andler, Frank Nielsen, and Philippe Donnat. Clustering ﬁnancial time series: How long is enough? Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2583–2589, 2016. Gautier Marti, Sebastien Andler, Frank Nielsen, and Philippe Donnat. Exploring and measuring non-linear correlations: Copulas, lightspeed transportation and clustering. NIPS 2016 Time Series Workshop, 55, 2016. Gautier Marti, S´ebastien Andler, Frank Nielsen, and Philippe Donnat. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
29. 29. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Optimal transport vs. ﬁsher-rao distance between copulas for clustering multivariate time series. In IEEE Statistical Signal Processing Workshop, SSP 2016, Palma de Mallorca, Spain, June 26-29, 2016, pages 1–5, 2016. Gautier Marti, Philippe Very, Philippe Donnat, and Frank Nielsen. A proposal of a methodological framework with experimental guidelines to investigate clustering stability on ﬁnancial time series. In 14th IEEE International Conference on Machine Learning and Applications, ICMLA 2015, Miami, FL, USA, December 9-11, 2015, pages 32–37, 2015. Barnab´as P´oczos, Zoubin Ghahramani, and Jeﬀ G. Schneider. Copula-based kernel dependency measures. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r
30. 30. HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coeﬃcients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012. A Sklar. Fonctions de r´epartition `a n dimensions et leurs marges. Universit´e Paris 8, 1959. Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r