•

3 likes•523 views

This document discusses methods for clustering random walks. It introduces the GNPR (Generic Non-Parametric Representation) method for defining a distance between two random walks that separates dependence and distribution information. The GNPR method is shown to outperform standard approaches on synthetic datasets containing different clusters based on distribution and dependence. The GNPR method is also used to cluster credit default swaps, identifying a cluster of "Western sovereigns". The document concludes that GNPR is an effective way to deal with dependence and distribution information separately without losing information.

- 1. Introduction How to deﬁne a distance between two random walks? Applications Conclusion How to cluster random walks? Paris Machine Learning #5 Season 2: Time Series and FinTech Philippe Donnat1 Gautier Marti1,2 Frank Nielsen2 Philippe Very1 1Hellebore Capital Management 2Ecole Polytechnique th January Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 2. Introduction How to deﬁne a distance between two random walks? Applications Conclusion 1 Introduction Data Science for the CDS market How to group random walks? What is a clustering program? 2 How to deﬁne a distance between two random walks? Standard approach on time series Comovements and distributions GNPR: the best of both worlds 3 Applications Results on synthetic datasets Clustering Credit Default Swaps 4 Conclusion Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 3. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? 1 Introduction Data Science for the CDS market How to group random walks? What is a clustering program? 2 How to deﬁne a distance between two random walks? Standard approach on time series Comovements and distributions GNPR: the best of both worlds 3 Applications Results on synthetic datasets Clustering Credit Default Swaps 4 Conclusion Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 4. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? Hellebore Capital Management & Data Science Current R&D projects in Data Science: Data mining: parsing & natural language processing Inference: incomplete data sources Portfolio & Risk analysis: understanding joint behaviours Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 5. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? Do you see clusters? Random walks French banks and building materials CDS over 2006-2010 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 6. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? Do you see clusters? Random walks French banks and building materials CDS over 2006-2010 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 7. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? Do you see clusters? Random walks French banks and building materials CDS over 2006-2010 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 8. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? Do you see clusters? Random walks French banks and building materials CDS over 2006-2010 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 9. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? Do you see clusters? Random walks French banks and building materials CDS over 2006-2010 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 10. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? Do you see clusters? Random walks French banks and building materials CDS over 2006-2010 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 11. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? Do you see clusters? Random walks French banks and building materials CDS over 2006-2015 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 12. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? Do you see clusters? Random walks French banks and building materials CDS over 2006-2015 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 13. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Data Science for the CDS market How to group random walks? What is a clustering program? What is a clustering program? Deﬁnition Clustering is the task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than those in diﬀerent groups. Deﬁnition We aim at ﬁnding K groups by positioning K group centers {c1, . . . , cK } such that data points {x1, . . . , xn} minimize min c1,...,cK n i=1 K min j=1 d(xi , cj )2 But, what is the distance d between two random walks? Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 14. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Standard approach on time series Comovements and distributions GNPR: the best of both worlds 1 Introduction Data Science for the CDS market How to group random walks? What is a clustering program? 2 How to deﬁne a distance between two random walks? Standard approach on time series Comovements and distributions GNPR: the best of both worlds 3 Applications Results on synthetic datasets Clustering Credit Default Swaps 4 Conclusion Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 15. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Standard approach on time series Comovements and distributions GNPR: the best of both worlds Naive distance between two random walks random walks Y ,Y d· −→ increments X,X covariance scatterplot X = (y2 − y1, . . . , yT − yT−1) X,X points in RT : ||X − X ||2 = T−1 i=1 (Xi − Xi )2 apply normalizations: e.g. (X − µ)/σ, (X − min)/(max − min) capture rather well comovements drawbacks: not robust to outliers, blind to signal shape Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 16. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Standard approach on time series Comovements and distributions GNPR: the best of both worlds Our approach: split comovements and distributions ? Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 17. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Standard approach on time series Comovements and distributions GNPR: the best of both worlds GNPR: A suitable representation Deﬁnition GNPR (Generic Non-Parametric Representation) projection: T : VN → UN × GN (1) X → (GX (X), GX ) GX : x → P[X ≤ x] cumulative distribution function GX (X) ∼ U[0, 1] 1 T rank(Xt) = 1 T k≤T 1{Xk ≤ Xt} →T∞ P[X ≤ Xt] = GX (Xt) Property T is a bijection. N.B. It replicates Sklar’s theorem, the seminal result of Copula Theory. Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 18. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Standard approach on time series Comovements and distributions GNPR: the best of both worlds A distance dθ leveraging GNPR Deﬁnition Let (X, Y ) ∈ V2. Let GX , GY be vectors of marginal cdf. Let θ ∈ [0, 1]. We deﬁne the following distance d2 θ (X, Y ) = θd2 1 (GX (X), GY (Y )) + (1 − θ)d2 0 (GX , GY ), (2) where d2 1 (GX (X), GY (Y )) = 3E[|GX (X) − GY (Y )|2 ], (3) and d2 0 (GX , GY ) = 1 2 R dGX dλ − dGY dλ 2 dλ. (4) d0 Hellinger; d1 = (1 − ρS )/2, with ρS the rank correlation between X and Y . Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 19. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Standard approach on time series Comovements and distributions GNPR: the best of both worlds GNPR θ = 1: Increase of correlation Correlation Density −0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.53.03.5 Pearson Correlation Spearman Correlation Distribution of Correlations 10% more correlation, in average, using GNPR θ = 1 (rank statistics) instead of standard correlation Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 20. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Standard approach on time series Comovements and distributions GNPR: the best of both worlds GNPR θ = 0: Find distribution peculiarities Parametric modelling: Real-life CDS variations: Which distribution? Nokia −310 −280 −250 −220 −190 −160 −130 −100 −70 −40 −10 10 30 50 70 90 110 130 150 170 190 Distribution Nokia IncrementLogDensity 5e−051e−042e−045e−041e−032e−035e−031e−022e−02 020040060080010001200 Nokia 5Y CDS Time CDSSpread Jan−2006 Oct−2006 Jul−2007 Apr−2008 Feb−2009 Nov−2009 Sep−2010 Jul−2011 Apr−2012 Jan−2013 Oct−2013 Jul−2014 Telecom Italia −62 −56 −50 −44 −38 −32 −26 −20 −14 −8 −2 2 6 10 16 22 28 34 40 46 52 58 64 Distribution Telecom Italia Increment LogDensity 2e−045e−041e−032e−035e−031e−022e−025e−021e−01 100200300400500 Telecom Italia 5Y CDS Time CDSSpread Jan−2006 Dec−2006 Nov−2007 Oct−2008 Sep−2009 Jul−2010 Jun−2011 May−2012 Apr−2013 Mar−2014 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 21. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Results on synthetic datasets Clustering Credit Default Swaps 1 Introduction Data Science for the CDS market How to group random walks? What is a clustering program? 2 How to deﬁne a distance between two random walks? Standard approach on time series Comovements and distributions GNPR: the best of both worlds 3 Applications Results on synthetic datasets Clustering Credit Default Swaps 4 Conclusion Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 22. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Results on synthetic datasets Clustering Credit Default Swaps Description of the testing datasets We deﬁne some interesting test case datasets to study: distribution clustering (dataset A), dependence clustering (dataset B), a mix of both (dataset C). Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 23. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Results on synthetic datasets Clustering Credit Default Swaps Results: GNPR works! Adjusted Rand Index Representation Algorithm A B C X Ward 0 0.94 0.42 (X − µX )/σX Ward 0 0.94 0.42 (X − min)/(max − min) Ward 0 0.48 0.45 GNPR θ = 0 Ward 1 0 0.47 GNPR θ = 1 Ward 0 0.91 0.72 GNPR θ Ward 1 0.92 1 X k-means++ 0 0.90 0.44 (X − µX )/σX k-means++ 0 0.91 0.45 (X − min)/(max − min) k-means++ 0.11 0.55 0.47 GNPR θ = 0 k-means++ 1 0 0.53 GNPR θ = 1 k-means++ 0.06 0.99 0.80 GNPR θ k-means++ 1 0.99 1 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 24. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Results on synthetic datasets Clustering Credit Default Swaps HCMapper: Compare Hierarchical Clustering Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 25. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Results on synthetic datasets Clustering Credit Default Swaps HCMapper: Compare Hierarchical Clustering Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 26. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Results on synthetic datasets Clustering Credit Default Swaps “Western sovereigns”cluster Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 27. Introduction How to deﬁne a distance between two random walks? Applications Conclusion 1 Introduction Data Science for the CDS market How to group random walks? What is a clustering program? 2 How to deﬁne a distance between two random walks? Standard approach on time series Comovements and distributions GNPR: the best of both worlds 3 Applications Results on synthetic datasets Clustering Credit Default Swaps 4 Conclusion Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 28. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Conclusion: Take Home Message GNPR is a way to deal separately with dependence information, distribution information, without loosing any. Avenue for research: better aggregation: generalized means? consistency proof? any idea of interesting random walks outside ﬁnance? Check out www.datagrapple.com to follow our R&D stuﬀ and news about the CDS market! Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 29. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Internships If interested, please contact gautier.marti@helleborecapital.com philippe.very@helleborecapital.com for further details and application. Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 30. Introduction How to deﬁne a distance between two random walks? Applications Conclusion References I Shai Ben-David and Ulrike Von Luxburg, Relating clustering stability to properties of cluster boundaries., COLT, vol. 2008, 2008, pp. 379–390. Shai Ben-David, Ulrike Von Luxburg, and D´avid P´al, A sober look at clustering stability, Learning theory, Springer, 2006, pp. 5–19. Asa Ben-Hur, Andre Elisseeﬀ, and Isabelle Guyon, A stability based method for discovering structure in clustered data, Paciﬁc symposium on biocomputing, vol. 7, 2001, pp. 6–17. Tilman Lange, Volker Roth, Mikio L Braun, and Joachim M Buhmann, Stability-based validation of clustering solutions, Neural computation 16 (2004), no. 6, 1299–1323. Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 31. Introduction How to deﬁne a distance between two random walks? Applications Conclusion How to ﬁnd θ ? Clustering stability using perturbations due to: bootstrap time-sliding window draws from an oracle stability pros [BHEG01, LRBB04] stability cons [BDVLP06, BDVL08] theta T accuracy Accuracy landscape 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ARI θ Accuracy and stability of clustering using GNPR Accuracy Bootstrap stability Time stability Cross-datasets stability Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 32. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Properties of dθ Property ∀θ ∈ [0, 1], 0 ≤ dθ ≤ 1 Property For 0 < θ < 1, dθ is a metric For θ ∈ {0, 1}, U ∼ U[0, 1] = 1 − U, yet d0(U, 1 − U) = 0 V = 2V , yet d1(V , 2V ) = 0 Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 33. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Description of the testing datasets For 1 ≤ i ≤ N = pK S s=1 Ks, we deﬁne Xi = S s=1 Ks k=1 βs k,i Y s k + K k=1 αk,i Zi k, (5) where a) αk,i = 1, if i ≡ k − 1 (mod K), 0 otherwise; b) βs k ∈ [0, 1], c) βs k,i = βs k, if iKs/N = k, 0 otherwise. (Xi )N i=1 are partitioned into Q = K S s=1 Ks clusters of p random variables each. Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 34. Introduction How to deﬁne a distance between two random walks? Applications Conclusion OTC data ﬂow processing Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?
- 35. Introduction How to deﬁne a distance between two random walks? Applications Conclusion Conjecture: consistency of clustering random walks 200 400 600 0 500 1000 1500 2000 T N 0.4 0.6 0.8 1.0 ARI 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1000 1500 2000 ARI T Clustering convergence to the ground-truth partition Clustering distribution θ = 0 Clustering dependence θ = 1 Clustering total information θ = θ* Philippe Donnat, Gautier Marti, Frank Nielsen, Philippe Very How to cluster random walks?