Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning from (dis)similarity data

128 views

Published on

Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia

Published in: Science
  • Be the first to comment

  • Be the first to like this

Learning from (dis)similarity data

  1. 1. Learning from (dis)similarity data Nathalie Vialaneix nathalie.vialaneix@inra.fr http://www.nathalievialaneix.eu MelbURN 2018 July 16th, 2018 - Melbourne, Australia Nathalie Vialaneix | Learning from (dis)similarity data 1/24
  2. 2. What are my data like? Nathalie Vialaneix | Learning from (dis)similarity data 2/24
  3. 3. A medieval social network [Boulet et al., 2008, Rossi et al., 2013] corpus with more than 6,000 transactions, 3 centuries, all related to Castelnau Montratier Nathalie Vialaneix | Learning from (dis)similarity data 3/24
  4. 4. A medieval social network [Boulet et al., 2008, Rossi et al., 2013] corpus with more than 6,000 transactions, 3 centuries, all related to Castelnau Montratier Individual Transaction q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Ratier Ratier (II) Castelnau Jean Laperarede Bertrande Audoy Gailhard Gourdon Guy Moynes (de) Pierre Piret (de) Bernard Audoy Hélène Castelnau Guiral Baro Bernard Audoy Arnaud Bernard Laperarede Guilhem Bernard Prestis Jean Manas Jean Laperarede Jean Laperarede Jean Roquefeuil Jean Pojols Ramond Belpech Raymond Laperarede Bertrand Prestis (de) Ratier (Monseigneur) Roquefeuil (de) Guilhem Bernard Prestis Arnaud Gasbert Castanhier (del) Ratier (III) Castelnau Pierre Prestis (de) P Valeribosc Guillaume Marsa Berenguier Roquefeuil Arnaud Bernard Perarede Jean Roquefeuil Arnaud I Audoy Arnaud Bernard Perarede bipartite network with more than 17,000 nodes (∼ 10,000 individuals) What can we learn from the French medieval society? Nathalie Vialaneix | Learning from (dis)similarity data 3/24
  5. 5. Career paths [Olteanu and Villa-Vialaneix, 2015] Survey “Génération 98”: labor market status (9 categories) on more than 16,000 people having graduated in 1998 during 94 months. 1 1. Available thanks to Génération 1998 à 7 ans - 2005, [producer] CEREQ, [diffusion] Centre Maurice Halbwachs (CMH). Nathalie Vialaneix | Learning from (dis)similarity data 4/24
  6. 6. Career paths [Olteanu and Villa-Vialaneix, 2015] Survey “Génération 98”: labor market status (9 categories) on more than 16,000 people having graduated in 1998 during 94 months. 1 How to cluster career paths into homogeneous groups? 1. Available thanks to Génération 1998 à 7 ans - 2005, [producer] CEREQ, [diffusion] Centre Maurice Halbwachs (CMH). Nathalie Vialaneix | Learning from (dis)similarity data 4/24
  7. 7. Career paths [Olteanu and Villa-Vialaneix, 2015] Survey “Génération 98”: labor market status (9 categories) on more than 16,000 people having graduated in 1998 during 94 months. 1 How to cluster career paths into homogeneous groups? It is all about distance... χ2 dissimilarity emphasizes the contemporary identical situations Optimal-matching dissimilarities is more focused on the sequences similarities [Needleman and Wunsch, 1970] (or “edit distance”, “Levenshtein distance”) 1. Available thanks to Génération 1998 à 7 ans - 2005, [producer] CEREQ, [diffusion] Centre Maurice Halbwachs (CMH). Nathalie Vialaneix | Learning from (dis)similarity data 4/24
  8. 8. and then I went into NGS data... and again... distances are everywhere Nathalie Vialaneix | Learning from (dis)similarity data 5/24
  9. 9. a collection of NGS data... DNA barcoding Astraptes fulgerator optimal matching (edit) distances to differentiate species Nathalie Vialaneix | Learning from (dis)similarity data 6/24
  10. 10. a collection of NGS data... DNA barcoding Astraptes fulgerator optimal matching (edit) distances to differentiate species Hi-C data pairwise measure (similarity) related to the physical 3D distance between loci in the cell, at genome scale Nathalie Vialaneix | Learning from (dis)similarity data 6/24
  11. 11. a collection of NGS data... DNA barcoding Astraptes fulgerator optimal matching (edit) distances to differentiate species Hi-C data pairwise measure (similarity) related to the physical 3D distance between loci in the cell, at genome scale Metagenomics dissemblance between samples is better captured when phylogeny between species is taken into account (unifrac distances) Nathalie Vialaneix | Learning from (dis)similarity data 6/24
  12. 12. Relational Self-Organizing Map algorithm Nathalie Vialaneix | Learning from (dis)similarity data 7/24
  13. 13. Basics on (standard) stochastic SOM [Kohonen, 2001] x x x (xi)i=1,...,n ⊂ Rd are affected to a unit f(xi) ∈ {1, . . . , U} the grid is equipped with a “distance” between units: d(u, u ) and observations affected to close units are close in Rd every unit u corresponds to a prototype, pu (x) in Rd Nathalie Vialaneix | Learning from (dis)similarity data 8/24
  14. 14. Basics on (standard) stochastic SOM [Kohonen, 2001] x x x Iterative learning (assignment step): xi is picked at random within (xk )k and affected to best matching unit: ft (xi) = arg min u xi − pt u 2 Nathalie Vialaneix | Learning from (dis)similarity data 8/24
  15. 15. Basics on (standard) stochastic SOM [Kohonen, 2001] x x x Iterative learning (representation step): all prototypes in neighboring units are updated with a gradient descent like step: pt+1 u ←− pt u + µ(t)Ht (d(f(xi), u))(xi − pt u) Nathalie Vialaneix | Learning from (dis)similarity data 8/24
  16. 16. Extension of SOM to data described by a kernel or a dissimilarity [Olteanu and Villa-Vialaneix, 2015] Data: (xi)i=1,...,n ∈ Rd 1: Initialization: randomly set p0 1 , ..., p0 U in Rd 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U xi − pt u 2 βt u 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) 7: end for 8: end for Nathalie Vialaneix | Learning from (dis)similarity data 9/24
  17. 17. Extension of SOM to data described by a kernel or a dissimilarity [Olteanu and Villa-Vialaneix, 2015] Data: (xi)i=1,...,n ∈ X 1: Initialization: randomly set p0 1 , ..., p0 U in Rd 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U xi − pt u 2 βt u 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) 7: end for 8: end for Nathalie Vialaneix | Learning from (dis)similarity data 9/24
  18. 18. Extension of SOM to data described by a kernel or a dissimilarity [Olteanu and Villa-Vialaneix, 2015] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u ∼ n i=1 β0 ui xi (convex combination) 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U xi − pt u 2 βt u 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) 7: end for 8: end for Nathalie Vialaneix | Learning from (dis)similarity data 9/24
  19. 19. Extension of SOM to data described by a kernel or a dissimilarity [Olteanu and Villa-Vialaneix, 2015] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u ∼ n i=1 β0 ui xi (convex combination) 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U βt uD(pt u, xi) 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) 7: end for 8: end for Nathalie Vialaneix | Learning from (dis)similarity data 9/24
  20. 20. Extension of SOM to data described by a kernel or a dissimilarity [Olteanu and Villa-Vialaneix, 2015] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u ∼ n i=1 β0 ui xi (convex combination) 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U βt uD(pt u, xi) 5: for all u = 1 → U do Representation 6: pt+1 u = pt u + µ(t)Ht (d(ft (xi), u)) ∼ xi − pt u 7: end for 8: end for Nathalie Vialaneix | Learning from (dis)similarity data 9/24
  21. 21. Extension of SOM to data described by a kernel or a dissimilarity [Olteanu and Villa-Vialaneix, 2015] Data: (xi)i=1,...,n ∈ X 1: Initialization: p0 u ∼ n i=1 β0 ui xi (convex combination) 2: for t = 1 → T do 3: pick at random i ∈ {1, . . . , n} 4: Assignment ft (xi) = arg min u=1,...,U βt u(βt u) D(., xi) − 1 2 (βt u) Dβt u 5: for all u = 1 → U do Representation 6: βt+1 u = βt u + µ(t)Ht (d(ft (xi), u)) 1i − βt u 7: end for 8: end for Nathalie Vialaneix | Learning from (dis)similarity data 9/24
  22. 22. Note on drawbacks of RSOM Two main drawbacks: For T ∼ γn iterations, complexity of RSOM is O(γn3 U) (compared to O(γUdn) for numeric) [Rossi, 2014] Nathalie Vialaneix | Learning from (dis)similarity data 10/24
  23. 23. Note on drawbacks of RSOM Two main drawbacks: For T ∼ γn iterations, complexity of RSOM is O(γn3 U) (compared to O(γUdn) for numeric) [Rossi, 2014] Exact solution proposed in [Mariette et al., 2017] to reduce the complexity to O(γn2 U) with additional storage memory of O(Un) Nathalie Vialaneix | Learning from (dis)similarity data 10/24
  24. 24. Note on drawbacks of RSOM Two main drawbacks: For T ∼ γn iterations, complexity of RSOM is O(γn3 U) (compared to O(γUdn) for numeric) [Rossi, 2014] Exact solution proposed in [Mariette et al., 2017] to reduce the complexity to O(γn2 U) with additional storage memory of O(Un) For the non Euclidean case, the learning algorithm can be very unstable (saddle points) Nathalie Vialaneix | Learning from (dis)similarity data 10/24
  25. 25. Note on drawbacks of RSOM Two main drawbacks: For T ∼ γn iterations, complexity of RSOM is O(γn3 U) (compared to O(γUdn) for numeric) [Rossi, 2014] Exact solution proposed in [Mariette et al., 2017] to reduce the complexity to O(γn2 U) with additional storage memory of O(Un) For the non Euclidean case, the learning algorithm can be very unstable (saddle points) clip or flip? [Chen et al., 2009] Nathalie Vialaneix | Learning from (dis)similarity data 10/24
  26. 26. SOMbrero [Villa-Vialaneix, 2017] SOMbrero is an R package implementing stochastic variants of SOM for non vectorial data Specifically well adapted to... non expert use and teaching use with graphs and obtain simplified representations first release: March 2013; latest release: Feb. 2018 (version 1.2.3) depends on R (version ≥ 3.1.0) http://www.r-project.org and on several packages available on CRAN: wordcloud, igraph, RColorBrewer, scatterplot3d, knitr, shiny available at https://cran.r-project.org/package=SOMbrero (licence GPL) and can be installed from inside R using install.packages("SOMbrero") Nathalie Vialaneix | Learning from (dis)similarity data 11/24
  27. 27. Training mysom <- trainSOM(iris[ ,1:4], ...) Options to train the SOM: grid: square grid, with arbitrary width and length distance between units: standard distances as in dist or "letremy" (Euclidean then "maximum") neighborhood relationship: Gaussian or "letremy" prototypes: initialized randomly, with a PCA, with random observations from the training sample preprocessing: centering, scaling to unit variance or nothing training: number of iterations, standard or Heskes’s assignment step ft (xi) ← arg min u=1,...,U U u =1 Ht (d(u, u )) xi − pt−1 u 2 Nathalie Vialaneix | Learning from (dis)similarity data 12/24
  28. 28. Diagnostic tools quality(mysom) topographic error: average frequency (over the samples) for which the prototypes that comes closest is in the direct neighborhood on the grid of the BMU quantization error Q = 1 n n i=1 xi − pf(xi) 2 Nathalie Vialaneix | Learning from (dis)similarity data 13/24
  29. 29. Plots... plot(mysom, what = c("observations", "prototypes", "add"), type = ..., ...) Nathalie Vialaneix | Learning from (dis)similarity data 14/24
  30. 30. Super-clustering mysom.sc <- superClass(mysom) Nathalie Vialaneix | Learning from (dis)similarity data 15/24
  31. 31. Start with SOMbrero 3 datasets corresponding to the three types of data that SOMbrero can handle (iris, presidentielles2002 and lesmis, a graph from “Les Misérables”) Nathalie Vialaneix | Learning from (dis)similarity data 16/24
  32. 32. Start with SOMbrero 3 datasets corresponding to the three types of data that SOMbrero can handle (iris, presidentielles2002 and lesmis, a graph from “Les Misérables”) comprehensive (HTML) vignettes included in the package and available on the website Nathalie Vialaneix | Learning from (dis)similarity data 16/24
  33. 33. Start with SOMbrero 3 datasets corresponding to the three types of data that SOMbrero can handle (iris, presidentielles2002 and lesmis, a graph from “Les Misérables”) comprehensive (HTML) vignettes included in the package and available on the website Web User Interface (made with shiny) for using the package even if you do not know R programming language (included in the package with sombreroGUI() Tested and approved on an historian! Nathalie Vialaneix | Learning from (dis)similarity data 16/24
  34. 34. RSOM for mining a medieval social network with the heat kernel Individual Transaction q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Ratier Ratier (II) Castelnau Jean Laperarede Bertrande Audoy Gailhard Gourdon Guy Moynes (de) Pierre Piret (de) Bernard Audoy Hélène Castelnau Guiral Baro Bernard Audoy Arnaud Bernard Laperarede Guilhem Bernard Prestis Jean Manas Jean Laperarede Jean Laperarede Jean Roquefeuil Jean Pojols Ramond Belpech Raymond Laperarede Bertrand Prestis (de) Ratier (Monseigneur) Roquefeuil (de) Guilhem Bernard Prestis Arnaud Gasbert Castanhier (del) Ratier (III) Castelnau Pierre Prestis (de) P Valeribosc Guillaume Marsa Berenguier Roquefeuil Arnaud Bernard Perarede Jean Roquefeuil Arnaud I Audoy Arnaud Bernard Perarede [Boulet et al., 2008] Graph induced by clusters: has nice relations with space and time emphasizes leading people has helped to identify problems in the database (namesakes) But: biggest communities are still very complex Nathalie Vialaneix | Learning from (dis)similarity data 17/24
  35. 35. RSOM for typology of Astraptes fulgerator from DNA barcoding Edit distances between DNA sequences [Olteanu and Villa-Vialaneix, 2015] Almost perfect clustering (identifying a possible label error on one sample) with (in addition) information on relations between species. Nathalie Vialaneix | Learning from (dis)similarity data 18/24
  36. 36. RSOM for typology of school-to-time transitions Edit distance between 12,000 categorical time series Nathalie Vialaneix | Learning from (dis)similarity data 19/24
  37. 37. Also in SOMbrero: KORRESP [Cottrell and Letrémy, 2005] Data: contingency table T = (nij)ij with p rows and q columns transformed into a numeric dataset X: X = columns rows columns rows column profile row profile with ∀ i = 1, . . . , p and ∀ j = 1, . . . , q, xij = nij ni. × n n.j Nathalie Vialaneix | Learning from (dis)similarity data 20/24
  38. 38. Also in SOMbrero: KORRESP [Cottrell and Letrémy, 2005] Data: contingency table T = (nij)ij with p rows and q columns transformed into a numeric dataset X: X = columns rows columns rows augmented column profile augmented row profile with ∀ i = 1, . . . , p and ∀ j = q + 1, . . . , q + p, xij = xk(i)+p,j with k(i) = arg maxk=1,...,q xik Nathalie Vialaneix | Learning from (dis)similarity data 20/24
  39. 39. Also in SOMbrero: KORRESP [Cottrell and Letrémy, 2005] Data: contingency table T = (nij)ij with p rows and q columns transformed into a numeric dataset X: X = columns rows columns rows augmented column profile augmented row profile column profile row profile assignment uses reduced profile representation uses augmented profile alternatively process row profiles and column profiles Nathalie Vialaneix | Learning from (dis)similarity data 20/24
  40. 40. Also available in SOMbrero mysom <- trainSOM(presidentielles2002 , type = "korresp") plot(mysom, what = "obs", type = "names") Nathalie Vialaneix | Learning from (dis)similarity data 21/24
  41. 41. SOMbrero Madalina Olteanu, Fabrice Rossi, Marie Cottrell, Laura Bendhaïba and Julien Boelaert SOMbrero and mixKernel Jérôme Mariette adjclust Pierre Neuvial, Guillem Rigail, Christophe Ambroise and Shubham Chaturvedi Nathalie Vialaneix | Learning from (dis)similarity data 22/24
  42. 42. Don’t miss useR! 2019 user2019.r-project.org Nathalie Vialaneix | Learning from (dis)similarity data 23/24
  43. 43. Credits for pictures Slide 2: Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/ Slide 3: Picture of Castelnau Montratier from https://commons.wikimedia.org/wiki/File: Place_Gambetta,_Castelnau-Montratier.JPG by Duch.seb CC BY-SA 3.0 Slide 4: image based on ENCODE project, by Darryl Leja (NHGRI), Ian Dunham (EBI) and Michael Pazin (NHGRI) Slide 6: Astraptes picture is from https://www.flickr.com/photos/39139121@N00/2045403823/ by Anne Toal (CC BY-SA 2.0), Hi-C experiment is taken from the article Matharu et al., 2015 DOI:10.1371/journal.pgen.1005640 (CC BY-SA 4.0) and metagenomics illustration is taken from the article Sommer et al., 2010 DOI:10.1038/msb.2010.16 (CC BY-NC-SA 3.0) Slide 12: TADS picture is from the article Fraser et al., 2015 DOI:10.15252/msb.20156492 (CC BY-SA 4.0) Nathalie Vialaneix | Learning from (dis)similarity data 24/24
  44. 44. References Boulet, R., Jouve, B., Rossi, F., and Villa, N. (2008). Batch kernel SOM and related Laplacian methods for social network analysis. Neurocomputing, 71(7-9):1257–1273. Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009). Similarity-based classification: concepts and algorithm. Journal of Machine Learning Research, 10:747–776. Cottrell, M. and Letrémy, P. (2005). How to use the Kohonen algorithm to simultaneously analyse individuals in a survey. Neurocomputing, 63:193–207. Kohonen, T. (2001). Self-Organizing Maps, 3rd Edition, volume 30. Springer, Berlin, Heidelberg, New York. Mariette, J., Rossi, F., Olteanu, M., and Villa-Vialaneix, N. (2017). Accelerating stochastic kernel som. In Verleysen, M., editor, XXVth European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2017), pages 269–274, Bruges, Belgium. i6doc. Needleman, S. and Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453. Olteanu, M. and Villa-Vialaneix, N. (2015). On-line relational and multiple relational SOM. Neurocomputing, 147:15–30. Rossi, F. (2014). How many dissimilarity/kernel self organizing map variants do we need? Nathalie Vialaneix | Learning from (dis)similarity data 24/24
  45. 45. In Villmann, T., Schleif, F., Kaden, M., and Lange, M., editors, Advances in Self-Organizing Maps and Learning Vector Quantization (Proceedings of WSOM 2014), volume 295 of Advances in Intelligent Systems and Computing, pages 3–23, Mittweida, Germany. Springer Verlag, Berlin, Heidelberg. Rossi, F., Villa-Vialaneix, N., and Hautefeuille, F. (2013). Exploration of a large database of French notarial acts with social network methods. Digital Medievalist, 9. Villa-Vialaneix, N. (2017). Stochastic self-organizing map variants with the R package SOMbrero. In Lamirel, J., Cottrell, M., and Olteanu, M., editors, 12th International Workshop on Self-Organizing Maps and Learning Vector Quantization, Clustering and Data Visualization (Proceedings of WSOM 2017), Nancy, France. IEEE. Nathalie Vialaneix | Learning from (dis)similarity data 24/24

×