Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DataXDay - Exploring graphs: looking for communities & leaders

26 views

Published on

Ever been stuck in a data science use case where any approach seems too hard? Graph theory, describing a system just in terms of nodes and links, could be your answer! In the practical example we’ll show, we’ll try to find data science communities and their leaders in LinkedIn. Challenge accepted?

Aurélia Nègre & Alberto Guggiola - Quantmetry
https://dataxday.fr/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

DataXDay - Exploring graphs: looking for communities & leaders

  1. 1. @DataXDay@DataXDay © Quantmetry 2018 | Diffusion interdite sans accord
  2. 2. @DataXDay@DataXDay The Panama Papers: a massive leak Image VectorOpenStock © Quantmetry 2018 | Diffusion interdite sans accord
  3. 3. @DataXDay@DataXDay The Panama Papers: a massive leak 11.5M documents 2.6TB of data © Quantmetry 2018 | Diffusion interdite sans accord
  4. 4. @DataXDay@DataXDay © Quantmetry 2018 | Diffusion interdite sans accord The Panama Papers: a massive leak. 11.5M documents 2.6TB of data
  5. 5. @DataXDay@DataXDay © Quantmetry 2018 | Diffusion interdite sans accord The Panama Papers: a massive leak. 11.5M documents 2.6TB of data
  6. 6. @DataXDay@DataXDay And graphs to make sense of it... https://www.silicon.fr/linkurious-start-up-big-data-panama-papers- 144051.html?inf_by=5ae98d4c671db887218b5652 © Quantmetry 2018 | Diffusion interdite sans accord
  7. 7. @DataXDay@DataXDay © Quantmetry 2018 | Diffusion interdite sans accord …. originating an international scandal
  8. 8. @DataXDay Aurélia Nègre Data Scientist anegre@quantmetry.com Alberto Guggiola Data Scientist aguggiola@quantmetry.com Graph Theory … looking for communities & finding the leaders… DataXDay 17th May 2017
  9. 9. @DataXDay@DataXDay Who are we? § 70 Consultants (Data Scientists, Architects, Engineers, Consultants & more …) § From proofs of concept to production § Fraud detection, predictive maintenance, customer insights … Aurélia Nègre & Alberto Guggiola © Quantmetry 2018 | Diffusion interdite sans accord
  10. 10. @DataXDay@DataXDay A graph: a structure made up of nodes and links Social network Transportation network © Quantmetry 2018 | Diffusion interdite sans accord
  11. 11. @DataXDay@DataXDay Some use cases of graph theory Spreading • Determine the speed of a spreading phenomenon • How to speed it up or to slow it down? Viral marketing, vaccination campaigns Dynamics & optimisation • Shortest path between two nodes? • Effects of modifying the structure? Transportation systems, social networks Domino effects • Resilience to random failures? • And to targeted attacks? Security systems, economics, infrastructures Structural importance • Which nodes are the most important or authoritatives? Who are the leaders? Google PageRank algorithm © Quantmetry 2018 | Diffusion interdite sans accord
  12. 12. @DataXDay@DataXDay Looking for communities1 © Quantmetry 2018 | Diffusion interdite sans accord
  13. 13. @DataXDay@DataXDay Community detection: looking for a structure Community: Region having some degree of autonomy -> No unique formal definition! © Quantmetry 2018 | Diffusion interdite sans accord
  14. 14. @DataXDay@DataXDay Community detection: looking for a structure Community: Region having some degree of autonomy -> No unique formal definition! Which communities interact with each other? Which elements act as « bridges » between communities? © Quantmetry 2018 | Diffusion interdite sans accord
  15. 15. @DataXDay@DataXDay Cutting the bridges Gathering the most connected elements Two approaches for finding clusters Spectral clustering, Girvan Newman Fastgreedy, Louvain, Walktrap © Quantmetry 2018 | Diffusion interdite sans accord
  16. 16. @DataXDay@DataXDay Girvan Newman: a good algorithm on small graphs (<500 nodes), but a very high complexity Walktrap : much more efficient on large graphs Two examples Random walk on a network: path following randomly chosen edges on the graph Community « strength »: proportional to the time a random walker spends inside it Cut the bridges: iteratively remove links with highest betweenness Community are found when the graph becomes disconnected © Quantmetry 2018 | Diffusion interdite sans accord
  17. 17. @DataXDay@DataXDay ✅ Able to identify heterogenous communities ✅ Efficient on large graphs: complexity O(N logN) ✅ Available in most graph analytical libraries: ok as first try And the winner is... Louvain algorithm © Quantmetry 2018 | Diffusion interdite sans accord
  18. 18. @DataXDay@DataXDay ✅ Able to identify heterogenous communities ✅ Efficient on large graphs: complexity O(N logN) ✅ Available in most graph analytical libraries: ok as first try And the winner is... Louvain algorithm Modularity optimization Density of edges inside vs outside clusters 𝑄 = 1 2𝑚 & 𝐴() − 𝑘( 𝑘) 2𝑚 𝛿 () (𝑐(, 𝑐)) Local to global greedy From groups of nodes … … to groups of clusters © Quantmetry 2018 | Diffusion interdite sans accord
  19. 19. @DataXDay@DataXDay • I measure the capability to reconstruct real, known communities • Example of metrics: Normalized Mutual Information I observe the truth: the known communities Testing the algorithms and measuring the performances I create the truth: the Stochastic Block Model • I define the probability for each couple of nodes to be connected • In the simplest case: 𝑝() = ? 𝐴 𝑖𝑓 𝑖, 𝑗 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑐𝑜𝑚𝑚𝑢𝑛𝑖𝑡𝑦 𝐵 < 𝐴 𝑖𝑓 𝑛𝑜𝑡 • More links inside communities as a consequence • Many observations can be generated to test algorithms © Quantmetry 2018 | Diffusion interdite sans accord
  20. 20. @DataXDay@DataXDay Look at modularity of best solution: if <0.3, not a real community structure Possible causes: • On generated data, intra and inter-community probability of links are too close • On real networks, the known communities do not influence the structure • The approximated solution is too far from the global optimum Possible follow-up: • NLP + graphs: groups of people discussing about a certain topic But sometimes, there is just no pattern to be discovered … © Quantmetry 2018 | Diffusion interdite sans accord
  21. 21. @DataXDay@DataXDay Finding the leaders2 © Quantmetry 2018 | Diffusion interdite sans accord
  22. 22. @DataXDay@DataXDay Which node is the most important? © Quantmetry 2018 | Diffusion interdite sans accord
  23. 23. @DataXDay@DataXDay Different ways of measuring nodes importance A global importance : the betweenness centrality A local importance : the degree Is the node « well connected »? Count its number of direct neighbours Is the node a « bridge »? Count number of shortest paths passing through it A well known, iterative metric : Google PageRank -> Is the node connected to many important nodes ? © Quantmetry 2018 | Diffusion interdite sans accord
  24. 24. @DataXDay@DataXDay Other centrality metrics © Quantmetry 2018 | Diffusion interdite sans accord
  25. 25. @DataXDay@DataXDay Can provide information on profiles of nodes Combining centrality metrics & identifiying hierarchies © Quantmetry 2018 | Diffusion interdite sans accord
  26. 26. @DataXDay@DataXDay And, in practice?3 © Quantmetry 2018 | Diffusion interdite sans accord
  27. 27. @DataXDay@DataXDay Several tools, depending on your objectives Non distributed analytical libraries Distributed analytical libraries Databases © Quantmetry 2018 | Diffusion interdite sans accord
  28. 28. @DataXDay@DataXDay Free networks data to play with © Quantmetry 2018 | Diffusion interdite sans accord
  29. 29. @DataXDay@DataXDay Demo Time using LinkedIn data © Quantmetry 2018 | Diffusion interdite sans accord
  30. 30. @DataXDay@DataXDay • 3 blog articles (in french): – Introduction à une théorie aux applications multiformes (Alberto Guggiola) – Détection de communautés : théorie et retour d’expérience (Aurélia Nègre) – Comment identifier les rôles stratégiques des influenceurs d'un réseau ? (Ysé Wanono) • https://www.quantmetry.com/blog To go further... © Quantmetry 2018 | Diffusion interdite sans accord
  31. 31. The video of this presentation will be soon available at dataxday.fr Thanks to our sponsors Stay tuned by following @DataXDay

×