SlideShare a Scribd company logo
Analyzing Rich-Club Behavior
in Open Source Projects
OpenSym 2019, the 15th International Symposium on Open Collaboration
Skövde, Sweden
Mattia Gasparini1, Javier Luis Cànovas Izquierdo2,
Robert Clarisò2, Marco Brambilla1, Jordi Cabot2
Politecnico di Milano1 Universitat Oberta de la Catalunya2
Introduction
• Git and Github data to analyze evolution,
success and management of Open Source
Software.
• Define developers behavioral patterns.
• Discover how collaborations between
developers work.
2
Problem
Statement
ANALYSIS OF
COLLABORATION
NETWORKS
COMMITS, ISSUES AND
PULL REQUESTS AS
SOURCES
DISCOVER PRESENCE OF
SPECIFIC COLLABORATION
STRUCTURES: RICH-CLUBS
3
Rich-club coefficient
• Graph structural property:
It represents the tendency of well-connected nodes (i.e.: hubs) to interact with other well-
connected nodes.
• Formulation:
𝜙 𝑘 =
2𝐸 𝑘
𝑁𝑘(𝑁𝑘 − 1)
𝜌 𝑘 =
𝜙(𝑘)
𝜙 𝑟𝑎𝑛𝑑𝑜𝑚(𝑘)
𝐸 𝑘: number of edges between nodes of degree greater or equal to 𝑘
𝑁𝑘: number of nodes with degree greater or equal to 𝑘
𝜙 𝑘 : rich-club coefficient
𝜌 𝑘 : normalized rich-club coefficient
4
Related Work
• Rich-club phenomenon for a specific project [2],
or for a single FLOSS community [3].
• Study of the presence of a rich-club effect
across the whole GitHub social network [4].
• Analysis on open source communities exploiting
email exchanges among participants [5].
5
[2] Weifeng Pan, Bing Li, Yutao Ma, and Jing Liu. 2011. Multi-granularity evolution analysis of software using complex network theory
[3] Guido Conaldi. 2010. Flat for the few, steep for the many: Structural cohesion and Rich-Club effect as measures of hierarchy and control in FLOSS communities
[4] Antonio Lima, Luca Rossi, and Mirco Musolesi. 2014. Coding Together at Scale: GitHub as a Collaborative Social Network
[5] Sergi Valverde and Ricard V. Solé. 2007. Self-organization versus hierarchy in open-source social networks
Case Study
6
Top-100 starred projects in 2016 on
GitHub
926K commits produced by 50K Git users
1.3M issues-related events generated by
118K GitHub users
280K pullrequest-related events
generated by 20K GitHub users
Analysis Pipeline
7
Data Collection &
Preprocessing
• Git repository cloning for
commits data using Gitana
• Github activities for issues
and PR activities querying
GHArchive
• Duplicity and clashing
problem
8
Graphs Construction
• Definition of 4 undirected graphs:
a. PR graph
b. Commits graph
c. Issues graph
d. Supergraph (a + b + c)
• Nodes: users
• Edges connect a pair of users if
they interacted on the same
element (issue, PR, file)
9
Graphs Example
Materialize PR graph (a) Materialize commits graph (b) Materialize issues graph (c) Materialize supergraph (d)
10
Rich-club Coefficient
Calculation
• Calculation using algorithm
implementation included in
NetworkX6
• Normalized coefficient
𝜌(𝑘): rich-club effect
relevant if 𝜌 𝑘 > 1
• Discard networks for which
randomization fails
11
[6] https://networkx.github.io/documentation/stable/reference/algorithms/rich_club.html
Rich-club Coefficient
Results
• 60 projects have a defined
coefficient for the
supergraph.
• Each graph presents a rich-
club effect, since 𝜌 𝑘 > 1
for some 𝑘
Materialize7:
Rich-Club
Supergraph
Coefficient
Maximum normalized coefficient (k =
49) corresponds to maximum club effect
with nodes of degree at least 49.
13[7] https://materializecss.com
Materialize:
Supergraph
14
Swift8:
Rich-Club
Supergraph
Coefficient
15[8] https://swift.org/
Swift:
Supergraph
16
Rich-club Coefficient Results
17
Maximum coefficient distribution
• Distribution of the maximum
rich-club coefficient for each
type of graph across the studied
projects.
• Mean value around 1 for issues
and commits graphs
coefficients: weak rich-club
presence.
• Mean value around 1.4 for PR
graphs coefficient: strong rich-
club presence.
Further insights
18
Multi-club users
• 25 over 60 projects present a set
of users belonging to multiple rich-
clubs.
• Distribution of multi-club users
across the 25 projects.
• Developers form community with
strong influence in each project
level.
Further insights
19
Conclusions
First systematic evaluation of the rich-club
behaviour on open source projects:
• 60% of projects shows rich-clubs in the
supergraph, mostly with a slight effect.
• Rich-club behavior could undermine the open
paradigma, but phenomeon requires further
analysis.
• Strong rich-club presence in PR graphs may
reside to criticality of the activity.
• 25 over 60 projects have users belonging to
multiple rich-clubs.
20
Future Work
Weighted rich-club
coefficient
Rich-club effect at module
and ecosystem level
Time dimension to
highlight temporal clubs
21
Questions?

More Related Content

What's hot

Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Ruchika Sharma
 
Data mining based social network
Data mining based social networkData mining based social network
Data mining based social network
Firas Husseini
 
Social media community using optimized algorithm by M. Gomathi / Lecturer
Social media community using optimized algorithm by M. Gomathi / LecturerSocial media community using optimized algorithm by M. Gomathi / Lecturer
Social media community using optimized algorithm by M. Gomathi / Lecturer
gomathi chlm
 
Building better knowledge graphs through social computing
Building better knowledge graphs through social computingBuilding better knowledge graphs through social computing
Building better knowledge graphs through social computing
Elena Simperl
 
Identifying news clusters using Q-analysis and Modularity
Identifying news clusters using Q-analysis and ModularityIdentifying news clusters using Q-analysis and Modularity
Identifying news clusters using Q-analysis and Modularity
David Sousa-Rodrigues
 
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust networkBig Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Ruchika Sharma
 
Data mining for social media
Data mining for social mediaData mining for social media
Data mining for social media
rangesharp
 
From Argument Mapping to Argument Mining, and Back
From Argument Mapping to Argument Mining, and BackFrom Argument Mapping to Argument Mining, and Back
From Argument Mapping to Argument Mining, and Back
EDV Project
 
Navigating large graphs like a breeze with Linkurious
Navigating large graphs like a breeze with LinkuriousNavigating large graphs like a breeze with Linkurious
Navigating large graphs like a breeze with Linkurious
Linkurious
 

What's hot (9)

Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
 
Data mining based social network
Data mining based social networkData mining based social network
Data mining based social network
 
Social media community using optimized algorithm by M. Gomathi / Lecturer
Social media community using optimized algorithm by M. Gomathi / LecturerSocial media community using optimized algorithm by M. Gomathi / Lecturer
Social media community using optimized algorithm by M. Gomathi / Lecturer
 
Building better knowledge graphs through social computing
Building better knowledge graphs through social computingBuilding better knowledge graphs through social computing
Building better knowledge graphs through social computing
 
Identifying news clusters using Q-analysis and Modularity
Identifying news clusters using Q-analysis and ModularityIdentifying news clusters using Q-analysis and Modularity
Identifying news clusters using Q-analysis and Modularity
 
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust networkBig Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
 
Data mining for social media
Data mining for social mediaData mining for social media
Data mining for social media
 
From Argument Mapping to Argument Mining, and Back
From Argument Mapping to Argument Mining, and BackFrom Argument Mapping to Argument Mining, and Back
From Argument Mapping to Argument Mining, and Back
 
Navigating large graphs like a breeze with Linkurious
Navigating large graphs like a breeze with LinkuriousNavigating large graphs like a breeze with Linkurious
Navigating large graphs like a breeze with Linkurious
 

Similar to Analyzing rich club behavior in open source projects

Operationalisation of Collaboration Sunbelt 2015
Operationalisation of Collaboration Sunbelt 2015Operationalisation of Collaboration Sunbelt 2015
Operationalisation of Collaboration Sunbelt 2015
Dawn Foster
 
Network Relationships and Job Changes of Software Developers at Sunbelt 2016
Network Relationships and Job Changes of Software Developers at Sunbelt 2016Network Relationships and Job Changes of Software Developers at Sunbelt 2016
Network Relationships and Job Changes of Software Developers at Sunbelt 2016
Dawn Foster
 
Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...
Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...
Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...
IJCSIS Research Publications
 
Leveraging the Crowd: Supporting Newcomers to Build an OSS Community
Leveraging the Crowd: Supporting Newcomers to Build an OSS CommunityLeveraging the Crowd: Supporting Newcomers to Build an OSS Community
Leveraging the Crowd: Supporting Newcomers to Build an OSS Community
Marco Aurelio Gerosa
 
Decentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic WebDecentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic Web
hala Skaf
 
The path to an hybrid open source paradigm
The path to an hybrid open source paradigmThe path to an hybrid open source paradigm
The path to an hybrid open source paradigm
Jonathan Challener
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
Neo4j
 
SocialCom09-tutorial.pdf
SocialCom09-tutorial.pdfSocialCom09-tutorial.pdf
SocialCom09-tutorial.pdf
BalasundaramSr
 
A data-driven approach for understanding Open Design @ Design For Next
A data-driven approach for understanding Open Design @ Design For NextA data-driven approach for understanding Open Design @ Design For Next
A data-driven approach for understanding Open Design @ Design For Next
MAKE-IT
 
DE gitConnect
DE gitConnectDE gitConnect
DE gitConnect
Akshara Chaturvedi
 
CROSSMINER Project at OW2con'19
CROSSMINER Project at OW2con'19CROSSMINER Project at OW2con'19
CROSSMINER Project at OW2con'19
OW2
 
Experiences in the Design and Implementation of a Social Cloud for Volunteer ...
Experiences in the Design and Implementation of a Social Cloud for Volunteer ...Experiences in the Design and Implementation of a Social Cloud for Volunteer ...
Experiences in the Design and Implementation of a Social Cloud for Volunteer ...
ryanchard
 
Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018
Fabien Gandon
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphere
DERIGalway
 
PEARC17: The Community Software Repository from XSEDE: A Resource for the Nat...
PEARC17: The Community Software Repository from XSEDE: A Resource for the Nat...PEARC17: The Community Software Repository from XSEDE: A Resource for the Nat...
PEARC17: The Community Software Repository from XSEDE: A Resource for the Nat...
John-Paul Navarro
 
Participation Inequality and the 90-9-1 Principle in Open Source [OpenSym'2020]
Participation Inequality and the 90-9-1 Principle in Open Source [OpenSym'2020]Participation Inequality and the 90-9-1 Principle in Open Source [OpenSym'2020]
Participation Inequality and the 90-9-1 Principle in Open Source [OpenSym'2020]
rclariso
 
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science ChallengeIronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
Purdue RCODI
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczIoan Toma
 
GraphChain
GraphChainGraphChain
GraphChain
sopekmir
 

Similar to Analyzing rich club behavior in open source projects (20)

Operationalisation of Collaboration Sunbelt 2015
Operationalisation of Collaboration Sunbelt 2015Operationalisation of Collaboration Sunbelt 2015
Operationalisation of Collaboration Sunbelt 2015
 
Final Algos
Final AlgosFinal Algos
Final Algos
 
Network Relationships and Job Changes of Software Developers at Sunbelt 2016
Network Relationships and Job Changes of Software Developers at Sunbelt 2016Network Relationships and Job Changes of Software Developers at Sunbelt 2016
Network Relationships and Job Changes of Software Developers at Sunbelt 2016
 
Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...
Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...
Birds of a Feather Flock Together? A Study of Developers’ Flocking and Migrat...
 
Leveraging the Crowd: Supporting Newcomers to Build an OSS Community
Leveraging the Crowd: Supporting Newcomers to Build an OSS CommunityLeveraging the Crowd: Supporting Newcomers to Build an OSS Community
Leveraging the Crowd: Supporting Newcomers to Build an OSS Community
 
Decentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic WebDecentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic Web
 
The path to an hybrid open source paradigm
The path to an hybrid open source paradigmThe path to an hybrid open source paradigm
The path to an hybrid open source paradigm
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
SocialCom09-tutorial.pdf
SocialCom09-tutorial.pdfSocialCom09-tutorial.pdf
SocialCom09-tutorial.pdf
 
A data-driven approach for understanding Open Design @ Design For Next
A data-driven approach for understanding Open Design @ Design For NextA data-driven approach for understanding Open Design @ Design For Next
A data-driven approach for understanding Open Design @ Design For Next
 
DE gitConnect
DE gitConnectDE gitConnect
DE gitConnect
 
CROSSMINER Project at OW2con'19
CROSSMINER Project at OW2con'19CROSSMINER Project at OW2con'19
CROSSMINER Project at OW2con'19
 
Experiences in the Design and Implementation of a Social Cloud for Volunteer ...
Experiences in the Design and Implementation of a Social Cloud for Volunteer ...Experiences in the Design and Implementation of a Social Cloud for Volunteer ...
Experiences in the Design and Implementation of a Social Cloud for Volunteer ...
 
Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018Overview of the Research in Wimmics 2018
Overview of the Research in Wimmics 2018
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphere
 
PEARC17: The Community Software Repository from XSEDE: A Resource for the Nat...
PEARC17: The Community Software Repository from XSEDE: A Resource for the Nat...PEARC17: The Community Software Repository from XSEDE: A Resource for the Nat...
PEARC17: The Community Software Repository from XSEDE: A Resource for the Nat...
 
Participation Inequality and the 90-9-1 Principle in Open Source [OpenSym'2020]
Participation Inequality and the 90-9-1 Principle in Open Source [OpenSym'2020]Participation Inequality and the 90-9-1 Principle in Open Source [OpenSym'2020]
Participation Inequality and the 90-9-1 Principle in Open Source [OpenSym'2020]
 
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science ChallengeIronHacks Live: Info session #3 - COVID-19 Data Science Challenge
IronHacks Live: Info session #3 - COVID-19 Data Science Challenge
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
 
GraphChain
GraphChainGraphChain
GraphChain
 

More from Marco Brambilla

M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
Marco Brambilla
 
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Marco Brambilla
 
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Hierarchical Transformers for User Semantic Similarity - ICWE 2023Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Marco Brambilla
 
Exploring the Bi-verse. A trip across the digital and physical ecospheres
Exploring the Bi-verse.A trip across the digital and physical ecospheresExploring the Bi-verse.A trip across the digital and physical ecospheres
Exploring the Bi-verse. A trip across the digital and physical ecospheres
Marco Brambilla
 
Conversation graphs in Online Social Media
Conversation graphs in Online Social MediaConversation graphs in Online Social Media
Conversation graphs in Online Social Media
Marco Brambilla
 
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
Marco Brambilla
 
Available Data Science M.Sc. Thesis Proposals
Available Data Science M.Sc. Thesis Proposals Available Data Science M.Sc. Thesis Proposals
Available Data Science M.Sc. Thesis Proposals
Marco Brambilla
 
Data Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extractionData Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extraction
Marco Brambilla
 
Iterative knowledge extraction from social networks. The Web Conference 2018
Iterative knowledge extraction from social networks. The Web Conference 2018Iterative knowledge extraction from social networks. The Web Conference 2018
Iterative knowledge extraction from social networks. The Web Conference 2018
Marco Brambilla
 
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
Marco Brambilla
 
Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...
Marco Brambilla
 
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Marco Brambilla
 
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
Marco Brambilla
 
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
Marco Brambilla
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
Marco Brambilla
 
Web Science. An introduction
Web Science. An introductionWeb Science. An introduction
Web Science. An introduction
Marco Brambilla
 
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
Marco Brambilla
 
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Marco Brambilla
 
Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...
Marco Brambilla
 
Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...
Marco Brambilla
 

More from Marco Brambilla (20)

M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
 
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
Thesis Topics and Proposals @ Polimi Data Science Lab - 2023 - prof. Brambill...
 
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Hierarchical Transformers for User Semantic Similarity - ICWE 2023Hierarchical Transformers for User Semantic Similarity - ICWE 2023
Hierarchical Transformers for User Semantic Similarity - ICWE 2023
 
Exploring the Bi-verse. A trip across the digital and physical ecospheres
Exploring the Bi-verse.A trip across the digital and physical ecospheresExploring the Bi-verse.A trip across the digital and physical ecospheres
Exploring the Bi-verse. A trip across the digital and physical ecospheres
 
Conversation graphs in Online Social Media
Conversation graphs in Online Social MediaConversation graphs in Online Social Media
Conversation graphs in Online Social Media
 
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...Analysis of On-line Debate on Long-Running Political Phenomena.The Brexit C...
Analysis of On-line Debate on Long-Running Political Phenomena. The Brexit C...
 
Available Data Science M.Sc. Thesis Proposals
Available Data Science M.Sc. Thesis Proposals Available Data Science M.Sc. Thesis Proposals
Available Data Science M.Sc. Thesis Proposals
 
Data Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extractionData Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extraction
 
Iterative knowledge extraction from social networks. The Web Conference 2018
Iterative knowledge extraction from social networks. The Web Conference 2018Iterative knowledge extraction from social networks. The Web Conference 2018
Iterative knowledge extraction from social networks. The Web Conference 2018
 
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...Driving Style and Behavior Analysis based on Trip Segmentation over GPS  Info...
Driving Style and Behavior Analysis based on Trip Segmentation over GPS Info...
 
Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...
 
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
Harvesting Knowledge from Social Networks: Extracting Typed Relationships amo...
 
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...Model-driven Development of  User Interfaces for IoT via Domain-specific Comp...
Model-driven Development of User Interfaces for IoT via Domain-specific Comp...
 
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.A Model-Based Method for  Seamless Web and Mobile Experience. Splash 2016 conf.
A Model-Based Method for Seamless Web and Mobile Experience. Splash 2016 conf.
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
 
Web Science. An introduction
Web Science. An introductionWeb Science. An introduction
Web Science. An introduction
 
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...On the Quest for Changing Knowledge. Capturing emerging entities from social ...
On the Quest for Changing Knowledge. Capturing emerging entities from social ...
 
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
Studying Multicultural Diversity of Cities and Neighborhoods through Social M...
 
Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...Model driven software engineering in practice book - Chapter 9 - Model to tex...
Model driven software engineering in practice book - Chapter 9 - Model to tex...
 
Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...Model driven software engineering in practice book - chapter 7 - Developing y...
Model driven software engineering in practice book - chapter 7 - Developing y...
 

Recently uploaded

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 

Recently uploaded (20)

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 

Analyzing rich club behavior in open source projects

  • 1. Analyzing Rich-Club Behavior in Open Source Projects OpenSym 2019, the 15th International Symposium on Open Collaboration Skövde, Sweden Mattia Gasparini1, Javier Luis Cànovas Izquierdo2, Robert Clarisò2, Marco Brambilla1, Jordi Cabot2 Politecnico di Milano1 Universitat Oberta de la Catalunya2
  • 2. Introduction • Git and Github data to analyze evolution, success and management of Open Source Software. • Define developers behavioral patterns. • Discover how collaborations between developers work. 2
  • 3. Problem Statement ANALYSIS OF COLLABORATION NETWORKS COMMITS, ISSUES AND PULL REQUESTS AS SOURCES DISCOVER PRESENCE OF SPECIFIC COLLABORATION STRUCTURES: RICH-CLUBS 3
  • 4. Rich-club coefficient • Graph structural property: It represents the tendency of well-connected nodes (i.e.: hubs) to interact with other well- connected nodes. • Formulation: 𝜙 𝑘 = 2𝐸 𝑘 𝑁𝑘(𝑁𝑘 − 1) 𝜌 𝑘 = 𝜙(𝑘) 𝜙 𝑟𝑎𝑛𝑑𝑜𝑚(𝑘) 𝐸 𝑘: number of edges between nodes of degree greater or equal to 𝑘 𝑁𝑘: number of nodes with degree greater or equal to 𝑘 𝜙 𝑘 : rich-club coefficient 𝜌 𝑘 : normalized rich-club coefficient 4
  • 5. Related Work • Rich-club phenomenon for a specific project [2], or for a single FLOSS community [3]. • Study of the presence of a rich-club effect across the whole GitHub social network [4]. • Analysis on open source communities exploiting email exchanges among participants [5]. 5 [2] Weifeng Pan, Bing Li, Yutao Ma, and Jing Liu. 2011. Multi-granularity evolution analysis of software using complex network theory [3] Guido Conaldi. 2010. Flat for the few, steep for the many: Structural cohesion and Rich-Club effect as measures of hierarchy and control in FLOSS communities [4] Antonio Lima, Luca Rossi, and Mirco Musolesi. 2014. Coding Together at Scale: GitHub as a Collaborative Social Network [5] Sergi Valverde and Ricard V. Solé. 2007. Self-organization versus hierarchy in open-source social networks
  • 6. Case Study 6 Top-100 starred projects in 2016 on GitHub 926K commits produced by 50K Git users 1.3M issues-related events generated by 118K GitHub users 280K pullrequest-related events generated by 20K GitHub users
  • 8. Data Collection & Preprocessing • Git repository cloning for commits data using Gitana • Github activities for issues and PR activities querying GHArchive • Duplicity and clashing problem 8
  • 9. Graphs Construction • Definition of 4 undirected graphs: a. PR graph b. Commits graph c. Issues graph d. Supergraph (a + b + c) • Nodes: users • Edges connect a pair of users if they interacted on the same element (issue, PR, file) 9
  • 10. Graphs Example Materialize PR graph (a) Materialize commits graph (b) Materialize issues graph (c) Materialize supergraph (d) 10
  • 11. Rich-club Coefficient Calculation • Calculation using algorithm implementation included in NetworkX6 • Normalized coefficient 𝜌(𝑘): rich-club effect relevant if 𝜌 𝑘 > 1 • Discard networks for which randomization fails 11 [6] https://networkx.github.io/documentation/stable/reference/algorithms/rich_club.html
  • 12. Rich-club Coefficient Results • 60 projects have a defined coefficient for the supergraph. • Each graph presents a rich- club effect, since 𝜌 𝑘 > 1 for some 𝑘
  • 13. Materialize7: Rich-Club Supergraph Coefficient Maximum normalized coefficient (k = 49) corresponds to maximum club effect with nodes of degree at least 49. 13[7] https://materializecss.com
  • 18. Maximum coefficient distribution • Distribution of the maximum rich-club coefficient for each type of graph across the studied projects. • Mean value around 1 for issues and commits graphs coefficients: weak rich-club presence. • Mean value around 1.4 for PR graphs coefficient: strong rich- club presence. Further insights 18
  • 19. Multi-club users • 25 over 60 projects present a set of users belonging to multiple rich- clubs. • Distribution of multi-club users across the 25 projects. • Developers form community with strong influence in each project level. Further insights 19
  • 20. Conclusions First systematic evaluation of the rich-club behaviour on open source projects: • 60% of projects shows rich-clubs in the supergraph, mostly with a slight effect. • Rich-club behavior could undermine the open paradigma, but phenomeon requires further analysis. • Strong rich-club presence in PR graphs may reside to criticality of the activity. • 25 over 60 projects have users belonging to multiple rich-clubs. 20
  • 21. Future Work Weighted rich-club coefficient Rich-club effect at module and ecosystem level Time dimension to highlight temporal clubs 21

Editor's Notes

  1. GitHub is the most popular service to develop and maintain open source software. Each user interacts with many other users in the project development process (commits, issues, pr), defining collaboration networks. Studying collaboration networks helps in discovering properties and behaviors that influence development, management and success of an OSS project.
  2. Developers collaborate mostly with the same fixed subset of other important colleagues, instead of spreading the cooperation to each component of the team.
  3. Formally, it cab be measured by the so called rich-club coefficient ϕ(k). Intuitively, ϕ(k) measures how far the set of nodes with degree k is from being a complete subgraph. The value of ϕ(k) ranges from 0 (all nodes are disconnected) to 1 (a clique), with higher values showing a stronger rich-club behavior in the network. It is monotonically increasing even for random networks, so a normalized coefficient has been introduced in literature: ϕ(k) is divided by the coefficient calculated for a random network with same degree distribution of the original one.
  4. Presence or absence of a rich-clubs in open source projects has not been studied in a systematic way and has not been applied to a large dataset as the one that GitHub can now provide.
  5. Clashing: same name of different users Duplicity: different names for the same users Solution: use SHA value to associate git commits to GitHub users (if still present)
  6. Two users are connected in the PR graph if they commented/interacted on the same PR…
  7. Calculaton of rich-club coefficient is run for each project’s supergraph to have a global view of the effect. Maximum value for each project is shown: each of the 60 graphs presents a rich club behavior, even if most of them have values only slightly higher than 1. For this reason, we want to better understand the correspondence between the coefficient and the actual graphs.
  8. The first example that we take is the materialize repositorty: rich-club coefficient with respect to node degree is presented. It is possible to notice a rich-club behavior for a range of degrees, with a peak on k=49, which should correspond to groups of nodes with degree at least 49 connected to each other.
  9. This seems to go against the open source paradigma: project “owned” by few users. Established in 2014 by a team of 4 developers, with 3,853 commits and 252 contributors. Nevertheless, the project only has two top contributors (more than 1,000 commits), which belong to the original team, and no frequent contributors
  10. Mixed behavior presence: slightly over than 1, then dramatically lower. The overall intuition is that the graph does not present rich-clubs
  11. It was publicly announced by Apple in 2014 and was later open sourced in December 2015. Currently, the project has more than 84k commits and 674 contributors, with 14 top contributors (more than 1.000 commits) and 44 frequent contributors (between 100 and 1.000 commits). Remarkably, 4 of the top contributors and 21 of the frequent contributors do not belong to Apple according to their GitHub profile. This is a sign that the project has successfully attracted and retained external talent.
  12. In this table, the 10 projects with highest coefficient for the supergraph are presented. Along with them, the coefficient for the other kind of graphs is calculated when possible. Infact,also these other graphs can «hide» other clubs structures.
  13. Maximum coefficient distribution for each kind of graph as a further insight. Blue line is the one already discussed.Green and orange line show commits and issues maximum coefficient distribution: density has a peak on 1 meaning that most of the graphs do not present strong rich-clubs. Red line has its peak around 1.4: most of the projects present evident rich-club structures. This behavior could be related to the fact that PR is the most critical level in open-source software development and few trusthworty developers are in charge of most of the tasks.
  14. We focused also the attention on the users: almost 50% of the projects, have users tha belongs to multiple clubs. The distribution presents the number of users shared across all the projects’ clubs: this means that, on average, 7 developers are in the PR club, as well as in the commits and issues club. These developers form a sub-community inside the project that has strong influence in all the project’s levels.
  15. As rich-club phenomenon is quite complex and also its application on OSS communities relatively new, plenty of further works can be done. First of all, we want to apply weighted coefficient version to check if other patterns arise. We want to extend the analysis at the module and the ecosystem level. And third, we want to introduce time variable: in this work the graphs are built using the entire data as a 1-year snapshot, but it is possible to build monthly graphs and check if temporal clubs show up.
  16. With this, I have concluded the presentation. Thank you for the attention.