2. Motivation
The motivation for this project was to analyse the user connections and get more
insight on github network.
Finding clusters in the github network based on the repositories that the users have
collaborated on. Cluster is a group of similar things or people occurring or
positioned together.
4. Data
Data source is Git Archive.
Processed around 1 TB of Data.
Dataset includes Users, Followers, Repositories and Events.
Last 6 month’s events were taken into consideration.
~2 million users had a push event to some repositories.
~16 million push events happened to repositories.
~112 million total events processed
5. Processing Data
Filtered Push events from the entire set of events with the mapping of user to
repository
User Repository
Constructed graph from the mapping User to Repository to :
User User
Using this I created a graph in GraphX where Users are the Vertices and the
collaboration to a repository is the Edge.
6. Graph Structure
Vertices 1, 2, 3, 4, are connected based on the
contribution to repositories.
Graph answers following queries:
❏ Find the clusters in the Graph using
Connected Components.
❏ Compute top contributor using Pagerank.
Data structure to hold vertices and Edges looks
like this:
val vertexRDD: RDD[(Long, (String, List<String>))]
val edgeRDD: RDD[Edge[Long]]
7. Data Insights
❏ Total unique vertices are close to 600K from last 6 months’ events.
❏ Processed around to 1.5 million collaboration edges between users.
❏ Average user is connected to 6 other people indicating that the average vertex in the
graph is only connected to a small fraction of the other nodes
❏ A user is connected to 1,788 users.
8. Challenges
Un-structure data, changed schema for different years.
Spark ran out of memory when processing the data. Optimized the jobs to run
efficiently. Divided the job processing in 2 stages reducing the processing time for
the graph
9. About Me
Akshara Chaturvedi
Full Stack Developer
Past : Zendesk, Allscript, Aberdeen Group.
MS Computer Science, Syracuse University
Git: https://github.com/zenachaturvedi
LinkedIn: https://www.linkedin.com/in/aksharachaturvedi