GitConnect

gitConnect
Analysing GitHub Connections
Akshara Chaturvedi

Motivation
Networks are all around us and our position and connections within these networks
highly influence the way we think and act and in this project I wanted to analyze
github data.
By analyzing github connections I am identify sources of influence and see what
clusters these influences form.

Demo
https://drive.google.com/file/d/0B24RNwmWYSdNTEozZTZhV2FrNk0/view

Data
Data source is Git Archive.
Processed around 1 TB of Data.
Dataset includes Users, Followers, Repositories and Events.
Last 6 month’s events were taken into consideration.
~2 million users had a push event to some repositories.
~16 million push events happened to repositories.
~112 million total events processed

Processing Data
Filtered Push events from the entire set of events with the mapping of user to
repository
User Repository
Constructed graph from the mapping User to Repository to :
User User
Using this I created a graph in GraphX where Users are the Vertices and the
collaboration to a repository is the Edge.

Graph Structure
Vertices 1, 2, 3, 4, are connected based on the
contribution to same repositories.
Graph answers following queries:
❏ Find the clusters in the Graph using Connected
Components.
❏ Compute top contributor using Pagerank.
Data structure to hold vertices and Edges looks like this:
val vertexRDD: RDD[(Long, (String, List<String>))]
val edgeRDD: RDD[Edge[Long]]

Schema
component_vertex
component_lookup
user_id user_ranks
user_rank
vertex_id component_id
component_id vertex_list list<text>
{"id":"3390141329","type":"PushEvent","actor":{"id":
5126316,"login":"wjfwzzc","gravatar_id":"","url":"https:
//api.github.com/users/wjfwzzc","avatar_url":"https:
//avatars.githubusercontent.com/u/5126316?"},"repo":
{"id":46919799,"name":"wjfwzzc/Kaggle_Script","url":"
https://api.github.com/repos/wjfwzzc/Kaggle_Script"},"
payload":{"push_id":883618927,"size":1,"distinct_size":
1,"ref":"refs/heads/master","head":"
88efd7dbaf6f5392e08fd25b910c395649cce9a3","
before":"
9e8c5025706526abcc61ef8da9062dc42481c36a","
commits":[{"sha":"
88efd7dbaf6f5392e08fd25b910c395649cce9a3","
author":{"email":"wjfwzzc@gmail.com","name":"
wjfwzzc"},"message":"upgrade README.md","distinct":
true,"url":"https://api.github.
com/repos/wjfwzzc/Kaggle_Script/commits/88efd7dba
f6f5392e08fd25b910c395649cce9a3"}]},"public":true,"
created_at":"2015-11-30T09:02:36Z"}
PushEvent

Data Insights
❏ Total unique vertices are close to 600K from last 6 months’ events.
❏ Processed around to 1.5 million collaboration edges between users.
❏ Average user is connected to 6 other people indicating that the average vertex in the
graph is only connected to a small fraction of the other nodes
❏ A user is connected to 1,788 users.

Challenges
Un-structure data, changed schema for different years.
Spark ran out of memory when processing the data. Optimized the jobs to run
efficiently. Divided the job processing in 2 stages reducing the processing time for
the graph

About Me
Akshara Chaturvedi
Full Stack Developer
Past : Zendesk, Allscript, Aberdeen Group.
MS Computer Science, Syracuse University
LinkedIn: https://www.linkedin.com/in/aksharachaturvedi

GitConnect

More Related Content

What's hot

Similar to GitConnect

Recently uploaded

GitConnect