gitConnect
Analysing GitHub Connections
Akshara Chaturvedi
Motivation
Networks are all around us and our position and connections within these networks
highly influence the way we think and act and in this project I wanted to analyze
github data.
By analyzing github connections I am identify sources of influence and see what
clusters these influences form.
Demo
https://drive.google.com/file/d/0B24RNwmWYSdNTEozZTZhV2FrNk0/view
Pipeline
Data
Data source is Git Archive.
Processed around 1 TB of Data.
Dataset includes Users, Followers, Repositories and Events.
Last 6 month’s events were taken into consideration.
~2 million users had a push event to some repositories.
~16 million push events happened to repositories.
~112 million total events processed
Processing Data
Filtered Push events from the entire set of events with the mapping of user to
repository
User Repository
Constructed graph from the mapping User to Repository to :
User User
Using this I created a graph in GraphX where Users are the Vertices and the
collaboration to a repository is the Edge.
Graph Structure
Vertices 1, 2, 3, 4, are connected based on the
contribution to same repositories.
Graph answers following queries:
❏ Find the clusters in the Graph using Connected
Components.
❏ Compute top contributor using Pagerank.
Data structure to hold vertices and Edges looks like this:
val vertexRDD: RDD[(Long, (String, List<String>))]
val edgeRDD: RDD[Edge[Long]]
Schema
component_vertex
component_lookup
user_id user_ranks
user_rank
vertex_id component_id
component_id vertex_list list<text>
{"id":"3390141329","type":"PushEvent","actor":{"id":
5126316,"login":"wjfwzzc","gravatar_id":"","url":"https:
//api.github.com/users/wjfwzzc","avatar_url":"https:
//avatars.githubusercontent.com/u/5126316?"},"repo":
{"id":46919799,"name":"wjfwzzc/Kaggle_Script","url":"
https://api.github.com/repos/wjfwzzc/Kaggle_Script"},"
payload":{"push_id":883618927,"size":1,"distinct_size":
1,"ref":"refs/heads/master","head":"
88efd7dbaf6f5392e08fd25b910c395649cce9a3","
before":"
9e8c5025706526abcc61ef8da9062dc42481c36a","
commits":[{"sha":"
88efd7dbaf6f5392e08fd25b910c395649cce9a3","
author":{"email":"wjfwzzc@gmail.com","name":"
wjfwzzc"},"message":"upgrade README.md","distinct":
true,"url":"https://api.github.
com/repos/wjfwzzc/Kaggle_Script/commits/88efd7dba
f6f5392e08fd25b910c395649cce9a3"}]},"public":true,"
created_at":"2015-11-30T09:02:36Z"}
PushEvent
Data Insights
❏ Total unique vertices are close to 600K from last 6 months’ events.
❏ Processed around to 1.5 million collaboration edges between users.
❏ Average user is connected to 6 other people indicating that the average vertex in the
graph is only connected to a small fraction of the other nodes
❏ A user is connected to 1,788 users.
Challenges
Un-structure data, changed schema for different years.
Spark ran out of memory when processing the data. Optimized the jobs to run
efficiently. Divided the job processing in 2 stages reducing the processing time for
the graph
About Me
Akshara Chaturvedi
Full Stack Developer
Past : Zendesk, Allscript, Aberdeen Group.
MS Computer Science, Syracuse University
LinkedIn: https://www.linkedin.com/in/aksharachaturvedi

GitConnect

  • 1.
  • 2.
    Motivation Networks are allaround us and our position and connections within these networks highly influence the way we think and act and in this project I wanted to analyze github data. By analyzing github connections I am identify sources of influence and see what clusters these influences form.
  • 3.
  • 4.
  • 5.
    Data Data source isGit Archive. Processed around 1 TB of Data. Dataset includes Users, Followers, Repositories and Events. Last 6 month’s events were taken into consideration. ~2 million users had a push event to some repositories. ~16 million push events happened to repositories. ~112 million total events processed
  • 6.
    Processing Data Filtered Pushevents from the entire set of events with the mapping of user to repository User Repository Constructed graph from the mapping User to Repository to : User User Using this I created a graph in GraphX where Users are the Vertices and the collaboration to a repository is the Edge.
  • 7.
    Graph Structure Vertices 1,2, 3, 4, are connected based on the contribution to same repositories. Graph answers following queries: ❏ Find the clusters in the Graph using Connected Components. ❏ Compute top contributor using Pagerank. Data structure to hold vertices and Edges looks like this: val vertexRDD: RDD[(Long, (String, List<String>))] val edgeRDD: RDD[Edge[Long]]
  • 8.
    Schema component_vertex component_lookup user_id user_ranks user_rank vertex_id component_id component_idvertex_list list<text> {"id":"3390141329","type":"PushEvent","actor":{"id": 5126316,"login":"wjfwzzc","gravatar_id":"","url":"https: //api.github.com/users/wjfwzzc","avatar_url":"https: //avatars.githubusercontent.com/u/5126316?"},"repo": {"id":46919799,"name":"wjfwzzc/Kaggle_Script","url":" https://api.github.com/repos/wjfwzzc/Kaggle_Script"}," payload":{"push_id":883618927,"size":1,"distinct_size": 1,"ref":"refs/heads/master","head":" 88efd7dbaf6f5392e08fd25b910c395649cce9a3"," before":" 9e8c5025706526abcc61ef8da9062dc42481c36a"," commits":[{"sha":" 88efd7dbaf6f5392e08fd25b910c395649cce9a3"," author":{"email":"wjfwzzc@gmail.com","name":" wjfwzzc"},"message":"upgrade README.md","distinct": true,"url":"https://api.github. com/repos/wjfwzzc/Kaggle_Script/commits/88efd7dba f6f5392e08fd25b910c395649cce9a3"}]},"public":true," created_at":"2015-11-30T09:02:36Z"} PushEvent
  • 9.
    Data Insights ❏ Totalunique vertices are close to 600K from last 6 months’ events. ❏ Processed around to 1.5 million collaboration edges between users. ❏ Average user is connected to 6 other people indicating that the average vertex in the graph is only connected to a small fraction of the other nodes ❏ A user is connected to 1,788 users.
  • 10.
    Challenges Un-structure data, changedschema for different years. Spark ran out of memory when processing the data. Optimized the jobs to run efficiently. Divided the job processing in 2 stages reducing the processing time for the graph
  • 11.
    About Me Akshara Chaturvedi FullStack Developer Past : Zendesk, Allscript, Aberdeen Group. MS Computer Science, Syracuse University LinkedIn: https://www.linkedin.com/in/aksharachaturvedi