SlideShare a Scribd company logo
1 of 10
Download to read offline
gitConnect
Analysing GitHub Connections
Akshara Chaturvedi
Motivation
The motivation for this project was to analyse the user connections and get more
insight on github network.
Finding clusters in the github network based on the repositories that the users have
collaborated on. Cluster is a group of similar things or people occurring or
positioned together.
Pipeline
Data
Data source is Git Archive.
Processed around 1 TB of Data.
Dataset includes Users, Followers, Repositories and Events.
Last 6 month’s events were taken into consideration.
~2 million users had a push event to some repositories.
~16 million push events happened to repositories.
~112 million total events processed
Processing Data
Filtered Push events from the entire set of events with the mapping of user to
repository
User Repository
Constructed graph from the mapping User to Repository to :
User User
Using this I created a graph in GraphX where Users are the Vertices and the
collaboration to a repository is the Edge.
Graph Structure
Vertices 1, 2, 3, 4, are connected based on the
contribution to repositories.
Graph answers following queries:
❏ Find the clusters in the Graph using
Connected Components.
❏ Compute top contributor using Pagerank.
Data structure to hold vertices and Edges looks
like this:
val vertexRDD: RDD[(Long, (String, List<String>))]
val edgeRDD: RDD[Edge[Long]]
Data Insights
❏ Total unique vertices are close to 600K from last 6 months’ events.
❏ Processed around to 1.5 million collaboration edges between users.
❏ Average user is connected to 6 other people indicating that the average vertex in the
graph is only connected to a small fraction of the other nodes
❏ A user is connected to 1,788 users.
Challenges
Un-structure data, changed schema for different years.
Spark ran out of memory when processing the data. Optimized the jobs to run
efficiently. Divided the job processing in 2 stages reducing the processing time for
the graph
About Me
Akshara Chaturvedi
Full Stack Developer
Past : Zendesk, Allscript, Aberdeen Group.
MS Computer Science, Syracuse University
Git: https://github.com/zenachaturvedi
LinkedIn: https://www.linkedin.com/in/aksharachaturvedi
Schema
user_rank
cc_adjlist
component_lookup

More Related Content

What's hot

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 

What's hot (20)

Foss4G 2009 Scenz Grid
Foss4G 2009 Scenz GridFoss4G 2009 Scenz Grid
Foss4G 2009 Scenz Grid
 
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
Big Data Analytics with Google BigQuery.  By Javier Ramirez. All your base Co...Big Data Analytics with Google BigQuery.  By Javier Ramirez. All your base Co...
Big Data Analytics with Google BigQuery. By Javier Ramirez. All your base Co...
 
Your data layer - Choosing the right database solutions for the future
Your data layer - Choosing the right database solutions for the futureYour data layer - Choosing the right database solutions for the future
Your data layer - Choosing the right database solutions for the future
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
 
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
Keynote IEEE International Workshop on Cloud Analytics. Dennis  GannonKeynote IEEE International Workshop on Cloud Analytics. Dennis  Gannon
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
 
Visualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and KibanaVisualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and Kibana
 
Redshift VS BigQuery
Redshift VS BigQueryRedshift VS BigQuery
Redshift VS BigQuery
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
Geolocation analysis using HiveQL
Geolocation analysis using HiveQLGeolocation analysis using HiveQL
Geolocation analysis using HiveQL
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Xanadu Based Big Data CBIR System:Automated Astronomical Objects Classificati...
Xanadu Based Big Data CBIR System:Automated Astronomical Objects Classificati...Xanadu Based Big Data CBIR System:Automated Astronomical Objects Classificati...
Xanadu Based Big Data CBIR System:Automated Astronomical Objects Classificati...
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
GreenLight Data Collection Architecture
GreenLight Data Collection ArchitectureGreenLight Data Collection Architecture
GreenLight Data Collection Architecture
 
30 days of google cloud event
30 days of google cloud event30 days of google cloud event
30 days of google cloud event
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23
 
Big Data DC - Analytics at Clearspring
Big Data DC - Analytics at ClearspringBig Data DC - Analytics at Clearspring
Big Data DC - Analytics at Clearspring
 
CCI DAY PRESENTATION
CCI DAY PRESENTATIONCCI DAY PRESENTATION
CCI DAY PRESENTATION
 
Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.com
 

Viewers also liked

Christina Paris Tuwa Jewellery
Christina Paris Tuwa JewelleryChristina Paris Tuwa Jewellery
Christina Paris Tuwa Jewellery
Christina Paris
 
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-SystemMDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
Joseph Kiok
 
Sludgehandlinganddisposal
SludgehandlinganddisposalSludgehandlinganddisposal
Sludgehandlinganddisposal
Er Sohel R Sheikh
 
Inliniedreapta.net karl peter-schwarz_la_by_alexa_dac_vrei_s_v_ntoarcei_la_vr...
Inliniedreapta.net karl peter-schwarz_la_by_alexa_dac_vrei_s_v_ntoarcei_la_vr...Inliniedreapta.net karl peter-schwarz_la_by_alexa_dac_vrei_s_v_ntoarcei_la_vr...
Inliniedreapta.net karl peter-schwarz_la_by_alexa_dac_vrei_s_v_ntoarcei_la_vr...
ÎnLinieDreaptă
 
Web Stock 2012 - Mobile Apps Monetization
Web Stock 2012 - Mobile Apps MonetizationWeb Stock 2012 - Mobile Apps Monetization
Web Stock 2012 - Mobile Apps Monetization
Andrei Costescu
 
Presentation1
Presentation1Presentation1
Presentation1
Subrina22
 
vCOPS VMware management suite
vCOPS VMware management suitevCOPS VMware management suite
vCOPS VMware management suite
nelegoovaerts007
 
金融业发展和改革“十二五”规划(全文)
金融业发展和改革“十二五”规划(全文)金融业发展和改革“十二五”规划(全文)
金融业发展和改革“十二五”规划(全文)
zs043
 

Viewers also liked (20)

Christina Paris Tuwa Jewellery
Christina Paris Tuwa JewelleryChristina Paris Tuwa Jewellery
Christina Paris Tuwa Jewellery
 
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-SystemMDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
 
Sludgehandlinganddisposal
SludgehandlinganddisposalSludgehandlinganddisposal
Sludgehandlinganddisposal
 
GreenPR_sunum
GreenPR_sunumGreenPR_sunum
GreenPR_sunum
 
CRM
CRMCRM
CRM
 
Las Estrategias de la Mirada
Las Estrategias de la MiradaLas Estrategias de la Mirada
Las Estrategias de la Mirada
 
Wordle
WordleWordle
Wordle
 
Luces presentation
Luces presentationLuces presentation
Luces presentation
 
Presentasi kk2 puji
Presentasi kk2 pujiPresentasi kk2 puji
Presentasi kk2 puji
 
Inliniedreapta.net karl peter-schwarz_la_by_alexa_dac_vrei_s_v_ntoarcei_la_vr...
Inliniedreapta.net karl peter-schwarz_la_by_alexa_dac_vrei_s_v_ntoarcei_la_vr...Inliniedreapta.net karl peter-schwarz_la_by_alexa_dac_vrei_s_v_ntoarcei_la_vr...
Inliniedreapta.net karl peter-schwarz_la_by_alexa_dac_vrei_s_v_ntoarcei_la_vr...
 
Web Stock 2012 - Mobile Apps Monetization
Web Stock 2012 - Mobile Apps MonetizationWeb Stock 2012 - Mobile Apps Monetization
Web Stock 2012 - Mobile Apps Monetization
 
Colegio Antamira
Colegio AntamiraColegio Antamira
Colegio Antamira
 
3ar
3ar3ar
3ar
 
Presentation1
Presentation1Presentation1
Presentation1
 
vCOPS VMware management suite
vCOPS VMware management suitevCOPS VMware management suite
vCOPS VMware management suite
 
Lo2- Be able to generate ideas for an original print- based media product
Lo2- Be able to generate ideas for an original print- based media productLo2- Be able to generate ideas for an original print- based media product
Lo2- Be able to generate ideas for an original print- based media product
 
Introducing Divkom 2012
Introducing Divkom 2012Introducing Divkom 2012
Introducing Divkom 2012
 
金融业发展和改革“十二五”规划(全文)
金融业发展和改革“十二五”规划(全文)金融业发展和改革“十二五”规划(全文)
金融业发展和改革“十二五”规划(全文)
 
Kode ASCII
Kode ASCIIKode ASCII
Kode ASCII
 
Luz natural y luz artificial
Luz natural y luz artificialLuz natural y luz artificial
Luz natural y luz artificial
 

Similar to DE gitConnect

Similar to DE gitConnect (20)

GitConnect
GitConnectGitConnect
GitConnect
 
Final Algos
Final AlgosFinal Algos
Final Algos
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
Analyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsAnalyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projects
 
Git influencer - PPT
Git influencer - PPTGit influencer - PPT
Git influencer - PPT
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise Graph
 
Insight presentation
Insight presentationInsight presentation
Insight presentation
 
Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
Spark
SparkSpark
Spark
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016NIH BD2K DataMed metadata model - Force11, 2016
NIH BD2K DataMed metadata model - Force11, 2016
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
MongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionMongoDB What's new in 3.2 version
MongoDB What's new in 3.2 version
 

Recently uploaded

Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Recently uploaded (20)

Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 

DE gitConnect

  • 2. Motivation The motivation for this project was to analyse the user connections and get more insight on github network. Finding clusters in the github network based on the repositories that the users have collaborated on. Cluster is a group of similar things or people occurring or positioned together.
  • 4. Data Data source is Git Archive. Processed around 1 TB of Data. Dataset includes Users, Followers, Repositories and Events. Last 6 month’s events were taken into consideration. ~2 million users had a push event to some repositories. ~16 million push events happened to repositories. ~112 million total events processed
  • 5. Processing Data Filtered Push events from the entire set of events with the mapping of user to repository User Repository Constructed graph from the mapping User to Repository to : User User Using this I created a graph in GraphX where Users are the Vertices and the collaboration to a repository is the Edge.
  • 6. Graph Structure Vertices 1, 2, 3, 4, are connected based on the contribution to repositories. Graph answers following queries: ❏ Find the clusters in the Graph using Connected Components. ❏ Compute top contributor using Pagerank. Data structure to hold vertices and Edges looks like this: val vertexRDD: RDD[(Long, (String, List<String>))] val edgeRDD: RDD[Edge[Long]]
  • 7. Data Insights ❏ Total unique vertices are close to 600K from last 6 months’ events. ❏ Processed around to 1.5 million collaboration edges between users. ❏ Average user is connected to 6 other people indicating that the average vertex in the graph is only connected to a small fraction of the other nodes ❏ A user is connected to 1,788 users.
  • 8. Challenges Un-structure data, changed schema for different years. Spark ran out of memory when processing the data. Optimized the jobs to run efficiently. Divided the job processing in 2 stages reducing the processing time for the graph
  • 9. About Me Akshara Chaturvedi Full Stack Developer Past : Zendesk, Allscript, Aberdeen Group. MS Computer Science, Syracuse University Git: https://github.com/zenachaturvedi LinkedIn: https://www.linkedin.com/in/aksharachaturvedi