SlideShare a Scribd company logo
1 of 15
Git-Influencer
- Discover Social influencers in Github Network
Catherine Shen
Data Engineering project
http://bit.ly/Git-Influencer
http://bit.ly/Git-Influncer-github
Who can help me find out a list of users to
follow and learn from?
Hi, I am learning Scala.
Could you give me some
influential Scala users on github
to follow, so I can receive their
daily update and learn from their
code?
Demo
Discoverinfluencers
Github network
Run PageRank on the whole github network
Estimate of how important the user is by
counting the number and quality of users who
follow him/her.
TechStack
● Change S3 to HDFS for spark read directly
● Two step cleanup: History & newcome data
Data Cleaning Challenge - amount and speed
● 3TB history JSON event data
● 1GB new data come every hour
Get highly bias result when run pagerank on the whole network
Scala
Python Golang
Shell
Java
JavaScript
All language
User Classification Challenge
Inspired by linkedin “Endorsement by experts”
Only users who use this language before
will be considered into the language network calculation.
Java_user
user_a
user_b
User_Following User_Followed
user_a user_b
Java_User_Following Java_User_Followed
user_a user_b
Solution: Mapping users with languages.
Find Language User Group
Look into Details - Find clues from Event JSON files
Java_user
user_a
user_b
Clues
Commit level
Repo level
Catherine Shen
● Love go to conferences and
learn new technology
● Part-time photographer
● Love cats (including octocat)
Clues
ETL Challenge
history data cleaning: schema change
● Change of schema and field name , 2015
● Test of data availability - sample - find break point
Useful schema extraction: from various schemas
● User type: user, company, org combine other data sources
● User availability: active account, deleted account
● Zero following trap: filter out these users
Further Improvement
● Explore HDFS data storage efficiency - Parquet
● Try different classification metric for discovering more user topics
● Use more Graph analysis algorithms in GraphX
Github users numbers varies in each language , pagerank results show most influencers use
JavaScript which is highly bias.
JavaScript
User Classification Challenge
Heavy calculation for graph analysis algorithms, deal with ring issues and 0 following trap.
PageRank Algorithms
Github Archive contains JSON
encoded events as reported by
the GitHub API since 2011.
Available as a public dataset
around 3TB in total.
Github Resources

More Related Content

Similar to Git influencer - PPT

Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
PyCon Korea - Real World Graphene
PyCon Korea - Real World GraphenePyCon Korea - Real World Graphene
PyCon Korea - Real World GrapheneMarcin Gębala
 
GraphQL Advanced
GraphQL AdvancedGraphQL Advanced
GraphQL AdvancedLeanIX GmbH
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQLRodrigo Prates
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 Apache AGE and the synergy effect in the combination of Postgres and NoSQL Apache AGE and the synergy effect in the combination of Postgres and NoSQL
Apache AGE and the synergy effect in the combination of Postgres and NoSQLEDB
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Data Sharing, Distribution and Updating Using Social Coding Community Github ...
Data Sharing, Distribution and Updating Using Social Coding Community Github ...Data Sharing, Distribution and Updating Using Social Coding Community Github ...
Data Sharing, Distribution and Updating Using Social Coding Community Github ...Universität Salzburg
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
Graph processing at scale using spark & graph frames
Graph processing at scale using spark & graph framesGraph processing at scale using spark & graph frames
Graph processing at scale using spark & graph framesRon Barabash
 
Sprint 44 review
Sprint 44 reviewSprint 44 review
Sprint 44 reviewManageIQ
 
WikiLoop: Big tech's Open Knowledge contributions
WikiLoop: Big tech's Open Knowledge contributionsWikiLoop: Big tech's Open Knowledge contributions
WikiLoop: Big tech's Open Knowledge contributionsAll Things Open
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsJohann Schleier-Smith
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample TrackingBruce Kozuma
 
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...NETWAYS
 
ExSchema - ICSM'13
ExSchema - ICSM'13ExSchema - ICSM'13
ExSchema - ICSM'13jccastrejon
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 

Similar to Git influencer - PPT (20)

Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
PyCon Korea - Real World Graphene
PyCon Korea - Real World GraphenePyCon Korea - Real World Graphene
PyCon Korea - Real World Graphene
 
GraphQL Advanced
GraphQL AdvancedGraphQL Advanced
GraphQL Advanced
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQL
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 Apache AGE and the synergy effect in the combination of Postgres and NoSQL Apache AGE and the synergy effect in the combination of Postgres and NoSQL
Apache AGE and the synergy effect in the combination of Postgres and NoSQL
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Stackalytics
StackalyticsStackalytics
Stackalytics
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Data Sharing, Distribution and Updating Using Social Coding Community Github ...
Data Sharing, Distribution and Updating Using Social Coding Community Github ...Data Sharing, Distribution and Updating Using Social Coding Community Github ...
Data Sharing, Distribution and Updating Using Social Coding Community Github ...
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Graph processing at scale using spark & graph frames
Graph processing at scale using spark & graph framesGraph processing at scale using spark & graph frames
Graph processing at scale using spark & graph frames
 
Sprint 44 review
Sprint 44 reviewSprint 44 review
Sprint 44 review
 
WikiLoop: Big tech's Open Knowledge contributions
WikiLoop: Big tech's Open Knowledge contributionsWikiLoop: Big tech's Open Knowledge contributions
WikiLoop: Big tech's Open Knowledge contributions
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
 
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
OSMC 2022 | Unifying Observability Weaving Prometheus, Jaeger, and Open Sourc...
 
ExSchema - ICSM'13
ExSchema - ICSM'13ExSchema - ICSM'13
ExSchema - ICSM'13
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 

Recently uploaded

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 

Recently uploaded (20)

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 

Git influencer - PPT

  • 1. Git-Influencer - Discover Social influencers in Github Network Catherine Shen Data Engineering project http://bit.ly/Git-Influencer http://bit.ly/Git-Influncer-github
  • 2. Who can help me find out a list of users to follow and learn from? Hi, I am learning Scala. Could you give me some influential Scala users on github to follow, so I can receive their daily update and learn from their code?
  • 4. Discoverinfluencers Github network Run PageRank on the whole github network Estimate of how important the user is by counting the number and quality of users who follow him/her.
  • 6. ● Change S3 to HDFS for spark read directly ● Two step cleanup: History & newcome data Data Cleaning Challenge - amount and speed ● 3TB history JSON event data ● 1GB new data come every hour
  • 7. Get highly bias result when run pagerank on the whole network Scala Python Golang Shell Java JavaScript All language User Classification Challenge
  • 8. Inspired by linkedin “Endorsement by experts” Only users who use this language before will be considered into the language network calculation. Java_user user_a user_b User_Following User_Followed user_a user_b Java_User_Following Java_User_Followed user_a user_b Solution: Mapping users with languages.
  • 9. Find Language User Group Look into Details - Find clues from Event JSON files Java_user user_a user_b Clues Commit level Repo level
  • 10. Catherine Shen ● Love go to conferences and learn new technology ● Part-time photographer ● Love cats (including octocat)
  • 11. Clues ETL Challenge history data cleaning: schema change ● Change of schema and field name , 2015 ● Test of data availability - sample - find break point Useful schema extraction: from various schemas ● User type: user, company, org combine other data sources ● User availability: active account, deleted account ● Zero following trap: filter out these users
  • 12. Further Improvement ● Explore HDFS data storage efficiency - Parquet ● Try different classification metric for discovering more user topics ● Use more Graph analysis algorithms in GraphX
  • 13. Github users numbers varies in each language , pagerank results show most influencers use JavaScript which is highly bias. JavaScript User Classification Challenge
  • 14. Heavy calculation for graph analysis algorithms, deal with ring issues and 0 following trap. PageRank Algorithms
  • 15. Github Archive contains JSON encoded events as reported by the GitHub API since 2011. Available as a public dataset around 3TB in total. Github Resources