1. Git-Influencer
- Discover Social influencers in Github Network
Catherine Shen
Data Engineering project
http://bit.ly/Git-Influencer
http://bit.ly/Git-Influncer-github
2. Who can help me find out a list of users to
follow and learn from?
Hi, I am learning Scala.
Could you give me some
influential Scala users on github
to follow, so I can receive their
daily update and learn from their
code?
6. ● Change S3 to HDFS for spark read directly
● Two step cleanup: History & newcome data
Data Cleaning Challenge - amount and speed
● 3TB history JSON event data
● 1GB new data come every hour
7. Get highly bias result when run pagerank on the whole network
Scala
Python Golang
Shell
Java
JavaScript
All language
User Classification Challenge
8. Inspired by linkedin “Endorsement by experts”
Only users who use this language before
will be considered into the language network calculation.
Java_user
user_a
user_b
User_Following User_Followed
user_a user_b
Java_User_Following Java_User_Followed
user_a user_b
Solution: Mapping users with languages.
9. Find Language User Group
Look into Details - Find clues from Event JSON files
Java_user
user_a
user_b
Clues
Commit level
Repo level
10. Catherine Shen
● Love go to conferences and
learn new technology
● Part-time photographer
● Love cats (including octocat)
11. Clues
ETL Challenge
history data cleaning: schema change
● Change of schema and field name , 2015
● Test of data availability - sample - find break point
Useful schema extraction: from various schemas
● User type: user, company, org combine other data sources
● User availability: active account, deleted account
● Zero following trap: filter out these users
12. Further Improvement
● Explore HDFS data storage efficiency - Parquet
● Try different classification metric for discovering more user topics
● Use more Graph analysis algorithms in GraphX
13. Github users numbers varies in each language , pagerank results show most influencers use
JavaScript which is highly bias.
JavaScript
User Classification Challenge
14. Heavy calculation for graph analysis algorithms, deal with ring issues and 0 following trap.
PageRank Algorithms
15. Github Archive contains JSON
encoded events as reported by
the GitHub API since 2011.
Available as a public dataset
around 3TB in total.
Github Resources