Harnessing Twitter to Support
Serendipitous Learning of Developers
Abhishek Sharma1, Yuan Tian1, Agus Sulistya1, David Lo1
and Aiko Fallas Yamashita2
1School of Information Systems,
Singapore Management University
2Oslo and Akershus University, Norway
24th IEEE International Conference on Software Analysis,
Evolution, and Reengineering (SANER 2017)
• Keeping up to date a big challenge
(Storey et al. TSE’16)
Developer Challenges?
2
Why Twitter for Learning
• Keeping up to date a big challenge
(Storey et al. TSE’16)
• Twitter is used by software
developers to share important
information (Tian et al. MSR’12)
2
https://unsplash.com/photos/HAIPJ8PyeL8
Why Twitter for Learning
• Keeping up to date a big challenge
(Storey et al. TSE’16)
• Twitter is used by software
developers to share important
information (Tian et al. MSR’12)
• Twitter enables serendipitous
(pleasant and undirected) learning
for developers (Singer et al.
ICSE’14)
2
https://unsplash.com/photos/HAIPJ8PyeL8
Challenges
• Finding useful articles not easy
3
Challenges
• Finding useful articles not easy
• Developers need to identify
– many relevant Twitter users to follow
– sieve through a large amount of
tweets/URLs
3
Challenges
• Finding useful articles not easy
• Developers need to identify
– many relevant Twitter users to follow
– sieve through a large amount of
tweets/URLs
Singer et al. ICSE’14
3
Challenges
• Finding useful articles not easy
• Developers need to identify
– many relevant Twitter users to follow
– sieve through a large amount of
tweets/URLs
Singer et al. ICSE’14
• Too much information can make learning using Twitter an
unpleasant experience
3
https://unsplash.com/photos/yD5rv8_WzxA
This Study
• Can we automatically extract popular and relevant URLs
from Twitter for developers
• In this work, we:
• propose 14 features to characterize a URL
• evaluate a supervised and unsupervised approach to
recommend URLs harvested from Twitter
4
Methodology (1): Collecting Seed Data
5
Methodology (1): Collecting Seed Data
• Get a list of seed twitter users
5
http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
Methodology (1): Collecting Seed Data
• Get a list of seed twitter users
5
http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
Methodology (1): Collecting Seed Data
• Get a list of seed twitter users
• Get a larger set of people who
– Follow (or are followed by) >= 5 seed users
– Results in 85,171 Twitter users
5
Methodology (1): Collecting Seed Data
• Get a list of seed twitter users
• Get a larger set of people who
– Follow (or are followed by) >= 5 seed users
– Results in 85,171 Twitter users
• Collect tweets generated by these users for 1 month
period (Nov’ 15)
5
Methodology (2): URL Extraction
615
Methodology (2): URL Extraction
• Find tweets which contain keyword “java” (2,104 tweets)
616
Methodology (2): URL Extraction
• Find tweets which contain keyword “java” (2,104 tweets)
• Find tweets which contain an URL (1,606 tweets)
617
https://t.co/
https://b.ly/
https://go.cl
Methodology (2): URL Extraction
• Find tweets which contain keyword “java” (2,104 tweets)
• Find tweets which contain an URL (1,606 tweets)
• Extract URLs
http://ow.ly/UIxwS
http://bit.ly/1OFsZSj
http://goo.gl/IGxGlo
https://t.co/ryPI3
618
https://t.co/
https://b.ly/
https://go.cl
Methodology (2): URL Extraction
• Find tweets which contain keyword “java” (2,104 tweets)
• Find tweets which contain an URL (1,606 tweets)
• Extract URLs
• Expand short URLs (770 expanded URLs)
http://abc.com
http://xyz.com
http://abc.com
http://xyz.com
619
https://t.co/
https://b.ly/
https://go.cl
Methodology (2): URL Extraction
• Find tweets which contain keyword “java” (2,104 tweets)
• Find tweets which contain an URL (1,606 tweets)
• Extract URLs
• Expand short URLs (770 expanded URLs)
• Resolve duplicate/broken URLs (577)
http://abc.com
http://xyz.com
620
https://t.co/
https://b.ly/
https://go.cl
Methodology (3): Feature Extraction
• 14 features extracted
– Content
– Popularity
– Network
7
Methodology (3): Feature Extraction
• Content
8
Methodology (3): Feature Extraction
• Content
– cosine similarity between
keyword and
8
Methodology (3): Feature Extraction
• Content
– cosine similarity between
keyword and
• tweet text (CosSimT)
8
Methodology (3): Feature Extraction
• Content
– cosine similarity between
keyword and
• tweet text (CosSimT)
• user profile text (CosSimP)
8
Methodology (3): Feature Extraction
• Content
– cosine similarity between
keyword and
• tweet text (CosSimT)
• user profile text (CosSimP)
• webpage text (CosSimW)
8
Methodology (3): Feature Extraction
– Network
9
Methodology (3): Feature Extraction
– Network
• estimate importance of
users through
– centrality scores
– page rank
9
– Network
• estimate importance of
users through
– centrality scores
– page rank
9
Methodology (3): Feature Extraction
– Network
• estimate importance of
users through
– centrality scores
– page rank
– Popularity
• number of times the
tweets containing the
URL were
9
Methodology (3): Feature Extraction
– Network
• estimate importance of
users through
– centrality scores
– page rank
– Popularity
• number of times the
tweets containing the
URL were
– retweeted
9
Methodology (3): Feature Extraction
– Network
• estimate importance of
users through
– centrality scores
– page rank
– Popularity
• number of times the
tweets containing the
URL were
– retweeted
– liked
9
Methodology (3): Feature Extraction
Methodology (4): Labelling the URLs
• Labelled independently by
– 2 persons having having more than 4 years of professional
programming experience in Java
– one a PhD student and another a Research Engineer
10
Methodology (4): Labelling the URLs
• Labelled independently by
– 2 persons having having more than 4 years of professional
programming experience in Java
– one a PhD student and another a Research Engineer
• Both persons sat together to resolve disagreements
10
Methodology (4): Labelling the URLs
• Labelled independently by
– 2 persons having having more than 4 years of professional
programming experience in Java
– one a PhD student and another a Research Engineer
• Both persons sat together to resolve disagreements
• URLs assigned relevance scores from 0-3
10
Methodology (5): Recommendation
• Unsupervised –Borda Count
– assigns ranking points for each feature score for an
URL and then combines the scores
11
• Supervised –Learning to Rank
– learns a ranking function based on the weighted sum
of features of an URL
RQ1: Effectiveness of Our Approach
12
• NDCG (Normalized Discounted Cumulative Gain)
• Measures the capability to recommend higher ranked URLs at
top ranks
• Score closer to 1 specifies better performance with the range
of scores being 0-1
RQ1: Effectiveness of Our Approach
12
0.832
0.719
0
0.2
0.4
0.6
0.8
1
Supervised Unsupervised
NDCGScore
Recommendation Approach
• NDCG (Normalized Discounted Cumulative Gain)
• Measures the capability to recommend higher ranked URLs at
top ranks
• Score closer to 1 specifies better performance with the range
of scores being 0-1
RQ2: Sensitivity of Supervised
Approach to Training Data
13
0.832
0.825
0.833
0.845
0.834
0.842
0.837
0.847
0.843
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10
9
8
7
6
5
4
3
2
NDCG Score
k(nooffoldsused)
Threats to Validity
14
Threats to Validity
• Subjectivity in the labelling process
14
Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
14
Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
• Only 1 domain
14
Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
• Only 1 domain
– evaluate more domains in future work
14
Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
• Only 1 domain
– evaluate more domains in future work
• Suitability of evaluation metric
14
Threats to Validity
• Subjectivity in the labelling process
– asked 2 persons to label independently
• Only 1 domain
– evaluate more domains in future work
• Suitability of evaluation metric
– used NDCG which is a standard metric
14
Conclusion and Future Work
• Supervised and unsupervised approaches
show promise in recommending URLs
• Future work:
– Automatically categorize the recommended
URLs
– Build an automated system to recommend
relevant URLs
15
Feedback/Advice
• What additional resources we can
consider for mining URLs?
• How to infer developer interests
automatically?
Thank you!

Saner17 sharma

  • 1.
    Harnessing Twitter toSupport Serendipitous Learning of Developers Abhishek Sharma1, Yuan Tian1, Agus Sulistya1, David Lo1 and Aiko Fallas Yamashita2 1School of Information Systems, Singapore Management University 2Oslo and Akershus University, Norway 24th IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER 2017)
  • 2.
    • Keeping upto date a big challenge (Storey et al. TSE’16) Developer Challenges? 2
  • 3.
    Why Twitter forLearning • Keeping up to date a big challenge (Storey et al. TSE’16) • Twitter is used by software developers to share important information (Tian et al. MSR’12) 2 https://unsplash.com/photos/HAIPJ8PyeL8
  • 4.
    Why Twitter forLearning • Keeping up to date a big challenge (Storey et al. TSE’16) • Twitter is used by software developers to share important information (Tian et al. MSR’12) • Twitter enables serendipitous (pleasant and undirected) learning for developers (Singer et al. ICSE’14) 2 https://unsplash.com/photos/HAIPJ8PyeL8
  • 5.
    Challenges • Finding usefularticles not easy 3
  • 6.
    Challenges • Finding usefularticles not easy • Developers need to identify – many relevant Twitter users to follow – sieve through a large amount of tweets/URLs 3
  • 7.
    Challenges • Finding usefularticles not easy • Developers need to identify – many relevant Twitter users to follow – sieve through a large amount of tweets/URLs Singer et al. ICSE’14 3
  • 8.
    Challenges • Finding usefularticles not easy • Developers need to identify – many relevant Twitter users to follow – sieve through a large amount of tweets/URLs Singer et al. ICSE’14 • Too much information can make learning using Twitter an unpleasant experience 3 https://unsplash.com/photos/yD5rv8_WzxA
  • 9.
    This Study • Canwe automatically extract popular and relevant URLs from Twitter for developers • In this work, we: • propose 14 features to characterize a URL • evaluate a supervised and unsupervised approach to recommend URLs harvested from Twitter 4
  • 10.
  • 11.
    Methodology (1): CollectingSeed Data • Get a list of seed twitter users 5 http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
  • 12.
    Methodology (1): CollectingSeed Data • Get a list of seed twitter users 5 http://www.noop.nl/2009/02/twitter-top-100-for-softwaredevelopers.htm
  • 13.
    Methodology (1): CollectingSeed Data • Get a list of seed twitter users • Get a larger set of people who – Follow (or are followed by) >= 5 seed users – Results in 85,171 Twitter users 5
  • 14.
    Methodology (1): CollectingSeed Data • Get a list of seed twitter users • Get a larger set of people who – Follow (or are followed by) >= 5 seed users – Results in 85,171 Twitter users • Collect tweets generated by these users for 1 month period (Nov’ 15) 5
  • 15.
    Methodology (2): URLExtraction 615
  • 16.
    Methodology (2): URLExtraction • Find tweets which contain keyword “java” (2,104 tweets) 616
  • 17.
    Methodology (2): URLExtraction • Find tweets which contain keyword “java” (2,104 tweets) • Find tweets which contain an URL (1,606 tweets) 617 https://t.co/ https://b.ly/ https://go.cl
  • 18.
    Methodology (2): URLExtraction • Find tweets which contain keyword “java” (2,104 tweets) • Find tweets which contain an URL (1,606 tweets) • Extract URLs http://ow.ly/UIxwS http://bit.ly/1OFsZSj http://goo.gl/IGxGlo https://t.co/ryPI3 618 https://t.co/ https://b.ly/ https://go.cl
  • 19.
    Methodology (2): URLExtraction • Find tweets which contain keyword “java” (2,104 tweets) • Find tweets which contain an URL (1,606 tweets) • Extract URLs • Expand short URLs (770 expanded URLs) http://abc.com http://xyz.com http://abc.com http://xyz.com 619 https://t.co/ https://b.ly/ https://go.cl
  • 20.
    Methodology (2): URLExtraction • Find tweets which contain keyword “java” (2,104 tweets) • Find tweets which contain an URL (1,606 tweets) • Extract URLs • Expand short URLs (770 expanded URLs) • Resolve duplicate/broken URLs (577) http://abc.com http://xyz.com 620 https://t.co/ https://b.ly/ https://go.cl
  • 21.
    Methodology (3): FeatureExtraction • 14 features extracted – Content – Popularity – Network 7
  • 22.
    Methodology (3): FeatureExtraction • Content 8
  • 23.
    Methodology (3): FeatureExtraction • Content – cosine similarity between keyword and 8
  • 24.
    Methodology (3): FeatureExtraction • Content – cosine similarity between keyword and • tweet text (CosSimT) 8
  • 25.
    Methodology (3): FeatureExtraction • Content – cosine similarity between keyword and • tweet text (CosSimT) • user profile text (CosSimP) 8
  • 26.
    Methodology (3): FeatureExtraction • Content – cosine similarity between keyword and • tweet text (CosSimT) • user profile text (CosSimP) • webpage text (CosSimW) 8
  • 27.
    Methodology (3): FeatureExtraction – Network 9
  • 28.
    Methodology (3): FeatureExtraction – Network • estimate importance of users through – centrality scores – page rank 9
  • 29.
    – Network • estimateimportance of users through – centrality scores – page rank 9 Methodology (3): Feature Extraction
  • 30.
    – Network • estimateimportance of users through – centrality scores – page rank – Popularity • number of times the tweets containing the URL were 9 Methodology (3): Feature Extraction
  • 31.
    – Network • estimateimportance of users through – centrality scores – page rank – Popularity • number of times the tweets containing the URL were – retweeted 9 Methodology (3): Feature Extraction
  • 32.
    – Network • estimateimportance of users through – centrality scores – page rank – Popularity • number of times the tweets containing the URL were – retweeted – liked 9 Methodology (3): Feature Extraction
  • 33.
    Methodology (4): Labellingthe URLs • Labelled independently by – 2 persons having having more than 4 years of professional programming experience in Java – one a PhD student and another a Research Engineer 10
  • 34.
    Methodology (4): Labellingthe URLs • Labelled independently by – 2 persons having having more than 4 years of professional programming experience in Java – one a PhD student and another a Research Engineer • Both persons sat together to resolve disagreements 10
  • 35.
    Methodology (4): Labellingthe URLs • Labelled independently by – 2 persons having having more than 4 years of professional programming experience in Java – one a PhD student and another a Research Engineer • Both persons sat together to resolve disagreements • URLs assigned relevance scores from 0-3 10
  • 36.
    Methodology (5): Recommendation •Unsupervised –Borda Count – assigns ranking points for each feature score for an URL and then combines the scores 11 • Supervised –Learning to Rank – learns a ranking function based on the weighted sum of features of an URL
  • 37.
    RQ1: Effectiveness ofOur Approach 12 • NDCG (Normalized Discounted Cumulative Gain) • Measures the capability to recommend higher ranked URLs at top ranks • Score closer to 1 specifies better performance with the range of scores being 0-1
  • 38.
    RQ1: Effectiveness ofOur Approach 12 0.832 0.719 0 0.2 0.4 0.6 0.8 1 Supervised Unsupervised NDCGScore Recommendation Approach • NDCG (Normalized Discounted Cumulative Gain) • Measures the capability to recommend higher ranked URLs at top ranks • Score closer to 1 specifies better performance with the range of scores being 0-1
  • 39.
    RQ2: Sensitivity ofSupervised Approach to Training Data 13 0.832 0.825 0.833 0.845 0.834 0.842 0.837 0.847 0.843 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 9 8 7 6 5 4 3 2 NDCG Score k(nooffoldsused)
  • 40.
  • 41.
    Threats to Validity •Subjectivity in the labelling process 14
  • 42.
    Threats to Validity •Subjectivity in the labelling process – asked 2 persons to label independently 14
  • 43.
    Threats to Validity •Subjectivity in the labelling process – asked 2 persons to label independently • Only 1 domain 14
  • 44.
    Threats to Validity •Subjectivity in the labelling process – asked 2 persons to label independently • Only 1 domain – evaluate more domains in future work 14
  • 45.
    Threats to Validity •Subjectivity in the labelling process – asked 2 persons to label independently • Only 1 domain – evaluate more domains in future work • Suitability of evaluation metric 14
  • 46.
    Threats to Validity •Subjectivity in the labelling process – asked 2 persons to label independently • Only 1 domain – evaluate more domains in future work • Suitability of evaluation metric – used NDCG which is a standard metric 14
  • 47.
    Conclusion and FutureWork • Supervised and unsupervised approaches show promise in recommending URLs • Future work: – Automatically categorize the recommended URLs – Build an automated system to recommend relevant URLs 15
  • 48.
    Feedback/Advice • What additionalresources we can consider for mining URLs? • How to infer developer interests automatically? Thank you!