A DATA-DRIVEN JOURNEY THROUGHRESEARCH ON SOFTWARE ENGINEERING           Mario Sangiorgio
MOTIVATIONGetting a better idea of what’s going on insoftware engineering research community     through a quantitative ap...
RELATED WORKS•C. Ghezzi - Keynote at ICSE 2008 Reflections on 40+ years of software engineering research and beyond•L. Bria...
SUBJECTS OF OUR STUDY researchers    research topics affiliations   geographical areas
DATA
ACADEMIC LITERATURE
SELECTED PUBLICATIONS           REPRESENTATIVENESS           AUTHORITATIVENESS
DATA SOURCES   Articles published and their authors       COMPLETE XML DATABASE  Citations, authors and affiliation details...
COLLECTED DATA  Venue    Number of papers   From    To    TSE        3043           1975   2012 TOSEM          295        ...
ANALYSIS
AUTHOR ANALYSIS  Who published the most? Are there sub-communities?
MOST PROLIFIC AUTHORS Software                ICSE       ASE      ESEC/FSE       TSE      TOSEMEngineering   Basili      B...
SUB-COMMUNITY DETECTION                       For each venue we                      consider the top most                ...
SUB-COMMUNITIES                                 FSE           0.4                  TOSEM           0.2 mds[,2]           0...
TOPIC ANALYSISWhat is the topic of a paper? What are the hot topics in  software engineering?  How have they evolved?
CITATION NETWORK     Papers in the       dataset
CITATION NETWORK      Internal      citations
CITATION NETWORK    Citations from    specific venues      Complete       citations
EXAMPLEWhat is the topic of the yellow paper?
EXAMPLE                             What is the topic of                              the yellow paper? Topic    Direct ci...
EXAMPLE                             What is the topic of                              the yellow paper? Topic    Direct ci...
SOFTWARE ENGINEERING TOPICS              Topic         Fraction of papers    Programming Languages         9.34%        Fo...
TOPICS IN THE ‘70S                                                            By far the most                             ...
TOPICS IN THE ‘80S                             Topic          Fraction of papers                                          ...
TOPICS IN THE ‘90S                                                               Change of the                            ...
TOPICS IN THE 2000S                                 Still lot of                                                          ...
NEED FOR A FINER ANALYSIS Topics change constantly, not once in a decade          SOLUTION: sliding window          instea...
TESTING0.180.140.090.05  0   1975   1980   1985   1990   1995   2000   2005
EMPIRICAL STUDIES0.180.140.090.05  0   1975   1980   1985   1990   1995   2000   2005
SERVICES0.180.140.090.05  0   1975   1980   1985   1990   1995   2000   2005
DISTRIBUTED SYSTEMS0.180.140.090.05  0   1975   1980   1985   1990   1995   2000   2005
PROGRAMMING LANGUAGES0.180.140.090.05  0   1975   1980   1985   1990   1995   2000   2005
PER-VENUE INSIGHTS Venue                     Peculiarities  TSE            Biased towards empirical worksTOSEM           M...
AFFILIATION ANALYSISWhere do the most prolific authors work? How much research is done in industry?
AFFILIATION PROFILE Author    Affiliation       Affiliation profileAuthor A       1                        Affiliation 1    33...
MOST PROLIFIC AFFILIATIONS                Affiliation                 Papers                   IBM                     186....
PER-VENUE INSIGHTS                                           Is it linked to the presence Venue                     Peculi...
INDUSTRY VS ACADEMIA1.000.750.500.25  0   1970    1975   1980       1985   1990   1995       2000   2005                  ...
GEOGRAPHICAL ANALYSIS Where does the contribution come from?
GEOGRAPHICAL AREAS          Europe North                     AsiaAmerica                      &                    Oceania...
LOCATION OF A PAPER    Affiliation profile                    LocationsAffiliation 1     20%           Affiliation 1 North Ame...
GEOGRAPHICAL DISTRIBUTION1.000.750.500.25  0   1970         1975       1980        1985         1990      1995       2000 ...
CONCLUSION    Academic literature contains a lot of information about a scientific communityWith data mining techniques we ...
QUESTIONS?
Upcoming SlideShare
Loading in …5
×

A data driven journey through research on software engineering

875 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
875
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A data driven journey through research on software engineering

  1. 1. A DATA-DRIVEN JOURNEY THROUGHRESEARCH ON SOFTWARE ENGINEERING Mario Sangiorgio
  2. 2. MOTIVATIONGetting a better idea of what’s going on insoftware engineering research community through a quantitative approach
  3. 3. RELATED WORKS•C. Ghezzi - Keynote at ICSE 2008 Reflections on 40+ years of software engineering research and beyond•L. Briand - Keynote at ICSM 2011 Useful software engineering research: leading a double agent life•D. Rosemblum - Keynote at ASE 2012 Whither software engineering research?
  4. 4. SUBJECTS OF OUR STUDY researchers research topics affiliations geographical areas
  5. 5. DATA
  6. 6. ACADEMIC LITERATURE
  7. 7. SELECTED PUBLICATIONS REPRESENTATIVENESS AUTHORITATIVENESS
  8. 8. DATA SOURCES Articles published and their authors COMPLETE XML DATABASE Citations, authors and affiliation details APIs
  9. 9. COLLECTED DATA Venue Number of papers From To TSE 3043 1975 2012 TOSEM 295 1992 2012 ICSE 2907 1976 2012 ASE 1116 1997 2012ESEC/FSE 416 1987 2012 TOTAL 7777 1975 20129865 researchers 278794 citations
  10. 10. ANALYSIS
  11. 11. AUTHOR ANALYSIS Who published the most? Are there sub-communities?
  12. 12. MOST PROLIFIC AUTHORS Software ICSE ASE ESEC/FSE TSE TOSEMEngineering Basili Bohem Xie Clarke Basili Notkin 60 28 24 8 33 13 Notkin Basili Grundy D. Jackson Briand Rothermel 56 26 18 8 26 8 Kramer Osterweil Hosking Ernst Weyuker Roman 49 23 16 7 18 6 Harrold Kramer Egyed Notkin Knight Wolf 46 21 16 7 17 6 Xie Notkin Lo Uchitel Kramer Harrold 46 21 16 7 16 6
  13. 13. SUB-COMMUNITY DETECTION For each venue we consider the top most prolific authors |A B| We compute the set J(A, B) = |A [ B| similarity between all the pair of venues
  14. 14. SUB-COMMUNITIES FSE 0.4 TOSEM 0.2 mds[,2] 0.0 ASE −0.2 TSE ICSE −0.2 0.0 0.2 0.4 0.6 mds[,1]
  15. 15. TOPIC ANALYSISWhat is the topic of a paper? What are the hot topics in software engineering? How have they evolved?
  16. 16. CITATION NETWORK Papers in the dataset
  17. 17. CITATION NETWORK Internal citations
  18. 18. CITATION NETWORK Citations from specific venues Complete citations
  19. 19. EXAMPLEWhat is the topic of the yellow paper?
  20. 20. EXAMPLE What is the topic of the yellow paper? Topic Direct citationsTopic A 2 What is the topic ofTopic B 0 the general paper?General 1
  21. 21. EXAMPLE What is the topic of the yellow paper? Topic Direct citations Topic profileTopic A 2 Topic A 66%Topic B 1General 1 Topic B 33%
  22. 22. SOFTWARE ENGINEERING TOPICS Topic Fraction of papers Programming Languages 9.34% Formal Methods 8.49% Software Reliability 6.13% Distributed Systems 5.96% Software Maintenance 5.92% Testing 4.64% Software Quality 4.53% Models 4.36% Software Architectures 4.36%
  23. 23. TOPICS IN THE ‘70S By far the most represented Topic Fraction of papersTopics from Programming Languages 16.71%other fields Performance 7.95% Operating Systems 7.29% Database Systems 6.84% Formal Methods 6.65% Software Architectures 6.14% Knowledge Engineering 5.69% Distributed Systems 4.94% Software Maintenance 4.18%
  24. 24. TOPICS IN THE ‘80S Topic Fraction of papers Significant rise Programming Languages 10.48% Distributed Systems 9.30% Other fields, Knowledge Engineering 8.47% related to Software Reliability 6.68%distributed systems Formal Methods 6.51% Information Systems 5.55% Software Maintenance 5.04% Models 4.35% Artificial Intelligence 3.74% Not only code
  25. 25. TOPICS IN THE ‘90S Change of the most published Topic Fraction of papers topic Formal Methods 8.29% Programming Languages 8.13% Distributed Systems 6.80%Focus on soft ware Software Maintenance 6.55% quality Software Architectures 5.34% Software Quality 4.80% Knowledge Engineering 4.67% Models 4.65% Information Systems 4.40%
  26. 26. TOPICS IN THE 2000S Still lot of emphasis on soft ware Topic Fraction of papers quality Formal Methods 9.93% Programming Languages 8.37% Testing 6.86% Software Maintenance 6.58% Software Reliability 6.22% Analysis of open Software Quality 5.72%source repositories Models 4.80% Empirical Studies 4.76% Software Architectures 4.38%
  27. 27. NEED FOR A FINER ANALYSIS Topics change constantly, not once in a decade SOLUTION: sliding window instead of fixed subdivision
  28. 28. TESTING0.180.140.090.05 0 1975 1980 1985 1990 1995 2000 2005
  29. 29. EMPIRICAL STUDIES0.180.140.090.05 0 1975 1980 1985 1990 1995 2000 2005
  30. 30. SERVICES0.180.140.090.05 0 1975 1980 1985 1990 1995 2000 2005
  31. 31. DISTRIBUTED SYSTEMS0.180.140.090.05 0 1975 1980 1985 1990 1995 2000 2005
  32. 32. PROGRAMMING LANGUAGES0.180.140.090.05 0 1975 1980 1985 1990 1995 2000 2005
  33. 33. PER-VENUE INSIGHTS Venue Peculiarities TSE Biased towards empirical worksTOSEM More focused on formal aspects ICSE Balanced with respect to other venues Formal, with interests in testing, modelingESEC/FSE and requirements engineering Interests in program analysis and automated ASE reasoning
  34. 34. AFFILIATION ANALYSISWhere do the most prolific authors work? How much research is done in industry?
  35. 35. AFFILIATION PROFILE Author Affiliation Affiliation profileAuthor A 1 Affiliation 1 33%Author B 2Author B 2 Affiliation 2 66%
  36. 36. MOST PROLIFIC AFFILIATIONS Affiliation Papers IBM 186.32 Carnegie Mellon University 166.52 University of Texas, Austin 122.62 University of Maryland 106.83 Microsoft 101.63 AT&T Bell Laboratories 101.37 University of California, Irvine 98.17 Georgia Institute of Technology 94.75 Massachusetts Institute of Technology 93.24 University of Virginia 81.55 ALL FROM THE USA
  37. 37. PER-VENUE INSIGHTS Is it linked to the presence Venue Peculiarities of empirical works? Is the venue with more industrial TSE contribution European universities among the topTOSEM contributors Is Europe more formal? Balanced set of contributors we saw in the ICSE other venues Despite ESEC, there is no bias towardsESEC/FSE It is representative Europe Industrial contribution is less relevant. ASE Some affiliations appear only in its top list.
  38. 38. INDUSTRY VS ACADEMIA1.000.750.500.25 0 1970 1975 1980 1985 1990 1995 2000 2005 Industry Academia
  39. 39. GEOGRAPHICAL ANALYSIS Where does the contribution come from?
  40. 40. GEOGRAPHICAL AREAS Europe North AsiaAmerica & Oceania SouthAmerica Africa
  41. 41. LOCATION OF A PAPER Affiliation profile LocationsAffiliation 1 20% Affiliation 1 North AmericaAffiliation 2 30% Affiliation 2 EuropeAffiliation 3 50% Affiliation 3 Europe Location profile North America 20% Europe 80%
  42. 42. GEOGRAPHICAL DISTRIBUTION1.000.750.500.25 0 1970 1975 1980 1985 1990 1995 2000 2005 Europe North America South America Asia & Oceania Africa
  43. 43. CONCLUSION Academic literature contains a lot of information about a scientific communityWith data mining techniques we can unveil it and get some interesting insights
  44. 44. QUESTIONS?

×