A data driven journey through research on software engineering
A DATA-DRIVEN JOURNEY THROUGHRESEARCH ON SOFTWARE ENGINEERING Mario Sangiorgio
MOTIVATIONGetting a better idea of what’s going on insoftware engineering research community through a quantitative approach
RELATED WORKS•C. Ghezzi - Keynote at ICSE 2008 Reﬂections on 40+ years of software engineering research and beyond•L. Briand - Keynote at ICSM 2011 Useful software engineering research: leading a double agent life•D. Rosemblum - Keynote at ASE 2012 Whither software engineering research?
SUBJECTS OF OUR STUDY researchers research topics afﬁliations geographical areas
DATA SOURCES Articles published and their authors COMPLETE XML DATABASE Citations, authors and afﬁliation details APIs
COLLECTED DATA Venue Number of papers From To TSE 3043 1975 2012 TOSEM 295 1992 2012 ICSE 2907 1976 2012 ASE 1116 1997 2012ESEC/FSE 416 1987 2012 TOTAL 7777 1975 20129865 researchers 278794 citations
TOPICS IN THE ‘70S By far the most represented Topic Fraction of papersTopics from Programming Languages 16.71%other fields Performance 7.95% Operating Systems 7.29% Database Systems 6.84% Formal Methods 6.65% Software Architectures 6.14% Knowledge Engineering 5.69% Distributed Systems 4.94% Software Maintenance 4.18%
TOPICS IN THE ‘80S Topic Fraction of papers Significant rise Programming Languages 10.48% Distributed Systems 9.30% Other fields, Knowledge Engineering 8.47% related to Software Reliability 6.68%distributed systems Formal Methods 6.51% Information Systems 5.55% Software Maintenance 5.04% Models 4.35% Artiﬁcial Intelligence 3.74% Not only code
TOPICS IN THE ‘90S Change of the most published Topic Fraction of papers topic Formal Methods 8.29% Programming Languages 8.13% Distributed Systems 6.80%Focus on soft ware Software Maintenance 6.55% quality Software Architectures 5.34% Software Quality 4.80% Knowledge Engineering 4.67% Models 4.65% Information Systems 4.40%
TOPICS IN THE 2000S Still lot of emphasis on soft ware Topic Fraction of papers quality Formal Methods 9.93% Programming Languages 8.37% Testing 6.86% Software Maintenance 6.58% Software Reliability 6.22% Analysis of open Software Quality 5.72%source repositories Models 4.80% Empirical Studies 4.76% Software Architectures 4.38%
NEED FOR A FINER ANALYSIS Topics change constantly, not once in a decade SOLUTION: sliding window instead of ﬁxed subdivision
PER-VENUE INSIGHTS Venue Peculiarities TSE Biased towards empirical worksTOSEM More focused on formal aspects ICSE Balanced with respect to other venues Formal, with interests in testing, modelingESEC/FSE and requirements engineering Interests in program analysis and automated ASE reasoning
AFFILIATION ANALYSISWhere do the most proliﬁc authors work? How much research is done in industry?
AFFILIATION PROFILE Author Afﬁliation Afﬁliation proﬁleAuthor A 1 Afﬁliation 1 33%Author B 2Author B 2 Afﬁliation 2 66%
MOST PROLIFIC AFFILIATIONS Afﬁliation Papers IBM 186.32 Carnegie Mellon University 166.52 University of Texas, Austin 122.62 University of Maryland 106.83 Microsoft 101.63 AT&T Bell Laboratories 101.37 University of California, Irvine 98.17 Georgia Institute of Technology 94.75 Massachusetts Institute of Technology 93.24 University of Virginia 81.55 ALL FROM THE USA
PER-VENUE INSIGHTS Is it linked to the presence Venue Peculiarities of empirical works? Is the venue with more industrial TSE contribution European universities among the topTOSEM contributors Is Europe more formal? Balanced set of contributors we saw in the ICSE other venues Despite ESEC, there is no bias towardsESEC/FSE It is representative Europe Industrial contribution is less relevant. ASE Some afﬁliations appear only in its top list.
INDUSTRY VS ACADEMIA1.000.750.500.25 0 1970 1975 1980 1985 1990 1995 2000 2005 Industry Academia
GEOGRAPHICAL ANALYSIS Where does the contribution come from?
GEOGRAPHICAL AREAS Europe North AsiaAmerica & Oceania SouthAmerica Africa
LOCATION OF A PAPER Afﬁliation proﬁle LocationsAfﬁliation 1 20% Afﬁliation 1 North AmericaAfﬁliation 2 30% Afﬁliation 2 EuropeAfﬁliation 3 50% Afﬁliation 3 Europe Location proﬁle North America 20% Europe 80%
GEOGRAPHICAL DISTRIBUTION1.000.750.500.25 0 1970 1975 1980 1985 1990 1995 2000 2005 Europe North America South America Asia & Oceania Africa
CONCLUSION Academic literature contains a lot of information about a scientiﬁc communityWith data mining techniques we can unveil it and get some interesting insights