Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science, Empirical SE and Software Analytics: It’s a Family Affair!

79 views

Published on

My views on the interactions and complementarities between data science, empirical SE and software analytics (i.e., mining software repositories). This talk was given at IASESE 2018 (http://eseiw2018.wixsite.com/iasese).

Published in: Science
  • Be the first to comment

  • Be the first to like this

Data Science, Empirical SE and Software Analytics: It’s a Family Affair!

  1. 1. Data Science, Empirical SE and Software Analytics: It’s a Family Affair! 1 Bram Adams http://mcis.polymtl.ca
  2. 2. Who is Bram Adams?
  3. 3. 3
  4. 4. Wolfgang De Meuter Vrije Universiteit Brussel Herman Tromp Ghent University 3
  5. 5. Ahmed E. Hassan Queen's University 3
  6. 6. M C IS (Lab on Maintenance, Construction and Intelligence of Software) 3
  7. 7. Who is MCIS? !4 http://mcis.polymtl.ca
  8. 8. Who is MCIS? !4 Parisa Doriane Shivashree Isabella me Yutaro M C I S Javier Armstrong http://mcis.polymtl.ca
  9. 9. !5 https://conf.researchr.org/home/msr-2019
  10. 10. !5 https://conf.researchr.org/home/msr-2019 deadline mid January 2019
  11. 11. Data Science, Empirical SE and Software Analytics: It’s a Family Affair! 6 Bram Adams http://mcis.polymtl.ca
  12. 12. Dear Bram; As per our conversation during ICSE, we are pleased to invite you as a speaker in IASESE 2018 co-located with ESEM conference. We are planning to have a discussion about similarities, differences, and synergies between empirical software engineering, data science, and mining software repositories under the umbrella of "Empirical Software Engineering and Data science: Old Wine in a New Bottle". we are looking into a talk around 45 - 60 minutes and a round table discussion.  Attached please find the agenda for IASESE. Please don't hesitate to contact us in case of any questions. We hope to hear back from you soon; Best Regards; Silvia Abrahão, Maleknaz Nayebi
  13. 13. Data Science has Overtaken Other Major Technologies in Popularity https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  14. 14. Data Science has Overtaken Other Major Technologies in Popularity https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  15. 15. Data Science has Overtaken Other Major Technologies in Popularity big data https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  16. 16. Data Science has Overtaken Other Major Technologies in Popularity big datacloud computing https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  17. 17. Data Science has Overtaken Other Major Technologies in Popularity big datacloud computing data science https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  18. 18. Data Science has Overtaken Other Major Technologies in Popularity big datacloud computing data scienceempirical research :-( https://trends.google.com/trends/explore?date=today%205-y,today%205-y,today%205-y,today%205- y&geo=,,,&q=%22big%20data%22,%22data%20science%22,%22cloud%20computing%22,%22empirical%20research%22
  19. 19. https://www.forbes.com/sites/louiscolumbus/2018/01/29/data-scientist-is-the-best-job-in-america-according- glassdoors-2018-rankings/#c2d2ae55357e
  20. 20. https://elitedatascience.com/books
  21. 21. !11
  22. 22. !11 Software Data ScienceSciM SoDaSciM 2018 would such a renaming make sense? why (not)?
  23. 23. Part I: Data Science
  24. 24. !13
  25. 25. !13
  26. 26. !13
  27. 27. !13
  28. 28. https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#539fcc0c55cf !14
  29. 29. John Tukey, “The Future of Data Analysis”, Ann. Math. Statist., 33(1), pp. 1-67, 1962. https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#539fcc0c55cf I have come to feel that my central interest is in data analysis... Data analysis, and the parts of statistics which adhere to it, must...take on the characteristics of science rather than those of mathematics... data analysis is intrinsically an empirical science [John Tukey, 1962] https://en.wikipedia.org/w/index.php?curid=17099473 !14
  30. 30. Empirical Science?
  31. 31. Empirical Science? testable hypothesis
  32. 32. Empirical Science? testable hypothesis validation
  33. 33. Empirical Science? testable hypothesis validation validated hypothesis
  34. 34. Empirical Science? testable hypothesis validation validated hypothesis
  35. 35. Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis
  36. 36. Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis
  37. 37. Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis observation/ experience
  38. 38. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process !16
  39. 39. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data !16
  40. 40. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data knowledge !16
  41. 41. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data select target data knowledge !16
  42. 42. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data select target data knowledge preprocess clean data !16
  43. 43. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data select target data knowledge transform a b preprocess clean data !16
  44. 44. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data select target data knowledge data mining transform a b preprocess clean data !16
  45. 45. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data interpret select target data knowledge data mining transform a b preprocess clean data !16
  46. 46. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data interpret select target data knowledge iterate data mining transform a b preprocess clean data !16
  47. 47. Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge Discovery: An Overview", in “Advances in Knowledge Discovery and Data Mining”, 1996, pp.1-34 1989: KDD Workshop & Process raw data interpret select target data knowledge iterate data mining transform a b preprocess clean data !16
  48. 48. 1996/1998 !17
  49. 49. Data Science is not only a synthetic concept to unify statistics, data analysis and their related methods but also comprises its results. It includes three phases, design for data, collection of data, and analysis on data. 1996/1998 !17
  50. 50. Data Science is not only a synthetic concept to unify statistics, data analysis and their related methods but also comprises its results. It includes three phases, design for data, collection of data, and analysis on data. 1996/1998 !17
  51. 51. Data Science is not only a synthetic concept to unify statistics, data analysis and their related methods but also comprises its results. It includes three phases, design for data, collection of data, and analysis on data. 1996/1998 !17
  52. 52. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18
  53. 53. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 raw data
  54. 54. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 raw data knowledge
  55. 55. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge
  56. 56. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge Scrub a b
  57. 57. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge Scrub a b Explore
  58. 58. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge Scrub a b Model Explore
  59. 59. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge Scrub a b iNterpret Model Explore
  60. 60. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge iterate Scrub a b iNterpret Model Explore
  61. 61. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge iterate Scrub a b iNterpret Model Explore
  62. 62. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !18 Obtain metrics raw data knowledge iterate Scrub a b iNterpret Model Explore
  63. 63. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ raw data Obtain Scrub ModeliNterpret metrics knowledge iterate a b !19 Explore
  64. 64. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ raw data Obtain Scrub ModeliNterpret metrics knowledge iterate a b !19 Explore
  65. 65. Exercise: Fairness of Eurovision Song Contest https://www.imdb.com/title/tt0800959/mediaviewer/rm623579392
  66. 66. Hilary Mason & Chris Wiggins, http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2010: OSEMN Process for Data Science !21 Obtain metrics raw data knowledge iterate Scrub a b iNterpret Model Explore
  67. 67. Why (Not) Data Science? !22
  68. 68. customize software tools and data analysis techniques to problem at hand Why (Not) Data Science? !22
  69. 69. exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources customize software tools and data analysis techniques to problem at hand Why (Not) Data Science? !22
  70. 70. exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources lacking guidance on the planning/problem formulation side (due to focus on exploratory/ bottom-up research) customize software tools and data analysis techniques to problem at hand Why (Not) Data Science? !22
  71. 71. Part II: Empirical SE
  72. 72. Where does “Raw Data” Come from? !24
  73. 73. Where does “Raw Data” Come from? any data domain !24
  74. 74. Where does “Raw Data” Come from? SE Development Process any data domain !24
  75. 75. … and How does it Relate to High-Level Questions? !25
  76. 76. … and How does it Relate to High-Level Questions? What kinds of patches are more likely to be accepted by OSS projects? !25
  77. 77. … and How does it Relate to High-Level Questions? What kinds of patches are more likely to be accepted by OSS projects? Does our new dashboard for code review improve reviewers’ effectiveness? !25
  78. 78. … and How does it Relate to High-Level Questions? What kinds of patches are more likely to be accepted by OSS projects? Does our new dashboard for code review improve reviewers’ effectiveness? Commit. Why do developers prefer short commit messages? !25
  79. 79. These Questions All Require Empirical Evidence • Why? SE development process is managed by humans, hence no formal laws, rules, etc. exist that “prove” the answer to these questions • How? through “a posteriori” knowledge “observed” or “experienced” through: • quantitative analysis: how large is effect of given manipulation or activity? • qualitative analysis: why does something happen, from perspective/point of view of subjects? !26
  80. 80. Common Strategies to Obtain Empirical Evidence !27
  81. 81. Common Strategies to Obtain Empirical Evidence case study (observational study, no control) !27
  82. 82. Common Strategies to Obtain Empirical Evidence case study (observational study, no control) experiment (assigning treatments, with control) !27
  83. 83. Common Strategies to Obtain Empirical Evidence case study (observational study, no control) experiment (assigning treatments, with control) survey (interviews/ questionnaire) !27
  84. 84. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process !28
  85. 85. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process !28
  86. 86. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define !28
  87. 87. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan !28
  88. 88. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan iterate !28
  89. 89. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) iterate !28
  90. 90. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) analyze/ interpret stats iterate !28
  91. 91. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) analyze/ interpret stats knowledge present/ package iterate !28
  92. 92. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) analyze/ interpret stats knowledge present/ package iterate iterate !28
  93. 93. Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) analyze/ interpret stats knowledge present/ package iterate iterate !28
  94. 94. Enhanced Empirical SE Process define plan iterate operate (case study, experiment or survey) ESE !29
  95. 95. Enhanced Empirical SE Process define plan iterate operate (case study, experiment or survey) empirical SE data science raw data Obtain Scrub Explore Model iNterpret iterate a b OSEMN ESE !29
  96. 96. Why this Hybrid Process? !30
  97. 97. strong empirical definition/planning/ operation base Why this Hybrid Process? !30
  98. 98. strong empirical definition/planning/ operation base Why this Hybrid Process? !30 exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources
  99. 99. strong empirical definition/planning/ operation base Why this Hybrid Process? !30 exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources customize software tools and data analysis techniques to problem at hand
  100. 100. strong empirical definition/planning/ operation base Why this Hybrid Process? experiments, surveys and even case studies require large effort, depending on available data sources !30 exploit explosion in computing power (cloud!) to scale with volume and velocity of available data sources customize software tools and data analysis techniques to problem at hand
  101. 101. effort empirical strategy case study experiment survey !31
  102. 102. effort empirical strategy case study experiment survey !31
  103. 103. effort empirical strategy case study experiment survey !31
  104. 104. effort empirical strategy case study experiment survey !31
  105. 105. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  106. 106. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  107. 107. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  108. 108. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  109. 109. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey !31
  110. 110. What if Accurate SE Process Data would be Available? effort empirical strategy case study experiment survey enables more case studies! !31 reduces effort!
  111. 111. Part III: Software Analytics (aka MSR)
  112. 112. Mining Version Histories to Guide Software Changes Thomas Zimmermann tz@acm.org Peter Weißgerber weissger@st.cs.uni-sb.de Stephan Diehl diehl@acm.org Andreas Zeller zeller@acm.org Saarland University, Saarbr¨ucken, Germany Abstract We apply data mining to version histories in order to guide programmers along related changes: “Programmers who changed these functions also changed...”. Given a set of existing changes, such rules (a) suggest and predict likely further changes, (b) show up item coupling that is in- detectable by program analysis, and (c) prevent errors due to incomplete changes. After an initial change, our ROSE prototype can correctly predict 26% of further files to be changed—and 15% of the precise functions or variables. The topmost three suggestions contain a correct location with a likelihood of 64%. 1. Introduction Shopping for a book at Amazon.com, you may have come across a section that reads “Customers who bought this book also bought...”, listing other books that were typi- cally included in the same purchase. Such information is gathered by data mining— the automated extraction of hid- den predictive information from large data sets. In this pa- per, we apply data mining to version histories: “Program- mers who changed these functions also changed. ..”. Just like the Amazon.com feature helps the customer browsing along related items, our ROSE tool guides the programmer along related changes, with the following aims: each time some programmer extended the fKeys[] ar- ray, she also extended the function that sets the pref- erence default values. If the programmer now wanted to commit her changes without altering the suggested location, ROSE would issue a warning. Detect coupling indetectable by program analysis. As ROSE operates uniquely on the version history, it is able to detect coupling between items that cannot be detected by program analysis—including coupling between items that are not even programs. In Figure 1, position 3 on the list is an ECLIPSE HTML documen- tation file with a confidence of 0.75—suggesting that after adding the new preference, the documentation should be updated, too. ROSE is not the first tool to leverage version histories. In earlier work (Section 7), researchers have used history data to understand programs and their evolution [3], to detect evolutionary coupling between files [8] or classes [4], or to support navigation in the source code [6]. In contrast to this state of the art, the present work • uses full-fledged data mining techniques to obtain as- sociation rules from version histories, • detects coupling between fine-grained program enti- ties such as functions or variables (rather than, say, classes), thus increasing precision and integrating with program analysis, ZimmermannT,WeißgerberP,DiehlS,ZellerA.,“Miningversion historiestoguidesoftwarechanges”,26thInternationalConferenceon SoftwareEngineering(ICSE),pp.563–572,2004. !34
  113. 113. Mining Version Histories to Guide Software Changes Thomas Zimmermann tz@acm.org Peter Weißgerber weissger@st.cs.uni-sb.de Stephan Diehl diehl@acm.org Andreas Zeller zeller@acm.org Saarland University, Saarbr¨ucken, Germany Abstract We apply data mining to version histories in order to guide programmers along related changes: “Programmers who changed these functions also changed...”. Given a set of existing changes, such rules (a) suggest and predict likely further changes, (b) show up item coupling that is in- detectable by program analysis, and (c) prevent errors due to incomplete changes. After an initial change, our ROSE prototype can correctly predict 26% of further files to be changed—and 15% of the precise functions or variables. The topmost three suggestions contain a correct location with a likelihood of 64%. 1. Introduction Shopping for a book at Amazon.com, you may have come across a section that reads “Customers who bought this book also bought...”, listing other books that were typi- cally included in the same purchase. Such information is gathered by data mining— the automated extraction of hid- den predictive information from large data sets. In this pa- per, we apply data mining to version histories: “Program- mers who changed these functions also changed. ..”. Just like the Amazon.com feature helps the customer browsing along related items, our ROSE tool guides the programmer along related changes, with the following aims: each time some programmer extended the fKeys[] ar- ray, she also extended the function that sets the pref- erence default values. If the programmer now wanted to commit her changes without altering the suggested location, ROSE would issue a warning. Detect coupling indetectable by program analysis. As ROSE operates uniquely on the version history, it is able to detect coupling between items that cannot be detected by program analysis—including coupling between items that are not even programs. In Figure 1, position 3 on the list is an ECLIPSE HTML documen- tation file with a confidence of 0.75—suggesting that after adding the new preference, the documentation should be updated, too. ROSE is not the first tool to leverage version histories. In earlier work (Section 7), researchers have used history data to understand programs and their evolution [3], to detect evolutionary coupling between files [8] or classes [4], or to support navigation in the source code [6]. In contrast to this state of the art, the present work • uses full-fledged data mining techniques to obtain as- sociation rules from version histories, • detects coupling between fine-grained program enti- ties such as functions or variables (rather than, say, classes), thus increasing precision and integrating with program analysis, ZimmermannT,WeißgerberP,DiehlS,ZellerA.,“Miningversion historiestoguidesoftwarechanges”,26thInternationalConferenceon SoftwareEngineering(ICSE),pp.563–572,2004. what other functions should I change for this task? similar to: “what did other people buy (in the past)?” !34
  114. 114. Mining Version Histories to Guide Software Changes Thomas Zimmermann tz@acm.org Peter Weißgerber weissger@st.cs.uni-sb.de Stephan Diehl diehl@acm.org Andreas Zeller zeller@acm.org Saarland University, Saarbr¨ucken, Germany Abstract We apply data mining to version histories in order to guide programmers along related changes: “Programmers who changed these functions also changed...”. Given a set of existing changes, such rules (a) suggest and predict likely further changes, (b) show up item coupling that is in- detectable by program analysis, and (c) prevent errors due to incomplete changes. After an initial change, our ROSE prototype can correctly predict 26% of further files to be changed—and 15% of the precise functions or variables. The topmost three suggestions contain a correct location with a likelihood of 64%. 1. Introduction Shopping for a book at Amazon.com, you may have come across a section that reads “Customers who bought this book also bought...”, listing other books that were typi- cally included in the same purchase. Such information is gathered by data mining— the automated extraction of hid- den predictive information from large data sets. In this pa- per, we apply data mining to version histories: “Program- mers who changed these functions also changed. ..”. Just like the Amazon.com feature helps the customer browsing along related items, our ROSE tool guides the programmer along related changes, with the following aims: each time some programmer extended the fKeys[] ar- ray, she also extended the function that sets the pref- erence default values. If the programmer now wanted to commit her changes without altering the suggested location, ROSE would issue a warning. Detect coupling indetectable by program analysis. As ROSE operates uniquely on the version history, it is able to detect coupling between items that cannot be detected by program analysis—including coupling between items that are not even programs. In Figure 1, position 3 on the list is an ECLIPSE HTML documen- tation file with a confidence of 0.75—suggesting that after adding the new preference, the documentation should be updated, too. ROSE is not the first tool to leverage version histories. In earlier work (Section 7), researchers have used history data to understand programs and their evolution [3], to detect evolutionary coupling between files [8] or classes [4], or to support navigation in the source code [6]. In contrast to this state of the art, the present work • uses full-fledged data mining techniques to obtain as- sociation rules from version histories, • detects coupling between fine-grained program enti- ties such as functions or variables (rather than, say, classes), thus increasing precision and integrating with program analysis, ZimmermannT,WeißgerberP,DiehlS,ZellerA.,“Miningversion historiestoguidesoftwarechanges”,26thInternationalConferenceon SoftwareEngineering(ICSE),pp.563–572,2004. what other functions should I change for this task? similar to: “what did other people buy (in the past)?” so: mine version control system tounderstand which functions werechanged together in the past, inorder to predict the future !34
  115. 115. !35
  116. 116. Ahmed E. Hassan, “The road ahead for mining software repositories”, Frontiers of Software Maintenance, pp. 48-57, 2008. The Mining Software Repositories (MSR) field analyzes and cross-links the rich data available in software repositories to uncover interesting and actionable information about software systems and projects [Ahmed E. Hassan, 2008] !36
  117. 117. What Repositories are we Talking About? !37
  118. 118. What Repositories are we Talking About? !37
  119. 119. What Repositories are we Talking About? !37
  120. 120. What Repositories are we Talking About? !37
  121. 121. What Repositories are we Talking About? !37
  122. 122. What Repositories are we Talking About? !37
  123. 123. What Repositories are we Talking About? !37
  124. 124. What Repositories are we Talking About? !37
  125. 125. !38
  126. 126. !38
  127. 127. !38
  128. 128. !38 recording traces of activities
  129. 129. actionable insights !38 recording traces of activities
  130. 130. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? !39
  131. 131. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? !39
  132. 132. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? deep qualitative analysis possible, lack of generalization !39
  133. 133. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? deep qualitative analysis possible, lack of generalization generalization of shallow findings due to large- scale quantitative analysis !39
  134. 134. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? deep qualitative analysis possible, lack of generalization generalization of shallow findings due to large- scale quantitative analysis !39
  135. 135. #detailed insight #analyzed repositories How Many Repositories and Projects are Required? deep qualitative analysis possible, lack of generalization generalization of shallow findings due to large- scale quantitative analysis mix of both !39
  136. 136. Hold On, What about that “Cross- Linking”? !40
  137. 137. !41
  138. 138. !41 how did reviewers react to this commit?
  139. 139. !42
  140. 140. !43
  141. 141. !43 does this commit fix a bug or add a new feature?
  142. 142. !44
  143. 143. Is it Always that Easy? !45
  144. 144. Is it Always that Easy? NO! !45
  145. 145. Linking Repositories is Challenging! !46
  146. 146. lack of access to confidential/ closed repositories not enough data left after linking imprecise heuristics based on email addresses, … Linking Repositories is Challenging! no recorded links due to lack of discipline developers ambiguous or incorrect links invalid identifiers, e.g., after migration to newer repository technology !46
  147. 147. OSEMN Enhanced Empirical SE Process define plan iterate raw data Obtain Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science a b ESE !47
  148. 148. OSEMN Enhanced Empirical SE Process define plan iterate raw data Obtain Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science a b how to incorporate software analytics? ESE !47
  149. 149. Today’s Empirical SE Process define plan iterate Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science a b LONSEMN ESE !48
  150. 150. Today’s Empirical SE Process define plan iterate Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science software analytics a b LONSEMN ESE !48
  151. 151. Today’s Empirical SE Process define plan iterate Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science software analytics Link a b LONSEMN ESE !48
  152. 152. Today’s Empirical SE Process define plan iterate Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) empirical SE data science software analytics Link a b ObtaiN LONSEMN ESE !48
  153. 153. ObtaiN Activity should be Time-aware! !49
  154. 154. ObtaiN Activity should be Time-aware! !49
  155. 155. ObtaiN Activity should be Time-aware! check out snapshot of code !49
  156. 156. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  157. 157. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  158. 158. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  159. 159. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  160. 160. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  161. 161. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  162. 162. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  163. 163. ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  164. 164. what if snapshot does not compile? ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction !49
  165. 165. what if snapshot does not compile? ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction what if analysis is time-intensive? !49
  166. 166. what if snapshot does not compile? ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction what if more than one branch? what if analysis is time-intensive? !49
  167. 167. what if snapshot does not compile? ObtaiN Activity should be Time-aware! check out snapshot of code apply time- unaware metric extraction what if more than one branch? what if results of prior snapshots needed? what if analysis is time-intensive? !49
  168. 168. Exercise: Predicting Bug- introducing Commits (1) !50
  169. 169. Today’s Empirical SE Process define plan iterate Link Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) ObtaiN empirical SE data science software analytics a b LONSEMN ESE !51
  170. 170. Part IV: Now What?
  171. 171. !53
  172. 172. !53 The 3 concepts are in fact complementary!
  173. 173. … so can you just submit your work to venues of either domain? !53 The 3 concepts are in fact complementary!
  174. 174. Yes, you can! (Depending on the contributions you want to stress) !54
  175. 175. Yes, you can! (Depending on the contributions you want to stress) NOTE: “stress” != “include” (i.e., all activities will be included, but with different importance) !54
  176. 176. empirical venues (ESEM, EMSE, …) define plan iterate operate (case study, experiment or survey) !55
  177. 177. empirical venues (ESEM, EMSE, …) define plan iterate operate (case study, experiment or survey) software analytics venues (MSR, PROMISE, …) Link ObtaiN !55
  178. 178. empirical venues (ESEM, EMSE, …) define plan iterate operate (case study, experiment or survey) software analytics venues (MSR, PROMISE, …) Link ObtaiN data science/general SE venues (KDD, ICDM, ICSE, TSE, …) Scrub Explore Model iNterpret iterate a b !55
  179. 179. Exercise: Predicting Bug- introducing Commits (2) !56
  180. 180. Exercise: Predicting Bug- introducing Commits (2) How to sell this work as an ESEM paper? !56
  181. 181. Exercise: Predicting Bug- introducing Commits (2) How to sell this work as an ESEM paper? How to sell this work as an MSR paper? !56
  182. 182. Exercise: Predicting Bug- introducing Commits (2) How to sell this work as an ESEM paper? How to sell this work as an MSR paper? How to sell this work as an ICSE paper? !56
  183. 183. empirical venues (ESEM, EMSE, …) data science/general SE venues (KDD, ICDM, ICSE, TSE, …) software analytics venues (MSR, PROMISE, …) Link ObtaiN define plan iterate operate (case study, experiment or survey) Scrub Explore Model iNterpret iterate a b !57
  184. 184. How Will Tomorrow’s ESE Process Look Like? !58
  185. 185. How Will Tomorrow’s ESE Process Look Like? hard to predict … !58
  186. 186. How Will Tomorrow’s ESE Process Look Like? hard to predict … … but very likely an evolution of today’s process! !58
  187. 187. Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis observation/ experience
  188. 188. Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis observation/ experience Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) iterate !28
  189. 189. Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis observation/ experience Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) iterate !28 actionable insights !37 recording traces of activities
  190. 190. Today’s Empirical SE Process define plan iterate Link Scrub Explore Model iNterpret iterate operate (case study, experiment or survey) ObtaiN empirical SE data science software analytics a b LONSEMN ESE !50 Empirical Science? testable hypothesis validation rejected hypothesis validated hypothesis observation/ experience Claes Wohlin et al., “Experimentation in Software Engineering - An Introduction”, 2000. Traditional Empirical SE Process define plan raw dataoperate (case study, experiment or survey) iterate !28 actionable insights !37 recording traces of activities

×