Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data

2,157 views

Published on

(Abstract and video links below)

ACM SIGSOFT Webinar May 4th, 2016
Distinguished lecture at ISR, UCI, April 2016.

UCI Video is available at: https://www.youtube.com/watch?v=Ujm4G7ayRQQ
Webinar link will be available shortly.

This talk is based on a short chapter to appear in a forthcoming book on "Perspectives on Data Science for Software Engineering", it can be preordered here:
http://goo.gl/Wi30Ra

Abstract:

Software analytics and the use of computational methods on "big" data in software engineering is transforming the ways software is developed, used, improved and deployed. Software engineering researchers and practitioners are witnessing an increasing trend in the availability of diverse trace and operational data and the methods to analyze it. This information is being used to paint a picture of how software is engineered and suggest ways it may be improved. But we have to remember that software engineering is inherently a socio-technical endeavour, with complex practices, activities and cultural aspects that cannot be externalized or captured by tools alone---in fact, they may be perturbed when trace data is surfaced and analyzed in a transparent manner.

In this talk, I will ask:

- Are researchers and practitioners adequately considering the unanticipated impacts that software analytics can have on software engineering processes and stakeholders?
- Are there important questions that are not being asked because the answers do not lie in the data that are readily available?
- Can we improve the application of software analytics using other methods that collect insights directly from participants in software engineering (e.g., through observations)?

I will explore these questions through specific examples. I hope to engage the audience in discussing how software analytics that depend on "big data" from tools, as well as methods that collect "thick" data from participants, can be mutually beneficial in improving software engineering research and practice.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,157
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
190
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Lies, Damned Lies and Software Analytics: Why Big Data Needs Rich Data

  1. 1. Lies, Damned Lies and Software Analytics: Why Big Data Needs Thick Data Margaret-Anne (Peggy) Storey University of Victoria @margaretstorey Presented at UCI, Irvine, April 2016 and ACM SIGSOFT Webinar, May 4th 2016
  2. 2. Acknowledgements: Alexey Zagalsky, Daniel German, Matthieu Foucault (Uvic) Jacek Czerwonka, Brendan Murphy (Microsoft Research) http://www.slideshare.net/mastorey/lies-damned-lies-and- software-analytics-why-big-data-needs-rich-data @margaretstorey
  3. 3. My research… Human and social aspects in software engineering: Software visualization The social programmer and a participatory culture in software engineering Qualitative research and mixed methods in software engineering
  4. 4. Dashboards for developers awareness: Treude and Storey, “Awareness 2.0: staying aware of projects, developers and tasks using dashboards and feeds,” ICSE 2010.
  5. 5. 1968 1980 1990 2000 20101970 Developer tools…
  6. 6. How developers stay up to date using Twitter How developers assess each other based on their development and networking activity How a crowd of developers document open source API’s through Stackoverflow How developers share tacit knowledge on How developers coordinate which code is committed and accepted through GitHub
  7. 7. 1968 1980 1990 2000 20101970 Telephone Face2Face Project Workbook Documents Email Email Lists VisualAge Visual Studio NetBeans Eclipse IRC ICQ Skype SourceForge Wikis Trello Basecamp Jazz Slack Google Hangouts Punchcards TFS Books Usenet Stack Overflow Twitter Google Groups Podcasts Blogs GitHub Conferences Societies LinkedIn Facebook Slashdot HackerNews Nondigital Digital Digital & Socially Enabled Masterbranch Coderwall Meetups Yammer
  8. 8. 1968 1980 1990 2000 20101970 Telephone Face2Face Project Workbook Documents Email Email Lists VisualAge Visual Studio NetBeans Eclipse IRC ICQ Skype SourceForge Wikis Trello Basecamp Jazz Slack Google Hangouts Punchcards TFS Books Usenet Stack Overflow Twitter Google Groups Podcasts Blogs GitHub Conferences Societies LinkedIn Facebook Slashdot HackerNews Nondigital Digital Digital & Socially Enabled Masterbranch Coderwall Meetups Yammer Surveyed over 2,500 devs
  9. 9. Ecosystem of tools and activities
  10. 10. Learning CodeHosting Q&Asites Websearch Ecosystem of tools and activities
  11. 11. Coordination CodeHosting Coordinationtools Privatechat Privatediscuss Ecosystem of tools and activities
  12. 12. FacetoFace Connecting Microblogging Privatediscuss FacetoFace Codehosting Ecosystem of tools and activities
  13. 13. Social tools facilitate a participatory development culture in software engineering, with support for the social creation and sharing of content, informal mentorship, and awareness that contributions matter to one another Storey, M.-A., L. Singer, F. Figueira Filho, B. Cleary and A. Zagalsky, The (R)evolutionary Role of Social Media in Software Engineering, ICSE 2014 Future of Software Engineering.
  14. 14. How to study a participatory culture?
  15. 15. (Competing) concerns in software engineering… Code: faster, cheaper, more features, more reliable/secure Developers: more productive, more skilled, happier, better connected Organizations/communities: attract/retain contributors, encourage a participatory culture, increase value
  16. 16. https://www.flickr.com/photos/opensourceway/5755219017 Do the answers lie in here?
  17. 17. “The machine does not isolate us from the great problem of nature but plunges us more deeply into them.” Antoine de Saint Exupéry
  18. 18. Thick data…
  19. 19. Talk outline… History of software analytics in software engineering Risks of software analytics Why big data needs thick data Consider both researchers and practitioners….
  20. 20. Talk outline… History of software analytics in software engineering Risks of software analytics Why big data needs thick data Consider both researchers and practitioners….
  21. 21. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  22. 22. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  23. 23. The dawn of software metrics “The realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.” Maurice Wilkes, 1949 “If you can't measure it, you can't manage it” Tom de Marco, 1982
  24. 24. Why use metrics? To discover facts about the world To steer our actions To modify human behaviour [DeMarco] Used by individuals, teams, companies, external organizations…
  25. 25. Software metrics Product: KLOC, Complexity measures (cyclomatic complexity, function points), OO metrics, #defects Process metrics: Testing, code review, deployment, agile practices (e.g., #sprints, burndown rate) Productivity: KLOC, Mean time to repair, #commits Developer metrics: Skills, followers, biometrics Estimation: cost metrics and models
  26. 26. Research success?
  27. 27. Success in industry? • Adoption at large, small companies (e.g., HP) • Integrated in CASE tools • Initial focus on product rather than process • Initial poor use of metrics led to the Goal Question Metric Approach [Basili et al.]
  28. 28. Lines of Code § Easy to calculate, to understand, to visualize § Descriptive of the product, and developer productivity § Correlates with complexity measures and # of bugs
  29. 29. “Measuring programming progress by lines of code is like measuring aircraft building progress by weight.”
  30. 30. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  31. 31. Mining software repositories “We have all this data, the problem is what to do with it.” [A Software Engineering Researcher] Mining Software Repositories (MSR) conference series established in 2004 “Outcroppings of past human behaviour.” [McGrath]
  32. 32. Data, data, everywhere… Program data: runtime traces, program logs, system events, failure logs, performance logs, continuous deployment,… User data: usage logs, user surveys, user forums, A/B testing, Twitter, blogs, … Development data: source code versions, bug data, check-in information, test cases and results, communication between developers, social media
  33. 33. Techniques Association rules and frequency patterns Classification Clustering Text mining/natural language processing Searching and mining Qualitative analysis See papers from the Mining Software Repositories Conference!
  34. 34. Benefits of mining trace data Low interference Low reactivity Records made by the participants Data is easy to collect
  35. 35. “Only metric worth counting is defects” [Demarco, 1997] Why mine and measure information about bugs? Personal discovery, evaluation by managers, understand product status, predict reliability
  36. 36. Bug prediction • Models to predict bugs show promise (ownership, churn, tangled code changes) • Poor replication across organizations! • Poor actionability (practitioners know which modules are buggy!) • The secret life of bugs [Aranda et al.]
  37. 37. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  38. 38. Data science movement… http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science
  39. 39. Goals of software analytics Improve: quality of the software experience of the users developer productivity Dongmei Zhang & Tao Xie, http://research.microsoft.com/en- us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf
  40. 40. Data Science Spectrum Past Present Future Explore trends alerts forecasting Analyze summarize compare what-if Experiment model benchmark simulate The Art and Science of Analyzing Software Data, by Bird, Menzies, Zimmermann, Elsevier 2015.
  41. 41. Software Analytics and its role in Automation • Scaling to 1000’s of developers — automation is required! [Jacek Czerwonka] • Goal is to optimize competing concerns of quality, time, resources • Data Scientists manage and measure impacts of automation and software analytics [Kim et al., 2016]
  42. 42. Does increasing test code coverage increase reliability?
  43. 43. No! Wasting time testing simple code may increase the presence of bugs! [Mockus et al.] Does increasing test code coverage increase reliability?
  44. 44. Role of data science in software engineering Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s)
  45. 45. Talk outline… History of software analytics in software engineering Risks of software analytics Why big data needs thick data Consider both researchers and practitioners….
  46. 46. Five Risks 1) Data and construct trustworthiness 2) Reliability of the results 3) Ethical concerns 4) Unintended and unexpected consequences 5) Big data can’t answer big questions
  47. 47. Risk #1: Trustworthiness of the data Data representativeness (construct validity) Data completeness Inaccuracies in profiles, exaggerations, skewed opinions Treating humans as “rational” animals [Harper et al.]
  48. 48. Perils from using GitHub data: A repository is not necessarily a (development) project Most projects are inactive or have few commits Most projects are for personal use only Only 10% of projects use pull requests History can be rewritten on GitHub A lot happens outside of GitHub The Promises and Perils of Mining GitHub, Eirini Kalliamvakou et al., MSR 2014.
  49. 49. Risk #2: Trustworthiness of the results Researcher bias [Shepperd et al., 2014] Confusing correlations with cause and effect Big data and small effects [Marcus et al.] Inappropriate generalization Conclusion instability [Menzies et al.]
  50. 50. “all models are wrong, but some are useful” [Box, 1976] http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
  51. 51. Risk #3: Ethical concerns Private, public, blurred spaces Surveillance at the level of the individual Opaque algorithms, opaque biases [Tufecki, CSCW Keynote, 2015]
  52. 52. http://www.informationweek.com/big-data/big-data-analytics/data- scientists-want-big-data-ethics-standards/d/d-id/1315798)
  53. 53. Risk #4: Unexpected consequences Negative side effects [Gender studies] Gaming the gamification Incentives? handle with care!
  54. 54. Assessing and watching developers Singer, Filho, Cleary, Treude, Storey, Schneider. Mutual Assessment in the Social Programmer Ecosystem: An Empirical Investigation of Developer Profile Aggregators, CSCW 2013.
  55. 55. Contributing graphs considered harmful https://github.com/isaacs/github/issues/627 http://www.hanselman.com/blog/GitHubActi vityGuiltAndTheCodersFitBit.aspx
  56. 56. Most unwise questions! Analyze This! 145 Questions for Data Scientists in Software Engineering Andrew Begel and Thomas Zimmermann
  57. 57. Risk #5: Big Data can’t answer Big Questions Or
  58. 58. Risk #5: Big Data can’t answer Big Questions Or
  59. 59. Risk #5: Big Data can’t answer Big Questions alone
  60. 60. Examples of big questions? • What is a good architecture to solve problem x? [Devanbu] • What makes a really awesome programmer? [Software managers] • How to build a great development team? [Google] • How is program knowledge distributed? [Naur] • What is the ideal software engineering process? [Facebook, Microsoft, IBM,…] • What tools/practices support a participatory development process? [Storey et al.]
  61. 61. Five Risks 1) Data and construct trustworthiness 2) Reliability of the results 3) Ethical concerns 4) Unintended and unexpected consequences 5) Big data can’t answer big questions
  62. 62. Talk outline… History of software analytics in software engineering Risks of software analytics Why big data needs thick data, and why thick data needs big data! Consider both researchers and practitioners….
  63. 63. Data scientists… “Typically start with the data, rather than starting with the problem.” [Forbes] “I love data” “I love patterns” [Kim et al., ICSE 2016] http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/print/
  64. 64. John Snow’s theory about cholera came from talking to people [1850’s]
  65. 65. Danger zones… http://blogs.lse.ac.uk/impactofsocialsciences/2015/02/12/philosophy-of-data-science- emma-uprichard/ “Most big data is social data – the analytics need serious interrogation” Social Science+ “It doesn’t matter how much or how good our data is if the approach to modelling social systems is backwards.”
  66. 66. What is “thick” data? Researcher generated “thick” data Explanations, motivations, recommendations Questions rather than answers Variables for a model Future challenges Limitations: Self reporting, researcher bias, ambiguity in instruments and collected data
  67. 67. Beyond “Mixed Methods”: Ethnomining Combines the ethos of ethnography interleaved with data mining techniques around behavioral/social data Storytelling (to support the numbers) Leverages visualization within tight loops of eliciting/reporting results http://ethnographymatters.net/blog/2013/04/02/april-2013- ethnomining-and-the-combination-of-qualitative-quantitative-data/
  68. 68. Tagging work items in
  69. 69. ConcernLines
  70. 70. Research challenges ahead Big data! (of trace and thick data!) Rapid pace of change (increased automation, participatory culture) Studying unstable objects [Rogers] Poor boundaries of study contexts
  71. 71. Kevin Kelly, Futurist: “You’ll be paid in the future based on how well you work with robots.”
  72. 72. Key Takeaway: Big Data needs Thick Data
  73. 73. Future of data science in software engineering? Metrics (late 1960’s) Mining software repositories (mid 2000’s) Software analytics (early 2010’s) Big Data meets Thick Data @margaretstorey
  74. 74. References: “Mad about Measurement”, De Marco, http://ca.wiley.com/WileyCDA/WileyTitle/productCd-0818676450.html Van Solingen, Rini, et al. "Goal question metric (GQM) approach." Encyclopedia of software engineering (2002). The Emerging Role of Data Scientists on Software Development Team, Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel, ICSE May 2016. Analyze This! 145 Questions for Data Scientists in Software Engineering, Andrew Begel and Thomas Zimmermann, ICSE June 2014. Dongmei Zhang & Tao Xie, http://research.microsoft.com/en- us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf Rules of Data Science in SE, see www.slideshare.net/timmenzies/the-art-and- science-of-analyzing-software-data Audris Mockus, Nachiappan Nagappan, Trung T. Dinh-Trong, Test coverage and post-verification defects: A multiple case study. ESEM 2009: 291-301 Shepperd, Martin, David Bowes, and Tracy Hall. "Researcher bias: The use of machine learning in software defect prediction." Software Engineering, IEEE Transactions on 40.6 (2014): 603-616.
  75. 75. M. Storey, The Evolution of the Social Programmer, Mining Software Repositories (MSR) 2012 Keynote http://www.slideshare.net/mastorey/msr- 2012-keynote-storey-slideshare M. Storey et al., The (R)evolution of Social Media in Software Engineering, ICSE Future of Software Engineering 2014, http://www.slideshare.net/mastorey/icse2014-fose-social-media H. Jenkins, K. Clinton, R. Purushotma, A. J. Robison, and M. Weigel. Confronting the challenges of participatory culture: Media education for the 21st century, 2006. http://digitallearning.macfound.org/atf/cf/%7B7E45C7E0-A3E0-4B89- AC9C-E807E1B0AE4E%7D/JENKINS_WHITE_PAPER.PDF L. Singer, F. F. Filho, B. Cleary, C. Treude, M.-A. Storey, K. Schneider. Mutual Assessment in the Social Programmer Ecosystem: An Empirical Investigation of Developer Profile Aggregators Treude, C., and M.-A. Storey, “Awareness 2.0: staying aware of projects, developers and tasks using dashboards and feeds,” in ICSE’10: Proc. of the 32nd ACM/IEEE Int. Conference on Software Engineering, ACM. C. Treude and M.-A. Storey. Work Item Tagging: Communicating Concerns in Collaborative Software Development. In IEEE Transactions on Software Engineering 38, 1 (January/February 2012). pp. 19-34
  76. 76. [Marcus2014] Gary Marcus and Ernest Davis, "Eight (No, Nine!) Problems with Big Data", New York Times, April 6, 2014 [Harper2013] Richard Harper, Christian Bird, Thomas Zimmermann, and Brendan Murphy"Dwelling in Software: Aspects of the felt-life of engineers in large software projects", Proceedings of the 13th European Conference on Computer Supported Cooperative Work (ECSCW '13), Springer, September 2013. P. Naur and B. Randell. Software Engineering: Report of a Conference Sponsored by the NATO Science Committee, Garmisch, Germany, Oct.1968. NATO Mcgrath, E. "Methodology matters: Doing research in the behavioral and social sciences." Readings in Human-Computer Interaction: Toward the Year 2000 (2nd ed. 1995. Aranda, Jorge, and Gina Venolia. "The secret life of bugs: Going past the errors and omissions in software repositories." Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 2009. Ethno-Mining: Integrating Numbers and Words from the Ground Up: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-125.pdf How Google builds a really development team, New York Times, 2016. [Tufekci2015] Zeynep Tufekci, "Algorithms in our Midst: Information, Power and Choice when Software is Everywhere", Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp.1918-1918, ACM 2015.

×