Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining Sociotechnical Information From Software Repositories


Published on

A large amount of data is produced during collaborative software development. The analysis of such data sets a great opportunity to better understand Software Engineering from the perspective of evidence-based research. Mining software repositories studies have explored both the technical and social aspects of software development contributed to the discovery of important information about how software development evolves and how developers collaborate. Several repositories store data regarding source code production (version control systems), communication between developers and users (forums and mailing lists), and coordination of activities (issue tracker, task managers, etc.). In the open source world, such data is available in large ecosystems of software development. Platforms such as GitHub host millions of repositories, which receive contributions from millions of developers worldwide. Some project repositories register data from more than a decade of development, enabling the analysis of projects from a historical perspective. In this talk, I will discuss some of the uses and challenges of mining software repositories, focusing on some works conducted in our group, such as: identification of change dependencies, evaluation of architectural degradation from commit meta-data, core-periphery analysis of developers participation, change-proneness prediction, analysis of the impact of refactoring on code quality, and relations between quality attributes of the test and the code being tested.

Published in: Technology
  • Be the first to comment

Mining Sociotechnical Information From Software Repositories

  1. 1. Mining Sociotechnical Information From Software Repositories Marco Aurélio Gerosa UNIVERSITY OF SÃO PAULO, BRAZIL University of California, Irvine Informatics Seminar March/2014
  2. 2. Software repositories SW Architect Manager Tester Programmer Programmer Computer Mediated Tools Client Source Code User Issues Bug Reports Message Archives Etc. Current and historical artifacts and interactions are registered in software repositories March/2014 Marco Aurélio Gerosa ( 2
  3. 3. Repositories of repositories 11.3 millions repositories 5.4 millions users In 2013: • 3 millions new users • 152 millions pushes • 25 millions comments • 14 millions issue • 7 millions pull requests 93K projects 1 million users 250K projects 661K projects 29 billions of lines of codes 3 millions users 30K projects 324K projects 3.4 millions developers 36K projects 33K projects 200 projects March/2014 Marco Aurélio Gerosa ( 3
  4. 4. Mining software repositories Communication Source code and artifacts 3C Model* Discussion lists Comments on issues Code comments User reports Q&A sites Social media Issue trackers Project management systems Coordination Reputation systems Practitioner Researcher Applications Collaboration and software production Mining Information about a project Decision making Information about an ecosystem Cooperation Software understanding Information about Software Engineering Support maintenance Empirical validation of ideas & techniques Tag cloud from MSR 2014 CFP “The Mining Software Repositories (MSR) field analyzes the rich data available in software repositories to uncover interesting and actionable information about software systems and projects.” March/2014 *3C Model: Fuks, H., Raposo, A., Gerosa, M.A., Pimentel, M. & Lucena, C.J.P. (2007) “The 3C Collaboration Model” in: The Encyclopedia of ECollaboration, Ned Kock (org), ISBN 978-1-59904-000-4, pp. 637-644. Marco Aurélio Gerosa ( 4
  5. 5. Examples
  6. 6. Sentiment analysis on commits GitHub Data Challenge 2nd place March/2014 Marco Aurélio Gerosa ( 6
  7. 7. Programming language relations March/2014 Marco Aurélio Gerosa ( 7
  8. 8. Mining discussion lists What is the discussion list for? Christian Bird, Alex Gourley, Prem Devanbu, Michael Gertz, and Anand Swaminathan. 2006. Mining email social networks. MSR 2006 March/2014 Anja Guzzi, Alberto Bacchelli, Michele Lanza, Martin Pinzger, and Arie van Deursen. 2013. Communication in open source software development mailing lists. MSR 2013 Marco Aurélio Gerosa ( 8
  9. 9. Coordination requirements Cataldo, M., Dependencies in geographically distributed software development: Overcoming the limits of modularity, PhD Thesis Santana, F. et a. “XFlow: An Extensible Tool for Empirical Analysis of Software Systems Evolution”. ESELAW 2011 March/2014 Marco Aurélio Gerosa ( 9
  10. 10. Programmers who changed this function also changed … Thomas Zimmermann, Peter Weissgerber, Stephan Diehl, and Andreas Zeller. 2005. Mining Version Histories to Guide Software Changes. IEEE Trans. Software Eng. 31, 6 (June 2005), 429-445. DOI=10.1109/TSE.2005.72 March/2014 Marco Aurélio Gerosa ( 10
  11. 11. Don’t program on Fridays March/2014 Marco Aurélio Gerosa ( 11
  12. 12. Which files are more buggy? March/2014 Marco Aurélio Gerosa ( 12
  13. 13. Lack of documentation Deficient Documentation Detection: A Methodology to Locate Deficient Project Documentation using Topic Analysis, Joshua Charles Campbell, Chenlei Zhang, Zhen Xu, Abram Hindle, and James Miller, MSR 2013 March/2014 Marco Aurélio Gerosa ( 13
  14. 14. Some questions from MSR 2013 • How to recommend developers to bug reports? • How to filter update notifications? • What can apps’ permissions reveal? • How to extract feature requests from apps online reviews? • How to analyze peer code-review data? • How to find API usage patterns? • Is programming knowledge related to age? • How do patches reach the Linux kernel? • How do code clones evolve? • How to process crash reports? • How to predict defects and effort? • Etc. March/2014 Marco Aurélio Gerosa ( 14
  15. 15. Some of our work
  16. 16. Some of our work 1. Change dependencies identification 2. Design degradation identification 3. Key developers characterization 4. Unit tests’ feedback for code quality 5. Refactoring 6. Change prediction 7. Support for OSS projects newcomers March/2014 Marco Aurélio Gerosa ( 16
  17. 17. 1) Change dependencies identification Dependency? dF) oo Tr hM i s e ( C lA a s s C lB a s s A B • “A dependency means that a client element has knowledge of the supplier element and a change in the supplier may affect the client” [Larman, 2004] • A dependency relation means that the semantics of the depending elements is semantically or structurally dependent on the definition of the supplier element [UML Formal Specification] • Unrecognized dependencies result in a higher number of defects [Herbsleb et al., 2006] • Structural (or static), dynamic, or logical dependency? • Dependencies may be hard to identify • Publisher/subscriber, polymorphism, clones, crosscutting concerns, semantic relations etc. De ed pn e c ny March/2014 Marco Aurélio Gerosa ( C h a n g e 17
  18. 18. Change dependencies Files frequently changed together share some sort of dependency [Gall et al. 1998] Strong change dependency from B to A (the opposite is a much weaker dependency) A logical dependency denotes an implicit and evolutionary relationship between software artifacts The concept has proven useful in several different studies: ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Software quality analysis [Cataldo & Nambiar, 2010] Bugs prediction [D’Ambros et al. , 2009a] Change prediction and change impact analysis [Zimmermann et al., 2005] Uncover cross-cutting concerns [Adams et al., 2010] Uncover design flaws and opportunities for refactoring [D’Ambros et al., 2009b] Understand and evaluate software architecture [Zimmermann et al., 2003] Requirements traceability [Ali et al., 2013] Maintain documentation [Kagdi et al., 2006] March/2014 Marco Aurélio Gerosa ( 18
  19. 19. 1.1) Structural x Change Dependencies Overlap between change dependencies and structural dependencies Sta tc l ru u r De ed pn e c ny Cn o g -as c e h ? Gustavo Oliva, PhD candidate Analysis of 150K commits of the ASF showed that: - 93% of the change dependencies did not involve structural dependencies - 95% of the structural dependencies did not imply in a change dependency Oliva, G.A., Gerosa, M. A. (2011) “On the Interplay between Structural and Logical Dependencies in Free Software”. Brazilian Symposium on Software Engineering (SBES 2011) March/2014 Marco Aurélio Gerosa ( 19
  20. 20. 1.2) Change dependencies origins Manual classification of commits to understand the origins of change dependencies Category Refactoring elements that belong to a same semantic class Structural dependencies on a changing semantic class Cross-cutting concerns Overloaded revision Repository operations Structural dependencies on specific elements Other reasons Total Joint-changes Total % 80 19.6% 9 2.2% 165 60 21 40.4% 14.7% 5.1% 66 16.2% 7 408 Gustavo Oliva, PhD candidate 1.7% Oliva, G.A., Santana, F., Gerosa, M. A., Souza, C. (2011) “Towards a Classification of Logical Dependencies Origins: A Case Study”. Proceedings of the 12th International Workshop on Principles of Software Evolution and the 7th annual ERCIM Workshop on Software Evolution (IWPSE-EVOL '11) March/2014 Marco Aurélio Gerosa ( 20
  21. 21. 1.3) Preprocessing commits Commit = change? Using the sliding time window approach [Zimmermann & Weißgerber, 2004] to group SVN commits N Sum Mean StDev Skewness Kurtosis Before 479,794 3,206,900 6.68 37.84 33.80 1,844.00 After 453,865 3,174,051 6.99 40.79 39.56 Gustavo Oliva, PhD candidate 2,829.94 Evaluation in the Apache code repository showed that the produced grouping corresponded to 4.6% of the number of commits What about commit habits/practices/policies? Social aspects matter! Oliva, G. A., Santana, F., Gerosa, M. A., Souza, C. (2012), “Preprocessing Change-Sets to Improve Logical Dependencies Identification”, 6th Int. Workshop on Software Quality and Maintainability (SQM 2012) March/2014 Marco Aurélio Gerosa ( 21
  22. 22. 2) Design degradation identification Rigidity and fragility [Martin & Martin, 2006] identification based on commit metadata Gustavo Oliva, PhD candidate 5.3 Rigidity => designs difficult to change due to ripple effects => commit density (number of changed files per commit) Fragility => designs break in different areas when a change is performed => commit dispersion (distance in the directory tree among file paths included in a commit) Oliva, G., Steinmacher, I., Wiese, I.S., Gerosa, M.A. “What Can Commit Metadata Tell Us About Design Degradation?”, In: 13th International Workshop on Principles on Software Evolution (IWPSE 2013), Saint Petersburg. March/2014 Marco Aurélio Gerosa ( 22
  23. 23. 3) Key developers characterization Key developers participation Oliva, G., Santana, F.W., da Silva, J. T., Oliveira, K.C.M., Werner, C.M.L., Souza, C.R.B. & Gerosa, M.A., “Evolving the System’s Core: A Case Study on the Identification and Characterization of Key Developers in Apache Ant”, Computing and Informatics [to appear]. March/2014 Marco Aurélio Gerosa ( 23
  24. 24. 4) Unit tests’ feedback for code quality Mauricio Aniche, PhD candidate Number of asserts indicate - Cyclomatic complexity? - LOC ? - Method calls ? - “Asserted Objects” metric presents better results than “number of asserts” - Statistically difference in 20% of the projects 22 ASF projects 3 industry projects Aniche, M., Oliva, G.A., Gerosa, M.A., “What Do the Asserts in a Unit Test Tell Us about Code Quality? A Study on Open Source and Industrial Projects”, 17th European Conference on Software Maintenance and Reengineering (CSMR 2013). March/2014 Marco Aurélio Gerosa ( 24
  25. 25. 5) Refactoring Most part of the documented refactoring does not reduce cyclomatic complexity. However, 23% of the documented refactoring reduce cyclomatic complexity while 12% of the other commits have the same effect. Francisco Sokol, MSc candidate Sokol, F., Aniche, M.F., Gerosa, M.A., “Does the Act of Refactoring Really Make Code Simpler? A Preliminary Study”,. In: IV Brazilian Workshop of Agile Methods (WBMA 2013). March/2014 Marco Aurélio Gerosa ( 25
  26. 26. 6) Change prediction Using social, process, and architectural metrics to improve change proneness prediction Igor Wiese, PhD candidate 13 projects, 6 classifiers, 11 metrics Social and architectural metrics improved the prediction model Similar results when considering projects grouped by change ratio Dimensions Process Social Architecture Metric Name HCM WHCM Churn WChurn MFD – Modification File-Developer DevParticipation WDevParticipation Rigidity Fragility WCo-Changes UD – Unworked Dependencies Reference Hassan [13] D’Ambros et al. [8] D’Ambros et al. [8] Nagappan and Ball [23] New Rahman & Devanbu [28] New Oliva et al. [26] Oliva et al. [26] New New Wiese, I. S, Nassif Jr, Steinmacher I, Re Reginaldo, Gerosa, M.A “Comparing communication and development networks for predicting file change proneness: An exploratory study considering process and social metrics”,. In: International Workshop on Software Quality and Maintainability (SQM 2014). March/2014 Marco Aurélio Gerosa ( 26
  27. 27. 7) Supporting Newcomers to OSS projects Analysis of newcomers dropout reasons Use of MSR to identify newcomers Igor Steinmacher, PhD candidate Systematic review on awareness in DSD [JCSCW 2013] comum + ação Ação de tornar comum COMUNICAÇÃO demanda gera compromissos gerenciados pela Awareness COOPERAÇÃO COORDENAÇÃO co + operar + ação Ação de operar em conjunto organiza as tarefas para co + ordem + ação Ação de organizar em conjunto Steinmacher, I., Wiese, I.S., Chaves, A.P., Gerosa, M.A., Why do newcomers abandon open source software projects?, 6th Int. Workshop on Cooperative and Human Aspects of Software Engineering (CHASE 2013) March/2014 Marco Aurélio Gerosa ( 27
  28. 28. 7) Supporting Newcomers to OSS projects ◦ ◦ Systematic Literature Review on barriers for newcomers to OSS Qualitative analysis of interviews Igor Steinmacher, PhD candidate Steinmacher, I., Wiese, I., Conte, T., Gerosa, M.A., Redmiles, D.F. The Hard Life of Newcomers to OSS Projects, 7th International Workshop on Cooperative and Human Aspects of Software Engineering Steinmacher, I.; Graciotto Silva, M. A. ; GEROSA, M.A., Barriers faced by newcomers to open source projects: a systematic review. 10th International Conference on Open Source Systems (OSS 2014) March/2014 Marco Aurélio Gerosa ( 28
  29. 29. And now?
  30. 30. Some MSR challenges • Years of sociotechnical data of millions of projects • BigData & large-scale computing • Natural Language processing • Hao Zhong, H., Zhang, L., Xie, T. & Mei, H. “Inferring Resource Specifications from Natural Language API Documentation”, ASE 2009. • Incomplete and inaccurate information (e.g., commit comments) • Multiple identities and commits on behalf of others • Traceability and provenance • Davies, German, Godfrey & Hindle. “Software bertillonage: finding the provenance of an entity”, MSR 2011 • Tools usage change over time • March/2014 Guzzi et al. “Communication in open source software development mailing lists”, MSR 2013 Marco Aurélio Gerosa ( 30
  31. 31. What’s next? (1/2) Towards theories and consolidation Sharing insights Sharing patterns ◦ DAPSE ◦ MSR Cookbook Sharing data ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ [Gaines, 1999] / Dataset papers published at MSR MSR Challenges dataset MetricMiner Replicating studies ◦ Ghezzi, G. & Gall, H. “Replicating Mining Studies with SOFAS”, MSR 2013 March/2014 Marco Aurélio Gerosa ( 31
  32. 32. What’s next? (2/2) Actionable information in the Software Environments (new IDEs) Software Analytics ◦ “The use of analysis, data, and systematic reasoning to make decisions” New sources ◦ A lot of information is lost between commits => IDE & desktop instrumentation and live data collection ◦ Singer, J. “Navtracks: Supporting navigation in software maintenance”, ICSM'05 ◦ Blincoe, K., Valetto, G. & Goggins, S. “Proximity: a measure to quantify the need for developers' coordination. CSCW'12 ◦ Software execution logs ◦ Social Media ◦ Storey, M., Treude, C., Deursen, A. & Cheng, L., “The impact of social media on software engineering practices and tools” FoSER'10 ◦ Bougie, G., Starke, J., Storey, M. & German, D., “Towards understanding twitter use in software engineering: preliminary findings, ongoing challenges and future questions”, Web2SE'11 ◦ Q&A Sites ◦ “Over 92% of Stack Overflow questions about expert topics are answered - in a median time of 11 minutes” [Mamykina et al., CHI'11] ◦ Campbell et al., “Deficient Documentation Detection: A Methodology to Locate Deficient Project Documentation using Topic Analysis”, MSR'13 March/2014 Marco Aurélio Gerosa ( 32
  33. 33. Conferences ◦ MSR: Working Conference on Mining Software Repositories ◦ ◦ ICSM: International Conference on Software Maintenance ◦ ◦ CSMR-WCRE: Software Evolution Week (joins CSMR and WCRE) ◦ ◦ ICSE: International Conference on Software Engineering ◦ ◦ SCAM: International Working Conference on Source Code Analysis and Manipulation ◦ ◦ ESEM: International Symposium on Empirical Software Engineering and Measurement ◦ ◦ PROMISE March/2014 Marco Aurélio Gerosa ( 33
  34. 34. Thank you! MARCO AURELIO GEROSA ( G E R O S A @ I M E . U S P. B R ) – O F F I C E 5 2 2 8 @ D B H @GEROSA_MARCO H T T P : / / L A P E S S C . I M E . U S P. B R / ( A L L P U B L I C AT I O N S A R E A V A I L A B L E AT T H I S S I T E ) H T T P : / / W W W. I M E . U S P. B R / ~ G E R O S A H T T P : / / N A P. U S P. B R / N A W E B
  35. 35. Other projects Smart Audio City Guide A social mobile system based on georeferenced audio information to support the mobility of blind people 82,000 images of Brazilian architecture March/2014 Marco Aurélio Gerosa ( 35
  36. 36. Opportunities for Research in Brazil More than 38k scholarships implemented in 2013 for sending students abroad Visiting researcher (PVE Ciencia sem Fronteiras) - ◦ Projects for 2 or 3 years, with a stay of 30 to 90 days per year in Brazil, continuous or not ◦ ≈ US$ 6400 (R$ 14,000)/month stayed in Brazil CNPq Investment on Research (in BRL millions) 2,000 1,000 2012 2010 2008 2006 2004 2002 2000 ◦ Transportation assistance, among other benefits Young Talent Attraction (BJT) - ◦ 1 to 3 years 1998 0 1996 ◦ ≈ US$ 23,000 (R$ 50,000)/year for funding + scholarships Scholarships abroad (in thousands BRL) ◦ Min ≈ US$ 3200 (R$ 50,000)/month 250,000 100,000 50,000 2012 2010 2008 2006 2004 0 2002 ◦ Support for accommodation and transportation FAPESP (Sao Paulo state) - ◦ Visiting professor: ≈US$ 6300, 5000, or 4200 (R$ 13,653, R$ 10,950, R$ 9184)/month depending on the level, visits of any length up to 12 months + support for transportation and health insurance 150,000 2000 ◦ ≈ US$ 3200 or US$ 4100 (R$ 6900 or R$ 8900)/month depending on the level 200,000 1998 ◦ Transportation assistance, support for accommodation Visiting scholar (CAPES PVE) ◦ Visits from 15 days to 1 year 1996 ◦ Min ≈ US$ 9200 (R$ 20,000)/year for funding ◦ Post-doc: ≈US$ 2700 (R$ 5908) + support for accommodation. 6 to 36 months. USP (University of Sao Paulo) - ◦ Specific grants for receiving retired researchers or faculty on sabbatical Bilateral agreements (with several universities) March/2014 Marco Aurélio Gerosa ( 36
  37. 37. References Change Dependencies [Larman, 2004] LARMAN, CRAIG: Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and Iterative Development. third. ed. : Prentice Hall, 2004 [UML Formal Specification] OMG: Object Management Group: Unified Modeling Language (UML). [Ball et al, 1997] BALL, THOMAS ; ADAM, JUNG-MIN KIM ; HARVEY, A. PORTER ; SIY, P.: If Your Version Control System Could Talk... In: ICSE Workshop on Process Modeling and Empirical Studies of Software Engineering, 1997 [Gall et al., 1998] GALL, HARALD ; HAJEK, KARIN ; JAZAYERI, MEHDI: Detection of Logical Coupling Based on Product Release History. In: Proceedings of the International Conference on Software Maintenance, ICSM ’98. Washington, DC, USA : IEEE Computer Society, 1998 — ISBN 0-8186-8779-7, p. 190– Change Dependencies and Its Use [Cataldo & Nambiar, 2010] CATALDO, MARCELO ; NAMBIAR, SANGEETH: The impact of geographic distribution and the nature of technical coupling on the quality of global software development projects. In: Journal of Software Maintenance and Evolution: Research and Practice, John Wiley & Sons, Ltd. (2010) [D’Ambros et al., 2009a] D’AMBROS, MARCO ; LANZA, MICHELE ; ROBBES, ROMAIN: On the Relationship Between Change Coupling and Software Defects. In: ZAIDMAN, A. ; ANTONIOL, G. ; DUCASSE, S. (eds.): 16th Working Conference on Reverse Engineering, WCRE 2009, 13-16 October 2009, Lille, France : IEEE Computer Society, 2009 — ISBN 978-0-7695-3867-9, pp. 135–144 [Zimmermann et al., 2005] ZIMMERMANN, THOMAS ; WEISSGERBER, PETER ; DIEHL, STEPHAN ; ZELLER, ANDREAS: Mining Version Histories to Guide Software Changes. In: IEEE Trans. Softw. Eng. vol. 31. Piscataway, NJ, USA, IEEE Press (2005), Nr. 6, pp. 429–445 March/2014 Marco Aurélio Gerosa ( 37
  38. 38. References Change dependencies and its use (continued) [Adams et al., 2010] ADAMS, BRAM ; JIANG, ZHEN MING ; HASSAN, AHMED E.: Identifying crosscutting concerns using historical code changes. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10. Cape Town, South Africa : ACM, 2010 — ISBN 978-1-60558-719-6, pp. 305–314 [D’Ambros et al., 2009b] D’AMBROS, MARCO ; LANZA, MICHELE ; LUNGU, MIRCEA: Visualizing Co-Change Information with the Evolution Radar. In: IEEE Trans. Software Eng vol. 35 (2009), Nr. 5, pp. 720–735 [Zimmermann et al., 2003] ZIMMERMANN, T. ; DIEHL, S. ; ZELLER, A.: How history justifies system architecture (or not). In: Software Evolution, 2003. Proceedings. Sixth International Workshop on Principles of, 2003, pp. 73–83 [Ali et al., 2013] ALI, N. ; JAAFAR, F. ; HASSAN, A.E.: Leveraging historical co-change information for requirements traceability. In: Reverse Engineering (WCRE), 2013 20th Working Conference on, 2013, pp. 361–370 [Kagdi et al., 2006] KAGDI, H. ; MALETIC, J.I.: Mining for Co-Changes in the Context of Web Localization. In: Web Site Evolution, 2006. WSE ’06. Eighth IEEE International Symposium on, 2006, pp. 50–57 Improving change dependencies identification [Zimmermann et al., 2004] ZIMMERMANN, THOMAS ; WEIßGERBER, PETER: Preprocessing CVS Data for Fine-Grained Analysis. In: Proceedings 1st International Workshop on Mining Software Repositories (MSR 2004). Los Alamitos CA : IEEE Computer Society Press, 2004, pp. 2–6 Design Degradation [Martin & Martin, 2006] MARTIN, ROBERT C. ; MARTIN, MICAH: Agile Principles, Patterns, and Practices in C#. first. ed. : Prentice Hall, 2006 March/2014 Marco Aurélio Gerosa ( 38