An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems

1. Introduction Methodology Results Conclusion An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems Yiming Tang1 Raffi Khatchadourian2,1 Mehdi Bagherzadeh3 Rhia Singh4 Ajani Stewart2 Anita Raja2,1 1 City University of New York (CUNY) Graduate Center, USA 2 City University of New York (CUNY) Hunter College, USA 3 Oakland University, USA 4 City University of New York (CUNY) Macaulay Honors College International Conference on Software Engineering May 25, 2021, Madrid, Spain1 1Held remotely. Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 1 / 17

2. Introduction Methodology Results Conclusion Machine Learning Systems & Technical Debt Machine Learning (ML), including Deep Learning (DL), systems are pervasive. Do not only consist of ML models; they also encompass complex subsystems supporting ML processes [Sculley, Holt, Golovin, Davydov, Phillips, Ebner, Chaudhary, Young, et al., 2015]. Like other long-lived, complex systems, they are prone to classic technical debt [Tom et al., 2013] issues. They also exhibit debt specific to ML systems [Sculley, Holt, Golovin, Davydov, Phillips, Ebner, Chaudhary, and Young, 2014]. As ML systems become more difficult and expensive to evolve and maintain, knowledge of the modifications required is of the utmost importance. Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 2 / 17

3. Introduction Methodology Results Conclusion Mining Refactorings in Open-source ML Systems Refactoring is a widely accepted mechanism for effectively reducing technical debt [Beck, 1999; Behutiye et al., 2017; Brown et al., 2010; Suryanarayana et al., 2014]. Understanding the kinds of refactorings performed yields insight into the technical debt tackled in ML systems. No previous studies quantifying and qualifying refactorings and technical debt in open-source systems. We studied common refactorings in real-world, open-source ML systems to discover: Kinds of refactorings—both specific and tangential to ML—performed. Refactoring frequency in model code vs. other supporting subsystems. Types of technical debt being addressed. Whether debt correspond to established ML-specific technical debt. Whether any new—potentially generalizable—ML-specific refactorings and technical debt categories could be derived. Can help improve existing—and drive new ML-specific—automated refactoring techniques, IDE code completion, and automated refactoring mining approaches. Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 3 / 17

4. Contributions Summary Refactoring hierarchical taxonomy Manually examined 327 patches of 26 projects. Built a rich hierarchical, crosscutting taxonomy of common generic and ML-specific refactorings. Whether they occur in ML-related code—code specific to ML-related tasks (e.g., classifiers, feature extraction, algorithm parameters). ML-specific technical debt they address. New ML-specific refactorings & technical debt categories Introduce 14 and 7 new ML-specific refactorings and technical debt categories, respectively. Recommendations, best practices, & anti-patterns Propose preliminary recommendations, best practices, and anti-patterns for long-lasting ML system evolution from our statistical results and analysis.

5. Introduction Methodology Results Conclusion Subject ML Systems 26 open-source ML systems, ∼4.2 million lines of source code, 175,839 Git commits, and 183.76 years (7.07 years/subject). Vary widely in domain, application, size, and popularity. Non-trivial GitHub metrics (stars, forks, collaborators). Mix of ML libraries, frameworks, and applications. At least one commit whose log message mentions “refactor.” Non-trivial portion involving ML. Favored Java (popular for large-scale ML [Kamath and Choppella, 2017]) but model code can be in others, e.g., Python, C++. Includes CoreNLP, Deeplearning4j, and Weka (see paper). Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 5 / 17

6. Introduction Methodology Results Conclusion Discovering & Classifying Changesets with Refactorings Mined repositories for commit logs mentioning “refactor.” Randomly selected a subset to manually examine. Studied code changes to determine: The refactoring category. Whether the refactoring took place in ML-related code. The ML-specific technical debt, if any, the refactoring addressed. Utilized referenced bug reports and commit log messages. RefactoringMiner [Tsantalis et al., 2018] was occasionally used to help isolate larger commits. Used terms like “cluster” and “train” in the commit logs to help identify whether the changesets were related to ML. Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 6 / 17

7. Quantitative Analysis 175,839 total subject commits. 2,892 commits with “refactor” keyword. 327 manually examined (randomly selected). 285 “true” refactorings. 165 appeared in ML-related code. Devised a refactoring category hierarchy having two top-levels, those specifically-related and tangentially-related to ML, respectively.

8. Figure: Discovered refactorings (hierarchical).

9. Table: Discovered refactorings (nonhierarchical). group category abbr cnt MLc Generic Defer execution DEF 1 0 Make immutable IMM 1 0 Make more reusable RUS 1 1 Generalization GEN 2 1 Make more interoperable INT 2 2 Simplify regex RGX 2 0 Concurrency CON 4 2 Safety SAF 5 2 Dead code elimination DED 6 4 Make more extensible EXT 11 8 New language feature LNG 14 5 Test TST 15 4 Unknown UKN 15 10 Improve performance PRF 27 18 Duplicate code elimination DUP 33 24 Clean up CLN 48 26 Reorganization ORG 81 41 Total 268 148 ML-specific Make algorithms more visible VIZ 1 1 (new) Make matrix variable names more verbose VRB 1 1 Monitor feature extraction progress MON 1 1 Push down hyperparameters HYP 1 1 Pull up policy PLC 1 1 Remove unnecessary matrix operation RMA 1 1 Replace flags with polymorphic classifier CLS 1 1 Replace flags with polymorphic feature extraction FET 1 1 Replace primitive array with matrix AMT 1 1 Replace with sparse matrix SMT 1 1 Replace primitives with rich prediction PRD 2 2 Replace rich model parameter with primitives RMP 2 2 Replace primitives with rich model parameter PRM 3 3 Total 17 17 Grand Total 285 165

14. Introduction Methodology Results Conclusion Generic refactoring Count Figure: Discovered generic refactorings (nonhierarchical). Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 10 / 17

15. Introduction Methodology Results Conclusion Refactoring-related Finding Highlights Performance improvement and reorganization (e.g., inheritance introduction) refactorings crosscut concerns, affecting multiple categories, both specifically and tangentially, associated with ML systems and were among the most frequent (37.89%). Duplicate code elimination (11.58%) was a major, crosscutting ML system refactoring theme, combating debt in various ways. Inheritance introduction, appearing under six categories—the most of any other category—was a common and crosscutting way to eliminate duplication May be key in coping with subtle variations intrinsic to ML algorithms. Despite being the smallest subsystem [Sculley, Holt, Golovin, Davydov, Phillips, Ebner, Chaudhary, Young, et al., 2015], ML-related code was refactored the most (57.89%). Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 11 / 17

16. Introduction Methodology Results Conclusion Table: Discovered ML-specific technical debt vs. refactoring categories. group technical debt refactoring A M T C L S F E T G E N H Y P L N G M O N P L C R M A R U S S M T V I Z V R B P R D D E D I N T P R F S A F C L N P R M E X T D U P O R G Total Existing Dead experimental code paths 1 1 Abstraction 1 1 2 Boundary erosion 2 2 Glue code 1 1 2 Prototype 2 2 Monitoring and testing 1 1 1 3 Multiple languages 1 1 1 2 2 7 Plain-old-data type 1 2 1 1 3 2 10 Configuration 1 1 1 1 1 1 2 2 2 3 7 15 37 Total 1 1 1 1 1 1 1 1 1 2 3 2 3 3 3 5 3 8 25 66 New Custom data types 1 1 Duplicate feature extraction code 1 1 Model code reusability 1 1 Unnecessary model code 1 1 Model code comprehension 1 1 1 1 4 Model code modifiability 5 5 Duplicate model code 17 1 18 Total 1 1 1 1 1 1 5 18 2 31 Grand Total 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 4 5 8 26 27 97 Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 12 / 17

21. Introduction Methodology Results Conclusion Technical Debt-related Finding Highlights Configuration, duplicate model code, and plain-old-data type were the most tackled technical debt categories (36.84%, 18.95%, and 10.53%, respectively). Duplicate code elimination was a major refactoring (27.37%) in reducing ML-specific technical debt, overwhelming related to configuring and implementing different yet related ML algorithms (92.31%). Inheritance and other reorganization refactorings were commonly (28.42%) used to reduce a variety of ML-specific debt, especially configuration (55.56%). Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 13 / 17

22. Introduction Methodology Results Conclusion Qualitative Analysis Duplicate Model Code Debt Example Pull Up Policy (PLC) refactoring in Mahout: “Refactored ClusteringPolicies into hierarchy under new AbstractClusteringPolicy . . .” +public abstract class AbstractClusteringPolicy implements ClusteringPolicy { + public Vector classify(Vector d, ClusterClassifier p){ + List<Cluster> models = p.getModels(); /*..*/ }} public class CanopyClusteringPolicy - implements ClusteringPolicy { + extends AbstractClusteringPolicy { - public Vector classify(Vector d, List<Cluster> models){ - Vector pdfs = new DenseVector(models.size());/*..*/}} public class DirichletClusteringPolicy - implements ClusteringPolicy { + extends AbstractClusteringPolicy { - public Vector classify(Vector d, List<Cluster> models){ - Vector pdfs = new DenseVector(models.size());/*..*/}} Multiple classes representing different clustering algorithm policies. Each class previously implemented a common interface. An abstract class is introduced, encapsulating common policy functionality. Duplicated model code is replaced with polymorphic calls to classify(). Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 14 / 17

23. Introduction Methodology Results Conclusion In the Paper . . . More refactoring examples. Common attributes of ML-specific technical debt categories. Preliminary best practices, anti-patterns, and recommendations. Discussion. And, much more! Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 15 / 17

24. Introduction Methodology Results Conclusion Conclusion Studied refactorings performed and the technical debt they alleviate in ML systems. 1 Refactorings both specific and tangential to ML. 2 Occurring within and outside of ML-related code. 3 Hierarchical taxonomy of refactorings in ML systems. 4 14 and 7 new ML-specific refactorings and technical debt categories, respectively, introduced. 5 Preliminary recommendations, best practices, and anti-patterns proposed. Future Work Juxtapose findings with developer specialties and expertise. Develop automated refactoring approaches. Integrate our results into automated refactoring detection techniques [Tsantalis et al., 2018]. Explore using refactorings for (ML-specific) SATD detection. Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 16 / 17

25. Introduction Methodology Results Conclusion Conclusion Studied refactorings performed and the technical debt they alleviate in ML systems. 1 Refactorings both specific and tangential to ML. 2 Occurring within and outside of ML-related code. 3 Hierarchical taxonomy of refactorings in ML systems. 4 14 and 7 new ML-specific refactorings and technical debt categories, respectively, introduced. 5 Preliminary recommendations, best practices, and anti-patterns proposed. Future Work Juxtapose findings with developer specialties and expertise. Develop automated refactoring approaches. Integrate our results into automated refactoring detection techniques [Tsantalis et al., 2018]. Explore using refactorings for (ML-specific) SATD detection. Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 16 / 17

26. Introduction Methodology Results Conclusion For Further Reading I Beck, Kent (1999). Extreme Programming Explained: Embrace Change. Addison-Wesley. Behutiye, Woubshet Nema, Pilar Rodrı́guez, Markku Oivo, and Ayşe Tosun (2017). “Analyzing the concept of technical debt in the context of agile software development: A systematic literature review”. In: Information and Software Technology 82, pp. 139–158. doi: 10.1016/j.infsof.2016.10.004. Brown, Nanette, Yuanfang Cai, Yuepu Guo, Rick Kazman, Miryung Kim, Philippe Kruchten, Erin Lim, Alan MacCormack, Robert Nord, Ipek Ozkaya, Raghvinder Sangwan, Carolyn Seaman, Kevin Sullivan, and Nico Zazworka (2010). “Managing Technical Debt in Software-Reliant Systems”. In: FSE/SDP Workshop on Future of Software Engineering Research. ACM, pp. 47–52. doi: 10.1145/1882362.1882373. Kamath, U. and K. Choppella (2017). Mastering Java Machine Learning. Packt Publishing. Sculley, D., Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young (2014). “Machine Learning: The High Interest Credit Card of Technical Debt”. In: SE4ML: Software Engineering for Machine Learning. NIPS 2014 Workshop. Sculley, D., Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison (2015). “Hidden Technical Debt in Machine Learning Systems”. In: Neural Information Processing Systems. Vol. 2. MIT Press, pp. 2503–2511. Suryanarayana, Girish, Ganesh Samarthyam, and Tushar Sharma (2014). Refactoring for Software Design Smells: Managing Technical Debt. 1st ed. Morgan Kaufmann. Tom, Edith, Aybüke Aurum, and Richard Vidgen (2013). “An Exploration of Technical Debt”. In: Journal of Systems and Software 86.6, pp. 1498–1516. doi: 10.1016/j.jss.2012.12.052. Tsantalis, Nikolaos, Matin Mansouri, Laleh M. Eshkevari, Davood Mazinanian, and Danny Dig (2018). “Accurate and Efficient Refactoring Detection in Commit History”. In: International Conference on Software Engineering. Tang, Khatchadourian, Bagherzadeh, Singh, Stewart, Raja An Empirical Study of Refactorings & Tech. Debt in ML Systems 17 / 17

An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems

Similar to An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems (20)

More from Raffi Khatchadourian

More from Raffi Khatchadourian (20)

Recently uploaded

Recently uploaded (20)

An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems