Title: Empowering Empirical Research in Software Design: Construction and Studies on a Large-Scale Corpus of UML Models
Ph.D Candidate: Truong Ho-Quang (Chalmers | Gothenburg Univ. Sweden)
Opponent: Dr. Klass-Jan Stol (University College Cork, Ireland)
Grading Committee Members: 1) Dr. Maria Teresa Baldassarre (University of Bari Aldo Mori, Italy); 2) Dr. Christoph Treude (University of Adelaide, Australia); 3) Dr. Sebastian Herold (Karlstad University, Sweden)
Supervisors: Dr. Michel R.V. Chaudron & Dr. Regina Hebig (Chalmers | Gothenburg Univ. Sweden)
1. Empowering Empirical Research in Software Design:
Construction and Studies on a Large-Scale Corpus of UML Models
Ph.D Thesis Defence
Ph.D Candidate:
Truong Ho-Quang
Chalmers | Gothenburg Univ.
Sweden
Opponent:
Dr. Klaas-Jan Stol
University College Cork
Ireland
Supervisors:
Dr. Michel R.V. Chaudron
Dr. Regina Hebig
Chalmers | Gothenburg Univ.
Sweden
Grading committee members:
Dr. Maria Teresa Baldassarre (University of Bari Aldo Mori, Italy)
Dr. Christoph Treude (University of Adelaide, Australia)
Dr. Sebastian Herold (Karlstad University, Sweden)
2. Content
• Context of the study (the ’Why’)
• Research Approach & Methods (the ’How’)
• Findings (the ’What’)
• Conclusion
2
3. Design & Modeling: House vs Software
3
HOUSE SOFTWARE
Design(v) is a process of making
decision about something that is
to be built or created.
Model(n) is an abstract
representation of a thing/system.
Modeling(v) is the process of
making models (i.e. choosing what
to represent and how to represent it)
Expressing
design into house
design plans
A house plan An UML model
Expressing
design into
models
4. Context of the study
• Software design and modeling are an essential part of the
software development process.
• There is lack of empirical research about software design
and modeling in practice.
4
5. Example of contradict findings
• Empirical research about use of UML (*) is contradictory
– [Petre, 2013]: ”… The majority of those interviewed (50)
simply do not use UML. Informants criticized UML for its
complexity, lack of formal semantics, inconsistency …”
– [Scanniello et al., 2010]: ”… the majority of the companies
(20/22) use UML in their projects…”
– [Anda et al., 2006]: ” … The interviewees obtained
immediate improvements as a consequence of
introducing a UML-based development method …”
5
(*) Abbreviation of the Unified Modeling Language
6. Problem statement
• Lack of practical guidelines on the use of modeling
– How is software design and modeling used?
– What are the impacts of software design and modeling?
• Lack of empirical data about software design & modeling
– Lack of generalisability and replicability
Goal of the PhD:
To empower empirical studies in software design and modeling by
collecting and studying a large corpus of software modeling artifacts
from real-life software systems.
6
8. Goals of the Ph.D study
8
Dataset
Knowledge
G1. Building & sharing
a corpus of curated modeling artifacts
G2. Modeling practices
G3. Impacts
of modeling
9. Modeling languages
Scope: What modeling artifacts?
9
We focus on the Unified Modeling Language (UML) because:
• UML has become the de-facto standard for software
modeling in industry.
10. Scope: Which sources of UML models?
10
Sources of UML models
Industry OSS projects
Benefits • Industry-relevant context • OSS relevant context
• Data availability
• Data transparency
Challenges • Data availability
• Generalisation
• Replicability
• Identification of UML files
• Filtering out toy-projects
11. Challenges to identifying impacts:
Modeling is contextual (*)
11
(*) Figure from this paper:
Fernández-Sáez, Ana M., Michel RV Chaudron, and Marcela Genero. "An industrial case study on the use of UML in software
maintenance and its perceived benefits and hurdles." Empirical Software Engineering 23.6 (2018): 3281-3345.
12. Scope: Studying impacts in context
12
Dataset
Knowledge
G1. Building & sharing
a corpus of curated modeling artifacts
G2. Modeling practices
G3. Impacts
of modeling
G3.
Impacts
G2.
Practices
G1.
Collection
Context of
use
Enables
Enables
13. Relationship of papers and goals
13
C
A
B D
E
F
G HE
Legend
Paper
21
Paper 2
extends
paper 1
G1
Corpus
G2
Practices
G3
Impacts
14. Goals
Papers
A B C D E F G H
G3.
Impacts
G2.
Practices
G1.
Collection
Research methodology
14
Empirical Method
Experiment, Case Study,
User Study
Empirical Method
Survey Study
Constructive Method
16. Contributions to Goal G1.
Data Contribution
• Lindholmen dataset
– 93k+ UML models from 24k+ GitHub projects
– Meta data of the UML models and projects
– Data are curated
• Researchers are using it
– 12 published papers by other authors
16
http://models.cs.chalmers.se/oss/
Lindholmen Dataset
G1.
Collection
17. Contributions to Goal G1.
Data Collection Process
• Complete tool set for automatically crawl UML models from a
GitHub project
• This can be extended to crawl other software development artifacts
17
UML
File list
GitHub
1 Data collection
Potential UML file list
3 Extract Meta-data
2 Filter UML files
UML Image
Filter
Textual
Filter
Validation
5 Analyse result
4 Query database
CVSAnalY MySQL
GHTorrent
~ 12 800 000
non-forked repos
Output: 93 648 UML files
& 24 797 projects that use UML
18. Contributions to Goal G2.
• Scientific insights into how UML is used in OSS
projects.
18
G2.
Practices
19. Paper B: “How is UML used in OSS projects?”
19
• Most projects work very shortly on UML, usually at the beginning.
• Models are introduced during all possible phases in the lifespan of
OSS projects.
• A few projects are active with UML during their whole lifetime.
G2.
Practices
20. Paper C: Why is UML used in OSS projects?
20
• The majority of models are intended for creating
software designs and documenting software systems.
• Non-UML Contributors (NUCs) benefit from UML models
for understanding a system and for communication.
G2.
Practices
21. Contributions to goal G3.
• Scientific insights into impacts of software
design and UML modeling in software
development
21
G3.
Impacts
22. Paper C: What are the impacts of using UML?
22
• UML is helpful for new contributors to get up to speed.
• Changes of the working routine due to UML, mainly in
the planning phase and in communication.
G3.
Impacts
23. Paper F: Impacts of using UML to defect proneness?
23
93 GitHub projects
without UML models
50 GitHub projects
that have UML models
• Projects with UML have about 35% fewer bugs reported
compared to projects without using UML.
G3.
Impacts
24. Paper H: Impacts of using role stereotype to
developer’s understanding?
24
Mean TLX (*) Mean SUS (**) Mean Understanding (**)
RoleViz SoftagramRoleViz Tool (*)
• Participants achieved better scores on completing
software understanding tasks with RoleViz without any
cognitive-load penalty
(*) Demo video is available online at: https://youtu.be/1JYQMPMF9do
G3.
Impacts
25. Conclusion
25
I am proud that after five years of my PhD study:
• I went from little data to big empirical data of UML use.
• I made qualitative and quantitative observations on the
use and impacts of using software design and
modeling.
• I ran into an area that has lots of room for discussion &
future research.
• Being a part of great teams and colleagues.
Thank you for your attention!