This document explores language interaction in software projects. It analyzes the Apache Hadoop project to address three research questions. RQ1 finds that over 50% of commits involve more than one language. RQ2 finds that C, SH, properties, and XML files interact most with other languages. RQ3 finds that while cross-language modules generally are not more defect-prone, some language pairs like C-Java have significantly more defects in cross-language modules. The study provides initial evidence on language interactions in software but notes threats from confounding factors and the single-project scope.
Languages interaction study explores defect relationships
1. Languages interaction and possible
effects: an exploratory study
Antonio Vetrò - Federico Tomassetti
Marco Torchiano - Maurizio Morisio
2. No one writes in a single language anymore. Even trivial applications
have a general-purpose language, SQL, JavaScript, CSS, and dozens of
frameworks, each of which includes an external DSL
Wampler 2010
3. How do those languages interact?
Is that interaction problematic?
4. Research questions
RQ1 How much interaction is there between the
languages used in a project?
RQ2 Which language pairs interact more?
RQ3 Are Cross Language Modules more defect-
prone than Intra Language Modules?
5. Plan
• Define a measure for the level of interaction
among languages
• Investigate interaction vs. defect proneness
• Perform a case study
6. The Case Study
Apache Hadoop, which is a software to support
distributed data storage and processing.
Used in many real applications (e.g., Yahoo, Facebook).
7. Commit types
Language A (.extA)
Cross-Language Commit (CLC)
Intra-Language Commit (ILC)
Language B (.extB)
8. RQ1 How much interaction is there between
the languages present in a project?
Metric: Percentage of Cross-Language Commits
• All type of commits (RQ1.1)
• Commits divided by activity type (e.g., improvement,
bug fixing, new feature) (RQ1.2)
All Bug Improv New Sub Task Test
(RQ 1.1)
ement Feature task
0.53 0.12 0.26 0.30 0.45 0.26 0.05
9. Cross Language Ratio
Language A (.extA)
3 out of 4 commits involving
m are Cross-Language
Cross Language Ratio of
module m CLRm = 0.75
m
Language C (.extC) Language B (.extB)
10. Interaction level of a language
• Cross language ratio of an extension (language)
11. RQ2 Which extensions interact more?
Metric: CLRext
Considering one extension versus all the other
extensions (RQ2.1)
CLRext Nr files Extension
0.96 49 c
0.87 114 sh
0.72 75 properties
0.71 320 xml
0.59 4328 java
12. Focusing on extension pairs
Language A (.extA)
2 out of 3 commits involving
m together with extA are
Cross Language
Cross Language Ratio of
module m w.r.t extA
CLRm,extA = 0.67
m
Language C (.extC) Language B (.extB)
13. Interaction level of a pair
• Cross language ratio of an extension w.r.t.
another extension
– Asymmetrical measure!
14. RQ2 Which extensions do interact more?
Metric: CLRextA,extB
Considering the most interacting ordered pairs of
extensions (RQ2.2).
extA/extB C Java Properties Sh
C - 0.51 0.10 0.50
Java 0.01 - 0.28 0.04
Properties 0 0.54 - 0.36
Sh 0.09 0.22 0.24 -
Xml 0.04 0.52 0.43 0.24
15. Cross vs. Intra Lang Modules
Cross Language Module (CLM): CLR is ≥ t%
Intra Language Modules (ILM): CLR is <
t%
t = 50%
16. RQ3 Are Cross Language Modules more defect-prone?
Metric: Odds ratio of CLM with/without defects , ILM
with/without defects
- all module regardless of extension (RQ3.1)
- by extension (RQ3.2)
ILM ILM CLM CLM p-value OR
no def. def. no def. def.
all 1891 225 2875 89 <0.001 0.26
c 2 0 46 1 1.000 Inf
java 1692 201 2239 25 <0.001 0.09
properties 19 1 45 7 0.429 2.92
sh 10 5 64 13 0.162 0.41
xml 96 11 184 24 0.851 1.14
17. RQ3 Are Cross Language Modules more defect-prone?
Metric: Odds ratio of CLM with/without defects , ILM
with/without defects
Considering interaction between specific ordered
pairs of extensions (RQ3.3).
C Java Properties sh XML
C - Inf 0 0 Inf
Java 2.79 - 0.32 0.43 0.96
Properties Inf 1 - 12.08 0.94
Sh 3.55 4.45 17.17 - 7.44
Xml 3.83 0.95 3.22 4.73 -
In bold significant values
18. Threats
• Confounding factors: age and size of modules
• Usage of proxy for interaction between artifacts
• Apache Hadoop representativeness
• Renaming of modules
22. In general language interaction is not related to
higher defect proneness, see Java
Though several language pairs have CLMs
significantly more defect prone then ILMs, see C
23. Questions?
Languages interaction and possible
effects: an exploratory study
Antonio Vetrò - Federico Tomassetti
Marco Torchiano - Maurizio Morisio