How different are different
diff algorithms in Git?
Yusuf S. Nugroho Hideaki Hata Kenichi Matsumoto
diff is essential in SE research field
2
Empirical Software
Engineering Research git diff [<options>] <commit> <commit> [--] [<path>…]
Git offers 4 diff algorithms
3
• Myers
• Minimal (improved Myers)
• Patience (try to give contextual diff)
• Histogram (enhanced Patience, normally faster)
--diff-algorithm={algorithm name}
Documentation
default algorithm
Histogram was introduced in git 1.7.7 in 2011
Different algorithms produce different diff outputs
4
Differences:
• Number of changed lines
• Position of changed lines
9 added lines
2 deleted lines
4 deleted lines
9 added lines
2 sequential analyses
• Systematic Mapping Study
• How previous studies used git diff?
• Comparisons Study
• differences of diff outputs between
Myers and Histogram
5
Results of Mapping Study
6
52
0
Default Other diff
0
20
40
60
Type of diff algorithm
24
13 12
2 1
0
10
20
30
Patches Metrics SZZ Merges Authorship
Top 3 purposes of using diff
TSE
3 Journals
EMSE TOSEM
8 Proceedings FSE ICSE OOPSLA PLDI
ASE ICSME ISSTA MSR
Comparing diff outputs in 3 applications
7
Manual Comparison: Patches
Myers Histogram
... some code ...
- a deleted line
+ an added line
+ an added line
- a deleted line
... some code ...
... some code ...
- a deleted line
- a deleted line
+ an added line
+ an added line
... some code ...
Different diff algorithms can report different location of
identified changed lines of code
4.17%
4.62%
7.79%
9.68%
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
Metrics SZZ Algorithm
Frequency of different location of identified changed lines of code
#Files
#Commits
8
Histogram is better for describing code changes
9
13.4%
14.9%
71.6%
62.6%
16.9%
20.6%
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0%
Histogram
Myers
Same
in-Code diff in-Non-Code diff
code changes
non-code
changes
Use histogram diff algorithm when analyzing code changes
10
• Different diff algorithms can produce different amount and
location of changed lines.
• Histogram detects the changed lines more appropriately from
source code.
• Nugroho, Y.S., Hata, H. & Matsumoto, K., ”How Different are
Different diff Algorithms in Git? Use --histogram for Code
Changes", Empirical Software Engineering 25, 79-823 (2020).
Available at: https://doi.org/10.1007/s10664-019-09772-z
11
Publication
Application on actual tools
• Git extension -- (Feature request)
https://github.com/gitextensions/gitextensions/
issues/6991
12
• Pydriller
https://pydriller.readthedocs.io/en/latest/
configuration.html#git-diff-algorithms

How different are different diff algorithms in Git?

  • 1.
    How different aredifferent diff algorithms in Git? Yusuf S. Nugroho Hideaki Hata Kenichi Matsumoto
  • 2.
    diff is essentialin SE research field 2 Empirical Software Engineering Research git diff [<options>] <commit> <commit> [--] [<path>…]
  • 3.
    Git offers 4diff algorithms 3 • Myers • Minimal (improved Myers) • Patience (try to give contextual diff) • Histogram (enhanced Patience, normally faster) --diff-algorithm={algorithm name} Documentation default algorithm Histogram was introduced in git 1.7.7 in 2011
  • 4.
    Different algorithms producedifferent diff outputs 4 Differences: • Number of changed lines • Position of changed lines 9 added lines 2 deleted lines 4 deleted lines 9 added lines
  • 5.
    2 sequential analyses •Systematic Mapping Study • How previous studies used git diff? • Comparisons Study • differences of diff outputs between Myers and Histogram 5
  • 6.
    Results of MappingStudy 6 52 0 Default Other diff 0 20 40 60 Type of diff algorithm 24 13 12 2 1 0 10 20 30 Patches Metrics SZZ Merges Authorship Top 3 purposes of using diff TSE 3 Journals EMSE TOSEM 8 Proceedings FSE ICSE OOPSLA PLDI ASE ICSME ISSTA MSR
  • 7.
    Comparing diff outputsin 3 applications 7 Manual Comparison: Patches Myers Histogram ... some code ... - a deleted line + an added line + an added line - a deleted line ... some code ... ... some code ... - a deleted line - a deleted line + an added line + an added line ... some code ...
  • 8.
    Different diff algorithmscan report different location of identified changed lines of code 4.17% 4.62% 7.79% 9.68% 0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% Metrics SZZ Algorithm Frequency of different location of identified changed lines of code #Files #Commits 8
  • 9.
    Histogram is betterfor describing code changes 9 13.4% 14.9% 71.6% 62.6% 16.9% 20.6% 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% Histogram Myers Same in-Code diff in-Non-Code diff code changes non-code changes
  • 10.
    Use histogram diffalgorithm when analyzing code changes 10 • Different diff algorithms can produce different amount and location of changed lines. • Histogram detects the changed lines more appropriately from source code.
  • 11.
    • Nugroho, Y.S.,Hata, H. & Matsumoto, K., ”How Different are Different diff Algorithms in Git? Use --histogram for Code Changes", Empirical Software Engineering 25, 79-823 (2020). Available at: https://doi.org/10.1007/s10664-019-09772-z 11 Publication
  • 12.
    Application on actualtools • Git extension -- (Feature request) https://github.com/gitextensions/gitextensions/ issues/6991 12 • Pydriller https://pydriller.readthedocs.io/en/latest/ configuration.html#git-diff-algorithms

Editor's Notes

  • #2 Hello everyone, I'm Yusuf Sulistyo Nugroho, from Nara Institute of Science and Technology, Japan. In this opportunity, I would like to present my journal paper entitled "How different are different diff algorithms in Git?"
  • #3 (click) In empirical software engineering research, diff is commonly used in various topics. Along with the growth of GitHub, recent studies analyze software changes extracted from Git repositories by using the git command (click), for example: git diff (click).
  • #4 But do you know that according to git documentation, Git offers (click) diff utility for users to select the diff algorithms, either (click) Myers, (click) Minimal, (click) Patience or (click) Histogram. In textual differencing, all diff algorithms are computationally correct in generating the diff outputs. Without specifying the algorithm in git diff command, Myers is used as the default diff algorithm (click). (click) Although Histogram that was introduced in git 1.7.7 in 2011 might give better performance to git diff, it is not popular among software engineer communities. The result of previous studies might be different since then, if they considered Histogram in git diff command. Thus, in this study, I only focus on comparing the differences between Myers and Histogram. However, (click)
  • #5 The diff outputs are sometimes different due to different diff algorithms.  Different diff algorithms might identify different change hunks, that is, a list of program statements deleted or added contiguously, separated by at least one line of unchanged context. The figures illustrate two diff outputs extracted using two diff algorithms, that is, Myers and Histogram. As we can see in the example, there are two differences (click): (click) the number of the changed lines, and (click) the position of the changed lines. In Myers diff output, we see (click) 9 added lines of code from line 5 to 12 plus line 16, and (click) 2 deleted lines of code between line 18 and 19. While in Histogram diff output, we see (click) 4 deleted lines of code from line 5 to 8, and (click) 9 added lines of code at line 9 and from line 13 to 20.
  • #6 To clarify the impact of adopting different diff algorithms on empirical studies and investigate which diff algorithm can provide better diff results that can be expected to recover the changing operations, we carry out two sequential analyses:  (click) systematic mapping study, and (click) empirical comparisons
  • #7 For the systematic mapping, we collect papers from (click) three top journals and (click) eight high ranking international conference proceedings that were published from 2013 to 2017.  Out of 3,057 papers, we mapped 52 selected papers to understand: (click) which diff algorithm is used in previous studies, and  (click) the purposes of mining Git repositories. The results reveal that (click) all collected papers had never considered the other advanced diff algorithms.  We also found that the purposes of using the git command were (click) to get patches, followed by metrics collection, and bug-introducing identification using SZZ algorithm. Based on these top 3 popular usages of git diff in previous studies, (click).
  • #8 We conducted an empirical analysis, to compare the diff outputs generated by 2 popular diff algorithms, that are, (click) Myers and Histogram We then applied the two algorithms in three applications: (click) collecting metrics,  identifying bug-introduction using SZZ algorithm, and  (click) manual comparison on patches
  • #9 The results show that the percentage of files having a different location of identified changed lines of code are (click) 4.17% and (click) 4.62% on average in the application of metrics and SZZ algorithm respectively. Considering commits (click), we found the different location of the changed lines of code accounting for (click) 7.79% in metrics and (click) 9.68% in SZZ application averagely.
  • #10 The result of our manual investigation on patches shows that the Histogram algorithm provides better outputs (click) compared with the Myers in describing the code changes from the source code files. But they have almost the same ability to extract the diffs from (click) the non-code changes, for example, the changes of a text file.
  • #11 To sum up, (click) Our quantitative analyses have shown that the different diff algorithms can report the different amount of changed lines, and identify different change locations.  (click) Our qualitative investigation reveals that Histogram is better for describing code changes.  Since diff is the fundamental tool for various software engineering tasks, considering the limitations and advantages of algorithms is important.  (click) Thus, currently, we recommend using the Histogram algorithm when analyzing code changes.
  • #12 I am happy to announce that this journal paper has attracted people to (click) discuss in (click) Hacker News. And also this publication has been referenced in (click) Wikipedia for describing “diff”.
  • #13 Our study has also been referenced to the actual tools. (click) First, it was shared and discussed in (click) Git extension, and (click) Second, it has been implemented in (click) PyDriller.  Now, the PyDriller users can consider Histogram algorithm in their experiments. That’s all I have for my presentation, and you are very welcome to read my journal paper and check it for more details. Thank you for listening.