cregit:
Who Authored the Kernel?
Recovering Token-Level Authorship
Information from Git
Daniel M German
University of Victoria
dmg@uvic.ca
Kate Stewart
Linux Foundation
Bram Adams
Polytecnique of Montreal
Cincinnati Library
Image in the public domain
diff-able
Image in the public domain
“History is writen by
the winners”
By Y. Karsh. Image in the public domain in
Canada. Copyrighted in the US
“Archeology is the search for fact, not truth.
If it's truth you're interested in,
Dr. Tyree's Philosophy class is right down the hall.”
-- Indiana Jones
Image Copyright Walt Disney Company
The history in git is likely to be incomplete
Yet, what can we do with it?
The Dream, by H. Rousseau. In the public domain.
Image in the public domain
Evolutionary Views of VC Repos
Per Line
Per Token
Linux History
Warning
● The author (git parlance) of a commit is not
necessarily the author of the code
– Code imported from another source
– Refactorings
– Moving code
Up to 4.7
Persons in blame:
Line: 12,005
Token: 12,087
Token Line
Linux
Token Line
kernel/
Many small changes
Non-merges that modified C and H files with respect to total of all commits
● 9.5 % of commits added 3 or less c-tokens and removed 3 or less c-
tokens
● 7% of commits did not add any c-tokens but removed c-tokens
● 3.8% of commits added one c-token and removed one c-token
● 22.4% of commits added 10 or less c-tokens and removed 10 or less c-
tokens
● 50% of commits added 60 or less c-tokens and removed 60 or less c-
tokens
● 2 commits added at least 1M c-tokens and removed at least 1M c-
tokens
C-Churn
● Churn = C Tokens added – C tokens removed in non-
merge commits
Non-merges that modified C and H files with respect to total
of all commits
● 10% of commits had c-churn == 0
● 48% had c-churn <= 10
● 26% had negative c-churn
● 2 commits had c-churn >= 1M
Conclusion
● On the large
– Token and Line are
equivalent
● On the small
– Provide a fine grained
view of the evolution
of the code
Cregit Recovering token level authorship from Git
Cregit Recovering token level authorship from Git
Cregit Recovering token level authorship from Git
Cregit Recovering token level authorship from Git

Cregit Recovering token level authorship from Git

  • 1.
    cregit: Who Authored theKernel? Recovering Token-Level Authorship Information from Git Daniel M German University of Victoria dmg@uvic.ca Kate Stewart Linux Foundation Bram Adams Polytecnique of Montreal
  • 3.
    Cincinnati Library Image inthe public domain
  • 4.
  • 5.
    Image in thepublic domain
  • 8.
    “History is writenby the winners” By Y. Karsh. Image in the public domain in Canada. Copyrighted in the US
  • 10.
    “Archeology is thesearch for fact, not truth. If it's truth you're interested in, Dr. Tyree's Philosophy class is right down the hall.” -- Indiana Jones Image Copyright Walt Disney Company
  • 11.
    The history ingit is likely to be incomplete
  • 12.
    Yet, what canwe do with it?
  • 13.
    The Dream, byH. Rousseau. In the public domain.
  • 24.
    Image in thepublic domain
  • 25.
  • 31.
  • 34.
  • 35.
    Warning ● The author(git parlance) of a commit is not necessarily the author of the code – Code imported from another source – Refactorings – Moving code
  • 36.
    Up to 4.7 Personsin blame: Line: 12,005 Token: 12,087
  • 43.
  • 44.
  • 45.
    Many small changes Non-mergesthat modified C and H files with respect to total of all commits ● 9.5 % of commits added 3 or less c-tokens and removed 3 or less c- tokens ● 7% of commits did not add any c-tokens but removed c-tokens ● 3.8% of commits added one c-token and removed one c-token ● 22.4% of commits added 10 or less c-tokens and removed 10 or less c- tokens ● 50% of commits added 60 or less c-tokens and removed 60 or less c- tokens ● 2 commits added at least 1M c-tokens and removed at least 1M c- tokens
  • 46.
    C-Churn ● Churn =C Tokens added – C tokens removed in non- merge commits Non-merges that modified C and H files with respect to total of all commits ● 10% of commits had c-churn == 0 ● 48% had c-churn <= 10 ● 26% had negative c-churn ● 2 commits had c-churn >= 1M
  • 47.
    Conclusion ● On thelarge – Token and Line are equivalent ● On the small – Provide a fine grained view of the evolution of the code