How Linux uses Git

474 views

Published on

Distributed Version Control systems (DVCs) are replacing centralized version systems (CVCs). Our research is trying to uncover the way that DVCs are being used, and how its use differs from CVCs. For the last 23 months we have been mining Linux git repositories in an attempt to understand how they use their DVCs (git). I will describe the challenges of this mining, and provide an overview of how linux uses DVCs.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
474
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

How Linux uses Git

  1. 1. How is distributed version control software being used? Daniel M German University of Victoria
  2. 2. Git ● Software engineers are moving towards git – ● And other DVCs Github a major reason
  3. 3. The Promise of Git From: http://thkoch2001.github.io/whygitisbetter/
  4. 4. Challenge 1 ● Personal repos are beyond reach ● Local commits might never be observable
  5. 5. Challenge 2: History “History is written by the victors” Niccolò Machiavelli
  6. 6. Rebasing changes history
  7. 7. Save history before it is lost!
  8. 8. DVC model: a clone is a branch
  9. 9. Super-repository ● Collection of repositories cloned (recursively) from the same repo – At least one per developer ● – At least one public repository ● – In their personal computer The blessed In git, no way to trace them
  10. 10. Moving commits across the superRepo Method Push Pull Email Done at source, needs write access to source Done at destination, needs read access to source Source creates patch, recipient applies it
  11. 11. Merging in DCVs ● If not all commits in destination – Create temp branch – Copy commits ● Merge locally ● If created, – delete temp branch
  12. 12. Ecosystem of Repos
  13. 13. Can we learn from Linux?
  14. 14. Life of a Patch in Linux
  15. 15. How can we observe them?
  16. 16. We have to find them!
  17. 17. Snapshot Mining
  18. 18. Continuous Mining
  19. 19. Continuous Mining ● Every interval – Mine as many repos as known ● ● What is new? What has been deleted?
  20. 20. Continuous Mining ● Challenges: 1) Finding the repositories 2) Logging their commits Proactive vs Reactive implementation • Logging vs discovery
  21. 21. ContinuousMining of Linux ● Linux has no centralized logging – – ● Nobody really knows what the superRepo is Commits flow without any event broadcasting mechanism Where do we find the activity? – Repos – Commits
  22. 22. Repos and Committers ● Most repos will have a known set of persons committing to them – Simplest case: its owner is the only committer – Extreme case: repo is used as centralized version control system: everybody commits to it
  23. 23. Semiautomatic Process ● Every 3 hrs, ask every repo: – What new commits do you have? – What commits did you delete? – Automatically resolve propagations ● ● Commits might propagate before we scan Daily: – Are commits in repo by unknown committers? ● Answer: – is there a new repo? or is committer new to repo?
  24. 24. Implementation ● Running since Nov. 2011 – Currently scans 650 repos every 3 hrs – Retrieved ● ● 2.3 million commits (compared to 400k in Linus repo) 109 million records in propagation table <commit-id, added|deleted, repo, when>
  25. 25. Discovery of new repos
  26. 26. Is one better than the other? ● RQ1 – ● Does continuousMining uncover a larger development ecosystem than snapMining? RQ2 – Does continuousMining expose any missing information, or bias in the recorded history of the project recovered using snapMining?
  27. 27. Snapshot (Linus) No Repos Continuous 1 479 Commits 64k 533k Non-merge Commits 59k 485k Unique Non-merges 58k 135k 98.9% 27.9% %unique non-merges Non-merges that reached Blessed 43.1% Different authors emails 3434 5646 Different authors 2883 4575 Different committers emails 283 1185 Different committers 245 1058
  28. 28. ● RQ2 – Does continuousMining expose any missing information, or bias in the recorded history of the project recovered using snapMining?
  29. 29. Commit vs Patches ● Commit ids are insufficient to tracks patches ● Large amount of work not reaching blessed
  30. 30. The data in blessed is biased ● 37.9% of patches arriving at blessed did not arrived in its original commit
  31. 31. Arrival of Commits at Blessed
  32. 32. Arrival of Commits at Blessed... ● We can classify patches as a new feature or bug-fix
  33. 33. So what? (the reviewer will ask) ● What can we do with this data? – For researchers: enable empirical studies of activities previously invisible – For practitioners: Implement traceability of ● ● Commits and Repos
  34. 34. Empirical study ● RQ. What are the characteristics of the repos in the Linux Super-repository
  35. 35. The Repos
  36. 36. The Repos ● ● ● X: activity (in commits) Y: ratio of commits accepted by Linus to total commits Shape: – – ● Triangles: official repos Circles: non-official repos Size: – – ● Smaller: consume commits Larger: produce commits Color: merge/commit ratio – Grey: never merge – “Cooler”: high ratio – “Warmer”: lower ratio
  37. 37. Consumers are very active
  38. 38. Propagation ● RQ: How do repositories interact, and how do commits propagate across repositories?
  39. 39. The Latency Time of Authorship Time of Commit
  40. 40. The Interaction
  41. 41. Linux Dashboard ● We asked two linux maintainers: – ● Can this info be useful? Answer: – “Yes” … but not for what we expected...
  42. 42. Tracking commits in Linux ● Need to track patches, not commits – Particularly important in consumer repositories – Need to cross-reference commits ● ● – What commits contain the same patch? What commits are mentioned in the log? Some repos track commits from blessed via cherry-picking ● ● Commit ids are useless So they annotate log with the origin commit id
  43. 43. Example
  44. 44. Has it reached linux-next before blessed? ● Commits should pass through linux-next before arriving at blessed. ● If not, potential issue ● Hard to do with current tools: ● Patches change commit id

×