Linguistic diversity inopen-source development Bogdan Vasilescu Alexander Serebrenik Mark van den Brand
Motivation                             …                           …      I „speak‟ Java                        C         ...
MotivationIf                                   leaves the project, what is the risk of not finding                        ...
Linguistic diversity      • Greenberg (1956)           • compare geographic regions           • probability that two rando...
Linguistic diversity    Probability that two random individuals do not speak the same language                • Simple mod...
Linguistic diversity    Probability that two random individuals do not speak the same language                • Related-la...
Linguistic diversity    Probability that two random individuals do not speak the same language                • Polyglot r...
Our risk measure    • Probability that two random individuals do not speak the      same language                         ...
StackOverflow.com/ Mathematics and Computer Science   23-4-2012   PAGE 8
User tags/ Mathematics and Computer Science   23-4-2012   PAGE 9
Similarity measure      •    Reverend Gonzo: Java, C, C++, C#, Python,…      •    Alexander Serebrenik: Prolog, SQL, C++,…...
Similarity measure - results     • Assembly posts: 44     • Assembly + Java developers: > 1000      When in need for Java...
Case study - Emacs  • 1985-2012: C, Emacs Lisp, C++, Java, Lisp, Python, M4, … (26)                                       ...
Case study - Emacs                        C: spoken by half of the community                        + similar to other lan...
Conclusions                                        What is the risk of not finding developers                             ...
Upcoming SlideShare
Loading in …5
×

IPA Spring Days 2012

519 views

Published on

I used these slides during my talk at the IPA Spring Days in Garderen, The Netherlands (2012).

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

IPA Spring Days 2012

  1. 1. Linguistic diversity inopen-source development Bogdan Vasilescu Alexander Serebrenik Mark van den Brand
  2. 2. Motivation … … I „speak‟ Java C C++ … HTML Lisp XML Java Python Unix shell I „speak‟ Java I „speak‟ Python and Python/ Mathematics and Computer Science 23-4-2012 PAGE 1
  3. 3. MotivationIf leaves the project, what is the risk of not finding replacement developers that speak Python? No risk, plenty of other Python What about now? developers to choose from/ Mathematics and Computer Science 23-4-2012 PAGE 2
  4. 4. Linguistic diversity • Greenberg (1956) • compare geographic regions • probability that two random individuals do not speak the same language/ Mathematics and Computer Science 23-4-2012 PAGE 3
  5. 5. Linguistic diversity Probability that two random individuals do not speak the same language • Simple model • everyone speaks exactly one language • languages are independent 2 S A 1 p  p  L P/ Mathematics and Computer Science 23-4-2012 PAGE 4
  6. 6. Linguistic diversity Probability that two random individuals do not speak the same language • Related-languages model • everyone speaks exactly one language • languages are similar S B 1 p pm sim(, m) p ,m L P 0 sim(, m) 1 sim(, ) 1/ Mathematics and Computer Science 23-4-2012 PAGE 5
  7. 7. Linguistic diversity Probability that two random individuals do not speak the same language • Polyglot related-languages model • everyone speaks at least one language • languages are similar sim(, m)  s ,m t Xs F 1 ps pt ps s ,t P ( L ) s t P L A, B, C P ( L) A, B, C , AB, AC , BC , ABC/ Mathematics and Computer Science 23-4-2012 PAGE 6
  8. 8. Our risk measure • Probability that two random individuals do not speak the same language sim(, m)  s ,m t F 1 ps pt s ,t P ( L ) s t • Risk of not finding developers that „speak‟  risk () 1 ps maxk s sim (k ) s P( L)/ Mathematics and Computer Science 23-4-2012 PAGE 7
  9. 9. StackOverflow.com/ Mathematics and Computer Science 23-4-2012 PAGE 8
  10. 10. User tags/ Mathematics and Computer Science 23-4-2012 PAGE 9
  11. 11. Similarity measure • Reverend Gonzo: Java, C, C++, C#, Python,… • Alexander Serebrenik: Prolog, SQL, C++,… • Bogdan Vasilescu: Python,… • Jon Skeet: C#, Java, ASP.net, XML,… • … > 400,000 • Association rule mining: Java • “C => Java” nBoth C sim k conf k  nLeft/ Mathematics and Computer Science 23-4-2012 PAGE 10
  12. 12. Similarity measure - results • Assembly posts: 44 • Assembly + Java developers: > 1000  When in need for Java developers, ask Assembly guys/ Mathematics and Computer Science 23-4-2012 PAGE 11
  13. 13. Case study - Emacs • 1985-2012: C, Emacs Lisp, C++, Java, Lisp, Python, M4, … (26) Exotic languages High/low risk/ Mathematics and Computer Science 23-4-2012 PAGE 12
  14. 14. Case study - Emacs C: spoken by half of the community + similar to other languages Python: spoken very sporadically low risk + similar to other languages  low risk/ Mathematics and Computer Science 23-4-2012 PAGE 13
  15. 15. Conclusions What is the risk of not finding developers that speak Python? • Risk measure risk () 1 ps maxk s sim (k ) s P( L) • Similarity measure (StackOverflow) • “C => Java” sim k conf k nBoth  nLeft Low risk Depends on similarity/ Mathematics and Computer Science 23-4-2012 PAGE 14

×