Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jupyter con 2018 Diversity Analytics & OSS Adventures


Published on

Many of us believe that gender diversity in open source projects is important (for example, O’Reilly, Google, and the Python Software Foundation). (If you don’t, this isn’t going to convince you.) But what things are correlated with improved gender diversity, and what can we learn from similar historic industries?

Holden Karau and Matt Hunt explore the diversity of different projects, examine historic EEOC complaints, and detail parallels and historic solutions. To keep things interesting, Holden and Matt conclude with a comparative analysis of the state of OSS and various complaints handled by the EEOC in the ’60s, along with the solutions, suggestions, and binding settlements that were reached for similar diversity problems in other industries. This comparison is not legal advice but rather examples of what we can learn from early equal opportunity commission decisions.

Topics include:

Diversity of gender among the different levels of a given project’s leadership (committers, PMC, etc.)
The existence of codes of conduct
Language used in comments, code, and mailing lists
The rate of promotions for project participants

Published in: Internet
  • Be the first to comment

Jupyter con 2018 Diversity Analytics & OSS Adventures

  1. 1. What things are correlated with gender diversity A data science stroll through the ASF and Jupyter projects By @holdenkarau & @instantmatthew
  2. 2. What is this all about? ● Curiosity: few metrics on open source diversity exist ● Fun use of Jupyter, Spark, ML ● Pull requests welcome! Lori Erickson
  3. 3. Who are you? we have nothing in common Me: smart, funny, straight, bald, New Yorker Holden: trans, queer, canadian San Franciscan, wants you to follow her on YouTube … etc
  4. 4. Or do we? ● English speaking bi-coastal North American techies ● Breathe same air, mortal ● Distinctive fashion sense ● A shared appreciation for the Cheesecake Factory ● Whisky ● Neither of us are talking on behalf of our employers today
  5. 5. Historical Perspective ● quote from “The Goods Girls Revolt” ○ “Writers come to magazine over the transom,” he said, “and women aren’t coming. We can’t do anything if they aren’t interested” ● And a similar quote from open source luminaries ○ “I don’t have any experience working with women in programming projects; I don’t think that any volunteered to work on Emacs or GCC.” - RMS *The Good Girls Revolt: How the Women of Newsweek Sued their Bosses and Changed the Workplace by Lynn Povich sheologian
  6. 6. Recent studies GitHub 2017 “These researchers found that women’s coding suggestions was accepted 71.8% of the time when their gender was kept a secret, but only 62.5% of the time when their gender was revealed.” “Only 3% of the 5500 randomly selected respondents were women. 25% of those women reported being exposed to language or content that made them uncomfortable”
  7. 7. What have we done? Pulled data from git, meetup, etc, done some ML magic to infer gender and get stats Used Jupyter! Made some pretty(ish) pictures
  8. 8. What you can’t get from this? ● Causation. Which correlation ain’t. ● Legal advice ● Academic quality data Quirky Confectioner Lawyer cat objects!
  9. 9. Data sources/Methods ● Git commits and messages ● Inferred gender ● Gender from human review ● Project websites ● Mailing lists ● You can see our work - ○ And contribute… hint hint….. Melissa Wiese
  10. 10. Such Data ● ~50 projects ● ~30gb of commits & posts Human reviewed: ● Sampled down to ~1600 code contributors + all ~2600 committers Andrey Belenko
  11. 11. Stage One: Eyeballing Jennifer Morrow
  12. 12. So what do ASF & Jupyter projects look like?
  13. 13. Wait what’s that tall bar? fabien duplan
  14. 14. Some other things stand out quickly... ● Broad base of companies (maybe different kinds of diversity or correlated)? ● Easy to find community page ● Get involved link right on the home page ● Academic funding sources (NSF) + GSOC
  15. 15. Stage 2: Science John Floyd
  16. 16. What are some interesting project attributes? ● Does the project have a code of conduct? ● Does the project have a stated way for people to become committers? ● Does the project have a contributing guide? ● What’s the sentiment of the projects user/dev list? ● PR acceptance rate ● Your ideas/suggestions - seriously e-mail us (and/or make PRs to the notebook!) j0035001-2
  17. 17. What about gender related attributes? ● Gender %s of code contributors ● Gender %s of mailing list users ● Gender %s of PMC / committers ● And correlations charlene mcbride
  18. 18. Slides for Correlations [Row(corr(sampled.nonmale_percentage, infered.nonmale_percentage)=0.8402836506347078, corr(sampled.nonmale_percentage, Answer_code_of_conduct_easy)=-0.05088697801152734, corr(infered.nonmale_percentage, Answer_code_of_conduct_easy)=0.004552341326140643, corr(sampled.nonmale_percentage, Answer_code_of_conduct_exists)=-0.05088697801152734, corr(infered.nonmale_percentage, Answer_code_of_conduct_exists)=0.004552341326140643, corr(sampled.nonmale_percentage, Answer_committer_guide_easy)=-0.30915940064845393, corr(infered.nonmale_percentage, Answer_committer_guide_easy)=-0.0381086842740672, corr(sampled.nonmale_percentage, Answer_committer_guide_exists)=-0.34084081419416784, corr(infered.nonmale_percentage, Answer_committer_guide_exists)=-0.03831572641820849, corr(sampled.nonmale_percentage, Answer_contributing_guide_easy)=0.00950903602820991, corr(infered.nonmale_percentage, Answer_contributing_guide_easy)=0.04837014770606781, corr(sampled.nonmale_percentage, Answer_contributing_guide_exists)=0.0202429856533326, corr(infered.nonmale_percentage, Answer_contributing_guide_exists)=0.03636869585244893, corr(sampled.nonmale_percentage, Answer_mentoring_guide_easy)=-0.15392301526227192, corr(infered.nonmale_percentage, Answer_mentoring_guide_easy)=-0.055002597763866734, corr(sampled.nonmale_percentage, Answer_mentoring_guide_exists)=-0.15392301526227192, corr(infered.nonmale_percentage, Answer_mentoring_guide_exists)=-0.055002597763866734, corr(sampled.nonmale_percentage, has_female_or_enby_committer_magic)=0.18942118337810188, corr(infered.nonmale_percentage, has_female_or_enby_committer_magic)=0.20349367651041672, corr(sampled.nonmale_percentage, nonmale_committer_percentage_magic)=0.5441035627011365, corr(infered.nonmale_percentage, nonmale_committer_percentage_magic)=0.35402599653343864, corr(sampled.nonmale_percentage, R. Crap Mariner
  19. 19. This wasn’t much better +------------------------------------------------------------..... |corr(sampled.nonmale_percentage, infered.nonmale_percentage)|corr(sampled.nonmale_percentage, Answer_code_of_conduct_easy)|corr(infered.nonmale_percentage, Answer_code_of_conduct_easy)|corr(sampled.nonmale_percentage, Answer_code_of_conduct_exists)|corr(infered.nonmale_percentage, Answer_code_of_conduct_exists)|corr(sampled.nonmale_percentage, Answer_committer_guide_easy)|corr(infered.nonmale_percentage, Answer_committer_guide_easy)|corr(sampled.nonmale_percentage, Answer_committer_guide_exists)|..... | 0.8402836506347078| -0.05088697801152734| 0.004552341326140643| -0.05088697801152734| 0.004552341326140643| -0.30915940064845393| -0.0381086842740672| -0.34084081419416784| -0.03831572641820849| 0.00950903602820991| 0.04837014770606781| 0.0202429856533326| 0.03636869585244893| -0.15392301526227192| -0.05500259776386...| -0.15392301526227192| -0.05500259776386...| 0.18942118337810188| 0.20349367651041672| 0.5441035627011365| 0.35402599653343864| 0.27903907421646745| -0.19842388895891314| 0.018343520672052215| -0.0531287316430999| -0.04570527792465824| -0.11407965948006175| -0.02941906552049...| 0.010923839206653968| -0.19651751264222414| -0.2121016705878764| -0.20639989813410967| -0.21973083941480384| -0.31067113317726425| -0.15172448698670876| -0.31736988968372776| -0.17906926611311288| 0.14828713581114333| -0.28798744559651446| 0.540848408698061| -0.11571044537290899| 0.5044867286902844| -0.44725076538864206| 0.4935819383384438| R. Crap Mariner
  20. 20. Slides for Correlations Inferred gender informationSampled gender information Barry Badcock
  21. 21. Oh howdy, there’s some differences…. ● Maybe it’s from our data collection methods ● Inferred gender is also known to have issues, especially with non-American names, non-cis folks, etc. ● Inferred sentiment detection maybe not great? ○ I just used nltk vader cause w/e
  22. 22. How was the human data collection done? Instructions: Find the gender of the user in question. You can look at the e-mails sent in response to them, but also feel free to search online to find other information about the user (use the project information disambiguate cases of multiple people with the same name). List additional links possibly about the user used (e.g. linkedin, twitter, etc.) Provided with: E-mails in response to user, project name, author name, and github name (All depending on what could be found) DocChewbacca
  23. 23. First look Khairil Zhafri
  24. 24. Sentiment of mailing lists J. Triepke
  25. 25. And the rest…. Hajime NAKANO
  26. 26. What about that inferred data?
  27. 27. Stage 3: Solutions to historical challenges Remember the parallels in quotes? Maybe there are parallels in solutions? ● Short answer: hire women ○ In OSS we sometimes pretend we are not paid…. but a lot of us are. ● Longer answer: make training/mentorship programs to promote internal candidates ○ Strangely enough mentoring programs existences was negatively correlated ● Explicit “try-outs” ○ (or ways of hiring people that wasn’t just friends) ● Not depending on randomly finding people Nacho
  28. 28. Related work ● ● nel-technical-contributions/ ● (PR acceptance rates for women insiders/outsiders) ● Livestreams of the data processing/collection - ○ Did you know it’s perf season at Google? And Google is very metrics driven…. Also my managers name is Steve. Arthur Cruz
  29. 29. Special thanks! Ann Spencer Wrangler of cats and unicorns as the Head of Content at Domino Data Lab. Formerly Data Editor at O'Reilly Media (aka Holden's editor). Born and raised in San Francisco.
  30. 30. Want to participate? ● New forum:!managemembers/oss-diversity-discussion ● Notebook code at / ● Slides: ● @holdenkarau & @instantmatthew ● And or come say hi to us @ Strata Melissa Wiese
  31. 31. High Performance Spark! Unrelated to this talk. I’ll have a book signing @ 3:20pm at the O’Reilly booth. You can also buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee.
  32. 32. Questions?