SlideShare a Scribd company logo
What things are correlated
with gender diversity
A data science stroll through the ASF and Jupyter
projects
By
@holdenkarau & @instantmatthew
What is this all about?
● Curiosity: few metrics on open source diversity exist
● Fun use of Jupyter, Spark, ML
● Pull requests welcome!
Lori Erickson
Who are you?
we have nothing in
common
Me: smart, funny,
straight, bald, New Yorker
Holden: trans, queer,
canadian San Franciscan,
wants you to follow her on
YouTube … etc
Or do we?
● English speaking bi-coastal North American techies
● Breathe same air, mortal
● Distinctive fashion sense
● A shared appreciation for the Cheesecake Factory
● Whisky
● Neither of us are talking on behalf of our employers today
Historical Perspective
● quote from “The Goods Girls Revolt”
○ “Writers come to magazine over the transom,” he said, “and women aren’t coming. We can’t
do anything if they aren’t interested”
● And a similar quote from open source luminaries
○ “I don’t have any experience working with women in programming projects; I don’t think that
any volunteered to work on Emacs or GCC.” - RMS
*The Good Girls Revolt: How the Women of Newsweek Sued their Bosses and Changed the Workplace
by Lynn Povich
sheologian
Recent studies
GitHub 2017
“These researchers found that women’s coding suggestions was accepted 71.8% of the time
when their gender was kept a secret, but only 62.5% of the time when their gender was
revealed.”
“Only 3% of the 5500 randomly selected respondents were women. 25% of those women
reported being exposed to language or content that made them uncomfortable”
What have we done?
Pulled data from git, meetup, etc,
done some ML magic to infer gender and get stats
Used Jupyter!
Made some pretty(ish) pictures
What you can’t get from this?
● Causation. Which correlation ain’t.
● Legal advice
● Academic quality data
Quirky Confectioner
Lawyer cat
objects!
Data sources/Methods
● Git commits and messages
● Inferred gender
● Gender from human review
● Project websites
● Mailing lists
● You can see our work - http://bit.ly/holdendDiversityAnalyticsRepo
○ And contribute… hint hint…..
Melissa Wiese
Such Data
● ~50 projects
● ~30gb of commits & posts
Human reviewed:
● Sampled down to ~1600 code contributors + all ~2600 committers
Andrey Belenko
Stage One: Eyeballing Jennifer Morrow
So what do ASF & Jupyter projects look like?
Wait what’s that tall bar?
fabien duplan
Some other things stand out quickly...
● Broad base of companies (maybe different kinds of diversity or correlated)?
● Easy to find community page
● Get involved link right on the home page
● Academic funding sources (NSF) + GSOC
Stage 2: Science John Floyd
What are some interesting project attributes?
● Does the project have a code of conduct?
● Does the project have a stated way for people to become committers?
● Does the project have a contributing guide?
● What’s the sentiment of the projects user/dev list?
● PR acceptance rate
● Your ideas/suggestions - seriously e-mail us (and/or make PRs to the
notebook!)
j0035001-2
What about gender related attributes?
● Gender %s of code contributors
● Gender %s of mailing list users
● Gender %s of PMC / committers
● And correlations
charlene mcbride
Slides for Correlations
[Row(corr(sampled.nonmale_percentage, infered.nonmale_percentage)=0.8402836506347078, corr(sampled.nonmale_percentage,
Answer_code_of_conduct_easy)=-0.05088697801152734, corr(infered.nonmale_percentage,
Answer_code_of_conduct_easy)=0.004552341326140643, corr(sampled.nonmale_percentage,
Answer_code_of_conduct_exists)=-0.05088697801152734, corr(infered.nonmale_percentage,
Answer_code_of_conduct_exists)=0.004552341326140643, corr(sampled.nonmale_percentage,
Answer_committer_guide_easy)=-0.30915940064845393, corr(infered.nonmale_percentage,
Answer_committer_guide_easy)=-0.0381086842740672, corr(sampled.nonmale_percentage,
Answer_committer_guide_exists)=-0.34084081419416784, corr(infered.nonmale_percentage,
Answer_committer_guide_exists)=-0.03831572641820849, corr(sampled.nonmale_percentage,
Answer_contributing_guide_easy)=0.00950903602820991, corr(infered.nonmale_percentage,
Answer_contributing_guide_easy)=0.04837014770606781, corr(sampled.nonmale_percentage,
Answer_contributing_guide_exists)=0.0202429856533326, corr(infered.nonmale_percentage,
Answer_contributing_guide_exists)=0.03636869585244893, corr(sampled.nonmale_percentage,
Answer_mentoring_guide_easy)=-0.15392301526227192, corr(infered.nonmale_percentage,
Answer_mentoring_guide_easy)=-0.055002597763866734, corr(sampled.nonmale_percentage,
Answer_mentoring_guide_exists)=-0.15392301526227192, corr(infered.nonmale_percentage,
Answer_mentoring_guide_exists)=-0.055002597763866734, corr(sampled.nonmale_percentage,
has_female_or_enby_committer_magic)=0.18942118337810188, corr(infered.nonmale_percentage,
has_female_or_enby_committer_magic)=0.20349367651041672, corr(sampled.nonmale_percentage,
nonmale_committer_percentage_magic)=0.5441035627011365, corr(infered.nonmale_percentage,
nonmale_committer_percentage_magic)=0.35402599653343864, corr(sampled.nonmale_percentage,
R. Crap Mariner
This wasn’t much better
+------------------------------------------------------------.....
|corr(sampled.nonmale_percentage, infered.nonmale_percentage)|corr(sampled.nonmale_percentage,
Answer_code_of_conduct_easy)|corr(infered.nonmale_percentage, Answer_code_of_conduct_easy)|corr(sampled.nonmale_percentage,
Answer_code_of_conduct_exists)|corr(infered.nonmale_percentage, Answer_code_of_conduct_exists)|corr(sampled.nonmale_percentage,
Answer_committer_guide_easy)|corr(infered.nonmale_percentage, Answer_committer_guide_easy)|corr(sampled.nonmale_percentage,
Answer_committer_guide_exists)|.....
| 0.8402836506347078| -0.05088697801152734|
0.004552341326140643| -0.05088697801152734| 0.004552341326140643|
-0.30915940064845393| -0.0381086842740672| -0.34084081419416784|
-0.03831572641820849| 0.00950903602820991| 0.04837014770606781|
0.0202429856533326| 0.03636869585244893| -0.15392301526227192|
-0.05500259776386...| -0.15392301526227192| -0.05500259776386...|
0.18942118337810188| 0.20349367651041672| 0.5441035627011365|
0.35402599653343864| 0.27903907421646745| -0.19842388895891314|
0.018343520672052215| -0.0531287316430999| -0.04570527792465824|
-0.11407965948006175| -0.02941906552049...| 0.010923839206653968|
-0.19651751264222414| -0.2121016705878764| -0.20639989813410967|
-0.21973083941480384| -0.31067113317726425| -0.15172448698670876|
-0.31736988968372776| -0.17906926611311288| 0.14828713581114333|
-0.28798744559651446| 0.540848408698061| -0.11571044537290899|
0.5044867286902844| -0.44725076538864206| 0.4935819383384438|
R. Crap Mariner
Slides for Correlations
Inferred gender informationSampled gender information
Barry Badcock
Oh howdy, there’s some differences….
● Maybe it’s from our data collection methods
● Inferred gender is also known to have issues, especially with non-American
names, non-cis folks, etc.
● Inferred sentiment detection maybe not great?
○ I just used nltk vader cause w/e
How was the human data collection done?
Instructions:
Find the gender of the user in question. You can look at the e-mails sent in
response to them, but also feel free to search online to find other information
about the user (use the project information disambiguate cases of multiple people
with the same name).
List additional links possibly about the user used (e.g. linkedin, twitter, etc.)
Provided with:
E-mails in response to user, project name, author name, and github name
(All depending on what could be found)
DocChewbacca
First look Khairil Zhafri
Sentiment of mailing lists J. Triepke
And the rest…. Hajime
NAKANO
What about that inferred data?
Stage 3: Solutions to historical challenges
Remember the parallels in quotes? Maybe there are parallels in solutions?
● Short answer: hire women
○ In OSS we sometimes pretend we are not paid…. but a lot of us are.
● Longer answer: make training/mentorship programs to promote internal
candidates
○ Strangely enough mentoring programs existences was negatively correlated
● Explicit “try-outs”
○ (or ways of hiring people that wasn’t just friends)
● Not depending on randomly finding people
Nacho
Related work
● https://code.likeagirl.io/gender-bias-in-open-source-d1deda7dec28
● https://blog.bitergia.com/2016/10/11/gender-diversity-analysis-of-the-linux-ker
nel-technical-contributions/
● https://peerj.com/articles/cs-111/ (PR acceptance rates for women
insiders/outsiders)
● Livestreams of the data processing/collection -
http://bit.ly/holdenJupyterStreams
○ Did you know it’s perf season at Google? And Google is very metrics driven…. Also my
managers name is Steve.
Arthur Cruz
Special thanks!
Ann Spencer
Wrangler of cats and unicorns as the Head of Content at Domino Data Lab.
Formerly Data Editor at O'Reilly Media (aka Holden's editor).
Born and raised in San Francisco.
https://blog.dominodatalab.com/
Want to participate?
● New forum:
https://groups.google.com/forum/#!managemembers/oss-diversity-discussion
● Notebook code at https://github.com/holdenk/diversity-analytics /
http://bit.ly/holdendDiversityAnalyticsRepo
● Slides: https://www.slideshare.net/hkarau
● @holdenkarau & @instantmatthew
● And or come say hi to us @ Strata
Melissa Wiese
High Performance Spark!
Unrelated to this talk. I’ll have a book signing @ 3:20pm at
the O’Reilly booth.
You can also buy it from that scrappy Seattle bookstore,
Jeff Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
Questions?

More Related Content

What's hot (6)

Dr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhDDr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhD
 
Linked Open Govt Data - Sem Tech East
Linked Open Govt Data - Sem Tech EastLinked Open Govt Data - Sem Tech East
Linked Open Govt Data - Sem Tech East
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Understanding the Standards Gap
Understanding the Standards GapUnderstanding the Standards Gap
Understanding the Standards Gap
 
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015
 

Similar to Jupyter con 2018 Diversity Analytics & OSS Adventures

Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Diana Maynard
 
Non-Experimental Methods
Non-Experimental MethodsNon-Experimental Methods
Non-Experimental Methods
Kurt Luther
 
Flupa UX Days 2018 | Sara Wachter-Boettcher (EN)
Flupa UX Days 2018 | Sara Wachter-Boettcher (EN)Flupa UX Days 2018 | Sara Wachter-Boettcher (EN)
Flupa UX Days 2018 | Sara Wachter-Boettcher (EN)
Flupa
 

Similar to Jupyter con 2018 Diversity Analytics & OSS Adventures (20)

When recommendation systems go bad - machine eatable
When recommendation systems go bad - machine eatableWhen recommendation systems go bad - machine eatable
When recommendation systems go bad - machine eatable
 
Networking 101 Arts Works Conference 2013 University of Alberta
Networking 101 Arts Works Conference 2013 University of AlbertaNetworking 101 Arts Works Conference 2013 University of Alberta
Networking 101 Arts Works Conference 2013 University of Alberta
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
 
PyOhio 2015: You Gotta Want It
PyOhio 2015: You Gotta Want ItPyOhio 2015: You Gotta Want It
PyOhio 2015: You Gotta Want It
 
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesStep Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
 
Survey Research in Design
Survey Research in DesignSurvey Research in Design
Survey Research in Design
 
AI and ChatGPT in Online Education
AI and ChatGPT in Online Education AI and ChatGPT in Online Education
AI and ChatGPT in Online Education
 
PARKER, LYNNE. PANEL: ENGAGING WOMEN IN ROBOTICS
PARKER, LYNNE.  PANEL: ENGAGING WOMEN IN ROBOTICSPARKER, LYNNE.  PANEL: ENGAGING WOMEN IN ROBOTICS
PARKER, LYNNE. PANEL: ENGAGING WOMEN IN ROBOTICS
 
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
 
Non-Experimental Methods
Non-Experimental MethodsNon-Experimental Methods
Non-Experimental Methods
 
Tool criticism
Tool criticismTool criticism
Tool criticism
 
A World Without Contract Cheating - Keynote Presentation for University of Br...
A World Without Contract Cheating - Keynote Presentation for University of Br...A World Without Contract Cheating - Keynote Presentation for University of Br...
A World Without Contract Cheating - Keynote Presentation for University of Br...
 
LIB300 Week 9 finding, analyzing, and documenting information
LIB300 Week 9 finding, analyzing, and documenting informationLIB300 Week 9 finding, analyzing, and documenting information
LIB300 Week 9 finding, analyzing, and documenting information
 
Getting the Work Done [Code for America Summit 2018 Breakout Session]
Getting the Work Done [Code for America Summit 2018 Breakout Session]Getting the Work Done [Code for America Summit 2018 Breakout Session]
Getting the Work Done [Code for America Summit 2018 Breakout Session]
 
Essay About Rainwater Harvesting
Essay About Rainwater HarvestingEssay About Rainwater Harvesting
Essay About Rainwater Harvesting
 
Ai demystified for HR and TA leaders
Ai demystified for HR and TA leadersAi demystified for HR and TA leaders
Ai demystified for HR and TA leaders
 
RMACC 2018 Keynote: Breaking the Glass Ceiling Identifying and Addressing Sel...
RMACC 2018 Keynote: Breaking the Glass Ceiling Identifying and Addressing Sel...RMACC 2018 Keynote: Breaking the Glass Ceiling Identifying and Addressing Sel...
RMACC 2018 Keynote: Breaking the Glass Ceiling Identifying and Addressing Sel...
 
OpenThreads: The Community of Mailing Lists presented at FOSS4G-NA
OpenThreads: The Community of Mailing Lists presented at FOSS4G-NAOpenThreads: The Community of Mailing Lists presented at FOSS4G-NA
OpenThreads: The Community of Mailing Lists presented at FOSS4G-NA
 
Flupa UX Days 2018 | Sara Wachter-Boettcher (EN)
Flupa UX Days 2018 | Sara Wachter-Boettcher (EN)Flupa UX Days 2018 | Sara Wachter-Boettcher (EN)
Flupa UX Days 2018 | Sara Wachter-Boettcher (EN)
 
Equality and Technology_Gregory
Equality and Technology_GregoryEquality and Technology_Gregory
Equality and Technology_Gregory
 

Recently uploaded

Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
abhinandnam9997
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
aagad
 

Recently uploaded (12)

Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
The Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI StudioThe Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI Studio
 
The AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdfThe AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdf
 
The Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case StudyThe Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case Study
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
Pvtaan Social media marketing proposal.pdf
Pvtaan Social media marketing proposal.pdfPvtaan Social media marketing proposal.pdf
Pvtaan Social media marketing proposal.pdf
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
 
How Do I Begin the Linksys Velop Setup Process?
How Do I Begin the Linksys Velop Setup Process?How Do I Begin the Linksys Velop Setup Process?
How Do I Begin the Linksys Velop Setup Process?
 

Jupyter con 2018 Diversity Analytics & OSS Adventures

  • 1. What things are correlated with gender diversity A data science stroll through the ASF and Jupyter projects By @holdenkarau & @instantmatthew
  • 2. What is this all about? ● Curiosity: few metrics on open source diversity exist ● Fun use of Jupyter, Spark, ML ● Pull requests welcome! Lori Erickson
  • 3. Who are you? we have nothing in common Me: smart, funny, straight, bald, New Yorker Holden: trans, queer, canadian San Franciscan, wants you to follow her on YouTube … etc
  • 4. Or do we? ● English speaking bi-coastal North American techies ● Breathe same air, mortal ● Distinctive fashion sense ● A shared appreciation for the Cheesecake Factory ● Whisky ● Neither of us are talking on behalf of our employers today
  • 5. Historical Perspective ● quote from “The Goods Girls Revolt” ○ “Writers come to magazine over the transom,” he said, “and women aren’t coming. We can’t do anything if they aren’t interested” ● And a similar quote from open source luminaries ○ “I don’t have any experience working with women in programming projects; I don’t think that any volunteered to work on Emacs or GCC.” - RMS *The Good Girls Revolt: How the Women of Newsweek Sued their Bosses and Changed the Workplace by Lynn Povich sheologian
  • 6. Recent studies GitHub 2017 “These researchers found that women’s coding suggestions was accepted 71.8% of the time when their gender was kept a secret, but only 62.5% of the time when their gender was revealed.” “Only 3% of the 5500 randomly selected respondents were women. 25% of those women reported being exposed to language or content that made them uncomfortable”
  • 7. What have we done? Pulled data from git, meetup, etc, done some ML magic to infer gender and get stats Used Jupyter! Made some pretty(ish) pictures
  • 8. What you can’t get from this? ● Causation. Which correlation ain’t. ● Legal advice ● Academic quality data Quirky Confectioner Lawyer cat objects!
  • 9. Data sources/Methods ● Git commits and messages ● Inferred gender ● Gender from human review ● Project websites ● Mailing lists ● You can see our work - http://bit.ly/holdendDiversityAnalyticsRepo ○ And contribute… hint hint….. Melissa Wiese
  • 10. Such Data ● ~50 projects ● ~30gb of commits & posts Human reviewed: ● Sampled down to ~1600 code contributors + all ~2600 committers Andrey Belenko
  • 11. Stage One: Eyeballing Jennifer Morrow
  • 12. So what do ASF & Jupyter projects look like?
  • 13. Wait what’s that tall bar? fabien duplan
  • 14. Some other things stand out quickly... ● Broad base of companies (maybe different kinds of diversity or correlated)? ● Easy to find community page ● Get involved link right on the home page ● Academic funding sources (NSF) + GSOC
  • 15. Stage 2: Science John Floyd
  • 16. What are some interesting project attributes? ● Does the project have a code of conduct? ● Does the project have a stated way for people to become committers? ● Does the project have a contributing guide? ● What’s the sentiment of the projects user/dev list? ● PR acceptance rate ● Your ideas/suggestions - seriously e-mail us (and/or make PRs to the notebook!) j0035001-2
  • 17. What about gender related attributes? ● Gender %s of code contributors ● Gender %s of mailing list users ● Gender %s of PMC / committers ● And correlations charlene mcbride
  • 18. Slides for Correlations [Row(corr(sampled.nonmale_percentage, infered.nonmale_percentage)=0.8402836506347078, corr(sampled.nonmale_percentage, Answer_code_of_conduct_easy)=-0.05088697801152734, corr(infered.nonmale_percentage, Answer_code_of_conduct_easy)=0.004552341326140643, corr(sampled.nonmale_percentage, Answer_code_of_conduct_exists)=-0.05088697801152734, corr(infered.nonmale_percentage, Answer_code_of_conduct_exists)=0.004552341326140643, corr(sampled.nonmale_percentage, Answer_committer_guide_easy)=-0.30915940064845393, corr(infered.nonmale_percentage, Answer_committer_guide_easy)=-0.0381086842740672, corr(sampled.nonmale_percentage, Answer_committer_guide_exists)=-0.34084081419416784, corr(infered.nonmale_percentage, Answer_committer_guide_exists)=-0.03831572641820849, corr(sampled.nonmale_percentage, Answer_contributing_guide_easy)=0.00950903602820991, corr(infered.nonmale_percentage, Answer_contributing_guide_easy)=0.04837014770606781, corr(sampled.nonmale_percentage, Answer_contributing_guide_exists)=0.0202429856533326, corr(infered.nonmale_percentage, Answer_contributing_guide_exists)=0.03636869585244893, corr(sampled.nonmale_percentage, Answer_mentoring_guide_easy)=-0.15392301526227192, corr(infered.nonmale_percentage, Answer_mentoring_guide_easy)=-0.055002597763866734, corr(sampled.nonmale_percentage, Answer_mentoring_guide_exists)=-0.15392301526227192, corr(infered.nonmale_percentage, Answer_mentoring_guide_exists)=-0.055002597763866734, corr(sampled.nonmale_percentage, has_female_or_enby_committer_magic)=0.18942118337810188, corr(infered.nonmale_percentage, has_female_or_enby_committer_magic)=0.20349367651041672, corr(sampled.nonmale_percentage, nonmale_committer_percentage_magic)=0.5441035627011365, corr(infered.nonmale_percentage, nonmale_committer_percentage_magic)=0.35402599653343864, corr(sampled.nonmale_percentage, R. Crap Mariner
  • 19. This wasn’t much better +------------------------------------------------------------..... |corr(sampled.nonmale_percentage, infered.nonmale_percentage)|corr(sampled.nonmale_percentage, Answer_code_of_conduct_easy)|corr(infered.nonmale_percentage, Answer_code_of_conduct_easy)|corr(sampled.nonmale_percentage, Answer_code_of_conduct_exists)|corr(infered.nonmale_percentage, Answer_code_of_conduct_exists)|corr(sampled.nonmale_percentage, Answer_committer_guide_easy)|corr(infered.nonmale_percentage, Answer_committer_guide_easy)|corr(sampled.nonmale_percentage, Answer_committer_guide_exists)|..... | 0.8402836506347078| -0.05088697801152734| 0.004552341326140643| -0.05088697801152734| 0.004552341326140643| -0.30915940064845393| -0.0381086842740672| -0.34084081419416784| -0.03831572641820849| 0.00950903602820991| 0.04837014770606781| 0.0202429856533326| 0.03636869585244893| -0.15392301526227192| -0.05500259776386...| -0.15392301526227192| -0.05500259776386...| 0.18942118337810188| 0.20349367651041672| 0.5441035627011365| 0.35402599653343864| 0.27903907421646745| -0.19842388895891314| 0.018343520672052215| -0.0531287316430999| -0.04570527792465824| -0.11407965948006175| -0.02941906552049...| 0.010923839206653968| -0.19651751264222414| -0.2121016705878764| -0.20639989813410967| -0.21973083941480384| -0.31067113317726425| -0.15172448698670876| -0.31736988968372776| -0.17906926611311288| 0.14828713581114333| -0.28798744559651446| 0.540848408698061| -0.11571044537290899| 0.5044867286902844| -0.44725076538864206| 0.4935819383384438| R. Crap Mariner
  • 20. Slides for Correlations Inferred gender informationSampled gender information Barry Badcock
  • 21. Oh howdy, there’s some differences…. ● Maybe it’s from our data collection methods ● Inferred gender is also known to have issues, especially with non-American names, non-cis folks, etc. ● Inferred sentiment detection maybe not great? ○ I just used nltk vader cause w/e
  • 22. How was the human data collection done? Instructions: Find the gender of the user in question. You can look at the e-mails sent in response to them, but also feel free to search online to find other information about the user (use the project information disambiguate cases of multiple people with the same name). List additional links possibly about the user used (e.g. linkedin, twitter, etc.) Provided with: E-mails in response to user, project name, author name, and github name (All depending on what could be found) DocChewbacca
  • 24. Sentiment of mailing lists J. Triepke
  • 25. And the rest…. Hajime NAKANO
  • 26. What about that inferred data?
  • 27. Stage 3: Solutions to historical challenges Remember the parallels in quotes? Maybe there are parallels in solutions? ● Short answer: hire women ○ In OSS we sometimes pretend we are not paid…. but a lot of us are. ● Longer answer: make training/mentorship programs to promote internal candidates ○ Strangely enough mentoring programs existences was negatively correlated ● Explicit “try-outs” ○ (or ways of hiring people that wasn’t just friends) ● Not depending on randomly finding people Nacho
  • 28. Related work ● https://code.likeagirl.io/gender-bias-in-open-source-d1deda7dec28 ● https://blog.bitergia.com/2016/10/11/gender-diversity-analysis-of-the-linux-ker nel-technical-contributions/ ● https://peerj.com/articles/cs-111/ (PR acceptance rates for women insiders/outsiders) ● Livestreams of the data processing/collection - http://bit.ly/holdenJupyterStreams ○ Did you know it’s perf season at Google? And Google is very metrics driven…. Also my managers name is Steve. Arthur Cruz
  • 29. Special thanks! Ann Spencer Wrangler of cats and unicorns as the Head of Content at Domino Data Lab. Formerly Data Editor at O'Reilly Media (aka Holden's editor). Born and raised in San Francisco. https://blog.dominodatalab.com/
  • 30. Want to participate? ● New forum: https://groups.google.com/forum/#!managemembers/oss-diversity-discussion ● Notebook code at https://github.com/holdenk/diversity-analytics / http://bit.ly/holdendDiversityAnalyticsRepo ● Slides: https://www.slideshare.net/hkarau ● @holdenkarau & @instantmatthew ● And or come say hi to us @ Strata Melissa Wiese
  • 31. High Performance Spark! Unrelated to this talk. I’ll have a book signing @ 3:20pm at the O’Reilly booth. You can also buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark