SlideShare a Scribd company logo
Mining Email Social
Networks
Christian Bird, Alex Gourley,
Prem Devanbu, Michael Gertz, Anand Swaminathan

University of California, Davis




                                  Presented By:
                                  Arnamoy Bhattacharyya
Communication & Co-ordination (C&C) activities are central to large software
projects
Communication & Co-ordination (C&C) activities are central to large software
projects



Difficult to observe and study in traditional (closed-source, commercial)
settings
Communication & Co-ordination (C&C) activities are central to large software
projects



Difficult to observe and study in traditional (closed-source, commercial)
settings




       the email archives of OSS projects provide a useful trace of the
       communication and co-ordination activities of the participants
CHATTERERS & CHANGERS


A mailing list in an OSS project is a public forum
CHATTERERS & CHANGERS


A mailing list in an OSS project is a public forum



Anyone can post messages to the list.
CHATTERERS & CHANGERS


A mailing list in an OSS project is a public forum



Anyone can post messages to the list.



Posted messages are visible to all the mailing list
subscribers.
CHATTERERS & CHANGERS


A mailing list in an OSS project is a public forum



Anyone can post messages to the list.



Posted messages are visible to all the mailing list
subscribers.



  Posters include developers, bug-reporters, contributors (who submit
  patches, but don't have commit privileges) and ordinary
  users.
A response b to a message a is an indication That –

the sender of b; (Sb) found that the sender of a; (Sa) had something
interesting to say
A response b to a message a is an indication That –

the sender of b; (Sb) found that the sender of a; (Sa) had something
interesting to say



    It is also an indication of Sa’s status, i.e., Sb indicates that s/he
    found Sa's email worth reading, and worthy of response.
A response b to a message a is an indication That –

the sender of b; (Sb) found that the sender of a; (Sa) had something
interesting to say



    It is also an indication of Sa’s status, i.e., Sb indicates that s/he
    found Sa's email worth reading, and worthy of response.




However, the vast majority of individuals participating on the email
list sent very few messages, and received very few replies to their
messages
OF DOGS AND DEVELOPERS



“On the Internet, no one knows if you're a Dog“ - Peter Steiner
OF DOGS AND DEVELOPERS



      “On the Internet, no one knows if you're a Dog"




The same individual
can use different email aliases
OF DOGS AND DEVELOPERS



      “On the Internet, no one knows if you're a Dog"




The same individual
can use different email aliases




       developer Ian Holsman uses 7 different email
       aliases
OF DOGS AND DEVELOPERS



      “On the Internet, no one knows if you're a Dog"




The same individual
can use different email aliases




       developer Ian Holsman uses 7 different email
       aliases

           Ignoring these aliases would confound later
           steps of data analysis
Unmasking Aliases

Most emails include a header that identifies the sender, of this form:

From: "Bill Stoddard" <reddrum@attglobal.net>
Unmasking Aliases

Most emails include a header that identifies the sender, of this form:

From: "Bill Stoddard" <reddrum@attglobal.net>


        Crawl messages and extract all
        headers to produce a list of
        <Name,email> identifiers (IDs)

            Execute a clustering algorithm
            that measure the similarity
            between every pair of IDs

                Manually Post Process the
                clusters formed to remove
                further aliases
Unmasking Aliases

Most emails include a header that identifies the sender, of this form:

From: "Bill Stoddard" <reddrum@attglobal.net>


        Crawl messages and extract all
        headers to produce a list of
        <Name,email> identifiers (IDs)

            Execute a clustering algorithm
            that measure the similarity
            between every pair of IDs

                Manually Post Process the
                clusters formed to remove
                further aliases


     set the cluster similarity threshold quite low:
     easier to split big clusters than to unify two disparate clusters from a very
     large set.
THE CLUSTERING ALGORITHM




1. Normalize name
    remove all punctuation, suffixes
(“jr")

  turn all whitespace into a single space

   Remove generic terms like “admin", “support", from the name

  split the name into first name and last name (using whitespace
and commas as cues)
THE CLUSTERING ALGORITHM


2. Name Similarity:
Use a scoring algorithm between –

    The full names
    The first name and last name separately
    Consider names similar if the full names are similar, or
if both first and last names are similar




e.G Andy Smith <-> Andrew Smith

  Deepa Patel !<-> Deepa Ratnaswamy
THE CLUSTERING ALGORITHM


3. Names-email Similarity:
   If the email contains both first and last names – match

Arnamoy Bhattacharyya <-> ar.bhat@yahoo.com

    if the email contains the initial of one part of the name and entirety
of the other part – match

Erin Bird <-> ebird
Erin Bird <-> erinb
THE CLUSTERING ALGORITHM

4. Email Similarity:
  If the Levenshtein edit distance between two email address bases (not
  including the domain, after the "@") is small – Match
THE CLUSTERING ALGORITHM

5. Cumulative ID similarity:
   The similarity between two IDs is the maximum of the all mentioned
   above

E.G

Name Similarity – 3
Names-email similarity – 5
Email Similarity – 2

If the threshold is 4, it would be considered as a match
vast majority of people send only one message, and
there are some who send a great many
Out-degree - # of different people from whom an individual has
received responses

                Higher out-degree <-> higher status
In-degree - # of different people to whom an individual has
replied-to
                   Indicates the level of engagement of an
                   individual in the mailing list and the breadth of
                   his/her interests
In-degree - # of different people to whom an individual has
replied-to
                   Indicates the level of engagement of an
                   individual in the mailing list and the breadth of
                   his/her interests
            The distributions show a small-world character
High correlation between messages sent and replies got(out order) -
0.97
Correlation may not be true-

1. People who only post relevant messages get large responds to
   messages
2. Only people who receive replies from several people keep sending
   messages (Survival Effect)
Each link indicates at
least 150 messages
sent
C&C ACTIVITY AND DEVELOPMENT
           ACTIVITY


 How does email activity relate to software development activity?

73 committers-

1. A correlation of 0.80 between the number of messages sent by an
individual, and number of source changes they make –

more software development work <-> more C&C activity
C&C ACTIVITY AND DEVELOPMENT
           ACTIVITY


 How does email activity relate to software development activity?

73 committers-

1. A correlation of 0.80 between the number of messages sent by an
individual, and number of source changes they make –

more software development work <-> more C&C activity



       2. A correlation of 0.57 between the number
       of messages sent by an individual, and number
       of document changes they make

       source code activities require much more co-
       ordination effort
       than documentation effort
Are developers more likely to play the role of gatekeepers or brokers in the
complete email social network?
Are developers more likely to play the role of gatekeepers or brokers in the
complete email social network?


Betweenness (BW)---
Are developers more likely to play the role of gatekeepers or brokers in the
complete email social network?


Betweenness (BW)---




   High betweenness <-> that the person is a kind of broker, or gatekeeper
mean
mean




Developers are higher in status than non-developers
Relative Status of Developers

Do the most active developers have the highest status among developers ?
Relative Status of Developers

Do the most active developers have the highest status among developers ?




 Source changes are not as highly correlated with document changes <-> not all
 developers are engaged in both to the same degree
Relative Status of Developers

Do the most active developers have the highest status among developers ?




 Source changes are not as highly correlated with document changes <-> not all
 developers are engaged in both to the same degree


   Source changes shows the strongest rank correlation with the social network
   status <-> the most active developers play the strongest role of
   communicators, brokers, and gatekeepers
Conclusion
The level of activity on the mailing list is strongly correlated with source code
change activity, and to a lesser extent with document change activity.
Conclusion
The level of activity on the mailing list is strongly correlated with source code
change activity, and to a lesser extent with document change activity.




       Social network measures such as in-degree, out-degree and betweenness
      indicate that developers who actually commit changes, play much more
      significant roles in the email community than non-developers.
Conclusion
 The level of activity on the mailing list is strongly correlated with source code
 change activity, and to a lesser extent with document change activity.




        Social network measures such as in-degree, out-degree and betweenness
       indicate that developers who actually commit changes, play much more
       significant roles in the email community than non-developers.




Even within the select group of developers, there is a strong correlation
between the social network importance and level of source code change activity.
Questions?

More Related Content

Similar to Mining Email Social Networks

E -MAIL AND INTERNET
E -MAIL AND INTERNETE -MAIL AND INTERNET
E -MAIL AND INTERNET
Prof Ansari
 
EMPOWERMENT TECHNOLOGY by jessabel & mary grace
EMPOWERMENT TECHNOLOGY by jessabel & mary graceEMPOWERMENT TECHNOLOGY by jessabel & mary grace
EMPOWERMENT TECHNOLOGY by jessabel & mary grace
obus25
 
testaws
testawstestaws
testaws
stucon
 
How to Trace an E-mail Part 1
How to Trace an E-mail Part 1How to Trace an E-mail Part 1
How to Trace an E-mail Part 1Lebowitzcomics
 
Internet Tutorial 02
Internet  Tutorial 02Internet  Tutorial 02
Internet Tutorial 02
dpd
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
Anush90
 
BC (Email Correspondence) communication.pptx
BC (Email Correspondence) communication.pptxBC (Email Correspondence) communication.pptx
BC (Email Correspondence) communication.pptx
HamHere
 
Building Email Apps
Building Email AppsBuilding Email Apps
Building Email Apps
Andy Denmark
 
The Hacker's guide to fundraising
The Hacker's guide to fundraisingThe Hacker's guide to fundraising
The Hacker's guide to fundraising
Galvanize
 
HeirList Knowledge Base Access
HeirList Knowledge Base Access HeirList Knowledge Base Access
HeirList Knowledge Base Access
Chief Innovation
 
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations lokesh shanmuganandam
 
Cyber security and emails presentation
Cyber security and emails presentationCyber security and emails presentation
Cyber security and emails presentation
Wan Solo
 
SPAM FILTERS
SPAM FILTERSSPAM FILTERS
SPAM FILTERS
BismaSajjad9
 
Rethinking how your organisation collaborates
Rethinking how your organisation collaboratesRethinking how your organisation collaborates
Rethinking how your organisation collaborates
Stephen Bounds
 
The Detection of Suspicious Email Based on Decision Tree ...
The Detection of Suspicious Email Based on Decision Tree                     ...The Detection of Suspicious Email Based on Decision Tree                     ...
The Detection of Suspicious Email Based on Decision Tree ...
IRJET Journal
 
Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final ReportShikha Swami
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Carla Marini
 

Similar to Mining Email Social Networks (20)

E -MAIL AND INTERNET
E -MAIL AND INTERNETE -MAIL AND INTERNET
E -MAIL AND INTERNET
 
EMPOWERMENT TECHNOLOGY by jessabel & mary grace
EMPOWERMENT TECHNOLOGY by jessabel & mary graceEMPOWERMENT TECHNOLOGY by jessabel & mary grace
EMPOWERMENT TECHNOLOGY by jessabel & mary grace
 
E Mail
E MailE Mail
E Mail
 
testaws
testawstestaws
testaws
 
How to Trace an E-mail Part 1
How to Trace an E-mail Part 1How to Trace an E-mail Part 1
How to Trace an E-mail Part 1
 
Internet Tutorial 02
Internet  Tutorial 02Internet  Tutorial 02
Internet Tutorial 02
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
 
BC (Email Correspondence) communication.pptx
BC (Email Correspondence) communication.pptxBC (Email Correspondence) communication.pptx
BC (Email Correspondence) communication.pptx
 
Building Email Apps
Building Email AppsBuilding Email Apps
Building Email Apps
 
The Hacker's guide to fundraising
The Hacker's guide to fundraisingThe Hacker's guide to fundraising
The Hacker's guide to fundraising
 
HeirList Knowledge Base Access
HeirList Knowledge Base Access HeirList Knowledge Base Access
HeirList Knowledge Base Access
 
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
 
Cyber security and emails presentation
Cyber security and emails presentationCyber security and emails presentation
Cyber security and emails presentation
 
SPAM FILTERS
SPAM FILTERSSPAM FILTERS
SPAM FILTERS
 
Rethinking how your organisation collaborates
Rethinking how your organisation collaboratesRethinking how your organisation collaborates
Rethinking how your organisation collaborates
 
The Detection of Suspicious Email Based on Decision Tree ...
The Detection of Suspicious Email Based on Decision Tree                     ...The Detection of Suspicious Email Based on Decision Tree                     ...
The Detection of Suspicious Email Based on Decision Tree ...
 
Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final Report
 
SAS Text Mining
SAS Text MiningSAS Text Mining
SAS Text Mining
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
 
Is anatomy of_an_email
Is anatomy of_an_emailIs anatomy of_an_email
Is anatomy of_an_email
 

Recently uploaded

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 

Recently uploaded (20)

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 

Mining Email Social Networks

  • 1. Mining Email Social Networks Christian Bird, Alex Gourley, Prem Devanbu, Michael Gertz, Anand Swaminathan University of California, Davis Presented By: Arnamoy Bhattacharyya
  • 2. Communication & Co-ordination (C&C) activities are central to large software projects
  • 3. Communication & Co-ordination (C&C) activities are central to large software projects Difficult to observe and study in traditional (closed-source, commercial) settings
  • 4. Communication & Co-ordination (C&C) activities are central to large software projects Difficult to observe and study in traditional (closed-source, commercial) settings the email archives of OSS projects provide a useful trace of the communication and co-ordination activities of the participants
  • 5. CHATTERERS & CHANGERS A mailing list in an OSS project is a public forum
  • 6. CHATTERERS & CHANGERS A mailing list in an OSS project is a public forum Anyone can post messages to the list.
  • 7. CHATTERERS & CHANGERS A mailing list in an OSS project is a public forum Anyone can post messages to the list. Posted messages are visible to all the mailing list subscribers.
  • 8. CHATTERERS & CHANGERS A mailing list in an OSS project is a public forum Anyone can post messages to the list. Posted messages are visible to all the mailing list subscribers. Posters include developers, bug-reporters, contributors (who submit patches, but don't have commit privileges) and ordinary users.
  • 9. A response b to a message a is an indication That – the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say
  • 10. A response b to a message a is an indication That – the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say It is also an indication of Sa’s status, i.e., Sb indicates that s/he found Sa's email worth reading, and worthy of response.
  • 11. A response b to a message a is an indication That – the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say It is also an indication of Sa’s status, i.e., Sb indicates that s/he found Sa's email worth reading, and worthy of response. However, the vast majority of individuals participating on the email list sent very few messages, and received very few replies to their messages
  • 12. OF DOGS AND DEVELOPERS “On the Internet, no one knows if you're a Dog“ - Peter Steiner
  • 13. OF DOGS AND DEVELOPERS “On the Internet, no one knows if you're a Dog" The same individual can use different email aliases
  • 14. OF DOGS AND DEVELOPERS “On the Internet, no one knows if you're a Dog" The same individual can use different email aliases developer Ian Holsman uses 7 different email aliases
  • 15. OF DOGS AND DEVELOPERS “On the Internet, no one knows if you're a Dog" The same individual can use different email aliases developer Ian Holsman uses 7 different email aliases Ignoring these aliases would confound later steps of data analysis
  • 16. Unmasking Aliases Most emails include a header that identifies the sender, of this form: From: "Bill Stoddard" <reddrum@attglobal.net>
  • 17. Unmasking Aliases Most emails include a header that identifies the sender, of this form: From: "Bill Stoddard" <reddrum@attglobal.net> Crawl messages and extract all headers to produce a list of <Name,email> identifiers (IDs) Execute a clustering algorithm that measure the similarity between every pair of IDs Manually Post Process the clusters formed to remove further aliases
  • 18. Unmasking Aliases Most emails include a header that identifies the sender, of this form: From: "Bill Stoddard" <reddrum@attglobal.net> Crawl messages and extract all headers to produce a list of <Name,email> identifiers (IDs) Execute a clustering algorithm that measure the similarity between every pair of IDs Manually Post Process the clusters formed to remove further aliases set the cluster similarity threshold quite low: easier to split big clusters than to unify two disparate clusters from a very large set.
  • 19. THE CLUSTERING ALGORITHM 1. Normalize name remove all punctuation, suffixes (“jr") turn all whitespace into a single space Remove generic terms like “admin", “support", from the name split the name into first name and last name (using whitespace and commas as cues)
  • 20. THE CLUSTERING ALGORITHM 2. Name Similarity: Use a scoring algorithm between – The full names The first name and last name separately Consider names similar if the full names are similar, or if both first and last names are similar e.G Andy Smith <-> Andrew Smith Deepa Patel !<-> Deepa Ratnaswamy
  • 21. THE CLUSTERING ALGORITHM 3. Names-email Similarity: If the email contains both first and last names – match Arnamoy Bhattacharyya <-> ar.bhat@yahoo.com if the email contains the initial of one part of the name and entirety of the other part – match Erin Bird <-> ebird Erin Bird <-> erinb
  • 22. THE CLUSTERING ALGORITHM 4. Email Similarity: If the Levenshtein edit distance between two email address bases (not including the domain, after the "@") is small – Match
  • 23. THE CLUSTERING ALGORITHM 5. Cumulative ID similarity: The similarity between two IDs is the maximum of the all mentioned above E.G Name Similarity – 3 Names-email similarity – 5 Email Similarity – 2 If the threshold is 4, it would be considered as a match
  • 24.
  • 25. vast majority of people send only one message, and there are some who send a great many
  • 26.
  • 27. Out-degree - # of different people from whom an individual has received responses Higher out-degree <-> higher status
  • 28. In-degree - # of different people to whom an individual has replied-to Indicates the level of engagement of an individual in the mailing list and the breadth of his/her interests
  • 29. In-degree - # of different people to whom an individual has replied-to Indicates the level of engagement of an individual in the mailing list and the breadth of his/her interests The distributions show a small-world character
  • 30. High correlation between messages sent and replies got(out order) - 0.97
  • 31. Correlation may not be true- 1. People who only post relevant messages get large responds to messages 2. Only people who receive replies from several people keep sending messages (Survival Effect)
  • 32. Each link indicates at least 150 messages sent
  • 33. C&C ACTIVITY AND DEVELOPMENT ACTIVITY How does email activity relate to software development activity? 73 committers- 1. A correlation of 0.80 between the number of messages sent by an individual, and number of source changes they make – more software development work <-> more C&C activity
  • 34. C&C ACTIVITY AND DEVELOPMENT ACTIVITY How does email activity relate to software development activity? 73 committers- 1. A correlation of 0.80 between the number of messages sent by an individual, and number of source changes they make – more software development work <-> more C&C activity 2. A correlation of 0.57 between the number of messages sent by an individual, and number of document changes they make source code activities require much more co- ordination effort than documentation effort
  • 35. Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?
  • 36. Are developers more likely to play the role of gatekeepers or brokers in the complete email social network? Betweenness (BW)---
  • 37. Are developers more likely to play the role of gatekeepers or brokers in the complete email social network? Betweenness (BW)--- High betweenness <-> that the person is a kind of broker, or gatekeeper
  • 38. mean
  • 39. mean Developers are higher in status than non-developers
  • 40. Relative Status of Developers Do the most active developers have the highest status among developers ?
  • 41. Relative Status of Developers Do the most active developers have the highest status among developers ? Source changes are not as highly correlated with document changes <-> not all developers are engaged in both to the same degree
  • 42. Relative Status of Developers Do the most active developers have the highest status among developers ? Source changes are not as highly correlated with document changes <-> not all developers are engaged in both to the same degree Source changes shows the strongest rank correlation with the social network status <-> the most active developers play the strongest role of communicators, brokers, and gatekeepers
  • 43. Conclusion The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.
  • 44. Conclusion The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity. Social network measures such as in-degree, out-degree and betweenness indicate that developers who actually commit changes, play much more significant roles in the email community than non-developers.
  • 45. Conclusion The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity. Social network measures such as in-degree, out-degree and betweenness indicate that developers who actually commit changes, play much more significant roles in the email community than non-developers. Even within the select group of developers, there is a strong correlation between the social network importance and level of source code change activity.