Social Interactions around
Cross-System Bug Fixings:
        The Case of
  FreeBSD and OpenBSD
  Gerardo Canfora, Luigi Cerulo,
Marta Cimitile, Massimiliano Di Penta
       dipenta@unisannio.it
Context
  Source code is often reused across different systems
    Unixes (FreeBSD, OpenBSD, Linux)
    Office applications (NeoOffice, OpenOffice)
    Desktop environment apps (KDE or GNOME apps)
  Maintenance might require to propagate bug fixings
    We call this “Cross System Bug Fixing” (CSBF)


  Example:
     FreeBSD, 1996/01/19, file ip_icmp.h:
       – “Added definitions for ICMP router discovery. Reviewed by:
         wollman
     OpenBSD, 1996/08/02, file ip_icmp.h:
       – “ICMP Router Discovery definitions; from FreeBSD”
What we propose
  A method to track CSBFs
  A study on the social characteristics
   and development activity made by
   CSBF committers
    degree, betweenness, brokerage
    commits, lines changed
Detecting CSBF - I
  Step 1: mining cross-referencing commits
    openbsd, atphy.c,2008/09/25 20:47:16,brad,
     Add a driver for the Attansic F1 PHY. From FreeBSD via
     kevlo@
  Step 2: mine commits previously performed on files
   with same name in the other system
    freebsd,atphy.c,2008/05/19 01:12:10,yongari,
     Add Attansic/Atheros F1 PHY driver.
    openbsd, atphy.c,2008/09/25 20:47:16,brad,
     Add a driver for the Attansic F1 PHY. From FreeBSD via
     kevlo@
Detecting CSBF - II
  Step 3: compute file similarity with clone detection
    CCFinder
    Threshold: at least 10% of cloned lines
  Step 4: take the previous change with the highest
   textual similarity in the commit note
    Use of Vector Space models
    Cosine similarity; threshold (0.20) to filter out unrelated
     commits

                  Add Attansic/Atheros F1 PHY driver.

                                    =    0.72

         Add a driver for the Attansic F1 PHY. From FreeBSD via kevlo@
Building Committers' Network
  We extract communication from mailing
   lists
    Bug fixing mailing lists
  Heuristic similar to the one of Bird et al.
   [2006] to map inconsistent namings /
   emails
    Also, to map committer Ids to mailing list
     names/emails
  Nodes of the network labeled as:
    Committer / other mailing list contributors
    CSBFs committer
Empirical Study
 Goal: analyze the phenomenon of CSBFs
 Purpose: understanding its relevance with
  respect to the social characteristics of the
  involved developers
 Context: CVS repositories and mailing lists
  archives of FreeBSD and OpenBSD
   Period: 1993-2009 (FreeBSD), 1998-2009
    (OpenBSD)
   Commits: 119,000 (FreeBSD), 70,000 (OpenBSD)
Research Questions
  RQ1: How do the source code committers
   and contributors of the two systems
   overlap?
  RQ2: How frequent is the phenomenon of
   CSBFs?
  RQ3: Who are the contributors involved in
   CSBFs?
  RQ4: Are mailing list contributors involved
   in CSBFs more active than others?
RQ1 – Team overlap
                              FreeBSD OpenBSD Both
  Committers                      383      211       26
  Mailing list contribs          8035     3843   359
  Committers and                  213     122        17
  mailing list contributors


  The two projects have less than 10% of
 common contributors →
 the development team of Free and
 Open BSD is really different
RQ2 – Commit filtering
   1000                                           933
    900

    800

    700

    600

    500       439
    400
                                                          296
    300

    200               133                                         120
    100
                              59

     0
                    FreeBSD                             OpenBSD

              Referring commits    Cloned files     Linked commits



          At the end of the filtering not that many but...
RQ2 – Cloned lines in CSBF files




         C source files                        header files
  Percentage smaller for .h files
  Use of preprocessor conditional to make header files system-
   dependent
    #if defined(__FreeBSD__)
RQ3 – CSBF Graph (excerpt)
Blue/cyan: FreeBSD
Red/orange: OpenBSD
Yellow: common
RQ3: social characteristics
  Importance in terms of
    (in/out) degree: number of (incoming/outcoming)
     communication links
    Betweenness: number of communications for which the
     node is in the short path
  Brokerage metrics: useful to analyze the
   communication between two clusters

                                B is a coordinator

                                B is a gatekeeper

                                B is a representative
RQ3 – social characteristics
       Representative
          Gatekeeper
           12
       Coordinator /10
           10
   Betweenness / 1000
           8
          Out-degree
                                                                          Column 1
           6
                In-degree                                                 Column 2
                                                                          Column 3
           4
                  Degree
           2                0   5       10   15    20   25    30     35   40   45    50
           0
                   Row 1            CSBF
                                Row 2             Others
                                              Row 3          Row 4



  All differences statistically significant
  High effect size (Cohen d>1)
  Contributors involved in CSBF have a higher importance in
   the communication and in the flow of communication
   between systems
RQ3 – committers with highest
social metrics
RQ4 – change activity of CSBF
committers and others
        LOC added/removed                 Commits
40000                           1500
                                1000
20000
                                 500

    0                              0
         FreeBSD      OpenBSD          FreeBSD      OpenBSD

           CSBF    Others                CSBF    Others




    All differences statistically significant
    High effect size (Cohen d∼1)
    Contributors involved in CSBF are more active
     than others
Conclusions and Work-in-Progress
  We proposed method to mine CSBF
  We reported a study on FreeBSD and OpenBSD where:
    Development team is almost disjoint
    There is a small, though not negligible portion of CSBF
    Committers involved in CSBF have
     – Higher social importance
     – Higher brokerage level
     – Higher activity in source code commits
  Work-in-progress:
    Better approaches to identify implicit CSBF, tracking and
     linking changes occurring on both systems
    More extensive study on less obvious cases

Dipenta msr2011-csbf

  • 1.
    Social Interactions around Cross-SystemBug Fixings: The Case of FreeBSD and OpenBSD Gerardo Canfora, Luigi Cerulo, Marta Cimitile, Massimiliano Di Penta dipenta@unisannio.it
  • 2.
    Context  Sourcecode is often reused across different systems  Unixes (FreeBSD, OpenBSD, Linux)  Office applications (NeoOffice, OpenOffice)  Desktop environment apps (KDE or GNOME apps)  Maintenance might require to propagate bug fixings  We call this “Cross System Bug Fixing” (CSBF)  Example:  FreeBSD, 1996/01/19, file ip_icmp.h: – “Added definitions for ICMP router discovery. Reviewed by: wollman  OpenBSD, 1996/08/02, file ip_icmp.h: – “ICMP Router Discovery definitions; from FreeBSD”
  • 3.
    What we propose  A method to track CSBFs  A study on the social characteristics and development activity made by CSBF committers  degree, betweenness, brokerage  commits, lines changed
  • 4.
    Detecting CSBF -I  Step 1: mining cross-referencing commits  openbsd, atphy.c,2008/09/25 20:47:16,brad, Add a driver for the Attansic F1 PHY. From FreeBSD via kevlo@  Step 2: mine commits previously performed on files with same name in the other system  freebsd,atphy.c,2008/05/19 01:12:10,yongari, Add Attansic/Atheros F1 PHY driver.  openbsd, atphy.c,2008/09/25 20:47:16,brad, Add a driver for the Attansic F1 PHY. From FreeBSD via kevlo@
  • 5.
    Detecting CSBF -II  Step 3: compute file similarity with clone detection  CCFinder  Threshold: at least 10% of cloned lines  Step 4: take the previous change with the highest textual similarity in the commit note  Use of Vector Space models  Cosine similarity; threshold (0.20) to filter out unrelated commits Add Attansic/Atheros F1 PHY driver. = 0.72 Add a driver for the Attansic F1 PHY. From FreeBSD via kevlo@
  • 6.
    Building Committers' Network  We extract communication from mailing lists  Bug fixing mailing lists  Heuristic similar to the one of Bird et al. [2006] to map inconsistent namings / emails  Also, to map committer Ids to mailing list names/emails  Nodes of the network labeled as:  Committer / other mailing list contributors  CSBFs committer
  • 7.
    Empirical Study  Goal:analyze the phenomenon of CSBFs  Purpose: understanding its relevance with respect to the social characteristics of the involved developers  Context: CVS repositories and mailing lists archives of FreeBSD and OpenBSD  Period: 1993-2009 (FreeBSD), 1998-2009 (OpenBSD)  Commits: 119,000 (FreeBSD), 70,000 (OpenBSD)
  • 8.
    Research Questions RQ1: How do the source code committers and contributors of the two systems overlap?  RQ2: How frequent is the phenomenon of CSBFs?  RQ3: Who are the contributors involved in CSBFs?  RQ4: Are mailing list contributors involved in CSBFs more active than others?
  • 9.
    RQ1 – Teamoverlap FreeBSD OpenBSD Both Committers 383 211 26 Mailing list contribs 8035 3843 359 Committers and 213 122 17 mailing list contributors The two projects have less than 10% of common contributors → the development team of Free and Open BSD is really different
  • 10.
    RQ2 – Commitfiltering 1000 933 900 800 700 600 500 439 400 296 300 200 133 120 100 59 0 FreeBSD OpenBSD Referring commits Cloned files Linked commits At the end of the filtering not that many but...
  • 11.
    RQ2 – Clonedlines in CSBF files C source files header files  Percentage smaller for .h files  Use of preprocessor conditional to make header files system- dependent  #if defined(__FreeBSD__)
  • 12.
    RQ3 – CSBFGraph (excerpt) Blue/cyan: FreeBSD Red/orange: OpenBSD Yellow: common
  • 13.
    RQ3: social characteristics  Importance in terms of  (in/out) degree: number of (incoming/outcoming) communication links  Betweenness: number of communications for which the node is in the short path  Brokerage metrics: useful to analyze the communication between two clusters B is a coordinator B is a gatekeeper B is a representative
  • 14.
    RQ3 – socialcharacteristics Representative Gatekeeper 12 Coordinator /10 10 Betweenness / 1000 8 Out-degree Column 1 6 In-degree Column 2 Column 3 4 Degree 2 0 5 10 15 20 25 30 35 40 45 50 0 Row 1 CSBF Row 2 Others Row 3 Row 4  All differences statistically significant  High effect size (Cohen d>1)  Contributors involved in CSBF have a higher importance in the communication and in the flow of communication between systems
  • 15.
    RQ3 – committerswith highest social metrics
  • 16.
    RQ4 – changeactivity of CSBF committers and others LOC added/removed Commits 40000 1500 1000 20000 500 0 0 FreeBSD OpenBSD FreeBSD OpenBSD CSBF Others CSBF Others  All differences statistically significant  High effect size (Cohen d∼1)  Contributors involved in CSBF are more active than others
  • 17.
    Conclusions and Work-in-Progress  We proposed method to mine CSBF  We reported a study on FreeBSD and OpenBSD where:  Development team is almost disjoint  There is a small, though not negligible portion of CSBF  Committers involved in CSBF have – Higher social importance – Higher brokerage level – Higher activity in source code commits  Work-in-progress:  Better approaches to identify implicit CSBF, tracking and linking changes occurring on both systems  More extensive study on less obvious cases