An Empirical Study on the Risks of Using Off-the-Shelf
          Techniques for Processing Mailing List Data
              ...
Development Repositories

SOURCE    COMMUNICATION      BUG
 CODE        ARCHIVES     DATABASES




                       ...
Development Repositories

SOURCE    COMMUNICATION      BUG
 CODE        ARCHIVES     DATABASES




                       ...
The Importance of Mailing List
          Archives


                       rm of comm   unication
 • Emai  l popular fo
  ...
Mining the Mailing Lists of
23 Open-Source Projects


       • Summarizing developer mailing lists
       • Using off-the-s...
catter   !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration
! Nov !! --- !! && !! NULL, !! ...
nfiguration !! PGDATA mlw !! bool !! palloc(bufsize);!!!!long !! environment !! agrees ! resides. !
  catter !! things !! ...
While mining Mailing Lists of
        23 Open-Source Projects


• Don’t treat mail archives as textual data
• Changing tec...
While mining Mailing Lists of
        23 Open-Source Projects


• Don’t treat mail archives as textual data
• Changing tec...
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu...
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu...
Resolving Multiple Sender Identities


•   Participants send mail from different addresses
•   Up to 21% of addresses are a...
Reconstructing Discussion Threads

•   Mail stored sequentially in archives
•   Logical grouping: discussion topics
•   Re...
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu...
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu...
Attachments


•   MIME standard defines extensions to email
•   Binary data encoded as text
•   Around 10% of messages have...
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu...
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu...
Quotes and Signatures
•   Duplicate information
•   Unrelated to actual message
•   Removing signatures is challenging
•  ...
More Risks presented
         in the Paper




                        19
(1) Mailing Lists contain valuable
    information on a project.


(2) Data Needs Pre-Processing before
    applying tradi...
Upcoming SlideShare
Loading in …5
×

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

1,012
-1

Published on

Talk given at the 2009 International Conference on Software Maintenance in Edmonton, Alberta, Canada.

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,012
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

  1. 1. An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data Nicolas Bettenburg, Emad Shihab, Ahmed E. Hassan Queen’s University, Canada 1
  2. 2. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 2
  3. 3. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 3
  4. 4. The Importance of Mailing List Archives rm of comm unication • Emai l popular fo to distribu te messages • Mailing lists valuable in formation • Messa ges contain ssions of s ource code • Discu evelopmen t decisions •D • Er ror reports ser support requests •U 4
  5. 5. Mining the Mailing Lists of 23 Open-Source Projects • Summarizing developer mailing lists • Using off-the-shelf tools • Data from around 500,000 emails • Unexpected results from experiments 5
  6. 6. catter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! ! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetData ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch "datadir" !! running !! PGP !! (GNU/Linux) !! "hbaconfig" !! file. !! ((opt !! 2001 !! ! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! "A:a:B:b:c:D:d:Fh:ik:lm:M #include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explic malloc(strlen(DataDir)!! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case diff !! easier !! certs !! given !! { !! 6
  7. 7. nfiguration !! PGDATA mlw !! bool !! palloc(bufsize);!!!!long !! environment !! agrees ! resides. ! catter !! things !! info !! !! impose !! them. !! opinion !! keys symlinks !! configuration eating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! "pg" !! BSD ! ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! ! specifies > !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! attered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, ! ! char !! 1F !! ecified, !! hey, !! file !! !! reasons !! it. !! reasonable. !! Dec !! postgres 43 damn !! options: !! utterly !! line, !! files !! !! DataDir, !! pg_hba.conf !! 69 !! + SetData co ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B hod options !! considering !! always. !! !! symlinks. !! different !! 5434 !! /etc/pgsql/ path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch postgresql PGP !! !!(GNU/Linux) !! !! !! file. !! ((opt !! 2001 !! "datadir" !! !! !! running things !!overides !! convenient !! using, symlinking "hbaconfig" ab onf !! command !! !! */ !! !! !! multiple !! !!/* !! share !!!! !! ! !! bits !! databases simple !! controllable modssl servers !! undesired /path/name3" ","I Similarly, ObFlam "A:a:B:b:c:D:d:Fh:ik:lm:M ster !! Config !! directory!!!!+ { !! #include !! *) !! vendors !! !! people E3 discussion !! packager !! ass. !! really !! machine !! 08:27:06 !! 3B !! 16 !! +# !! explic !! +++ !! !! /etc/nessusd. !! logical. !! behavior !! crypto !! evil !! sense !! hbaconfig malloc(strlen(DataDir) diff !! easier !! certs !! given !! { !! Debian !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case g !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! o 6
  8. 8. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise 7
  9. 9. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise Additional processing and cleaning needed! 8
  10. 10. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 9
  11. 11. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 10
  12. 12. Resolving Multiple Sender Identities • Participants send mail from different addresses • Up to 21% of addresses are aliases • Such aliases bias identity-based analyses • Manual inspection and correction tedious • No fully automated approach to resolve identities 11
  13. 13. Reconstructing Discussion Threads • Mail stored sequentially in archives • Logical grouping: discussion topics • Required information erroneous or missing • Essential for social network and topic analysis A A B B C C D D Linear Sequence Thread Hierarchy 12
  14. 14. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 13
  15. 15. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 14
  16. 16. Attachments • MIME standard defines extensions to email • Binary data encoded as text • Around 10% of messages have attachments • Extract attachments and store separately 15
  17. 17. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 16
  18. 18. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 17
  19. 19. Quotes and Signatures • Duplicate information • Unrelated to actual message • Removing signatures is challenging • Quoted text may or may not be desirable • Signatures impact text mining approaches • No perfect method for signature removal ==== === ==== | = ==== n === ==== -- Deaco ======= ==== == = ==== eapons! ==== | === ==== ear w === ==== ==== rmonucl ==== === === == e === == === ==== === ==== t the th ======= key . === ==== === ==== shoot a === ==== pub lic === ==== === ==== do not === ==== fo r my ======== ase ==== cmu.edu === | Ple === ==== rew. === ==== ==== eek@and ==== ==== er g ==== === ng == | Fi ======== == ==== 18
  20. 20. More Risks presented in the Paper 19
  21. 21. (1) Mailing Lists contain valuable information on a project. (2) Data Needs Pre-Processing before applying traditional tools. (3) Manual Data Processing is often not feasible or requires much effort. (4) Off-the-Shelf tools were not designed to prepare data for mining. 20
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×