An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

An Empirical Study on the Risks of Using Oﬀ-the-Shelf
Techniques for Processing Mailing List Data
Nicolas Bettenburg, Emad Shihab, Ahmed E. Hassan
Queen’s University, Canada

1

Development Repositories

SOURCE COMMUNICATION BUG
CODE ARCHIVES DATABASES

2

Development Repositories

SOURCE COMMUNICATION BUG
CODE ARCHIVES DATABASES

3

The Importance of Mailing List
Archives

rm of comm unication
• Emai l popular fo
to distribu te messages
• Mailing lists
valuable in formation
• Messa ges contain
ssions of s ource code
• Discu
evelopmen t decisions
•D
• Er ror reports
ser support requests
•U

4

Mining the Mailing Lists of
23 Open-Source Projects

• Summarizing developer mailing lists
• Using oﬀ-the-shelf tools
• Data from around 500,000 emails
• Unexpected results from experiments

5

catter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration
! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w

! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !!

! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetData

! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch

"datadir" !! running !! PGP !! (GNU/Linux) !! "hbaconfig" !! file. !! ((opt !! 2001 !!

! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! "A:a:B:b:c:D:d:Fh:ik:lm:M

#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explic
malloc(strlen(DataDir)!! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case
diff !! easier !! certs !! given !! { !!

6

nfiguration !! PGDATA mlw !! bool !! palloc(bufsize);!!!!long !! environment !! agrees ! resides. !
catter !! things !! info !!
!! impose !! them. !! opinion !! keys
symlinks !! configuration
eating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! "pg" !! BSD !
! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w
!
!
specifies
> !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !!
!! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !!
attered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, !
! char !! 1F !!
ecified, !! hey, !!
file !! !! reasons !! it. !!
reasonable.
!! Dec !!
postgres 43
damn !! options: !! utterly !! line, !! files !!
!! DataDir, !! pg_hba.conf !! 69 !! + SetData
co

! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B
hod options
!! considering !! always. !! !! symlinks. !! different !! 5434 !! /etc/pgsql/
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch
postgresql PGP !! !!(GNU/Linux) !! !! !! file. !! ((opt !! 2001 !!
"datadir" !! !! !!
running
things !!overides !! convenient !!
using, symlinking
"hbaconfig"
ab

onf !! command !! !! */ !! !! !! multiple !! !!/* !! share !!!! !!
! !!
bits !! databases
simple
!! controllable modssl
servers
!!
undesired /path/name3" ","I Similarly, ObFlam
"A:a:B:b:c:D:d:Fh:ik:lm:M

ster !! Config !! directory!!!!+ { !!
#include !! *) !! vendors !!
!! people E3
discussion !! packager !! ass. !! really !! machine !!
08:27:06 !! 3B !! 16 !! +# !! explic
!! +++ !! !! /etc/nessusd. !! logical. !! behavior !! crypto !!
evil !! sense !! hbaconfig
malloc(strlen(DataDir)

diff !! easier !! certs !! given !! { !!
Debian
!! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case

g !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! o
6

While mining Mailing Lists of

• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise

7

While mining Mailing Lists of

• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise

Additional processing and cleaning needed!

8

From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key. |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

9

Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)

- ---559023410-851401618-854387445=:824



=====================================================================
=====================================================================
=====================================================================

- ---559023410-851401618-854387445=:824


10

Resolving Multiple Sender Identities

• Participants send mail from diﬀerent addresses
• Up to 21% of addresses are aliases
• Such aliases bias identity-based analyses
• Manual inspection and correction tedious
• No fully automated approach to resolve identities

11

Reconstructing Discussion Threads

• Mail stored sequentially in archives
• Logical grouping: discussion topics
• Required information erroneous or missing
• Essential for social network and topic analysis

A A

B B

C C

D D

Linear Sequence Thread Hierarchy

12

Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)

- ---559023410-851401618-854387445=:824



=====================================================================
=====================================================================
=====================================================================

- ---559023410-851401618-854387445=:824


13

Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)

- ---559023410-851401618-854387445=:824



=====================================================================
=====================================================================
=====================================================================

- ---559023410-851401618-854387445=:824


14

Attachments

• MIME standard deﬁnes extensions to email
• Binary data encoded as text
• Around 10% of messages have attachments
• Extract attachments and store separately

15

Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)

- ---559023410-851401618-854387445=:824



=====================================================================
=====================================================================
=====================================================================

- ---559023410-851401618-854387445=:824


16

Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)

- ---559023410-851401618-854387445=:824



=====================================================================
=====================================================================
=====================================================================

- ---559023410-851401618-854387445=:824


17

Quotes and Signatures
• Duplicate information
• Unrelated to actual message
• Removing signatures is challenging
• Quoted text may or may not be desirable
• Signatures impact text mining approaches
• No perfect method for signature removal
====
=== ==== |
= ==== n
===
==== -- Deaco =======
==== ==
= ==== eapons! ==== |
=== ==== ear w === ====
==== rmonucl ==== ===
=== == e === == === ====
=== ==== t the th ======= key . === ====
=== ==== shoot a === ==== pub lic === ====
=== ==== do not === ==== fo r my ========
ase ==== cmu.edu ===
| Ple === ==== rew. === ====
==== eek@and ====
==== er g ====
===
ng ==
| Fi ========
==
====
18

More Risks presented
in the Paper

19

(1) Mailing Lists contain valuable
information on a project.

(2) Data Needs Pre-Processing before
applying traditional tools.

(3) Manual Data Processing is often not
feasible or requires much eﬀort.

(4) Oﬀ-the-Shelf tools were not designed
to prepare data for mining.

20

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Similar to An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data (20)

More from Nicolas Bettenburg

More from Nicolas Bettenburg (7)

Recently uploaded

Recently uploaded (20)

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data