An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data - Presentation Transcript

    1. An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data Nicolas Bettenburg, Emad Shihab, Ahmed E. Hassan Queen’s University, Canada 1
    2. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 2
    3. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 3
    4. The Importance of Mailing List Archives rm of comm unication • Emai l popular fo to distribu te messages • Mailing lists valuable in formation • Messa ges contain ssions of s ource code • Discu evelopmen t decisions •D • Er ror reports ser support requests •U 4
    5. Mining the Mailing Lists of 23 Open-Source Projects • Summarizing developer mailing lists • Using off-the-shelf tools • Data from around 500,000 emails • Unexpected results from experiments 5
    6. catter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! ! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetData ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch "datadir" !! running !! PGP !! (GNU/Linux) !! "hbaconfig" !! file. !! ((opt !! 2001 !! ! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! "A:a:B:b:c:D:d:Fh:ik:lm:M #include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explic malloc(strlen(DataDir)!! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case diff !! easier !! certs !! given !! { !! 6
    7. nfiguration !! PGDATA mlw !! bool !! palloc(bufsize);!!!!long !! environment !! agrees ! resides. ! catter !! things !! info !! !! impose !! them. !! opinion !! keys symlinks !! configuration eating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! "pg" !! BSD ! ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! ! specifies > !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! attered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, ! ! char !! 1F !! ecified, !! hey, !! file !! !! reasons !! it. !! reasonable. !! Dec !! postgres 43 damn !! options: !! utterly !! line, !! files !! !! DataDir, !! pg_hba.conf !! 69 !! + SetData co ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B hod options !! considering !! always. !! !! symlinks. !! different !! 5434 !! /etc/pgsql/ path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch postgresql PGP !! !!(GNU/Linux) !! !! !! file. !! ((opt !! 2001 !! "datadir" !! !! !! running things !!overides !! convenient !! using, symlinking "hbaconfig" ab onf !! command !! !! */ !! !! !! multiple !! !!/* !! share !!!! !! ! !! bits !! databases simple !! controllable modssl servers !! undesired /path/name3" ","I Similarly, ObFlam "A:a:B:b:c:D:d:Fh:ik:lm:M ster !! Config !! directory!!!!+ { !! #include !! *) !! vendors !! !! people E3 discussion !! packager !! ass. !! really !! machine !! 08:27:06 !! 3B !! 16 !! +# !! explic !! +++ !! !! /etc/nessusd. !! logical. !! behavior !! crypto !! evil !! sense !! hbaconfig malloc(strlen(DataDir) diff !! easier !! certs !! given !! { !! Debian !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case g !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! o 6
    8. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise 7
    9. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise Additional processing and cleaning needed! 8
    10. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 9
    11. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 10
    12. Resolving Multiple Sender Identities • Participants send mail from different addresses • Up to 21% of addresses are aliases • Such aliases bias identity-based analyses • Manual inspection and correction tedious • No fully automated approach to resolve identities 11
    13. Reconstructing Discussion Threads • Mail stored sequentially in archives • Logical grouping: discussion topics • Required information erroneous or missing • Essential for social network and topic analysis A A B B C C D D Linear Sequence Thread Hierarchy 12
    14. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 13
    15. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 14
    16. Attachments • MIME standard defines extensions to email • Binary data encoded as text • Around 10% of messages have attachments • Extract attachments and store separately 15
    17. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 16
    18. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 17
    19. Quotes and Signatures • Duplicate information • Unrelated to actual message • Removing signatures is challenging • Quoted text may or may not be desirable • Signatures impact text mining approaches • No perfect method for signature removal ==== === ==== | = ==== n === ==== -- Deaco ======= ==== == = ==== eapons! ==== | === ==== ear w === ==== ==== rmonucl ==== === === == e === == === ==== === ==== t the th ======= key . === ==== === ==== shoot a === ==== pub lic === ==== === ==== do not === ==== fo r my ======== ase ==== cmu.edu === | Ple === ==== rew. === ==== ==== eek@and ==== ==== er g ==== === ng == | Fi ======== == ==== 18
    20. More Risks presented in the Paper 19
    21. (1) Mailing Lists contain valuable information on a project. (2) Data Needs Pre-Processing before applying traditional tools. (3) Manual Data Processing is often not feasible or requires much effort. (4) Off-the-Shelf tools were not designed to prepare data for mining. 20

    + Nicolas BettenburgNicolas Bettenburg, 2 months ago

    custom

    187 views, 0 favs, 0 embeds more stats

    Talk given at the 2009 International Conference on more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 187
      • 187 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 5
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories