SlideShare a Scribd company logo
1 of 21
Download to read offline
An Empirical Study on the Risks of Using Off-the-Shelf
          Techniques for Processing Mailing List Data
                       Nicolas Bettenburg, Emad Shihab, Ahmed E. Hassan
                                                    Queen’s University, Canada




                                                                                 1
Development Repositories

SOURCE    COMMUNICATION      BUG
 CODE        ARCHIVES     DATABASES




                                      2
Development Repositories

SOURCE    COMMUNICATION      BUG
 CODE        ARCHIVES     DATABASES




                                      3
The Importance of Mailing List
          Archives


                       rm of comm   unication
 • Emai  l popular fo
                  to distribu te messages
  • Mailing lists
                          valuable in formation
   • Messa   ges contain
              ssions of s ource code
     • Discu
         evelopmen    t decisions
      •D
      • Er ror reports
          ser support   requests
      •U

                                                  4
Mining the Mailing Lists of
23 Open-Source Projects


       • Summarizing developer mailing lists
       • Using off-the-shelf tools
       • Data from around 500,000 emails
       • Unexpected results from experiments



                                               5
catter   !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration
! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w

!   > !!   break; !! get   !! SIGNATURE !! -----END !! != !! symlinks. !! command !!

! char !! 1F !!    file !!   postgres   !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetData

 ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch

"datadir"   !! running !! PGP !! (GNU/Linux) !! "hbaconfig" !! file. !! ((opt !! 2001 !!

! bits !! simple !! databases !!   */ !!   servers   !! multiple !! /* !! share !! "A:a:B:b:c:D:d:Fh:ik:lm:M

#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explic
malloc(strlen(DataDir)!! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case
diff !! easier !! certs !! given !! { !!

                                                                                                           6
nfiguration !! PGDATA mlw !! bool !! palloc(bufsize);!!!!long !! environment !! agrees ! resides. !
  catter !! things !! info !!
                              !! impose !! them. !! opinion !! keys
                                                                         symlinks !! configuration
 eating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! "pg" !! BSD !
 ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w
 !
 !
  specifies
  >                           !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !!
         !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !!
attered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, !
 ! char !! 1F !!
 ecified,   !! hey,   !!
                         file !! !! reasons !! it. !!
                        reasonable.
                                        !! Dec !!
                                       postgres                   43
                                                                    damn    !! options: !! utterly !! line, !!   files !!
                                                                       !! DataDir, !! pg_hba.conf !! 69 !! + SetData
                                                                                                                              co

 ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B
hod                         options
         !! considering !! always. !!                  !! symlinks. !! different !! 5434 !! /etc/pgsql/
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch
 postgresql PGP !! !!(GNU/Linux) !! !! !! file. !! ((opt !! 2001 !!
 "datadir" !!        !! !!
                      running
                             things !!overides         !! convenient !!
                                                                  using,     symlinking
                                                                       "hbaconfig"
                                                                                                                              ab


 onf !! command !! !! */ !! !! !! multiple !! !!/* !! share !!!! !!
 !    !!
   bits        !! databases
            simple
                                      !! controllable    modssl
                                                      servers
                                                                    !!
                                                                   undesired        /path/name3"      ","I   Similarly,   ObFlam
                                                                                                    "A:a:B:b:c:D:d:Fh:ik:lm:M

ster !! Config !! directory!!!!+ { !!
#include !! *) !! vendors !!
                             !! people           E3
                                                                       discussion   !! packager !! ass. !! really !! machine !!
                                                                              08:27:06    !! 3B !! 16 !! +# !! explic
                      !! +++ !! !! /etc/nessusd. !! logical. !! behavior !! crypto !!
evil !! sense !! hbaconfig
malloc(strlen(DataDir)

diff !! easier !! certs !! given !! { !!
                                                                                                   Debian
                                  !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case

g !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! o
                                                                                                                              6
While mining Mailing Lists of
        23 Open-Source Projects


• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise




                                              7
While mining Mailing Lists of
        23 Open-Source Projects


• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise

 Additional processing and cleaning needed!



                                              8
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        9
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        10
Resolving Multiple Sender Identities


•   Participants send mail from different addresses
•   Up to 21% of addresses are aliases
•   Such aliases bias identity-based analyses
•   Manual inspection and correction tedious
•   No fully automated approach to resolve identities




                                                        11
Reconstructing Discussion Threads

•   Mail stored sequentially in archives
•   Logical grouping: discussion topics
•   Required information erroneous or missing
•   Essential for social network and topic analysis

                                             A           A


                                             B                     B


                                             C                                  C


                                             D                     D


                                       Linear Sequence       Thread Hierarchy




                                                                                    12
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        13
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        14
Attachments


•   MIME standard defines extensions to email
•   Binary data encoded as text
•   Around 10% of messages have attachments
•   Extract attachments and store separately




                                               15
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        16
From geek+@cmu.edu Wed Jan 21 08:11:26 1998
Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)
From: "Brian E. Gallew" <geek+@cmu.edu>
Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

>      If you can grab a copy and run it on your machine, and send me
> the output, that would help alot.

Here is a gzip'ed tar of the results.



=====================================================================
| Please do not shoot at the thermonuclear weapons! -- Deacon       |
=====================================================================
| Finger geek@andrew.cmu.edu for my public key.                     |
=====================================================================

- ---559023410-851401618-854387445=:824
Content-Type: APPLICATION/x-gzip
Content-Transfer-Encoding: BASE64
Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W
UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx
ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/
gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

                                                                        17
Quotes and Signatures
•   Duplicate information
•   Unrelated to actual message
•   Removing signatures is challenging
•   Quoted text may or may not be desirable
•   Signatures impact text mining approaches
•   No perfect method for signature removal
                                                                                                    ====
                                                                                           === ====      |
                                                                                     = ==== n
                                                                                 ===
                                                                            ==== -- Deaco =======
                                                                      ====                     ==
                                                               = ==== eapons!             ====            |
                                                      ===  ==== ear w            === ====
                                                 ==== rmonucl               ====                        ===
                                          === ==        e            === ==                    === ====
                                  === ==== t the th =======              key  .       === ====
                          === ==== shoot a        === ====      pub lic      === ====
                 === ==== do not         === ====      fo r my ========
                        ase         ==== cmu.edu              ===
                 |  Ple     === ==== rew.            === ====
                       ==== eek@and             ====
                  ==== er g            ====
                                            ===
                       ng           ==
                  | Fi ========
                        ==
                   ====
                                                                                                          18
More Risks presented
         in the Paper




                        19
(1) Mailing Lists contain valuable
    information on a project.


(2) Data Needs Pre-Processing before
    applying traditional tools.


(3) Manual Data Processing is often not
    feasible or requires much effort.


(4) Off-the-Shelf tools were not designed
    to prepare data for mining.

                                           20

More Related Content

Viewers also liked

Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Nicolas Bettenburg
 
Automatic Identification of Bug Introducing Changes
Automatic Identification of Bug Introducing ChangesAutomatic Identification of Bug Introducing Changes
Automatic Identification of Bug Introducing ChangesNicolas Bettenburg
 
Cloning Considered Harmful Considered Harmful
Cloning Considered Harmful Considered HarmfulCloning Considered Harmful Considered Harmful
Cloning Considered Harmful Considered HarmfulNicolas Bettenburg
 
Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*Nicolas Bettenburg
 
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction ModelsThink Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction ModelsNicolas Bettenburg
 
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...Nicolas Bettenburg
 
The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07Nicolas Bettenburg
 
Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?Nicolas Bettenburg
 
Computing Accuracy Precision And Recall
Computing Accuracy Precision And RecallComputing Accuracy Precision And Recall
Computing Accuracy Precision And RecallNicolas Bettenburg
 

Viewers also liked (10)

Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...
 
Automatic Identification of Bug Introducing Changes
Automatic Identification of Bug Introducing ChangesAutomatic Identification of Bug Introducing Changes
Automatic Identification of Bug Introducing Changes
 
Cloning Considered Harmful Considered Harmful
Cloning Considered Harmful Considered HarmfulCloning Considered Harmful Considered Harmful
Cloning Considered Harmful Considered Harmful
 
Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*
 
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction ModelsThink Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
 
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
 
The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07
 
Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?
 
Computing Accuracy Precision And Recall
Computing Accuracy Precision And RecallComputing Accuracy Precision And Recall
Computing Accuracy Precision And Recall
 
Fuzzy Logic in Smart Homes
Fuzzy Logic in Smart HomesFuzzy Logic in Smart Homes
Fuzzy Logic in Smart Homes
 

Similar to An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

A DevOps Perspective: MongoDB & MMF
A DevOps Perspective: MongoDB & MMFA DevOps Perspective: MongoDB & MMF
A DevOps Perspective: MongoDB & MMFMapMyFitness
 
Crunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-casesCrunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-casesSergii Khomenko
 
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享elevenma
 
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...CODE BLUE
 
What happens when firefox crashes?
What happens when firefox crashes?What happens when firefox crashes?
What happens when firefox crashes?Erik Rose
 
About Multiblock Reads v4
About Multiblock Reads v4About Multiblock Reads v4
About Multiblock Reads v4Enkitec
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashesCloudflare
 
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"Yandex
 
Groovy 1.8 and 2.0 at GR8Conf Europe 2012
Groovy 1.8 and 2.0 at GR8Conf Europe 2012Groovy 1.8 and 2.0 at GR8Conf Europe 2012
Groovy 1.8 and 2.0 at GR8Conf Europe 2012Guillaume Laforge
 
An Introduction to Go
An Introduction to GoAn Introduction to Go
An Introduction to GoCloudflare
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboyKenneth Geisshirt
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
NGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManusNGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManusKnome_Inc
 

Similar to An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data (20)

A DevOps Perspective: MongoDB & MMF
A DevOps Perspective: MongoDB & MMFA DevOps Perspective: MongoDB & MMF
A DevOps Perspective: MongoDB & MMF
 
Crunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-casesCrunching data with go: Tips, tricks, use-cases
Crunching data with go: Tips, tricks, use-cases
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
 
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
CODE BLUE 2014 : マイクロソフトの脆弱性調査 : ベンダーでありながら発見者となるために by デイヴィッド・シードマン David Se...
 
Brasil Ross 2011
Brasil Ross 2011Brasil Ross 2011
Brasil Ross 2011
 
What happens when firefox crashes?
What happens when firefox crashes?What happens when firefox crashes?
What happens when firefox crashes?
 
About Multiblock Reads v4
About Multiblock Reads v4About Multiblock Reads v4
About Multiblock Reads v4
 
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc   bigo meetup rolling hashesCloud flare jgc   bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
 
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
Дмитрий Щадей "Что помогает нам писать качественный JavaScript-код?"
 
Groovy 1.8 and 2.0 at GR8Conf Europe 2012
Groovy 1.8 and 2.0 at GR8Conf Europe 2012Groovy 1.8 and 2.0 at GR8Conf Europe 2012
Groovy 1.8 and 2.0 at GR8Conf Europe 2012
 
An Introduction to Go
An Introduction to GoAn Introduction to Go
An Introduction to Go
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
NGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManusNGS Informatics and Interpretation - Hardware Considerations by Michael McManus
NGS Informatics and Interpretation - Hardware Considerations by Michael McManus
 
PowerDNS Webinar
PowerDNS Webinar PowerDNS Webinar
PowerDNS Webinar
 

More from Nicolas Bettenburg

10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...Nicolas Bettenburg
 
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source CodeUsing Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source CodeNicolas Bettenburg
 
Managing Community Contributions: Lessons Learned from a Case Study on Andro...
Managing Community Contributions:  Lessons Learned from a Case Study on Andro...Managing Community Contributions:  Lessons Learned from a Case Study on Andro...
Managing Community Contributions: Lessons Learned from a Case Study on Andro...Nicolas Bettenburg
 
Predictors of Customer Perceived Quality
Predictors of Customer Perceived QualityPredictors of Customer Perceived Quality
Predictors of Customer Perceived QualityNicolas Bettenburg
 
Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.Nicolas Bettenburg
 

More from Nicolas Bettenburg (7)

10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
 
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source CodeUsing Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
 
Managing Community Contributions: Lessons Learned from a Case Study on Andro...
Managing Community Contributions:  Lessons Learned from a Case Study on Andro...Managing Community Contributions:  Lessons Learned from a Case Study on Andro...
Managing Community Contributions: Lessons Learned from a Case Study on Andro...
 
Approximation Algorithms
Approximation AlgorithmsApproximation Algorithms
Approximation Algorithms
 
Predictors of Customer Perceived Quality
Predictors of Customer Perceived QualityPredictors of Customer Perceived Quality
Predictors of Customer Perceived Quality
 
Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.
 
Metropolis Instant Radiosity
Metropolis Instant RadiosityMetropolis Instant Radiosity
Metropolis Instant Radiosity
 

Recently uploaded

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 

Recently uploaded (20)

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

  • 1. An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data Nicolas Bettenburg, Emad Shihab, Ahmed E. Hassan Queen’s University, Canada 1
  • 2. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 2
  • 3. Development Repositories SOURCE COMMUNICATION BUG CODE ARCHIVES DATABASES 3
  • 4. The Importance of Mailing List Archives rm of comm unication • Emai l popular fo to distribu te messages • Mailing lists valuable in formation • Messa ges contain ssions of s ource code • Discu evelopmen t decisions •D • Er ror reports ser support requests •U 4
  • 5. Mining the Mailing Lists of 23 Open-Source Projects • Summarizing developer mailing lists • Using off-the-shelf tools • Data from around 500,000 emails • Unexpected results from experiments 5
  • 6. catter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! ! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetData ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch "datadir" !! running !! PGP !! (GNU/Linux) !! "hbaconfig" !! file. !! ((opt !! 2001 !! ! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! "A:a:B:b:c:D:d:Fh:ik:lm:M #include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explic malloc(strlen(DataDir)!! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case diff !! easier !! certs !! given !! { !! 6
  • 7. nfiguration !! PGDATA mlw !! bool !! palloc(bufsize);!!!!long !! environment !! agrees ! resides. ! catter !! things !! info !! !! impose !! them. !! opinion !! keys symlinks !! configuration eating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! "pg" !! BSD ! ! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://w ! ! specifies > !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! attered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, ! ! char !! 1F !! ecified, !! hey, !! file !! !! reasons !! it. !! reasonable. !! Dec !! postgres 43 damn !! options: !! utterly !! line, !! files !! !! DataDir, !! pg_hba.conf !! 69 !! + SetData co ! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B hod options !! considering !! always. !! !! symlinks. !! different !! 5434 !! /etc/pgsql/ path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(ch postgresql PGP !! !!(GNU/Linux) !! !! !! file. !! ((opt !! 2001 !! "datadir" !! !! !! running things !!overides !! convenient !! using, symlinking "hbaconfig" ab onf !! command !! !! */ !! !! !! multiple !! !!/* !! share !!!! !! ! !! bits !! databases simple !! controllable modssl servers !! undesired /path/name3" ","I Similarly, ObFlam "A:a:B:b:c:D:d:Fh:ik:lm:M ster !! Config !! directory!!!!+ { !! #include !! *) !! vendors !! !! people E3 discussion !! packager !! ass. !! really !! machine !! 08:27:06 !! 3B !! 16 !! +# !! explic !! +++ !! !! /etc/nessusd. !! logical. !! behavior !! crypto !! evil !! sense !! hbaconfig malloc(strlen(DataDir) diff !! easier !! certs !! given !! { !! Debian !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case g !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! o 6
  • 8. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise 7
  • 9. While mining Mailing Lists of 23 Open-Source Projects • Don’t treat mail archives as textual data • Changing technologies • Up to 98% of messages contain noise Additional processing and cleaning needed! 8
  • 10. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 9
  • 11. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 10
  • 12. Resolving Multiple Sender Identities • Participants send mail from different addresses • Up to 21% of addresses are aliases • Such aliases bias identity-based analyses • Manual inspection and correction tedious • No fully automated approach to resolve identities 11
  • 13. Reconstructing Discussion Threads • Mail stored sequentially in archives • Logical grouping: discussion topics • Required information erroneous or missing • Essential for social network and topic analysis A A B B C C D D Linear Sequence Thread Hierarchy 12
  • 14. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 13
  • 15. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 14
  • 16. Attachments • MIME standard defines extensions to email • Binary data encoded as text • Around 10% of messages have attachments • Extract attachments and store separately 15
  • 17. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 16
  • 18. From geek+@cmu.edu Wed Jan 21 08:11:26 1998 Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST) From: "Brian E. Gallew" <geek+@cmu.edu> Subject: Re: [HACKERS] configure - ---559023410-851401618-854387445=:824 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII > If you can grab a copy and run it on your machine, and send me > the output, that would help alot. Here is a gzip'ed tar of the results. ===================================================================== | Please do not shoot at the thermonuclear weapons! -- Deacon | ===================================================================== | Finger geek@andrew.cmu.edu for my public key. | ===================================================================== - ---559023410-851401618-854387445=:824 Content-Type: APPLICATION/x-gzip Content-Transfer-Encoding: BASE64 Content-Description: m88k-dg-dgux5.4R3.10.tar.gz H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8W UgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/Qx ffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/ gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn 17
  • 19. Quotes and Signatures • Duplicate information • Unrelated to actual message • Removing signatures is challenging • Quoted text may or may not be desirable • Signatures impact text mining approaches • No perfect method for signature removal ==== === ==== | = ==== n === ==== -- Deaco ======= ==== == = ==== eapons! ==== | === ==== ear w === ==== ==== rmonucl ==== === === == e === == === ==== === ==== t the th ======= key . === ==== === ==== shoot a === ==== pub lic === ==== === ==== do not === ==== fo r my ======== ase ==== cmu.edu === | Ple === ==== rew. === ==== ==== eek@and ==== ==== er g ==== === ng == | Fi ======== == ==== 18
  • 20. More Risks presented in the Paper 19
  • 21. (1) Mailing Lists contain valuable information on a project. (2) Data Needs Pre-Processing before applying traditional tools. (3) Manual Data Processing is often not feasible or requires much effort. (4) Off-the-Shelf tools were not designed to prepare data for mining. 20