MUD 2010
   Workshop on Mining Unstructured Data




                          Nicolas Bettenburg
SOFTWARE ANALYSIS            Bram Adams
 & INTELLIGENCE LAB   http://sailhome.cs.queensu.ca/mud/
                                                           1
Unstructured
   Data?




               2
EXAMPLE OF STRUCTURED DATA
<bug>
  <bug_id>45411</bug_id>
  <creation_ts>2000-07-13 13:46:00 -0700</creation_ts>
  <short_desc>Drag, hover over tab should open tab</short_desc>
  <delta_ts>2009-12-04 13:03:48 -0800</delta_ts>
  <reporter_accessible>1</reporter_accessible>
  <cclist_accessible>1</cclist_accessible>
  <classification_id>2</classification_id>
  <classification>Client Software</classification>
  <product>SeaMonkey</product>
  <component>Tabbed Browser</component>
  <version>Trunk</version>
  <rep_platform>All</rep_platform>
  <op_sys>All</op_sys>
  <bug_status>RESOLVED</bug_status>
  <resolution>WONTFIX</resolution>
  <priority>--</priority>
  <bug_severity>enhancement</bug_severity>
  <target_milestone>---</target_milestone>
  <blocked>121292</blocked>
  ...
</bug>
                                                                  3
So What?
EXAMPLES OF UNSTRUCTURED DATA


   web-sites      diagrams        requirements
                                   documents

social media   documentation                 help
                                IRC chat     files
       code
so urce nts              orts
     mme        bu g rep              captchas
  co

                  commit logs
       email                          system logs
                                                    4
SE data without explicit format




COMPLEXITY   DIVERSITY   IMPERFECTION


                                        5
Unstructured Data is
        COMPLEX ...
                                    all
                  QLite library sh                 Bonjour,
       0: The S                      ents
S1  000              l SQ L statem
           high-leve s to persistent
translate             all
          level I/O c                               ces deux pro
                                                                   blèmes sont
into low-                                           En effet, les                  reliés.
                                                                   paquets Ubu
 storage.                                          comportent                     ntu ne
                                 SQL
                    k  of every           an-
                                                                  pas les dépe
                                                                                 ndances (e.
  The ess ential tas to translate hum              libpng, libjp
                                                                 eg, libglew,                 g.
                  ne is                                                        ...).
  datab ase engi             ts into
             SQL s tatemen        s.              Si Tulip ne p
  readable             operation                                  eut afficher
                                                                                les fichiers
               of I/O                            PNG, c'est s
   sequences                                                     ans doute ca
                                                                                r le paquet
                                                 libpng est m
                                                                 anquant sur
                                                Nous travail                    le système.
                                                                lons à ajout
                                                dépendance                   er les
                                                                s sur les paq
  natural language                              n'arrivera pr
                                                                obablement
                                                                               uets, mais c
                                                                              pas avant T
                                                                                             eci
                                                3.5.                                         ulip
  rich semantics
                                                Cordialemen
                                                           t,
  no authoritative formats                      Charles.

                                                                                                    6
... AND DIVERSE
In this report, you have defined a parameter named blocksize,
which is given a value of "7|D|1|D". In open script of data set,
there are below lines code:

<script begin>
token=Packages.java.util.StringTokenizer(params["blocksize"],"|");
vec=new Packages.java.util.Vector();
while(token.hasMoreTokens()){
   vec.addElement(token.nextToken());   Eclipse #150222
}
params["DateRange"]=java.lang.Integer.parseInt(vec.elementAt(0));
</script end>

Since the value of params["blocksize"] is "7|D|1|D", vec.elementAt(0)
is "7", and then it can not be parsed to int value. In 1.0.1,
the value of params["blocksize"] might be 7|D|1|D, so it can be
parsed to int value of 7.

                                                                     7
... AND IMPERFECT
              o e@gmail.com
From: john.d      c eforge.net
To: d evlist@sour        !!
Subject: BS  OD WTF!!??

Hi devs,
                         C       inconsistency
               in JDBC-RP ’t
 f ound a bug ol. OMG can        ambiguity
 ver y badass l sed that. I
        ve you mis incorrect     informal language
 belie           er
 get  a bsod aft
                  (
  pw,  pls fix :'

  JD $$$
                                                 8
So What?
EXAMPLES OF UNSTRUCTURED DATA


   web-sites      diagrams        requirements
                                   documents

social media   documentation                 help
                                IRC chat     files
       code
so urce nts              orts
     mme        bu g rep              captchas
  co

                  commit logs
       email                          system logs
                                                    9

Mud flash

  • 1.
    MUD 2010 Workshop on Mining Unstructured Data Nicolas Bettenburg SOFTWARE ANALYSIS Bram Adams & INTELLIGENCE LAB http://sailhome.cs.queensu.ca/mud/ 1
  • 2.
  • 3.
    EXAMPLE OF STRUCTUREDDATA <bug> <bug_id>45411</bug_id> <creation_ts>2000-07-13 13:46:00 -0700</creation_ts> <short_desc>Drag, hover over tab should open tab</short_desc> <delta_ts>2009-12-04 13:03:48 -0800</delta_ts> <reporter_accessible>1</reporter_accessible> <cclist_accessible>1</cclist_accessible> <classification_id>2</classification_id> <classification>Client Software</classification> <product>SeaMonkey</product> <component>Tabbed Browser</component> <version>Trunk</version> <rep_platform>All</rep_platform> <op_sys>All</op_sys> <bug_status>RESOLVED</bug_status> <resolution>WONTFIX</resolution> <priority>--</priority> <bug_severity>enhancement</bug_severity> <target_milestone>---</target_milestone> <blocked>121292</blocked> ... </bug> 3
  • 4.
    So What? EXAMPLES OFUNSTRUCTURED DATA web-sites diagrams requirements documents social media documentation help IRC chat files code so urce nts orts mme bu g rep captchas co commit logs email system logs 4
  • 5.
    SE data withoutexplicit format COMPLEXITY DIVERSITY IMPERFECTION 5
  • 6.
    Unstructured Data is COMPLEX ... all QLite library sh Bonjour, 0: The S ents S1 000 l SQ L statem high-leve s to persistent translate all level I/O c ces deux pro blèmes sont into low- En effet, les reliés. paquets Ubu storage. comportent ntu ne SQL k of every an- pas les dépe ndances (e. The ess ential tas to translate hum libpng, libjp eg, libglew, g. ne is ...). datab ase engi ts into SQL s tatemen s. Si Tulip ne p readable operation eut afficher les fichiers of I/O PNG, c'est s sequences ans doute ca r le paquet libpng est m anquant sur Nous travail le système. lons à ajout dépendance er les s sur les paq natural language n'arrivera pr obablement uets, mais c pas avant T eci 3.5. ulip rich semantics Cordialemen t, no authoritative formats Charles. 6
  • 7.
    ... AND DIVERSE Inthis report, you have defined a parameter named blocksize, which is given a value of "7|D|1|D". In open script of data set, there are below lines code: <script begin> token=Packages.java.util.StringTokenizer(params["blocksize"],"|"); vec=new Packages.java.util.Vector(); while(token.hasMoreTokens()){ vec.addElement(token.nextToken()); Eclipse #150222 } params["DateRange"]=java.lang.Integer.parseInt(vec.elementAt(0)); </script end> Since the value of params["blocksize"] is "7|D|1|D", vec.elementAt(0) is "7", and then it can not be parsed to int value. In 1.0.1, the value of params["blocksize"] might be 7|D|1|D, so it can be parsed to int value of 7. 7
  • 8.
    ... AND IMPERFECT o e@gmail.com From: john.d c eforge.net To: d evlist@sour !! Subject: BS OD WTF!!?? Hi devs, C inconsistency in JDBC-RP ’t f ound a bug ol. OMG can ambiguity ver y badass l sed that. I ve you mis incorrect informal language belie er get a bsod aft ( pw, pls fix :' JD $$$ 8
  • 9.
    So What? EXAMPLES OFUNSTRUCTURED DATA web-sites diagrams requirements documents social media documentation help IRC chat files code so urce nts orts mme bu g rep captchas co commit logs email system logs 9