Successfully reported this slideshow.

Mud flash

2,731 views

Published on

  • Be the first to comment

  • Be the first to like this

Mud flash

  1. 1. MUD 2010 Workshop on Mining Unstructured Data Nicolas Bettenburg SOFTWARE ANALYSIS Bram Adams & INTELLIGENCE LAB http://sailhome.cs.queensu.ca/mud/ 1
  2. 2. Unstructured Data? 2
  3. 3. EXAMPLE OF STRUCTURED DATA <bug> <bug_id>45411</bug_id> <creation_ts>2000-07-13 13:46:00 -0700</creation_ts> <short_desc>Drag, hover over tab should open tab</short_desc> <delta_ts>2009-12-04 13:03:48 -0800</delta_ts> <reporter_accessible>1</reporter_accessible> <cclist_accessible>1</cclist_accessible> <classification_id>2</classification_id> <classification>Client Software</classification> <product>SeaMonkey</product> <component>Tabbed Browser</component> <version>Trunk</version> <rep_platform>All</rep_platform> <op_sys>All</op_sys> <bug_status>RESOLVED</bug_status> <resolution>WONTFIX</resolution> <priority>--</priority> <bug_severity>enhancement</bug_severity> <target_milestone>---</target_milestone> <blocked>121292</blocked> ... </bug> 3
  4. 4. So What? EXAMPLES OF UNSTRUCTURED DATA web-sites diagrams requirements documents social media documentation help IRC chat files code so urce nts orts mme bu g rep captchas co commit logs email system logs 4
  5. 5. SE data without explicit format COMPLEXITY DIVERSITY IMPERFECTION 5
  6. 6. Unstructured Data is COMPLEX ... all QLite library sh Bonjour, 0: The S ents S1 000 l SQ L statem high-leve s to persistent translate all level I/O c ces deux pro blèmes sont into low- En effet, les reliés. paquets Ubu storage. comportent ntu ne SQL k of every an- pas les dépe ndances (e. The ess ential tas to translate hum libpng, libjp eg, libglew, g. ne is ...). datab ase engi ts into SQL s tatemen s. Si Tulip ne p readable operation eut afficher les fichiers of I/O PNG, c'est s sequences ans doute ca r le paquet libpng est m anquant sur Nous travail le système. lons à ajout dépendance er les s sur les paq natural language n'arrivera pr obablement uets, mais c pas avant T eci 3.5. ulip rich semantics Cordialemen t, no authoritative formats Charles. 6
  7. 7. ... AND DIVERSE In this report, you have defined a parameter named blocksize, which is given a value of "7|D|1|D". In open script of data set, there are below lines code: <script begin> token=Packages.java.util.StringTokenizer(params["blocksize"],"|"); vec=new Packages.java.util.Vector(); while(token.hasMoreTokens()){ vec.addElement(token.nextToken()); Eclipse #150222 } params["DateRange"]=java.lang.Integer.parseInt(vec.elementAt(0)); </script end> Since the value of params["blocksize"] is "7|D|1|D", vec.elementAt(0) is "7", and then it can not be parsed to int value. In 1.0.1, the value of params["blocksize"] might be 7|D|1|D, so it can be parsed to int value of 7. 7
  8. 8. ... AND IMPERFECT o e@gmail.com From: john.d c eforge.net To: d evlist@sour !! Subject: BS OD WTF!!?? Hi devs, C inconsistency in JDBC-RP ’t f ound a bug ol. OMG can ambiguity ver y badass l sed that. I ve you mis incorrect informal language belie er get a bsod aft ( pw, pls fix :' JD $$$ 8
  9. 9. So What? EXAMPLES OF UNSTRUCTURED DATA web-sites diagrams requirements documents social media documentation help IRC chat files code so urce nts orts mme bu g rep captchas co commit logs email system logs 9

×