Mud flash

MUD 2010
Workshop on Mining Unstructured Data

Nicolas Bettenburg
SOFTWARE ANALYSIS Bram Adams
& INTELLIGENCE LAB http://sailhome.cs.queensu.ca/mud/
1

Unstructured
Data?

2

EXAMPLE OF STRUCTURED DATA
<bug>
<bug_id>45411</bug_id>
<creation_ts>2000-07-13 13:46:00 -0700</creation_ts>
<short_desc>Drag, hover over tab should open tab</short_desc>
<delta_ts>2009-12-04 13:03:48 -0800</delta_ts>
<reporter_accessible>1</reporter_accessible>
<cclist_accessible>1</cclist_accessible>
<classification_id>2</classification_id>
<classification>Client Software</classification>
<product>SeaMonkey</product>
<component>Tabbed Browser</component>
<version>Trunk</version>
<rep_platform>All</rep_platform>
<op_sys>All</op_sys>
<bug_status>RESOLVED</bug_status>
<resolution>WONTFIX</resolution>
<priority>--</priority>
<bug_severity>enhancement</bug_severity>
<target_milestone>---</target_milestone>
<blocked>121292</blocked>
...
</bug>
3

So What?
EXAMPLES OF UNSTRUCTURED DATA

web-sites diagrams requirements
documents

social media documentation help
IRC chat ﬁles
code
so urce nts orts
mme bu g rep captchas
co

commit logs
email system logs
4

SE data without explicit format

COMPLEXITY DIVERSITY IMPERFECTION

5

Unstructured Data is
COMPLEX ...
all
QLite library sh Bonjour,
0: The S ents
S1 000 l SQ L statem
high-leve s to persistent
translate all
level I/O c ces deux pro
blèmes sont
into low- En effet, les reliés.
paquets Ubu
storage. comportent ntu ne
SQL
k of every an-
pas les dépe
ndances (e.
The ess ential tas to translate hum libpng, libjp
eg, libglew, g.
ne is ...).
datab ase engi ts into
SQL s tatemen s. Si Tulip ne p
readable operation eut afficher
les fichiers
of I/O PNG, c'est s
sequences ans doute ca
r le paquet
libpng est m
anquant sur
Nous travail le système.
lons à ajout
dépendance er les
s sur les paq
natural language n'arrivera pr
obablement
uets, mais c
pas avant T
eci
3.5. ulip
rich semantics
Cordialemen
t,
no authoritative formats Charles.

6

... AND DIVERSE
In this report, you have defined a parameter named blocksize,
which is given a value of "7|D|1|D". In open script of data set,
there are below lines code:

<script begin>
token=Packages.java.util.StringTokenizer(params["blocksize"],"|");
vec=new Packages.java.util.Vector();
while(token.hasMoreTokens()){
vec.addElement(token.nextToken()); Eclipse #150222
}
params["DateRange"]=java.lang.Integer.parseInt(vec.elementAt(0));
</script end>

Since the value of params["blocksize"] is "7|D|1|D", vec.elementAt(0)
is "7", and then it can not be parsed to int value. In 1.0.1,
the value of params["blocksize"] might be 7|D|1|D, so it can be
parsed to int value of 7.

7

... AND IMPERFECT
o e@gmail.com
From: john.d c eforge.net
To: d evlist@sour !!
Subject: BS OD WTF!!??

Hi devs,
C inconsistency
in JDBC-RP ’t
f ound a bug ol. OMG can ambiguity
ver y badass l sed that. I
ve you mis incorrect informal language
belie er
get a bsod aft
(
pw, pls fix :'

JD $$$
8

So What?
EXAMPLES OF UNSTRUCTURED DATA

web-sites diagrams requirements
documents

social media documentation help
IRC chat ﬁles
code
so urce nts orts
mme bu g rep captchas
co

commit logs
email system logs
9

Mud flash

More Related Content

Viewers also liked

Similar to Mud flash

More from Nicolas Bettenburg

Mud flash