Successfully reported this slideshow.
Introduc)on
to
SEASR
and
Text
Mining

                   UIUC/NCSA

                   Feb
4,
2009



                   L...
The
SEASR
Picture

SEASR:
Reach
+
Relevance
+
Reuse
+
Repeatability




SEASR
emphasizes
flexibility,
scalability,
modularity,
provides

 comm...
Knowledge
Discovery
in
Data

Workbench

•  Web‐based
UI

•  Components
and
flows

   are
retrieved
from
server

•  Addi)onal
loca)ons
of

   components
...
Community
Hub

SEASR
@
Work
–
Zotero

•  Plugin
to
Firefox


•  Zotero
manages
the

   collec)on

•  Launch
SEASR
Analy)cs


   –  Cita)o...
SEASR
@
Work
–
Fedora




                               Interac)ve
Web


                                 Applica)on




...
SEASR
@
Work
–

En)ty
Mash‐up

•  En)ty

   Extrac)on
with

   OpenNLP

•  Loca)ons

   viewed
on

   Google
Map


•  Date...
SEASR
@
Work
–
Audio
Analysis

•  NEMA:
Executes
a
SEASR

   flow
for
each
run

   –  Loads
audio
data

   –  Extracts
feat...
SEASR
@
Work
–
MONK

Executes
flows
for

  each
analysis

  requested

  –  Predic)ve

     modeling
using

     Naïve
Baye...
SEASR
@
Work
–
DISCUS

     On‐demand
usage
of

• 
     analy)cs
while
surfing

      –  While
naviga)ng

         request
...
SEASR
and
UIMA
:
Emo)on
Tracking


Goal
is
to
have
this
type
of
Visualiza)on
to
track
emo)ons
across
a
text

 document
(Le...
SEASR
Text
Analy)cs
Goals

Address
the
Scholarly
text
analy)cs
needs
by:


•  Efficiently
managing
distributed
Literary
and
...
The
Zotero
Picture





The


WEB





                Zotero

                 Store

The
Zotero
+
SEASR
Picture





                         The

The

                         WEB

WEB





           Zoter...
Your
Zotero
Collec)on

The
SEASR
Analy)cs

The
Value
Added

Some
Examples

•  Authorship Analysis (JUNG network
   importance algorithms to rank the authors
   in the citation networ...
SEASR Flow
Text
Mining
Defini)on

Many
defini)ons
in
the
literature

•  The
non
trivial
extrac)on
of
implicit,
previously

   unknown,
...
Text
Mining
Process

     Text
Preprocessing

• 
           Syntac)c
Text
Analysis

      – 
           Seman)c
Text
Analy...
Text
Characteris)cs
(1)

•  Large
textual
data
base


    –  Enormous
wealth
of
textual
informa)on
on
the
Web

    –  Publ...
Text
Characteris)cs
(2)

•  Dependency


    –  Relevant
informa)on
is
a
complex
conjunc)on
of
words/phrases

    –  Order...
Text
Preprocessing

•  Syntac)c
analysis

        Tokeniza)on

   – 
        Lemmi)za)on

   – 
        POS
tagging

   – ...
Syntac)c
Analysis

     Tokeniza)on

• 
           Text
document
is
represented
by
the
words
it
contains
(and
their
occurr...
Seman)c
Analysis:
Informa)on
Extrac)on

 •  Defini)on:
Informa)on
extrac)on
is
the

    iden)fica)on
of
specific
seman)c
elem...
Informa)on
Extrac)on


                                  Informa(on
Type
                                                 ...
Informa)on
Extrac)on
Approaches

•  
Terminology
(name)
lists

   –  This
works
very
well
if
the
list
of
names
and
name

 ...
Informa)on
Extrac)on

Rela)on
(Event)
Extrac)on

•  Iden)fy
(and
tag)
the
rela)on
among
two
en))es:

   –  A
person
is_loc...
Seman)c
Analy)cs

   Named
En)ty
(NE)
Tagging





         NE:Person
                NE:Time

Mayor
Rex
Luthor
announced
...
Seman)c
Analysis

Seman)c
Category
(unnamed
en)ty,
UNE)

  Tagging





  Mayor
Rex
Luthor
announced
today
the
establishme...
Seman)c
Analysis

Co‐reference
Resolu)on
for
en))es
and

  unnamed
en))es





  Mayor
Rex
Luthor
announced
today
the
esta...
Seman)c
Analysis

Seman)c
Role
Analysis





            ACTOR           ACTION   WHEN               OBJECT
       Mayor R...
Seman)c
Analysis

Concept‐Rela)on
Extrac)on




                                                         today
           ...
IE
–
Template
Extrac)on
‐
Steps





                       </VerbGroup> …
Template
Extrac)on

                                                  <Facility>Finsbury Park Mosque</Facility>
(c) 2001, ...
Streaming
Text:
Knowledge
Extrac)on


•  Leveraging
some
earlier

   work
on
informa)on

   extrac)on
from
text

   stream...
Seman)c
Analysis

     Word
Sense
Disambigua)on

• 
      –  Context
based
or
proximity
based

      –  Very
accurate

Ontological
Associa)on
(WordNet)

•  Wordnet:
As
of
2006,
the
database
contains
about
150,000
words

   organized
in
over
...
Feature
Selec)on

•  Reduce
Dimensionality

  –  Learners
have
difficulty
addressing
tasks
with
high

     dimensionality


...
Text
Mining:
General
Applica)on
Areas

•  Informa)on
Retrieval

   –  Indexing
and
retrieval
of
textual
documents

   –  F...
Text
Mining:
Supervised
vs.
Unsupervised

 •  Supervised
learning
(Classifica)on)

    –  Data
(observa)ons,
measurements,
...
Results:
Social
Network
(Tom
in
Red)

Results:
Timeline

Results:
Maps

Text
Mining:
T2K
and
ThemeWeaver

Text
Mining:
Themescape
and
ThemeRiver

      Visualizing
Rela)onships
Between
Documents

 • 




                        ...
Gather
–
Analyze
–
Present


Text
Mining:
Applica)ons


•  Email:
Spam
filtering

•  News
Feeds:
Discover
what
is

   interes)ng

•  Medical:
Iden)fy
re...
Text
Mining:
Classifica)on
Defini)on

•  Given:
Collec)on
of
labeled
records

    –  Each
record
contains
a
set
of
features
...
Text
Mining:
Clustering
Defini)on

•  Given:

Set
of
documents
and
a
similarity
measure

   among
documents

•  Find:
Clust...
SEASR

Meandre
Workbench

Future
Work

•  Enhancements
to
Seman)c
Analysis

  –  Use
of
Ontological
Associa)on
(WordNet,

     VerbNet)

  –  Improv...
Upcoming SlideShare
Loading in …5
×

Text Mining and SEASR

1,383 views

Published on

Published in: Technology, Education
  • Be the first to comment

Text Mining and SEASR

  1. 1. Introduc)on
to
SEASR
and
Text
Mining
 UIUC/NCSA
 Feb
4,
2009
 LoreBa
Auvil
 Na)onal
Center
for
Supercompu)ng
Applica)ons
 University
of
Illinois
at
Urbana
Champaign

  2. 2. The
SEASR
Picture

  3. 3. SEASR:
Reach
+
Relevance
+
Reuse
+
Repeatability

 
SEASR
emphasizes
flexibility,
scalability,
modularity,
provides
 community
hub
and
access
to
heterogeneous
data
and
 computa)onal
systems
 –  Seman)c
driven
environment
for
SOA
interoperability
 –  Encourages
sharing
and
par)cipa)on
for
building
communi)es
 –  Modular
construc)on
allows
flows
to
be
modified
and
configured
to
 encourage
reusability
within
and
across
domains
 –  Enables
a
mashup
and
integra)on
of
tools
 –  Data‐intensive
flows
can
be
executed
on
a
simple
desktop
or
a
large
 cluster(s)
without
modifica)on
 –  Computa)on
can
be
created
for
distributed
execu)on
on
servers
where
 the
content
lives
 –  User
accessibility
to
control
trust
and
compliance
with
required
copyright
 license
of
content
 –  Relies
on
standardized
Resource
Descrip)on
Framework
(RDF)
to
define
 components
and
flow

  4. 4. Knowledge
Discovery
in
Data

  5. 5. Workbench
 •  Web‐based
UI
 •  Components
and
flows
 are
retrieved
from
server
 •  Addi)onal
loca)ons
of
 components
and
flows
 can
be
added
to
server
 •  Create
flow
using
a
 graphical
drag
and
drop
 interface
 •  Change
property
values
 •  Execute
the
flow

  6. 6. Community
Hub

  7. 7. SEASR
@
Work
–
Zotero
 •  Plugin
to
Firefox

 •  Zotero
manages
the
 collec)on
 •  Launch
SEASR
Analy)cs

 –  Cita)on
Analysis
uses
the
JUNG
 network
importance
algorithms
 to
rank
the
authors
in
the
cita)on
 network
that
is
exported
as
RDF
 data
from
Zotero
to
SEASR
 –  Zotero
Export
to
Fedora
through
 SEASR
 –  Saves
results
from
SEASR
 Analy)cs
to
a
Collec)on
 •  Launch
MONK
Processing
 –  MONK
DB
Inges)on
Workflow

  8. 8. SEASR
@
Work
–
Fedora
 Interac)ve
Web

 Applica)on
 Web
Service

  9. 9. SEASR
@
Work
–

En)ty
Mash‐up
 •  En)ty
 Extrac)on
with
 OpenNLP
 •  Loca)ons
 viewed
on
 Google
Map

 •  Dates
viewed
 on
Simile
 Timeline

  10. 10. SEASR
@
Work
–
Audio
Analysis
 •  NEMA:
Executes
a
SEASR
 flow
for
each
run
 –  Loads
audio
data
 –  Extracts
features
for
every
 10
sec
moving
window
of
 audio
 –  Loads
and
applies
the
 models
 –  Sends
results
back
to
the
 WebUI
 •  NESTER:
Annota)on
of
 Audio
via
Spectral
 Analysis

  11. 11. SEASR
@
Work
–
MONK
 Executes
flows
for
 each
analysis
 requested
 –  Predic)ve
 modeling
using
 Naïve
Bayes
 –  Predic)ve
 modeling
using
 Support
Vector
 Machines
(SVM)

  12. 12. SEASR
@
Work
–
DISCUS
 On‐demand
usage
of
 •  analy)cs
while
surfing
 –  While
naviga)ng
 request
analy)cs
to
be
 performed
on
page
 –  Text
extrac)on
and
 cleaning
 Summariza)on
and
key
 •  work
extrac)on
 –  List
the
important
 terms
on
the
page
 being
analyzed
 –  Provide
relevant
short
 summaries

 Visual
maps
 •  –  Provide
a
visual
 representa)on
of
the
 key
concepts
 –  Show
the
graph
of
 rela)ons
between
 concepts

  13. 13. SEASR
and
UIMA
:
Emo)on
Tracking
 
Goal
is
to
have
this
type
of
Visualiza)on
to
track
emo)ons
across
a
text
 document
(Leveraging
flare.prefuse.org)

  14. 14. SEASR
Text
Analy)cs
Goals
 Address
the
Scholarly
text
analy)cs
needs
by:
 •  Efficiently
managing
distributed
Literary
and
Historical
textual
assets
 •  Structuring
extracted
informa)on
to
facilitate
knowledge
discovery
 •  Extract
informa)on
from
text
at
a
level
of
seman)c/func)onal
 abstrac)on
that
is
sufficiently
rich
to
support
ques)on‐answering
 •  Devise
a
representa)on
for
the
extracted
informa)on
that
can
be
 efficiently
reasoned
over
to
recover
data
in
the
ques)on‐answer
 process
 •  Devise
algorithms
for
ques)on
answering
and
inference
 •  Develop
UI
for
effec)ve
visual
knowledge
discovery
with
separate
 query
logic
from
applica)on
logic

 •  Leveraging
exis)ng
approaches
and
devise
algorithms
for
clustering,
 inference,
and
Q&A
 •  Developing
an
Interac)on
UI
for
effec)ve
visual
data
explora)on
 •  Enable
the
text
analy)cs
through
SEASR
components

  15. 15. The
Zotero
Picture
 The

 WEB
 Zotero
 Store

  16. 16. The
Zotero
+
SEASR
Picture
 The
 The
 WEB
 WEB
 Zotero
 Store

  17. 17. Your
Zotero
Collec)on

  18. 18. The
SEASR
Analy)cs

  19. 19. The
Value
Added

  20. 20. Some
Examples
 •  Authorship Analysis (JUNG network importance algorithms to rank the authors in the citation network) •  Author Centrality Analysis –  Uses Betweenness Centrality, which ranks each coauthor graph derived from the number of shortest paths that pass through them •  Author Degree Analysis –  Uses AuthorDegreeDistributionAnalysis, which ranks each on the number of coauthors •  Author HITS Analysis –  The *hubness* of a node is the degree to which a node links to other important authorities. The *authoritativeness* of a node is the degree to which a node is pointed to by important hubs. •  Readability •  Flesch-Kincaid readability test quot; (http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test)
  21. 21. SEASR Flow
  22. 22. Text
Mining
Defini)on
 Many
defini)ons
in
the
literature
 •  The
non
trivial
extrac)on
of
implicit,
previously
 unknown,
and
poten)ally
useful
informa)on
 from
(large
amount
of)
textual
data”
 •  An
explora)on
and
analysis
of
textual
(natural‐ language)
data
by
automa)c
and
semi
automa)c
 means
to
discover
new
knowledge
 •  What
is
“previously
unknown”
informa)on?
 –  Strict
defini)on
 •  Informa)on
that
not
even
the
writer
knows
 –  Lenient
defini)on
 •  Rediscover
the
informa)on
that
the
author
encoded
in
the
 text


  23. 23. Text
Mining
Process
 Text
Preprocessing
 •  Syntac)c
Text
Analysis
 –  Seman)c
Text
Analysis

 –  Features
Genera)on

 •  Bag
of
Words
 –  Ngrams
 –  Feature
Selec)on
 •  Simple
Coun)ng
 –  Sta)s)cs

 –  Selec)on
based
on
POS
 –  Text/Data
Mining
 •  Classifica)on
‐
Supervised
 –  Learning
 Clustering
‐
Unsupervised
 –  Learning
 Informa)on
Extrac)on
 –  Analyzing
Results
 •  Visual
Explora)on,
Discovery
 –  and
Knowledge
Extrac)on
 Query‐based
–
ques)on
 –  answering

  24. 24. Text
Characteris)cs
(1)
 •  Large
textual
data
base

 –  Enormous
wealth
of
textual
informa)on
on
the
Web
 –  Publica)ons
are
electronic
 •  High
dimensionality

 –  Consider
each
word/phrase
as
a
dimension
 •  Noisy
data
 –  Spelling
mistakes
 –  Abbrevia)ons
 –  Acronyms
 •  Text
messages
are
very
dynamic
 –  Web
pages
are
constantly
being
generated
(removed)
 –  Web
pages
are
generated
from
database
queries

 •  Not
well
structured
text
 –  Email/Chat
rooms
 •  “r
u
available
?”
 •  “Hey
whazzzzzz
up”

 –  Speech

  25. 25. Text
Characteris)cs
(2)
 •  Dependency

 –  Relevant
informa)on
is
a
complex
conjunc)on
of
words/phrases
 –  Order
of
words
in
the
query
 •  hot
dog
stand
in
the
amusement
park

 •  hot
amusement
stand
in
the
dog
park
 •  Ambiguity

 –  Word
ambiguity

 •  Pronouns

(he,
she
…)
 •  Synonyms
(buy,
purchase)
 •  Words
with
mul)ple
meanings
(bat
–
it
is
related
to
baseball
or
mammal)
 –  Seman)c
ambiguity
 •  The
king
saw
the
rabbit
with
his
glasses.
(mul)ple
meanings)

 •  Authority
of
the
source
 –  IBM
is
more
likely
to
be
an
authorized
source
then
my
second
far
 cousin

  26. 26. Text
Preprocessing
 •  Syntac)c
analysis
 Tokeniza)on
 –  Lemmi)za)on
 –  POS
tagging
 –  Shallow
parsing
 –  Custom
literary
tagging
 –  •  Seman)c
analysis
 –  Informa)on
Extrac)on
 •  Named
En)ty
tagging
 Seman)c
Category
(unnamed
en)ty)
tagging
 –  Co‐reference
resolu)on
 –  Ontological
associa)on
(WordNet,
VerbNet)
 –  Seman)c
Role
analysis
 –  Concept‐Rela)on
extrac)on
 – 
  27. 27. Syntac)c
Analysis
 Tokeniza)on
 •  Text
document
is
represented
by
the
words
it
contains
(and
their
occurrences)
 –  e.g.,
“Lord
of
the
rings”

→

{“the”,
“Lord”,
“rings”,
“of”}
 –  Highly
efficient
 –  Makes
learning
far
simpler
and
easier
 –  Order
of
words
is
not
that
important
for
certain
applica)ons
 –  Lemmi)za)on/Stemming
 •  Involves
the
reduc)on
of
corpus
words
to
their
respec)ve
headwords
(i.e.
lemmas)
 –  Reduce
dimensionality
 –  Iden)fies
a
word
by
its
root
 –  e.g.,
flying,
flew
→
fly
 –  Stop
words
 •  Iden)fies
the
most
common
words
that
are
unlikely
to
help
with
text
mining
 –  e.g.,
“the”,
“a”,
“an”,
“you”
 –  Parsing
/
Part
of
Speech
(POS)
tagging
 •  Generates
a
parse
tree
(graph)
for
each
sentence
 –  Each
sentence
is
a
stand
alone
graph

 –  Find
the
corresponding
POS
for
each
word
 –  e.g.,
John
(noun)
gave
(verb)
the
(det)
ball
(noun)

 –  Shallow
Parsing
 –  analysis
of
a
sentence
which
iden)fies
the
cons)tuents
(noun
groups,
verbs,...),
but
does
not
specify
their
internal
 •  structure,
nor
their
role
in
the
main
sentence
 Deep
Parsing
 –  more
sophis)cated
syntac)c,
seman)c
and
contextual
processing
must
be
performed
to
extract
or
construct
the
answer
 • 
  28. 28. Seman)c
Analysis:
Informa)on
Extrac)on
 •  Defini)on:
Informa)on
extrac)on
is
the
 iden)fica)on
of
specific
seman)c
elements
 within
a
text
(e.g.,
en))es,
proper)es,
 rela)ons)
 •  Extract
the
relevant
informa)on
and
ignore
 non‐relevant
informa)on
(important!)

 •  Link
related
informa)on
and
output
in
a
 predetermined
format

  29. 29. Informa)on
Extrac)on
 Informa(on
Type
 State
of
the
art
(Accuracy)
 En((es
 90‐98%
 an
object
of
interest
such
as
a
 person
or
organiza)on.
 A9ributes
 80%
 a
property
of
an
en)ty
such
as
its
 name,
alias,
descriptor,
or
type.
 Facts
 60‐70%
 a
rela1onship
held
between
two
or
 more
en))es
such
as
Posi)on
of
a
 Person
in
a
Company.
 Events
 50‐60%
 an
ac1vity
involving
several
en))es
 such
as
a
terrorist
act,
airline
crash,
 management
change,
new
product
 introduc)on.
 “Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL
  30. 30. Informa)on
Extrac)on
Approaches
 •  
Terminology
(name)
lists
 –  This
works
very
well
if
the
list
of
names
and
name
 expressions
is
stable
and
available
 •  
Tokeniza)on
and
morphology
 –  This
works
well
for
things
like
formulas
or
dates,
which
 are
readily
recognized
by
their
internal
format
(e.g.,
 DD/MM/YY
or
chemical
formulas)
 •  
Use
of
characteris)c
paBerns
 –  This
works
fairly
well
for
novel
en))es
 –  Rules
can
be
created
by
hand
or
learned
via
machine
 learning
or
sta)s)cal
algorithms
 –  Rules
capture
local
paBerns
that
characterize
en))es
 from
instances
of
annotated
training
data

  31. 31. Informa)on
Extrac)on
 Rela)on
(Event)
Extrac)on
 •  Iden)fy
(and
tag)
the
rela)on
among
two
en))es:
 –  A
person
is_located_at
a
loca)on
(news)
 –  A
gene
codes_for
a
protein
(biology)
 •  Rela)ons
require
more
informa)on

 –  Iden)fica)on
of
two
en))es
&
their
rela)onship
 –  Predicted
rela)on
accuracy

 •  Pr(E1)*Pr(E2)*Pr(R)
~=
(.93)
*
(.93)
*
(.93)
=
.80
 •  Informa)on
in
rela)ons
is
less
local
 –  Contextual
informa)on
is
a
problem:
right
word
may
not
 be
explicitly
present
in
the
sentence
 –  Events
involve
more
rela)ons
and
are
even
harder

  32. 32. Seman)c
Analy)cs
 Named
En)ty
(NE)
Tagging
 NE:Person
 NE:Time
 Mayor
Rex
Luthor
announced
today
the
establishment
of
a
 NE:Loca)on
 new
research
facility
in
Alderwood.

It
will
be
known
as
 NE:Organiza)on
 Boynton
Laboratory.

  33. 33. Seman)c
Analysis
 Seman)c
Category
(unnamed
en)ty,
UNE)
 Tagging
 Mayor
Rex
Luthor
announced
today
the
establishment
of
a
 UNE:Organiza)on
 new
research
facility
in
Alderwood.

It
will
be
known
as
 Boynton
Laboratory.

  34. 34. Seman)c
Analysis
 Co‐reference
Resolu)on
for
en))es
and
 unnamed
en))es
 Mayor
Rex
Luthor
announced
today
the
establishment
of
a
 UNE:Organiza)on
 new
research
facility
in
Alderwood.

It
will
be
known
as
 Boynton
Laboratory.

  35. 35. Seman)c
Analysis
 Seman)c
Role
Analysis
 ACTOR ACTION WHEN OBJECT Mayor Rex Luthor announced today the establishment WHERE OBJECT of a new research facility in Alderwoon. It will be ACTION COMPL known as Boynton Laboratory
  36. 36. Seman)c
Analysis
 Concept‐Rela)on
Extrac)on
 today e tim n ) time e (w h actor Rex Luthor announce (who) person action ob w h a ( je t) ct establ. loc(whe event ha t (w jec at re) t) b io o n Boynton Alderwood Lab organiz. location
  37. 37. IE
–
Template
Extrac)on
‐
Steps
 </VerbGroup> …
  38. 38. Template
Extrac)on
 <Facility>Finsbury Park Mosque</Facility> (c) 2001, Chicago Tribune. <Country>England</Country> Visit the Chicago Tribune on the Internet at <Country>France </Country> http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune <Country>England</Country> Information Services. By Stephen J. Hedges and Cam Simpson <Country>Belgium</Country> ……. <Country>United States</Country> The Finsbury Park Mosque is the center of <PersonPositionOrganization>  radical Muslim activism in England. Through its <Person>Abu Hamza al-Masri</Person> <OFFLEN OFFSET=quot;3576quot; doors have passed at least three of the men LENGTH=“33quot; /> now held on suspicion of terrorist activity in France, England and Belgium, as well as one   <Person>Abu Hamza al-Masri</Person> Algerian man in prison in the United States. <Position>chief cleric</Position> ``The mosque's chief cleric, Abu Hamza al- <Organization>Finsbury Park Mosque</ Masri lost two hands fighting the Soviet Union Organization> <PersonArrest>  in Afghanistan and he advocates the <City>London</City> </PersonPositionOrganization> <OFFLEN OFFSET=quot;3814quot; elimination of Western influence from Muslim countries. He was arrested in London in 1999 LENGTH=quot;61quot; />   for his alleged involvement in a Yemen bomb <Person>Abu Hamza al-Masri</Person>   plot, but was set free after Yemen failed to <Location>London</Location> produce enough evidence to have him extradited. .'‘ … <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason>   </PersonArrest>
  39. 39. Streaming
Text:
Knowledge
Extrac)on
 •  Leveraging
some
earlier
 work
on
informa)on
 extrac)on
from
text
 streams
 Informa)on
extrac)on
 •  process
of
using
 advanced
automated
 machine
learning
 approaches

 •  to
iden)fy
en))es
in
 text
documents
 •  extract
this
informa)on
 along
with
the
 rela)onships
these
 The
visualiza)on
above
demonstrates
informa)on
 en))es
may
have
in
the
 extrac)on
of
names,
places
and
organiza)ons
from
real‐ text
documents
 )me
news
feeds.
As
news
ar)cles
arrive,
the
informa)on
 is
extracted
and
displayed.
Rela)onships
are
defined
 when
en))es
co‐occur
within
a
specific
window
of
 words.

  40. 40. Seman)c
Analysis
 Word
Sense
Disambigua)on
 •  –  Context
based
or
proximity
based
 –  Very
accurate

  41. 41. Ontological
Associa)on
(WordNet)
 •  Wordnet:
As
of
2006,
the
database
contains
about
150,000
words
 organized
in
over
115,000
synsets
for
a
total
of
207,000
word‐sense
pairs
 •  Search
for
dog
 –  n
dog,
domes)c
dog,
Canis
familiaris
(a
member
of
the
genus
Canis
(probably
 descended
from
the
common
wolf)
that
has
been
domes)cated
by
man
since
 prehistoric
)mes;
occurs
in
many
breeds)
 –  n
frump,
dog
(a
dull
unaBrac)ve
unpleasant
girl
or
woman)
 –  n
dog
(informal
term
for
a
man)
 –  n
cad,
bounder,
blackguard,
dog,
hound,
heel
(someone
who
is
morally
 reprehensible)
 –  n
frank,
frankfurter,
hotdog,
hot
dog,
dog,
wiener,
wienerwurst,
weenie
(a
 smooth‐textured
sausage
of
minced
beef
or
pork
usually
smoked;
o}en
served
 on
a
bread
roll)
 –  n
pawl,
detent,
click,
dog
(a
hinged
catch
that
fits
into
a
notch
of
a
ratchet
to
 move
a
wheel
forward
or
prevent
it
from
moving
backward)
 –  n
andiron,
firedog,
dog,
dog‐iron
(metal
supports
for
logs
in
a
fireplace)
 –  v
chase,
chase
a}er,
trail,
tail,
tag,
give
chase,
dog,
go
a}er,
track
(go
a}er
with
 the
intent
to
catch)

  42. 42. Feature
Selec)on
 •  Reduce
Dimensionality
 –  Learners
have
difficulty
addressing
tasks
with
high
 dimensionality

 •  Irrelevant
Features
 –  Not
all
features
help!

 –  Remove
features
that
occur
in
only
a
few
 documents
 –  Reduce
features
that
occur
in
too
many
 documents

  43. 43. Text
Mining:
General
Applica)on
Areas
 •  Informa)on
Retrieval
 –  Indexing
and
retrieval
of
textual
documents
 –  Finding
a
set
of
(ranked)
documents
that
are
relevant
to
 the
query
 •  Informa)on
Extrac)on
 –  Extrac)on
of
par)al
knowledge
in
the
text
 •  Web
Mining
 –  Indexing
and
retrieval
of
textual
documents
and
extrac)on
 of
par)al
knowledge
using
the
web
 •  Classifica)on
 –  Predict
a
class
for
each
text
document
 •  Clustering
 –  Genera)ng
collec)ons
of
similar
text
documents

  44. 44. Text
Mining:
Supervised
vs.
Unsupervised
 •  Supervised
learning
(Classifica)on)
 –  Data
(observa)ons,
measurements,
etc.)
are
accompanied
by
 labels
indica)ng
the
class
of
the
observa)ons
 –  Split
into
training
data
and
test
data
for
model
building
process
 –  New
data
is
classified
based
on
the
model
built
with
the
training
 data
 –  Techniques
 •  Bayesian
classifica)on,
Decision
trees,
Neural
networks,
 Instance‐Based
Methods,
Support
Vector
Machines

 •  Unsupervised
learning
(Clustering)
 –  Class
labels
of
training
data
is
unknown
 –  Given
a
set
of
measurements,
observa)ons,
etc.
with
the
aim
of
 establishing
the
existence
of
classes
or
clusters
in
the
data

  45. 45. Results:
Social
Network
(Tom
in
Red)

  46. 46. Results:
Timeline

  47. 47. Results:
Maps

  48. 48. Text
Mining:
T2K
and
ThemeWeaver

  49. 49. Text
Mining:
Themescape
and
ThemeRiver
 Visualizing
Rela)onships
Between
Documents
 •  Images from Pacific Northwest Laboratory
  50. 50. Gather
–
Analyze
–
Present


  51. 51. Text
Mining:
Applica)ons
 •  Email:
Spam
filtering
 •  News
Feeds:
Discover
what
is
 interes)ng
 •  Medical:
Iden)fy
rela)onships
and
 link
informa)on
from
different
 medical
fields
 •  Homeland
Security
 •  Marke)ng:
Discover
dis)nct
groups
of
 poten)al
buyers
and
make
 sugges)ons
for
other
products
 •  Industry:
Iden)fying
groups
of
 compe)tors
web
pages
 •  Job
Seeking:
Iden)fy
parameters
in
 searching
for
jobs

  52. 52. Text
Mining:
Classifica)on
Defini)on
 •  Given:
Collec)on
of
labeled
records
 –  Each
record
contains
a
set
of
features
(aBributes),
and
the
true
class
 (label)
 –  Create
a
training
set
to
build
the
model
 –  Create
a
tes)ng
set
to
test
the
model
 •  Find:
Model
for
the
class
as
a
func)on
of
the
values
of
the
features
 •  Goal:
Assign
a
class
(as
accurately
as
possible)
to
previously
unseen
 records
 •  Evalua)on:
What
Is
Good
Classifica)on?
 –  Correct
classifica)on

 •  Known
label
of
test
example
is
iden)cal
to
the
predicted
class
from
the
model
 –  Accuracy
ra)o
 •  Percent
of
test
set
examples
that
are
correctly
classified
by
the
model
 –  Distance
measure
between
classes
can
be
used
 •  e.g.,
classifying
“football”
document
as
a
“basketball”
document
is
not
as
bad
 as
classifying
it
as
“crime”

  53. 53. Text
Mining:
Clustering
Defini)on
 •  Given:

Set
of
documents
and
a
similarity
measure
 among
documents
 •  Find:
Clusters
such
that
 –  Documents
in
one
cluster
are
more
similar
to
one
 another
 –  Documents
in
separate
clusters
are
less
similar
to
one
 another
 •  Goal:
 –  Finding
a
correct
set
of
documents
 •  Similarity
Measures:
 –  Euclidean
distance
if
aBributes
are
con)nuous
 –  Other
problem‐specific
measures
 •  e.g.,
how
many
words
are
common
in
these
documents
 •  Evalua)on:
What
Is
Good
Clustering?
 –  Produce
high
quality
clusters
with

 •  high
intra‐class
similarity
 •  low
inter‐class
similarity

 –  Quality
of
a
clustering
method
is
also
measured
by
its
 ability
to
discover
some
or
all
of
the
hidden
paBerns

  54. 54. SEASR
 Meandre
Workbench

  55. 55. Future
Work
 •  Enhancements
to
Seman)c
Analysis
 –  Use
of
Ontological
Associa)on
(WordNet,
 VerbNet)
 –  Improve
co‐referencing
 –  Improve
fact
extrac)on
 •  Visual
explora)on
tools


×