Enterprise Collaboration and
Innovation Support Systems

(GE/ IE 498 ECI): 
The Unstructured Data Contains a Clue

Xavier ...
o
n‘. 
a

Where did we leave it? 

- Graphs as structure

- Basic concepts

~ Types of graphs

- Basic concepts of graph a...
What's the plan for today? 

- Human communication

- Text as a form of unstructured data
- Basic notions of text mining

...
Unstructured data all around

- Human communication involves text

- Language is highly ambiguous

- Even speech can be tr...
o
u‘. 
o

The overview of today topics

- Basic concepts (this lecture)

— Collecting documents

— Document standardizatio...
Collecting documents

- Think about what kind of text you have access from
your computer

- Documents you write in Word,  ...
Document standardization

- Efforts to provide a common representation and sharing
of documents

- Efforts focus on the co...
Some issues to be aware

- Two main approaches to gain insight from text
— Natural language processing
— Statistic-driven ...
o
u‘. 
o

Zipf’s law

0 Publicized by Harvard linguist George Kingsley Zipf

- The most frequent word will occur approxima...
Zipf’s law

A ~-----~
English
b German
5 Hungarian , _
Hungarian 2 

O ’ Zipl‘s

I ‘ Lavaletie

U BNC

I

8

f

f

9

Q

U...
"i‘ —1?E-[C Eur ngl

Zipf’s law lessons

High-frequency terms are not so meaningful
These terms are usually called stop wo...
o
n‘. 
o

Tokenization

— Language dependent

- Identify words (also know as tokens)

~ Basic units of text

- Take care o...
'_§‘ -135-[C Ear ngl

Lemmatization

Process tokens

It is also know as stemming
Language and application dependent
Some e...
o
n‘. 
a

A simple vectorized structure

- A collection of documents
- Each document is a set of tokens
~ We can stem each...
Some examples

- Done in class

‘ 4 %E.  [C Ear mg EC-CI?  Xavier Lima in Spring 2007
What will we do in the next lecture? 

- Communications over the net

- How can support innovation and creativity? 
- Over...
Something to think about

- Recommended reading

Text Mining:  Predictive Methods for Analyzing Unstructured Information

...
Upcoming SlideShare
Loading in …5
×

GE498-ECI, Lecture 9:The Unstructured Data Contains a Clue

1,008 views
949 views

Published on

GE498-ECI, Lecture 9:The Unstructured Data Contains a Clue

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,008
On SlideShare
0
From Embeds
0
Number of Embeds
47
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

GE498-ECI, Lecture 9:The Unstructured Data Contains a Clue

  1. 1. Enterprise Collaboration and Innovation Support Systems (GE/ IE 498 ECI): The Unstructured Data Contains a Clue Xavier Llora Illinois Genetic Algorithms Lab & National Center for Supercomputing Applications University of Illinois at Urbana-Champaign xllo: a@uiuc. edu
  2. 2. o n‘. a Where did we leave it? - Graphs as structure - Basic concepts ~ Types of graphs - Basic concepts of graph analysis GE ‘l[ 475. [C SN mg EC-CI? Xavier Lima in Spring 2007
  3. 3. What's the plan for today? - Human communication - Text as a form of unstructured data - Basic notions of text mining - Common operations in text mining GE/ IE 498 ECI Spring 2007 Xavier Lloré -3: Spring 2007 3
  4. 4. Unstructured data all around - Human communication involves text - Language is highly ambiguous - Even speech can be transcribed as text - Available everywhere (company, web, your laptop) - Contains a lot of information and knowledge, but it is highly unstructured GE/ IE 498 ECI Spring 2007 Xavier Llora -1.1 Spring 2007 4
  5. 5. o u‘. o The overview of today topics - Basic concepts (this lecture) — Collecting documents — Document standardization — Tokenization Lemmatization A simple vectorized structure - Advanced topics (for your leisure) Sentence boundary detection — Part-of-speech tagging - Word sense disambiguation — Phrase recognition — Named entity recognition — Parsing Feature creation Si i‘l[ 4?-E. [C Sm rig 203.7 Xavier Llora -D Spring 2007
  6. 6. Collecting documents - Think about what kind of text you have access from your computer - Documents you write in Word, PDFs you download, web pages, chat logs, etc. - Researchers have put sample corpuses together (Reuters, Wall Street, ... ) - Each of them have different properties and language usages (not to mention different languages) - The web has produces a clear explosion of the amount of text GE/ IE 498 ECI Spring 2007 Xavier Llcira in Spring 2007 6
  7. 7. Document standardization - Efforts to provide a common representation and sharing of documents - Efforts focus on the content not the visualization of that content - The common framework: XML - A common resource description framework: RDF - Why we are interested on it? Make it possible to automate document processing and management. GE/ IE 498 ECI Spring 2007 Xavier Llora 1.1 Spring 2007 7
  8. 8. Some issues to be aware - Two main approaches to gain insight from text — Natural language processing — Statistic-driven analysis - Common applications (classification, summarization, indexing, etc. ) - Different languages - Typos - Syntax - Grammar - Zipf’s law GE/ IE 498 ECI Spring 2007 Xavier Llora -1.1 Spring 2007 B
  9. 9. o u‘. o Zipf’s law 0 Publicized by Harvard linguist George Kingsley Zipf - The most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. - The term usually refers to any power law probability distributions - Brown Corpus - "the" is the most frequently-occurring word (7% of all word occurrences) The second-place word "of" accounts for slightly over 3.5% of words The list follows with "and", etc. 6: i‘l[ ~17-E. [C Snr rig ZCICIF Xavier Llora in Spring 2007
  10. 10. Zipf’s law A ~-----~ English b German 5 Hungarian , _ Hungarian 2 O ’ Zipl‘s I ‘ Lavaletie U BNC I 8 f f 9 Q U e I’) C Y -. 1o 10' 10’ 10’ 10' 1o 10“ 10 Number of different word forms in descending frequency order (}cz: iNcn>:1li, ( siilxi Zuuiku illxili. 'Miil1iliii-, :iiu| ~r. iii. -ziuil rcu uIl‘. l]¥. ~I. Zigii‘. ~ la“ and lluii-, :.iiiuii . ~;wt~t-l: gt-m~i. «iiuu'. M14 L| l‘, i!lll. ll-I | li. iii-, :<rit-. i.4W 5-6i: K5—lU| ozrir 4:5 cc Sam-g 2-337 Xavier Llora 0 Spring 2007 -0
  11. 11. "i‘ —1?E-[C Eur ngl Zipf’s law lessons High-frequency terms are not so meaningful These terms are usually called stop words Low-frequency terms a very common — They may be specific to the document - They may be typos Any statistics must take this into account C-CI? Xavier Lima in Spring 2007 ‘ i
  12. 12. o n‘. o Tokenization — Language dependent - Identify words (also know as tokens) ~ Basic units of text - Take care of delimiters - ‘ is a ‘s or a delimiter? What about - ? - Other elements . ,:<>()? ! GE ‘l[ 475. [C Eur mg EC-CI? Xavier Lima in Spring 2007
  13. 13. '_§‘ -135-[C Ear ngl Lemmatization Process tokens It is also know as stemming Language and application dependent Some examples: — Run, runs, ran, running -> run - Book, books -> book Benefits - Reduce the number of different words to deal with Drawbacks — May introduce ambiguity C-CI? Xavier Lima in Spring 2007
  14. 14. o n‘. a A simple vectorized structure - A collection of documents - Each document is a set of tokens ~ We can stem each token - Each stemmed token that is not a stop word (ore a typo) forms a dictionary GE ‘l[ 4 $5 [C Snr mg EC-CI? Xavier Lima in Spring 2007 ‘4
  15. 15. Some examples - Done in class ‘ 4 %E. [C Ear mg EC-CI? Xavier Lima in Spring 2007
  16. 16. What will we do in the next lecture? - Communications over the net - How can support innovation and creativity? - Overcoming superficiality - Get focused - Visual maps GE/ IE 498 ECI Spring 2007 Xavier Llora in Spring 2007 16
  17. 17. Something to think about - Recommended reading Text Mining: Predictive Methods for Analyzing Unstructured Information by Shalom Weiss, Nitin Indurkhya, Tong Zhang, Fred Damerau Springer (2005) - Homework: - Research on lemmatization for English. is there a common tool used for everybody? GEIIE 498 ECI Spring 2007 Xavier Llora -3.1 Spring 2007 17

×