GE498-ECI, Lecture 9:The Unstructured Data Contains a Clue

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    GE498-ECI, Lecture 9:The Unstructured Data Contains a Clue - Presentation Transcript

    1. Enterprise Collaboration and Innovation Support Systems (GE/IE 498 ECI): The Unstructured Data Contains a Clue Xavier Llorà Illinois Genetic Algorithms Lab & National Center for Supercomputing Applications University of Illinois at Urbana-Champaign xllora@uiuc.edu
    2. Where did we leave it? • Graphs as structure • Basic concepts • Types of graphs • Basic concepts of graph analysis GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 2
    3. What’s the plan for today? • Human communication • Text as a form of unstructured data • Basic notions of text mining • Common operations in text mining GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 3
    4. Unstructured data all around • Human communication involves text • Language is highly ambiguous • Even speech can be transcribed as text • Available everywhere (company, web, your laptop) • Contains a lot of information and knowledge, but it is highly unstructured GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 4
    5. The overview of today topics Basic concepts (this lecture) • – Collecting documents – Document standardization – Tokenization – Lemmatization – A simple vectorized structure Advanced topics (for your leisure) • – Sentence boundary detection – Part-of-speech tagging – Word sense disambiguation – Phrase recognition – Named entity recognition – Parsing – Feature creation GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 5
    6. Collecting documents • Think about what kind of text you have access from your computer • Documents you write in Word, PDFs you download, web pages, chat logs, etc. • Researchers have put sample corpuses together (Reuters, Wall Street,…) • Each of them have different properties and language usages (not to mention different languages) • The web has produces a clear explosion of the amount of text GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 6
    7. Document standardization • Efforts to provide a common representation and sharing of documents • Efforts focus on the content not the visualization of that content • The common framework: XML • A common resource description framework: RDF • Why we are interested on it? Make it possible to automate document processing and management. GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 7
    8. Some issues to be aware • Two main approaches to gain insight from text – Natural language processing – Statistic-driven analysis • Common applications (classification, summarization, indexing, etc.) • Different languages • Typos • Syntax • Grammar • Zipf’s law GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 8
    9. Zipf’s law Publicized by Harvard linguist George Kingsley Zipf • The most frequent word will occur approximately twice as often as • the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. The term usually refers to any power law probability distributions • Brown Corpus • – "the" is the most frequently-occurring word (7% of all word occurrences) – The second-place word "of" accounts for slightly over 3.5% of words – The list follows with "and", etc. GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 9
    10. Zipf’s law Geza Nemeth, Csaba Zainko (2002), "Multilingual statistical text analysis, Zipf''s law and Hungarian speech generation", Acta Linguistica Hungarica, 49(3-4):385-401. GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 10
    11. Zipf’s law lessons • High-frequency terms are not so meaningful • These terms are usually called stop words • Low-frequency terms a very common – They may be specific to the document – They may be typos • Any statistics must take this into account GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 11
    12. Tokenization • Language dependent • Identify words (also know as tokens) • Basic units of text • Take care of delimiters • ‘ is a ‘s or a delimiter? What about - ? • Other elements .,:<>()?! GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 12
    13. Lemmatization • Process tokens • It is also know as stemming • Language and application dependent • Some examples: – Run, runs, ran, running -> run – Book, books -> book • Benefits – Reduce the number of different words to deal with • Drawbacks – May introduce ambiguity GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 13
    14. A simple vectorized structure • A collection of documents • Each document is a set of tokens • We can stem each token • Each stemmed token that is not a stop word (ore a typo) forms a dictionary GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 14
    15. Some examples • Done in class GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 15
    16. What will we do in the next lecture? • Communications over the net • How can support innovation and creativity? • Overcoming superficiality • Get focused • Visual maps GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 16
    17. Something to think about Recommended reading • Text Mining: Predictive Methods for Analyzing Unstructured Information by Sholom Weiss, Nitin Indurkhya, Tong Zhang, Fred Damerau Springer (2005) Homework: • – Research on lemmatization for English. Is there a common tool used for everybody? GE/IE 498 ECI - Spring 2007 Xavier Llorà © Spring 2007 17

    + Xavier LloràXavier Llorà, 3 years ago

    custom

    973 views, 1 favs, 1 embeds more stats

    GE498-ECI, Lecture 9:The Unstructured Data Contains more

    More info about this document

    CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

    Go to text version

    • Total Views 973
      • 970 on SlideShare
      • 3 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 0
    Most viewed embeds
    • 3 views on http://www.illigal.uiuc.edu

    more

    All embeds
    • 3 views on http://www.illigal.uiuc.edu

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories