Text Mining What It Is With Audio

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database.

    As a family of techniques, is consists of Segmentation to find the beginning and ending boundaries of the text snippets that will go in the DB. Classification…

    1 Favorite

    Text Mining What It Is With Audio - Presentation Transcript

    1. Text-Mining The process used when some people read: detailed text explanatory documents financial reports
    2. What is Text-Mining?
      • Finding interesting regularities in large textual
      • Where interesting means: non-trivial , hidden, previously unknown and potentially useful
      • Finding semantic and abstract information from the surface form of textual
    3. Which areas are active in Text Processing? Natural Language Processing Information Retrieval Text Processing Knowledge Rep. & Reasoning
    4. Why Text is Tough? (M.Hearst 97)
      • Abstract concepts are difficult to understand.
      • “ Countless” combinations of subtle, abstract relationships among concepts.
      • There are many ways to represent similar concepts.
        • Example: space ship, flying saucer, UFO
      • Concepts are difficult to visualize
    5. Why Text is Easy? (M.Hearst 97)
      • Highly redundant
        • … most text count on this property
      • Just about any simple system can get “good” results for simple tasks:
        • Pull out “important” phrases
        • Find “meaningfully” related words
        • Mentally create some sort of summary from documents
    6. Levels of Text Processing
      • Word Level
        • Words Properties
        • Stop-Words
        • Stemming
        • Frequent N-Grams
        • Thesaurus
    7. Words Properties
      • Relations among word surface forms and their senses:
        • Homonomy : same form, but different meaning
        • Examples: bank: river bank, financial institution
        • Polysemy : same form, related meaning
        • Examples: bank: blood bank, financial institution
        • Synonymy : different form, same meaning
        • Examples: singer, vocalist
        • Hyponymy : one word denotes a subclass of an another
        • Examples: breakfast, meal
      • Word frequencies in texts have power distribution :
        • … small number of very frequent words
        • … big number of low frequency words
    8. Stop-words
      • Stop-words are words that from
      • non-linguistic view that do not carry information
        • … they have mainly functional role
        • … usually we remove them to help the methods
        • to perform better
      • Natural language dependent
      • Examples:
        • a, about, above, across, after, again, against, all, almost, alone, along, already, also, ...
      • Original text
      • Information Systems Asia Web - provides research, IS-related commercial materials, interaction, and even research sponsorship by interested corporations with a focus on Asia Pacific region.
      • After the stop-words removal
      • Information Systems Asia Web provides research IS-related commercial materials interaction research sponsorship interested corporations focus Asia Pacific region
      • Original text
      • Survey of Information Retrieval - guide to IR, with an emphasis on web-based projects. Includes a glossary, and pointers to interesting papers.
      • After the stop-words removal
      • Survey Information Retrieval guide IR emphasis web-based projects Includes glossary pointers interesting papers
    9. Stemming
      • Different forms of the same word are
      • usually problematic for text analysis,
      • because they have different spelling and
      • similar meaning .
      • Example: learns, learned, learning,…
      • Stemming is a process of transforming a
      • word into its stem (normalized form)
    10. Phrases in the form of frequent N-Grams
      • Simple way phrases are generated are frequent n-grams :
        • N-Gram is a sequence of n consecutive words
        • “ Frequent n-grams” are the ones which appear in all observed documents
      • N-grams are interesting because of the simple.
      • Given:
        • Set of documents (each document is a sequence of words),
        • MinFreq (minimal n-gram frequency),
        • MaxNGramSize (maximal n-gram length)
      • Original text on the Yahoo Web page:
      • Top:Reference:Libraries:Library and Information Science:Information Retrieval
      • UK Only
      • Idomeneus - IR & DB repository - These pages mostly contain IR related resources such as test collections, stop lists, stemming algorithms, and links to other IR sites.
      • Document represented by n-grams:
      • "REFERENCE LIBRARIES LIBRARY INFORMATION SCIENCE (#3 LIBRARY INFORMATION SCIENCE) INFORMATION RETRIEVAL (#2 INFORMATION RETRIEVAL)"
      • "UK"
      • "IR PAGES IR RELATED RESOURCES COLLECTIONS LISTS LINKS IR SITES"
      • Original text on the Yahoo Web page:
      • University of Glasgow - Information Retrieval Group - information on the resources and people in the Glasgow IR group.
      • Centre for Intelligent Information Retrieval (CIIR).
      • Information Systems Asia Web - provides research, IS-related commercial materials, interaction, and even research sponsorship by interested corporations with a focus on Asia Pacific region.
      • Document represented by n-grams:
      • "UNIVERSITY GLASGOW INFORMATION RETRIEVAL (#2 INFORMATION RETRIEVAL) GROUP INFORMATION RESOURCES (#2 INFORMATION RESOURCES) PEOPLE GLASGOW IR GROUP"
      • "CENTRE INFORMATION RETRIEVAL (#2 INFORMATION RETRIEVAL)"
      • "INFORMATION SYSTEMS ASIA WEB RESEARCH COMMERCIAL MATERIALS RESEARCH ASIA PACIFIC REGION"
    11. Word relationships – lexical relations
      • Word relationships is a well developed and effective
        • … it consist from 4 classes which are:
        • nouns,
        • verbs,
        • adjectives,
        • adverbs
      • Each class consists from sense entries consisting from
      • a set of synonyms,
      • Examples:
        • musician, instrumentalist, player
        • person, individual, someone
        • life form, organism, being
    12. Word relationships
      • Each word entry is connected with another.
      leader -> follower Opposites Antonym course -> meal From parts to wholes Part-Of table -> leg From wholes to parts Has-Part copilot -> crew From members to their groups Member-Of faculty -> professor From groups to their members Has-Member meal -> lunch From concepts to subtypes Hyponym breakfast -> meal From concepts to subordinate Hypernym Example Definition Relation
      • Document Level
        • Summarization
        • Single Document Visualization
        • Text Segmentation
      Sentence Level
    13. Summarization
      • Task : the task is to produce shorter, summary
      • version of an original document.
      • Two main approaches to the problem:
        • Knowledge rich – performing semantic analysis, representing the meaning and generating the text satisfying length restriction
        • Selection based
    14. Selection based summarization
      • Main phases:
        • Analyzing the source text
        • Determining its important points
    15. Selected units Selection threshold Example of selection based approach from MS Word
    16. Visualization of a single document
    17. Why visualization of a single document is hard?
      • Visualizing of big text is easier task because of the mass amount of information:
        • ...statistics already starts working
        • ...most known approaches are statistics based
      • Visualization of a single (possibly short)
      • document is much harder task because:
        • ...we can not count of statistical properties of the text (lack of data)
        • ...we must rely on syntactical and logical structure of the document
    18. Simple approach
      • The text is split into the sentences.
      • Anaphora resolution is performed on all sentences
        • ...all ‘he’, ‘she’, ‘they’, ‘him’, ‘his’, ‘her’, etc. references to the objects are replaced by its proper name
      • From all the sentences we extract:
      • Subject-Predicate-Object triples (SPO)
    19. Text Segmentation
    20. Text Segmentation
      • Problem : divide text that has no given structure into segments with similar content
      • Example applications:
        • topic tracking in news (spoken news)
        • identification of topics in large, unstructured text
    21. Visualization
    22. Why text visualization?
      • To have a top level view of the topics
      • To see relationships between the topics
      • To understand better what’s going on
      • To view highly structured nature of textual
      • contents in a simplified way
    23. How do we extract keywords?
      • Characteristic keywords for a group of
      • documents are the most highly weighted
      • words in the center of the cluster
        • ...center of the cluster could be understood as an “average document” for specific group of documents
        • ...efficient solution
    24. Information Extraction The mental process that one goes through when reading and analyzing documents of any kind. This is a process that for a varying degree do automatically once we the mental processes down.
    25. What is “Information Extraction” Filling in the slots from sub-segments of text. Utilize the information in the text to fill in the database. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION
    26. What is “Information Extraction” Filling in the slots from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE
    27. What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction”
    28. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation
    29. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation
    30. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * * NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft..
    31. Levels of Text Processing
        • Question-Answering
    32. Question Answering
      • QA Systems are returning short and accurate
      • replies to the well-formed natural language
      • questions such as:
        • What is the height of Mount Everest?
        • After which animal is the Canary Island named?
        • How many liters are there in to a gallon?
      • QA Systems can be classified into levels of
      • sophistication:
        • Slot-filling – easy questions
    33. Question Answering Example
      • Example question and answer:
        • Q:What is the color of grass?
        • A: Green.
        • The answer comes from a document saying: “ grass is green ” without mentioning “ color”
      • Hypernym hierarchy:
        • green, chromatic color, color, visual property, property
    SlideShare Zeitgeist 2009

    + Michael SadowskiMichael Sadowski Nominate

    custom

    173 views, 1 favs, 0 embeds more stats

    How might a potential reader view the material you more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 173
      • 173 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 15
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories