Using Wikipedia as a reference for extracting semantic information

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Using Wikipedia as a reference for extracting semantic information - Presentation Transcript

    1. Using Wikipedia as a reference for extracting semantic information from a text Andrea Prato & Marco Ronchetti Università di Trento, Italy
    2. Explicit Semantic Analysis Gabrilovich Markovich 2007
    3. Throw away: Stopwords Fragment pages (<100 words) Suffixes (stemming)
    4. - Leukemia - Severe combined immunodeficiency A sample (ESA) - Cancer -Non-Hodgkin lymphoma The development of T-cell leukaemia - AIDS following the otherwise successful -ICD-10 Chapter II: treatment of three patients with X-linked severe combined immune deficiency (X- Neoplasms; SCID) in gene-therapy trials using -Chapter III: Diseases of the haematopoietic stem cells has led to a re- blood and blood-forming evaluation of this approach. Using a mouse model for gene therapy of X- organs, and certain SCID, we find that the corrective disorders involving the therapeutic gene IL2RG itself can act as immune mechanism a contributor to the genesis of T-cell lymphomas, with one-third of animals - Bone marrow transplant being affected. Gene-therapy trials for X- - Immunosuppressive drug SCID, which have been based on the - Acute lymphoblastic assumption that IL2RG is minimally oncogenic, may therefore pose some risk leukemia to patients. - Multiple sclerosis.
    5. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
    6. Clustering
    7. Wikipedia is hyperlinked
    8. Swimming is clustered with Olympic Games
    9. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
    10. Throw away: Large aggregators  Category links  Numbers  Pages with more than (N=100) links
    11. After clustering:  only 3 clusters with cardinality larger than 1.  The first cluster, with cardinality 21, was automatically named Swimming.  The second and the third both have cardinality equal to 2, and they are named Training and Venice-bucentaur.
    12. Which one is machine -generated? Validation: Turing test Classification Text Classification Classification
    13. 20 texts of length Outcome ranging between 60 and 200 words. Texts were collected from various sources like newspaper articles, text books, random web pages, MSN Encarta.
    14. Further improvements
    15. Using only nouns Using a POS Tagger to identify syntactic roles in document to be classified Keep only names (throw away the rest) No degradation in the results!
    16. Define Multiwords  Lexical multiword identification approach:  The following generative pattern is considered ((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun Prep)?) (Adj∣Noun)∗)Noun +: One or more *: Zero or more ?: Zero or one ∣: Or Validation: A candidate multiword is valid if there is a Wikipedia entry related to it.
    17. Text with multiwords: Keep all nouns Keep all adjectives that are part of a multiword
    18. Evaluation (human inspection of results) 100 samples (50 technical, 50 generic) Multiword improved significanty 7 (5 technical) It improved marginally 13 It worsened marginally 6 Overall improvement: 10/% on technical text
    19. Work in progress
    20. Concept-mediated mapping among documents How similar are two docs? Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 Concept 4
    21. Syllabi comparison
    22. Inter links
    23. Mapping documents in different languages Deploying Wikipedia Interlinks Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 INTERLINKS Concept 4
    SlideShare Zeitgeist 2009

    + ronchetronchet Nominate

    custom

    132 views, 0 favs, 0 embeds more stats

    In this paper we present an algorithm that, using W more

    More info about this document

    CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

    Go to text version

    • Total Views 132
      • 132 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 7
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories