Your SlideShare is downloading. ×
0
Using Wikipedia as a reference
    for extracting semantic
    information from a text

            Andrea Prato
         ...
Explicit Semantic Analysis




                             Gabrilovich
                             Markovich
           ...
Throw away:

Stopwords
Fragment pages (<100 words)
Suffixes (stemming)
- Leukemia
                                                - Severe combined
                                             ...
1-Glossary_of_cue_sports_terms
    A sample (ESA)                               2-Swimming,
                              ...
Clustering
Wikipedia is hyperlinked
Swimming is clustered with Olympic Games
1-Glossary_of_cue_sports_terms
    A sample (ESA)                               2-Swimming,
                              ...
Throw away:

Large aggregators
   Category links
   Numbers
   Pages with more than (N=100) links
After clustering:

 only 3 clusters with cardinality larger than 1.
 The first cluster, with cardinality 21, was
  automa...
Which one is
                          machine -generated?
Validation: Turing test


                            Classific...
20 texts of length
Outcome   ranging between 60
          and 200 words. Texts
          were collected from
          var...
Further improvements
Using only nouns

Using a POS Tagger to identify syntactic
 roles in document to be classified
Keep only names (throw awa...
Define Multiwords

 Lexical multiword identification approach:
 The following generative pattern is considered
 ((Adj∣Noun...
Text with multiwords:

Keep all nouns
Keep all adjectives that are part of a
 multiword
Evaluation (human inspection of
results)
100 samples (50 technical, 50 generic)
Multiword improved significanty 7 (5 techni...
Work in progress
Concept-mediated mapping
among documents
How similar are two docs?
                                   Jaccard Index



  ...
Syllabi comparison
Inter
links
Mapping documents in different
  languages
   Deploying Wikipedia Interlinks
                                         Jac...
Upcoming SlideShare
Loading in...5
×

Using Wikipedia as a reference for extracting semantic information

768

Published on

In this paper we present an algorithm that, using Wikipedia as a reference, extracts semantic information from an arbitrary text. Our algorithm refines a procedure proposed by others, which mines all the text contained in the whole Wikipedia. Our refinement, based on a clustering approach, exploits the semantic information contained in certain types of Wikipedia hyperlinks, and also introduces an analysis based on multi-words. Our algorithm outperforms current methods in that the output contains many less false positives. We were also able to understand which (structural) part of the texts provides most of the semantic information extracted by the algorithm.

Published in: Technology, Health & Medicine
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
768
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Using Wikipedia as a reference for extracting semantic information"

  1. 1. Using Wikipedia as a reference for extracting semantic information from a text Andrea Prato & Marco Ronchetti Università di Trento, Italy
  2. 2. Explicit Semantic Analysis Gabrilovich Markovich 2007
  3. 3. Throw away: Stopwords Fragment pages (<100 words) Suffixes (stemming)
  4. 4. - Leukemia - Severe combined immunodeficiency A sample (ESA) - Cancer -Non-Hodgkin lymphoma The development of T-cell leukaemia - AIDS following the otherwise successful -ICD-10 Chapter II: treatment of three patients with X-linked severe combined immune deficiency (X- Neoplasms; SCID) in gene-therapy trials using -Chapter III: Diseases of the haematopoietic stem cells has led to a re- blood and blood-forming evaluation of this approach. Using a mouse model for gene therapy of X- organs, and certain SCID, we find that the corrective disorders involving the therapeutic gene IL2RG itself can act as immune mechanism a contributor to the genesis of T-cell lymphomas, with one-third of animals - Bone marrow transplant being affected. Gene-therapy trials for X- - Immunosuppressive drug SCID, which have been based on the - Acute lymphoblastic assumption that IL2RG is minimally oncogenic, may therefore pose some risk leukemia to patients. - Multiple sclerosis.
  5. 5. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
  6. 6. Clustering
  7. 7. Wikipedia is hyperlinked
  8. 8. Swimming is clustered with Olympic Games
  9. 9. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
  10. 10. Throw away: Large aggregators  Category links  Numbers  Pages with more than (N=100) links
  11. 11. After clustering:  only 3 clusters with cardinality larger than 1.  The first cluster, with cardinality 21, was automatically named Swimming.  The second and the third both have cardinality equal to 2, and they are named Training and Venice-bucentaur.
  12. 12. Which one is machine -generated? Validation: Turing test Classification Text Classification Classification
  13. 13. 20 texts of length Outcome ranging between 60 and 200 words. Texts were collected from various sources like newspaper articles, text books, random web pages, MSN Encarta.
  14. 14. Further improvements
  15. 15. Using only nouns Using a POS Tagger to identify syntactic roles in document to be classified Keep only names (throw away the rest) No degradation in the results!
  16. 16. Define Multiwords  Lexical multiword identification approach:  The following generative pattern is considered ((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun Prep)?) (Adj∣Noun)∗)Noun +: One or more *: Zero or more ?: Zero or one ∣: Or Validation: A candidate multiword is valid if there is a Wikipedia entry related to it.
  17. 17. Text with multiwords: Keep all nouns Keep all adjectives that are part of a multiword
  18. 18. Evaluation (human inspection of results) 100 samples (50 technical, 50 generic) Multiword improved significanty 7 (5 technical) It improved marginally 13 It worsened marginally 6 Overall improvement: 10/% on technical text
  19. 19. Work in progress
  20. 20. Concept-mediated mapping among documents How similar are two docs? Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 Concept 4
  21. 21. Syllabi comparison
  22. 22. Inter links
  23. 23. Mapping documents in different languages Deploying Wikipedia Interlinks Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 INTERLINKS Concept 4
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×