Semanticannotation of text: techniques and applications<br />Prof. Luis Sanchez-Fernandez<br />Web Technologies Laboratory...
Semantic Web<br />Techniquesforsemanticannotation of text<br />AnapproachtonamedentitydisambiguationusingWikipedia<br />2<...
Short history of the Web<br />1990: Creation of WorldWide Web infraestructure at CERN by Tim Berners-Lee<br />HTTP, HTML, ...
The problem of information overload<br />The great success of the web has lead to one of its current problems: information...
Thegoal of theSemantic Web istoautomate web tasksbyenrichingthecurrent Web contentwith formal representationsthatenablebet...
http://webtlab.it.uc3m.es<br />6<br />Semantic Web Stack<br />
Máster interuniversitario en Ingeniería Telemática<br />7<br />RDF<br />“ResourceDescription Framework” (RDF)<br />Goal of...
Máster interuniversitario en Ingeniería Telemática<br />8<br />RDF basic principles<br />We want to represent a piece of i...
Máster interuniversitario en Ingeniería Telemática<br />9<br />RDF Model<br />An RDF model (set of RDF statements) can be ...
Máster interuniversitario en Ingeniería Telemática<br />10<br />Example <br />“http://www.example.org has a creator whose ...
Máster interuniversitario en Ingeniería Telemática<br />11<br />Textual notation (triples)<br /><http://www.example.org/in...
Máster interuniversitario en Ingeniería Telemática<br />12<br />Ontologies: goal<br />An ontology is a formal, explicit sp...
Máster interuniversitario en Ingeniería Telemática<br />13<br />RDF Schema<br />RDF vocabulary<br />Properties definition ...
Máster interuniversitario en Ingeniería Telemática<br />14<br />Properties in RDF Schema<br />rdfs:subPropertyOf<br />rdfs...
http://webtlab.it.uc3m.es<br />15<br />Sampletaxonomy<br />pictureby<br />IanRuotsala<br />
Ontologylanguage<br />More powerfulthanRDF-Schema<br />Examples:<br />Existence/cardinalityconstraints<br />allinstancesof...
Semantic Web and TechnologyEnhancedLearning<br />http://webtlab.it.uc3m.es<br />17<br />
Modelling (ontologies)<br />learningprocesses<br />learningcontent<br />learning output (competences)<br />learningagents ...
Semanticannotation of text<br />http://webtlab.it.uc3m.es<br />19<br />
Generalities<br />Goal: extract semantic annotations from free text<br />Natural language is complex and ambiguous<br />La...
Taxonomy of semanticannotations<br />Content basedannotations<br />Documentcategorization<br />Namedentities<br />Ontology...
basic techniques (i)<br />Semantic Analysis<br />S  NP NP*(X) VBT(Elect) NN(Y)<br />Parsing<br />S  NP NP* VBT NN<br />S...
Basic techniques (ii)<br />Statistical NLP<br />Based on counting: finding frequent patterns that make likely the occurren...
AnapproachtonamedentitydisambiguationwithWikipedia<br />http://webtlab.it.uc3m.es<br />24<br />
Instance: a particular person, location (GPE), organization, ...<br />http://webtlab.it.uc3m.es<br />25<br />Introduction<...
http://webtlab.it.uc3m.es<br />26<br />Strategy I<br />
Approach<br />Findentities in document<br />Foreachentity, identifycandidateinstancesthat are compatible withtheentityname...
Semanticcoherence (in terms of ranking)<br />“Aninstancewouldhave a high ranking valueiftheinstancesthattypicallyco-occurw...
We can add a vector Ethataccountsforothercontextinformation<br />Equation similar to Google PageRank<br />http://webtlab.i...
Alternativeinstancenamesextractedbyprocessing a Wikipediadump<br />Page titles, redirects, disambiguationpages, anchors<br...
http://webtlab.it.uc3m.es<br />31<br />Instanceranker<br />AL: basedondirect links<br />E: candidateinstanceweightspassedb...
http://webtlab.it.uc3m.es<br />32<br />Results I<br />
http://webtlab.it.uc3m.es<br />33<br />Results II<br />
Approachbasedoninstanceco-occurrence<br />TextfromWikipediarestrictedto: titles, anchors<br />Resultsconsideredpromising<b...
http://webtlab.it.uc3m.es<br />35<br />ThankYou!<br />Questions?<br />
Upcoming SlideShare
Loading in …5
×

2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

1,414 views

Published on

2011 03 11
(upm)
emadrid
lsanchez
uc3m
anotación semántica de texto

Published in: Education
  • Be the first to comment

  • Be the first to like this

2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

  1. 1. Semanticannotation of text: techniques and applications<br />Prof. Luis Sanchez-Fernandez<br />Web Technologies LaboratoryUniversity Carlos III of Madrid<br />http://webtlab.it.uc3m.es<br />1<br />http://webtlab.it.uc3m.es<br />
  2. 2. Semantic Web<br />Techniquesforsemanticannotation of text<br />AnapproachtonamedentitydisambiguationusingWikipedia<br />2<br />Outline<br />http://webtlab.it.uc3m.es<br />
  3. 3. Short history of the Web<br />1990: Creation of WorldWide Web infraestructure at CERN by Tim Berners-Lee<br />HTTP, HTML, first Web client, first Web server<br />1993: Mosaic, firstgraphic Web client<br />1994: Netscape Navigator<br />1996: Commercial use of WWWisgeneralized<br />1999: Tim Berners-Lee proposestheSemantic Web<br />2002: Weblogs and RSS  Web 2.0<br />6thOctober 2009: at least 8 billionindexable Web pages<br />23rdSeptember 2010: at least 15 billionindexable Web pages<br />accordingtohttp://www.worldwidewebsize.com/<br />
  4. 4. The problem of information overload<br />The great success of the web has lead to one of its current problems: information overload<br />Difficult and time costly to find and update relevant information for people and companies<br />Ex.: keep an updated state of the art<br />Company employees can use up to 20% of their working time searching in the Web (Outsell Inc, 2002)<br />
  5. 5. Thegoal of theSemantic Web istoautomate web tasksbyenrichingthecurrent Web contentwith formal representationsthatenablebettercooperationbetweenhumans and computers<br />http://webtlab.it.uc3m.es<br />5<br />TheSemantic Web proposal<br />
  6. 6. http://webtlab.it.uc3m.es<br />6<br />Semantic Web Stack<br />
  7. 7. Máster interuniversitario en Ingeniería Telemática<br />7<br />RDF<br />“ResourceDescription Framework” (RDF)<br />Goal of RDF (alternativeviews):<br />Languageforresourcedescription in the Web<br />Languagefor formal representation of (parts of) informationavailable in a Web document (metadata)<br />Formal => machine readable<br />Vocabularydefinedwithontologies<br />Whatis a resource?<br />Web content: Web pages, images, e-mails, files, …<br />Resourcesmentioned in Web content: Persons, locations, organizations, …<br />
  8. 8. Máster interuniversitario en Ingeniería Telemática<br />8<br />RDF basic principles<br />We want to represent a piece of information available in the Web describing a resource<br />Each metadata states a property that can be modelled as a (formal) statement, composed of:<br />subject: resource being described<br />predicate: property of the resource<br />object: value of the property for the resource being described<br />“http://www.example.org has a creator whose value is John Smith”<br />
  9. 9. Máster interuniversitario en Ingeniería Telemática<br />9<br />RDF Model<br />An RDF model (set of RDF statements) can be represented by means of a graf<br />For each statement:<br />subject is a node<br />predicate is an arc<br />object is a node<br />Subject and predicate are resources<br />Object can be either a resource or a literal<br />
  10. 10. Máster interuniversitario en Ingeniería Telemática<br />10<br />Example <br />“http://www.example.org has a creator whose value is John Smith”.<br />
  11. 11. Máster interuniversitario en Ingeniería Telemática<br />11<br />Textual notation (triples)<br /><http://www.example.org/index.html> <http://purl.org/dc/elements/1.1/creator> <http://www.example.org/staffid/85740> .<br /><http://www.example.org/index.html> <http://www.example.org/terms/creation-date> <br />"August 16, 1999" .<br /><http://www.example.org/index.html> <http://www.example.org/terms/language> <br />"English“ .<br />
  12. 12. Máster interuniversitario en Ingeniería Telemática<br />12<br />Ontologies: goal<br />An ontology is a formal, explicit specification of a shared conceptualization<br />An ontology defines the basic terms and relations comprising the vocabulary of a topic area, as well as rules that should be fulfilled by such terms and relations<br />
  13. 13. Máster interuniversitario en Ingeniería Telemática<br />13<br />RDF Schema<br />RDF vocabulary<br />Properties definition and description of properties<br />Classes definition and description<br />Can be used to define simple ontologies<br />
  14. 14. Máster interuniversitario en Ingeniería Telemática<br />14<br />Properties in RDF Schema<br />rdfs:subPropertyOf<br />rdfs:range<br />rdfs:domain<br />rdfs:subClassOf<br />
  15. 15. http://webtlab.it.uc3m.es<br />15<br />Sampletaxonomy<br />pictureby<br />IanRuotsala<br />
  16. 16. Ontologylanguage<br />More powerfulthanRDF-Schema<br />Examples:<br />Existence/cardinalityconstraints<br />allinstancesof personhave a motherthatisalso a person, orthatpersonshaveexactly 2 parents<br />Transitive, inverseorsymmetricalproperties<br />isPartOfis a transitiveproperty, hasPartistheinverse of isPartOf, touchesissymmetrical<br />http://webtlab.it.uc3m.es<br />16<br />OWL<br />
  17. 17. Semantic Web and TechnologyEnhancedLearning<br />http://webtlab.it.uc3m.es<br />17<br />
  18. 18. Modelling (ontologies)<br />learningprocesses<br />learningcontent<br />learning output (competences)<br />learningagents (students, teachers)<br />Addingmetadata (annotations) accordingtothemodels<br />Use themodels and themetadata in toolstomakedecissions<br />example: personalized, adaptivecontent and/orproblems<br />http://webtlab.it.uc3m.es<br />18<br />Typicalapplications<br />
  19. 19. Semanticannotation of text<br />http://webtlab.it.uc3m.es<br />19<br />
  20. 20. Generalities<br />Goal: extract semantic annotations from free text<br />Natural language is complex and ambiguous<br />Language dependent<br />Domain dependent applications<br />News<br />Literature<br />E-mail<br />Transcriptions of spoken dialogues<br />Some useful results can be achieved nowadays<br />
  21. 21. Taxonomy of semanticannotations<br />Content basedannotations<br />Documentcategorization<br />Namedentities<br />Ontologybaseddomainannotations<br />Concepts and instancesidentification<br />Relationsextraction<br />isGovernor(GaryLocke,WST)<br />Named Entity (Washington, location)<br /><rdf:Description rdf:about=‘WST'> <br /><rdf:type rdf:resource=‘State'/><br /></rdf:Description><br /><rdf:Description rdf:about=‘WDC'> <br /><rdf:type rdf:resource=‘City'/><br /></rdf:Description><br />
  22. 22. basic techniques (i)<br />Semantic Analysis<br />S  NP NP*(X) VBT(Elect) NN(Y)<br />Parsing<br />S  NP NP* VBT NN<br />S<br />hasFunction(X, Y)<br />NP<br />VBT<br />NP<br />NN<br />Symbolic NLP<br />Based on the use of lexicons and grammar rules to process text<br />Example: “Barack Obama Elected President”<br />Lexical Analysis<br />NP  Barack<br />NP  Obama<br />VBT  Elect<br />VBT  VBT + ‘ed’<br />NN  President<br />hasFunction(BarackObama, President)<br />
  23. 23. Basic techniques (ii)<br />Statistical NLP<br />Based on counting: finding frequent patterns that make likely the occurrence of certain text feature<br />Use of extensive corpora<br />Example: <br />“Washington” when appearing in the same document with “Hollywood” is likely to represent (Denzel Washington, actor) while Washington” when appearing in the same document with “Obama” is likely to represent (Washington D.C., American capital)<br />We can count the frequency of different meanings of “Washington” when appearing in different contexts<br />
  24. 24. AnapproachtonamedentitydisambiguationwithWikipedia<br />http://webtlab.it.uc3m.es<br />24<br />
  25. 25. Instance: a particular person, location (GPE), organization, ...<br />http://webtlab.it.uc3m.es<br />25<br />Introduction<br />Entity: text + type<br />
  26. 26. http://webtlab.it.uc3m.es<br />26<br />Strategy I<br />
  27. 27. Approach<br />Findentities in document<br />Foreachentity, identifycandidateinstancesthat are compatible withtheentityname<br />Assign a ranking valuetoeachcandidateinstance: 0 ≤ r ≤ 1<br />Greater ranking valuesindicategreaterlikelihood of occurrence<br />http://webtlab.it.uc3m.es<br />27<br />Strategy II<br />
  28. 28. Semanticcoherence (in terms of ranking)<br />“Aninstancewouldhave a high ranking valueiftheinstancesthattypicallyco-occurwithitalsohavehigh ranking values”<br />http://webtlab.it.uc3m.es<br />28<br />Strategy III<br />
  29. 29. We can add a vector Ethataccountsforothercontextinformation<br />Equation similar to Google PageRank<br />http://webtlab.it.uc3m.es<br />29<br />Strategy IV<br />
  30. 30. Alternativeinstancenamesextractedbyprocessing a Wikipediadump<br />Page titles, redirects, disambiguationpages, anchors<br />IndexedbyLucene<br />Candidateinstances are obtainedbyqueryingLucene<br />CandidateinstancesweightedbycombiningLucene scores and PageRankvalues<br />Filteringlimitsthemaximumnumber of candidates<br />http://webtlab.it.uc3m.es<br />30<br />Instancefinder & filter<br />
  31. 31. http://webtlab.it.uc3m.es<br />31<br />Instanceranker<br />AL: basedondirect links<br />E: candidateinstanceweightspassedbytheinstancefilter<br />AC: basedoninstanceco-occurrence in Wikipediapages<br />
  32. 32. http://webtlab.it.uc3m.es<br />32<br />Results I<br />
  33. 33. http://webtlab.it.uc3m.es<br />33<br />Results II<br />
  34. 34. Approachbasedoninstanceco-occurrence<br />TextfromWikipediarestrictedto: titles, anchors<br />Resultsconsideredpromising<br />ShouldimproveforGPE<br />http://webtlab.it.uc3m.es<br />34<br />Conclusions<br />
  35. 35. http://webtlab.it.uc3m.es<br />35<br />ThankYou!<br />Questions?<br />
  36. 36. Differentiateaccordingtoentitytype<br />Improveselection of candidateinstances<br />Responsiblebyitself of errors in 12,7% of non-nilqueries<br />http://webtlab.it.uc3m.es<br />36<br />Futurework<br />
  37. 37. AC: basedoninstancecooccurrence in Wikipediapages<br />AL: basedondirect links<br />E: candidateinstanceweightspassedbytheinstancefilter<br />http://webtlab.it.uc3m.es<br />37<br />Instanceranker<br />
  38. 38. aCij≈ P(Ii|Ij)<br />Basedoncountingcooccurrence of Ii and Ij in Wikipediapages<br />Ex.: P(PauGasol|Lakers)= #pageswhereboth Pau Gasol and Los AngelesLakers are mentioneddividedby #total pageswhere Los AngelesLakers are mentioned<br />http://webtlab.it.uc3m.es<br />38<br />ACComputation<br />
  39. 39. Basedondirect links<br />Ex.: TheWikipedia page of Pau Gasol links totheWikipedia page of Los AngelesLakers<br />Initial idea<br />IfIj links many times toIi and Ijislikelytooccur (it has a high ranking), thenIiisalsolikelytooccur<br />Lucene score isusedto compute αLij<br />http://webtlab.it.uc3m.es<br />39<br />ALComputation<br />
  40. 40. Target entity: 200 or 30<br />Otherentities: 15<br />http://webtlab.it.uc3m.es<br />40<br />Instancefilter<br />
  41. 41. Semanticcoherenceprinciple<br />“Aninstanceusuallycooccurtypicallywithotherrelatedinstances”<br />Ex.: (Pau Gasol, Los AngelesLakers); (Athens, Georgia); (Athens; Greece); (Hillary Clinton, BarackObama)<br />Requirestodisambiguatealldocumententities<br />Slightlydifferenttoentitycooccurrenceapproaches<br />http://webtlab.it.uc3m.es<br />41<br />DisambiguationStrategy I<br />
  42. 42. http://webtlab.it.uc3m.es<br />42<br />Results IV: ablationtests<br />

×