Towards a solution to extract knowledge from the social web (“metadata first, ontologies second”)   Project  Collaborative...
Social web: source of knowledge <ul><li>Crowds share and  tag   resources  of different types:  </li></ul><ul><ul><li>pict...
Related work  (since 2006) <ul><li>mapping tags to ontologies </li></ul><ul><li>Schmitz  2006.  Inducing Ontology from Fli...
Which  knowledge representation  model? <ul><li>Extracting knowledge from data sharing Web 2.0 sites, but into which forma...
Crowds tagging   pictures
Crowds tagging   pictures Aitor Almeida Borja Sotomayor Diego López de Ipiña
Crowds tagging   pictures
Crowds tagging   posts
Crowds tagging   slides
Crowds tagging   books
Crowds tagging   URL
Crowd-sharing of tags <ul><li>Flickr, del.icio.us...  group tags by  social sharing (or “co-usage”) </li></ul><ul><ul><li>...
Mapping folksonomies  into tag clusters <ul><li>RawSugar   <http://rawsugar.com/> </li></ul><ul><ul><li>allows users to as...
Tag clustering <ul><li>TAG clustering   is the main technique used to improve the wealth of social tagging </li></ul><ul><...
Beyond tag clusters?
Should we map   them into  ontologies?
Better mapping   1st    into  metadata
Metadata vs ontologies <ul><li>Why are  metadata  structures  better   than ontologies  (for resource classification and c...
What are metadata?
TAG vs  metadata ?
Metadata vs ontologies <ul><li>Why are  metadata  structures  better ? </li></ul><ul><ul><li>Because metadata provide  wid...
Better mapping 1st folksonomies  into  metadata structures
Dublin Core  Metadata Initiative http://jodi.tamu.edu/Articles/v02/i02/Greenberg/metadataform.gif
Dublin Core  Metadata Initiative
Dublin Core  Metadata Inicitive
Our mapping tool: folk2onto   (? folk2meta) designed by   Borja Sotomayor
folk2onto:  Tag Distiller <ul><li>Tag Distiller :  </li></ul><ul><ul><li>Downloads tags from Web 2.0 sites </li></ul></ul>...
TAG clouds  from  del.icio.us <ul><li>http://del.icio.us/url/check?url=site </li></ul><ul><li>Looks for  <title>  and gets...
TAG clouds  from  Technorati <ul><li>Technorati: blog aggregator </li></ul><ul><ul><li>We can get tag clouds from Technora...
TAG clouds  from  Technorati <ul><ul><li><?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?>  </li></ul></ul><ul><u...
Tagged URL  at  Technorati <ul><li>All <tag> elements are downloaded </li></ul><ul><li>To get the “title”  http://api.tech...
semantic relations  in WordNet  <ul><li>WordNet relations for tag ‘Spanish’: </li></ul>
TAG filtering algorithm <ul><li>Tags are filtered out by means of WordNet </li></ul><ul><li>If a TAG has only one meaning ...
TAG filtering algorithm <ul><li>Once senses have been discarded, synonyms are also filtered out </li></ul><ul><li>Words th...
TAG XML file <ul><li><?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> </li></ul><ul><li><resource> </li></ul><ul...
TAG file  without senses <ul><li><resource> </li></ul><ul><li><tittle>Wired News: The Virus That Ate DHS</tittle> </li></u...
WordNet’s  sense sets <ul><li>Words are grouped in sense sets </li></ul><ul><ul><li>If related(a,b) is = 1, then words are...
folk2onto:  Tag Trainer
folk2onto: Map Trainer
folk2onto:  Tag Mapper <ul><li>The Mapper makes  tag-element  associations </li></ul><ul><li>These associations are made a...
folk2onto:  Dublin Core <ul><li>The Distiller gets 4 elements from the tag source (del.icio.us, Technorati, etc.): </li></...
folk2onto:  Dublin Core <ul><li>The Tag-Mapper deals with: </li></ul><ul><ul><li>Subject : the “topic”. </li></ul></ul><ul...
folk2onto:  mapping formulae <ul><li>When a TAG has one mapping, that TAG is used </li></ul><ul><li>If it has more than on...
folk2onto:  file mapping <ul><li><rdf:RDF </li></ul><ul><li>xmlns:j.0=&quot;http://purl.org/dc/elements/1.1&quot; </li></u...
Mapping trainer
folk2onto:  6 tests (A-F) <ul><li>Experiment A : Selecting random synsets for the tags. </li></ul><ul><li>Experiment B : W...
folk2onto:  tests output 278 (%12.8) 1894 (%87.2) F 823 (%37.9) 1349 (%62.1) E 680 (%31.3) 1492 (%68.7) D 973 (%44.8) 1199...
folk2onto:  tests output
Open issues <ul><li>Tag filtering through WordNet </li></ul><ul><ul><li>blog, wiki </li></ul></ul><ul><ul><li>xml, rdf, rs...
That was all about CollOnBus/folk2onto <ul><li>Thank you very much! </li></ul><ul><li>Any question? </li></ul>
Upcoming SlideShare
Loading in …5
×

Metadata first, ontologies second

3,537 views
3,431 views

Published on

Published in: Business, Education

Metadata first, ontologies second

  1. 1. Towards a solution to extract knowledge from the social web (“metadata first, ontologies second”) Project Collaborative Ontology Building System (CollOnBus) INTEK Nets 2005-2007 Aitor Almeida, Borja Sotomayor, Joseba Abaitua , Diego Lopez de Ipiña
  2. 2. Social web: source of knowledge <ul><li>Crowds share and tag resources of different types: </li></ul><ul><ul><li>pictures, music, posts, videoclips, slides, books, bookmarks, etc. </li></ul></ul><ul><li>Social tagging (or crowd- tagging ) is a very effective and economic way of generating knowledge </li></ul><ul><ul><ul><li>Crowdsourcing “the trend of leveraging the mass collaboration enabled by Web2.0 technologies to achieve business goals. ” </li></ul></ul></ul><ul><ul><ul><ul><li><http://en.wikipedia.org/wiki/Crowdsourcing> </li></ul></ul></ul></ul>
  3. 3. Related work (since 2006) <ul><li>mapping tags to ontologies </li></ul><ul><li>Schmitz 2006. Inducing Ontology from Flickr tags. WWW’2006: Collaborative Web Tagging workshop </li></ul><ul><li>Abbasi et. al. 2007. Organizing Resources on Tagging Systems using T-ORG. ESWC2007 SemNet workshop </li></ul><ul><li>identifying semantic relations </li></ul><ul><li>Specia, Motta. 2007. Integrating Folksonomies with the Semantic Web. ESWC2007 </li></ul><ul><li>transforming folksonomies into formal representations </li></ul><ul><li>Marlow et al. 2006. Tagging, Taxonomy, Flickr, Article, ToRead. WWW’2006: Collaborative Web Tagging workshop </li></ul><ul><li>Hotho et al. 2006. Trend Detection in Folksonomies . Semantics And Digital Media Technology SAMT2006 </li></ul><ul><li>Maala et. Al. A Conversion Process From Flickr Tags to RDF Descriptions. BIS2007 workshop </li></ul>
  4. 4. Which knowledge representation model? <ul><li>Extracting knowledge from data sharing Web 2.0 sites, but into which formal representation? </li></ul><ul><li>Semantic Networks </li></ul><ul><ul><li>Lexical networks (WordNet) </li></ul></ul><ul><li>Taxonomines </li></ul><ul><ul><li>eg. categories from Wikipedia, Thesauri </li></ul></ul><ul><li>Metadata </li></ul><ul><ul><li>“ mapping to Dublin Core is a weak choice” </li></ul></ul><ul><li>Ontologies </li></ul><ul><li>“ metadata first, ontologies second” </li></ul>
  5. 5. Crowds tagging pictures
  6. 6. Crowds tagging pictures Aitor Almeida Borja Sotomayor Diego López de Ipiña
  7. 7. Crowds tagging pictures
  8. 8. Crowds tagging posts
  9. 9. Crowds tagging slides
  10. 10. Crowds tagging books
  11. 11. Crowds tagging URL
  12. 12. Crowd-sharing of tags <ul><li>Flickr, del.icio.us... group tags by social sharing (or “co-usage”) </li></ul><ul><ul><li>but the semantic information that socially shared tags acquire is poorly exploited </li></ul></ul>
  13. 13. Mapping folksonomies into tag clusters <ul><li>RawSugar <http://rawsugar.com/> </li></ul><ul><ul><li>allows users to assign hierarchies to their tags, improving the navigation and searching of folksonomies </li></ul></ul><ul><ul><li>non-expert users will find it easier to tag resources without any restrictions </li></ul></ul>
  14. 14. Tag clustering <ul><li>TAG clustering is the main technique used to improve the wealth of social tagging </li></ul><ul><ul><li>but semantic relations are not detected </li></ul></ul>
  15. 15. Beyond tag clusters?
  16. 16. Should we map them into ontologies?
  17. 17. Better mapping 1st into metadata
  18. 18. Metadata vs ontologies <ul><li>Why are metadata structures better than ontologies (for resource classification and categorisation)? </li></ul><ul><li>Let’s reflect on different knowledge representations and about who use them: </li></ul><ul><ul><li>Folksonomies (crowds) </li></ul></ul><ul><ul><li>Taxonomies, ontologies (knowledge engineers, AI/SW practitioners) </li></ul></ul><ul><ul><li>Metadata structures (librarians, archivists, documentalists) </li></ul></ul>
  19. 19. What are metadata?
  20. 20. TAG vs metadata ?
  21. 21. Metadata vs ontologies <ul><li>Why are metadata structures better ? </li></ul><ul><ul><li>Because metadata provide wide and complete range of facets for representing knowledge about an entity or resource </li></ul></ul><ul><ul><li>Each facet (or data type) could be part of one or several ontological structures </li></ul></ul><ul><ul><li>Facet “any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)” </li></ul></ul><ul><ul><li>“ A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order” (Wikipedia). </li></ul></ul>
  22. 22. Better mapping 1st folksonomies into metadata structures
  23. 23. Dublin Core Metadata Initiative http://jodi.tamu.edu/Articles/v02/i02/Greenberg/metadataform.gif
  24. 24. Dublin Core Metadata Initiative
  25. 25. Dublin Core Metadata Inicitive
  26. 26. Our mapping tool: folk2onto (? folk2meta) designed by Borja Sotomayor
  27. 27. folk2onto: Tag Distiller <ul><li>Tag Distiller : </li></ul><ul><ul><li>Downloads tags from Web 2.0 sites </li></ul></ul><ul><ul><li>Matches each tag against WordNet (taking into account the tag’s context/cloud) </li></ul></ul><ul><ul><li>Filters out synonyms </li></ul></ul><ul><ul><li>Keeps the list of remaining tags </li></ul></ul><ul><ul><li>Generates an XML file </li></ul></ul><ul><ul><ul><li>Implemented by Aitor Almeida </li></ul></ul></ul>
  28. 28. TAG clouds from del.icio.us <ul><li>http://del.icio.us/url/check?url=site </li></ul><ul><li>Looks for <title> and gets its content: the hash </li></ul><ul><li>Gets the RSS in </li></ul><ul><ul><ul><li>http://del.icio.us/rss/url/ + hash </li></ul></ul></ul><ul><li>Then tag-clouds are downloaded from </li></ul><ul><ul><li>< rdf:li resource=&quot;http://del.icio.us/tag/&quot; > </li></ul></ul>
  29. 29. TAG clouds from Technorati <ul><li>Technorati: blog aggregator </li></ul><ul><ul><li>We can get tag clouds from Technoraty through: http://api.technorati.com/blogposttags?key= [apikey] &url= [blog URL] </li></ul></ul>
  30. 30. TAG clouds from Technorati <ul><ul><li><?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> </li></ul></ul><ul><ul><li><!-- generator=&quot;Technorati API version 1.0 /blogposttags&quot; --> </li></ul></ul><ul><ul><li><!DOCTYPE tapi PUBLIC &quot;-//Technorati, Inc.//DTD TAPI 0.02//EN&quot; &quot;http://api.technorati.com/dtd/tapi-002.xml&quot;> </li></ul></ul><ul><ul><li><tapi version=&quot;1.0&quot;> </li></ul></ul><ul><ul><li><document> </li></ul></ul><ul><ul><li><result> </li></ul></ul><ul><ul><li><querycount>13</querycount> </li></ul></ul><ul><ul><li></result> </li></ul></ul><ul><ul><li><item> </li></ul></ul><ul><ul><li><tag>christmas cookie recipes</tag> </li></ul></ul><ul><ul><li><posts>274</posts> </li></ul></ul><ul><ul><li></item> </li></ul></ul><ul><ul><li>… . </li></ul></ul>
  31. 31. Tagged URL at Technorati <ul><li>All <tag> elements are downloaded </li></ul><ul><li>To get the “title” http://api.technorati.com/bloginfo?key= [apikey] &url= [blog url] </li></ul><ul><li>And<name> is recovered </li></ul>
  32. 32. semantic relations in WordNet <ul><li>WordNet relations for tag ‘Spanish’: </li></ul>
  33. 33. TAG filtering algorithm <ul><li>Tags are filtered out by means of WordNet </li></ul><ul><li>If a TAG has only one meaning (synset) that meaning is assigned </li></ul><ul><li>If it has more than one, then </li></ul><ul><ul><li>T: resources tag set </li></ul></ul><ul><ul><li>Related(a,b): gives 1 if a and b have some type of relation (hypernym, hyponym, holonym, meronym) </li></ul></ul><ul><ul><li>w: weights </li></ul></ul><ul><li>Several iterations are made until a meaning is found (10 iterations max.) </li></ul>
  34. 34. TAG filtering algorithm <ul><li>Once senses have been discarded, synonyms are also filtered out </li></ul><ul><li>Words then are grouped in senses using WordNet’s relation network </li></ul><ul><li>The output is exported to a: </li></ul><ul><ul><li>XML file with senses </li></ul></ul><ul><ul><li>XML file with tags that were discarded </li></ul></ul><ul><ul><li>RDF containing WordNet’s relation network </li></ul></ul>
  35. 35. TAG XML file <ul><li><?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> </li></ul><ul><li><resource> </li></ul><ul><li><tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</tittle> </li></ul><ul><li><type>Text</type> </li></ul><ul><li><format>text/html</format> </li></ul><ul><li><identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</identifier> </li></ul><ul><li><tags> </li></ul><ul><li><tag> </li></ul><ul><li><lemma>tune</lemma> </li></ul><ul><li>< idlex>236726</idlex> </li></ul><ul><li></tag> </li></ul><ul><li><tag> </li></ul><ul><li><lemma>bd</lemma> </li></ul><ul><li><idlex>5604473</idlex> </li></ul><ul><li></tag> </li></ul>
  36. 36. TAG file without senses <ul><li><resource> </li></ul><ul><li><tittle>Wired News: The Virus That Ate DHS</tittle> </li></ul><ul><li><type>Text</type> </li></ul><ul><li><format>text/html</format> </li></ul><ul><li><identifier>www.wired.com/news/technology/0,72051-0.html?tw=rss.index</identifier> </li></ul><ul><li><tags> </li></ul><ul><li><tag>bit200f06</tag> </li></ul><ul><li><tag>group141</tag> </li></ul><ul><li><tag>dhs</tag> </li></ul><ul><li><tag>group35</tag> </li></ul><ul><li><tag>malware</tag><tag>group91</tag><tag>group17</tag> </li></ul><ul><li><tag>group53</tag> </li></ul><ul><li><tag>computer_security</tag> </li></ul><ul><li></tags> </li></ul><ul><li></resource> </li></ul>
  37. 37. WordNet’s sense sets <ul><li>Words are grouped in sense sets </li></ul><ul><ul><li>If related(a,b) is = 1, then words are grouped in the same set </li></ul></ul><ul><ul><li>The relations depth has to be equal or smaller than 3 </li></ul></ul>
  38. 38. folk2onto: Tag Trainer
  39. 39. folk2onto: Map Trainer
  40. 40. folk2onto: Tag Mapper <ul><li>The Mapper makes tag-element associations </li></ul><ul><li>These associations are made according to the senses asigned by the Distiller </li></ul><ul><li>Mapping targets into Dublin Core metadata records </li></ul>
  41. 41. folk2onto: Dublin Core <ul><li>The Distiller gets 4 elements from the tag source (del.icio.us, Technorati, etc.): </li></ul><ul><ul><li>Title : URL’s title -> from the <title> XML tag </li></ul></ul><ul><ul><li>Type : content type -> depending on the source (here both are “Text”) </li></ul></ul><ul><ul><li>Format : MIME class -> depending on the source (here we have 2 text/html) </li></ul></ul><ul><ul><li>Identifier : we take the resource’s URL </li></ul></ul>
  42. 42. folk2onto: Dublin Core <ul><li>The Tag-Mapper deals with: </li></ul><ul><ul><li>Subject : the “topic”. </li></ul></ul><ul><ul><li>Language : en, es, fr, de, ru... </li></ul></ul><ul><ul><li>Coverage : when, where (about the topic) </li></ul></ul><ul><ul><li>Rights : type of licence </li></ul></ul>
  43. 43. folk2onto: mapping formulae <ul><li>When a TAG has one mapping, that TAG is used </li></ul><ul><li>If it has more than one: </li></ul><ul><li>If it has no mapping, then: </li></ul>
  44. 44. folk2onto: file mapping <ul><li><rdf:RDF </li></ul><ul><li>xmlns:j.0=&quot;http://purl.org/dc/elements/1.1&quot; </li></ul><ul><li>xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot; > </li></ul><ul><li><rdf:Description rdf:nodeID=&quot;A0&quot;> </li></ul><ul><ul><li><rdf:type rdf:resource=&quot;http://purl.org/dc/elements/1.1identifier&quot;/> </li></ul></ul><ul><ul><li><j.0:identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</j.0:identifier> </li></ul></ul><ul><ul><li><j.0:type>Text</j.0:type> </li></ul></ul><ul><ul><li><j.0:format>text/html</j.0:format> </li></ul></ul><ul><ul><li><j.0:tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</j.0:tittle> </li></ul></ul><ul><ul><li><j.0:subject>database</j.0:subject> </li></ul></ul><ul><ul><li><j.0:subject>performance</j.0:subject> </li></ul></ul><ul><ul><li><j.0:subject>bd</j.0:subject> </li></ul></ul><ul><li></rdf:Description> </li></ul><ul><li></rdf:RDF> </li></ul>
  45. 45. Mapping trainer
  46. 46. folk2onto: 6 tests (A-F) <ul><li>Experiment A : Selecting random synsets for the tags. </li></ul><ul><li>Experiment B : Without any limit in the semantic relation depth. Only taking into account the trained synsets (frec=0, wordnet=0, trained=1). </li></ul><ul><li>Experiment C : Without any limit in the semantic relation depth. Only taking into account the context (frec=0, wordnet=1, trained=0). </li></ul><ul><li>Experiment D : Without any limit in the semantic relation depth. Taking the context and the trained synsets into account (frec=0,=wordnet0.4, trained=0.6). </li></ul><ul><li>Experiment E : Without any limit in the semantic relation depth. Taking al three components of the equation (familiarity, context and trained synsets) into account (frec=0.1, wordnet=0.3, trained=0.6). </li></ul><ul><li>Experiment F : Limiting the semantic relation depth to 3 and taking the context and the trained synsets into account. (frec=0, wordnet=0.4, trained=0.6). </li></ul>
  47. 47. folk2onto: tests output 278 (%12.8) 1894 (%87.2) F 823 (%37.9) 1349 (%62.1) E 680 (%31.3) 1492 (%68.7) D 973 (%44.8) 1199 (%55.2) C 578 (%26.6) 1594 (%73.4) B 1466 (%67.5) 706 (%32.5) A Erroneous synsets Correct synsets Experiment
  48. 48. folk2onto: tests output
  49. 49. Open issues <ul><li>Tag filtering through WordNet </li></ul><ul><ul><li>blog, wiki </li></ul></ul><ul><ul><li>xml, rdf, rss </li></ul></ul><ul><ul><li>wordpress, tuenti, flickr </li></ul></ul><ul><ul><li>social, open </li></ul></ul><ul><li>“ tags can be about so many things </li></ul><ul><ul><li>mapping to Dublin Core is a weak choice” </li></ul></ul><ul><li>Mappings </li></ul><ul><ul><li>Coverage: Japan </li></ul></ul><ul><ul><li>Language: Spanish </li></ul></ul><ul><li>Learning the right synset of eg. &quot;jaguar&quot; </li></ul><ul><ul><li>&quot;vehicle&quot;, &quot;video game console&quot;, or &quot;cat of prey&quot; </li></ul></ul><ul><ul><li>&quot;<dc:subject>Jaguar</dc:subject>&quot; </li></ul></ul><ul><li>Word-sense disambiguation </li></ul><ul><ul><li>tag-category disambiguation </li></ul></ul>
  50. 50. That was all about CollOnBus/folk2onto <ul><li>Thank you very much! </li></ul><ul><li>Any question? </li></ul>

×