Towards a solution to extract knowledge from the social web (“metadata first, ontologies second”)   Project  Collaborative Ontology Building System  (CollOnBus)  INTEK Nets 2005-2007   Aitor Almeida, Borja Sotomayor,  Joseba Abaitua , Diego Lopez de Ipiña
Social web: source of knowledge Crowds share and  tag   resources  of different types:  pictures, music, posts, videoclips, slides, books, bookmarks, etc. Social tagging  (or  crowd- tagging ) is a very effective and economic way of generating knowledge Crowdsourcing  “the trend of leveraging the mass collaboration enabled by Web2.0 technologies to achieve business goals. ” <http://en.wikipedia.org/wiki/Crowdsourcing>
Related work  (since 2006) mapping tags to ontologies Schmitz  2006.  Inducing Ontology from Flickr tags.  WWW’2006: Collaborative Web Tagging workshop Abbasi et. al.  2007.  Organizing Resources on Tagging Systems using T-ORG.  ESWC2007 SemNet workshop identifying semantic relations   Specia, Motta.  2007.  Integrating Folksonomies with the Semantic Web.  ESWC2007 transforming folksonomies into formal representations Marlow et al.  2006.  Tagging, Taxonomy, Flickr, Article, ToRead.  WWW’2006: Collaborative Web Tagging workshop Hotho  et al.  2006.   Trend Detection in Folksonomies .  Semantics And Digital Media Technology SAMT2006 Maala et. Al.   A Conversion Process From Flickr Tags to RDF Descriptions.  BIS2007 workshop
Which  knowledge representation  model? Extracting knowledge from data sharing Web 2.0 sites, but into which formal representation? Semantic Networks Lexical networks (WordNet) Taxonomines  eg. categories from Wikipedia, Thesauri Metadata “ mapping to Dublin Core is a weak choice” Ontologies “ metadata first, ontologies second”
Crowds tagging   pictures
Crowds tagging   pictures Aitor Almeida Borja Sotomayor Diego López de Ipiña
Crowds tagging   pictures
Crowds tagging   posts
Crowds tagging   slides
Crowds tagging   books
Crowds tagging   URL
Crowd-sharing of tags Flickr, del.icio.us...  group tags by  social sharing (or “co-usage”) but the semantic information that socially shared tags acquire is poorly exploited
Mapping folksonomies  into tag clusters RawSugar   <http://rawsugar.com/> allows users to assign hierarchies to their tags, improving the navigation and searching of folksonomies  non-expert users will find it easier to tag  resources without any restrictions
Tag clustering TAG clustering   is the main technique used to improve the wealth of social tagging but semantic relations are not detected
Beyond tag clusters?
Should we map   them into  ontologies?
Better mapping   1st    into  metadata
Metadata vs ontologies Why are  metadata  structures  better   than ontologies  (for resource classification and categorisation)? Let’s reflect on different  knowledge representations  and about who use them: Folksonomies  (crowds) Taxonomies, ontologies  (knowledge engineers, AI/SW practitioners) Metadata structures  (librarians, archivists, documentalists)
What are metadata?
TAG vs  metadata ?
Metadata vs ontologies Why are  metadata  structures  better ? Because metadata provide  wide and complete range of facets  for  representing knowledge   about an entity or resource Each  facet  (or data type) could be part of one or several ontological structures Facet “any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)” “ A  faceted classification  system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order” (Wikipedia).
Better mapping 1st folksonomies  into  metadata structures
Dublin Core  Metadata Initiative http://jodi.tamu.edu/Articles/v02/i02/Greenberg/metadataform.gif
Dublin Core  Metadata Initiative
Dublin Core  Metadata Inicitive
Our mapping tool: folk2onto   (? folk2meta) designed by   Borja Sotomayor
folk2onto:  Tag Distiller Tag Distiller :  Downloads tags from Web 2.0 sites Matches each tag against  WordNet   (taking into account the tag’s context/cloud) Filters out synonyms  Keeps the list of remaining tags  Generates an XML file Implemented by   Aitor Almeida
TAG clouds  from  del.icio.us http://del.icio.us/url/check?url=site Looks for  <title>  and gets its content: the hash Gets the RSS in  http://del.icio.us/rss/url/ + hash Then tag-clouds are downloaded from   < rdf:li resource=\&quot;http://del.icio.us/tag/&quot; >
TAG clouds  from  Technorati Technorati: blog aggregator We can get tag clouds from Technoraty through:  http://api.technorati.com/blogposttags?key= [apikey] &url= [blog URL]
TAG clouds  from  Technorati <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?>  <!-- generator=&quot;Technorati API version 1.0 /blogposttags&quot; --> <!DOCTYPE tapi PUBLIC &quot;-//Technorati, Inc.//DTD TAPI 0.02//EN&quot; &quot;http://api.technorati.com/dtd/tapi-002.xml&quot;>  <tapi version=&quot;1.0&quot;>  <document>  <result>  <querycount>13</querycount>  </result>  <item>  <tag>christmas cookie recipes</tag>  <posts>274</posts>  </item>  … .
Tagged URL  at  Technorati All <tag> elements are downloaded To get the “title”  http://api.technorati.com/bloginfo?key= [apikey] &url= [blog url]   And<name> is recovered
semantic relations  in WordNet  WordNet relations for tag ‘Spanish’:
TAG filtering algorithm Tags are filtered out by means of WordNet If a TAG has only one meaning (synset) that meaning is assigned If it has more than one, then T: resources tag set Related(a,b): gives 1 if a and b have some type of relation (hypernym, hyponym, holonym, meronym) w: weights Several iterations are made until a meaning is found (10 iterations max.)
TAG filtering algorithm Once senses have been discarded, synonyms are also filtered out Words then are grouped in senses using WordNet’s relation network The output is exported to a: XML file with senses XML file with tags that were discarded RDF containing WordNet’s relation network
TAG XML file <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <resource> <tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</tittle> <type>Text</type> <format>text/html</format> <identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</identifier> <tags> <tag> <lemma>tune</lemma> < idlex>236726</idlex> </tag> <tag> <lemma>bd</lemma> <idlex>5604473</idlex> </tag>
TAG file  without senses <resource> <tittle>Wired News: The Virus That Ate DHS</tittle> <type>Text</type> <format>text/html</format> <identifier>www.wired.com/news/technology/0,72051-0.html?tw=rss.index</identifier> <tags> <tag>bit200f06</tag> <tag>group141</tag> <tag>dhs</tag> <tag>group35</tag> <tag>malware</tag><tag>group91</tag><tag>group17</tag> <tag>group53</tag> <tag>computer_security</tag> </tags> </resource>
WordNet’s  sense sets Words are grouped in sense sets If related(a,b) is = 1, then words are grouped in the same set The relations depth has to be equal or smaller than 3
folk2onto:  Tag Trainer
folk2onto: Map Trainer
folk2onto:  Tag Mapper The Mapper makes  tag-element  associations These associations are made according to the senses asigned by the Distiller Mapping targets into  Dublin Core  metadata records
folk2onto:  Dublin Core The Distiller gets 4 elements from the tag source (del.icio.us, Technorati, etc.): Title : URL’s title -> from the <title> XML tag Type : content type -> depending on the source (here both are “Text”) Format : MIME class -> depending on the source (here we have 2 text/html) Identifier : we take the resource’s URL
folk2onto:  Dublin Core The Tag-Mapper deals with: Subject : the “topic”. Language : en, es, fr, de, ru... Coverage : when, where (about the topic) Rights : type of licence
folk2onto:  mapping formulae When a TAG has one mapping, that TAG is used If it has more than one: If it has no mapping, then:
folk2onto:  file mapping <rdf:RDF xmlns:j.0=&quot;http://purl.org/dc/elements/1.1&quot; xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot; > <rdf:Description rdf:nodeID=&quot;A0&quot;> <rdf:type rdf:resource=&quot;http://purl.org/dc/elements/1.1identifier&quot;/> <j.0:identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</j.0:identifier>  <j.0:type>Text</j.0:type>  <j.0:format>text/html</j.0:format> <j.0:tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</j.0:tittle> <j.0:subject>database</j.0:subject> <j.0:subject>performance</j.0:subject> <j.0:subject>bd</j.0:subject> </rdf:Description> </rdf:RDF>
Mapping trainer
folk2onto:  6 tests (A-F) Experiment A : Selecting random synsets for the tags. Experiment B : Without any limit in the semantic relation depth. Only taking into account the trained synsets (frec=0, wordnet=0, trained=1). Experiment C : Without any limit in the semantic relation depth. Only taking into account the context (frec=0, wordnet=1, trained=0). Experiment D : Without any limit in the semantic relation depth. Taking the context and the trained synsets into account (frec=0,=wordnet0.4, trained=0.6). Experiment E : Without any limit in the semantic relation depth. Taking al three components of the equation (familiarity, context and trained synsets) into account (frec=0.1, wordnet=0.3, trained=0.6). Experiment F : Limiting the semantic relation depth to 3 and taking the context and the trained synsets into account. (frec=0, wordnet=0.4, trained=0.6).
folk2onto:  tests output 278 (%12.8) 1894 (%87.2) F 823 (%37.9) 1349 (%62.1) E 680 (%31.3) 1492 (%68.7) D 973 (%44.8) 1199 (%55.2) C 578 (%26.6) 1594 (%73.4) B 1466 (%67.5) 706 (%32.5) A Erroneous synsets Correct synsets Experiment
folk2onto:  tests output
Open issues Tag filtering through WordNet blog, wiki xml, rdf, rss wordpress, tuenti, flickr social, open  “ tags can be about so many things   mapping to Dublin Core is a weak choice” Mappings Coverage: Japan Language: Spanish Learning the right synset  of eg.  &quot;jaguar&quot;  &quot;vehicle&quot;, &quot;video game console&quot;, or &quot;cat of prey&quot; &quot;<dc:subject>Jaguar</dc:subject>&quot;   Word-sense disambiguation tag-category disambiguation
That was all about CollOnBus/folk2onto Thank you very much! Any question?

Metadata first, ontologies second

  • 1.
    Towards a solutionto extract knowledge from the social web (“metadata first, ontologies second”) Project Collaborative Ontology Building System (CollOnBus) INTEK Nets 2005-2007 Aitor Almeida, Borja Sotomayor, Joseba Abaitua , Diego Lopez de Ipiña
  • 2.
    Social web: sourceof knowledge Crowds share and tag resources of different types: pictures, music, posts, videoclips, slides, books, bookmarks, etc. Social tagging (or crowd- tagging ) is a very effective and economic way of generating knowledge Crowdsourcing “the trend of leveraging the mass collaboration enabled by Web2.0 technologies to achieve business goals. ” <http://en.wikipedia.org/wiki/Crowdsourcing>
  • 3.
    Related work (since 2006) mapping tags to ontologies Schmitz 2006. Inducing Ontology from Flickr tags. WWW’2006: Collaborative Web Tagging workshop Abbasi et. al. 2007. Organizing Resources on Tagging Systems using T-ORG. ESWC2007 SemNet workshop identifying semantic relations Specia, Motta. 2007. Integrating Folksonomies with the Semantic Web. ESWC2007 transforming folksonomies into formal representations Marlow et al. 2006. Tagging, Taxonomy, Flickr, Article, ToRead. WWW’2006: Collaborative Web Tagging workshop Hotho et al. 2006. Trend Detection in Folksonomies . Semantics And Digital Media Technology SAMT2006 Maala et. Al. A Conversion Process From Flickr Tags to RDF Descriptions. BIS2007 workshop
  • 4.
    Which knowledgerepresentation model? Extracting knowledge from data sharing Web 2.0 sites, but into which formal representation? Semantic Networks Lexical networks (WordNet) Taxonomines eg. categories from Wikipedia, Thesauri Metadata “ mapping to Dublin Core is a weak choice” Ontologies “ metadata first, ontologies second”
  • 5.
  • 6.
    Crowds tagging pictures Aitor Almeida Borja Sotomayor Diego López de Ipiña
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    Crowd-sharing of tagsFlickr, del.icio.us... group tags by social sharing (or “co-usage”) but the semantic information that socially shared tags acquire is poorly exploited
  • 13.
    Mapping folksonomies into tag clusters RawSugar <http://rawsugar.com/> allows users to assign hierarchies to their tags, improving the navigation and searching of folksonomies non-expert users will find it easier to tag resources without any restrictions
  • 14.
    Tag clustering TAGclustering is the main technique used to improve the wealth of social tagging but semantic relations are not detected
  • 15.
  • 16.
    Should we map them into ontologies?
  • 17.
    Better mapping 1st into metadata
  • 18.
    Metadata vs ontologiesWhy are metadata structures better than ontologies (for resource classification and categorisation)? Let’s reflect on different knowledge representations and about who use them: Folksonomies (crowds) Taxonomies, ontologies (knowledge engineers, AI/SW practitioners) Metadata structures (librarians, archivists, documentalists)
  • 19.
  • 20.
    TAG vs metadata ?
  • 21.
    Metadata vs ontologiesWhy are metadata structures better ? Because metadata provide wide and complete range of facets for representing knowledge about an entity or resource Each facet (or data type) could be part of one or several ontological structures Facet “any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)” “ A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order” (Wikipedia).
  • 22.
    Better mapping 1stfolksonomies into metadata structures
  • 23.
    Dublin Core Metadata Initiative http://jodi.tamu.edu/Articles/v02/i02/Greenberg/metadataform.gif
  • 24.
    Dublin Core Metadata Initiative
  • 25.
    Dublin Core Metadata Inicitive
  • 26.
    Our mapping tool:folk2onto (? folk2meta) designed by Borja Sotomayor
  • 27.
    folk2onto: TagDistiller Tag Distiller : Downloads tags from Web 2.0 sites Matches each tag against WordNet (taking into account the tag’s context/cloud) Filters out synonyms Keeps the list of remaining tags Generates an XML file Implemented by Aitor Almeida
  • 28.
    TAG clouds from del.icio.us http://del.icio.us/url/check?url=site Looks for <title> and gets its content: the hash Gets the RSS in http://del.icio.us/rss/url/ + hash Then tag-clouds are downloaded from < rdf:li resource=\&quot;http://del.icio.us/tag/&quot; >
  • 29.
    TAG clouds from Technorati Technorati: blog aggregator We can get tag clouds from Technoraty through: http://api.technorati.com/blogposttags?key= [apikey] &url= [blog URL]
  • 30.
    TAG clouds from Technorati <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <!-- generator=&quot;Technorati API version 1.0 /blogposttags&quot; --> <!DOCTYPE tapi PUBLIC &quot;-//Technorati, Inc.//DTD TAPI 0.02//EN&quot; &quot;http://api.technorati.com/dtd/tapi-002.xml&quot;> <tapi version=&quot;1.0&quot;> <document> <result> <querycount>13</querycount> </result> <item> <tag>christmas cookie recipes</tag> <posts>274</posts> </item> … .
  • 31.
    Tagged URL at Technorati All <tag> elements are downloaded To get the “title” http://api.technorati.com/bloginfo?key= [apikey] &url= [blog url] And<name> is recovered
  • 32.
    semantic relations in WordNet WordNet relations for tag ‘Spanish’:
  • 33.
    TAG filtering algorithmTags are filtered out by means of WordNet If a TAG has only one meaning (synset) that meaning is assigned If it has more than one, then T: resources tag set Related(a,b): gives 1 if a and b have some type of relation (hypernym, hyponym, holonym, meronym) w: weights Several iterations are made until a meaning is found (10 iterations max.)
  • 34.
    TAG filtering algorithmOnce senses have been discarded, synonyms are also filtered out Words then are grouped in senses using WordNet’s relation network The output is exported to a: XML file with senses XML file with tags that were discarded RDF containing WordNet’s relation network
  • 35.
    TAG XML file<?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <resource> <tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</tittle> <type>Text</type> <format>text/html</format> <identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</identifier> <tags> <tag> <lemma>tune</lemma> < idlex>236726</idlex> </tag> <tag> <lemma>bd</lemma> <idlex>5604473</idlex> </tag>
  • 36.
    TAG file without senses <resource> <tittle>Wired News: The Virus That Ate DHS</tittle> <type>Text</type> <format>text/html</format> <identifier>www.wired.com/news/technology/0,72051-0.html?tw=rss.index</identifier> <tags> <tag>bit200f06</tag> <tag>group141</tag> <tag>dhs</tag> <tag>group35</tag> <tag>malware</tag><tag>group91</tag><tag>group17</tag> <tag>group53</tag> <tag>computer_security</tag> </tags> </resource>
  • 37.
    WordNet’s sensesets Words are grouped in sense sets If related(a,b) is = 1, then words are grouped in the same set The relations depth has to be equal or smaller than 3
  • 38.
  • 39.
  • 40.
    folk2onto: TagMapper The Mapper makes tag-element associations These associations are made according to the senses asigned by the Distiller Mapping targets into Dublin Core metadata records
  • 41.
    folk2onto: DublinCore The Distiller gets 4 elements from the tag source (del.icio.us, Technorati, etc.): Title : URL’s title -> from the <title> XML tag Type : content type -> depending on the source (here both are “Text”) Format : MIME class -> depending on the source (here we have 2 text/html) Identifier : we take the resource’s URL
  • 42.
    folk2onto: DublinCore The Tag-Mapper deals with: Subject : the “topic”. Language : en, es, fr, de, ru... Coverage : when, where (about the topic) Rights : type of licence
  • 43.
    folk2onto: mappingformulae When a TAG has one mapping, that TAG is used If it has more than one: If it has no mapping, then:
  • 44.
    folk2onto: filemapping <rdf:RDF xmlns:j.0=&quot;http://purl.org/dc/elements/1.1&quot; xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot; > <rdf:Description rdf:nodeID=&quot;A0&quot;> <rdf:type rdf:resource=&quot;http://purl.org/dc/elements/1.1identifier&quot;/> <j.0:identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</j.0:identifier> <j.0:type>Text</j.0:type> <j.0:format>text/html</j.0:format> <j.0:tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</j.0:tittle> <j.0:subject>database</j.0:subject> <j.0:subject>performance</j.0:subject> <j.0:subject>bd</j.0:subject> </rdf:Description> </rdf:RDF>
  • 45.
  • 46.
    folk2onto: 6tests (A-F) Experiment A : Selecting random synsets for the tags. Experiment B : Without any limit in the semantic relation depth. Only taking into account the trained synsets (frec=0, wordnet=0, trained=1). Experiment C : Without any limit in the semantic relation depth. Only taking into account the context (frec=0, wordnet=1, trained=0). Experiment D : Without any limit in the semantic relation depth. Taking the context and the trained synsets into account (frec=0,=wordnet0.4, trained=0.6). Experiment E : Without any limit in the semantic relation depth. Taking al three components of the equation (familiarity, context and trained synsets) into account (frec=0.1, wordnet=0.3, trained=0.6). Experiment F : Limiting the semantic relation depth to 3 and taking the context and the trained synsets into account. (frec=0, wordnet=0.4, trained=0.6).
  • 47.
    folk2onto: testsoutput 278 (%12.8) 1894 (%87.2) F 823 (%37.9) 1349 (%62.1) E 680 (%31.3) 1492 (%68.7) D 973 (%44.8) 1199 (%55.2) C 578 (%26.6) 1594 (%73.4) B 1466 (%67.5) 706 (%32.5) A Erroneous synsets Correct synsets Experiment
  • 48.
  • 49.
    Open issues Tagfiltering through WordNet blog, wiki xml, rdf, rss wordpress, tuenti, flickr social, open “ tags can be about so many things mapping to Dublin Core is a weak choice” Mappings Coverage: Japan Language: Spanish Learning the right synset of eg. &quot;jaguar&quot; &quot;vehicle&quot;, &quot;video game console&quot;, or &quot;cat of prey&quot; &quot;<dc:subject>Jaguar</dc:subject>&quot; Word-sense disambiguation tag-category disambiguation
  • 50.
    That was allabout CollOnBus/folk2onto Thank you very much! Any question?