Metadata first, ontologies second

Towards a solution to extract knowledge from the social web (“metadata first, ontologies second”) Project Collaborative Ontology Building System (CollOnBus) INTEK Nets 2005-2007 Aitor Almeida, Borja Sotomayor, Joseba Abaitua , Diego Lopez de Ipiña

Social web: source of knowledge Crowds share and tag resources of different types: pictures, music, posts, videoclips, slides, books, bookmarks, etc. Social tagging (or crowd- tagging ) is a very effective and economic way of generating knowledge Crowdsourcing “the trend of leveraging the mass collaboration enabled by Web2.0 technologies to achieve business goals. ” <http://en.wikipedia.org/wiki/Crowdsourcing>

Related work (since 2006) mapping tags to ontologies Schmitz 2006. Inducing Ontology from Flickr tags. WWW’2006: Collaborative Web Tagging workshop Abbasi et. al. 2007. Organizing Resources on Tagging Systems using T-ORG. ESWC2007 SemNet workshop identifying semantic relations Specia, Motta. 2007. Integrating Folksonomies with the Semantic Web. ESWC2007 transforming folksonomies into formal representations Marlow et al. 2006. Tagging, Taxonomy, Flickr, Article, ToRead. WWW’2006: Collaborative Web Tagging workshop Hotho et al. 2006. Trend Detection in Folksonomies . Semantics And Digital Media Technology SAMT2006 Maala et. Al. A Conversion Process From Flickr Tags to RDF Descriptions. BIS2007 workshop

Which knowledge representation model? Extracting knowledge from data sharing Web 2.0 sites, but into which formal representation? Semantic Networks Lexical networks (WordNet) Taxonomines eg. categories from Wikipedia, Thesauri Metadata “ mapping to Dublin Core is a weak choice” Ontologies “ metadata first, ontologies second”

Crowds tagging pictures Aitor Almeida Borja Sotomayor Diego López de Ipiña

Crowd-sharing of tags Flickr, del.icio.us... group tags by social sharing (or “co-usage”) but the semantic information that socially shared tags acquire is poorly exploited

Mapping folksonomies into tag clusters RawSugar <http://rawsugar.com/> allows users to assign hierarchies to their tags, improving the navigation and searching of folksonomies non-expert users will find it easier to tag resources without any restrictions

Tag clustering TAG clustering is the main technique used to improve the wealth of social tagging but semantic relations are not detected

Should we map them into ontologies?

Better mapping 1st into metadata

Metadata vs ontologies Why are metadata structures better than ontologies (for resource classification and categorisation)? Let’s reflect on different knowledge representations and about who use them: Folksonomies (crowds) Taxonomies, ontologies (knowledge engineers, AI/SW practitioners) Metadata structures (librarians, archivists, documentalists)

Metadata vs ontologies Why are metadata structures better ? Because metadata provide wide and complete range of facets for representing knowledge about an entity or resource Each facet (or data type) could be part of one or several ontological structures Facet “any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)” “ A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order” (Wikipedia).

Better mapping 1st folksonomies into metadata structures

Dublin Core Metadata Initiative http://jodi.tamu.edu/Articles/v02/i02/Greenberg/metadataform.gif

Dublin Core Metadata Initiative

Dublin Core Metadata Inicitive

Our mapping tool: folk2onto (? folk2meta) designed by Borja Sotomayor

folk2onto: Tag Distiller Tag Distiller : Downloads tags from Web 2.0 sites Matches each tag against WordNet (taking into account the tag’s context/cloud) Filters out synonyms Keeps the list of remaining tags Generates an XML file Implemented by Aitor Almeida

TAG clouds from del.icio.us http://del.icio.us/url/check?url=site Looks for <title> and gets its content: the hash Gets the RSS in http://del.icio.us/rss/url/ + hash Then tag-clouds are downloaded from < rdf:li resource=\"http://del.icio.us/tag/" >

TAG clouds from Technorati Technorati: blog aggregator We can get tag clouds from Technoraty through: http://api.technorati.com/blogposttags?key= [apikey] &url= [blog URL]

TAG clouds from Technorati <?xml version="1.0" encoding="utf-8"?>  <!DOCTYPE tapi PUBLIC "-//Technorati, Inc.//DTD TAPI 0.02//EN" "http://api.technorati.com/dtd/tapi-002.xml"> <tapi version="1.0"> <document> <result> <querycount>13</querycount> </result> <item> <tag>christmas cookie recipes</tag> <posts>274</posts> </item> … .

Tagged URL at Technorati All <tag> elements are downloaded To get the “title” http://api.technorati.com/bloginfo?key= [apikey] &url= [blog url] And<name> is recovered

semantic relations in WordNet WordNet relations for tag ‘Spanish’:

TAG filtering algorithm Tags are filtered out by means of WordNet If a TAG has only one meaning (synset) that meaning is assigned If it has more than one, then T: resources tag set Related(a,b): gives 1 if a and b have some type of relation (hypernym, hyponym, holonym, meronym) w: weights Several iterations are made until a meaning is found (10 iterations max.)

TAG filtering algorithm Once senses have been discarded, synonyms are also filtered out Words then are grouped in senses using WordNet’s relation network The output is exported to a: XML file with senses XML file with tags that were discarded RDF containing WordNet’s relation network

TAG XML file <?xml version="1.0" encoding="UTF-8"?> <resource> <tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</tittle> <type>Text</type> <format>text/html</format> <identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</identifier> <tags> <tag> <lemma>tune</lemma> < idlex>236726</idlex> </tag> <tag> <lemma>bd</lemma> <idlex>5604473</idlex> </tag>

TAG file without senses <resource> <tittle>Wired News: The Virus That Ate DHS</tittle> <type>Text</type> <format>text/html</format> <identifier>www.wired.com/news/technology/0,72051-0.html?tw=rss.index</identifier> <tags> <tag>bit200f06</tag> <tag>group141</tag> <tag>dhs</tag> <tag>group35</tag> <tag>malware</tag><tag>group91</tag><tag>group17</tag> <tag>group53</tag> <tag>computer_security</tag> </tags> </resource>

WordNet’s sense sets Words are grouped in sense sets If related(a,b) is = 1, then words are grouped in the same set The relations depth has to be equal or smaller than 3

folk2onto: Tag Mapper The Mapper makes tag-element associations These associations are made according to the senses asigned by the Distiller Mapping targets into Dublin Core metadata records

folk2onto: Dublin Core The Distiller gets 4 elements from the tag source (del.icio.us, Technorati, etc.): Title : URL’s title -> from the <title> XML tag Type : content type -> depending on the source (here both are “Text”) Format : MIME class -> depending on the source (here we have 2 text/html) Identifier : we take the resource’s URL

folk2onto: Dublin Core The Tag-Mapper deals with: Subject : the “topic”. Language : en, es, fr, de, ru... Coverage : when, where (about the topic) Rights : type of licence

folk2onto: mapping formulae When a TAG has one mapping, that TAG is used If it has more than one: If it has no mapping, then:

folk2onto: file mapping <rdf:RDF xmlns:j.0="http://purl.org/dc/elements/1.1" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" > <rdf:Description rdf:nodeID="A0"> <rdf:type rdf:resource="http://purl.org/dc/elements/1.1identifier"/> <j.0:identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</j.0:identifier> <j.0:type>Text</j.0:type> <j.0:format>text/html</j.0:format> <j.0:tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</j.0:tittle> <j.0:subject>database</j.0:subject> <j.0:subject>performance</j.0:subject> <j.0:subject>bd</j.0:subject> </rdf:Description> </rdf:RDF>

folk2onto: 6 tests (A-F) Experiment A : Selecting random synsets for the tags. Experiment B : Without any limit in the semantic relation depth. Only taking into account the trained synsets (frec=0, wordnet=0, trained=1). Experiment C : Without any limit in the semantic relation depth. Only taking into account the context (frec=0, wordnet=1, trained=0). Experiment D : Without any limit in the semantic relation depth. Taking the context and the trained synsets into account (frec=0,=wordnet0.4, trained=0.6). Experiment E : Without any limit in the semantic relation depth. Taking al three components of the equation (familiarity, context and trained synsets) into account (frec=0.1, wordnet=0.3, trained=0.6). Experiment F : Limiting the semantic relation depth to 3 and taking the context and the trained synsets into account. (frec=0, wordnet=0.4, trained=0.6).

folk2onto: tests output 278 (%12.8) 1894 (%87.2) F 823 (%37.9) 1349 (%62.1) E 680 (%31.3) 1492 (%68.7) D 973 (%44.8) 1199 (%55.2) C 578 (%26.6) 1594 (%73.4) B 1466 (%67.5) 706 (%32.5) A Erroneous synsets Correct synsets Experiment

Open issues Tag filtering through WordNet blog, wiki xml, rdf, rss wordpress, tuenti, flickr social, open “ tags can be about so many things mapping to Dublin Core is a weak choice” Mappings Coverage: Japan Language: Spanish Learning the right synset of eg. "jaguar" "vehicle", "video game console", or "cat of prey" "<dc:subject>Jaguar</dc:subject>" Word-sense disambiguation tag-category disambiguation

That was all about CollOnBus/folk2onto Thank you very much! Any question?

Metadata first, ontologies second

More Related Content

What's hot

Viewers also liked

Similar to Metadata first, ontologies second

More from Joseba Abaitua

Recently uploaded

Metadata first, ontologies second