Rio info 2013 - Linked Data at Globo.com

2,254 views

Published on

Prsenta

Published in: Education

Rio info 2013 - Linked Data at Globo.com

  1. 1. Linked Data at Tatiana Al-Chueyr Martins tatiana.martins@corp.globo.com @tati_alchueyr 18 de setembro de 2013, Simpósio Rio Info globo.com
  2. 2. BROADCAST MOVIES PAY TV INTERNET EVENTS MUSIC PUBLISHING NEW VENTURES NEWSPAPERRADIO NETWORK
  3. 3. Andréia Bustamante Ícaro Medeiros Tatiana Al-Chueyr Rodrigo Senra Semantic Team
  4. 4. Franklin Amorim Diogo Kiss Contributors
  5. 5. MotivationNot only words São Paulo
  6. 6. MotivationNot only words São Paulo?
  7. 7. MotivationNot only words São Paulo state
  8. 8. MotivationNot only words São Paulo city
  9. 9. MotivationNot only words São Paulo saint
  10. 10. MotivationNot only words São Paulo soccer team
  11. 11. MotivationMultiple words for the same thing Female f F female woman ...
  12. 12. MotivationMultiple words for the same thing http://data.globo.com/female
  13. 13. Motivation Soccer player Cross-link content from different web products
  14. 14. Politician MotivationCross-link content from different web products
  15. 15. Celebrity Motivation ● Cross-link content from different web products MotivationCross-link content from different web products
  16. 16. Isabella Nardoni foi morta em 29 de março de 2008 na Zona Norte de São Paulo (Foto:Reprodução) Isabella de Oliveira Nardoni, de 5 anos, foi morta na noite de 29 de março de 2008. A perícia concluiu que a menina foi atirada do sexto andar do prédio onde moravam seu pai, Alexandre Nardoni, sua madrasta, Anna Carolina Jatobá, e dois filhos pequenos do casal, na Vila Isolina Mazzei, na zona norte de São Paulo. Túmulo de Isabella vira local de visitação em SP; casal Nardoni está preso. Caso Isabella Nardoni Juliana Cardilli G1 SP RDF FOAF GEO Dublin Core SKOS Semantic markup in web pages Motivation
  17. 17. Recommend annotations to information Producer Motivation
  18. 18. Suggest related content to information Consumer Motivation
  19. 19. Suggest related content to information Consumer Motivation
  20. 20. Suggest related content to information Consumer Motivation
  21. 21. Changes ● Replacement of words by entities http://data.globo.com/person/Person/santos_dumont
  22. 22. Changes ● Replacement of labels by qualified relationships
  23. 23. Changes ● Organize data from tables to graphs
  24. 24. Outcomes ● To replace words by entities improved: ○ Finding ○ Linking ○ Reconciling ○ Organizing multiple layers of information
  25. 25. Outcomes ● Flexible ways to organize content ● Ease to find related issues ● Explicit relations derived from annotated content ● Up-to-date topic pages with little editorial effort ● Linking content across different web products ● Seamless navigation leading to flow state
  26. 26. Status Quo Used by the main web products of Globo.com: ○ 18,485 organizations ○ 83,000 people ○ 9,129 places ○ 1,000,000+ annotated news Which sum up 2,500,000+ entities! from August 2010 to May 2013
  27. 27. Linked data problems
  28. 28. Legacy Architecture CDA CMA triple store search engine ontology
  29. 29. CDA CMA CDA CMA CDA CMA CDA CMA Legacy Architecture triple store search engine ontology
  30. 30. Poor data management ○ direct access to triple store (unmanaged) ○ difficulty to share data (distributed DBs) ○ re-sync triple-store and search engine index ○ scalability of triple store ○ high entropy in distributed ontology engineering Problems
  31. 31. Problems
  32. 32. Ontology Engineering Domain-driven (current) Base G1 GE EGO TVG news sports gossip tv Upper Person Organization Music Politics Programme Education Sports Product-driven (past) Place
  33. 33. Possible Solution Upper Ontology
  34. 34. Semantic as a library ○ many different versions in production ○ programming language dependent ○ steep learning curve for RDF/OWL/SPARQL Problems
  35. 35. Create an open semantic data management platform ● Scalable ● Mobile and Web friendly ● Interconnect Globo's data with external data sources ● Automate content extraction (including NER) Solution
  36. 36. Brainiak linked data restful API
  37. 37. CDA CMA CDA CMA CDA CMA CDA CMA Legacy Architecture triple store search engine ontology
  38. 38. API Brainiak CMA CDA CDA CDA CDA triple store search engine Under Development
  39. 39. Requirements ● Indirect usage of SPARQL ● Programming language independent ● Data management with quality ● Finer-grained authorization and authentication ● Isolate applications from triplestore ● Improve triplestore performance
  40. 40. SPARQL query DEFINE input:inference <http://data.globo.com/ruleset> SELECT ?uri ?label FROM <http://data.globo.com/sports/> WHERE { ?uri a <http://data.globo.com/sports/Team>; rdfs:label ?label . } LIMIT 10 OFFSET 0 task: list all sports teams
  41. 41. /sports/Team Brainiak query GET
  42. 42. SPARQL response
  43. 43. Brainiak response
  44. 44. SPARQL query SELECT DISTINCT ?class WHERE { <http://data.globo.com/place/City> rdfs:subClassOf ?class OPTION (TRANSITIVE, t_distinct, t_step('step_no') as ?n, t_min (0)) . ?class a owl:Class . } task: retrieve all superclasses of a class
  45. 45. SPARQL query SELECT DISTINCT ?predicate ?predicate_graph ?predicate_comment ?type ?range ?title ?range_graph ?range_label ?super_property WHERE { { GRAPH ?predicate_graph { ?predicate rdfs:domain ?domain_class } . } UNION { graph ?predicate_graph {?predicate rdfs:domain ?blank} . ?blank a owl:Class . ?blank owl:unionOf ?enumeration . OPTIONAL { ?enumeration rdf:rest ?list_node OPTION(TRANSITIVE, t_min (0)) } . OPTIONAL { ?list_node rdf:first ?domain_class } . } FILTER (?domain_class IN (<http://data.globo.com/place/City>, <http://data.globo.com/place/GeopoliticalDivision>, <http://data.globo.com/place/Place>, <http://data.globo. com/upper/Object>, <http://data.globo.com/upper/Substance>, <http://data.globo.com/upper/ConcreteEntity>, <http://data.globo.com/upper/Entity>)) {?predicate rdfs:range ?range .} UNION { ?predicate rdfs:range ?blank . ?blank a owl:Class . ?blank owl:unionOf ?enumeration . OPTIONAL { ?enumeration rdf:rest ?list_node OPTION(TRANSITIVE, t_min (0)) } . OPTIONAL { ?list_node rdf:first ?range } . } FILTER (!isBlank(?range)) ?predicate rdfs:label ?title . ?predicate rdf:type ?type . OPTIONAL { ?predicate rdfs:subPropertyOf ?super_property } . FILTER (?type in (owl:ObjectProperty, owl:DatatypeProperty)) . FILTER(langMatches(lang(?title), "en") OR langMatches(lang(?title), "")) . OPTIONAL { ?predicate rdfs:comment ?predicate_comment } FILTER(langMatches(lang(?predicate_comment), "en") OR langMatches(lang(?predicate_comment), "")) . OPTIONAL { GRAPH ?range_graph { ?range rdfs:label ?range_label . FILTER(langMatches(lang(?range_label), "en") OR langMatches(lang(?range_label), "")) . } } } task: retrieve all properties of a group of classes
  46. 46. SPARQL query SELECT DISTINCT ?predicate ?min ?max ?range ?enumerated_value ?enumerated_value_label WHERE { <http://data.globo.com/place/City> rdfs:subClassOf ?s OPTION (TRANSITIVE, t_distinct, t_step('step_no') as ?n, t_min (0)) . ?s owl:onProperty ?predicate . OPTIONAL { ?s owl:minQualifiedCardinality ?min } . OPTIONAL { ?s owl:maxQualifiedCardinality ?max } . OPTIONAL { { ?s owl:onClass ?range } UNION { ?s owl:onDataRange ?range } UNION { ?s owl:allValuesFrom ?range } OPTIONAL { ?range owl:oneOf ?enumeration } . OPTIONAL { ?enumeration rdf:rest ?list_node OPTION(TRANSITIVE, t_min (0)) } . OPTIONAL { ?list_node rdf:first ?enumerated_value } . OPTIONAL { ?enumerated_value rdfs:label ?enumerated_value_label . } . } } } task: retrieve the cardinalities of all properties of a certain class
  47. 47. /place/City/_schema Brainiak query GET
  48. 48. ● Enrich Globo.com search ● SEO (automatic schema.org) ● Improve annotator (DBpedia Spotlight) ● Richer content relationships (inference) ● Link to open data (e.g. DBPedia, dados.gov.br) Next steps
  49. 49. Stay tuned @brainiak_api ... will be soon released as an open source project !
  50. 50. http://www.slideshare.net/ @semantic_team @alchueyr Slides
  51. 51. tatiana.martins@corp.globo.com semantica@corp.globo.com globo.com Thank you for the attention!

×