Rio info 2013 - Linked Data at Globo.com

Linked Data at
Tatiana Al-Chueyr Martins
tatiana.martins@corp.globo.com
@tati_alchueyr
18 de setembro de 2013, Simpósio Rio Info
globo.com

BROADCAST MOVIES PAY TV INTERNET
EVENTS MUSIC
PUBLISHING
NEW VENTURES NEWSPAPERRADIO NETWORK

Andréia Bustamante
Ícaro Medeiros
Tatiana Al-Chueyr
Rodrigo Senra
Semantic Team

Franklin Amorim
Diogo Kiss
Contributors

MotivationNot only words
São Paulo

São Paulo?

São Paulo state

São Paulo city

São Paulo saint

São Paulo soccer team

MotivationMultiple words for the same thing
Female
f
F
female
woman
...

MotivationMultiple words for the same thing
http://data.globo.com/female

Motivation
Soccer player
Cross-link content from different web products

Politician
MotivationCross-link content from different web products

Celebrity
Motivation
● Cross-link content from different web products
MotivationCross-link content from different web products

Isabella Nardoni foi morta em 29 de março de 2008
na Zona Norte de São Paulo (Foto:Reprodução)
Isabella de Oliveira Nardoni, de 5
anos, foi morta na noite de 29 de
março de 2008. A perícia concluiu
que a menina foi atirada do sexto
andar do prédio onde moravam seu
pai, Alexandre Nardoni, sua
madrasta, Anna Carolina Jatobá, e
dois filhos pequenos do casal, na
Vila Isolina Mazzei, na zona norte de
São Paulo.
Túmulo de Isabella vira local de visitação em SP; casal Nardoni está preso.
Caso Isabella Nardoni
Juliana Cardilli G1 SP
RDF
FOAF
GEO
Dublin
Core
SKOS
Semantic markup in web pages
Motivation

Recommend annotations to information Producer
Motivation

Suggest related content to information Consumer
Motivation

Changes
● Replacement of words by entities
http://data.globo.com/person/Person/santos_dumont

Changes
● Replacement of labels by qualified relationships

Changes
● Organize data from tables to graphs

Outcomes
● To replace words by entities improved:
○ Finding
○ Linking
○ Reconciling
○ Organizing
multiple layers of information

Outcomes
● Flexible ways to organize content
● Ease to find related issues
● Explicit relations derived from annotated content
● Up-to-date topic pages with little editorial effort
● Linking content across different web products
● Seamless navigation leading to flow state

Status Quo
Used by the main web products of Globo.com:
○ 18,485 organizations
○ 83,000 people
○ 9,129 places
○ 1,000,000+ annotated news
Which sum up 2,500,000+ entities!
from August 2010 to May 2013

Legacy Architecture
CDA
CMA
triple
store
search
engine
ontology

CDA
CMA
CDA
CMA
CDA
CMA
CDA
CMA
Legacy Architecture
triple
store
search
engine
ontology

Poor data management
○ direct access to triple store (unmanaged)
○ difficulty to share data (distributed DBs)
○ re-sync triple-store and search engine index
○ scalability of triple store
○ high entropy in distributed ontology engineering
Problems

Ontology Engineering
Domain-driven
(current)
Base
G1 GE EGO TVG
news sports gossip tv
Upper
Person Organization
Music
Politics
Programme Education
Sports
Product-driven
(past)
Place

Possible Solution
Upper
Ontology

Semantic as a library
○ many different versions in production
○ programming language dependent
○ steep learning curve for RDF/OWL/SPARQL
Problems

Create an open semantic data management platform
● Scalable
● Mobile and Web friendly
● Interconnect Globo's data with external data sources
● Automate content extraction (including NER)
Solution

Brainiak
linked data restful
API

API
Brainiak
CMA
CDA
CDA
CDA
CDA
triple
store
search
engine
Under Development

Requirements
● Indirect usage of SPARQL
● Programming language independent
● Data management with quality
● Finer-grained authorization and authentication
● Isolate applications from triplestore
● Improve triplestore performance

SPARQL query
DEFINE input:inference <http://data.globo.com/ruleset>
SELECT ?uri ?label
FROM <http://data.globo.com/sports/>
WHERE
{
?uri a <http://data.globo.com/sports/Team>;
rdfs:label ?label .
}
LIMIT 10
OFFSET 0
task: list all sports teams

/sports/Team
Brainiak query
GET

SPARQL query
SELECT DISTINCT ?class
WHERE {
<http://data.globo.com/place/City> rdfs:subClassOf ?class OPTION
(TRANSITIVE, t_distinct, t_step('step_no') as ?n, t_min (0)) .
?class a owl:Class .
}
task: retrieve all superclasses of a class

SPARQL query
SELECT DISTINCT ?predicate ?predicate_graph ?predicate_comment ?type ?range ?title ?range_graph ?range_label ?super_property
WHERE {
{
GRAPH ?predicate_graph { ?predicate rdfs:domain ?domain_class } .
} UNION {
graph ?predicate_graph {?predicate rdfs:domain ?blank} .
?blank a owl:Class .
?blank owl:unionOf ?enumeration .
OPTIONAL { ?enumeration rdf:rest ?list_node OPTION(TRANSITIVE, t_min (0)) } .
OPTIONAL { ?list_node rdf:first ?domain_class } .
}
FILTER (?domain_class IN (<http://data.globo.com/place/City>, <http://data.globo.com/place/GeopoliticalDivision>, <http://data.globo.com/place/Place>, <http://data.globo.
com/upper/Object>, <http://data.globo.com/upper/Substance>, <http://data.globo.com/upper/ConcreteEntity>, <http://data.globo.com/upper/Entity>))
{?predicate rdfs:range ?range .}
UNION {
?predicate rdfs:range ?blank .
?blank a owl:Class .
?blank owl:unionOf ?enumeration .
OPTIONAL { ?list_node rdf:first ?range } .
}
FILTER (!isBlank(?range))
?predicate rdfs:label ?title .
?predicate rdf:type ?type .
OPTIONAL { ?predicate rdfs:subPropertyOf ?super_property } .
FILTER (?type in (owl:ObjectProperty, owl:DatatypeProperty)) .
FILTER(langMatches(lang(?title), "en") OR langMatches(lang(?title), "")) .
OPTIONAL { ?predicate rdfs:comment ?predicate_comment }
FILTER(langMatches(lang(?predicate_comment), "en") OR langMatches(lang(?predicate_comment), "")) .
OPTIONAL {
GRAPH ?range_graph {
?range rdfs:label ?range_label .
FILTER(langMatches(lang(?range_label), "en") OR langMatches(lang(?range_label), "")) .
}
}
}
task: retrieve all properties of a group of classes

SPARQL query
SELECT DISTINCT ?predicate ?min ?max ?range ?enumerated_value ?enumerated_value_label
WHERE {
<http://data.globo.com/place/City> rdfs:subClassOf ?s OPTION (TRANSITIVE, t_distinct, t_step('step_no') as ?n,
t_min (0)) .
?s owl:onProperty ?predicate .
OPTIONAL { ?s owl:minQualifiedCardinality ?min } .
OPTIONAL { ?s owl:maxQualifiedCardinality ?max } .
OPTIONAL {
{ ?s owl:onClass ?range }
UNION { ?s owl:onDataRange ?range }
UNION { ?s owl:allValuesFrom ?range }
OPTIONAL { ?range owl:oneOf ?enumeration } .
OPTIONAL { ?list_node rdf:first ?enumerated_value } .
OPTIONAL {
?enumerated_value rdfs:label ?enumerated_value_label .
} .
}
}
}
task: retrieve the cardinalities of all properties of a certain class

/place/City/_schema
Brainiak query
GET

● Enrich Globo.com search
● SEO (automatic schema.org)
● Improve annotator (DBpedia Spotlight)
● Richer content relationships (inference)
● Link to open data (e.g. DBPedia, dados.gov.br)
Next steps

Stay tuned
@brainiak_api
... will be soon released
as an open source project !

http://www.slideshare.net/
@semantic_team
@alchueyr
Slides

tatiana.martins@corp.globo.com
semantica@corp.globo.com
globo.com
Thank you
for the attention!

Rio info 2013 - Linked Data at Globo.com

More Related Content

Viewers also liked

Similar to Rio info 2013 - Linked Data at Globo.com

More from Tatiana Al-Chueyr

Recently uploaded

Rio info 2013 - Linked Data at Globo.com