2. About the Associated Press
– AP is a not-for-profit news cooperative, owned by US
newspaper and broadcast members, founded in 1846
– AP news content is seen by half the world’s
population on any given day
– We process and deliver 100k+ content items daily
– AP, member and third-party content
– Text, photos, audio, multimedia interactives, and
broadcast and online quality video
– Primarily B2B
3. Evolution of AP Metadata Services
2011
• RDF modeling
• API development
• Pilot offering
2008
• Automated tagging of
Companies, Organizations,
Geography, Events starts
2012
• AP Metadata
2009-2010 Services Launch
• Scope and depth of
2006 coverage increases
• Initial taxonomy • Platform stabilized
and rule 2007
development
• Automated tagging of
starts
Subjects, People, Compani
es starts
4. Introducing AP Metadata Services
– Semantic Web services to drive the next generation of
news delivery and consumption:
– AP News Taxonomy
– AP Tagging Service
– B2B service with continuing investment and human
curation
– Ongoing and frequent updates to tagging
rules, entities, concepts and their semantic relationships
– Designed to meet AP’s exacting needs for its own content
5. What Does Rich Metadata Do for Publishers?
– Connect customers with more relevant content through:
– Improved search and discovery
– Automated aggregation, syndication and distribution of related
content
– Richer and more relevant content products and services
– Reduced time to market for new products and services
– Reduces editorial workload, creates efficiencies
– Content interoperability
6. • Site delivered ~5,000
articles and ~20,000 photos
over 2 months
• Routing and display of
content by team and
conference is automated
• Editorial resources are
focused on curating only
the most important parts of
the site
• Enables user experience
that would not be possible
without automated,
standard metadata
7. The AP News Taxonomy
– Breadth and depth to support news and current events
– Defines rich semantic metadata specific to news
– Generic subjects and hierarchy
– Named entities
– Relationships, synonyms, additional entity data
– Delivers automated notifications of taxonomy changes
– New terms, deprecated terms, name changes, etc.
8. The AP Tagging Service
– Software as a Service
– Leverages AP investment and expertise
– Tags concepts; more than entity extraction
– Automated tagging tied to AP News Taxonomy ensures
more consistent, comprehensive results
9. Top Level Subject Areas:
• Arts and Entertainment
Coverage • Business
• Demographic groups
– 4200 Subjects • Environment and Nature
• Events
– 2100 Geographic locations • General News
• Government and Politics
– 1200 Organizations • Health
• Lifestyle
– 91,000 People • Living Things
• Media
– 41,000 Publicly-traded • Science
Companies • Social Affairs
• Sports
• Technology
– Supports English language content
10. A Foundation of Semantic Web Standards
– URIs for all entities and topics
– Taxonomy modeled in RDF
– SKOS Ontology
– Supplemented with other ontologies
(Dublin Core, DBPedia, FOAF, GeoNames, etc.)
– Some AP-specific properties
– Taxonomy and Tagging Service accessible via
RESTful APIs
– Using a SPARQL end-point internally to provide
views of the taxonomy
11. Supported Formats
AP Tagging Service AP Taxonomy
– Input formats – Taxonomy Output Format
– Plain Text – RDF/XML
– Simple XML: XML encoded content – RDF/Turtle
e.g. XHTML, NITF, NewsML, NewsML-G2
– RDF/JSON
– Output formats
– NewsML-G2
– RDF/XML
– Taxonomy Change Log
– RDF/JSON Output formats
– RDF/Turtle
– XML
– Simple XML
– CSV
– NewsML-G2
12. Metadata Services in AP’s Content Lifecycle
Content Repository
3rd party
content Products defined
based on rich Distribution methods:
metadata Internet syndication
Web portals
APIs
AP Editorial
Content
(Input) AP Tagging Service
applies standard values Metadata Services
and related data
• Taxonomy fed to
editorial tools
• Automated tagging
applies subject and
entity metadata from
taxonomy
• Rich relationships
between
Standard AP subjects, entities
News • Metadata used to
deliver targeted
Taxonomy
feeds, auto-publish and
values improve search and
browse experience
13. RDF/XML representation of Scott Walker, Governor of Wisconsin
<skos:Concept rdf:about="http://cv.ap.org/id/11AD96CF0A5149C5B3909F5BE9A5494A">
<skos:prefLabel xml:lang="en">Scott Walker</skos:prefLabel>
<ap:associatedState rdf:resource="http://cv.ap.org/id/1BC1BC3082C81004896CDF092526B43E" />
<ap:entryTerm xml:lang="en">Scott K. Walker</ap:entryTerm>
<ap:entryTerm xml:lang="en">Scott Kevin Walker</ap:entryTerm>
<ap:isPlaceholder rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">false</ap:isPlaceholder>
<dbpedia-owl:party rdf:resource="http://cv.ap.org/id/BF6E2E80760D10048F8AE6E7A0F4673E" />
<dbprop:birthdate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1967-11-02</dbprop:birthdate>
<dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2009-11-01T10:23:29-
05:00</dcterms:created>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-02-26T10:14:13-
05:00</dcterms:modified>
<rdf:type rdf:resource="http://cv.ap.org/c/Politician" />
<skos:altLabel xml:lang="en">Scott K. Walker</skos:altLabel>
<skos:altLabel xml:lang="en">Scott Kevin Walker</skos:altLabel>
<skos:broader rdf:resource="http://cv.ap.org/id/C9D7FA107E4E1004847ADF092526B43E" />
<skos:definition xml:lang="en">45th Governor of Wisconsin. Milwaukee, Wisconsin County Executive. US
Republican member of the Wisconsin State Assembly.</skos:definition>
<skos:inScheme rdf:resource="http://cv.ap.org/a#person" />
</skos:Concept>
14. - <ClassificationResults>
<DocumentId>C495D353258440B487279767F9A16D02</DocumentId>
<DocumentDate>2012-06-06T15:59:46-05:00</DocumentDate>
- <Entities>
- <Entity>
<Authority>AP Person</Authority>
<AuthorityVersion>3420</AuthorityVersion>
Subset of tags returned for
<Name>LeBron James</Name> article about NBA Finals
game, in Simple XML format
<Id>http://cv.ap.org/id/7c05129d1a1741af8bcc326c9459545c</Id>
- <Properties>
<PersonType>Professional Athlete</PersonType>
<PersonType>Sports Figure</PersonType>
<Team>Miami Heat</Team>
</Properties>
</Entity>
-
15. - <Entity>
<Authority>AP Organization</Authority>
<AuthorityVersion>3412</AuthorityVersion>
<Name>Miami Heat</Name>
<Id>http://cv.ap.org/id/8a85be975bf94cd18836b6eb5a1f6391</Id>
Subset of tags returned for
</Entity> article about NBA Finals
game, in Simple XML
- <Entity> format, cont.
<Authority>AP Organization</Authority>
<AuthorityVersion>3412</AuthorityVersion>
<Name>NBA Eastern Conference</Name>
<Id>http://cv.ap.org/id/4a653a1806bc49518c5e667120a283e3</Id>
</Entity>
- </Entities>
-
16. <Subjects>
- <Subject>
<Authority>AP Subject</Authority> Subset of tags returned for
article about NBA Finals
<AuthorityVersion>3415</AuthorityVersion> game, in Simple XML
format, cont.
<Name>NBA basketball</Name>
<Id>http://cv.ap.org/id/6c01c3e08c8010048288a13d9888b73e</Id>
</Subject>
- <Subject>
<Authority>AP Subject</Authority>
<AuthorityVersion>3415</AuthorityVersion>
<Name>NBA Finals</Name>
<Id>http://cv.ap.org/id/fd862c8beea14e189c9a5617cf5c379c</Id>
</Subject>
Historically AP content had minimal descriptive metadata. Starting in 2005, AP began working on applying standard metadata across all content in order to improve access and enable new product development. The system needed to provide high accuracy and a high degree of control; scale to handle large volumes of content of different types; not slow down editorial’s ability to get the news out quickly. We built a standard set of taxonomies, and a rules-based automated classification system. As the service evolved, members asked us about the possibility of using AP’s systems for their own content. By 2011, the platform was mature enough, and we started the work to make our internal service available externally, using Semantic Web standards.
Targeted search and granular products: Allows customers to follow a company, person, or topic over timeEnables us to deliver specific products, e.g. Green technology, the Royal Wedding, 2012 Olympics hometown athletes, etc.Aggregate AP and member content
Really a lightweight ontology – we model hierarchy, synonyms, relationships between entities, additional entity properties.
Rules-based system – each term in the taxonomy has an associated rule.People and Company tagging is based on mention, with significant disambiguation rules to ensure accuracy.Subject, Geo and Organization tagging is based on more complex rules, and strives for “aboutness.”
Entity coverage continues to grow.Results in ~1.7 million triples
High-level view of how services are integrated into AP’s pipeline. Taxonomy data is exposed in editorial tools, and for web site curation.Automated tagging happens downstream of editorial.Taxonomy data (i.e. synonyms and other) is integrated into search and browse within our portals. Because the services are available as APIs, it’s easy for publishers to integrate into their own workflows in whatever way makes the most sense – APIs offer flexibilty.