Schema.org news markup
Overall type of the object on this page, in HTML head
Headline, dateline, date as additions to div/span properties
Byline expressed as nested object (using itemscope) of type schema.org/Person
Driving application: “rich snippets”
Schema.org covers not just news but music, restaurants, people, organizations,
Snippets, and better search-ability generally, are motivation for Google, Yahoo, Bing
to push schema.org
Additional metadata from indexing team
In database, but doesn't necessarily make it to HTML.
Application: content navigation
Articles about “Syria”
on NYT topic page
More reliable than simple text
search (because the relevance
algorithm knows a story is
Wall Street is high on Molson Coors Brewing (TAP), expecting it to report earnings that
are up 17.5% from a year ago when it reports its third quarter earnings on Wednesday,
November 7, 2012. The consensus estimate is $1.34 per share, up from earnings of
$1.14 per share a year ago.
The consensus estimate has dipped over the past month, from $1.35, but it’s still up from
the consensus estimate of $1.19 three months ago. For the fiscal year, analysts are
expecting earnings of $3.89 per share. Revenue is projected to eclipse the year-earlier
total of $954.4 million by 31%, finishing at $1.25 billion for the quarter. For the year,
revenue is projected to roll in at $4.04 billion.
The company’s net income has declined in the last two quarters. The company posted
profit falling by 52.8% in the second quarter. This is after it reported a profit decline in the
first quarter by 4.1%.
Automatic story generation (AP/Narrative Science)
Application: automatic stories
News as relations between entities
“Alice attended the wedding”
“IBM was founded in 1917.”
“Hurricane Sandy hit New York”
Encode facts as relation(subject,object)
also written (subject relation object)
Things we could do with this
“The granddaughter of which actor starred in E.T.?”
(?x acted-in “E.T.”)(?y is-a actor)(?x granddaughter-of ?y)
(bob brother-of alice)
(alice mother-of lucy) =>
(bob uncle-of lucy)
Answer questions using inference
“how many executives of publicly-traded Canadian companies died in car
Every big news org has their own
topics, people, organizations, places...
Enter Linked Data
Triples of (subject relation object), each a URL or literal
Abbreviations possible with many formats...
NYT API can return linked data
"title": "Syria's Rebels Open Talks on Forging United Political Front"
"body": "BEIRUT, Lebanon — Syria ’s fractious opposition groups began
negotiations in Doha, Qatar, on Sunday to forge a more unified front to reshape
the political landscape in a bloody conflict that claims more than 100 lives
virtually every day. Given the scant prospects that any attempt to restructure
the opposition will succeed — the",
"facet_terms": "CLINTON, HILLARY RODHAM ASSAD, BASHAR AL- SYRIA DOHA
(QATAR) SYRIAN NATIONAL COUNCIL STATE DEPARTMENT WAR AND REVOLUTION DEFENSE AND
Objects and relations in text?
names, dates, places, verbs.
Named Entity Recognition
Extract subjects, objects, from text.
Also, resolve pronouns if possible.
"Gov. Andrew M. Cuomo on Wednesday gave a sea wall the
nod. Because of the recent history of powerful storms hitting the
area, he said, elected officials have a responsibility to consider
new and innovative plans to prevent similar damage in the
Relations from sentence parsing
“The water that made rivers of Avenues C and D receded
on Tuesday, and the East Village was a mixture of disaster
and nonchalance. A group of young men in pajama pants
and shorts threw a football on East 12th Street, while
workers pumped the basement of CHP Hardware on
Avenue C and Eighth Street.”
subject verb object
(water made rivers of Avenues C and D)
(East Village was a mixture of disaster and nonchalance)
(group of young men in pajama pants and shorts threw football)
(workers pumped the basement of CHP Hardware )
Do we have all of these in the ontology?
“General Question Answering”
Precision/recall tradeoff. State of the art is IBM’s DeepQA
DeepQA use of structured data
“Watson can also use detected relations to query a triple store and
directly generate candidate answers. Due to the breadth of relations in
the Jeopardy domain and the variety of ways in which they are
expressed, however, Watson’s current ability to effectively use curated
databases to simply “look up” the answers is limited to fewer than 2
percent of the clues.”
- Ferruci et. al. “Building Watson”
Taught at Columbia Journalism School, Fall 2018
Full syllabus and lecture videos at http://www.compjournalism.com/?p=218