Data Curation at the New York Times

Digital Enterprise Research Institute www.deri.ie

Data Curation at the
New York Times
Edward Curry, Andre Freitas, Seán O'Riain

ed.curry@deri.org
http://www.deri.org/
http://www.EdwardCurry.org/
Copyright 2010 Digital Enterprise Research Institute. All rights reserved.

Speaker Profile

 Research Scientist at the Digital Enterprise Research
Institute (DERI)
 Leading international web science research organization
 Researching how web of data is changing way business
work and interact with information
 Projects include studies of enterprise linked data, community-
based data curation, semantic data analytics, and semantic
search
 Investigate utilization within the pharmaceutical, oil & gas,
financial, advertising, media, manufacturing, health care, ICT,
and automotive industries
 Invited speaker at the 2010 MIT Sloan CIO Symposium
to an audience of more than 600 CIOs

Overview

 Curation Background
 The Business Need for Curated Data
 What is Data Curation?
 Data Quality and Curation
 How to Curate Data

 New York Times Case Study

 Best Practices from Case Study Learning

The Business Need

 Knowledge workers need:
 Access to the right information
 Confidence in that information

 Working incomplete
inaccurate, or wrong
information can have
disastrous consequences

The Problems with Data

 Flawed Data
 Effects 25% of critical data in world‟s top companies
(Gartner)

 Data Quality
 Recent banking crisis (Economist Dec‟09)
 Inaccurate figures made it difficult to manage operations
(investments exposure and risk)
– “asset are defined differently in different programs”
– “numbers did not always add up”
– “departments do not trust each other‟s figures”
– “figures … not worth the pixels they were made of”

What is Data Curation?

 Digital Curation
 Selection, preservation, maintenance, collection, and
archiving of digital assets

 Data Curation
 Active management of data over its life-cycle

 Data Curators
 Ensure data is trustworthy, discoverable, accessible,
reusable, and fit for use
– Museum cataloguers of the Internet age

What is Data Curation?

 Data Governance
 Convergence of data quality, data management,
business process management, and risk
management

 Data Curation is a complimentary activity
 Part of overall data governance strategy for
organization

 Data Curator = Data Steward ??
 Overlapping terms between communities

Data Quality and Curation

 What is Data Quality?
 Desirable characteristics for information resource
 Described as a series of quality dimensions
– Discoverability, Accessibility, Timeliness, Completeness,
Interpretation, Accuracy, Consistency, Provenance &
Reputation

 Data curation can be used to improve these
quality dimensions


 Discoverability & Accessibility
 Curate to streamline search by storing and classifying
in appropriate and consistent manner

 Accuracy
 Curate to ensure data correctly represents the “real-
world” values it models

 Consistency
 Curate to ensure data created and maintained using
standardized definitions, calculations, terms, and
identifiers


 Provenance & Reputation
 Curate to track source of data and determine reputation
 Curate to include the objectivity of the source/producer
– Is the information unbiased, unprejudiced, and impartial?
– Or does it come from a reputable but partisan source?

Other dimensions discussed in chapter

How to Curate Data

 Data Curation is a large field with sophisticated
techniques and processes

 Section provides high-level overview on:
 Should you curate data?
 Types of Curation
 Setting up a curation process

Additional detail and references available in book
chapter

Should You Curate Data?

 Curation can have multiple motivations
 Improving accessibility, quality, consistency,…

 Will the data benefit from curation?
 Identify business case
 Determine if potential return support investment

 Not all enterprise data should be curated
 Suits knowledge-centric data rather than transactional
operations data

Types of Data Curation

 Multiple approaches to curate data, no single
correct way
 Who?
– Individual Curators
– Curation Departments
– Community-based Curation
 How?
– Manual Curation
– (Semi-)Automated
– Sheer Curation

Types of Data Curation – Who?

 Individual Data Curators
 Suitable for infrequently changing small quantity of
data
– (<1,000 records)
– Minimal curation effort (minutes per record)

Types of Data Curation – Who?

 Curation Departments
 Curation experts working with subject matter experts
to curate data within formal process
– Can deal with large curation effort (000‟s of records)

 Limitations
 Scalability: Can struggle with large quantities of
dynamic data (>million records)
 Availability: Post-hoc nature creates delay in curated
data availability

Types of Data Curation - Who?

 Community-Based Data Curation
 Decentralized approach to data curation
 Crowd-sourcing the curation process
– Leverages community of users to curate data
 Wisdom of the community (crowd)
 Can scale to millions of records

Types of Data Curation – How?

 Manual Curation
 Curators directly manipulate data
 Can tie users up with low-value add activities

 (Sem-)Automated Curation
 Algorithms can (semi-)automate curation activities
such as data cleansing, record duplication and
classification
 Can be supervised or approved by human curators

Types of Data Curation – How?

 Sheer curation, or Curation at Source
 Curation activities integrated in normal workflow of
those creating and managing data
 Can be as simple as vetting or “rating” the results of a
curation algorithm
 Results can be available immediately

 Blended Approaches: Best of Both
 Sheer curation + post hoc curation department
 Allows immediate access to curated data
 Ensures quality control with expert curation

Setting up a Curation Process

 5 Steps to setup a curation process:
1 - Identify what data you need to curate
2 - Identify who will curate the data
3 - Define the curation workflow
4 - Identity appropriate data-in & data-out formats
5 - Identify the artifacts, tools, and processes needed to
support the curation process

The New York Times

100 Years of Expert Data Curation

The New York Times

 Largest metropolitan and third largest
newspaper in the United States

 nytimes.com
 Most popular newspaper
website in US

 100 year old curated
repository defining its
participation in the
emerging Web of Data

The New York Times

 Data curation dates back to 1913
 Publisher/owner Adolph S. Ochs decided to provide a
set of additions to the newspaper
 New York Times Index
 Organized catalog of articles titles and summaries
– Containing issue, date and column of article
– Categorized by subject and names
– Introduced on quarterly then annual basis
 Transitory content of newspaper became
important source of searchable historical data
 Often used to settle historical debates

The New York Times

 Index Department was created in 1913
 Curation and cataloguing of NYT resources
– Since 1851 NYT had low quality index for internal use

 Developed a comprehensive catalog using a
controlled vocabulary
 Covering subjects, personal names, organizations,
geographic locations and titles of creative works
(books, movies, etc), linked to articles and their
summaries

 Current Index Dept. has ~15 people

The New York Times

 Challenges with consistently and accurately
classifying news articles over time
 Keywords expressing subjects may show some
variance due to cultural or legal constraints
 Identities of some entities, such as organizations and
places, changed over time

 Controlled vocabulary grew to hundreds of
thousands of categories
 Adding complexity to classification process

The New York Times

 Increased importance of Web drove need to
improve categorization of online content

 Curation carried out by Index Department
 Library-time (days to weeks)
 Print edition can handle next-day index

 Not suitable for real-time online publishing
 nytimes.com needed a same-day index

The New York Times

 Introduced two stage curation process
 Editorial staff performed best-effort semi-automated
sheer curation at point of online pub.
– Several hundreds journalists
 Index Department follow up with long-term accurate
classification and archiving

 Benefits:
 Non-expert journalist curators provide instant
accessibility to online users
 Index Department provides long-term high-quality
curation in a “trust but verify” approach

NYT Curation Workflow

 Curation starts with article getting out of the newsroom


 Member of editorial staff submits article to web-based rule
based information extraction system (SAS Teragram)


 Teragram uses linguistic extraction rules based on subset of
Index Dept‟s controlled vocab.


 Teragram suggests tags based on the Index vocabulary that
can potentially describe the content of article


 Editorial staff member selects terms that best describe the
contents and inserts new tags if necessary


 Reviewed by the taxonomy managers with feedback to
editorial staff on classification process


 Article is published online at nytimes.com


 At later stage article receives second level curation by Index
Dept. additional Index tags and a summary


 Article is submitted to NYT Index

The New York Times

 Early adopter of Linked Open Data (June „09)

The New York Times

 Linked Open Data @ data.nytimes.com
 Subset of 10,000 tags from index vocabulary
 Dataset of people, organizations & locations
– Complemented by search services to consume data
about articles, movies, best sellers, Congress votes,
real estate,…
 Benefits
 Improves traffic by third party data usage
 Lowers development cost of new applications for
different verticals inside the website
– E.g. movies, travel, sports, books

Overview

 Curation Background
 The Business Need for Curated Data
 What is Data Curation?
 Data Quality and Curation
 How to Curate Data

 Case Study New York Times

 Best Practices from Case Study Learning

Best Practices from Case Study
Learning

 Social Best Practices
 Participation
 Engagement
 Incentives
 Community Governance Models

 Technical Best Practices
 Data Representation
 Human- and AutomatedCuration
 Track Provenance

Social Best Practices

 Participation
 Stakeholders involvement for data producers and
consumers must occur early in project
– Provides insight into basic questions of what they want
to do, for whom, and what it will provide
 White papers are effective means to present these
ideas, and solicit opinion from community
– Can be used to establish informal „social contract‟ for
community


 Engagement
 Outreach activities essential for promotion and
feedback
 Typical consumers-to-contributors ratios of less than
5%
 Social communication and networking forums are
useful
– Majority of community may not communicate using
these media
– Communication by email still remains important


 Incentives
 Sheer curation needs line of sight from data curating
activity, to tangible exploitation benefits
 Lack of awareness of value proposition will slow
emergence of collaborative contributions
 Recognizing contributing curators through a formal
feedback mechanism
– Reinforces contribution culture
– Directly increases output quality


 Community Governance Models
 Effective governance structure is vital to ensure
success of community
 Internal communities and consortium perform well
when they leverage traditional corporate and
democratic governance models
 Open communities need to engage the community
within the governance process
– Follow less orthodox approaches using meritocratic
and autocratic principles

Technical Best Practices

 Data Representation
 Must be robust and standardized to encourage
community usage and tools development
 Support for legacy data formats and ability to
translate data forward to support new technology and
standards
 Human & Automated Curation
 Balancing will improve data quality
 Automated curation should always defer to, and never
override, human curation edits
– Automate validating data deposition and entry
– Target community at focused curation tasks

Technical Best Practices

 Track Provenance
 All curation activities should be recorded and
maintained as part data provenance effort
– Especially where human curators are involved
 Users can have different perspectives of provenance
– A scientist may need to evaluate the fine grained
experiment description behind the data
– For a business analyst the ‟brand‟ of data provider can
be sufficient for determining quality

Conclusions

 Data curation can ensure the quality of data and
its fitness for use
 Pre-competitive data can be shared without
conferring a commercial advantage
 Pre-competitive data communities
 Common curation tasks carried out once in public
domain
 Reduces cost, increase quantity and quality

Acknowledgements

 Collaborators Andre Freitas & Seán O'Riain

 Insight from Thought Leaders
 Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product
Development and Management), and Gregg Fenton (Director Emerging Platforms)
from the New York Times
 Krista Thomas (Vice President, Marketing & Communications), Tom Tague
(OpenCalais initiative Lead) from Thomson Reuters
 Antony Williams (VP of Strategic Development ) from ChemSpider
 Helen Berman (Director), John Westbrook (Product Development) from the Protein
Data Bank
 Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.

 The work presented has been funded by Science
Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-
2).

Further Information

The Role of Community-Driven
Data Curation for Enterprises
Edward Curry, Andre Freitas, & Seán O'Riain

In David Wood (ed.),
Linking Enterprise Data Springer, 2010.
Available Free at:
http://3roundstones.com/led_book/led-curry-et-al.html

Data Curation at the New York Times

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Data Curation at the New York Times

Similar to Data Curation at the New York Times (20)

Recently uploaded

Recently uploaded (20)

Data Curation at the New York Times