Data Curation at the New York Times
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Data Curation at the New York Times

  • 2,862 views
Uploaded on

The New York Times is the largest metropolitan and the third largest newspaper in the United States. The Times website, nytimes.com, is ranked as the most ...

The New York Times is the largest metropolitan and the third largest newspaper in the United States. The Times website, nytimes.com, is ranked as the most
popular newspaper website in the United States and is an important source of advertisement revenue for the company. The NYT has a rich history for curation of its articles and its 100 year old curated repository has ultimately defined its participation as one of the first players in the emergingWeb of Data.

Data curation is a process that can ensure the quality of data and its fitness for use. Traditional approaches to curation are struggling with increased data volumes, and near real-time demands for curated data. In response, curation teams have turned to community crowd-sourcing and semi-automatedmetadata tools for assistance.

E. Curry, A. Freitas, and S. O’Riáin, “The Role of Community-Driven Data Curation for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,862
On Slideshare
1,893
From Embeds
969
Number of Embeds
7

Actions

Shares
Downloads
35
Comments
0
Likes
3

Embeds 969

http://edwardcurry.org 846
http://flavors.me 92
http://edcurry.flavors.me 23
http://www.linkedin.com 5
http://foto-films.com 1
http://www.slashdocs.com 1
http://prlog.ru 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Digital Enterprise Research Institute www.deri.ie Data Curation at the New York Times Edward Curry, Andre Freitas, Seán ORiain ed.curry@deri.org http://www.deri.org/ http://www.EdwardCurry.org/ Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
  • 2. Speaker ProfileDigital Enterprise Research Institute www.deri.ie  Research Scientist at the Digital Enterprise Research Institute (DERI)  Leading international web science research organization  Researching how web of data is changing way business work and interact with information  Projects include studies of enterprise linked data, community- based data curation, semantic data analytics, and semantic search  Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries  Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
  • 3. OverviewDigital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  New York Times Case Study  Best Practices from Case Study Learning
  • 4. The Business NeedDigital Enterprise Research Institute www.deri.ie  Knowledge workers need:  Access to the right information  Confidence in that information  Working incomplete inaccurate, or wrong information can have disastrous consequences
  • 5. The Problems with DataDigital Enterprise Research Institute www.deri.ie  Flawed Data  Effects 25% of critical data in world‟s top companies (Gartner)  Data Quality  Recent banking crisis (Economist Dec‟09)  Inaccurate figures made it difficult to manage operations (investments exposure and risk) – “asset are defined differently in different programs” – “numbers did not always add up” – “departments do not trust each other‟s figures” – “figures … not worth the pixels they were made of”
  • 6. What is Data Curation?Digital Enterprise Research Institute www.deri.ie  Digital Curation  Selection, preservation, maintenance, collection, and archiving of digital assets  Data Curation  Active management of data over its life-cycle  Data Curators  Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use – Museum cataloguers of the Internet age
  • 7. What is Data Curation?Digital Enterprise Research Institute www.deri.ie  Data Governance  Convergence of data quality, data management, business process management, and risk management  Data Curation is a complimentary activity  Part of overall data governance strategy for organization  Data Curator = Data Steward ??  Overlapping terms between communities
  • 8. Data Quality and CurationDigital Enterprise Research Institute www.deri.ie  What is Data Quality?  Desirable characteristics for information resource  Described as a series of quality dimensions – Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation  Data curation can be used to improve these quality dimensions
  • 9. Data Quality and CurationDigital Enterprise Research Institute www.deri.ie  Discoverability & Accessibility  Curate to streamline search by storing and classifying in appropriate and consistent manner  Accuracy  Curate to ensure data correctly represents the “real- world” values it models  Consistency  Curate to ensure data created and maintained using standardized definitions, calculations, terms, and identifiers
  • 10. Data Quality and CurationDigital Enterprise Research Institute www.deri.ie  Provenance & Reputation  Curate to track source of data and determine reputation  Curate to include the objectivity of the source/producer – Is the information unbiased, unprejudiced, and impartial? – Or does it come from a reputable but partisan source? Other dimensions discussed in chapter
  • 11. How to Curate DataDigital Enterprise Research Institute www.deri.ie  Data Curation is a large field with sophisticated techniques and processes  Section provides high-level overview on:  Should you curate data?  Types of Curation  Setting up a curation process Additional detail and references available in book chapter
  • 12. Should You Curate Data?Digital Enterprise Research Institute www.deri.ie  Curation can have multiple motivations  Improving accessibility, quality, consistency,…  Will the data benefit from curation?  Identify business case  Determine if potential return support investment  Not all enterprise data should be curated  Suits knowledge-centric data rather than transactional operations data
  • 13. Types of Data CurationDigital Enterprise Research Institute www.deri.ie  Multiple approaches to curate data, no single correct way  Who? – Individual Curators – Curation Departments – Community-based Curation  How? – Manual Curation – (Semi-)Automated – Sheer Curation
  • 14. Types of Data Curation – Who?Digital Enterprise Research Institute www.deri.ie  Individual Data Curators  Suitable for infrequently changing small quantity of data – (<1,000 records) – Minimal curation effort (minutes per record)
  • 15. Types of Data Curation – Who?Digital Enterprise Research Institute www.deri.ie  Curation Departments  Curation experts working with subject matter experts to curate data within formal process – Can deal with large curation effort (000‟s of records)  Limitations  Scalability: Can struggle with large quantities of dynamic data (>million records)  Availability: Post-hoc nature creates delay in curated data availability
  • 16. Types of Data Curation - Who?Digital Enterprise Research Institute www.deri.ie  Community-Based Data Curation  Decentralized approach to data curation  Crowd-sourcing the curation process – Leverages community of users to curate data  Wisdom of the community (crowd)  Can scale to millions of records
  • 17. Types of Data Curation – How?Digital Enterprise Research Institute www.deri.ie  Manual Curation  Curators directly manipulate data  Can tie users up with low-value add activities  (Sem-)Automated Curation  Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification  Can be supervised or approved by human curators
  • 18. Types of Data Curation – How?Digital Enterprise Research Institute www.deri.ie  Sheer curation, or Curation at Source  Curation activities integrated in normal workflow of those creating and managing data  Can be as simple as vetting or “rating” the results of a curation algorithm  Results can be available immediately  Blended Approaches: Best of Both  Sheer curation + post hoc curation department  Allows immediate access to curated data  Ensures quality control with expert curation
  • 19. Setting up a Curation ProcessDigital Enterprise Research Institute www.deri.ie  5 Steps to setup a curation process: 1 - Identify what data you need to curate 2 - Identify who will curate the data 3 - Define the curation workflow 4 - Identity appropriate data-in & data-out formats 5 - Identify the artifacts, tools, and processes needed to support the curation process
  • 20. The New York TimesDigital Enterprise Research Institute www.deri.ie 100 Years of Expert Data Curation
  • 21. The New York TimesDigital Enterprise Research Institute www.deri.ie  Largest metropolitan and third largest newspaper in the United States  nytimes.com  Most popular newspaper website in US  100 year old curated repository defining its participation in the emerging Web of Data
  • 22. The New York TimesDigital Enterprise Research Institute www.deri.ie  Data curation dates back to 1913  Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaper  New York Times Index  Organized catalog of articles titles and summaries – Containing issue, date and column of article – Categorized by subject and names – Introduced on quarterly then annual basis  Transitory content of newspaper became important source of searchable historical data  Often used to settle historical debates
  • 23. The New York TimesDigital Enterprise Research Institute www.deri.ie  Index Department was created in 1913  Curation and cataloguing of NYT resources – Since 1851 NYT had low quality index for internal use  Developed a comprehensive catalog using a controlled vocabulary  Covering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries  Current Index Dept. has ~15 people
  • 24. The New York TimesDigital Enterprise Research Institute www.deri.ie  Challenges with consistently and accurately classifying news articles over time  Keywords expressing subjects may show some variance due to cultural or legal constraints  Identities of some entities, such as organizations and places, changed over time  Controlled vocabulary grew to hundreds of thousands of categories  Adding complexity to classification process
  • 25. The New York TimesDigital Enterprise Research Institute www.deri.ie  Increased importance of Web drove need to improve categorization of online content  Curation carried out by Index Department  Library-time (days to weeks)  Print edition can handle next-day index  Not suitable for real-time online publishing  nytimes.com needed a same-day index
  • 26. The New York TimesDigital Enterprise Research Institute www.deri.ie  Introduced two stage curation process  Editorial staff performed best-effort semi-automated sheer curation at point of online pub. – Several hundreds journalists  Index Department follow up with long-term accurate classification and archiving  Benefits:  Non-expert journalist curators provide instant accessibility to online users  Index Department provides long-term high-quality curation in a “trust but verify” approach
  • 27. NYT Curation WorkflowDigital Enterprise Research Institute www.deri.ie  Curation starts with article getting out of the newsroom
  • 28. NYT Curation WorkflowDigital Enterprise Research Institute www.deri.ie  Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)
  • 29. NYT Curation WorkflowDigital Enterprise Research Institute www.deri.ie Teragram uses linguistic extraction rules based on subset of Index Dept‟s controlled vocab.
  • 30. NYT Curation WorkflowDigital Enterprise Research Institute www.deri.ie  Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article
  • 31. NYT Curation WorkflowDigital Enterprise Research Institute www.deri.ie  Editorial staff member selects terms that best describe the contents and inserts new tags if necessary
  • 32. NYT Curation WorkflowDigital Enterprise Research Institute www.deri.ie  Reviewed by the taxonomy managers with feedback to editorial staff on classification process
  • 33. NYT Curation WorkflowDigital Enterprise Research Institute www.deri.ie  Article is published online at nytimes.com
  • 34. NYT Curation WorkflowDigital Enterprise Research Institute www.deri.ie  At later stage article receives second level curation by Index Dept. additional Index tags and a summary
  • 35. NYT Curation WorkflowDigital Enterprise Research Institute www.deri.ie  Article is submitted to NYT Index
  • 36. The New York TimesDigital Enterprise Research Institute www.deri.ie  Early adopter of Linked Open Data (June „09)
  • 37. The New York TimesDigital Enterprise Research Institute www.deri.ie Linked Open Data @ data.nytimes.com  Subset of 10,000 tags from index vocabulary  Dataset of people, organizations & locations – Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,… Benefits  Improves traffic by third party data usage  Lowers development cost of new applications for different verticals inside the website – E.g. movies, travel, sports, books
  • 38. OverviewDigital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  Case Study New York Times  Best Practices from Case Study Learning
  • 39. Best Practices from Case Study LearningDigital Enterprise Research Institute www.deri.ie  Social Best Practices  Participation  Engagement  Incentives  Community Governance Models  Technical Best Practices  Data Representation  Human- and AutomatedCuration  Track Provenance
  • 40. Social Best PracticesDigital Enterprise Research Institute www.deri.ie  Participation  Stakeholders involvement for data producers and consumers must occur early in project – Provides insight into basic questions of what they want to do, for whom, and what it will provide  White papers are effective means to present these ideas, and solicit opinion from community – Can be used to establish informal „social contract‟ for community
  • 41. Social Best PracticesDigital Enterprise Research Institute www.deri.ie  Engagement  Outreach activities essential for promotion and feedback  Typical consumers-to-contributors ratios of less than 5%  Social communication and networking forums are useful – Majority of community may not communicate using these media – Communication by email still remains important
  • 42. Social Best PracticesDigital Enterprise Research Institute www.deri.ie  Incentives  Sheer curation needs line of sight from data curating activity, to tangible exploitation benefits  Lack of awareness of value proposition will slow emergence of collaborative contributions  Recognizing contributing curators through a formal feedback mechanism – Reinforces contribution culture – Directly increases output quality
  • 43. Social Best PracticesDigital Enterprise Research Institute www.deri.ie  Community Governance Models  Effective governance structure is vital to ensure success of community  Internal communities and consortium perform well when they leverage traditional corporate and democratic governance models  Open communities need to engage the community within the governance process – Follow less orthodox approaches using meritocratic and autocratic principles
  • 44. Technical Best PracticesDigital Enterprise Research Institute www.deri.ie  Data Representation  Must be robust and standardized to encourage community usage and tools development  Support for legacy data formats and ability to translate data forward to support new technology and standards  Human & Automated Curation  Balancing will improve data quality  Automated curation should always defer to, and never override, human curation edits – Automate validating data deposition and entry – Target community at focused curation tasks
  • 45. Technical Best PracticesDigital Enterprise Research Institute www.deri.ie  Track Provenance  All curation activities should be recorded and maintained as part data provenance effort – Especially where human curators are involved  Users can have different perspectives of provenance – A scientist may need to evaluate the fine grained experiment description behind the data – For a business analyst the ‟brand‟ of data provider can be sufficient for determining quality
  • 46. ConclusionsDigital Enterprise Research Institute www.deri.ie  Data curation can ensure the quality of data and its fitness for use  Pre-competitive data can be shared without conferring a commercial advantage  Pre-competitive data communities  Common curation tasks carried out once in public domain  Reduces cost, increase quantity and quality
  • 47. AcknowledgementsDigital Enterprise Research Institute www.deri.ie  Collaborators Andre Freitas & Seán ORiain  Insight from Thought Leaders  Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times  Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters  Antony Williams (VP of Strategic Development ) from ChemSpider  Helen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank  Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.  The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion- 2).
  • 48. Further InformationDigital Enterprise Research Institute www.deri.ieThe Role of Community-DrivenData Curation for EnterprisesEdward Curry, Andre Freitas, & Seán ORiain In David Wood (ed.), Linking Enterprise Data Springer, 2010. Available Free at: http://3roundstones.com/led_book/led-curry-et-al.html