SlideShare a Scribd company logo
Digital Enterprise Research Institute                                         www.deri.ie




                                                  Data Curation at the
                                                   New York Times
                      Edward Curry, Andre Freitas, Seán O'Riain




 ed.curry@deri.org
 http://www.deri.org/
 http://www.EdwardCurry.org/
 Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
Speaker Profile
Digital Enterprise Research Institute                                                 www.deri.ie



            Research Scientist at the Digital Enterprise Research
             Institute (DERI)
                   Leading international web science research organization
            Researching how web of data is changing way business
             work and interact with information
                   Projects include studies of enterprise linked data, community-
                    based data curation, semantic data analytics, and semantic
                    search
                   Investigate utilization within the pharmaceutical, oil & gas,
                    financial, advertising, media, manufacturing, health care, ICT,
                    and automotive industries
            Invited speaker at the 2010 MIT Sloan CIO Symposium
             to an audience of more than 600 CIOs
Overview
Digital Enterprise Research Institute                    www.deri.ie



            Curation Background
                   The Business Need for Curated Data
                   What is Data Curation?
                   Data Quality and Curation
                   How to Curate Data


            New York Times Case Study

            Best Practices from Case Study Learning
The Business Need
Digital Enterprise Research Institute                              www.deri.ie



               Knowledge workers need:
                   Access              to the right information
                   Confidence              in that information


               Working incomplete
                inaccurate, or wrong
                information can have
                disastrous consequences
The Problems with Data
Digital Enterprise Research Institute                                           www.deri.ie



          Flawed Data
             Effects   25% of critical data in world‟s top companies
                 (Gartner)

          Data Quality
             Recent               banking crisis (Economist Dec‟09)
             Inaccurate   figures made it difficult to manage operations
                 (investments exposure and risk)
                    –   “asset are defined differently in different programs”
                    –   “numbers did not always add up”
                    –   “departments do not trust each other‟s figures”
                    –   “figures … not worth the pixels they were made of”
What is Data Curation?
Digital Enterprise Research Institute                                    www.deri.ie


        Digital Curation
            Selection,    preservation, maintenance, collection, and
                archiving of digital assets

        Data Curation
            Active             management of data over its life-cycle

        Data Curators
            Ensure    data is trustworthy, discoverable, accessible,
                reusable, and fit for use
                   – Museum cataloguers of the Internet age
What is Data Curation?
Digital Enterprise Research Institute                                www.deri.ie




            Data Governance
                Convergence     of data quality, data management,
                    business process management, and risk
                    management

            Data Curation is a complimentary activity
                Part   of overall data governance strategy for
                    organization

            Data Curator = Data Steward ??
                   Overlapping terms between communities
Data Quality and Curation
Digital Enterprise Research Institute                                                www.deri.ie



            What is Data Quality?
                Desirable              characteristics for information resource
                Described              as a series of quality dimensions
                       – Discoverability, Accessibility, Timeliness, Completeness,
                         Interpretation, Accuracy, Consistency, Provenance &
                         Reputation

            Data curation can be used to improve these
             quality dimensions
Data Quality and Curation
Digital Enterprise Research Institute                                    www.deri.ie



            Discoverability & Accessibility
                Curate    to streamline search by storing and classifying
                    in appropriate and consistent manner

            Accuracy
                Curate     to ensure data correctly represents the “real-
                    world” values it models

            Consistency
                Curate      to ensure data created and maintained using
                    standardized definitions, calculations, terms, and
                    identifiers
Data Quality and Curation
Digital Enterprise Research Institute                                                www.deri.ie




            Provenance & Reputation
                Curate                 to track source of data and determine reputation
                Curate                 to include the objectivity of the source/producer
                       – Is the information unbiased, unprejudiced, and impartial?
                       – Or does it come from a reputable but partisan source?




                       Other dimensions discussed in chapter
How to Curate Data
Digital Enterprise Research Institute                               www.deri.ie




            Data Curation is a large field with sophisticated
             techniques and processes

            Section provides high-level overview on:
                Should                 you curate data?
                Types             of Curation
                Setting                up a curation process


               Additional detail and references available in book
               chapter
Should You Curate Data?
Digital Enterprise Research Institute                                              www.deri.ie




            Curation can have multiple motivations
                Improving                accessibility, quality, consistency,…

            Will the data benefit from curation?
                Identify               business case
                Determine                if potential return support investment

            Not all enterprise data should be curated
                Suits   knowledge-centric data rather than transactional
                    operations data
Types of Data Curation
Digital Enterprise Research Institute                        www.deri.ie



            Multiple approaches to curate data, no single
             correct way
                Who?
                       – Individual Curators
                       – Curation Departments
                       – Community-based Curation
                How?
                       – Manual Curation
                       – (Semi-)Automated
                       – Sheer Curation
Types of Data Curation – Who?
Digital Enterprise Research Institute                                                 www.deri.ie




            Individual Data Curators
                Suitable               for infrequently changing small quantity of
                    data
                       – (<1,000 records)
                       – Minimal curation effort (minutes per record)
Types of Data Curation – Who?
Digital Enterprise Research Institute                                             www.deri.ie


            Curation Departments
                Curation     experts working with subject matter experts
                    to curate data within formal process
                       – Can deal with large curation effort (000‟s of records)

            Limitations
                Scalability: Can struggle with large quantities of
                    dynamic data (>million records)
                Availability:  Post-hoc nature creates delay in curated
                    data availability
Types of Data Curation - Who?
Digital Enterprise Research Institute                                    www.deri.ie



            Community-Based Data Curation
                Decentralized               approach to data curation
                Crowd-sourcing                the curation process
                       – Leverages community of users to curate data
                Wisdom                 of the community (crowd)
                Can           scale to millions of records
Types of Data Curation – How?
Digital Enterprise Research Institute                                        www.deri.ie



            Manual Curation
                Curators               directly manipulate data
                Can           tie users up with low-value add activities

            (Sem-)Automated Curation
                Algorithms      can (semi-)automate curation activities
                    such as data cleansing, record duplication and
                    classification
                Can           be supervised or approved by human curators
Types of Data Curation – How?
Digital Enterprise Research Institute                                          www.deri.ie



            Sheer curation, or Curation at Source
                Curation    activities integrated in normal workflow of
                    those creating and managing data
                Can     be as simple as vetting or “rating” the results of a
                    curation algorithm
                Results                can be available immediately

            Blended Approaches: Best of Both
                Sheer             curation + post hoc curation department
                Allows             immediate access to curated data
                Ensures                quality control with expert curation
Setting up a Curation Process
Digital Enterprise Research Institute                                  www.deri.ie




            5 Steps to setup a curation process:
               1 - Identify what data you need to curate
               2 - Identify who will curate the data
               3 - Define the curation workflow
               4 - Identity appropriate data-in & data-out formats
               5 - Identify the artifacts, tools, and processes needed to
                   support the curation process
The New York Times
Digital Enterprise Research Institute                            www.deri.ie




                             100 Years of Expert Data Curation
The New York Times
Digital Enterprise Research Institute                 www.deri.ie


            Largest metropolitan and third largest
             newspaper in the United States


            nytimes.com
                    Most popular newspaper
                     website in US

            100 year old curated
             repository defining its
             participation in the
             emerging Web of Data
The New York Times
Digital Enterprise Research Institute                                              www.deri.ie


       Data curation dates back to 1913
           Publisher/owner      Adolph S. Ochs decided to provide a
               set of additions to the newspaper
       New York Times Index
           Organized                   catalog of articles titles and summaries
                  – Containing issue, date and column of article
                  – Categorized by subject and names
                  – Introduced on quarterly then annual basis
       Transitory content of newspaper became
        important source of searchable historical data
           Often            used to settle historical debates
The New York Times
Digital Enterprise Research Institute                                            www.deri.ie


              Index Department was created in 1913
                Curation               and cataloguing of NYT resources
                       – Since 1851 NYT had low quality index for internal use

            Developed a comprehensive catalog using a
             controlled vocabulary
                Covering    subjects, personal names, organizations,
                    geographic locations and titles of creative works
                    (books, movies, etc), linked to articles and their
                    summaries

            Current Index Dept. has ~15 people
The New York Times
Digital Enterprise Research Institute                                          www.deri.ie



            Challenges with consistently and accurately
             classifying news articles over time
                Keywords     expressing subjects may show some
                    variance due to cultural or legal constraints
                Identities   of some entities, such as organizations and
                    places, changed over time

            Controlled vocabulary grew to hundreds of
             thousands of categories
                Adding                 complexity to classification process
The New York Times
Digital Enterprise Research Institute                               www.deri.ie




            Increased importance of Web drove need to
             improve categorization of online content

            Curation carried out by Index Department
                Library-time           (days to weeks)
                Print          edition can handle next-day index

            Not suitable for real-time online publishing
                nytimes.com            needed a same-day index
The New York Times
Digital Enterprise Research Institute                                    www.deri.ie


            Introduced two stage curation process
                Editorial  staff performed best-effort semi-automated
                    sheer curation at point of online pub.
                       – Several hundreds journalists
                Index     Department follow up with long-term accurate
                    classification and archiving

            Benefits:
                Non-expert      journalist curators provide instant
                    accessibility to online users
                Index    Department provides long-term high-quality
                    curation in a “trust but verify” approach
NYT Curation Workflow
Digital Enterprise Research Institute                                        www.deri.ie




  Curation                starts with article getting out of the newsroom
NYT Curation Workflow
Digital Enterprise Research Institute                             www.deri.ie




  Member      of editorial staff submits article to web-based rule
      based information extraction system (SAS Teragram)
NYT Curation Workflow
Digital Enterprise Research Institute                         www.deri.ie




 Teragram   uses linguistic extraction rules based on subset of
    Index Dept‟s controlled vocab.
NYT Curation Workflow
Digital Enterprise Research Institute                        www.deri.ie




  Teragram     suggests tags based on the Index vocabulary that
      can potentially describe the content of article
NYT Curation Workflow
Digital Enterprise Research Institute                         www.deri.ie




  Editorial  staff member selects terms that best describe the
      contents and inserts new tags if necessary
NYT Curation Workflow
Digital Enterprise Research Institute                         www.deri.ie




  Reviewed       by the taxonomy managers with feedback to
      editorial staff on classification process
NYT Curation Workflow
Digital Enterprise Research Institute                     www.deri.ie




  Article           is published online at nytimes.com
NYT Curation Workflow
Digital Enterprise Research Institute                           www.deri.ie




  At   later stage article receives second level curation by Index
      Dept. additional Index tags and a summary
NYT Curation Workflow
Digital Enterprise Research Institute            www.deri.ie




  Article           is submitted to NYT Index
The New York Times
Digital Enterprise Research Institute                      www.deri.ie


           Early adopter of Linked Open Data (June „09)
The New York Times
Digital Enterprise Research Institute                                    www.deri.ie


    Linked Open Data @ data.nytimes.com
        Subset               of 10,000 tags from index vocabulary
        Dataset               of people, organizations & locations
               – Complemented by search services to consume data
                 about articles, movies, best sellers, Congress votes,
                 real estate,…
    Benefits
        Improves                  traffic by third party data usage
        Lowers      development cost of new applications for
            different verticals inside the website
               – E.g. movies, travel, sports, books
Overview
Digital Enterprise Research Institute                    www.deri.ie



            Curation Background
                   The Business Need for Curated Data
                   What is Data Curation?
                   Data Quality and Curation
                   How to Curate Data


            Case Study New York Times

            Best Practices from Case Study Learning
Best Practices from Case Study
       Learning
Digital Enterprise Research Institute                           www.deri.ie


            Social Best Practices
                Participation
                Engagement
                Incentives
                Community                Governance Models

            Technical Best Practices
                Data           Representation
                Human-                 and AutomatedCuration
                Track            Provenance
Social Best Practices
Digital Enterprise Research Institute                                              www.deri.ie




            Participation
                Stakeholders  involvement for data producers and
                    consumers must occur early in project
                       – Provides insight into basic questions of what they want
                         to do, for whom, and what it will provide
                White     papers are effective means to present these
                    ideas, and solicit opinion from community
                       – Can be used to establish informal „social contract‟ for
                         community
Social Best Practices
Digital Enterprise Research Institute                                               www.deri.ie




            Engagement
                Outreach                 activities essential for promotion and
                    feedback
                Typical                consumers-to-contributors ratios of less than
                    5%
                Social            communication and networking forums are
                    useful
                       – Majority of community may not communicate using
                         these media
                       – Communication by email still remains important
Social Best Practices
Digital Enterprise Research Institute                                     www.deri.ie




            Incentives
                Sheer      curation needs line of sight from data curating
                    activity, to tangible exploitation benefits
                Lack   of awareness of value proposition will slow
                    emergence of collaborative contributions
                Recognizing   contributing curators through a formal
                    feedback mechanism
                       – Reinforces contribution culture
                       – Directly increases output quality
Social Best Practices
Digital Enterprise Research Institute                                         www.deri.ie




            Community Governance Models
                Effective  governance structure is vital to ensure
                    success of community
                Internal  communities and consortium perform well
                    when they leverage traditional corporate and
                    democratic governance models
                Open      communities need to engage the community
                    within the governance process
                       – Follow less orthodox approaches using meritocratic
                         and autocratic principles
Technical Best Practices
Digital Enterprise Research Institute                                    www.deri.ie

            Data Representation
                Must   be robust and standardized to encourage
                    community usage and tools development
                Support     for legacy data formats and ability to
                    translate data forward to support new technology and
                    standards
            Human & Automated Curation
                Balancing              will improve data quality
                Automated      curation should always defer to, and never
                    override, human curation edits
                       – Automate validating data deposition and entry
                       – Target community at focused curation tasks
Technical Best Practices
Digital Enterprise Research Institute                                         www.deri.ie



            Track Provenance
                All  curation activities should be recorded and
                    maintained as part data provenance effort
                       – Especially where human curators are involved
                Users             can have different perspectives of provenance
                       – A scientist may need to evaluate the fine grained
                         experiment description behind the data
                       – For a business analyst the ‟brand‟ of data provider can
                         be sufficient for determining quality
Conclusions
Digital Enterprise Research Institute                                               www.deri.ie




        Data curation can ensure the quality of data and
         its fitness for use
        Pre-competitive data can be shared without
         conferring a commercial advantage
        Pre-competitive data communities
                Common                 curation tasks carried out once in public
                    domain
                Reduces                cost, increase quantity and quality
Acknowledgements
Digital Enterprise Research Institute                                                      www.deri.ie


        Collaborators Andre Freitas & Seán O'Riain

        Insight from Thought Leaders
               Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product
                Development and Management), and Gregg Fenton (Director Emerging Platforms)
                from the New York Times
               Krista Thomas (Vice President, Marketing & Communications), Tom Tague
                (OpenCalais initiative Lead) from Thomson Reuters
               Antony Williams (VP of Strategic Development ) from ChemSpider
               Helen Berman (Director), John Westbrook (Product Development) from the Protein
                Data Bank
               Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.

        The work presented has been funded by Science
         Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-
         2).
Further Information
Digital Enterprise Research Institute                     www.deri.ie


The Role of Community-Driven
Data Curation for Enterprises
Edward Curry, Andre Freitas, & Seán O'Riain




  In David Wood (ed.),
  Linking Enterprise Data Springer, 2010.
  Available Free at:
  http://3roundstones.com/led_book/led-curry-et-al.html

More Related Content

What's hot

Building Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataBuilding Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked Data
Edward Curry
 
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage DataCollaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Edward Curry
 
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Edward Curry
 
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
Crowdsourcing Approaches to Big Data Curation for Earth SciencesCrowdsourcing Approaches to Big Data Curation for Earth Sciences
Crowdsourcing Approaches to Big Data Curation for Earth SciencesEdward Curry
 
Querying Heterogeneous Datasets on the Linked Data Web
Querying Heterogeneous Datasets on the Linked Data WebQuerying Heterogeneous Datasets on the Linked Data Web
Querying Heterogeneous Datasets on the Linked Data Web
Edward Curry
 
System of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceSystem of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked Dataspace
Edward Curry
 
Linked Building (Energy) Data
Linked Building (Energy) DataLinked Building (Energy) Data
Linked Building (Energy) Data
Edward Curry
 
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
Enterprise Energy Management using a Linked Dataspace for Energy IntelligenceEnterprise Energy Management using a Linked Dataspace for Energy Intelligence
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
Edward Curry
 
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and TrendsSustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
Edward Curry
 
The Big Data Value PPP: A Standardisation Opportunity for Europe
The Big Data Value PPP: A Standardisation Opportunity for EuropeThe Big Data Value PPP: A Standardisation Opportunity for Europe
The Big Data Value PPP: A Standardisation Opportunity for Europe
Edward Curry
 
Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...
Edward Curry
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Edward Curry
 
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in CrowdsourcingSLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
Edward Curry
 
A Capability Maturity Framework for Sustainable ICT
A Capability Maturity Framework for Sustainable ICTA Capability Maturity Framework for Sustainable ICT
A Capability Maturity Framework for Sustainable ICT
Edward Curry
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in Europe
Edward Curry
 
Linked Water Data For Water Information Management
Linked Water Data For Water Information ManagementLinked Water Data For Water Information Management
Linked Water Data For Water Information Management
Edward Curry
 
Interactive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics ApproachInteractive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics Approach
Edward Curry
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Edward Curry
 
Crowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data ManagementCrowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data Management
Edward Curry
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –Review
IJERA Editor
 

What's hot (20)

Building Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked DataBuilding Optimisation using Scenario Modeling and Linked Data
Building Optimisation using Scenario Modeling and Linked Data
 
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage DataCollaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
 
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013Big Data Public Private Forum (BIG) @ European Data Forum 2013
Big Data Public Private Forum (BIG) @ European Data Forum 2013
 
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
Crowdsourcing Approaches to Big Data Curation for Earth SciencesCrowdsourcing Approaches to Big Data Curation for Earth Sciences
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
 
Querying Heterogeneous Datasets on the Linked Data Web
Querying Heterogeneous Datasets on the Linked Data WebQuerying Heterogeneous Datasets on the Linked Data Web
Querying Heterogeneous Datasets on the Linked Data Web
 
System of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked DataspaceSystem of Systems Information Interoperability using a Linked Dataspace
System of Systems Information Interoperability using a Linked Dataspace
 
Linked Building (Energy) Data
Linked Building (Energy) DataLinked Building (Energy) Data
Linked Building (Energy) Data
 
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
Enterprise Energy Management using a Linked Dataspace for Energy IntelligenceEnterprise Energy Management using a Linked Dataspace for Energy Intelligence
Enterprise Energy Management using a Linked Dataspace for Energy Intelligence
 
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and TrendsSustainable IT for Energy Management: Approaches, Challenges, and Trends
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
 
The Big Data Value PPP: A Standardisation Opportunity for Europe
The Big Data Value PPP: A Standardisation Opportunity for EuropeThe Big Data Value PPP: A Standardisation Opportunity for Europe
The Big Data Value PPP: A Standardisation Opportunity for Europe
 
Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...Transforming the European Data Economy: A Strategic Research and Innovation A...
Transforming the European Data Economy: A Strategic Research and Innovation A...
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
 
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in CrowdsourcingSLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing
 
A Capability Maturity Framework for Sustainable ICT
A Capability Maturity Framework for Sustainable ICTA Capability Maturity Framework for Sustainable ICT
A Capability Maturity Framework for Sustainable ICT
 
Key Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in EuropeKey Technology Trends for Big Data in Europe
Key Technology Trends for Big Data in Europe
 
Linked Water Data For Water Information Management
Linked Water Data For Water Information ManagementLinked Water Data For Water Information Management
Linked Water Data For Water Information Management
 
Interactive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics ApproachInteractive Water Services: The Waternomics Approach
Interactive Water Services: The Waternomics Approach
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
 
Crowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data ManagementCrowdsourcing Approaches for Smart City Open Data Management
Crowdsourcing Approaches for Smart City Open Data Management
 
Big Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –ReviewBig Data and Big Data Management (BDM) with current Technologies –Review
Big Data and Big Data Management (BDM) with current Technologies –Review
 

Viewers also liked

Influenciencia del mundo emocional en el aprendizaje
Influenciencia del mundo emocional en el aprendizajeInfluenciencia del mundo emocional en el aprendizaje
Influenciencia del mundo emocional en el aprendizaje
Instituto Familia y Adopción
 
Open Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and TrendsOpen Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and Trends
Edward Curry
 
Towards a BIG Data Public Private Partnership
Towards a BIG Data Public Private PartnershipTowards a BIG Data Public Private Partnership
Towards a BIG Data Public Private Partnership
Edward Curry
 
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Improving Policy Coherence and Accessibility through Semantic Web Technologie...Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Edward Curry
 
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
Designing Next Generation Smart City Initiatives:Harnessing Findings And Les...Designing Next Generation Smart City Initiatives:Harnessing Findings And Les...
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
Edward Curry
 
Citizen Actuation For Lightweight Energy Management
Citizen Actuation For Lightweight Energy ManagementCitizen Actuation For Lightweight Energy Management
Citizen Actuation For Lightweight Energy Management
Edward Curry
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Edward Curry
 
Towards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing SystemsTowards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing Systems
Edward Curry
 

Viewers also liked (8)

Influenciencia del mundo emocional en el aprendizaje
Influenciencia del mundo emocional en el aprendizajeInfluenciencia del mundo emocional en el aprendizaje
Influenciencia del mundo emocional en el aprendizaje
 
Open Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and TrendsOpen Data Innovation in Smart Cities: Challenges and Trends
Open Data Innovation in Smart Cities: Challenges and Trends
 
Towards a BIG Data Public Private Partnership
Towards a BIG Data Public Private PartnershipTowards a BIG Data Public Private Partnership
Towards a BIG Data Public Private Partnership
 
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Improving Policy Coherence and Accessibility through Semantic Web Technologie...Improving Policy Coherence and Accessibility through Semantic Web Technologie...
Improving Policy Coherence and Accessibility through Semantic Web Technologie...
 
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
Designing Next Generation Smart City Initiatives:Harnessing Findings And Les...Designing Next Generation Smart City Initiatives:Harnessing Findings And Les...
Designing Next Generation Smart City Initiatives: Harnessing Findings And Les...
 
Citizen Actuation For Lightweight Energy Management
Citizen Actuation For Lightweight Energy ManagementCitizen Actuation For Lightweight Energy Management
Citizen Actuation For Lightweight Energy Management
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
 
Towards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing SystemsTowards Unified and Native Enrichment in Event Processing Systems
Towards Unified and Native Enrichment in Event Processing Systems
 

Similar to Data Curation at the New York Times

Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Camille Mathieu
 
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Umair ul Hassan
 
Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...
jodischneider
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real Worldsssw2012
 
KMWorld Martin Briefing
KMWorld Martin BriefingKMWorld Martin Briefing
KMWorld Martin Briefing
martingarland
 
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
Data2030 Summit Data Megatrends Turner Sept 2022.pptxData2030 Summit Data Megatrends Turner Sept 2022.pptx
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
Matt Turner
 
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and OutcomesWikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
jodischneider
 
Down to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data ServicesDown to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data Services
Inside Analysis
 
Digital DNA for Organic Enterprises
Digital DNA for Organic EnterprisesDigital DNA for Organic Enterprises
Digital DNA for Organic EnterprisesTeemu Arina
 
Towards Patient Controlled Privacy
Towards Patient Controlled PrivacyTowards Patient Controlled Privacy
Towards Patient Controlled Privacy
Owen Sacco
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
Chris Dwan
 
Introduction to Open Data
Introduction to Open DataIntroduction to Open Data
Introduction to Open Data
Derilinx
 
Externalization Trend
Externalization TrendExternalization Trend
Externalization Trend
Nigel Green
 
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
Keynote Theatre. Keynote Day 2. 16:30   Evelyn de Souza Keynote Theatre. Keynote Day 2. 16:30   Evelyn de Souza
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
CloudExpoAsia
 
Knowledge management on the desktop
Knowledge management on the desktopKnowledge management on the desktop
Knowledge management on the desktop
Laura Dragan
 
Big_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_ReddiboinaBig_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_ReddiboinaMadhu Reddiboina
 
Self-service Linked Government Data
Self-service Linked Government DataSelf-service Linked Government Data
Self-service Linked Government Data
Fadi Maali
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalism
Bahareh Heravi
 
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic WebMulti-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
Fabrizio Orlandi
 

Similar to Data Curation at the New York Times (20)

Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
Metadata Standards and Organizational Resource Allocation: A Case for the Eff...
 
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
Towards Expertise Modelling for Routing Data Cleaning Tasks within a Communit...
 
Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...Envisioning a discussion dashboard for collective intelligence of web convers...
Envisioning a discussion dashboard for collective intelligence of web convers...
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real World
 
KMWorld Martin Briefing
KMWorld Martin BriefingKMWorld Martin Briefing
KMWorld Martin Briefing
 
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
Data2030 Summit Data Megatrends Turner Sept 2022.pptxData2030 Summit Data Megatrends Turner Sept 2022.pptx
Data2030 Summit Data Megatrends Turner Sept 2022.pptx
 
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and OutcomesWikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
WikiSym2012 Deletion Discussions in Wikipedia: Decision Factors and Outcomes
 
Down to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data ServicesDown to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data Services
 
Digital DNA for Organic Enterprises
Digital DNA for Organic EnterprisesDigital DNA for Organic Enterprises
Digital DNA for Organic Enterprises
 
Towards Patient Controlled Privacy
Towards Patient Controlled PrivacyTowards Patient Controlled Privacy
Towards Patient Controlled Privacy
 
2018 10 igneous
2018 10 igneous2018 10 igneous
2018 10 igneous
 
Introduction to Open Data
Introduction to Open DataIntroduction to Open Data
Introduction to Open Data
 
Externalization Trend
Externalization TrendExternalization Trend
Externalization Trend
 
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
Keynote Theatre. Keynote Day 2. 16:30   Evelyn de Souza Keynote Theatre. Keynote Day 2. 16:30   Evelyn de Souza
Keynote Theatre. Keynote Day 2. 16:30 Evelyn de Souza
 
Knowledge management on the desktop
Knowledge management on the desktopKnowledge management on the desktop
Knowledge management on the desktop
 
Big_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_ReddiboinaBig_Data_ML_Madhu_Reddiboina
Big_Data_ML_Madhu_Reddiboina
 
Self-service Linked Government Data
Self-service Linked Government DataSelf-service Linked Government Data
Self-service Linked Government Data
 
Towards Social semantic journalism
Towards Social semantic journalismTowards Social semantic journalism
Towards Social semantic journalism
 
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic WebMulti-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
Multi-Source Provenance-Aware User Interest Profiling on the Social Semantic Web
 
Mydex opentech2010
Mydex opentech2010Mydex opentech2010
Mydex opentech2010
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Data Curation at the New York Times

  • 1. Digital Enterprise Research Institute www.deri.ie Data Curation at the New York Times Edward Curry, Andre Freitas, Seán O'Riain ed.curry@deri.org http://www.deri.org/ http://www.EdwardCurry.org/ Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
  • 2. Speaker Profile Digital Enterprise Research Institute www.deri.ie  Research Scientist at the Digital Enterprise Research Institute (DERI)  Leading international web science research organization  Researching how web of data is changing way business work and interact with information  Projects include studies of enterprise linked data, community- based data curation, semantic data analytics, and semantic search  Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries  Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
  • 3. Overview Digital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  New York Times Case Study  Best Practices from Case Study Learning
  • 4. The Business Need Digital Enterprise Research Institute www.deri.ie  Knowledge workers need:  Access to the right information  Confidence in that information  Working incomplete inaccurate, or wrong information can have disastrous consequences
  • 5. The Problems with Data Digital Enterprise Research Institute www.deri.ie  Flawed Data  Effects 25% of critical data in world‟s top companies (Gartner)  Data Quality  Recent banking crisis (Economist Dec‟09)  Inaccurate figures made it difficult to manage operations (investments exposure and risk) – “asset are defined differently in different programs” – “numbers did not always add up” – “departments do not trust each other‟s figures” – “figures … not worth the pixels they were made of”
  • 6. What is Data Curation? Digital Enterprise Research Institute www.deri.ie  Digital Curation  Selection, preservation, maintenance, collection, and archiving of digital assets  Data Curation  Active management of data over its life-cycle  Data Curators  Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use – Museum cataloguers of the Internet age
  • 7. What is Data Curation? Digital Enterprise Research Institute www.deri.ie  Data Governance  Convergence of data quality, data management, business process management, and risk management  Data Curation is a complimentary activity  Part of overall data governance strategy for organization  Data Curator = Data Steward ??  Overlapping terms between communities
  • 8. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  What is Data Quality?  Desirable characteristics for information resource  Described as a series of quality dimensions – Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation  Data curation can be used to improve these quality dimensions
  • 9. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  Discoverability & Accessibility  Curate to streamline search by storing and classifying in appropriate and consistent manner  Accuracy  Curate to ensure data correctly represents the “real- world” values it models  Consistency  Curate to ensure data created and maintained using standardized definitions, calculations, terms, and identifiers
  • 10. Data Quality and Curation Digital Enterprise Research Institute www.deri.ie  Provenance & Reputation  Curate to track source of data and determine reputation  Curate to include the objectivity of the source/producer – Is the information unbiased, unprejudiced, and impartial? – Or does it come from a reputable but partisan source? Other dimensions discussed in chapter
  • 11. How to Curate Data Digital Enterprise Research Institute www.deri.ie  Data Curation is a large field with sophisticated techniques and processes  Section provides high-level overview on:  Should you curate data?  Types of Curation  Setting up a curation process Additional detail and references available in book chapter
  • 12. Should You Curate Data? Digital Enterprise Research Institute www.deri.ie  Curation can have multiple motivations  Improving accessibility, quality, consistency,…  Will the data benefit from curation?  Identify business case  Determine if potential return support investment  Not all enterprise data should be curated  Suits knowledge-centric data rather than transactional operations data
  • 13. Types of Data Curation Digital Enterprise Research Institute www.deri.ie  Multiple approaches to curate data, no single correct way  Who? – Individual Curators – Curation Departments – Community-based Curation  How? – Manual Curation – (Semi-)Automated – Sheer Curation
  • 14. Types of Data Curation – Who? Digital Enterprise Research Institute www.deri.ie  Individual Data Curators  Suitable for infrequently changing small quantity of data – (<1,000 records) – Minimal curation effort (minutes per record)
  • 15. Types of Data Curation – Who? Digital Enterprise Research Institute www.deri.ie  Curation Departments  Curation experts working with subject matter experts to curate data within formal process – Can deal with large curation effort (000‟s of records)  Limitations  Scalability: Can struggle with large quantities of dynamic data (>million records)  Availability: Post-hoc nature creates delay in curated data availability
  • 16. Types of Data Curation - Who? Digital Enterprise Research Institute www.deri.ie  Community-Based Data Curation  Decentralized approach to data curation  Crowd-sourcing the curation process – Leverages community of users to curate data  Wisdom of the community (crowd)  Can scale to millions of records
  • 17. Types of Data Curation – How? Digital Enterprise Research Institute www.deri.ie  Manual Curation  Curators directly manipulate data  Can tie users up with low-value add activities  (Sem-)Automated Curation  Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification  Can be supervised or approved by human curators
  • 18. Types of Data Curation – How? Digital Enterprise Research Institute www.deri.ie  Sheer curation, or Curation at Source  Curation activities integrated in normal workflow of those creating and managing data  Can be as simple as vetting or “rating” the results of a curation algorithm  Results can be available immediately  Blended Approaches: Best of Both  Sheer curation + post hoc curation department  Allows immediate access to curated data  Ensures quality control with expert curation
  • 19. Setting up a Curation Process Digital Enterprise Research Institute www.deri.ie  5 Steps to setup a curation process: 1 - Identify what data you need to curate 2 - Identify who will curate the data 3 - Define the curation workflow 4 - Identity appropriate data-in & data-out formats 5 - Identify the artifacts, tools, and processes needed to support the curation process
  • 20. The New York Times Digital Enterprise Research Institute www.deri.ie 100 Years of Expert Data Curation
  • 21. The New York Times Digital Enterprise Research Institute www.deri.ie  Largest metropolitan and third largest newspaper in the United States  nytimes.com  Most popular newspaper website in US  100 year old curated repository defining its participation in the emerging Web of Data
  • 22. The New York Times Digital Enterprise Research Institute www.deri.ie  Data curation dates back to 1913  Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaper  New York Times Index  Organized catalog of articles titles and summaries – Containing issue, date and column of article – Categorized by subject and names – Introduced on quarterly then annual basis  Transitory content of newspaper became important source of searchable historical data  Often used to settle historical debates
  • 23. The New York Times Digital Enterprise Research Institute www.deri.ie  Index Department was created in 1913  Curation and cataloguing of NYT resources – Since 1851 NYT had low quality index for internal use  Developed a comprehensive catalog using a controlled vocabulary  Covering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries  Current Index Dept. has ~15 people
  • 24. The New York Times Digital Enterprise Research Institute www.deri.ie  Challenges with consistently and accurately classifying news articles over time  Keywords expressing subjects may show some variance due to cultural or legal constraints  Identities of some entities, such as organizations and places, changed over time  Controlled vocabulary grew to hundreds of thousands of categories  Adding complexity to classification process
  • 25. The New York Times Digital Enterprise Research Institute www.deri.ie  Increased importance of Web drove need to improve categorization of online content  Curation carried out by Index Department  Library-time (days to weeks)  Print edition can handle next-day index  Not suitable for real-time online publishing  nytimes.com needed a same-day index
  • 26. The New York Times Digital Enterprise Research Institute www.deri.ie  Introduced two stage curation process  Editorial staff performed best-effort semi-automated sheer curation at point of online pub. – Several hundreds journalists  Index Department follow up with long-term accurate classification and archiving  Benefits:  Non-expert journalist curators provide instant accessibility to online users  Index Department provides long-term high-quality curation in a “trust but verify” approach
  • 27. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Curation starts with article getting out of the newsroom
  • 28. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)
  • 29. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Teragram uses linguistic extraction rules based on subset of Index Dept‟s controlled vocab.
  • 30. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article
  • 31. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Editorial staff member selects terms that best describe the contents and inserts new tags if necessary
  • 32. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Reviewed by the taxonomy managers with feedback to editorial staff on classification process
  • 33. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Article is published online at nytimes.com
  • 34. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  At later stage article receives second level curation by Index Dept. additional Index tags and a summary
  • 35. NYT Curation Workflow Digital Enterprise Research Institute www.deri.ie  Article is submitted to NYT Index
  • 36. The New York Times Digital Enterprise Research Institute www.deri.ie  Early adopter of Linked Open Data (June „09)
  • 37. The New York Times Digital Enterprise Research Institute www.deri.ie  Linked Open Data @ data.nytimes.com  Subset of 10,000 tags from index vocabulary  Dataset of people, organizations & locations – Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,…  Benefits  Improves traffic by third party data usage  Lowers development cost of new applications for different verticals inside the website – E.g. movies, travel, sports, books
  • 38. Overview Digital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  Case Study New York Times  Best Practices from Case Study Learning
  • 39. Best Practices from Case Study Learning Digital Enterprise Research Institute www.deri.ie  Social Best Practices  Participation  Engagement  Incentives  Community Governance Models  Technical Best Practices  Data Representation  Human- and AutomatedCuration  Track Provenance
  • 40. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Participation  Stakeholders involvement for data producers and consumers must occur early in project – Provides insight into basic questions of what they want to do, for whom, and what it will provide  White papers are effective means to present these ideas, and solicit opinion from community – Can be used to establish informal „social contract‟ for community
  • 41. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Engagement  Outreach activities essential for promotion and feedback  Typical consumers-to-contributors ratios of less than 5%  Social communication and networking forums are useful – Majority of community may not communicate using these media – Communication by email still remains important
  • 42. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Incentives  Sheer curation needs line of sight from data curating activity, to tangible exploitation benefits  Lack of awareness of value proposition will slow emergence of collaborative contributions  Recognizing contributing curators through a formal feedback mechanism – Reinforces contribution culture – Directly increases output quality
  • 43. Social Best Practices Digital Enterprise Research Institute www.deri.ie  Community Governance Models  Effective governance structure is vital to ensure success of community  Internal communities and consortium perform well when they leverage traditional corporate and democratic governance models  Open communities need to engage the community within the governance process – Follow less orthodox approaches using meritocratic and autocratic principles
  • 44. Technical Best Practices Digital Enterprise Research Institute www.deri.ie  Data Representation  Must be robust and standardized to encourage community usage and tools development  Support for legacy data formats and ability to translate data forward to support new technology and standards  Human & Automated Curation  Balancing will improve data quality  Automated curation should always defer to, and never override, human curation edits – Automate validating data deposition and entry – Target community at focused curation tasks
  • 45. Technical Best Practices Digital Enterprise Research Institute www.deri.ie  Track Provenance  All curation activities should be recorded and maintained as part data provenance effort – Especially where human curators are involved  Users can have different perspectives of provenance – A scientist may need to evaluate the fine grained experiment description behind the data – For a business analyst the ‟brand‟ of data provider can be sufficient for determining quality
  • 46. Conclusions Digital Enterprise Research Institute www.deri.ie  Data curation can ensure the quality of data and its fitness for use  Pre-competitive data can be shared without conferring a commercial advantage  Pre-competitive data communities  Common curation tasks carried out once in public domain  Reduces cost, increase quantity and quality
  • 47. Acknowledgements Digital Enterprise Research Institute www.deri.ie  Collaborators Andre Freitas & Seán O'Riain  Insight from Thought Leaders  Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times  Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters  Antony Williams (VP of Strategic Development ) from ChemSpider  Helen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank  Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.  The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion- 2).
  • 48. Further Information Digital Enterprise Research Institute www.deri.ie The Role of Community-Driven Data Curation for Enterprises Edward Curry, Andre Freitas, & Seán O'Riain In David Wood (ed.), Linking Enterprise Data Springer, 2010. Available Free at: http://3roundstones.com/led_book/led-curry-et-al.html