Your SlideShare is downloading. ×
The Role of Community-Driven Data Curation for Enterprises
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

The Role of Community-Driven Data Curation for Enterprises


Published on

With increased utilization of data within their operational and strategic processes, enterprises need to ensure data quality and accuracy. Data curation is a process that can ensure the quality of …

With increased utilization of data within their operational and strategic processes, enterprises need to ensure data quality and accuracy. Data curation is a process that can ensure the quality of data and its fitness for use. Traditional approaches to curation are struggling with increased data volumes, and near real-time demands for curated data. In response, curation teams have turned to community crowd-sourcing and semi-automatedmetadata tools for assistance. This chapter provides an overview of data curation, discusses the business motivations for curating data and investigates the role of community-based data curation, focusing on internal communities and pre-competitive data collaborations. The chapter is supported by case studies from Wikipedia, The New York Times, Thomson Reuters, Protein Data Bank and ChemSpider upon which best practices for both social and technical aspects of community-driven data curation are described.

E. Curry, A. Freitas, and S. O’Riáin, “The Role of Community-Driven Data Curation for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47.

Published in: Technology, Education

1 Comment
  • the slides describing The role of community driven data cu ration for enterprises are very helpful to me....The different types of curations u mentioned in ur slides namely manual,automated,sheer,blended are knowledgable...Expecting more such good topics from u edward. discount coupons
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. The Role of Community-Driven Data Curation for Enterprises
    Edward Curry, Andre Freitas, Seán O'Riain
  • 2. Speaker Profile
    Research Scientist at the Digital Enterprise Research Institute (DERI)
    Leading international web science research organization
    Researching how web of data is changing way business work and interact with information
    Projects include studies of enterprise linked data, community-based data curation, semantic data analytics, and semantic search
    Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries
    Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
  • 3. Web of Data
  • 4. Acknowledgements
    Collaborators Andre Freitas & SeánO'Riain
    Insight from Thought Leaders
    Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times
    Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters
    Antony Williams (VP of Strategic Development ) from ChemSpider
    Helen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank
    Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.
    The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).
  • 5. Further Information
    The Role of Community-Driven
    Data Curation for Enterprises
    Edward Curry, Andre Freitas, & Seán O'Riain
    In David Wood (ed.),
    Linking Enterprise Data Springer, 2010.
    Available Free at:
  • 6. Overview
    Curation Background
    The Business Need for Curated Data
    What is Data Curation?
    Data Quality and Curation
    How to Curate Data
    Curation Communities and Enterprise Data
    Case Studies
    Wikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data Bank
    Best Practices from Case Study Learning 
  • 7. The Business Need
    • Knowledgeworkers need:
    • 8. Access to the right information
    • 9. Confidence in that information
    Working incomplete inaccurate, or wrong information can have disastrous consequences
  • 10. The Problems with Data
    Flawed Data
    Effects 25% of critical data in world’s top companies (Gartner)
    Data Quality
    Recent banking crisis (Economist Dec’09)
    Inaccurate figures made it difficult to manage operations (investments exposure and risk)
    “asset are defined differently in different programs”
    “numbers did not always add up”
    “departments do not trust each other’s figures”
    “figures … not worth the pixels they were made of”
  • 11. What is Data Curation?
    Selection, preservation, maintenance, collection, and archiving of digital assets
    Active management of data over its life-cycle
    Data Curators
    Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use
    Museum cataloguers of the Internet age
  • 12. What is Data Curation?
    Data Governance
    Convergence of data quality, data management, business process management, and risk management
    Data Curation is a complimentary activity
    Part of overall data governance strategy for organization
    Data Curator = Data Steward ??
    Overlapping terms between communities
  • 13. Data Quality and Curation
    What is Data Quality?
    Desirable characteristics for information resource
    Described as a series of quality dimensions
    Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation
    Data curation can be used to improve these quality dimensions
  • 14. Data Quality and Curation
    Discoverability & Accessibility
    Curate to streamline search by storing and classifying in appropriate and consistent manner
    Curate to ensure data correctly represents the “real-world” values it models
    Curate to ensure datacreated and maintained using standardized definitions, calculations, terms, and identifiers
  • 15. Data Quality and Curation
    Provenance & Reputation
    Curate to track source of data and determine reputation
    Curate to include the objectivity of the source/producer
    Is the information unbiased, unprejudiced, and impartial?
    Or does it come from a reputable but partisan source?
    Other dimensions discussed in chapter
  • 16. How to Curate Data
    Data Curation is a large field with sophisticated techniques and processes
    Sectionprovides high-leveloverview on:
    Should you curate data?
    Types of Curation
    Setting up a curation process
    Additional detail and references available in book chapter
  • 17. Should You Curate Data?
    Curation can have multiple motivations
    Improving accessibility, quality, consistency,…
    Will the data benefit from curation?
    Identify business case
    Determine if potential return support investment
    Not all enterprise data should be curated
    Suits knowledge-centric data rather than transactional operations data
  • 18. Types of Data Curation
    Multiple approaches to curate data, no single correct way
    Individual Curators
    Curation Departments
    Community-based Curation
    Manual Curation
    Sheer Curation
  • 19. Types of Data Curation – Who?
    Individual Data Curators
    Suitable for infrequently changing small quantity of data
    (<1,000 records)
    Minimal curation effort (minutes per record)
  • 20. Types of Data Curation – Who?
    Curation Departments
    Curation experts working with subject matter experts to curate data within formal process
    Can deal with large curation effort (000’s of records)
    Scalability: Can struggle with large quantities of dynamic data (>million records)
    Availability: Post-hoc nature creates delay incurated data availability
  • 21. Types of Data Curation - Who?
    Community-Based Data Curation
    Decentralized approach to data curation
    Crowd-sourcing the curation process
    Leverages community of users to curate data
    Wisdom of the community (crowd)
    Can scale to millions of records
  • 22. Types of Data Curation – How?
    Manual Curation
    Curators directly manipulate data
    Can tie users up with low-value add activities
    (Sem-)Automated Curation
    Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification
    Can be supervised or approved by human curators
  • 23. Types of Data Curation – How?
    Sheer curation, or Curation at Source
    Curation activities integrated in normal workflow of those creating and managing data
    Can be as simple as vetting or “rating” the results of a curation algorithm
    Results can be available immediately
    Blended Approaches: Best of Both
    Sheer curation +post hoc curation department
    Allows immediate access to curated data
    Ensures quality control with expert curation
  • 24. Setting up a Curation Process
    5 Steps to setup a curation process:
    1 - Identify what data you need to curate
    2 - Identify who will curate the data
    3 - Define the curation workflow
    4 - Identity appropriate data-in & data-out formats
    5 - Identify the artifacts, tools, and processes needed to support the curation process
  • 25. Setting up a Curation Process
    Step 1: Identify what data you need to curate
    Newly created data and/or legacy data?
    How is new data created?
    Do users create the data, or is it imported from an external source?
    How frequently is new data created/updated?
    What quantity of data is created?
    How much legacy data exists?
    Is it stored within a single source, or scattered across multiple sources?
  • 26. Setting up a Curation Process
    Step 2: Identify who will curate the data
    Individuals, depts, groups, institutions,community
    Step 3: Define the curation workflow
    What curation activities are required?
    How will curation activities be carried out?
    Step 4: Identity suitable data-in & -out formats
    What is the best format for the data?
    Right format for receiving and publishing data is critical
    Support multiple formats to maximum participation
  • 27. Setting up a Curation Process
    Step 5: Identify the artifacts, tools, and processes needed to support curation
    Workflow support/Community collaboration platforms
    Algorithms can (semi-)automate curation activities
    Major factors that influence approach:
    Quantity of data to be curated (new and legacy data)
    Amount of effort required to curate the data
    Frequency of data change / data dynamics
    Availability of experts
  • 28. Overview
    Curation Background
    The Business Need for Curated Data
    What is Data Curation?
    Data Quality and Curation
    How to Curate Data
    Curation Communities and Enterprise Data
    Case Studies
    Wikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data Bank
    Best Practices from Case Study Learning 
  • 29. Community–based Curation
    Two community approaches:
    Internal corporate communities
    External pre-competitive communities
    To determine the right model consider:
    What the purpose of the community is?
    Will resulting curateddataset be publicly available? Or restricted?
  • 30. Community–based Curation
    Internal Communities
    Taps potential of workforce to assist data curation
    Curate competitive enterprise data that will remain internal to the company
    May not always be the case e.g. product technical support and marketing data
    Can work in conjunction with curation dept.
    Community governance typically follows the organization’s internal governance model
  • 31. Pre-competitive Communities
    Pre-competitive collaboration
    Well-established technique for open innovation
    Notable examples
  • 32. What is Pre-Competitive Data?
    Two Types of Enterprise Data
    Propriety data for competitive advantage
    Common data with no competitive advantage
    What is pre-competitive data?
    Has little potential for differentiation
    Can be shared without conferring commercial advantage to competitor
    Common non-competitive data
    Needs to be maintaining and curated
    Companies duplicate effort in-house incurring full-cost
  • 33. Pre-competitive Communities
    External pre-competitive communities
    Share costs, risks, and technical challenges
    Common curation tasks carried out once inpublic domain rather than multiple timesin each company
    Reduces cost required to provide and maintain data
    Can increase the quantity, quality, and access
    Focus turns to value-add competitive activity
    Move “competitive onus” from novel data to novel algorithms, shifting emphasis from “proprietary data” to a “proprietary understanding of data”
    e.g. Protein Data Bank and Pistoia Alliance in Pharma
  • 34. External Pre-competitive Communities
    Two popular community models are
    Organization consortium
    Open community
    Organization consortium
    Operates like a private democratic club
    Usually closed community, members invited based on skill-set to contribute
    Output data - public or limited tomembers
    Consortiums follow a democratic process
    Member voting rights may reflect level of investment
    Larger players may be leaders of the consortium
  • 35. External Pre-competitive Communities
    Open community
    Everyone can participate
    “Founder(s)” defines desired curation activity
    Seek public support to contribute to curation activates
    Wikipedia, Linux, and Apache are good examples of large open communities
  • 36. Overview
    Curation Background
    The Business Need for Curated Data
    What is Data Curation?
    Data Quality and Curation
    How to Curate Data
    Curation Communities and Enterprise Data
    Case Studies
    Wikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data Bank
    Best Practices from Case Study Learning 
  • 37. Wikipedia
    The World Largest Open Digital Curation Community
  • 38. Wikipedia
    Open-source encyclopedia
    Collaboratively built by large community
    Challenges existing models of content creation
    More than 19,000,000 articles
    270+ languages, 3,200,000+ articles in English
    More than 157,000 active contributors
    Studies show accuracy and stylistic formality are equivalent to resources developed in expert-based closed communities
    i.e. Columbia and Britannica encyclopedias
  • 39. Wikipedia
    Wiki platform behind Wikipedia
    Widespread and popular technology
    Wikis can also support data curation
    Lowers entry barriers for collaborative data curation
    Widely used inside organizations
    Intellipedia covering 16 U.S. Intelligence agencies
    Wiki Proteins,curatedProtein data for knowledge discovery and annotation
  • 40. Wikipedia
    Decentralized environment supports creation of high quality information with:
    Social organization
    Artifacts, tools & processes for cooperative work coordination
    Wikipedia collaboration dynamics highlightgood practices
  • 41. Wikipedia – Social Organization
    Any usercan edit its contents
    Without prior registration
    Does not lead to a chaotic scenario
    In practice highly scalable approach for high quality content creation on the Web
    Relies on simple but highly effective way to coordinate its curation process
    Curation is activity of Wikipedia admins
    Responsibility for information quality standards
  • 42. Wikipedia – Social Organization
    Four main types of accounts:
    Anonymous users
    Identified by their associated IP address
    Registered users
    Users with an account in the Wikipedia website
    Registered users with additional permissions in the system
    Access to curation tools
    Programs that perform repetitive tasks
  • 43. Wikipedia – Social Organization
  • 44. Wikipedia – Social Organization
    Improvement of one’s reputation
    Sense of efficacy
    Contributing effectively to a meaningful project
    Over time focus of editors typically change
    From curators of a few articles in specific topics
    To more global curation perspective
    Enforcing quality assessment of Wikipedia as a whole
  • 45. Wikipedia – Artifacts, Tools & Processes
    Wiki Article Editor (Tool)
    WYSIWYG or markup text editor
    Talk Pages (Tool)
    Public arena for discussions around Wikipedia resources
    Watchlists (Tool)
    Helps curators to actively monitor the integrity and quality of resources they contribute
    Permission Mechanisms (Tool)
    Users with administrator status can perform critical actions such as remove pages and grant administrative permissions to new users
  • 46. Wikipedia – Artifacts, Tools & Processes
    Automated Edition (Tool)
    Bots are automated or semi-automated tools that perform repetitive tasks over content
    Page History and Restore (Tool)
    Historical trail of changes to a Wikipedia Resource
    Guidelines, Policies & Templates (Artifact)
    Defines curation guidelines for editors to assess article quality
    Dispute Resolution (Process)
    Dispute mechanism between editors over the article contents
    Article Edition, Deletion, Merging, Redirection, Transwiking, Archival (Process)
    Describe the curation actions over Wikipedia resources
  • 47. Wikipedia - DBPedia
    DBPedia Knowledge base
    Inherits massive volume of curated Wikipedia data
    Built using information info box properties
    Indirectly uses wiki as data curation platform
    DBPediaprovides direct access to data
    3.4 million entities and 1 billion RDF triples
    Comprehensive data infrastructure
    Concept URIs, definitions, and basic types
  • 48.
  • 49. Wikipedia - DBPedia
  • 50. The New York Times
    100 Years of Expert Data Curation
  • 51. The New York Times
    Largest metropolitan and third largest newspaper in the United States
    • 52. Most popular newspaper website in US
    • 53. 100 year old curated repository defining its participation in the emerging Web of Data
  • The New York Times
    Data curation dates back to 1913
    Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaper
    New York Times Index
    Organized catalog of articles titles and summaries
    Containing issue, date and column of article
    Categorized by subject and names
    Introduced on quarterly thenannual basis
    Transitory content of newspaper became important source of searchable historical data
    Often used to settle historical debates
  • 54. The New York Times
     Index Department was created in 1913
    Curation and cataloguingofNYT resources
    Since 1851 NYT had low quality index for internal use
    Developed a comprehensive catalog using a controlled vocabulary
    Covering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries
    Current Index Dept. has~15 people
  • 55. The New York Times
    Challenges with consistently and accurately classifying news articles over time
    Keywords expressing subjects may show some variance due to cultural or legal constraints
    Identities of some entities, such as organizations and places, changed over time
    Controlled vocabulary grew to hundreds of thousands of categories
    Adding complexity to classification process
  • 56. The New York Times
    Increased importance of Web drove need to improve categorization of online content
    Curation carried out by Index Department
    Library-time (days to weeks)
    Print edition can handle next-day index
    Not suitable for real-time online publishing needed a same-day index
  • 57. The New York Times
    Introduced two stage curation process
    Editorial staff performed best-effort semi-automated sheer curation at point of online pub.
    Several hundreds journalists
    Index Department follow up with long-term accurate classification and archiving
    Non-expert journalist curators provide instant accessibility to online users
    Index Department provides long-term high-quality curation in a “trust but verify” approach
  • 58. NYT Curation Workflow
    Curation starts with article getting out of the newsroom
  • 59. NYT Curation Workflow
    Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)
  • 60. NYT Curation Workflow
    Teragram uses linguistic extraction rules based on subset of Index Dept’s controlled vocab.
  • 61. NYT Curation Workflow
    Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article
  • 62. NYT Curation Workflow
    Editorial staff member selects terms that best describe the contents and inserts new tags if necessary
  • 63. NYT Curation Workflow
    Reviewed by the taxonomy managers with feedback to editorial staff on classification process
  • 64. NYT Curation Workflow
    Article is published online at
  • 65. NYT Curation Workflow
    At later stage article receives second level curation by Index Dept. additional Index tags and a summary
  • 66. NYT Curation Workflow
    Article is submitted to NYT Index
  • 67. The New York Times
    Early adopter of Linked Open Data (June ‘09)
  • 68. The New York Times
    Linked Open Data @
    Subset of 10,000 tagsfrom index vocabulary
    Dataset of people, organizations & locations
    Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,…
    Improves traffic by third party data usage
    Lowers development cost of new applications for different verticals inside the website
    E.g. movies, travel, sports, books
  • 69. Thomson Reuters
    Data Curation: A Core Business Competency
  • 70. Thomson Reuters
    Thomson Reuters is an information provider
    Created by acquisition of Reuters by Thomson
    Over 50,000 employees
    Commercial presence in 100+ countries
    Provides specialist curated information and information-based services
    Selects most relevant information for customers
    Classifying, enriching and distributing it in a way that can be readily consumed
  • 71. Thomson Reuters
    Curation process
    Working over approximately 1000 data sources
    Automatic tools provide first level triage and classification
    Refined by intervention of human curators
    Curator is a domain specialist
    Employs thousands of curators
  • 72. Thomson Reuters
    OneCalais platform
    Reduces workload for classification ofcontent
    Natural Language Processingonunstructured text
    Automatically derives tags for analyzed content
    Enrichment with machine readable structured data
    Provides description of specific entities (places, people, events, facts) present in the text
    Open Calais (free version of OneCalais)
    20.000+ users,>4 million trans per day
    CNET, CBS Interactive, The Huffington Post, The Powerhouse Museum of Science and Design,…
  • 73. ChemSpider
    Structure centric chemical community
    Over 300 data sources with 25 million records
    Provided by chemical vendors, government databases, private laboratories and individual
    Pharmarealizing benefits of open data
    Heavily leveraged by pharmaceutical companies as pre-competitive resources for experimental and clinical trial investigation
    Glaxo Smith Kline made its proprietary malaria dataset of 13,500 compounds available
  • 74. Protein Data Bank
    Dedicated to improving understanding of biological systems functions with 3-D structure of macromolecules
    Started in 1971 with 3 core members
    Originally offered 7 crystal structures
    Grown to 63,000 structures
    Over 300 million dataset downloads
    Expanded beyond curated data download service to include complex molecular visualized, search, and analysis capabilities
  • 75. Overview
    Curation Background
    The Business Need for Curated Data
    What is Data Curation?
    Data Quality and Curation
    How to Curate Data
    Curation Communities and Enterprise Data
    Case Studies
    Wikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data Bank
    Best Practices from Case Study Learning 
  • 76. Best Practices from Case Study Learning
    Social Best Practices
    Community Governance Models
    Technical Best Practices
    Data Representation
    Human- andAutomatedCuration
    Track Provenance
  • 77. Social Best Practices
    Stakeholders involvement fordata producers and consumers must occur early in project
    Provides insight into basic questions of what they want to do, for whom, and what it will provide
    White papers are effective means to present these ideas, and solicit opinion from community
    Can be used to establish informal ‘social contract’ for community
  • 78. Social Best Practices
    Outreach activities essential for promotion and feedback
    Typical consumers-to-contributors ratios of less than 5%
    Social communication and networking forums are useful
    Majority of community may not communicate using these media
    Communication by email still remains important
  • 79. Social Best Practices
    Sheer curationneedsline of sight from data curating activity, to tangible exploitation benefits
    Lack of awareness of value proposition will slow emergence ofcollaborative contributions
    Recognizing contributing curators through a formal feedback mechanism
    Reinforces contribution culture
    Directly increases output quality
  • 80. Social Best Practices
    Community Governance Models
    Effective governance structure is vital to ensure success of community
    Internal communities and consortium perform well when they leverage traditional corporate and democratic governance models
    Open communities need to engage the community within the governance process
    Follow less orthodox approaches using meritocratic and autocratic principles
  • 81. Technical Best Practices
    Data Representation
    Must be robust and standardized to encourage community usage and tools development
    Support for legacy data formats and ability to translate data forward to support new technology and standards
    Human & Automated Curation
    Balancing will improve data quality
    Automated curation should always defer to, and never override, human curation edits
    Automate validating data deposition and entry
    Target community at focused curation tasks
  • 82. Technical Best Practices
    Track Provenance
    All curation activities should be recorded and maintained as part data provenance effort
    Especially where human curators are involved
    Users can have different perspectives of provenance
    A scientist may need to evaluate the fine grained experiment description behind the data
    For a business analyst the ’brand’ of data provider can be sufficient for determining quality
  • 83. Conclusions
    Data curation can ensure the quality of data and its fitness for use
    Pre-competitive data can be shared without conferring a commercial advantage
    Pre-competitive data communities
    Common curation tasks carried out once in public domain
    Reduces cost, increase quantity and quality