The Role of Community-Driven Data Curation for EnterprisesEdward Curry, Andre Freitas, Seán O'Riain ed.curry@deri.orghttp://www.deri.org/http://www.EdwardCurry.org/
Speaker  ProfileResearch Scientist at the Digital Enterprise Research Institute (DERI)Leading international web science research organizationResearching how web of data is changing way business work and interact with informationProjects include studies of enterprise linked data, community-based data curation, semantic data analytics, and semantic searchInvestigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industriesInvited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
Web of Data
AcknowledgementsCollaborators Andre Freitas & SeánO'RiainInsight from Thought LeadersEvan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York TimesKrista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson ReutersAntony Williams (VP of Strategic Development ) from ChemSpiderHelen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance. The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).
Further Information	The Role of Community-Driven	Data Curation for EnterprisesEdward Curry, Andre Freitas, & Seán O'RiainIn David Wood (ed.), Linking Enterprise Data Springer, 2010.Available Free at: http://3roundstones.com/led_book/led-curry-et-al.html
OverviewCuration BackgroundThe Business Need for Curated DataWhat is Data Curation?Data Quality and CurationHow to Curate DataCuration Communities and Enterprise DataCase StudiesWikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data BankBest Practices from Case Study Learning 
The Business NeedKnowledgeworkers need:
Access to the right information
Confidence in that informationWorking incomplete inaccurate, or wrong information can have disastrous consequences
The Problems with DataFlawed DataEffects 25% of critical data in world’s top companies (Gartner)Data QualityRecent banking crisis (Economist Dec’09)Inaccurate figures made it difficult to manage operations (investments exposure and risk)“asset are defined differently in different programs”“numbers did not always add up”“departments do not trust each other’s figures”“figures … not worth the pixels they were made of”
What is Data Curation?DigitalCuration Selection, preservation, maintenance, collection, and archiving of digital assetsDataCurationActive management of data over its life-cycleData CuratorsEnsure data is trustworthy, discoverable, accessible, reusable, and fit for useMuseum cataloguers of the Internet age
What is Data Curation?Data GovernanceConvergence of data quality, data management, business process management, and risk managementData Curation is a complimentary activityPart of overall data governance strategy for organization Data Curator = Data Steward  ??Overlapping terms between communities
Data Quality and CurationWhat is Data Quality?Desirable characteristics for information resource Described as a series of quality dimensionsDiscoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & ReputationData curation can be used to improve these quality dimensions
Data Quality and CurationDiscoverability & AccessibilityCurate to streamline search by storing and classifying in appropriate and consistent mannerAccuracyCurate to ensure data correctly represents the “real-world” values it modelsConsistencyCurate to ensure datacreated and maintained using standardized definitions, calculations, terms, and identifiers
Data Quality and CurationProvenance & ReputationCurate to track source of data and determine reputationCurate to include the objectivity of the source/producerIs the information unbiased, unprejudiced, and impartial?Or does it come from a reputable but partisan source?Other dimensions discussed in chapter
How to Curate DataData Curation is a large field with sophisticated techniques and processesSectionprovides high-leveloverview on:Should you curate data?Types of CurationSetting up a curation processAdditional detail and references available in book chapter
Should You Curate Data?Curation can have multiple motivationsImproving accessibility, quality, consistency,…Will the data benefit from curation?Identify business caseDetermine if potential return support investmentNot all enterprise data should be curatedSuits knowledge-centric data rather than transactional operations data
Types of Data CurationMultiple approaches to curate data, no single correct wayWho?Individual CuratorsCuration DepartmentsCommunity-based CurationHow?Manual Curation(Semi-)AutomatedSheer Curation
Types of Data Curation – Who?Individual Data CuratorsSuitable for infrequently changing small quantity of data (<1,000 records)Minimal curation effort (minutes per record)
Types of Data Curation – Who?Curation DepartmentsCuration experts working with subject matter experts to curate data within formal processCan deal with large curation effort (000’s of records)LimitationsScalability: Can struggle with large quantities of dynamic data (>million records) Availability: Post-hoc nature creates delay incurated data availability
Types of Data Curation - Who?Community-Based Data CurationDecentralized approach to data curationCrowd-sourcing the curation processLeverages community of users to curate data Wisdom of the community (crowd)Can scale to millions of records
Types of Data Curation – How?Manual CurationCurators directly manipulate dataCan tie users up with low-value add activities(Sem-)Automated CurationAlgorithms can (semi-)automate curation activities such as data cleansing, record duplication and classificationCan be supervised or approved by human curators
Types of Data Curation – How?Sheer curation, or Curation at SourceCuration activities integrated in normal workflow of those creating and managing dataCan be as simple as vetting or “rating” the results of a curation algorithmResults can be available immediatelyBlended Approaches: Best of Both Sheer curation +post hoc curation departmentAllows immediate access to curated data Ensures quality control with expert curation
Setting up a Curation Process5 Steps to setup a curation process:1 - Identify what data you need to curate2 - Identify who will curate the data3 - Define the curation workflow4 - Identity appropriate data-in & data-out formats5 - Identify the artifacts, tools, and processes needed to support the curation process
Setting up a Curation ProcessStep 1: Identify what data you need to curateNewly created data and/or legacy data? How is new data created? Do users create the data, or is it imported from an external source? How frequently is new data created/updated? What quantity of data is created?How much legacy data exists?Is it stored within a single source, or scattered across multiple sources?
Setting up a Curation Process Step 2: Identify who will curate the dataIndividuals, depts, groups, institutions,communityStep 3: Define the curation workflowWhat curation activities are required?How will curation activities be carried out?Step 4: Identity suitable data-in & -out formatsWhat is the best format for the data?Right format for receiving and publishing data is criticalSupport multiple formats to maximum participation
Setting up a Curation ProcessStep 5: Identify the artifacts, tools, and processes needed to support curationWorkflow support/Community collaboration platformsAlgorithms can (semi-)automate curation activitiesMajor factors that influence approach:Quantity of data to be curated (new and legacy data)Amount of effort required to curate the dataFrequency of data change / data dynamicsAvailability of experts
OverviewCuration BackgroundThe Business Need for Curated DataWhat is Data Curation?Data Quality and CurationHow to Curate DataCuration Communities and Enterprise DataCase StudiesWikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data BankBest Practices from Case Study Learning 
Community–based CurationTwo community approaches:Internal corporate communitiesExternal pre-competitive communitiesTo determine the right model consider:What the purpose of the community is? Will resulting curateddataset be publicly available? Or restricted?
Community–based CurationInternal CommunitiesTaps potential of workforce to assist data curationCurate competitive enterprise data that will remain internal to the companyMay not always be the case e.g. product technical support and marketing data Can work in conjunction with curation dept.Community governance typically follows the organization’s internal governance model
Pre-competitive CommunitiesPre-competitive collaborationWell-established technique for open innovation Notable examples
What is Pre-Competitive Data?Two Types of Enterprise DataPropriety data for competitive advantageCommon data with no competitive advantageWhat is pre-competitive data?Has little potential for differentiationCan be shared without conferring commercial advantage to competitorCommon non-competitive dataNeeds to be maintaining and curatedCompanies duplicate effort in-house incurring full-cost
Pre-competitive CommunitiesExternal pre-competitive communitiesShare costs, risks, and technical challengesCommon curation tasks carried out once inpublic domain rather than multiple timesin each companyReduces cost required to provide and maintain dataCan increase the quantity, quality, and accessFocus turns to value-add competitive activityMove “competitive onus” from novel data to novel algorithms, shifting emphasis from “proprietary data” to a “proprietary understanding of data”e.g. Protein Data Bank and Pistoia Alliance in Pharma
External Pre-competitive CommunitiesTwo popular community models areOrganization consortiumOpen communityOrganization consortiumOperates like a private democratic clubUsually closed community, members invited based on skill-set to contributeOutput data - public or limited tomembersConsortiums follow a democratic processMember voting rights may reflect level of investmentLarger players may be leaders of the consortium
External Pre-competitive CommunitiesOpen communityEveryone can participate“Founder(s)” defines desired curation activitySeek public support to contribute to curation activatesWikipedia, Linux, and Apache are good examples of large open communities
OverviewCuration BackgroundThe Business Need for Curated DataWhat is Data Curation?Data Quality and CurationHow to Curate DataCuration Communities and Enterprise DataCase StudiesWikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data BankBest Practices from Case Study Learning 
WikipediaThe World Largest Open Digital Curation Community
WikipediaOpen-source encyclopediaCollaboratively built by large communityChallenges existing models of content creationMore than 19,000,000 articles270+ languages, 3,200,000+ articles in EnglishMore than 157,000 active contributorsStudies show accuracy and stylistic formality are equivalent to resources developed in expert-based closed communitiesi.e. Columbia and Britannica encyclopedias
WikipediaMediaWiki Wiki platform behind WikipediaWidespread and popular technologyWikis can also support data curationLowers entry barriers for collaborative data curationWidely used inside organizationsIntellipedia covering 16 U.S. Intelligence agenciesWiki Proteins,curatedProtein data for knowledge discovery and annotation
WikipediaDecentralized environment supports creation of high quality information with:Social organizationArtifacts, tools & processes for cooperative work coordinationWikipedia collaboration dynamics highlightgood practices
Wikipedia – Social OrganizationAny usercan edit its contentsWithout prior registrationDoes not lead to a chaotic scenarioIn practice highly scalable approach for high quality content creation on the WebRelies on simple but highly effective way to coordinate its curation processCuration is activity of Wikipedia adminsResponsibility for information quality standards
Wikipedia – Social OrganizationFour main types of accounts:Anonymous usersIdentified by their associated IP addressRegistered usersUsers with an account in the Wikipedia websiteAdministrators/EditorsRegistered users with additional permissions in the systemAccess to curation toolsBots Programs that perform repetitive tasks
Wikipedia – Social Organization
Wikipedia – Social OrganizationIncentivesImprovement of one’s reputationSense of efficacyContributing effectively to a meaningful project Over time focus of editors typically changeFrom curators of a few articles in specific topics To more global curation perspectiveEnforcing quality assessment of Wikipedia as a whole
Wikipedia – Artifacts, Tools & Processes Wiki Article Editor (Tool)WYSIWYG or markup text editorTalk Pages (Tool)Public arena for discussions around Wikipedia resourcesWatchlists (Tool)Helps curators to actively monitor the integrity and quality of resources they contributePermission Mechanisms (Tool)Users with administrator status can perform critical actions such as remove pages and grant administrative permissions to new users
Wikipedia – Artifacts, Tools & Processes Automated Edition (Tool)Bots are automated or semi-automated tools that perform repetitive tasks over contentPage History and Restore (Tool)Historical trail of changes to a Wikipedia ResourceGuidelines, Policies & Templates (Artifact)Defines curation guidelines for editors to assess article quality Dispute Resolution (Process)Dispute mechanism between editors over the article contentsArticle Edition, Deletion, Merging, Redirection, Transwiking, Archival (Process)Describe the curation actions over Wikipedia resources
Wikipedia - DBPediaDBPedia Knowledge baseInherits massive volume of curated Wikipedia dataBuilt using information info box propertiesIndirectly uses wiki as data curation platformDBPediaprovides direct access to data3.4 million entities and 1 billion RDF triplesComprehensive data infrastructure Concept URIs, definitions, and basic types
Wikipedia - DBPedia
The New York Times100 Years of Expert Data Curation
The New York TimesLargest metropolitan and third largest newspaper in the United Statesnytimes.com
Most popular newspaper website in US
100 year old curated repository defining its participation in the emerging Web of DataThe New York TimesData curation dates back to 1913 Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaperNew York Times IndexOrganized catalog of articles titles and summaries Containing issue, date and column of articleCategorized by subject and namesIntroduced on quarterly thenannual basis Transitory content of newspaper became important source of searchable historical dataOften used to settle historical debates
The New York Times Index Department was created in 1913Curation and cataloguingofNYT resources Since 1851 NYT had low quality index for internal useDeveloped a comprehensive catalog using a controlled vocabularyCovering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summariesCurrent Index Dept. has~15 people
The New York TimesChallenges with consistently and accurately classifying news articles over timeKeywords expressing subjects may show some variance due to cultural or legal constraintsIdentities of some entities, such as organizations and places, changed over timeControlled vocabulary grew to hundreds of thousands of categoriesAdding complexity to classification process
The New York TimesIncreased importance of Web drove need to improve categorization of online contentCuration carried out by Index DepartmentLibrary-time (days to weeks)Print edition can handle next-day index Not suitable for real-time online publishing nytimes.com needed a same-day index
The New York TimesIntroduced two stage curation processEditorial staff performed best-effort semi-automated sheer curation at point of online pub.Several hundreds journalistsIndex Department follow up with long-term accurate classification and archivingBenefits:Non-expert journalist curators provide instant accessibility to online usersIndex Department provides long-term high-quality curation in a “trust but verify” approach
NYT Curation Workflow Curation starts with article getting out of the newsroom
NYT Curation Workflow Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)
NYT Curation Workflow Teragram uses linguistic extraction rules based on subset of Index Dept’s controlled vocab.
NYT Curation Workflow Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article
NYT Curation Workflow Editorial staff member selects terms that best describe the contents and inserts new tags if necessary
NYT Curation Workflow Reviewed by the taxonomy managers with feedback to editorial staff on classification process
NYT Curation Workflow Article is published online at nytimes.com
NYT Curation Workflow At later stage article receives second level curation by Index Dept. additional Index tags and a summary
NYT Curation Workflow Article is submitted to NYT Index
The New York TimesEarly adopter of Linked Open Data (June ‘09)
The New York TimesLinked Open Data @ data.nytimes.comSubset of 10,000 tagsfrom index vocabularyDataset of people, organizations & locationsComplemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,…BenefitsImproves traffic by third party data usageLowers development cost of new applications for different verticals inside the websiteE.g. movies, travel, sports, books
Thomson ReutersData Curation: A Core Business Competency
Thomson ReutersThomson Reuters is an information providerCreated by acquisition of Reuters by ThomsonOver 50,000 employeesCommercial presence in 100+ countriesProvides specialist curated information and information-based servicesSelects most relevant information for customersClassifying, enriching and distributing it in a way that can be readily consumed
Thomson ReutersCuration processWorking over approximately 1000 data sourcesAutomatic tools provide first level triage and classificationRefined by intervention of human curatorsCurator is a domain specialistEmploys thousands of curators
Thomson ReutersOneCalais platformReduces workload for classification ofcontentNatural Language Processingonunstructured textAutomatically derives tags for analyzed contentEnrichment with machine readable structured dataProvides description of specific entities (places, people, events, facts) present in the textOpen Calais (free version of OneCalais) 20.000+ users,>4 million trans per dayCNET, CBS Interactive, The Huffington Post, The Powerhouse Museum of Science and Design,…
ChemSpiderStructure centric chemical community Over 300 data sources with 25 million recordsProvided by chemical vendors, government databases, private laboratories and individualPharmarealizing benefits of open dataHeavily leveraged by pharmaceutical companies as pre-competitive resources for experimental and clinical trial investigation Glaxo Smith Kline made its proprietary malaria dataset of 13,500 compounds available
Protein Data BankDedicated to improving understanding of biological systems functions with 3-D structure of macromolecules Started in 1971 with 3 core membersOriginally offered 7 crystal structures Grown to 63,000 structuresOver 300 million dataset downloadsExpanded beyond curated data download service to include complex molecular visualized, search, and analysis capabilities
OverviewCuration BackgroundThe Business Need for Curated DataWhat is Data Curation?Data Quality and CurationHow to Curate DataCuration Communities and Enterprise DataCase StudiesWikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data BankBest Practices from Case Study Learning 
Best Practices from Case Study LearningSocial Best PracticesParticipationEngagementIncentivesCommunity Governance ModelsTechnical Best PracticesData RepresentationHuman- andAutomatedCurationTrack Provenance
Social Best PracticesParticipationStakeholders involvement fordata producers and consumers must occur early in projectProvides insight into basic questions of what they want to do, for whom, and what it will provideWhite papers are effective means to present these ideas, and solicit opinion from communityCan be used to establish informal ‘social contract’ for community
Social Best PracticesEngagementOutreach activities essential for promotion and feedbackTypical consumers-to-contributors ratios of less than 5%Social communication and networking forums are usefulMajority of community may not communicate using these mediaCommunication by email still remains important
Social Best PracticesIncentivesSheer curationneedsline of sight from data curating activity, to tangible exploitation benefitsLack of awareness of value proposition will slow emergence ofcollaborative contributionsRecognizing contributing curators through a formal feedback mechanismReinforces contribution cultureDirectly increases output quality
Social Best PracticesCommunity Governance ModelsEffective governance structure is vital to ensure success of community Internal communities and consortium perform well when they leverage traditional corporate and democratic governance models Open communities need to engage the community within the governance processFollow less orthodox approaches using meritocratic and autocratic principles

The Role of Community-Driven Data Curation for Enterprises

  • 1.
    The Role ofCommunity-Driven Data Curation for EnterprisesEdward Curry, Andre Freitas, Seán O'Riain ed.curry@deri.orghttp://www.deri.org/http://www.EdwardCurry.org/
  • 2.
    Speaker ProfileResearchScientist at the Digital Enterprise Research Institute (DERI)Leading international web science research organizationResearching how web of data is changing way business work and interact with informationProjects include studies of enterprise linked data, community-based data curation, semantic data analytics, and semantic searchInvestigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industriesInvited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
  • 3.
  • 4.
    AcknowledgementsCollaborators Andre Freitas& SeánO'RiainInsight from Thought LeadersEvan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York TimesKrista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson ReutersAntony Williams (VP of Strategic Development ) from ChemSpiderHelen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance. The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).
  • 5.
    Further Information The Roleof Community-Driven Data Curation for EnterprisesEdward Curry, Andre Freitas, & Seán O'RiainIn David Wood (ed.), Linking Enterprise Data Springer, 2010.Available Free at: http://3roundstones.com/led_book/led-curry-et-al.html
  • 6.
    OverviewCuration BackgroundThe BusinessNeed for Curated DataWhat is Data Curation?Data Quality and CurationHow to Curate DataCuration Communities and Enterprise DataCase StudiesWikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data BankBest Practices from Case Study Learning 
  • 7.
  • 8.
    Access to theright information
  • 9.
    Confidence in thatinformationWorking incomplete inaccurate, or wrong information can have disastrous consequences
  • 10.
    The Problems withDataFlawed DataEffects 25% of critical data in world’s top companies (Gartner)Data QualityRecent banking crisis (Economist Dec’09)Inaccurate figures made it difficult to manage operations (investments exposure and risk)“asset are defined differently in different programs”“numbers did not always add up”“departments do not trust each other’s figures”“figures … not worth the pixels they were made of”
  • 11.
    What is DataCuration?DigitalCuration Selection, preservation, maintenance, collection, and archiving of digital assetsDataCurationActive management of data over its life-cycleData CuratorsEnsure data is trustworthy, discoverable, accessible, reusable, and fit for useMuseum cataloguers of the Internet age
  • 12.
    What is DataCuration?Data GovernanceConvergence of data quality, data management, business process management, and risk managementData Curation is a complimentary activityPart of overall data governance strategy for organization Data Curator = Data Steward ??Overlapping terms between communities
  • 13.
    Data Quality andCurationWhat is Data Quality?Desirable characteristics for information resource Described as a series of quality dimensionsDiscoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & ReputationData curation can be used to improve these quality dimensions
  • 14.
    Data Quality andCurationDiscoverability & AccessibilityCurate to streamline search by storing and classifying in appropriate and consistent mannerAccuracyCurate to ensure data correctly represents the “real-world” values it modelsConsistencyCurate to ensure datacreated and maintained using standardized definitions, calculations, terms, and identifiers
  • 15.
    Data Quality andCurationProvenance & ReputationCurate to track source of data and determine reputationCurate to include the objectivity of the source/producerIs the information unbiased, unprejudiced, and impartial?Or does it come from a reputable but partisan source?Other dimensions discussed in chapter
  • 16.
    How to CurateDataData Curation is a large field with sophisticated techniques and processesSectionprovides high-leveloverview on:Should you curate data?Types of CurationSetting up a curation processAdditional detail and references available in book chapter
  • 17.
    Should You CurateData?Curation can have multiple motivationsImproving accessibility, quality, consistency,…Will the data benefit from curation?Identify business caseDetermine if potential return support investmentNot all enterprise data should be curatedSuits knowledge-centric data rather than transactional operations data
  • 18.
    Types of DataCurationMultiple approaches to curate data, no single correct wayWho?Individual CuratorsCuration DepartmentsCommunity-based CurationHow?Manual Curation(Semi-)AutomatedSheer Curation
  • 19.
    Types of DataCuration – Who?Individual Data CuratorsSuitable for infrequently changing small quantity of data (<1,000 records)Minimal curation effort (minutes per record)
  • 20.
    Types of DataCuration – Who?Curation DepartmentsCuration experts working with subject matter experts to curate data within formal processCan deal with large curation effort (000’s of records)LimitationsScalability: Can struggle with large quantities of dynamic data (>million records) Availability: Post-hoc nature creates delay incurated data availability
  • 21.
    Types of DataCuration - Who?Community-Based Data CurationDecentralized approach to data curationCrowd-sourcing the curation processLeverages community of users to curate data Wisdom of the community (crowd)Can scale to millions of records
  • 22.
    Types of DataCuration – How?Manual CurationCurators directly manipulate dataCan tie users up with low-value add activities(Sem-)Automated CurationAlgorithms can (semi-)automate curation activities such as data cleansing, record duplication and classificationCan be supervised or approved by human curators
  • 23.
    Types of DataCuration – How?Sheer curation, or Curation at SourceCuration activities integrated in normal workflow of those creating and managing dataCan be as simple as vetting or “rating” the results of a curation algorithmResults can be available immediatelyBlended Approaches: Best of Both Sheer curation +post hoc curation departmentAllows immediate access to curated data Ensures quality control with expert curation
  • 24.
    Setting up aCuration Process5 Steps to setup a curation process:1 - Identify what data you need to curate2 - Identify who will curate the data3 - Define the curation workflow4 - Identity appropriate data-in & data-out formats5 - Identify the artifacts, tools, and processes needed to support the curation process
  • 25.
    Setting up aCuration ProcessStep 1: Identify what data you need to curateNewly created data and/or legacy data? How is new data created? Do users create the data, or is it imported from an external source? How frequently is new data created/updated? What quantity of data is created?How much legacy data exists?Is it stored within a single source, or scattered across multiple sources?
  • 26.
    Setting up aCuration Process Step 2: Identify who will curate the dataIndividuals, depts, groups, institutions,communityStep 3: Define the curation workflowWhat curation activities are required?How will curation activities be carried out?Step 4: Identity suitable data-in & -out formatsWhat is the best format for the data?Right format for receiving and publishing data is criticalSupport multiple formats to maximum participation
  • 27.
    Setting up aCuration ProcessStep 5: Identify the artifacts, tools, and processes needed to support curationWorkflow support/Community collaboration platformsAlgorithms can (semi-)automate curation activitiesMajor factors that influence approach:Quantity of data to be curated (new and legacy data)Amount of effort required to curate the dataFrequency of data change / data dynamicsAvailability of experts
  • 28.
    OverviewCuration BackgroundThe BusinessNeed for Curated DataWhat is Data Curation?Data Quality and CurationHow to Curate DataCuration Communities and Enterprise DataCase StudiesWikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data BankBest Practices from Case Study Learning 
  • 29.
    Community–based CurationTwo communityapproaches:Internal corporate communitiesExternal pre-competitive communitiesTo determine the right model consider:What the purpose of the community is? Will resulting curateddataset be publicly available? Or restricted?
  • 30.
    Community–based CurationInternal CommunitiesTapspotential of workforce to assist data curationCurate competitive enterprise data that will remain internal to the companyMay not always be the case e.g. product technical support and marketing data Can work in conjunction with curation dept.Community governance typically follows the organization’s internal governance model
  • 31.
  • 32.
    What is Pre-CompetitiveData?Two Types of Enterprise DataPropriety data for competitive advantageCommon data with no competitive advantageWhat is pre-competitive data?Has little potential for differentiationCan be shared without conferring commercial advantage to competitorCommon non-competitive dataNeeds to be maintaining and curatedCompanies duplicate effort in-house incurring full-cost
  • 33.
    Pre-competitive CommunitiesExternal pre-competitivecommunitiesShare costs, risks, and technical challengesCommon curation tasks carried out once inpublic domain rather than multiple timesin each companyReduces cost required to provide and maintain dataCan increase the quantity, quality, and accessFocus turns to value-add competitive activityMove “competitive onus” from novel data to novel algorithms, shifting emphasis from “proprietary data” to a “proprietary understanding of data”e.g. Protein Data Bank and Pistoia Alliance in Pharma
  • 34.
    External Pre-competitive CommunitiesTwopopular community models areOrganization consortiumOpen communityOrganization consortiumOperates like a private democratic clubUsually closed community, members invited based on skill-set to contributeOutput data - public or limited tomembersConsortiums follow a democratic processMember voting rights may reflect level of investmentLarger players may be leaders of the consortium
  • 35.
    External Pre-competitive CommunitiesOpencommunityEveryone can participate“Founder(s)” defines desired curation activitySeek public support to contribute to curation activatesWikipedia, Linux, and Apache are good examples of large open communities
  • 36.
    OverviewCuration BackgroundThe BusinessNeed for Curated DataWhat is Data Curation?Data Quality and CurationHow to Curate DataCuration Communities and Enterprise DataCase StudiesWikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data BankBest Practices from Case Study Learning 
  • 37.
    WikipediaThe World LargestOpen Digital Curation Community
  • 38.
    WikipediaOpen-source encyclopediaCollaboratively builtby large communityChallenges existing models of content creationMore than 19,000,000 articles270+ languages, 3,200,000+ articles in EnglishMore than 157,000 active contributorsStudies show accuracy and stylistic formality are equivalent to resources developed in expert-based closed communitiesi.e. Columbia and Britannica encyclopedias
  • 39.
    WikipediaMediaWiki Wiki platformbehind WikipediaWidespread and popular technologyWikis can also support data curationLowers entry barriers for collaborative data curationWidely used inside organizationsIntellipedia covering 16 U.S. Intelligence agenciesWiki Proteins,curatedProtein data for knowledge discovery and annotation
  • 40.
    WikipediaDecentralized environment supportscreation of high quality information with:Social organizationArtifacts, tools & processes for cooperative work coordinationWikipedia collaboration dynamics highlightgood practices
  • 41.
    Wikipedia – SocialOrganizationAny usercan edit its contentsWithout prior registrationDoes not lead to a chaotic scenarioIn practice highly scalable approach for high quality content creation on the WebRelies on simple but highly effective way to coordinate its curation processCuration is activity of Wikipedia adminsResponsibility for information quality standards
  • 42.
    Wikipedia – SocialOrganizationFour main types of accounts:Anonymous usersIdentified by their associated IP addressRegistered usersUsers with an account in the Wikipedia websiteAdministrators/EditorsRegistered users with additional permissions in the systemAccess to curation toolsBots Programs that perform repetitive tasks
  • 43.
  • 44.
    Wikipedia – SocialOrganizationIncentivesImprovement of one’s reputationSense of efficacyContributing effectively to a meaningful project Over time focus of editors typically changeFrom curators of a few articles in specific topics To more global curation perspectiveEnforcing quality assessment of Wikipedia as a whole
  • 45.
    Wikipedia – Artifacts,Tools & Processes Wiki Article Editor (Tool)WYSIWYG or markup text editorTalk Pages (Tool)Public arena for discussions around Wikipedia resourcesWatchlists (Tool)Helps curators to actively monitor the integrity and quality of resources they contributePermission Mechanisms (Tool)Users with administrator status can perform critical actions such as remove pages and grant administrative permissions to new users
  • 46.
    Wikipedia – Artifacts,Tools & Processes Automated Edition (Tool)Bots are automated or semi-automated tools that perform repetitive tasks over contentPage History and Restore (Tool)Historical trail of changes to a Wikipedia ResourceGuidelines, Policies & Templates (Artifact)Defines curation guidelines for editors to assess article quality Dispute Resolution (Process)Dispute mechanism between editors over the article contentsArticle Edition, Deletion, Merging, Redirection, Transwiking, Archival (Process)Describe the curation actions over Wikipedia resources
  • 47.
    Wikipedia - DBPediaDBPediaKnowledge baseInherits massive volume of curated Wikipedia dataBuilt using information info box propertiesIndirectly uses wiki as data curation platformDBPediaprovides direct access to data3.4 million entities and 1 billion RDF triplesComprehensive data infrastructure Concept URIs, definitions, and basic types
  • 49.
  • 50.
    The New YorkTimes100 Years of Expert Data Curation
  • 51.
    The New YorkTimesLargest metropolitan and third largest newspaper in the United Statesnytimes.com
  • 52.
  • 53.
    100 year oldcurated repository defining its participation in the emerging Web of DataThe New York TimesData curation dates back to 1913 Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaperNew York Times IndexOrganized catalog of articles titles and summaries Containing issue, date and column of articleCategorized by subject and namesIntroduced on quarterly thenannual basis Transitory content of newspaper became important source of searchable historical dataOften used to settle historical debates
  • 54.
    The New YorkTimes Index Department was created in 1913Curation and cataloguingofNYT resources Since 1851 NYT had low quality index for internal useDeveloped a comprehensive catalog using a controlled vocabularyCovering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summariesCurrent Index Dept. has~15 people
  • 55.
    The New YorkTimesChallenges with consistently and accurately classifying news articles over timeKeywords expressing subjects may show some variance due to cultural or legal constraintsIdentities of some entities, such as organizations and places, changed over timeControlled vocabulary grew to hundreds of thousands of categoriesAdding complexity to classification process
  • 56.
    The New YorkTimesIncreased importance of Web drove need to improve categorization of online contentCuration carried out by Index DepartmentLibrary-time (days to weeks)Print edition can handle next-day index Not suitable for real-time online publishing nytimes.com needed a same-day index
  • 57.
    The New YorkTimesIntroduced two stage curation processEditorial staff performed best-effort semi-automated sheer curation at point of online pub.Several hundreds journalistsIndex Department follow up with long-term accurate classification and archivingBenefits:Non-expert journalist curators provide instant accessibility to online usersIndex Department provides long-term high-quality curation in a “trust but verify” approach
  • 58.
    NYT Curation WorkflowCuration starts with article getting out of the newsroom
  • 59.
    NYT Curation WorkflowMember of editorial staff submits article to web-based rule based information extraction system (SAS Teragram)
  • 60.
    NYT Curation WorkflowTeragram uses linguistic extraction rules based on subset of Index Dept’s controlled vocab.
  • 61.
    NYT Curation WorkflowTeragram suggests tags based on the Index vocabulary that can potentially describe the content of article
  • 62.
    NYT Curation WorkflowEditorial staff member selects terms that best describe the contents and inserts new tags if necessary
  • 63.
    NYT Curation WorkflowReviewed by the taxonomy managers with feedback to editorial staff on classification process
  • 64.
    NYT Curation WorkflowArticle is published online at nytimes.com
  • 65.
    NYT Curation WorkflowAt later stage article receives second level curation by Index Dept. additional Index tags and a summary
  • 66.
    NYT Curation WorkflowArticle is submitted to NYT Index
  • 67.
    The New YorkTimesEarly adopter of Linked Open Data (June ‘09)
  • 68.
    The New YorkTimesLinked Open Data @ data.nytimes.comSubset of 10,000 tagsfrom index vocabularyDataset of people, organizations & locationsComplemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate,…BenefitsImproves traffic by third party data usageLowers development cost of new applications for different verticals inside the websiteE.g. movies, travel, sports, books
  • 69.
    Thomson ReutersData Curation:A Core Business Competency
  • 70.
    Thomson ReutersThomson Reutersis an information providerCreated by acquisition of Reuters by ThomsonOver 50,000 employeesCommercial presence in 100+ countriesProvides specialist curated information and information-based servicesSelects most relevant information for customersClassifying, enriching and distributing it in a way that can be readily consumed
  • 71.
    Thomson ReutersCuration processWorkingover approximately 1000 data sourcesAutomatic tools provide first level triage and classificationRefined by intervention of human curatorsCurator is a domain specialistEmploys thousands of curators
  • 72.
    Thomson ReutersOneCalais platformReducesworkload for classification ofcontentNatural Language Processingonunstructured textAutomatically derives tags for analyzed contentEnrichment with machine readable structured dataProvides description of specific entities (places, people, events, facts) present in the textOpen Calais (free version of OneCalais) 20.000+ users,>4 million trans per dayCNET, CBS Interactive, The Huffington Post, The Powerhouse Museum of Science and Design,…
  • 73.
    ChemSpiderStructure centric chemicalcommunity Over 300 data sources with 25 million recordsProvided by chemical vendors, government databases, private laboratories and individualPharmarealizing benefits of open dataHeavily leveraged by pharmaceutical companies as pre-competitive resources for experimental and clinical trial investigation Glaxo Smith Kline made its proprietary malaria dataset of 13,500 compounds available
  • 74.
    Protein Data BankDedicatedto improving understanding of biological systems functions with 3-D structure of macromolecules Started in 1971 with 3 core membersOriginally offered 7 crystal structures Grown to 63,000 structuresOver 300 million dataset downloadsExpanded beyond curated data download service to include complex molecular visualized, search, and analysis capabilities
  • 75.
    OverviewCuration BackgroundThe BusinessNeed for Curated DataWhat is Data Curation?Data Quality and CurationHow to Curate DataCuration Communities and Enterprise DataCase StudiesWikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data BankBest Practices from Case Study Learning 
  • 76.
    Best Practices fromCase Study LearningSocial Best PracticesParticipationEngagementIncentivesCommunity Governance ModelsTechnical Best PracticesData RepresentationHuman- andAutomatedCurationTrack Provenance
  • 77.
    Social Best PracticesParticipationStakeholdersinvolvement fordata producers and consumers must occur early in projectProvides insight into basic questions of what they want to do, for whom, and what it will provideWhite papers are effective means to present these ideas, and solicit opinion from communityCan be used to establish informal ‘social contract’ for community
  • 78.
    Social Best PracticesEngagementOutreachactivities essential for promotion and feedbackTypical consumers-to-contributors ratios of less than 5%Social communication and networking forums are usefulMajority of community may not communicate using these mediaCommunication by email still remains important
  • 79.
    Social Best PracticesIncentivesSheercurationneedsline of sight from data curating activity, to tangible exploitation benefitsLack of awareness of value proposition will slow emergence ofcollaborative contributionsRecognizing contributing curators through a formal feedback mechanismReinforces contribution cultureDirectly increases output quality
  • 80.
    Social Best PracticesCommunityGovernance ModelsEffective governance structure is vital to ensure success of community Internal communities and consortium perform well when they leverage traditional corporate and democratic governance models Open communities need to engage the community within the governance processFollow less orthodox approaches using meritocratic and autocratic principles
  • 81.
    Technical Best PracticesDataRepresentationMust be robust and standardized to encourage community usage and tools developmentSupport for legacy data formats and ability to translate data forward to support new technology and standardsHuman & Automated CurationBalancing will improve data qualityAutomated curation should always defer to, and never override, human curation editsAutomate validating data deposition and entryTarget community at focused curation tasks
  • 82.
    Technical Best PracticesTrackProvenanceAll curation activities should be recorded and maintained as part data provenance effortEspecially where human curators are involved Users can have different perspectives of provenance A scientist may need to evaluate the fine grained experiment description behind the dataFor a business analyst the ’brand’ of data provider can be sufficient for determining quality
  • 83.
    ConclusionsData curation canensure the quality of data and its fitness for usePre-competitive data can be shared without conferring a commercial advantagePre-competitive data communitiesCommon curation tasks carried out once in public domainReduces cost, increase quantity and quality