The Role of Community-Driven Data Curation for Enterprises

The Role of Community-Driven Data Curation for Enterprises Edward Curry, Andre Freitas, Seán O'Riain ed.curry@deri.org http://www.deri.org/ http://www.EdwardCurry.org/

Speaker Profile Research Scientist at the Digital Enterprise Research Institute (DERI) Leading international web science research organization Researching how web of data is changing way business work and interact with information Projects include studies of enterprise linked data, community-based data curation, semantic data analytics, and semantic search Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs

Acknowledgements Collaborators Andre Freitas & SeánO'Riain Insight from Thought Leaders Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters Antony Williams (VP of Strategic Development ) from ChemSpider Helen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance. The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).

Further Information The Role of Community-Driven Data Curation for Enterprises Edward Curry, Andre Freitas, & Seán O'Riain In David Wood (ed.), Linking Enterprise Data Springer, 2010. Available Free at: http://3roundstones.com/led_book/led-curry-et-al.html

Overview Curation Background The Business Need for Curated Data What is Data Curation? Data Quality and Curation How to Curate Data Curation Communities and Enterprise Data Case Studies Wikipedia, The New York Times, Thomson Reuters, ChemSpider, Protein Data Bank Best Practices from Case Study Learning

The Business Need ,[object Object]

Access to the right information

Confidence in that informationWorking incomplete inaccurate, or wrong information can have disastrous consequences

The Problems with Data Flawed Data Effects 25% of critical data in world’s top companies (Gartner) Data Quality Recent banking crisis (Economist Dec’09) Inaccurate figures made it difficult to manage operations (investments exposure and risk) “asset are defined differently in different programs” “numbers did not always add up” “departments do not trust each other’s figures” “figures … not worth the pixels they were made of”

What is Data Curation? DigitalCuration Selection, preservation, maintenance, collection, and archiving of digital assets DataCuration Active management of data over its life-cycle Data Curators Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use Museum cataloguers of the Internet age

What is Data Curation? Data Governance Convergence of data quality, data management, business process management, and risk management Data Curation is a complimentary activity Part of overall data governance strategy for organization Data Curator = Data Steward ?? Overlapping terms between communities

Data Quality and Curation What is Data Quality? Desirable characteristics for information resource Described as a series of quality dimensions Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation Data curation can be used to improve these quality dimensions

Data Quality and Curation Discoverability & Accessibility Curate to streamline search by storing and classifying in appropriate and consistent manner Accuracy Curate to ensure data correctly represents the “real-world” values it models Consistency Curate to ensure datacreated and maintained using standardized definitions, calculations, terms, and identifiers

Data Quality and Curation Provenance & Reputation Curate to track source of data and determine reputation Curate to include the objectivity of the source/producer Is the information unbiased, unprejudiced, and impartial? Or does it come from a reputable but partisan source? Other dimensions discussed in chapter

How to Curate Data Data Curation is a large field with sophisticated techniques and processes Sectionprovides high-leveloverview on: Should you curate data? Types of Curation Setting up a curation process Additional detail and references available in book chapter

Should You Curate Data? Curation can have multiple motivations Improving accessibility, quality, consistency,… Will the data benefit from curation? Identify business case Determine if potential return support investment Not all enterprise data should be curated Suits knowledge-centric data rather than transactional operations data

Types of Data Curation Multiple approaches to curate data, no single correct way Who? Individual Curators Curation Departments Community-based Curation How? Manual Curation (Semi-)Automated Sheer Curation

Types of Data Curation – Who? Individual Data Curators Suitable for infrequently changing small quantity of data (<1,000 records) Minimal curation effort (minutes per record)

Types of Data Curation – Who? Curation Departments Curation experts working with subject matter experts to curate data within formal process Can deal with large curation effort (000’s of records) Limitations Scalability: Can struggle with large quantities of dynamic data (>million records) Availability: Post-hoc nature creates delay incurated data availability

Types of Data Curation - Who? Community-Based Data Curation Decentralized approach to data curation Crowd-sourcing the curation process Leverages community of users to curate data Wisdom of the community (crowd) Can scale to millions of records

Types of Data Curation – How? Manual Curation Curators directly manipulate data Can tie users up with low-value add activities (Sem-)Automated Curation Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification Can be supervised or approved by human curators

Types of Data Curation – How? Sheer curation, or Curation at Source Curation activities integrated in normal workflow of those creating and managing data Can be as simple as vetting or “rating” the results of a curation algorithm Results can be available immediately Blended Approaches: Best of Both Sheer curation +post hoc curation department Allows immediate access to curated data Ensures quality control with expert curation

Setting up a Curation Process 5 Steps to setup a curation process: 1 - Identify what data you need to curate 2 - Identify who will curate the data 3 - Define the curation workflow 4 - Identity appropriate data-in & data-out formats 5 - Identify the artifacts, tools, and processes needed to support the curation process

Setting up a Curation Process Step 1: Identify what data you need to curate Newly created data and/or legacy data? How is new data created? Do users create the data, or is it imported from an external source? How frequently is new data created/updated? What quantity of data is created? How much legacy data exists? Is it stored within a single source, or scattered across multiple sources?

Setting up a Curation Process Step 2: Identify who will curate the data Individuals, depts, groups, institutions,community Step 3: Define the curation workflow What curation activities are required? How will curation activities be carried out? Step 4: Identity suitable data-in & -out formats What is the best format for the data? Right format for receiving and publishing data is critical Support multiple formats to maximum participation

Setting up a Curation Process Step 5: Identify the artifacts, tools, and processes needed to support curation Workflow support/Community collaboration platforms Algorithms can (semi-)automate curation activities Major factors that influence approach: Quantity of data to be curated (new and legacy data) Amount of effort required to curate the data Frequency of data change / data dynamics Availability of experts

Community–based Curation Two community approaches: Internal corporate communities External pre-competitive communities To determine the right model consider: What the purpose of the community is? Will resulting curateddataset be publicly available? Or restricted?

Community–based Curation Internal Communities Taps potential of workforce to assist data curation Curate competitive enterprise data that will remain internal to the company May not always be the case e.g. product technical support and marketing data Can work in conjunction with curation dept. Community governance typically follows the organization’s internal governance model

Pre-competitive Communities Pre-competitive collaboration Well-established technique for open innovation Notable examples

What is Pre-Competitive Data? Two Types of Enterprise Data Propriety data for competitive advantage Common data with no competitive advantage What is pre-competitive data? Has little potential for differentiation Can be shared without conferring commercial advantage to competitor Common non-competitive data Needs to be maintaining and curated Companies duplicate effort in-house incurring full-cost

Pre-competitive Communities External pre-competitive communities Share costs, risks, and technical challenges Common curation tasks carried out once inpublic domain rather than multiple timesin each company Reduces cost required to provide and maintain data Can increase the quantity, quality, and access Focus turns to value-add competitive activity Move “competitive onus” from novel data to novel algorithms, shifting emphasis from “proprietary data” to a “proprietary understanding of data” e.g. Protein Data Bank and Pistoia Alliance in Pharma

External Pre-competitive Communities Two popular community models are Organization consortium Open community Organization consortium Operates like a private democratic club Usually closed community, members invited based on skill-set to contribute Output data - public or limited tomembers Consortiums follow a democratic process Member voting rights may reflect level of investment Larger players may be leaders of the consortium

External Pre-competitive Communities Open community Everyone can participate “Founder(s)” defines desired curation activity Seek public support to contribute to curation activates Wikipedia, Linux, and Apache are good examples of large open communities

Wikipedia The World Largest Open Digital Curation Community

Wikipedia Open-source encyclopedia Collaboratively built by large community Challenges existing models of content creation More than 19,000,000 articles 270+ languages, 3,200,000+ articles in English More than 157,000 active contributors Studies show accuracy and stylistic formality are equivalent to resources developed in expert-based closed communities i.e. Columbia and Britannica encyclopedias

Wikipedia MediaWiki Wiki platform behind Wikipedia Widespread and popular technology Wikis can also support data curation Lowers entry barriers for collaborative data curation Widely used inside organizations Intellipedia covering 16 U.S. Intelligence agencies Wiki Proteins,curatedProtein data for knowledge discovery and annotation

Wikipedia Decentralized environment supports creation of high quality information with: Social organization Artifacts, tools & processes for cooperative work coordination Wikipedia collaboration dynamics highlightgood practices

Wikipedia – Social Organization Any usercan edit its contents Without prior registration Does not lead to a chaotic scenario In practice highly scalable approach for high quality content creation on the Web Relies on simple but highly effective way to coordinate its curation process Curation is activity of Wikipedia admins Responsibility for information quality standards

Wikipedia – Social Organization Four main types of accounts: Anonymous users Identified by their associated IP address Registered users Users with an account in the Wikipedia website Administrators/Editors Registered users with additional permissions in the system Access to curation tools Bots Programs that perform repetitive tasks

Wikipedia – Social Organization

Wikipedia – Social Organization Incentives Improvement of one’s reputation Sense of efficacy Contributing effectively to a meaningful project Over time focus of editors typically change From curators of a few articles in specific topics To more global curation perspective Enforcing quality assessment of Wikipedia as a whole

Wikipedia – Artifacts, Tools & Processes Wiki Article Editor (Tool) WYSIWYG or markup text editor Talk Pages (Tool) Public arena for discussions around Wikipedia resources Watchlists (Tool) Helps curators to actively monitor the integrity and quality of resources they contribute Permission Mechanisms (Tool) Users with administrator status can perform critical actions such as remove pages and grant administrative permissions to new users

Wikipedia – Artifacts, Tools & Processes Automated Edition (Tool) Bots are automated or semi-automated tools that perform repetitive tasks over content Page History and Restore (Tool) Historical trail of changes to a Wikipedia Resource Guidelines, Policies & Templates (Artifact) Defines curation guidelines for editors to assess article quality Dispute Resolution (Process) Dispute mechanism between editors over the article contents Article Edition, Deletion, Merging, Redirection, Transwiking, Archival (Process) Describe the curation actions over Wikipedia resources

Wikipedia - DBPedia DBPedia Knowledge base Inherits massive volume of curated Wikipedia data Built using information info box properties Indirectly uses wiki as data curation platform DBPediaprovides direct access to data 3.4 million entities and 1 billion RDF triples Comprehensive data infrastructure Concept URIs, definitions, and basic types

The New York Times 100 Years of Expert Data Curation

The New York Times Largest metropolitan and third largest newspaper in the United States ,[object Object]

The Role of Community-Driven Data Curation for Enterprises

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Role of Community-Driven Data Curation for Enterprises

Similar to The Role of Community-Driven Data Curation for Enterprises (20)

Recently uploaded

Recently uploaded (20)

The Role of Community-Driven Data Curation for Enterprises