Digital Enterprise Research Institute                                         www.deri.ie                              Wik...
Speaker ProfileDigital Enterprise Research Institute                                                www.deri.ie          ...
OverviewDigital Enterprise Research Institute                    www.deri.ie            Curation Background              ...
The Business NeedDigital Enterprise Research Institute                              www.deri.ie               Knowledge w...
The Problems with DataDigital Enterprise Research Institute                                           www.deri.ie        ...
What is Data Curation?Digital Enterprise Research Institute                                    www.deri.ie        Digital...
What is Data Curation?Digital Enterprise Research Institute                              www.deri.ie            Data Gove...
Data Quality and CurationDigital Enterprise Research Institute                                               www.deri.ie  ...
Data Quality and CurationDigital Enterprise Research Institute                                    www.deri.ie            ...
Data Quality and CurationDigital Enterprise Research Institute                                                www.deri.ie ...
How to Curate DataDigital Enterprise Research Institute                               www.deri.ie            Data Curatio...
Should You Curate Data?Digital Enterprise Research Institute                                              www.deri.ie     ...
Types of Data CurationDigital Enterprise Research Institute                        www.deri.ie            Multiple approa...
Types of Data Curation – Who?Digital Enterprise Research Institute                                                 www.der...
Types of Data Curation – Who?Digital Enterprise Research Institute                                             www.deri.ie...
Types of Data Curation - Who?Digital Enterprise Research Institute                                    www.deri.ie        ...
Types of Data Curation – How?Digital Enterprise Research Institute                                        www.deri.ie     ...
Types of Data Curation – How?Digital Enterprise Research Institute                                          www.deri.ie   ...
Setting up a Curation ProcessDigital Enterprise Research Institute                                  www.deri.ie          ...
WikipediaDigital Enterprise Research Institute                             www.deri.ie              The World Largest Open...
WikipediaDigital Enterprise Research Institute                                         www.deri.ie        Open-source enc...
WikipediaDigital Enterprise Research Institute                                            www.deri.ie       MediaWiki    ...
WikipediaDigital Enterprise Research Institute                                www.deri.ie           Decentralized environ...
Wikipedia – Social OrganizationDigital Enterprise Research Institute                                             www.deri....
Wikipedia – Social OrganizationDigital Enterprise Research Institute                                             www.deri....
Wikipedia – Social OrganizationDigital Enterprise Research Institute    www.deri.ie
Wikipedia – Social OrganizationDigital Enterprise Research Institute                                           www.deri.ie...
Wikipedia – Artifacts, Tools &       ProcessesDigital Enterprise Research Institute                                       ...
Wikipedia – Artifacts, Tools &       ProcessesDigital Enterprise Research Institute                                       ...
Wikipedia - DBPediaDigital Enterprise Research Institute                                              www.deri.ie        ...
Digital Enterprise Research Institute   www.deri.ie
Wikipedia - DBPediaDigital Enterprise Research Institute   www.deri.ie
OverviewDigital Enterprise Research Institute                    www.deri.ie            Curation Background              ...
Best Practices from Case Study       LearningDigital Enterprise Research Institute                           www.deri.ie  ...
Social Best PracticesDigital Enterprise Research Institute                                              www.deri.ie       ...
Social Best PracticesDigital Enterprise Research Institute                                               www.deri.ie      ...
Social Best PracticesDigital Enterprise Research Institute                                     www.deri.ie            Inc...
Social Best PracticesDigital Enterprise Research Institute                                         www.deri.ie           ...
Technical Best PracticesDigital Enterprise Research Institute                                    www.deri.ie            D...
Technical Best PracticesDigital Enterprise Research Institute                                         www.deri.ie        ...
ConclusionsDigital Enterprise Research Institute                                               www.deri.ie        Data cu...
AcknowledgementsDigital Enterprise Research Institute                                                      www.deri.ie   ...
Further InformationDigital Enterprise Research Institute                     www.deri.ieThe Role of Community-DrivenData C...
Upcoming SlideShare
Loading in...5
×

Wikipedia (DBpedia): Crowdsourced Data Curation

3,913

Published on

Wikipedia is an open-source encyclopedia, built collaboratively by a large community of web editors. The success of Wikipedia as one of the most important sources of information available today still challenges existing models of content creation. Despite the fact that the term ‘curation’ is not commonly addressed by Wikipedia’s contributors, the task of digital curation is the central activity of Wikipedia editors, who have the responsibility for information quality standards.

Wikipedia, is already widely used as a collaborative environment inside organizations5.
The investigation of the collaboration dynamics behind Wikipedia highlights important features and good practices which can be applied to different organizations. Our analysis focuses on the curation perspective and covers two important dimensions: social organization and artifacts, tools & processes for cooperative work coordination. These are key enablers that support the creation of high quality information products in Wikipedia’s decentralized environment.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,913
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
24
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Wikipedia (DBpedia): Crowdsourced Data Curation

  1. 1. Digital Enterprise Research Institute www.deri.ie Wikipedia (DBpedia): Crowdsourced Data Curation Edward Curry, Andre Freitas, Seán ORiain ed.curry@deri.org http://www.deri.org/ http://www.EdwardCurry.org/ Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
  2. 2. Speaker ProfileDigital Enterprise Research Institute www.deri.ie  Research Scientist at the Digital Enterprise Research Institute (DERI)  Leading international web science research organization  Researching how web of data is changing way business work and interact with information  Projects include studies of enterprise linked data, community- based data curation, semantic data analytics, and semantic search  Investigate utilization within the pharmaceutical, oil & gas, financial, advertising, media, manufacturing, health care, ICT, and automotive industries  Invited speaker at the 2010 MIT Sloan CIO Symposium to an audience of more than 600 CIOs
  3. 3. OverviewDigital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  Wikipedia (DBpedia) Case Study  Best Practices from Case Study Learning
  4. 4. The Business NeedDigital Enterprise Research Institute www.deri.ie  Knowledge workers need:  Access to the right information  Confidence in that information  Working incomplete inaccurate, or wrong information can have disastrous consequences
  5. 5. The Problems with DataDigital Enterprise Research Institute www.deri.ie  Flawed Data  Effects 25% of critical data in world‟s top companies (Gartner)  Data Quality  Recent banking crisis (Economist Dec‟09)  Inaccurate figures made it difficult to manage operations (investments exposure and risk) – “asset are defined differently in different programs” – “numbers did not always add up” – “departments do not trust each other‟s figures” – “figures … not worth the pixels they were made of”
  6. 6. What is Data Curation?Digital Enterprise Research Institute www.deri.ie  Digital Curation  Selection, preservation, maintenance, collection, and archiving of digital assets  Data Curation  Active management of data over its life-cycle  Data Curators  Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use – Museum cataloguers of the Internet age
  7. 7. What is Data Curation?Digital Enterprise Research Institute www.deri.ie  Data Governance  Convergence of data quality, data management, business process management, and risk management  Data Curation is a complimentary activity  Part of overall data governance strategy for organization  Data Curator = Data Steward ??  Overlapping terms between communities
  8. 8. Data Quality and CurationDigital Enterprise Research Institute www.deri.ie  What is Data Quality?  Desirable characteristics for information resource  Described as a series of quality dimensions – Discoverability, Accessibility, Timeliness, Completeness, Inte rpretation, Accuracy, Consistency, Provenance & Reputation  Data curation can be used to improve these quality dimensions
  9. 9. Data Quality and CurationDigital Enterprise Research Institute www.deri.ie  Discoverability & Accessibility  Curate to streamline search by storing and classifying in appropriate and consistent manner  Accuracy  Curate to ensure data correctly represents the “real- world” values it models  Consistency  Curate to ensure data created and maintained using standardized definitions, calculations, terms, and identifiers
  10. 10. Data Quality and CurationDigital Enterprise Research Institute www.deri.ie  Provenance & Reputation  Curate to track source of data and determine reputation  Curate to include the objectivity of the source/producer – Is the information unbiased, unprejudiced, and impartial? – Or does it come from a reputable but partisan source? Other dimensions discussed in chapter
  11. 11. How to Curate DataDigital Enterprise Research Institute www.deri.ie  Data Curation is a large field with sophisticated techniques and processes  Section provides high-level overview on:  Should you curate data?  Types of Curation  Setting up a curation process Additional detail and references available in book chapter
  12. 12. Should You Curate Data?Digital Enterprise Research Institute www.deri.ie  Curation can have multiple motivations  Improving accessibility, quality, consistency,…  Will the data benefit from curation?  Identify business case  Determine if potential return support investment  Not all enterprise data should be curated  Suits knowledge-centric data rather than transactional operations data
  13. 13. Types of Data CurationDigital Enterprise Research Institute www.deri.ie  Multiple approaches to curate data, no single correct way  Who? – Individual Curators – Curation Departments – Community-based Curation  How? – Manual Curation – (Semi-)Automated – Sheer Curation
  14. 14. Types of Data Curation – Who?Digital Enterprise Research Institute www.deri.ie  Individual Data Curators  Suitable for infrequently changing small quantity of data – (<1,000 records) – Minimal curation effort (minutes per record)
  15. 15. Types of Data Curation – Who?Digital Enterprise Research Institute www.deri.ie  Curation Departments  Curation experts working with subject matter experts to curate data within formal process – Can deal with large curation effort (000‟s of records)  Limitations  Scalability: Can struggle with large quantities of dynamic data (>million records)  Availability: Post-hoc nature creates delay in curated data availability
  16. 16. Types of Data Curation - Who?Digital Enterprise Research Institute www.deri.ie  Community-Based Data Curation  Decentralized approach to data curation  Crowd-sourcing the curation process – Leverages community of users to curate data  Wisdom of the community (crowd)  Can scale to millions of records
  17. 17. Types of Data Curation – How?Digital Enterprise Research Institute www.deri.ie  Manual Curation  Curators directly manipulate data  Can tie users up with low-value add activities  (Sem-)Automated Curation  Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification  Can be supervised or approved by human curators
  18. 18. Types of Data Curation – How?Digital Enterprise Research Institute www.deri.ie  Sheer curation, or Curation at Source  Curation activities integrated in normal workflow of those creating and managing data  Can be as simple as vetting or “rating” the results of a curation algorithm  Results can be available immediately  Blended Approaches: Best of Both  Sheer curation + post hoc curation department  Allows immediate access to curated data  Ensures quality control with expert curation
  19. 19. Setting up a Curation ProcessDigital Enterprise Research Institute www.deri.ie  5 Steps to setup a curation process: 1 - Identify what data you need to curate 2 - Identify who will curate the data 3 - Define the curation workflow 4 - Identity appropriate data-in & data-out formats 5 - Identify the artifacts, tools, and processes needed to support the curation process
  20. 20. WikipediaDigital Enterprise Research Institute www.deri.ie The World Largest Open Digital Curation Community
  21. 21. WikipediaDigital Enterprise Research Institute www.deri.ie  Open-source encyclopedia  Collaboratively built by large community  Challenges existing models of content creation  More than 19,000,000 articles  270+ languages, 3,200,000+ articles in English  More than 157,000 active contributors  Studies show accuracy and stylistic formality are equivalent to resources developed in expert- based closed communities  i.e. Columbia and Britannica encyclopedias
  22. 22. WikipediaDigital Enterprise Research Institute www.deri.ie  MediaWiki  Wiki platform behind Wikipedia – Widespread and popular technology  Wikis can also support data curation – Lowers entry barriers for collaborative data curation  Widely used inside organizations  Intellipedia covering 16 U.S. Intelligence agencies  Wiki Proteins, curated Protein data for knowledge discovery and annotation
  23. 23. WikipediaDigital Enterprise Research Institute www.deri.ie  Decentralized environment supports creation of high quality information with:  Social organization  Artifacts, tools & processes for cooperative work coordination  Wikipedia collaboration dynamics highlight good practices
  24. 24. Wikipedia – Social OrganizationDigital Enterprise Research Institute www.deri.ie  Any user can edit its contents  Without prior registration  Does not lead to a chaotic scenario  In practice highly scalable approach for high quality content creation on the Web  Relies on simple but highly effective way to coordinate its curation process  Curation is activity of Wikipedia admins  Responsibility for information quality standards
  25. 25. Wikipedia – Social OrganizationDigital Enterprise Research Institute www.deri.ie  Four main types of accounts:  Anonymous users – Identified by their associated IP address  Registered users – Users with an account in the Wikipedia website  Administrators/Editors – Registered users with additional permissions in the system – Access to curation tools  Bots – Programs that perform repetitive tasks
  26. 26. Wikipedia – Social OrganizationDigital Enterprise Research Institute www.deri.ie
  27. 27. Wikipedia – Social OrganizationDigital Enterprise Research Institute www.deri.ie  Incentives  Improvement of one‟s reputation  Sense of efficacy – Contributing effectively to a meaningful project  Over time focus of editors typically change – From curators of a few articles in specific topics – To more global curation perspective – Enforcing quality assessment of Wikipedia as a whole
  28. 28. Wikipedia – Artifacts, Tools & ProcessesDigital Enterprise Research Institute www.deri.ie  Wiki Article Editor (Tool)  WYSIWYG or markup text editor  Talk Pages (Tool)  Public arena for discussions around Wikipedia resources  Watchlists (Tool)  Helps curators to actively monitor the integrity and quality of resources they contribute  Permission Mechanisms (Tool)  Users with administrator status can perform critical actions such as remove pages and grant administrative permissions to new users
  29. 29. Wikipedia – Artifacts, Tools & ProcessesDigital Enterprise Research Institute www.deri.ie  Automated Edition (Tool)  Bots are automated or semi-automated tools that perform repetitive tasks over content  Page History and Restore (Tool)  Historical trail of changes to a Wikipedia Resource  Guidelines, Policies & Templates (Artifact)  Defines curation guidelines for editors to assess article quality  Dispute Resolution (Process)  Dispute mechanism between editors over the article contents  Article Edition, Deletion, Merging, Redirection, Transwiking, Archiv al (Process)  Describe the curation actions over Wikipedia resources
  30. 30. Wikipedia - DBPediaDigital Enterprise Research Institute www.deri.ie  DBPedia Knowledge base  Inherits massive volume of curated Wikipedia data  Built using information info box properties  Indirectly uses wiki as data curation platform  DBPedia provides direct access to data  3.4 million entities and 1 billion RDF triples  Comprehensive data infrastructure – Concept URIs, definitions, and basic types
  31. 31. Digital Enterprise Research Institute www.deri.ie
  32. 32. Wikipedia - DBPediaDigital Enterprise Research Institute www.deri.ie
  33. 33. OverviewDigital Enterprise Research Institute www.deri.ie  Curation Background  The Business Need for Curated Data  What is Data Curation?  Data Quality and Curation  How to Curate Data  Wikipedia (DBpedia) Case Study  Best Practices from Case Study Learning
  34. 34. Best Practices from Case Study LearningDigital Enterprise Research Institute www.deri.ie  Social Best Practices  Participation  Engagement  Incentives  Community Governance Models  Technical Best Practices  Data Representation  Human- and AutomatedCuration  Track Provenance
  35. 35. Social Best PracticesDigital Enterprise Research Institute www.deri.ie  Participation  Stakeholders involvement for data producers and consumers must occur early in project – Provides insight into basic questions of what they want to do, for whom, and what it will provide  White papers are effective means to present these ideas, and solicit opinion from community – Can be used to establish informal „social contract‟ for community
  36. 36. Social Best PracticesDigital Enterprise Research Institute www.deri.ie  Engagement  Outreach activities essential for promotion and feedback  Typical consumers-to-contributors ratios of less than 5%  Social communication and networking forums are useful – Majority of community may not communicate using these media – Communication by email still remains important
  37. 37. Social Best PracticesDigital Enterprise Research Institute www.deri.ie  Incentives  Sheer curation needs line of sight from data curating activity, to tangible exploitation benefits  Lack of awareness of value proposition will slow emergence of collaborative contributions  Recognizing contributing curators through a formal feedback mechanism – Reinforces contribution culture – Directly increases output quality
  38. 38. Social Best PracticesDigital Enterprise Research Institute www.deri.ie  Community Governance Models  Effective governance structure is vital to ensure success of community  Internal communities and consortium perform well when they leverage traditional corporate and democratic governance models  Open communities need to engage the community within the governance process – Follow less orthodox approaches using meritocratic and autocratic principles
  39. 39. Technical Best PracticesDigital Enterprise Research Institute www.deri.ie  Data Representation  Must be robust and standardized to encourage community usage and tools development  Support for legacy data formats and ability to translate data forward to support new technology and standards  Human & Automated Curation  Balancing will improve data quality  Automated curation should always defer to, and never override, human curation edits – Automate validating data deposition and entry – Target community at focused curation tasks
  40. 40. Technical Best PracticesDigital Enterprise Research Institute www.deri.ie  Track Provenance  All curation activities should be recorded and maintained as part data provenance effort – Especially where human curators are involved  Users can have different perspectives of provenance – A scientist may need to evaluate the fine grained experiment description behind the data – For a business analyst the ‟brand‟ of data provider can be sufficient for determining quality
  41. 41. ConclusionsDigital Enterprise Research Institute www.deri.ie  Data curation can ensure the quality of data and its fitness for use  Pre-competitive data can be shared without conferring a commercial advantage  Pre-competitive data communities  Common curation tasks carried out once in public domain  Reduces cost, increase quantity and quality
  42. 42. AcknowledgementsDigital Enterprise Research Institute www.deri.ie  Collaborators Andre Freitas & Seán ORiain  Insight from Thought Leaders  Evan Sandhaus (Semantic Technologist), Rob Larson (Vice President Product Development and Management), and Gregg Fenton (Director Emerging Platforms) from the New York Times  Krista Thomas (Vice President, Marketing & Communications), Tom Tague (OpenCalais initiative Lead) from Thomson Reuters  Antony Williams (VP of Strategic Development ) from ChemSpider  Helen Berman (Director), John Westbrook (Product Development) from the Protein Data Bank  Nick Lynch (Architect with AstraZeneca) from the Pistoia Alliance.  The work presented has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion- 2).
  43. 43. Further InformationDigital Enterprise Research Institute www.deri.ieThe Role of Community-DrivenData Curation for EnterprisesEdward Curry, Andre Freitas, & Seán ORiain In David Wood (ed.), Linking Enterprise Data Springer, 2010. Available Free at: http://3roundstones.com/led_book/led-curry-et-al.html
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×